1. Introduction
Epilepsy is among the most common neurological disorders, affecting over 50 million people worldwide according to the World Health Organization (WHO) [
1]. The disease contributes to more than 0.5% of the global burden of disease and is associated with considerable morbidity, mortality, and reduced quality of life [
2,
3]. Epileptic seizures manifest in diverse forms, ranging from subtle impairments of consciousness to generalized convulsions and sudden collapses. In severe cases, seizures may result in traumatic injuries or even fatalities [
2]. The heterogeneity of seizure manifestations makes accurate and timely diagnosis particularly challenging, while also underscoring its clinical significance. In low-resource settings, delayed or inaccurate diagnosis remains a major barrier to treatment, further increasing the disease burden [
4].
Electroencephalography (EEG) is the principal diagnostic tool recommended by the International League Against Epilepsy (ILAE) [
5,
6]. EEG records electrical potentials generated by synchronized neuronal activity and provides high temporal resolution while being non-invasive. In clinical practice, EEG allows physicians to identify interictal epileptiform discharges, seizure events, and abnormal rhythms [
7]. Nevertheless, manual EEG interpretation is a time-consuming and expertise-dependent process, characterized by subjectivity and inter-rater variability [
8]. Furthermore, long-term monitoring produces vast data volumes, which amplifies the risk of oversight and increases clinicians’ workload. These limitations motivate the development of automated computational methods to improve both efficiency and accuracy in epilepsy diagnostics.
Over the past decades, numerous computational approaches have been proposed for epileptiform event detection and seizure prediction. Early methods focused on template matching and morphological features [
9], whereas later studies incorporated spectral and time–frequency analyses to capture more complex EEG dynamics [
10]. With the growth of machine learning, supervised and unsupervised classifiers have been applied to distinguish abnormal from normal EEG segments [
11,
12]. More recently, deep learning techniques have gained traction, demonstrating remarkable potential in automated EEG interpretation [
13,
14,
15,
16]. Despite these advances, major challenges persist, including class imbalance between seizure and non-seizure states, high inter-subject variability, and the presence of artifacts such as muscle activity or electrode noise that can mimic pathological patterns. These challenges motivate the exploration of outlier-oriented frameworks capable of distinguishing clinically relevant abnormalities from noise and artifacts [
17,
18,
19].
The limitations of existing methods highlight the need for approaches that can capture rare and atypical patterns in EEG signals. In this framework, abnormal events can be regarded as outliers, defined as signal segments that deviate significantly from background activity [
20]. Such anomalies may correspond to epileptiform discharges, seizure events, or pathological rhythms, but may also arise from artifacts or noise. Detecting outliers is therefore clinically valuable: it enables the identification of rare but relevant abnormalities without exhaustive manual annotation, alleviates neurologists’ workload, and facilitates timely diagnosis and patient monitoring [
21]. Outlier detection thus provides a natural framework for addressing the inherent imbalance between abundant normal activity and rare pathological events in EEG recordings—an especially important capability in clinical practice, where detecting subtle epileptiform discharges can inform diagnosis, treatment decisions, and patient monitoring [
21].
This operational perspective aligns with ILAE terminology and supports a screening-level distinction between epileptiform (outlier) and non-epileptiform (normal) EEG segments. To make this conceptual distinction explicit, we provide the following operational definition used throughout this study.
Definition 1 (EEG Outlier). An outlier in the EEG context denotes a signal segment exhibiting epileptiform activity—ictal or interictal spikes, sharp waves, spike–waves, or seizure patterns—that deviate from normal background rhythms. Accordingly, the binary labels used throughout are outlier (epileptiform) versus non-outlier (non-epileptiform). Artifact-only segments are excluded unless explicitly marked as epileptiform in the source dataset.
Building upon this conceptualization, one promising direction is to enhance outlier detection through ensemble learning. Ensemble methods aggregate the predictions of multiple classifiers, leveraging their complementary strengths to enhance predictive accuracy, reduce variance, and improve generalization [
22]. Unlike conventional single classifiers, which are sensitive to parameter selection and prone to overfitting noisy data, the proposed automated ensemble approach systematically selects well-performing base learners and integrates them through complementary aggregation strategies. This reduces subjectivity in model design and increases robustness against artifacts and inter-subject variability—both critical for reliable clinical use in EEG-based epilepsy diagnostics.
In this study, we propose an ensemble-based outlier detection framework for EEG. We screen three interpretable baseline models—Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and decision tree (DT-CART)—under a pre-registered eligibility rule (). Only models meeting this criterion are used to form ensembles; under this rule, DT-CART did not qualify, so final ensembles combine SVM and k-NN. We then aggregate eligible base learners using bagging, stacking, and majority voting. By combining complementary decision boundaries, the framework increases robustness to noise, mitigates class imbalance, and improves reliable identification of abnormal EEG segments.
Although deep and transformer-based models have demonstrated outstanding performance in recent EEG studies, the proposed framework deliberately employs classical machine learning algorithms—SVM, k-NN, and DT-CART—as base learners. This decision is motivated by three factors: (i) interpretability and computational efficiency, which facilitate clinical acceptance; (ii) smaller data requirements, given the limited availability of labeled EEG recordings; and (iii) the ability to serve as baseline modules in an automated ensemble generator that can later incorporate deep or transformer-based architectures. Furthermore, the selected base classifiers address complementary aspects of EEG abnormality detection: SVM captures nonlinear, high-dimensional decision boundaries typical of ictal activity; k-NN effectively models local neighborhood relations for subtle interictal discharges; and DT-CART provides interpretable, rule-based partitions that separate normal background rhythms from artifact-related patterns. Together, these models offer a balanced trade-off between flexibility, interpretability, and computational efficiency—attributes particularly important for detecting diverse EEG abnormalities within limited and heterogeneous datasets.
The first dataset, the Bonn EEG dataset, represents a well-structured benchmark comprising high-quality recordings from both healthy individuals and epilepsy patients under controlled laboratory conditions [
23]. In contrast, the Guinea-Bissau and Nigeria Epilepsy (GBNE) dataset offers a more realistic scenario, containing field-acquired EEG signals recorded using low-cost equipment [
24]. This dataset is of particular clinical importance because it reflects the challenges of diagnosing epilepsy in low-resource environments, where neurologists and specialized equipment are often scarce. Together, these datasets provide a dual perspective: Bonn serves as a clean benchmark for methodological validation, whereas GBNE captures the complexity of real-world diagnostics in resource-limited clinical environments. This combination enables a comprehensive evaluation of model robustness under both idealized and realistic conditions.
Problem statement and contributions. This study frames epileptiform activity detection as a binary outlier detection problem in EEG: signal segments that deviate from normal background rhythms are labeled as outliers (epileptiform), whereas normal background activity is treated as non-outlier (non-epileptiform). The main contributions of this work are as follows:
Operational definition of EEG outliers. We formalize an ILAE-aligned mapping between the clinical notion of “epileptiform” and the modeling term “outlier,” and use it consistently throughout the study.
Automated, performance-based ensemble construction. Base models are automatically admitted to ensembles only if they satisfy a pre-registered eligibility rule (), improving transparency and reproducibility.
Systematic comparison of aggregation strategies. We evaluate three ensemble aggregators—bagging, stacking, and majority voting—across homogeneous and heterogeneous configurations.
Validation on clean vs. field EEG. We assess generalization on the artifact-free Bonn dataset and the noisy, low-cost GBNE dataset, reflecting real-world diagnostic constraints.
Statistical validation. We employ McNemar’s test on paired out-of-fold predictions to confirm that ensemble improvements over the best base model are statistically significant.
The remainder of this paper is organized as follows.
Section 2 presents a review of existing solutions related to EEG signal analysis and outlier detection methods.
Section 3 introduces the baseline classifiers and ensemble learning techniques used in this study.
Section 4 describes the proposed procedures for building homogeneous and heterogeneous ensemble models.
Section 5 details the research methodology, including dataset characteristics, evaluation metrics, and experimental design.
Section 6 discusses the experimental results obtained for both EEG datasets, and
Section 7 provides final conclusions and future research directions.
4. Proposed Approach
Previous studies have demonstrated the feasibility of applying classical machine learning approaches such as
k-NN, SVM, and decision trees for outlier detection in EEG signals, showing promising results in identifying epileptiform discharges and seizure-related anomalies [
11,
72]. Recent reviews further highlight the growing importance of anomaly and outlier detection in EEG analysis, particularly in the context of epilepsy diagnosis and seizure prediction [
13]. Building upon these findings, the proposed approach develops ensemble classifier models using three aggregation techniques—bagging, stacking, and majority voting—selected for their complementary mechanisms of combining predictions and their potential to mitigate the limitations of individual base classifiers.
Bagging (bootstrap aggregating) improves prediction stability by reducing model variance, which is particularly beneficial when dealing with non-stationary EEG signals and the presence of noise and outliers. Stacking, as a meta-learning approach, constructs a higher-level model (meta-model) that learns how to optimally combine the predictions of base classifiers, thereby capturing complex relationships among them. Majority voting—in both hard and soft variants—provides a simple yet effective strategy for integrating decisions, especially when using heterogeneous base learners. Applying these three aggregation methods enables a comparative evaluation of their effectiveness in EEG outlier detection across homogeneous and heterogeneous ensemble configurations.
For ensemble construction, classical base classifiers were employed: k-NN, SVM, and DT-CART. The key advantage of the proposed framework lies in its automated ensemble generation, which minimizes subjective parameter tuning and ensures systematic model construction. By leveraging the diversity of multiple base learners, the ensembles exhibit higher robustness to noise, artifacts, and inter-subject variability than individual classifiers. This property is clinically significant, as it enhances the reliability of epileptiform abnormality detection in realistic EEG recordings.
4.1. Construction of Homogeneous Ensemble Classification Models
The process of building homogeneous ensemble classification models was carried out automatically using base classifiers. Let denote the set of base classifier types (e.g., k-NN, SVM, and DT-CART), and let n represent the number of instances (models) for each type. Each type of classifier is treated as a separate base variant for which individual homogeneous ensembles are constructed.
The algorithm consists of the following steps:
Configuration generation:
For each base classifier , define hyperparameter spaces and using grid search and random search techniques. Based on these, generate a set of possible hyperparameter configurations.
Training base models:
Each hyperparameter configuration is used to train a model on the training dataset. Model performance and outlier detection are assessed using cross-validation.
Model evaluation and selection:
For each trained model, compute performance metrics (accuracy , precision P, recall, and ). Models are retained for ensemble construction if they achieve an outlier-class on the validation fold (default ). Accuracy is reported for completeness but is not used for selection due to class imbalance.
Construction of homogeneous ensemble classifiers:
For each base classifier type , select n trained models with different hyperparameter configurations that satisfy (outlier class). These models form a homogeneous ensemble composed solely of classifiers of the same type, denoted as follows: , where is the instance of classifier with the j-th hyperparameter configuration. Each ensemble is combined using an aggregation method A, where , to create the final ensemble models.
Testing of ensemble models:
For each aggregation method A, compute the performance metrics. Then, the best combination of models and aggregation strategy is selected from , where , , and .
The graphical representation of the procedure for constructing homogeneous ensemble classifiers is shown in
Figure 1, and an illustrative example is provided in Example 1.
Example 1. For each hyperparameter configuration, the validation (outlier class) was determined:
For k-NN, the obtained -scores were
.
For SVM, they were
.
For DT-CART, they were
.
Assume the number of models in the ensemble is and the minimum acceptable is .
For the k-NN classifier, the two best-performing models are selected: and .
For SVM, and are chosen.
For , no configuration meets the condition , so no ensemble is formed.
For , the possible homogeneous ensembles are as follows:
k-NN: ;
SVM: ;
DT-CART still does not meet the criterion and is excluded.
The presented homogeneous ensemble classifier construction approach allows for systematic and reproducible model generation without manual parameter tuning, increasing both the objectivity and efficiency of the entire process. The scheme of homogeneous ensemble classification is illustrated in
Figure 1. From a clinical standpoint, homogeneous ensembles provide a systematic way to stabilize the performance of individual classifiers such as SVM or
k-NN, reducing the variability caused by parameter choices. This consistency is particularly relevant in EEG analysis, where noisy or artifact-contaminated segments can otherwise lead to unstable predictions.
4.2. Heterogeneous Ensemble Classification Models
Unlike homogeneous ensembles, where all models in the ensemble originate from the same base classifier type, heterogeneous classification involves combining classifiers of different types within a single ensemble, which may increase its diversity and effectiveness. As in
Section 4.1, the process of constructing heterogeneous ensemble classifiers begins with the generation and evaluation of base classifiers.
Let denote the set of base classifier types (e.g., k-NN, SVM, and DT-CART) and n the number of model instances for each type. The best-performing base classifiers (i.e., those that achieve an outlier-class ; default ) are selected and combined into ensembles consisting of classifiers of different types (e.g., SVM, k-NN, and DT-CART).
The process consists of the following steps:
Configuration Generation:
As in
Section 4.1, the hyperparameter space is defined for each base classifier type. Configurations are generated using grid search
and random search
methods.
Training Base Models:
Models are trained on the training dataset using cross-validation to evaluate classification performance and outlier detection capability. Hyperparameter spaces may vary significantly between classifier types (e.g., C and kernel for SVM, k for k-NN, and tree depth for DT-CART).
Model Evaluation and Selection:
Each model is evaluated using the metrics described in
Section 5.5. Models satisfying
(outlier class) proceed to the next stage, where
is determined empirically (e.g., 0.60 or higher).
Construction of the Heterogeneous Ensemble:
For a given number of models n, the best-performing base classifiers are selected, allowing diversity in model types. The resulting ensemble includes classifiers of different types (e.g., k-NN + SVM + DT-CART). An aggregation method is then chosen, .
Testing Ensemble Models:
Heterogeneous ensemble models are evaluated on the test set using the defined metrics (e.g., accuracy, recall, area under the precision–recall curve (AUPRC), and -score). The effectiveness of different aggregation methods is also analyzed.
To account for class imbalance, selection relies on the outlier-class (default threshold ); accuracy is reported for context only. The procedure for constructing heterogeneous ensemble classifiers is presented in Example 2.
Example 2. = ;
= ;
= .
For a target number of models , the two base models with the highest are selected (regardless of classifier type), provided that . In this case, the selected models are
k-NN with of ;
SVM with of .
Based on these models, a heterogeneous ensemble classifier is created: k-NN + SVM.
For , the top three models are selected, also satisfying the condition :
k-NN with of ;
k-NN with of ;
SVM with of .
As a result, the final ensemble classifier consists of two k-NN models and one SVM model.
Clinically, heterogeneous ensembles are especially valuable because they integrate complementary decision strategies from different classifiers. This diversity enhances robustness to inter-subject variability and heterogeneous recording conditions, making them well-suited for real-world EEG applications.
5. Research Methodology
The experiments were conducted using the Google Colab platform, with Python 3.12 as the programming language. The following libraries were used: MNE, NumPy, Pandas, SciPy, and Scikit-learn.
5.1. Datasets
This study utilizes two publicly available EEG datasets that differ significantly in terms of recording conditions, population, equipment, and acquisition protocols—thereby enabling robust validation of the proposed models under both controlled and real-world conditions.
5.1.1. Bonn EEG Dataset
The first dataset originates from the University of Bonn and is widely regarded as a benchmark in EEG-based epilepsy research [
23]. It comprises a total of 500 single-channel EEG segments, organized into five subsets labeled A–E. Each subset contains 100 segments of equal length (23.6 s), sampled at 173.61 Hz and digitized using a 12-bit A/D converter. The data were originally recorded from both healthy volunteers and epilepsy patients.
Subsets A and B were recorded extracranially from healthy volunteers using scalp electrodes. The distinction between the two lies in the participants’ eye status: eyes open (A) and eyes closed (B).
Subset C includes interictal EEG signals from epilepsy patients, recorded intracranially from the hippocampal formation in the non-epileptogenic hemisphere.
Subset D contains interictal recordings obtained from within the epileptogenic zone.
Subset E consists of ictal segments—recordings captured during actual epileptic seizures.
In total, the dataset provides 500 carefully selected, artifact-free single-channel segments (4097 samples each), corresponding to approximately 196 min of EEG recordings. All signals are balanced across subsets and of equal length, enabling controlled experimentation and reproducibility of results.
Epoch labels (A–E) follow the original Bonn protocol and were not modified. Each 23.6 s epoch was partitioned into overlapping 2 s windows (50% overlap), with each window inheriting its parent epoch’s label for training and evaluation.
5.1.2. Guinea-Bissau and Nigeria Epilepsy (GBNE) Dataset
The second dataset, known as the Guinea-Bissau and Nigeria Epilepsy (GBNE) dataset [
24], was collected under field conditions in rural and semi-urban areas of West Africa. It comprises EEG recordings from 97 participants: 51 patients diagnosed with epilepsy and 46 healthy controls. The data were acquired using the EMOTIV EPOC+ wireless headset, which features 14 channels arranged according to the international 10–20 system. Signals were recorded in a resting state with eyes closed, sampled at 128 Hz with 14-bit resolution, and stored with a session duration of approximately five minutes per subject. In total, this amounts to approximately 8 h of multichannel EEG recordings.
The dataset is inherently imbalanced, with slightly more epileptic than healthy recordings. Unlike the artifact-free Bonn dataset, GBNE introduces realistic challenges, including motion artifacts, environmental noise, and variability in signal quality due to field conditions and low-cost equipment. These characteristics make GBNE particularly valuable for testing the robustness and generalization ability of automated outlier detection models in practical, resource-limited clinical scenarios.
Table 3 summarizes the key characteristics of the Bonn and GBNE datasets, highlighting their complementary roles in evaluating algorithms under both controlled and real-world conditions.
As shown in
Table 3, the two datasets complement each other: Bonn serves as a clean benchmark for reproducible testing, whereas GBNE reflects the noisy and heterogeneous conditions of clinical practice. The rationale for selecting these datasets lies in their complementary characteristics: the Bonn dataset, with its high-quality and well-annotated recordings, is optimal for evaluating model performance under controlled conditions. In contrast, the GBNE dataset captures the complexity and variability of real-world scenarios, making it suitable for testing model generalization and robustness in low-resource environments. This combination supports a comprehensive assessment of the proposed methods, spanning both idealized and challenging EEG analysis contexts.
Ground-truth labeling.
The Bonn dataset includes canonical subset labels (A–E) assigned by clinical experts. For the GBNE dataset, ground-truth labels were derived from clinical diagnoses (epilepsy vs. control) provided in the original metadata. No additional manual re-annotation or relabeling was performed in this study.
5.2. Representative EEG Examples
Figure 2,
Figure 3,
Figure 4 and
Figure 5 present representative segments from both benchmark datasets, highlighting the contrast between interictal activity in the Bonn EEG dataset and realistic, artifact-prone scalp recordings in the GBNE dataset.
As shown in
Figure 2, these examples illustrate interictal patterns observed in the non-epileptogenic zone, where background activity remains mostly regular, with only occasional isolated spikes.
Figure 3 illustrates interictal activity recorded from the epileptogenic zone, contrasting with the more stable background observed in Subset C.
To complement the intracranial recordings shown in
Figure 2 and
Figure 3,
Figure 4 and
Figure 5 present representative scalp EEG segments from the GBNE dataset, which was collected under real-world field conditions using low-cost wearable equipment. Unlike the artifact-free Bonn EEG dataset, GBNE signals exhibit varying levels of contamination arising from motion, eye blinks, and muscle activity. These examples illustrate the challenges faced by automated outlier detection systems in distinguishing pathological activity from non-neural disturbances, particularly in environments characterized by limited hardware stability and recording noise.
Figure 4 shows EEG segments with mild to moderate distortions caused primarily by slow electrode drift and subtle motion artifacts, whereas
Figure 5 depicts segments with pronounced electromyographic (EMG) bursts and high-amplitude artifacts related to facial movement or head motion. Together, these examples illustrate the spectrum of real-world noise patterns that may mimic epileptiform activity and highlight the need for models capable of robust generalization to heterogeneous EEG data.
The above visualizations underline the physiological and technical heterogeneity of the EEG data analyzed in this work. While the Bonn dataset provides clean, clinically curated intracranial recordings suitable for controlled benchmarking, the GBNE dataset captures the complexity of real-world scalp EEG, including motion and muscular artifacts. These visual observations complement the quantitative results presented in
Section 6, providing intuitive confirmation of the differences between controlled and real-world EEG conditions.
5.3. Preprocessing
To ensure consistency and comparability of results, both datasets were subjected to the same preprocessing procedure. First, a fourth-order Butterworth bandpass filter (0.1–45 Hz) was applied to suppress low-frequency drifts and high-frequency noise. Next, Z-score normalization was performed independently on each EEG channel to standardize amplitude values and reduce inter-subject variability. Finally, the EEG signals were segmented into overlapping 2 s windows with 50% overlap. This approach improved temporal resolution, increased the number of training examples, and minimized information loss at segment boundaries.
Following preprocessing, each 2 s EEG segment was converted into a feature vector combining temporal, statistical, and spectral information. The extracted features included the following:
Time-domain statistics: Mean and variance;
Hjorth parameters: Activity, mobility, and complexity;
Spectral band power across standard EEG frequency ranges computed using Welch’s method;
Entropy-based measures: Shannon and spectral entropy.
All extracted features were standardized using z-score normalization prior to model training and evaluation. This ensured that the feature vectors used by SVM, k-NN, and DT-CART classifiers captured both temporal and frequency-domain variability, allowing for robust differentiation between epileptiform and normal EEG activity.
After feature extraction and normalization, the EEG segments were labeled to distinguish epileptiform (outlier) from normal (non-outlier) activity in accordance with the clinical annotations of each dataset. Specifically, for the Bonn dataset, segments from Subsets C–E (interictal hippocampal, interictal from the epileptogenic zone, and ictal activity) were treated as epileptiform outliers, while Subsets A–B (healthy scalp EEG with eyes open/closed) represented normal background activity. For the GBNE dataset, outliers corresponded to EEG segments obtained from patients diagnosed with epilepsy and non-outliers to recordings from healthy controls. This consistent labeling ensured reproducibility and provided a unified binary framework for distinguishing epileptiform (outlier) versus normal (non-outlier) EEG segments.
5.4. Hyperparameter Space
The experiments were conducted following the procedure described in
Section 4, starting with the definition of the hyperparameter space for each base classifier. The classification algorithms considered include k-Nearest Neighbors (
k-NN), Support Vector Machines (SVMs), and DT-CART decision trees. Hyperparameter sets were automatically generated using the grid search method.
Support Vector Machine (SVM):
This parameter controls the trade-off between model complexity and classification error. It helps to avoid overfitting while minimizing misclassification, striking a balance between margin width and training accuracy.
This parameter specifies the kernel function used to transform the data into a higher-dimensional space. The kernel can be selected from the following: linear, radial basis function (RBF), or polynomial.
Decision trees (DT-CART):
This parameter is the maximum depth of the decision tree, defining the number of levels allowed for splits.
This parameter is the minimum number of samples required to split an internal node.
This is the splitting criterion used to evaluate the quality of a split. The options include Gini impurity and Shannon entropy, both measuring node impurity.
k-Nearest Neighbors (k-NN):
This is the number of neighbors considered during classification. This value affects the smoothness of the decision boundary.
This is the distance metric used to compute the similarity between samples. Euclidean distance measures the straight-line distance, Manhattan distance computes the sum of absolute differences across dimensions, and Minkowski is a generalized metric encompassing both Euclidean and Manhattan distances, allowing parameterized weighting.
This defines how the neighbors contribute to the classification. Uniform assigns equal weight to all neighbors, while distance gives more influence to closer neighbors.
5.5. Evaluation Metrics
The performance of all classifiers was assessed using standard metrics derived from the confusion matrix: accuracy (
), precision (
P), recall (
R), specificity (
), and the
-score. These measures quantify how accurately the models distinguish between normal and epileptic EEG segments and are defined as follows. A summary of the evaluation metrics and their mathematical definitions is presented in
Table 4.
High recall is clinically important for detecting epileptic discharges, while precision helps reduce false alarms. The -score provides a balanced measure suitable for imbalanced EEG datasets. All results were obtained using 10-fold stratified cross-validation, with preprocessing performed independently within each training fold to avoid data leakage.
5.6. Considerations for Imbalanced Data
A key challenge in EEG outlier detection is class imbalance: anomalous segments (e.g., epileptic events) are much rarer than normal activity. Relying on overall accuracy can therefore be misleading, as high
may co-occur with poor sensitivity to rare events [
73].
To address this, this study adopted an evaluation and selection protocol that is robust to imbalance. All results are reported using stratified 10-fold cross-validation, and we prioritize imbalance-aware measures—precision, sensitivity (recall), and the
-score—over accuracy when comparing models. Model selection within the automated framework uses an
-based criterion (see
Section 5.4). Where applicable (e.g., SVM), class weights were enabled to emphasize the minority class without altering the original data distribution.
Importantly, we did not apply synthetic oversampling (e.g., SMOTE) in the final protocol. With overlapping 2 s windows, oversampling risks information leakage between training and validation/test folds and may distort the clinically meaningful class prevalence. Our reported precision–recall trade-offs thus reflect the true distribution of events in each dataset. A systematic, leakage-safe study of dataset-adaptive oversampling techniques is left for future work.
In imbalanced EEG classification, overall accuracy can be misleading because it conflates majority-class performance with true positive detection. Recent studies recommend imbalance-aware metrics—particularly the class-specific
-score and, when probabilistic scores are available, the area under the precision–recall curve (AUPRC)—as more informative objectives for both model selection and reporting [
49,
50]. Following this guidance, we use the outlier-class
as our selection criterion (default threshold
) and report AUPRC where applicable, in addition to accuracy for context.
To prevent information leakage, a grouped, stratified 10-fold cross-validation scheme was applied. For the Bonn dataset, all 2 s windows inherited the label of their parent 23.6 s epoch and were assigned to the same fold (epoch-wise grouping). For the GBNE dataset, windows originating from the same subject were grouped together to ensure subject-wise separation between training and test folds.
5.7. Paired Significance Testing via McNemar’s Test
To assess whether the ensemble’s improvement over the best base classifier is statistically significant, we applied McNemar’s test on paired out-of-fold (OOF) predictions generated under identical cross-validation splits. Let
be the ground-truth label for instance
i, and let
and
denote the predictions of the ensemble and the best base model, respectively, with both predictions obtained for the same held-out instance (OOF). We construct the
contingency table over
N OOF instances as shown in
Table 5.
McNemar’s test focuses on the discordant pairs
and
. The continuity-corrected chi-square statistic is defined in Equation (
1) with one degree of freedom:
For small , we additionally report the exact binomial p-value, testing whether the probability of a discordant outcome favors one model over the other under the null hypothesis . We apply the test separately per dataset, Bonn and GBNE. A higher than indicates that the ensemble corrects more of the base model’s errors than vice versa. Significant p-values confirm that the observed superiority is unlikely due to chance.
6. Results
6.1. Performance of Base Classifiers
Outlier detection was conducted for two EEG datasets—Bonn and the Guinea-Bissau and Nigeria Epilepsy Dataset (GBNE)—using the automatically generated hyperparameter space described in
Section 5.4. Three base classifiers were applied in the experiments: k-Nearest Neighbors (
k-NN), Support Vector Machines (SVM), and decision trees of the CART type (DT-CART). Stratified results for ictal, interictal, and healthy subsets of the Bonn dataset were analyzed separately to assess robustness across EEG categories. This allowed us to evaluate whether classifier performance remained consistent across different types of epileptiform and normal activity.
Table 6 summarizes, for each classifier and dataset, the ranges of ACC, precision, outlier-class
, and recall. Model selection in our framework uses the outlier-class
, while accuracy is reported for context only.
For the SVM classifier, we used the rbf kernel with . On the Bonn EEG dataset, the maximum overall accuracy reached 94.4%, yet outlier recall was only 66.3%. On GBNE, SVM achieved 83.0% accuracy with 82.0% recall, outperforming its Bonn recall.
For the DT-CART classifier (maximum depth 30, entropy split), overall accuracy on the Bonn EEG dataset remained below 50% (maximum 48.8%), with precision up to 67.2% and recall up to 48.8%. On GBNE, recall peaked at 60.0%. Within the automated selection procedure, base models were eligible for ensemble construction only if they achieved an outlier-class
on validation folds (
Section 5.5). As DT-CART did not satisfy this criterion on either dataset, it served exclusively as a comparative baseline for interpretability assessment and was automatically excluded from all homogeneous and heterogeneous ensembles. This confirms that ensemble composition was determined objectively by the performance threshold rather than manual preference for specific algorithms.
The best performance was obtained with the k-NN classifier. On the Bonn EEG dataset, the maximum accuracy was 92.5% using Manhattan distance with , with recall up to 92.5%. On GBNE, the best accuracy (79.9%) was achieved with Manhattan distance, , and uniform weighting; recall reached 80.3%.
Overall, the k-NN algorithm showed the highest effectiveness in detecting outliers—especially on Bonn—combining high accuracy with high recall. SVM yielded satisfactory results, particularly on GBNE, but its Bonn recall was limited. DT-CART exhibited the weakest performance, with both accuracy and recall markedly below those of the other methods. These limitations of individual base classifiers motivate the use of ensemble methods to improve outlier detection in EEG signals.
6.2. Results of Homogeneous Ensemble Classification
Homogeneous ensemble classifiers were constructed based on a single classifier type with varying hyperparameters. Only base models that achieved an outlier-class
on validation folds were included in the ensemble generation process (
Section 5.5). Under this rule, DT-CART did not qualify on either dataset and was therefore excluded from all homogeneous ensembles. Its results are reported in
Section 6.1 for completeness, as they illustrate the automatic screening mechanism that governs ensemble construction in the proposed framework. Given a specified number
n of base classifiers of the same type and a selection threshold of outlier-class
, the algorithm automatically formed combinations of these classifiers into a homogeneous ensemble.
Bonn EEG Dataset
Outlier-class
values for the top-performing configurations of
k-NN, SVM, and DT-CART on the Bonn EEG dataset are presented in
Table 7.
For transparency, we report the best DT–CART and polynomial kernel SVM settings even when
; such models were not eligible for ensemble construction under the
selection rule. Following the procedure described in
Section 4, homogeneous ensemble classifiers were automatically created for specified values of
n. Sample configurations are shown in
Table 8.
Homogeneous ensemble classifiers were evaluated using aggregation methods
. For each combination of classifiers, accuracy (
), precision (
P), recall, and
-score were calculated. The results are presented in
Table 9. The stacking technique achieved the best performance, with accuracy up to 95.0%. Even a 2.5 percentage point improvement compared to the best single classifier is clinically meaningful. Majority voting produced results similar to the strongest individual classifiers (around 92.5%), while bagging performed more weakly, with accuracy in the 80.0–88.8% range.
As shown in
Table 9,
k-NN ensembles achieved very high precision and recall, with majority voting correctly identifying over 92% of outlier cases. The
-score reached 91.8%, confirming robust detection performance. Bagging was consistently weaker, performing about 10 percentage points below the other aggregation methods. Stacking produced the strongest results, with precision above 91% and accuracy up to 95%.
For SVM-based ensembles, recall improved markedly compared to single models (from 66.3% to 91.3%). The highest ACC was obtained with stacking (2), whereas stacking (3–4) achieved the highest /recall. In contrast, bagging offered only moderate gains.
Comparing k-NN and SVM ensembles highlights distinct behaviors. k-NN ensembles achieved higher recall (above 90%), making them highly sensitive to epileptic discharges but with slightly lower precision. SVM ensembles delivered very high precision (up to 1.000) but somewhat lower recall, reflecting a more conservative detection strategy with fewer false positives but more missed seizures.
Clinically, this trade-off is important. For seizure detection, where missing an event carries high risk, the higher recall of k-NN ensembles may be preferable. Conversely, SVM ensembles are valuable in scenarios where minimizing false alarms is critical, such as continuous monitoring in clinical settings.
Guinea-Bissau and Nigeria Epilepsy Dataset
Results for the GBNE dataset differed substantially. For k-NN ensembles, the best performance came from majority voting and bagging with four classifiers, both achieving 82.9% accuracy. In contrast, stacking with k-NN produced the weakest outcomes.
SVM ensembles combined with stacking reached the highest overall accuracy on GBNE (86.8%), outperforming all other homogeneous ensembles. Majority voting with SVMs also achieved strong results (81.6%).
Unexpectedly, SVM ensembles trained with bagging showed extremely poor recall (only 11.8%), despite high precision (over 81%). This indicates that while they rarely produced false positives, they failed to detect most seizure events, limiting their clinical usefulness.
Detailed evaluation metrics for GBNE homogeneous ensembles are presented in
Table 10.
Overall, while accuracies on GBNE were encouraging, both precision and -scores dropped compared to the Bonn dataset. Majority voting and stacking often produced precision values between 44 and 58%, meaning nearly half of the detected outliers were false positives. Clinically, this limits reliability, despite strong recall in some cases.
k-NN ensembles generally achieved higher recall (up to 0.769), confirming their sensitivity to seizure events in noisy data. SVM ensembles, in contrast, achieved higher precision but often failed to detect true positives, particularly when bagging was applied.
These results highlight the challenges of applying ensemble methods to noisy, imbalanced datasets such as GBNE. While stacking improved overall accuracy, precision–recall trade-offs became more pronounced, underlining the need for dataset-adaptive strategies in clinical applications.
6.3. Results of Heterogeneous Ensemble Classification
Heterogeneous ensembles were generated automatically as described in
Section 4. Each base model was required to achieve an outlier-class
on validation. According to this rule, DT-CART did not qualify on either dataset and was therefore excluded from all heterogeneous ensemble configurations. Examples of automatically generated ensembles for
and
are listed in
Table 11. Here, Manh denotes Manhattan distance, Eucl denotes Euclidean distance, dist indicates distance-based weighting, and u refers to uniform weighting.
Bonn EEG dataset.
Heterogeneous SVM +
k-NN ensembles comfortably exceeded the selection threshold
. The best result was obtained with stacking (
Table 12), achieving
and recall = 91.6%. Simple majority voting was close (
), whereas bagging lagged behind (
–
%). Precision peaked for stacking (up to 90.5%), indicating an improved precision–recall balance without sacrificing sensitivity. Accuracy (ACC) is reported for context only because the model comparison is driven by
.
On Bonn, stacking achieved the best overall performance (ACC 0.950,
0.927, and recall 0.916). The confusion matrix for the best heterogeneous model (stacking, SVM +
k-NN) is shown in
Figure 6. Relative to the best homogeneous ensembles, false positives (type I errors) decreased by 4–5 cases, while false negatives (type II) remained comparable to
k-NN. Thus, the heterogeneous stack improves the precision–recall balance without sacrificing sensitivity.
It should be emphasized that on clean, well-structured EEG (Bonn), stacking SVM with k-NN reduces false alarms while maintaining high sensitivity, which is desirable for decision-support workflows.
Guinea-Bissau and Nigeria Epilepsy (GBNE).
On the more challenging GBNE dataset, stacking again produced the best overall accuracy (
), whereas majority voting underperformed (hard:
; soft:
). Bagging reached
. Despite high precision for soft voting (
), its
-score remained low (
) due to reduced recall, indicating many missed true positives or unstable balance on noisy data. Stacking delivered the strongest
, but with moderate precision/recall (both
–
), reflecting the difficulty of minority-class detection in field conditions. Detailed performance metrics for all heterogeneous ensembles are summarized in
Table 13.
On GBNE dataset, stacking yielded the highest ACC (0.855), while bagging (3) gave the best (0.612) and soft voting achieved the highest precision (0.890). It should be emphasized that on the noisy, imbalanced GBNE dataset, accuracy alone can be misleading: methods with high (e.g., stacking) may still yield modest due to precision–recall trade-offs. In screening scenarios, ensembles should be tuned for higher recall to avoid missed seizures, whereas in alarm-driven monitoring, higher precision may be prioritized to limit false alerts.
The very low recall observed for SVM bagging on GBNE likely stems from probability/threshold calibration variance under noisy, artifact-laden features, which—after bootstrap aggregation—induces a conservative decision boundary and a bias toward the majority class. In practice, this can be mitigated by post hoc calibration (Platt/Isotonic), class weight tuning in base SVMs, or decision threshold optimization on validation folds (maximizing or recall at fixed precision).
6.4. Statistical Validation of Ensemble Superiority
To assess whether the observed improvements of the ensemble models were statistically significant, we applied the non-parametric McNemar’s test on paired classification outcomes between the ensemble (stacking) and the best-performing base models (k-NN and SVM). The test evaluates the null hypothesis that both models have identical proportions of correct classifications. A p-value below 0.05 indicates a significant difference in prediction behavior.
Figure 7 and
Figure 8 present the McNemar contingency tables for the Bonn and GBNE datasets, respectively. For the Bonn dataset, the test comparing the stacking ensemble with
k-NN yielded
, while for the GBNE dataset the comparison between stacking and SVM resulted in
. The resulting
p-values were below 0.01 in both cases, leading to rejection of the null hypothesis and confirming that the ensemble classifier’s superiority is statistically significant at the
level.
7. Conclusions
The results obtained across both datasets demonstrate that ensemble models consistently outperformed individual base classifiers under the F1-driven selection protocol. The proposed models are designed for screening-level support and triage of large EEG corpora—not for standalone diagnosis. Clinical decisions remain the purview of expert evaluation. For the Bonn EEG dataset, the best heterogeneous stack (SVM + k-NN) achieved an -score of 92.7%, with recall = 91.6% and accuracy = 95.0%. This configuration slightly improved the precision–recall balance compared with the strongest homogeneous ensembles, confirming that combining complementary decision rules enhances stability and reduces variability caused by hyperparameter selection.
On the more challenging GBNE dataset, performance was highly metric-dependent. Stacking achieved the highest accuracy (up to 85.5% for heterogeneous and 86.8% for homogeneous SVM ensembles), although its -score remained below the 0.60 threshold in the heterogeneous setting. In contrast, heterogeneous bagging reached the best heterogeneous (up to 61.2%). Among homogeneous models, k-NN stacking prioritized sensitivity (recall up to 76.9%) and achieved higher (up to 75.3%), albeit at the cost of lower accuracy—illustrating the inherent precision–recall trade-offs typical of noisy, imbalanced EEG data acquired in field conditions.
These trade-offs have clear clinical implications. For screening scenarios where missed events are costly, recall-oriented ensembles (e.g., k-NN–based) are preferable. Conversely, in alarm-driven monitoring applications, precision-oriented variants (e.g., SVM voting or stacking) are more effective in reducing false alerts. Throughout this work, accuracy is reported for completeness, but model comparison and selection are guided by the outlier-class -score.
It should be emphasized that the Bonn dataset is small and idealized, whereas the GBNE dataset, though more realistic, includes a limited number of subjects. Fixed 2 s segmentation may overlook longer temporal dependencies. Moreover, while stacking improves effectiveness, it also reduces interpretability, and the models’ sensitivity to class imbalance and noise limits their immediate clinical deployment. All experiments were performed offline rather than in real-time conditions.
Additionally, the ensemble generator applies an automatic -based selection rule that determines which base models are eligible for inclusion. This simple yet effective criterion enhances the transparency, reproducibility, and adaptability of the proposed framework, making it applicable to diverse datasets and classifier types without manual tuning or arbitrary design choices.
Future research should address these limitations by expanding the datasets with longer and more diverse EEG recordings, incorporating explainable AI mechanisms to enhance interpretability, and validating the framework in real-time monitoring settings. Exploring deep learning ensembles, adaptive weighting strategies, and multimodal data integration (e.g., EEG–fMRI) may further enhance both performance and clinical applicability.
Overall, the proposed automated ensemble framework demonstrates not only methodological improvements but also clinically meaningful benefits. It offers a robust, interpretable, and generalizable solution capable of reducing false alarms, improving diagnostic efficiency, and supporting neurologists in reliable and efficient epilepsy detection.