Personalized Classifier Selection for EEG-Based BCIs

: The most important component of an Electroencephalogram (EEG) Brain–Computer Interface (BCI) is its classifier, which translates EEG signals in real time into meaningful commands. The accuracy and speed of the classifier determine the utility of the BCI. However, there is significant intra-and inter-subject variability in EEG data, complicating the choice of the best classifier for different individuals over time. There is a keen need for an automatic approach to selecting a personalized classifier suited to an individual’s current needs. To this end, we have developed a systematic methodology for individual classifier selection, wherein the structural characteristics of an EEG dataset are used to predict a classifier that will perform with high accuracy. The method was evaluated using motor imagery EEG data from Physionet. We confirmed that our approach could consistently predict a classifier whose performance was no worse than the single-best-performing classifier across the participants. Furthermore, Kullback–Leibler divergences between reference distributions and signal amplitude and class label distributions emerged as the most important characteristics for classifier prediction, suggesting that classifier choice depends heavily on the morphology of signal amplitude densities and the degree of class imbalance in an EEG dataset.


Introduction
Brain-Computer Interfaces (BCIs) are a type of assistive technology that allows individuals with profound motor impairments to directly use mental activity to control external devices and interact with their world [1].BCIs infer movement or communication intentions directly from brain signals, typically Electroencephalogram (EEG), in real time [2].EEG signals are measured from the scalp of the individual via electrodes.EEG signals are the spatial and temporal summations of thousands of synchronous, excitatory, and inhibitory post-synaptic potentials, mostly due to extra-cellular currents associated with the synaptic activity of pyramidal neurons [3].Furthermore, EEG signals are susceptible to the blurring effects of volume conduction as the electrodes are far away from the signal sources [4].Thus, EEG signals are inherently noisy and non-stationary (the statistics changing over time), with time-varying spectra and spatial distributions, making their classification challenging [1].
At the heart of the BCI is the classifier that translates the incoming stream of EEG data into functional commands (e.g., selection of words or control of an external device).Numerous classifiers have been deployed in BCIs [5], as the classification of EEG data is generally a difficult task, where no single classifier works well across all users [6].Specifically, BCI performance for a given classifier and task often varies greatly across individuals [7].This widespread inter-subject variability is manifested in the spectral characteristics of task-related EEG signals [8], the temporal features of evoked potentials [9], and the spatial distribution of sensorimotor-related activations [10].Within-individuals, within-day, and between-day EEG signals are also inherently non-stationary [11].For example, children undergo developmental brain changes such as neurogenesis, neural migration, pruning, and myelin formation [12], while adults experience widespread regional brain volume reductions with aging [13].Deciding on the best classifier is thus a particularly challenging and time-consuming problem for EEG BCIs.
To deal with this rampant intra-and inter-subject variability, some BCI studies have exploited transfer learning [14].Chen and colleagues investigated the cross-subject distribution shift problem and proposed a solution based on deep adaptation networks [15], using a custom loss function to decrease both classification and adaptation losses concurrently [16].In another work, by George et al., transfer learning in online and offline fashions was utilized to improve the classification accuracies of three deep neural networks-namely, BiGRU, Deep Net [17], and Multibranch CNN [18]-to tackle the non-stationary nature of motor imagery tasks within and across sessions and subjects, respectively [19].Alternatively, others have proposed between-session updates of the trained classifier [11,20].However, the choice of classifier remains unaddressed in such schemes.User-dependent classifiers generally tend to outperform user-independent classifiers [21], necessitating a personalized classifier selection.To this end, a scheme for expediently predicting the most accurate classifier for a given user at a given time of day would be valuable.For this paper, we leveraged empirical algorithmics and algorithm portfolio methods to design a framework that can automatically decide on the most accurate classifier (see Figure 1) for the BCI dataset at hand, on the basis of the structural characteristics of the dataset.The main contributions of this work are twofold:

•
A systematic approach for classifier selection using structural characteristics of data is proposed; • The applicability of our classifier selection method against 109 BCI2000 EEG datasets is evaluated.

Related Work 2.1. Algorithm Portfolios
One promising strategy for classifier selection is to use machine learning methods to learn from data and subsequently advise users on the most suitable algorithm without having to actually apply different methods to the data.One of the early works to address algorithm selection was done by Rice [22], who applied approximation theory to the algorithm selection problem, based on problem space, algorithm space, and performancemeasures space.Later, Gomez and Selman [23] employed an algorithm portfolio for hard combinatorial search problems and showed that algorithms with higher variance in running time, such as stochastic algorithms, are advantageous over best-bound approaches.Leyton-Brown et al. also proposed an algorithm portfolio approach, reporting that a set of algorithms with a selection heuristic collectively outperforms its constituent algorithms on the combinatorial auction Winner Determination Problem (WDP) [24].The algorithm portfolio approach has been successfully applied to solving various problems, including propositional satisfiability [25,26], automated mission planning [27], university timetabling [28], traveling salesman [29], subgraph isomorphisms [30], collaborative filtering [31], dynamic maximal covering location [32], and human behavior and syllogistic reasoning [33].The main advantage of combining algorithms in a portfolio is the reduction of computational cost while boosting classification accuracy [23].
Empirical or experimental algorithmics entail the use of empirical methods, such as statistics, to investigate the behaviors of algorithms [34][35][36][37][38]. Empirical algorithmics have found widespread application across various problem domains, including, for example, automating performance bottle-neck detection [39], input-sensitive profiling [40], and computational phylogenetics [41].By statistically analyzing the behavior of different classifiers, one can distill a subset of classifiers, i.e., a portfolio, from which the most appropriate classifier for a given problem and dataset can be selected.

Automated Machine Learning
Automated Machine Learning (Auto-ML) is a field where the selection of machine learning tasks, such as feature selection and classification/regression and hyper-parameter tuning, are achieved using optimization techniques, thereby obviating the need for human expert input [42].Auto-ML finds the best combination of machine learning tasks to maximize classification/regression accuracy.However, it is computationally expensive, which makes it less applicable to processing EEG data on the fly.Although imposing a time limit for Auto-ML is an option to bound the computational requirements, this comes at the expense of inferior accuracy.
Rooted in the Bayesian optimization methods of [43], Auto-WEKA [42] simultaneously addresses model selection and parameter optimization.Auto-WEKA started a trend of optimizing as many aspects of the machine learning pipeline as possible, from data preprocessing to architecture selection to hyper-parameter tuning.Later, in Auto-WEKA 2.0, the authors added regression methods, different performance metrics, and parallelism [44].Auto-sklearn is an improvement of Auto-WEKA 2.0, incorporating meta-learning to boost Bayesian optimization [45].For example, a subset of methods working well on a new dataset, based on previously computed results, is determined and passed to the optimization step.In the end, an ensemble is created automatically to provide more robust results compared to Auto-WEKA.Recently, Auto-ML methods have also been used to boost the performance of Deep Learning (DL) systems for applications such as image classification and natural language processing [46].A benchmark of Auto-ML methods can be found in [47].
Auto-ML methods are well-suited to non-experts in ML and to problems where a streamlined end-to-end pipeline of pre-processing and high-accuracy classification/regression are needed.However, with EEG data processing, pre-processing is uniquely designed for each study based on the mental tasks, a challenge beyond AutoML.Considering the surveyed works above, there is a paucity of research on data-driven classifier selection, particularly for EEG processing for BCIs.We thus propose a potential solution using empirical algorithmics.

Algorithmic Fairness
While the motivation for finding personalized best classifiers for EEG data is very much in the spirit of algorithmic fairness [48,49] and the multicalibration/multiaccuracy line of work [50], our setting is somewhat different.Rather than samples corresponding to individuals and classes to protected groups, here we are dealing with a series of recordings for each person, which are samples that are not necessarily computationally identifiable as belonging to a specified group/person.Instead, we look at the aggregate features of the data belonging to a specific person and use that information to estimate which classifier is likely to give the best accuracy over a set of unseen samples from a distribution with those features.In our experiments, this method outperforms using the best overall classifier, as shown in Figure 2:

EEG Data
For the analyses described herein, we used data from the BCI2000 [51] dataset.Briefly, this dataset consists of 64-channel Electroencephalogram (EEG) recordings sampled at 160 Hz from 109 subjects, performing three 2 min runs of hand and feet motor execution and imagery tasks.Here, we only consider the hand imagery tasks (imagining opening and closing of the left or right fist) and, for each participant, we merged their data from the respective runs (tasks 4, 8, and 12, in terms of the BCI2000 vernacular).In most EEG datasets considered, classes were imbalanced according to the ratio 1.8:1.0:1.05(number of samples in class 1 to class 2 to class 3).We then applied independent component analysis (ICA), retaining enough components to attain a cumulative explained variance just below 99%.Subsequently, we down-sampled the signals to 10 Hz.The resulting data were merged into one data file for each participant.The EEG data for one participant is herein referred to as an EEG dataset.This is to be distinguished from a classifier dataset, explained in Section 3.2.3.Incidentally, given the EEG datasets described above, the EEG ternary classification problem was that of distinguishing among rest, left-hand, and right-hand motor imagery.

Proposed Classifier Selection Method
To the best of our knowledge, there have been no investigations of classifier selection for EEG datasets that specifically leverage the characteristics of the data.Several efforts have sought suitable pairings of algorithms to datasets [52][53][54] but none of them took an empirical and automated approach to finding a match between an algorithm and an EEG dataset's characteristics.Here, we propose such a framework, in which each EEG dataset is characterized by its structural properties.

EEG Dataset Characteristics
First, we generated 41 structural characteristics of the dataset (see Table 1), agnostic to the source of the data, which is a necessary step for any arbitrary dataset.These dataset descriptors fall into three categories: learnability of the dataset, properties of signal features, and informativeness of class labels.

Learnability
In the first category, we included the number of samples, signal features, and classes, as well as the ratio of the number of samples to signal features.Learnable concepts can be defined using the Probably Approximately Correct (PAC) framework, in terms of sample complexity and time and space complexity, which depend on the cost of the representation of the concepts [55].

Properties of EEG Signal Features
In the second category, we focused on the properties of the signal features included in the dataset.We calculated the average of all the features and the average standard deviation and covariance of all the features.To evaluate the intra-correlation and redundancy of the features, we calculated the average chi-square and the inter-feature Pearson, Kendall, and Spearman correlations.The feature-to-class correlations were also included, to gauge the importance of features in predicting class labels.To render our approach agnostic to the data source, we also included the average Kullback-Leibler (KL) divergence [56] between all feature distributions and normal, uniform, logistic, exponential, chi-square, Rayleigh, Pareto, and Zipf distributions.

Informativeness of Class Labels
The last category focused on the properties of class labels.We calculated the entropy of class labels and characterized their skewness via upper and lower quantiles and chi-square values.Additionally, we included the average covariance of all features in the class and the Kullback-Leibler divergence between class label distributions and various distributions.The Rademacher complexity of class labels was also considered as a measure of the richness of class labels and their similarity to a randomly generated vector.After generating 41 characteristics for each dataset, we applied 22 classifiers to each dataset, to determine the best classifier, in terms of classification accuracy, as described next.

EEG Classifiers
We selected 22 classifiers (see Table 2) from ensemble, linear, Naive Bayes, nearest neighbors, neural networks, and tree-based classifiers, representing the most commonly used classifiers in EEG data processing.For the ensemble methods, we included Ada Boost (AB), Extra Trees (ET), Random Forest (RF), and Gradient Boosting (GB with two criteria).For the linear methods, which are very common in EEG data processing [5], we employed Linear Discriminant Analysis (LDA; 2 variants, one with singular value decomposition and the other with least-squares solver), Logistic Regression (LR; 2 variants, one with L2 and the other with no penalty), Ridge Classifier (RC), and the regularized linear model with Stochastic Gradient Descent Learning (SGD).As surveyed by multiple review papers, Naive Bayes, nearest neighbors, neural networks, and tree-based classifiers have been extensively used in emotion recognition [57,58], steady-state visual evoked potential [59], and motor imagery EEG classification [6].Therefore, we included Bernoulli Naive Bayes (BNB), Complement Naive Bayes (CNB), Gaussian Naive Bayes (GNB), and Multinomial Naive Bayes (MNB) from the Naive Bayes family of methods; K-Neighbors Classifier (KN with K = 5 and BallTree, KDTree, and brute-force search algorithms) and Nearest Centroid (NC) from the nearest neighbor-based classifiers; Multi-Layer Perceptron (MLP) from the neural networks classifiers; and Decision Tree (DT; 2 variants, one with Gini index and the other with entropy as impurity measures) from the tree-based classifiers.
Table 2. Scikit-learn [60] classifiers and their parameters used for the experiments.As the present study focused on classifier prediction, the downsampled (to 10 Hz) raw EEG signals were classified.Signals, each comprising 696 samples from 64 channels, were concatenated to form a 696 × 64 dimensional input vector for classification.

Classifier Dataset
All 22 classifiers were applied to each dataset, using 10-fold cross-validation (10-CV), and the name of the classifier achieving the highest accuracy was used as the label for the corresponding input EEG dataset from one participant.In other words, upon completion of this exercise, we obtained a classifier dataset, wherein a single instance comprised a vector of the 41 structural characteristics of a single participant's EEG dataset, paired with the name of the highest-accuracy classifier (as a categorical variable) for these EEG data.This classifier dataset contained 109 instances.

Predicting the Best Personalized Classifier
The classifier dataset was subjected to Principal Component Analysis (PCA) to extract the most informative structural characteristics of the EEG datasets.A Random Forest (RF) was implemented to classify the instances.In practice, predicting a classifier that performs almost as well as the best classifier is just as valuable as predicting the very best one.To account for that, we introduced a rounding variable, denoting the precision of the target accuracy.We formed a "bucket" for each instance and included any number of EEG classifiers whose accuracy fell within the rounding distance of that of the best classifier.When rounding was zero, only the actual label (i.e., the name of the single-best-performing EEG classifier) for a given EEG dataset was used, whether in the training or testing phase.
If rounding = t, where t > 0, then for the training phase we used the actual labels, but for the testing phase the predicted label for each instance was compared with the labels in the corresponding "bucket", which included the best classifier and those with accuracies at most t below the highest accuracy during the training phase.If the predicted classifier was among those in the "bucket", then the prediction was considered correct.We repeated this process 10 times for all the possible values of the number of extracted structural characteristics (i.e., 2 to 41), using PCA and a 70-30% split (76 samples for training and 33 samples for testing) of the classifier dataset.The proposed personalized classifier selection method (the code is available at https://github.com/jranaraki/PersonalizedClassifierSelection(accessed on 19 June 2024 )) is summarized in Figure 1.

Environment
All the experiments were conducted using a machine with Ubuntu 22.04.4LTS, Intel ® Core TM i7-8750H, and 24GB of RAM.The proposed method was implemented in Python 3.12.2,and no proprietary library was used to run the experiments.

Experimental Results
Table 3 summarizes the frequency at which the classifiers were selected as the best, in terms of accuracy.The classifiers that were the least frequently selected as the best were Gradient Boosting (GB), Bernoulli NB (BNB), Multinomial NB (MNB), and Linear Discriminant Analysis (LDA).When GB, BNB, and MNB were selected as the best classifier, we noted that their accuracies were negligibly higher than that of the corresponding secondbest classifier.As such, we discarded these classifiers from further consideration.On the other hand, LDA was retained, as its accuracy tended to be dramatically higher than that of the cognate second-best classifier.The revised counts of the number of times each of the top six classifiers were the most accurate are shown in Table 3.The average number of Floating Point Operations Per Second (FLOPS) for each method across all datasets is also reported.The LR classifier was the best overall classifier for all participants.Using our method for each participant with rounding = 0.01, the accuracy of the predicted classifier exceeded that of LR by 0.0035 ± 0.0120, on average, with an average of 24.20 extracted features.Figure 2 illustrates the difference in classification accuracies from that of the best classifier for each dataset in the test set, sorted in terms of bucket size from smallest to largest.
Table 4 provides more details by presenting the classifiers included in each bucket for each participant, the best, randomly selected, and the predicted classifiers and their corresponding accuracies.
We performed an N × N Friedman test [61] to evaluate the differences between the accuracies of each approach on the test data (see Table 5 for the mean rank of each approach).Subsequently, Holm's post hoc pairwise comparisons were conducted (see Table 6) to account for multiple comparisons.Our proposed method (i.e., labeled as 'Predicted' in Table 5) ranked higher than both the best overall classifier (i.e., LR) and the randomly selected classifiers.Based on the pairwise comparisons, all pairs were significantly different except for Predicted vs. LR, which confirmed that our method performed on a par with the best overall classifier.Table 4. Predicted classifiers for each dataset using rounding = 0.01 and 24 extracted features using PCA, and accuracies of the best, randomly selected, predicted, and Logistic Regression (LR) classifiers sub-scripted.The bucket for each dataset could contain the Extra Trees classifier (ET), the Random Forest classifier (RF), the Linear Discriminant Analysis (LDA) classifier, the Logistic Regression (LR) classifier, the Ridge Classifier (RC), and the MLP classifier.With rounding = 0.01 for most of the datasets, more than one classifier achieved the highest accuracy.By increasing the rounding from 0.00 to 0.04, the best, average and worst accuracies using RF are shown in Table 7.The rankings of the 41 structural characteristics (Table 1) of the EEG datasets are shown in Figure 3, based on the RF classification of the classifier dataset.

Discussion
Finding the best classifier for the dataset at hand is a laborious task.Unsurprisingly, researchers often simply deploy the classifiers previously used for similar problems.To date, no empirical approaches systematically suggest a classifier based on the structural properties of EEG datasets.As a solution to this problem, we formed a classifier dataset of instances, each comprising a set of 41 structural characteristics of the EEG dataset and a target label (i.e., the best classifier for this dataset).Then, we applied feature extraction using PCA and introduced a rounding variable to account for variability in classification accuracies.By increasing rounding value, we allowed for more than one classifier to join the "bucket" (i.e., correct answers).We trained a Random Forest over the generated classifier dataset and compared the predicted classifiers with those in the "bucket".We evaluated our method on EEG datasets from BCI2000 [51].

Predicting a Classifier for a New User
Our findings suggest that it is indeed feasible to predict a classifier for a new EEG BCI user, strictly on the basis of the structural characteristics of their offline (i.e., training) EEG dataset (Figure 2 and Table 4).In other words, one could identify a person-specific classifier without the need for time-consuming experimentation (i.e., training and testing different classifiers).In fact, Tables 5 and 6 confirm that the proposed framework can predict a classifier that will perform no worse than the single-best-performing classifier across the participants.This is an important finding because it suggests that the proposed approach could allow BCI practitioners to quickly choose a subject-specific classifier once a cognate training dataset has been acquired, potentially accelerating the path to same-session online testing.

Most Predictive Structural Characteristics
From Figure 3, Kullback-Leibler measures feature prominently among the most important structural characteristics of the EEG dataset.The KLUnifClass and KLNormClass characteristics reflect the differences between the distribution of class labels (represented as integers 1, 2, and 3) and reference uniform and normal distributions, respectively.These characteristics can be interpreted as representing the degree of balance of samples across classes (i.e., if a dataset were completely balanced the distribution of class labels would be uniform).Our analyses thus seem to suggest that certain classifiers are preferred in the presence of class imbalances.
The avgKLExpoAll and avgKLParetAll characteristics represent how the different (across all classes) EEG amplitude distributions resemble exponential and Pareto distributions.Both distributions have one-sided, right-tailed densities that fall off as the distance from the mean increases.However, the Pareto density, Pareto(x), has a heavy tail compared to an exponential density exp(x) with the same mean and, thus, higher probabilities for large values of x.Our findings suggest that classifier choice hinges, in large part, on the shape of the EEG amplitude density, namely, where it lies between power law and exponential decay.The positive skewness of the amplitude density is associated with nonlinear temporal dynamics in the signal [62].
In sum, class balance and the morphology of signal amplitude distributions appear to be critical structural characteristics of an EEG dataset for classifier prediction.

The Elusive Best Classifier
The logistic regression classifier was the single-best classifier across the motor imagery EEG datasets.This finding corroborates previous motor imagery BCI research, which identified the logistic regression classifier as yielding the highest accuracy [63] and greatest receiver operating characteristic area [64] among other motor imagery classifiers.With the BCI2000 dataset, the choice of best classifier was seemingly not unique in many instances; more than one preferred classifier could be selected with little difference in accuracy.This could, in part, be attributable to the well-documented, clearly lateralized, and machine-discernible event-related desynchronization and synchronization reflected in EEG signals accompanying motor imagery in adults [65].For other BCI classification challenges, such as emotion recognition [66] or speech decoding [11], where common topographical patterns across participants are less probable, the performance difference among classifiers may be more evident.

Limitations and Future Work
We only considered a homogeneous dataset (i.e., BCI2000 [51]), where the same protocol and instrumentation were implemented across participants.As such, certain structural characteristics-namely, the number of features (n), the number of classes (nClass), the ratio of the number of samples to the number of classes ( m n ), and the number of samples (m)-contributed negligibly to classifier prediction (Figure 3).The value of the proposed method would be more evident with heterogeneous datasets, comprising data from different subjects, dissimilar protocols, and varied instrumentation.Furthermore, the performance among classifiers would likely be more dramatic with heterogeneous datasets, rendering the choice of classifier even more critical.
We were able to predict a classifier that performed on a par with the single-best classifier across the participants.However, this classifier may not be the best classifier for an individual user.Future research ought to investigate the prediction of the highest accuracy classifier for a new user (i.e., with rounding = 0), as well as validating the proposed method on the data collected on a different day.
We only predicted the classifier but did not optimize other parts of the signal-processing pipeline on a per-user basis.The predicted classifier itself and the preceding filtering and feature extraction could be optimized via an AutoML method without the need for further data.This could be followed by studying metrics specific to evaluating imbalanced datasets.In this way, the proposed method could be applied to other challenging classification problems, such as MRI and genomic data classification, where additional data collection is costly or logistically challenging.

Conclusions
We showed that it is feasible to automatically predict a classifier based on the structural characteristics of an EEG dataset.Our proposed approach can recommend a subject-specific classifier, the average accuracy of which can surpass the average classification accuracy of the single-best classifier across the participants.Personalized classifier selection has the potential to reduce the time and effort required to optimize BCIs for specific individuals.

Figure 1 .
Figure 1.Flowchart of the proposed methodology.

Figure 2 .
Figure 2. The difference in accuracy from that of the best classifier: (a) Logistic Regression (LR).(b) Predicted (rounding = 0.01).(c) Randomly selected classifiers for each dataset in the test set.

FeaturesFigure 3 .
Figure 3. Average impurity of each structural characteristic based on the accuracy of the best classifier for each dataset from the BCI2000 repository [51].

Table 1 .
Generated structural characteristics for each EEG dataset.

Table 3 .
[51]uency at which each classifier was selected as the most accurate, and the average of the FLOPS across all BCI2000 EEG datasets[51].

Table 5 .
Average ranking of the algorithms, using the Friedman test.

Table 6 .
Holm's post hoc comparisons.All pairwise comparisons were significantly different except for Predicted vs. LR, where p > 0.05.

Table 7 .
The best, average, and worst accuracies of the Random Forest (RF) classifier on the classifier dataset with rounding ranging from 0.00 to 0.04.Structural characteristics are those from Table1.