Robust Motor Imagery Tasks Classification Approach Using Bayesian Neural Network

The development of Brain–Computer Interfaces based on Motor Imagery (MI) tasks is a relevant research topic worldwide. The design of accurate and reliable BCI systems remains a challenge, mainly in terms of increasing performance and usability. Classifiers based on Bayesian Neural Networks are proposed in this work by using the variational inference, aiming to analyze the uncertainty during the MI prediction. An adaptive threshold scheme is proposed here for MI classification with a reject option, and its performance on both datasets 2a and 2b from BCI Competition IV is compared with other approaches based on thresholds. The results using subject-specific and non-subject-specific training strategies are encouraging. From the uncertainty analysis, considerations for reducing computational cost are proposed for future work.


Introduction
A Brain-Computer Interface (BCI) provides a direct communication pathway between the user's brain and external devices/software in order to monitor and/or repair sensorimotor and cognitive functions [1][2][3]. The BCI's input receives brain commands typically via electroencephalograms (EEG), which is a non-invasive, low-cost, high temporal resolution, and portable technique. Many EEG-based BCIs have been developed by using different paradigms, such as P300 event-related potentials [4,5], steady-state visual evoked potentials (SSVEP) [6,7], and sensorimotor rhythms (SMR) [8][9][10]. Systems using SMR have demonstrated promising results and great potential for training and recovering motor skills through various types of motor imagery tasks. However, the poor EEG signal-to-noise ratio and intra-and inter-subject variability make the MI classification still a challenge.
Deep learning techniques for MI classification have shown better performance than handcrafted feature extraction methods with respect to accuracy [11][12][13][14][15][16][17][18][19]. This success of deep learning also appears in other research areas, such as computer vision, image, and language processing, among others. Nevertheless, these approaches are deterministic architectures, and hence they lack strategies to measure classification trust.
The source of uncertainty in deep learning models may come from data input because of noise and inaccurate measurement (aleatoric uncertainty). Uncertainty from other 1.
Two BNN architectures by applying the variational inference method for MI classification are proposed, confirming the advantage of using MOPED's prior distribution for Bayesian models. Additionally, an efficient implementation to reduce computational cost for practical real-world applications is proposed.

2.
An adaptive threshold scheme based on a closed formula for robust MI classification with a reject option is proposed here.
The dataset 2b comprises EEG signals collected during the execution of two MI tasks (left hand and right hand) from nine subjects with three bipolar EEG channels (around C3, Cz, and C4), using a sampling rate at 250 Hz. Each subject completed five sessions with a total of 720 trials, comprising 120 trials per session in the first two sessions and 160 trials per session in the remaining three sessions. In our research, the first three sessions were used for training, and the last two sessions for testing, as in [15][16][17][18]44,48].

Preprocessing and Data Augmentation
Previous studies [49][50][51][52][53][54] reported that real or imagined unilateral movement enhances the mu (8 to 12 Hz) and beta (13 to 30 Hz) rhythms over the primary motor cortex in both contralateral and ipsilateral hemispheres, respectively; phenomena known as eventrelated desynchronization/synchronization (ERD/ERS) [55]. Thus, the EEG signals of each trial were band-pass filtered in our study with a frequency range from 4 to 38 Hz through a fourth-order Butterworth filter [13,14,45,47,56], aiming to preserve the ERD and ERS potentials, rejecting noise and undesirable physiological and non-physiological artifacts. Afterwards, each filtered EEG trial x was standardized as (x(t i ) − µ(t i ))/σ(t i ) by applying the electrode-wise exponential moving standardization with a decay factor of 0.999, which computes both mean µ(t i ) and variance σ 2 (t i ) values taken at sample t i [45]. The starting values µ(t 0 ) and σ 2 (t 0 ) were calculated over periods corresponding to the rest state preceding each trial. In order to rectify outliers, the EEG amplitudes of each trial were first limited to µ(t i ) ± 6σ(t i ). Finally, the trial crop strategy for data augmentation was employed. For both datasets, crops of 4 s in length each 8 ms from −0.5 to 4 s (cue onset at 0 s) over all trials were extracted in our study. Figure 1 shows the EEG preprocessing and data augmentation.
The dataset 2b comprises EEG signals collected during the execution of two MI tasks (left hand and right hand) from nine subjects with three bipolar EEG channels (around C3, Cz, and C4), using a sampling rate at 250 Hz. Each subject completed five sessions with a total of 720 trials, comprising 120 trials per session in the first two sessions and 160 trials per session in the remaining three sessions. In our research, the first three sessions were used for training, and the last two sessions for testing, as in [15][16][17][18]44,48].

Preprocessing and Data Augmentation
Previous studies [49][50][51][52][53][54] reported that real or imagined unilateral movement enhances the mu (8 to 12 Hz) and beta (13 to 30 Hz) rhythms over the primary motor cortex in both contralateral and ipsilateral hemispheres, respectively; phenomena known as event-related desynchronization/synchronization (ERD/ERS) [55]. Thus, the EEG signals of each trial were band-pass filtered in our study with a frequency range from 4 to 38 Hz through a fourth-order Butterworth filter [13,14,45,47,56], aiming to preserve the ERD and ERS potentials, rejecting noise and undesirable physiological and non-physiological artifacts. Afterwards, each filtered EEG trial x was standardized as / by applying the electrode-wise exponential moving standardization with a decay factor of 0.999, which computes both mean and variance values taken at sample [45]. The starting values and were calculated over periods corresponding to the rest state preceding each trial. In order to rectify outliers, the EEG amplitudes of each trial were first limited to 6 . Finally, the trial crop strategy for data augmentation was employed. For both datasets, crops of 4 s in length each 8 ms from −0.5 to 4 s (cue onset at 0 s) over all trials were extracted in our study. Figure 1 shows the EEG preprocessing and data augmentation.

Baseline Architecture
A shallow architecture based on a Convolutional Neural Network (SCNN) [57] was used in our work as a baseline deterministic network, as shown in Figure 2. This architecture contains two convolutional layers and a dense classification layer. The first convolution has an input tensor of 1000 1, where c is the number of EEG channels. The first convolution applies a temporal convolution with a 45 1 filter and 40 channels, giving an output tensor of 478 40 after performing a down-sampling with a stride of 2. The second convolutional layer realizes a spatial convolution composed of 40 channels and a 1 filter, which performs a convolution amongst all EEG channels. Following the spatial convolution, the model executes a sequence of transformations as follows: a

Baseline Architecture
A shallow architecture based on a Convolutional Neural Network (SCNN) [57] was used in our work as a baseline deterministic network, as shown in Figure 2. This architecture contains two convolutional layers and a dense classification layer. The first convolution has an input tensor of 1000 × c × 1, where c is the number of EEG channels. The first convolution applies a temporal convolution with a 45 × 1 filter and 40 channels, giving an output tensor of 478 × c × 40 after performing a down-sampling with a stride of 2. The second convolutional layer realizes a spatial convolution composed of 40 channels and a 1 × c filter, which performs a convolution amongst all EEG channels. Following the spatial convolution, the model executes a sequence of transformations as follows: a batch normalization layer, an activation function with square non-linearity, an average pooling layer with 45 × 1 sliding windows, and a max-pooling layer with 8 × 1 pool size, 8 × 1 stride, and a logarithmic activation function. Lastly, the classification layer composed of a dense layer using the Softmax activation function receives 2160 features, which are translated into a s × 1 prediction vector, s being the number of classes. composed of a dense layer using the Softmax activation function receives 2160 features which are translated into a 1 prediction vector, being the number of classes. Although dropout layers and the "MaxNorm" regularization have been previously used in the baseline architecture [57] to avoid overfitting, we did not use them in BNN models that are already more robust and less prone to overfitting. Furthermore, the "Early Stopping" and the decay of learning rate techniques were used. Additionally, the Adam optimizer and the Categorical Cross-Entropy as a cost function were employed.

Bayesian Neural Networks
BNNs [58] provide a probabilistic interpretation of deep learning models by placing distributions over their weights. In BNNs, the weights are no longer considered as fixed values, but rather as random variables sampled according to a distribution whose parameters are learned in the training stage. This makes each prediction different for the same input, and for many forward passes, the average behavior produces relevant results. Notably, the variability of these predictions also allows assessment of the model's confidence Let , be a training set with inputs , ⋯ , and expected outputs or targets , ⋯ , , where ∈ ℝ and ∈ ℝ , being the number of classes. The Bayesian modeling aims to approximate the posterior distribution of weights | that generated the output vector . Following this approach, the prior distribution that represents the initial beliefs about the weights has to be established. If | is known, then for a given input , the predictive distribution | , can be determined from the outputs | , corresponding to a specific set of weights as follows: The integral in (1) can be estimated by using the Monte Carlo method. For this, it is necessary to perform evaluations of the neural network on the same input and weights sampled from the posterior distribution | . As a result, instead of a single output, we obtain outputs from the model ; 1 . According to [21], 50 is a good balance in terms of complexity and accuracy. The set can be interpreted as a sample of the predictive distribution. The final prediction of BNN on the input is seen as the sample mean of the predictive distribution: In neural networks, the uncertainty measures how reliable a model is when making predictions. The statistical dispersion of the predictive distribution | , is one indicator of uncertainty. A "large" dispersion in | , means that the neural Although dropout layers and the "MaxNorm" regularization have been previously used in the baseline architecture [57] to avoid overfitting, we did not use them in BNN models that are already more robust and less prone to overfitting. Furthermore, the "Early Stopping" and the decay of learning rate techniques were used. Additionally, the Adam optimizer and the Categorical Cross-Entropy as a cost function were employed.

Bayesian Neural Networks
BNNs [58] provide a probabilistic interpretation of deep learning models by placing distributions over their weights. In BNNs, the weights are no longer considered as fixed values, but rather as random variables sampled according to a distribution whose parameters are learned in the training stage. This makes each prediction different for the same input, and for many forward passes, the average behavior produces relevant results. Notably, the variability of these predictions also allows assessment of the model's confidence.
Let D = {X, Y} be a training set with inputs X = {x 1 , · · · , x n } and expected outputs or targets Y = {y 1 , · · · , y n }, where x n ∈ R d and y i ∈ R C , C being the number of classes. The Bayesian modeling aims to approximate the posterior distribution of weights p(w|D ) that generated the output vector Y. Following this approach, the prior distribution p(w) that represents the initial beliefs about the weights has to be established.
If p(w|D ) is known, then for a given inputx, the predictive distribution p ŷ x , D .
can be determined from the outputs p ŷ x , w . corresponding to a specific set of weights w as follows: The integral in (1) can be estimated by using the Monte Carlo method. For this, it is necessary to perform T evaluations of the neural network on the same inputx and weights w t sampled from the posterior distribution p(w|D) . As a result, instead of a single output, we obtain T outputs from the model ŷ t ; 1 ≤ t ≤ T . According to [21], T = 50 is a good balance in terms of complexity and accuracy. The set ŷ t can be interpreted as a sample of the predictive distribution. The final prediction of BNN on the inputx is seen as the sample mean of the predictive distribution: In neural networks, the uncertainty measures how reliable a model is when making predictions. The statistical dispersion of the predictive distribution p To summarize, if the posterior distribution is known, it is possible to obtain, for each inputx, an approximation of the predictive distribution p ŷ x , D . This then allows us to estimate the BNN prediction and an uncertainty measure. However, finding and sampling the posterior distribution for complex models, such as neural networks, is a computationally intractable problem because of their high dimensionality and nonconvexity. To address this issue, two popular approaches have been introduced previously in other studies, such as (1) variational inference [33] and (2) Monte Carlo dropout [21]. The next section only describes the former method, which was used in our approach.

Variational Inference
In variational inference, rather than sampling from the exact posterior distribution p(w|D) , a variational distribution q(w, θ) is used, parametrized by a vector θ. The θ values are then learned in such a way that q(w, θ) is close as possible to p(w|D) in a Kullback-Leibler (KL) divergence sense. It is known that minimizing the KL divergence between q(w, θ) and p(w|D ) is equivalent to maximizing the evidence lower bound (ELBO) function, denoted as L(θ), which serves as the objective function to train the model.
Maximizing the first term in (3) encourages q(w, θ) to explain the data well, while minimizing the second term encourages q(w, θ) to be close to p(w).
Although in general guessing q(w, θ) is complex, to adopt an independent Gaussian distribution w ∼ N µ, σ 2 for each weight is a simple and common approach that often works. Additionally, with respect to the prior distribution, a standard normal distribution w ∼ N (0, 1) is commonly used. In contrast, the MOPED method [38] proposes to specify prior distributions p(w) from deterministic weights (w d ) derived from a pretrained DNN with the same architecture.

Bayesian Neural Models for Uncertainty Estimation and MI Classification
In this work, two BNN architectures (SCBNN and SCBNN-MOPED) derived from the SCNN baseline are proposed. They only differ in the way the prior distribution of the weights is established. The former initializes the prior distribution of weights as a standard Gaussian distribution N (0, 1), whereas the second uses the MOPED method. For SCBNN-MOPED, the prior distribution of each weight w is initialized as an independent Gaussian distribution with the mean µ taken from the pretrained weights w d of the SCNN architecture and a scale σ = 0.1|w d |.
Both models were obtained by starting from a deterministic network SCNN and translating it onto a Bayesian network by using the Flipout estimator [59] from TensorFlow. We preferred the Flipout estimator instead of using reparameterization [60] because it offers a significant low variance, albeit it uses roughly twice that of floating point operations. Table 1 shows the main differences between SCNN and the two proposed Bayesian neural architectures, following the terminology used in the "TensorFlow probability" library of Python for better comparison [33].

Uncertainty Measures
Several uncertainty measures can be used to quantify the model uncertainty [33,39,40,61]. For better understanding, three well-known metrics-predictive entropy (H), mutual information (I), and margin of confidence (M)-were first presented as follows.
Let C be the total number of classes andŷ t = (ŷ 1t ,ŷ 2t , · · · ,ŷ Ct ) the model output for a given inputx in a stochastic forward pass t; Predictive Entropy H. This metric captures through Equation (4) the average on the amount of information contained in the predictive distribution, reaching its maximum value (log 2 C) when all classes have equal uniform probability (maximum uncertainty, in other words). In contrast, it reaches zero value (the minimum) when a unique class has probability equal to 1, confirming a certain prediction.
To facilitate the comparison across various datasets, the predictive entropy was normalized as follows: Mutual Information I. This measures the epistemic uncertainty by capturing the model's confidence from its output.
Margin of Confidence M. Let c = argmaxŷ * j be the predicted class. The most intuitive form of measuring uncertainty is through analyzing the difference between two predictions of highest confidence. For this, in each forward pass t, the difference d t between the output y ct (predicted class) and the other output of highest score value max j =cŷ jt is calculated. These differences are then averaged as follows:

Classification with a Reject Option
The uncertainty values, calculated by using any of the aforementioned measures, provide to classifiers the ability to accept or reject inputs. For instance, a high uncertainty means that the classifier may be performing random predictions; therefore, the associated input should be rejected. This type of classification is known as classification with a reject option [36,62], which is of great importance for applications that require low error tolerance. Classification with a reject option generalizes the decision problem of class predictions, and also determines whether the prediction is reliable (accept) or unreliable (reject).
The process of refraining from providing a response or discarding an input when the system does not have enough confidence in its prediction is a topic of interest that has been addressed over the last 60 years [63]. Recent methods [39][40][41] have implemented the classification of rejection from an established threshold Th by using any uncertainty metric, rejecting inputs that present an uncertainty value above Th.
The criteria used in [39] incorporate the ground-truth label, model prediction, and uncertainty value to evaluate the selected threshold. In this approach, the map of correct and incorrect values is computed by matching the ground truth labels with the model's predictions. Then, given a threshold Th, the predictions are classified as certain and uncertain, providing four combinations to predict an input as follows: incorrect-uncertain (iu), correctuncertain (cu), correct-certain (cc), and incorrect-certain (ic).
Let N iu , N cu , N cc and N ic be the total number of each type of predictions in each subset, where N is the total number of predictions, and R c is the proportion of certain predictions with respect to the total number of predictions: Then, the following criteria measure the effectiveness of an uncertainty estimator and a threshold selector [39]: The correct-certain ratio R cc measures the ability of a classifier to accurately classify non-rejected samples: The overall accuracy of the uncertainty estimation can be measured through the Uncertainty Accuracy (U A), which indicates the ability of a classifier to accurately classify non-rejected samples, and reject misclassified samples: U A(Th) penalizes both the incorrect-certain and correct-uncertain predictions, and its highest values suggest better thresholds. Thus, this criterion can be also used to compare between different threshold values, in order to increase the reliability, effectivity, and feasibility for practical applications.
In this study, the correct-uncertain ratio R cu was proposed to calculate the accuracy of rejected predictions, as shown in Equation (10). This criterion was incorporated with the objective of evaluating how close the accuracy on rejected predictions was with respect to the hypothetical accuracy that would be obtained if the classifier provided random predictions.

Adaptive Uncertainty Threshold
To select the cut-off threshold, recent works [39][40][41] propose to evaluate all threshold values for a given uncertainty measure, such as predictive entropy. For this purpose, this study measured the performance by using each threshold, taking into account a criterion of quality as U A, consequently keeping the best threshold. This strategy has high computational cost because of its expensive and exhaustive search. As a contribution, we introduced a low-computational-cost novel threshold scheme based on a closed formula, rather than an exhaustive search.
Our threshold scheme uses Equation (6) to compute the margin of confidence M, which takes values close to zero when the prediction has high uncertainty. It is worth mentioning that if M is not significantly higher than zero, there will be no significant difference between the two highest confidence predictions of the model. Therefore, we reduced the labeling problem on the prediction of an inputx as uncertain to a hypothesis test for the mean where σ d is the standard deviation of the sample {d t } T t=1 , and T is the number of forward passes over each inputx. It is worth remarking that the rejection region for the null hypothesis is the right tail of the distribution.
M is a sample mean of T = 50 random values {d t } T t=1 , making M follow a normal distribution with variance σ 2 d /T. Therefore, we assume that the random variable ζ follows a standard normal distribution. The quantile z 1−α = Φ −1 (1 − α) is used as the critical value of the test, where Φ is the cumulative distribution function (CDF) of the standard normal distribution, and 1 − α is the confidence level. The null hypothesis is rejected for ζ > z 1−α . Thus, if the prediction is certain, M must be greater than σ d z 1−α / √ T. This way, the proposed adaptive threshold can be calculated as follows: As a highlight, our adaptive threshold scheme highlights four observations. First, the threshold scheme is statistically based on a closed formula. Second, the threshold is not fixed; it varies depending on the sample variance σ 2 d of {d t } T t=1 . In other words, this novel threshold is adaptive to the predictive distribution characteristics of each input x. Third, our scheme is conservative, as it does not classify as uncertain those inputs in which the predictions for each forward pass are consistent (σ d ≈ 0). However, increasing the confidence level 1 − α, the threshold scheme behaves less conservatively. Fourth, our threshold scheme can be considered as universal, as it does not require prior knowledge about the data, depending exclusively on the predictive distribution.

Experiments
The proposed methods were evaluated with three experiments, using subject-specific and non-subject-specific classification strategies.
The first experiment analyzed the accuracy of the Bayesian Neural Network models SCBNN and SCBNN-MOPED for MI classification, compared with their deterministic counterpart SCNN [57].
The second experiment assessed the quality using the proposed adaptive threshold scheme within the SCBNN-MOPED model. The quality was measured by looking into accepted and rejected partitions of the testing set, measuring how well each partition could be classified. Ideally, the rejected partition should be hard to classify, with significantly better accuracy than random guessing.
The last experiment evaluated the proposed threshold scheme with respect to [39] as a benchmark, as it seeks the optimum threshold via exhaustive search. The aim was to demonstrate that our proposed adaptive threshold may achieve the state-of-the-art performance without brute force.
All experiments were designed to meet the requirements for the BCI Competition IV dataset, never using the testing sets during training or validation. Figure 3 shows the difference between both subject-specific and non-subject-specific strategies, respectively, where the former restricts itself to the training data of each subject (see Figure 3a), whereas the latter calibrates the classifier by using the training data from all subjects (see Figure 3b). demonstrate that our proposed adaptive threshold may achieve the state-of-the-art performance without brute force.
All experiments were designed to meet the requirements for the BCI Competition IV dataset, never using the testing sets during training or validation. Figure 3 shows the difference between both subject-specific and non-subject-specific strategies, respectively, where the former restricts itself to the training data of each subject (see Figure 3a), whereas the latter calibrates the classifier by using the training data from all subjects (see Figure 3b). In all experiments, a repeated holdout validation over the same testing set was carried out, considering both subject-specific and non-subject-specific classification strategies. For each repetition, a subset V i of the training set T was randomly picked, guaranteeing identical class ratios in V i with respect to the class ratios in T . The subset V i was used for validation purposes, which is necessary to build deep learning models. For training purposes, the remaining set T i = T \ V i was used. Once the model was trained for each pair T i and V i , it was evaluated on the testing set E . To reduce the influence of random factors, this process was repeated 16 times for each model, reporting the average of some metrics, such as accuracy, R cc , among others.
Specifically, the accuracy was used for evaluation in the first experiment, whereas the criteria R c , R cc and R cu (see Equations (7), (8) and (10)) were used in the second experiment, assessing the quality of accepted and rejected partitions on the testing set. As in [39], we used U A (Equation (9)) as an evaluation criterion in the third experiment.
For the third experiment, the method in [39] was replicated by using the SCBNN-MOPED architecture and the normalized predictive entropy H n as the uncertainty measure. As in [39][40][41], the optimum threshold was obtained by making the uncertainty threshold Th vary over the interval [0.05, 1] and calculating U A over the validation set V i for each Th. Once the optimal threshold was found, its performance was evaluated with respect to the criterion U A by using the testing set E . Similar to the other experiments, 16 validation sets (V 1 · · · , V 16 ) were used to compute the overall assessment of the method in [39].
An HPC using a Dell PowerEdge R720 server with four 2.40 GHz Intel Xeon processors and 48 GB was used for all experiments. An NVIDIA GM107L Tesla M10 GPU with 32 GB memory was used to train and test the Bayesian models by using the Python TensorFlow 2.3.0 framework.

MI Classification Using SCBNN and SCBNN-MOPED Models
Tables 2-5 show the results for subject-specific and non-subject specific strategies by using the datasets 2a and 2b, respectively, where a confidence level 1 − α at 0.95 by using the Wilcoxon signed rank test was considered.    For both datasets, the SCBNN-MOPED architecture achieved significantly better results than the SCBNN for almost all subjects. For dataset 2a, the accuracy improvement was greater than 22% and 19% (on average) in the subject-specific and non-subject-specific strategy, respectively. In general, our results were consistent with the findings in [38], showing that the MOPED method helped the training convergence and improved the accuracy of Bayesian neural models.
The SCBNN-MOPED model also achieved better mean accuracy than the deterministic SCNN model, except for the subject-specific classification in dataset 2a. Tables 4 and 5 show that, for non-subject-specific classification, SCBNN-MOPED significantly outperformed the SCNN model with respect to overall accuracy, and for almost all subjects.
Because the SCBNN-MOPED architecture achieved superior performance, the next experiments were carried out discarding the SCBNN architecture. Figures 4 and 5 show the performance by applying the proposed adaptive threshold for MI classification with a reject option, considering both subject-specific and non-subjectspecific classifications, respectively. Figures 4 and 5 show that a larger number of predictions were rejected in dataset 2a independently of the training strategies. As expected from an adaptive and effective threshold, the accuracy achieved on the accepted predictions was higher than considering the full testing set without rejection. This was notable, using R cc as criterion for comparison with the results shown in Tables 2-5. This supports that by rejecting uncertain predictions, the classification with a reject option can improve both accuracy and reliability, which is of great importance in systems for real-life applications.

Classification with Reject Option Using SCBNN-MOPED Architecture
The R cu criterion is another way to measure the quality of our proposed threshold scheme, looking into the accuracy obtained over the rejected predictions. For dataset 2b and both training strategies, R cu of 50% was achieved, showing that the classification accuracy over rejected predictions and random guessing was not different, confirming the optimal performance using the proposed threshold scheme. For dataset 2a, R cu of 40% was obtained, while a random classifier would obtain 25%.
istic SCNN model, except for the subject-specific classification in dataset 2a. Tables 4 and  5 show that, for non-subject-specific classification, SCBNN-MOPED significantly outperformed the SCNN model with respect to overall accuracy, and for almost all subjects.
Because the SCBNN-MOPED architecture achieved superior performance, the next experiments were carried out discarding the SCBNN architecture. Figures 4 and 5 show the performance by applying the proposed adaptive threshold for MI classification with a reject option, considering both subject-specific and non-subject-specific classifications, respectively.    5 show that a larger number of predictions were rejected in dataset 2a independently of the training strategies. As expected from an adaptive and effective threshold, the accuracy achieved on the accepted predictions was higher than considering the full testing set without rejection. This was notable, using as criterion for comparison with the results shown in Tables 2-5. This supports that by rejecting uncertain predictions, the classification with a reject option can improve both accuracy and reliability, which is of great importance in systems for real-life applications.

Classification with Reject Option Using SCBNN-MOPED Architecture
The criterion is another way to measure the quality of our proposed threshold scheme, looking into the accuracy obtained over the rejected predictions. For dataset 2b and both training strategies, of 50% was achieved, showing that the classification accuracy over rejected predictions and random guessing was not different, confirming the optimal performance using the proposed threshold scheme. For dataset 2a, of 40% was obtained, while a random classifier would obtain 25%.
It is worth noting that our approach based on an adaptive threshold is conservative, achieving significant low percentage of rejected predictions in comparison to the incorrect predictions. For instance, 33.23% of inputs were incorrectly classified in dataset 2a with a non-subject-specific strategy, and only 1.22% was associated as uncertain by our adaptive threshold. However, by increasing the confidence level , the number of rejected predictions can be increased. This strategy has limitations, though, as our proposed threshold scheme only considers epistemic uncertainty, and it may also overlook other sources of uncertainty. Tables 6 and 7 show the values obtained by our proposed adaptive threshold , as well as the maximum or optimal values reached by [39] via exhaustive search and normalized predictive entropy ( ℍ ). It is worth noting that our approach based on an adaptive threshold is conservative, achieving significant low percentage of rejected predictions in comparison to the incorrect predictions. For instance, 33.23% of inputs were incorrectly classified in dataset 2a with a non-subject-specific strategy, and only 1.22% was associated as uncertain by our adaptive threshold. However, by increasing the confidence level 1 − α, the number of rejected predictions can be increased. This strategy has limitations, though, as our proposed threshold scheme only considers epistemic uncertainty, and it may also overlook other sources of uncertainty. Tables 6 and 7 show the UA values obtained by our proposed adaptive threshold (UA(T M )), as well as the maximum or optimal values reached by [39] via exhaustive search and normalized predictive entropy (U, A, (T H )).

Comparison with Other Methods Based on Thresholds
Recall that the UA criterion makes it possible to evaluate different cut-off thresholds. By inspecting the values in Tables 6 and 7, we conclude that our proposed threshold scheme allows us to achieve state-of-the-art performance. Moreover, by relying exclusively on the predictive statistical distribution, our scheme becomes universal and independent of the nature of data.

Considerations for Reducing Computational Cost
All evaluations considering the classification accuracy and uncertainty analysis throughout this current work were carried out by using data augmentation, as shown in Figure 1. The accuracy was obtained as a ratio between correct classified crops and the total crops, as proposed in the CAgross method [19]. Others studies followed different strategies. For instance, a single prediction for each trial was obtained in the CAST method [19] by using majority voting, whereas in [45] the authors averaged the predictions over all crops. This analysis of the crops requires an extra computational cost, which increases even more with the use of Bayesian architectures, as it realizes T forward passes to obtain the prediction over the same input.
Given the nature of the problem that is addressed in our study, the EEG variability of patterns associated with MI tasks, and the large intra-and inter-subject differences can be observed through the ERD/ERS potentials throughout. As a result, in certain crops, the aforementioned phenomena were not detected appropriately by the model, producing incorrect predictions. Figure 6a shows the predictive distribution histograms of crops over a specific trial on Subject A01 from dataset 2a, where the initial crops were classified correctly as left-hand MI, but after approximately crop number 45, they were incorrectly predicted as right-hand MI. Figure 6b shows that this change correlated with the increase of uncertainty. Observing this behavior, we believe that, when considering the predictions over all crops predicting the MI of each trial, it is not the most appropriate strategy, since all crops do not have the same certainty. Therefore, it is preferable to consider the crops with less uncertainty. We hypothesize that if the predictions with less uncertainty are determined over the same time interval for all trials on the validation set, similar behavior may be also obtained in the testing set. This allows us to choose only crops contained in that interval for classification with/without a reject option, reducing the computational cost with high accuracy. Figure 7 shows the normalized predictive entropy on a validation set from database 2a. We observed that the central crops covering a time interval from 1.724 to 5.756 s regularly presented the lowest uncertainty with respect to the crops located at the extremes. It is worth mentioning that this finding was also observed in dataset 2b. From our findings, we selected the five central crops of each trial, aiming to reduce computational cost during the evaluation step. Then, we averaged the predictions of these five central crops to rank each trial. Table 8 shows the classification accuracy without/with a reject option achieved on the selected crops of each trial (from the testing set), using SCBNN-MOPED architecture and subject-specific strategy with each database. For a better assessment of the established criterion, we also added the achieved results from Tables 2 and 3, using all the crops. We hypothesize that if the predictions with less uncertainty are determined over the same time interval for all trials on the validation set, similar behavior may be also obtained in the testing set. This allows us to choose only crops contained in that interval for classification with/without a reject option, reducing the computational cost with high accuracy. Figure 7 shows the normalized predictive entropy on a validation set from database 2a. We observed that the central crops covering a time interval from 1.724 to 5.756 s regularly presented the lowest uncertainty with respect to the crops located at the extremes. It is worth mentioning that this finding was also observed in dataset 2b. We hypothesize that if the predictions with less uncertainty are determined over the same time interval for all trials on the validation set, similar behavior may be also obtained in the testing set. This allows us to choose only crops contained in that interval for classification with/without a reject option, reducing the computational cost with high accuracy. Figure 7 shows the normalized predictive entropy on a validation set from database 2a. We observed that the central crops covering a time interval from 1.724 to 5.756 s regularly presented the lowest uncertainty with respect to the crops located at the extremes. It is worth mentioning that this finding was also observed in dataset 2b. From our findings, we selected the five central crops of each trial, aiming to reduce computational cost during the evaluation step. Then, we averaged the predictions of these five central crops to rank each trial. Table 8 shows the classification accuracy without/with a reject option achieved on the selected crops of each trial (from the testing set), using SCBNN-MOPED architecture and subject-specific strategy with each database. For a better assessment of the established criterion, we also added the achieved results from Tables 2 and 3, using all the crops. From our findings, we selected the five central crops of each trial, aiming to reduce computational cost during the evaluation step. Then, we averaged the predictions of these five central crops to rank each trial. Table 8 shows the classification accuracy without/with a reject option achieved on the selected crops of each trial (from the testing set), using SCBNN-MOPED architecture and subject-specific strategy with each database. For a better assessment of the established criterion, we also added the achieved results from Tables 2 and 3, using all the crops. The results show that the established criterion to reduce the computational cost by using the five central crops provides advantages during MI classification without/with a reject option. This proposal increases the possibilities of real implementations using Bayesian classifiers for MI tasks. However, for practical implementation, other criteria must be considered, since neural network models require large hardware resources. This analysis is outside the scope of this current work.

Conclusions
The advantage of using the MOPED method to improve MI classification by employing BNN has been verified in this work. The uncertainty quantification on predictions by using the margin of confidence provided valuable information, which enhanced the MI classification. In addition, the proposed adaptive threshold scheme allowed us to obtain a more robust, effective and adequate classifier. The proposed scheme, formulated in a closed equation, is based exclusively on the predictive statistical distribution, which makes it a universal method. Thus, it can be extended to other classification problems. Finally, a criterion to reduce the computational cost based on the uncertainty analysis was proposed, which increases the possibilities of practical implementations of Bayesian classifiers in MI-based BCIs.