2.3. Architecture
A shallow architecture that performs temporal and spatial convolutions is used here. The temporal convolutional layer with a 45
1 filter and 40 channels has input tensors of size 1000
22
1 and output tensors of size 478
22
40 when using dataset 2a, while input tensors of size 1000
3
1 and output tensors of size 478
3
40 are used for dataset 2b. Then, downsampling from 250 to 125 Hz by employing a stride of 2 is performed. The spatial convolutional layer is composed of 40 channels and a 1
22 filter when using dataset 2a and a 1
3 filter when using dataset 2b. After the temporal convolution and the spatial filter, a squaring nonlinearity, an average pooling layer with 45
1 sliding windows, a max-pooling layer with an 8
1 stride, and a logarithmic activation function are applied. These steps together are analogous to the trial log-variance computation, which is widely used in the Filter Bank Common Spatial Patterns (FBCSPs) [
30,
44]. The use of quadratic activation functions, or even higher-order polynomials, is not new in neural network research [
45], and to the best of our knowledge, it was first used in BCI applications by Schirrmeister [
31]. The classification layer is composed of a dense layer with Softmax activation function that receives a total of 2160 features. To avoid overfitting, batch normalization and dropout layers are used, and also the “MaxNorm” regularization is further applied in both convolution and dense layers. Moreover, the “Early Stopping” method and the decay of learning rate are also considered. The Adam optimizer [
46] and the Categorical Cross-Entropy as a cost function are employed. As a result, the proposed architecture contains a total of 45,804 weights for dataset 2a and 11,082 weights for dataset 2b.
The neural network shown is totally deterministic and does not permit broader reasoning about uncertainty. To estimate the uncertainty, the Monte Carlo dropout described in the next section was used.
2.4. Monte Carlo Dropout
The dropout technique is commonly used to reduce the model complexity and also avoid overfitting [
24]. A dropout layer multiplies the output of each neuron by a binary mask that is drawn following a Bernoulli distribution, randomly setting some neurons to zero in the neural network, during the training time. Then, the non-dropped trained neural network is used at test time. Gal and Ghahramani [
23] demonstrated that dropout used at test time is an approximation of probabilistic Bayesian models in deep Gaussian processes. Monte Carlo dropout (MCD) quantifies the uncertainty of network outputs from its predictive distribution by sampling
new dropout masks for each forward pass. As a result, instead of one output model,
model outputs {
Pt; 1 ≤
t ≤
T} for each input sample
x are obtained. Then, the set
can be interpreted as samples from the predictive distribution, which is useful to extract information regarding the prediction’s variability. This information is valuable for making decisions. In fact, quantifying the uncertainty of the model may allow uncertain inputs to be treated differently.
The main drawback of MCD is its computational complexity, which can be proportional to the number of forward passes
. As an alternative, the forward passes can run concurrently, resulting in constant running time. Moreover, if the dropout layers are located near the network output, as in the SCNN model (see
Figure 2), the input of the first dropout layer can be saved in the first pass, to reuse it in the remaining passes, avoiding redundant computation [
26]. Consequently, the computational complexity of MCD can be significantly reduced, enabling it for real-time applications.
The MCD model estimation can be computed as the average of
predictions:
According to [
23],
is considered a safe choice to estimate the uncertainty, but this value must also be evaluated, considering the predictive performance of MCD. In our study employing the SCNN architecture (see
Figure 2), the performance by applying MCD through different
samples was analyzed for each subject in both datasets 2a and 2b.
Figure 3 shows the accuracy improvement (∆ ACC) from the baseline
. We observed that generally when
increases, the accuracy of SCNN-MCD improves, reaching an evident stabilization for values prior to
. For this reason,
was adopted in SCNN-MCD.
The Monte Carlo dropout can be seen as a particular case of Deep Ensembles (training multiple similar networks and sampling predictions from each), which is another alternative to improve the performance of deep learning models and estimate uncertainty. A brief description of Deep Ensembles is presented in the next section.
2.6. Uncertainty Analysis and Prediction Performance
The uncertainty in neural networks measures how reliable a model makes predictions. Several uncertainty measures can be used to quantify model uncertainty [
48,
49]. For a better understanding, we first present five well-known metrics, such as variation ratio
, predictive entropy
, mutual information
, total variance
, and margin of confidence
. The next descriptions assume the aforementioned predictive distribution obtained from the stochastic forward passes.
Let be the total number of classes for classification, and ; the model output for a forward pass, , is . If the last layer of the model is softmax, the sum of all outputs is equal to 1. Let be the average of the predictions .
Variation ratio
. This measures the dispersion or how spread the distribution is around the mode.
where
is the frequency of the mode
of the discrete distribution
and
is the predicted class in each stochastic forward pass.
Notice that , and it reaches its minimum and maximum values for closest to and , respectively.
Predictive Entropy
. This metric captures the average of the amount of information contained in the predictive distribution. The predictive entropy attains its maximum value when all classes are predicted to have an equal uniform probability. In contrast, it obtains zero value as minimum when one class has a probability equal to 1, being 0 for all others (for instance, when the prediction is certain). The predictive entropy can be estimated as:
The maximum value for is . Therefore, it is not fixed for datasets with different numbers of classes. To facilitate the comparison across various datasets, we normalize the predictive entropy here as follows: , .
Mutual Information
. It measures the epistemic uncertainty by capturing the model’s confidence from its output.
Total variance
. It is the sum of variances obtained for each class:
Margin of Confidence . The most intuitive form to measure uncertainty is analyzing the difference between two predictions of the highest confidence.
Let be the predicted class through the MCD approach.
Then, for
, we compute:
where
takes values close to zero for points toward high uncertainty, and it increases when the uncertainty decreases. We noted that
can be negative.
The prediction’s uncertainty can be intuitively expected to be correlated with the classification performance. For instance,
Figure 4 shows the histograms of the normalized predictive entropy, for predictions classified correctly and incorrectly, when applying subject-specific classification on dataset 2a. We observed for almost all subjects that well-classified predictions were grouped predominantly toward low-entropy values, while the incorrect classified predictions were more clustered in regions of high entropy. A similar effect also occurred when applying the other uncertainty measure presented here. This indicates that the most uncertain predictions also tend to be incorrect. In areas of high uncertainty, the model can randomly classify patterns, and therefore, it is preferred to reject their associated inputs. The rejection decision can be carried out by using some uncertainty metrics, and preferably, it must be statistically inferred. Next, the more suitable uncertainty measures to achieve this purpose are determined.
As a novelty, a new approach based on the Bhattacharyya distance to compare the ability of several uncertainty measures for discriminating correct and incorrect classified predictions is proposed here in order to enhance the MI tasks recognition. The Bhattacharyya distance measures the similarity of two probability distributions
p and
q over the same domain
X, and it can be calculated as
Table 1 shows the Bhattacharyya distance computed between histograms obtained from correct and incorrect classified predictions, using the aforementioned uncertainty measures.
Notice that the margin of confidence
reached the highest Bhattacharyya distance on dataset 2a and the mean of both datasets, outperforming the other metrics. Thus, we used it in the classification process to reject those EEG trials that were less certain. The margin of confidence is a sample mean of
random values {
}; consequently, a normal distribution can be assumed for the random variable
. This allows fixing a threshold
on the values of
to split the predictions into certain
and uncertain
. Notice that if the prediction is certain, the zero value must be outside the confidence interval of
, and therefore,
must be necessarily greater than
, where
is the standard deviation of samples {
},
,
is the cumulative distribution function (CDF) of the standard normal distribution, and
is the confidence level. Consequently, as threshold, the following equation can be used:
The certainty condition is satisfied if the mean of the differences {} is “very large” or if the standard deviation is “very small”. As a result, this threshold scheme does not classify as uncertain those predictions in which the model is consistent , even when is close to zero. As a highlight, the proposed threshold does not require prior knowledge of the data, as it depends exclusively on the predictive distribution.
Finally, four subsets for predictions can be obtained by using the proposed method, which are incorrect–uncertain (iu), correct–uncertain (cu), correct–certain (cc), and incorrect–certain (ic) predictions.
Let , , and be the number of predictions in each subset, be the total number of predictions, and be the certain ratio. This last ratio is the proportion of certain predictions with respect to the total number of predictions.
In any recognition system, the correct classification of certain predictions is desirable. Then, the correct-certain ratio
in Equation (12) [
50] can be used to measure this expectation.
On the other hand, if the model makes an incorrect prediction, it is desirable to have high uncertainty, which can be measured by the incorrect–uncertain ratio
[
50], as follows:
The overall accuracy of the uncertainty estimation can be measured through the Uncertainty Accuracy
as:
where
penalizes the incorrect–certain and correct–uncertain predictions, aiming to increase the reliability, effectivity, and feasibility of EEG MI-based recognition systems in practical applications.
takes higher values for the best threshold values; thus, it can be further used to compare different thresholds.