Revisiting Softmax for Uncertainty Approximation in Text Classification

: Uncertainty approximation in text classiﬁcation is an important area with applications in domain adaptation and interpretability. One of the most widely used uncertainty approximation methods is Monte Carlo (MC) dropout, which is computationally expensive as it requires multiple forward passes through the model. A cheaper alternative is to simply use a softmax based on a single forward pass without dropout to estimate model uncertainty. However, prior work has indicated that these predictions tend to be overconﬁdent. In this paper, we perform a thorough empirical analysis of these methods on ﬁve datasets with two base neural architectures in order to identify the trade-offs between the two. We compare both softmax and an efﬁcient version of MC dropout on their uncertainty approximations and downstream text classiﬁcation performance, while weighing their runtime (cost) against performance (beneﬁt). We ﬁnd that, while MC dropout produces the best uncertainty approximations, using a simple softmax leads to competitive, and in some cases better, uncertainty estimation for text classiﬁcation at a much lower computational cost, suggesting that softmax can in fact be a sufﬁcient uncertainty estimate when computational resources are a concern.


Introduction
The pursuit of pushing state-of-the-art performance on machine learning benchmarks often comes with an added cost of computational complexity.On top of already complex base models, such as Transformer models (Vaswani et al., 2017;Lin et al., 2021), successful methods often employ additional techniques to improve the uncertainty estimation of these models as they tend to be over-confident in their predictions.Though these techniques can be effective, the overall benefit in relation to the added computational cost is under-studied.
More complexity does not always imply better performance.For example, Transformers can be outperformed by much simpler convolutional neural nets (CNNs) when the latter are pre-trained as well (Tay et al., 2021).Here, we turn our attention to neural network uncertainty estimation methods in text classification, which have applications in domain adaptation and decision making, and can help make models more transparent and explainable.In particular, we focus on a setting where efficiency is of concern, which can help improve the sustainability and democratisation of machine learning, as well as enable use in resource-constrained environments.
Quantifying predictive uncertainty in neural nets has been explored using various techniques (Gawlikowski et al., 2021), with the methods being divided into three main categories: Bayesian methods, single deterministic networks, and ensemble methods.Bayesian methods include Monte Carlo (MC) dropout (Gal and Ghahramani, 2016b) and Bayes by back-prop (Blundell et al., 2015).Single deterministic networks can approximate the predictive uncertainty by a single forward pass in the model, with softmax being the prototypical method.Lastly, ensemble methods utilise a collection of models to calculate the predictive uncertainty.However, while uncertainty estimation can improve when using more complex Bayesian and ensembling techniques, efficiency takes a hit.
In this paper, we perform an empirical investigation of the trade-off between choosing cheap vs. expensive uncertainty approximation methods for text classification, with the goal of highlighting the efficacy of these methods in an efficient setting.We focus on one single deterministic and one Bayesian method.For the single deterministic method, we study the softmax, which is calculated from a single forward pass and is computationally very efficient.While softmax is a widely used method, prior work has posited that the softmax output when taken as a single deterministic operation is not the most dependable uncertainty approximation method (Gal and Ghahramani, 2016b;Hendrycks and Gimpel, 2017).As such, it has been superseded by newer methods such as MC dropout, which leverages the dropout function in neural nets to approximate a random sample of multiple networks and aggregates the softmax outputs of this sample.MC dropout is favoured due to its close approximation of uncertainty, and because it can be used without any modification to the applied model.It has also been widely applied in text classification tasks (Zhang et al., 2019;He et al., 2020).
To understand the cost vs.benefit of softmax vs. MC dropout, we perform experiments on five datasets using two different neural network architectures, applying them to three different downstream text classification tasks.We measure both the added computational complexity in the form of runtime (cost) and the downstream performance on multiple uncertainty metrics (benefit).We show that by using a single deterministic method like softmax instead of MC dropout, we can improve the runtime by 10 times while still providing reasonable uncertainty estimates on the studied tasks.As such, given the already high computational cost of deep neural network based methods and recent pushes for more sustainable ML (Strubell et al., 2019;Patterson et al., 2021), we recommend not discarding efficient uncertainty approximation methods such as softmax in resource-constrained settings, as they can still potentially provide reasonable estimations of uncertainty.
Contribution In summary, our contributions are: 1) an empirical study of an efficient version of MC dropout and softmax for text classification tasks, using two different neural architectures, and five datasets; 2) a comparison of uncertainty estimation between MC dropout and softmax using expected calibration error; 3) a comparison of the cost vs.benefit of MC dropout and softmax in a setting where efficiency is of concern.

Uncertainty Quantification
Quantifying the uncertainty of a prediction can be done using various techniques (Ovadia et al., 2019;Gawlikowski et al., 2021;Henne et al., 2020) such as single deterministic methods (Mozejko et al., 2018;van Amersfoort et al., 2020) which calculate the uncertainty on a single forward pass of the model.They can further be classified as internal or external methods, which describe if the uncertainty is calculated internally in the model or post-processing the output.Another family of techniques are Bayesian methods, which combine NNs and Bayesian learning.Bayesian Neural Networks (BNNs) can also be split into subcategories, namely Variational Inference (Hinton and van Camp, 1993), Sampling (Neal, 1993), and Laplace Approximation (MacKay, 1992).Some of the more notable methods are Bayes by backprop (Blundell et al., 2015) and Monte Carlo Dropout (Gal and Ghahramani, 2016b).One can also approximate uncertainty using ensemble methods, which use multiple models to better measure predictive uncertainty, compared to using the predictive uncertainty given by a single model (Lakshminarayanan et al., 2017;He et al., 2020;Durasov et al., 2021).Recently, we have seen uncertainty methods being used to develop methods for new tasks (Zhang et al., 2019;He et al., 2020), where mainly Bayesian methods have been used.We present a thorough empirical study of how uncertainty quantification behaves for text classification tasks.Unlike prior work, we do not only evaluate based on the performance of the methods, but perform an in-depth comparison to much simpler deterministic methods based on multiple metrics.

Uncertainty Metrics
Measuring the performance of uncertainty approximation methods can be done in multiple ways, each offering benefits and downsides.Niculescu-Mizil and Caruana (2015) explore the use of obtaining confidence values from model predictions to use for supervised learning.One of the more widespread and accepted methods is using expected calibration error (ECE, Guo et al., 2017).While ECE measures the underlying confidence of the uncertainty approximation, we have also seen the use of human intervention for text classification (Zhang et al., 2019;He et al., 2020).There, the uncertainty estimates are used to identify uncertain predictions from the model and ask humans to classify these predictions.The human classified data is assumed to have 100% accuracy and to be suitable for measuring how well the model scores after removing a proportion of the most uncertain data points.Using metrics such as ECE, the calibration of models is shown, and this calibration can be improved using scaling techniques (Guo et al., 2017;Naeini et al., 2015).We use uncertainty approximation metrics like expected calibration error, and human intervention (which we refer to as holdout experiments) to measure the difference in the performance of MC dropout and softmax compared against each other on text classification tasks.

Uncertainty Approximation for Text Classification
We focus on one deterministic method and one Bayesian method of uncertainty approximation.
Both methods assume the existence of an already trained base model, and are applied at test time to obtain uncertainty estimates from the model's predictions.In the following sections, we formally introduce the two methods we study, namely MC dropout and softmax.MC dropout is a Bayesian method which utilises the dropout layers of the model to measure the predictive uncertainty, while softmax is a deterministic method that uses the classification output.In Figure 1, we visualise the differences between the two methods and how they are connected to base text classification models.

Bayesian Learning
Before introducing the MC dropout method, we quickly introduce the concept of Bayesian learning.We start by comparing Bayesian learning to a traditional NN.A traditional NN assumes that the network weights ω ∈ R n are real but of an unknown value and can be found through maximumlikelihood estimation, and the input data (x, y) ∈ D are treated as random variables.Bayesian learning instead views the weights as random variables, and infers a posterior distribution p(ω|D) over ω after observing D. The posterior distribution is defined as follows: Using the posterior distribution, we can find the prediction of an input of unseen data x * and y * as follows: However, the posterior distribution is infeasible to compute due to the marginal likelihood in the denominator, so we cannot find a solution analytically.We therefore resort to approximating the posterior distribution.For this approximation we rely on methods such as Bayes by Backpropagation (Blundell et al., 2015) and Monte Carlo Dropout (Gal and Ghahramani, 2016b).

Monte Carlo Dropout
At a high level, MC dropout approximates the posterior distribution p(ω|D) by leveraging the dropout layers in a model (Gal and Ghahramani, 2016b,a).Mathematically, it is derived by introducing a distribution q(ω), representing a distribution of weight matrices whose columns are randomly set to 0, to approximate the posterior distribution p(ω|D), which results in the following predictive distribution: As this integral is still intractable, it is approximated by taking K samples from q(ω) using the dropout layers of a learned network f which approximates p(y * |x * , ω).As such, calculating p(y * |x * , ω)q(ω) amounts to leaving the dropout layers active during testing, and approximating the integral amounts to aggregating predictions across multiple dropout samples.For the proofs, see Gal and Ghahramani (2016b).MC dropout requires multiple forward passes, so its computational cost is a multiple of the cost of performing a forward pass through the entire network.As this is obviously more computationally expensive than the single forward pass required for deterministic methods, we provide a fairer comparison between softmax and MC dropout by using an efficient version of MC dropout which caches an intermediate representation and only activates the dropout layers of the latter part of the network.As such, we obtain a representation z * by passing an input through the first several layers of the model and pass only this representation through the latter part of the model multiple times, reducing the computational cost while approximating the sampling of multiple networks.

Combining Sample Predictions
With multiple samples of the same data point, we have to determine how to combine them to quantify the predictive uncertainty.We test two methods that can be calculated using the logits of the model, requiring no model changes.The first approach, which we refer to as Mean MC, is averaging the output of the softmax layer from all forward passes: where z k i is a representation of the i'th data point of the k'th forward pass and f is a fullyconnected layer.The second method we use to quantify the predictive uncertainty is Dropout Entropy (DE) (Zhang et al., 2019) which uses a combination of binning and entropy: BinCount is the number of predictions of each class and b is a vector the probabilities of a class's occurrence based on the bin count.We show the performance of the two methods in Section 4.3.2.

Softmax
Softmax, a common normalising function for producing a probability distribution from neural network logits, is defined as follows: where z i are the logits of the i'th data point.The softmax yields a probability distribution over the predicted classes.However, the predicted probability distribution is often overconfident toward the predicted class (Gal and Ghahramani, 2016b;Hendrycks and Gimpel, 2017).The issue of softmax's overconfidence can also be exploited (Gal and Ghahramani, 2016b;Joo et al., 2020) -in the worst case, this leads to the softmax producing imprecise uncertainties.However, model calibration methods like temperature scaling have been found to lessen the overconfidence to some extent (Guo et al., 2017).As temperature scaling also incurs a cost in terms of runtime in order to find an optimal temperature, we choose to compare raw softmax probabilities to the efficient MC dropout method desribed previously, though uncertainty estimation could potentially be improved by scaling the logits appropriately.

Experiments and Results
We consider five different datasets and two different base models in our experiments.Additionally, we conduct experiments to determine the optimal hyperparameters for the MC dropout method, particularly the optimal amount of samples which affects the efficiency and performance of MC dropout.In the paper we focus on the results of the 20 Newsgroups dataset, the results of the other four datasets are shown in the Appendix B and C. We further find the optimal dropout percentage in Appendix A.3.

Data
To test the predictive uncertainty of the two methods, we use five datasets for diverse text classification tasks.We use the following five datasets: The 20 Newsgroups dataset (Lang, 1995), is a text classification consisting of a collection of 20.000 news articles.The news articles are classified into 20 different classes.The Amazon dataset (McAuley and Leskovec, 2013) is a sentiment classification task.We use the 'sports and outdoors' category, which consists of 272.630 reviews ranging from 1 to 5. The IMDb dataset (Maas et al., 2011) is also a sentiment classification task.However, compared to the amazon dataset, this is a binary problem.The dataset consists of 50.000 reviews.
The SST-2 dataset (Socher et al., 2013), is also a binary sentiment classification dataset, consisting of 70.042 sentences.Lastly, we also use the WIKI dataset (Redi et al., 2019), which is a citation needed task, i.e. we predict if a citation is needed.
The dataset consists of 19.998 texts.For the 20 Newsgroups, Amazon, IMDb and Wiki datasets, we use a split of 60, 20 and 20 for the training, validation and test data, the data in splits have been selected randomly.We used the provided splits for the SST-2 dataset, but due to the test labels being hidden, we used the validation set for testing.We select these datasets as they are large, the tasks are diverse, and they cover multiple domains of text.Additionally, they represent well-studied and standard benchmarks in the field of text classification, which helps with the reproducibility of the results and comparison with baselines.

Experimental Setup
We use two different base neural architectures with two different embeddings in our experiments.To recreate baseline results, the first model is the same model as proposed in (Zhang et al., 2019), which is a CNN using pre-trained GloVe embeddings (Glove-CNN) with a dimension of 200 (Pennington et al., 2014).The second model uses a pre-trained BERT model (Devlin et al., 2019) fine-tuned as masked language model on the dataset under evaluation to obtain contextualised embeddings, which are then input to a CNN with 4 layers (BERT-CNN).
The selection of these models allows us to compare the established baseline architecture from (Zhang et al., 2019) with a more modern version of it which takes advantage of large language models.For both models we use the final dropout layer for MC dropout.Both models are optimised using Adam (Kingma and Ba, 2015) and are trained for 1000 epochs with early stopping after 10 iterations if there have been no improvements, and we set the learning rate to 0.001.

MC Dropout Sampling
To make full use of MC dropout, we first determine the optimal number of forward passes through the model needed to obtain the best performance while maintaining high efficiency.This hyper-parameter search is imperative because the MC dropout performance and efficiency are correlated with the number of samples generated.To make a fair comparison against the already cheap softmax method, we want to find the minimum number of samples needed to approximate a good uncertainty.In Table 1, we show the performance, using the F1 score, of the MC dropout method with the BERT-CNN model on the 20 Newsgroups dataset for the following number of samples: [1,5,10,25,50,100,1000].The ta-ble shows how the performance of the uncertainty approximation increases, given the number of samples.However, the performance gained by the number of samples falls off at 50.Given this, we use 50 MC samples in our experiments in order to balance good performance and efficiency.

Evaluation Metrics
We use complementary evaluation metrics to benchmark the performance of MC dropout and softmax.Namely, we measure how well each of the methods identify uncertain predictions as well as the runtime of the methods.

Efficiency
To quantify efficiency, we measure the runtime of each of the methods during inference and the calculation of the uncertainties.Since we do not calculate uncertainties during training, this is only done on the test sets.Training the model is independent of the uncertainty estimation methods, since we only use them to quantify the uncertainty of the predictions of the model.We therefore only calculate the runtime of each of the methods based on the test data.

Performance Metrics
We use two main uncertainty metrics: test data holdout and expected calibration error (ECE).These metrics give us an estimation of the epistemic uncertainty of the model i.e. the lack of certainty inherent in the model and its predictions.We do not cover metrics of aleatoric uncertainty in this paper, which focus on the inherent randomness of the data itself and which could be tested through the introduction of e.g.label noise.For base model performance, we record the macro F1 score on the 20 Newsgroups, IMDb, Wiki and SST-2 datasets, and the accuracy on the Amazon dataset.
Test data holdout: This metric ranks all samples based on the predictive uncertainty and calculates the F1 and accuracy scores on a percentage of the samples by removing those which the model is least certain about.In other words, a method is better if  it achieves a greater improvement in performance metrics (e.g.F1) when removing the most uncertain samples.As such, this metric expresses the relationship between model calibration and accuracy.We choose to remove 10%, 20%, 30% and 40% of the least certain samples for our experiments.This metric shows how well the two methods can identify uncertain predictions of the model, as reflected by improvements in performance when more uncertain predictions are removed (Zhang et al., 2019).In our experiments, we use the former mentioned Mean MC, DE and softmax method to calculate the uncertainties; we further add the Penultimate Layer Variance (PL-Variance), where the PL-Variance utilises the variance of the last fully-connected layer as the uncertainty (Zaragoza and d'Alché Buc, 1998).
Expected calibration error: As a second uncertainty estimation metric, we use the expected calibration error (ECE, Guo et al. ( 2017)), which measures in expectation, how confident are the predictions for both correct and incorrect predictions.
This tells us how well each of the MC dropout and softmax methods estimate the uncertainties at the level of probability distributions, as opposed to the holdout method which only looks at downstream task performance.ECE works by dividing the data into m bins, where each bin in B contains data that is within a certain range of probabilities, using the probability of the predicted class.Formally, ECE is defined as: where M is the size of the dataset and acc and conf is the accuracy and mean confidence (i.e.predicted class probabilities) of the bin B m .Finally, to visualise the difference between the MC dropout and softmax, we create both confidence histograms and reliability diagrams (Guo et al., 2017).The reliability diagrams show how close the models are to perfect calibration, where perfect calibration means that the models accuracy and confidence is equal to the bins confidence range.In all cases, we show reliability diagrams by comparing histograms of accuracy and confidence across confidence bins; as such, when confidence exceeds accuracy in a given bin, that indicates how overconfident the model is for that bin.The reliability diagrams help us visualise the ECE, by showing the accuracy and mean confidence of each bin, where each bin consists of the data which have a confidence within the range of the bin.To complement the reliability diagrams, we also use confidence histograms, which show the distribution of confidence.

Efficiency Results
In Table 2, we display the runtime of the different model and method combinations.The runtime for the forward passes is calculated as a sum of all the forward passes on the entire dataset, and the runtime for the uncertainty methods are calculated for the entire dataset.Observing the results, we see that softmax is overall faster, and is approximately 10 times faster when only looking at the forward passes, and using more complex aggregation methods in MC dropout, like DE, can be computationally heavy.

Test Data Holdout Results
Table 3 and the table in Appendix B show the performance of the two uncertainty approximation methods using the different datasets and models.
The tables show the macro F1 score and accuracy (depending on the datasets), and the ratio of improvement from holding out data in parentheses.We observe that in most cases, either dropout-entropy (DE) or softmax has the highest score and improvement ratio.However, in most cases the two are close in performance and improvement ratio.We further observe that Mean MC also performs well and is almost on par with DE, however, Mean MC is a much more efficient method compared to DE, so the slight trade-off in performance could be beneficial in resource-constrained settings or non-critical applications.

Model Calibration Results
To further investigate the differences between MC dropout and softmax, we utilise the expected calibration error (ECE) to observe the differences in the predictive uncertainties.In Table 4, we show the accuracy and ECE on the three datasets using the BERT embeddings.
The results from our holdout experiments in Table 3 and in Appendix B combined with the results from our ECE calculations in Table 4, all point in the direction of the efficient MC dropout used in this study and softmax performing on par to each other, but with a large gap in runtime as shown in Table 2. To get a better understanding of if and where the two methods diverge, we plot the reliability diagrams and confidence histograms as described in Section 4.3.2.Plot description: In Figures 2 and 3, we show the reliability diagrams and the confidence histograms on the 20 Newsgroups dataset using both our BERT-CNN and GloVe-CNN with both the MC dropout method and softmax.We create the reliability diagrams using 10 bins and the confidence histograms with 20.Where the reliability diagram's and confidence histogram's bins are an interval of confidence.We use 20 bins for the confidence histograms to obtain a more fine-grained view of the distribution.In the reliability diagram, the x-axis is the confidence and the y-axis is the accuracy.For the confidence histogram the x-axis is again the confidence and the y-axis is the percentage of the samples in the given bin.Expectations: While ECE can quantify the performance of the models on a somewhat lower level than our other metrics, the metric can be deceived, especially in cases where models score high in accuracy.It will favour overconfident models; therefore, we expect the results to favour softmax.Looking at the ECE, we can observe that it will favour an overconfident method when the model achieves high accuracy.With this in mind, we expect the  results to be skewed towards the softmax.
Observations reliability diagram: From the reliability diagram, we observe that the difference in confidence and outputs are small.The difference between the two uncertainty methods is also minimal, including both BERT and GloVe embeddings, suggesting minimal potential gains from using MC dropout in an efficient setting while still incurring a high cost in terms of runtime.We determine that there is minimal difference by visually inspecting the plots, and by observing the ECE displayed in Table 4.We further observe that in both MC dropout and softmax that the model worsens when we use the GloVe embeddings.
Observations confidence histogram: As mentioned earlier, we know that the softmax tends to be overconfident, which can be seen in the percentage of samples in the last bin.The MC dropout method, on the other hand, utilises the probability space to a greater extent.We include reliability diagrams and confidence histograms for the 2 other datasets in Appendix C.
Noise experiment: Inspecting both Table 4 showing the ECE values and the performances in Table 3, 6 and 7, we observe that using our two uncertainty estimation methods we achieved very high F1 scores and accuracies and low ECEs.We hypothesised that high performance could lead to softmax achieving high ECE, due to naturally having high confidence, compared to MC dropout.We added zero-mean Gaussian noise to the 20 Newsgroups test embeddings and redid our ECE experiments to test our hypothesis.In Figure 4, we show the relia-bility diagram of the experiment with added noise, which shows the MC dropout outperforming softmax.To further build on the theory, we also inspect the confidence histogram, showing that softmax is still overconfident and the difference between the accuracy and mean confidence is high.This suggests that MC dropout is more resilient to noise, and in cases where the performance of a model is low, MC dropout could potentially obtain more precise predictive uncertainties.

Discussion and Conclusion
In this paper, we perform an in-depth empirical comparison of using the MC dropout method in an efficient setting and the more straightforward softmax method.By doing a thorough empirical analysis of the two methods, shown in Section 4.3.2,using various metrics to measure their performance on both efficiency and performance levels, we see that in our holdout experiments in table 3, that the two methods perform approximately the same.
Looking at the expected calibration error (ECE) experiments, the results again show that the MC dropout and softmax method perform somewhat equally, which we have shown in Section 4.6.We observe differences in the results as we observe a lower accuracy score, which we show in our noise experiment, which is also shown in Section 4.6.
Prior research (Hendrycks and Gimpel, 2017) investigated out-of-distribution analysis and found that softmax, both for sentiment classification and text categorisation tasks, can detect out-ofdistribution data points efficiently.It further showcases that in these two tasks, that the softmax also to some extent, can perform well as a confidence estimator.
While we show that the two methods perform almost equally, when comparing the predictive performance, the cost of using MC dropout is at a minimum 10 times that of running softmax even in the efficient setting where only the final layer is dropped out, depending on the post-processing of the uncertainties, as we show in Section 4.4.The post-processing cost of MC dropout can quickly explode when used on larger datasets or if a more expensive method like dropout-entropy is used instead of simpler approaches.
Given this, when could it be appropriate to use the more efficient softmax over MC dropout for estimating predictive uncertainty?Our results suggest that when the base accuracy of a model is high, the differences in uncertainty estimation between the two methods is relatively low, likely due to the higher confidence of the softmax method.In this case, if latency or resource efficiency is a concern such as on edge devices, it may be appropriate to rely on a quick estimate using softmax as opposed to a more cumbersome method.However, when model accuracy is expected to be low, softmax is still overconfident compared to MC dropout, so estimates using a single deterministic softmax may be unreliable.The downstream application may also impact this: in critical scenarios such as health care, it may still be more appropriate to use an inefficient method with better predictive uncertainty for improved decision-making.In low-risk applications where models are known to be accurate and efficiency is of concern, we have demonstrated that softmax can potentially be sufficient.

Limitations
We highlight a few key limitations of the study to further contextualise the work.First, we note that the study is restricted to neural network based methods, while other methods in ML may be useful to study for uncertainty estimation as well.Second, we note that we test a plain softmax method without temperature scaling -while calibrating a useful temperature could induce a cost in terms of time, it would potentially lead to better uncertainty estimation.Finally, we note that we also test an efficient form of MC dropout which only drops out a portion of the network.While this demonstrates that in an efficient setting, softmax can be as good or better at uncertainty estimation than MC dropout, full MC dropout still may have better uncertainty estimation when efficiency is not a concern.

A Reproducibility
A.1 Computing Infrastructure All Experiments were run on a Microsoft Azure NC6-series server.With the following specifications: 6 Inter Xeon-E5-2690 v3, NVIDIA Tesla K80 with 12GB RAM and 56GB of RAM.

A.2 Hyperparameters
We used the following hyperparameters for training our CNN model and CNN GloVe model: Epochs: 1000; batch size: 256 for 20 Newsgroups, IMDb SST-2 and Wiki and 128 for Amazon; early stopping: 10; learning rate: 0.001.For fine-tuning BERT we used the following set of hyperparamaters: epochs: 3; warm-up steps 500; weight decay 0.01; batch size 8; masked language model probability: 0.15.All hyperparameters are set without performing cross-validation.

A.3 Dropout -Hyperparameter
The performance of the MC dropout method is correlated with the dropout probability.We therefore run our CNN model using BERT embeddings on the 20 Newsgroups dataset with the following dropout probabilities [0.1, 0.2, 0.3, 0.4, 0.5].In Table 5, we show the results using the 5 different dropout probabilities, where we see that it stops improving at 0.4 and 0.5 percentage dropout.As such, we use a dropout of 0.5 for our experiments.

Figure 1 :
Figure1: MC Dropout (left) and softmax (right).In the version of MC dropout tested in this paper, a test input x * is passed through model f to obtain a representation z * , which is then subsequently passed through a dropout layer multiple times, and passed through the final part of the network to obtain prediction y * .For softmax, dropout is disabled and a single prediction is obtained.

Figure 2 :
Figure 2: Reliability diagram (left, displayed as a stacked bar chart comparing accuracy and confidence ) and confidence histogram (right) of 20 Newsgroups using BERT-CNN.Softmax and the efficient version of MC dropout tested in this paper are relatively similar in their calibration (a higher value for confidence than accuracy in any bin indicates overconfidence in that bin).At the same time, as indicated by the confidence histogram, softmax still produces more confident estimates on average.

Figure 3 :
Figure 3: Reliability diagram (left, displayed as a stacked bar chart comparing accuracy and confidence ) and confidence histogram (right) of 20 Newsgroups using GloVe-CNN.Comparing the plots of the figure to Figure 2, we see slight differences in both the reliability diagram and the confidence histogram.Most noticeable, we see slight differences in the reliability diagram, where we see more significant gaps between the confidence and the outputs, which indicates a less calibrated model due to the GloVe embeddings.

Figure 4 :
Figure 4: Reliability diagram of 20 Newsgroups dataset (displayed as a stacked bar chart comparing accuracy and confidence) using the BERT-CNN model, with added zero-mean Gaussian noise to the BERT embeddings.Softmax is highly overconfident compared to MC dropout (despite the efficient setting in this paper where only the final layers of the model are used for dropout), as indicated by the large gap between average confidence and accuracy in each bin of the histogram.

Table 1 :
This table shows how the number of samples affect the performance of the MC dropout method, on the 20 Newsgroups dataset, using the BERT-CNN model.The results are reported using macro F1.

Table 2 :
Runtime measured in seconds for both MC dropout (top) and softmax (bottom).The times are on the full datasets split into the runtime of the forward passes and the runtime of calculating the uncertainty.

Table 3 :
Macro F1 score and improvement rate for the 20 Newsgroups dataset.

Table 4 :
Accuracy and ECE of the two uncertainty approximation approaches on the three selected datasets.

Table 5 :
We test how the dropout probabilities correlate with the performance of MC dropout, using the BERT-CNN model.The results are reported in terms of macro F1.

Table 7 :
Accuracy score and improvement rate for the Amazon (Sports and Outdoors) dataset.

Table 9 :
Macro F1 score and improvement rate for the SST-2 dataset.