Sparsity Increases Uncertainty Estimation in Deep Ensemble

Deep neural networks have achieved almost human-level results in various tasks and have become popular in the broad artificial intelligence domains. Uncertainty estimation is an on-demand task caused by the black-box point estimation behavior of deep learning. The deep ensemble provides increased accuracy and estimated uncertainty; however, linearly increasing the size makes the deep ensemble unfeasible for memory-intensive tasks. To address this problem, we used model pruning and quantization with a deep ensemble and analyzed the effect in the context of uncertainty metrics. We empirically showed that the ensemble members’ disagreement increases with pruning, making models sparser by zeroing irrelevant parameters. Increased disagreement implies increased uncertainty, which helps in making more robust predictions. Accordingly, an energy-efficient compressed deep ensemble is appropriate for memory-intensive and uncertainty-aware tasks.


Introduction
Deep neural networks make an overconfident [1], blackbox point estimation because the neural network parameters comprise scalar matrices.Even in unseen data that come from another distribution, a neural network incorrectly predicts with high confidence.In deep learning, we require a scalable and straightforward uncertainty estimation method.Monte Carlo (MC) dropout uses a standard regularization technique dropout [2] with a Bernoulli mask in inference time by sampling to measure model uncertainty [3].
Ensemble [4] is a well-known method for improving the accuracy of machine learning tasks.Another sample-based uncertainty measuring method is the deep ensemble [5], which consists of randomly initialized, independently trained, and shared-architecture neural networks.The results showed excellent uncertainty estimation while increasing the accuracy [6].Randomly initialized networks are close to each other in initialization, but they move further away in the function space during training.It is assumed that such function space diversity is the main reason for the excellent performance of the deep ensemble [7].
However, cost remains the main challenge for the deep ensemble as the training, test time and ensemble size linearly increase in parallel with the increase in ensemble members.The deep ensemble is not applicable in memory-intensive tasks because of its increased size.
In deep learning, model compression techniques [8] dramatically reduce the model size compared with relatively insignificant to no accuracy degradation.Model pruning increases the model sparsity by zeroing the irrelevant parameters.Low-rank factorization decomposes the matrix tensor by estimating the informative parameters.Knowledge distillation learns a distilled model and trains a more compact network to reproduce a more extensive or ensemble network.Model quantization converts the parameter floating-point into 8 bits or less.

Related Work
To the best of our knowledge, the majority of the studies aim to increase uncertainty quantification while decreasing the cost by leveraging knowledge distillation [9][10][11][12].We assume that using the deep ensemble with knowledge distillation might lose the multimodality, which is the main advantage of its successful results.Correspondingly, because the baseline approach is a relatively simple and scalable method without any change in the model architecture, the compression method should also be as simple and scalable as the deep ensemble.
We already know that model pruning [13][14][15][16][17] has a tradeoff between accuracy and size.This study analyzes the tradeoffs between the size, accuracy, and uncertainty estimation using a simple strategy comprising pruning and quantization with a deep ensemble (called a compressed deep ensemble).
Pruning sacrifices the long tail part of the training dataset.It is a tricky part of the dataset to be predicted by both humans and models because of noise-contaminated, multiple, or wrongly labeled objects [17].If pruning identifies such instances, it is a favorable feature from an uncertainty perspective.Moreover, sparse initialization helps achieve greater diversity among initialization time units [18].Thus, we believe that model pruning increases a non-zero parameter weight, and such an increased weight could positively affect the function space diversity.

Problem Setup and High-Level Summary
We assume that the training dataset comprises of N data points {x, y}, where x ∈ R d represents d-dimensional features.For classification problems, the label y ∈ {1, …, K} is assumed to be one of the K classes.We used a neural network to assign the prediction probability pθ (y| x) over the labels, where θ denotes the neural network's parameters.In this study, we created an ensemble that comprises M number of independent neural networks.

Deep Ensemble
The deep ensemble consists of three simple recipes to measure the uncertainty.Selecting a proper score rule as a training criterion is the first recipe.Independent, randomly initialized, shared architecture is the second recipe.They introduced adversarial training to smooth the predictive distributions, but it was not as effective as the abovementioned two recipes.
Popular loss functions meet the condition for the proper scoring rule; negative log-likelihood (NLL) for both regression and classification, root mean square error (RMSE) for regression, and Brier score for classification are all examples of proper scoring rules.Because a base learner trained on a bootstrap sample sees only 63% unique data points, instead of using the bootstrap, the entire dataset is favorable [19], even though the deep ensemble was theoretically motivated by the bootstrap.Ovadia et al. [6], [20][21][22][23] demonstrated that a deep ensemble consistently provides more reliable and practically useful uncertainty estimates.In the deep ensemble, all members are treated as a uniformly weighted mixture model, and the final prediction is the average of the predictions.

Model Pruning and Quantization
Model pruning decreases the model size by increasing the sparsity of the parameters by removing irrelevant parameters.In conventional pruning, the trained model is iteratively pruned until it attains the desired sparsity, and after pruning, the model with pruned parameters is re-trained.The neural network model has a function f (x; W), then pruned model has a function f (x; W ⨀ M), where M ∈ {0, 1} is a pruning mask to remove the parameter.Practically, the pruned parameters of W are set to zero or entirely removed.Using zero weights, element-wise multiplication is redundant.As a result, the computational footprint also decreases.Blalock et al. [15] empirically showed that a large, sparse model performs better than a small dense model.Model quantization is a model compression technique that converts the parameter representation from floating-point 32 bits to 8 bits or fewer.Quantization is complementary to pruning techniques and is harmless for accuracy.Thus, we pruned and quantized the deep ensemble to decrease the linearly increasing size.

Evaluation Metrics
We evaluated the standard deep ensemble as a baseline model and estimated accuracy, NLL, Brier score, zipped model size as a memory footprint, and compression ratio in the in-distribution data.
There is no ground truth in the out-of-distribution data; thus, we cannot measure the accuracy in these data.Furthermore, there was no standard uncertainty metric.Hence, we evaluated entropy, disagreement, and the confidence curve proposed in [6] as uncertainty metrics.As proposed in [5], ensemble disagreement counts the Kullback-Leibler divergence between the member network prediction and the mean prediction.

Experimental Setup
We first trained the deep ensemble using the MNIST [24] handwritten digits dataset.The hyperparameters and model summary are listed in Table 1.Subsequently, we applied a magnitude-based iterative pruning and 8-bit quantization for each member of deep ensembles using various sparsity levels: 25%, 50%, 75%, and 95% using the TensorFlow [25] Model Optimization Toolkit.We used 10,000 test samples from the MNIST dataset and 18,724 test samples from the NotMNIST [26] alphabet dataset as out-of-distribution data to evaluate the uncertainty.<Table 1> Model summary and hyperparameters

Experimental Results
Table 2 presents the results of the evaluation of the MNIST dataset.Accuracy was increased by 0.04%, and the entire ensemble size decreased by 4× by pruning with 25% sparsity and quantization.However, pruning with 50-75% sparsity and quantization slightly reduces the accuracy by 0.01% and 0.05%, while decreasing the ensemble size by 5× and 8×.Pruning with 95% sparsity degrades accuracy by ~1%.
Figure 1 presents the confidence curves of the indistribution and out-of-distribution data.Confidence in the MNIST (Figure 1(a)) is higher than that of NotMNIST (Figure 1(b)), and if the confidence threshold τ=90%, the ensemble classifies more than 90% of the MNIST samples, but only 16-32% of the NotMNIST samples.An increase in sparsity was associated with a decrease in confidence in both test datasets.
The evaluation results of the out-of-distribution data are presented in Figure 2. All uncertainty metrics increased as sparsity increased; thus, pruning and quantization made the However, the ensemble's generalization error is lower than that of a single neural network, and it generalizes well in unseen data.

Discussion
We empirically showed that applying pruning and quantization into the deep ensemble decreased the ensemble size and increased the uncertainty metrics while showing a slight accuracy loss depending on the sparsity level.Pruning with 25-75% sparsity and quantization successfully addresses the problem associated with the linear increase in the size of the deep ensemble and shows solid performance.However, pruning with 95% sparsity noticeably degrades the accuracy.
There is a tradeoff between insignificant accuracy degradation, uncertainty, and memory footprint metric improvements.The pruned and quantized deep ensemble makes a less confident prediction and generalizes well by increasing the uncertainty metrics.Thus, pruning and quantization require a deep ensemble applicable to memoryintensive tasks in Internet-of-Things or mobile devices, while increased uncertainty metrics make the deep ensemble more robust in distribution shift.
Pruning after initialization should be studied further by applying the proposed strategy to real-world tasks.Moreover, creating a diverse deep ensemble by applying pruning with a pre-trained single model could be another possible research direction.2021 온라인 춘계학술발표대회 논문집 제28권 제1호 (2021. 5)

Accuracy
results of the MNIST dataset deep ensemble more robust in unseen data, especially in test data that comes from another distribution.The predictive distribution moves to a uniform distribution as uncertainty increases, meaning that the model randomly guesses.Baseline evaluation metrics did not change significantly with respect to the deep ensemble for pruning and quantization.The size of the deep ensemble decreased by a magnitude, and 50% of the parameters pruned and quantized deep ensemble were almost the same size as a single model.

Figure 1 (
Figure 1(a) Confidence curve of the in-distribution data.

Figure 1 (
Figure 1(b) Confidence curve of out-of-distribution data.