Uncertainty Quantification Based on Block Masking of Test Images

Wang, Pai-Xuan; Liu, Chien-Hung; You, Shingchern D.

doi:10.3390/info16100885

Open AccessArticle

Uncertainty Quantification Based on Block Masking of Test Images

by

Pai-Xuan Wang

,

Chien-Hung Liu

and

Shingchern D. You

^*

Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei 106, Taiwan

^*

Author to whom correspondence should be addressed.

Information 2025, 16(10), 885; https://doi.org/10.3390/info16100885

Submission received: 11 August 2025 / Revised: 9 October 2025 / Accepted: 10 October 2025 / Published: 11 October 2025

(This article belongs to the Special Issue Machine Learning and Data Mining for User Classification)

Download

Browse Figures

Versions Notes

Abstract

In image classification tasks, models may occasionally produce incorrect predictions, which can lead to severe consequences in safety-critical applications. For instance, if a model mistakenly classifies a red traffic light as green, it could result in a traffic accident. Therefore, it is essential to assess the confidence level associated with each prediction. Predictions accompanied by high confidence scores are generally more reliable and can serve as a basis for informed decision-making. To address this, the present paper extends the block-scaling approach—originally developed for estimating classifier accuracy on unlabeled datasets—to compute confidence scores for individual samples in image classification. The proposed method, termed block masking confidence (BMC), applies a sliding mask filled with random noise to occlude localized regions of the input image. Each masked variant is classified, and predictions are aggregated across all variants. The final class is selected via majority voting, and a confidence score is derived based on prediction consistency. To evaluate the effectiveness of BMC, we conducted experiments comparing it against Monte Carlo (MC) dropout and a vanilla baseline across image datasets of varying sizes and distortion levels. While BMC does not consistently outperform the baselines under standard (in-distribution) conditions, it shows clear advantages on distorted and out-of-distribution (OOD) samples. Specifically, on the level-3 distorted iNaturalist 2018 dataset, BMC achieves a median expected calibration error (ECE) of 0.135, compared to 0.345 for MC dropout and 0.264 for the vanilla approach. On the level-3 distorted Places365 dataset, BMC yields an ECE of 0.173, outperforming MC dropout (0.290) and vanilla (0.201). For OOD samples in Places365, BMC achieves a peak entropy of 1.43, higher than the 1.06 observed for both MC dropout and vanilla. Furthermore, combining BMC with MC dropout leads to additional improvements. On distorted Places365, the median ECE is reduced to 0.151, and the peak entropy for OOD samples increases to 1.73. Overall, the proposed BMC method offers a promising framework for uncertainty quantification in image classification, particularly under challenging or distribution-shifted conditions.

Keywords:

deep learning; block scaling quality; uncertainty quantification; confidence score

1. Introduction

Supervised learning is a core area of machine learning, primarily used to predict labels (or classes) for previously unseen data samples [1]. Before a model can be deployed, it must be trained. In a typical setup, a portion of the labeled dataset is allocated for training, while the remaining samples are reserved for testing and evaluating the model’s performance. This approach commonly assumes that real-world, unlabeled data shares a similar distribution with the training data. Under this assumption, strong test accuracy provides confidence that the model will generalize well and remain effective in practical applications.

In many practical applications—particularly in image recognition—the assumption that test samples follow the same (or similar) distribution as training samples often breaks down due to the dynamic and diverse nature of real-world environments. For instance, if a model is trained solely on images of white cats and gray dogs, as illustrated in Figure 1a, it may incorrectly classify a gray cat (Figure 1b) as a dog. From the model’s perspective, white and gray cats exhibit different distributions, despite belonging to the same class.

In many scenarios, users cannot control the diversity of input data or restrict the training set to specific types of images, such as white cats and gray dogs. This leads to a challenge known as the open set recognition (OSR) problem [2]. A conventional trained classifier assigns a label to every input sample, even if it does not belong to any of the known categories. As a result, the model tends to select the closest matching class instead of expressing uncertainty, which can undermine precision and degrade performance. For example, a white bear might be misclassified as a cat simply because both are white.

To address this issue, it becomes essential for models to provide confidence scores alongside classification results. A high confidence score suggests the prediction is reliable and can be accepted. Conversely, a low confidence score may signal the need for post-processing strategies—such as reclassification using alternative models or rejecting the sample outright as an outlier to known classes. The science of characterizing and quantifying uncertainties in computational models and predictions is known as uncertainty quantification (UQ).

Confidence estimation for input samples is an essential component of modern machine learning. Several approaches have been proposed to compute confidence scores, including the maximum softmax value (referred to as the vanilla method) [3], temperature scaling [4], deep ensemble [5], stochastic variational inference (SVI) [6], and Monte Carlo (MC) dropout [7]. Each method presents its own trade-offs. For example, SVI requires specialized network architectures that are not widely adopted, making implementation more complex—even when source code is available. Temperature scaling relies on training a separate parameter and has shown limited performance in comparative studies [8], including on standard datasets like MNIST [9].

Although deep ensemble methods can leverage conventional architectures, they require multiple models to average predictions [5], making confidence estimation computationally expensive and resource-intensive during generating multiple models. Furthermore, this is problematic for users who rely on readily available pretrained models, as training multiple instances of the same architecture solely for confidence scoring may be impractical, especially in resource-constrained settings.

This study addresses the challenge of computing confidence scores using a single, possibly pretrained model, similar to the MC dropout method [7]. Our focus is on image recognition, a domain with wide-ranging applications—from traffic sign and road condition analysis in autonomous driving [10] to facial recognition in security systems [11] and even medical diagnosis via imaging [12].

In image recognition, test datasets may be categorized as

In-distribution: Samples closely resemble the training data (e.g., images of white cats or gray dogs in the previous example).
Distorted in-distribution (distribution shift): Samples exhibit visual perturbations such as blur or color shift (e.g., gray cats viewed as distorted variants).
Out-of-distribution (OOD): Samples differ entirely from training data, including unseen categories (e.g., white bears) or synthetic noise.

In our previous work [13], we introduced the Block Scaling Quality (BSQ) framework for estimating classification accuracy on datasets without ground truth via input perturbation. The method modifies input spectrograms by multiplying values within a block by a constant, and computes the ratio of perturbed inputs that yield different predictions. This ratio reflects the importance of specific regions and is inversely correlated with dataset accuracy. By applying this procedure to labeled datasets, a linear regression model can be constructed. The accuracy of a new, unlabeled dataset can then be estimated by applying its variation ratio to this model. BSQ demonstrated low estimation errors across various audio datasets.

While the original BSQ method was designed to estimate dataset-level accuracy, it produced a single value per dataset and did not address the reliability of individual predictions. In this paper, we extend the block scaling concept with a new objective: computing confidence scores for individual samples. Ground-truth labels are used solely for evaluating performance metrics—such as cumulative accuracy—which will be discussed in Section 3.2.1. Crucially, ground-truth labels are not required during deployment, making the proposed method well-suited for real-world applications where labeled data may be unavailable.

The main contributions of this paper are

Extending block-scaling approach for computing confidence score: We modify the original BSQ method to compute confidence scores for each input sample. Unlike deep ensembles [5] or Bayesian neural networks (BNNs) [6], our approach requires only a single model and is compatible with pretrained networks.
Experimental comparison: Evaluation of the proposed approach against MC dropout and vanilla approaches under in-distribution, distorted, and OOD datasets. Empirical results show that our method offers superior performance under distortion and OOD conditions.

The rest of the paper is organized as follows. Section 2 briefly reviews related work. Section 3 describes the proposed approach. Section 4 covers the comparison metrics, the chosen dataset, experiments, and results. Finally, Section 5 presents the conclusion and future work.

2. Related Work

This section reviews the key literature related to prediction confidence estimation, commonly referred to as uncertainty quantification. The most basic method is the vanilla approach [3], which directly interprets softmax outputs as confidence scores and requires no additional computation. However, due to its limitations in reliability, several techniques have been proposed to improve confidence estimation. These include temperature scaling [4], deep ensembles [5], BNNs [6], and MC dropout [7]. In addition to these methods, two comprehensive review papers [8,14] are included to provide a broader context. Finally, one study addressing uncertainty quantification in regression tasks is also discussed [15].

Hendrycks and Gimpel [3] investigated the use of the maximum softmax value (vanilla) as a baseline for detecting misclassified and OOD samples. Their study employed two widely used evaluation metrics: the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision–Recall Curve (AUPR). Results showed that the vanilla baseline performed effectively across a variety of test datasets. Accordingly, this method is included as a comparison target in our subsequent experimental analysis.

Guo et al. [4] found that neural networks often exhibit poor calibration, with predictive confidence misaligned from true accuracy. Their analysis revealed that factors such as network depth, the number of hidden units, and the use of batch normalization contribute to this miscalibration. To address this issue, they proposed a simple yet effective post-processing technique known as temperature scaling, which applies a single-parameter adjustment to the logits to calibrate the predicted probabilities.

Lakshminarayanan et al. [5] introduced the deep ensemble approach, which employs multiple identical neural networks to generate independent predictions. The confidence score for a given input is obtained by averaging the softmax outputs across the ensemble. According to comparative analyses [8], this method demonstrates relatively stronger performance in both classification accuracy and uncertainty quantification. A conceptually related approach was proposed by You et al. [16], whose primary objective was to estimate model prediction accuracy over a test set without access to ground-truth labels.

The Bayesian neural network (BNN) [6] differs fundamentally from conventional neural networks in how it models uncertainty. While traditional neural networks learn deterministic weight parameters through optimization, BNNs treat each connection weight as a probability distribution, typically over plausible values conditioned on the training data. As a result, repeated predictions on the same input yield varying outputs, reflecting epistemic uncertainty. This stochasticity is analogous to ensembling multiple models, and it allows for the computation of confidence scores by aggregating outputs from multiple forward passes.

To estimate the confidence level of an unlabeled sample, Gal and Ghahramani [7] employed MC dropout with rate p on a trained neural network. This technique introduces stochasticity by randomly deactivating units and their corresponding weights during inference, effectively sampling from an approximate posterior over the model parameters. By performing multiple stochastic forward passes, an ensemble of predictive distributions is generated. Similarly to deep ensembles, the confidence score is computed by averaging the softmax outputs across these passes, thereby capturing predictive uncertainty in the absence of ground-truth labels.

Ovadia et al. [8] conducted an extensive comparative study on the robustness of various uncertainty estimation techniques—including deep ensembles, Bayesian neural networks, and standard (vanilla) models—under dataset shift scenarios. Their findings indicate that deep ensembles, as previously described, outperform other methods across several performance and calibration metrics when confronted with distributional changes. Among the studied techniques, temperature scaling has shown limited performance in comparative studies. Some approaches necessitate model re-training or architectural modifications, rendering them unsuitable for pre-trained networks. Therefore, vanilla and MC dropout remain the only methods among those evaluated that can be directly applied to a pre-trained model. They are used in the comparison.

Abdar et al. [14] provided a comprehensive survey of recent advancements in UQ techniques, with particular emphasis on Bayesian-based and deep ensemble-based methods. Their work also reviewed the application of these approaches in reinforcement learning settings and identified key research challenges and future directions for the development of UQ frameworks.

Kristoffersson Lind et al. [15] examined various metrics for uncertainty quantification (UQ) in regression tasks, including area under the sparsification error curve, calibration error, Spearman’s rank correlation, and negative log-likelihood. Using multiple datasets, they found calibration error to be the most effective overall, while noting that other metrics remain valuable depending on the specific context.

3. Proposed Approach, Metrics, and Datasets

This section begins by introducing the proposed method, block masking confidence (BMC). It then describes the evaluation metrics used in the analysis. Finally, it presents the datasets selected for comparative study.

3.1. Proposed Approach

Existing approaches to confidence estimation, such as deep ensemble and MC dropout, derive confidence scores by averaging multiple classification outputs for a given input sample. Specifically, let the softmax output from model

i

be denoted as

{\hat{Y}}_{i} \in R^{n_{C}}

, where

n_{C}

is the total number of classes. Then, the average of

{\hat{Y}}_{i}

across different models

i

then serves as the estimated class-wise prediction confidence. In deep ensemble methods, multiple models are individually trained; in MC dropout, stochastic predictions are produced by randomly deactivating internal neurons during inference, both were mentioned in Section 2. In contrast, this paper proposes a novel method that generates multiple classification outcomes by applying localized perturbations directly to the input image—an approach conceptually similar to dropout, but at the input level.

In the following, lowercase variables denote scalars, while uppercase variables represent matrices. Lowercase subscripts indicate indices, whereas uppercase subscripts correspond to constants. For notational shorthand, abbreviations using two or more capital letters—such as CS for confidence score—are adopted in equations.

As shown in Figure 2, the proposed approach consists of the following steps:

Generation of image mask. We define a patch mask $M \in R^{h_{M} \times w_{M} \times 3}$ , where $h_{M}$ and $w_{M}$ denote mask dimensions. Each pixel in the mask is independently sampled from a uniform distribution across the red, green, and blue channels, allowing diverse contextual substitution when overlayed onto the original image. Typically, the mask size is approximately $\frac{1}{10}$ the size of the input image. Empirical observations suggest that mask size has a limited impact on final performance. In contrast to our previous BSQ method [13], where pixel values within the patch were scaled by a constant factor (e.g., multiplied by 2), experimental results show that using a patch of random noise offers greater robustness against complex image content.
Construction of masked images. For a single test image $I$ , we generate $n_{M}$ variants ${\tilde{I}}_{s}$ by sliding the mask $M$ from bottom-left to top-right. Specifically, let the bottom-left corner of the mask be initialized at location at ( $i_{M}, j_{M})$ for the $s$ -th test image. Then, the pixel values of ${\tilde{I}}_{s}$ are computed as

${\tilde{I}}_{s} (i, j, \cdot) = \{\begin{array}{l} M (i - i_{M}, j - j_{M}, \cdot), & i_{M} \leq i < i_{M} + h_{M}, i_{M} \leq j < j_{M} + w_{M} \\ I (i, j, \cdot), & o t h e r w i s e \end{array} \begin{matrix} \end{matrix} .$

(1)
Classifications of generated images. Each masked image ${\tilde{I}}_{s}$ is passed through the classification model to obtain a softmax output ${\hat{Y}}_{s} \in R^{n_{C}}$ , where $n_{C}$ is the total number of classes. Overall, there are $n_{M}$ sets of ${\hat{Y}}_{s}$ .
Majority voting for class prediction. Initially, let $B (l) = 0$ for $1 \leq l \leq n_{C}$ . For each prediction, let the predicted class be $c = \arg \max_{l} {\hat{Y}}_{s} (l)$ . we then update the vote count $B$ such that $B (c) \leftarrow B (c) + 1$ . The final class label for the original image I is determined by majority vote as $c_{x} = \underset{l}{argmax} (B (l))$ .
Confidence score computation. This step introduces a new component not present in the original BSQ framework, which was not designed to compute confidence scores. In the proposed method, we then compute the mean $μ$ and standard deviation $σ$ across the softmax scores for class $c_{x}$ from the $n_{M}$ masked images, where $μ = \sum_{i = 1}^{n_{M}} {\hat{Y}}_{i} (c_{x})$ and $σ = \sqrt{\frac{1}{n_{M}} {\sum_{i = 1}^{n_{M}} ({\hat{Y}}_{i} (c_{x}) - μ)}^{2}}$ . Here, $μ$ reflects the model’s average confidence, while $σ$ captures its sensitivity to localized perturbations. A low $σ$ implies stable predictions; a high $σ$ suggests the model relies on spatially localized features. To map these values to a normalized confidence score in [0, 1], we apply a scaling function based on the arctangent of the inverse variance. The final confidence score is calculated as

$C S (α, μ, σ) = α \cdot μ + (1 - α) \cdot \frac{2}{π} \cdot \arctan (\frac{1}{σ})$

(2)

where $α \in$ [0, 1] is a weighting factor balancing average confidence and sensitivity-based adjustment.

It is important to note that the proposed BMC approach is model-agnostic and can be applied on top of any classifier, whether pretrained or trained from scratch. As a result, specific model architectures or training algorithms are not detailed in this section. Additional information about the models used in the experimental evaluation will be provided in Table 7.

3.2. Evaluation Metrics

Evaluation metrics are used to assess whether uncertainty quantification accurately reflects the relationship between a model’s confidence scores and actual outcomes. Below, we discuss the metrics used in this paper.

3.2.1. Accuracy vs. Confidence Curve

The accuracy vs. confidence curve is a commonly used tool in machine learning and statistical analysis for evaluating how well a model’s prediction confidence aligns with its empirical accuracy. In this curve, the confidence denotes the probability score a model assigns to its prediction (e.g., a score of 0.6 reflects a 60% belief that the prediction is correct). The accuracy represents the proportion of correct predictions within a given confidence threshold. The curve plots cumulative accuracy across various confidence levels to reveal calibration behavior.

For binary classification, the x-axis corresponds to the confidence threshold

τ

, while the y-axis denotes the cumulative accuracy (CA), computed as

C A (τ) = \frac{\sum_{i = 1}^{n_{T}} 1 (p (y_{i} | X_{i}) \geq τ, y_{i} = {\hat{y}}_{i})}{\sum_{i = 1}^{n_{T}} 1 (p (y_{i} | X_{i}) \geq τ)},

(3)

where

n_{T}

is the total number of test samples,

1 (\cdot)

is an indicator function returning 1 if the condition holds and 0 otherwise,

p (y_{i}| X_{i})

is the model’s predicted probability for the true class

y_{i} \in \{0,1\}

given input

X_{i}

, and

{\hat{y}}_{i} \in \{0,1\}

is the predicted class label for sample

X_{i}

. In the actual implementation of multi-class classification scenarios,

p (y_{i}| X_{i})

is the softmax value of the true class

l

, i.e.,

p (y_{i}| X_{i}) = {\hat{Y}}_{i} (l)

. In our experimental analysis, we focus on the high-confidence region of the accuracy–confidence curve to assess model reliability. For a given confidence score, a higher corresponding accuracy indicates stronger predictive calibration and is therefore preferable.

3.2.2. Expected Calibration Error (ECE)

Expected calibration error (ECE) quantifies how well a model’s predicted confidence scores align with true outcomes [17]. Ideally, a confidence score of 60% should correspond to a 60% chance that the associated prediction is correct. To compute ECE, model predictions are partitioned into

n_{B}

bins based on confidence scores. Each bin

B_{i}

, where

1 \leq i \leq n_{B}

, contains samples with confidence values falling within the interval

(\frac{i - 1}{n_{B}}, \frac{i}{n_{B}}]

. For each bin, let

c o n f (B_{i})

denote the average confidence and

a c c (B_{i})

the empirical accuracy. Then, ECE is defined as:

E C E = \sum_{i = 1}^{n_{B}} \frac{|B_{i}|}{n_{T}} |a c c (B_{i}) - c o n f (B_{i})|,

(4)

where

|B_{i}|

is the number of samples in bin

B_{i}

, and

n_{T}

is the total number of test samples.

A lower ECE value indicates better calibration—i.e., the model’s confidence scores more accurately reflect the likelihood of correctness. However, it is important to note that a lower ECE does not necessarily imply higher classification accuracy; calibration and accuracy are related but distinct properties [18].

3.2.3. Brier Score

The Brier score (BS) [19] is a metric for evaluating the accuracy of probabilistic predictions in both binary and multi-class classification settings. It computes the mean squared error between the predicted probability distributions and the true class labels.

Let

p_{i, j}

denote the predicted probability that sample

X_{i}

belongs to class

j

, and let

y_{i, j} \in \{0,1\}

indicate whether

X_{i}

truly belongs to class

j

(i.e.,

y_{i, j} = 1

for the correct class, and 0 otherwise). The Brier score is defined as:

B S = \frac{1}{n_{T}} \sum_{i = 1}^{n_{T}} \sum_{l = 1}^{n_{C}} {(p_{i, l} - y_{i, l})}^{2}

(5)

In practice, the model output

{\hat{Y}}_{i} (l)

is used in place of

p_{i, l}

, representing the softmax probability for class

l

on sample

X_{i}

. Similarly to ECE, lower Brier scores indicate better calibrated and more accurate probabilistic predictions. However, BS simultaneously reflects both calibration and sharpness, making it a more holistic measure in some evaluation contexts.

3.2.4. Negative Log-Likelihood

Negative log-likelihood (NLL) [20] is a widely used loss function in classification tasks and probabilistic modeling. It quantifies how well the predicted probability distribution aligns with the true class labels, with the objective of maximizing the likelihood of the observed data under the model’s distribution.

In the multi-class setting, the NLL is calculated based on the predicted probability assigned to the correct class label for each sample. Let

p (y_{i, j} = 1| X_{i})

be the predicted probability that sample

X_{i}

belongs to the true class

j

. Then, the NLL is defined as:

N L L = - \frac{1}{n_{T}} \sum_{i = 1}^{n_{T}} \log p (y_{i, j} = 1| X_{i})

(6)

In practice, the model typically uses the softmax activation function for outputs

{\hat{Y}}_{i}

. Therefore, for each sample

X_{i}

, the term

p (y_{i, j} = 1| X_{i})

is taken as

{\hat{Y}}_{i} (j)

, where

j

corresponds to the true label.

Similarly to metrics like ECE and the BS, a lower NLL indicates better predictive performance. However, unlike BS, which reflects both calibration and sharpness, NLL is directly tied to likelihood optimization and penalizes overconfident incorrect predictions more severely.

3.2.5. Entropy

Entropy (EN), rooted in information theory [21], quantifies the uncertainty inherent in a probability distribution. Higher entropy values indicate greater unpredictability or “disorder” in model predictions. In the context of OOD detection, it is desirable for the confidence scores assigned to all classes to be nearly uniform, signifying that the model cannot make a definitive prediction and is appropriately expressing uncertainty.

For each test sample

i

, entropy is computed as

{E N}_{i} = - \sum_{l = 1}^{n_{C}} p_{i, l} \log p_{i, l}

(7)

where

n_{C}

is the number of classes and

{p_{i, l} = \hat{Y}}_{i} (l)

denotes the predicted softmax probability for class

l

on sample

X_{i}

.

A higher entropy value implies that the prediction distribution is flatter and less confident—an outcome that is preferable for identifying OOD samples. This metric serves as an effective proxy for gauging model uncertainty in scenarios where the input may not belong to any of the known classes.

3.2.6. Using the Metrics in Experiments

Table 1 provides an overview of all evaluation metrics used in this study. A notable limitation emerges for metrics that inherently rely on probabilistic distributions—specifically, BS, NLL, and entropy. These metrics are not applicable to the proposed confidence score, which is constructed from the mean and standard deviation of softmax outputs. Because this representation deviates from a probability distribution, it breaks the probabilistic assumptions underlying those metrics.

This limitation affects the final two steps of the proposed method, where the derived confidence scores no longer retain a direct probabilistic interpretation. Unlike raw softmax probabilities, which sum to one and represent a categorical distribution, the mean–standard deviation formulation represents a statistical summary that cannot be meaningfully evaluated using metrics designed for full distributions.

In this study, Accuracy Vs. Confidence Curve and ECE are employed as the primary evaluation metrics. These metrics do not assume probabilistic normalization—such as softmax—of model outputs, making them well-suited for assessing the reliability of the proposed confidence scores. Importantly, both metrics evaluate whether high-confidence predictions correspond to high empirical accuracy, providing meaningful insight into the effectiveness of the uncertainty quantification.

Although Brier Score and NLL are not directly applicable to the proposed confidence scores—given that they require probabilistic distributions—they offer valuable complementary perspectives. To facilitate comparison using these metrics, only the first three steps of the proposed method are carried out, and the average softmax output across masked images is used in the computation. These metrics are thus categorized as secondary and used to provide a broader analysis of model behavior.

For OOD samples, where ground-truth labels are unavailable, entropy is adopted as the evaluation criterion. For the same reason as in Brier score and NLL, we choose to apply it to softmax outputs derived from steps 1–3 of our approach to maintain a fair basis for comparison with other methods.

3.3. Experimental Datasets

This subsection introduces the datasets used in the experiments. From the selected datasets, distorted in-distribution variants are generated to simulate distribution shift. Additionally, OOD datasets are incorporated to evaluate model behavior under unfamiliar conditions.

Among the datasets, CIFAR-10 [22,23] and ImageNet 2012 [24,25] were previously employed in Ovadia et al. [8], and are thus included for consistency and comparative evaluation. Since the final step of our proposed method—confidence score calibration—was partially guided by performance insights derived from these two datasets, we additionally introduce two benchmark datasets that were not used in developing the confidence score methodology. This ensures an impartial comparison and robust evaluation of the proposed approach across both familiar and unseen data distributions.

3.3.1. CIFAR-10

CIFAR-10 [22,23] is a widely used benchmark in image classification research. It comprises labeled images spanning ten categories, including common animals and vehicles, and serves as a standard testbed for evaluating the performance of machine learning and deep learning models. Developed by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton, the dataset was released through the Canadian Institute for Advanced Research (CIFAR). Detailed specifications are provided in Table 2.

3.3.2. ImageNet 2012

ImageNet 2012 [24,25] is a large-scale image classification dataset, best known as the foundation for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It plays a pivotal role in advancing machine learning and deep learning research, particularly in high-performance image classification.

Developed jointly by Stanford University and Princeton University as part of the broader ImageNet project, the dataset leverages the hierarchical structure of WordNet, wherein each category corresponds to a set of synonyms (Synsets). The dataset includes a wide spectrum of image categories encompassing animals, vehicles, everyday objects, and natural landscapes. These specifications—outlined in Table 3—make ImageNet 2012 a cornerstone benchmark for evaluating the scalability and generalizability of image recognition models.

3.3.3. iNaturalist 2018

iNaturalist 2018 [26,27] is a large-scale image classification dataset curated for biodiversity-focused machine learning research. Originating from the iNaturalist platform and organized by the Visipedia team, the dataset was featured in the CVPR FGVC5 competition.

It contains images categorized by biological species and further grouped into super-categories to facilitate hierarchical classification. Detailed specifications are listed in Table 4. Given its structure and taxonomic diversity, iNaturalist 2018 is well-suited for tasks such as species identification, biological classification, and deep learning model evaluation. In our experiments, we utilize only the superclass labels to simulate a reduced-class scenario. Notably, this setup introduces a pronounced class imbalance, which adds further complexity to model calibration and evaluation.

3.3.4. Places365 Dataset

Places365 [28,29] is a large-scale scene recognition dataset curated by the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). It is primarily used in image classification and scene understanding tasks and serves as a standard benchmark in machine learning and deep learning research. The dataset is a subset of the broader Places2 Database and is available in two variants: Places365-Standard and Places365-Challenge. Detailed specifications are provided in Table 5.

A distinctive characteristic of Places365 is its inclusion of images with rich co-occurring object structures. For example, a kitchen scene might feature a stove, refrigerator, sink, and tableware, while other categories—such as classrooms or libraries—are defined not only by the presence of objects but also by their spatial and functional arrangement. These structured patterns offer valuable semantic cues, making Places365 particularly effective for evaluating models that incorporate contextual and spatial reasoning.

3.3.5. Distorted In-Distribution Datasets

In addition to the primary datasets—CIFAR-10, ImageNet 2012, iNaturalist 2018, and Places365—this study incorporates distorted variants to evaluate model performance under various degradation scenarios. For CIFAR-10 and ImageNet 2012, the existing corrupted versions (CIFAR-10-C and ImageNet-C) are directly adopted.

However, as no predefined distortion sets are available for iNaturalist 2018 and Places365, this study extends the ImageNet-C distortion transformation framework to generate analogous corrupted samples for these two datasets. The implemented distortion functions are summarized in Table 6, with visual illustrations of the corresponding image alterations provided in Figure 3. Notably, the Frost transformation could not be replicated for iNaturalist 2018 and Places365 due to the absence of the original reference images necessary for applying this specific effect.

3.3.6. Out-of-Distribution Datasets

For CIFAR-10, the SVHN dataset is selected as its OOD counterpart. In contrast, ImageNet 2012, iNaturalist 2018, and Places365 are paired with synthetically generated random RGB images of resolution 224 × 224 pixels to serve as OOD samples.

Given the wide range of semantic categories covered by the experimental datasets, random noise images provide an effective proxy for extreme OOD cases, where the test inputs are entirely dissimilar to any of the known classes. Furthermore, the proposed method operates on masked patches, which also involve random noise. As such, distinguishing between unaltered and corrupted regions becomes particularly challenging when the OOD inputs themselves are purely noise-based, thus presenting a stringent evaluation scenario.

4. Experiments and Results

This section outlines the experimental setup and presents the corresponding results. It begins with a description of the baseline methods used for comparison, followed by an introduction to the models employed in the study. A total of five experiments are conducted: four on individual datasets and one additional experiment demonstrating the feasibility of an ensemble strategy. The subsequent section summarizes the key findings, followed by a discussion of computational complexity. The section concludes with an analysis of limitations and a discussion of future research directions. The link to the experimental source codes is given in the Supplementary Materials.

4.1. Comparison Targets

The baseline models used for comparison include (1) a vanilla approach that directly interprets softmax outputs as confidence scores, and (2) the Monte Carlo (MC) dropout method. In the experimental setup, following the protocol established in [8], the dropout probability is fixed at 0.1, and a total of 128 models are generated without further modification. This configuration enables direct reuse of the results reported in [8], thereby eliminating the need for retraining and reducing computational overhead.

A fair comparison beyond these baselines is challenging due to structural and procedural differences among alternative methods. For instance, deep ensembles require training multiple independent models, whereas the compared methods do not. BNNs involve specialized architectures that are incompatible with pretrained models. Additionally, temperature scaling has been excluded from comparison, as it was shown to generalize poorly in [8], and it needs an additional training phase.

4.2. Model Architectures and Configuration

To ensure a fair comparison with the results reported in [8], the deep learning models for CIFAR-10 and ImageNet 2012 follow the same architectural designs outlined in that reference. However, a notable distinction lies in the training strategy for ImageNet 2012: instead of training from scratch, models in this study leverage the official pretrained weights provided by PyTorch 2.x [30], consistent with widely adopted best practices.

For the iNaturalist 2018 and Places365 datasets, this study introduces additional configuration details. The iNaturalist 2018 models are trained using the hyperparameter settings specified in Table 7, while the Places365 models employ pretrained weights released by the MIT CSAIL team [31], also documented in Table 7.

To support the implementation of the proposed method, additional configuration parameters are specified in Table 8. Due to the large scale of several experimental datasets—particularly ImageNet 2012, iNaturalist 2018, and Places365—subsampling is applied to reduce training time and computational overhead. The numbers of sampling images are also documented in Table 8.

4.3. Experiment I: CIFAR-10

The objective of this experiment is to assess the effectiveness of the proposed uncertainty quantification method on a small-scale image classification task, using CIFAR-10 as the in-distribution benchmark. The evaluation focuses on both predictive accuracy and calibration quality, and compares the proposed approach with existing techniques.

Table 9 presents the comparative results using various uncertainty quantification metrics. For baseline methods—including MC dropout and vanilla softmax confidence—the reported values are extracted directly from figures in [8], without retraining, to ensure alignment and fair comparison. Therefore, NLL is not available to report here.

The results indicate that the proposed method does not yield a performance advantage over MC dropout on CIFAR-10. A possible explanation lies in the data augmentation strategy adopted during training: the base model already utilizes techniques such as random cropping, which introduces spatial variability by exposing the network to different subregions of each image. This enhances the model’s spatial robustness and reduces its sensitivity to perturbations introduced through random masking—the core mechanism of the proposed confidence score estimation. Consequently, the masking operation may not sufficiently disrupt the model’s inference behavior, leading to less informative uncertainty measurements.

While the proposed method does not outperform MC dropout under standard conditions (e.g., the CIFAR-10 test set), its advantages become more evident in the presence of input degradation. For example, under moderate distortion—specifically, Gaussian blur level 3 in CIFAR-10-C—the cumulative accuracy within high-confidence prediction regions surpasses that of MC dropout, as shown in Figure 4, indicating stronger uncertainty quantification.

Furthermore, as distortion severity increases, the proposed method consistently exhibits lower ECE, emphasizing its enhanced calibration capability under degraded conditions. This pattern, visualized in Figure 5, suggests that the method is more effective in predictive uncertainty when faced with challenging inputs.

In summary, although performance on clean data remains comparable or slightly below baseline, the proposed approach demonstrates superior robustness and uncertainty sensitivity in adverse scenarios, making it a promising candidate for calibration-aware modeling in real-world settings where input quality may vary.

In scenarios involving out-of-distribution inputs, the proposed method yields more pronounced uncertainty responses than MC Dropout, as depicted in Figure 6. Specifically, the mode of the predictive entropy under the proposed approach reaches 1.066, compared to approximately 0.976 for MC Dropout. This higher entropy concentration indicates a more cautious predictive behavior, suggesting that the proposed method is less likely to assign overconfident predictions to samples outside the CIFAR-10 training distribution.

By generating sharper entropy distributions for OOD inputs, the method demonstrates improved awareness of epistemic uncertainty and a reduced tendency to overfit unfamiliar data. This characteristic is essential for safe model deployment in open-world settings, where unseen or corrupted inputs may arise unexpectedly.

4.4. Experiment II: ImageNet 2012

To assess the scalability and robustness of the proposed uncertainty quantification method, we extend our evaluation to large-scale image classification datasets, including ImageNet 2012, ImageNet-C, and a custom-designed set of OOD samples.

Evaluation metrics on in-distribution datasets, presented in Table 10, reveal comparable performance between the proposed method and MC Dropout. For distorted samples—particularly under Gaussian blur level 3 in ImageNet-C—both approaches exhibit similar cumulative accuracy–confidence characteristics in the high confidence region, as depicted in Figure 7.

Notably, as distortion severity increases, the ECE trend under the proposed method exhibits markedly greater consistency, as shown in Figure 8. This pattern highlights enhanced calibration stability in scenarios with degraded input quality. Such resilience suggests that the method more effectively quantifies predictive uncertainty when confronted with heavily corrupted data, thereby mitigating the risk of erroneous high-confidence predictions in ambiguous or noisy conditions.

Regarding OOD inputs, the proposed method displays quantitative advantages over MC dropout, as shown in Figure 9; however, its entropy distribution notably overlaps with that of the Vanilla baseline. This phenomenon can be attributed to the behavior of vanilla softmax when confronted with highly unfamiliar inputs—such as randomly generated three-channel RGB images—across a large number of output classes. In such cases, vanilla tends to produce nearly uniform softmax scores, resulting in elevated predictive entropy despite lacking true uncertainty awareness.

Conversely, while the proposed method integrates a masking mechanism designed to induce uncertainty, its behavior on highly abnormal samples can lead to concentrated activations in specific output branches. This effect suppresses the tails of the entropy distribution, reducing sensitivity at its extremes. As a result, although the method retains uncertainty quantification capabilities, its response to severe OOD samples may exhibit limitations in entropy expressiveness.

4.5. Experiment III: iNaturalist 2018

The iNaturalist 2018 dataset presents a long-tailed image classification challenge due to its highly imbalanced class distribution. Under standard (in-distribution) evaluation protocols, the proposed method outperforms MC dropout, as illustrated in Figure 10. This improvement is chiefly attributed to the fundamental differences in uncertainty modeling: while MC dropout relies on repeated stochastic neuron masking, leading to prediction variance and diminished stability, the proposed technique mitigates these disruptions. As a result, it achieves superior robustness and interpretability in uncertainty quantification.

However, in terms of ECE and other metrics, our method slightly underperforms compared to the vanilla approach, as shown in Table 11. This suggests that although the vanilla method lacks explicit uncertainty quantification mechanisms, its calibration accuracy remains competitive in scenarios with significant class imbalance.

Under distortion conditions in the iNaturalist 2018 dataset, the proposed method demonstrates significant advantages over both MC dropout and vanilla approaches, especially for high confidence levels, as shown in Figure 11. Furthermore, regarding the distribution of ECE values for all level-3 distortions, the proposed method yields the lowest median, as shown in Figure 12, indicating stable calibration performance. To save simulation time, only datasets with level-3 distortion are used in the calculation. These results validate the method’s effectiveness in uncertainty quantification under conditions of severe class imbalance and substantial input distortion.

Under OOD conditions, the proposed method demonstrates more discerning uncertainty characteristics on the iNaturalist 2018 dataset. Despite the pronounced class imbalance—which induces a bias toward certain classes and results in an overall low-entropy distribution—both our method and vanilla achieve meaningful sample concentration within the higher-entropy region, as depicted in Figure 13. In contrast, MC dropout exhibits strong concentration in the low-entropy region, suggesting a tendency toward overconfident predictions and diminished reliability in uncertainty quantification.

4.6. Experiment IV: Places365

Compared to ImageNet, the Places365 dataset is distinguished by its high semantic density and visually complex backgrounds. As such, models trained on Places365 must capture holistic scene context rather than relying solely on localized object features. Under in-distribution evaluation, the proposed method does not demonstrate a marked advantage over either MC dropout or the baseline vanilla approach, as reflected in Figure 14 and Table 12.

Under Gaussian-3 distortion conditions, the proposed method exhibits only minor gains over MC dropout and the vanilla approach in terms of confidence–accuracy alignment (Figure 15). However, it demonstrates improved performance in ECE for all level-3 distortions, as shown in Figure 16, indicating enhanced calibration robustness in the presence of degraded visual quality and semantically complex imagery.

In the OOD scenario for the Places365 dataset, depicted in Figure 17, all methods yield predictions with moderate confidence levels, reflected in entropy values significantly below the theoretical upper bound of log (365) ≈ 5.9 nats. However, the proposed method exhibits a more distinct concentration of samples in the higher-entropy region compared to other approaches, suggesting enhanced capability in expressing elevated uncertainty for OOD inputs.

4.7. Experiment V: Ensemble Approaches

Given the distinct mechanisms employed by our proposed method and the MC dropout technique for computing confidence scores, it is feasible to combine their outputs in an ensemble-like fashion. To explore this possibility, we use the Places365 dataset as a case study.

The integration procedure involves two steps: first, the input image is processed using the proposed method to determine the predicted class (via majority voting) and its associated confidence score. Then, the same input is evaluated using MC dropout to compute a confidence score for the same predicted class. These two scores are averaged to obtain a final ensemble-based confidence value.

The results for this hybrid approach—evaluated on the original Places365 dataset—are presented in Figure 18. However, as shown, the combined method does not yield performance improvements over either standalone technique. Since this ensemble strategy targets score fusion rather than predictive refinement, only ECE is reported in Table 13. The ensemble approach does not have a clear advantage, suggesting limited synergy between the two underlying confidence estimation mechanisms for in-distribution samples.

In scenarios involving distorted inputs, the ensemble approach exhibits improved calibration performance—underscoring one of the key strengths of the proposed method. As illustrated in Figure 19, the ensemble technique achieves higher accuracy within the high-confidence prediction region, making it a preferable strategy under such conditions.

Further evidence is provided in Figure 20, which presents the ECE results for level-3 distortion. The ensemble method yields a lower median error and narrower distribution intervals, highlighting its ability to maintain robust uncertainty quantification even under significant image degradation. These findings affirm the practical value of integrating the proposed method with MC dropout to enhance reliability and performance in challenging visual environments.

In OOD scenarios, the ensemble approach demonstrates a clear advantage over other methods, as illustrated in Figure 21. This underscores the combined method’s superior capability to quantify uncertainty in OOD samples enriched with contextual information.

4.8. Summary of Experimental Results

We conducted four experiments (Section 4.5, Section 4.6, Section 4.7 and Section 4.8) to evaluate the relative performance of the proposed approach, MC dropout, and vanilla baseline methods. One additional experiment (Section 4.9) explored the feasibility of fusing two confidence scores. To give a quick review, a summary of the experiments and reasons for choosing the in-distribution datasets are given in Table 14.

The findings in the experiments are summarized as follows:

In-distribution datasets: The proposed method demonstrates consistently stable performance, remaining comparable to the other two approaches. While it does not consistently outperform them, its reliability underscores its effectiveness in standard settings.
Distorted (distribution-shift) datasets: In high-confidence regions, the proposed method achieves higher cumulative accuracy and lower ECE, indicating stronger calibration performance and heightened sensitivity to data degradation.
OOD datasets: The method exhibits a denser concentration of samples in high-entropy regions, effectively reducing overconfident predictions and enhancing the model’s awareness of unfamiliar inputs.
Class-imbalanced datasets (e.g., iNaturalist 2018): Relative to the other two methods, the proposed approach better mitigates prediction bias arising from unstable dropout, demonstrating enhanced robustness and interpretability.
Ensemble performance: When integrated with MC dropout, the proposed method performs notably well under distortion scenarios, achieving lower ECE and a higher entropy peak center value than either method in isolation—highlighting its potential for synergistic integration.

4.9. Computational Complexity

In terms of computational complexity, all three methods under comparison require training a single model, which is then reused during the testing phase. The vanilla approach performs only one forward pass for classification. The MC dropout method incurs a minor overhead to determine which weights to drop, followed by

n_{P}

forward passes; in our experiments,

n_{P}

= 128, consistent with the protocol in [8]. The proposed method similarly requires a small amount of time to generate masked images. The number of classification passes depends on the image dimensions and the hop size of the masking window. To simplify the discussion without introducing additional notation, we use the Places365 dataset as a reference. Each image is typically sized at

256 \times 256

pixels, and the hop size is empirically set to

10 \times 10

pixels. As a result, a total of

25 \times 25 = 625

masked images are generated and classified. Accordingly, the computational cost of the proposed method is approximately five times that of the MC dropout approach.

4.10. Limitations of the Proposed Approach

The proposed approach has several limitations. First, its current implementation is restricted to inputs with a two-dimensional structure, such as images or spectrograms, where localized perturbations—e.g., small random-noise patches—can be meaningfully applied. Inputs with alternative structures, such as one-dimensional signals or word-embedded vectors, fall outside the scope of this method and are therefore not supported. Nonetheless, since the core concept involves perturbing localized regions of input features, it may be possible to generalize the approach to non-image-based data. This represents a promising direction for future research.

Second, the present study focuses exclusively on classification tasks; the framework does not extend to regression problems. However, if localized perturbations can be meaningfully applied to the input features, the method may be adaptable to regression settings as well.

Third, the approach incurs significant computational overhead. As discussed in Section 4.9, it requires multiple forward passes during inference, which may be impractical for time-critical applications with limited computational resources. This limitation is inherently difficult to circumvent. Fortunately, with ongoing advancements in computing hardware, this constraint may become less significant over time.

4.11. Future Directions

Two promising directions for future work include (1) generalizing the proposed approach to accommodate diverse input structures, and (2) applying the method to real-world domains. The challenge of extending beyond two-dimensional inputs—such as images or spectrograms—has been discussed in Section 4.10 and is therefore omitted here.

Regarding applications, we first aim to adapt the proposed approach for medical image classification. In this domain, predictive accuracy alone is insufficient; the model’s confidence in its decisions is equally critical. For example, misclassifying normal tissue as a tumor could have serious consequences. Thus, evaluating the trustworthiness of model outputs is essential, and when confidence is low, alternative decision-making mechanisms should be incorporated to ensure safety and reliability.

A second application area is autonomous driving, a rapidly advancing field in the automotive industry. Tasks such as pedestrian detection and traffic signal recognition are vital for safe navigation. In this context, assessing the confidence of model predictions plays a key role in preventing accidents. Improving model interpretability and reliability in such high-stakes environments is crucial for responsible deployment.

Finally, given the generality of the proposed method, it holds potential for broad applicability across various critical classification tasks, such as face recognition and user classification.

5. Conclusions

This paper investigates the performance of the proposed uncertainty quantification method against MC dropout and vanilla across diverse image datasets, including standard (in-distribution) datasets with varying numbers of classes, distorted datasets, and OOD datasets. The analysis further extends to datasets characterized by extreme class imbalance and semantic complexity, as well as to the heterogeneous integration of the proposed method with MC dropout. Experimental results demonstrate that the proposed method consistently outperforms MC dropout and the vanilla baseline in handling heavily distorted inputs and OOD samples. Moreover, its heterogeneous integration with MC dropout further elevates predictive performance and calibration quality in these challenging scenarios.

Future work will focus on extending the proposed method to accommodate inputs of more general shapes beyond the current two-dimensional constraint. This enhancement would broaden its applicability across diverse domains. In terms of practical applications, the method holds promise for critical areas such as autonomous driving and medical diagnosis, where both predictive accuracy and confidence estimation are essential for safe and reliable decision-making.

Supplementary Materials

The source code for the experiments can be downloaded at: https://github.com/ha3usee2u/bzq-on-uncertainty-quantification (accessed on 11 August 2025).

Author Contributions

Conceptualization, S.D.Y.; methodology, P.-X.W., C.-H.L. and S.D.Y.; software, P.-X.W.; validation, P.-X.W., C.-H.L. and S.D.Y.; formal analysis, S.D.Y.; investigation, P.-X.W., C.-H.L. and S.D.Y.; resources, C.-H.L. and S.D.Y.; data curation, P.-X.W.; writing—original draft preparation, S.D.Y.; writing—review and editing, C.-H.L.; visualization, P.-X.W.; supervision, C.-H.L. and S.D.Y.; project administration, C.-H.L. and S.D.Y.; funding acquisition, C.-H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science and Technology Council, Taiwan, grant number NSTC 112-2221-E-027-049-MY2. The APC was waived by invitation.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original datasets presented in the study are from 3rd party and are openly available in the following: https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 1 August 2025) (for CIFAR-10), https://image-net.org/challenges/LSVRC/2012/ (accessed on 1 August 2025) (for ImageNet 2012), https://github.com/visipedia/inat_comp (accessed on 1 August 2025) (for iNaturalist), and https://github.com/CSAILVision/places365 (accessed on 1 August 2025) (for Places365).

Acknowledgments

During the preparation of this manuscript, the authors used Copilot for the purposes of generating Figure 1 and polishing the English writing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

BNN	Bayesian Neural Network
BSQ	Block Scaling Quality
ECE	Expected Calibration Error
MC	Monte Carlo
NLL	Negative Log-Likelihood
OOD	Out of distribution
UQ	Uncertainty Quantification

References

Mohri, M.; Rostamizadeh, A.; Talwalkar, A. Foundations of Machine Learning; The MIT Press: Cambridge MA, USA, 2012. [Google Scholar]
Scheirer, W.J.; de Rezende Rocha, A.; Sapkota, A.; Boult, T.E. Toward open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1757–1772. [Google Scholar] [CrossRef] [PubMed]
Hendrycks, D.; Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neural Inf. Process. 2017, 30, 6405–6416. [Google Scholar]
Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural networks. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
Ovadia, Y.; Fertig, E.; Ren, J.; Nado, Z.; Sculley, D.; Nowozin, S.; Dillon, J.V.; Lakshminarayanan, B.; Snoek, J. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. Adv. Neural Inf. Process. 2019, 32, 13991–14002. [Google Scholar]
MNIST Dataset. Available online: https://www.kaggle.com/datasets/hojjatk/mnist-dataset (accessed on 1 August 2025).
Kozłowski, M.; Racewicz, S.; Wierzbicki, S. Image Analysis in Autonomous Vehicles: A Review of the Latest AI Solutions and Their Comparison. Appl. Sci. 2024, 14, 8150. [Google Scholar] [CrossRef]
Gururaj, H.L.; Soundarya, B.C.; Priya, S.; Shreyas, J.; Flammini, F. A Comprehensive Review of Face Recognition Techniques, Trends, and Challenges. IEEE Access 2024, 12, 107903–107926. [Google Scholar] [CrossRef]
Zhang, W.-J.; Chen, W.-T.; Liu, C.-H.; Chen, S.-W.; Lai, Y.-H.; You, S.D. Feasibility Study of Detecting and Segmenting Small Brain Tumors in a Small MRI Dataset with Self-Supervised Learning. Diagnostics 2025, 15, 249. [Google Scholar] [CrossRef] [PubMed]
You, S.D.; Lin, K.-R.; Liu, C.-H. Estimating classification accuracy for unlabeled datasets based on block scaling. Int. J. Eng. Technol. Innov. 2023, 13, 313–327. [Google Scholar] [CrossRef]
Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
Kristoffersson Lind, S.; Xiong, Z.; Forssén, P.-E.; Krüger, V. Uncertainty quantification metrics for deep regression. Pattern Recognit. Lett. 2024, 186, 91–97. [Google Scholar] [CrossRef]
You, S.D.; Liu, H.-C.; Liu, C.-H. Predicting classification accuracy of unlabeled datasets using multiple deep neural networks. IEEE Access 2022, 10, 44627–44637. [Google Scholar] [CrossRef]
Pavlovic, M. Understanding model calibration—A gentle introduction and visual exploration of calibration and the expected calibration error (ECE). arXiv 2025, arXiv:2501.19047v2. [Google Scholar]
Si, C.; Zhao, C.; Min, S.; Boyd-Graber, J. Re-examining calibration: The case of question answering. arXiv 2022, arXiv:2205.12507. [Google Scholar] [CrossRef]
Brier, G.W. Verification of forecasts expressed in terms of probability. Mon. Weather. Rev. 1950, 78, 1–3. [Google Scholar] [CrossRef]
Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
The CIFAR-10 Dataset. Available online: https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 1 August 2025).
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Van Horn, G.; Aodha, O.M.; Song, Y.; Cui, Y.; Sun, C.; Shepard, A.; Adam, H.; Perona, P.; Belongie, S. The iNaturalist Species Classification and Detection Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Visipedia/Inat_Comp. Available online: https://github.com/visipedia/inat_comp (accessed on 1 August 2025).
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million Image Database for Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
Places Download. Available online: http://places2.csail.mit.edu/download.html (accessed on 1 August 2025).
Models and Pre-Trained Weights. Available online: https://docs.pytorch.org/vision/main/models.html (accessed on 1 August 2025).
CSAILVision/Places365. Available online: https://github.com/CSAILVision/places365 (accessed on 1 August 2025).

Figure 1. Illustration of training (a) and test (b) samples.

Figure 2. Illustration of the proposed BMC approach. Note that the noise mask in Step 2 is depicted as a gray block, while the red dotted boxes highlight the locations corresponding to the maximum softmax values of the model outputs.

Figure 3. Illustration of various types of distortions.

Figure 4. Confidence vs. accuracy for Gaussian-3 distortion of CIFAR-10C.

Figure 5. ECE for different levels of distribution shift (distortion) of CIFAR-10C.

Figure 6. Entropy for OOD samples using models trained with CIFAR-10. Entropy values are computed only at discrete data points. The connected lines in the plot are solely for visual illustration and do not represent the true underlying distribution.

Figure 7. Confidence vs. accuracy for Gaussian-3 distortion of ImageNet 2012C.

Figure 8. ECE for different levels of distribution shift (distortion) of ImageNet 2012C.

Figure 9. Entropy for OOD samples using models trained with ImageNet 2012.

Figure 10. Confidence vs. accuracy for original iNaturalist 2018 dataset.

Figure 11. Confidence vs. accuracy for Gaussian-3 distortion of iNaturalist 2018.

Figure 12. ECE for level-3 distortion of iNaturalist 2018.

Figure 13. Entropy for OOD samples using models trained with iNaturalist 2018.

Figure 14. Confidence vs. accuracy for original Places365 dataset.

Figure 15. Confidence vs. accuracy for Gussian-3 distortion of Places365.

Figure 16. ECE for level-3 distortion of Places365.

Figure 17. Entropy for OOD samples using models trained with Places365.

Figure 18. Confidence vs. accuracy for original Places365 dataset for ensemble experiment.

Figure 19. Confidence vs. accuracy for Gaussian-3 distortion of Places365 with ensemble approach.

Figure 20. ECE for level-3 distortion of Places365 with ensemble approach.

Figure 21. Entropy for OOD samples using models trained with Places365 with ensemble approach.

Table 1. Metrics used in the experiments.

Metric	Priority	Preference	Cal. Steps
Acc. vs. Conf.	Primary	Problem dependent	All
ECE	Primary	↓ ¹	All
BS	Secondary	↓	1–3
NLL	Secondary	↓	1–3
Entropy	Secondary	↑	1–3

¹ Symbol ↓ indicates lower is better, and ↑ indicates higher is better.

Table 2. CIFAR-10 dataset.

Item	Value
Training set	50,000
Test set	10,000
Image size	32 × 32 pixels
Image channel	3 (R, G, B)
No. of classes	10

Table 3. ImageNet 2012 dataset.

Item	Value
Training set	1,281,167
Validation set	50,000
Test set	100,000 (no label)
Image size	Variable (typically 224 × 224)
Image channel	3 (R, G, B)
No. of classes	1000

Table 4. iNaturalist 2018 dataset.

Item	Value
Training set	437,513
Validation set	24,426
Test set	149,394 (no label)
Image size	Variable (typically 224 × 224)
Image channel	3 (R, G, B)
No. of classes	8142
No. of super classes	14

Table 5. Places365 dataset.

Item	Standard	Challenge
Training set	1,800,000	8,000,000
Validation set	36,000	36,000
Test set	149,394 (no label)
Image size	Variable, typically 224 × 224 or 256 × 256	Variable, typically 224 × 224 or 256 × 256
Image channel	3	3
No. of classes	365	434

Table 6. Distortion types used in the experiments.

Category	Distortion
Blur	Glass blur, Defocus blur, Zoom blur, Gaussian blur
Noise	Impulse noise, Shot noise, Speckle noise, Gaussian noise
Transformation	Elastic transform, Pixelate
Color	Saturation, Brightness, Contrast
Environment	Fog, Spatter, Frost (not for iNaturalist2018 and Places365)

Table 7. Models used in experiments.

Item	CIFAR-10	ImageNet 2012	iNaturalist 2018	Places365
Framework	TensorFlow 2.0	PyTorch 2.x	PyTorch 2.x	PyTorch 2.x
Structure	ResNet-20 V1	ResNet-50	ResNet-50	VGG-16
Training epochs	200	Pre-trained	50	Pre-trained
Batch size/Model	7	IMAGENET1K_V1	80	CSAIL Vision
Optimizer	Adam	-	SGD	-
Learning rate	0.000717	-	0.1	-
Momentum	-	-	0.9	-
Loss	Categorical cross entropy	Categorical cross entropy	Categorical cross entropy	Categorical cross entropy

Table 8. Experimental parameters.

Item	CIFAR-10	ImageNet 2012	iNaturalist 2018	Places365
Mask size	3 × 3	20 × 20	20 × 20	20 × 20
Hop size	1 × 1	10 × 10	10 × 10	10 × 10
$α$ in Equation (2)	1/2	1/2	1/2	1/2
In-distribution image	10,000	10,000	5000	5000
Distorted image	80,000	32,000	15,000	15,000
OOD image	26,032	1000	1000	1000

Table 9. Metrics for CIFAR-10 in-distribution dataset. Values in Bold are the best values in metrics.

Metrics	Proposed	MC Dropout	Vanilla
Accuracy	87.80%	~91.2%	~90.5%
ECE	0.0693	~0.01	~0.05
Brier Score	0.2036	~0.13	~0.15

Table 10. Metrics for original ImageNet 2012 dataset. Values in Bold are the best values in metrics.

Metrics	Proposed	MC Dropout	Vanilla
Accuracy	75.94%	~74.55%	~75.0%
ECE	0.026	~0.015	~0.040
Brier Score	0.342	~0.35	~0.34
NLL	0.998	~1.1	~1.1

Table 11. Metrics for original iNaturalist 2018 dataset. Values in Bold are the best values in metrics.

Metrics	Proposed	MC Dropout	Vanilla
ECE	0.055	0.070	0.007
Brier Score	0.053	0.101	0.048
NLL	0.109	0.196	0.104

Table 12. Metrics for original Places365 dataset. Values in Bold are the best values in metrics.

Metrics	Proposed	MC Dropout	Vanilla
ECE	0.062	0.047	0.043
Brier Score	0.461	0.443	0.449
NLL	1.165	1.112	1.108

Table 13. Metrics for original Places365 dataset for ensemble experiment. Values in Bold are the best values in metrics.

Metrics	Proposed	MC Dropout	Vanilla	Ensemble
ECE	0.062	0.047	0.043	0.079

Table 14. Summary of conducted experiments.

Experiment	In-Distribution Dataset	Highlights
I	CIFAR-10	To compare results in [8]
II	ImageNet 2012	To compare results in [8]
III	iNaturalist 2018	Highly class-imbalanced
IV	Places365	High semantics with complex backgrounds
V	Places365	To evaluate performance of ensemble

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, P.-X.; Liu, C.-H.; You, S.D. Uncertainty Quantification Based on Block Masking of Test Images. Information 2025, 16, 885. https://doi.org/10.3390/info16100885

AMA Style

Wang P-X, Liu C-H, You SD. Uncertainty Quantification Based on Block Masking of Test Images. Information. 2025; 16(10):885. https://doi.org/10.3390/info16100885

Chicago/Turabian Style

Wang, Pai-Xuan, Chien-Hung Liu, and Shingchern D. You. 2025. "Uncertainty Quantification Based on Block Masking of Test Images" Information 16, no. 10: 885. https://doi.org/10.3390/info16100885

APA Style

Wang, P.-X., Liu, C.-H., & You, S. D. (2025). Uncertainty Quantification Based on Block Masking of Test Images. Information, 16(10), 885. https://doi.org/10.3390/info16100885

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Uncertainty Quantification Based on Block Masking of Test Images

Abstract

1. Introduction

2. Related Work

3. Proposed Approach, Metrics, and Datasets

3.1. Proposed Approach

3.2. Evaluation Metrics

3.2.1. Accuracy vs. Confidence Curve

3.2.2. Expected Calibration Error (ECE)

3.2.3. Brier Score

3.2.4. Negative Log-Likelihood

3.2.5. Entropy

3.2.6. Using the Metrics in Experiments

3.3. Experimental Datasets

3.3.1. CIFAR-10

3.3.2. ImageNet 2012

3.3.3. iNaturalist 2018

3.3.4. Places365 Dataset

3.3.5. Distorted In-Distribution Datasets

3.3.6. Out-of-Distribution Datasets

4. Experiments and Results

4.1. Comparison Targets

4.2. Model Architectures and Configuration

4.3. Experiment I: CIFAR-10

4.4. Experiment II: ImageNet 2012

4.5. Experiment III: iNaturalist 2018

4.6. Experiment IV: Places365

4.7. Experiment V: Ensemble Approaches

4.8. Summary of Experimental Results

4.9. Computational Complexity

4.10. Limitations of the Proposed Approach

4.11. Future Directions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI