Towards Knowledge Uncertainty Estimation for Open Set Recognition

: Uncertainty is ubiquitous and happens in every single prediction of Machine Learning models. The ability to estimate and quantify the uncertainty of individual predictions is arguably relevant, all the more in safety-critical applications. Real-world recognition poses multiple challenges since a model’s knowledge about physical phenomenon is not complete, and observations are incomplete by deﬁnition. However, Machine Learning algorithms often assume that train and test data distributions are the same and that all testing classes are present during training. A more realistic scenario is the Open Set Recognition, where unknown classes can be submitted to an algorithm during testing. In this paper, we propose a Knowledge Uncertainty Estimation (KUE) method to quantify knowledge uncertainty and reject out-of-distribution inputs. Additionally, we quantify and distinguish aleatoric and epistemic uncertainty with the classical information-theoretical measures of entropy by means of ensemble techniques. We performed experiments on four datasets with different data modalities and compared our results with distance-based classiﬁers, SVM-based approaches and ensemble techniques using entropy measures. Overall, the effectiveness of KUE in distinguishing in-and out-distribution inputs obtained better results in most cases and was at least comparable in others. Furthermore, a classiﬁcation with rejection option based on a proposed combination strategy between different measures of uncertainty is an application of uncertainty with proven results.


Introduction
Machine Learning (ML) has continuously attracted the interest of the research community motivated by the promising results obtained in many decision-critical domains. Along with the interest arises some concerns related to the trustworthiness and robustness of the models [1]. The notion of uncertainty is of major importance in ML, and a trustworthy representation of uncertainty should be considered as a key feature of any ML method [2,3]. In application domains such as medicine, information about the reliability of the automated decisions is crucial to improve the system's safety [4,5]. Uncertainty also plays a role in AI at the methodological level, such as in active learning [6,7] and self-training [8,9].
Uncertainty is ubiquitous and happens in every single event we encounter in the real-world arising from different sources in various forms. According to the origin of uncertainty, a distinction between aleatoric uncertainty and epistemic uncertainty is commonly used. Aleatoric uncertainty refers to the inherent randomness in nature, and epistemic uncertainty refers to uncertainty caused by lack of knowledge of the physical world (knowledge uncertainty), as well as the ability to measure and model the physical world (model uncertainty) [10]. In ML, these two sources of uncertainty are usually not distinguished. However, some studies have been proposed showing the usefulness of quantifying and distinguishing the sources of uncertainties in different applications, such as self-driving cars [11], where the authors emphasize "the importance of epistemic uncertainty or 'uncertainty on uncertainty' in these AI-assisted systems", referring to the first accident of a self-driving car that led to the death of the driver, or in medicine, where the authors focused their evaluation on medical data of chest pain patients and their diagnoses [12].
Until recently, almost all evaluations of ML-based recognition algorithms have taken the form of "close set" recognition, where it is assumed that the train and test data distributions are the same and that all testing classes are known at training time [13]. However, a more realistic scenario for deployed classifiers is to assume that the world is an open set of objects, that our knowledge is always incomplete and, thus, the unknown classes should be submitted to an algorithm during testing [14]. For instance, the diagnosis and treatment of infectious diseases relies on the accurate detection of bacterial infections. However, deploying a ML method to perform bacterial identification is challenging, as real data are highly likely to contain unseen classes not seen in training data [15]. Similarly, verification problems for security-oriented face matching or unplanned scenarios for self-driving cars lead to what is called "open set" recognition, in comparison to systems that use "close set" recognition.
Based on the basic recognition categories of classes asserted by Scheirer et al. [16], there are three categories of classes: 1. known classes: classes with distinctly labeled positive training samples (also serving as negative samples for other known classes); 2. known unknown classes: classes labeled negative samples, not necessary grouped into meaningful categories; 3. unknown unknown classes: classes unseen in training.
Traditional supervised classification methods consider only known classes. Some improved the implementation by starting to include known unknown results in models with an explicit "other class" or a detector trained with unclassified negatives. Open Set Recognition (OSR) algorithms, where new classes unseen in training appear in testing, considers the unknown unknown classes category. In this scenario, the classifier needs not only to accurately classify known classes but also effectively reject unknown classes. Although the classification with a reject option is more than 60 years old [17], the focus on rejection has been the ambiguity between classes (aleatoric uncertainty), not for addressing unknown inputs (epistemic uncertainty). Based on Chow's theory, the inputs are rejected if the posterior probability is not sufficiently high based on a predefined threshold that optimizes the ambiguous regions between classes. Epistemic uncertainty is high near the decision boundary, and most classifiers increase confidence with the distance from the decision boundary. Thus, an unknown far from the boundary is not only incorrectly labeled but will be incorrectly classified with very high confidence [14]. Although classification with rejection is related to OSR, it still works under the close set assumption where a classifier rejects to classify a sample due to its low confidence on an overlapping region between classes, which leads to high aleatoric uncertainty. For OSR problems, One-Class Classification (OCC) is commonly used since it tries to focus on the known class and ignore everything else. A popular approach for OSR scenario using OCC is to adapt the familiar Support Vector Machine (SVM) methodology using a one-vs-one or one-vs-all scenario. OSR problems are usually focused on novel, anomaly or outlier detection, so that they are only interested in the epistemic uncertainty due to the lack of knowledge. Although the OSR method indirectly deals with epistemic uncertainty, a proper uncertainty quantification is rarely done.
In this work, we focus on uncertainty quantification of individual predictions, where an input can be rejected for both aleatoric and epistemic uncertainty. Due to the difficulty of dealing with unknown samples, we propose a new method for knowledge uncertainty quantification and combined it with measures of entropy using ensemble techniques. The experimental results are validated on four datasets with different data modalities: • a human activity dataset using inertial data from smartphones where uncertainty estimation plays an important role in the recognition of abnormal human activities. Indoor location solutions can also benefit from a proper uncertainty estimation where high confident activity classifications should increase positioning accuracy; • a handwritten digits dataset using images where uncertainty estimation might be used for unrecognized handwritten digits; • a bacterial dataset using Raman spectra for the identification of bacteria pathogens where novel pathogens often appear and its identification is critical; • a cardiotocograms dataset using fetal heart rate and uterine contraction where rarely seen conditions of patient data can be accessed through uncertainty estimation.
Overall, the contributions of this work can be summarized as follows: • a new uncertainty measure for quantifying knowledge uncertainty and rejecting unknown inputs; • a combination strategy to incorporate different uncertainty measures, evaluating the increase of classification accuracy versus rejection rate by uncertainty measures; • an experimental evaluation of in-and out-distribution inputs over four different datasets and eight state-of-the-art methods.

Uncertainty in Supervised Learning
The awareness of uncertainty is of major importance in ML and constitutes a key element of ML methodology. Traditionally, uncertainty in ML is modeled using probability theory, which has always been perceived as the reference tool for uncertainty handling [2]. Uncertainty arises from different sources in various forms and is commonly classified into aleatoric uncertainty or epistemic uncertainty. Aleatoric uncertainty is related to data and increases with the increase of noise in the observations, which can cause class overlap. On the other hand, epistemic uncertainty is related to the model and the knowledge that is given to it. This uncertainty increases with test samples in out-of-distribution (OOD) regions, and it captures the lack of knowledge of the model's parameters. Epistemic uncertainty can be reduced with the collection of more samples. However, aleatoric uncertainty is irreducible [18]. Although traditional probabilistic predictors may be a viable approach for representing uncertainty, there is no explicit distinction between different types of uncertainty in ML.

Uncertainty on Standard Probability Estimation
In the standard probabilistic modeling and Bayesian inference, the representation of uncertainty about a prediction is given by the probability of the predicted class. Considering a distribution p(x, ω) over input features x and labels ω, where ω k ∈ {ω 1 , . . . , ω K } consists of a finite set of K class labels, the predictive uncertainty of a classification model p(ω k |x, D) trained on a finite dataset , with N samples, is an uncertainty measure that combines aleatoric and epistemic uncertainty. The probability of the predicted class, or maximum probability, is a measure of confidence in the prediction that can be obtained by Another measure of uncertainty is the (Shannon) entropy of the predictive posterior distribution, which behaves similarly to maximum probability, but represents the uncertainty encapsulated in the entire distribution: Both maximum probability and entropy of the predictive posterior distribution can be seen as measures of the total uncertainty in predictions [19]. These measures of uncertainty for probability distributions primarily capture the shape of the distribution and, hence, are mostly concerned with the aleatoric part of the overall uncertainty. In this paradigm, the classification with a rejection option introduced by Chow [17] suggests that objects are rejected for which the maximum posterior probability is below a threshold. If the classifier is not sufficiently accurate for the task at hand, then one can take the approach not to classify all examples, but only those whose posterior probability is sufficiently high. Chow's theory is suitable when a sufficiently large training sample is available for all classes and when the training sample is not contaminated by outliers [20]. Fumera et al. [21] show that Chow's rule does not perform well if a significant error in probability estimation is present. In that case, a different rejection threshold per class has to be used. In classifiers with a rejection option, the key parameters are the thresholds that define the reject area, which may be hard to define and may vary significantly in value, especially when classes have a large spread. Additionally, Bayesian inference is more akin to aleatoric uncertainty, and it has been argued that probability distributions are less suitable for representing ignorance in the sense of lack of knowledge [2,22].

Knowledge Uncertainty
Knowledge uncertainty is often associated with novelty, anomaly or outlier detection where the testing samples come from a different population than the training set. Approaches based on generative models typically use densities, p(x), to decide whether to reject a test input that is located in a region without training inputs. These low-density regions, where no training inputs have been encountered so far, represent a high knowledge uncertainty. Traditional methods, such as Kernel Density Estimation (KDE), can be used to estimate p(x), and often threshold-based methods are applied on top of the density where a classifier can refuse to predict a test input in that region [23]. Related to this topic is the closed-world assumption, which is often violated in experimental evaluations of ML algorithms. Almost all supervised classification methods assume that all train and test data distributions are the same and that all classes in the test set are present in the training set. However, this is unrealistic for many real-world applications where new unseen classes in training appear in testing. This problem has been studied under different names with varying levels of difficulty, including the previously mentioned classification with rejection option, OOD detection and OSR [24]. In the classification with a rejection option, the test distribution has usually the same classes as the training distribution, and the classifier rejects inputs it cannot confidently classify. OOD detection is the ability of a classifier to reject a novel input rather than assigning it an incorrect label. In this setting, OOD inputs are usually considered as outliers that come from entirely different datasets. This topic is particularly important in deep neural networks and has been recognized by several studies showing that deep neural networks usually predict OOD inputs with high confidence [25,26]. The OSR approach is similar to OOD detection and can be viewed as tackling both the classification and novelty detection problem at the same time. Contrary to the OOD detection, the novel classes that are not observed during training are often made up of the remaining classes in the same dataset. This task is probably the hardest one because the statistics of a class are often very similar to the statistics of other classes in the dataset [24]. In each case, the goal is to correctly classify inputs that belong to the same distribution as the training set and to reject inputs that are outside this distribution.
A number of approaches have been proposed in the literature to handle unknown classes in the testing phase [27,28]. A popular and promising approach for open-set scenarios is the OCC since it focuses on the known class and ignores any additional class. OCC problems consist of defining the limits of all, or most, of the training data, by having a single target class. All the samples outside those limits will be considered outliers [29]. SVM models are commonly used in OCC problems by fitting a hyperplane that separates normal data points from outliers in a high-dimensional space [30]. Typically, SVM model separates the training set containing only samples from the known classes by the widest interval possible. Training samples on the OOD region are penalized, and the prediction is made by assigning the samples to the known or unknown region. Binary classification with the one-vs-all approach can also be applied to the open-set recognition [31]. In this scenario, the most confident binary classifier which classifies as in-distribution is chosen to predict the final class of the multiclass classifier. When there is no in-distribution classification from the binary classifiers, the test sample is classified as unknown. Different adaptions of OCC and variations of the SVM have been applied for OSR aiming at minimizing the risk of the unknown classification [13,16,32]. However, a drawback of these methods is the need for re-training the models from scratch, at a relatively high computational cost, when new classes are discovered and become available. Therefore, they are not well-suited for incremental updates or scalability required for open-world recognition [33]. Distance-based approaches are more suitable to open-world scenarios, since the addition of new classes to existing classes can be made at near-zero cost [34]. Distance-based classifiers with a rejection option are easily applied to OSR because the classifiers can create a bounded known space in the feature space, rejecting test inputs that are far away from training data. For instance, the Nearest Class Mean (NCM) classifier is a distance-based classifier that represents classes by their mean feature vector of its elements [34]. The problem for most of the methods dealing with rejection by thresholding the similarity score is the difficulty to determine such a threshold that defines whether a test input is an outlier or not. In this context, Júnior et al. [35] extended the traditional close-set Nearest Neighbor classifier applying a threshold on the ratio of similarity scores of the two most similar classes and called it Open Set Nearest Neighbors (OSNN).

Combined Approaches for Uncertainty Quantification
The previously mentioned studies do not explicitly quantify uncertainty nor distinguish the different sources of uncertainty. However, we argue that probabilistic methods are more concerned with handling aleatoric uncertainty to reject low-confident inputs, and OSR algorithms are mainly focused on effectively rejecting unknown inputs, which intrinsically have a high epistemic uncertainty due to the lack of knowledge.
However, in a real-world applications, and advocating a trustworthy representation of uncertainty in ML, both sources are important, and a proper distinction between them is desirable, all the more in safety-critical applications of ML. Motivated by such scenarios, several works have been developed for uncertainty quantification showing the usefulness of distinguishing both types of uncertainty in the context of Artificial Intelligence (AI) safety [11,12]. Some initial proposals for dealing with OSR and properly quantify uncertainty can already be found in the literature, mostly in the area of deep neural networks [19,36].
An approach for the quantification of aleatoric, epistemic, and total uncertainty (given by the sum of the previous two uncertainties) separately is to approximate these measures by means of ensemble techniques [19,37], presenting the posterior distribution p(h|D) by a finite ensemble of M hypotheses, H = {h 1 , . . . , h M }, that map instances x to probability distributions on outcomes. This approach was developed in the context of neural networks for regression [38], but the idea is more general and can also be applied to other settings, such as in the work of Shaker et al. [37] where the measures of entropy were applied using the Random Forest (RF) classifier. Using an ensemble approach, {p(ω k |x, h i )} M i=1 , a measure of total uncertainty can be approximate by the entropy of predictive posterior given by An ensemble estimate of aleatoric uncertainty considers the average entropy of each model in an ensemble. The idea is that by fixing a hypotheses h, the epistemic uncertainty is essentially removed. However, since h is not precisely known, the aleatoric uncertainty is measured in terms of the expectation of entropy with regard to the posterior probability: Epistemic uncertainty is measured in terms of mutual information between hypotheses and outcomes, which is a measure of the spread of an ensemble. Epistemic uncertainty can be expressed as the difference of the total uncertainty, captured by the entropy of expected distribution, and the expected data uncertainty, captured by expected entropy of each member of the ensemble [19].
Epistemic uncertainty is high if the distribution p(ω|x, h) varies a lot for different hypotheses h with high probability but leading to quite different predictions.
Finally, besides the uncertainty quantification into aleatoric, epistemic and total uncertainty, there are also open questions regarding the empirical evaluation of different methods, since data usually do not contain information about ground truth uncertainty. Commonly the evaluation is done indirectly through the increase of successful predictions, such as the Accuracy-Rejection (AR) curve which depicts the accuracy of a predictor as a function of the percentage of rejections [2].

Proposed Method
In this paper, we are interested in predictive uncertainty, where the uncertainty related to the predictionω over input features x is quantified in terms of aleatoric and epistemic (and total) uncertainty. Due to the complexity of novelty, anomaly or outlier detection, a specific uncertainty measure to deal with knowledge uncertainty is also proposed, summarizing the relationship of novelty detection and multiclass recognition as a combination of OSR problems. We formulate the problem as traditional OSR where a model is trained only over in-distribution data, denoted by a distribution Q in , and tested on a mixture distribution with in-and out-distribution inputs, drawn from Q in and Q out where the latter represents the out-distribution data. Given a finite training set drawn from Q in where x i is the i-th training input and ω k ∈ {ω 1 , . . . , ω K } is the finite set of class labels, a classifier is trained to correctly identify the class label from Q in and to reject unknown classes not seen in training from Q out by an uncertainty threshold. In Figure 1, the main steps of the proposed approach are summarized. Besides the traditional classification processes, our method learns the feature density estimation from the training data to feed the Knowledge Uncertainty Estimation measure used for rejecting test inputs with an uncertainty value higher than the learned threshold during training. Finally, each prediction is also quantified using entropy measures in terms of total uncertainty or aleatoric and epistemic uncertainty if ensemble techniques are used.
Uncertainty is modeled through a combination of a normalized density estimation over input feature space for each known class. Assuming an input x i represented by P-dimensional feature vectors, where f j ∈ { f 1 , . . . , f P } is the feature vector in a bounded area of the feature space, an independent density estimation of the P features conditional by the class label is estimated and normalized by its maximum density, in order to set all values in the interval [0, 1]. Thus, each feature density is transformed on an uncertainty distance, d unc , assuming values in [0, 1], where 1 represents the maximum density seen in training, and near-zero values represent low-density regions where no training inputs were observed during training. The combination between each feature distance is computed by the product rule over whole features. Thus, given a test input x i from class ω k the uncertainty is measured using the proposed Knowledge Uncertainty Estimation method, KUE(x i |ω k ), calculated by An example of the proposed uncertainty measure distribution over a Bacteria dataset (we will present and discuss this dataset in Section 4.1) with a 30-dimensional feature vector, 28 known classes and 2 unknown classes is shown in Figure 2.  Histogram of the density based on the KUE values, from a train set drawn from in-distribution data, Q in , and test set drawn from both in-and out-distribution data, Q in and Q out , respectively.
A common approach to define a threshold for OOD or even to tune a model's hyperparameters is to use a certain amount of OOD as validation data. However, this approach is unrealistic due to the proper definition of OOD inputs that come from an unknown distribution, leading to compromised performance in real-world applications, as Shafei et al. [39] showed in their recent study. Therefore, we argue that a more realistic approach is to learn a threshold only from in-distribution data. Due to the differences between data from different datasets, learning a global threshold for all datasets is not a reliable approach. Therefore, our hypothesis is that if we learn the training uncertainty distribution for each class within a dataset, there is a specific threshold for each distribution that will bound our uncertainty space, so input samples that fall outside the upper bound threshold are rejected. The upper bound threshold is defined based on a predefined percentile from the training uncertainty distribution. The percentile choice is defined according to different applications scenarios, whether the end-user is willing reject more or less in-distribution samples. As train and test in-distribution data come from the same distribution it is expected that the percentage of reject samples from test data will represent approximately 10% if the chosen percentile is set to 90%. From this 10% we can also argue that a certain percentage can represent classification errors or, if rejected samples were correctly classified, the classification was done under limited evidence so that a high uncertainty is associated with that decision. Thus, the rejection rule for input sample x i for in-and out-distribution is given by g(x i |ω k ) in Equation (7), where P r [U(ω k )] represents the uncertainty value for the r-th percentile of the train uncertainty data distribution associated with class ω k . The output values −1 and 1 mean that the input sample x i is rejected or accepted, respectively: Since the proposed measure only deals with knowledge uncertainty, besides the in-and out-distribution detection we also combined our proposed approach with the uncertainty measures presented in Equations (3)-(5) to quantify total, aleatoric and epistemic uncertainty, respectively.

Experiments
In this section, we describe the datasets used for the experiments and provide a detailed description of two experimental results: (1) classification with a rejection option based on a combination of our proposed KUE method and measures of predictive uncertainty from Section 2.3; (2) effectiveness of KUE in distinguishing in-and out-distribution inputs.

Datasets
We designed experiments on different data modalities to evaluate our method and to compare it with state-of-the-art methods. The experiments were performed on a real-world bacterial dataset (https://github.com/csho33/bacteria-ID (accessed on June 2020)) from [40] and on a set of standard datasets from the UCI repository [41], namely HAR (https://archive.ics.uci.edu/ml/datasets/human+ activity+recognition+using+smartphones (accessed on February 2020)), Digits (https://archive.ics.uci. edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits (accessed on July 2020)) and Cardio (https://archive.ics.uci.edu/ml/datasets/Cardiotocography (accessed on July 2020)). As the datasets do not explicitly contain OOD samples, we adopted a common approach seen in literature to simulate a realistic OSR problem, by re-labeling some of the known classes as unknown [20]. The datasets instances, attributes, classes and OOD combinations are summarized in Table 1. In the following, a brief description of each dataset is given: Digits: This dataset is composed by 10 handwritten digits (from 0 to 9) and 64 attributes. We used each class as unknown resulting in a total of 10 OOD combinations. • Cardio: This dataset contains measurements of fetal heart rate and uterine contraction on cardiotocograms. The dataset has 10 classes and additional labeling as (Normal, Suspicious and Pathologic). Thus, we trained the model using only classes labeled as Normal and consider the unknown classes from the labeling Suspicious and Pathologic.

Classification with Rejection Option
The classification with a rejection option based on measures of predictive uncertainty is presented in this section, where we described the evaluation metrics, the uncertainty quantification methods and a detailed description of the experimental results.

Evaluation Metric
The empirical evaluation of methods for quantifying uncertainty is a non-trivial problem, due to the lack of ground truth uncertainty information. A common approach for indirectly evaluating the predicted uncertainty measures is using AR curves [2]. According to Nadeem et al. [42], an accuracy rejection curve is a function representing the accuracy of a classifier as a function of its rejection rate. Therefore, the AR curves plot the rejection rate of the metrics (from 0 to 1) against the accuracy of the classifier. Since the accuracy is always 100% when the rejection rate is 1, all curves converge to the point (1, 1), and they start from the point (0, a), where a is the initial accuracy of the classifier, with 0% of rejected samples.

Uncertainty Quantification Methods
The methods used for the classification with rejection option through uncertainty measures are the following: 1. Knowledge uncertainty measured by our proposed KUE (Equation (6)) using KDE for the probability density function of each feature and Scott's rule [43] for the kernel bandwidth; 2. Total uncertainty approximated by the entropy of the predictive posterior using Equation (3); 3. Aleatoric uncertainty measured with the average entropy of each model in an ensemble using Equation (4); 4. Epistemic uncertainty expressed as the difference between the total uncertainty and aleatoric uncertainty given by Equation (5).
Although KUE can be applied to any ML model with feature level representation, the measures of total, aleatoric and epistemic uncertainty are approximated using an ensemble approach. Therefore, a RF classifier with 50 trees and a bootstrap approach to create diversity between the trees of the forest was used for this experiment.

Experimental Results
As we explained in Section 3, our classification rule depends on the choice of a predefined percentile of the train data uncertainty values, which can vary depending on the application. As we hypothesized that the percentage of the reject in-distribution data depends on the chosen percentile, we computed the True Positives Rate (TPR) and False Positives Rate (FPR) for a range of train percentiles, as shown in Figure 3, considering the positive samples being the samples classified as out-distribution and the negative samples the ones classified as in-distribution. Additionally, as inand out-distribution detection does not consider the prediction error, we also computed an adjusted FPR where the classification errors were removed from FPR, i.e., the in-distribution inputs that have an uncertainty value higher than the chosen percentile and were misclassified by the model were removed from the FPR. This adjusted FPR is represented in Figure 3 by FPR*. This metric has an important meaning for our method since our method depends on the classification performance, where the uncertainty of the misclassified inputs is computed using the probability densities of a different class. Therefore, it is expected that the uncertainty value is high for both OOD and for misclassified inputs.
In Figure 3, the variation of TPR, FPR and FPR* according to the train percentile (which defines the uncertainty threshold) for each of the four datasets is presented. Each graph comprises the average and the standard deviation of all OOD combinations for each dataset. As expected, the increase of the train percentile represented almost a linear decrease in FPR, since the distributions of the train data were similar to the distributions of the in-distribution test data. We can see that the FPR* was also linear in all datasets, and both FPR and FPR* converged to 0. This means that, depending on the application and on the risk associated with decisions, we can define the train percentile based on how many in-distribution test samples we are willing to reject. On the other hand, TPR followed a different behavior, where a high percentile could reject most of the OOD samples and a few in-distribution test samples or reject a minor percentage of both in-and out-distribution inputs. Since our proposed approach only deals with knowledge uncertainty, we also quantified the uncertainty in terms of total, aleatoric and epistemic uncertainty by means of ensemble techniques. Although epistemic uncertainty is a combination of model and knowledge uncertainty, its quantification is limited to the use of ensemble approaches. Moreover, specialized OOD detection methods would probably perform better for the knowledge uncertainty quantification. As our approach is only specialized in OOD detection, and total uncertainty encapsulates the uncertainty of the entire distribution, a combination between them should ideally perform better for the overall classification accuracy. Thus, we combined uncertainties by firstly reject input samples based on our method until the chosen percentile and then rejecting samples based on total uncertainty. For evaluation we used AR curves where the prediction uncertainty can be assessed indirectly by the improved prediction as a function of the percentage of the rejection. If we have a reliable measure of uncertainty involved in the classification of test inputs, then uncertainty estimation should correlate with the probability of making a correct decision, so that the accuracy should be improved with increasing rejection percentage, and AR curves should be monotone increasing. The comparison between different methods using AR curves should be based on the required accuracy level and/or the appropriate rejection rate [44]. Since we are comparing methods derived from the same classifier, the AR curves always had the same starting accuracy for all methods. Consequently, the relevant variable for the empirical evaluation is the rejection rate. Thus, we moveed vertically over the graph to see which method had a higher accuracy for a certain rejection rate. The AR curves were obtained by varying the rejection threshold, where samples with the highest uncertainty values were rejected first.
In Figure 4, the average rejection rate against the average accuracy for KUE, total, aleatoric and epistemic uncertainty is presented. The proposed combination is also shown in black, and the optimal rejection is represented by the dashed line. The optimal AR curve was computed by rejecting all OOD samples as well as misclassified samples in a row. In order to obtain the AR curves we ran 10 random repetitions using 15% of OOD inputs and using an uncertainty train percentile for the our proposed combination of 95%. As we can see in Figure 4, almost every curve over the different OOD combinations increased the accuracy with the increase of the rejection rate percentage. It is also interesting to note that even with only 15% of OOD inputs, our method always presented the monotone dependency between reject rate and classification accuracy, which means that our method also behaved quite well over the misclassified inputs. Regarding the proposed combination, the AR curve was always better or similar to the total uncertainty. Besides that, we observe that the AR curves tendency for the KUE method did not vary much between different OOD combinations, contrary to the aleatoric and epistemic uncertainty. The AR curves for the other datasets can be found in Appendix B.3.

Out-of-Distribution Detection
The effectiveness of KUE in distinguishing in-and out-distribution inputs is presented in this section, where we described the evaluation metrics, the inference methods for comparison and a detailed description of the experimental results.

Evaluation Metrics
Most of the recent studies employ the Area Under the ROC (AUROC) metric, which is a threshold-independent performance method [45] for evaluating OOD detection methods. The Receiver Operating Characteristic (ROC) curve depicts the relationship between the TPR and FPR. The AUROC can be interpreted as the probability that a positive example is assigned a higher detection score than a negative example [46]. Consequently, a random positive example detector corresponds to a 50% AUROC, and a perfect detector corresponds to an AUROC score of 100%. Hendrycks and Gimpel [47] stated a more qualitative interpretation of AUROC values as follows: excellent: 90-100%, good: 80-90%, fair: 70-80%, poor: 60-70%, fail: 50-60%. This interpretation was adopted for the overall evaluation over all datasets.
The Area Under the Precision-Recall (AUPR) is another threshold-independent metric frequently applied for OOD detection evaluation [48]. The Precision-Recall (PR) curve is a graph showing the precision and recall against each other. The baseline detector has an AUPR approximately equal to the precision [49], and a perfect detector has an AUPR of 100%. Consequently, the base rate of the positive class greatly influences the AUPR, so the AUPR-In and AUPR-Out are commonly used, where in-distribution and out-distribution inputs are specified as negatives and positives, respectively. The AUPR is sometimes deemed as more informative than AUROC because the AUROC is not ideal when the positive class and negative class have greatly differing base rates.
As for the evaluation of OOD detection we used the same number of in-distribution and out-distribution inputs, and the main metric employed for the evaluation of experiments is the AUROC. Additional details about AUPR-In and AUPR-Out can be found in Appendix B.2.

Inference Methods
For the in-and out-distribution detection we compare the following classification approaches: 1. KUE KDE : Our proposed method for knowledge uncertainty estimation using KDE as the probability density function and Scott's rule [43] for the kernel bandwidth; 2. KUE Gauss : Our proposed method for knowledge uncertainty estimation a using Gaussian distribution as probability density function; 3. p(ω|x): Maximum class probability. Although standard probability estimation is more akin to the to aleatoric part of the overall uncertainty, OOD data tend to have lower scores than in-distribution data [47].

H[p(ω|x)]:
Total uncertainty modeled by the (Shannon) entropy of the predictive posterior distribution. High entropy of the predictive posterior distribution, and therefore a high predictive uncertainty, suggests that the test input may be OOD [2]. 5. I[ω, h]: Epistemic uncertainty measured in terms of the mutual information between hypotheses and outcomes. High epistemic uncertainty means that p(ω|x, h) varies a lot for different hypotheses h with high probability. The existence of different hypotheses, all considered probable but leading to quite different predictions, can indeed be seen as a sign of OOD input [2]. 6. OCSVM: One-Class SVM (OCSVM) introduced by Schölkopf et al. [50] using a radial basis function kernel to allow a non-linear decision boundary. OCSVM learns a decision boundary in feature space to separate in-distribution data from outlier data. 7. SVM ovo : Multiclass SVM with one-vs-one approach and calibration across classes using a variation of Platt's extended by [51]. 8. SVM ova : One-vs-all multiclass strategy fitting one SVM per class. 9. NCM: Nearest Class Mean Classifier using a probabilistic model based on multiclass logistic regression to obtain class conditional probabilities [33]. 10. OSNN: Open Set Nearest Neighbor introduced by Júnior et al. [35] using a distance ratio based on the Euclidean distance of two most similar classes. 11. IF: Isolation Forest (IF) introduced by Liu et al. [52] for anomaly detection using an implementation based on an ensemble of an extremely randomized tree regressor.
Note that epistemic uncertainty is approximated by means of ensemble techniques, which is the representation of the posterior distribution by a finite ensemble of hypotheses. For this reason, to make the comparison fair between baseline methods 1-5, we chose a RF classifier for the experiments analysis. Nevertheless, and since different classifiers have different accuracies for classification of the very same data, a comparison study was carried out on a set of classical algorithms: RF, K-Nearest Neighbors (KNN), Naive Bayes (NB), SVM and Logistic Regression (LR). The detailed results for our method using different algorithms can be seen in Appendix B.1.

Experimental Results
For the problem of detecting OOD inputs we trained the models using only in-distribution inputs, ignoring the OOD inputs during training. For the final evaluation, we randomly selected the same number of in-distribution and out-distribution inputs from the test set. Table 2 compares our method using two variants of the feature modeling (KDE and Gaussian) with the methods mentioned in Section 4.3.2. The OOD names shown in Table 2 indicate the assumed unknown classes for each dataset. Regarding the Bacteria dataset, the names are the antibiotic treatments used to group the unknown classes, which are detailed in Appendix A. The AUROC is the average results over 10 random repetitions for a total of 29 OOD combinations over 4 different datasets.
From a detailed analysis of Table 2 we notice that, in the majority of the OOD combinations, our method obtained better or comparable AUROC with other methods. Moreover, the proposed method performed more consistently for different OOD combinations, unlike the other methods that showed unstable behaviors, where the standard deviation was very large over all combinations considered. For instance, the OCSVM presented the highest performance on the Digits and Cardio datasets. However, in the other datasets its performance varied a lot depending on the assumed unknown classes, with a poor performance on several OOD combinations. As an example in Figure 5, we show the ROC curves for the Caspofungin and Ciprofloxacin OOD combinations of the Bacteria dataset, representing the best and the worst performance of our method in the Bacteria dataset, respectively. It is interesting to note that, after our method, OCSVM presented the highest performance for Caspofungin. However, for Ciprofloxacin the OCSVM performance was lower than random. A similar behavior happened with the maximum class probability, p(w|x), and the total uncertainty, H[p(w|x)], which are the best methods to detect OOD samples on Ciprofloxacin combination and the worse in the case of antibiotic Caspofungin. Both methods had the same behavior over all combinations due to their intrinsic dependency. Maximum class probability can also be seen as a measure of the total uncertainty in predictions. Regarding epistemic uncertainty, although it obtained a few poor performances, it seemed to have more consistent behavior than the other methods. Additionally, it can be seen that all methods obtained high AUROC and comparable performance for all combinations of the Digits dataset. Comparing our two feature modeling strategies (KDE and Gaussian), we observed that results were similar, probably due to the fact that the feature modeling using the KDE in our datasets was approximated to a Gaussian distribution.  These results also allow a deeper understanding of the behaviour of our proposed uncertainty combination method from the previous Section 4.2. On the Bacteria dataset, the OOD combinations where our method achieved significantly better AUROC than total uncertainty also had a significantly higher accuracy for the same rejection rate, namely Daptomycin, Caspofungin and Vancomycin. On the other hand, the AR curves between our combination approach and total uncertainty for the other OOD combinations were similar due to the fact that total uncertainty also obtained good AUROC for these combinations.
The conclusions drawn for the AUPR-In and AUPR-Out (see Appendix B.2) are analogous to the AUROC analysis, since we used 50% of in-and out-distributions inputs.
A more qualitative interpretation of the AUROC is presented in Table 3, where the results represent the number of occurrences in each AUROC interval over all datasets. From this table, we can easily conclude that KUE method was at least more robust to changes in OOD combinations/datasets than compared to state-of-the-art methods. Unlike the other methods, our method did not obtain any OOD worse than random. We can also see that OCSVM more occurrences of an excellent qualitative evaluation, but also one of which that had more fail and random classifications. Table 3. Qualitative AUROC evaluation over all OOD combinations. Excellent: 90-100%, Good: 80-90%, Fair: 70-80%, Poor: 60-70%, Fail: 50-60%, ↓ Random: < 50% Excellent  9  10  5  4  5  14  4  3  5  11  5  Good  11  12  8  10  12  2  12  11  11  2  7  Fair  5  4  6  5  7  3  3  7  2  5  2  Poor  3  2  4  4  2  2  0  2  2  5  7  Fail  1  1  2  2  1  4  5  3  Since our proposed approach for OOD detection is based on a density estimation techniques, and density estimation typically requires a large sample size, we performed an ablation study to evaluate how the AUROC results change with the number of train samples used for modeling. In Figure 6, we present the results of the ablation study for the four datasets used, where we rejected gradually 5% of the original number of train samples in each iteration, making a total of 20 iterations for each OOD combination. We can see that the AUROC values did not change significantly with the number of train samples. This means that the number of training samples caused small changes in feature modeling, resulting in minor variations on the performance of our method. Figure 6. Ablation study of the KUE method, with KDE, for the four datasets. The legend represents each OOD inputs combination, and the title of each plot represents the dataset used.

Discussion and Conclusions
The importance of uncertainty quantification in ML has recently gained attention in the research community. However, its proper estimation is still an open research area. In a standard Bayesian setting, uncertainty is reflected by the predicted posterior probability, which is more akin to the aleatoric part of the overall uncertainty. On the other hand, epistemic uncertainty is commonly associated with OOD detection problems, despite its quantification not being explicitly performed. Although OSR settings are a more realistic scenario for the deployment of ML models, they are mainly focused on effectively rejecting unknown inputs.
With this in mind, we proposed a new method for knowledge uncertainty estimation, KUE, and combined it with the classical information-theoretical measures of entropy proposed in the context of neural networks for distinguishing aleatoric, epistemic and total uncertainty by means of ensemble techniques.
Our proposed KUE method is based on a feature level density estimation of in-distribution train data, and it does not rely on out-distribution inputs for hyperparameters tuning nor for threshold selection. Since different classifiers have different accuracies for the classification of the very same data, we proposed a method that, although dependent on the classification accuracy, can be easily applied to any feature level model without changing the underlying classification methodology. As the nature of the data is often difficult to determine, we proposed a KDE method for feature density estimation. However, due to the computational cost of KDE with the increase of training size, we also compared the proposed method using a Gaussian distribution. For the four different datasets used for evaluation, Gaussian estimation showed similar results with KDE, which can significantly reduce the computational cost on large datasets. Nevertheless, if possible, the train data distribution can be calculated to choose the best kernel to be applied. Regarding the AUROC, our method KUE showed competitive performance results comparable to state-of-the-art methods. Furthermore, we also defined a threshold for OOD input rejection that is chosen based on the percentage of in-distribution test samples that we are willing to reject. We showed its dependency on FPR and also demonstrated that misclassified inputs tend to have high uncertainty values. Although the proposed threshold selection strategy effectively controlled the FPR, the TPR had a high variability between different datasets, and it was not possible to estimate its behavior for unknown inputs. For future research, this limitation should be addressed by combining KUE with different methods adopting a hybrid generative discriminative model perspective.
The aleatoric, epistemic and total uncertainty produced by measures of entropy showed a monotone dependency between reject rate and classification accuracy, which confirmed that these measures of uncertainty are a reliable indicator of the uncertainty involved in a classification decision. Moreover, the proposed uncertainty measures combination between our proposed KUE method and total uncertainty outperformed the individual entropy measures of uncertainty for the classification with a rejection option.
Future research includes the study of different combination strategies of uncertainty measures for classification with a rejection option. Leveraging the uncertainty for the interpretability of the rejected inputs is another interesting research direction. In addition, expanding the testing scenarios with more datasets should provide more indications about the robustness of the measures used. If more specialized OOD detection methods are able to properly quantify their own uncertainty, different combinations between existing methods and other sources of uncertainty should also be explored.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Supplementary Details for Bacteria Dataset Experiments
Bacteria dataset from the work of Ho et al. [40] is publicly available at https://github.com/ csho33/bacteria-ID. The detailed combinations used for OOD combinations can be seen in Table A1, where each combination was selected using antibiotic treatment for a specific set of bacteria.  The experimental evaluation of our method for OOD detection was performed using a RF to provide a fair comparison between methods that required the use of ensemble techniques. Nevertheless, we provide detailed results using a set of classical algorithms, namely KNN, NB, SVM and LR, for the uncertainty quantification of our proposed method. In Table A2 we report both AUROC and the respective accuracy of each method on different OOD combinations.
Regarding the results from different classifiers, we notice that AUROC are very similar between the different algorithms. However, algorithms with higher accuracy tend to have also higher AUROC, which makes sense due to the dependency of classification accuracy of our proposed method.

. AUPR-In and AUPR-Out Results
In Tables A3 and A4 we present the detailed results for AUPR-Out and AUPR-In, where in-distribution and out-distribution inputs are specified as negatives and positives, respectively.
The conclusions drawn for the AUPR are analogous to the AUROC analysis, since we used 50% of in-and out-distribution inputs.  Figure A2. AR curves for aleatoric, epistemic and total uncertainty for the Digits dataset. The curve for perfect rejection is included as a baseline. The name in each plot represents the digits used for each OOD input combination. Figure A3. AR curves for aleatoric, epistemic and total uncertainty for the Cardio dataset. The curve for perfect rejection is included as a baseline. The name in each plot represents the OOD input combination.