Evaluating and Calibrating Uncertainty Prediction in Regression Tasks

Predicting not only the target but also an accurate measure of uncertainty is important for many machine learning applications and in particular safety-critical ones. In this work we study the calibration of uncertainty prediction for regression tasks which often arise in real-world systems. We show that the existing definition for calibration of a regression uncertainty [Kuleshov et al. 2018] has severe limitations in distinguishing informative from non-informative uncertainty predictions. We propose a new definition that escapes this caveat and an evaluation method using a simple histogram-based approach. Our method clusters examples with similar uncertainty prediction and compares the prediction with the empirical uncertainty on these examples. We also propose a simple, scaling-based calibration method that preforms as well as much more complex ones. We show results on both a synthetic, controlled problem and on the object detection bounding-box regression task using the COCO and KITTI datasets.


Introduction
Regression problems are common machine learning tasks and in many applications simply returning the target prediction is not sufficient.In these cases, the learning algorithm needs to also output its confidence in its prediction.For example, when the predictions are used by a self-driving car agent, or any other safety critical decisionmaker, it needs to take the confidence of these predictions into account.Another example is the commonly used Kalman-Filter tracking algorithm [Blackman, 2004] that requires the variance of the observed object's location estimation in order to correctly combine past and present information to better estimate the current state of the tracked object.
To provide uncertainty estimation, each prediction produced by the machine learning module during inference * Authors contributed equally.
should be a distribution over the target domain.There are several approaches for achieving this, most common are Bayesian neural networks [Gal, 2016;Gal and Ghahramani, 2016], ensembles [Lakshminarayanan et al., 2017] and outputting a parametric distribution directly [Nix and Weigend, 1994].For simplicity we use the direct approach for producing uncertainty: we transform the network output from a single scalar to a Gaussian distribution by taking the scalar as the mean and adding a branch that predicts the standard deviation (STD) as in [Lakshminarayanan et al., 2017].While this is probably the simplest form, it is commonly used in practice, and our analysis is applicable to more complex distributions as well as other approaches.
Given output distributions over target domains and observed targets, the main question we address in this work is how to evaluate the uncertainty estimation of our regressors.For classification, a simple and useful definition is calibration.We say that a classifier is calibrated if when it predicts some label with probability p it will be correct with exactly probability p.Recently it has been shown that modern deep networks are not well calibrated but rather tend to be over confident in their predictions [Guo et al., 2017].The same study revealed that for classification, Platt Scaling [Platt, 1999], a simple scaling of the logits, can achieve well calibrated confidence estimates.
Defining calibration for regression, where the model outputs a continuous distribution over possible predictions, is not straightforward.In recent work [Kuleshov et al., 2018] suggested a definition based on credible intervals where if we take the p percentiles of each predicted distribution, the output should fall below them for exactly p percent of the data.Based on this definition the authors further suggested a calibration evaluation metric and re-calibration method.While this seems very sensible and has the advantage of considering the entire distribution, we found serious flaws in this definition.The main problem arises from averaging over the entire dataset.We show, both empirically and analytically, that one can calibrate using this evaluation metric practically any output distribution, even one which is entirely uncorrelated with the empirical uncertainty.We elaborate on this property of the evaluation method described in [Kuleshov et al., 2018] in Section 2 and show empirical evidence in Section 4.
We propose a new, simple definition for calibration for regression, which is closer to the standard one for classification.Calibration for classification can be viewed as expecting the output for every single data point to correctly predict its error, in terms of misclassification probability.In a similar fashion, we define calibration for regression by simply replacing the misclassification probability with the mean square error.Based on this definition, we propose a new calibration evaluation metric similar to the Expected Calibration Error (ECE) [Naeini et al., 2015].Finally, we propose a calibration method where we re-adjust the predicted uncertainty, in our case the outputted Gaussian variance, by minimizing the negativelog-likelihood (NLL) on a separate re-calibration set.We show good calibration results on a real-world dataset using a simple parametric model which scales the uncertainty by a constant factor.As opposed to [Kuleshov et al., 2018], we show that our calibration metric does not claim that uncertainty that is uncorrelated with the real uncertainty is perfectly calibrated.To summarize, our main contributions are: • Revealing the fundamental flaws in the current definition of calibrated regression uncertainty [Kuleshov et al., 2018] • A newly proposed definition of calibrated uncertainty in regression tasks, laying grounds for a new practical evaluation methodology.• A simple scaling method, similar to temperature scaling for classification [Guo et al., 2017], that can reduce calibration error as well as more complex methods, tested on large, real world vision datasets.

Related Work
While shallow neural networks are typically wellcalibrated [Niculescu-Mizil and Caruana, 2005], modern, deep networks, albeit superior in accuracy, are no-longer calibrated [Guo et al., 2017].Uncertainty calibration for classification is a relatively studied field.Calibration plots or Reliability diagrams provide a visual representation of uncertainty prediction calibration [DeGroot and Fienberg, 1983;Niculescu-Mizil and Caruana, 2005] 1999] such as Matrix Scaling and Temperature Scaling [Guo et al., 2017].In [Guo et al., 2017] it is demonstrated that the simple Temperature Scaling, consisting of a one scaling-parameter model which multiplies the last layer logits, suffices to produce excellent calibration on many classification data-sets.
In comparison with classification, calibration of uncertainty prediction in regression, has received little attention so far.As already described, [Kuleshov et al., 2018] propose a practical method for evaluation and calibration based on confidence intervals and isotonic regression.The proposed method is applied in the context of Bayesian neural networks.In recent work [Phan et al., 2018], the authors follow [Kuleshov et al., 2018] definition and method of calibration for regression, but use a standard deviation vs. MSE scatter plot, somewhat similar to our approach, as a sanity check.In concurrent work, [Song et al., 2019] propose a calibration method that addresses the uniformity of [Kuleshov et al., 2018] over the entire dataset.However they do not address the inherent limitation in the calibration evaluation metric.

Confidence-intervals based calibration
We next review the method for regression uncertainty calibration proposed in [Kuleshov et al., 2018] which is based on confidence intervals, and highlight its shortcomings.We refer to this method in short as the "interval-based" calibration method.We start by introducing basic notations for uncertainty calibration used throughout the paper.
Notations.Let X, Y ∼ P be two random variables jointly distributed according to P and X × Y their corresponding domains.A dataset {(x t , y t )} T t=1 consists of i.i.d samples of X, Y .A forecaster H : X → P(Y) outputs per example x t a distribution p t ≡ H(x t ) over the target space, where P(Y ) is the set of all distributions over Y.In classification tasks, Y is discrete and p t is a multinomial distribution, and in regression tasks in which Y is a continuous domain, p t is usually a parametric probability density function, e.g. a Gaussian.For regression, we denote by F t : Y → [0, 1] the CDF corresponding to p t .
According to [Kuleshov et al., 2018] a forecaster in a regression setting H is calibrated if: (1) .
Intuitively this means that the y t is smaller than F −1 t (p) with probability p, or that the predicted CDF matches the empirical one as the dataset size goes to infinity.This is equivalent to Where F (X) represents the CDF corresponding to H(X).This notion is translated by [Kuleshov et al., 2018] to a practical evaluation and calibration methodology.A recalibration dataset S = {(x t , y t )} T t=1 is used to compute the empirical CDF value for each predicted CDF value p ∈ F t (y t ): The calibration consists of fitting a regression function R (i.e.isotonic regression) , to the set of points {(p, P (p))} T t=1 .For diagnosis the authors suggest a calibration plot of p verses P (p).
We start by intuitively explaining the basic limitation of this methodology.From Eq. 3 P is non-decreasing and therefore isotonic regression finds a perfect fit.Therefore, the modified CDF R • F t will satisfy P (p) = p on the re-calibration set, and the new forecaster is calibrated up to sampling error.This means that perfect calibration is always possible, even for output CDFs which are statistically independent of the actual empirical uncertainty.We note that this might be acceptable when the uncertainty prediction is degenerate, e.g.all output distributions are Gaussian with the same variance, but this is not the case here.We also note that the issue is with the calibration definition not the re-calibration, as we show with the following analytic example.
We next present a concise analytic example in which the output distribution and the ground truth distribution are independent, yet fully calibrated according to Eq. 2. Consider the case where the target has a normal distribution y t ∼ N (0, 1) and the network output H(x t ) has a Cauchy distribution with zero location parameter and random scale parameter γ t independent of x t and y t , defined as: Following a known equality for Cauchy distributions, the CDF output of the network F t (y) = F y γt , where F is the CDF of a Cauchy distribution with zero location and 1 scale parameters.First we note that yt γt and yt zt , i.e. with and without the absolute value, have the same distribution due to symmetry.Next we recall the well known fact that the ratio of two independent normal random variables is distributed as Cauchy with zero location and 1 scale parameters (i.e.yt zt ∼ Cauchy(0, 1)).This means that probability that F t (y t ) ≡ F ( yt γt ) ≤ p is exactly p (recall that F is a Cauchy(0, 1) CDF).In other words, the prediction is perfectly calibrated according to the definition in Eq. 2, even though the scale parameter was random and independent of the distribution of y t .
While the Cauchy distribution is a bit unusual due to the lack of mean and variance, the example does not depend on it and it was chosen for simplicity of exposition.It is possible to prove the existence of a distribution whose product of two independent samples is Gaussian [Thorin, 1977] and replace the Cauchy with a Gaussian, but it is an implicit construction and not a familiar distribution.

Our method
We present a new definition for calibration for regression, as well as several evaluation measures and a reliability diagram for calibration diagnosis, analogous to the ones used for classification [Guo et al., 2017].The basic idea is that for each value of uncertainty, measured through standard deviation σ, the expected mistake, measured in mean square error (MSE), matches the predicted error σ 2 .This is similar to classification with MSE replacing the role of mis-classification error.More formally, if µ(x) and σ(x) 2 are the predicted mean and variance respectively then we consider a regressor well calibrated if: (5) In contrast to to [Kuleshov et al., 2018] this does not average over points with different values of σ 2 at least in the definition, for practical measures some binning is needed.In addition, compared to [Song et al., 2019] it only looks at how well the MSE is predicted separated from the quality of the prediction themselves, similar to classification where calibration and accuracy are disconnected.We claim that this captures the desired meaning of calibration, i.e. for each individual example you can correctly predict the expected mistake.
Since we can expect each exact value of σ 2 in our dataset to appear exactly once, we evaluate eq. 3 empirically using binning, same as for classification.Formally, let σ t be the standard deviation of predicted output PDF p t and assume without loss of generality that the examples are ordered by increasing values of σ t .We also assume for notation simplicity that the number of bins, N , divides the number of examples, T .We divide the indices of the examples to N bins, {B j } N j=1 , such that: Each resulting bin therefore represents an interval in the standard deviation axis: [min t∈Bj {σ t }, max t∈Bj {σ t }].The intervals are non-overlapping and their boundary values are increasing.
To evaluate how calibrated the forecaster is, we compare per bin j two quantities as follows.The root of the mean variance: And the empirical root mean square error : where ŷt is the mean of the predicted PDF (p t ) For diagnosis, we propose a reliability diagram which plots the RM SE as function of the RM V as shown in Figure 2. The idea is that for a calibrated forecaster per bin the RM V and the observed RM SE should be approximately equal, and hence the plot should be close to the identity function.Apart from this diagnosis tool which as we will show is valuable for assessing calibration, we propose additional scores for evaluation.
Expected Normalized Calibration Error (ENCE).For summarizing the error in the calibration we propose the following measure: This score averages the calibration error in each bin, normalized by the bin's mean predicted variance, since for larger variance we expect naturally larger errors.This measure is analogous to the expected calibration error (ECE) used in classification.
STDs Coefficient of variation (C V ).In addition to the calibration error we would like to measure the dispersion of the predicted uncertainties.If for example the forecaster predicts a single homogeneous uncertainty measure for each example, which matches the empirical uncertainty of the predictor for the entire population, then the EN CE would be zero, but the uncertainty estimation per example would be uninformative.Therefore, we complement the EN CE measure with the Coefficient of Variation (c v ) for the predicted STDs which measures their dispersion: where µ σ = 1 T T t=1 σ t .Ideally the c v should be high indicating a disperse uncertainty estimation over the dataset.We propose using the EN CE as the primary calibration measure and the c v as a secondary diagnostic tool.

Calibration
To understand the need for calibration, let us start by considering a trained neural network for regression, which has very low mean squared error (MSE) on the train data.We now add a separate branch that predicts uncertainty as standard deviation, which together with the original network output interpreted as the mean, defines a Gaussian distribution per example.In this case, the NLL loss on the train data can be minimized by lowering the standard deviation of the predictions, without changing the MSE on train or test data.On test data however, MSE will be naturally higher.Since the predicted STDs remain low on test examples, this will result in higher NLL and ENCE values for the test data.This type of miss-calibration is defined as over-confidence, but opposite or mixed cases can occur depending on how the model is trained.
Negative log-likelihood.N LL is a standard measure for a probabilistic model's quality [Hastie et al., 2001].
When training the network to output classification confidence or a regression distribution, it is commonly used as the objective function to minimize.It is defined as: We propose using the NLL on the re-calibration set as our objective for calibration, and the reliability diagram, together with its summary measures (EN CE , c v ) for diagnosis of the calibration.In the most general setting a calibration function maps predicted PDFs to calibrated PDFs: R(Θ) : P(Y) → P(Y) where θ is the set of parameters defining the mapping.
Optimizing calibration over the re-calibration set is obtained by finding θ yielding minimal NLL: To ensure the calibration generalization, the diagnosis should be made on a separate validation set.Multiple choices exist for the family of functions R belongs to.We propose using STD Scaling, (in analogy to Temperature Scaling [Guo et al., 2017]), which essentially multiplies the STD of each predicted distribution by a constant scaling factor s. If the predicted PDF is that of a Gaussian distribution, N (µ, σ 2 ), then the re-calibrated PDF is N (µ, (s•σ) 2 ).Hence, in this case the calibration objective (Eq.11) is: If the original predictions are overconfident, as common in neural networks, then the calibration should set s > 1.This is analogous to Temperature Scaling in classification: a single multiplicative parameter is tuned to fix over or under-confidence of the model, and it does not modify the model's final prediction since µ t remains unchanged.
More complex calibration methods.Histogram binning and Isotonic Regression applied to the STDs can be also used as calibration methods.We chose STD scaling since: (a) it is less prone to overfit the validation set, (b) it does not enforce minimal and maximal STD values, (c) it is easy to implement and (d) empirically, it produced good calibration results, on par with the much more complex percentile-based isotonic regression of [Kuleshov et al., 2018].

Experimental results
We next show empirical results of our approach on two tasks: a controlled synthetic regression problem and object detection bounding box regression.We examine the effect of outputting trained and random uncertainty on the calibration process.In all training and optimization   stages we use an SGD optimizer with learning rate 0.001 and 0.9 momentum.We note that since the the calibration in [Kuleshov et al., 2018] works by changing the CDF directly, we need to extract the variance from the modified CDF.To do that we use the formula We numerically calculate the integral in eq. 13 using Romberg's integration method.

Synthetic regression problem
Experimenting with a synthetic regression problem enables us to control the target distribution Y and to validate our method.We randomly generate T = 50, 000 input samples {x t , y t } T t=1 .We sample x t from X ∼ Uniform[0.1,1] and y t from Y ∼ N (x t , x 2 t ).This way, the target standard deviation of sample x t is x t .We train a fully-connected network with four layers and a ReLU activation function on the generated training set using the L 1 loss function.In this random uncertainty experiment, per example, the standard deviation representing the uncertainty is randomly drawn from Uniform [1,10].We then re-calibrate as described in Sec.3.1 on a separate re-calibration set consisting of 6, 000 samples.We also ran calibration experiments with trained uncertainty which are described in Appendix A.
As one can see in Fig. 1 (b) the confidence interval method [Kuleshov et al., 2018] can almost perfectly calibrate the random independent uncertainty estimation according to their definition, as the expected and observed confidence level match and we get the desired identity curve.This phenomenon is extremely undesirable for safety critical applications where falsely relying on uninformative uncertainty can lead to severe consequences.In  Before calibration Calibrated (ours) Calibrated [Kuleshov et al., 2018

Bounding box regression for object detection
In computer vision, an object detector outputs per input image a set of bounding boxes, each commonly defined by 5 outputs: classification confidence and four positional outputs (t x , t y , t w , t h ) representing its (x,y) position, width and height.We show results on each positional output as an independent regression task using the R-FCN detector [Dai et al., 2016].To this architecture we add an additional uncertainty branch predicting the corresponding STDs for each regression output, (σ x , σ y , σ w , σ h ).Thus, the network outputs a Gaussian distribution per regression task.For training the network weights we use the entire Common objects in context (COCO) dataset [Lin et al., 2014].For uncertainty calibration and validation we use two separate subsets of the KITTI [Geiger et al., 2012] object detection benchmark dataset, which consists of road scenes.Training the uncertainty output on one dataset and performing calibration on a different one without changing the predictions reduces the risk of over-fitting and increases the calibration validity.See Appendix A for further details.We initially train the network without the additional uncertainty branch as in [Dai et al., 2016], while the uncertainty branch weights are randomly initialized.We then train the uncertainty branch by minimizing the N LL loss (Eq.10) on the training set, freezing all network weights but the uncertainty head for 1K training iterations with 6 images per iteration.Freezing the rest of the network ensures that the additional uncertainty estimation represents uncertainty on unseen data.The result of this stage is the network with predicted uncertainty .Finally, we train the N LL loss for 1K additional training iterations on the re-calibration set, to optimize the single scaling parameter s, and obtain the calibrated uncertainty .
Figure 2 shows the resulting reliability diagrams before calibration (predicted uncertainty ) and after (calibrated uncertainty ) for all four positional outputs, on the validation set.For comparison we also show the results are calibration with the interval method [Kuleshov et al., 2018].As can be observed from the monotonously increasing curve before calibration, the output uncertainties are indeed correlated with the empirical ones.Additionally, since the curves are entirely above the ideal one, the predictions are over confident.Using the learned scaling factor s which varies between 1.1 and 1.2, the EN CE is significantly reduced as shown in Table 1.The c v remains unchanged after calibration since it is invariant to uniform scaling of the output STDs (Eq.9).
In table 1 we see that calibration with both our method and the method in [Kuleshov et al., 2018] improves the ENCE considerably and have comparable performance.We first note that our calibration is much simpler, with only a scalar parameter compared to isotonic regression and does not need any numeric integration to calculate the mean and variance unlike the calibration in [Kuleshov et al., 2018].Furthermore we observe that while we showed that the definition of calibration in [Kuleshov et al., 2018] and evaluation is flawed, the re-calibration algorithm that was derived from it still shows good results.

Conclusions
Calibration, and more generally uncertainty prediction, are critical parts of machine learning, especially in safetycritial applications.In this work we exposed serious flaws in the current approach to define and evaluate calibration for regression problems.We proposed a new definition for calibration in regression problems and evaluation metrics.Based on our definition we proposed a simple recalibration method that showed significant improvement in real-world applications.
Figure 3: Bounding box Regression (t x ) with random uncertainty (independent of actual uncertainty) almost perfectly calibrated by the method proposed in [Kuleshov et al., 2018], when the expected and observed confidence level are identical.As anything can be perfectly calibrated, this calibration definition becomes uninformative.

A.1 Synthetic regression problem
We describe here additional experiments ran on the synthetic regression problem described in Section 4.1.We start with the random uncertainty estimation setting described in the aforementioned section.Figure 4(a) shows that initially, before calibration, the true and predicted uncertainties are entirely uncorrelated.As expected, after running our re-calibration method these quantities remain uncorrelated (Figure 4(b)).Figures 4(c) and 4(d) plot the reliability diagrams using our evaluation method, along with the ENCE measure, before and after our re-calibration respectively.Notice that the graphs and the high ENCE measures clearly reveal the miss-calibration.
We also conducted a trained experiment, in which uncertainty is predicted by the network using the same training protocol as the one used for the KITTI dataset described in Section 4. We can see in Fig. 5 that the network almost perfectly learns the correct uncertainty, as expected from the problem simplicity and the high data availability.In this case both methods, ours an the interval method, hardly affect the calibration results.The important point is that our calibration and evaluation method can easily differentiate between both cases, the random and predicted uncertainty, while they are almost exactly the same after calibrating with [Kuleshov et al., 2018].

A.2 Bounding box regression for object detection
In this section we provide additional details on the network architecture and datasets used for bounding box prediction as well as additional results using random predictions.As our base architecture we use the R-FCN detector [Dai et al., 2016] with a ResNet-101 backbone [He et al., 2016].The R-FCN regression branch outputs per region candidate a 4-d vector that parametrizes the bounding box as t b = (t x , t y , t w , t h ) following the accepted parametrization in [Girshick, 2015].We use these outputs in our experiments as four seperate regression outputs.To this architecture we add an uncertainty branch, identical in structure to the regression branch, that outputs a 4-d vector (u 1 , u 2 , u 3 , u 4 ) ≡ (log(σ 2 x ).log(σ 2 y ), log(σ 2 w ), log(σ 2 h )), each representing the log variance of the Gaussian distributions of the corresponding output.As before, the original regression output represents the Gaussian mean (i.e.µ x = t x ).
For training the network weights we use the entire Common objects in context (COCO) dataset [Lin et al., 2014].As stated previously we use a two-stage training approach.We first train the original R-FCN network, and then freeze all weights and train only the additional uncertainty prediction branch.In this way we train uncertainty prediction without sacrificing the network's accuracy.Note however that our method completely holds if the entire network is trained at once (e.g. if confidence estimation importance is such that accuracy may be marginally sacrified).For uncertainty calibration we use the KITTI [Geiger et al., 2012] object detection benchmark dataset, which consists of road scenes.We divide the KITTI dataset into a re-calibration set used for training the calibration parameters (∼ 6K images), and a validation set (∼ 1.5K images, 37K object instances).The classes in the KITTI dataset represent a small subset of the classes in the COCO dataset, and therefore we reduce our model training on COCO to the 9 relevant classes (e.g.car, person) and map them accordingly to the KITTI classes.
Similarly to the random uncertainties in the synthetic data experiment, we conduct an experiment with untrained uncertainty for the bounding box regression task.In this experiment the network for bounding box regression is trained regularly without an uncertainty branch.When the uncertainty branch is added, the uncertainty branch weights are randomly initialized, and therefore the predicted uncertainties are uncorrelated to the true uncertainties.Figure 6 shows the reliability diagrams for the four bounding box regression outputs with untrained uncertainty before and after we apply our calibration method.As with the synthetic dataset, the graphs immediately reveal the disconnect between the random values and the empirical uncertainties.In all the cases the calibration results in a highly non-calibrated uncertainty according to our metrics.In contrast, Fig. 3 shows that after calibration, just as with the synthetic dataset, using the interval-based method [Kuleshov et al., 2018], uncertainty is almost perfectly calibrated.[Kuleshov et al., 2018]

Figure 2 :
Figure 2: Reliability diagrams for bounding box regression on the KITTI validation set before and after calibration.Each plot compares the empirical RM SE and the root mean variance (RM V ) in each bin.Grey dashed line indicates the ideal calibration line.See Sec.4.2 for details.
Fig 1 (a) we show the predicted STD vs real STD after this calibration showing that the predictions that are perfectly calibrated according to the interval definition are indeed uncorrelated with the actual uncertainty.In contrast one can see in Fig 1 (c) that the these results are clearly un-calibrated by our definition and metric.

Figure 4 :
Figure 4: Synthetic regression problem with random uncertainty estimation.(a)Real vs. predicted STDs before calibration (b) Real vs. predicted STDs after our calibration method was applied.(c) Reliability diagram before calibration (d) Reliability diagram after applying our calibration.

Figure 5 :
Figure 5: Synthetic regression problem with predicted uncertainty.(a) Ground truth vs. predicted STDs.(b) Reliability diagram before and after our calibration.(c) Reliability diagram using confidence intervals [Kuleshov et al., 2018] before and after calibration.

Table 1 :
Evaluation of uncertainty calibration for the bounding box regression tasks on the KITTI validation dataset.
before and after calibration.Reliability diagrams for bounding box regression with untrained uncertainty estimation for the bounding box regression outputs (t x , t y , t w , t h ).Top row: before calibration, bottom row: after calibration.