Next Article in Journal
Quantifying Non-Stationarity with Information Theory
Next Article in Special Issue
Gradient Regularization as Approximate Variational Inference
Previous Article in Journal
The “Hockey” Assist Makes the Difference—Validation of a Defensive Disruptiveness Model to Evaluate Passing Sequences in Elite Soccer
Previous Article in Special Issue
Minimum Message Length in Hybrid ARMA and LSTM Model Forecasting
 
 
Article
Peer-Review Record

Empirical Frequentist Coverage of Deep Learning Uncertainty Quantification Procedures

Entropy 2021, 23(12), 1608; https://doi.org/10.3390/e23121608
by Benjamin Kompa 1, Jasper Snoek 2 and Andrew L. Beam 1,3,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Entropy 2021, 23(12), 1608; https://doi.org/10.3390/e23121608
Submission received: 1 October 2021 / Revised: 17 November 2021 / Accepted: 24 November 2021 / Published: 30 November 2021
(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

Round 1

Reviewer 1 Report

The manuscript proposes to measure quality of uncertainty quantification (UQ) schemes for deep learning models using two criteria: coverage (of confidence sets/intervals) and their size. For the former, there is some desired level, such as 95%, and for the latter, smaller size is better, as long as the coverage is at least as large as the desired level. The authors use apply two criteria in several simulation studies to show that they shed light on the quality of various UQ approaches.

Overall, I think the two proposed criteria are attractive because of their simplicity and interpretability. Of course, they have been explored in other contexts, but using them as metrics for UQ is not common as of yet. The simulations presented in this work are interesting and well-done, and they convincingly show the value of the authors' proposed evaluation method.

A few major questions/concerns.

(1) This proposal is deeply connected to conformal prediction, but the literature review of that subject is nearly absent from this work. I'd recommend Vovk, Gammerman, and Shafer's "Algorithmic Learning in a Random World" as the most important, general-purpose citation. Another recent overview is "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification" by Bates and Angelopoulos. Various specific citations should be included elsewhere. For example, for line 116, see "Bayes-optimal prediction with frequentist coverage control" by Hoff. For a discussion of how conformal relates to distribution shift, highly relevant to the experiments presented herein, see "Robust Validation: Confident Predictions Even When Distributions Shift" by Cauchois, Gupta, Ali, and Duchi and "Conformal Prediction Under Covariate Shift" by Tibshirani, Barber, Candes, and Ramdas. 

(2) The size metric is not directly comparable for models with different coverage levels. For this reason, size plots such as the one in Figure 5 are hard to interpret. To make the comparison of different methods more meaningful, why not calibrate them all to have the same coverage level (with conformal prediction) and then compare sizes?

 


Minor points

(1) At various points, the document says that it is desirable to have "high levels of coverage". E.g., Line 196 and the caption of Figure 4. This is a little imprecise. I think that one wants coverage as close as possible to the nominal level (e.g., 95%), not as large as possible. The wording should be adjusted in such places.

(2) Figure 7 is small and currently hard to read. 

(3) In Figure 5, why not present the results with no shift as well? It would help provide context to report the coverage and size without any shift.

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Although this work proposes a study that seems novel, it is very complicated to understand it properly due to the way it is organised. For example, the introduction uses formulas that are unnecessary at the time and whose elements are described later. There is a Background subsection and then a Related Work section with a single subsubsection, which makes no sense. Point 3.1 concerns the experimentation carried out or the results section needs an introduction indicating how it is approached. In the same section it would be necessary to explain in detail the different datasets used. The benchmark used is taken from previous publications, however, due to the fact that these are conference publications and the fundamental role they play in the article, they should be explained in detail to show their suitability for the study. 

Also noteworthy is the bibliography used, which makes excessive use of works published in conferences and needs to be updated to studies published in prestigious journals. 

As a minor comment, the black line in figure 1.a makes it difficult to read.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

1.    In the Relate Works, the references studies are in long paragraphs. It would be more comprehensive to organize a literature review with some pros and cons.

2.    The main objectives and contributions are not apparent in the introduction part.

3.    The regression tasks, LL SVI, SVI, and Gaussian processes all had high levels of coverage across nearly all benchmarks. Authors require to compare with other existing models with similar datasets.

4.    How authors choose the parameter of regression and classification; discuss more detail about the parameters (Preprocessing).

5.    Describe the fact for the large-scale evaluation of the empirical frequentist coverage properties of well-known uncertainty quantification techniques on a suite of regression and classification tasks.

6.    The procedure our your approach should result systematically, and the functionality of your models should be clear to understand. 

7.    Their large-scale empirical study assessed these methods on standard datasets such as MNIST, CIFAR-10, ImageNet, and other non-image-based datasets. Additionally, how do these datasets' translations, rotations, and corruptions to quantify performance under dataset shift? Discuss the merit of these datasets' translations, rotations, and corruptions, etc.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The article has been suitably improved by addressing and responding to the different comments made.

Reviewer 3 Report

The author responded to the reviewer's comments very well. The article may accept in the current form. 

Back to TopTop