Confidence intervals for class prevalences under prior probability shift

Point estimation of class prevalences in the presence of data set shift has been a popular research topic for more than two decades. Less attention has been paid to the construction of confidence and prediction intervals for estimates of class prevalences. One little considered question is whether or not it is necessary for practical purposes to distinguish confidence and prediction intervals. Another question so far not yet conclusively answered is whether or not the discriminatory power of the classifier or score at the basis of an estimation method matters for the accuracy of the estimates of the class prevalences. This paper presents a simulation study aimed at shedding some light on these and other related questions.


Introduction
In a prevalence estimation problem, one is presented with a sample of unlabelled instances (the test sample) and is asked to estimate the distribution of the labels in the sample. If the problem sits in a binary two-class context, all instances belong to exactly one of two possible classes and, accordingly, can be labelled either positive or negative. The distribution of the labels then is characterised by the prevalence (i.e. proportion) of the positive labels ('class prevalence' for short) in the test sample. However, the labels are latent at estimation time such that the class prevalence cannot be determined by simple inspection of the labels. Instead the class prevalence can only be inferred from the features of the instances in the sample, i.e. from observable covariates of the labels. The interrelationship between features and labels must be learnt from a training sample of labelled instances in another step before the class prevalence of the positive labels in the test sample can be estimated. This whole process is called 'supervised prevalence estimation' (Barranquero et al., 2013), 'quantification' (Forman, 2008), 'class distribution estimation' (González-Castro et al., 2013) or 'class prior estimation' (Du Plessis et al., 2017) in the literature. See González et al. (2017) for a recent overview of the quantification problem and approaches to deal with it. The emergence of further recent papers with new proposals of prevalence estimation methods suggests that the subject is still of high interest for both researchers and practitioners (Castaño et al., 2018;Keith and O'Connor, 2018;Maletzke et al., 2019;Vaz et al., 2019).
A variety of different methods for prevalence point estimation has been proposed and a considerable number of comparative studies for such methods has been published in the literature (González et al., 2017). But the question of how to construct confidence and prediction intervals for class prevalences seems to have attracted less attention. Hopkins and King (2010) routinely provided confidence intervals for their estimates "via standard bootstrapping procedures", without commenting much on details of the procedures or on any issues encountered with them. Keith and O'Connor (2018) proposed and compared a number of methods for constructing such confidence intervals. Some of these methods involve Monte-Carlo simulation and some do not. Also Daughton and Paul (2019) proposed a new method for constructing bootstrap confidence intervals and compared its results with the confidence intervals based on popular prevalence estimation methods. Vaz et al. (2019) introduced the 'ratio estimator' for class prevalences and used its asymptotic properties for determining confidence intervals without involving Monte-Carlo techniques.
This paper presents a simulation study that seeks to illustrate some observations from these previous papers on confidence intervals for class prevalences in the binary case and to provide answers to some questions begged in the papers: • Would it be worthwhile to distinguish confidence and prediction intervals for class prevalences and deploy different methods for their estimation? This question is raised against the backdrop that for instance Keith and O'Connor (2018) talked about estimating confidence intervals but in fact constructed prediction intervals which are conceptionally different (Meeker et al., 2017).
• Would it be worthwhile to base class prevalence estimation on more accurate classifiers? The background for this question are conflicting statements in the literature as to the benefit of using accurate classifiers for prevalence estimation. On the one hand, Forman (2008, p. 168) stated: "A major benefit of sophisticated methods for quantification is that a much less accurate classifier can be used to obtain reasonably precise quantification estimates. This enables some applications of machine learning to be deployed where otherwise the raw classification accuracy would be unacceptable or the training effort too great." As an example for the opposite position, on the other hand, Barranquero et al. (2015, p. 595) commented with respect to prevalence estimation: "We strongly believe that it is also important for the learner to consider the classification performance as well. Our claim is that this aspect is crucial to ensure a minimum level of confidence for the deployed models." • Which prevalence estimation methods show the best performance with respect to the construction of as short as possible confidence intervals for class prevalences?
• Do non-simulation approaches to the construction of confidence intervals for class prevalences work?
In addition, this paper introduces two new methods for class prevalence estimation which are specifically designed for delivering as short as possible confidence intervals.
Deploying a simulation study for finding answers to the above questions has some advantages compared to working with real-world data: • The true class prevalences are known and can even be chosen with a view to facilitate obtaining clear answers.
• The setting of the study can be freely modified -say with regard to samples sizes or accuracy of the involved classifiers -in order to more precisely investigate the topics in question.
• In a simulation study, it is easy to apply an ablation approach to assess the relative impact of factors that influence the performance of methods for estimating confidence intervals.
• The results can be easily replicated.
• Simulation studies are good for delivering counter-examples. A method performing poorly in the study reported in this paper may be considered unlikely to perform much better in complex realworld settings.
Naturally, these advantages are bought at the cost of accepting certain obvious drawbacks: • Most findings of the study are suggestive and illustrative only. No firm conclusions can be drawn from them.
• Important features of the problem which only occur in real-world situations might be overlooked.
• The prevalence estimation problem primarily is caused by data set shift. For capacity reasons, the scope of the simulation study in this paper is restricted to prior probability shift 1 , a special type of data set shift.
With these qualifications in mind, the main findings of this paper can be summarised as follows: • Extra efforts to construct prediction intervals instead of confidence intervals for class prevalences appear to be unnecessary.
• 'Error Adjusted Bootstrapping' as proposed by Daughton and Paul (2019) for the construction of prevalence confidence or prediction intervals may fail in the presence of prior probability shift.
• Deploying more accurate 2 classifiers for class prevalence estimation results in shorter confidence intervals.
• Compared to the other estimation methods considered in this paper, straight-forward 'adjusted classify & count' methods for prevalence estimation (Forman, 2008, called 'confusion matrix method' in Saerens et al., 2001 without any further tuning produce the longest confidence intervals and hence, given identical coverage, perform worst. Methods based on minimisation of the Hellinger distance (González-Castro et al., 2013, with different numbers of bins) produce much shorter confidence intervals, but sometimes do not guarantee sufficient coverage. The maximum likelihood approach (with bootstrapping for the confidence intervals) and 'adjusted probabilistic classify & count' (Bella et al., 2010, called there 'scaled probability average') appear to stably produce the shortest confidence intervals among the methods considered in the paper.
The paper is organised as follows: • Section 2 'Setting of the simulation study' describes the conception and technical details of the simulation study, including in sub-section 2.2 a list of the prevalence estimation methods in scope.
• Section 3 'Results of the simulation study' provides some tables with results of the study and comments on the results, in order to explore the questions stated above. Results in subsection 3.3 show that certain standard non-simulation approaches cannot take into account estimation uncertainty in the training sample and that bootstrap-based construction of confidence intervals could be used instead.
• Section 4 'Conclusions' wraps up and closes the paper.
• In Appendix A 'Particulars for the implementation of the simulation study', the mathematical details needed for coding the simulation study are listed.
• In Appendix B 'Analysis of Error Adjusted Bootstrapping', the appropriateness for prior probability shift of the approach proposed by Daughton and Paul (2019)

Setting of the simulation study
The set-up of the simulation study is intended to reflect the situation that occurs when a prevalence estimation problem as described in Section 1 has to be solved: • There is a training sample (x 1,P , y 1,P ), . . . , (x m,P , y m,P ) of observations of features x i,P and class labels y i,P ∈ {−1, 1} for m instances 3 . By assumption, this sample was generated from a joined distribution P(X, Y ) (the training population distribution) of the feature random variable X and the label (or class) random variable Y .
• There is a test sample x 1,Q , . . . , x n,Q of observations of features x i,Q for n instances. By assumption, each instance has a latent class label y i,Q ∈ {−1, 1}, and both the features and the labels were generated from a joined distribution Q(X, Y ) (the test population distribution) of the feature random variable X and the label random variable Y .
The prevalence estimation or quantification problem then is to estimate the prevalence q = Q[Y = 1] of the positive class labels in the test population. Of course, this is only a problem if there is data set shift, i.e. if P(X, Y ) = Q(X, Y ) and as a likely consequence p = P[Y = 1] = q.
This paper deals only with the situation where the training population distribution P(X, Y ) and the test population distribution Q(X, Y ) are related by prior probability shift which means in mathematical terms that for all subsets A of the feature space such that P[X ∈ A] and Q[X ∈ A] are well-defined.

The model for the simulation study
The classical binormal model with equal variances fits well into the prior probability shift setting for prevalence estimation of this paper. Kawakubo et al. (2016) used it as part of their experiments for comparing the performance of prevalence methods. Logistic regression is a natural and optimal approach to the estimation of the binormal model with equal variances (Section 6.1, Cramer, 2003). Hence when logistic regression is used for the estimation of the model in the simulation study, there is no need to worry about the results being invalidated by the deployment of a sub-optimal regression or classification technique. The binormal model is specified by defining the two class-conditional feature distributions P(X | Y = −1) and P(X | Y = 1) respectively.
Test population distribution. Same as the training population distribution, with P replaced by Q, in order to satisfy the assumption (2.1) on prior probability shift between training and test times.
For the sake of brevity, in the following the setting with (2.2a) for both the training and the test sample is referred to as 'double' binormal setting.
Given the class-conditional population distributions as specified in (2.2a), the unconditional training and test population distributions can be represented as with p = P[Y = 1] and q = Q[Y = 1] as parameters whose values in the course of the simulation study are selected depending on the purposes of the specific numerical experiments.
Control parameters. For this paper's numerical experiments, the values for the parametrisation of the model are selected from the ranges specified in the following list: • p ∈ {0.33, 0.5, 0.67} is the prevalence of the positive class in the training population.
• m ∈ {100, ∞} is the size of training sample. In the case m = ∞, the training sample is considered identical with the training population and learning of the model is unnecessary. In the case of a finite training sample, the number of instances with positive labels is non-random in order to reflect the fact that for model development purposes a pre-defined stratification of the training sample might be desirable and can be achieved by under-sampling of the majority class or by over-sampling of the minority class. m + then is the size of the training sub-sample with positive labels, and m − is the size of the training sub-sample with negative labels. Hence it holds that m = m + + m − , m + = p m und m − = (1 − p) m for finite m.
• q ∈ {0.05, 0.2} is the prevalence of the positive class in the test population.
• n ∈ {50, 500} is the size of the test sample. In the test sample, the number of instances with positive labels is random.
• The number of simulation runs in all of the experiments is n sim = 100, i.e. n sim -times a training sample and a test sample as specified above are generated and subjected to some estimation procedures.
• The number of bootstrap iterations where needed in any of the interval estimation procedures is always R = 999 (Davison and Hinkley, 1997).
• All confidence and prediction intervals are constructed at α = 90% confidence level.
Choosing ν = 1 in one of the following simulation experiments will reflect a situation where no accurate classifier can be found, as it is suggested by the fact that then the AUC (area under the curve) of the feature X taken as a soft classifier is 4 Φ ν−µ σ √ 2 = 76.02%. In the case ν = 2.5 the same soft classifier X is very accurate with an AUC of Φ ν−µ σ √ 2 = 96.15%. The different performance of the classifier depending on the value of parameter ν is also demonstrated in Figure 1 by the ROCs (receiver operating characteristics) corresponding to the two values ν = 1 ('low power') and ν = 2.5 ('high power').
For the sake of completeness, it is also noted that the feature-conditional class probability P[Y = 1 | X] under the training population distribution is given by . For the density ratio R under both the training and 4 Φ denotes the standard normal distribution function. (2.3b)

Methods for prevalence estimation considered in this paper
The following criteria have been applied for the selection of the methods deployed in the simulation study: • The methods must be Fisher consistent in the sense of Tasche (2017). This criterion excludes for instance 'classify & count' (Forman, 2008), the 'Q-measure' approach (Barranquero et al., 2013) and the distance-minimisation approaches based on the Inner Product, Kumar-Hassebrook, Cosine, and Harmonic Mean distances mentioned in Maletzke et al. (2019).
• The methods should enjoy some popularity in the literature.
• Two new methods based on already established methods and designed to minimise the lengths of confidence intervals are introduced and tested.
According to these criteria the following prevalence estimation methods have been included in the simulation study: • ACC50: Adjusted Classify & Count (ACC: Gart and Buck, 1966;Saerens et al., 2001;Forman, 2008), based on the Bayes classifier that minimises accuracy. '50' because if the Bayes classifier is represented by means of the posterior probability of the positive class and a threshold, the threshold has to be 50%.
• ACCp: Adjusted Classify & Count, based on the Bayes classifier that maximises the difference of TPR (true positive rate) and FPR (false positive rate). 'p' because if the Bayes classifier is represented by means of the posterior probability of the positive class and a threshold, the threshold needs to be p, the a priori probability (or prevalence) of the positive class in the training population. ACCp was called 'method max' in Forman (2008).
• ACCv: New version of ACC where the threshold for the classifier is selected in such a way that the variance of the prevalence estimates is minimised among all ACC-type estimators based on classifiers represented by means of the posterior probability of the positive class and some threshold.
• APCCv: New version of APCC where the a priori positive class probability parameter in the posterior positive class probability is selected in such a way that the variance of the prevalence estimates is minimised among all APCC-type estimators based on posterior positive class probabilities where the a priori positive class probability parameter varies between 0 and 1.
• MLinf / MLboot: ML is the maximum likelihood approach to prevalence estimation (Peters and Coberly, 1976). Note that the EM (expectation maximisation) approach of Saerens et al. (2001) is one way to implement ML. 'MLinf' refers to construction of the prevalence confidence interval based on the asymptotic normality of the ML estimator (using the Fisher information for the variance). 'MLboot' refers to construction of the prevalence confidence interval solely based on bootstrap sampling.
For the readers' convenience, the particulars needed to implement the methods in this list are presented in Appendix A. Note that ACC50, ACCp, ACCv, APCC und APCCv are all special cases of the 'ratio estimator' discussed in Vaz et al. (2019).
On the basis of the general asymptotic efficiency of maximum likelihood estimators (Theorem 10.1.12, Casella and Berger, 2002), the maximum likelihood approach for class prevalences is a promising approach for achieving minimum confidence intervals lengths. In addition, the ML approach may be considered a representative of the class of entropy-related estimators and, as such, is closely related to the Topsøe approach which was found to perform very well in Maletzke et al. (2019).

Calculations performed in the simulation study
The calculations performed as part of the simulation study serve the purpose of providing facts for answers to the questions listed in Section 1 'Introduction'.
Calculations for constructing confidence intervals. Iterate n sim times the following steps: 1) Create the training sample: Simulate m + times from P(X | Y = 1) = N (ν, σ 2 ) features x 1,P + , . . ., x m + ,P + of positive instances and m − times from P(X 2) Create the test sample: Simulate the number N + of positive instances as a binomial random variable with size n and success probability q. Then simulate N + times from Q(X | Y = 1) = N (ν, σ 2 ) features x 1,Q , . . ., x N + ,Q of positive instances and N − = n − N + times from Q(X | Y = −1) = N (µ, σ 2 ) features x N + +1,Q , . . ., x n,Q of negative instances. The information of whether a feature x i,Q was sampled from Q(X | Y = 1) or from Q(X | Y = −1) is assumed to be unknown in the estimation step. Therefore, the gnerated features are combined in a single sample x 1,Q , . . ., x n,Q .
3) Iterate R times the bootstrap procedure: Generate by stratified sampling with replications bootstrap samples x 1,P + , . . ., x m + ,P + of features of positive instances, x 1,P − , . . ., x m − ,P − of features of negative instances from the training subsamples, and x 1,Q , . . ., x n,Q of features with unknown labels from the test sample. Calculate, based on the three resulting bootstrap samples, estimates of the positive class prevalence in the test population according to all the estimation methods listed in Section 2.2. 4) For each estimation method, the bootstrap procedure from the previous step creates a sample of R estimates of the positive class prevalence. Based on this sample of R estimates, construct confidence intervals at level α for the positive class prevalence in the test population.
Tabulated results of the simulation algorithm for confidence intervals.
• For each estimation method, n sim estimates of the positive class prevalence are calculated. From this set of estimates, the following summary results are derived and tabulated: -The average of the prevalence estimates.
-The average absolute deviation of the prevalence estimates from the true prevalence parameter.
-The percentage of simulation runs with failed prevalence estimates.
-The percentage of estimates equal to 0 or 1.
• For each estimation method, n sim confidence intervals at level α for the positive class prevalence are produced. From this set of confidence intervals, the following summary results are derived and tabulated: -The average length of the confidence intervals.
-The percentage of confidence intervals that contain the true prevalence parameter (coverage rate).

For the construction of the bootstrap confidence intervals in
Step 4 of the list of calculations, the method 'perc' (Davison and Hinkley, 1997, Section 5.3.1) of the function boot.ci of the R-package 'boot' is used. More accurate methods for bootstrap confidence intervals are available, but these tend to require more computational time and to be less robust. Given that the performance of 'perc' in the setting of this simulation study can be controlled via checking the coverage rates, the loss in performance seems tolerable. In the cases where calculations have resulted in coverage rates of less than α the calculations have been repeated with the 'bca' method (Davison and Hinkley, 1997, Section 5.3.2) of boot.ci in order to confirm the results.
Step 1 of the calculations can be omitted in the case m = ∞, i.e. when the training sample is identical with the training population distribution. However, in this case some quantities of relevance for the estimates have to be pre-calculated before the entrance into the loop for the n sim simulation runs. The details for these pre-calculations are provided in Appendix A. Also in the case m = ∞, for the prevalence estimation methods ACC50, ACCp, ACCv, APCC und APCCv, the bootstrap confidence intervals for the prevalences are replaced by "conservative binomial intervals" (Meeker et al., 2017, Section 6.2.2), computed with the 'exact' method of the R-function binconf. Moreover, as explained in Section 2.2, in the case m = ∞ method MLinf is applied instead of MLboot for the construction of the maximum likelihood confidence interval.
As mentioned in Section 1 'Introduction', one of the purposes of the simulation study is to illustrate the differences between confidence and prediction intervals. Conceptionally, the difference may be described by their definitions as given in Meeker et al. (2017) 6 : • "A 100(1 − α)% confidence interval for an unknown quantity θ may be formally characterized as follows: If one repeatedly calculates such intervals from many independent random samples, 100(1−α)% of the intervals would, in the long run, correctly include the actual value θ. Equivalently, one would, in the long run, be correct 100(1 − α)% of the time in claiming that the actual value of θ is contained within the confidence interval." (Meeker et al., 2017, Section 2.2.5) • "If from many independent pairs of random samples, a 100(1 − α)% prediction interval is computed from the data of the first sample to contain the value(s) of the second sample, 100(1 − α)% of the intervals would, in the long run, correctly bracket the future value(s). Equivalently, one would, in the long run, be correct 100(1 − α)% of the time in claiming that the future value(s) will be contained within the prediction interval." (Meeker et al., 2017, Section 2.3.6) In order to construct prediction intervals instead of confidence intervals in the simulation runs, Step 4 of the calculations is modified as follows: 4') For each estimation method, the bootstrap procedure from the previous step creates a sample of R estimates of the positive class prevalence. For each estimate, generate a virtual number of realisations of positive instances by simulating an inpendent binomial variable with size n and success probability given by the estimate. Divide these virtual numbers by n to obtain (for each estimation method) a sample of relative frequencies of positive labels. Based on this additional size-R sample of relative frequencies, construct prediction intervals at level α for the percentage of instances with positive labels in the test sample.
As in the case of the construction of confidence intervals, for the construction of the prediction intervals again the method 'perc' of the function boot.ci of the R-package 'boot' is deployed.
Tabulated results of the simulation algorithm for prediction intervals.
• For each estimation method, n sim virtual relative frequencies of positive labels in the test sample are simulated under the assumption that the estimated positive class prevalence equals the true prevalence. From this set of frequencies, the following summary results are derived and tabulated: -The average of the virtual relative frequencies.
-The average absolute deviation of the virtual relative frequencies from the true prevalence parameter.
-The percentage of simulation runs with failed prevalence estimates and hence also failed simulations of virtural relative frequencies of positive labels.
-The percentage of virtual relative frequencies equal to 0 or 1.
• For each estimation method, n sim prediction intervals at level α for the realised relative frequencies of positive labels are produced. From this set of prediction intervals, the following summary results are derived and tabulated: -The average length of the prediction intervals.
-The percentage of prediction intervals that contain the true relative frequencies of positive labels (coverage rate).

Results of the simulation study
All simulation procedures are performed with parameter setting n sim = 100, Rseed = 17 and R = 999 (see Section 2.1 for the complete list of control parameters). At each table in the following, the values selected for the remaining control parameters are listed in the captions or within the table bodies.
In all the simulation procedures run for this paper, the R-boot.ci method for determining the statistical intervals (both confidence and prediction) has been the method 'perc'. In cases where the coverage found with 'perc' is significantly lower than 90% (for n sim = 100 at 5% significance level this means lower than 85%), the calculation has been repeated with the R-boot.ci method 'bca' for confirmation or correction.
The naming of the table rows and table columns has been standardized. Unless mentioned otherwise, the columns always display results for all or some of the prevalence estimation methods listed in Section 2.2. Short explanations of the meaning of the row names are given in Table 1. A more detailed explanation of the row names can be found in Section 2.3.

Prediction vs. confidence intervals
In the simulation study performed for this paper, the values of the true positive class prevalences of the test samples -understood in the sense of the a priori positive class prevalences of the populations from which the samples were generated (see Section 2) -are always known. In contrast, when one is working with real-world data sets, there is no way to know with certainty the true positive class prevalences of the test samples. Inevitably, therefore, in studies of prevalence estimation methods on real-world data sets, the performance has to be measured by comparison between the estimates and the relative frequencies of the positive labels observed in the test samples.
This was stated explicitly, for instance, in Keith and O'Connor (2018). The authors said in the section 'Problem definition' of the paper that they estimated 'prevalence confidence intervals' with the property that "(1 − α)% of the predicted intervals ought to contain the true value θ * ". For this purpose, Keith and O'Connor defined the 'true value' as follows: "For each group D, let θ * ≡ (1/n) n i y i be the true proportion of positive labels (where n = |D|)." As 'group' was used by Keith and O'Connor as equivalent to sample and y i was 1 for positive labels and 0 otherwise, it is clear that Keith and O'Connor estimated rather prediction intervals than confidence intervals (see Section 2.3 for the definitions of both types of intervals).
Hence, would it be worthwhile to distinguish confidence and prediction intervals for class prevalences and deploy different methods for their estimation, as has been asked in Section 1?
By assumption (see Section 2), the test sample x 1,Q , . . . , x n,Q is interpreted as the feature components of independent, identically distributed random variables (X 1,Q , Y 1,Q ), . . ., (X n,Q , Y n,Q ). While the positive class prevalence in the test population is given by the constant Q[Y = 1] = q, the relative frequency of the positive labels in test sample is represented by the random variable

Row name Explanation
'Av prev' Average of the prevalence estimates (for confidence intervals) 'Av freq' Average of the relative frequencies of simulated positive class labels (for prediction intervals) 'Av abs dev' Average of the absolute deviation of the prevalence estimates or the simulated relative frequencies from the true prevalence 'Perc fail est' Percentage of simulation runs with failed prevalence estimates 'Av int length' Average of the confidence or prediction interval lengths 'Coverage' Percentage of confidence intervals containing the true prevalence or of prediction intervals containing the true realised relative frequencies of positive labels 'Perc 0 or 1' Percentage of prevalence estimates or simulated fequencies with value ≤ 10 −7 or ≥ 1 − 10 −7 where I(Y i,Q = 1) = 1 if Y i,Q = 1 and I(Y i,Q = 1) = 0 otherwise.
The simulation procedures for the panels of Table 2 are intended to gauge the impact of using a confidence interval instead of a prediction interval for capturing the relative frequency of positive labels in the test sample as defined in (3.1). By the law of large numbers, the difference of Y n,Q and q ought to be small for large n. Therefore, if there is any impact of using a confidence interval when a prediction interval would be needed, it should rather be visible for smaller n.
The algorithm devised in this paper for the construction of prediction intervals (see Section 2.3) involves the simulation of binomial random variables with the prevalence estimates as success probabilities which are independent of the test samples. This procedure, however, is likely to exaggerate the variance of the relative frequencies of the positive labels because the prevalence estimates and the test samples are not only not independent but even by design should be strongly dependent. The dependence between prevalence estimate and the test sample should be the stronger, the more accurate the classifier underlying the estimator is. This implies that for prevalence estimation, differences between prediction and confidence intervals should rather be discernible for lower accuracy of the classifiers deployed. • Top two panels: Simulation of a 'benign' situation, with not too much difference of positive class prevalences (33% vs. 20%) in training and test population distributions, and high power of the score underlying the classifiers and distance minimisation approaches. Results suggest 'overshooting' by the binomial prediction interval approach, i.e. intervals are so long that coverage is much higher than requested, even reaching 100%. The confidence intervals clearly show sufficient coverage of the true realised percentages of positive labels for all estimation methods. Interval lengths are quite uniform, with only the straight ACC methods ACC50 and ACCp showing distinctly longer intervals. Also in terms of average absolute deviation from the true positive class prevalence, the performance is rather uniform. However, it is interesting to see that ACCv which has been designed for minimising confidence interval length among the ACC estimators shows the distinctly worst performance with regard to average absolute deviation.
• Central two panels: Simulation of a rather adverse situation, with very different (67% vs. 20%) positive class prevalences in training and test population distributions and low power of the score underlying the classifiers and distance minimisation approaches. There is still overshooting by the binomial prediction interval approach for all methods but H8. For all methods but H8 sufficient coverage by the confidence intervals is still clearly achieved. H8 coverage of relative positive class frequency is significantly too low with the confidence intervals but still sufficient with the prediction intervals. However, H8 also displays heavy bias of the average relative frequency of positive labels, possibly a consequence of the combined difficulties of there being 8 bins for only 50 points (test sample size) and little difference between the densities of the score conditional on the two classes.
In terms of interval length performance MLboot is best, closely followed by APCC and Energy. But even for these methods, confidence interval lengths of more then 47% suggest that the estimation task is rather hopeless.
• Bottom two panels: Similar picture to the central panels, but even more adverse with a small test sample prevalence of 5%. Results similar, but much higher proportions of 0% estimates for all methods. H8 now has insufficient coverage with both prediction and confidence intervals, and also H4 coverage with the confidence intervals is insufficient. Note the strong estimation bias suggested by all average frequency estimates, presumably caused by the clipping of negative estimates (i.e. replacing such estimates by zero). Among all these bad estimators, MLboot is clearly best in terms of bias, average absolute deviation and interval lengths.
• General conclusion: For all methods from Section 2.2 but the Hellinger methods, it suffices to    • Performance in terms of interval length (with sufficient coverage in all circumstances): MLboot best, followed by APCC and Energy.
Daughton and Paul (2019) proposed 'Error Adjusted Bootstrapping' as an approach to constructing "confidence intervals" (prediction intervals, as a matter of fact) for prevalences and showed by example that its performance in terms of coverage was sufficient. However, theoretical analysis of 'Error Adjusted Bootstrapping' presented in Appendix B suggests that this approach is not appropriate for constructing prediction intervals in the presence of prior probability shift. Indeed, Table 3 demonstrates that 'Error Adjusted Bootstrapping' intervals based on the classifiers ACC50 and ACCp (see Section 2.2) achieve sufficient coverage if the difference between the training and test sample prevalences is moderate (33% vs. 20%) but breaks down if the difference is large (67% vs. 20%).

Does higher accuracy help for shorter confidence intervals?
As mentioned in Section 1, views in the literature differ on whether or not the performance of prevalence estimators is impacted by the discriminatory power of the score underlying the estimation method. Table 4 shows a number of simulation results, for a variety of sets of circumstances, both benign and adverse. Results for high and low power are juxtaposed: • Top two panels: Simulation of a 'benign' situation, with moderate difference of positive class prevalences (50% vs. 20%) in training and test population distributions, no estimation uncertainty on the training sample and a rather large test sample with n = 500. Results for all estimation methods suggest that the lengths of the confidence intervals are strongly dependent upon the discriminatory     power of the score which is the basic building block of all the methods. Coverage is accurate for the high power situation whereas there is even slight overshooting of coverage in the low power situation.
• Central two panels: Simulation of a less benign situation, with small test sample size and low true positive class prevalence in the test sample but still without uncertainty on the training sample. There is nonetheless again evidence for the strong dependence of the lengths of the confidence intervals upon the discriminatory power of the score. For all estimation methods, low power leads to strong bias of the prevalence estimates. The percentage of zero estimates jumps between the 3rd and the 4th panel. Hence, decrease of power of the score entails much higher rates of zero estimates. For the maximum likelihood method, the interval length results in both panels show that constructing confidence intervals based on the central limit theorem for maximum likelihood estimators may become unstable for small test sample size and small positive class prevalence.
• Bottom two panels: Simulation of an adverse situation, with small test and training sample sizes and low true positive class prevalence in the test sample. Results show qualitatively very much the same picture as in the central panels. The impact of estimation uncertainty in the training sample which marks the difference to the situation for the central panels, however, is moderate for high power of the score but dramatic for low power of the score. Again there is a jump of the rate of zero estimates between the two panels differentiated by different levels of discriminatory power. For the Hellinger methods, results of the high power panel suggest a performance issue 7 with respect to the coverage rate. In contrast to MLinf, MLboot (using only bootstrapping for constructing the confidence intervals) performs well, even with relatively low bias for the prevalence estimate in the low power case.
• General conclusion: The results displayed in Table 4 suggest that there should be a clear benefit in terms of shorter confidence intervals when high power scores and classifiers are deployed for prevalence estimation. In addition, the results illustrate the statement on the asymptotic variance of ratio estimators like ACC50, ACCp, APCv, APCC and APCCv in Corollary 11 of Vaz et al. (2019).
• Performance in terms of interval length (with sufficient coverage in all circumstances): Both APCC estimation methods show good and stable performance when compared to all other methods. Energy and MLboot follow closely. The Hellinger methods also produce short confidence lengths but may have insufficient coverage.

Do approaches to confidence intervals without Monte Carlo simulations work?
For the prevalence estimation methods ACC50, ACCp, ACCv, MS, and MLinf, confidence intervals can be constructed without bootstrapping and, therefore, much less numerical effort. For ACC50, ACCp, ACCv, and MS, conservative binomial intervals by means of the 'exact' method of R-function binconf can be deployed (Meeker et al., 2017, Section 6.2.2). For the maximum likelihood approach, an asymptotically most efficient normal approximation with variance expressed in terms of the Fisher information can be used (Theorem 10.1.12, Casella and Berger, 2002). This approach is denoted by 'MLinf' in order to distinguish it from 'MLboot', maximum likelihood estimation combined with bootstrapping for the confidence intervals.
However, it can be shown by examples that these non-simulation approaches fail in the sense of producing insufficient coverage rates if training sample sizes are finite, i.e. if parameters like true positive and false positive rates needed for the estimators have to be estimated (e.g. by means of regression) before being   Table 5 with panels juxtaposing results for infinite sample and finite sample sizes of the training sample, provides such an example.
The estimation problem whose results are shown in Table 5 is pretty well-posed, with a large test sample, a high power score underlying the estimation methods and moderate difference between training and test sample positive class prevalences. Panel 1 shows that without estimation uncertainty on the training sample (infinite sample size) the non-simulation approaches produce confidence intervals with sufficient coverage. In contrast, Panel 2 demonstrates that for all five methods coverage breaks down when estimation uncertainty is introduced into the training sample (finite sample size). According to Panel 3, this issue can be remediated by deploying bootstrapping for the construction of the confidence intervals.

Conclusions
The simulation study whose results are reported in this paper has been intended to shed some light on certain questions from the literature regarding the construction of confidence or prediction intervals for the prevalence of positive labels in binary quantification problems. In particular, the results of the study should help to provide answers to the questions of • whether estimation techniques for confidence intervals are appropriate if in practice most of the time prediction intervals are needed, and • whether the discriminatory power of the soft classifier or score at the basis of a prevalence estimation method matters when it comes to minimizing the confidence interval for an estimate.
The answers suggested by the results of the simulation study are subject to a number of qualifications. Most prominent among the qualifications are • the fact that the findings of the paper apply only for problems where it is clear that training and test sample are related by prior probability shift, and • the general observation that the scope of a simulation study necessarily is rather restricted and therefore findings of such studies can be suggestive and illustrative at best.
Hence the findings from the study do not allow firm or general conclusions. As a consequence, the answers to the questions suggested by the simulation study have to be ingested with caution: • For not too small test sample sizes like 50 or more, there is no need to deploy special techniques for prediction intervals.
• It is worthwhile to base prevalence estimation on powerful classifiers or scores because this way the lengths of the confidence intervals can be much reduced. The use of less accurate classifiers may entail confidence intervals so long that the estimates have to be considered worthless.
In most of the experiments performed as part of the simulation study, the maximum likelihood approach (method MLboot) to the estimation of the positive class prevalence turned out to deliver on average the shortest confidence intervals. As shown in Appendix A.2.3, application of the maximum likelihood approach requires that in a previous step the density ratio or the posterior class probabilities are estimated on the training samples. To achieve this with sufficient precision is a notoriously hard problem. Note, however, the promising recent progress made on this issue (Kull et al., 2017). Not much worse and in a few cases even superior was the performance of APCC (Adjusted Probabilistic Classify & Count). In contrast the performance of the Energy distance and Hellinger distance estimation methods was not outstanding and, in the case of the latter methods, even insufficient in the sense of not guaranteeing the required coverage rates of the confidence intervals.
ratio estimators are Fisher consistent for estimating the positive class prevalence of the test population under prior probability shift.

Adjusted Classify & Count (ACC).
In the setting of Section 2, denote the feature space (i.e. the range of values which the feature variable X can take) by X . Let g : X → {−1, 1} be a crisp classifier in the sense that if for an instance it holds that g(X) = 1, a positive class label is predicted, and if g(X) = −1 a negative class label is predicted. With the notation introduced in Section 2, the ACC estimator Q g [Y = 1] based on the classifier g of the test population positive class prevalence is given by Recall that • Q[g(X) = 1] is the proportion of instances in the test population whose labels are predicted positive by the classifier g.
• P[g(X) = 1 | Y = −1] is the false positive rate (FPR) associated with the classifier g. The FPR equals 100% − true negative rate and, therefore, also 100% − specificity of the classifier g.
• P[g(X) = 1 | Y = 1] is the true positive rate (TPR) associated with the classifier g. The TPR is also called 'recall' or 'sensitivity' of g.
Of course, the ACC estimator of ( if g is not completely inaccurate. González et al. (2017, Section 6.2) gave some background information on the history of ACC estimators.
When a threshold t ∈ R is fixed, the soft classifier s : X → R gives rise to a crisp classifier g The classifiers p t (x) with are Bayes classifiers which minimise cost-sensitive Bayes errors, see for instance Tasche (2017, Section 2.1). Thresholds of special interest are • t = 1/2 for maximum accuracy (i.e. minimum classification error) which leads to the estimator ACC50 listed in Section 2.2, and • t = P[Y = 1] for maximising the denominator of the right-hand side of (A.1) which leads to the estimator ACCp listed in Section 2.2.
For the simulation procedures run for this paper, a sample version of Q[p t (X) = 1] has been used: where x 1,Q , . . . , x n,Q denotes a sample generated under the test population distribution Q(X).
To deal with the case where in the setting of Section 2.1 with the double binormal model the training sample is infinite, the following formulae have been coded for the right-hand side of (A.1) with g(X) = p t (X) and (A.4) (with parameters a, b as in (2.3a)): Adjusted . For the simulation procedures run for this paper, a sample version of E Q P[Y = 1 | X] has been used: where x 1,Q , . . . , x n,Q denotes a sample generated under the test population distribution Q(X).
To deal with the case where in the setting of Section 2.1 with the double binormal model the training sample is infinite, the following formulae have been coded for the right-hand side of (A.6) with h(X) = P[Y = 1 | X] and (A.7) (with parameters a, b as in (2.3a)): , Median sweep (MS). Forman (2008) proposed to stabilise the prevalence estimates from ACC based on a soft classifier s via (A.2), by taking the median of all ACC estimates based on g (s) t for all thresholds t such that the denominator P[g Tuning ACC for ACCv. Observe that a main factor impacting the length of a confidence interval for a parameter is the standard deviation of the underlying estimator. This suggests the following approach to choosing a good threshold t * for the classifier p t (X) in (A.3): The test population distribution Q appears in the numerator of (A.9) because the confidence interval is calculated for a sample generated from Q. The training population distribution P is used in the denominator of (A.9) because the confidence interval is scaled by the denominator of (A.1). See (A.5) for the formulae used for the calculations of this paper for (A.9) in the setting of Section 2.1. Like in the case of MS, for the purpose of this paper the set of possible thresholds t is restricted to {0.05, 0.1, 0.15, . . . , 0.9, 0.95}.
Tuning APCC for APCCv. Similarly to (A.9), the idea is to minimise the variance of the estimator under Q while controlling the size of the denominator in (A.6). For 0 < π < 1 define where f + and f − are the class-conditional densities of the features. Then it holds that A good choice for π could be π * with π * = arg min For the purpose of this paper the set of possible parameters π in (A.10) is restricted to {0.05, 0.1, 0.15, . . . , 0.9, 0.95}. In the setting of Section 2.1, let a be defined as in (2.3a) and let Then, analogously to (A.8), in the setting of Section 2.1 the following formulae are obtained for use in the calculations of this paper for (A.10):

A.2. Prevalence estimation by distance minimisation
The idea for prevalence estimation by distance minimisation is to obtain an estimate q of Q[Y = 1] = q by solving the following optimisation problem: Here d denotes a distance measure of probability measures with the following two properties: There is no need for d to be a metric (i.e. asymmetric distance measures d with d(M 1 , M 2 ) = d(M 2 , M 1 ) for some M 1 , M 2 are permitted). By property 2), distance minimisation estimators defined by (A.12) are Fisher consistent for estimating the positive class prevalence of the test population under prior probability shift. In the following subsections three approaches to prevalence estimation based on distance minimisation are introduced that have been suggested in the literature and appear to be popular.

A.2.1. Prevalence estimation by minimising the Hellinger distance
The Hellinger distance 8 d H of two probability measures M 1 , M 2 on the same domain is defined in measure-theoretic terms by where λ is any measure on the same domain such that both M 1 and M 2 are absolutely continuous with respect to λ. The value of d H (M 1 , M 2 ) does not depend upon the choice of λ.
In practice, the calculation of the Hellinger distance must take into account that most of time it has to be estimated from sample data. Therefore, the right-hand side of (A.13) is discretized by (in the setting of Section 2) decomposing the feature space X into a finite number of subsets or bins X 1 , . . . , X b and evaluating the probability measures whose distance is to be measured on these bins. This leads to the following approximative version of the minimisation problem (A.12): (A.14) If the feature space X is multi-dimensional, e.g. X ⊂ R d for some d ≥ 2, González-Castro et al. (2013) also suggest minimising the Hellinger distance separately across all the d dimensions of the feature vector X = (X 1 , . . . , X d ). In this case, the feature space X = X 1 × . . . × X d is decomposed component-wise in b bins and (A.14) is modified to become For the purposes of this paper, (A.14) has been adapted to become (A.15) where (x 1,Q , . . . x n,Q ) ∈ R n is a sample of features of instances generated under the test population distribution Q(X) and the P-terms must be estimated from the training sample if it is finite and can be exactly pre-calculated in the case of an infinite training sample. In the latter case, (A.15) has to be modified to reflect the binormal setting of Section 2.1 for the training population distribution: For this paper, the number b of bins 9 in (A.16a) has been chosen to be 4 or 8, and the boundaries of the bins have been defined as follows 10 : González-Castro et al. (2013) and Castaño et al. (2018) for more information on the Hellinger distance approach to prevalence estimation. 9 See Maletzke et al. (2019) for critical comments regarding the choice of the number of bins. 10 Φ −1 is the inverse function to the standard normal distribution function. Kawakubo et al. (2016) and Castaño et al. (2018) provide background information for the application of the Energy distance approach to prevalence estimation.

A.2.2. Prevalence estimation by minimising the Energy distance
Denote by V and V respectively the projection on the first d components and the last d components respectively of R 2 d , i.e.
V and V are also used to denote the identity mapping on R d , i.e. V (x) = x and V (x) = x for x ∈ R d .
Let M 1 , M 2 be two probability measures on R d . Then M 1 ⊗ M 2 denotes the product measure of M 1 and M 2 on R 2 d . Hence M 1 ⊗ M 2 is the probability measure on R 2 d such that V and V are stochastically independent under M 1 ⊗ M 2 .
Denote by ||x|| the Euclidean norm of x ∈ R d . Then the Energy distance d E of two probability measures Recall that in this section the aim is to estimate class prevalences by solving the optimisation problem (A.12). To do so by means of minimising the Energy distance, fix a function h : X → R and choose as probability measures M 1 and M 2 the distributions of h(X) under the probability measures Q and q P[X ∈ · | Y = 1] + (1 − q) P[X ∈ · | Y = −1] whose distance is minimised in (A.12): for 0 ≤ q ≤ 1 and D ⊂ R such that all involved probabilities are well-defined. With this choice for M 1 and M 2 , it follows from (A.17) that (A.18) with P + = P[X ∈ · | Y = 1] and P − = P[X ∈ · | Y = −1]. The unique minimising value q of q for the right-hand side of (A.18) is found to be The fact that there is a closed-form solution for the estimate q in the two-class case is one of the advantages of the Energy distance approach. The choice M 1 = Q(X) and M 2 = q P[X ∈ · | Y = 1]+(1−q) P[X ∈ · | Y = −1] leads to a computationally convenient Kullback-Leibler version of the minimisation problem (A.12). One has to assume that there are a measure λ on the feature space X and densities f Q , f + and f − such that This gives the following optimisation problem: where the density ratio R(X) is defined by and additionally it must hold that f − (X) > 0. Under fairly general smoothness conditions, the right-hand side of (A.23) can be differentiated with respect to q. This gives the following necessary condition for optimality in (A.23): 0 = E Q R(X) − 1 q (R(X) − 1) + 1 . (A.24) When both the training and the test samples are finite, the right-hand side of (A.24) as a function of q can be empirically estimated in a straight-forward way.
For the semi-finite setting according to Section 2.1 with infinite training sample, it can be assumed that the density ratio R is fully known by the specification of the model. In contrast, the test population distribution Q(X) of the features is only known through a sample (or empirical distribution) x 1,Q , . . . , x n,Q that was sampled from Q(X). Replacing the expectation with respect to Q in (A.24) with a sample average gives the equation as an approximative necessary condition for q to minimise the Kullback-Leibler distance in (A.23). Solving (A.25a) results in an approximation q n of q and therefore Q[Y = 1].
It is not hard to see (see, for instance, Lemma 4.1 of Tasche Hence the event 'an instance in the test sample turns out to have a positive label' can be simulated in three steps: 1) Apply the classifier g to the bootstrapped features of an instance in the test sample.
2) If a positive label is predicted by g, simulate a Bernoulli variable with success probability Q[Y = 1 | g(X) = 1]. If a negative label is predicted by g, simulate a Bernoulli variable with success probability Q[Y = 1 | g(X) = −1].
3) In both cases, if the outcome of the Bernoulli variable is success, count the result as positive class, otherwise as negative class.
By (B.1), the probability of the positive class in this experiment is the prevalence of the positive class in the test population distribution. Repeat the experiment for all the instances in the bootstrapped test sample. Then the relative frequency of the positive outcomes of the experiments is an approximate realisation of the relative frequency of the positive class labels in the test sample which in the same way as by the binomial approach described in Section 2.3 can be used to construct a bootstrap prediction interval.
Daughton and Paul (2019)  Unfortunately, (B.2) does not hold under prior probability shift as is implied by the following representation of the precision in terms of T P R = P[g(X) = 1 | Y = 1], F P R = P[g(X) = 1 | Y = −1] and test population prevalence p: Under prior probability shift, TPR and FPR are not changed, but p changes. Hence, by replacing p with q = p on the right-hand side of (B.3), it follows that Q[Y = 1 | g(X) = 1] = q T P R q (T P R − F P R) + F P R = P[Y = 1 | g(X) = 1].
Therefore, under prior probability shift, the approach by Daughton and Paul (2019) is unlikely to work in general. See Table 3 in Section 3.1 for a numerical example. It is not clear if requiring that (B.2) holds results in defining an instance of data set shift which might occur in the real world.