Exact Probability Distribution for the ROC Area under Curve

Simple Summary This contribution allows for the computation of exact p-values and for conducting accurate statistical hypothesis tests of ROC AUC-values. As a result, the development of diagnostic tests is facilitated. This work is illustrated via simulated data and through the development of proteomic blood biomarkers for the early detection of cancer. Abstract The Receiver Operating Characteristic (ROC) is a de facto standard for determining the accuracy of in vitro diagnostic (IVD) medical devices, and thus the exactness in its probability distribution is crucial toward accurate statistical inference. We show the exact probability distribution of the ROC AUC-value, hence exact critical values and p-values are readily obtained. Because the exact calculations are computationally intense, we demonstrate a method of geometric interpolation, which is exact in a special case but generally an approximation, vastly increasing computational speeds. The method is illustrated through open access data, demonstrating superiority of 26 composite biomarkers relative to a predicate device. Especially under correction for testing of multiple hypotheses, traditional asymptotic approximations are encumbered by considerable imprecision, adversely affecting IVD device development. The ability to obtain exact p-values will allow more efficient IVD device development.


Background
The ROC concept [1] has become a de facto standard for determining the accuracy of binary predictors within many areas, such as IVD medical devices [2][3][4]. In order to provide a regulator with valid scientific evidence supporting the conclusion that there is reasonable assurance that an IVD device is safe and effective [5], it is highly valuable to demonstrate that the ROC Area Under Curve (AUC) value of the novel IVD device is non-inferior relative to a predicate device. Because the ROC-curve and the AUC-value is subject to randomness due to sources of error including sampling error and measurement imprecision, statistical hypothesis tests are necessary.
The motivation for this work, specifically, is our effort to develop biomarkers for the early detection of cancer. Because of combinatory proliferation when composite biomarkers are generated, the correction for testing multiple hypotheses is often severe, yielding critical values that are situated in the far tails of the AUC-value probability distribution. It is a well-known phenomenon that the tails of probability distributions are particularly sensitive to inaccuracies; hence, exactness in the AUC-value probability distribution is crucial toward accurate statistical inference.
The term ROC was introduced by Ref. [6] and the phrase AUC was used by Ref. [7]. While a complete literature review is beyond the scope of this article, a summary of approaches used to approximate the AUC-value probability distribution is warranted and follows in this paragraph and the next. It was shown [8] that when computed by the trapezoidal rule, the trapezoidal AUC-value equals a constant times the Mann-Whitney

Exact Computation via Order Statistics
This subsection derives the exact AUC-value probability distribution under the assumption of observed values of the True Positives (TPs) and True Negatives (TNs) that each are independent and identically distributed, iid with probability distribution functions denoted as F and G, respectively.
Because the ROC-curve is determined by the ranks of the observed values of the TPs and TNs, the probability of an AUC-value can be computed through order statistics. In general, suppose there are n observed values of the TPs, x 1 , . . . , x n , and m observed values of the TNs, y 1 , . . . , y m , and the two have probability density functions f and g, respectively, then under the iid assumption the order statistic joint probability density function, η, presuming they are indexed such that y 1 ≤ · · · ≤ y m and x 1 ≤ · · · ≤ x n , equals m!n!g(y 1 ) · · · g(y m ) f (x 1 ) · · · f (x n ) if y 1 ≤ · · · ≤ y m and x 1 ≤ · · · ≤ x n , 0 otherwise.
Because each ROC-curve is determined completely by the ranks of the observed values of the TPs and TNs, the probability that a given ROC-curve will manifest itself can thus be obtained through integration of the joint probability density function.
For example, the ROC-curve that has an AUC-value equal to unity corresponds to observing values of the TPs and TNs such that y m ≤ x 1 , and under the iid assumption and existence of probability density functions, the probability of the ROC-curve equals −∞ η dy 1 · · · dy m dx 1 · · · dx n , which after application of the chain rule can be shortened to For another example, the ROC-curve that has an AUC-value equal to 1 − 1/mn, i.e., the second highest AUC-value possible, corresponds to observing values of the TPs and TNs such that y m−1 ≤ x 1 ≤ y m ≤ x 2 , and under the same assumptions the probability of that ROC-curve equals Continuing to the example of the ROC-curve with AUC-value 1 − 2/mn, the reader will note that there are two distinct ROC-curves that produce the AUC-value, one that corresponds to observing values of TPs and TNs such that y m−1 ≤ x 1 ≤ x 2 ≤ y m ≤ x 3 and the other such that y m−2 ≤ x 1 ≤ y m−1 ≤ y m ≤ x 2 . Because the two ROC-curves are mutually exclusive, the probability of observing TPs and TNs such that the AUCvalue equals 1 − 2/mn is equal to the sum of the probabilities of those two ROC-curves. Generally, distinct ROC-curves are mutually exclusive and consequently the probability of a given AUC-value equals the sum of the probabilities of all ROC-curves that produce the given AUC-value.
In practice, numerical computation requires explicit probability distribution functions F and G. A common choice of probability distribution functions is two normal distributions with equal variance and some difference in their means [25], sometimes referred to as binormal. Since many common parametric choices of probability distributions are quite smooth, numerical integration using the trapezoid method is typically accurate and also fast in its algorithmic implementation [26].
The probability distribution function of the AUC-value can be determined by computing the probability for each of the AUC-values 0, 1/mn, 2/mn, . . . , 1; for an illustration, see Figure 1a. However, the number of ROC-curves that produce a given AUC-value tends to be large in many situations, particularly when the AUC-value is around the centre of the unit interval, and consequently determining the whole AUC-value probability distribution function through this method is in many instances computationally impractical.

Monte Carlo-Simulation of the AUC-Value Probabilities
When estimated through iid sampling, by the strong law of large numbers the empirical distribution function converges point-wise, with probability one, to the distribution function from which the observations were drawn. The result can be strengthened further; through the Glivenko-Cantelli lemma the convergence is uniform and through Donsker's theorem the normalized difference converges in distribution to a Gaussian process [27]. Because of its desirable asymptotic properties, and properties such as simplicity of computation, the empirical distribution function is commonly utilized for estimation of the distribution functions through Monte Carlo-simulation.
Percentiles can be estimated through the empirical distribution function by taking the infimum of the superlevel set, i.e., if α is a number in the unit interval then the 100αpercentile is estimated through inf{x :F k (x) ≥ α}, whereF k denotes the empirical distribution function at sample size k. In a common algorithmic implementation, determining the empirical distribution function and the infimum of the superlevel set amounts to sorting the observed values and selecting the value that is the 100α percent largest. Through the aforementioned beneficial asymptotic properties of the empirical distribution function, the percentile estimate obtains many desirable properties, however if the distribution function is constant in an interval then the percentile estimate obtains a discontinuity when viewed as a function of α. Further, even if the distribution function is strictly increasing, a relatively shallow slope of the distribution function yields the practical problem of a slow rate of convergence.
When correction for testing of multiple hypotheses is applied, which is common in for instance biomarker discovery, the critical values to be estimated are typically situated in the far tail of the probability distribution, where the distribution function often has a shallow slope. As an illustration, if the type-one error probability subsequent to correction for testing of multiple hypotheses equals 10 −10 , then the Monte Carlo-simulation estimate of the corresponding critical value is equal to the maximum AUC-value simulated for all simulations that are constituted of 10 10 or fewer simulated AUC-values. However, in contrast, as discussed in Section 2.1, computing the exact probability of AUC-values is relatively quick when the AUC-values are close to the extremities of the unit interval. A pragmatic compromise is to employ exact computation in the far tails and use Monte Carlo-simulation in the body of the AUC-value distribution.

Geometric Interpolation
As discussed, the number of AUC-value probabilities that can be computed exactly is in practice limited by the availability of computational resources. At the time of writing, we use a computer with an Intel (Santa Clara, CA, USA) Xeon W-2123 CPU, having 8 cores running at 3.6 GHz, and 48 GB RAM, and computing 50 consecutive AUC-value probabilities requires less than 2 h while computing 60 AUC-value probabilities requires about 13 h. We have observed that each subsequent AUC-value probability requires 22% additional time relative to the preceding AUC-value probability, thus yielding an exponential growth in the amount of time required as the number of consecutive AUCvalue probabilities computed increases.
Ideally, the sum of the AUC-value probabilities, computed consecutively from zero or one, is such that the corresponding percentile is suitable for estimation through Monte Carlo-simulation. However, in many instances the sum of the computed AUC-value probabilities is materially smaller than would be desired vis-à-vis percentile estimation through Monte Carlo-simulation. Consequently, the limitation of computational resources effectively yields a gap between the sum of the computed AUC-value probabilities and the largest value deemed suitable for percentile estimation through Monte Carlo-simulation. In this instance, we have observed a perceived stability of ratios of subsequent differences of AUC-value probabilities, arising from a special case, and exploited it toward bridging the aforementioned gap through geometric interpolation.
Consider firstly the special case when F = G, i.e., the observed values of the TPs and TNs follow the same probability distribution. Then the expressions of Section 2.1 simplify so that the probability of each ROC-curve equals n!m!/(n + m)!, and therefore the probability of an AUC-value is determined by the number of distinct ROC-curves that yield the AUC-value. Consequently, in this special case it holds that x i = k i n!m!/(n + m)!, where k 0 , k 1 , k 2 , k 3 , . . . denotes the number of distinct ROC-curves that yield the AUC-values Figure 2 illustrates the number of ROC-curves that yield the AUC-value 26/30 when n = 5 and m = 6; in the example k 4 = 5. Let ∆ denote the difference operator, ∆x i = x i − x i−1 . The present authors have observed that, in the top percentile of the probability distribution, the ratio of subsequent differences ∆x i /∆x i−1 is well approximated by C∆k i /∆k i−1 where C is a constant. As discussed, this is in the special case when F = G equality holds with C = 1. See Figure 1c,d for a numerical example with a selection of values of n and m, and where F and G are each normal distributions with equal variance and a selection of differences in their means. Rearranging terms yields the approximation which can be employed to bridge the gap between the exact computations discussed in Section 2.1 and the Monte Carlo-simulations discussed in Section 2. 2.
An estimate of C in the above approximation can be obtained by solving for equality between the Monte Carlo estimate of the, say, 99.9th percentile and the same percentile when estimated through exact computation and geometric interpolation using the above approximation. For example, the bisection method is a simple yet effective method to solve for C, thus obtaining an estimate using the aforementioned equality [26]. The reader will note that there will be division by zero in the above approximation when ∆k i−1 = 0, i.e., k i−1 = k i−2 , however the authors have only seen this occur when i = 2 or in some instances when i ≈ nm/2 and those instances do not constitute the typical intervals of application of the approximation, cf. Figure 1b.
While analyzing the dataset of Ref. [28], see Figure 3, we noted that the aforementioned ratio is less stable when the numbers of TPs and TNs are highly unbalanced. In Figure 3d, where the balance of TPs to TNs for the pancreatic cohort is approximately 1 : 9, a slightly positive trend can be visually perceived, and when a constant is assumed, the distribution function will, seen from the right, firstly drop too quickly and then too slowly, see Figure 3b. When an approximation is unsatisfactory, common approaches are to either use a more sophisticated approximation or to lessen the use of the approximation. Figure 3d shows an approximation using a monotonically increasing function, an exponent function, that connects the AUC-value probabilities with the AUC-value probability at the median, which is inferred from the curvature of the S-shape yielding ∆x i /∆x i−1 = 1. Lessening the use of the approximation is achieved by shortening the gap to be bridged; computing more exact AUC-value probabilities and estimating a higher percentile through Monte Carlo-simulation.   Determining the numbers k 0 , k 1 , k 2 , . . . , i.e., the numbers of distinct ROC-curves that yield an AUC-value, see Figure 2 for an illustrative example, can be achieved through the following recursion, which is stylized in pseudo-code in Algorithm 1. return RECURSFCN(x, max_dim1 − 1, max_dim2) 15: +RECURSFCN(x − max_dim1, max_dim1, max_dim2 − 1) 16: end procedure The recursive algorithm employs the recursive equality that is described in the following. Denote by (u · v) the number of permutations in which u rectangles, each with area 1/nm cf. Figure 2, can be subtracted under a maximum of v rectangle side lengths. For instance, the side length may be taken along the horizontal axis, cf. Figure 2, where the maximum number of horizontal side lengths is 6. The number of permutations (u · v) can be decomposed into two parts as per the equality The first part of the decomposition are the permutations that use only v − 1 side lengths, and the second part are the permutations that use all v side lengths, which therefore equals the number of permutations (u − v · v).
The recursion terminates under the conditions (u · v) = 0 for all u < 0, i.e., there are nil permutations in which a negative number of rectangles can be subtracted, (u · 1) = 1, i.e., there is one permutation in which any number of rectangles can be subtracted under a maximum of only one side length, and (0 · v) = 1, i.e., there is one permutation in which nil rectangles can be subtracted under any maximum of rectangle side lengths. Further, a condition on the maximum of the number of side lengths along the axis perpendicular relative to the axis thus discussed is implemented, denoted by max_dim2 in the pseudo-code. While the authors view it as possible that a closed form expression for the numbers k i , i = 0, 1, . . . , nm, exists, we have at the time of writing not been able to derive such; hence the recursive algorithm.

Statistical Hypothesis Testing of AUC-Values
Because the AUC-value is a number in the unit interval, a one-sided acceptance region with type-one error probability α is constructed through the interval [0, c] where c is the critical value satisfying P(X > c) = α and X is the random test variable under the null-hypothesis [29]. In applications such as biomarker discovery, it is common to simultaneously consider numerous binary predictors. In particular, systematically forming composite biomarkers from constituent measurands often causes a combinatory proliferation, and thus AUC-values for millions of composite biomarkers may be simultaneously considered. As a result of correction for testing of multiple hypotheses through Bonferronicorrection or other methods, the type-one error probability, α, is commonly small. For a numerical example, a statistical significance level of 99% and 10 8 composite biomarkers yields a Bonferroni-corrected α-value of 10 −10 .
With AUC-values, estimation of the critical value through outright Monte Carlo-simulation will be especially taxing because each AUC-value requires simulation of the observed values of TPs and TNs. Within the setting of a clinical trial, the numbers of TPs and TNs could be between 100 and 1000 each, and to obtain an estimate of a critical value at α equalling 10 −10 , at least 10 10 + 1 AUC-value simulations are needed. Hence, to obtain the most coarse estimate of the critical value through Monte Carlo-simulation, several trillion random numbers are often needed, and to obtain a more precise estimate perhaps a quadrillion random numbers would be desired in the presently discussed numerical example.
Furthermore, when the purpose of the Monte Carlo-simulation is estimation of statistical power or experimental design optimization, then simulations will typically need to be performed under a range of design parameter values, which will greatly compound the difficulties. For a numerical indication, at the time of writing the authors are able to simulate 10 10 AUC-values in 2.63 × 10 6 s, or 30.4 days, at 400 cases and 400 controls using computationally optimized algorithms in R (version 3.3.2) [30] and computer equipment as detailed in Section 2.3. Simulation across a range of sample sizes and possibly other design parameters would multiply the required time by several folds. Hence, statistical hypothesis testing of AUC-values is simple in principle, but the determination of the probability distribution tails is computationally challenging.  Similarly, under Bonferroni-correction for 10 8 hypotheses, the critical value for a one-sided test at the 99% significance level is 0.9816, i.e., if any of the observed AUC-values are greater than 0.9816 then the nullhypothesis is rejected; it seems improbable that the observed value is an observation from the hypothesized probability distribution. Figure 1c,d illustrates the degree of stability of the ratios of subsequent probability differences normalized by the corresponding ratios of the differences of numbers of ROCcurves per AUC-values, i.e., the normalized ratios (∆x i /∆x i−1 )/(∆k i /∆k i−1 ), for the first 56 differences, which were the greatest number deemed feasible to compute when preparing the present example. As discussed in Section 2.3, when the TP and TN-distribution difference in means is zero, the ratio is identically equal to unity. The constant C, discussed in Section 2.3, is estimated using the bisection method, and is plotted in the figures in dashed lines. Alignments between the stabilizing ratios and the constants C estimated using the bisection method are visually evident.

Illustration through Simulated Data
Because the distribution function curve, cf. Figure 1b, possesses an S-shape, i.e., it first accelerates then decelerates, the ratio of subsequent probability differences is likely approximately constant only within an interval of limited length. However, as discussed in Section 2.3, the need to bridge exact computations and Monte Carlo-estimates is typically confined to the top or bottom percentile of the probability distribution, and within that finite region the approximation tends to be quite valid as illustrated in Figure 1c,d.

Illustration through Biomarker Data
Statistical hypothesis testing and determination of p-values of AUC-values under correction for multiple hypothesis testing is illustrated using the proteomic dataset of Ref. [28]. The data encompass 39 proteins measured across 1004 individuals newly diagnosed with cancer and 812 healthy controls. The cancer diagnoses are breast, colorectal, esophagus, liver, lung, ovary, pancreas, and stomach cancer. In addition, a ninth pan-cancer diagnosis is formed by merging the eight cancer diagnoses. It may be noted that the aforementioned data source did not include statistical hypothesis tests or p-values. For ovarian cancer, we chose to only include female healthy controls.
Composite biomarkers are formed as per Ref. [31]. From the 39 proteins, 741 composite biomarkers are formed by combining the proteins into pairs. The 741 are further multiplied by 8 as a result of allowing positive classification when one or both of the proteins are downregulated and when either or both of the proteins are up or down-regulated. By testing for the 9 diagnoses, a total of (39 choose 2) × 8 × 9 = 53,352 composite biomarkers are simultaneously statistically hypothesis tested. p-values are adjusted as per the Bonferroni method. The null-hypothesis is that the biomarkers are on par with the colorectal cancer test FIT, which has an AUC-value of 0.88 ( [32], p. 31). As a commonly used screening test, the stool-based test FIT is a direct predicate IVD device for the colorectal cancer cohort, and also a relevant benchmark for the other cohorts that lack regulatory approved screening predicate IVD devices [33]. Rejection of the null-hypothesis is interpreted as evidence that, sources of error notwithstanding, the biomarker has an AUC-value that is superior vis-à-vis the AUC-value of FIT.
For each of the 9 diagnoses, 58 AUC-value probabilities were computed exactly, i.e., 1, 1 − 1/nm, 1 − 2/nm, . . . , 1 − 57/nm where n and m are the numbers of TPs and TNs, respectively. The nine probability distributions were then interpolated geometrically, as per Section 2.3, to the 99.99th percentiles, which were determined by Monte Carlosimulation of 10 8 AUC-values per diagnosis. The distinctly large Monte Carlo-simulation was used because some of the cohorts have highly unbalanced numbers of TPs to TNs, which affects the geometric interpolation as discussed in Section 2.3 and illustrated in Figure 3. Figure 3a shows distribution functions for the pan-cancer and pancreatic cohorts, where the difference in the number of TPs, 1004 versus 93, yields a lower critical value for the pan-cancer cohort relative to the pancreatic cohort. Figure 3c,d show geometric interpolation for the pan-cancer cohort and the pancreatic cohort, where the pan-cancer cohort possesses a balanced ratio of TPs to TNs (1004 : 812) and the pancreatic cohort possesses a highly unbalanced ratio of TPs to TNs (93 : 812) yielding a relatively better fit of the geometric constant interpolation for the pan-cancer cohort than for the pancreatic cohort (cf. Section 2.3). Using computer equipment as detailed in Section 2.3, computation of the 58 AUC-value probabilities required about 7.2 h per diagnosis, and the Monte Carlosimulation required about 1.3 h per diagnosis. The probability distributions are available on GitHub as per the data availability statement.
In total, 26 cancer biomarkers were statistically significant at the 99%-level, meaning that they have an AUC-value that is higher than the highest AUC-value out of 53,352 biomarkers, which is on par with what FIT would reasonably produce. The interpretation is that those biomarkers are, sources of error notwithstanding, superior relative to FIT. In this instance, separation of training and validation data does not constitute an issue because every biomarker is tested statistically; i.e., no training is conducted. Table 1 shows Bonferroni corrected critical values relative to the null-hypothesis that the 53, 352 biomarkers have an AUC-value on par with FIT. It is evident that tumour type cohorts that have larger sample sizes exhibit critical values that are lower than for cohorts that have relatively smaller sample sizes. Of the biomarkers tested, 26 are significantly better at the 99% level, than FIT, including 4 liver, 14 ovarian, and 8 pancreatic cancer biomarkers. Table 2

Discussion
Historically, AUC-value probability distributions have commonly been approximated via asymptotic properties or bootstrap-percentiles [19,24,25]; however the approach provides inadequate accuracy in the tails of the probability distribution where critical values under correction for multiple hypotheses are situated. With exact computation of AUC-value probabilities, impeccable precision is obtainable; permitting accurate statistical inference. In biomarker development, if critical values are too low, the type-one error probability will be too high; effectively yielding false expectations. If critical values are too high, then the type-two error probability will be too high; effectively barring identification of potentially valid biomarkers. Using the exact AUC-value probability distribution, the aforementioned risks are avoided, all while allowing for computation of exact p-values.
In order to provide reasonable assurance that a study has sufficiently large sample sizes to demonstrate a putative effect, experimental design optimization is warranted. The computational challenges inherent to Monte Carlo-simulation are greatly compounded when critical values under a multidimensional set of parameter values need to be estimated. With the proposed method of exact computation of AUC-value probabilities paired with Monte Carlo-simulations and geometric interpolation, a method is obtained that is both computationally feasible and relatively precise.
A disadvantage of the presently proposed approach using exact computation of AUCvalue probabilities is that the method is relatively computationally intense. Historically, computers have become more capable over time, which will gradually mitigate the disadvantage. Moreover, we commit to uploading the exact distributions on GitHub, as per the data availability statement, so that the exact probabilities can be shared and thus make them readily available for all who need them.
A topic of future research, proposed by one of the reviewers of this article, is that exact probabilities of the Mann-Whitney U statistic can be obtained analogously relative to the proposed method detailed in Section 2.

Conclusions
In the early 21th century, a consensus formed within the biomarker literature around the notion that, despite substantial economic resources invested, the scientific community has hitherto failed to produce biomarkers for cancer that translate into clinical use [34][35][36][37][38][39][40]. Among the known pitfalls identified, the use of statistical methods that are unsuitable or improper for the purpose is commonly named among the most meaningful.
The ability to compute exact p-values and conduct accurate statistical hypothesis tests of ROC AUC-values, we believe, will facilitate biomarker development and expedite the introduction of IVD devices into clinical use.  Informed Consent Statement: Informed consent relative to data of Section 4.2 available via [28].

Data Availability Statement:
Code for the simulation of AUC-value data and AUC-value probability distributions are available on GitHub https://github.com/TSResearchGroupUU/ (accessed on 8 February 2023). Biomarker data available are online via Ref. [28].