Universal Sample Size Invariant Measures for Uncertainty Quantification in Density Estimation

Previously, we developed a high throughput non-parametric maximum entropy method (PLOS ONE, 13(5): e0196937, 2018) that employs a log-likelihood scoring function to characterize uncertainty in trial probability density estimates through a scaled quantile residual (SQR). The SQR for the true probability density has universal sample size invariant properties equivalent to sampled uniform random data (SURD). Alternative scoring functions are considered that include the Anderson-Darling test. Scoring function effectiveness is evaluated using receiver operator characteristics to quantify efficacy in discriminating SURD from decoy-SURD, and by comparing overall performance characteristics during density estimation across a diverse test set of known probability distributions.


Introduction
The rapid and accurate estimate of the probability density function (pdf) for a random variable is important in many different fields and areas of research [1][2][3][4][5][6]. For example, accurate high throughput pdf estimation is sought in bioinformatics screening applications and in high frequency trading to evaluate profit/loss risks. In the era of big data, data analytics and machine learning, it has never been more important to strive for automated high-quality pdf estimation. Of course, there are numerous other traditional areas of low throughput applications where pdf estimation is also of great importance, such as damage detection in engineering [7], isotope analysis in archaeology [8], econometric data analysis in economics [9], and particle discrimination in high energy physics [10]. The wide range of applications for pdf estimation exemplifies its ubiquitous importance in data analysis. However, a continuing objective regarding pdf estimation is to establish a robust distribution free method to make estimates rapidly while quantifying error in an estimate. To this end, it is necessary to develop universal measures to quantify error and uncertainties to enable comparisons across distribution classes. To illustrate the need for universality, the pdf and cumulative distribution function (cdf) for four distinctly different distributions are shown in Figure 1a,b. Comparing the four cases of pdf and cdf over the same sample range, it is apparent that the data are distributed very differently. The process of estimating the pdf for a given sample of data is an inverse problem. Due to fluctuations in a sample of random data, many pdf estimates will be able to model the data sample well. If additional smoothness criteria are imposed, many proposed pdf estimates can be filtered out. Nevertheless, a pdf estimate will carry intrinsic uncertainty along with it . The development of a scoring function to measure uncertainty in a pdf estimate without knowing the form of the true pdf is indispensable in high throughput applications where human domain expertise cannot be applied to inspect every proposed solution for validity. Moreover, it is desirable to remove subjective bias from human (or artificial intelligence) intervention. Automation can be achieved by employing a scoring function that measures over-fitting and under-fitting quantitatively based solely on mathematical properties. The ultimate limit is set by statistical resolution, which depends on sample size.
Solving the inverse problem becomes a matter of optimizing a scoring function, which breaks down into two parts-first, developing a suitable measure that resists under-and over-fitting to the sampled data, which is the focus of this paper. Second, developing an efficient algorithm to optimize the score while adaptively constructing a non-parametric pdf. The second part will be accomplished by an algorithm involving a non-parametric maximum entropy method (NMEM) that was recently developed by JF and DJ [11] and implemented as the "PDFestimator." Similar to a traditional parametric maximum entropy method (MEM), NMEM employs Lagrange multipliers as coefficients to orthogonal functions within a generalized Fourier series. The non-parametric aspect of the process derives from employing a data driven scoring function to select an appropriate number of orthogonal functions, as their Lagrange multipliers are optimized to accurately represent the complexity of the data sample that ultimately determines the features of the pdf. The resolution of features that can be uncovered without over-fitting naturally depends on the sample size.
Some important results in statistics [12] that are critical to obtain universality in a scoring function are summarized here. For a univariate continuous random variable, X, the cdf is given by F X (x), which is a monotonically increasing function of x and, irrespective of the domain, the range of F X (x) is on the interval (0, 1). A new random variable, R, that spans the interval (0, 1) is obtained through the mapping r = F X (x). The cdf for the random variable R can be determined as follows, F(r) = P(R ≤ r) = P(F X (x) ≤ r) = P(X ≤ F −1 X (r)) = F X (F −1 X (r)) = r Since the pdf for the random variable R is given as f (r) = dF(r) dr = 1 it follows that R has a uniform pdf on the interval (0, 1). Furthermore, due to the monotonically increasing property of F X (x) it follows that a sort ordered set of N random numbers {x k } N maps to the transformed set of random numbers {r k } N in a 1 to 1 fashion, where k is a labeling index that runs from 1 to N. In particular, for an index k > k, it is the case that r k ≥ r k . The 1 to 1 mapping that takes X → R has important implications for assessing the quality of a pdf estimate. The universal nature of this approach is that, for a given sample of random data and no a priori knowledge of the underlying functional form of the true pdf, an evaluation can be made of the transformed data.
Given a high-quality pdf estimate from an estimation method,f X (x), the corresponding estimated cdf,F X (x), will exhibit sampled uniform random data (SURD). Conversely, for a given sample from the true pdf, a poor trial estimate,f X (x) , will yield transformed random variables that deviate from SURD. The objective of this work is to consider a variety of measures that can be used as a scoring function to quantify the uncertainty in how close the estimatef X (x) is to the true pdf based on how closely the sort order statistics ofF X ({x k }) matches with the sort order statistics of SURD. The powerful concept of using sort order statistics to quantify the quality of density estimates [13] will be leveraged to construct universal scoring functions that are sample size invariant.
The strategy employed in the NMEM is to iteratively perturb a trial cdf and evaluate it with a scoring function. By means of a random search using adaptive perturbations, the trial cdf with the best score is tracked until the score reaches a threshold where optimization terminates. At this point, the trial cdf is within an acceptable tolerance to the true cdf and constitutes the pdf estimate. Different outcomes are possible since the method is based on a random fitness-selection process to solve an inverse problem. The role of the scoring function in the NMEM includes defining the objective target for optimizing the Lagrange multipliers, providing stopping criteria for adding orthogonal functions in the generalized Fourier series expansion and marking a point of diminishing returns where further optimizing the Lagrange multipliers results in over-fitting to the data. Simply put, the scoring function provides a means to quantify the quality of the NMEM density estimate. Optimizing the scoring function in NMEM differs from traditional MEM approaches that minimize error in estimates based on moments of the sampled data. Note that the universality of the scoring function eliminates problems with heavy tailed distributions that have divergent moments. Nevertheless, Lagrange multipliers are determined based on solving a well defined extremum problem in both cases.
Before tackling how to evaluate the efficacy of scoring functions, a brief description is given here on how the quality of a pdf estimate can be assessed without knowing the true pdf. Visualizing a quantile-quantile plot (QQ-plot) is a common approach in determining if two random samples come from the same pdf. Given a set of N sort ordered random variables {x k } N that are monotonically increasing, along with a cdf estimate, the corresponding empirical quantiles are determined by the mapping {r k } N =F X ({x k } N ) as described above. It is not necessary to have a second data set to compare. As described previously [11], the empirical quantile can be plotted on the y-axis versus the theoretical average quantile for the true pdf plotted on the x-axis. From single order statistics (SOS) the expectation value of r k is given by µ k = k/(N + 1) for k = 1, 2, ...N, which gives the mean quantile. Figure 2a illustrates the QQ plot for the distributions shown in Figure 1. The benefit of the QQ plot is that it is a universal measure. Unfortunately, for large sample sizes, the plot is no longer informative because all curves approach a perfect straight line as random fluctuations decrease with increasing sample size. A quantile residual (QR) allows deviations from the mean quantile to be readily visualized when one sample size is considered. However, as illustrated in Figure 2b, the residuals in a QR-plot decrease as sample size increases. Hence, the quantile residual is not sample size invariant.
The QR-plot is scaled [11] in such a way as to make the scaled quantile residual (SQR) sample size invariant. From SOS, the standard deviation for the empirical quantile to deviate from the mean quantile is well-known to be σ k = µ k (1 − µ k )/ √ N + 2 where k is the sort order index. Interestingly, all fluctuations regardless of the value for the mean quantile scale with sample size as 1/ √ N + 2. Sample size invariance is achieved by defining SQR as √ N + 2(r k − µ k ) and, when plotted against µ k , one obtains a SQR-plot. Figure 2c shows an SQR-plot for three different sample sizes for each of the four distributions considered in Figure 1. It is convenient to define contour lines using the formula s f µ(1 − µ), where the scale factor, s f , can be adjusted to control how frequently points on the SQR plot will fall within a given contour. In particular, 99% of the time the SQR points will fall within the boundaries of the oval when bounded by ±2.58 µ(1 − µ).  The SQR-plot provides a distribution free visualization tool to assess the quality of a cdf estimate in three ways. First, when the SQR falls appreciably within the oval that encloses 99% of the residual, it is not possible to reject the null hypothesis. Second, when the SQR exhibits non-random patterns, this is an indication of systematic error introduced by the estimator method. Finally, when the SQR has suppressed random fluctuations such that it is close to 0 for an extended interval, this indicates that the pdf estimate is over-fitting to the sample data. In general, over-fitting is hard to quantify [14]. As the graphical abstract shows, it is possible to plot the SQR against the original random variable x instead of the mean quantile. Doing this deforms the oval or "lemon drop" shape of the SQR-plot but it directly shows where problems in the estimate are locally occurring in relation to the pdf estimate. The aim of this paper is to quantify these salient features of an SQR-plot using a scoring function.
This work was motivated by the concern that different scoring functions will likely perform differently in terms of speed and accuracy in NMEM. The scoring function that was initially considered was constructed from the natural logarithm of the product of probabilities for each transformed random variable, given byF X ({x k }). This log-likelihood scoring function provides one way to measure the quality of a proposed cdf. Interestingly, the log-likelihood scoring function has a mathematical structure similar to the commonly employed Anderson-Darling (AD) test [15,16]. As such, the current study considers several alternative scoring functions that use SQR and compares how sensitive they are in quantifying the quality of a pdf estimate. Other types of information measures that use cumulative relative entropy [17] or residual cumulative Kullback-Leibler information [18,19] are possible. However, these alternatives are outside the scope of this study, which focuses on leveraging SQR properties. The scoring function must exhibit distribution free and sample size invariant properties so that it can be applied to any sample of random data of a continuous variable and also to sub-partitions of the data when employed in the PDFestimator. It is worth noting that all the scoring functions presented in this paper exhibit desirable properties with similar or greater efficacy than the AD scoring function and all are useful for assessing the quality of density estimates.
In the remainder of this paper, a numerical study is presented to explore different types of measures for SQR quality. The initial emphasis is on constructing sensitive quality measures that are universal and sample size invariant. These scoring functions based on SQR properties can be applied to quantifying the accuracy (or ''goodness of fit") of a pdf estimate created by any methodology, without knowledge of the true pdf. The SQR is readily calculated from the cdf which is obtained by integrating the pdf. To determine which scoring function best distinguishes between good and poor cdf estimates, the concept of decoy SURD is introduced. Once decoys are generated, Receiver Operator Characteristics (ROC) are employed to identify the most discriminating scoring function [17]. In addition to ROC evaluation, performance of the PDFestimator for different plugged in scoring functions is evaluated. This benchmark is important because the scoring function is expected to affect the rate of convergence toward a satisfactory pdf estimate using the NMEM approach. After discussing the significance of the results, several conclusions are drawn from an extensive body of experiments.

Sample Size Invariant Scoring Functions
Seven scoring functions are defined in Table 1. At the moment, the input to these scoring functions is SURD of sample size N. Specifically, N random numbers are independently and identically drawn uniformly on the interval (0,1) and then sort ordered to give SOS represented by the set {r k } N where 0 < r k ≤ r k+1 < 1 ∀ k = 1, 2, ...N. For sample size, N, a scoring function of type t is evaluated as S t ({r k } N ), which defines a new random variable that is simply denoted as S t (N). A scoring function is scale invariant if the probability density for S t (N) is independent of sample size, which typically holds only for large N. However, finite size corrections are made for each scoring function and are listed in Table 1. In all cases, the finite size corrections are empirically determined based on numerical simulation to achieve approximate scale invariance for N ≥ 9. In all coefficients reported, there is a (3) error in the last significant figure, such as 0.406(3) or 11.32(3).
generalized moment (S 4 ) where m = block size for both the i-th and j-th blocks being compared.
to account for the number of subsamples within partition, p.
As defined in Table 1, the proposed scoring functions include the relevant part of the Anderson-Darling (AD) measure [15], denoted as S AD , and the quasi log-likelihood formula [11], denoted as S LL . Note that S LL = log [∏ k p k (r k )] where p k (r k ) is the exact pdf corresponding to a beta distribution that describes the random variable r k as derived from SOS [13]. The quasi log-likelihood is not an exact log-likelihood. Rather, S LL corresponds to a mean field approximation where correlations between the random variables, {r k } N , are neglected. Another scoring function is defined as S VAR = z 2 k , where z k = (r k − µ k )/σ k . As mentioned in the Introduction, µ k = r k = k/(N + 1) is the mean quantile of the k-th random variable and σ k = µ k (µ k − 1) / √ N + 2 is the standard deviation of the k-th random variable about its mean. Essentially S VAR is the mean variance of a "z-value" for SOS. Despite sharing a similar mathematical form, the S AD and S LL scoring functions are not the same, even in the limit N → ∞. At face value, these functions look very different. However, after shifting the origin of these functions to their natural reference points and scaling S LL by a factor of −2, which was empirically determined to obtain data collapse, these two measures were remarkably similar. To demonstrate this, let The natural reference points S o AD and S o LL are respectively defined as S AD and S LL , evaluated at the mean quantiles. Figure 3a,d show the pdf for S AD and the pdf for S LL are approximately sample size invariant and markedly similar. Interestingly, S AD has superior sample size invariance because it reaches its asymptotic limit extremely fast, as reported almost 60 years ago [20].
To improve or create a scale invariant scoring function, finite size corrections are incorporated by transforming S t (N) to a z-value. For all score types, Z t = (S t − µ t )/σ t where µ t is the average of S t and σ t is the standard deviation of S t about its mean. All shifts and scale factors used to transform S t (N) → Z t (N) are given in Table 1. Figure 3a,d,g show that, after finite size corrections, the pdf for the three scoring functions Z AD , Z LL and Z VAR exhibit excellent scale invariance. Furthermore, the pdf for these scoring functions fall on top of one another in a massive data collapse (data not shown) indicating they share the same pdf for all practical purposes. It is worth noting that because this is a numerical study, there is uncertainty in the formulas that define the corrections to finite sample size. As can be clearly seen in Figure 3a, the AD measures before finite size corrections are applied display the most impressive data collapse. Indeed, the observed data collapse from numerical simulation are tighter than the intrinsic uncertainties in the correction to finite size samples. In contrast, the log likelihood measure has the most dispersion in its data collapse before finite size corrections are applied. In this case, the finite sample size corrections greatly improved the data collapse.
The most surprising result is that this numerical study demonstrates that Z VAR shares the same pdf as Z AD . This result is surprising because both Z AD and Z LL involve linear combinations of logarithms, while Z VAR has no logarithms. However, it is not surprising that Z VAR has good scaling properties because the function is defined in terms of the scaled variable, otherwise called the z-value. The transformation to the z-value naively sets the mean to the origin and normalizes the variance. As such, it would be somewhat surprising if Z VAR did not exhibit data collapse as a function of the z-value. Given that Z VAR scales, it is expected that generalized moments of the z-value variable will exhibit data collapse and also exhibit sample size invariance.
From a practical standpoint, it is computationally faster to work with Z VAR . Therefore, additional scoring functions defined as S p = |z k | p 1/p for p = 1 2 , 1, 2, 3, 4 were considered. Note that S 2 is the standard deviation of z k and, after finite size corrections are applied, S p → Z p . The cases p = 1 2 and p = 4 are listed in Table 1 and exhibit scale invariance as shown in Figure 3b,e respectively. The p = 1, 2, 3 cases (data not shown) are similar and straddle the limiting cases smoothly. It is worth mentioning that the natural reference at the mean quantile is zero for S VAR and S p .
By exploring SURD for additional patterns, it was observed that two disjoint blocks of the same size can be compared using double order statistics (DOS). Among all random variables, {r k } N , the indices that span from k 1 o to k 1 f define block 1 and the indices that span from k 2 o to k 2 f define block 2. Without loss of generality, block 2 is taken to be to the right of block 1, such that k is calculated for all disjoint blocks at once. By partitioning all random variables into equal blocks of indices, the mean log-ratio is calculated rapidly for all pairs of blocks. For any size block and for any pair of blocks, S (i,j) LR exhibits strong scale invariance as shown in Figure 3h. Over a hundred diverse cases are shown as gray lines. Interestingly, the pdf for S (i,j) LR is essentially a normal distribution shown as a red line.
LR is localized to a pair of blocks, to cover the entire SQR-plot a new scoring function is constructed by taking the root mean square of all distinct pairs of S (i,j) LR . For a calculation time proportional to sample size, the size of a block is set proportional to √ N, which necessarily makes the number of blocks, N b , proportional to √ N. The pdf for RMSLR is nearly sample size invariant as shown in Figure 3i. From Table 1, it appears the finite size corrections for RMSLR are complicated. However, as will be discussed below, scale invariance should be preserved for sub-samples of the data, called partitions. It turns out that only RMSLR requires special attention to make partitions scale, where N p is the number of data points being sub-sampled. Finally, the absolute value of the measures Z VAR and Z 4 are respectively shown in Figure 3c,f. Note that taking an absolute value of a measure that is scale invariant will remain scale invariant.

Redundant and Complimentary Information
Since the pdf of different scoring functions may be similar or the same, the next question addressed is how do different measures compare when applied to the same SURD? For sample size N, SURD is generated using numerical simulation and each measure is evaluated per realization of {r k } N . For 100,000 random trials per N, a 1 to 1 comparison is made between Z a (N) versus Z b (N) with a = b. Note that by definition, Z t (N) has a mean of zero and a standard deviation of 1. For reasons that will become clear below, absolute values are taken on the scoring functions. Despite the pdf for |Z VAR |, |Z LL | and |Z AD | being practically identical for all sample sizes, scatter plots indicate that the scores are not identical on a 1 to 1 basis. Figure 4a,b plot |Z VAR | and |Z LL | against |Z AD |, respectively. Although there is always a tight linear correlation, there is more scatter in the comparison at smaller sample sizes. As N → ∞ the different scores converge to the same value, although the approach to the asymptotic limit for each measure differs. These differences have important implications for application to density estimation as discussed below.  The scatter plot of |Z 4 | versus |Z VAR | in Figure 4c shows that these two measures characterize SURD in a fundamentally different way due to the strong deviation of |Z 4 | relative to |Z VAR | with modest statistical scatter. The greatest non-linear deviation between the two scores occurs at large values of |Z VAR |, corresponding to outliers in SURD. The scatter plot of RMSLR versus |Z VAR | in Figure 4d shows strong random scatter with no discernible deterministic dependence. Hence, RMSLR and |Z VAR | measure different SURD characteristics. Yet, despite their conspicuous differences, the pdf for |Z 4 | and RMSLR are qualitatively similar as shown in Figure 3f,i, respectively.
As demonstrated by scatter plots, various scoring functions characterize SURD in different or similar ways relative to one another. Note that combining measures with complimentary properties can potentially lead to a more sensitive measure. Through reductive analysis, a composite score (CS) is proposed as: In constructing CS, the most probable score for Z VAR , near 0.666, is used as a baseline. Then contributions are added from outliers from either |Z 4 | or RMSLR, whichever is larger. The last term does not modify the score when no outlier is detected, otherwise the contribution to CS continuously increases starting at zero at just above the threshold for outlier detection.

Partition Size Invariance
A critical part of the algorithm in the PDFestimator [11] is that the input data sample is partitioned into hierarchical sub-samples by powers of 2 when N > 1025. Consequently, the employed scoring function should be sample size invariant for all partitions. Invariance of partition size, N p , is satisfied by all scoring functions described in this work, as exemplified in Figure 5 for three of the most distinct measures. Furthermore, for any realization of SURD of size N, all partitions within have essentially the same score independent of the type of scoring function. A necessary requirement for all the scoring functions is that sub-sampling must be uniformly distributed over the data. It is worth noting that S AD (and its corresponding Z AD ) is particularly sensitive to the way the uniform sub-sampling is performed within a partition. Due to the form of the S AD equation, it is critical that the selected points are symmetric about the center index in the sort ordering. The number of samples used in a partition is always odd of the form N p = 1 + 2 n . Thus, the median point is included and for each index selected to be in the sub-sample below the median, a corresponding mirror image index above the median is selected. For example, if there are 17 indices in the full sample, indices 1, 4, 9, 14, 17 has the required mirror symmetry. All other scoring functions are not sensitive to breaking mirror symmetry.

Decoy SURD
For the purpose of quantifying how well a scoring function discriminates between true SURD and random data that is not SURD, a controlled decoy-SURD (dSURD) is generated. Let {r o k } define SURD and let {r d k } define dSURD. As described in detail in Section 4.4, a decoy cdf, F d (r), is constructed to facilitate the 1 to 1 mapping given by then the output is identical to the input. A decoy-SURD is controlled by adding a perturbation of the form F d (r) = r + ∆(r). By choosing various functional forms for the perturbation and, by controlling the amplitude of the perturbation, it is a simple matter to make a broad spectrum of decoys that range from impossible to markedly obvious to detect at any specified sample size.
In Figure 6, the middle row shows the decoy cdf resulting from the perturbations shown along the top row. This is an example of a moderately hard dSURD because by eye the decoy cdf looks close to a perfect straight line. To make it clear that dSURD is indeed different from SURD, the pdf for each case is shown along the bottom row. For a sufficiently large sample size, statistical resolution will be good enough to resolve these small perturbations, but for smaller sample sizes the perturbation will not be detectable. To demonstrate how statistical resolution increases with larger sample sizes, Figure 7 shows SQR-plots for SURD and its corresponding dSURD for samples sizes of 1000, 5000, 20,000 and 100,000. These three cases are examples of localized perturbations.  Three additional perturbations of an extended type are shown in Figure 8 using the same layout. The last column plots the perturbation, cdf and pdf as dashed red lines because "reduced fluctuation" is a special type of perturbation that is also explained in Section 4.4. As the name implies, fluctuations are suppressed, representing a scenario where a pdf estimate over-fits the data. Figure 9 shows the SQR-plots for SURD and its corresponding dSURD of the extended type for samples sizes of 1000, 5000, 20,000 and 100,000. Note that the reduced fluctuation perturbation is equally detectable at any size sample because fluctuations are suppressed by a fixed proportion in relation to true SURD.
By comparing measures applied to dSURD and SURD, it can be expected that the more sensitive scoring function is one that detects a given perturbation at smaller sample sizes compared to other scoring functions. It is also expected that a certain scoring function will be able to detect certain types of perturbations more readily than other types of perturbations. As such, it is likely impossible to find a perfect scoring function that performs best on all decoy types all the time. Nevertheless, for a given diverse set of dSURD examples, the best overall performing scoring functions with the greatest sensitivity or selectivity can be deduced using receiver operator characteristics.

Receiver Operator Characteristics
Receiver operator characteristics (ROC) are calculated based on simulation data involving 10,000 trials of SURD over a broad range of N samples, and for each SURD, many dSURD mappings are generated for each of the six decoy types shown above. Results are exemplified in Figure 10, showing ROC curves for three different sample sizes and six different decoy types. ROC curves quantify the efficacy of a scoring function in discriminating SURD from dSURD. Figure 10 shows representative results for moderately difficult decoys. As a point of reference, easy, moderate and hard decoys are aimed at requiring about 1000, 10,000 and 100,000 samples to have sufficient statistical resolution to notice dSURD just barely by eye (e.g., see Figures 7 and 9). Only the decoy that reduces fluctuations using a fixed scale factor has the same difficulty for detection independent of sample size. It is common practice to quantify ROC curves by their area under the curve (AUC). Table 2 gives all AUC values for all the cases shown in Figure 10. The ROC curves and the results listed in Table 2 clearly show that CS detects decoys better than the other measures. Of course, informed by reductive analysis, this result was purposely intended during the construction of CS given in Equation (2). In summary, it is generally found that Z AD , Z LL , Z VAR , |Z AD |, |Z LL |, |Z VAR |, Z 4 , |Z 4 | and CS scoring functions are all good measures to distinguish SURD from easy to detect dSURD. However, it is always possible to create decoy SURD that will go undetected by any measure (e.g., Figure 10). In general, Z AD , Z LL and Z VAR share similar ROC curves and |Z VAR | and |Z 4 | have similar ROC curves. The most sensitive scoring function is CS. The reason Z t and |Z t | are considered as two separate cases is now easily explained. First note that Z t has a mean of zero and a standard deviation of 1. For a decoy type of "reduced fluctuations" that mimics an over-fitting scenario, the ROC curve becomes inverted for any type of measure, Z t . However, the inversion problem is eliminated when considering |Z t | because both over-fitting and under-fitting is detected when |Z t | is large. Finally, only the combined score, CS, readily detects very localized perturbations due to its RMSLR component. Figure 11 summarizes the comparative statistics for failure rates. The bar plots in Figure 11a report averages across distributions and random samples, for cumulative ranges of sample sizes. As expected, the failure rate increases with sample size. For all scoring methods, average failure rates are typically on the order of 10% for sample sizes less than one million. Failure rate averages are the least for |Z 4 | and |Z LL |, a trend that holds across sample size. The associated box plots in Figure 11b more clearly demonstrate the computational advantage of |Z 4 | and |Z LL | over the other scoring methods. All scoring methods have between 50 and 60 outliers, but |Z 4 | and |Z LL | have virtually no failures outside of these extreme values. For computational time and Kullback-Leibler (KL) divergence [21], or simply KL, care must be taken to ensure a fair comparison, accounting for failure rates. Thus, a subset of the data is considered for these measurements. Of the 275 test sets (25 distributions at 11 sample sizes), 230 of these contain at least 10 successes out of the 100 trials, across all five scoring methods. The remaining 45 tests contain failure rates greater than 90% for at least one scoring method and are eliminated from further comparison, ensuring an equitable comparison across successful distributions and sample sizes. The results are shown in Figure 12.

PDF Estimation Performance
Computational time comparisons prove to be the most challenging to pin down, due to wide variations between distributions, sample sizes and random trials. However, Figure 12a demonstrates a clear advantage in the average computational time for |Z 4 |, across all sample sizes. Once again, the number of outliers, which are compressed for clarity in Figure 12b, is roughly the same across the five scoring methods. However, |Z AD | has a higher range of typical runtimes, as well as higher averages in the smallest sample sizes. The KL-divergence comparisons shown in Figure 12c,d are less variable between scoring methods. A lower divergence between the estimate and the known reference distribution suggests a better estimate is being made. Figure 12c shows a decreasing KL-divergence with increasing sample size for all scoring methods, which demonstrates expected convergence, albeit with diminishing returns for larger sample sizes. Notably, |Z AD | produces slightly lower KL-divergence on average, compared to the other methods. Average KL (d) Figure 12. Comparative statistics between five scoring methods averaged over successful solutions. Cumulative averages for (a) performance time across four sample size ranges and (c) Kullback-Leibler divergence [21]. Panels (b,d) show box plots for the respective data shown in panels (a,c). Box plots show inner-quartiles and whiskers represent range of data excluding outliers, which are shown as red crosses.

Discussion
Each of the five scoring methods have been evaluated when utilized within the PDFestimator and applied to the same distribution test set in terms of scalability, sensitivity, failure rate and KL-divergence. Each of the proposed measures have strengths and weaknesses in different areas. The |Z AD | measure produces the most accurate scaling and the lowest KL-divergence. The CS measure shows the greatest sensitivity for detecting small deviations from SURD. The |Z LL | method, although not a clear winner in any particular area, is notably well-performing in all tests. These results suggest a possible trade-off between a lower KL-divergence versus longer computational time with the |Z AD | scoring method. However, the slight benefit of a lower KL-divergence is arguably not worth the computational cost, particularly when also considering the higher failure rate. In contrast, the significantly low failure rate and fast performance times are strong arguments in favor of |Z 4 | as the preferred scoring method. However, this result is only true when the score of a sensitive measure is minimized, while the threshold to terminate is based on a less sensitive measure (see Section 4.7 in methods for details).
Qualitative analysis is used to elucidate why |Z 4 | minimization is the best overall performer. The pdf and SQR for hundreds of different estimates were compared visually and robust trends were observed between the |Z VAR | and |Z 4 | methods. Figure 13a is a representative example, showing the density estimates for the Burr distribution at 100,000 samples. Although both estimates were terminated at the same quality level, the smooth curve found for |Z 4 | would be subjectively judged superior. However, there is nothing inherently or measurably incorrect about the small wiggles in the |Z VAR | estimate. Note that no smoothness conditions are enforced in the PDFestimator.
The SQR-plot, shown in Figure 13b, is especially insightful in evaluating the differences in this example. The Burr distribution is deceptively difficult to estimate accurately due to a heavy tail on the right. Both |Z VAR | and |Z 4 | fall mostly within the expected range, except for the sharp peak to the right corresponding to the long tail. Although the peak is more pronounced for |Z VAR |, the more relevant point in this example is the shape of the entire SQR-plot. SQR for |Z VAR | contains scaled residuals close to zero, behavior virtually never observed in true SURD. Hence, this corresponds to over-fitting. This contrast in the SQR-plot between |Z VAR | and |Z 4 | is generally true with the following explanation.
The |Z 4 | scoring method uses the same threshold scoring as |Z VAR |, but simultaneously seeks to minimize the variance from average, thus highly penalizing outliers to the expected z-score. The |Z VAR | method, by contrast, tends to over-fit some areas of the distribution of high density, attempting to compensate for areas of relatively low density where it deviates significantly. This often results in longer run times, many unnecessary Lagrange multipliers, less smooth estimates and unrealistic SQR-plots, as the NMEM algorithm attempts to improve inappropriately. For example, in the test shown in Figure 13, the number of Lagrange multipliers required for the |Z VAR | estimate was 141, whereas |Z 4 | required only 19. Therefore, it is easy to see why |Z VAR | took much longer to complete. This phenomenon is a general trend but it is exacerbated in cases where there are large sample sizes on distributions that have a combination of sharp peaks and heavy tails. The corresponding SQR-plots for each density estimate are shown. By eye, both density estimates look exceptionally good, but the SQR-plot has a strong peak representing error in the extreme tail of the distribution. The degree of error depends on the scoring function, but both scoring functions give qualitatively the same results.
A surprising null result of this work is that the CS measure, custom designed to have the greatest overall sensitivity and selectivity, failed to be the best overall performer in practice when invoked in the PDFestimator. Although more investigation is required, all comparative results taken together suggest that the CS scoring function is the most sensitive but is over-designed for the capability of the random search optimization method currently employed in the PDFestimator. In the progression of improvements on pdf estimation, the results from the initial PDFestimator suggested that a more sensitive scoring function would improve performance. With that aim, more sensitive scoring functions have been determined and performance of the PDFestimator substantially improved. However, it appears the opposite is now true, requiring a shift in attention to optimize the optimizer, with access to a battery of available scoring functions. In preparation, another work (ZM, JF, DJ) optimizes the overall scheme by dividing the data into smaller blocks, which gives much greater speed and higher accuracy, while taking advantage of parallelization.

Methods
MATLAB 2019a (MathWorks, Natick, MA, USA) and the density estimation program "PDFestimator" were used to generate all the data presented in this work. The PDFestimator is a C++ program that JF and DJ developed as previously reported [11], which has the original Java program in supporting material. Upgrades on the PDFestimator are continuously being made on the BioMolecular Physics Group (BMPG) GitHub website,Available online: https://github. com/BioMolecularPhysicsGroup-UNCC/PDF-Estimator, where the source code is freely available, including a MATLAB interface to the C++ program. An older C++ version is also available in R, https://cran.r-project.org/web/packages/PDFEstimator/index.html. The version on the public GitHub website is the most recent stable version that has been well tested.

Generating SURD and Scoring Function Evaluation
MATLAB was employed in numerical simulations to generate SURD. For a sample size N, the sort ordered sequence of numbers {r k } N was used to evaluate each scoring function being considered. The same realization of SURD was assigned multiple scores to facilitate subsequent cross correlations.

Method for Partitioning Data
As previously explained in detail [11], sample sizes of N > 1025 were partitioned in the PDFestimator to achieve rapid calculations. The lowest and highest random number in the set {r k } N define the boundaries of each partition. The random number closest to the median was also included. Partitions have an odd number of random numbers due to the recursive process of adding one additional random number between the previously selected random numbers in the current partition. Partition sizes follow the pattern of 3, 5, 9, 17, 33, ....1 + 2 n . A desired property of scoring functions is that they should maintain size invariance for all partitions. Scores for each measure were tracked for all partitions of size 1026 and greater, including the full data set, which is the last partition. For example, with N=100,000 the scores for partitions of size Np = 1025, 2049, 4097, 8193, 16,385, 32,769, 65,537, 100,000 were calculated. Scores from different partitions were cross correlated in scatter plots.

Finite Size Corrections
For each partition of size N p , including the last partition of size N, the scores were transformed to obtain data collapse. For all practical purposes finite size corrections were successfully achieved by shifting the average of a score to zero and normalizing the data by the standard deviation of the raw score. That is to say, the score, S t (N p ) for N p samples in the p-th partition, was a random variable. This score was transformed to a Z-value through the procedure (N p , N). Operationally, tens of thousands of random sequences of SURD were generated for each scoring function type to empirically estimate µ t (N p , N) and σ t (N p , N). Note that µ t (N p , N) and σ t (N p , N) were obtained using basic fitting tools in the MATLAB graphics interface, and these are reported in Table 1.

Decoy Generation
For each decoy the sort ordered sequence of numbers r o k N defining SURD was transformed into decoy-SURD, denoted as dSURD. This was accomplished by creating a model decoy cdf, F d (r). A new set of sort ordered random numbers was created by the 1 to 1 mapping {r d k } N = F d ({r o k } N ), yielding a dSURD realization per SURD realization. Different decoys were generated based on different types of perturbations, which must meet certain criteria. Let ∆(r) represent a perturbation to SURD, such that For the perturbation to be valid, the pdf given by f d (r) = dF d (r) dr must satisfy f d (r) ≥ 0, which implies 1 + ∆ (r) ≥ 0. The boundary conditions ∆(0) = ∆(1) = 0 must also be imposed. With these conditions satisfied, decoys of a wide variety could be generated. Four types of decoys were created using this approach, listed in the first 4 rows of Table 3. In this approach, the amplitude of the perturbation is a parameter. A decoy that is marginally difficult to detect at sample size of N d has max(|∆|) = 1/ √ N d . It will be challenging to discriminate between SURD and dSURD for N < N d , and markedly distinguishable when N/N d 1. Two additional types of decoys were also generated. First, F d (r) is set to a beta distribution cdf, denoted as F β (r|α, β). Therefore, the perturbation is given as ∆(r) = F β (r|α, β) − r. The α and β parameters were adjusted to tune detection difficulty, by systematically searching for pairs of α and β on a high resolution square grid to find when max(∆) was at a level that was consistent with the targeted sample size, N d . Second, a decoy can be defined by uniformly reducing fluctuations according to r d k = r o k + p(r o k − µ k ) where µ k = k/(N + 1). When p = 0 the decoy was the same as SURD, but as p → 1 the decoy retained no fluctuations. In this sense, this decoy type mimics extreme over-fitting, where p controls how much of the fluctuations are reduced. Table 3. Decoy type summary.

Decoy Name Perturbation Equation Parameters
dimple

ROC Curves
All ROC curves were generated according to the definition that the fraction of true positives (FTP) were plotted on the y-axis versus the fraction of false positives (FFP) plotted on the x-axis [22]. Note that alternative definitions for ROC are possible. To calculate FTP and FFP, a threshold score must be specified. If a score is below this threshold, the sort ordered sequence of numbers is predicted to be SURD. Conversely, if a score exceeds the threshold, the prediction is not SURD. As such, there are four possible outcomes. First, true SURD can be predicted as SURD or not, respectively, corresponding to a true positive (TP) or a false negative (FN). Second, dSURD can be predicted as SURD or not, respectively corresponding to a false positive (FP) or true negative (TN). All possible outcomes are tallied, such that FTP = TP/(TP + FN) and FFP = FP/(FP + TN). For a given threshold value, this calculation determines one point on the ROC curve. By considering a continuous range of possible thresholds, the entire ROC curve is constructed.
Procedurally, the data used to calculate the fractions of true and false positives that come from numerical simulations in MATLAB comprised 10,000 random SURD and dSURD pairs for sample sizes, N = 10, 50, 200, 1000, 5000, 20,000 and 100,000. About 60 different types of decoys were considered with diverse sets of parameters.

Distribution Test Set
To benchmark the effect of a scoring function on the performance of the PDFestimator, a diverse collection of distributions was selected and these are listed in Table 4. A MATLAB script was created to utilize built in functions dealing with statistical distributions to generate random samples of specified size. The random samples were subsequently processed by the PDFestimator to estimate the pdf, but for which the exact pdf is known. The set of possible distributions available for analysis cover a range of monomodal distributions that represent many types of features that include sharp peaks, heavy tails and multiple resolution scales. Some mixture models were also included that combine difficult distributions to create a greater challenge.

PDF Estimation Method
Each alternative scoring function, {|Z AD |, |Z LL |, |Z VAR |, |Z 4 |, CS} was implemented in the PDFestimator and were evaluated separately. Factors confounding comparisons in performance include sample size, distribution type, selection of key factors to evaluate and consistency across multiple trials. To provide a quantitative synopsis of the strengths and weaknesses of the proposed scoring methods, large numbers of trials were conducted on the distribution test set listed in Table 4. The distribution test set increases atypical failures amongst the estimates because it is necessary to consider extreme scenarios to identify breaking points in each of the scoring methods. Nevertheless, easier distributions, such as Gaussian, uniform and exponential, were included. To wit, good performance of an estimator when applied to challenging cases should not suffer when applied to easier distributions.
As an inverse problem, density estimation applied to multiple random samples of the same size for any given distribution will generally produce variation amongst the estimates. For small samples, the pdf estimate must resist over-fitting, whereas large sample sizes create computational challenges that must trade between speed and accuracy. To monitor these issues, a large range of sample sizes were tested, each with 100 trials of an independently generated input sample data set. Specifically, 100 random samples were generated for each of the 25 distributions, for each of the following 11 sample sizes with N = 10, 50, 100, 500, 1000, 5000, 10,000, 50,000, 100,000, 500,000, 1,000,000. This produced a total of 27,500 test cases, each of which were estimated using five scoring methods. Statistics were collected and averaged over each of the 100 random sample sets.
Three key quantities were calculated for a quantitative comparison of the scoring methods-failure rate, computational time and Kullback-Leibler (KL) divergence [21]. It was found that the KL-divergence distance was not sensitive to the different scoring functions. Alternative information measures [23,24] could be considered in future work. Failure rate is expressed as a fraction of failures out of 100 random samples. The KL-divergence measures the difference between the estimate against the known reference distribution. Computational times and KL-divergences were averaged only for successful solutions and thus were not impacted by failures. A failure is automatically determined by the PDFestimator when a score does not reach a minimum threshold.
During an initial testing phase, it was found that the measures Z t and |Z t | for t = AD, LL, and VAR all worked successfully, which is not surprising considering the original measure, Z LL , works markedly well. However, for the more sensitive measures, Z 4 , |Z 4 | and CS, the PDFestimator failed consistently because the score rarely reached its target threshold, at least within a reasonable time. Therefore, a hybrid method was developed that minimizes a sensitive measure as usual, but the |Z VAR | measure was invoked to determine when to terminate. In tests of |Z t | for t = AD, LL or VAR, these measures were optimized and were simultaneously used as a stopping condition with a threshold of 0.66 corresponding to the 40% level in the cdf, which was the same level used previously [11]. All these measures have the same pdf and cdf, and thus the same threshold value. This threshold was used for |Z VAR | as a stopping condition when different scoring functions are minimized. Table 4. List of distribution types and corresponding parameters used to generate random data samples. Parameter and variable names correspond to the labeling scheme of MATLAB. For mixture distributions, subscripts indicate the distribution used to create the mixture with ordinal numbering, and under the Scale Parameter column, for mixture distributions p i is the mixing weight.

Conclusions
Several conclusions can be drawn from the large body of results presented. (1) The scaled quantile residual (SQR) is instrumental in assessing the quality of a pdf by means of visual inspection. The advantage of an SQR-plot over a traditional QQ-plot is that the displayed information is not only universal (distribution free), but importantly, sample size invariant; (2) It is possible to construct myriad scoring functions that are universal and sample size invariant based on quantitatively characterizing SQR. In particular, various measures can be developed based on mathematical properties of single order statistics (SOS) and/or double order statistics (DOS); (3) Finite size corrections can generally be applied to scoring functions so that their asymptotic properties can be utilized for finite size samples, as low as N = 9; (4) Surprisingly, the scoring functions based on the Anderson-Daring test, quasi log-likelihood of SOS and the variance of SOS z-score -when applied to sampled uniform random data (SURD) share identical pdf for their scores for all practical purposes. Moreover, the scores are invariant across sample size and for different size partitions that sub-sample the input data; (5) The concept of decoy-SURD is introduced and a few methods are given for creating decoy-SURD (dSURD). The purpose of dSURD is to quantify the sensitivity and selectivity of a proposed scoring function using Receiver Operator Characteristics (ROC) or other means, such as machine learning. The usefulness of dSURD to quantify uncertainty in density estimation parallels the use of decoys in the field of protein structure prediction. That is, better scoring functions can be developed by focusing on how they discriminate between true SURD and dSURD; (6) Implementing a more sensitive scoring function in a method that estimates a pdf from random sampled data does not necessarily imply the process of estimation will be improved. There are many confounding factors that determine the ultimate performance characteristics of an algorithm for density estimation, since speed and accuracy need to be balanced for a practical software tool; (7) Minimizing either the Z 4 or |Z 4 | scores greatly improved the performance of the PDFestimator, a C++ program for univariate density estimation, compared to the initially used scoring function Z LL .
In closing, a few research directions that can stem from this work are highlighted. Interestingly, the mean log ratio of nearest neighbor differences in sort ordered SURD, when taken from two disjoint subsets, is normally distributed (at least to a very good approximation). Unaware of an existing proof of this result, the empirical result suggests that a proof should be sought given that the literature contains many works that derive the pdf for ratios of random numbers that are distributed in a specific way. The results presented here can be applied to the problem of constructing a more sensitive distribution free "test for goodness of fit." Essentially, this was the main objective that was addressed but here the emphasis was on how to better quantify uncertainty for the process of estimating a pdf for a random sample of data. Going forward, the universal sample size invariant measures developed here can be employed to test the similarity of two random samples of data.
Author Contributions: D.J. formalized the project objectives, proposed most measures and all decoy types. D.J. wrote the MATLAB code to scale all measures for SURD, generate decoy-SURD and discriminate SURD from decoy-SURD using ROC curves. Z.M. selected the distribution test set, wrote the MATLAB code to generate the random samples, and performed preliminary tests on the PDFestimator. A.G. evaluated ROC curves and scatter plots for the proposed scoring functions applied to SURD and decoy-SURD across sample sizes. J.F. motivated the initial work by reductive data analysis using methods of data collapse and scaling. J.F. modified the PDFestimator as needed, performed all simulations involving the PDFestimator, and performed all data analysis regarding comparative density estimation performance. D.J., J.F. and Z.M. wrote the paper.
Funding: This research received no external funding.