Nonparametric Limits of Agreement for Small to Moderate Sample Sizes: A Simulation Study

: The assessment of agreement in method comparison and observer variability analysis of quantitative measurements is usually done by the Bland–Altman Limits of Agreement, where the paired differences are implicitly assumed to follow a normal distribution. Whenever this assumption does not hold, the 2.5% and 97.5% percentiles are obtained by quantile estimation. In the literature, empirical quantiles have been used for this purpose. In this simulation study, we applied both sample, subsampling, and kernel quantile estimators, as well as other methods for quantile estimation to sample sizes between 30 and 150 and different distributions of the paired differences. The performance of 15 estimators in generating prediction intervals was measured by their respective coverage probability for one newly generated observation. Our results indicated that sample quantile estimators based on one or two order statistics outperformed all of the other estimators and they can be used for deriving nonparametric Limits of Agreement. For sample sizes exceeding 80 observations, more advanced quantile estimators, such as the Harrell–Davis and estimators of Sfakianakis–Verginis type, which use all of the observed differences, performed likewise well, but may be considered intuitively more appealing than simple sample quantile estimators that are based on only two observations per quantile.


Introduction
The classical Bland-Altman Limits of Agreement (BA LoA) define a range within which approximately 95% of normally distributed differences between paired measurements are expected to lie [1][2][3]. In cases of non-normally distributed differences, the use of empirical quantiles has been proposed as a robust alternative [2,4,5]; however, extensive research endeavors in the past have suggested the application of nonparametric quantile estimation to the assessment of 2.5% and 97.5% percentiles as nonparametric LoA. We performed a simulation study on 15 nonparametric quantile estimators to derive nonparametric prediction intervals and assessed their performance by means of the coverage probability for one newly generated observation. The aim of this study was to suggest a nonparametric and robust alternative to the classical BA LoA when the normality assumption does not hold and/or the sample sizes are small to moderate. Our findings are illustrated by an application to data from a previously published clinical study on coronary artery calcification measured by the Agatston score [6].

Methods
Let the differences of paired observations be independent observations of a random variable X with a cumulative distribution function (CDF) F : R → [0, 1]. If F is continuous from the right, then the quantile function of X is continuous from the left: Q(u) := inf{x : F(x) ≥ u}, u ∈ (0, 1); hence, at least 100u percent of the values of X are below Q(u) [7,8]. In the following, we seek to estimate the 2.5% and 97.5% percentiles of F, corresponding to u = 0.025, 0.975, by different methods of nonparametric quantile estimation.
Databases, such as JSTOR (Journal Storage), ScienceDirect, the online journal platform "Taylor & Francis Online", and those maintained by PubMed/Medline in the NCBI (National Center for Biotechnology Information) were searched for quantile estimator, nonparametric quantile estimator, nonparametric kernel quantile estimator, subsampling quantile estimator, and new quantile estimator. Fifteen nonparametric quantile estimators were chosen, three of which are sample quantile estimators, four are subsampling quantile estimators, two are kernel quantile estimators, and six are other quantile estimators. Hence, we chose very different types of nonparametric quantile estimators, which use one, two, or all the observations in a sample.

Sample Quantile Estimators
The simplest way to estimate quantiles nonparametrically is by using sample quantile estimators. A random sample X 1 , . . . X n of size n is sorted in increasing order X (1) ≤ X (2) ≤ . . . ≤ X (n) ; the symbols here denote the order statistics of the random sample. The CDF of F can then be estimated by the step functionF where I A (x) is the indicator function that takes the value 1 if x ∈ A and 0 if x / ∈ A; the sample quantile function is defined asQ(u) := inf{x :F(x) ≥ u}, u ∈ (0, 1) by Cheng [7]. The quantile estimator is based on a single order statistic, where [x] is the greatest integer that is less than or equal to x [9,10]. SQ p1 is the smallest observation for which at least p percent of the observed values in the sample are smaller than or equal to SQ p1 . The second sample quantile estimator SQ p2 is a weighted average of the two order statistics that are closest to including p percent of all the observations in the sample: with α = p(n + 1) − r and r = [p(n + 1)] [9,11]. Finally, we considered a weighted average of X [np+0.5] and X [np+0.5]+1 : with i = [np + 0.5] and 0.5 ≤ np ≤ (n − 0.5) [9,12].

Subsampling Quantile Estimators
The abovementioned sample quantile estimators are based on only one or two order statistics whereas subsampling, kernel, and other quantile estimators employ linear combinations of all the available order statistics, weighting them according to their relative closeness to the target percentile.
Based on the sample quantile functionQ, linear smooth nonparametric estimators of Q(u) can be written as where G(u; ·) is a CDF with support on the unit interval. Many distributions G(u; ·) have been proposed, the choice of which depends on the sample size and, typically, a smoothing parameter.
Depending on the choice of G(u; ·), there are two major classes of quantile estimators according to Cheng [7]: subsampling quantile estimators and kernel quantile estimators. Both can be given as L-statistics, which is, where W j and X (j) is the weight for the j-th order statistic and the j-th order statistic itself, respectively [13]. For subsampling quantile estimators, a discrete distribution is chosen for G(u; ·), interpreted as a resampling distribution from the set of the observed order statistics X (j) , j = 1, . . . , n [7]. The Harrell-Davis estimator is given by with weight function where I i/n {a, b} is the incomplete beta function [10,14,15]. The quantile estimator of Kaigh and Lachenbruch where r = [(k + 1)p], is obtained by averaging a subsample quantile estimate over all ( k n ) subsamples of size k, 1 ≤ k ≤ n, which are sampled without replacement. The subsample size k is an arbitrary smoothing (or reduction) parameter, and Kaigh and Lachenbruch proposed choosing k, so as to minimize the mean squared error (MSE), which is, MSE = E(KL p − ε p ) 2 , where ε p is the true value of the p-th quantile [7,16,17].
Kaigh and Cheng [18] proposed the quantile estimator with r = kp , where x denotes the smallest integer greater than or equal to x [7]. The value of k is again determined by minimizing the MSE of the estimator. Finally, the Bernstein polynomial quantile estimator is given by according to Cheng [7,19].

Kernel Quantile Estimators
Like subsampling quantile estimators, kernel quantile estimators can be written in the form of Equation (5). Here, G(u; .) is a location-scale family CDF with density function K, location parameter u, and scale parameter h(n): Subsequently, Equation (5) becomes the kernel quantile estimator introduced by Parzen [8]: with its L-statistic representation beinĝ The density K, called the kernel, is symmetric around zero and it has a chosen bandwidth h(n), which satisfies h(n) → ∞ when n → ∞ [7,20]. Yang [21] proposed a discretized version, which we used as the first of the two kernel quantile estimators in our study due to its closed form: As do not generally provide a (discrete) probability distribution on [0, 1], monotonicity, translation and scale equivariance, and the symmetry relation do not hold. Translation and scale equivariance would imply that KQ p1 applied to (X 1 + c, . . . , X n + c) is equal to c plus KQ p1 applied to (X 1 , . . . , X n ); the asymmetry means that KQ p1 applied to (−X 1 , . . . , −X n ) is not equal to −KQ p1 applied to (X 1 , . . . , X n ). The Nadaraya-Watson type estimator overcomes these drawbacks and was used as the second of the two kernel quantile estimators in this study [7,22,23]. Its name originates from the Nadaraya-Watson estimator in kernel regression [24,25].
In the following, we chose the standard Gaussian kernel for K and used the value of the bandwith h(n) that minimized the MSE to find k for both KL p in Equation (7) and KC p in Equation (8).

Other Quantile Estimators
The kernel quantile estimators that are presented in the previous section are all based on the usual empirical distribution (1) with equal weights 1/n assigned to each observation. To improve the performance of quantile estimators, Huang and Brill [26] proposed using a weighted empirical distribution, for instance, the level crossing empirical distribution function with weight function One such kernel quantile estimator using level crossing empirical distributions is Huang [27] modified the Harrell-Davis estimator (6) by applying a weighted empirical distribution function instead of the empirical distribution with equal weights 1/n. The Harrell-Davis estimator using a level crossing empirical distribution function can be written as where β(·, ·) is the beta function, q = 1 − p, F lc (·) is given by (12), as defined in (13), and q 0,n ≡ 0. Sfakianakis and Verginis [28] proposed a group of estimators, motivated by the fact that nonparametric quantile estimation of extreme quantiles close to 0 and 1 requires large samples for sufficient accuracy. These three quantile estimators are supposed to better estimate quantiles in the tails of a distribution when using small samples and they employ the Binomial probability of observing exactly i out of n events with an event probability of p, B(i; n, p): Finally, Navruz and Özdemir [29] introduced a new quantile estimator, which is a weighted average of all order statistics: (B(i; n, p)(1 − p) + B(i + 1; n, p)p)X (i+1) − B(n; n, p)pX (n−2) + B(n; n, p)(3p − 1)X (n−1) + (B(n − 1; n, p)(1 − p) + B(n; n, p)(2 − 2p))X (n) .

Simulation Setup
We contrasted nonparametric LoA as constructed with the 15 abovementioned quantile estimators by comparing their coverage probabilities for the next paired difference under the given distributional assumption. Here, we employed the standard normal distribution (ND), a standard normal distribution with 1%, 2%, and 5% outliers (ND 1%, ND 2%, and ND 5%, respectively), an exponential distribution (ED) with a rate of 1, and a lognormal distribution (LND) with meanlog = 0 and sdlog = 1. For normal distributions comprising outliers, simulated data were replaced with a probability of 1%, 2%, and 5% by data sampled from a normal distribution with a mean of 0 and a standard deviation of 3. To examine small to moderate sample sizes, the sample size was set to 30, 50, 80, 100, and 150. For each combination of distribution, sample size, and nonparametric quantile estimator, 20,000 simulated trials of size (n + 1) were generated with R (the code is available as Supplemental Material S1). Here, a seed was set in order to use the same simulated data for each combination of distribution and sample size across nonparametric quantile estimators. The first n observations in each simulated trial were used to derive nonparametric LoA, to which the last observation was compared. The coverage probability was then the proportion of cases out of the 20,000 trials where the nonparametric LoA included the last observation. All of the figures were generated with Stata/MP 16.1 (College Station, TX 77845, USA).

Results
For n = 30 (Table 1), none of the estimators reached the nominal coverage probability of 0.95. The coverage probability of SQ p1 was closest to 0.95, ranging from 0.934 to 0.938. Note that, for sample sizes of up to n = 40 observations, the smallest and largest difference are used as nonparametric quantile estimates for the 2.5% and 97.5% percentiles, respectively. For SQI p , HD p , HD plc , SV p1 , SV p2 , and SV p3 , the coverage probabilities were at least 0.921, 0.911, 0.910, 0.920, 0.914, and 0.923, respectively. Neither SQ p2 nor KL p are defined for n < 40.  (Tables 2-5); for n = 50, SQ p2 was the only one to do so.
For n = 80 (Table 3), the coverage probabilities of HD p , SV p1 , SV p2 , and SV p3 fluctuated closely around the nominal level except for the simulations with an ED (0.94). For n ≥ 100 (Tables 4 and 5), these estimators performed close to the 0.95 nominal level for all of the investigated distributions.
The coverage probabilities of the recently proposed NO p estimator varied between 0.945 and 0.950 for n = 200 and they were very close to 0.95 for n = 250 (results not shown here).

Example
Diederichsen et al. [6] compared coronary artery calcification measurements using the Agatston score with the measurements using Framingham Heart Score in Danes of 50 and 60 years of age. Of 1825 randomly sampled citizens, 1257 consented to participation in the study, and 1156 of them were eligible. Agatston scores were independently reanalyzed for 129 randomly chosen study participants, and the agreement measures were the proportions of agreement and the kappa statistics for dichotomized calcification status (absence vs. presence) to assess intra-and inter-rater agreement. In the following, the intra-rater differences are used for exemplification purposes.
Approximately half of the 129 participants had an Agatston score of 0. The paired intra-rater differences ranged from -683 to 130, with a first, second, and third quartile being equal to 0; the 5th, 10th, 90th, and 95th percentiles were −23, −12, 1.1, and 5, respectively. The empirical distribution of the paired differences was, therefore, characterized by its denseness around 0 and a single, comparatively extreme outlier, clearly indicating the inappropriateness of the normality assumption in this setting (see also a histogram including an approximating normal distribution as Supplemental Material S2).
Using SQ p2 , HD p , and SV p1 , the nonparametric, asymmetric, and robust LoA are −61.5, 12.8; −96.2, 26.7; and, −122.1, 30.6, respectively, whereas the symmetric BA LoA of −129.8, 116.1 are equidistant from the estimated mean difference of −6.9 (Figure 1). The upper LoA for HD p and SV p1 are similar, but the respective lower LoA are differently affected by the single outlier (3942.5, −683). SQ p2 appears to be most robust to few outliers due to its definition. The R source code for the derivation of these nonparametric LoA as well as the example data can be found as Supplemental Material S3 and S4, respectively.  The sensitivity of the classical BA LoA to outliers becomes crystal-clear when excluding the single outlier here. Subsequently, the symmetric BA LoA are −37.4 and 34.3, and the estimated mean difference reduces to −1.6 (results not shown here). In practice, outliers are, though, kept in the analysis dataset if there is no reasonable explanation for an exclusion. This underlines the importance of robust alternatives to the BA LoA.

Statement of Principal Findings
The simple sample quantile estimators that are based on one and two order statistics performed closest to the nominal level in terms of the coverage probability for the next observation across six distributional scenarios for n = 30 and n = 50, 80, 100, 150. The Harrell-Davis subsampling estimator and estimators of the Sfakianakis-Verginis type followed closely for sample sizes of at least n = 80 and may be considered intuitively more appealing, as they use the entire sample, whereas more simple and outlier-robust sample quantile estimators are only based on a few observations from the sample.

Strengths and Limitations of The Study
The choice of distributions for the simulation study was motivated by our own experience with agreement assessments in clinical studies, especially roughly normal distributions with a few percent outliers. We investigated a wide range of quantile estimators, comprising sample, subsampling, and kernel quantile estimators as well as other methods for quantile estimation with sample sizes between 30 and 150. As a measure of performance, we considered the coverage probability of nonparametric LoA for the next observation, interpreting the nonparametric LoA as a prediction interval, as the lower and upper LoA need to be simultaneously assessed. Therefore, we did not pursue evaluations using, for instance, mean squared errors.
When compared to the usual number of 2000 iterations, the chosen number of 20,000 iteration runs translated for a given nonparametric estimator and sample size into a reduced range of the coverage probabilities across distributions by approximately 0.005 and is deemed appropriately accurate. However, the increased number of iterations did implicate considerably longer running times in creating the data for one Table (12 as opposed to 2 h). The abovementioned studies employed between 1000 and 10,000 iterations [16,22].
Harrell and Davis [14] did not recommend HD p for small n and extreme p, and Dielman, Lowry and Pfaffenberger [9] concluded that there was not one best estimator across scenarios, based on maximum sample sizes of 30 and 60; however, Dielman, Lowry and Pfaffenberger [9] suggested that HD p performs well in a wide range of cases, except when p = 0.02, 0.98. Our findings for HD p are in line with these former conclusions, but extend to larger sample sizes of n = 80, 100, 150, in which HD p appears to be a preferable choice for estimating extreme quantiles.

Meaning of the Findings: Possible Mechanisms and Implications
Our findings suggest using SQ p1 in small samples with approximately n = 30 but SQI p , SV p1 , or SV p3 may be preferential alternatives as SQ p1 simply reduces to the smallest and largest observations as estimates for the 2.5% and 97.5% quantiles, respectively. The latter is, in turn, unfortunate in the case of outliers due to their unabated impact on the estimates. SQ p2 performed closest to the nominal level for all samples with n ≥ 50 and appeared to be less prone to the single outlier in our clinical example than HD p and SV p1 . However, the latter two estimators do involve all the observations. SQ p2 can, therefore, be considered the first choice for samples of approximately n = 50, but, for larger n, both HD p and Sfakianakis-Verginis type estimators are equally applicable and actually preferable if the researcher seeks to include the entire dataset in quantile estimation and not only pairs of order statistics.
The normality assumption of the paired differences may often be considered to be reasonable in the planning stage; however, alternative quantile estimators should be equally specified in the planning stage as empirical distributions may deviate notably from ideal assumptions. Moreover, our investigation suggests several beneficial nonparametric alternatives to BA LoA instead of the simple percentile estimators that currently seem to prevail.

Unanswered Questions and Future Research
In the case of normally distributed paired differences, Bland and Altman [1,2] have already proposed approximate confidence intervals for the BA LoA. Recently, Vock [30] emphasized that only a tolerance interval or the outer confidence limits for BA LoA can provide a range that will contain a specified percentage of future differences with a known certainty. Carkeet and Goh [31,32] proposed exact confidence intervals for BA LoA, while using two-sided tolerance factors for a normal distribution.
In the case of any given distribution for the paired differences, several approaches for the construction of nonparametric confidence intervals for quantiles have been proposed over half a century [33][34][35][36][37][38][39]. For both HD p and KL p , confidence intervals for quantiles have been