Bland–Altman Limits of Agreement from a Bayesian and Frequentist Perspective

Bland–Altman agreement analysis has gained widespread application across disciplines, last but not least in health sciences, since its inception in the 1980s. Bayesian analysis has been on the rise due to increased computational power over time, and Alari, Kim, and Wand have put Bland–Altman Limits of Agreement in a Bayesian framework (Meas. Phys. Educ. Exerc. Sci. 2021, 25, 137–148). We contrasted the prediction of a single future observation and the estimation of the Limits of Agreement from the frequentist and a Bayesian perspective by analyzing interrater data of two sequentially conducted, preclinical studies. The estimation of the Limits of Agreement θ1 and θ2 has wider applicability than the prediction of single future differences. While a frequentist confidence interval represents a range of nonrejectable values for null hypothesis significance testing of H0: θ1 ≤ −δ or θ2 ≥ δ against H1: θ1 > −δ and θ2 < δ, with a predefined benchmark value δ, Bayesian analysis allows for direct interpretation of both the posterior probability of the alternative hypothesis and the likelihood of parameter values. We discuss group-sequential testing and nonparametric alternatives briefly. Frequentist simplicity does not beat Bayesian interpretability due to improved computational resources, but the elicitation and implementation of prior information demand caution. Accounting for clustered data (e.g., repeated measurements per subject) is well-established in frequentist, but not yet in Bayesian Bland–Altman analysis.


Introduction
The Bland-Altman plot for method comparison on continuous outcomes has its roots in Tukey's mean-difference plot in which means and differences of paired measurements (from, for instance, two measurement devices) are shown in a scatterplot [1]. Altman and Bland defined the two parameters θ 1 = µ − 1.96σ and θ 2 = µ + 1.96σ for normally distributed differences, with mean µ and standard deviation σ, and called these lower and upper Limits of Agreement (LoA), respectively [2]. They extended Tukey's mean-difference plot by adding empirical estimates for θ 1 and θ 2 as θ 1 and θ 2 are the boundaries within which, under the assumption of normally distributed differences, 95% of all population differences are supposed to fall [2,3]. The Bland-Altman LoA have found widespread application, last, but not least in health science, as to which approximately 37,450 citations of the seminal Lancet paper [4] to date bear witness. Altman and Bland reflected later on both the historical development of their approach and the intriguing success of that paper [5]. Since then, further extensions have been proposed [6][7][8][9][10][11], as well as reporting standards for [12,13] and alternative modes of analysis to the LoA [14][15][16][17][18].
Standard textbooks on agreement and reliability have taken a frequentist perspective [19][20][21][22]. Lyle D. Broemeling described a Bayesian take on intra-class correlation coefficients with respect to reliability [23] and a regression approach of one reader's scores Stats 2021, 4 1081 on those of the other in order to assess agreement in terms of whether the simple linear regression goes through the origin with a slope of one in a scatterplot of paired measurements [24]. The latter, however, is similar to the period preceding Bland-Altman analysis in which scatterplots of the paired measurements against each other, comparison to the 45-degree diagonal, and correlation coefficients prevailed in agreement assessment [5]. Alari, Kim, and Wand have recently proposed a Bayesian perspective on Bland-Altman analysis that can be considered a counterpart of the frequentist one [25]. In a frequentist sense, θ 1 and θ 2 have true, but unknown values; in a Bayesian sense, both θ 1 and θ 2 have probability distributions.
The aim of this work is to contrast classical Bland-Altman analysis with a Bayesian counterpart, highlighting the inherent differences in interpretation. Both the prediction of a single future observation and the estimation of the LoA are illuminated and discussed. Interrater data of two local, sequentially conducted, preclinical studies (study 1 & 2) served exemplification purposes. We will see that, in case of absent prior information, the results for both approaches are numerically close, but Bayesian interpretation is naturally more appealing as probability statements relate directly to relevant ranges for the LoA.

Classical Frequentist Bland-Altman Analysis
The prediction interval for a future observation follows the description of Carstensen [20] and Vock [26]: estimated mean difference ± c PI times the standard deviation of the differences with c PI = t 1− α 2 ;n−1 n+1 n , n observed differences, and the (1 − α 2 ) quantile of Student's t distribution with (n − 1) degrees of freedom, t 1− α 2 ;n−1 . The exact parametric confidence intervals for Bland-Altman LoA were used to derive outer 95% confidence limits for the LoA, considering these as a pair [6]: estimated mean difference ± c OCI times the standard deviation of the differences. The constant c OCI is to be read off from Carkeet's Supplemental Data S2 as c t 0.975 [6].
Stata/MP 17.0 (College Station, TX 77845 USA) was applied. Stata codes and data are attached as Supplemental Material Code S1 and Data S2, respectively.

Bayesian Bland-Altman Analysis
For the Bayesian perspective, the proposal by Alari, Kim, and Wand was employed, including respective Supplemental Materials [25]. Assuming that the differences follow a normal distribution with mean µ and standard deviation σ and given a pre-specified benchmark value of δ for these differences, they addressed two questions of interest:

1.
What is the probability that a future difference d between the two measurements will be between (−δ, δ)? 2.
The first question focuses, like the frequentist prediction interval, on a single future difference, whereas the second question targets the LoA and quantifies the updated belief that the true parameter values for µ and σ are within the fixed region of practical equivalence (ROPE), see also Kruschke [27] (pp. 336-343). The ROPE comprises all possible values of µ and σ where the two methods of measurement are practically the same, meaning close enough to each other in terms of the predetermined benchmark value of δ. Given δ, all combinations of (µ, σ) that fall within the triangle (−δ, 0), (0, δ/1.96), and (δ, 0) do belong to H 1 : θ 1 > −δ and θ 2 < δ (see also Figure 2 in Alari, Kim, and Wand [25]).
The questions above are addressed by means of the posterior predictive distribution for a future difference d and the posterior distributions of θ 1 and θ 2 , respectively. An uninformative normal-gamma prior on both µ and σ with parameters a 0 = 0.5, b 0 = 0.000001, µ 0 = 0, and λ 0 = 0.000001 is applied to the data of study 1, resulting in updated parameters a 1 , b 1 , µ 1 , and λ 1 which in turn specify the normal-gamma prior for study 2. However, as all data of both studies were available, the uninformative normal-gamma prior was applied to the pooled data of studies 1 and 2 for the sake of convenience.
Alari, Kim, and Wand [25] provided a tractable method of specifying an informative prior and gave the necessary R codes for all calculations and figures in their Supplemental Data S2. Both R codes and data can be found in Supplemental Material Code S3. R version 4.1.1 was used to this end. Alari, Kim, and Wand also provided an interface R Shiny applet [28] with which all calculations can likewise be performed. This interface applet [28] supports the use of an informative prior (induced prior specification).

Worked Example
Interrater data from a recent investigation [29] as well as from work under progress in a similar setting [30] will be exemplified. The measurements represent lymphedema hindlimb volume (in mm 3 ) of mice using microcomputed tomography. Study 1 [29] comprised N 1 = 50, study 2 N 2 = 81 paired observations of two raters (Wiinholt and Bučan) who were respectively rater 1 and 2 in study 1 [29]. Data were pooled for analysis. In light of an average lymphedema hindlimb volume of around 180 mm 3 (range: 140 to 280 mm 3 ), a clinically reasonable benchmark value of δ = 5 was applied.

Results
Descriptive statistics for differences in lymphedema hindlimb measurements between the two raters are shown in Table 1. Mean (bias) and standard deviation (SD) were clearly larger in study 2 than in study 1. The estimates for the LoA wereθ 1 = 0.5 − 1.96 × 1.93 = −3.28 andθ 2 = 0.5 + 1.96 × 1.93 = 4.28. Figure 1 shows the classical Bland-Altman plot, supplemented by a regression line of the differences on the means. The vertical spread of the data was quite homogeneous across the measurements range (variance homogeneity), and the regression line did neither suggest a positive nor a negative trend with increasing mean values.

Prediction Interval for a Future Difference
With α = 0.05 and n = 131, the prediction interval for a future observation was assessed as This prediction interval spans marginally wider than the estimates for the LoÂ θ 1 = −3.28 andθ 2 = 4.28 (see short dashes in Figure 2 These outer 95% confidence limits for the LoA span remarkably wider around the LoAθ 1 = −3.28 andθ 2 = 4.28 as they quantify the outwards uncertainty around θ 1 and θ 2 (see long dashes in Figure 2).  This prediction interval spans marginally wider than the estimates for the LoA −3.28 and θ = 4.28 (see short dashes in Figure 2 These outer 95% confidence limits for the LoA span remarkably wider around LoA θ = −3.28 and θ = 4.28 as they quantify the outwards uncertainty around θ θ2 (see long dashes in Figure 2).

Bayesian Bland-Altman Analysis
After observing the differences Di, i = 1,…,131, the posterior predictive distribu of a future difference is given in Figure 3 (bottom panel, right-hand side). We are, Figure 2. Classical Bland-Altman plot including the estimated LoA and a regression line of the differences on the means, extended by a prediction interval for a future difference (short dashes) and exact outer 95% confidence limits for the LoA (long dashes).

Posterior Probability of a Future Difference d Being in (−δ, δ)
After observing the differences D i , i = 1, . . . ,131, the posterior predictive distribution of a future difference is given in Figure 3 (bottom panel, right-hand side). We are, thus, 95% sure that a future difference will be between −3.30 and 4.27, i.e., the 95% Bayesian (posterior) predictive interval spans from −3.30 to 4.27 (see 0.025 and 0.975 quantiles of d in Table 2). Further, we believe that a future difference will be within (−δ, δ) = (−5, 5) with a probability of 0.9880 (see post.pred.agree in the output from Supplemental Material Code S3).
differences on the means, extended by a prediction interval for a future difference (short dashes) and exact outer 95% confidence limits for the LoA (long dashes).

Discussion
The frequentist prediction interval was [−3.33; 4.33], and the outer 95% confidence limits for the LoA were −3.82 and 4.82. According to the Bayesian analysis, the equal-tailed 95% Bayesian (posterior) predictive interval was [−3.30; 4.27], and a future difference will be within (−δ, δ) = (−5, 5) with a probability of 0.9880. Outer credibility limits for the LoA were −3.87 and 4.87, with a posterior probability of the alternative hypothesis H 1 : θ 1 > −5 and θ 2 < 5 that equaled 0.9895. The frequentist prediction interval for a future difference and its Bayesian counterpart, i.e., the Bayesian (posterior) predictive interval, were very close to each other, as were the frequentist outer 95% confidence limits and the Bayesian outer 95% credibility limits. The Bayesian results are accompanied by probabilistic statements on a future difference being within predefined benchmark values (−δ, δ) on the one hand and on the alternative hypothesis that the LoA fall within (−δ, δ) on the other hand.

Predicting One Future Difference versus Targeting the LoA
A prediction interval for a future difference, or a Bayesian (posterior) predictive interval for it, for that matter, may give an indication but appears clinically to be much less interesting than outer 95% confidence or credibility intervals of the LoA themselves. In practice, the focus will typically be on many future differences, not just on one. In terms of frequentist analysis, Vock [26] argued in favor of outer confidence limits for the LoA as these include the 2.5% and 97.5% target percentiles with a sufficiently high probability. In a Bayesian context, Alari, Kim, and Wand [25] reckoned that the researcher's question of interest is of vital importance for whether one future difference or the LoA themselves are targeted. From our point of view, a sufficiently accurate estimate for the LoA is clearly preferred as a stronger case is built, be it in favor of or against method agreement.

Confidence versus Credibility
Frequentist and Bayesian thinking of what probability constitutes differ fundamentally: The frequentist view defines the probability of some event in terms of the relative frequency with which the event tends to occur, whereas the Bayesian view defines probability in more subjective terms, namely as a measure of the strength of your belief regarding the true situation [31,32].
Frequentist analysis assumes parameters to have exact but unknown values for which confidence intervals quantify the uncertainty of respective estimates based on observed data. Ninety-five percent confidence intervals are long-term statements in the way that 95% of the confidence intervals derived from many different samples from the same target population will comprise the true but unknown population parameter. A single 95% confidence interval will or will not include the true value, but with many possible samples in mind, one may argue that a specific 95% confidence interval stands a 95% chance of containing the true but unknown value (namely across all possible samples). Kruschke [27] (p. 318) reckons that the description of 95% confidence intervals as range of nonrejectable parameter values in null hypothesis significance testing is the most general and coherent one. To this end, a confidence interval is merely a range limited by two end-points, and it does not indicate any sort of probability distribution over values for the unknown parameter [27] (p. 323).
In Bayesian analysis, an unknown parameter is described by a distribution of possible values, not only by a single, unknown value to be estimated as in the frequentist setting. Prior beliefs about a parameter's distribution are updated by newly gathered data, resulting in the posterior belief of that parameter's distribution. In other words, credibility is reallocated across possibilities based on new data [27] (p. 15). Bayesian credibility intervals have three advantages over frequentist confidence intervals [27] (p. 324): • Credibility intervals have a direct probabilistic interpretation in terms of the credibility of possible parameter values which confidence intervals do not have; • Credibility intervals have no dependence on the sampling and testing intentions of the experimenter. Frequentist confidence intervals tell us about probabilities of data relative to imaginary possibilities generated from the experimenter's intentions (namely the range of nonrejectable parameter values in null hypothesis significance testing, which focuses on the probability of observing the data as seen or data favoring the alternative hypothesis H 1 , given H 0 were true); • Credibility intervals are responsive to the experimenter's prior belief based, for instance, on previous literature findings. Bayesian analysis indicates how much newly gathered data should alter our beliefs. Frequentist confidence intervals do not incorporate prior knowledge.

Group-Sequential Testing
Group-sequential testing and, more general, adaptive design methodology are welldeveloped and widely applied in clinical drug development [33][34][35][36][37][38], but to a much lesser extent in diagnostic research [39][40][41]. Zou et al. [42] employed the α-spending function approach [43][44][45] in connection with receiver operating characteristic curve analysis. Gerke et al. [46] exemplified analogously rater variability analysis with post-hoc interim analyses on the standard deviation σ. They supplemented corresponding upper one-sided confidence limits for σ, which can be used for decision making at an interim analysis time point instead of comparing the p-value at interim with the respective benchmark p-value according to the employed α-spending function. In the same way are outer confidence intervals for θ 1 and θ 2 constructed for confidence levels different from 95% which, in turn, correspond to hypothesis testing of H 0 : θ 1 ≤ −δ or θ 2 ≥ δ against H 1 : θ 1 > −δ and θ 2 < δ at significance levels smaller than 5% at interim analysis. To this end, Carkeet's MATLAB code of his Supplemental Data S2 is a tool to derive exact confidence intervals for confidence levels different from 95% [6].
Similarly, Zhu and Yu [47] have proposed the application of α-spending functions in Bayesian analysis recently, and Kruschke [27] (pp. 383-392) exemplified various Bayesian strategies based on p-values, Bayes' factor, credibility intervals' length, and ROPEs. Any sequential testing strategy based on hypothesis testing-be it frequentist or Bayesian-may produce biased estimates as a result of stopping after observing extreme values up to an interim analysis time point. Therefore, he proposed to shift the focus to stopping when sufficient precision is achieved, for instance, in terms of the width of credibility intervals.
Stallard et al. [48] compared Bayesian and frequentist group-sequential tests when data can be summarized by normally distributed test statistics. Despite conceptual differences in frequentist and Bayesian analysis, Bayesian and frequentist group-sequential tests can have identical stopping rules in a single-arm trial or two-arm comparative trial with a prior distribution specified for the treatment difference. For instance, restricting Bayesian critical values at different interim analyses to be equal, O'Brien and Fleming's design [44] corresponds to a Bayesian design with an exceptionally informative negative prior, and Pocock's design [43] is an analogue to a Bayesian design with a non-informative prior.

Nonparametric Alternatives
Whenever the assumption of normally distributed differences does not hold, a nonparametric alternative analysis for the LoA is needed. Bland and Altman [3] proposed two methods. The former comprised the calculation of the proportion of differences exceeding some predefined benchmark values and comparing these to predefined acceptability limits. The latter represented nonparametric quantile estimation on the 2.5% and 97.5%, or, alternatively, on the 5% and 95% percentiles in small samples. Frey, Petersen, and Gerke [49] compared sample, subsampling, and kernel quantile estimators, as well as other methods for quantile estimation in small to moderate sample sizes (30 ≤ n ≤ 150). A simple sample quantile estimator (in the form of a weighted average of the observations closest to the target quantile), the Harrell-Davis estimator, and estimators of the Sfakianakis-Verginis type outperformed 10 other quantile estimators in terms of mean coverage for the next observation. These results were later confirmed when targeting the coverage probability of nonparametric LoA, and the three estimators above performed identically well for sample sizes of n ≥ 150 [50].

Simplicity versus Interpretability
The inherent simplicity of the classical frequentist LoA has not only led to widespread adoption in the literature but also to a heterogeneous reporting of agreement assessments. There are numerous examples of when agreement analysis was equaled with reporting just sheer point estimates for the LoA whereas the interpretation of 95% outer confidence limits for the LoA reflects upon the uncertainty of the estimated values of the LoA. In comparison, a frequentist confidence interval represents merely a range of nonrejectable values for null hypothesis significance testing of H 0 : θ 1 ≤ −δ or θ 2 ≥ δ against the alternative H 1 : θ 1 > −δ and θ 2 < δ, Bayesian analysis allows for direct interpretation of both the posterior probability of the alternative hypothesis H 1 and the likelihood of parameter values.
The historical computational burden of Bayesian analysis has in the meantime been resolved; however, the solicitation of prior information and its implementation still requires thoughtful engagement. To use prior literature evidence to full capacity is, of course, appealing for efficacy and economic reasons, but requires careful balancing by weakening priors to account for, say, differences in inclusion and exclusion criteria between a previous study and the one that builds on the prior findings. In our example, for instance, the mean difference and the SD were clearly larger in study 2 than in study 1, despite identical experimental set-ups (including the very same raters) in our preclinical laboratory. Even if every effort is taken to rebuild the original study set-up, such inter-study differences may be observed within the very same research institution. The issue of securing comparable study designs, making the employment of prior information from another study reasonable and defensible, becomes much more challenging when combining study data from different research groups. For instance, Dykun and colleagues reported intra-and interrater reproducibility for left ventricle size quantification using non-contrast-enhanced cardiac computed tomography by means of intraclass correlation coefficients but did not give descriptive statistics for these comparisons [54]. Their data could possibly have been useful as prior information for one of our local studies [55], but the necessary information for doing so was simply not available to us. Finally, our worked example was straightforward in terms of using per-subject data only. Incorporating correlated data (e.g., several lesions within the same patient) is wellestablished in frequentist Bland-Altman analysis [7], but requires more advanced hierarchical modelling in a Bayesian analysis of repeated measurement in method comparison studies [56].
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/stats4040062/s1, Codes S1: Stata codes for frequentist analyses, Data S2: example data in Stata format, Codes S3: R codes for Bayesian analyses. Institutional Review Board Statement: The preclinical study from which the data for the worked example were taken was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of the University of Southern Denmark.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data supporting the reported results can be found as Supplemental Materials.