Bayesian Estimation of Combined Accuracy for Tests with Verification Bias

This presentation will emphasize the estimation of the combined accuracy of two or more tests when verification bias is present. Verification bias occurs when some of the subjects are not subject to the gold standard. The approach is Bayesian where the estimation of test accuracy is based on the posterior distribution of the relevant parameter. Accuracy of two combined binary tests is estimated employing either “believe the positive” or “believe the negative” rule, then the true and false positive fractions for each rule are computed for two tests. In order to perform the analysis, the missing at random assumption is imposed, and an interesting example is provided by estimating the combined accuracy of CT and MRI to diagnose lung cancer. The Bayesian approach is extended to two ordinal tests when verification bias is present, and the accuracy of the combined tests is based on the ROC area of the risk function. An example involving mammography with two readers with extreme verification bias illustrates the estimation of the combined test accuracy for ordinal tests.


Introduction
This article introduces the reader to the methodology of measuring the accuracy of several medical tests that are administered to the patient. Our main focus is on measuring the accuracy of a combination of two or more tests. For example, to diagnose type 2 diabetes, the patient is given a

OPEN ACCESS
fasting blood glucose test, which is followed by an oral glucose tolerance test. What is the accuracy (true and false positive fractions) of this combination of two tests? Or, in order to diagnose coronary artery disease, the subject's history of chest pain is followed by an exercise stress test. Still another example is for the diagnosis of prostate cancer, where a digital rectal exam is followed by measuring PSA (prostate specific antigen). The reader is referred to Johnson, Sandmire, and Klein [1] for a description of additional examples of multiple tests to diagnose a large number of diseases, including heart disease, diabetes, lung cancer, breast cancer, etc.
In many scenarios, it is common practice to administer one or more tests to diagnose a given condition and we will explore two avenues. One case where it is common to administer several tests is in standard medical practice, and the other is an experimental situation where one test is compared to a standard medical test. An example of the latter is that MRI is now being studied as an alternative to standard mammography, as a means to diagnose breast cancer.
There are many studies that assess the accuracy of the combination of two or more tests. Two tests for the diagnosis of a disease measure different aspects or characteristics of the same disease. In the case of diagnostic imaging, two modalities have different qualities (resolution, contrast, and noise), thus, although they are imaging the same scene, the information is not the same from the two sources. When this is the case, the accuracy of the combination of two modalities is of paramount importance. For example, the accuracy of the combination of mammography and scintimammography for suspected breast cancer has been reported by Buscombe, Cwikla, Holloway, and Hilson [2]. Another study for diagnosing breast cancer was performed by Berg, Gutierrez, et al. [3] who measured the accuracy of mammography, clinical examination, ultrasound, and MRI in a preoperative assessment of the disease, The accuracy of each modality and various combinations of the modalites were measured. When investigating metastasis to the lymph nodes in lung cancer, Van Iverhagen, Brakel, and Heijenbrok et al. [4] measured the accuracy of ultrasound and CT and the combination of two. Ultrasound conveys different information about metastasis compared to CT, but the combination of the two might provide a more accurate diagnosis than each separately. For an example of the diagnosis of head and neck cancer, Pauleit, Zimmerman, Stoffels et al. [5] used two nuclear medicine modalities, 18 F-FET PET and 18 F-FDG PET to assess the extent of the disease and estimated the accuracy of each and combined. On the other hand, Schaffler, Wolf, Schoelinast et al. [6] evaluated pleural abnormalities with CT and 18 F-FDG PET and the combination of the two.

Measuring the Combined Accuracy of Two Binary Tests
What is the optimal way to measure the accuracy for combination of two binary tests? Pepe ([7], p. 207) presents two approaches: (1) believe the positive rule, or BP, where a positive test score on a subject occurs when one or the other of the two tests is scored positive; and (2) believe the negative rule, or BN, where a subject is scores positive if both tests are scored positive. Pepe also provides some properties about these rules, namely: a. The BP rule increases sensitivity relative to the two binary tests, but increases the FPF (false positive fraction), but by no more than the sum of the two false positive fractions FPF1 + FPF2. The false positive fraction of test 1 and test 2 are denoted by FPF1 and FPF2 respectively; b. The BN rule decreases the false positive rate relative to the false positive rates of the two tests, but at the same time, decreases the sensitivity, however, the sensitivity remains above TPF1 + TPF2 -1. Note that the true positive fractions for the two tests are designated by TPF1 and TPF2.
Thus, with the BP rule the combined test is scored positive if one or the other or both of the two are scored positive, but, on the other hand, with the BN rule the combined test is scored positive if both tests are positive.

Verification Bias
Verification bias is present when some of the subjects are not subject to the gold standard, thus, the disease status of some of the patients is not known. Consider an example involving the exercise stress test to diagnose heart disease. Among those that test positive, some will undergo coronary angiography to confirm the diagnosis. On the other hand, among those that test negative, very few will be subject to the gold standard. Using only those cases that are verified will lead to biased estimates of test accuracy both of the true and false positive rates, thus, alternative methods based on the missing at random assumptions will be derived from a Bayesian viewpoint.
When assessing the accuracy of two tests, the design in many cases is paired. When two tests are used to assess the patient's condition, true and false positive fractions will be estimated assuming the BP (believe the positive) and BN (believe the negative) rules. For example, two modalities (e.g., CT and MRI) are imaging the same patients and the two images would be expected to be quite similar. Another case of a paired design is for two readers who are imaging the same set of patients with the same imaging device. One expects the information gained from the two paired sources to be highly correlated, and in the case of two paired readers, agreement between the two is also of interest. The experimental layout for a paired study when verification bias is present, namely: A Good introduction to statistical methods for estimating test accuracy with verification bias is Zhou et al. ([8], p. 307), who use maximum likelihood estimation for the study below. Table 1. Two binary scores with verification bias.
The two binary tests are Y 1 and Y 2 , where 1 and 0 designate positive and negative tests respectively. For those subjects who are verified for disease V = 1, while those who are not verified are denoted with V = 0.
Note that the number of subjects verified under the gold standard, when both tests are positive, is s 11 + r 11 , among which s 11 had the disease and r 11 did not have the disease, and also the number that were not verified under the gold standard when both tests are positive is u 11 etc., also note that the total number of subjects is

Posterior Distribution of Combined Test Accuracy
The following derivation is based on the MAR assumption, namely Thus, the probability a subject's disease status is verified depends only on the outcomes of the two tests, and not on other considerations. Derivations below follow to some extent Chapter 10 of Zhou, Obuchowski, and McClish 8 .
Suppose the unknown parameters are defined as follows: and for i, j = 0, 1. Also let and where i, j = 0, 1. The likelihood for the parameters is Assuming an improper prior distribution for the parameters, the posterior distributions are for i, j = 0, 1, and the θ ij are distributed Dirichlet with parameter vector (m 00 , m 01 , m 10 , m 11 ). The improper prior imposed on the parameters is given by the density where s 1. = s 11 + s 10 and r 1. = r 11 + r 10 Note that if a uniform prior is assumed, one should adjust the posterior distributions for the phis and thetas by adding a one to the beta and Dirichlet hyper parameters given in formulas (6) and (7).
The main parameters of interest are the true positive fraction and the false positive fraction for the two tests, thus for the first test and is given by Bayes theorem as tpf 1 = ϕ 1. θ 1. / (ϕ 1. θ 1. +ϕ 0. θ 0. ) (8) where the . i φ are given by (8.19) and As for test 1, the false positive fraction is given by With regard to test 2, the true positive fraction is and the false positive fraction is The main focus of this section is on measuring the accuracy of the combined test in the presence of verification bias of both tests using the BN (believe the negative) and BP (believe the positive) principles.
Assume the BP principle is in effect, then the true positive fraction for the combined test is while Now assume the BN assumption is in effect, then the true positive fraction is while The above four accuracy measures can be expressed as follows: for the BP assumption, where P[D = 1] = ϕ 11 θ 11 +ϕ 01 θ 01 +ϕ 10 θ 10 +ϕ 00 θ 00 ) For the BN assumption and Formulas (16)- (19) measure the combined accuracy of two binary tests assuming MAR and assuming an improper prior. If a uniform prior distribution is assumed for the ϕ ij and θ ij , adjust the hyper parameters in formulas (6) and (7) accordingly. For additional information about estimating accuracy with verification bias see Zhou [9,10] and Zhou and Castelluccio [11].

Example of MRI and CT to Assess Risk of Lung Cancer
Consider the hypothetical results of two correlated binary tests when verification bias is present as given below. The first test Y 1 give the results for a CT determination of lung cancer risk, where a 0 indicates a small risk and a 1 a high risk of lung cancer, while the second test Y 2 is a determination of lung cancer risk using MRI. The patients where D = 0 do not have lung cancer.  Using BUGS CODE 1, an analysis that determines the accuracy of the CT and MRI separately and for the combined accuracy is executed with 45,000 observations, with a burn in of 5,000 and a refresh of 100. The list statement of the code gives the data for this example, assuming an improper prior distribution: pd<-ph11*th11+ph10*th10+ph01*th01+ph00*th00 fpfbp<-((1-ph11)*th11+(1-ph10)*th10+(1-ph01)*th01)/(1-pd) # believe the negative , BN tpfbn<-ph11*th11/pd fpfbn<-(1-ph11)*th11/(1-pd)} # CT and MRI for lung cancer risk with improper prior # for a uniform prior, add a one to the values in the list statement list(s00=3,r00=18,s01=9,r01=13,s10=12,r10=9,s11=14,r11=4, m00=31,m01=31,m10=29,m11=25) # activate initial values from the specification tool with the gen inits button Note the above code closely follows the derivation given in formulas (1)- (19). This analysis can be executed by downloading the code from http://medtestacc.blogspot.com. The results of the analysis show that the false positive fractions for the two modalities are fairly high being 0.286 and 0.38 for CT and MRI respectively, but on the other hand, the true positive fractions for the two modalities are somewhat low at 0.6755 and 0.60 respectively for CT and MRI. It is also observed that the MCMC errors for all parameters are less than 0.0001 and the posterior distributions of all parameters appear to be symmetric about the posterior mean. Assuming the BN rule, the true and false positive fractions are estimated as 0.3663 and 0.0880, but as for the BP rule, the corresponding estimates (based on the posterior mean) are 0.9165 and 0.5766! The fact that both modalities are not very accurate is reflected in the estimated combined accuracies for the BN and BP rules. At first glance, the BP rule is encouraging in that the true positive fraction is 0.9165, but the false positive fraction is also large as 0.5766. The BN rule gives a low estimate of 0.0880 for the false positive rate, but also a low estimate of 0.3663 for the true positive fraction. The posterior density of the true positive fraction for the BP rule is depicted below. Note what rule the user adopts depends on their personal preference.

Extreme Verification Bias
The reader is referred to Pepe ([7], p. 180) and Broemeling ([12], p. 166) for a description of many studies for those cases where when the test is positive, the subject is referred to the gold standard (verification of disease status), but if the test is negative the subject is not referred to the gold standard. In such situations, the true and false positive rates are not estimable. Consider the following hypothetical example given by above in Table 2, where MRI and CT are jointly used to detect lung cancer, where the results of CT are given by the first test and those of MRI given by Y 2 . Note when the subject is verified with D = 0, the patient does not have lung cancer. Table 4. CT and MRI for lung cancer risk with extreme verification bias. The results of Table 2 have been modified so that all patients are referred to the gold standard (biopsy) except those where both tests are negative. There are 82 patients, of which 61 are verified for disease, namely the 18 who test positive when both CT and MRI are positive, 21 patients who test positive with CT but negative with MRI, and lastly 22 patients who test positive with MRI and negative with CT.
When both tests are negative, 21 patients are not subject to the gold standard. Since they were not verified, one does not know their disease status (D = 0 or D = 1), thus one does not know the fraction of patients with disease, and the true and false positive fractions cannot be estimated. This is a case where it is not possible to estimate the true and false positive fractions for either test. However, not all is lost because other measures of test accuracies for both tests can be estimated.
Consider the detection probability for CT namely P[Y 1 = 1, D = 1], then the estimated detection probability is 26/82 = 0.317, while the detection probability of MRI is estimated as 23/82 = 0.280. In lieu of the true and false positive fractions, the detection probability and the false referral probability for both tests aid in the estimation of the test accuracy of the combined tests. For example, the false referral probability for CT is P[Y 1 = 1, D = 0] which is estimated as 13/82 = 0.158, and for MRI the false referral probability is estimated as 17/82 = 0.2073.
It is interesting to note that the detection probability of a test is expressed as DP = ρ TPF (20) and the false referral probability as where ρ = P[D = 1], is the probability of disease, and TPF and FPF are the true and false positive fractions respectively. From a given study the probability of disease cannot be estimated, however the true positive fractions and be compared with the ratio: where DP 1 is the detection probability of the first test and DP 2 the detection probability for the second test. In a similar way, the false positive fractions can be compared by the ratio: where FRP 1 is the false referral probability for the first test and FRP 2 the false referral probability of the second. Of course, the Bayesian approach will determine the posterior distribution of these quantities. It is interesting to observe that the individual true and false positive fractions cannot be estimated, but that the true and false positive fractions of two tests can be compared for studies with extreme verification bias.
How is the accuracy of the combined tests estimated? Can one employ the BP (believe the positive) and BN rules to estimate the accuracy of the combined tests? The answer is no! Consider the BP rule and refer to Table 4, where the true positive fraction of the combined tests is measured by P[Y 1 = 1orY 2 = 1|D = 1], however, the disease frequency cannot be estimated. On the other hand, the ratio of the true positive fraction for the BP rule relative to the true positive fraction of the BN rule can be measured by From Table 4 for the CT-MRI study of lung cancer risk, a naïve estimate of this quantity is 35/14 = 2.5, which implies the true positive fraction for the BP rule is approximately 2.5 times larger than the true positive fraction for the BN rule.
In a similar way the false positive fraction for the BP rule relative to the false positive fraction of the BN rule can be measured as Referring to Table 4 provides an estimate of 26/4 = 6.5 for comparing the false positive fractions for the two rules, implying the false positive rate for the BP rule is six and half times larger that of the BN rule.
Extreme verification bias does not provide sufficient information to estimate the usual measures of test accuracy, however, in lieu of those measures, it is possible to assess the accuracy of two binary tests (with extreme verification bias) by: (1) the detection probabilities; (2) the false referral probabilities and (3) the ratio of the true positive fraction for the BP rule relative to that of the BN rule, and (4) the ratio of the false positive fraction of the BP rule relative to that of the BN rule.

Bayesian Analysis for Extreme Verification Bias
The foundation of the Bayesian analysis given by formulas (1)- (19) which are now expanded to determine the posterior distribution of the detection probabilities and false referral probabilities of both tests.
With regard to estimating the combined accuracy of the two tests in the presence of extreme verification bias, the posterior distributions of the ratio of the true positive fraction of the BP relative to that of the BN rule is determined as is the ratio of the false positive fraction of the BP rule relative to that of the BN rule.
For the detection probability of the first test where the summation is taken over the missing subscript denoted by a period. Note the θ ij and ϕ ij are defined by formulas (1) and (2), thus in a similar way it can be shown that As for the false referral probabilities, the one for the first test is It can also be shown that for the second test, Referring to formulas (1) Formulas (26)-(31) measure the accuracy of two combined binary tests when extreme verification bias is present.

Bayesian Analysis for Risk of Lung Cancer
BUGS CODE 1 is amended with the following WinBUGS® code.
The above code corresponds to formulas (26)- (20). Refer to formula (30) for the ratio of the true positive fraction of the BP rule relative to that of the BN rule, thus the following: Rtpfbptpfbn<-(th11*ph11+th10*ph10+th01*ph01)/(th11*ph11) Refer to formula (31) for the ratio of the false positive fraction of the BP rule to that of the BN rule, thus the code is Rfpfbpfpfbn<-(th11*(1-ph11)+th10*(1-ph10)+th01*(1-ph01))/(th11*(1-ph11)) The list statement for the data is given by: list(s00=1,r00=1,s01=9,r01=13,s10=12,r10=9,s11=14,r11=4,m00=21,m01=22, m10=21,m11=18) which assumes an improper prior density for all parameters. Although s00 = 1 and r00 = 1 are zero (when the two tests are negative all patients are referred to the gold standard), I put a one, which does not affect the analysis. The analysis for the CT-MRI determination of lung cancer risk with extreme verification bias is based on the amended BUGS CODE 1 by generating 55,000 samples from the posterior distributions of the detection and false referral probabilities of both CT and MRI. Also computed are the posterior distributions of the ratio of the true and false positive fractions of the BP rule relative to that of the BN rule. I used a burn in of 5,000 observations with a refresh of 100 to give. One notices the skewness of the posterior distribution of the ratio of the false positive fraction of the BP rule relative to the BN rule and the skewness is evident from Figure 2 below. I would use 6.902 the posterior median as a point estimate of the ratio. The other posterior distributions appear to be symmetric about the posterior mean. The detection probability of the two tests are similar, but the false referral probability of CT is somewhat less than that of MRI and the TPF of the BP rule appears to be 2.61 times that of the BN rules. On the other hand, the false positive fraction of the BP rule is 6.9 times that of the BN rule.
Recall that the true and false positive fractions of the two rules are not known, thus, it is difficult to interpret these ratios. The detection probability of a test is somewhat related to the true positive fraction in that one would favor a test with a higher detection probability. Also, of course, one would favor a test with lower false referral probability, thus, the overall conclusion about test comparison is to favor the CT determination of lung cancer risk. As for the accuracy of the combined test, the two ratios one for the true positive fraction and one for the false positive fraction provide the relevant information. Should the accuracy be based on the BP or the BN rule? The BP rule gives a larger true positive fraction, but unfortunately a much larger false positive fraction.

Verification Bias for Two Ordinal Tests
When analyzing the accuracy of combined tests when the tests are ordinal, a somewhat different approach is taken, where the overall accuracy is measured by the area of the ROC curve of the risk score. The risk scores are the probability of disease of each patient, usually determined. An informative presentation of the risk score is given by Pepe ([7], p. 271).
Our general methodology is based on inverse probability weighting which transforms the original table of observations with verification (See Table), to an imputed table. Such ideas will be explained in a later section, but the reader is referred to Pepe ([7], p.172) and Broemeling ([13], p. 279) for additional details of the inverse probability weighting approach for estimating the accuracy of tests with verification bias.
The analysis of assessing the accuracy of two tests is now expanded to include two ordinal tests T 1 and T 2 , where the general layout is given by Table 5. As before the s i denote the number of patients for the 9 events when D = 1, while the i r represent the number of cases for the nine events when D = 0.

The total number of observations is
∑ . Each test has three values, but of course the general situation has a similar scenario. Table 5. Two ordinal tests with verification bias.
In what is to follow, the posterior distribution of the ROC area of the two ordinal tests is developed, which is to be followed by an explanation of inverse probability weighting, and lastly the use of the risk score to asses the accuracy of the combined tests is explained.

Posterior Distribution of the ROC Areas for Two Ordinal Tests with Verification Bias
Recall from formulas (1) and (2) that and for i = 1,2,3, then assuming an improper prior density, and The improper prior density used here is the reciprocal of the parameters over the relevant region of the parameter space.
If uniform prior is appropriate, add a 1 to the hyper parameters of the ϕ i and θ i for i = 1,2,3.

BUGS CODE
The code closely follows formulas (36)-(50), and the analysis is executed with 55,000 samples from the posterior distribution of the parameters, with a burn in of 5,000 and a refresh of 100 and the results are reported in Table 7. Reader 1 appears more accurate that reader 2. It should be noted that the gold standard is biopsy of the tissue from the suspected lesion. The MCMC errors are quite "small" and give one confidence that the simulation is providing accurate estimates of the accuracy.
The main goal of the analysis is to estimate the combined accuracy of the two readers nd compare it the individual estimated ROC areas given by Table 7.

Inverse Probability Weighting
Pepe ([7], p. 171) describes an interesting variation on estimating test accuracy with verification bias by the inverse probability weighting technique, and for further information, see Begg and Greenes [14]. Briefly this method involves constructing an imputed data table from the observed data table with verification bias. Consider the observed data table (Table 6) Table 6 is replaced by the selected table below: where the basic idea is to compute the ROC of the risk score. For additional information about estimating the test accuracy of ordinal tests with verification bias using the ROC area, see Gray et al. [15].

The Risk Score
The risk score is the probability that D = 1, which is the value assigned to all patients in the study. For example, for each of the 1,768 patients in the melanoma staging study portrayed in Table 8, the probability of disease is assigned and the ROC area computed. The ROC area of the risk score is the combined accuracy of the two ordinal tests with verification bias. The risk score is defined as and has the property that it is a monotone function of the likelihood ratio. Simply stated, the risk score assigns a probability of disease to each study subject. The risk score (51) has the same ROC curve as the likelihood ratio and the same optimal properties. Observe that which shows that the risk score is a monotone increasing function of the likelihood ratio, which implies that the ROC curve of risk score is the same as that of the likelihood ratio. For our purposes the risk score will be used to measure the accuracy of combined tests, namely, using the area of the ROC curve of the risk score. Pepe ([7], p. 274) shows the utility of logistic regression for finding the ROC curve of the risk score. Note that the following statements show why.

RS
Suppose the risk score is expressed as where g is a known function, then: (a) the parameter λ can be estimated, even for retrospective designs in which the sampling depends on D, and (b) the function g is optimal for determining the ROC curve of the risk function.
From a practical point of view, logistic regression can be used to determine the ROC curve of the risk function, but it should be noted that finding a suitable function g can be a challenge. After all, g can be a complicated non-linear function of λ and/ or Y, but it would be convenient if g is linear in the test scores Y. In order to estimate the logistic regression function or risk score, Broemeling ([13], p. 474), takes a Bayesian approach.
For the example of Table 8, where two readers are diagnosing breast cancer, the risk score is computed by logistic regression using the following code. In order to accommodate the data, a list statement must be added, which consists of three vectors, each of dimension 1,768, corresponding to the number of patients in the study: The vectors T 1 and T 2 will consist of the values 1,2, or 3 corresponding to Table 8, while a d vector will consists of zeros and ones, where a 1 indicates the patient has melanoma and a 0 indicates no melanoma, thus, the first 1,078 components of d consist of a one, while the remaining 690 consist of zeros. After execution of the BUGS CODE 3, with 45,000 samples from the posterior distribution, with a burn in of 5,000, and a refresh of 100, the posterior distribution of coefficients of the logistic regression is given in Table 9. The posterior distributions indicate the coefficients are not zero and are important in determining the risk score of each patient, but the main interest is in the vector theta (of dimension 1,768) where the median of each component is used as the risk score. There were 9 distinct risk scores: 0.086, 0.187, 0.36, 0.336, 0.5542, 0.733, 0.7542, 0.8704, and 0.9426. Therefore, I computed the ROC using the basic formula from Broemeling ([12], p. 58) for the ROC area of a test with ordinal responses.

BUGS CODE
Using BUGS CODE 4, I computed the ROC area of the risk score via BUGS CODE 4 with posterior mean 0.842 with a 95% credible interval of (0.8238, 0.8608). (Also, the ROC area was computed with a non-parametric technique with SPSS with the same result as the Bayesian). Thus, using the risk score, the estimated ROC area is 0.842 compared to a ROC area of 0.78 for the surgeon and 0.63 for the dermatologist, thus, the accuracy of the combining results of two readers is more than the individual accuracies. Consider the posterior analysis: Based on BUGS CODE 4, the analysis is executed with 55,000 observations with a burn in of 5,000 and a refresh of 100. A2 is the probability of a tie, and auc is the ROC area. Small MCMC errors for the parameters are evident, and the posterior distribution of all parameters appear to be symmetric. The code below is explained with explanatory remarks indicated by #.

Comments and Conclusions
This presentation has reviewed the Bayesian methodology available for estimating the test accuracy when two or more tests are used to diagnose disease. The main focus is on tests that are subject to verification bias, that is, when some of the patients are not subject to the gold standard (are not verified for disease status). When verification bias occurs certain techniques are employed to correct for bias. If the usual estimators are calculated only for those patients that are verified for disease, the estimators are biased, thus, this article develops Bayesian procedures that "correct" for bias. By imposing the missing at random assumption, Bayesian estimators of the usual measures of test accuracy are developed. For two binary tests, the true and false positive fractions estimate the test accuracy, while for two ordinal tests the ROC area of the score function estimates the test accuracy of the combined tests. An interesting variation of verification bias is extreme verification bias, which is present when all of the patients that test negative with both tests are not verified for disease. In such a scenario, the true and false positive fractions cannot be estimated. However, other measures of the combined accuracy are available and easily estimated with Bayesian inference. Bayesian inference is illustrated with examples involving the diagnosis of breast and lung cancer.
There is a large literature on the subject of verification bias, and for additional recent information about the subject see [16][17][18][19]. Very little has appeared from a Bayesian viewpoint, however, Buzoianu and Kadane [20] present results on adjustment for verification bias. Broemeling [12,13] is the only book that focuses on the Bayesian approach to verification bias.