Diagnostics 2011, 1(1), 1-35; doi:10.3390/diagnostics1010001

Review
Bayesian Methods for Medical Test Accuracy
Lyle D. Broemeling
Broemeling & Associates Inc., 1023 Fox Ridge Road, Medical Lake, WA 99022, USA; E-Mail: broemeli2@aol.com
Received: 18 February 2011; in revised form: 12 April 2011 / Accepted: 20 April 2011 /
Published: 5 May 2011

Abstract

: Bayesian methods for medical test accuracy are presented, beginning with the basic measures for tests with binary scores: true positive fraction, false positive fraction, positive predictive values, and negative predictive value. The Bayesian approach is taken because of its efficient use of prior information, and the analysis is executed with a Bayesian software package WinBUGS®. The ROC (receiver operating characteristic) curve gives the intrinsic accuracy of medical tests that have ordinal or continuous scores, and the Bayesian approach is illustrated with many examples from cancer and other diseases. Medical tests include X-ray, mammography, ultrasound, computed tomography, magnetic resonance imaging, nuclear medicine and tests based on biomarkers, such as blood glucose values for diabetes. The presentation continues with more specialized methods suitable for measuring the accuracies of clinical studies that have verification bias, and medical tests without a gold standard. Lastly, the review is concluded with Bayesian methods for measuring the accuracy of the combination of two or more tests.
Keywords:
Bayesian inference; posterior distribution; prior distribution; ROC curve; verification bias; tests without a gold standard

1. Introduction

This review presents and describes the Bayesian techniques that are available for estimating the accuracy of various medical tests used in the diagnosis and treatment of disease, with a primary focus on cancer. Fundamental measures of test accuracy are first introduced, and include the true positive rate (sensitivity), the false positive rate (1-specificity), the positive and negative predictive values, and the area under the ROC curve. Bayesian approaches are very efficient because they are based on prior information that is readily available from previous related studies.

Estimating the test accuracy when verification bias is present is considered next. Verification bias occurs when not all of the patients are subject to a gold standard. For example, consider mammography, where those patients that test positive are usually referred to the gold standard (pathology), but where those that test negative are usually not referred to pathology. Or consider PSA (prostate specific antigen) testing for prostate cancer, where those that test negative are not usually subject to the gold standard (pathology). When verification bias is present, there are special biostatistical methods that are available for providing unbiased estimates of test accuracy.

There are many cases, where a gold standard is not available for estimating test accuracy, but there is an imperfect reference standard. For example there may be two tests, one a so-called new test and the other the imperfect reference standard used to diagnose a bacterial infection, but no gold standard is available. Relative to the imperfect reference standard, the accuracy of the new test can be estimated, however, such estimates can be misleading, but fortunately, there are statistical procedures that are available for ‘correcting’ the estimated accuracy of the new test.

This review involves many of the tests used in medical practice for the diagnosis and monitoring of disease. Of the many tests used in medicine, many are based on imaging devices such as X-ray, CT(computed tomography), MRI(magnetic resonance imaging), mammography, nuclear medicine (PET(positron emission tomography) and SPECT(single-photon emission computed tomography) gamma cameras), and ultrasound. Of course, there are many others based on biomarkers, such as PSA (prostate specific antigen) for prostate cancer, CA19-9 and CA125 for pancreatic disease, and blood glucose values for diabetes.

What biostatistical methods will be described in this review? When estimating the accuracy with the fundamental measures such as the true and false positive rates and the positive and negative predictive values, ratios or fractions (sometimes referred to as rates) will be employed. On the other hand, when estimating the area under the ROC curve, more advanced methods will be explained and illustrated with various examples. Special methods have been developed for estimating the accuracy of medical tests when verification bias is present and these will be illustrated with several examples, as will the case when there is no gold standard but an imperfect reference standard is available.

When the study and patient covariates are taken into account, regression methods provide the way to estimate medical test accuracy. For example, in screening for breast cancer, the patient's age and use of hormones have an effect on breast cancer incidence and should be taken into account when estimating the accuracy of mammography.

Another aspect of test accuracy to be described is the role agreement plays in estimating the accuracy of a medical test. There are usually several observers or readers involved in observing the medical test results and each has their own interpretation of the test outcomes. For example, suppose three radiologists are interpreting the same CT image in order to diagnose lung cancer metastasis, then there could be disagreement as to the degree of metastasis of the disease. There are many studies where there are separate estimates of test accuracy corresponding to the several readers of the test results, and this review will present methods for estimating the agreement between the readers.

2. Sources of Information

The author will base the review on two sources of information, textbooks and articles in the statistical and medical literature. There are three textbooks that are devoted to statistical methods for the estimating test accuracy and they are: Broemeling [1], who develops methods based on a Bayesian approach, and Pepe [2] and Zhou, Obuchowski, and McClish [3] who for the most part, use non Bayesian methods such as maximum likelihood etc. Many of the methods and examples for this review are taken from these books, while others are based on articles in the medical literature. For example, some examples are based on articles in Radiology and others on the Journal of Pathology.

3. Basic Measures of Test Accuracy

The basic measures of test accuracy are computed from the information in the 2 by 2 table below.

The nij are the number of subjects with test score i = 0 or 1 and disease status j = 0 or 1, while θij is the corresponding probability, where D = 0 indicates no disease and D = 1 indicates disease. θij is the probability a patient has test score i and disease status j, where i,j = 0 or 1, thus n00 is the number of patients without disease and a negative test score X = 0. The true and false positive fractions TPF and FPF are defined as

TPF ( θ ) = θ 11 / ( θ 11 + θ 01 ) = P ( X = 1 D = 1 )
and
FPF ( θ ) = θ 10 / ( θ 00 + θ 10 ) = P ( X = 1 D = 0 )

Thus, the true positive fraction is the proportion of patients with disease who test positive and the false positive fraction is the proportion of non-diseased individuals who test positive for disease. Note that these two measures of accuracy are defined in terms of the unknown cell probabilities of the above 2 by 2 table. Note each cell probability θij is estimated by the corresponding fraction nij / n, where n is the total number of individuals in the study. Usually the study is designed as follows: the individuals are selected at random from a well-defined population, such that the cell frequencies follow a multinomial distribution, and consequently the variance or standard deviation of the estimator nij / n of θij is known. If one takes a Bayesian approach assuming a uniform prior distribution for the cell probabilities, it is known that the posterior distribution of the cell probabilities is Dirichlet with parameter vector

θ / data ~ Dir ( n 00 + 1 , n 01 + 1 , n 10 + 1 , n 11 + 1 )

By data is meant the totality of the cell frequencies of the above table. As an example, consider the pexample examined by Pepe [2] based on the study by Weiner et al. [4], which is a cohort study of 1465 subjects, where each is classified as to disease status (coronary artery disease (CAD) via an angiogram) and a diagnostic test, the exercise stress test (EST), which is a nuclear medicine procedure and data can be found in Pepe [2].

What are the sensitivity and specificity of the exercise stress test? The Bayesian uses the posterior distribution (3) of the cell probabilities to estimate the true and false positive fractions (1) and (2) by generating samples from the posterior distribution of (1) and (2). Note (1) and (2) are functions of the cell probabilities, thus the posterior distribution of the true positive fraction is determined by generating samples from the Dirichlet distribution (3), then transforming those samples to samples from the true positive fraction via the formula (1). I used 55,000 samples generated from the Dirichlet distribution (3) then transformed each one via formula (1) to get the 55,000 observations from the true positive fraction.

The software package I used is WinBUGS® which is an object-oriented language specifically designed for making Bayesian inferences where the samples from the posterior distribution are generated via Monte Carlo Markov Chain (MCMC) techniques, and the reader is referred to Woodworth [5] for additional information about such simulation methods.

Now returning to the exercise stress test of Table 1, what are the Bayesian estimates of the true and false positive fractions?

Table 3 reports the WinBUGS output for making inferences about the true and false positive fractions, and it is seen that the mean of the posterior distribution of the TPF is 0.796 and that the standard deviation of the posterior distribution of TPF is 0.0125. A 95% credible interval for the TPF is (0.7716, 0.8208) and the median of the posterior distribution is 0.7967 implying that the posterior distribution of the TPF is symmetric about the mean 0.7967. The error column of the above table gives one information about the accuracy of using 55,000 observations to estimate the ‘true’ posterior mean. An important aspect of the package is that it generates plots of the various posterior distribution, as for example for the FPF given by:

Note that the density is centered over the posterior mean of 0.2612 and appears to be symmetric about the mean. My impression is that the exercise stress test is accurate if one uses the sensitivity to estimate accuracy, but I am not so sure about the relatively high value for the false positive fraction.

There are other ways to measure the accuracy of the exercise stress test, namely the positive and negative predictive values:

PPV ( θ ) = θ 11 / ( θ 01 + θ 11 ) = P ( D = 1 X = 1 )
and
NPV ( θ ) = θ 00 / ( θ 00 + θ 01 ) = P ( D = 0 X = 0 )

These measures of accuracy are of interest to the patient. Take for example the positive predictive value (PPV), which is the proportion of patients who test positive that have the disease as determined by the gold standard), and the negative predictive value of NPV, which is defined as the proportion of patients that test negative that do not have the disease.

A casual look at the above table tells me that the exercise stress test is accurate based on the PPV, but not on the NPV. Among those that test negative, approximately 61% do not have the disease! Does this give you confidence in the exercise stress test? It should be noted that rarely is a test perfect, where the TPF, FPF, PPV, and NPV are all one!

Consider the diagnostic likelihood ratios as a third group of test accuracy measures and are defined as the positive diagnostic likelihood ratio

PDLR ( θ ) = P ( X = 1 D = 1 ) / P ( X = 1 D = 0 ) = [ θ 11 / ( θ 11 + θ 01 ) ] / [ θ 10 / ( θ 10 + θ 00 ) ] = TPF ( θ ) / FPF ( θ )
and the negative diagnostic likelihood ratio
NDLR ( θ ) = P ( X = 0 D = 1 ) / P ( X = 0 D = 0 ) = [ θ 01 / ( θ 11 + θ 01 ) ] / [ θ 00 / ( θ 10 + θ 00 ) ] = FNF ( θ ) / TNF ( θ )

These measures are quite different than that the previous measures. The positive diagnostic likelihood ratio is a fraction, the numerator of which is the TPF and the denominator is the FPF, and note that larger values indicate a more accurate test, because more accurate tests have larger TPF and smaller FPF. As for the negative diagnostic likelihood ratio, smaller values are indicative of a more accurate test, because more accurate tests have a smaller FNF and a larger TNF! With regard to the exercise stress test, the Bayesian analysis gives the following results:

Is the PDLR large enough? Recall that the TPF is three times that of the FPF and the FNF is 0.27 that of the TNF. For me the TPF, FPF, FNF, and TNF each gives a separate way to view the accuracy of a medical test and are more informative than the diagnostic likelihood ratios.

The ROC curve is another way to measure the accuracy of a medical test and is appropriate when the test scores are ordinal or continuous. Consider the results of mammography given to 60 women, of which 30 had the disease. This is presented in Zhou et al. ([3], p. 21).

The radiologist assigns a score from 1–5 to each mammogram, where 1 indicates a normal lesion, 2 a benign, 3 a lesion which is probably benign, 4 indicates suspicious, and 5 malignant. How would one estimate the accuracy for mammography from this information? When the test results are binary, the observed TPF and FPF are calculated, but here there are 5 possible results for each image. The scores could be converted to binary by designating 4 as the threshold, then scores 1–3 are negative and 4–5 are positive test results. Then estimate the TPF as tpf = 23/30 and the specificity (1-FPF) as (1-fpf) = 21/30. Another approach would be to use each test result as a threshold and calculate the tpf and fpf, which are depicted in Table 6.

Of the 30 diseased, 30 had a score of at least 1, while 23 had a score of at least 4. On the other hand, of the 30 without cancer, 30 had a score of at least 1, and 8 had a score of at least 4, etc. Figure 2 is a plot of the observed true and false positive values of Table 6. What does this graph tell us about the accuracy of mammography?

The area under the ROC gives the intrinsic accuracy of a diagnostic test and can be interpreted in several ways. Either as the average sensitivity for all values of specificity, or the average specificity for all values of sensitivity, or as the probability that the diagnostic score of a diseased patient is more of an indication of disease than the score of a patient without the disease or condition. The problem is in determining the area under the curve. For the graph above, there are five points corresponding to the five threshold values.

In the case of ordinal data, the area under the curve (AUC) as determined by a linear interpolation of the points on the graph (including (0,0) and (1,1)) and the area has the following interpretation,

AUC = P ( Y > X ) + ( 1 / 2 ) P ( Y = X )

See the description of Pepe ([2], p. 92), where it is assumed that one patient is selected at random from the population of diseased patients, with a diagnostic score of Y and another patient, with a score of X, is selected from the population of non-diseased patients. Note that the AUC depends on the parameters of the model. Let us return to the mammography example and estimate the area under the curve via a Bayesian method.

For the mammography example, the area is defined as

AUC ( θ , ϕ ) = P ( Y > X θ , ϕ ) ) + ( 1 / 2 ) P ( Y = X θ , ϕ )
where Y (= 1,2,3,4,5) is the diagnostic score for a person with breast cancer and X (= 1,2,3,4,5) for a person without. It can be shown
AUC ( θ , ϕ ) = i = 2 i = 5 j = 1 j = i 1 θ i ϕ j + ( 1 / 2 ) i = 1 i = 5 θ i ϕ i

It is assumed the Y and X are independent, given the parameters, and that P(Y = i) = θi and P(X = j) = ϕj, i, j = 1,2,3,4,5. AUC is a parameter that depends on the parameters θ and ϕ, and their posterior distributions are θ / dataDir(2,1,7,12,13) and independent of ϕ / dataDir(10,3,12,9,1), assuming a uniform prior for the parameters, see Table 5.

Samples from the posterior distribution of the AUC are generated by sampling from the posterior distributions of θ and ϕ. This is accomplished with WinBUGS, where 55,000 observations are generated from the posterior distribution of all the parameters.

Notice that mammography gives fair to good accuracy based on the ROC area, which is estimated as 0.7811(0.0514) with the posterior mean and by (0.6702,0.8709) using a 95% credible interval. The Bayesian estimate of the ROC area is similar to the Zhou et al. ([3], P. 30) estimate. The MCMC error for the parameter based on 50,000 observations is less than 0.001, but the reader should vary the simulation sample size to see its effect on the MCMC error and posterior mean. The parameter A1 is P[Y > X] and estimated as 0.688(0.06350) and the probability of a tie, P[Y = X], given by A2, is estimated as 0.1861(0.0307).

See Broemeling ([1], p. 82) and Zhou ([3], p. 134) with an example from mammography. In mammography the mammogram is partitioned into five areas of interest and the radiologist assigns a score from say 1 to 5 (which indicates the degree of malignancy) as in the above example of mammography in Table 5, and one would expect the scores to be correlated between the five areas of interest, which is taken into account by the Bayesian approach.

With regard to continuous test scores, Bayesian estimators of the ROC area are easily determined by the WinBUGS code of O'Malley et al. [6]. The area under the ROC curve gives an intrinsic value to the accuracy of a diagnostic test and has a long history beginning in signal detection theory. See Egan [7] for the early use of the ROC curve in signal detection theory. Also, the books by Pepe [2] and Zhou et al. [3] provide the history as well as the latest statistical methods (non Bayesian) for using ROC curves in diagnostic medicine. The ROC area is generally accepted as the way to measure diagnostic accuracy in radiology.

Let X be a quantitative variable and r a threshold value, and consider the test positive when X ≥ r, otherwise negative, then the ROC curve is the set of all points

ROC ( . ) = { [ FPF ( r ) , TPF ( r ) ] , r any real number } = { [ t , ROC ( t ) ] , t ( 0 , 1 ) }
where t = FPF(r), that is, r is the threshold corresponding to t. As r becomes large, FPF(r) and TPF(r) tend to zero, while if r becomes small, FPF(r) and TPF(r) tend to 1, thus the ROC curve passes through (0,0) and (1,1). If the area under the curve is 1, the test is discriminating perfectly between the diseased and non-diseased groups, while if the area is 0.5, the test cannot discriminate between the two groups.

Pepe ([2], ch. 4) presents several useful properties of the ROC curve, namely: (1) the invariance of the ROC curve under monotone increasing transformations of X, (2) interpreting the ROC area for continuous variables as AUC = P(X > Y), and (3) a formula for the AUC area when X is normally distributed. The Bayesian approach to estimating the ROC area is based on

AUC = Φ [ a / 1 + b 2 ]
where X is normally distributed,
a = ( μ D μ D ¯ ) / σ D
and
b = σ D / σ D ¯

The mean and standard deviation of X for the diseased population are μD and σD respectively, while μ and σ are the mean and standard deviation of X for the non-diseased. Φ is the cumulative distribution function of the standard normal distribution. Formula (12) is the binormal assumption and is cited by many authors, including Pepe [2], who presents a good discussion of its use. Note that the ROC area AUC depends on the unknown parameters of the model.

Bayesian methods for estimating the ROC area with continuous data will be illustrated by referring to a hypothetical example of diabetes, which involves 59 subjects with diabetes and 19 without, where those with diabetes have a mean blood glucose value of 123.34 mg/dl and those without have a mean value of 107.54. The corresponding standard deviations are 6.76 for those with diabetes and 9.09 mg/dl for those without the disease, and the actual values from the study are given below in the first list statement of the WinBUGS program appearing below. The y vector of the first list statement contains the blood glucose values, where the first 49 entries correspond to diabetic patients and the remaining 19 to non diabetic patients. Note the first 49 entries of the d vector are 1 designating a diabetic patient, while the 19 remaining entries are zero. The O'Malley et al. [6] approach assumes binormality, where the blood glucose values for both the diabetic and non diabetic patients are assumed to be normally distributed. I have inserted comments about the WinBUGS code designated by a # symbol.

WinBUGS Code for Diabetes Example

model;
# Calculates posterior distribution of model parameters and the area under curve. y = test
# Based on O'Mally et al. [6] regression method.
{
# likelihood function
  for(i in 1:N) {
# The following statement is the regression of y on the disease vector d
   y[i]∼dnorm(mu[i],precy[d[i]+1]);
#   yt[i] < −log(y[i]); # logarithmic transformation
# The beta vector are the regression coefficients
   mu[i] < −beta[1] + beta[2] × d[i];
    }
# prior distributions - non-informative prior; similarly for informative priors
  for(i in 1:P) {
   beta[i] ∼ dnorm(0, 0.000001);
    }
  for(i in 1:K) {
   precy[i]∼dgamma(0.001, 0.001);
   vary[i] < −1.0/precy[i];
    }
# calculates area under the curve
  la1 < −beta[2]/sqrt(vary[1]); # ROC curve parameters
  la2 < −vary[2]/vary[1];
# auc is the area under the ROC curve
  auc < −phi(la1/sqrt(1+la2));
}
# Diabetes data
list(K = 2, P = 2, N = 78, y = c(123,129,115,131,119,111,129,127,118,111, 131,118,126,130,122,112,122,128,123,119,132,118,126,136,118,122, 119,117,129,120,125,115,131,123,130,113,128,138,119,118,124,127, 139,120,122,120,114,114,122,127,123,118,131,130,139,125,135,121,124, 109,106,100,88,106,108,110,111,112,94,122,110,113,106,114,101,99,128,106), d = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
# the initial values for the simulation
list(beta = c(0,0),precy = c(1,1))

There are 78 patients, of which 19 do not have diabetes. The primary parameters are the area under the curve, auc, and the regression coefficients. Based on the above code, 75,000 observations are generated from the posterior distribution for the area and regression parameters.

The estimated area is 0.9082 which implies that the blood glucose test had good accuracy and the area given is almost identical to that given by the basic formula (12). A plot of the posterior density is shown in Figure 3. The second regression coefficient is estimated as 16.14, which implies that the group effect is strong on the mean blood glucose values, which is one reason why the ROC area is as high as it is. Also, note the variation in the MCMC errors of estimation.

The plot indicates a slight asymmetry which is also implied by comparing the posterior median with the posterior mean.

This review is continued by considering two medical tests that are each applied to all patients, and a good example of this is the following hypothetical example of CT and MRI imaging of subjects for lung cancer. In order to compare the two, consider the following two tables, the first for lung cancer patients and the other for those without lung cancer. The example is employed to illustrate the Bayesian estimation of the basic measures of test accuracy and to compare the two modalities in regard to the true positive and false positive fractions. There are 995 subjects with the disease and 435 without lung cancer and the important question is which modality, MRI or CT is most accurate and by how much? Note that both modalities are imaging the same subjects, and one would expect the MRI and CT test scores to be correlated!

The Bayesian analysis will consist of finding the posterior distribution of the true and false positive fractions of the two modalities and comparing them on the basis of the ratios of the two basic measures. Let θij be the probability that a lung cancer patient has a CT score of i and an MRI score of j, where i, j = 0,1, where 0 indicates a negative outcome and 1 a positive. In a similar manner, let ϕij be the corresponding probability for a non-diseased subject.

Assuming a uniform prior distribution for θ = (θ00, θ01, θ10, θ11) and ϕ = (ϕ00, ϕ01, ϕ10, ϕ11), their joint posterior distribution is Dirichlet with parameter (23,192,31,753;149,168,50,72). The analysis is executed with 55,000 observations generated from the joint posterior distribution of the cell probabilities θij and gives the following results:

Note that fpfct is the false positive fraction for CT while rfpf (ct/mri) is the ratio of the false positive fraction of CT to that of MRI, and the 95% credible intervals for the two ratios do not include 1, implying that the two modalities have different accuracies for diagnosing lung cancer. With regard to the true positive fraction, MRI is more accurate, but CT has the smallest false positive ratio, and in fact the false positive fraction for MRI is quite large with a posterior median of 0.5469. Which modality would you use? I would use both.

The above approach can be extended to comparing the ROC areas of two modalities and more information can be found in Broemeling ([1], p. 84).

4. Verification Bias

Up to this point, interest has been confined to standard studies of medical test accuracy, but now attention will be focused on specialized methods for measuring accuracy. In a standard study, each subject will have been subjected to the gold standard where the disease status is known, but there are many studies where this is not possible. For example with the exercise stress test, those that test positive will most likely be referred to the gold standard (coronary angiography), however those that test negative will not, unless there are other indicators that point to disease. Actually, verification bias is present in many medical test accuracy studies; however, often the investigator is unaware that bias is present. According to Zhou et al. [3], Greenes and Begg [8] reviewed 145 investigations that took place over the period 1976–1980 and found that 26% had verification bias that was not recognized by the authors. In addition, Bates, Margolis, and Evans [9] reported that at least 1/3 of 54 pediatric studies had unrecognized verification bias. There are many more such studies, including those reported by Philbrick, Horwitz, and Feinstein [10] who found that of 33 diagnostic studies for coronary artery disease, 31 had verification bias. In a major review of verification bias, that reviewed 112 studies in major medical journals, Reid, Lachs, and Feinstein [11] reported finding that 54% had verification bias!

This section will present Bayesian methods for estimating test accuracy, when some of those that test positive or negative are referred to the gold standard, which is presented in the following table.

Consider the following table for one binary test Y = 0,1 where verification bias is present.

V = 1 indicates the patient is verified and the disease status is known, and V = 0 indicates a patient has not been verified, thus, there are u1 individuals who are not verified when Y = 1. The total number of patients in the study are m1+ m0, while the number who tested positive and had the disease is s1. If the test accuracy is based on only the verified patients, the estimates are misleading. Fortunately there are statistical methods for correcting these misleading estimates of test accuracy. In order to implement these procedure, the missing at random assumption (MAR) is imposed, which entails assuming that the decision to verify the disease status depends on only the results Y of the diagnostic test and not other factors related to the disease status. That is to say:

P [ V = 1 D , Y ] = P [ V = 1 Y ]

Our approach is likelihood based, where the likelihood function is based on the conditional distribution of the disease status D = 1, given Y = 0 or 1, and on the marginal distribution of Y. The probability that Y = 1, given D = 1 is then found by Bayes theorem. The derivation of the likelihood and relevant posterior distributions is as follows:

Let

ϕ i = P [ D = 1 Y = i ]
and
θ i = P [ Y = i ]
where i = 0,1, then the likelihood function for the parameters is
L ( θ , ϕ ) ϕ 1 s 1 ( 1 ϕ 1 ) r 1 ϕ 0 s 0 ( 1 ϕ 0 ) r 0 θ 1 m 1 θ 0 m 0
where all parameters are between zero and one and θ0 + θ1 = 1. With a uniform prior for all parameters, the posterior distribution of the parameters is as follows:
ϕ i ~ beta ( s i + 1 , r i + 1 )
for i = 0,1, and (θ0, θ1) has a Dirichlet with parameters (m0 + 1,m1 +1).

The approach to correcting for bias is to use Bayes theorem to compute

P [ Y = 1 D = 1 ] = P [ D = 1 Y = 1 ] / P [ D = 1 ]
where
P [ D = 1 ] = ϕ 1 θ 1 + ϕ 0 θ 0
Let α i = ϕ i θ i / ( ϕ 1 θ 1 + ϕ 0 θ 0 )
then α1 is the sensitivity of the test. On the other hand, let
β 1 = ( 1 ϕ 1 ) θ 1 / ( 1 ϕ 1 θ 1 ϕ 0 θ 0 )
then β1 is the false positive fraction, that is, the probability that Y = 1, given D = 0.

Once the posterior distribution of the parameters is determined, the posterior distribution of the true and false positive fractions is also determined.

A good example of verification bias is the study of Drum and Christacopoulos [12], which is a hepatic scintigraphy test for liver disease.

This test had two results Y = 0 or 1, where a 1 indicates a positive result for disease. Note that the total number of subjects is 670, with 474 who tested positive, and among those, 150 were not verified for the disease.

Among those who tested negative, 79 were examined by the gold standard, with 31 of those having the disease. The estimated sensitivity based on the verified patients is 298/329 = 0.905, and the estimated false positive rate is 26/74 = 0.35. What are the corrected estimates and how do they differ from these? The Bayesian analysis assumes a uniform prior for the parameters of the likelihood function and is executed with 50,000 observations for the MCMC simulation.

Note the reasonably good accuracy of the scintigraphy test for liver disease, with a sensitivity of 0.83 and a false positive fraction of 0.2139. Recall the estimated sensitivity using the verified cases only (the naïve estimates) is 0.9224 and the false positive fraction is 0.372, thus, the corrected true and false positive rates are smaller than those calculated from the verified cases only. In general if those that test positive are more likely to be verified than those patients that test negative, the naïve estimates (those based on verified cases only) are such that the true and false positive fractions are larger than the unbiased estimators respectively. See Pepe ([2], p. 169) for further details.

The above approach that produces unbiased estimators for the TPF and FPF, is easily extended to two paired binary tests, but will not be presented here; instead the approach is generalized to estimating the area under the ROC curve for medical tests with ordinal test scores. Consider the typical layout for such a test Y with possible values 1, 2,…, k reported below with familiar notation as:

If a uniform prior distribution is deemed appropriate, the posterior distribution of the ϕi is beta with parameters si+1 and ri+1 and that for the θi is Dirichlet with parameter (m1 + 1, m2 +1, …, mk +1).

In order to compute the area under the ROC, one must compute P[Y = i∣D = 1] and P[Y = i∣D = 0] for all i = 1, 2, …, k, where the first component is represented by Bayes theorem as

P [ Y = i D = 1 ] = P [ D = 1 Y = i ] P [ Y = i ] / P [ D = 1 ] = ϕ i θ i / P [ D = 1 ]
where,
P [ D = 1 ] = i = 1 i = k ϕ i θ i

On the other hand, the second component is computed as

P [ Y = i D = 0 ] = ( 1 ϕ i ) θ i / P [ D = 0 ]
where P[D = 0] = 1 − P[D = 1].

We are now in a position to compute the area under the ROC curve.

Let

α i = P [ T = i D = 1 ]
and
β i = P [ T = i D = 0 ]
for i = 1, 2,‥, k, then the area under the ROC is given by
A = A 1 + A 2 / 2
where
A 1 = α 2 β 1 + α 3 ( β 1 + β 2 ) + + α k ( β 1 + β 2 + + β k 1 )
and
A 2 = i = 1 i = k α i β i

Formula (29) for the ROC area is given in Broemeling ([1], p. 72).

The example for ordinal test scores is taken from a hypothetical mammography study with 1,509 subjects, where each patient is given a score of Y where Y = 1, 2, 3, 4, s5.

Note the number of unverified cases is 249 out of a total of 1509 subjects. A Bayesian analysis is performed using 55,000 observations for the simulation and the posterior analysis appears in the following table, where the median ROC area is 0.7764.

Generalizations for verification studies are possible in many directions including extensions to several observers and to using patient and study covariates. Also, it is possible to drop the MAR assumption and to estimate test accuracy. The case of extreme verification bias for binary tests is not considered, which is the case where only those that test positive are verified, while none are verified among those that test negative. See Pepe ([2], p. 180), and see Broemeling [1], Pepe [2], and Zhou et al. [3] for additional interesting information about the analysis of verification studies

5. Tests with an Imperfect Reference Standard

Suppose that a gold standard does not exist, but that test accuracy of a new test will be assessed with an imperfect gold standard. Many cases exist where there is no perfect gold standard. For example, depression is usually determined by a series of questions and observing the behavior of the patient, but such assessments are highly subjective, and there is no one test that will provide a perfect diagnosis. For infectious diseases, a perfect diagnosis can be elusive, where a culture is taken; however, the culture may not contain the infective agent or if the agent is present may not grow in the culture. Pepe [2] gives other examples, including tests for diagnosing cancer and hearing loss. Zhou et al. [3] also present various studies, including the diagnosis of a bacterial infection with the stool and serology tests. Their analysis is to use maximum likelihood while Bayesian is the approach taken here. Other examples presented by Zhou et al. include two tests for tuberculosis, with the Tine and Mantour tests, at two different sites, while a third example for detecting pleural thickening is performed by X-ray with three readers. Another interesting example of multiple tests is described by Pepe [2], where chlamydia bacterial infection is diagnosed with a blood culture, PCR, and ELISA.

Previous work has focused on maximum likelihood estimation and Bayesian. Zhou et al. [3] emphasize maximum likelihood and Bayesian. The Bayesian method is based on earlier work by Joseph, Gyorkos, and Coupal [13] who employ an augmented data approach. The augmented data approach views the missing data (the disease status D of a patient) as an unobservable random variable that can be modeled in such a way as to provide the posterior density of the measures of disease accuracy (true and false positive rates). Such an approach will be used here, because the Bayesian method has the advantage of using prior information and being able to separate the parameters of interest from nuisance parameters. Fortunately, prior information is available for diagnostic tests, especially the disease rates and the accuracy assessments of medical tests, and can be used as part of the posterior analysis.

With the Bayesian approach of Joseph, Gyorkos, and Coupal [13] and Dendukuri and Joseph [14], the various tests are assumed to be conditionally independent, an assumption that will be used in the present approach, however, the assumption will be relaxed in some cases and the two ways compared in estimating test accuracy.

Pepe ([2], p. 195) presents the following example of using an imperfect reference standard R to assess the accuracy of a new test T, namely:

The new test T has a ‘true’ sensitivity of 0.80 (80/100) and a specificity of 0.70 (70/100) but of course this is actually not known because there is no gold standard. Relative to the reference test R, the sestimated sensitivity is also 0.8 (64/80) but has a specificity of 0.61(74/120), thus, the new test is assessed to be less specific than it actually is. Also, with respect to the gold standard, the prevalence of disease is 50%, but is estimated to be 40% with regard to R. Remember the gold standard is not present, we do not know the ‘true’ measures of accuracy, only those with regard to the reference standard can be estimated, and can be misleading!

The two tests are said to be conditionally independent if

P [ T , R D ] = P [ T D ] P [ R D ]
a condition which is usually employed with both the conventional and Bayesian approaches. Using this assumption, Pepe ([2], p. 195) states that it is likely that both the observed (relative to the reference test R) sensitivity and specificity will be decreased.

Are there methods that will improve on the measures of accuracy provided by the imperfect standard test? Using primarily the Bayesian approach, this question will be explored in this chapter. In what is to follow, the subject is introduced with two binary tests, one is the reference test R and the other a new one T whose accuracy is to be assessed. Note none of the patients will have their true disease status D measured, instead each patient will be given a positive or negative score by both tests. A Bayesian approach is taken, where based on the likelihood function the posterior distribution of the sensitivity and specificity are determined. The likelihood function is presented where the missing disease status is modeled by augmented or latent variables. Conditional independence is assumed.

With the likelihood function based on latent variables and assuming conditional independence, the posterior distribution of the sensitivity, specificity, and disease prevalence are determined. An example er analyzed by Joseph, Gyorkos, and Coupal [13] involves a bacterial infection of immigrants to Canada and employs the augmented data method to estimate the sensitivity and specificity of the reference test R (a serology test) and another test T, the stool examination.

Consider a layout for the experiment with the two tests R and T, using the augmented data approach.

When D = 1, the results of the study are:

where the augmented data is represented by the yij and the observations by the corresponding nij.

Now let

θ i j = P [ R = i , T = j D = 1 ]
i, j=0 or 1, and
ϕ i j = P [ R = i , T = j D = 0 ]

Then the likelihood function is

L ( θ , ϕ / data ) p y ( 1 p ) n y i = 0 i = 1 j = 0 j = 1 θ i j y i j i = 0 i = 1 j = 0 j = 1 ϕ i j n i j y i j
and assuming a uniform prior, the posterior distribution of the parameters p, the θij, and the ϕij can be determined in terms of all the conditional distributions as follows:

If one assumes the conditional independence assumption the likelihood function is expressed directly in terms of the sensitivity and specificity as

L ( p , s 1 , s 2 , c 1 , c 2 ) s 1 y 11 + y 01 ( 1 s 1 ) y 10 + y 00 s 2 y 11 + y 10 ( 1 s 2 ) y 01 + y 00 c 1 n 10 + n 00 y 10 y 00 ( 1 c 1 ) n 11 + n 01 y 11 y 01 c 2 n 01 + n 00 y 01 y 00 ( 1 c 2 ) n 11 + n 10 y 11 y 10 p y ( 1 p ) n y
y is the sum of the yij and n is the sum of the four cell frequencies. The notation has been changed to denote s1 and c1 as the sensitivity and specificity of T respectively, while s2 and c2 denote the corresponding quantities for the reference R.

For computational purposes and assuming a uniform prior, it is obvious from the above likelihood function that the conditional distribution of the unknown parameters are:

The marginal distribution of p is beta with parameters

ap = y + 1 and bp = n y + 1

The conditional distribution of s1 given the other parameters is beta with parameters as1 and bs1 where

as 1 = y 11 + y 01 + 1 and bs 1 = y 10 + y 00 + 1

The conditional distribution of s1, given the other parameters is beta with hyperparameters

as 2 = y 11 + y 10 + 1 and bs 2 = y 01 + y 00 + 1

The conditional distribution of c1 is beta with hyperparameters

ac 1 = n 10 + n 00 y 10 y 00 + 1 and bc 1 = n 11 + n 01 y 11 y 01 + 1
and the conditional distribution of c2 is beta with parameters
ac 2 = n 01 + n 00 y 01 y 00 + 1 and bc 2 = n 11 + n 10 y 11 y 10 + 1 .

In addition, the posterior distribution of the latent variables is:

The conditional distribution of y11 given the other variables is binomial with parameters

m 11 = p s 1 s 2 / [ p s 1 s 2 + ( 1 p ) ( 1 c 1 ) ( 1 c 2 ) ] ( the probability parameter ) and q 11 = n 11

The conditional distribution of y10, given the other parameters is binomial with parameters:

m 10 = p ( 1 s 1 ) s 2 / [ p ( 1 s 1 ) s 2 + ( 1 p ) c 1 ( 1 c 2 ) ] and q 10 = n 10

The conditional distribution of y01, given the other parameters, is binomial with hyperparameters

m 01 = p s 1 ( 1 s 2 ) s 2 / [ p s 1 ( 1 s 2 ) + ( 1 p ) ( 1 c 1 ) c 2 ] and q 01 = n 01
and lastly

The conditional distribution of y00 given the other parameters is binomial with hyperparameters

m 00 = p ( 1 s 1 ) ( 1 s 2 ) / [ p ( 1 s 1 ) ( 1 s 2 ) + ( 1 p ) c 1 c 2 ] and q 00 = n 00

It is important to know that the above posterior distributions for the accuracy of two binary tests assumes a uniform prior for p, s1, s2, c1, and c2 and the assumption of conditional independence between R and T!

An example assuming a uniform prior and conditional independence between an imperfect reference test R and a new test T is presented as follows. Consider the diagnosis of a bacterial infection by Strongyloides exposing 162 Cambodian refugees to Canada. They entered Canada from July 1982 to February 1983 and were tested with a Stool examination, which serves as the ‘new’ test T and a serologic reference test R, and the results as reported by Zhou et al. ([3], p. 366) are given below.

This information has been analyzed by a number of people, including Joseph, Gyorkos, and Coupal [13] and Dendukuri and Joseph [14]. The observed sensitivity and specificity of the stool exam relative to the serology exam are 38/125 = 0.304 and 35/37 = 0.945 respectively. The main focus is to correct the actual sensitivity and specificity of the stool exam via the methodology derived in the previous section. Assume conditional independence between T and R, then the posterior distribution of the relevant parameters is given by the conditional distributions of each parameter given the others, which are identified in statements (37)(45).

The above table reports the analysis which is executed with 125,000 observations generated from the posterior distribution of the parameters. As seen from Table 20, the standard deviation for the two sensitivities is almost as large as the mean indicating uncertainty for these measures of accuracy, and the MCMC errors are relatively large (but reasonable) for all parameters. Also, the distributions for c2 and s1 are skewed, and I would use the posterior medians to report the accuracy of the two tests.

One can employ the prior information used by Zhou et al. ([3], p. 367) who utilized informative prior information about the parameters, namely:

The prior information was elicited form a panel of experts and the ranges of the parameter values converted to the hyperparameters of the corresponding beta prior distribution, that is, a beta prior was used for each parameter with the above values for the parameters of that variable. Note the uncertainty for p, expressed as a range (0,1) and a uniform prior for the prevalence. For example, the prior mean for c1 the specificity of the stool exam is 0.95, while that for the sensitivity s2 of the serology test is believe to be 0.80, compared to a prior mean of 0.74 for the sensitivity of the stool exam.

Of course, the accuracy of serology is supposed to be better than that compared to stool, and this is reflected in the prior values of the above table.

A Bayesian analysis is performed utilizing the prior information in the above table and conditional independence between the two tests. Again 125,000 observations are generated from the joint posterior distribution:

Comparing Tables 20 and 22 reveals less uncertainty in the estimates (posterior means) using informative beta priors for the accuracy parameters, and the MCMC errors are much smaller when the informative prior is used. For the accuracy parameters (sensitivity and specificity), the posterior standard deviations are less across the board. Not the posterior distributions appear to be symmetric. This example shows the effect of prior information on the posterior analysis, where a uniform prior was compared to an informative prior (based on expert opinion). Which analysis would you use?

The Bayesian analysis for correcting for an imperfect reference test is easily generalized to multiple binary tests and to the situation where the conditional independence assumption is not imposed. See Broemeling [1], Pepe [2], and Zhou et al. [3] for additional information about this interesting topic.

6. Accuracy of Multiple Tests

This section introduces methods to assess the accuracy of the combination of two or more tests. Two tests for the diagnosis of a disease measure different aspects or characteristics of the same disease. In the case of diagnostic imaging, two modalities have different qualities (resolution, contrast, and noise), thus although they are imaging the same scene, the information is not the same from the two sources. When this is the case, the accuracy of the combination of two modalities is of paramount importance. For example, the accuracy of the combination of mammography and scintimammography, for suspected breast cancer, has been reported by Buscombe, Cwikla, Holloway, and Hilson [15]. Another study for diagnosing breast cancer was performed by Berg, Gutierrez, et al. [16] who measured the accuracy of mammography, clinical examination, ultrasound, and MRI in a preoperative assessment of the disease, The accuracy of each modality and various combinations of the modalities were measured. When investigating metastasis to the lymph nodes in lung cancer, Van Iverhagen, Brakel, and Heijenbrok et al. [17] measured the accuracy of ultrasound and CT and the combination of two. Ultrasound conveys different information about metastasis compared to CT, but the combination of the two might provide a more accurate diagnosis than each separately. For an example of the diagnosis of head and neck cancer, Pauleit, Zimmerman, Stoffels et al. [18] used two nuclear medicine modalities, 18F-FET PET and 18F-FDG PET to assess the extent of the disease and estimated the accuracy of each and combined. On the other hand, Schaffler, Wolf, Schoelinast et al. [19] evaluated pleural abnormalities with CT and 18F-FDG PET and the combination of the two.

Switching from cancer to heart disease, Gerger, Coche, Pasquet et al. [20] used Four-Section Multi-Detector CT and 3D Navigator MR for detecting stenosis of the coronary arteries, where the accuracy of each and the combination of the two was estimated. The above examples involve binary test scores where accuracy is measured by TPF, FPF, PPV, and NPV, but when the test scores are ordinal and involve more than two possible values, or when the test scores are continuous, the accuracy is measured by the area under the ROC curve.

What is the optimal way to measure the accuracy for the combination of two binary tests? Pepe ([2], p. 268) presents two approaches: (1) believe the positive rule, or BP, where a positive test score on a subject means one or the other of the two tests is scored positive, and (2) believe the negative rule, or BN, where a subject is scored positive if both tests are scored positive. Pepe ([2], p. 268) also provides some properties about these rules, namely:

Statement 1

  • The BP rule increases sensitivity relative to the two binary tests, but increase the FPF, but by no more than the sum of the two false positive fractions, namely, FPF1 + FPF2.

  • The BN rule decreases the false positive rate relative to the false positive rates of the two tests, but at the same time, decreases the sensitivity, however, the sensitivity remains above TPF1 +TPF2 −1.

For the first part on two binary tests, several examples are provided, then the idea is generalized to two binary tests with several readers and to two binary tests when verification bias is present. For the section on two ordinal tests, the accuracy of the combination of the two tests is provided by the ROC curve, which in turn depends on the risk score of the component tests.

This section will employ a Bayesian approach to estimate the accuracy to two binary tests and the accuracy of the combination of the two using the believe the positive BP rule and believe the negative or BN. Label the two tests Y1 and Y2 where both take on the values 0 or 1, where 0 indicates a snegative test and 1 a positive score for the medical test. A subject either has the disease or does not, as determined by the gold standard, thus when D = 1, let

θ i j = P [ Y 1 = i , Y 2 = j ]
for i, j = 0,1, and when D = 0, let
ϕ i j = P [ Y 1 = i , Y 2 = j ]

Thus the thetas are the four cell probabilities for the diseased subjects and the corresponding phis are the cell probabilities for the non-diseased subjects. The corresponding cell frequencies are denoted by nij and mij for the diseased and non-diseased subjects respectively, thus assuming a uniform prior for the cell probabilities, the posterior distribution of the cell probabilities are Dirichlet for θ = (θ00, θ01, θ10, θ11) with parameter (n00 + 1, n01 + 1, n10 + 1, n11 +1), and for ϕ = (ϕ00, ϕ01, ϕ10, ϕ11) is also Dirichlet with parameter (m00 +1, m01 + 1, m10 +1, m11 +1).

Once the posterior distribution of the cell probabilities is determined, the posterior distribution of the truncated cell probabilities is easily found. The truncated cell probabilities for the diseased subjects are given by

θ i j = θ i j / i = 0 i = 1 θ i j
and for the non-diseased subjects the truncated cell probabilities are
ϕ i j = ϕ i j / i = 0 i = 1 ϕ i j
for i and j = 0 or 1.

The true and false positive fractions for the first test Y1 are

tpf 1 = θ 1 .
and
fpf 1 = ϕ 1 .
respectively.

while for the second test the true and false positive fractions are

tpf 2 = θ . 1
and
fpf 2 = ϕ . 1
respectively.

The above give the accuracy of the individual tests, but what about the combination of the two? Recall there are two ways to measure the accuracy of combined tests, either by the BP rule, or by the BN rule. With the former rule, the true positive fraction is

tpfbp = θ 01 + θ 11 + θ 10
and the false positive fraction is
fpfbp = ϕ 01 + ϕ 11 + ϕ 10 .

On the other hand, using the BN rule the true positive fraction is

tpfbn = θ 11
while the false positive fraction is
fpfbn = ϕ 11

In what is to follow the accuracies of the individual tests and the combined test will be estimated for several examples. The next example is based on the study of Gerber, Coche, Pasquet et al. [20] which investigated the use of both CT and MRI to determine the degree of stenosis in the coronary arteries, where 26 patients were suspected of having coronary artery disease. The gold standard is coronary catherization, which found 58 diseased segments (stenosis greater than 50%) and 236 non-diseased segments. This was an experimental study to determine the value of the two non invasive imaging modalities to diagnose coronary artery disease. The study found that the sensitivity of CT and MRI were 79% and 62% respectively, and that on the other hand the specificity of CT and MRI were 71% and 84% respectively, thus, CT had higher sensitivity but smaller specificity compared to MRI. This is a very interesting study and only a brief synopsis is given here, thus the reader is invited to read the article for more detail in order to know the value of the investigation. The information for the study is given below:

Our goal is to determine the accuracy of the combined test using the BP and BN rules, where the simulation consists of generating 25,000 observations from the joint posterior distribution24.

Which rule, the BP or BN rule, should be used to measure the accuracy of the combined test? Note the true positive fraction with the BP rule is higher than that with the NP rule, but on the other hand, the false positive rate is lower with the BN rule compared to the BP rule. This is a true quandary and it is not obvious which rule should be used to measure the accuracy of the combined test. Note the posterior mean for the bnfpf (believe the negative false positive fraction) is 0.1627 with a standard deviation of 0.0236. What is the best way to measure the combined tests of CT and MRI?

A change of emphasis from binary to ordinal and continuous test scores brings us to some ‘new’ ideas for measuring the accuracy by combining two tests. For ordinal and continuous scores the area under the ROC curve measures the intrinsic accuracy of a medical test, but how should the area be computed when two tests are combined? The ROC curve of the risk score is the foundation for measuring the accuracy for the combined test, but in turn, the risk score is a monotone increasing function of the likelihood ratio, which is the optimal way to measure accuracy for the combined test.

The optimality of the risk function is a consequence of the Neyman-Pearson lemma, which is a familiar result from classical statistics for testing hypotheses. In what is to follow, the likelihood ratio will be defined and the optimality of the ROC curve of the likelihood ratio will be demonstrated by referring to the Neyman-Pearson lemma, then the risk function will be defined and shown to a monotone increasing function of the likelihood ratio, thus the ROC curve of the risk function is the same as the ROC curve of the likelihood ratio. The Pepe et al. ([2], pp. 269–274) development of the subject is closely followed but given a Bayesian emphasis, and the end result will be that the optimal way to measure the accuracy of the combined test is to estimate the area under the ROC curve of the risk function. Determining the risk function is equivalent to performing a logistic regression using the test scores of the two tests as predictors, then the ROC curve of the predicted probabilities(from the logistic regression) is computed, from which the area is then estimated. Such an area is the accuracy of the combined test, and the methodology is illustrated with various examples using ordinal test scores. The first example is from an imaging trial using MRI and CT to detect lung cancer, where the one radiologist uses a five point confidence score, and the ROC curve of the risk function of the combined test is computed and compared to the ROC curve of the individual tests.

This section is continued with the definition of the likelihood ratio and concluded with the definition of the risk score.

Suppose Y = (Y1, Y2, …, Yp) is the vector of scores of p ordinal tests, then the likelihood ratio is

LR ( Y ) = P [ Y D = 1 ] / P [ Y D = 0 ]
where D is the indicator of disease. The numerator is the probability of the observed test scores, give the disease is present, and the denominator is the probability of the observed scores, given the disease is not present.

Recall that the likelihood ratio is used as a test statistic for the null hypothesis

H: D =1

versus the alternative hypothesis

A: D = 0,

where larger values of LR(Y) are evidence of the null hypothesis, and smaller values are evidence the alternative is true.

It can be shown the likelihood ratio has certain optimal properties, summarized by the result:

Statement 2

Suppose a decision about the accuracy of a medical test is based on the criterion

LR ( Y ) > c

Then the likelihood ratio

  • maximizes the TPF among all rules with FPF = t, for all t ∈ (0,1),

  • minimizes the FPF among all rules with the TPF = r, for all r ∈ (0,1),

  • minimizes the overall misclassification probability ρ(1 − TPF) + (1 − ρ)FPF, where ρ is the disease rate, and

  • minimizes the expected cost, regardless of the costs associated with false negative and false positive errors.

The threshold c above appearing in Statement 2, depends on the objective at hand, but for our purposes, the above result implies the ROC curve based on the likelihood ratio is optimal, in the sense its area is the largest. The likelihood function is difficult to work with because of the complexity of determining its distribution, but, fortunately, the risk score

RS ( Y ) = P [ D = 1 Y ]
does not have this disadvantage and has the property that it is a monotone function of the likelihood ratio. Simply stated, the risk score assigns a probability of disease to each study subject.

Statement 3

The risk score has the same ROC curve as the likelihood ratio and has the same optimal properties as the likelihood ratio.

Observe that

RS ( Y ) = P [ D = 1 Y ] = P [ Y D = 1 ] P [ D = 1 ] / P [ Y ] = P [ Y D = 1 ] P [ D = 1 ] / { P [ Y D = 1 ] P [ D = 1 ] + P [ Y D = 0 ] P [ D = 0 ] } = LR ( Y ) P [ D = 1 ] / { LR ( Y ) P [ D = 1 ] + P [ D = 1 ] }
which shows that the risk score is a monotone increasing function of the likelihood ratio, which implies that the ROC curve of risk score is the same as that of the likelihood ratio. For our purposes the risk score will be used to measure the accuracy of combined tests, namely, using the area of the ROC curve of the risk score. Pepe ([2], pp. 274–275) shows the utility of logistic regression for finding the ROC curve of the risk score. Note, that the following statement show why.

Statement 4

Suppose the risk score is expressed as

logitP [ D = 1 Y ] = γ + g ( λ , Y )
where g is a known function, then:

(a) the parameter λ can be estimated, even for retrospective designs in which the sampling depends son D, and (b) the function g is optimal for determining the ROC curve of the risk function.

From a practical point of view, logistic regression can be used to determine the ROC curve of the risk function, but it should be noted that finding a suitable function g can be challenge. After all, g can be a complicated non linear function of λ and/ or Y, but it would be convenient if g is linear in the test scores Y. Of course, a Bayesian approach is taken in order to estimate the logistic regression function (10.48).

The approach taken here is based on the risk score and Pepe ([2], pp. 274–275) gives a good account.

Suppose there are two medical tests with ordinal scores, then for diseased subjects the layout is:

Thus, there are nij diseased subjects with a score of i for test 1 and score j for test 2 and the cell probabilities for the diseased are

θ i j = P [ T 1 = i , T 2 = j D = 1 ]
for the first test, where i, j = 1,2,…,k.

The non-diseased cell probabilities are

ϕ i j = P [ T 1 = i , T 2 = j D = 0 ]

Define the ROC area for test 1, the usual way, as:

Area 1 = A 11 + A 12 / 2
where
A 11 = i = 1 i = k θ i . ( j = 1 j = i 1 ϕ j . )
and the θi., i = 1,2,…,k, are the sum of the θij over the missing subscript.

and

A 12 = i = 1 i = k θ i . ϕ i .

The ROC area for the second test is defined in a similar fashion as

Area 2 = A 21 + A 22 / 2 A 12 = i = 2 i = k θ . i ( j = 1 j = i 1 ϕ . j ) and A 22 = i = 1 i = k θ . i ϕ . i

Our goal is to use the area under the ROC of the risk score as a measure of accuracy of the combined tests T1 and T2, where the risk scores are determined by logistic regression (if appropriate)

log it ( θ i j ) = γ + g ( λ , T 1 , T 2 )
and the unknown parameters γ and λ, (possibly a vector) are estimated by Bayesian techniques. From the logistic regression, the estimated (e.g. posterior means) cell probabilities are employed to estimate the area under the ROC curve of the risk score.

Note that the area under the ROC curve of the risk score is based on the posterior distribution of the 2k2 parameters θij and ϕij for i, j = 1,2,…,k, and this scenario is illustrated with the following example, where the area under the ROC curve is given by the usual formulas employed in earlier in sections. Of course, in addition the area under the ROC curves for the individual tests will also be portrayed and compared to the area under the ROC curve of the risk score. It will be a challenge to develop a good logistic regression, however, in some cases it will turn out that the logit is a linear function of the two tests T1 and T2. The risk score is assigned to each experimental unit and is the probability of disease, which is estimated from the raw scores of the two component tests! Note, using the risk score is a statistical procedure and will ideally be utilized by the clinician working with a statistician.

When considering the accuracy of two ordinal tests, a paired study is envisioned, where each test is applied to each patient and one reader examines the results of both tests. It is important to remember that the reader uses the results of both tests for each patient in order to decide what score to assign to the patient.

Our first example involves the MRI and CT determination of the lung cancer risk, where one radiologist interprets both images and gives a score from 1–5 for the presence of a malignant lesion with the following definition: A score of 1 indicates no evidence of malignancy, while a score of 2 indicated very little evidence of a lesion. The score of 3 designates a benign lesion, while a score of 4 indicates there is some evidence of a malignancy, and finally a score of 5 signals that the lesion is definitely malignant. This is obviously a paired design in that both images are taken on each patient and one would expect a ‘large’ correlation between the scores of MRI and CT images. There are 261 patients that have lung cancer and 674 who do not, and the gold standard is lung biopsy.

The above study is hypothetical, but there are many studies that have investigated CT and MRI as alternatives to detecting lung cancer, and it should be noted that CT has shown good promise (in comparison to X-ray) in a recent national lung cancer screening trial, see Gierada, Pilgrim, Ford et al. [21] for additional information.

With regard to the accuracy of the combined test, the approach is to find the area under the ROC curve of the risk score, which is determined by logistic regression, namely,

log it ( theta [ i ] ) = b [ 1 ] + b [ 2 ] T 1 [ i ] + b [ 3 ] T 2 [ i ]
where theta[i] is the probability the i-th patient has disease, where i = 1,2,…,N.

N is the number of patients in the study with 261 with disease (lung cancer) and 674 with no disease, and the b[i] are unknown regression coefficients. From a Bayesian viewpoint, the regression coefficients are given vague prior distributions of the form

b [ i ] ~ dnorm ( . 000 , . 0001 )
namely, a normal distribution with mean 0 and precision 0.0001.

Based on generating 45,000 observations generated from the posterior distribution, the Bayesian analysis is presented below.

The MCMC errors are quite small and show that the presented estimated ROC areas are very ‘close’ to the actual posterior areas, and the analysis also shows that the two areas are about the same, that is the accuracy of the two modalities are essentially the same. The probability of a tie with CT is estimated with a posterior mean of 0.181 and 0.1788 with MRI. Thus one would expect the accuracy of the combined test, as measured by the ROC area of the risk score, to be about the same value, in the area of 0.70.

As before, when estimating the ROC area of the risk score, 45,000 observations are generated for the MCMC simulation, with the following results.

The auc parameter is the ROC area of the risk score and is estimated as 0.7246(0.0192) with the posterior mean, and the median is about the same value indicating very little skewness in the posterior distribution. The implication is that the combined test has an accuracy is somewhat larger the accuracy of the individual tests, see Table 27 which portrays the individual area as approximately 0.68. Of course, this is not surprising because the individual ROC area for CT and MRI are essentially the same, thus, one would expect the accuracy of the combined test to be about the same as the individual values.

Note, that the b's are the regression coefficients for the logistic regression, and the beta's are the regression coefficients in the normal regression for the ROC area of the risk score. The logistic regression is linear in the two test variables T1 and T2, but I did add the squares and cross product of the two and the ROC area remained the same, thus, the linear association appears to be adequate for estimating the risk score for the combined test. The risk scores are not normally distributed, but can be transformed to normality approximately via the log transformation, however, when this is done the ROC area remains at about 0.72.

Of course there are examples, where the ROC area of the risk score is much greater than that of the component tests. A good example is one for a pancreatic cancer study analyzed by Pepe [2, p9]] who investigated the effect of two biomarkers on the disease incidence. The first biomarker is CA19-9 and the second biomarker is CA125. On the original scale the mean(sd) of CA19-9 is 18.03(20.81) for the 51 control patients and 1715(3681) for the cancer patients, whereas for the CA125 marker, the mean(sd) for the control patients are 21.81(30.29) and 55.04(138.8) for the diseased.

The median for the first biomarker is 10 for the control and 249 for the cancer patients, and for the second biomarker, the medians are 11.4 versus 21.8 for the control and diseased patients respectively. Note the large variability of both biomarkers, but based on the difference in the means and medians between the diseased and non-diseased patients, one would expect a high value for the ROC area of CA19-9. Note that T1 is the CA19-9 biomarker, T2 is CA125. In order to determine the accuracy of the combined test, the Bayesian analysis is executed with 45,000 observations, and the results reported in Table 30.

Using the risk score (which is determined with a logistic regression that regresses the disease status on the logs of the two biomarkers T1 and T2) a ROC area with posterior mean 0.912 implies very sgood accuracy for the combined test. This is to be compared to an ROC area of 0.8733(0.0275), based on CA19-9, and 0.6786(0.0438) for CA125.

Bayesian Methods for determining the accuracy of combined tests can easily be extended to other situations and the reader is referred to Broemeling [1].

7. Comments and Conclusions

The article has described some of the Bayesian methods that are available for determining the accuracy of medical tests and began with the basic measure of accuracy including the true and false positive fractions and the positive and negative predictive values. For ordinal and continuous test scores, Bayesian methods for estimating the ROC area were introduced. The review was continued by considering more specialized scenarios, including studies where verification bias is present and where an imperfect reference standard is used, and for each scenario the methodology was illustrated with interesting examples that occur in cancer and other diseases.

Other scenarios were not considered but nevertheless are important topics for Bayesian methods of medical test accuracy. One important topic not covered is the subject of multiple observers, each providing an estimate of test accuracy. Consider the case where two radiologists are interpreting the same mammograms for diagnosing breast cancer, then how does one resolve any differences in interpretation and to what degree do the observers agree in their interpretation? This topic is studied from a Bayesian viewpoint in some detail by Broemeling [23], where the various analyses are executed with the WinBUGS package. Another area not presented in this review is that of Bayesian nonparametric inference, thus, the reader is referred to Erkanli et al. [24] and Hanson et al. [25] for additional information on this approach to medical test accuracy. Also absent from this review are certain aspects of the design of accuracy studies, therefore, for a good introduction refer to Dendukuri et al. [26] and Cheng et al. [27].

Diagnostics 01 00001f1 200
Figure 1. Posterior density of the false positive fraction.

Click here to enlarge figure

Figure 1. Posterior density of the false positive fraction.
Diagnostics 01 00001f1 1024
Diagnostics 01 00001f2 200
Figure 2. Empirical ROC for Mammography.

Click here to enlarge figure

Figure 2. Empirical ROC for Mammography.
Diagnostics 01 00001f2 1024
Diagnostics 01 00001f3 200
Figure 3. Posterior density of ROC area for head trauma study.

Click here to enlarge figure

Figure 3. Posterior density of ROC area for head trauma study.
Diagnostics 01 00001f3 1024
Table 1. Classification table.

Click here to display table

Table 1. Classification table.
TestD = 0D = 1
X = 0(n00, θ00)(n01, θ01)
X = 1(n10,θ10)(n11, sθ11)
Table 2. Exercise stress test and heart disease.

Click here to display table

Table 2. Exercise stress test and heart disease.
ESTD = 0D = 1
X = 0327208
X = 1115818
Table 3. Posterior distribution of the true and false positive fractions.

Click here to display table

Table 3. Posterior distribution of the true and false positive fractions.
ParameterMeanSDErrorLower 2 1/2MedianUpper 2 1/2
TPF0.79670.01255.84 × 10−50.77160.79680.8208
FPF0.26120.02089.22 × 10−50.22150.26080.3033
Table 4. Posterior distribution of predictive values.

Click here to display table

Table 4. Posterior distribution of predictive values.
ParameterMeanSDErrorLower 2 1/2MedianUpper 2 1/2
PPV0.87590.0108<0.00010.85380.87620.8961
NPV0.61090.0211<0.00010.56930.6110.6517
Table 5. Posterior distribution of diagnostic likelihood ratios.

Click here to display table

Table 5. Posterior distribution of diagnostic likelihood ratios.
ParameterMeanSDErrorLower 2 1/2MedianUpper 2 1/2
PDLR3.070.25260.00112.6163.0553.609
NDLR0.27550.0187<0.00010.23990.2750.3135
Table 6. Mammogram results.

Click here to display table

Table 6. Mammogram results.
StatusNormal 1Benign 2Probably Benign 3Suspicious 4Malignant 5Total
Cancer106111230
No Cancer92118030
Table 7. TPF versus FPF for Mammography.

Click here to display table

Table 7. TPF versus FPF for Mammography.
StatusNormal 1Benign 2Probably Benign 3Suspicious 4Malignant 5
tpf30/30 = 1.0030/30 = 1.0029/30 = 0.96623/30 = 0.76612/30 = 0.400
fpf30/30 = 1.0021/30 = 0.70019/30 = 0.6338/30 = 0.2660/30 = 0.000
Table 8. Posterior distribution of area under the ROC curve.

Click here to display table

Table 8. Posterior distribution of area under the ROC curve.
ParameterMeanSDErrorLower 2 1/2MedianUpper 2 1/2
auc0.78110.0514<0.0010.67020.78480.8709
A10.6880.0635<0.0010.55640.69090.8036
A20.18610.0307<0.0010.1280.18540.2484
Table 9. Posterior distribution for the ROC area.

Click here to display table

Table 9. Posterior distribution for the ROC area.
ParameterMeanSDErrorLower 2 1/2MedianUpper 2 1/2
beta [1]107.52.2370.0401103107.5111.9
beta [2]16.142.4380.243811.3516.1121.02
precy [1]0.01290.004<0.00010.005440.01160.0211
precy [2]0.02050.0038<0.00010.01370.02030.0286
auc0.90840.04227<0.00010.80620.91550.9689
Table 10. (a) CT and MRI study for diseased subjects; (b) CT and MRI study for non-diseased subjects.

Click here to display table

Table 10. (a) CT and MRI study for diseased subjects; (b) CT and MRI study for non-diseased subjects.
(a)
CTMRI=0MRI=1TOTAL
022191213
130752782
TOTAL52943995
(b)
CTMRI=0MRI=1TOTAL
0148167315
14971120
TOTAL197238435
Table 11. Posterior analysis for CT and MRI imaging for lung cancer.

Click here to display table

Table 11. Posterior analysis for CT and MRI imaging for lung cancer.
ParameterMeanSDError2 1/2Median97 1/2
fpfct0.27780.0212<0.000010.23750.27750.3202
fpfmri0.54680.2371<0.00010.50.54690.5928
rfpf(ct/mri)0.50890.0437<0.00010.4270.50760.5984
rtpf(ct/mri)0.82950.0144<0.000010.8010.82960.8577
tpfct0.78470.0130<0.000010.75890.78480.8097
tpfmri0.95950.0071<0.000010.9310.94620.9592
Table 12. One binary test.

Click here to display table

Table 12. One binary test.
V=1Y=1Y=0
D=1s1s0
D=0r1r0
V=0u1u0
Totalm1m0
Table 13. Hepatic scintigraphy study for verification bias.

Click here to display table

Table 13. Hepatic scintigraphy study for verification bias.
V=1Y=1Y=0
D=1s1 = 298s0 = 31
D=0r1 = 26r0 = 48
V=0u1 = 150u0 =117
Totalm1= 474m0= 196
Table 14. Posterior distribution for hepatic scintigraphy study.

Click here to display table

Table 14. Posterior distribution for hepatic scintigraphy study.
ParameterMeanSDError2 1/2Median97 1/2
fpf0.21390.0336<0.00010.15250.21250.2834
tpf0.83930.0235<0.00010.79180.840.8836
Table 15. Verification bias and one ordinal test.

Click here to display table

Table 15. Verification bias and one ordinal test.
V = 1Y = 1Y = 2Y = k
D = 1s1s2sk
D = 0r1r2rk
V = 0u1u2uk
Totalm1m2mk
Table 16. Ordinal results for mammography.

Click here to display table

Table 16. Ordinal results for mammography.
V = 1Y = 1Y = 2Y = 3Y = 4Y = 5
D = 1s1 = 72s2 = 54s3 = 121s4 = 145s5 = 245
D = 0r1 = 308r2 = 127r3 = 78r4 = 33r5 = 77
V = 0u1 = 92u2 = 66u3 = 76u4 = 10u5 = 5
Totalm1 = 472m2 = 247m3 = 275m4 = 188m5 = 327
Table 17. Posterior analysis for mammography study.

Click here to display table

Table 17. Posterior analysis for mammography study.
ParameterMeanSDLower 2 1/2MedianUpper 2 1/2
A0.77620.01260.75090.77640.8005
A10.69720.01540.66650.69740.7272
A20.07890.003030.07290.07890.0848
Table 18. Hypothetical example imperfect reference.

Click here to display table

Table 18. Hypothetical example imperfect reference.
New TestD = 0D = 1R = 0R = 1
T = 070207416
T = 130804664
Total10010012080
Table 19. (a) Augmented data for reference R and test T when D = 1; (b) Augmented data for R and T, when D = 0.

Click here to display table

Table 19. (a) Augmented data for reference R and test T when D = 1; (b) Augmented data for R and T, when D = 0.
(a)
Reference TestT = 1T = 0
R = 1y11y10
R = 0y10y00
Total
(b)
Reference TestT = 1T = 0
R = 1n11 − y11n10 − y10
R = 0n01 − y10n00 − y00
Total
Table 20. Results of a stool exam T and a serologic reference exam R.

Click here to display table

Table 20. Results of a stool exam T and a serologic reference exam R.
Serology Test RT = 1T = 0Total
R = 13887125
R = 023537
40122162
Table 21. Posterior analysis for the stool and serology exams.

Click here to display table

Table 21. Posterior analysis for the stool and serology exams.
ParameterMeanSDError2 1/2Median97 1/2
p0.49860.20110.00530.16040.49910.8352
c10.69770.25860.01090.09220.70460.9942
c20.25040.25520.01090.00420.12370.8862
s10.25170.25030.01070.00460.13080.8826
s20.70140.26230.01090.08890.71920.9945
Table 22. Prior information about stool and serology tests.

Click here to display table

Table 22. Prior information about stool and serology tests.
ParameterRangeAlphaBeta
p0–10011
c190–10071.253.75
c235–1004.11.76
s15–454.4413.31
s265–9521.965.49
Table 23. Posterior analysis for the stool and serology exams.

Click here to display table

Table 23. Posterior analysis for the stool and serology exams.
ParameterMeanSDError2 1/2Median97 1/2
p0.76180.10070.00140.52360.77550.9286
c10.9570.0214<0.00010.90650.96030.9885
c20.69010.16050.00200.37270.70060.9558
s10.30930.0518<0.00010.22240.30430.4269
s20.88310.0423<0.00010.78920.88740.9535
Table 24. (a) Study results the CT-MRI study; (b) study results the CT-MRI study.

Click here to display table

Table 24. (a) Study results the CT-MRI study; (b) study results the CT-MRI study.
(a)
CTMRI=0MRI=1Total
012012
1103646
Total223658
(b)
CTMRI=0MRI=1Total
01680168
1303868
Total19838236
Table 25. Bayesian analysis for combined test of CT and MRI.

Click here to display table

Table 25. Bayesian analysis for combined test of CT and MRI.
ParameterMeanSDError2 1/2Median97 1/2
bnfpf0.16270.0236<0.00010.11910.16170.2116
bntpf0.59650.0623<0.00010.47310.59770.7148
bpfpf0.29590.0293<0.00010.24040.29530.3553
bptpf0.79010.0515<0.00010.680.79330.8817
fpfct0.29180.0292<0.00010.23610.29110.3511
fpfmri0.16680.0238<0.00010.12280.16580.2163
tpfct0.77390.0527<0.00010.6620.7660.8675
tpfmri0.61270.0619<0.00010.48830.6140.7299
Table 26. (a) Two medical tests for diseased patients: frequencies and probabilities; (b) two medical tests for non-diseased patients: frequencies and probabilities.

Click here to display table

Table 26. (a) Two medical tests for diseased patients: frequencies and probabilities; (b) two medical tests for non-diseased patients: frequencies and probabilities.
(a)
Test 1Test 2=1Test 2=2...Test 2=k
1n11, θ11n12, θ12n1k, θ1k
2n21, θ21n22, θ22n2k, θ2k
.
.
knk1, θk1nk2, θk2nkk, θkk
(b)
Test 1Test 2=1Test 2 =2...Test 2=k
1m11, ϕ11m12,ϕ12m1k, ϕ1k
2m21, ϕ21m22, ϕ22m2k, ϕ2k
.
.
kmk1, ϕk1mk2, ϕk2mkk, ϕkk
Table 27. (a) MRI and CT scores for diseased patients; (b) MRI and CT scores for non-diseased patients.

Click here to display table

Table 27. (a) MRI and CT scores for diseased patients; (b) MRI and CT scores for non-diseased patients.
(a)
CT ScoresMRI = 1MRI = 2MRI = 3MRI = 4MRI = 5Total
1151062134
2921103245
356326352
420647257
501256573
Total3138566373261
(b)
CT ScoresMRI = 1MRI = 2MRI = 3MRI = 4MRI = 5Total
192624185208
258811084161
33830653118182
416221351286
5513111737
Total2091761409356674
Table 28. Posterior analysis for MRI and CT of individual ROC areas.

Click here to display table

Table 28. Posterior analysis for MRI and CT of individual ROC areas.
ParameterMeanSDError2 1/2Median97 1/2
area ct0.68360.0188<0.00010.64590.68370.7198
area110.59310.0213<0.00010.55090.59310.6346
area120.1810.0057<0.00010.16960.18110.1921
area mri0.68860.0183<0.00010.6520.68890.7239
area210.59920.0207<0.00010.55810.59940.6392
area220.17880.0049<0.00010.16890.17890.1883
Table 29. Posterior accuracy of the combined test.

Click here to display table

Table 29. Posterior accuracy of the combined test.
ParameterMeanSDError2 1/2Median97 1/2
auc0.72460.0192<0.00010.68580.7250.7616
b [1]−2.9520.2162<0.0001−3.381−2.949−2.533
b [2]0.35620.0752<0.00010.20890.35620.504
b [3]0.33920.0723<0.00010.19650.33910.481
beta [1]0.24120.0053<0.00010.23090.24120.2516
beta [2]0.13850.0127<0.00010.11360.13850.1638
precy [1]53.132.9040.014347.5953.158.92
precy [2]28.792.5280.012224.0828.7133.99
Table 30. Bayesian analysis for accuracy of the combined test.

Click here to display table

Table 30. Bayesian analysis for accuracy of the combined test.
ParameterMeanSDError2 1/2Median97 1/2
auc0.91270.0227<0.00010.86250.9150.951
beta [1]0.35680.0223<0.00010.31290.35670.4015
beta [2]0.43610.0368<0.00010.36330.43620.5084

References

  1. Broemeling, L.D. Bayesian Biostatistics and Diagnostic Medicine; Taylor & Francis: Boca Raton, FL, USA, 2007.
  2. Pepe, M.S. The Statistical Evaluation of Medical Tests for Classification and Prediction; Oxford University of Press: Oxford, UK, 2003.
  3. Zhou, X.H.; Obuchowski, N.A.; McClish, D.K. Statistical Methods in Diagnostic Medicine; John Wiley & Sons: New York, NY, USA, 2002.
  4. Wiener, D.A.; Ryan, T.J.; McCabe, C.H.; Kennedy, J.W.; Schloss, M.; Tristani, F.; Chaitman, B.R.; Fisher, L.D. Correlations among history of angina, ST-segmented response and prevalence of coronary artery disease. N. Engl. J. Med. 1979, 301, 230.
  5. Woodworth, G.G. Biostatistics, A Bayesian Introduction; John Wiley & Sons: Hoboken, NJ, USA, 2004.
  6. O'Malley, J.A.; Zou, K.H.; Fielding, J.R.; Tampany, C.M.C. Bayesian regression methodology for estimating a receiver operating characteristic curve with two radiologic applications: Prostate biopsy and spiral CT of ureteral stones. Acad. Radiol. 2001, 8, 713–725.
  7. Egan, J.P. Signal Detection Theory and ROC Analysis; Academic Press: New York, NY, USA, 1975.
  8. Greenes, R.; Begg, C. Assessment of diagnostic technologies: Mmethodology for unbiased estimation from samples of selected verified ppatients. Invest. Radiol. 1985, 20, 751–756.
  9. Bates, A.S.; Margolis, P.A.; Evans, A.T. Verification bias in pediatric studies evaluating diagnostic tests. J. Pediatr. 1993, 122, 585–590.
  10. Philbrick, J.T.; Horwitz, R.I.; Feinstein, A.R. Methodologic problems of exercise testing for coronary artery disease. Am. J. Cardiol. 1980, 46, 807–812.
  11. Reid, M.C.; Lachs, M.S.; Feinstein, A.R. Use of methodologic standards in diagnostic test research. Getting better but still not good. J. Am. Med. Assoc. 1995, 274, 645–651.
  12. Drum, D.; Christacopoulos, J. Hepatic scintigraphy in clinical decision making. J. Nucl. Med. 1969, 13, 908–915.
  13. Joseph, L.; Gyorkos, T.W.; Coupal, L. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. Am. J. Epidemiol. 1995, 3, 263–272.
  14. Dendukuri, N.; Joseph, L. Bayesian approaches to modeling the conditional dependence between multiple diagnostics tests. Biometrics 2001, 57, 158–167.
  15. Buscombe, J.R.; Cwikla, J.B.; Holloway, B.; Hilson, A.J.W. Prediction of the usefulness of combined mammography and scintimammography in suspected primary breast cancer using ROC curves. J. Nucl. Med. 2001, 42, 3–8.
  16. Berg, W.A.; Gutierrez, L.; NessAlver, M.S.; Carter, W.B.; Bhargavan, M.; Lewis, R.S.; Loffe, O.B. Diagnostic accuracy of mmammography, clinical examination, US and MR, imaging in preoperative assessment of breast cancer. Radiology 2004, 233, 830–849.
  17. Van Overhagen, H.; Brakel, K.; Heijenbrok, M.W.; van Kasteren, J.H.L.M.; van de Moosdijk, C.N.F.; Roldaan, A.C.; van Gils, A.P.; Hansen, B.E. Metastases in supraclavicular lymph nodes in lung cancer: Assessment with palpation, US, and CT. Radiology 2004, 232, 75–80.
  18. Pauleit, D.; Zimmerman, A.; Stoffels, G.; Bauer, D.; Risse, J.; Fluss, M.O.; Hamacher, K.; Coenene, H.H.; Langen, K.J. 18F-FET PET compared with 18F-FDG PET and CT in patients with head and neck cancer. J. Nucl. Med. 2006, 47, 256–261.
  19. Schaffler, G.J.; Wolf, G.W.; Schoellnast, H.; Groell, R.; Maier, A.; Smolle-Juttner, F.M.; Woltsche, M.; Fasching, G.; Nicolletti, R.; Aigner, R.M. Non-small cell lung cancer: Evaluation of pleural abnormalities on CT scans with 18F-FET PET. Radiology 2004, 231, 858–865.
  20. Gerber, B.L.; Coche, E.; Pasquet, A.; Ketelslegers, E.; Vancraeynest, D.; Grandin, C.; van Beers, B.E.; Vanocerschelde, J.L.J. Coronary artery stenosis: Direct comparison of four-section multi-detector row CT and 3D navigator MR imaging for detection—Initial results. Radiology 2005, 234, 98–108.
  21. Gierada, M.; Pilgrim, T.K.; Ford, M. Lung cancer interobserver agreement on interpretation of pulmonary findings at low-dose CT screening. Radiology 2008, 246, 265–272.
  22. Wieand, S.; Gail, M.H.; James, B.R. A family of nonparamtric statistics for comparing diagnostic markers with paired or unpaired data. Biometrics 1989, 76, 585.
  23. Broemeling, L.D. Bayesian Methods for Measures of Agreement; Taylor & Francis Group: Boca Raton, FL, USA, 2009.
  24. Erkanli, A.; Sung, M.; Costello, E.J.; Angold, A. Bayesian semi-parametric ROC analysis. Stat. Med. 2006, 25, 3905–3928.
  25. Hanson, T.E.; Kottas, A.; Branscum, A.J. Modelling stochastic order in the analysis of receiver operating characteristic data. Appl. Stat. 2008, 57, 207–225.
  26. Dendukuri, N.; Rahme, E.; Belisle, I.; Joseph, L. Bayesian sample size determination for prevalence and diagnostic test studies in the absence of a gold standard test. Biometrics 2004, 60, 378–387.
  27. Cheng, D.; Branscum, A.J.; Stamey, J.E. A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests. Comp. Stat. Data Analy. 2010, 54, 298–307.
Diagnostics EISSN 2075-4418 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert