Analysis of Predictive Values Based on Individual Risk Factors in Multi-Modality Trials

The accuracy of diagnostic tests with binary end-points is most frequently measured by sensitivity and specificity. However, from the clinical perspective, the main purpose of a diagnostic agent is to assess the probability of a patient actually being diseased and hence predictive values are more suitable here. As predictive values depend on the pre-test probability of disease, we provide a method to take risk factors influencing the patient’s prior probability of disease into account, when calculating predictive values. Furthermore, approaches to assess confidence intervals and a methodology to compare predictive values by statistical tests are presented. Hereby the methods can be used to analyze predictive values of factorial diagnostic trials, such as multi-modality, multi-reader-trials. We further performed a simulation study assessing length and coverage probability for different types of confidence intervals, and we present the R-Package facROC that can be used to analyze predictive values in factorial diagnostic trials in particular. The methods are applied to a study evaluating CT-angiography as a noninvasive alternative to coronary angiography for diagnosing coronary artery disease. Hereby the patients’ symptoms are considered as risk factors influencing the respective predictive values.


Introduction
The main purpose of a diagnostic agent is to assess a patient's true health status, so the probability of the test giving the correct diagnosis is an important assessment of diagnostic ability. Hereby the positive predictive value describes the probability that a patient with an abnormal (i.e., positive) test result is actually diseased and consequently, the negative predictive value represents the probability that a patient with normal (i.e., negative) test result is actually free of disease. However, these quantities are only of limited value: the predictive values of a diagnostic agent critically depend on the prevalence of the disease and, as the prevalence might vary, e.g., between different risk groups, predictive values are not homogenous within the population. Hence sensitivity and specificity are mostly used to describe the accuracy of a diagnostic agent because these measures are independent of the prevalence of disease: they are defined as the probabilities of the test, correctly identifying the diseased subjects or the non-diseased respectively. In contrast to the predictive values, sensitivity and specificity describe the result of a test within groups of patients who either have or do not have the condition. Thus they are characteristics of the diagnostic test itself and are independent of the prevalence of the disease. Even though sensitivity and specificity are useful and powerful measures to understand how effective a diagnostic test is, they also involve a main disadvantage: these quantities do not assess the accuracy of a diagnostic agent in a practically useful way. They concentrate on how accurate the diagnostic test is in discriminating diseased and non-diseased subjects but fail to give an assessment of a normal or abnormal test result of an individual patient. This interpretation of the test result is only provided by predictive values. Hence, predictive values should not be disregarded in the analysis of diagnostic trials, despite all entailed problems. Instead of avoiding predictive values, they should rather be estimated while carefully taking the arising problems into consideration.
This paper now provides a new method to calculate predictive values for different risk groups. As the pre-test probability of disease in different risk groups does not need to be determined in the same study as the efficiency of the diagnostic agent, this method can even be applied after the diagnostic study (analyzed by means of sensitivity and specificity) has already been closed. Hence, the approach is also applicable to case-control studies where, per study design, no prevalence can be assessed. By this method of analysis, the heterogeneity of predictive values throughout the population is taken into account because predictive values are estimated for each risk group separately. Reported additionally to sensitivity and specificity, these predictive values provide a comprehensive review of the strengths and weaknesses of the investigated diagnostic agent.
In our approach of analysis, we assume that sensitivity and specificity are equal in all investigated risk groups and that the probability of disease in each group has been estimated in prior studies. We calculate predictive values by using Bayes' theorem and determine the asymptotic distribution of the resulting inferential statistics by using the delta method (Section 2). As many diagnostic studies are at least two armed trials in which each subject is diagnosed by different tests, we use the multivariate delta theorem to allow to determine predictive values in factorial designs as well. The idea of using the delta method to calculate the distribution of predictive values in a univariate set-up was proposed by Mercaldo I. [1]. In their work the prevalence has to be known by the investigator and cannot be estimated, which seems to be unlikely in clinical practice. In our approach we hence allow the prevalence being a random variable, which cannot easily be neglected as simulations studies (Section 3) show. The practical relevance of presented methods are illustrated by means of a study evaluating the accuracy of multidetector CT angiography in the diagnosis of coronary artery disease (Section 4). The paper closes with a discussion of the proposed procedures (Section 6).

Methods of Analysis
In this approach, Bayes' theorem [2] serves as the theoretical basis for the analysis. This theorem connects sensitivity and specificity with the predictive values by displaying the positive (p + ) and the negative (p − ) predictive value as functions of sensitivity (se), specificity (sp) and prevalence (π), namely, (1) The sensitivity can be estimated by the ratio of the true positive test results to all diseased subjects, and the specificity can be estimated by the ratio of the true negative test results to all non-diseased subjects. As we assume that the prevalence of disease in each risk group has been assessed in prior studies, estimators for the positive and the negative predictive value for each risk group can be calculated by plugging in the estimators of sensitivity, specificity and prevalence in Equations (1) and (2).
This method is applied, e.g., by Diamond and Forrester [3], who compute the probability of having a coronary artery disease. They present a table of post-test probabilities depending on the result of an electrocardiographic stress test (depression of the S-T segment) and depending on the pre-test probability of disease (categorized into different risk groups by age, sex and symptoms). Their work shows the importance of distinguishing between different risk groups: for the same depression of the S-T segment, the positive predictive value varies from 0.938 (high risk group) to 0.003 (low risk group).
Extending the ideas of Diamond and Forrester, our approach will go one step further: we will derive methods of analysis by which not only the predictive values of two or more modalities are calculated for different risk groups, but also the difference between the modalities is statistically tested. Furthermore, different methods to calculate confidence intervals for the positive and the negative predictive value are provided. As all these methods take the patient's risk factors into consideration, this approach of analysis can be regarded as a further step towards personalized medicine.

Notation
We consider a diagnostic trial involving N subjects, where n 0 subjects are classified as non-diseased by a reliable gold standard and n 1 are classified as diseased. In the set-up of this trial, we assume that each subject is examined by means of m = 1, . . . , M different diagnostic tests. For each subject the results are collected in a vector X ik = (X ik = 0 otherwise. Within group i (i = 0, 1) the vectors X ik , k = 1, . . . , n i are independent identically distributed random vectors, following a multivariate Bernoulli distribution with success probabilities sp = (sp (1) , . . . , sp (M ) ) for i = 0 and se = (se (1) , . . . , se (M ) ) for i = 1. Hereby sp and se denote the vectors of sensitivity and specificity for the different modalities.

Estimation and Asymptotic Distribution
The sensitivity of the m-th diagnostic test is estimated by se (m) , the ratio of the true positive test results (of test m) to all diseased subjects, and the specificity of the m-th modality is estimated by sp (m) , the ratio of the true negative test results (of test m) to all non-diseased subjects. Similarly to sensitivity and specificity, their estimators are also collected in vectors se = ( se (1) , . . . , se (M ) ) and sp = ( sp (1) , . . . , sp (M ) ) .
For the calculation of the predictive values for a subject, its pre-test probability of disease is required. We assume that the pre-test probability of disease is influenced by a patient's individual characteristics and that each subject can be attributed to a risk group on the basis of these attributes. Hereby, the prevalences π g , g = 1, . . . , G, in the g-th risk group have been estimated in prior studies by π g = kg mg , the ratio of the number of diseased subjects k g in group g to all subjects m g in group g. With the help of Equations (1) and (2), the positive and the negative predictive value of the m-th modality for the g-th risk group can be calculated and finally be estimated by replacing sensitivity, specificity and prevalence by their respective estimates: Similarly to sensitivity and specificity, the positive and the negative predictive values of each risk group g = 1, . . . , G are collected in vectors p g,+ = (p  To derive the asymptotic results for the predictive values, the following regularity assumptions are required:
In clinical practice, these assumptions can be interpreted in the following way. The first assumption means that different subjects are independent replications. The second assumption ensures that the sample sizes n 1 (used for the estimation of sensitivity), n 0 (used for the estimation of specificity) and m g , g = 1, . . . , G (used for the estimation of prevalences for the different risk groups) increase uniformly when the overall sample size is increased. The third assumption excludes the trivial case that the sensitivity, specificity or prevalences are equal to 0 or 1.
These assumptions lead to our main result:

Theorem
For each risk group g = 1, . . . G, the statistics √ N ( p g + −p g + ) and √ N ( p g − −p g − ) have, asymptotically, a multivariate normal distribution with mean 0 and covariance matrices V g + and V g − , which are defined in Appendix A.

Proof
The proof is mainly based on the central limit theorem and Cramer's multivariate delta theorem with f + and f − as transformation functions. For details as well as for the expressions of V g + and V g − , we refer to the Appendix A.
The idea of using the delta method to calculate the distribution of predictive values was already proposed by Mercaldo et al. [1] for the univariate case, i.e., in their approach, predictive values of different diagnostic tests cannot be compared if the tests are carried out on the same subjects. It is further assumed that the prevalence is a known parameter but no quantity that has been estimated. If their method is applied in a set-up, when the prevalence is estimated (but incorrectly treated as fixed in order to meet the requirements for their approach), the variance of the predictive values is systematically underestimated (see Section 3).

Inferential Statistics
Based on the asymptotic distribution of the usual test statistics to compare the different diagnostic tests can be statistically tested by formulating the hypotheses in the same way as in theory of linear models which equivalently can be written as: Hereby, I M denotes the M -dimensional unit matrix and 1 M the M -dimensional vector of 1s. In this case, an additive model is assumed but this approach can easily be expanded to a logistic model by again applying the delta method with a logit-transformation function. Hypotheses can be tested with the help of the ANOVA-type statistic ( [4,5]): Under H 0 the statistic can be approximated by a central degrees of freedom. Furthermore, the (1 − α)-confidence intervals for each modality as well as for the difference between two modalities can be calculated in the usual way: For the confidence intervals, as well as for the test statistic, a logistic model can be applied. Hereby, the logistic model has one main advantage: the resulting confidence intervals are range-preserving by construction.
For small sample sizes, the distribution of can be approximated by a central t ν -distribution (see Appendix B), which increases the coverage probability of the resulting confidence intervals.

Simulation Results
In this section, we investigate the coverage and length of confidence intervals constructed with the delta method. Hereby, we compare the approach of Mercaldo et al. [1] with the approaches presented in this paper. There were 48 different combinations for prevalence π ∈ {0.05, 0.25, 0.5}, sensitivity and specificity se, se ∈ {0.5, 0.75, 0.85, 0.9}. Three different values of n = n 0 = n 1 ∈ {50, 100, 500} and m g ∈ {100, 500, 1, 000} were used with each combination to symbolize small, medium and large study sizes for both the study evaluating the usefulness of the diagnostic test and the study assessing the prevalence of disease according to risk groups. For each combination of π, se, sp, n, m g , 10,000 binomial samples were generated using the function rbinom of the free software R [6]. Hereby in each simulation step, the sensitivity was estimated from a contingency table generated by n 1 Bernoulli samples, the specificity was estimated from a contingency table generated by n 0 Bernoulli samples, and the prevalence was estimated by means of m g Bernoulli samples. The positive and negative predictive values as well as their estimators were calculated by applying Bayes's theorem to a given set of π, se, sp and π, se, sp, respectively.
Simulation results for p + and p − can be found in Tables 1 and 2 and Tables A1 and A2, respectively. Hereby, the results for the negative predictive value are presented in Appendix C for reasons of readability. Due to the great number of input parameters and the number of possible values, these tables were constructed in the same way as in the paper of Mercaldo et al. [1]: one parameter was held fixed while averaging over the remaining parameters. If the positive or the negative predictive value is estimated 0 or 1, the logistic confidence interval is not applicable. In this case the t ν -approximation is not applicable neither, because the denominator of ν is estimated to be 0. The number of times this occurred was recorded in the last columns of Tables 1 and A1, respectively. As the point estimators for p + and p − are equal in all approaches the failure rates of the t-approximation, the logistic normal approximation, the logistic t-approximation and the logistic method by Mercaldo et al. are the same. Table 1. Summary of p + coverage probabilities where the cell values denote the coverage probability for one fixed parameter and averaging over the remaining parameters. Hereby N-Approx and t-Approx are abbreviations for normal and t-approximation. π fix denotes the method of [1], where π is assumed to be fixed.

Additive
Logistic The approach of Mercaldo et al. assumes that the prevalence is a known fixed parameter and the variance of π g is 0. But as the prevalence can only be assessed by estimation, we assumed the prevalence to be a binomial random variable with variance greater than 0. To investigate whether this assumption has an impact on the quality of the confidence intervals or whether this assumption can easily be neglected in practice, we also simulated the approach of Mercaldo et al. with π g being a random variable. Note that we hence investigate the methodology of Mercaldo et al. in set-up, which seems likely in clinical practice but for which it was not designed. As the assumption that π g is known by the investigator seems to be unlikely in practice, no simulation with fixed π g was performed.
Simulation results show that the logistic confidence intervals have a slightly higher coverage probability than the additive intervals. The t-approximation in the additive set-up seems to achieve the best coverage.
For our approach the overall coverages for the logistic interval are 0.9568 (p + ) and 0.9562 (p − ), whereas the overall coverages for the additive interval are 0.9364 (p + ) and 0.9303 (p − ). The t-approximation increases coverage such that the overall coverages achieve 0.9494 (p + ) and 0.9440(p − ). The t-approximated logistic confidence intervals tend to be even more conservative than the normal-approximated logistic confidence intervals. Furthermore, simulation results show that the assumption of π g being a fixed parameter is a necessary assumption for a good performance of the approach of Mercaldo et al. If π g is a random variable, the overall coverage probability only reaches 0.7885 (additive) or 0.7998 (logistic) for the positive predictive value. (Simulations of the negative predictive values lead to comparable results.) The variance of π g decreases when the sample size m g increases and, hence, the method of Mercaldo et al. achieves better results for large m g . Table 2. Summary of p + confidence interval lengths where the cell values denote the confidence interval length for one fixed parameter and averaging over the remaining parameters.
Hereby N-Approx and t-Approx are abbreviations for normal and t-approximation. π fix denotes the method of [1], where π is assumed to be fixed.

Additive
Logistic Fixed Parameter N-Approx t-Approx π fix N-Approx t-Approx π fix As their approach assumed the variance of π g equal to 0, the lengths of the confidence intervals are noticeably smaller than in our approach (about 25%), whereas for both methods the additive and logistic approaches yield comparable intervals lengths. For each method, the lengths of the logistic and the additive confidence intervals are almost equal. By construction, the lengths of the t-approximated confidence intervals are slightly higher than the intervals constructed by means of the normal approximation, but the difference is negligible.
We hence recommend using either the t-approximation or the logistic normal approximation when confidence intervals are computed.

Applications: Diagnostic Performance of Multidetector CT Angiography
As coronary artery disease (CAD) has been recognized as the leading cause of death in the United States [7], the diagnosis of the presence and severity of CAD is essential in clinical practice. Conventional coronary angiography reveals the extent, location and severity of obstructive lesions with high accuracy and thus invasive coronary angiography, despite the associated risks, remains the standard procedure for the diagnosis of CAD. Multidetector computed tomographic angiography (MDCTA) has been proposed as a noninvasive alternative to the conventional coronary angiography.
Recently, Miller et al. [8] performed a multi-center diagnostic trial to evaluate the accuracy of MDCTA involving 64 detectors. In 291 patients, segments of 1.5 mm or more in diameter were analyzed by means of CT and conventional angiography (gold standard) to assess whether the patient has at least one coronary stenosis of 50% or more. These data are summarized Table 3. From Table 3, the sensitivity is estimated to be 0.85 while the specificity is estimated to be 0.90. Miller et al. [8] also estimated the predictive values from Table 3. Hereby, they implicitly assumed that the pre-test likelihood of disease is the same for all patients. They, furthermore, assumed that the study prevalence is representative. Using this approach, 0.83 is estimated as positive predictive value and 0.91 as negative predictive value. From these results, the authors draw the conclusion that CT angiography cannot be used as a simple replacement for conventional angiography. For our approach of analysis, we will regard three risk groups with different pre-test probabilities of disease. Diamond and Forrester [3] reviewed the literature to estimate the prevalence of CAD depending on sex, age and symptoms. For reasons of simplicity, we will only concentrate on the patient's symptoms as risk factor. According to the patient's symptoms, Diamond and Forrester [3] provide the pre-test probabilities of disease presented in Table 4. Using these estimators for the prevalence as well as the estimators of sensitivity and specificity, the positive predictive values (PPV), negative predictive values (NPV) and the corresponding confidence intervals were calculated using the described methods. The results are summarized in Table 5. Taking the additional information of the patient's individual risk factors of disease into account offers a more comprehensive interpretation of the study results. For a patient with nonanginal chest pain, a negative test result from the MDCTA eliminates the need of further examination as well as a positive test result for a patient with typical angina does. In contrast, for a patient with atypical angina, neither a positive nor a negative test result from the MDCTA will lead to a clear statement concerning the patient's health status.

Software
In order to analyze factorial trials, we have developed the R-Package facROC. The software can be used to evaluate most assessments of diagnostic accuracy in factorial set-ups: the area under the ROC-Curve (according to [9]), sensitivity and specificity (according to [10]) as well as predictive values (according to this paper). In most diagnostic trials sensitivity and specificity are analyzed as primary assessments of diagnostic accuracy. The evaluation of sensitivity and specificity serves as a basis for the computation of predictive values and can be performed by the facROC function facBinary: fB <-facBinary(formula, id, gold, data, logit=FALSE) Hereby the factorial structure of the trial can be taken into consideration with the help of the formula parameter that specifies the model in the usual way (e.g., formula = testresult˜rater * method). The parameter "id" indicates the patient's id and the parameter "gold" assigns the patient's true health status. (For more details as well as more options and parameters, see facROC manual.) To calculate and evaluate predictive values, the result of the analysis of sensitivity and specificity (i.e., a facBinary object) can be passed to the facPV function: facPV(fB, prev, logit=FALSE, test=FALSE) The prevalence parameter "prev" has to be passed to the facPV function as a two-dimensional vector: prev = c(diseased patients in prevalence study, number of patients). The options "logit" and "test" are logical flags indicating whether a logistic model should be fitted and whether hypotheses on the predictive values should be tested.
If the data of the original study determining sensitivity and specificity is not available and hence the function facBinary cannot be called, the function facPV can be used instead: facPV(se, sp, n1, n0, prev, logit=FALSE) Hereby "se" denotes a vector of sensitivities under different conditions and "sp" denotes the corresponding vector of specificities. Note that it is also possible to pass one-dimensional vectors to the function facPV. "n1" and "n0" characterize the sample sizes of diseased and non-diseased patients used to estimate "se" and "sp". Again the logical flag "logit" indicates whether logistic confidence intervals should be computed. If the parameters to determine predictive values are provided without a facBinary object, the test option is not available. As the covariance matrixes of sensitivity and specificity are not at hand in this case, the test statistic cannot be computed: it summarizes predictive values of different conditions and these might be dependent if the corresponding sensitivities and specificities are dependent. As confidence intervals are computed for each condition separately, they can nevertheless be calculated.
The package facROC will shortly be available on CRAN. Currently it is uploaded at http://github.com/KatharinaLange. To install directly from github, the package devtools is needed (available on CRAN).

Discussion
In this paper, we suggested a new method to translate the results of diagnostic trials for use in clinical practice by means of predictive values. The proposed method provides an approach to calculate confidence intervals according to factors influencing the risk of disease. As in our approach the pre-test probability of disease has been assessed in prior studies, no prevalence needs to be estimated from the data of the current trial. Thus this method of analysis can also be used for calculating predictive values in case-control studies where estimating prevalences is not possible. Note that in our approach, we assume that the prevalence is independent of sensitivity and specificity, which means that sensitivity and specificity have to be homogeneous in different risk groups. Thus, with this methodology, it is possible to estimate predictive values for risk groups that are not included in the original trial. Note that the assumption of homogeneity has to be considered carefully before this method is applied (For example, the accuracy of imaging devices as well as the risk of disease might depend on the patient's BMI. In this case, sensitivity and specificity are no longer equal in the different risk groups and hence a stratified estimation for each group has to be performed). We, furthermore, considered a set-up in which it is possible to compare the predictive values of different diagnostic tests by means of the ANOVA-Type statistic. Many diagnostic trials are imaging studies and therefore the investigation of the images is mostly carried out by several readers. As our approach uses the multivariate delta theorem, this method of analysis can easily be extended to multiple reader diagnostic trials by using a vector of indices (r, m) indicating the reader and the method. Hypotheses can be tested by choosing appropriate contrast matrices referring to the theory of linear models. Furthermore, confidence intervals for arbitrary contrasts c p g ± can be computed. Hence, in multiple reader trials we can assess the difference between two diagnostic tests by averaging over the different readers [9,10]. In clinical practice, an initial suspicion is sometimes confirmed not by only one diagnostic test but by several ones. In this case, the pre-test probability of disease increases with each positive diagnostic result. In order to calculate and analyze predictive values in these cases, some information is required: 1. the probability of disease before the first test was carried out, i.e., the "pre-testing" probability of disease, which might depend on several risk factors and has to be determined from prevalence studies; 2. the sensitivity and the specificity of each diagnostic test performed as well as the correlation between these tests.
With the help of the second item, a global sensitivity and a global specificity for the whole testing procedure can be calculated. (This might be a complex problem if different diagnostic tests are dependent. For more details, see, e.g., [11].) In combination with the pre-testing probability of disease, predictive values can now be calculated in the way proposed here. Note that a pre-test or a pre-testing probability is always required when predictive values are computed. The methodology developed might help to answer two of the most important questions to clinicians: "How likely is it that the patient has the condition?" and "How likely is it that the patient is free of disease?" Nevertheless, it is important to point out that predictive values cannot replace sensitivity and specificity. As predictive values have a more concrete and thus more user-friendly interpretation than sensitivity and specificity, they might also be considered as accuracy assessments, when the usefulness of a new diagnostic agent is evaluated. However, because these measurements depend on the prevalence, regulatory authorities advise to be careful when using predictive values for the evaluation of diagnostic trials. The EMEA states "predictive values must be reported with caution and only when the study sample is considered to be representative of the prevalence in the real world" [12] and the FDA recommends that "the trials include the intended population in the appropriate clinical setting" [13]. Following these recommendations, predictive values are calculated for a patient with a mean pre-test risk of disease but the results of the evaluation will not be valid for a patient with a known higher or lower probability of disease. Hence, we achieve a result for an average patient but no general result. But the main purpose of a diagnostic trial is to evaluate whether or not a new diagnostic agent increases the probability of a correct diagnosis in general. In contrast, sensitivity and specificity are able to assess the effect of a new diagnostic agent independent of any prior probability of disease and any prevalence. Thus sensitivity and specificity allow us to assess the quality of a new diagnostic agent in general. Therefore, predictive values should rather be avoided when the usefulness of a new diagnostic agent is evaluated and they should only be calculated for the use in clinical practice.
where V se denotes the covariance matrix of X 11 . Applying the central limit theorem to the estimator of the specificities similarly leads to where V sp = Cov(X 01 ). For the prevalences in the different risk groups, the univariate central limit theorem leads to where σ 2 g = π g (1 − π g ). As N n i → d i , i = 0, 1 and N mg → e g , g = 1, . . . , G by assumption, Equations (3)-(5) can be rewritten as The estimators of sensitivity, specificity and prevalence are independent random variables and thus we obtain that: where ⊕ denotes the direct sum. The functions f + and f − given in Equations (1) and (2) map sensitivity, specificity and the prevalence onto the positive and the negative predictive value, respectively. Let f + ((se , sp , π g ) ) = f + (se (1) , sp (1) , π g ), . . . , f + (se (d) , sp (d) , π g ) and f − ((se , sp , π g ) ) = f − (se (1) , sp (1) , π g ), . . . , f − (se (d) , sp (d) , π g ) denote the multivariate versions of f + and f − and let Df + = Df + ((se , sp , π g ) ) and Df − = Df − ((se , sp , π g ) ) denote the corresponding Jacobian matrices of all first-order partial derivatives at position (se , sp , π g ) . Then, applying Cramer's δ theorem leads to: Now Df + = Df + ((se , sp , π g ) ) and Df − = Df − ((se , sp , π g ) ) are estimated by Df + = Df + ( se , sp , π g ) and Df − = Df − ( se , sp , π g ) , respectively. The quantities d i are estimated by N n i , i = 0, 1 and e g is estimated by N mg , for all g = 1, . . . , G. We further estimate the covariance matrices V se and V sp by the sample covariance matrices respectively. We further use the unbiased empirical variance σ 2 g = mg mg−1 · π g · (1 − π g ) as the estimator of σ 2 g for the ease of convenience. Plugging in these empirical counterparts and applying Slutzky's theorem hence leads to our main result: for each risk group g = 1, . . . G, the statistics have, asymptotically, a multivariate normal distribution with mean 0 and covariance matrices respectively.

B. t-Approximation for Small Sample Sizes
In order to increase the quality of our methods for small sample sizes, a t ν -approximation of is also provided. To assess the degrees of freedom ν of the t-distribution, we use an approach based on the Box approximation [14]: the distribution of v g ± [m, m] is approximated by a scaled χ 2 ν -distribution, i.e., by the distribution of a random variable g · Z ν , where Z ν ∼ χ 2 ν and ν and g are constants such that the first two moments coincide. We hence determine ν by where v se [m, m] and v sp [m, m] denote the empirical variances of the X (m) ik , k = 1, · · · , n i for i = 1 and i = 0. ∂f /∂se is the partial derivative of f with respect to se at (se (m) , sp (m) , π g ) . The partial derivatives ∂f /∂sp and ∂f /∂π are defined analogously. As ν contains unknown parameters, ν itself is unknown and it has to be estimated. We, hence, estimate the partial derivatives at (se (m) , sp (m) , π g ) by the partial derivatives at ( se (m) , sp (m) , π g ) and further estimate d i by N/n i , i = 0, 1 and e g by N/e g . As v se [m, m], v sp [m, m] and σ 2 g are unbiased by construction, the numerator can be determined easily by means of these We, therefore, obtain for the variance of v se [m, m]: As the first four moments of a binomial distribution can easily be determined, the estimator of ν can be obtained by plugging in all estimates in Equation (6). Value   Table A1. Summary of p − coverage probabilities where the cell values denote the coverage probability for one fixed parameter and averaging over the remaining parameters. Hereby N-Approx and t-Approx are abbreviations for normal and t-approximation. π fix denotes the method of [1], where π is assumed to be fixed.

Additive
Logistic Failure (logistic Fixed Parameter N-Approx t-Approx π fix N-Approx t-Approx π fix & t-Approx)   Table A2. Summary of p − confidence interval lengths where the cell values denote the confidence interval length for one fixed parameter and averaging over the remaining parameters.
Hereby N-Approx and t-Approx are abbreviations for normal and t-approximation. π fix denotes the method of [1], where π is assumed to be fixed.