A method to compute multiplicity corrected confidence intervals for odds ratios and other relative effect estimates.

Epidemiological studies commonly test multiple null hypotheses. In some situations it may be appropriate to account for multiplicity using statistical methodology rather than simply interpreting results with greater caution as the number of comparisons increases. Given the one-to-one relationship that exists between confidence intervals and hypothesis tests, we derive a method based upon the Hochberg step-up procedure to obtain multiplicity corrected confidence intervals (CI) for odds ratios (OR) and by analogy for other relative effect estimates. In contrast to previously published methods that explicitly assume knowledge of P values, this method only requires that relative effect estimates and corresponding CI be known for each comparison to obtain multiplicity corrected CI.


Introduction
Testing the statistical significance of multiple null hypotheses is a routine practice in epidemiologic and other types of biomedical research. By chance, the probability of wrongly rejecting one or more null hypotheses increases in proportion to the number of comparisons tested [1]. This is referred to as "multiplicity bias. " Various methods have been presented in the literature for controlling the type I error in the context of multiple hypothesis testing. The classic Bonferroni inequality [2] provides a simple distribution-free method for multiplicity P value correction. Letting α i denote the probability that hypothesis S i is incorrect, the Bonferroni probability for the joint null hypothesis may be written as: for at least one i, where p i denotes the P value corresponding to the i th null hypothesis. In the simple case, α is apportioned evenly among the tests. Although the family-wise (FWER) and per family (PFER) error rates are preserved at the α level of significance, the Bonferroni procedure is known to be conservative, especially for highly correlated test statistics (i.e., type I error probability is less than the nominal level of α). For example, in the case of a study of multiple genetic polymorphisms, the assumption is that all variants being tested have equal probability of being truly associated with the outcome of interest and leads to overcorrection. [3] The first order Bonferroni inequality may be improved upon given knowledge of the joint bivariate probabilities [2,4,5] or when the absolute value of the correlation coefficient is greater than 50% [2,6]. However, these improvements have been limited in applied practice due to their restrictive nature. Several multiple testing procedures [7][8][9] based upon the "closure method" [10] and "Simes equality" [11] have been introduced and shown to be more powerful than the Bonferroni method for testing the intersection hypothesis [12][13]. Of the closure method based options, the Hochberg step-up multiple comparisons procedure [7] has gained popularity as being "easier to apply" than the more powerful procedures of Hommel [9] and Rom. The procedure also is uniformly more powerful than the Bonferroni-based, sequentially-rejective method of Holm [14] in many applied situations, e.g., when test statistics are uncorrelated, follow a multivariate normal or T 2 distribution, or are model independent [15][16][17]. Given an ordered set of P values, i.e., p (1) ≤p (2) ≤…≤p (n) , the Hochberg procedure rejects all hypothesis H i≤j if p (j) <α/(n-j+1) for any j=1, … , n. P values are incrementally corrected in order from smallest to largest by multiplying p (j) by (n-j+1), wherein the multiplicative factor for the largest P value is unity and thus remains the same after multiplicity correction. Many researchers and journal editors increasingly recognize confidence intervals (CI) as the preferred measure for conveying statistical uncertainty of effect size estimates such as odds ratios (OR), relative risks (RR), and hazard ratios (HR), as P values have been commonly misunderstood and misinterpreted in the literature [18][19][20][21][22]. Similar to hypothesis testing by way of P values, CI also may be corrected for multiplicity to minimize the risk of making false-positive inferences. Several authors have provided techniques to correct CI for multiple hypothesis testing [23][24][25][26]. However, most of the methods are computationally intensive or mathematically complex, and more importantly, none provide a way to correct CI when corresponding P values are not provided for the individual hypothesis tests.
Below, we present a method to compute multiplicity corrected CI for OR and by analogy for other measures of relative risk, when no P values have been explicitly provided. This computationally simple method based upon the Hochberg step-up procedure only requires knowledge of individual test OR and CI, and the number of comparisons being tested.

Methodology
The derivation of multiplicity corrected confidence intervals for a set of n OR involves expressing the standard error (SE) for the logarithm of OR i (i=1 to n) in terms of the lower confidence interval (LCI) for OR i . Letting: where z (1-α/2) is the 100% x (1-α/2) percentile of a standard normal distribution, and solving for SE[log(OR i )] we see that: (3) Substituting the right hand side of (3) into the equation for the 2-tailed z test statistic gives: The corresponding P value is computed as: Where: gives the Hochberg corrected standard error for the logarithm of OR i , i.e.: The multiplicity corrected (1-α/2) x 100% CI for OR (i) based upon the Hochberg step-up procedure can then be computed by substituting the above standard error from eq. 9 into the following basic equation: By analogy, replacing OR in the above equations with other relative effect estimates such as RR or HR gives the corresponding multiplicity corrected CI for these measures. When P values are directly available for the individual hypothesis tests, the Hochberg multiplicity corrected CI may be computed directly beginning with eq. 7. Furthermore, if the hypothesis test is 1-sided, then α must be multiplied by 2 in the above equations.  1.513, 1.068, 0.830), and 95% CI (0.09-32, 0.14-9.3, 1.3-33). The multiplicity corrected CI for Factor 1 and Factor 3 are considerably wider than the corresponding uncorrected intervals, thus indicating a greater degree of variability for the estimated OR. In the case of Factor 2, the uncorrected and corrected CI is the same since Factor 2 had the highest P value of the 3 comparisons when applying the Hochberg algorithm.

Example
In this example, the conclusions regarding the association (or lack thereof) of (D) and the exposure do not substantively change after correction for multiplicity, thus lending weight to what otherwise might be only cautious interpretation referencing the possibility of a chance observation due to multiple comparisons. However, in other situations where CI is close to containing unity, a null hypothesis might no longer be rejected at least in strict statistical terms after correction for multiplicity.

Discussion
Confidence intervals for OR, RR and other relative effect estimates are commonly reported in epidemiologic and public health literature without correction for multiple hypothesis testing. The failure to account for multiplicity may lead to inflation of type I error and over interpretation of any apparently "positive" findings. In the current paper, we show how CI for relative effect size estimates such as OR may be corrected for multiplicity by use of the Hochberg step-up procedure, a "closed-testing" method for protecting against making excessive false-positive inferences due to multiple comparisons.
Our method has several strengths. The corrected CI are simple to compute in standard statistical software packages that have function routines for determining percentiles and areas under a curve for a normal distribution. Since P values are not required for the original hypothesis tests, multiplicity corrected CI may be computed post hoc (when estimates are reported with sufficient precision) from publications that only report values for effect size estimates and corresponding CI. When the test statistics are uncorrelated, the family-wise type I error probability is theoretically guaranteed by the Hochberg step-up procedure. Simulation results also show that the Hochberg step-up procedure holds for many commonly encountered dependent test statistics [27]. The multiplicity adjusted and unadjusted 95% CI will be equal in this case since the corresponding unadjusted P value for the Factor 2 comparison was the highest of the 3 comparisons and thus the multiplicative factor for p (j) in equation (7) will be equal to 1. * Multiplicity adjusted estimates. Several limitations must be observed when applying our procedure for computing CI. The technique is not applicable when "exact sampling distribution" methods have been used to make statistical inferences. The Hochberg multiplicity correction also will inflate P values and related CI when one or more of the hypothesis tests involve a multi-level, logically related categorical variable (e.g., current smoker, former smoker, never smoker). In this case, it is unnecessary to correct CI for multiplicity for a logically related variable in multivariate space. The computed multiplicity corrected CI will be an approximate solution when the decimal accuracy is limited for the original OR and CI values. Accordingly, it is generally recommended that at least 2 or 3 significant digits of accuracy are available for published estimates when using this method in a post hoc manner to compute multiplicity corrected confidence intervals. Additionally, the rule for computing ( ) * p j (eq. 7) in rare cases may lead to an anomaly wherein ( ) * p j but not ( ) * p 1 j− will achieve statistical significance. In this situation, one might apply the de facto variation of multiplying p (j) and lesser ranked P values by j to obtain the corresponding Hochberg corrected P values. [28] And finally, the method should not be used if the logarithm of the effect estimate does not follow a normal distribution, or if the underlying observations are not independent and identically distributed.
It also is important to note that correction for multiplicity may not be necessary or even desirable in some situations [29][30][31][32][33]. For example, correction for multiplicity may be unnecessary when an a priori biologic mechanism of action exists for an independent variable that manifests a linear dose response in relationship to the outcome variable. Similarly, multiplicity correction may not be desirable when attempting to control type II errors as the latter will be inflated by virtue of decreasing type I errors [31]. Furthermore, multiplicity correction based upon the "universal null hypothesis," which tests that two groups are identical for all comparisons between variables, fails to take into account which and how many variables differ if the joint hypothesis is rejected [31]. Methods to correct for multiplicity also do not account for the inclusion of hypotheses that are biologically improbable or otherwise indefensible, which unnecessarily inflate the probability of incorrectly rejecting the joint null hypotheses [18,29]. Philosophically, some researchers believe that the "primary" purpose for CI are to indicate a range of parameter values consistent with the data rather than for de facto hypothesis testing based on whether or not they include 1.0. Another salient concern regarding the appropriateness of multiplicity correction techniques is "how does one choose the universe for the number of comparisons." Clearly, multiplicity adjustment remains a debated topic with diverse opinions presented in the literature [34][35].
In the early days of the development of stepwise and closed tests for the control of type I error in multiple hypothesis testing, epidemiologists and statisticians commonly believed that joint CI could not be constructed for these procedures. However, it has been shown since that standard methods for constructing CI also readily apply to common stepwise multiplicity procedures. [23][24] Here, we have expanded on the seminal work of these researchers to develop a simple method for computing multiplicity corrected CI for standard estimates of effect size. Although our derivation has focused on the case of binary predictor variables, it is possible that similar principles might be developed and applied to obtain joint confidence sets in the more complex case of multilevel categorical variables.

Conclusions
Although the most effective strategy to minimize type I error related to multiple comparisons is to simply reduce the number of comparisons, this in effect penalizes the researcher for conducting a more informative multivariable study [32].
Statistical correction for multiple comparisons is not a substitution for the parsimonious and epidemiologically prudent selectionduring the design phase of a study -of hypotheses to test. Nor should it be used in lieu of careful and informed interpretation of the results, taking into account biological plausibility (or lack thereof) and the results of prior studies. However, when statistical correction for multiple comparisons is appropriate, as is the case in many but not all situations, the method we present may have application as a supportive measure. A key advantage of this method is its correspondence with CI, which are typically more informative, and potentially more readily available, than P values.