This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

In complex mixture toxicology, there is growing emphasis on testing environmentally representative doses that improve the relevance of results for health risk assessment, but are typically much lower than those used in traditional toxicology studies. Traditional experimental designs with typical sample sizes may have insufficient statistical power to detect effects caused by environmentally relevant doses. Proper study design, with adequate statistical power, is critical to ensuring that experimental results are useful for environmental health risk assessment. Studies with environmentally realistic complex mixtures have practical constraints on sample concentration factor and sample volume as well as the number of animals that can be accommodated. This article describes methodology for calculation of statistical power for non-independent observations for a multigenerational rodent reproductive/developmental bioassay. The use of the methodology is illustrated using the U.S. EPA’s Four Lab study in which rodents were exposed to chlorinated water concentrates containing complex mixtures of drinking water disinfection by-products. Possible experimental designs included two single-block designs and a two-block design. Considering the possible study designs and constraints, a design of two blocks of 100 females with a 40:60 ratio of control:treated animals and a significance level of 0.05 yielded maximum prospective power (~90%) to detect pup weight decreases, while providing the most power to detect increased prenatal loss.

Toxicological investigation of environmental chemical mixtures is evolving, with attention focused on defined mixtures, involving a limited number of chemicals, and complex environmental mixtures, involving a large number of chemicals and, typically, an unidentified fraction. Dosing regimens are being developed to evaluate the toxicity of complex mixtures using approaches consistent with human environmental exposures that are typically much lower than those used in traditional toxicology studies. Consequently, newer studies are designed to evaluate the toxicity of complex mixtures (1) with the inclusion of low, environmentally relevant dose levels; (2) with the relative proportions of component chemicals similar to those measured in environmental samples; and, (3) with approaches that maintain the chemically unidentified components of the mixture [

Disinfection of drinking water for microbial contamination provides an essential public health benefit in reduction of water-borne disease. However, oxidizing disinfectants react with materials in the source water resulting in the formation of a wide variety of DBPs. DBP mixtures are highly complex, containing numerous chemicals not routinely measured and many that are unknown; approximately 50% of the total organic halide compounds formed when water is disinfected remains unidentified [

Because concerns identified from epidemiologic studies on whole DBP mixtures cannot be readily addressed by investigating either individual DBPs or simple, defined DBP mixtures, scientists from four of the national laboratories and centers of the U.S. EPA’s Office of Research and Development have developed and, along with extramural partners, undertaken a research project (the Four Lab Study) that integrates toxicological and chemical evaluation of environmentally realistic complex mixtures of DBPs [

The resulting database of toxicological and analytical chemistry data on the whole DBP mixture provides important information for health risk assessment of DBPs [

Experimental constraints considered in the design of the multigenerational bioassay included: the number of dams that could be accommodated at one time (maximum of 100); the extent that water could be concentrated while retaining palatability and conserving organics (a concentration factor of 136× for total organic carbon was achieved for use in the multigenerational bioassay [

In the multigenerational bioassay, timed-pregnant Sprague-Dawley rats comprising the parental (P_{0}) generation would be assigned randomly to either a control group or a treatment group which would consume chlorinated water concentrate. Each P_{0} dam was to deliver a litter (F_{1} generation). An issue addressed in the present work was whether to breed one or two females from each F_{1} litter to a non-sibling F_{1} male from the same exposure group to produce the F_{2} generation. Priority study endpoints were prenatal loss (number of uterine implantation sites minus number of live pups at birth, divided by implantation sites) and pup birth weight. In comparison with epidemiologic endpoints of concern, prenatal loss is analogous to spontaneous abortion, whereas reduced pup weight is analogous to small for gestational age and term low birth weight.

U.S. EPA testing guidelines for reproductive toxicity call for 20 pregnant females per group as the standard protocol for single chemical bioassays [

In this study, the individual pups within each litter represent repeated measurements that are not independent. A compound symmetric correlation structure was assumed, so that the correlation between any two pups in a litter was equal. Power and sample size calculations must account for the correlation; therefore, the methodology developed by Rochon [

As with less complicated sample size procedures, estimates of the group means and variances were required to calculate power. For the generalized-estimating equation (GEE)-adapted sample size methodology, estimates for the correlation between measurements and the over-dispersion factor (if appropriate) were also required. Estimates were derived by modeling data from Narotsky _{0}_{1}_{2}_{α}_{1 ≠}_{2}_{1}_{2}

Pup weight at birth and prenatal loss were the focus of the power calculations for the Four Lab study. After conducting the study, statistical tests are to be performed independently for these two endpoints, and the Type I error rate will be set to α = 0.05 for each individual test. Based on the data of Narotsky

For analyzing pup weight, male and female pups were considered separately as well as combined; the results presented in this paper are for the combined male and female pups. The litter was treated as the experimental unit, with each pup within a litter representing a repeated measurement. A linear model was assumed for the data, so that:

where _{ijk}^{th} live pup in the ^{th} litter of the ^{th} group (where

and:

In this model, the pup weights, _{ijk}_{i}

An additional key endpoint, prenatal loss, was also examined with respect to power. For prenatal loss, litter again represented the experimental unit, with each implantation site representing a repeated measurement. The endpoint was then coded as:

where

If _{i}^{th} group, then under a linear model:

where _{1}_{j}_{2}_{j}

where _{i}_{j}_{2}_{j}

The linear and linear logistic models differ based upon the alternative hypotheses being tested. In both the linear and linear logistic models for prenatal loss a binomial error distribution was assumed for the model. In the linear model, the alternative hypothesis _{α}_{1}_{2}_{α}_{1}_{2}_{i}

For both the pup weight data and the prenatal loss data, an explicit relationship exists between the over-dispersion parameter, which represents the proportion of the observed variability that is due to the correlation among the observations, and the intra-litter correlation. For both distributions, the over-dispersion parameter,

where _{ij}^{th}^{th}_{ij}_{ij}

The number of P_{0} dams available for study was limited to 100 dams per block where experimental blocks were logistically constrained to being evaluated sequentially over time. Therefore, the problem was that of determining whether the study would have sufficient power to detect the treatment effects of interest (

In each block, 100 dams would be divided into two groups: a control group and a treatment group receiving water concentrate containing a complex mixture of DBPs. In this study, one treatment group would be used due to the limited amount of water concentrate available.

The expected control group and treatment group means were estimated from the Narotsky

Having equal numbers of repeated measurements per experimental unit was a necessary assumption for the implementation of Rochon’s [

The first design considered was a single-block design in which one F_{1} female rat per litter would be bred. In this design, a maximum of 100 dams would be available for assignment to the two groups (control and treatment). The compound symmetric correlation structure was used for pups within litters, and pups from different litters were assumed to be independent. The correlation matrix is given in

The cohort size of 100 dams was sufficient to achieve the desired 80% power to detect a 0.6 g difference in average pup weight between the control and treatment groups at a significance level of 0.05 for all cases considered except one. With this simple design, greater than 99% power can be achieved with 50 dams assigned to each of the control and treatment groups, or with a 40:60 control:treatment group ratio assignment.

The single-block design failed to produce a sufficient level of power to detect a treatment effect on prenatal loss across the effect size, litter size, and correlation scenarios considered. Only when the intra-litter correlation was assumed to be zero (a very unlikely assumption) would 100 dams be sufficient to achieve the desired level of power. When the intra-litter correlation was assumed to be non-zero, the maximum achievable power, regardless of whether equal or unequal allocation of the dams occurred, was substantially lower than the desired 80% for the prenatal loss endpoint for either the linear (36%) or the logistic (35%) model. Because the single-block design showed such poor performance with respect to prenatal loss, other design options were examined.

In an attempt to increase the power of the experiment without increasing the number of P_{0} dams, a design in which two females per F_{1} litter (_{1} females would not be independent. The resulting correlation structure is presented in

Rochon’s [

In addition, the size of the inter-litter correlation relative to the intra-litter correlation must be estimated. It is a reasonable assumption that the inter-litter correlation would be smaller than the intra-litter correlation, though it is unclear how much smaller. To address this last issue, a sensitivity analysis could be conducted. The true nature of the correlation structure for such a design is uncertain, and a violation of any one of these assumptions could affect the results. Validation of these assumptions is not possible, because Narotsky _{1} female per litter, is unconventional and was not encountered in the scientific literature. Despite the likelihood that this design would not be usable for the current study, it was examined to determine its potential usefulness for future studies; if large increases in expected power can be realized with such a design, then pilot data could be collected to provide the required estimates and support validation of the assumptions for future work.

Based on a single value for the relative value of the inter-litter correlation to the intra-litter correlation (δ = 0.75), the results were encouraging. As expected based on the results above with one F_{1} female, power for the pup weight endpoint was greater than 99% with a control:treatment group ratio of either 50:50 or 40:60. Power for the prenatal loss endpoint continued to fall below the desired 80% level, but increased using the linear (43%) and logistic (42%) models, respectively, and with unequal allocation of the dams, as stated above for the one female per litter design.

Despite the increased power using two females per litter, this design was not further considered. This design required more water concentrate than the alternative two-block design (discussed below). In addition, the project team lacked confidence in the method of handling the correlation between related litters and was doubtful that all necessary assumptions would be met. Nonetheless, it is important to note that this approach,

To achieve the desired level of power without the complication of inter-litter correlation, a design with two blocks of 100 P_{0} dams per block (total of 200 dams) was examined. Although a single-block design lacks the complication of a blocking factor, managing 200 dams in a single block exceeded the technical capability available to conduct the study. A design with two blocks (

If dams are examined in two blocks, a factor must be included in the model to account for differences between the two blocks. The blocks would be treated as a random factor in the analysis of the data from the multi-generational study being powered here. The model under consideration for pup weight was revised as:

where _{ijkl}^{th} live pup in the ^{th} block in the ^{th} litter of the ^{th} group:

where _{k}^{th} block, with τ_{k}_{βτ}^{2}), and (_{ik}^{th} treatment group by ^{th} block interaction, with (&_{ik}_{βτ}^{2}). The experiment was designed to include replication so that both the block and the treatment group by block interaction can be estimated.

Using the same notation, the revised model for prenatal loss was:

Three different scenarios representing the experimental outcomes were examined in this analysis: (1) the block effect is zero (

To determine the power achieved for a two-block design, the methodology described by Rochon [

To calculate sample size and power for a given scenario with fixed parameters, the algorithm calculations needed to be performed only once. However, because of the random nature of the block effect, a single fixed parameter estimate for the block effect could not be used. Treating the blocking factor as random, the model assumed that the observed block effect was a random observation from the distribution of block effects and that the observed group × block interaction effect was a random observation from the distribution of group × block interaction effects.

To incorporate the random block main effect into the calculations, random noise was generated according to a _{τ}^{2}) distribution and added to the group means by block. Random noise was also generated according to a _{βτ}^{2})and added to each group mean to incorporate the random interaction effect. As stated above, the values for _{τ}^{2} and _{βτ}^{2} were varied since no estimates were available. The ranges for these random block effects were selected to provide sufficiently realistic variability without overwhelming the true effects of interest in the model. Power was calculated 500 times and the average power reported.

Power as a function of sample size ratio for pup weight is presented in

In contrast, for prenatal loss, an unequal allocation of the dams in the single-block design led to increased power by placing more dams in the group anticipated to have greater variance. The recommended allocation differed for the linear and the logistic models for prenatal loss. Power as a function of sample size ratio for the linear and logistic prenatal loss models are presented in ^{2}, equal to ^{2} = 1.69, was greater than that for the control group, σ^{2} = 0.96. The reverse was true for the logistic model [with variance, σ^{2}, equal to 1/np + 1/n(1-p)], where the effect variance for the treatment group was σ^{2} = 0.59 and for the control group was σ^{2} = 1.04. As a result, power was maximized at a control:treatment group ratio of 43:57 for the linear model and 57:43 for the logistic model. The power difference between the two models was small: at the 1:1 ratio, power was 36% and 35% for the linear and logistic models, respectively.

For pup weight in the two-block design with equal allocation to the control and treatment groups, the use of 100 dams in each of the two blocks was sufficient to achieve the desired 80% power with a significance level of 0.05. For this optimal allocation, mean power (calculated over 500 values for the block effect and the group × block effect) was approximately 100% to detect the treatment effect in 54% of the scenarios, greater than 90% for 62% of the scenarios, greater than 85% for 92% of the scenarios, and greater than 80% for 100% of the scenarios. Moreover, median power to detect pup weight differences was approximately 100% for all scenarios examined. These results held true for the different numbers of live pups per litter, as well as for each sex separately (results not shown) and for both sexes combined.

Based on the results from the single-block designs, two control:treatment ratios representing unequal allocation were considered for the two-block design, along with an equal allocation ratio: 45:55, 40:60, and 50:50. Pup weight results for three potential experimental outcome scenarios are represented in

For prenatal loss, two blocks of 100 dams with equal allocation to the control and treatment groups appeared to be insufficient to achieve the desired 80% power at a significance level of 0.05. The two-block design with a 40:60 ratio (control:treatment) of the dams within each block yielded the highest power estimates for prenatal loss; however, power remained below the desired 80%. The power results for prenatal loss are given in

In

This article describes the methodology used for calculating statistical power for non-independent observations in a two-block design for the multigenerational reproductive/developmental toxicity rodent bioassay in the Four Lab Study. It takes into account the multigenerational bioassay design as well as constraints on sample size, water concentrate volume and concentration factor.

Designing this bioassay under these constraints necessitated thoughtful consideration of statistical power. Determining power and sample size for multiple block designs is complicated, because it must account for the interaction of groups with blocks. Though the effect of the block is not of inherent interest, it might influence the group (treatment) effect if the DBP levels within the concentrate change during the course of the study. Using developmental toxicity screening data [

While several rodent developmental toxicity investigations have been conducted using exposures to concentrated tap waters [

Based on the results of the power analyses and the physical constraints of the study, the Four Lab team selected a two-block design for the multigenerational bioassay, assigning 40 and 60 timed-pregnant rats to the control and treatment groups in each block, respectively. The two-block design achieved greater than the desired 80% power at a significance level of 0.05 with respect to pup weight for all the scenarios examined; more than half of the scenarios for the two-block design achieved 100% power. This design ensured that at least one sensitive endpoint (

For this research, detecting a prenatal loss effect, if present, also was desirable. The two-block design, that optimizes the power for pup weight loss, also provides the most power for prenatal loss from among the possible designs, given the study constraints. This analysis shows that the two-block design provides a modest amount of power (_{1} litter, was eliminated from consideration due to the uncertainty surrounding the inter-litter correlation, and the need for larger quantities of water concentrate. In general, the power associated with the two-block design was approximately twice that of the single-block designs considered. The somewhat large discrepancy between the power for pup weight and prenatal loss was expected and is inevitable in a reproductive toxicity study. This is because, relative to their respective means, the variance for prenatal loss is generally much larger than for pup weight (e.g.,

The constraints imposed by conducting toxicological investigations with highly complex environmental mixtures in an environmentally relevant medium at environmentally relevant dose levels are not unique to the Four Lab Study. The methodology described here may be applied to appropriately design other toxicology studies with environmentally realistic complex mixtures, as similar constraints likely will be encountered.

This work highlights the importance of considering statistical power in the design of bioassays that evaluate health effects of chemical mixtures in the low-response region of the dose-response curve. Such biostatistical analyses provide meaningful quantitative insights into the trade-offs inherent in the design of studies conducted in the low-response region and provide a clear and logical rationale for choice of study design. These analyses and insights lead to toxicological studies in the low-response region that provide meaningful results and allow for appropriate interpretation of experiments when no observable adverse effect is detected. The conduct of such toxicological studies is critical for improved dose-response assessments of complex chemical mixtures, because they increase understanding of the potential human health effects from exposure to chemical mixtures near environmental exposure levels, which are of increased relevance to human health risk assessment [

The authors acknowledge the thoughtful reviews of Laura Aume and Woodrow Setzer, David Svendsgaard, and Glenn Suter. The authors also acknowledge the initial research on this experimental design issue by Chris Gennings under Cooperative Agreement No. CR827208-01-1 with Virginia Commonwealth University. This research was conducted under Contract No. EP-C-05-030 with Battelle, Columbus, OH, USA.

This manuscript has been reviewed in accordance with the U.S. Environmental Protection Agency’s peer and administrative review policies and approved for publication. Approval does not signify that the contents necessarily reflect the views and policies of the Agency nor does mention of trade names constitute endorsement or recommendation for use.

(

(_{c} = 6.5, μ_{t} = 5.9, ρ = 0.60, ψ = 7.59, and _{c} + _{t} = 100; (_{c} = 0.08, μ_{t} = 0.15, ρ = 0.19, ψ = 3.23, and _{c} + _{t} = 100; (_{c} = 0.08, μ_{t} = 0.15, ρ = 0.19, ψ = 3.23, and _{c} + _{t}.

Summary of data from pilot study with complex mixture of disinfection by-products [

Control | Treated | |
---|---|---|

36 | 35 | |

36 | 35 | |

| ||

| ||

13.1 ± 0.3 | 13.4 ± 0.4 | |

12.1 ± 0.4 | 11.3 ± 0.6 | |

7.8 ± 1.5 | 14.9 ± 3.8 | |

6.5 ± 0.1 | 5.9 ± 0.1 |

Significantly different from controls (

Power to detect a 0.6 g difference in average pup weight using a linear model: Two-block design.

Mean Power | Median Power | ||||
---|---|---|---|---|---|

| |||||

Block effect variance, ^{2}_{τ} |
Interaction effect variance, ^{2}_{β}_{τ} |
12 live pups/litter | 15 live pups/litter | 12 live pups/litter | 15 live pups/litter |

0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |

0.05 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |

0.5 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |

1.0 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 |

0.05 | 0.05 | 1.00 | 1.00 | 1.00 | 1.00 |

0.05 | 0.5 | 0.84 | 0.84 | 1.00 | 1.00 |

0.05 | 1.0 | 0.86 | 0.86 | 1.00 | 1.00 |

0.5 | 0.05 | 1.00 | 1.00 | 1.00 | 1.00 |

0.5 | 0.5 | 0.86 | 0.86 | 1.00 | 1.00 |

0.5 | 1.0 | 0.91 | 0.91 | 1.00 | 1.00 |

1.0 | 0.05 | 1.00 | 1.00 | 1.00 | 1.00 |

1.0 | 0.5 | 0.85 | 0.86 | 1.00 | 1.00 |

1.0 | 1.0 | 0.89 | 0.89 | 1.00 | 1.00 |

Note: Calculated across 500 simulations, assuming one F1 female per dam is bred, combined male and female pups, and unequal allocation of dams to control (40) and treatment (60) groups within each of the two blocks. Assumes an individual two-sided test, significance level of 0.05. Control group average pup weight = 6.5. Treatment group average pup weight = 5.9. ρ = 0.60. ψ = 7.59 or 9.39.

Power to detect a 7.1 percentage point difference in prenatal loss using a linear model: Two-block design.

Mean Power | Median Power | ||||
---|---|---|---|---|---|

| |||||

Block effect variance, ^{2}_{τ} |
Interaction effect variance, ^{2}_{βτ} |
13 implants/dam | 16 implants/dam | 13 implants/dam | 16 implants/dam |

0.00 | 0.00 | 0.57 | 0.53 | 0.57 | 0.53 |

0.001 | 0.00 | 0.57 | 0.53 | 0.57 | 0.53 |

0.01 | 0.00 | 0.57 | 0.53 | 0.57 | 0.53 |

0.025 | 0.00 | 0.58 | 0.54 | 0.57 | 0.53 |

0.001 | 0.001 | 0.57 | 0.53 | 0.57 | 0.53 |

0.001 | 0.01 | 0.57 | 0.53 | 0.57 | 0.53 |

0.001 | 0.025 | 0.57 | 0.54 | 0.58 | 0.54 |

0.01 | 0.001 | 0.57 | 0.53 | 0.57 | 0.53 |

0.01 | 0.01 | 0.57 | 0.54 | 0.57 | 0.53 |

0.01 | 0.025 | 0.57 | 0.54 | 0.57 | 0.53 |

0.025 | 0.001 | 0.58 | 0.54 | 0.57 | 0.53 |

0.025 | 0.01 | 0.58 | 0.54 | 0.57 | 0.53 |

0.025 | 0.025 | 0.57 | 0.54 | 0.57 | 0.53 |

Note: Calculated across 500 simulations, assuming one F1 female per dam is bred, a linear model, and unequal allocation of dams to control (40) and treatment (60) groups within each of two blocks. Assumes an individual one-sided test, significance level of 0.05. Control group prenatal loss = 0.08. Treatment group prenatal loss = 0.15. ρ = 0.19. ψ = 3.23 or 3.79.

Power to detect a 1.9-fold difference in prenatal loss using a linear logistic model: Two-block design.

Mean Power | Median Power | ||||
---|---|---|---|---|---|

| |||||

Block effect variance, ^{2}_{τ} |
Interaction effect variance, ^{2}_{βτ} |
13 implants/dam | 16 implants/dam | 13 implants/dam | 16 implants/dam |

0.00 | 0.00 | 0.52 | 0.48 | 0.52 | 0.48 |

0.001 | 0.00 | 0.52 | 0.48 | 0.52 | 0.48 |

0.01 | 0.00 | 0.52 | 0.48 | 0.52 | 0.48 |

0.025 | 0.00 | 0.52 | 0.48 | 0.52 | 0.48 |

0.001 | 0.001 | 0.52 | 0.48 | 0.52 | 0.48 |

0.001 | 0.01 | 0.53 | 0.49 | 0.53 | 0.49 |

0.001 | 0.025 | 0.54 | 0.50 | 0.54 | 0.51 |

0.01 | 0.001 | 0.52 | 0.48 | 0.52 | 0.48 |

0.01 | 0.01 | 0.51 | 0.48 | 0.51 | 0.48 |

0.01 | 0.025 | 0.51 | 0.47 | 0.51 | 0.49 |

0.025 | 0.001 | 0.52 | 0.48 | 0.52 | 0.48 |

0.025 | 0.01 | 0.52 | 0.49 | 0.52 | 0.48 |

0.025 | 0.025 | 0.51 | 0.48 | 0.51 | 0.48 |

Note: Calculated across 500 simulations, assuming one F1 female per dam is bred, a logistic model, and unequal allocation of dams to control (40) and treatment (60) groups within each of two blocks. Assumes an individual one-sided test, significance level of 0.05. Control group prenatal loss = 0.08. Treatment group prenatal loss = 0.15. ρ = 0.19. ψ = 3.23 or 3.79.