Abstract
In randomized clinical trials comparing two standard treatments, a two-sided test for efficacy at a significance level is typically used when neither treatment is expected to be superior at the design stage. This two-sided test comprises two one-sided tests, each conducted at a significance level of . If safety data later suggest that one treatment is not clinically acceptable due to a higher rate of adverse events, investigators may reallocate the significance level originally assigned to the one-sided efficacy test for that treatment to the other one-sided test. This results in a two-stage procedure. We examine the impact of such reallocation on the familywise error rate (FWER). Using theoretical derivations and simulation studies, we show that FWER can exceed the nominal level when the treatment with fewer adverse events tends to show greater efficacy. Therefore, the two-stage procedure should be avoided when strict control of FWER is a priority. These findings emphasize the need for caution when reallocating significance levels based on auxiliary information and have implications beyond clinical trials, particularly in adaptive statistical methodologies.
MSC:
62F03; 62J15; 62L10; 62P10
1. Introduction
This paper considers a randomized clinical trial comparing the efficacy of two standard treatments. When neither treatment is expected to be superior at the design stage, a two-sided test for efficacy at a significance level is typically employed [1]. If one treatment is shown to be more effective than the other, it will continue to be the standard treatment, while the other will not be adopted as the standard treatment.
The two-sided test at a significance level of consists of two one-sided tests, each assessing whether one treatment is superior to the other, with a significance level of . If, for some reason, one of the one-sided tests becomes irrelevant, the significance level of assigned to that test will be wasted. Suppose that one treatment is found to have a very high rate of adverse events. Even though it has been shown to be more effective, clinicians may consider it unreasonable for that treatment to remain the standard while the other treatment does not. In such a case, investigators would not be interested in demonstrating that the treatment is more effective. Instead, they would seek to reallocate the significance level of , originally assigned to testing whether the treatment is more effective, to testing whether the other treatment is more effective.
This procedure consists of two stages: The first stage reviews safety data and determines the reallocation of the significance level, while the second stage conducts the efficacy test. Two-stage procedures in hypothesis testing, although different from the one described above, have been widely discussed in the statistical literature [2,3,4,5,6]. If the two stages are not properly considered, using these procedures can lead to an increased type I error rate or familywise error rate (FWER), defined as the probability of rejecting one or more true null hypotheses under any combination of true and false null hypotheses [7].
For example, seamless phase II/III trials use data from the phase II stage in the phase III stage, and the two-stage procedure involves decisions made at both stages [2]. Senn [3] introduced a different two-stage procedure in a cross-over trial, where a preliminary test on a carry-over effect guides the choice of a primary test for the treatment effect. Similarly, Kahan [4] considered a factorial design and used an interaction test to determine whether to compare individual treatment combinations or contrasts such as A vs. not-A and B vs. not-B. Campbell and Dean [5] considered a two-stage procedure for Cox models, selecting a test for regression coefficients based on a test of the proportional hazard assumption. A two-stage procedure for group-sequential designs with two endpoints was also described in [6].
To the best of our knowledge, the reallocation of significance levels in two-sided hypothesis testing problems has not been mathematically investigated in the existing literature. This paper examines the validity of the two-stage procedure in such settings in terms of controlling FWER. The remainder of the paper is organized as follows. Section 2 provides a formal analytic description of the problem, followed by the calculation of FWER in Section 3 and the calculation of power in Section 4. Section 5 presents a statistical model for individuals, which satisfies the requirements outlined in Section 2. Section 6 presents a concluding discussion.
2. Reallocation of Significance Levels Based on Safety Data
We present a formal analytic description of the problem discussed in the Introduction. Two treatments are represented by and . For simplicity, we assume that the safety data used to determine the reallocation of significance levels relate to a single adverse event and are summarized by the difference in the proportions of patients experiencing the adverse event under treatments and . In addition, a rule for determining the reallocation of significance levels based on these data is established when planning the trial. Let be an effect measure of efficacy, where indicates that is more efficacious than , indicates that and have the same efficacy, and indicates that is more efficacious than . One example of such a measure is , where is a continuous outcome and is the assigned treatment taking values and . The null hypotheses of the two one-sided and one two-sided tests concerning are defined as , , and . Note that .
Let be a test statistic for that follows a normal distribution with mean and variance 1. Let be the standardized difference between the proportions of patients experiencing an adverse event under treatments and . Suppose that follows a two-dimensional normal distribution with mean vector , marginal variances of , and correlation coefficient . The mean of is assumed to be 0 since it is not the focus of interest.
A two-stage procedure is determined with cutoff values and () for safety data. If , the two-sided test for is conducted at the significance level of , which is equivalent to performing two one-sided tests for and , each at the significance level of . If , the significance level of assigned to the test for is reallocated to the test for , so that only the test for is conducted at the significance level of . If , the significance level of assigned to the test for is reallocated to the test for , so that only the test for is conducted at the significance level of .
The two-stage procedure described above maintains a type I error rate of for each test. However, due to the correlation between the efficacy test statistic and the safety index, it is unclear if the two-stage procedure adequately controls FWER. The next section mathematically evaluates FWER for the two-stage procedure.
3. Derivation of the Familywise Error Rate for the Two-Stage Procedure
This section evaluates FWER for the two-stage procedure. FWER for the two-stage procedure is a function of the effect measure , the correlation coefficient , the cutoff values and , and the significance level . The function is denoted by FWER. Let be the cumulative distribution function of the standard normal distribution, and let be the critical value of the standard normal distribution cutoff probability in the upper tail.
In the following, we evaluate FWER in three distinct cases: , , and . When , FWER is given by
Since it follows that
which equals We then have
Hence, for , we conclude that FWER.
When FWER is given by
Since it follows that
which equals We then have
Hence, for , we conclude that FWER.
When , FWER is given by
When and , and are independent since follows a two-dimensional normal distribution. Therefore, FWER can be written as
which equals
Hence, when and , we conclude that FWER.
When and , FWER can be calculated via a numerical integration. Table 1 presents the values of FWER for under various combinations of and . When and are negatively correlated, that is, when treatments with fewer adverse events tend to show better efficacy, FWER exceeds . In contrast, when and are positively correlated, that is, when treatments with more adverse events tend to show better efficacy, FWER remains below . When the absolute values of and are large (i.e., ), the probability that no reallocation occurs is high, and as a result, FWER tends to be controlled at . In contrast, when the absolute values of and are small (i.e., ), the probability of reallocation increases, and as a result, the maximum FWER can reach .
Table 1.
FWERs of the two-stage procedure at and
The observation regarding the maximum value of FWER can be formally proven. When and , the maximum value of FWER is equal to since
Equality holds if and only if the correlation coefficient and the cutoff values satisfy . We now provide a proof of this statement. Note that holds if and only if the following three equalities are satisfied:
If , then equalities (2) and (3) cannot hold. Therefore, we must have Suppose . Then, and from (2) we have
which leads to a contradiction. Therefore, must be . In this case, from (2), we obtain
which implies , that is, . Similarly, from (3) we obtain
which implies , that is, . From (1), we have
which implies and , that is, and . Then, equalities (1)–(3) hold if and only if , and . This completes the proof. The following theorem summarizes the discussion above.
Theorem 1.
FWER of the two-stage procedure satisfies the following properties: When , or and , the FWER of the two-stage procedure can be controlled at . When and , FWER may exceed and it can reach the maximum value of when and .
4. Derivation of Power for the Two-Stage Procedure
The power of the two-stage procedure to detect a true alternative hypothesis is a function of the effect measure , the correlation coefficient , the cutoff values and , and the significance level . It is denoted by POWER.
- When POWER is given by
Table 2.
Power of the two-stage procedure at
5. A Statistical Model Satisfying the Assumptions in Section 2
We present a statistical model for individuals that is appropriate for the setting described in Section 2. For let and denote latent variables for individuals receiving treatments and , respectively. These variables are assumed to be independently and identically distributed as standard normal variables.
For a patient receiving , the efficacy outcome is given by for some positive value , and the safety outcome takes the value 0 (no adverse event) or 1 (adverse event) with probability
where we assume the logistic curve between the latent and response variable and . Here, is a constant term. Similarly, for patient receiving , the efficacy outcome is given by , and the safety outcome takes the value 0 (no adverse event) or 1 (adverse event) with probability
When and , the probability that or equals 1 is , , and , respectively. In this study, we set .
In this setting, a test statistic for efficacy and a summary statistic for safety are given as
By the central limit theorem, the pair are asymptotically normally distributed and approximately satisfy the assumptions in Section 2. The mean of , denoted by , is equal to , where is considered as a constant. The correlation coefficient between and was estimated via a Monte Carlo simulation. We generated data with and conducted repetitions under , that is, . This resulted in independent pairs of , which are shown in the scatter plot in Figure 1. The correlation coefficient of was estimated to be 0.319, with a 95% confidence interval of .
Figure 1.
Scatter plot of independent pairs of
Since the number of individuals in each treatment group is , can take at most distinct values. As a result, certain values near zero are not represented in , leading to visible horizontal white lines in the scatterplot. As increases, these lines gradually disappear.
If the probability that equals 1 is given by for , with , and all other settings remain as previously described, then the correlation coefficient between and was estimated to be , with a 95% confidence interval of .
6. Concluding Discussion
In this paper, we consider a two-stage procedure in clinical trial settings comparing two standard treatments, in which a two-sided test for efficacy at a significance level of is planned at the design stage. The allocated to one of the two one-sided tests that constitute the two-sided test can potentially be reallocated as follows: the α/2 originally assigned to the one-sided test for the treatment with a higher rate of an adverse event is reallocated to the other one-sided test. In Theorem 1, we show that FWER for this two-stage procedure can exceed the nominal significance level when the treatment associated with a lower rate of adverse event tends to demonstrate greater efficacy. Therefore, this procedure should be avoided when strict control of FWER is a priority.
Consider a clinical example in which cancer patients treated with anticancer drugs tend to experience greater treatment efficacy if they develop adverse events than if they do not [8]. In such a scenario, applying the two-stage procedure in a trial comparing two such anticancer drugs may control FWER at the nominal level ; however, as demonstrated in this paper, it can result in reduced statistical power in some cases.
The main assumptions in Section 2 are as follows:
- (1)
- The test statistic for efficacy and the summary statistic for safety jointly follow a multivariate normal distribution.
- (2)
- The summary statistic for safety relates to a single adverse event and is determined by the difference in the proportions of patients experiencing the adverse event under treatments and .
- (3)
- The cutoff values for the safety summary statistic are determined at the design stage and are used to guide the reallocation of the significance level.
Regarding Assumption (1), we consider it to be reasonable, as many commonly used test and summary statistics are asymptotically normally distributed under standard regularity conditions. Assumptions (2) and (3) were adopted to facilitate the theoretical derivation of the FWER. In practice, however, it may be difficult to pre-specify a single adverse event and fixed cutoff values at the design stage. More commonly, all safety data are reviewed after all data have been collected, but before efficacy data are analyzed, and the decision on whether to reallocate the significance level is taken following discussions among investigators. In such cases, the safety data may influence the choice of cutoff values, making the mathematical evaluation of FWER more complex. Therefore, for analytical tractability, we focused on a setting in which both the adverse event and the cutoff values are specified in advance. Given that our study demonstrated inflation of FWER even under this simplified setting—and that FWER control is not guaranteed under more flexible or data-driven reallocation procedures—we think that such procedures, including the one examined in this study, should be avoided when strict FWER control is required.
From a practical perspective, if a treatment is found to have a very high rate of adverse events, efficacy data may not be collected after patient dropout. In such cases, the sample sizes of both the intention-to-treat population and the full analysis set may become imbalanced between groups. In general, an imbalance in sample size can lead to reduced power in the two-stage procedure. However, this study focuses on fundamental theoretical aspects and does not address practical issues such as dropout-related sample size imbalance.
This study focused solely on a two-stage procedure. However, in three- or multi-stage procedures, inflation of FWER could also occur because correlations between the test statistic and other summary statistics, which drive the inflation, may also arise in these more complex settings.
The two-stage procedure examined in this study involves sequential decision-making and, therefore, bears some resemblance to group sequential designs and the associated bias correction for point estimation (e.g., [9,10]). However, a key difference is that group sequential designs involve repeated testing of a single null hypothesis, whereas our procedure conducts only a single hypothesis test in the second stage, guided by a decision taken in the first stage. The primary focus of this study is on controlling FWER in the second stage.
The implications of our findings extend beyond clinical trial settings. The issues addressed in this study commonly arise when significance levels are reallocated based on external or auxiliary information. Since intuitive reasoning about probabilities can be misleading, careful consideration is essential when employing complex testing procedures.
Supplementary Materials
The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math13162547/s1, the R code for simulations.
Author Contributions
Conceptualization, A.N. and K.M.; methodology, A.N. and K.M.; formal analysis, A.N.; writing—original draft preparation, A.N.; writing—review and editing, A.N. and K.M.; visualization, A.N.; supervision, K.M.; project administration, A.N. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported in part by JSPS KAKENHI, Grant Number JP23H03353.
Data Availability Statement
No new data were created or analyzed in this study. Data sharing is not applicable to this article. The R code used for the simulation is provided in the Supplementary Materials.
Acknowledgments
We used ChatGPT (GPT-4o, OpenAI) to identify grammatical errors and improve our English expressions throughout the development of this work.
Conflicts of Interest
No authors have a conflict of interest related to the contents of this manuscript.
References
- Green, S.; Benedetti, J.; Smith, A.; Crowley, J. Clinical Trials in Oncology, 3rd ed.; Taylor & Francis: Oxfordshire, UK, 2012. [Google Scholar]
- Yu, M.; Man, R.; Zhu, H.; Wang, L. Enhancing the flexibility and power of adaptive seamless phase 2/3 design with copula modeling between short-term and long-term endpoints. Commun. Stat. Simul. Comput. 2024, 1–21. [Google Scholar] [CrossRef]
- Senn, S. Viewpoint: Do not resurrect the two-stage procedure. In Pharmaceutical Statistics; John Wiley and Sons Ltd.: Hoboken, NJ, USA, 2022; pp. 808–814. [Google Scholar]
- Kahan, B.C. Bias in randomised factorial trials. Stat. Med. 2013, 32, 4540–4549. [Google Scholar] [CrossRef] [PubMed]
- Campbell, H.; Dean, C.B. The consequences of proportional hazards based model selection. Stat. Med. 2014, 33, 1042–1056. [Google Scholar] [CrossRef] [PubMed]
- Hung, H.M.J.; Wang, S.-J.; O’Neill, R. Statistical considerations for testing multiple endpoints in group sequential or adaptive clinical trials. J. Biopharm. Stat. 2007, 17, 1201–1210. [Google Scholar] [CrossRef] [PubMed]
- Dmitrienko, A.; Tamhane, A.C.; Bretz, F. (Eds.) Multiple Testing Problems in Pharmaceutical Statistics; Chapman and Hall/CRC: Boca Raton, FL, USA, 2009. [Google Scholar]
- Haratani, K.; Hayashi, H.; Chiba, Y.; Kudo, K.; Yonesaka, K.; Kato, R.; Kaneda, H.; Hasegawa, Y.; Tanaka, K.; Takeda, M.; et al. Association of immune-related adverse events with nivolumab efficacy in non-small cell lung cancer. JAMA Oncol. 2018, 4, 374–378. [Google Scholar] [CrossRef] [PubMed]
- Grayling, M.J.; Wason, J.M.S. Point estimation following a two-stage group sequential trial. Stat. Methods Med. Res. 2023, 32, 287–304. [Google Scholar] [CrossRef] [PubMed]
- Grayling, M.J.; Wason, J.M.S.; Mander, A.P. Group sequential crossover trial designs with strong control of the familywise error rate. Seq. Anal. 2018, 37, 174–203. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).