1. Introduction
This paper considers a randomized clinical trial comparing the efficacy of two standard treatments. When neither treatment is expected to be superior at the design stage, a two-sided test for efficacy at a significance level
is typically employed [
1]. If one treatment is shown to be more effective than the other, it will continue to be the standard treatment, while the other will not be adopted as the standard treatment.
The two-sided test at a significance level of consists of two one-sided tests, each assessing whether one treatment is superior to the other, with a significance level of . If, for some reason, one of the one-sided tests becomes irrelevant, the significance level of assigned to that test will be wasted. Suppose that one treatment is found to have a very high rate of adverse events. Even though it has been shown to be more effective, clinicians may consider it unreasonable for that treatment to remain the standard while the other treatment does not. In such a case, investigators would not be interested in demonstrating that the treatment is more effective. Instead, they would seek to reallocate the significance level of , originally assigned to testing whether the treatment is more effective, to testing whether the other treatment is more effective.
This procedure consists of two stages: The first stage reviews safety data and determines the reallocation of the significance level, while the second stage conducts the efficacy test. Two-stage procedures in hypothesis testing, although different from the one described above, have been widely discussed in the statistical literature [
2,
3,
4,
5,
6]. If the two stages are not properly considered, using these procedures can lead to an increased type I error rate or familywise error rate (FWER), defined as the probability of rejecting one or more true null hypotheses under any combination of true and false null hypotheses [
7].
For example, seamless phase II/III trials use data from the phase II stage in the phase III stage, and the two-stage procedure involves decisions made at both stages [
2]. Senn [
3] introduced a different two-stage procedure in a cross-over trial, where a preliminary test on a carry-over effect guides the choice of a primary test for the treatment effect. Similarly, Kahan [
4] considered a factorial design and used an interaction test to determine whether to compare individual treatment combinations or contrasts such as A vs. not-A and B vs. not-B. Campbell and Dean [
5] considered a two-stage procedure for Cox models, selecting a test for regression coefficients based on a test of the proportional hazard assumption. A two-stage procedure for group-sequential designs with two endpoints was also described in [
6].
To the best of our knowledge, the reallocation of significance levels in two-sided hypothesis testing problems has not been mathematically investigated in the existing literature. This paper examines the validity of the two-stage procedure in such settings in terms of controlling FWER. The remainder of the paper is organized as follows.
Section 2 provides a formal analytic description of the problem, followed by the calculation of FWER in
Section 3 and the calculation of power in
Section 4.
Section 5 presents a statistical model for individuals, which satisfies the requirements outlined in
Section 2.
Section 6 presents a concluding discussion.
2. Reallocation of Significance Levels Based on Safety Data
We present a formal analytic description of the problem discussed in the Introduction. Two treatments are represented by and . For simplicity, we assume that the safety data used to determine the reallocation of significance levels relate to a single adverse event and are summarized by the difference in the proportions of patients experiencing the adverse event under treatments and . In addition, a rule for determining the reallocation of significance levels based on these data is established when planning the trial. Let be an effect measure of efficacy, where indicates that is more efficacious than , indicates that and have the same efficacy, and indicates that is more efficacious than . One example of such a measure is , where is a continuous outcome and is the assigned treatment taking values and . The null hypotheses of the two one-sided and one two-sided tests concerning are defined as , , and . Note that .
Let be a test statistic for that follows a normal distribution with mean and variance 1. Let be the standardized difference between the proportions of patients experiencing an adverse event under treatments and . Suppose that follows a two-dimensional normal distribution with mean vector , marginal variances of , and correlation coefficient . The mean of is assumed to be 0 since it is not the focus of interest.
A two-stage procedure is determined with cutoff values and () for safety data. If , the two-sided test for is conducted at the significance level of , which is equivalent to performing two one-sided tests for and , each at the significance level of . If , the significance level of assigned to the test for is reallocated to the test for , so that only the test for is conducted at the significance level of . If , the significance level of assigned to the test for is reallocated to the test for , so that only the test for is conducted at the significance level of .
The two-stage procedure described above maintains a type I error rate of for each test. However, due to the correlation between the efficacy test statistic and the safety index, it is unclear if the two-stage procedure adequately controls FWER. The next section mathematically evaluates FWER for the two-stage procedure.
3. Derivation of the Familywise Error Rate for the Two-Stage Procedure
This section evaluates FWER for the two-stage procedure. FWER for the two-stage procedure is a function of the effect measure , the correlation coefficient , the cutoff values and , and the significance level . The function is denoted by FWER. Let be the cumulative distribution function of the standard normal distribution, and let be the critical value of the standard normal distribution cutoff probability in the upper tail.
In the following, we evaluate FWER in three distinct cases:
,
, and
. When
, FWER is given by
Since
it follows that
which equals
We then have
Hence, for
, we conclude that FWER
.
When
FWER is given by
Since
it follows that
which equals
We then have
Hence, for
, we conclude that FWER
.
When
, FWER
is given by
When
and
,
and
are independent since
follows a two-dimensional normal distribution. Therefore, FWER
can be written as
which equals
Hence, when
and
, we conclude that FWER
.
When
and
, FWER
can be calculated via a numerical integration.
Table 1 presents the values of FWER
for
under various combinations of
and
. When
and
are negatively correlated, that is, when treatments with fewer adverse events tend to show better efficacy, FWER exceeds
. In contrast, when
and
are positively correlated, that is, when treatments with more adverse events tend to show better efficacy, FWER remains below
. When the absolute values of
and
are large (i.e.,
), the probability that no reallocation occurs is high, and as a result, FWER tends to be controlled at
. In contrast, when the absolute values of
and
are small (i.e.,
), the probability of reallocation increases, and as a result, the maximum FWER can reach
.
The observation regarding the maximum value of FWER can be formally proven. When
and
, the maximum value of FWER
is equal to
since
Equality holds if and only if the correlation coefficient
and the cutoff values satisfy
. We now provide a proof of this statement. Note that
holds if and only if the following three equalities are satisfied:
If
, then equalities (2) and (3) cannot hold. Therefore, we must have
Suppose
. Then,
and from (2) we have
which leads to a contradiction. Therefore,
must be
. In this case, from (2), we obtain
which implies
, that is,
. Similarly, from (3) we obtain
which implies
, that is,
. From (1), we have
which implies
and
, that is,
and
. Then, equalities (1)–(3) hold if and only if
, and
. This completes the proof. The following theorem summarizes the discussion above.
Theorem 1. FWER of the two-stage procedure satisfies the following properties: When , or and , the FWER of the two-stage procedure can be controlled at . When and , FWER may exceed and it can reach the maximum value of when and .
4. Derivation of Power for the Two-Stage Procedure
The power of the two-stage procedure to detect a true alternative hypothesis is a function of the effect measure , the correlation coefficient , the cutoff values and , and the significance level . It is denoted by POWER.
When POWER is given by
When
, POWER
is given by
When
, POWER
, as all alternative hypotheses are false. The power of the two-stage procedure can also be calculated via numerical integration.
Table 2 presents the power of the two-stage procedure for
under various values of
and
. The two values of the power function for
are identical, and this equality holds for other values of
as well. This pattern arises from the symmetry of the bivariate normal distribution. When
and
or
, the probability of conducting the test for
is high, resulting in high power. The same applies when
, in which case the test for
is likely to be conducted. In contrast, when
and
, the probability of conducting the test for
is low, resulting in low power. The same applies when
, in which case the test for
is unlikely to be conducted. As the correlation coefficient
increases, the power decreases since the probability of conducting the test for
when
, and for
when
, decreases. The power of the two-sided test for
is equal to
,
,
,
,
and
, for
and
, respectively. When
, the power of the two-stage procedure is, in some cases, lower than that of the two-sided test. This may be problematic. In contrast, when
, the power of the two-stage procedure, in some cases, exceeds that of the two-sided test. However, this improvement in power is a direct consequence of the failure to control the FWER under
; in other words, the gain in power reflects a loss of error rate control.
5. A Statistical Model Satisfying the Assumptions in Section 2
We present a statistical model for individuals that is appropriate for the setting described in
Section 2. For
let
and
denote latent variables for individuals receiving treatments
and
, respectively. These variables are assumed to be independently and identically distributed as standard normal variables.
For a patient
receiving
, the efficacy outcome
is given by
for some positive value
, and the safety outcome
takes the value 0 (no adverse event) or 1 (adverse event) with probability
where we assume the logistic curve between the latent and response variable
and
. Here,
is a constant term. Similarly, for patient
receiving
, the efficacy outcome
is given by
, and the safety outcome
takes the value 0 (no adverse event) or 1 (adverse event) with probability
When
and
, the probability that
or
equals 1 is
,
, and
, respectively. In this study, we set
.
In this setting, a test statistic for efficacy and a summary statistic for safety are given as
By the central limit theorem, the pair
are asymptotically normally distributed and approximately satisfy the assumptions in
Section 2. The mean of
, denoted by
, is equal to
, where
is considered as a constant. The correlation coefficient between
and
was estimated via a Monte Carlo simulation. We generated data with
and conducted
repetitions under
, that is,
. This resulted in
independent pairs of
, which are shown in the scatter plot in
Figure 1. The correlation coefficient of
was estimated to be 0.319, with a 95% confidence interval of
.
Since the number of individuals in each treatment group is , can take at most distinct values. As a result, certain values near zero are not represented in , leading to visible horizontal white lines in the scatterplot. As increases, these lines gradually disappear.
If the probability that equals 1 is given by for , with , and all other settings remain as previously described, then the correlation coefficient between and was estimated to be , with a 95% confidence interval of .
6. Concluding Discussion
In this paper, we consider a two-stage procedure in clinical trial settings comparing two standard treatments, in which a two-sided test for efficacy at a significance level of is planned at the design stage. The allocated to one of the two one-sided tests that constitute the two-sided test can potentially be reallocated as follows: the α/2 originally assigned to the one-sided test for the treatment with a higher rate of an adverse event is reallocated to the other one-sided test. In Theorem 1, we show that FWER for this two-stage procedure can exceed the nominal significance level when the treatment associated with a lower rate of adverse event tends to demonstrate greater efficacy. Therefore, this procedure should be avoided when strict control of FWER is a priority.
Consider a clinical example in which cancer patients treated with anticancer drugs tend to experience greater treatment efficacy if they develop adverse events than if they do not [
8]. In such a scenario, applying the two-stage procedure in a trial comparing two such anticancer drugs may control FWER at the nominal level
; however, as demonstrated in this paper, it can result in reduced statistical power in some cases.
The main assumptions in
Section 2 are as follows:
- (1)
The test statistic for efficacy and the summary statistic for safety jointly follow a multivariate normal distribution.
- (2)
The summary statistic for safety relates to a single adverse event and is determined by the difference in the proportions of patients experiencing the adverse event under treatments and .
- (3)
The cutoff values for the safety summary statistic are determined at the design stage and are used to guide the reallocation of the significance level.
Regarding Assumption (1), we consider it to be reasonable, as many commonly used test and summary statistics are asymptotically normally distributed under standard regularity conditions. Assumptions (2) and (3) were adopted to facilitate the theoretical derivation of the FWER. In practice, however, it may be difficult to pre-specify a single adverse event and fixed cutoff values at the design stage. More commonly, all safety data are reviewed after all data have been collected, but before efficacy data are analyzed, and the decision on whether to reallocate the significance level is taken following discussions among investigators. In such cases, the safety data may influence the choice of cutoff values, making the mathematical evaluation of FWER more complex. Therefore, for analytical tractability, we focused on a setting in which both the adverse event and the cutoff values are specified in advance. Given that our study demonstrated inflation of FWER even under this simplified setting—and that FWER control is not guaranteed under more flexible or data-driven reallocation procedures—we think that such procedures, including the one examined in this study, should be avoided when strict FWER control is required.
From a practical perspective, if a treatment is found to have a very high rate of adverse events, efficacy data may not be collected after patient dropout. In such cases, the sample sizes of both the intention-to-treat population and the full analysis set may become imbalanced between groups. In general, an imbalance in sample size can lead to reduced power in the two-stage procedure. However, this study focuses on fundamental theoretical aspects and does not address practical issues such as dropout-related sample size imbalance.
This study focused solely on a two-stage procedure. However, in three- or multi-stage procedures, inflation of FWER could also occur because correlations between the test statistic and other summary statistics, which drive the inflation, may also arise in these more complex settings.
The two-stage procedure examined in this study involves sequential decision-making and, therefore, bears some resemblance to group sequential designs and the associated bias correction for point estimation (e.g., [
9,
10]). However, a key difference is that group sequential designs involve repeated testing of a single null hypothesis, whereas our procedure conducts only a single hypothesis test in the second stage, guided by a decision taken in the first stage. The primary focus of this study is on controlling FWER in the second stage.
The implications of our findings extend beyond clinical trial settings. The issues addressed in this study commonly arise when significance levels are reallocated based on external or auxiliary information. Since intuitive reasoning about probabilities can be misleading, careful consideration is essential when employing complex testing procedures.