Inflation of Familywise Error Rate in Treatment Efficacy Testing Due to the Reallocation of Significance Levels Based on Safety Data

Akifumi Notsu; Keita Mori

doi:10.3390/math13162547

and

Clinical Research Center, Shizuoka Cancer Center, 1007 Shimonagakubo, Sunto-Nagaizumi, Shizuoka 411-8777, Japan

^*

Author to whom correspondence should be addressed.

Mathematics2025, 13(16), 2547;https://doi.org/10.3390/math13162547

This article belongs to the Special Issue Sequential Sampling Methods for Statistical Inference

Version Notes

Order Reprints

Abstract

In randomized clinical trials comparing two standard treatments, a two-sided test for efficacy at a significance level

α

is typically used when neither treatment is expected to be superior at the design stage. This two-sided test comprises two one-sided tests, each conducted at a significance level of

α / 2

. If safety data later suggest that one treatment is not clinically acceptable due to a higher rate of adverse events, investigators may reallocate the

α / 2

significance level originally assigned to the one-sided efficacy test for that treatment to the other one-sided test. This results in a two-stage procedure. We examine the impact of such reallocation on the familywise error rate (FWER). Using theoretical derivations and simulation studies, we show that FWER can exceed the nominal level

α

when the treatment with fewer adverse events tends to show greater efficacy. Therefore, the two-stage procedure should be avoided when strict control of FWER is a priority. These findings emphasize the need for caution when reallocating significance levels based on auxiliary information and have implications beyond clinical trials, particularly in adaptive statistical methodologies.

Keywords:

clinical trial; adaptive clinical trial; familywise error rate; reallocation

MSC:

62F03; 62J15; 62L10; 62P10

1. Introduction

This paper considers a randomized clinical trial comparing the efficacy of two standard treatments. When neither treatment is expected to be superior at the design stage, a two-sided test for efficacy at a significance level

α

is typically employed [1]. If one treatment is shown to be more effective than the other, it will continue to be the standard treatment, while the other will not be adopted as the standard treatment.

The two-sided test at a significance level of

α

consists of two one-sided tests, each assessing whether one treatment is superior to the other, with a significance level of

α / 2

. If, for some reason, one of the one-sided tests becomes irrelevant, the significance level of

α / 2

assigned to that test will be wasted. Suppose that one treatment is found to have a very high rate of adverse events. Even though it has been shown to be more effective, clinicians may consider it unreasonable for that treatment to remain the standard while the other treatment does not. In such a case, investigators would not be interested in demonstrating that the treatment is more effective. Instead, they would seek to reallocate the significance level of

α / 2

, originally assigned to testing whether the treatment is more effective, to testing whether the other treatment is more effective.

This procedure consists of two stages: The first stage reviews safety data and determines the reallocation of the significance level, while the second stage conducts the efficacy test. Two-stage procedures in hypothesis testing, although different from the one described above, have been widely discussed in the statistical literature [2,3,4,5,6]. If the two stages are not properly considered, using these procedures can lead to an increased type I error rate or familywise error rate (FWER), defined as the probability of rejecting one or more true null hypotheses under any combination of true and false null hypotheses [7].

For example, seamless phase II/III trials use data from the phase II stage in the phase III stage, and the two-stage procedure involves decisions made at both stages [2]. Senn [3] introduced a different two-stage procedure in a cross-over trial, where a preliminary test on a carry-over effect guides the choice of a primary test for the treatment effect. Similarly, Kahan [4] considered a factorial design and used an interaction test to determine whether to compare individual treatment combinations or contrasts such as A vs. not-A and B vs. not-B. Campbell and Dean [5] considered a two-stage procedure for Cox models, selecting a test for regression coefficients based on a test of the proportional hazard assumption. A two-stage procedure for group-sequential designs with two endpoints was also described in [6].

To the best of our knowledge, the reallocation of significance levels in two-sided hypothesis testing problems has not been mathematically investigated in the existing literature. This paper examines the validity of the two-stage procedure in such settings in terms of controlling FWER. The remainder of the paper is organized as follows. Section 2 provides a formal analytic description of the problem, followed by the calculation of FWER in Section 3 and the calculation of power in Section 4. Section 5 presents a statistical model for individuals, which satisfies the requirements outlined in Section 2. Section 6 presents a concluding discussion.

2. Reallocation of Significance Levels Based on Safety Data

We present a formal analytic description of the problem discussed in the Introduction. Two treatments are represented by

T_{1}

and

T_{2}

. For simplicity, we assume that the safety data used to determine the reallocation of significance levels relate to a single adverse event and are summarized by the difference in the proportions of patients experiencing the adverse event under treatments

T_{1}

and

T_{2}

. In addition, a rule for determining the reallocation of significance levels based on these data is established when planning the trial. Let

θ

be an effect measure of efficacy, where

θ > 0

indicates that

T_{1}

is more efficacious than

T_{2}

,

θ = 0

indicates that

T_{1}

and

T_{2}

have the same efficacy, and

θ < 0

indicates that

T_{2}

is more efficacious than

T_{1}

. One example of such a measure is

θ = E (Y | T = T_{1}) - E (Y | T = T_{2})

, where

Y

is a continuous outcome and

T

is the assigned treatment taking values

T_{1}

and

T_{2}

. The null hypotheses of the two one-sided and one two-sided tests concerning

θ

are defined as

H_{0}^{T_{1}} : θ \leq 0

,

H_{0}^{T_{1}, T_{2}} : θ = 0

, and

H_{0}^{T_{2}} : θ \geq 0

. Note that

H_{0}^{T_{1}, T_{2}} = H_{0}^{T_{1}} \cap H_{0}^{T_{2}}

.

Let

Z_{E}

be a test statistic for

θ

that follows a normal distribution with mean

θ

and variance 1. Let

Z_{S}

be the standardized difference between the proportions of patients experiencing an adverse event under treatments

T_{1}

and

T_{2}

. Suppose that

(Z_{E}, Z_{S})

follows a two-dimensional normal distribution with mean vector

(θ, 0)

, marginal variances of

1

, and correlation coefficient

r

. The mean of

Z_{S}

is assumed to be 0 since it is not the focus of interest.

A two-stage procedure is determined with cutoff values

c_{L}

and

c_{U}

(

c_{L} \leq c_{U}

) for safety data. If

c_{L} \leq Z_{S} \leq c_{U}

, the two-sided test for

H_{0}^{T_{1}, T_{2}}

is conducted at the significance level of

α

, which is equivalent to performing two one-sided tests for

H_{0}^{T_{1}}

and

H_{0}^{T_{2}}

, each at the significance level of

α / 2

. If

Z_{S} > c_{U}

, the significance level of

α / 2

assigned to the test for

H_{0}^{T_{1}}

is reallocated to the test for

H_{0}^{T_{2}}

, so that only the test for

H_{0}^{T_{2}}

is conducted at the significance level of

α

. If

Z_{S} < c_{L}

, the significance level of

α / 2

assigned to the test for

H_{0}^{T_{2}}

is reallocated to the test for

H_{0}^{T_{1}}

, so that only the test for

H_{0}^{T_{1}}

is conducted at the significance level of

α

.

The two-stage procedure described above maintains a type I error rate of

α

for each test. However, due to the correlation

r = c o r (Z_{E}, Z_{S})

between the efficacy test statistic and the safety index, it is unclear if the two-stage procedure adequately controls FWER. The next section mathematically evaluates FWER for the two-stage procedure.

3. Derivation of the Familywise Error Rate for the Two-Stage Procedure

This section evaluates FWER for the two-stage procedure. FWER for the two-stage procedure is a function of the effect measure

θ

, the correlation coefficient

r

, the cutoff values

c_{L}

and

c_{U}

, and the significance level

α

. The function is denoted by FWER

(θ, r, c_{L}, c_{U}, α)

. Let

Φ (x)

be the cumulative distribution function of the standard normal distribution, and let

z_{α}

be the critical value of the standard normal distribution cutoff probability

α

in the upper tail.

In the following, we evaluate FWER in three distinct cases:

θ > 0

,

θ < 0

, and

θ = 0

. When

θ > 0

, FWER is given by

FWER (θ, r, c_{L}, c_{U}, α) = P ({c_{U} < Z}_{S}, Z_{E} < - z_{α}) + P (c_{L} \leq Z_{S} \leq c_{U}, Z_{E} < - z_{α / 2}) .

Since

- z_{α / 2} < - z_{α},

it follows that

P ({c_{U} < Z}_{S}, Z_{E} < - z_{α}) + P (c_{L} \leq Z_{S} \leq c_{U}, Z_{E} < - z_{α / 2}) < P ({c_{U} < Z}_{S}, Z_{E} < - z_{α}) + P (c_{L} \leq Z_{S} \leq c_{U}, Z_{E} < - z_{α}),

which equals

P (c_{L} \leq Z_{S}, Z_{E} < - z_{α}) .

We then have

\begin{array}{l} P (c_{L} \leq Z_{S}, Z_{E} < - z_{α}) \\ \leq P (Z_{E} < - z_{α}) \\ = P (Z_{E} - θ < {- z}_{α} - θ) \\ = Φ (- z_{α} - θ) < Φ ({- z}_{α}) \\ = α . \end{array}

Hence, for

θ > 0

, we conclude that FWER

(θ, r, c_{L}, c_{U}, α) \leq α

.

When

θ < 0,

FWER is given by

FWER (θ, r, c_{L}, c_{U}, α) = P (Z_{S} < c_{L}, z_{α} < Z_{E}) + P (c_{L} \leq Z_{S} \leq c_{U}, z_{α / 2} {< Z}_{E}) .

Since

z_{α} < z_{α / 2},

it follows that

P (Z_{S} < c_{L}, z_{α} < Z_{E}) + P (c_{L} \leq Z_{S} \leq c_{U}, z_{α / 2} {< Z}_{E}) < P (Z_{S} < c_{L}, z_{α} < Z_{E}) + P (c_{L} \leq Z_{S} \leq c_{U}, z_{α} {< Z}_{E}),

which equals

P (Z_{S} \leq c_{U}, z_{α} < Z_{E}) .

We then have

\begin{array}{l} P (Z_{S} \leq c_{U}, z_{α} < Z_{E}) \\ \leq P (z_{α} < Z_{E}) \\ = P (z_{α} - θ < Z_{E} - θ) \\ = 1 - Φ (z_{α} - θ) \\ < 1 - Φ (z_{α}) \\ = α . \end{array}

Hence, for

θ < 0

, we conclude that FWER

(θ, r, c_{L}, c_{U}, α) \leq α

.

When

θ = 0

, FWER

(θ, r, c_{L}, c_{U}, α)

is given by

P (c_{U} < Z_{S}, Z_{E} < - z_{α}) + P (c_{L} \leq Z_{S} \leq c_{U}, z_{α / 2} < |Z_{E}|) + P (Z_{S} < c_{L}, z_{α} < Z_{E}) .

When

θ = 0

and

r = 0

,

Z_{E}

and

Z_{S}

are independent since

(Z_{E}, Z_{S})

follows a two-dimensional normal distribution. Therefore, FWER

(θ, r, c_{L}, c_{U}, α)

can be written as

P (c_{U} < Z_{S}) P (Z_{E} < - z_{α}) + P (c_{L} \leq Z_{S} \leq c_{U}) P (z_{α / 2} < |Z_{E}|) + P (Z_{S} < c_{L}) P (z_{α} < Z_{E}),

which equals

α (P (c_{U} < Z_{S}) + P (c_{L} \leq Z_{S} \leq c_{U}) + P (Z_{S} < c_{L})) = α .

Hence, when

θ = 0

and

r = 0

, we conclude that FWER

(θ, r, c_{L}, c_{U}, α) = α

.

When

θ = 0

and

r \neq 0

, FWER

(θ, r, c_{L}, c_{U}, α)

can be calculated via a numerical integration. Table 1 presents the values of FWER

(θ, r, c_{L}, c_{U}, α)

for

α = 0.05

under various combinations of

r, c_{L},

and

c_{U}

. When

Z_{E}

and

Z_{S}

are negatively correlated, that is, when treatments with fewer adverse events tend to show better efficacy, FWER exceeds

α

. In contrast, when

Z_{E}

and

Z_{S}

are positively correlated, that is, when treatments with more adverse events tend to show better efficacy, FWER remains below

α

. When the absolute values of

c_{L}

and

c_{U}

are large (i.e.,

|c_{L}| = |c_{U}| \geq 2.5

), the probability that no reallocation occurs is high, and as a result, FWER tends to be controlled at

α

. In contrast, when the absolute values of

c_{L}

and

c_{U}

are small (i.e.,

|c_{L}| = |c_{U}| \leq 2.0

), the probability of reallocation increases, and as a result, the maximum FWER can reach

2 α = 0.10

.

Table 1. FWERs of the two-stage procedure at

θ = 0

and

α = 0.05 .

The observation regarding the maximum value of FWER can be formally proven. When

θ = 0

and

r \neq 0

, the maximum value of FWER

(θ, r, c_{L}, c_{U}, α)

is equal to

2 α

since

\begin{array}{l} F W E R (θ, r, c_{L}, c_{U}, α) \\ = P (c_{U} < Z_{S}, Z_{E} < - z_{α}) + P (c_{L} \leq Z_{S} \leq c_{U}, z_{α / 2} < |Z_{E}|) + P (Z_{S} < c_{L}, z_{α} < Z_{E}) \\ = P (c_{U} < Z_{S}, Z_{E} < - z_{α}) + P (c_{L} \leq Z_{S} \leq c_{U}, Z_{E} < - z_{α / 2}) + P (c_{L} \leq Z_{S} \leq c_{U}, z_{α / 2} < Z_{E}) + P (Z_{S} < c_{L}, z_{α} < Z_{E}) \\ \leq P (c_{U} < Z_{S}, Z_{E} < - z_{α}) + P (c_{L} \leq Z_{S} \leq c_{U}, Z_{E} < - z_{α}) + P (c_{L} \leq Z_{S} \leq c_{U}, z_{α} < Z_{E}) + P (Z_{S} < c_{L}, z_{α} < Z_{E}) \\ = P (c_{L} \leq Z_{S}, Z_{E} < - z_{α}) + P (Z_{S} \leq c_{U}, z_{α} < Z_{E}) \\ \leq P (Z_{E} < - z_{α}) + P (z_{α} < Z_{E}) \\ = 2 α . \end{array}

Equality holds if and only if the correlation coefficient

r = - 1

and the cutoff values satisfy

- z_{α} \leq c_{L} \leq c_{U} \leq z_{α}

. We now provide a proof of this statement. Note that

F W E R (θ, r, c_{L}, c_{U}, α) = 2 α

holds if and only if the following three equalities are satisfied:

P (c_{L} \leq Z_{S} \leq c_{U}, |Z_{E}| > z_{α / 2}) = P (c_{L} \leq Z_{S} \leq c_{U}, |Z_{E}| > z_{α}),

(1)

P (c_{L} \leq Z_{S}, Z_{E} < - z_{α}) = P (Z_{E} < - z_{α}),

(2)

P (Z_{S} \leq c_{U}, z_{α} < Z_{E}) = P (z_{α} < Z_{E}) .

(3)

If

|r| < 1

, then equalities (2) and (3) cannot hold. Therefore, we must have

|r| = 1 .

Suppose

r = 1

. Then,

Z_{E} = Z_{S}

and from (2) we have

P (c_{L} \leq Z_{E}, Z_{E} < - z_{α}) = P (Z_{E} < - z_{α}),

which leads to a contradiction. Therefore,

r

must be

- 1

. In this case, from (2), we obtain

P (Z_{E} \leq - c_{L}, Z_{E} < - z_{α}) = P (Z_{E} < - z_{α}),

which implies

- z_{α} \leq - c_{L}

, that is,

c_{L} \leq z_{α}

. Similarly, from (3) we obtain

P (- c_{U} \leq Z_{E}, z_{α} < Z_{E}) = P (z_{α} < Z_{E}),

which implies

- c_{U} \leq z_{α}

, that is,

- z_{α} \leq c_{U}

. From (1), we have

P ({- c}_{U} \leq Z_{E} \leq - c_{L}, |Z_{E}| > z_{α / 2}) = P ({- c}_{U} \leq Z_{E} \leq - c_{L}, |Z_{E}| > z_{α}),

which implies

- z_{α} \leq - c_{U}

and

- c_{L} \leq z_{α}

, that is,

- z_{α} \leq c_{L}

and

c_{U} \leq z_{α}

. Then, equalities (1)–(3) hold if and only if

r = - 1

, and

- z_{α} \leq c_{L} \leq c_{U} \leq z_{α}

. This completes the proof. The following theorem summarizes the discussion above.

Theorem 1.

FWER of the two-stage procedure satisfies the following properties: When

θ \neq 0

, or

θ = 0

and

r = 0

, the FWER of the two-stage procedure can be controlled at

α

. When

θ = 0

and

r \neq 0

, FWER may exceed

α

and it can reach the maximum value of

2 α

when

r = - 1

and

- z_{α} \leq c_{L} \leq c_{U} \leq z_{α}

.

4. Derivation of Power for the Two-Stage Procedure

The power of the two-stage procedure to detect a true alternative hypothesis is a function of the effect measure

θ

, the correlation coefficient

r

, the cutoff values

c_{L}

and

c_{U}

, and the significance level

α

. It is denoted by POWER

(θ, r, c_{L}, c_{U}, α)

.

When $θ > 0,$ POWER $(θ, r, c_{L}, c_{U}, α)$ is given by

P (Z_{S} < c_{L}, z_{α} {< Z}_{E}) + P (c_{L} \leq Z_{S} \leq c_{U}, z_{α / 2} < Z_{E}) .

When

θ < 0

, POWER

(θ, r, c_{L}, c_{U}, α)

is given by

P (c_{U} < Z_{S}, Z_{E} < - z_{α}) + P (c_{L} \leq Z_{S} \leq c_{U}, Z_{E} < - z_{α / 2}) .

When

θ = 0

, POWER

(θ, r, c_{L}, c_{U}, α) = 0

, as all alternative hypotheses are false. The power of the two-stage procedure can also be calculated via numerical integration. Table 2 presents the power of the two-stage procedure for

α = 0.05

under various values of

r, c_{L},

and

c_{U}

. The two values of the power function for

|θ| = 3.0

are identical, and this equality holds for other values of

θ

as well. This pattern arises from the symmetry of the bivariate normal distribution. When

θ = 3.0

and

|c_{L}| = |c_{U}| = 2.5

or

1.5

, the probability of conducting the test for

H_{0}^{T_{1}} : θ \leq 0

is high, resulting in high power. The same applies when

θ = - 3.0

, in which case the test for

H_{0}^{T_{2}} : θ \geq 0

is likely to be conducted. In contrast, when

θ = 3.0

and

| c_{L} | = | c_{U} | = 0.1

, the probability of conducting the test for

H_{0}^{T_{1}} : θ \leq 0

is low, resulting in low power. The same applies when

θ = - 3.0

, in which case the test for

H_{0}^{T_{2}} : θ \geq 0

is unlikely to be conducted. As the correlation coefficient

r

increases, the power decreases since the probability of conducting the test for

H_{0}^{T_{1}} : θ \leq 0

when

θ > 0

, and for

H_{0}^{T_{2}} : θ \geq 0

when

θ < 0

, decreases. The power of the two-sided test for

H_{0}^{T_{1}, T_{2}} : θ = 0

is equal to

0.851

,

0.323

,

0.072

,

0.072

,

0.323

and

0.851

, for

θ = 3.0, 1.5, 0.5, - 0.5, - 1.5,

and

- 3.0

, respectively. When

| θ | = 3.0

, the power of the two-stage procedure is, in some cases, lower than that of the two-sided test. This may be problematic. In contrast, when

| θ | = 1.5

, the power of the two-stage procedure, in some cases, exceeds that of the two-sided test. However, this improvement in power is a direct consequence of the failure to control the FWER under

θ = 0

; in other words, the gain in power reflects a loss of error rate control.

Table 2. Power of the two-stage procedure at

α = 0.05 .

5. A Statistical Model Satisfying the Assumptions in Section 2

We present a statistical model for individuals that is appropriate for the setting described in Section 2. For

i, j = 1, \dots, n,

let

L_{i}^{T_{1}}

and

L_{j}^{T_{2}}

denote latent variables for individuals receiving treatments

T_{1}

and

T_{2}

, respectively. These variables are assumed to be independently and identically distributed as standard normal variables.

For a patient

i

receiving

T_{1}

, the efficacy outcome

Y_{i, E}^{T_{1}}

is given by

L_{i}^{T_{1}} + d

for some positive value

d

, and the safety outcome

Y_{i, S}^{T_{1}}

takes the value 0 (no adverse event) or 1 (adverse event) with probability

P (Y_{i, S}^{T_{1}} = 1 | L_{i}^{T_{1}}) = \exp (L_{i}^{T_{1}} - c) / \{1 + \exp (L_{i}^{T_{1}} - c)\},

where we assume the logistic curve between the latent and response variable

L_{i}^{T_{1}}

and

Y_{i, S}^{T_{1}}

. Here,

c

is a constant term. Similarly, for patient

j

receiving

T_{2}

, the efficacy outcome

Y_{j, E}^{T_{2}}

is given by

L_{j}^{T_{2}}

, and the safety outcome

Y_{j, S}^{T_{2}}

takes the value 0 (no adverse event) or 1 (adverse event) with probability

P (Y_{j, S}^{T_{2}} = 1 | L_{j}^{T_{2}}) = \exp (L_{j}^{T_{2}} - c) / \{1 + \exp (L_{j}^{T_{2}} - c)\} .

When

c = 1.0, 1.5,

and

2.0

, the probability that

Y_{i, S}^{T_{1}}

or

Y_{j, S}^{T_{2}}

equals 1 is

0.303

,

0.221

, and

0.115

, respectively. In this study, we set

c = 2

.

In this setting, a test statistic for efficacy and a summary statistic for safety are given as

Z_{E} = \sqrt{\frac{n}{2}} (\frac{1}{n} \sum_{i = 1}^{n} Y_{i, E}^{T_{1}} - \frac{1}{n} \sum_{j = 1}^{n} Y_{j, E}^{T_{2}}),

Z_{S} = \frac{(\frac{1}{n} \sum_{i = 1}^{n} Y_{i, S}^{T_{1}} - \frac{1}{n} \sum_{j = 1}^{n} Y_{j, S}^{T_{2}})}{\sqrt{(\frac{1}{n} \sum_{i = 1}^{n} Y_{i, S}^{T_{1}}) (1 - \frac{1}{n} \sum_{i = 1}^{n} Y_{i, S}^{T_{1}}) / n + (\frac{1}{n} \sum_{j = 1}^{n} Y_{j, S}^{T_{2}}) (1 - \frac{1}{n} \sum_{j = 1}^{n} Y_{j, S}^{T_{2}}) / n}} .

By the central limit theorem, the pair

(Z_{E}, Z_{S})

are asymptotically normally distributed and approximately satisfy the assumptions in Section 2. The mean of

Z_{E}

, denoted by

θ

, is equal to

d / \sqrt{n / 2}

, where

n

is considered as a constant. The correlation coefficient between

Z_{E}

and

Z_{S}

was estimated via a Monte Carlo simulation. We generated data with

n = 200

and conducted

10^{5}

repetitions under

θ = 0

, that is,

d = 0

. This resulted in

10^{5}

independent pairs of

(Z_{E}, Z_{S})

, which are shown in the scatter plot in Figure 1. The correlation coefficient of

(Z_{E}, Z_{S})

was estimated to be 0.319, with a 95% confidence interval of

[0.314, 0.325]

.

Figure 1. Scatter plot of

10^{5}

independent pairs of

(Z_{E}, Z_{S}) .

Since the number of individuals in each treatment group is

n = 200

,

Z_{S}

can take at most

200 \times 200 = 40,000

distinct values. As a result, certain values near zero are not represented in

Z_{S}

, leading to visible horizontal white lines in the scatterplot. As

n

increases, these lines gradually disappear.

If the probability that

Y_{i, S}^{T_{k}}

equals 1 is given by

\exp (- L_{i}^{T_{k}} - c) / \{1 + \exp (- L_{i}^{T_{k}} - c)\}

for

k = 1,2

, with

c = - 2

, and all other settings remain as previously described, then the correlation coefficient between

Z_{E}

and

Z_{S}

was estimated to be

- 0.319

, with a 95% confidence interval of

[- 0.325, - 0.314]

.

6. Concluding Discussion

In this paper, we consider a two-stage procedure in clinical trial settings comparing two standard treatments, in which a two-sided test for efficacy at a significance level of

α

is planned at the design stage. The

α / 2

allocated to one of the two one-sided tests that constitute the two-sided test can potentially be reallocated as follows: the α/2 originally assigned to the one-sided test for the treatment with a higher rate of an adverse event is reallocated to the other one-sided test. In Theorem 1, we show that FWER for this two-stage procedure can exceed the nominal significance level

α

when the treatment associated with a lower rate of adverse event tends to demonstrate greater efficacy. Therefore, this procedure should be avoided when strict control of FWER is a priority.

Consider a clinical example in which cancer patients treated with anticancer drugs tend to experience greater treatment efficacy if they develop adverse events than if they do not [8]. In such a scenario, applying the two-stage procedure in a trial comparing two such anticancer drugs may control FWER at the nominal level

α

; however, as demonstrated in this paper, it can result in reduced statistical power in some cases.

The main assumptions in Section 2 are as follows:

(1): The test statistic for efficacy and the summary statistic for safety jointly follow a multivariate normal distribution.
(2): The summary statistic for safety relates to a single adverse event and is determined by the difference in the proportions of patients experiencing the adverse event under treatments $T_{1}$ and $T_{2}$ .
(3): The cutoff values for the safety summary statistic are determined at the design stage and are used to guide the reallocation of the significance level.

Regarding Assumption (1), we consider it to be reasonable, as many commonly used test and summary statistics are asymptotically normally distributed under standard regularity conditions. Assumptions (2) and (3) were adopted to facilitate the theoretical derivation of the FWER. In practice, however, it may be difficult to pre-specify a single adverse event and fixed cutoff values at the design stage. More commonly, all safety data are reviewed after all data have been collected, but before efficacy data are analyzed, and the decision on whether to reallocate the significance level is taken following discussions among investigators. In such cases, the safety data may influence the choice of cutoff values, making the mathematical evaluation of FWER more complex. Therefore, for analytical tractability, we focused on a setting in which both the adverse event and the cutoff values are specified in advance. Given that our study demonstrated inflation of FWER even under this simplified setting—and that FWER control is not guaranteed under more flexible or data-driven reallocation procedures—we think that such procedures, including the one examined in this study, should be avoided when strict FWER control is required.

From a practical perspective, if a treatment is found to have a very high rate of adverse events, efficacy data may not be collected after patient dropout. In such cases, the sample sizes of both the intention-to-treat population and the full analysis set may become imbalanced between groups. In general, an imbalance in sample size can lead to reduced power in the two-stage procedure. However, this study focuses on fundamental theoretical aspects and does not address practical issues such as dropout-related sample size imbalance.

This study focused solely on a two-stage procedure. However, in three- or multi-stage procedures, inflation of FWER could also occur because correlations between the test statistic and other summary statistics, which drive the inflation, may also arise in these more complex settings.

The two-stage procedure examined in this study involves sequential decision-making and, therefore, bears some resemblance to group sequential designs and the associated bias correction for point estimation (e.g., [9,10]). However, a key difference is that group sequential designs involve repeated testing of a single null hypothesis, whereas our procedure conducts only a single hypothesis test in the second stage, guided by a decision taken in the first stage. The primary focus of this study is on controlling FWER in the second stage.

The implications of our findings extend beyond clinical trial settings. The issues addressed in this study commonly arise when significance levels are reallocated based on external or auxiliary information. Since intuitive reasoning about probabilities can be misleading, careful consideration is essential when employing complex testing procedures.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math13162547/s1, the R code for simulations.

Author Contributions

Conceptualization, A.N. and K.M.; methodology, A.N. and K.M.; formal analysis, A.N.; writing—original draft preparation, A.N.; writing—review and editing, A.N. and K.M.; visualization, A.N.; supervision, K.M.; project administration, A.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by JSPS KAKENHI, Grant Number JP23H03353.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article. The R code used for the simulation is provided in the Supplementary Materials.

Acknowledgments

We used ChatGPT (GPT-4o, OpenAI) to identify grammatical errors and improve our English expressions throughout the development of this work.

Conflicts of Interest

No authors have a conflict of interest related to the contents of this manuscript.

References

Green, S.; Benedetti, J.; Smith, A.; Crowley, J. Clinical Trials in Oncology, 3rd ed.; Taylor & Francis: Oxfordshire, UK, 2012. [Google Scholar]
Yu, M.; Man, R.; Zhu, H.; Wang, L. Enhancing the flexibility and power of adaptive seamless phase 2/3 design with copula modeling between short-term and long-term endpoints. Commun. Stat. Simul. Comput. 2024, 1–21. [Google Scholar] [CrossRef]
Senn, S. Viewpoint: Do not resurrect the two-stage procedure. In Pharmaceutical Statistics; John Wiley and Sons Ltd.: Hoboken, NJ, USA, 2022; pp. 808–814. [Google Scholar]
Kahan, B.C. Bias in randomised factorial trials. Stat. Med. 2013, 32, 4540–4549. [Google Scholar] [CrossRef] [PubMed]
Campbell, H.; Dean, C.B. The consequences of proportional hazards based model selection. Stat. Med. 2014, 33, 1042–1056. [Google Scholar] [CrossRef] [PubMed]
Hung, H.M.J.; Wang, S.-J.; O’Neill, R. Statistical considerations for testing multiple endpoints in group sequential or adaptive clinical trials. J. Biopharm. Stat. 2007, 17, 1201–1210. [Google Scholar] [CrossRef] [PubMed]
Dmitrienko, A.; Tamhane, A.C.; Bretz, F. (Eds.) Multiple Testing Problems in Pharmaceutical Statistics; Chapman and Hall/CRC: Boca Raton, FL, USA, 2009. [Google Scholar]
Haratani, K.; Hayashi, H.; Chiba, Y.; Kudo, K.; Yonesaka, K.; Kato, R.; Kaneda, H.; Hasegawa, Y.; Tanaka, K.; Takeda, M.; et al. Association of immune-related adverse events with nivolumab efficacy in non-small cell lung cancer. JAMA Oncol. 2018, 4, 374–378. [Google Scholar] [CrossRef] [PubMed]
Grayling, M.J.; Wason, J.M.S. Point estimation following a two-stage group sequential trial. Stat. Methods Med. Res. 2023, 32, 287–304. [Google Scholar] [CrossRef] [PubMed]
Grayling, M.J.; Wason, J.M.S.; Mander, A.P. Group sequential crossover trial designs with strong control of the familywise error rate. Seq. Anal. 2018, 37, 174–203. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Scatter plot of

10^{5}

independent pairs of

(Z_{E}, Z_{S}) .

Table 1. FWERs of the two-stage procedure at

θ = 0

and

α = 0.05 .

Table 1. FWERs of the two-stage procedure at

θ = 0

and

α = 0.05 .

$c_{L}$	$c_{U}$	$r$
$c_{L}$	$c_{U}$	−0.99	−0.80	−0.60	−0.40	−0.20	0.20	0.40	0.60	0.80	0.99
−0.1	0.1	0.100	0.099	0.093	0.081	0.066	0.034	0.019	0.006	0.000	0.000
−1.0	1.0	0.100	0.088	0.077	0.068	0.059	0.039	0.028	0.016	0.005	0.000
−1.5	1.5	0.097	0.073	0.065	0.060	0.055	0.044	0.037	0.027	0.014	0.000
−2.0	2.0	0.054	0.059	0.056	0.054	0.052	0.047	0.044	0.038	0.029	0.009
−2.5	2.5	0.050	0.052	0.052	0.051	0.051	0.049	0.048	0.045	0.042	0.038

FWER denotes the familywise error rate.

c_{L}

and

c_{U}

represent the lower and upper cutoff values for

Z_{S}

, respectively.

r

denotes the correlation coefficient between

Z_{E}

and

Z_{S}

.

Table 2. Power of the two-stage procedure at

α = 0.05 .

Table 2. Power of the two-stage procedure at

α = 0.05 .

$θ$	$c_{L}$	$c_{U}$	$r$
$θ$	$c_{L}$	$c_{U}$	−0.99	−0.80	−0.40	−0.20	0.20	0.40	0.80	0.99
3.0	−0.1	0.1	0.540	0.536	0.513	0.500	0.475	0.464	0.451	0.452
3.0	−1.5	1.5	0.851	0.836	0.811	0.803	0.795	0.794	0.795	0.786
3.0	−2.5	2.5	0.851	0.851	0.848	0.847	0.846	0.846	0.845	0.845
1.5	−0.1	0.1	0.428	0.367	0.293	0.261	0.198	0.164	0.078	0.001
1.5	−1.5	1.5	0.323	0.325	0.325	0.318	0.298	0.285	0.259	0.256
1.5	−2.5	2.5	0.323	0.323	0.323	0.323	0.320	0.318	0.317	0.317
0.5	−0.1	0.1	0.126	0.122	0.096	0.080	0.047	0.031	0.003	0.000
0.5	−1.5	1.5	0.077	0.084	0.079	0.076	0.065	0.058	0.036	0.010
0.5	−2.5	2.5	0.072	0.073	0.073	0.073	0.071	0.070	0.067	0.066
−0.5	−0.1	0.1	0.126	0.122	0.096	0.080	0.047	0.031	0.003	0.000
−0.5	−1.5	1.5	0.077	0.084	0.079	0.076	0.065	0.058	0.036	0.010
−0.5	−2.5	2.5	0.072	0.073	0.073	0.073	0.071	0.070	0.067	0.066
−1.5	−0.1	0.1	0.428	0.367	0.293	0.261	0.198	0.164	0.078	0.001
−1.5	−1.5	1.5	0.323	0.325	0.325	0.318	0.298	0.285	0.259	0.256
−1.5	−2.5	2.5	0.323	0.323	0.323	0.323	0.320	0.318	0.317	0.317
−3.0	−0.1	0.1	0.540	0.536	0.513	0.500	0.475	0.464	0.451	0.452
−3.0	−1.5	1.5	0.851	0.836	0.811	0.803	0.795	0.794	0.795	0.786
−3.0	−2.5	2.5	0.851	0.851	0.848	0.847	0.846	0.846	0.845	0.845

Power denotes the probability of detecting a true alternative hypothesis.

θ

represents the effect measure.

c_{L}

and

c_{U}

are the cutoff values for

Z_{S}

.

r

denotes the correlation coefficient between

Z_{E}

and

Z_{S}

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Inflation of Familywise Error Rate in Treatment Efficacy Testing Due to the Reallocation of Significance Levels Based on Safety Data

Abstract

1. Introduction

2. Reallocation of Significance Levels Based on Safety Data

3. Derivation of the Familywise Error Rate for the Two-Stage Procedure

4. Derivation of Power for the Two-Stage Procedure

5. A Statistical Model Satisfying the Assumptions in Section 2

6. Concluding Discussion

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics