Next Article in Journal
Critical Problem of Optimal Stabilization Without Control Constraints
Previous Article in Journal
Bifurcation, Phase Portrait and Traveling Wave Solution of Aizhan–Gudekli–Nurshuak–Zhanbota Equation
Previous Article in Special Issue
A Common Generalization of the (a,b)- and (s,t)-Transformations of Probability Measures
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hypothesis Testing for Two-Arm Proportions with Two Binary Endpoints

1
Department of Mathematics, Syracuse University, Syracuse, NY 13244, USA
2
Department of Mathematics and Statistics, University of North Florida, Jacksonville, FL 32224, USA
*
Author to whom correspondence should be addressed.
Axioms 2026, 15(6), 435; https://doi.org/10.3390/axioms15060435
Submission received: 17 March 2026 / Revised: 25 May 2026 / Accepted: 1 June 2026 / Published: 11 June 2026
(This article belongs to the Special Issue Probability Theory and Stochastic Processes: Theory and Applications)

Abstract

Many studies require evidence that a new treatment improves efficacy and maintains or improves safety. Composite endpoints can obscure trade-offs and complicate interpretation. We propose a single-stage hypothesis test that directly evaluates two binary endpoints against a concurrent control, offering a transparent alternative to composite endpoints. The test rejects only if the observed improvements on both endpoints exceed pre-specified paired thresholds. The joint distribution of efficacy and safety is modeled with a four-category multinomial, yielding probabilities for all the outcome combinations. This enables exact computation of rejection probabilities and identification of least-favorable parameter configurations to control type I error at the nominal level while retaining adequate power. Design tables map the target significance level and power, together with predefined effect sizes for each endpoint, to the required sample size and decision thresholds. Simulations and one case study illustrate design selection and interpretation. The proposed test provides an exact and practical tool for early-phase trials with dual binary endpoints, particularly when efficacy and safety must be evaluated simultaneously.

1. Introduction

Clinical trials increasingly require evidence that new treatments not only improve efficacy but also maintain or enhance safety. Regulatory agencies such as the U.S. Food and Drug Administration [1] explicitly emphasize the importance of multiple endpoints to ensure that therapeutic benefit is not achieved at the expense of unacceptable adverse effects. In many modern therapeutic areas—such as oncology, immunotherapy, and cardiovascular medicine—both efficacy and toxicity are regarded as co-primary outcomes, reflecting the dual objective of maximizing patient benefit while minimizing harm. Composite endpoints are frequently used to summarize multiple outcomes into a single measure; however, they can obscure trade-offs between efficacy and safety and complicate interpretation [2]. Consequently, there is growing interest in transparent and statistically rigorous frameworks that jointly evaluate efficacy and safety without conflating their effects.
From a methodological standpoint, the prevailing paradigm for co-primary binary endpoints is the intersection–union testing (IUT) framework, in which overall success is declared only when significant improvement is demonstrated on all the endpoints [3,4]. This approach provides strong control of the familywise type I error rate and aligns with regulatory expectations. A natural modeling strategy is to represent two binary outcomes jointly as a four-category multinomial variable, explicitly accounting for their correlation and enabling exact computation of joint probabilities and rejection regions.
A substantial body of research has focused on the design of clinical trials with dual binary endpoints. The seminal work of Bryant and Day [5] introduced a two-stage single-arm phase II design that simultaneously monitors efficacy and toxicity under a multinomial model. Their framework defined admissible regions for early termination but was limited to settings in which the experimental treatment was compared with a fixed standard rather than against a control treatment. Building on this idea, Conaway and Petroni [6] extended the single-arm framework to bivariate sequential settings, still assuming fixed standard values. Subsequent developments by Stallard, Thall, and Whitehead [7], Ivanova et al. [8], and Chen and Chi [9] further advanced early-phase designs with dual binary endpoints under single-arm or fixed-standard settings. Thall and Cheng [10] proposed a related decision-theoretic design for evaluating efficacy and safety in which clinically meaningful improvements of an experimental treatment over a standard treatment are elicited from the physician. Their testing procedure was developed using a large-sample normal approximation, whereas the present study provides an exact two-arm hypothesis testing framework under a multinomial model, allowing rejection probabilities and type I error rates to be computed exactly. Therefore, despite substantial progress, a methodological gap remains for exact two-arm hypothesis tests that jointly evaluate efficacy and safety when a clear decision is reached.
The present study addresses this methodological gap by proposing an exact frequentist hypothesis test for two-arm trials with two binary endpoints—efficacy and safety. The goal of the proposed method is twofold. First, it aims to directly evaluate the simultaneous improvement of both endpoints relative to a concurrent control rather than against fixed pre-specified standard values. Second, it ensures exact control of the type I error rate at the nominal significance level by analytically identifying least favorable parameter configurations (LFCs) or generalized least favorable parameter configurations (GLFCs). The LFCs and GLFCs are obtained by using monotonicity results to identify boundary parameter settings and then exactly evaluating the rejection probabilities under the multinomial model.
To achieve these objectives, the joint outcomes of efficacy and safety are modeled using a four-category multinomial distribution, which enables complete enumeration of all possible response combinations across treatment arms. Within this framework, the study focuses on the following four key methodological questions:
  • How can we construct a single-stage rejection rule that declares superiority only when both endpoints exceed pre-specified paired thresholds?
  • How can rejection probabilities be computed exactly under the four-category multinomial model?
  • How can LFCs or GLFCs be identified to guarantee exact control of the type I error rate and adequate power?
  • How can design tables be developed to map desired significance levels, power, and effect sizes to the required sample size and decision thresholds?
Since the two binary endpoints generate four possible joint response categories, the proposed formulation is naturally connected to the classical analysis of 2 × 2 contingency tables; see, for example, Everitt [11] (Section 2.8). In particular, the condition p 11 p 22 = p 12 p 21 corresponds to the classical no-association structure for a 2 × 2 table, a testing problem discussed by Kendall and Stuart [12] (Section 33.16). The present work builds on this classical framework but focuses on a different objective: constructing an exact two-arm hypothesis testing procedure for simultaneous efficacy and safety improvement relative to a concurrent control. This work makes several distinct contributions. First, it extends the single-arm joint-endpoint framework of Bryant and Day [5] and Chen and Chi [9] to a two-arm comparative design, allowing direct inference about relative efficacy and safety between treatments rather than against fixed standards. Second, the proposed test provides an exact frequentist formulation that allows analytical computation of rejection probabilities and formal identification of least favorable configurations. Finally, the framework produces design tables that map nominal type I error, power and effect sizes to required sample sizes and thresholds, facilitating practical implementation with simulations and exact calculations.
More recently, Homma and Yoshida [13] and Jung et al. [14] considered two-arm clinical trial designs with two binary co-primary endpoints under joint binary-outcome frameworks. Homma and Yoshida [13] developed exact power and sample size calculations for two-arm superiority trials with two co-primary binary endpoints by combining the bivariate binomial distribution with endpoint-wise testing procedures, such as Fisher’s exact test, Fisher’s mid-p test, Pearson’s chi-square test, the Z-pooled exact unconditional test, and Boschloo’s exact unconditional test. Thus, their framework is primarily built on marginal tests for each endpoint within an intersection–union structure, whereas the present study directly formulates a joint multinomial hypothesis testing procedure for simultaneous efficacy and safety improvement based on the paired treatment–control differences in ( p e , p s ) .
Jung et al. [14] proposed a related two-arm, two-stage phase II design for two binary co-primary endpoints, which allows early termination for futility and is therefore useful for screening ineffective treatments. Their approach is closer in spirit to the present work because it also models correlated binary endpoints jointly. However, their operating characteristics are calibrated at pre-specified null and alternative configurations: the type I error rate is evaluated under selected point configurations of the null hypothesis, and the power is calculated at a specified alternative point. In contrast, the proposed method identifies least favorable or generalized least favorable configurations for type I error control and power evaluation so that the resulting design guarantees the desired operating characteristics over the corresponding parameter regions rather than only at selected design points.
By uniting the intersection–union test principle and multinomial modeling, this work offers an exact and transparent testing framework for clinical trials requiring simultaneous improvement in efficacy and safety. The proposed testing procedure is calibrated to guarantee the desired type I error control and power requirements over the corresponding parameter regions rather than only at selected point configurations. It provides statistical rigor through precise error control and clinical interpretability by maintaining separate explicit evaluation of efficacy and safety. ChatGPT-5.5 was used to assist in converting simulation output tables into LaTeX format.

2. Formulation for Two Endpoints

We consider a procedure for comparing an experimental treatment with a control. Let π 0 denote the control treatment and π 1 the experimental treatment. Each treatment yields two binary endpoints, referred to as efficacy and safety, although other binary outcomes could be accommodated within the same framework. For treatment j ( j = 0 , 1 ), let X e j and X s j denote the numbers of successes in efficacy and safety among n patients, with corresponding success probabilities p e j and p s j . The two endpoints within a treatment arm may be correlated, whereas responses across different treatments are assumed to be independent. In this paper, the association between the efficacy and safety endpoints is characterized through a pre-specified odds ratio ϕ . This specification is needed because, under the four-category multinomial model, the marginal efficacy probability, the marginal safety probability, and the odds ratio together determine the joint cell probabilities ( p 11 , p 12 , p 21 , p 22 ) ; conversely, the joint cell probabilities determine the two marginal probabilities and the odds ratio. Thus, a working specification of the association structure is required to fully define the multinomial probabilities used in the design. In practice, information on the control treatment may be available from historical studies or previous clinical experience, so the marginal response rates and the association between efficacy and safety can often be estimated or elicited for the control arm. For the experimental treatment, however, the association structure is typically less well established at the design stage. Following common practice in phase II designs with dual binary endpoints, we therefore use a common odds ratio across treatment arms as a working design assumption. A similar specification of the association structure through a pre-specified odds ratio has also been used in the dual-endpoint phase II trial literature; see, for example, Chen and Chi [9].
To simultaneously assess clinical efficacy and safety, one may formulate the null hypothesis as stating that the new treatment is not sufficiently better than the control in terms of either efficacy or safety, while the alternative hypothesis requires improvement in both dimensions. Specifically, we consider the following hypothesis testing problem:
H 0 : p e 1 p e 0 or p s 1 p s 0 vs. H 1 : p e 1 > p e 0 and p s 1 > p s 0 .
Based on the null and alternative hypotheses in (1), the following probability requirements are imposed:
P ( Reject H 0 H 0 is true ) α ,
and
P ( Reject H 0 H a is true ) 1 β .
Here, condition (2) controls the type I error probability at the nominal level α , which represents the maximum acceptable probability of falsely declaring the experimental treatment superior when it does not achieve simultaneous improvement in both efficacy and safety. In potential clinical applications, the choice of α reflects the tolerance for such a false-positive conclusion and may depend on the trial phase, disease severity, available alternatives, and regulatory considerations. Condition (3) requires the power of the test to be at least 1 β under clinically meaningful alternatives, where β denotes the maximum acceptable probability of failing to detect a treatment that improves both endpoints by the pre-specified effect sizes. Thus, α controls the risk of incorrectly advancing an insufficient treatment, whereas 1 β measures the probability of correctly identifying a treatment with meaningful joint efficacy–safety benefit.

Joint Probability of Two Endpoints

Let j = 0 , 1 index the two treatment arms, where j = 0 corresponds to the control treatment π 0 and j = 1 corresponds to the experimental treatment π 1 . For patient i in treatment arm j, let ( E j i , S j i ) denote the two binary endpoints, where E j i = 1 indicates efficacy success and S j i = 1 indicates safety success. Thus, each patient falls into one of four joint response categories, ( 1 , 1 ) , ( 1 , 0 ) , ( 0 , 1 ) , or ( 0 , 0 ) , corresponding, respectively, to efficacious and safe, efficacious but unsafe, nonefficacious but safe, and neither efficacious nor safe.
Accordingly, for treatment j, we denote the vector of cell counts by
X j = ( X 11 j , X 12 j , X 21 j , X 22 j ) ,
where
X 11 j = i = 1 n I ( E j i = 1 , S j i = 1 ) , X 12 j = i = 1 n I ( E j i = 1 , S j i = 0 ) ,
X 21 j = i = 1 n I ( E j i = 0 , S j i = 1 ) , X 22 j = i = 1 n I ( E j i = 0 , S j i = 0 ) .
We assume that this vector follows a multinomial distribution with total sample size n and cell probabilities
p j = ( p 11 j , p 12 j , p 21 j , p 22 j ) ,
for j = 0 , 1 . Here, p k l j ( k , l = 1 , 2 ) denotes the probability corresponding to each efficacy–safety combination in treatment arm j. The sample proportions X k l j / n provide empirical estimates of the corresponding cell probabilities p k l j , for k , l = 1 , 2 .
This notation is equivalent to the standard three-dimensional contingency-table notation n k l j for observed frequencies and p k l j for cell probabilities, where the third index corresponds to the treatment arm. We retain notations X k l j and p k l j because, in clinical trial testing and ranking-and-selection designs, n is conventionally used for the sample size, while X denotes observed response counts.
The marginal probability of observing efficacy under treatment j is
p e j = p 11 j + p 12 j ,
which aggregates all outcomes classified as efficacious.
Similarly, the marginal safety probability is
p s j = p 11 j + p 21 j ,
representing the chance that a patient satisfies the safety criterion regardless of efficacy.
The corresponding marginal cell counts are
X e j = X 11 j + X 12 j , X e c j = X 21 j + X 22 j ,
and
X s j = X 11 j + X 21 j , X s c j = X 12 j + X 22 j ,
where e c and s c denote the complements of efficacy success and safety success, respectively.
Outcomes for treatment π j is listed in Table 1.
For each treatment arm, the dependence between efficacy and safety is quantified by the treatment-specific odds ratio
ϕ j = p 11 j p 22 j p 12 j p 21 j , j = 0 , 1 .
The odds ratio is a classical measure of association for 2 × 2 contingency tables; see, for example, Everitt [11] (Section 2.8). Equivalently, the condition p 11 j p 22 j = p 12 j p 21 j corresponds to the classical no-association case in treatment arm j, which has long been studied as a testing problem for 2 × 2 tables [12] (Section 33.16). For multiple 2 × 2 tables, such as those arising after stratification by treatment arm, the treatment-specific log odds ratio is a standard way to describe conditional association. The closely related problems of partial association and conditional independence in stratified 2 × 2 tables have been studied in the classical contingency-table literature; see Birch [15] and Kendall and Stuart [12] (Section 33.62). These formulations are also closely connected to the common-odds-ratio, or homogeneous-association, setting used for multiple 2 × 2 tables. This classical formulation provides a natural justification for using the odds ratio as an association measure for the two binary endpoints. Following its use in bivariate binary-endpoint designs such as Conaway and Petroni [6], we use the odds ratio to parameterize the association between efficacy and safety.
In the present paper, we adopt the working assumption of a known common odds ratio across the two treatment arms, namely
ϕ 0 = ϕ 1 = ϕ .
This common-odds-ratio assumption is not a mathematical necessity but a design simplification that allows the four cell probabilities in each treatment arm to be determined by the two marginal probabilities p e j and p s j together with a common association parameter ϕ . It is also related to the classical common-association framework for several 2 × 2 contingency tables, often expressed in terms of the log odds ratio; see Kendall and Stuart [12] (Section 33.62). In applications, ϕ may be specified using prior studies, pilot data, or sensitivity analyses. The value of ϕ has a direct interpretation in terms of the association between the two binary endpoints. When ϕ = 1 , the efficacy and safety endpoints are independent within a treatment arm, corresponding to the no-association case discussed above. When ϕ > 1 , the two endpoints are positively associated: patients who achieve efficacy are more likely to also achieve safety, and patients who fail to achieve efficacy are more likely to also fail to achieve safety. Equivalently, for fixed marginal probabilities, larger values of ϕ place more probability mass on the concordant cells ( 1 , 1 ) and ( 0 , 0 ) and less probability mass on the discordant cells ( 1 , 0 ) and ( 0 , 1 ) . When 0 < ϕ < 1 , the two endpoints are negatively associated, indicating a tendency toward discordant outcomes; for example, achieving efficacy may be less likely to occur together with achieving safety. The boundary case ϕ = 0 represents an extreme form of negative association and is mainly included for sensitivity assessment rather than as a typical clinical setting. Two situations are considered:
Case 1: ϕ = 1 (independence) When the two endpoints are independent, the joint distribution of ( X e , X s ) factorizes as
P ( X e = x e , X s = x s p e , p s , ϕ = 1 ) = b ( n , p e , x e ) b ( n , p s , x s ) ,
where b ( · ) denotes the binomial pmf.
Case 2: ϕ 1 (dependence) For specified p e , p s , and ϕ , Bryant and Day [5] derived the corresponding four cell probabilities:
p 11 = a a 2 + d 2 ( ϕ 1 ) ,
p 12 = p e p 11 ,
p 21 = p s p 11 ,
p 22 = 1 p 11 p 12 p 21 ,
with
a = 1 + ( ϕ 1 ) ( p e + p s ) , d = 4 ϕ ( ϕ 1 ) p e p s .
Given these cell probabilities, the joint probability of ( X e , X s ) is
P ( X e = x e , X s = x s p e , p s , ϕ ) = i = max ( 0 , x e + x s n ) min ( x e , x s ) n ! i ! ( x e i ) ! ( x s i ) ! ( n x e x s + i ) ! × p 11 i p 12 x e i p 21 x s i p 22 n x e x s + i .
Remark 1.
When ϕ = 1 , expression (8) continues to hold, with the simplifying identity
p 11 = p e p s .

3. Fixed Sample Size Design

Consider fixed design constants n, e, and s. The decision rule is constructed as follows. Here, n denotes the sample size per treatment arm, while e and s are positive integer thresholds representing the minimum required excess numbers of successes for efficacy and safety, respectively, in the experimental treatment compared with the control. These thresholds are chosen together with n so that the resulting test satisfies the desired type I error and power requirements.
Procedure H:
Collect n observations from each of the two treatments (the control and the new treatment). For treatment i  ( i = 0 , 1 ) , let X e i and X s i denote the observed numbers of successes for the efficacy and safety endpoints, respectively. With positive thresholds e and s, the rule in Procedure H is given by:
  • Reject the null hypothesis if both inequalities X e 1 X e 0 e and X s 1 X s 0 s hold.
  • Otherwise (i.e., if either X e 1 X e 0 < e or X s 1 X s 0 < s ), retain the null hypothesis.
Probability Requirements:
The design constants n, e, and s for Procedure H are chosen so that the procedure meets the following probabilistic criteria:
sup H 0 P X e 1 X e 0 e , X s 1 X s 0 s | ϕ α ,
inf H 1 P X e 1 X e 0 e , X s 1 X s 0 s | ϕ 1 β .
The left-hand side of (9) represents the probability of incorrectly rejecting H 0 and is therefore controlled at the type I error level α . Likewise, the left-hand side of (10) corresponds to the probability of correctly declaring the new treatment superior, which equals the power 1 β under H 1 .
To guarantee a meaningful distinction between the null and alternative spaces, we require a minimum effect separation that specifies how far the true parameters under H 1 must lie from those under H 0 . In particular, we assume the alternative satisfies
p e 1 p e 0 + δ e , p s 1 p s 0 + δ s ,
where δ e and δ s denote the smallest clinically relevant differences in efficacy and safety. These effect-size constraints ensure that the testing procedure can reliably discriminate between H 0 and H 1 when the new treatment exhibits meaningful improvement.
  • We now derive the values of the procedure parameters such that the probability constraints (9) and (10) are satisfied.
Theorem 1
(Monotonicity of the rejection probability). Let ϕ denote the odds ratio. The rejection probability
P ( Reject H 0 ) = P X e 1 X e 0 e , X s 1 X s 0 s | ϕ
satisfies the following monotonicity properties with respect to the marginal efficacy and safety probabilities.
1.
Non-increasing in p e 0 : For fixed p e 1 , p s 0 , p s 1 , and ϕ,
P ( Reject H 0 ) decreases ( or remains constant ) as p e 0 increases .
2.
Non-increasing in p s 0 : For fixed p e 0 , p e 1 , p s 1 , and ϕ,
P ( Reject H 0 ) decreases ( or remains constant ) as p s 0 increases .
3.
Non-decreasing in p e 1 : For fixed p e 0 , p s 0 , p s 1 , and ϕ,
P ( Reject H 0 ) increases ( or remains constant ) as p e 1 increases .
4.
Non-decreasing in p s 1 : For fixed p e 0 , p e 1 , p s 0 , and ϕ,
P ( Reject H 0 ) increases ( or remains constant ) as p s 1 increases .
The detailed proof is provided in the Appendix A.
Theorem 1 immediately implies that, when p e 0 and p s 0 are known, the power is minimized at the boundary of the effect-size constraints, namely when
p e 1 = p e 0 + δ e , p s 1 = p s 0 + δ s .
Theorem 2
(Least favorable configuration for power control under ϕ = 1 ). When the odds ratio satisfies ϕ = 1 and the critical values satisfy e n δ e and s n δ s , the minimal power under the alternative is attained at the boundary of the effect-size constraints, specifically when
p e 1 = p e 0 + δ e , p s 1 = p s 0 + δ s , p e 0 = 1 δ e 2 , p s 0 = 1 δ s 2 .
The proof is provided in Appendix A.
Theorem 3.
If the null-hypothesis parameters p e 0 and p s 0 are fixed and known, then the maximal type I error is attained at one of the following configurations:
p e 1 = 1 , p s 1 = p s 0 ,
or
p e 1 = p e 0 , p s 1 = 1 .
This result follows directly from the monotonicity properties in Theorem 1. To attain the maximal type I error, the rejection probability should be made as large as possible while the null hypothesis remains true. Since the rejection probability is non-decreasing in p e 1 and p s 1 , the maximum occurs on the boundary of the null hypothesis: either p s 1 = p s 0 with p e 1 as large as possible or p e 1 = p e 0 with p s 1 as large as possible.
Consequently, Theorem 3 implies that the constraint in (9) can be equivalently expressed as
max { P X e 1 X e 0 e , X s 1 X s 0 s | p e 1 = p e 0 , p s 1 = 1 , ϕ , P X e 1 X e 0 e , X s 1 X s 0 s | p e 1 = 1 , p s 1 = p s 0 , ϕ } α .
Theorem 4
(Least favorable configuration for type I error control under ϕ = 1 ). When the odds ratio satisfies ϕ = 1 and the critical values satisfy e , s > 0 , the maximal type I error under the null hypothesis is attained at one of two configurations,
p e 1 = p e 0 = 1 2 , p s 1 = 1 , p s 0 = 0 ,
or, symmetrically,
p s 1 = p s 0 = 1 2 , p e 1 = 1 , p e 0 = 0 .
The proof is provided in the Appendix A.
The above results characterize the least favorable configurations for both power and type I error control in Procedure H. Theorem 1 shows that the rejection probability is monotone in the marginal efficacy and safety probabilities. This monotonicity reduces the search for worst-case configurations to boundary points of the parameter space. Theorem 2 identifies the least favorable alternative for power calculation when ϕ = 1 . Theorems 3 and 4 characterize the worst-case null configurations for type I error control.
A key implication is that, when ϕ = 1 , the efficacy and safety endpoints are independent. In this case, a universal least favorable configuration (LFC) exists. Therefore, the design parameters ( n , e , s ) can be determined without specifying the nuisance parameters p e 0 and p s 0 . By contrast, when ϕ 1 , the two endpoints are dependent. A universal LFC is generally unavailable in this setting. Thus, determining ( n , e , s ) requires the baseline values p e 0 and p s 0 to be specified.
If p e 0 and p s 0 are known in advance, the resulting design is less conservative. It may also require a substantially smaller sample size. This reduction is illustrated in the numerical results reported in the subsequent tables.

4. Tables and Discussion

In this section, we report the design parameters required to implement the proposed fixed-sample procedure. Throughout, we assume a common association structure between the two binary endpoints across the control and the tested treatment; i.e., all the treatments share the same odds ratio ϕ . Under this assumption, bivariate binary outcomes can be characterized by the marginal success probabilities and the odds ratio, which jointly determine the 2 × 2 cell probabilities used in simulation.
We consider the following configurations: ϕ { 0 , 1 , 2 , 4 , 8 } , δ e { 0.1 , 0.2 , 0.3 } , and δ s { 0.1 , 0.2 , 0.3 } . The target operating characteristics are specified by
( Power , Type I error ) { ( 0.75 , 0.15 ) , ( 0.85 , 0.15 ) } .
The type I error level of 0.15 is used here for illustrative purposes in an exploratory early-phase setting, where a less stringent error level may be considered acceptable for screening promising treatments. In practice, the choice of type I error rate and power should be determined according to the clinical objective, regulatory context, and input from clinical investigators.
For each configuration, we determine the minimum required sample size per treatment, n, together with the corresponding critical values ( e , s ) such that the probability constraints in (9) and (10) are satisfied. The reported operating characteristics in the tables are estimated by Monte Carlo simulation using 100,000 replications for each configuration. For example, when ϕ = 1 , δ e = δ s = 0.2 , and the target operating characteristics are ( Power , Type I error ) = ( 0.75 , 0.15 ) , we evaluate candidate sample sizes n sequentially. For each n, all feasible integer threshold pairs ( e , s ) are checked by Monte Carlo estimation of the corresponding rejection probabilities. The smallest sample size for which at least one pair ( e , s ) satisfies both requirements is selected. In this setting, based on 100,000 Monte Carlo replications, n = 63 with ( e , s ) = ( 7 , 7 ) gives an estimated power of 0.75241 and an estimated type I error of 0.12335 , so it satisfies the desired constraints.
The simulation procedure used to determine ( n , e , s ) can be summarized as follows:
  • Specify the target type I error level α , target power 1 β , effect sizes δ e and δ s , odds ratio ϕ , and, when needed, baseline control rates p e 0 and p s 0 .
  • For a candidate sample size n, enumerate all feasible positive integer threshold pairs ( e , s ) .
  • For each pair ( e , s ) , estimate the type I error and power using 100,000 Monte Carlo replications under the corresponding LFC or GLFC.
  • Retain the threshold pairs ( e , s ) that satisfy both the type I error constraint and the power constraint.
  • Increase n sequentially until at least one feasible pair ( e , s ) is found.
  • Select the smallest such n. If multiple threshold pairs satisfy the constraints for this minimum n, choose the pair with the smallest thresholds ( e , s ) .
The numerical study is divided into two cases.
Case 1: ϕ = 1 with unknown baseline control rates. When ϕ = 1 , the efficacy and safety endpoints are independent. In this case, we assume that the baseline control rates p e 0 and p s 0 are unknown. By Theorem 2, the minimum power under the alternative is attained at a specific least favorable configuration when ϕ = 1 . Similarly, by Theorem 4, the maximal type I error under the null is attained at a least favorable configuration. These least favorable configurations together make it possible to calibrate the design parameters ( n , e , s ) without specifying p e 0 and p s 0 . The resulting designs are reported in Table 2 and Table 3, corresponding to the two target operating characteristics ( 0.75 , 0.15 ) and ( 0.85 , 0.15 ) , respectively.
Case 2: known baseline control rates. We next consider the case where the baseline control rates are known, with
p e 0 , p s 0 { 0.2 , 0.4 , 0.6 } ,
yielding all possible combinations of ( p e 0 , p s 0 ) . For this case, we fix
δ e = δ s = 0.2
and consider ϕ { 0 , 1 , 2 , 4 , 8 } . Using the generalized least favorable configuration (GLFC) characterized by Theorems 1 and 3, we determine the corresponding minimum sample size n and critical values ( e , s ) for each parameter setting. The results are presented in Table 4 and Table 5, corresponding to ( Power , Type I error ) = ( 0.75 , 0.15 ) and ( 0.85 , 0.15 ) , respectively.
The design parameters ( n , e , s ) are obtained via simulation. Specifically, for each candidate value of n, we search over feasible integer thresholds ( e , s ) and select the smallest n for which at least one pair ( e , s ) meets the constraints. When multiple ( e , s ) pairs satisfy the requirements for the same minimal n, we adopt the pair that yields the largest empirical power while maintaining the type I error constraint. The resulting parameters provide a direct lookup table for implementing the procedure under the considered settings.
When the baseline control rates p e 0 and p s 0 are unknown, Table 2 and Table 3 summarize, under ϕ = 1 , the minimum per-treatment sample size n and the corresponding critical values ( e , s ) that achieve the desired operating characteristics. As expected, larger effect thresholds (i.e., larger δ e and/or δ s ) generally lead to smaller required sample sizes since stronger separation between the null and alternative hypotheses makes it easier to satisfy the power constraint while controlling type I error. These tabulated values enable straightforward implementation of the procedure without re-running the design search for each new study.
In addition, Table 4 and Table 5 report the case where the baseline control rates p e 0 and p s 0 are known in advance and δ e = δ s = 0.2 . The results show that, as the odds ratio ϕ increases, the required sample size n is non-increasing across all the reported configurations, and the overall variation in n is relatively small. This indicates that the proposed design is not highly sensitive to the dependence parameter ϕ . The sensitivity results in Table 4 and Table 5 further suggest that the proposed design is relatively robust to the choice of ϕ over a clinically relevant range. In particular, the required sample size changes only slightly for moderate values of ϕ , indicating that moderate misspecification of the working odds ratio has limited impact on the resulting design. The case ϕ = 0 represents an extreme boundary scenario and is included mainly for sensitivity assessment rather than as a typical clinical setting. Moreover, for the setting δ e = δ s = 0.2 , the sample sizes in Table 2 and Table 3 are generally no smaller than those in Table 4 and Table 5, reflecting the fact that the designs obtained under unknown baseline control rates are more conservative. In particular, in Table 2, the required sample size is n = 63 , which exceeds all the corresponding values reported in Table 4 for δ e = δ s = 0.2 . In Table 3, the required sample size is n = 76 , which is larger than nearly all the corresponding values in Table 5; the only exception is one configuration with n = 77 when ϕ = 0 and p e 0 = p s 0 = 0.4 . Overall, these findings suggest that the values reported in Table 2 and Table 3 provide a reasonably conservative design benchmark. Therefore, when p e 0 and p s 0 are unknown and ϕ 1 , or even when ϕ itself is unknown, the design calibrated under the ϕ = 1 least favorable configuration can still serve as a practical and reliable approximation for trial planning.
To further examine the robustness of the proposed design to possible misspecification of the odds ratio ϕ , we conducted an additional sensitivity calculation using the first configuration in Table 3, where p e 0 = p s 0 = 0.2 . For this setting, the design obtained under ϕ = 8 is ( n , e , s ) = ( 45 , 5 , 5 ) . We then fixed this design and estimated the corresponding power and type I error rate under several different values of the true odds ratio. When ϕ = 0 , the estimated power is 0.7132 and the estimated type I error rate is 0.11864 ; when ϕ = 1 , the estimated power is 0.73223 and the estimated type I error rate is 0.11806 ; and, when ϕ = 2 , the estimated power is 0.74129 and the estimated type I error rate is 0.11623 . The cases ϕ = 4 and ϕ = 8 lead to the same design parameters in the table and therefore do not require separate design calibration. These results suggest that, when the misspecification of ϕ is moderate, its impact on the rejection probabilities is relatively small. Even in the boundary case ϕ = 0 , which represents an extreme scenario on the odds-ratio scale, the type I error rate remains below the nominal level 0.15 , while the reduction in power is moderate. Moreover, when p e 0 and p s 0 are unknown and the design from Table 2 with δ e = δ s = 0.2 is used, namely ( n , e , s ) = ( 63 , 7 , 7 ) , the estimated power and type I error rate under ϕ = 0 and p e 0 = p s 0 = 0.2 are 0.77839 and 0.07363 , respectively. Thus, when there is substantial uncertainty about the value of ϕ , the designs reported in Table 2 and Table 3 can serve as conservative and practical choices for trial planning.
The design tables also reveal several structural patterns among the design parameters. First, as expected, a higher target power generally requires a larger sample size because stronger probability guarantees require more information from each treatment arm. Second, the clinically meaningful effect sizes δ e and δ s have a substantial impact on the required sample size. When either the efficacy threshold or the safety threshold becomes larger, the required sample size tends to decrease since larger treatment differences are easier to detect. Conversely, smaller values of δ e or δ s lead to more demanding designs and therefore require larger sample sizes. Third, for fixed baseline rates and effect-size thresholds, the required sample size tends to be non-increasing as the odds ratio ϕ increases. This pattern indicates that a stronger positive association between efficacy and safety can reduce the amount of information needed to jointly demonstrate improvement on both endpoints. Overall, the numerical results are consistent with intuition: sample size is most sensitive to target power, the type I error requirement, and clinically meaningful thresholds δ e and δ s , while the dependence parameter ϕ appears to have little impact over the practical range considered in the tables.

5. Example

This example considers an experimental trial involving an immunotherapy-based treatment for elderly patients (≥75 years old) diagnosed with advanced non-small-cell lung cancer (NSCLC). The trial compares the immunotherapy strategy, PD1-A (anti-PD-1 monotherapy), against the standard chemotherapy regimen, consisting of carboplatin and pemetrexed, which serves as the control treatment.
The goal of the trial is to determine whether PD1-A demonstrates superiority over the control treatment in terms of both efficacy and safety. Specifically, the study targets an improvement of at least 0.30 in the response rate and a reduction of at least 0.20 in high-grade toxicity, corresponding to δ e = 0.30 and δ s = 0.20 . If PD1-A fails to show meaningful improvement over the control, the standard chemotherapy will be selected.
Under this setting, the least favorable configuration for rejecting the null hypothesis occurs when
p e 0 = 1 δ e 2 , p s 0 = 1 δ s 2 , p e 1 = 1 + δ e 2 , p s 1 = 1 + δ s 2 .
By Theorem 4, the least favorable configuration (LFC) for the type I error is attained when
p e 1 = p e 0 = 1 2 , p s 1 = 1 , p s 0 = 0 ,
or, symmetrically,
p s 1 = p s 0 = 1 2 , p e 1 = 1 , p e 0 = 0 .
When the probability constraints are set at power = 0.85 and type I error = 0.15 , Table 2 indicates that the fixed-sample procedure requires n = 56 observations per treatment arm, with associated critical values e = 6 and s = 6 . Therefore, the total number of observations required to meet the design criteria is 2 × 56 = 112 .
After the trial is completed, the procedure is applied by comparing the observed numbers of efficacy and safety successes between the PD1-A arm and the control arm. Let X e 1 and X s 1 denote the observed efficacy and safety successes in the PD1-A arm, and let X e 0 and X s 0 denote the corresponding observed successes in the control arm. With the selected critical values e = 6 and s = 6 , the null hypothesis is rejected only if both X e 1 X e 0 6 and X s 1 X s 0 6 . Otherwise, the trial does not provide sufficient evidence that PD1-A is superior to the control treatment in both efficacy and safety. For instance, if the observed differences are X e 1 X e 0 = 7 and X s 1 X s 0 = 6 , then both criteria are met and the null hypothesis is rejected; if either difference is less than 6, the null hypothesis is retained.

6. Conclusions

This paper develops an exact frequentist fixed sample size hypothesis testing procedure for two-arm clinical trials with two binary endpoints, motivated by settings in which a new treatment must demonstrate improvement in both efficacy and safety relative to a concurrent control. Instead of relying on a composite endpoint, the proposed method evaluates the two endpoints jointly and rejects the null hypothesis only when the observed treatment differences exceed pre-specified paired thresholds on both dimensions. By modeling the joint efficacy–safety outcomes through a four-category multinomial distribution, the procedure allows calculation of rejection probabilities and provides a clear framework for benefit–risk evaluation.
The theoretical results characterize the least favorable configurations for both power and type I error control. The monotonicity properties of the rejection probability reduce the search for worst-case configurations to the boundary of the parameter space. When the two endpoints are independent ( ϕ = 1 ), the least favorable configurations can be identified explicitly, allowing the design parameters ( n , e , s ) to be determined without specifying the baseline control rates p e 0 and p s 0 . When ϕ 1 , a universal least favorable configuration is generally unavailable, and calibration requires the baseline control rates to be specified. The numerical results further suggest that the required sample size is not highly sensitive to the odds ratio across the settings considered.
The proposed framework can be implemented through design tables that map the target power, type I error rate, and clinically meaningful effect sizes to the required sample size and decision thresholds. These tables provide a convenient tool for study planning and reduce the need for repeated design searches in new applications. In particular, when the baseline control rates are unknown, designs calibrated under the ϕ = 1 least favorable configuration may provide a practical approximation even when the true odds ratio differs from 1.
Several extensions of the proposed framework are possible. One limitation of the proposed framework is that the odds ratio ϕ is treated as a pre-specified working design parameter and is assumed to be common across the control and experimental treatments. Although sensitivity analyses indicate that the design is relatively robust over practical ranges of ϕ , future work may consider adaptive or data-driven approaches that allow different association structures between efficacy and safety across treatment arms. Second, although the present study focuses on a fixed-sample design, the relatively large sample sizes required in some settings suggest that sequential or curtailed versions of the proposed procedure may provide meaningful gains in efficiency.
Overall, the proposed procedure offers a rigorous and interpretable approach for two-arm trials with co-primary binary endpoints. By preserving separate assessments of efficacy and safety while controlling type I error and maintaining adequate power, the method provides a useful alternative to composite-endpoint analyses and helps to bridge the gap between early-phase bivariate endpoint designs and comparative two-arm confirmatory testing.

Author Contributions

Conceptualization, P.C., C.Y. and E.M.B.; Methodology, C.Y. and E.M.B.; Software, C.Y.; Validation, C.Y.; Formal analysis, E.M.B.; Investigation, P.C. and E.M.B.; Resources, E.M.B.; Data curation, C.Y.; Writing—original draft, C.Y.; Writing—review & editing, P.C., C.Y. and E.M.B.; Visualization, C.Y.; Supervision, P.C.; Project administration, P.C.; Funding acquisition, P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

Generative ChatGPT (GPT-5.5, OpenAI) was used in a limited and supportive manner during manuscript preparation. Specifically, AI was used to assist in converting simulation output tables into LaTeX format, as well as to help check grammar and improve language clarity. All scientific content, methodology, results, and interpretations were developed and verified by the authors.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Proof of Theorem 1

Lemma A1.
Let X e and X s be two random variables with joint probability mass function P ( X e = x e , X s = x s | p e , p s , ϕ ) . If ψ = ψ ( x e , x s ) is a non-decreasing (non-increasing) function of x e when x s is held fixed, then E ψ ( X e , X s ) is a non-decreasing (non-increasing) function of p e when p s , ϕ are held fixed. If ψ = ψ ( x e , x s ) is a non-decreasing (non-increasing) function of x s when x e is held fixed, then E ψ ( X e , X s ) is a non-decreasing (non-increasing) function of p s when p e , ϕ are held fixed.
This lemma was established by [16] in a related treatment-selection framework with two binary endpoints.
Theorem A1.
Let ϕ denote the odds ratio. The probability of rejecting the null hypothesis P ( Reject H 0 ) satisfies the following monotonicity properties:
1.
Non-increasing in p e 0 : For fixed p e 1 , p s 0 , p s 1 , ϕ ,
P ( Reject H 0 ) is non-increasing in p e 0 .
2.
Non-increasing in p s 0 : For fixed p e 0 , p e 1 , p s 1 , ϕ ,
P ( Reject H 0 ) is non-increasing in p s 0 .
3.
Non-decreasing in p e 1 : For fixed p e 0 , p s 0 , p s 1 , ϕ ,
P ( Reject H 0 ) is non-decreasing in p e 1 .
4.
Non-decreasing in p s 1 : For fixed p e 0 , p e 1 , p s 0 , ϕ ,
P ( Reject H 0 ) is non-decreasing in p s 1 .
Proof. 
Let ψ ( X e 0 , X s 0 , X e 1 , X s 1 ) be an indicator function, where ( X e j , X s j ) follows the joint distribution
f X e j , X s j ( x e j , x s j ) = P ( X e j = x e j , X s j = x s j ) ,
and the pairs ( X e j , X s j ) are independent of each other for j = 0 , 1 .
ψ = 1 if x e 1 x e 0 e & x s 1 x s 0 s , 0 if otherwise
That is, ψ = 1 if H 0 is rejected and ψ = 0 otherwise.
We are going to show that ψ is a non-increasing function of x e 0 when x s 0 and ( x e 1 , x s 1 ) are held fixed. Assume x e 0 * > x e 0 .
Let ψ ( x e 0 ) be a simplified form of ψ ( x e 0 , x s 0 , x e 1 , x s 1 ) .
If ψ ( x e 0 * ) = 0 , then we have ψ ( x e 0 ) ψ ( x e 0 * ) = 0 .
If ψ ( x e 0 * ) = 1 , then, by the condition of indicator function and x e 0 * > x e 0 , we have:
e x e 1 x e 0 * < x e 1 x e 0
By (A2), we have ψ ( x e 0 ) = 1 ψ ( x e 0 * ) . So ψ is a non-increasing function of x e 0 . Similarly we can show that ψ is non-increasing over x s 0 .
Next step, we are going to show that ψ is non-decreasing with x e 1 when the other variables are held fixed. Assume x e 1 * > x e 1
If ψ ( x e 1 ) = 0 , then ψ ( x e 1 * ) 0 = ψ ( x e 1 ) .
If ψ ( x e 1 ) = 1 , then we have:
x e 1 * x e 0 > x e 1 x e 0 e , x s 1 x s 0 s
By (A3), we can conclude that ψ ( x e 1 * ) = 1 ψ ( x e 1 ) . So ψ is a non-decreasing function of x e 1 . Similarly we can show that ψ is non-decreasing over x s 1 .
So ψ is monotone with respect to x e j and x s j for j = 0 , 1 .
Then, we are going to show that P ( Reject H 0 ) is monotone with respect to p e 0 .
P ( Reject H 0 ) = E ψ ( ( X e 0 , X s 0 ) , ( X e 1 , X s 1 ) ) = E { E ψ ( ( X e 0 , X s 0 ) , ( X e 1 , X s 1 ) ) | x s 0 , ( x e 1 , x s 1 ) ) }
E { E ψ ( ( X e 0 , X s 0 ) , ( X e 1 , X s 1 ) ) | x s 0 , ( x e 1 , x s 1 ) ) } = x e 0 ψ x s 0 , ( x e 1 , x s 1 ) ( x e 0 ) × P ( X e 0 = x e 0 | X s 0 = x s 0 ) = x e 0 ψ x s 0 , ( x e 1 , x s 1 ) ( x e 0 ) × P ( X e 0 = x e 0 , X s 0 = x s 0 ) P ( X s 0 = x s 0 ) = 1 P ( X s 0 = x s 0 ) × x e 0 ψ x s 0 , ( x e 1 , x s 1 ) ( x e 0 ) × P ( X e 0 = x e 0 , X s 0 = x s 0 )
Since (A5) is a non-increasing function of p e 0 by Lemma A1, then, by the order-preserving property of expectations, we can conclude that P ( Reject H 0 ) , which is the expectation of (A5), is a non-increasing function of p e 0 . Similarly, we can show that P ( Reject H 0 ) is a monotone function of p s j for j = 0 , 1 and p e g for g = 0 , 1 . □

Appendix A.2. Proof of Theorem 2

Theorem A2.
When the odds ratio satisfies ϕ = 1 and the critical values satisfy e n δ e and s n δ s , the minimal power under the alternative is attained at the boundary of the effect-size constraints, specifically when
p e 0 = 1 δ e 2 , p s 0 = 1 δ s 2 , p e 1 = 1 + δ e 2 , p s 1 = 1 + δ s 2 .
Proof. 
Based on Theorem 1, the power is minimized at the boundary configuration p e 1 = p e 0 + δ e and p s 1 = p s 0 + δ s . It remains to determine the location of ( p e 0 , p s 0 ) that minimizes the power.
We apply the variance-stabilizing normal approximation (arcsin–square-root transform):
Z e j = 4 n arcsin X e j n , Z s j = 4 n arcsin X s j n , j = 0 , 1 .
Under this approximation,
Z e j N 4 n arcsin p e j , 1 , Z s j N 4 n arcsin p s j , 1 .
Moreover, when ϕ = 1 , X e j is independent of X s j for j = 0 , 1 ; hence, ( Z e 0 , Z e 1 ) is independent of ( Z s 0 , Z s 1 ) .
After the transformation, the rejection criteria can be written as
T e = Z e 1 Z e 0 2 y e , T s = Z s 1 Z s 0 2 y s .
Under p e 1 = p e 0 + δ e and p s 1 = p s 0 + δ s , define
Δ e ( p e 0 ) = arcsin p e 0 + δ e arcsin p e 0 , Δ s ( p s 0 ) = arcsin p s 0 + δ s arcsin p s 0 .
Then
T e N 2 n Δ e ( p e 0 ) , 1 , T s N 2 n Δ s ( p s 0 ) , 1 ,
and T e is independent of T s . Therefore the power (under ϕ = 1 ) factorizes:
P ( Reject H 0 ) = P ( T e y e ) P ( T s y s ) 1 Φ { y e 2 n Δ e ( p e 0 ) } 1 Φ { y s 2 n Δ s ( p s 0 ) } .
For any fixed threshold y R , the function
g ( Δ ) = 1 Φ { y 2 n Δ }
is strictly increasing in Δ since
g ( Δ ) = 2 n φ { y 2 n Δ } > 0 .
Hence, minimizing the power with respect to p e 0 (resp. p s 0 ) is equivalent to minimizing Δ e ( p e 0 ) (resp. Δ s ( p s 0 ) ).
Next, differentiate Δ e ( p ) for p ( 0 , 1 δ e ) :
Δ e ( p ) = 1 2 p + δ e 1 p δ e 1 2 p 1 p .
Setting Δ e ( p ) = 0 yields
p ( 1 p ) = ( p + δ e ) ( 1 p δ e ) p ( 1 p ) = ( p + δ e ) ( 1 p δ e ) p = 1 δ e 2 .
Moreover, Δ e ( p ) < 0 for p < 1 δ e 2 and Δ e ( p ) > 0 for p > 1 δ e 2 , so Δ e ( p ) is minimized at p e 0 = 1 δ e 2 . The same argument gives that Δ s ( p s 0 ) is minimized at p s 0 = 1 δ s 2 . Therefore the power is minimized at
p e 0 = 1 δ e 2 , p s 0 = 1 δ s 2 , p e 1 = 1 + δ e 2 , p s 1 = 1 + δ s 2 .
Remark A1
(flip condition under an untransformed difference approximation). The above conclusion relies on the variance-stabilizing transform, under which the variance is approximately constant (equal to 1) and the power is monotone in Δ for any threshold y e , y s R . If instead one works directly with the count difference D e = X e 1 X e 0 and a rejection rule of the form D e a e , then, under p e 1 = p e 0 + δ e ,
E [ D e ] = n δ e , Var ( D e ) = n p e 0 ( 1 p e 0 ) + ( p e 0 + δ e ) ( 1 p e 0 δ e ) = : σ e 2 ( p e 0 ) ,
and a normal approximation gives
P ( D e a e ) 1 Φ a e n δ e σ e ( p e 0 ) .
The derivative with respect to σ e satisfies
σ e P ( D e a e ) = φ a e n δ e σ e · a e n δ e σ e 2 ,
whose sign is determined by a e n δ e . Consequently:
  • If a e n δ e , then P ( D e a e ) decreases with σ e , so the worst-case (minimum power) occurs at the maximal-variance point, i.e., p e 0 = ( 1 δ e ) / 2 .
  • If a e > n δ e , the monotonicity flips: P ( D e a e ) increases with σ e , so the worst-case shifts toward minimal variance, i.e., boundary values p e 0 0 or p e 0 1 δ e (and similarly for the safety endpoint).
An analogous statement holds for D s = X s 1 X s 0 with threshold a s and mean shift n δ s . □

Appendix A.3. Proof of Theorem 4

Theorem A3.
When the odds ratio satisfies ϕ = 1 and critical values e , s > 0 , the maximal type I error under the null hypothesis is attained when
p e 1 = p e 0 = 0.5 , p s 1 = 1 , p s 0 = 0 ,
or
p s 1 = p s 0 = 0.5 , p e 1 = 1 , p e 0 = 0 ,
Proof. 
Let the rejection region be
R = { T e e , T s s } ,
where T e = X e 1 X e 0 (resp. T s = X s 1 X s 0 ) is an increasing function of the efficacy (resp. safety) evidence in arm 1 relative to arm 0. When ϕ = 1 , efficacy and safety are independent within each arm, and the two arms are independent; hence,
P ( Reject H 0 ) = P ( T e e ) P ( T s s ) .
The null hypothesis is composite:
H 0 = { p e 1 p e 0 } { p s 1 p s 0 } ,
i.e., at least one endpoint is not improved. Since R is increasing in p e 1 p e 0 and in p s 1 p s 0 , the rejection probability under each sub-null is maximized on the boundary:
sup p e 1 p e 0 P ( T e e ) = sup p e 1 = p e 0 P ( T e e ) , sup p s 1 p s 0 P ( T s s ) = sup p s 1 = p s 0 P ( T s s ) .
Therefore, to maximize the overall type I error sup H 0 P ( R ) , it suffices to consider the two boundary cases:
Case 1: p e 1 = p e 0 (efficacy on the null boundary). Then, by (A6),
P ( R ) = P ( T e e p e 1 = p e 0 ) · P ( T s s ) .
The second factor is at most 1, so the product is maximized by making the safety constraint as easy as possible. In particular, taking ( p s 0 , p s 1 ) = ( 0 , 1 ) yields X s 0 0 and X s 1 n ; hence, P ( T s s ) = 1 (for any nontrivial critical value corresponding to requiring improvement). Thus, the maximal type I error in this case reduces to maximizing
P ( T e e p e 1 = p e 0 = p )
over p [ 0 , 1 ] .
For the commonly used improvement-type rules based on a positive difference in binomial counts (equivalently, a positive threshold on an increasing transform), this maximization occurs at p = 1 / 2 . A convenient statement is the following lemma (stated for the raw count difference; it applies to any increasing rejection rule with a positive critical value).
Lemma A2.
Let X 0 , X 1 i i d Bin ( n , p ) be independent. For any integer a 1 ,
f ( p ) : = P ( X 1 X 0 a )
is maximized at p = 1 2 .
Proof. 
Write X 1 X 0 = i = 1 n ( A i B i ) with ( A i , B i ) i.i.d. Bernoulli ( p ) pairs, and set Y i = A i B i { 1 , 0 , 1 } . Then
P ( Y i = 1 ) = P ( Y i = 1 ) = p ( 1 p ) = : t , P ( Y i = 0 ) = 1 2 t ,
so the law depends on p only through t [ 0 , 1 / 4 ] . Let K = # { i : Y i 0 } Bin ( n , 2 t ) . Conditional on K = k , the nonzero increments are ± 1 with equal probability, so X 1 X 0 K = k = d S k : = j = 1 k ε j , where ε j { ± 1 } are i.i.d. symmetric. Hence
f ( p ) = k = 0 n P ( K = k ) P ( S k a ) .
For a 1 , the tail probability P ( S k a ) is non-decreasing in k, and K Bin ( n , 2 t ) is stochastically increasing in t. Therefore f ( p ) is increasing in t = p ( 1 p ) , which is maximized at p = 1 2 . □
Applying the lemma (or its transformed-rule analogue) yields that the maximum over p is attained at p e 0 = p e 1 = 1 2 . Together with ( p s 0 , p s 1 ) = ( 0 , 1 ) , this gives the first claimed configuration:
p e 1 = p e 0 = 1 2 , p s 0 = 0 , p s 1 = 1 .
Case 2: p s 1 = p s 0 (safety on the null boundary). By symmetry of the argument, the product (A6) is maximized by taking ( p e 0 , p e 1 ) = ( 0 , 1 ) so that P ( T e y e ) = 1 and then maximizing P ( T s y s p s 1 = p s 0 = p ) , which is attained at p = 1 2 . This yields the second configuration:
p s 1 = p s 0 = 1 2 , p e 0 = 0 , p e 1 = 1 .
Combining the two cases proves the theorem. □

References

  1. U.S. Food and Drug Administration. Multiple Endpoints in Clinical Trials: Guidance for Industry; Food and Drug Administration: Silver Spring, MD, USA, 2022. Available online: https://www.fda.gov/media/162416/download (accessed on 24 May 2026).
  2. Pocock, S.J.; Ariti, C.A.; Collier, T.J.; Wang, D. The win ratio: A new approach to the analysis of composite endpoints in clinical trials. Eur. Heart J. 2012, 33, 176–182. [Google Scholar] [CrossRef] [PubMed]
  3. Marcus, R.; Peritz, E.; Gabriel, K.R. On closed testing procedures with special reference to ordered analysis of variance. Biometrika 1976, 63, 655–660. [Google Scholar] [CrossRef]
  4. Berger, R.L.; Hsu, J.C. Bioequivalence trials, intersection–union tests and equivalence confidence sets. Stat. Sci. 1996, 11, 283–319. [Google Scholar] [CrossRef]
  5. Bryant, J.; Day, R. Incorporating toxicity considerations into the design of two-stage phase II clinical trials. Biometrics 1995, 51, 1372–1383. [Google Scholar] [CrossRef] [PubMed]
  6. Conaway, M.R.; Petroni, G.R. Bivariate sequential designs for phase II trials. Biometrics 1995, 51, 656–664. [Google Scholar] [CrossRef]
  7. Stallard, N.; Thall, P.F.; Whitehead, J. Decision-theoretic designs for phase II clinical trials with multiple outcomes. Biometrics 1999, 55, 971–977. [Google Scholar] [CrossRef]
  8. Ivanova, A.; Qaqish, B.F.; Schell, M.J. Continuous toxicity monitoring in phase II trials in oncology. Biometrics 2005, 61, 540–545. [Google Scholar] [CrossRef]
  9. Chen, C.-M.; Chi, Y. Curtailed two-stage designs with two dependent binary endpoints. Pharm. Stat. 2012, 11, 57–62. [Google Scholar] [CrossRef] [PubMed]
  10. Thall, P.F.; Cheng, S.-C. Treatment comparisons based on two-dimensional safety and efficacy alternatives in oncology trials. Biometrics 1999, 55, 746–753. [Google Scholar] [CrossRef] [PubMed]
  11. Everitt, B.S. The Analysis of Contingency Tables; Chapman and Hall: London, UK, 1977. [Google Scholar]
  12. Kendall, M.G.; Stuart, A. The Advanced Theory of Statistics, Volume 2: Inference and Relationship, 3rd ed.; Charles Griffin: London, UK, 1973. [Google Scholar]
  13. Homma, G.; Yoshida, T. Exact power and sample size in clinical trials with two co-primary binary endpoints. Stat. Methods Med. Res. 2025, 34, 2183–2201. [Google Scholar] [CrossRef] [PubMed]
  14. Jung, H.; Mitani, A.A.; Husain, M.I.; Ma, C. Exact randomized two-stage phase 2 clinical trial designs for two binary co-primary endpoints. Stat. Med. 2026, 45, e70424. [Google Scholar] [CrossRef] [PubMed]
  15. Birch, M.W. The detection of partial association, I: The 2 × 2 case. J. R. Stat. Soc. Ser. B Methodol. 1964, 26, 313–324. [Google Scholar] [CrossRef]
  16. Yin, C.; Buzaianu, E.M.; Chen, P.; Hsu, L. A design for selecting among k treatments with two binary endpoints in comparison to a control treatment. Sankhya B 2026, 1–40. [Google Scholar] [CrossRef]
Table 1. Observed outcomes for treatment π j .
Table 1. Observed outcomes for treatment π j .
Safety
YesNoSum
EfficacyYes X 11 j X 12 j X e j
No X 21 j X 22 j X e c j
Sum X s j X s c j
Table 2. Target power = 0.75 , target type I error = 0.15 , ϕ = 1 , and p e 0 , p s 0 unknown.
Table 2. Target power = 0.75 , target type I error = 0.15 , ϕ = 1 , and p e 0 , p s 0 unknown.
δ e δ s p e 0 p s 0 nesPowerType I Error
0.10.10.450.4523512120.754520.14436
0.20.450.4015610100.752760.14103
0.30.450.3515410100.750910.13948
0.20.10.400.4515610100.754170.14103
0.20.400.4063770.752410.12335
0.30.400.3546660.757140.12566
0.30.10.350.4515316100.750720.13869
0.20.350.4046660.755210.12566
0.30.350.3529550.766410.11852
Table 3. Target power = 0.85 , target type I error = 0.15 , ϕ = 1 , and p e 0 , p s 0 unknown.
Table 3. Target power = 0.85 , target type I error = 0.15 , ϕ = 1 , and p e 0 , p s 0 unknown.
δ e δ s p e 0 p s 0 nesPowerType I Error
0.10.10.450.4531214140.851410.13987
0.20.450.422512120.851530.13912
0.30.450.3522412170.851120.13859
0.20.10.40.4522512120.850810.13912
0.20.40.476770.854730.14583
0.30.40.3556660.850930.14930
0.30.10.350.4522414120.850540.13859
0.20.350.456660.851830.14930
0.30.350.3534550.853530.13750
Table 4. Target power = 0.75 , target type I error = 0.15 , δ e = δ s = 0.2 , and ϕ { 0 , 1 , 2 , 4 , 8 } .
Table 4. Target power = 0.75 , target type I error = 0.15 , δ e = δ s = 0.2 , and ϕ { 0 , 1 , 2 , 4 , 8 } .
p s 0 p e 0 ϕ nesPowerType I Error
0.20.2048550.754950.12505
147550.755050.12295
246550.753960.11976
445550.751350.11693
845550.762210.11774
0.4053650.750590.13623
152650.754150.13770
251650.751340.13305
450650.750220.12924
850650.755590.13262
0.6052650.762620.13648
150650.750100.13051
250650.757350.13080
450650.762750.13154
849650.756290.12770
0.40.2054560.761880.13951
152560.752050.13523
252560.760510.13511
451560.759040.13437
850560.755750.13115
0.4058660.752140.14951
157660.756070.14663
256660.757330.14505
455660.753920.14443
854660.757650.14107
0.6057660.761240.14765
155660.752830.14256
254660.750110.14013
454660.762980.14010
853660.759340.13641
0.60.2052560.762900.13554
150560.751250.13153
250560.755560.13236
449560.751860.12765
849560.755580.12815
0.4057660.762150.14689
155660.754040.14279
255660.761940.14152
454660.759500.14007
853660.760190.13837
0.6055660.760300.14275
154660.762350.13779
253660.757810.13831
452660.754380.13652
851660.755060.13410
Table 5. Target power = 0.85 , target type I error = 0.15 , δ e = δ s = 0.2 , and ϕ { 0 , 1 , 2 , 4 , 8 } .
Table 5. Target power = 0.85 , target type I error = 0.15 , δ e = δ s = 0.2 , and ϕ { 0 , 1 , 2 , 4 , 8 } .
p s 0 p e 0 ϕ nesPowerType I Error
0.20.2058550.859280.14710
157550.855880.14647
256550.852610.14387
455550.851240.14296
855550.857130.14121
0.4071760.855980.13195
170760.854050.13131
270760.855980.13254
469760.854270.12807
868760.852860.12577
0.6068760.850270.12720
167760.851800.12478
267760.854430.12593
467760.857910.12723
866760.851090.12466
0.40.2071670.853860.13086
170670.853110.13067
270670.855100.13046
469670.854810.13033
868670.852970.12787
0.4077770.854110.14499
176770.853680.14162
275770.852620.13935
474770.851200.13735
873770.852520.13745
0.6074770.853610.13814
173770.854410.13641
272770.850280.13375
472770.853450.13483
872770.859660.13623
0.60.2068670.851590.12723
167670.851750.12547
267670.853830.12457
467670.857480.12530
866670.851260.12497
0.4074770.855410.13949
173770.853500.13631
272770.850370.13508
472770.851890.13507
871770.853130.13311
0.6072770.857460.13336
171770.858870.13302
270770.853030.13095
469770.850380.12930
869770.857070.13020
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, P.; Yin, C.; Buzaianu, E.M. Hypothesis Testing for Two-Arm Proportions with Two Binary Endpoints. Axioms 2026, 15, 435. https://doi.org/10.3390/axioms15060435

AMA Style

Chen P, Yin C, Buzaianu EM. Hypothesis Testing for Two-Arm Proportions with Two Binary Endpoints. Axioms. 2026; 15(6):435. https://doi.org/10.3390/axioms15060435

Chicago/Turabian Style

Chen, Pinyuen, Chishu Yin, and Elena M. Buzaianu. 2026. "Hypothesis Testing for Two-Arm Proportions with Two Binary Endpoints" Axioms 15, no. 6: 435. https://doi.org/10.3390/axioms15060435

APA Style

Chen, P., Yin, C., & Buzaianu, E. M. (2026). Hypothesis Testing for Two-Arm Proportions with Two Binary Endpoints. Axioms, 15(6), 435. https://doi.org/10.3390/axioms15060435

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop