1. Introduction
Clinical trials increasingly require evidence that new treatments not only improve efficacy but also maintain or enhance safety. Regulatory agencies such as the U.S. Food and Drug Administration [
1] explicitly emphasize the importance of multiple endpoints to ensure that therapeutic benefit is not achieved at the expense of unacceptable adverse effects. In many modern therapeutic areas—such as oncology, immunotherapy, and cardiovascular medicine—both efficacy and toxicity are regarded as co-primary outcomes, reflecting the dual objective of maximizing patient benefit while minimizing harm. Composite endpoints are frequently used to summarize multiple outcomes into a single measure; however, they can obscure trade-offs between efficacy and safety and complicate interpretation [
2]. Consequently, there is growing interest in transparent and statistically rigorous frameworks that jointly evaluate efficacy and safety without conflating their effects.
From a methodological standpoint, the prevailing paradigm for co-primary binary endpoints is the intersection–union testing (IUT) framework, in which overall success is declared only when significant improvement is demonstrated on all the endpoints [
3,
4]. This approach provides strong control of the familywise type I error rate and aligns with regulatory expectations. A natural modeling strategy is to represent two binary outcomes jointly as a four-category multinomial variable, explicitly accounting for their correlation and enabling exact computation of joint probabilities and rejection regions.
A substantial body of research has focused on the design of clinical trials with dual binary endpoints. The seminal work of Bryant and Day [
5] introduced a two-stage single-arm phase II design that simultaneously monitors efficacy and toxicity under a multinomial model. Their framework defined admissible regions for early termination but was limited to settings in which the experimental treatment was compared with a fixed standard rather than against a control treatment. Building on this idea, Conaway and Petroni [
6] extended the single-arm framework to bivariate sequential settings, still assuming fixed standard values. Subsequent developments by Stallard, Thall, and Whitehead [
7], Ivanova et al. [
8], and Chen and Chi [
9] further advanced early-phase designs with dual binary endpoints under single-arm or fixed-standard settings. Thall and Cheng [
10] proposed a related decision-theoretic design for evaluating efficacy and safety in which clinically meaningful improvements of an experimental treatment over a standard treatment are elicited from the physician. Their testing procedure was developed using a large-sample normal approximation, whereas the present study provides an exact two-arm hypothesis testing framework under a multinomial model, allowing rejection probabilities and type I error rates to be computed exactly. Therefore, despite substantial progress, a methodological gap remains for exact two-arm hypothesis tests that jointly evaluate efficacy and safety when a clear decision is reached.
The present study addresses this methodological gap by proposing an exact frequentist hypothesis test for two-arm trials with two binary endpoints—efficacy and safety. The goal of the proposed method is twofold. First, it aims to directly evaluate the simultaneous improvement of both endpoints relative to a concurrent control rather than against fixed pre-specified standard values. Second, it ensures exact control of the type I error rate at the nominal significance level by analytically identifying least favorable parameter configurations (LFCs) or generalized least favorable parameter configurations (GLFCs). The LFCs and GLFCs are obtained by using monotonicity results to identify boundary parameter settings and then exactly evaluating the rejection probabilities under the multinomial model.
To achieve these objectives, the joint outcomes of efficacy and safety are modeled using a four-category multinomial distribution, which enables complete enumeration of all possible response combinations across treatment arms. Within this framework, the study focuses on the following four key methodological questions:
How can we construct a single-stage rejection rule that declares superiority only when both endpoints exceed pre-specified paired thresholds?
How can rejection probabilities be computed exactly under the four-category multinomial model?
How can LFCs or GLFCs be identified to guarantee exact control of the type I error rate and adequate power?
How can design tables be developed to map desired significance levels, power, and effect sizes to the required sample size and decision thresholds?
Since the two binary endpoints generate four possible joint response categories, the proposed formulation is naturally connected to the classical analysis of
contingency tables; see, for example, Everitt [
11] (Section 2.8). In particular, the condition
corresponds to the classical no-association structure for a
table, a testing problem discussed by Kendall and Stuart [
12] (Section 33.16). The present work builds on this classical framework but focuses on a different objective: constructing an exact two-arm hypothesis testing procedure for simultaneous efficacy and safety improvement relative to a concurrent control. This work makes several distinct contributions. First, it extends the single-arm joint-endpoint framework of Bryant and Day [
5] and Chen and Chi [
9] to a two-arm comparative design, allowing direct inference about relative efficacy and safety between treatments rather than against fixed standards. Second, the proposed test provides an exact frequentist formulation that allows analytical computation of rejection probabilities and formal identification of least favorable configurations. Finally, the framework produces design tables that map nominal type I error, power and effect sizes to required sample sizes and thresholds, facilitating practical implementation with simulations and exact calculations.
More recently, Homma and Yoshida [
13] and Jung et al. [
14] considered two-arm clinical trial designs with two binary co-primary endpoints under joint binary-outcome frameworks. Homma and Yoshida [
13] developed exact power and sample size calculations for two-arm superiority trials with two co-primary binary endpoints by combining the bivariate binomial distribution with endpoint-wise testing procedures, such as Fisher’s exact test, Fisher’s mid-
p test, Pearson’s chi-square test, the
Z-pooled exact unconditional test, and Boschloo’s exact unconditional test. Thus, their framework is primarily built on marginal tests for each endpoint within an intersection–union structure, whereas the present study directly formulates a joint multinomial hypothesis testing procedure for simultaneous efficacy and safety improvement based on the paired treatment–control differences in
.
Jung et al. [
14] proposed a related two-arm, two-stage phase II design for two binary co-primary endpoints, which allows early termination for futility and is therefore useful for screening ineffective treatments. Their approach is closer in spirit to the present work because it also models correlated binary endpoints jointly. However, their operating characteristics are calibrated at pre-specified null and alternative configurations: the type I error rate is evaluated under selected point configurations of the null hypothesis, and the power is calculated at a specified alternative point. In contrast, the proposed method identifies least favorable or generalized least favorable configurations for type I error control and power evaluation so that the resulting design guarantees the desired operating characteristics over the corresponding parameter regions rather than only at selected design points.
By uniting the intersection–union test principle and multinomial modeling, this work offers an exact and transparent testing framework for clinical trials requiring simultaneous improvement in efficacy and safety. The proposed testing procedure is calibrated to guarantee the desired type I error control and power requirements over the corresponding parameter regions rather than only at selected point configurations. It provides statistical rigor through precise error control and clinical interpretability by maintaining separate explicit evaluation of efficacy and safety. ChatGPT-5.5 was used to assist in converting simulation output tables into LaTeX format.
2. Formulation for Two Endpoints
We consider a procedure for comparing an experimental treatment with a control. Let
denote the control treatment and
the experimental treatment. Each treatment yields two binary endpoints, referred to as efficacy and safety, although other binary outcomes could be accommodated within the same framework. For treatment
j (
), let
and
denote the numbers of successes in efficacy and safety among
n patients, with corresponding success probabilities
and
. The two endpoints within a treatment arm may be correlated, whereas responses across different treatments are assumed to be independent. In this paper, the association between the efficacy and safety endpoints is characterized through a pre-specified odds ratio
. This specification is needed because, under the four-category multinomial model, the marginal efficacy probability, the marginal safety probability, and the odds ratio together determine the joint cell probabilities
; conversely, the joint cell probabilities determine the two marginal probabilities and the odds ratio. Thus, a working specification of the association structure is required to fully define the multinomial probabilities used in the design. In practice, information on the control treatment may be available from historical studies or previous clinical experience, so the marginal response rates and the association between efficacy and safety can often be estimated or elicited for the control arm. For the experimental treatment, however, the association structure is typically less well established at the design stage. Following common practice in phase II designs with dual binary endpoints, we therefore use a common odds ratio across treatment arms as a working design assumption. A similar specification of the association structure through a pre-specified odds ratio has also been used in the dual-endpoint phase II trial literature; see, for example, Chen and Chi [
9].
To simultaneously assess clinical efficacy and safety, one may formulate the null hypothesis as stating that the new treatment is not sufficiently better than the control in terms of either efficacy or safety, while the alternative hypothesis requires improvement in both dimensions. Specifically, we consider the following hypothesis testing problem:
Based on the null and alternative hypotheses in (
1), the following probability requirements are imposed:
and
Here, condition (
2) controls the type I error probability at the nominal level
, which represents the maximum acceptable probability of falsely declaring the experimental treatment superior when it does not achieve simultaneous improvement in both efficacy and safety. In potential clinical applications, the choice of
reflects the tolerance for such a false-positive conclusion and may depend on the trial phase, disease severity, available alternatives, and regulatory considerations. Condition (
3) requires the power of the test to be at least
under clinically meaningful alternatives, where
denotes the maximum acceptable probability of failing to detect a treatment that improves both endpoints by the pre-specified effect sizes. Thus,
controls the risk of incorrectly advancing an insufficient treatment, whereas
measures the probability of correctly identifying a treatment with meaningful joint efficacy–safety benefit.
Joint Probability of Two Endpoints
Let index the two treatment arms, where corresponds to the control treatment and corresponds to the experimental treatment . For patient i in treatment arm j, let denote the two binary endpoints, where indicates efficacy success and indicates safety success. Thus, each patient falls into one of four joint response categories, , , , or , corresponding, respectively, to efficacious and safe, efficacious but unsafe, nonefficacious but safe, and neither efficacious nor safe.
Accordingly, for treatment
j, we denote the vector of cell counts by
where
We assume that this vector follows a multinomial distribution with total sample size
n and cell probabilities
for
. Here,
(
) denotes the probability corresponding to each efficacy–safety combination in treatment arm
j. The sample proportions
provide empirical estimates of the corresponding cell probabilities
, for
.
This notation is equivalent to the standard three-dimensional contingency-table notation for observed frequencies and for cell probabilities, where the third index corresponds to the treatment arm. We retain notations and because, in clinical trial testing and ranking-and-selection designs, n is conventionally used for the sample size, while X denotes observed response counts.
The marginal probability of observing efficacy under treatment
j is
which aggregates all outcomes classified as efficacious.
Similarly, the marginal safety probability is
representing the chance that a patient satisfies the safety criterion regardless of efficacy.
The corresponding marginal cell counts are
and
where
and
denote the complements of efficacy success and safety success, respectively.
Outcomes for treatment
is listed in
Table 1.
For each treatment arm, the dependence between efficacy and safety is quantified by the treatment-specific odds ratio
The odds ratio is a classical measure of association for
contingency tables; see, for example, Everitt [
11] (Section 2.8). Equivalently, the condition
corresponds to the classical no-association case in treatment arm
j, which has long been studied as a testing problem for
tables [
12] (Section 33.16). For multiple
tables, such as those arising after stratification by treatment arm, the treatment-specific log odds ratio is a standard way to describe conditional association. The closely related problems of partial association and conditional independence in stratified
tables have been studied in the classical contingency-table literature; see Birch [
15] and Kendall and Stuart [
12] (Section 33.62). These formulations are also closely connected to the common-odds-ratio, or homogeneous-association, setting used for multiple
tables. This classical formulation provides a natural justification for using the odds ratio as an association measure for the two binary endpoints. Following its use in bivariate binary-endpoint designs such as Conaway and Petroni [
6], we use the odds ratio to parameterize the association between efficacy and safety.
In the present paper, we adopt the working assumption of a known common odds ratio across the two treatment arms, namely
This common-odds-ratio assumption is not a mathematical necessity but a design simplification that allows the four cell probabilities in each treatment arm to be determined by the two marginal probabilities
and
together with a common association parameter
. It is also related to the classical common-association framework for several
contingency tables, often expressed in terms of the log odds ratio; see Kendall and Stuart [
12] (Section 33.62). In applications,
may be specified using prior studies, pilot data, or sensitivity analyses. The value of
has a direct interpretation in terms of the association between the two binary endpoints. When
, the efficacy and safety endpoints are independent within a treatment arm, corresponding to the no-association case discussed above. When
, the two endpoints are positively associated: patients who achieve efficacy are more likely to also achieve safety, and patients who fail to achieve efficacy are more likely to also fail to achieve safety. Equivalently, for fixed marginal probabilities, larger values of
place more probability mass on the concordant cells
and
and less probability mass on the discordant cells
and
. When
, the two endpoints are negatively associated, indicating a tendency toward discordant outcomes; for example, achieving efficacy may be less likely to occur together with achieving safety. The boundary case
represents an extreme form of negative association and is mainly included for sensitivity assessment rather than as a typical clinical setting. Two situations are considered:
Case 1:
(independence) When the two endpoints are independent, the joint distribution of
factorizes as
where
denotes the binomial pmf.
Case 2:
(dependence) For specified
,
, and
, Bryant and Day [
5] derived the corresponding four cell probabilities:
with
Given these cell probabilities, the joint probability of
is
Remark 1. When , expression (
8)
continues to hold, with the simplifying identity 3. Fixed Sample Size Design
Consider fixed design constants n, e, and s. The decision rule is constructed as follows. Here, n denotes the sample size per treatment arm, while e and s are positive integer thresholds representing the minimum required excess numbers of successes for efficacy and safety, respectively, in the experimental treatment compared with the control. These thresholds are chosen together with n so that the resulting test satisfies the desired type I error and power requirements.
Procedure H:
Collect n observations from each of the two treatments (the control and the new treatment). For treatment i , let and denote the observed numbers of successes for the efficacy and safety endpoints, respectively. With positive thresholds e and s, the rule in Procedure H is given by:
Reject the null hypothesis if both inequalities and hold.
Otherwise (i.e., if either or ), retain the null hypothesis.
Probability Requirements:
The design constants
n,
e, and
s for
Procedure H are chosen so that the procedure meets the following probabilistic criteria:
The left-hand side of (
9) represents the probability of incorrectly rejecting
and is therefore controlled at the type I error level
. Likewise, the left-hand side of (
10) corresponds to the probability of correctly declaring the new treatment superior, which equals the power
under
.
To guarantee a meaningful distinction between the null and alternative spaces, we require a minimum effect separation that specifies how far the true parameters under
must lie from those under
. In particular, we assume the alternative satisfies
where
and
denote the smallest clinically relevant differences in efficacy and safety. These effect-size constraints ensure that the testing procedure can reliably discriminate between
and
when the new treatment exhibits meaningful improvement.
We now derive the values of the procedure parameters such that the probability constraints (
9) and (
10) are satisfied.
Theorem 1 (Monotonicity of the rejection probability)
. Let ϕ denote the odds ratio. The rejection probabilitysatisfies the following monotonicity properties with respect to the marginal efficacy and safety probabilities.- 1.
Non-increasing in : For fixed and ϕ, - 2.
Non-increasing in : For fixed and ϕ, - 3.
Non-decreasing in : For fixed and ϕ, - 4.
Non-decreasing in : For fixed and ϕ,
Theorem 1 immediately implies that, when
and
are known, the power is minimized at the boundary of the effect-size constraints, namely when
Theorem 2 (Least favorable configuration for power control under
)
. When the odds ratio satisfies and the critical values satisfy and , the minimal power under the alternative is attained at the boundary of the effect-size constraints, specifically whenThe proof is provided in
Appendix A.
Theorem 3. If the null-hypothesis parameters and are fixed and known, then the maximal type I error is attained at one of the following configurations:orThis result follows directly from the monotonicity properties in Theorem 1. To attain the maximal type I error, the rejection probability should be made as large as possible while the null hypothesis remains true. Since the rejection probability is non-decreasing in
and
, the maximum occurs on the boundary of the null hypothesis: either
with
as large as possible or
with
as large as possible.
Consequently, Theorem 3 implies that the constraint in (
9) can be equivalently expressed as
Theorem 4 (Least favorable configuration for type I error control under
)
. When the odds ratio satisfies and the critical values satisfy , the maximal type I error under the null hypothesis is attained at one of two configurations,or, symmetrically,The proof is provided in the
Appendix A.
The above results characterize the least favorable configurations for both power and type I error control in Procedure H. Theorem 1 shows that the rejection probability is monotone in the marginal efficacy and safety probabilities. This monotonicity reduces the search for worst-case configurations to boundary points of the parameter space. Theorem 2 identifies the least favorable alternative for power calculation when . Theorems 3 and 4 characterize the worst-case null configurations for type I error control.
A key implication is that, when , the efficacy and safety endpoints are independent. In this case, a universal least favorable configuration (LFC) exists. Therefore, the design parameters can be determined without specifying the nuisance parameters and . By contrast, when , the two endpoints are dependent. A universal LFC is generally unavailable in this setting. Thus, determining requires the baseline values and to be specified.
If and are known in advance, the resulting design is less conservative. It may also require a substantially smaller sample size. This reduction is illustrated in the numerical results reported in the subsequent tables.
4. Tables and Discussion
In this section, we report the design parameters required to implement the proposed fixed-sample procedure. Throughout, we assume a common association structure between the two binary endpoints across the control and the tested treatment; i.e., all the treatments share the same odds ratio . Under this assumption, bivariate binary outcomes can be characterized by the marginal success probabilities and the odds ratio, which jointly determine the cell probabilities used in simulation.
We consider the following configurations:
,
, and
. The target operating characteristics are specified by
The type I error level of 0.15 is used here for illustrative purposes in an exploratory early-phase setting, where a less stringent error level may be considered acceptable for screening promising treatments. In practice, the choice of type I error rate and power should be determined according to the clinical objective, regulatory context, and input from clinical investigators.
For each configuration, we determine the minimum required sample size per treatment,
n, together with the corresponding critical values
such that the probability constraints in (
9) and (
10) are satisfied. The reported operating characteristics in the tables are estimated by Monte Carlo simulation using 100,000 replications for each configuration. For example, when
,
, and the target operating characteristics are
, we evaluate candidate sample sizes
n sequentially. For each
n, all feasible integer threshold pairs
are checked by Monte Carlo estimation of the corresponding rejection probabilities. The smallest sample size for which at least one pair
satisfies both requirements is selected. In this setting, based on 100,000 Monte Carlo replications,
with
gives an estimated power of
and an estimated type I error of
, so it satisfies the desired constraints.
The simulation procedure used to determine can be summarized as follows:
Specify the target type I error level , target power , effect sizes and , odds ratio , and, when needed, baseline control rates and .
For a candidate sample size n, enumerate all feasible positive integer threshold pairs .
For each pair , estimate the type I error and power using 100,000 Monte Carlo replications under the corresponding LFC or GLFC.
Retain the threshold pairs that satisfy both the type I error constraint and the power constraint.
Increase n sequentially until at least one feasible pair is found.
Select the smallest such n. If multiple threshold pairs satisfy the constraints for this minimum n, choose the pair with the smallest thresholds .
The numerical study is divided into two cases.
Case 1:
with unknown baseline control rates. When
, the efficacy and safety endpoints are independent. In this case, we assume that the baseline control rates
and
are unknown. By Theorem 2, the minimum power under the alternative is attained at a specific least favorable configuration when
. Similarly, by Theorem 4, the maximal type I error under the null is attained at a least favorable configuration. These least favorable configurations together make it possible to calibrate the design parameters
without specifying
and
. The resulting designs are reported in
Table 2 and
Table 3, corresponding to the two target operating characteristics
and
, respectively.
Case 2: known baseline control rates. We next consider the case where the baseline control rates are known, with
yielding all possible combinations of
. For this case, we fix
and consider
. Using the generalized least favorable configuration (GLFC) characterized by Theorems 1 and 3, we determine the corresponding minimum sample size
n and critical values
for each parameter setting. The results are presented in
Table 4 and
Table 5, corresponding to
and
, respectively.
The design parameters are obtained via simulation. Specifically, for each candidate value of n, we search over feasible integer thresholds and select the smallest n for which at least one pair meets the constraints. When multiple pairs satisfy the requirements for the same minimal n, we adopt the pair that yields the largest empirical power while maintaining the type I error constraint. The resulting parameters provide a direct lookup table for implementing the procedure under the considered settings.
When the baseline control rates
and
are unknown,
Table 2 and
Table 3 summarize, under
, the minimum per-treatment sample size
n and the corresponding critical values
that achieve the desired operating characteristics. As expected, larger effect thresholds (i.e., larger
and/or
) generally lead to smaller required sample sizes since stronger separation between the null and alternative hypotheses makes it easier to satisfy the power constraint while controlling type I error. These tabulated values enable straightforward implementation of the procedure without re-running the design search for each new study.
In addition,
Table 4 and
Table 5 report the case where the baseline control rates
and
are known in advance and
. The results show that, as the odds ratio
increases, the required sample size
n is non-increasing across all the reported configurations, and the overall variation in
n is relatively small. This indicates that the proposed design is not highly sensitive to the dependence parameter
. The sensitivity results in
Table 4 and
Table 5 further suggest that the proposed design is relatively robust to the choice of
over a clinically relevant range. In particular, the required sample size changes only slightly for moderate values of
, indicating that moderate misspecification of the working odds ratio has limited impact on the resulting design. The case
represents an extreme boundary scenario and is included mainly for sensitivity assessment rather than as a typical clinical setting. Moreover, for the setting
, the sample sizes in
Table 2 and
Table 3 are generally no smaller than those in
Table 4 and
Table 5, reflecting the fact that the designs obtained under unknown baseline control rates are more conservative. In particular, in
Table 2, the required sample size is
, which exceeds all the corresponding values reported in
Table 4 for
. In
Table 3, the required sample size is
, which is larger than nearly all the corresponding values in
Table 5; the only exception is one configuration with
when
and
. Overall, these findings suggest that the values reported in
Table 2 and
Table 3 provide a reasonably conservative design benchmark. Therefore, when
and
are unknown and
, or even when
itself is unknown, the design calibrated under the
least favorable configuration can still serve as a practical and reliable approximation for trial planning.
To further examine the robustness of the proposed design to possible misspecification of the odds ratio
, we conducted an additional sensitivity calculation using the first configuration in
Table 3, where
. For this setting, the design obtained under
is
. We then fixed this design and estimated the corresponding power and type I error rate under several different values of the true odds ratio. When
, the estimated power is
and the estimated type I error rate is
; when
, the estimated power is
and the estimated type I error rate is
; and, when
, the estimated power is
and the estimated type I error rate is
. The cases
and
lead to the same design parameters in the table and therefore do not require separate design calibration. These results suggest that, when the misspecification of
is moderate, its impact on the rejection probabilities is relatively small. Even in the boundary case
, which represents an extreme scenario on the odds-ratio scale, the type I error rate remains below the nominal level
, while the reduction in power is moderate. Moreover, when
and
are unknown and the design from
Table 2 with
is used, namely
, the estimated power and type I error rate under
and
are
and
, respectively. Thus, when there is substantial uncertainty about the value of
, the designs reported in
Table 2 and
Table 3 can serve as conservative and practical choices for trial planning.
The design tables also reveal several structural patterns among the design parameters. First, as expected, a higher target power generally requires a larger sample size because stronger probability guarantees require more information from each treatment arm. Second, the clinically meaningful effect sizes and have a substantial impact on the required sample size. When either the efficacy threshold or the safety threshold becomes larger, the required sample size tends to decrease since larger treatment differences are easier to detect. Conversely, smaller values of or lead to more demanding designs and therefore require larger sample sizes. Third, for fixed baseline rates and effect-size thresholds, the required sample size tends to be non-increasing as the odds ratio increases. This pattern indicates that a stronger positive association between efficacy and safety can reduce the amount of information needed to jointly demonstrate improvement on both endpoints. Overall, the numerical results are consistent with intuition: sample size is most sensitive to target power, the type I error requirement, and clinically meaningful thresholds and , while the dependence parameter appears to have little impact over the practical range considered in the tables.
5. Example
This example considers an experimental trial involving an immunotherapy-based treatment for elderly patients (≥75 years old) diagnosed with advanced non-small-cell lung cancer (NSCLC). The trial compares the immunotherapy strategy, PD1-A (anti-PD-1 monotherapy), against the standard chemotherapy regimen, consisting of carboplatin and pemetrexed, which serves as the control treatment.
The goal of the trial is to determine whether PD1-A demonstrates superiority over the control treatment in terms of both efficacy and safety. Specifically, the study targets an improvement of at least in the response rate and a reduction of at least in high-grade toxicity, corresponding to and . If PD1-A fails to show meaningful improvement over the control, the standard chemotherapy will be selected.
Under this setting, the least favorable configuration for rejecting the null hypothesis occurs when
By Theorem 4, the least favorable configuration (LFC) for the type I error is attained when
or, symmetrically,
When the probability constraints are set at power
and type I error
,
Table 2 indicates that the fixed-sample procedure requires
observations per treatment arm, with associated critical values
and
. Therefore, the total number of observations required to meet the design criteria is
.
After the trial is completed, the procedure is applied by comparing the observed numbers of efficacy and safety successes between the PD1-A arm and the control arm. Let and denote the observed efficacy and safety successes in the PD1-A arm, and let and denote the corresponding observed successes in the control arm. With the selected critical values and , the null hypothesis is rejected only if both and . Otherwise, the trial does not provide sufficient evidence that PD1-A is superior to the control treatment in both efficacy and safety. For instance, if the observed differences are and , then both criteria are met and the null hypothesis is rejected; if either difference is less than 6, the null hypothesis is retained.
6. Conclusions
This paper develops an exact frequentist fixed sample size hypothesis testing procedure for two-arm clinical trials with two binary endpoints, motivated by settings in which a new treatment must demonstrate improvement in both efficacy and safety relative to a concurrent control. Instead of relying on a composite endpoint, the proposed method evaluates the two endpoints jointly and rejects the null hypothesis only when the observed treatment differences exceed pre-specified paired thresholds on both dimensions. By modeling the joint efficacy–safety outcomes through a four-category multinomial distribution, the procedure allows calculation of rejection probabilities and provides a clear framework for benefit–risk evaluation.
The theoretical results characterize the least favorable configurations for both power and type I error control. The monotonicity properties of the rejection probability reduce the search for worst-case configurations to the boundary of the parameter space. When the two endpoints are independent (), the least favorable configurations can be identified explicitly, allowing the design parameters to be determined without specifying the baseline control rates and . When , a universal least favorable configuration is generally unavailable, and calibration requires the baseline control rates to be specified. The numerical results further suggest that the required sample size is not highly sensitive to the odds ratio across the settings considered.
The proposed framework can be implemented through design tables that map the target power, type I error rate, and clinically meaningful effect sizes to the required sample size and decision thresholds. These tables provide a convenient tool for study planning and reduce the need for repeated design searches in new applications. In particular, when the baseline control rates are unknown, designs calibrated under the least favorable configuration may provide a practical approximation even when the true odds ratio differs from 1.
Several extensions of the proposed framework are possible. One limitation of the proposed framework is that the odds ratio is treated as a pre-specified working design parameter and is assumed to be common across the control and experimental treatments. Although sensitivity analyses indicate that the design is relatively robust over practical ranges of , future work may consider adaptive or data-driven approaches that allow different association structures between efficacy and safety across treatment arms. Second, although the present study focuses on a fixed-sample design, the relatively large sample sizes required in some settings suggest that sequential or curtailed versions of the proposed procedure may provide meaningful gains in efficiency.
Overall, the proposed procedure offers a rigorous and interpretable approach for two-arm trials with co-primary binary endpoints. By preserving separate assessments of efficacy and safety while controlling type I error and maintaining adequate power, the method provides a useful alternative to composite-endpoint analyses and helps to bridge the gap between early-phase bivariate endpoint designs and comparative two-arm confirmatory testing.