1. Introduction
In randomized controlled trials, bilateral data commonly arise when participants receive treatment or surgery on paired organs or body parts. For example, in ophthalmologic studies, patients are randomly assigned to one of two treatment groups to evaluate whether a new therapy is more effective than an established treatment or a control with no intervention. After the treatment period, outcomes are often summarized into three categories: cure in both organs, cure in only one organ, or no cure. A similar data structure appears in case–control studies investigating the association between smoking and age-related macular degeneration (AMD), where clinical measurements are collected from both the left and right eyes. In such settings, responses can likewise be classified as bilateral, unilateral, or absent, and observations from paired organs are generally positively correlated.
Recent studies have demonstrated that failing to account for within-pair dependence can result in invalid statistical inferences. To address this issue, paired correlated data are commonly analyzed using one of three parametric frameworks: the R model, Dallal’s model, and the
model.
(1) “R model”: Rosner [
1] introduced a model incorporating a constant
R to capture intra-pair correlation, where the conditional probability of a response in one organ given a response in the paired organ is proportional to the group-specific prevalence. Nevertheless, Dallal [
2] noted that this model may provide a poor fit when bilateral responses occur with high certainty and prevalence varies substantially across groups.
(2) “Dallal’s model”: To overcome this limitation, Dallal [
2] proposed a model that assumes a constant conditional probability of response in one organ given a response in the other, independent of group prevalence.
(3) “ model”: Alternatively, Donner [
3] suggested a formulation in which all treatment groups share a common intra-class correlation coefficient to characterize within-pair dependence. Collectively, these models provide practical strategies for handling paired correlated data. Building on these frameworks, a range of asymptotic and exact testing procedures has been developed over the past two decades, with empirical evidence indicating satisfactory performance [
4,
5,
6,
7]. Beyond these parametric models, marginal modeling approaches have also been developed for correlated binary outcomes. For example, Zou and Donner [
8] extended modified Poisson regression to prospective studies with correlated data to estimate marginal risk ratios using sandwich variance estimators. Within the generalized estimating equation framework, Westgate [
9] proposed improved intra-cluster correlation estimation methods that reduce bias relative to moment-based estimators. Moreover, Li and Tong [
10] derived sample size formulas for modified Poisson regression in cluster randomized trials. Together, these approaches offer flexible alternatives for analyzing correlated binary data.
Stratified designs are widely employed to adjust for multi-center effects or other potential confounders and have attracted increasing attention in the recent statistical literature [
11,
12]. In stratified bilateral-sample studies with two treatment groups per stratum, valid analysis of paired observations must simultaneously address within-subject correlation and between-stratum heterogeneity. Motivated by this structure, a variety of statistical methods have been proposed to incorporate the dependence of bilateral outcomes within stratified designs explicitly.
Three primary measures are commonly used to characterize the association between two proportions in a population: the risk difference, the risk ratio, and the odds ratio (OR) [
13]. Owing to its intuitive interpretation, the risk difference has been extensively used to compare treatment effects. For example, Tang and Qiu [
14] proposed a modified score test for assessing homogeneity of the risk difference and constructed confidence intervals (CIs) for a common risk difference under the R model. Subsequently, Shen et al. [
15] developed maximum likelihood estimates (MLEs)-based tests for evaluating homogeneity of differences in stratified correlated bilateral data under the
model, and further proposed multiple testing procedures and CI methods for a common risk difference within the same framework.
In certain applications, the risk ratio also provides meaningful insight. Pei et al. [
16] introduced a homogeneity test for proportion ratios in stratified bilateral data using the R model, while Zhuang et al. [
17] derived several CI estimators for proportion ratios under the
model within each stratum.
The odds ratio, however, is often the preferred measure of association in prospective, retrospective, and cross-sectional studies, and it serves as a fundamental parameter in the analysis of multiway contingency tables [
18]. Despite the extensive literature on risk differences and risk ratios for stratified bilateral designs, asymptotic testing procedures for the OR remain largely unexplored. Unlike the risk difference or risk ratio, the OR is defined through a nonlinear transformation and is constrained to be nonnegative, rendering standard methods developed for independent or unpaired data inappropriate. This added complexity underscores the methodological challenges associated with OR-based inference in correlated bilateral settings.
Motivated by these gaps, this article focuses on developing asymptotic test procedures for assessing homogeneity of the OR in stratified correlated paired binary data under the model. We address these challenges by proposing testing approaches based on maximum likelihood estimation. The finite-sample performance of the proposed methods is evaluated through simulation studies, and their practical utility is demonstrated using data from an otolaryngologic study and a multi-center, two-arm clinical trial.
The remainder of this article is organized as follows. In
Section 2, we briefly describe the data structure.
Section 3 presents the MLEs of the model parameters along with three testing procedures. Simulation studies evaluating the performance of these tests are reported in
Section 4. In
Section 5, two real data examples are analyzed to illustrate the proposed methods. Finally,
Section 6 provides concluding remarks and discusses potential directions for future research.
2. Data Structure
We describe the data structure and statistical model under a prospective study design. Let denote the number of subjects in the group () and stratum () who exhibit l responses (). Let represent the total number of subjects in the group and stratum.
Define as an indicator variable for the response of the () eye (or other paired organ) of the subject () in the group and stratum, where indicates a response and otherwise. We assume a common marginal response probability () for both organs within the same group and stratum.
To account for within-subject dependence, we adopt a stratified version of the equal-correlation (or ) model. Specifically, let () denote the intra-class correlation coefficient for subjects in the stratum. The correlation is assumed to be identical across treatment groups within each stratum but allowed to vary between strata.
Under this formulation, the probabilities of observing no response, a unilateral response, or a bilateral response in the
group and
stratum are given by
,
, and
, respectively. The corresponding observed frequencies and probabilities are summarized in
Table 1.
3. Proposed Methods
In this article, we aim to test whether the odds ratios (ORs) of the two groups are equal across all strata. Let denotes the OR between two groups in stratum where and denotes an arbitrary constant. Therefore, the null hypothesis of interest is : versus : At least one of the ’s is not equal.
Let
denote the observed data for the
stratum, as shown in
Table 1, where each column follows a multinomial distribution. Following Ma and Liu [
19], the log-likelihood function of observed data
is given by
Thus, the overall log-likelihood function of the observed data is
where
,
,
.
- (a)
Unconstrained MLEs
We first derive the unconstrained MLEs. By taking partial derivatives of
l (or
) with respect to
and
and setting them equal to zero, the resulting equations are obtained. The solutions to these equations yield the MLEs, denoted by
and
, respectively.
There are no closed-form solutions for the above system of equations. Moreover, direct application of a global iterative algorithm is often computationally intensive and may suffer from convergence issues, especially in high-dimensional settings. Instead, the MLEs can be obtained using the procedure proposed by Ma and Liu [
19], which employs an alternating iterative scheme that updates
and
in turn. In particular, Equation (
3) can be simplified into a cubic Equation (
5),
The MLE of
is a function of
, which can be obtained by solving for the real root of it. Specifically, the iterative procedure is initialized using estimates obtained from the pooled counts across the two groups. Given an initial value of
,
is first updated by solving the log-likelihood equation, selecting the real root within
that maximizes the stratum-specific log-likelihood. Then,
is updated using the Fisher scoring algorithm. The
approximate of
is
where
. Subsequently, the
update of
is obtained by solving the cubic equation with
replaced by
. These two steps are repeated iteratively until convergence, which is defined as
, where
is a pre-specified tolerance level. The expressions for the second-order derivatives are provided in the
Appendix A.
- (b)
Constrained MLEs
We subsequently derive the constrained MLEs. Under the null hypothesis, can be written as , so that the parameter space is reduced to , , and the common parameter . By setting the partial derivatives of l (or ) with respect to equal to zero, the constrained MLEs are obtained and denoted by .
There are no closed-form solutions to (
). We adopt an iterative approach by Shen et al. [
15] to obtain the constrained MLEs under the null hypothesis. The common odds ratio
is initialized as 1, and
and
are initialized using pooled counts across the two groups. Given the current values of
and
,
is updated via a Newton–Raphson algorithm, and then
and
are updated jointly for each stratum using a Fisher Scoring algorithm with a given
. The feasible real root within
that maximizes the likelihood is selected. This process is repeated until the changes in
,
, and
between successive iterations are smaller than a pre-specified tolerance
, i.e.,
With all MLEs obtained, we consider the following test procedures.
3.1. Likelihood Ratio Test ()
The likelihood ratio test (LRT) statistic is given by
where
. Following Wilks [
20], under the null hypothesis,
asymptotically follows a chi-square distribution with
degrees of freedom.
3.2. Wald-Type Log-Linear Test ()
Log-transformation is widely used in biomedical research to handle skewed data, because the odds ratio (OR), as a ratio of two estimated odds that are bounded below by zero and unbounded above, has a highly skewed sampling distribution when sample sizes are small to moderate. The null hypothesis can be written as , where , and . Furthermore, the null hypothesis is expressed as , where is called logit for the group from the stratum.
Let , where the corresponding unconstrained MLE is .
Let
; then the null hypothesis can be rewritten as
, where
H has the form that
According to the asymptotic normality of MLE under certain regularity conditions, the asymptotic distribution of
is given by
, where
is the information matrix for
with a block diagonal structure where
,
is the information matrix for
from each stratum block, and
N is the sample size. Using the delta method, we can further obtain the asymptotic distribution of
, which is given by
, where
is the information matrix for
.
has a a block diagonal structure that
, where
,and
Therefore, the Wald-type log-linear test statistic has the form that
Under the null hypothesis, is asymptotically distributed as a chi-square distribution with degrees of freedom.
Particularly, in order to test
Wald-type log-linear test statistics are given by
where
with
in the
to
element,
in the
to
element, and 0 otherwise. Under the null hypothesis,
is asymptotically distributed as a chi-square distribution with 1 degree of freedom.
3.3. Score Test ()
The score test statistic utilizes the MLEs of parameters under the null hypothesis. Let , then the score function U is a row vector with a in block, and denotes a global information matrix for , which has a block diagonal structure, with a in each main diagonal block (i.e., ).
Therefore, the score test statistic is given by
Here,
is the parameter of interest, while
and
are nuisance parameters. To simplify the calculation, the score components corresponding to the nuisance parameters are set to 0 at the constrained MLE under
. That is, the
block of the score function can be rewritten as
. Then the test statistics can be simplified as
where
represents the
entry of
. Under the null hypothesis,
is proved to have an asymptotic chi-square distribution with
degrees of freedom.
For each of the three proposed test procedures, the null hypothesis is rejected at the nominal significance level if the observed value of the test statistic exceeds , the upper quantile of the chi-square distribution with degrees of freedom.
4. Simulation Studies
In this section, we use Monte Carlo simulations to examine the empirical type I error rates and power of the three test statistics introduced in the previous section.
First, we conduct simulation studies to evaluate the type I error rate under various parameter configurations. We focus on balanced data with equal sample sizes from two groups across
strata. Boxplots in
Figure 1 further illustrate the distribution of empirical type I error rates for all tests, considering balanced data with
or 100 in
or 8 strata, respectively. The specific parameter settings are summarized in
Table 2. For each configuration, 10,000 samples are generated under the null hypothesis, and the empirical type I error rate is computed as the proportion of samples in which the null hypothesis is rejected. All tests are conducted at a
significance level. Following Tang et al. [
21], a test is considered liberal if the empirical type I error exceeds
, conservative if it is below
, and otherwise robust.
Generally, the results (
Table 3,
Table 4 and
Table 5 and
Figure 1) indicate that the Wald-type test based on the log-linear hypothesis maintains satisfactory type I error across all simulation configurations. The score test performs well when the sample size is moderate to large (
or 100), but tends to be slightly liberal in the small-sample scenario (
), suggesting that the asymptotic approximation is less accurate when few observations are available. Interestingly, the performance of the score test improves as the number of strata increases, likely because the increased stratification provides more information and stabilizes the variance estimates.
In contrast, the likelihood ratio test exhibits pronounced liberal behavior in small samples. Small sample bias in the maximum likelihood estimates and deviations of the likelihood surface from its quadratic approximation can inflate the likelihood ratio test statistic, leading to higher than nominal rejection rates. In such settings, exact testing procedures may provide a more reliable alternative, as they avoid reliance on large-sample approximations. But exact methods may become computationally intensive in stratified correlated settings. Investigation of finite-sample exact inference under the proposed framework remains a topic for future study.
As expected, as the sample size increases, the type I error rates for all three tests converge toward the nominal level (). Overall, these findings suggest that the Wald-type log-linear test is robust across a wide range of sample sizes and configurations, the score test is reliable in moderate to large samples or with more strata, and indicate that the likelihood ratio test can be unreliable in small samples, and its use should be interpreted with caution.
Then, we evaluate the empirical power of the three proposed test statistics under various parameter settings. Specifically, we use the same sample sizes and parameter configurations as in the type I error simulations.
Table 6,
Table 7 and
Table 8 present the empirical power for the likelihood ratio test, Wald-type log-linear test, and score test across different scenarios.
Overall, the three tests exhibit similar power under the same parameter settings, with no substantial differences observed among them. As expected, the empirical power increases as the difference in the true
values among strata becomes larger, reflecting the greater detectability of the alternative hypothesis. The power also improves with increasing sample size, and to a lesser extent with the number of strata. In addition, following the confidence interval construction method described in [
22], we conducted a limited set of supplementary simulations under selected parameter configurations to assess the empirical coverage probabilities of the odds ratio. The observed coverage rates ranged approximately from 94% to 96%, close to the nominal 95% level, suggesting reasonable finite-sample interval performance. These findings further suggest that the relatively small power values observed in some configurations are likely attributable to modest effect sizes and limited sample sizes, rather than substantial bias or instability of the estimator.
Considering both type I error control and empirical power, the Wald-type log-linear test and the score test are generally preferable for practical applications. They consistently maintain type I error close to the nominal level while achieving competitive power, making them reliable choices across a wide range of sample sizes and parameter configurations.
5. Real-World Examples
In this section, we analyze two real-world datasets to illustrate the application of the proposed test statistics.
The first example is a double-blind randomized clinical trial investigating the efficacy of cefaclor versus amoxicillin in treating acute otitis media with effusion (OME) in children [
23]. A total of 31 children with bilateral tympanocentesis were randomly assigned to one of the two treatment groups, and each child received a 14-day course of the assigned antibiotic. Treatment outcomes were assessed by recording the number of cured ears at the end of the study. Stratification was based on age, and the resulting data structure is summarized in
Table 9. A goodness-of-fit test supports the adequacy of the common
model for these data [
24], justifying the use of the proposed testing procedures.
Maximum likelihood estimates of the model parameters are reported in
Table 10, while the values of the test statistics and corresponding
p-values are presented in
Table 11. All
p-values exceed the nominal significance level
, indicating insufficient evidence to reject the null hypothesis of homogeneous odds ratios across strata for any of the proposed tests.
The second example comes from a multi-center, two-arm randomized trial involving 168 patients with diffuse scleroderma [
25]. Participants were randomized to receive either native collagen or placebo, and treatment response was evaluated using the modified Rodnan skin score (MRSS). The MRSS assigns ordinal scores (0–3) to body parts to reflect disease severity. Following Tang et al. [
26], the MRSS was dichotomized at the body-part level, with improvement defined as either a score of zero at follow-up or a decrease of at least two units from baseline. To account for disease duration, patients were stratified into early-phase (≤3 years) and late-phase (4–10 years) groups.
The corresponding data structure, parameter estimates, and test results are summarized in
Table 12,
Table 13 and
Table 14. Similar to the first example, all
p-values are greater than 0.05, providing no evidence against the null hypothesis of homogeneous odds ratios across strata.
These two examples demonstrate the practical applicability of the proposed methods for analyzing stratified bilateral or multi-center paired binary data. In both cases, the tests can be effectively implemented to compute MLEs, evaluate the test statistics, and obtain valid p-values, providing a useful framework for inference on the homogeneity of odds ratios in real-world clinical studies.
6. Conclusions
In this article, we consider the problem of testing homogeneity for odds ratios of two proportions under the “ model” assumption on stratified bilateral designs. Three MLE-based test procedures—likelihood ratio test, Wald-type log-linear test, and score statistics—are investigated. Classical algorithms, such as Fisher scoring and Newton–Raphson methods, can be computationally demanding, particularly when there are many parameters. We simplify the algorithm and computational process to improve efficiency, making the methods more convenient for practical use.
Simulation studies indicate that the Wald-type log-linear test and the score test generally maintain acceptable type I error and exhibit reasonable power under the parameter configurations considered in this study. The likelihood ratio test shows adequate power but can yield inflated type I error for small sample sizes. In smaller samples, the Wald-type log-linear test appears more stable, whereas the score test tends to perform better in moderate to large samples. These results suggest that both tests are suitable for practical application, with the choice guided by sample size and study design.
Building on this article, we note that the model assumes equal intra-class correlation between groups within each stratum. While this assumption simplifies estimation and inference, it may be violated in practice, potentially affecting variance estimates and test statistics. Therefore, caution is warranted when interpreting results, particularly for small samples or in the presence of highly heterogeneous correlations. Moreover, the current study focuses on comparisons between two groups within each stratum. When more than two groups are involved, the problem naturally extends to a many-to-one or pairwise comparison setting, which requires additional methodological development. Extending the proposed methods to handle multiple groups via pairwise comparisons thus represents an important topic for future research. In addition, developing exact tests for small sample sizes remains a promising direction for future work, complementing the asymptotic methods studied here and addressing scenarios in which the current approaches may have limited accuracy.