1. Introduction
In clinical trial studies, the outcomes of paired organs, such as eyes, ears, and kidneys, are often recorded as binary data. The data are considered bilateral when both sites of the same individual are included, in contrast to unilateral data, where only one site of a paired organ is recorded per individual. Unilateral outcomes are typically assumed to be independent, with each individual’s response(s) being independent of others. Bilateral outcomes, however, exhibit intra-subject correlation due to the paired nature of the data. Both types of data are often recorded together in randomized clinical trials. For instance, in ophthalmologic studies, the focus is often on the statistical analysis of eyes rather than individuals. For n enrolled patients, the medical records may include responses for between n and eyes, as some patients provide data for both eyes (bilateral), while others provide data for only one (unilateral). This mixture of unilateral and bilateral observations may arise due to various practical reasons, including clinical ineligibility of one eye, pre-existing conditions, or occasional technical issues that prevent measurement of both eyes, making the inclusion of both unilateral and bilateral data inevitable in practice. Discarding the unilateral portion of the data can lead to reduced statistical power and potential bias. Therefore, it is important to develop statistical methods that can accommodate both types of data while appropriately accounting for intra-subject correlation in bilateral cases.
A number of statistical models have been proposed to address the intra-subject correlation problem. Rosner [
1] proposed a “constant
R model” in which the conditional probability of having a response in one eye, given a response in the other, is proportional to the marginal probability. The proportionality is governed by a constant
R, which captures the symmetric dependence between the paired outcomes. Dallal [
2] later argued that Rosner’s constant
R model could lead to a poor fit if the characteristic is almost certain to occur bilaterally with largely varying group-specific prevalence. Instead, he proposed that the conditional probability is a constant
. Subsequently, Donner [
3] proposed an alternative approach that assumes a constant intra-person correlation
for all the individuals in the sample. This model was proved to be robust with a simulation study by Thompson [
4]. Clayton [
5] proposed a model for association in bivariate variables using Clayton copula, which expresses the joint distribution in terms of the marginal cumulative distribution functions (CDFs) and a dependence parameter
. The Clayton copula model is particularly useful for capturing lower tail dependence that refers to the tendency of two variables to take extreme low values simultaneously. In medical research, this is particularly relevant when evaluating paired organ systems, where the disease in one organ may increase the risk in the paired counterpart. For example, in the case of paired kidneys, a severe decline in function in one kidney may be associated with a similar decline in the other kidney. In addition to these parametric models, the independence model and saturated model are often used as benchmarks for describing bilateral and combined data structures.
Different methods have been developed to analyze correlated binary data under the aforementioned models, including homogeneity test and confidence interval estimation (e.g., for homogeneity test, see [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15]; for confidence interval estimation, see [
9,
16,
17,
18]). Given the variety of available models, it is essential to identify the most suitable model based on the characteristics of the observed data. Previous work by Tang et al. [
19] and Liu and Ma [
20] investigated goodness-of-fit test methods for correlated bilateral data in the context of two and
groups, respectively. Both studies focused exclusively on purely bilateral data, providing valuable insights into model selection and test performance under those scenarios. In contrast, our work addresses the more complex and practically relevant situation where unilateral and bilateral data are combined, as often encountered in clinical trials due to missingness or design constraints. Moreover, we examine a model based on the Clayton copula, which has not been investigated in previous studies. In this paper, we focus on performing goodness-of-fit tests for the combined unilateral and bilateral data under the aforementioned models. Specifically, we compare the following six methods: deviance (
), Pearson chi-square (
), adjusted chi-square (
), and three bootstrap methods (
).
The rest of the paper is organized as follows.
Section 2 introduces six models for analyzing combined binary data and describes the procedures for obtaining maximum likelihood estimates (MLEs).
Section 3 presents the six methods for goodness-of-fit test. A simulation study is conducted in
Section 4 to evaluate the performance of these methods in terms of empirical type I error rates and powers under different models. In
Section 5, three real-world examples are applied to illustrate the goodness-of-fit test methods. Conclusions are provided in
Section 6.
2. Models for Combined Unilateral and Bilateral Data
Let
be the number of individuals who contribute data on paired organs with
responses (response means the organ is cured/affected) in the
i-th group (
) and
be the number of individuals who contribute data on one of the paired organs with
responses in the
i-th group, respectively. Let
and
be the numbers of subjects who respectively contribute to bilateral data and unilateral data with
r responses. Then,
Similarly, let
and
be the numbers of subjects who contribute on unilateral and bilateral data in the
ith group, respectively. Then,
The total number of subjects in the study is thus
The data structure is demonstrated in
Table 1.
It is obvious that for the
i-th group, the random variables
and
follow multinomial distributions. In particular,
where
denotes the probability of having
r (
) responses for a subject in the
i-th group for bilateral data, and
is the marginal probability of the response for a subject in the
i-th group. Let
be the response of the
k-th (
) paired organ for the
j-th subject in the
i-th group, then the joint probabilities
(
) take the following form:
where
denotes the intra-subject correlation between the two responses from the
j-th subject in the
i-th group.
Various statistical models that account for within-subject dependence often introduce additional (nuisance) parameters to capture the intra-subject correlation. In the following, we consider four such parametric models that introduce one nuisance parameter: (i) Rosner’s model, (ii) Donner’s (constant
) model, (iii) Dallal’s model, and (iv) Clayton copula model, respectively. To facilitate comparison across models,
Table 2 provides a summary of model assumptions and their associated nuisance parameters.
In what follows, we describe a general procedure for obtaining the MLEs for the four parametric models. An iterative algorithm is employed for this purpose, with the Newton–Raphson and Fisher’s scoring methods being two commonly used approaches. In our case, the Hessian matrix can be derived analytically, we adopt the Newton–Raphson method due to its faster convergence and suitability for more generic MLE optimization problems.
Let
, then the log-likelihood function can be written as
where dependence of
is through the joint probabilities
in (
2),
denotes a vector of combined bilateral and unilateral data that
and ‘const’ is a constant term depending on
.
Suppose the log-likelihood is concave and certain regularity conditions are satisfied. The maximum likelihood estimates (MLEs) of
can be obtained via the following normal equations:
Usually there are no closed-form solutions for the above normal equations; rather, the MLEs can be solved iteratively. The iteration procedure for estimating the MLEs of and is outlined below.
At the -th step, for given , obtain as a real root for equation for , such that .
At the
-th step,
is evaluated with the Newton–Raphson method:
Repeat Steps 1–2 until convergence of occurs, which can be measured by . The iteration procedure stops when for a sufficiently small , such as .
The initial values and can be set somewhat arbitrarily but within their allowed regions. The acceptable range of the nuisance parameter for each model is discussed in the following subsections, respectively.
It should be noted that, in addition to the four parametric models listed in
Table 2, both the independence model and the saturated model are free of nuisance parameters and allow closed-form solutions for the MLEs of the probabilities
and
. As a result, these models do not require the iteration procedure described above.
2.1. Rosner’s Model
Rosner [
1] proposed a “constant
R model” that assumed equal dependence between two eyes of the same person for the ophthalmologic data. More specifically, it assumed that the probability of cured eye at one site given cured eye at the other site for the
j-th subject in the
i-th group is proportional to the prevalence rate for the
i-th group by a constant factor
R, i.e.,
for
denoting the left and right eye, respectively. The intra-subject correlation then takes the form
. Clinically, the parameter
R captures the symmetric dependence between the two eyes of a subject, quantifying the degree of intra-subject correlation in bilateral outcomes. The region of
R is bounded by the region of probabilities and correlation. It can be shown that
R satisfies
if
;
if
with
[
12].
Taking
, the normal equations in (
5) are derived as
Equation (
7a) leads to a quartic equation with respect to
as shown below,
where the coefficients are
Therefore, at
Step 1 in the iteration procedure,
is a real root of the quartic Equation (
8) for a given
[
12]. The first (see the middle expression in the normal Equation (
7b)) and second derivative of the log-likelihood with respect to
R
are used to update
in the
-th iteration as described at
Step 2 in the iteration procedure.
2.2. Donner’s Model
With Donner’s approach [
3], the correlation between the outcomes of paired organs of the same subject is assumed to be the same in the sample such that
Taking
, the normal equations have the following form:
where a cubic equation
can be derived from (
11a), and the respective coefficients take the form
Therefore, at
Step 1 in the iteration procedure,
is a real root of the cubic Equation (
12) for a given
[
14]. The first derivative shown in the middle expression in (
11b) and the second derivative of the log-likelihood with respect to
are used to update
in the
-th iteration described in the iteration procedure.
2.3. Dallal’s Model
Under Dallal’s model the conditional probability is assumed to be a constant [
2], i.e.,
The resulting intra-subject correlation is . Being bounded by the region of the probabilities and correlation, it can be shown that the region of is if ; if , where .
Taking the nuisance parameter
, the normal equations take the form
where further reduction on Equation (
15a) gives rise to a quadratic equation below
with the coefficients
The smaller root of the quadratic Equation (
16) leads to the maximum of the log-likelihood and thus is used at the
Step 1 in the iteration procedure. The first derivative (see middle expression in (
15b)) and the second derivative of the log-likelihood with respect to
are used to update
in the
-th iteration.
2.4. Clayton Copula Model
According to Sklar’s theorem [
21], every joint cumulative distribution function (CDF) of a random vector can be expressed in terms of its marginal CDFs and a copula
C. In particular, for the paired organ data, it takes the form
Given
, the joint probabilities can be written as
Comparing Equation (
19) and Equation (
2), it is straightforward that
The Clayton copula is a type of Archimedean copula which allows modeling dependence with one parameter (
) and is denoted as
. It is particularly suited for modeling lower tail dependence [
5]. Liang et al. [
22] utilize the Clayton copula to test the homogeneity of two proportions for correlated bilateral data, where the copula is defined as:
for
. Note that the full expression of Clayton copula is
, for
. When
, the copula exhibits positive dependence and lower tail dependence. When
, it models negative dependence with vanishing lower tail dependence.
Using the copula form in (
21), the normal equations can be written as
Unlike in the previous models where the first normal equation (
) can be further reduced to a polynomial equation so that an analytic solution for
can be found for a given
at the
-th iteration, with the Clayton copula model, the root of Equation (
22a) is evaluated numerically. The second derivative of the log-likelihood with respect to
along with the first derivative shown in (
22b) are used to update
at the
-th iteration, where the coefficients
’s (
) are shown below
2.5. Independence Model
The independence model assumes no correlation between the two paired organs of the same person, i.e.,
. Thus, this model is free of nuisance parameter, and the MLEs of
can be directly obtained by solving the normal equation:
which yields a closed-form solution:
It should be noted that the independence model is a special case of the aforementioned models, including Rosner’s model, Donner’s model and Clayton copula model. It is a limiting case of Rosner’s model as , of Donner’s model as , and of the Clayton copula model as .
2.6. Saturated Model
The saturated model treats each joint probability
and the marginal probability
as free parameters, subject to the constraint
. This model serves as a reference (or “full”) model in the goodness-of-fit test, as it imposes no structural assumptions on the data. The log-likelihood is given by Equation (
3). Using the method of Lagrange multipliers, the MLEs of the parameters are
for
and
.
5. Real-World Applications
Unlike the simulation study, where no model preference was considered due to the data being simulated under a specific model, selecting an appropriate model is essential for analyzing real-world data. To evaluate the performance of the six proposed methods for the goodness-of-fit test, we apply them to three real-world examples.
Model selection is conducted among the following candidates: (i) the independence model, (ii) Rosner’s model, (iii) Donner’s model, (iv) Dallal’s model, and (v) the Clayton copula model. Each of these five candidate models has been introduced and discussed in
Section 2.1,
Section 2.2,
Section 2.3,
Section 2.4 and
Section 2.5. For each dataset, we applied all five models and assessed their fit using the Akaike Information Criterion (AIC), provided that the model passed the goodness-of-fit test. The best fit model was then identified based on the lowest AIC. The AIC is defined as
where
is the number of free parameters, and
is the log-likelihood with MLEs of
and
(
).
5.1. Example 1
A double-blind randomized clinical trial was conducted at two sites to compare the cefaclor and amoxicillin for the treatment of acute otitis media with effusion (OME) in 214 children [
25].
Table 12 shows the presence or absence of OME (in terms of the number of cured ears) at 14 days in 203 children from the sample of 214 children treated with cefaclor and amoxicillin.
Table 13 provides the
p-values of the six methods for goodness-of-fit test, along with the AIC values for the five competing models. The independence model is excluded due to extremely small
p-values across the six methods. The remaining four parametric models are considered acceptable, with all
p-values exceeding
. Among them, Rosner’s model and the Clayton copula model yield the highest
p-values (all ≳ 0.7), suggesting better fit compared to Donner’s and Dallal’s models. The AIC for the Clayton copula model is slightly lower than that for Rosner’s model, indicating that the Clayton copula model provides the best fit for this dataset. This result may reflect that the dependence structure in the OME dataset exhibits features of lower tail dependence which the Clayton copula is particularly suited to model. Interpreted in this context, the finding suggests that successful treatment of one ear may be associated with an increased likelihood of success in the paired ear of the same child.
5.2. Example 2
The second example involves combined unilateral and bilateral data obtained from an observational study for 60 myopia patients undergoing Orthokeratology (Ortho-k), a non-surgical vision correction method that uses specialized contact lenses worn overnight to temporarily reshape the cornea and correct myopia [
26]. Myopia improvement is assessed by the axial length growth (ALG), where improvement is indicated if ALG is less than
mm, and absent otherwise. For this analysis, a subset of 33 patients using three masked brands of Ortho-K (labeled as Q, Y, and W) is included [
18]. The observations on the number of improved myopic eyes by bands are summarized in
Table 14.
Table 15 presents the
p-values of the six methods, along with the AICs for the five competing models. The independence model yields considerably smaller
p-values compared to the other four models. In particular, its
p-values from the bootstrap methods
and
fall below
, indicating poor model fit. The remaining four models are considered acceptable, with all associated
p-values exceeding
. Among them, Rosner’s model achieves the lowest AIC value, suggesting it is the best model for this dataset. It is worth noting that the Pearson chi-square test should be interpreted with caution, as several cell counts in
Table 14 are smaller than 5, potentially affecting the test’s validity. This result may indicate that the dependence structure in the Orth-K dataset exhibits the symmetric feature that is described by Rosner’s model. Interpreted in this context, the finding suggests that the outcomes of the two myopic eyes in the same patient tend to be moderately correlated the chance of both myopic eyes being improved or not improved is moderately related, i.e., the improvement or lack of improvement in one eye is associated with a similar outcome in the paired eye.
5.3. Example 3
The third example, originally analyzed in Rosner’s paper introducing the constant
R model [
1], is based on data from an outpatient population of 218 persons aged 20 to 39 with retinitis pigmentosa (RP), who were seen at the Massachusetts Eye and Ear Infirmary between 1970 and 1979. The patients were classified into four types of genetic groups: (i) autosomal dominant RP (DOM), (ii) autosomal recessive RP (AR), (iii) sex-linked RP (SL), and (iv) isolate RP (ISO). In order to eliminate between-subject correlation, selected patients were from different families. The distribution of the number of effected eyes for persons in the four genetic groups is given in
Table 16, where an eye was considered affected if the best corrected Snellen visual acuity (VA) was 20/50 or worse, and normal if VA was 20/40 or better. Note that this dataset contains only bilateral observations and may be viewed as a special case within the combined data framework.
The results for goodness-of-fit tests are shown in
Table 17. As in the previous examples, the independence model is excluded due to extremely small
p-values across all methods, indicating poor fit. The remaining four models are considered acceptable, with all associated
p-values greater than
. Among them, Donner’s model and the Clayton copula model produce the highest
p-values (all
), more than twice those observed for Dallal’s model. Rosner’s model yields
p-values that are marginally greater than
. Between the two best-fitting models, Donner’s model has a slightly lower AIC value, indicating it is the best model for this dataset. This result may imply that the structure dependence in the retinitis pigmentosa dataset exhibits the feature of constant intra-subject correlation across the four genetic groups as described by Donner’s model.
6. Conclusions
Selecting an appropriate statistical model that adequately fits the observed data is a key consideration in the analysis of paired organ data. Misfitting models can lead to inaccurate inference and potentially misleading conclusions. For instance, in
Example 3, Donner’s model provides the best fit for the retinitis pigmentosa dataset. This same dataset was analyzed in a recent study by Zhou and Ma [
27], which introduced three MLE-based statistics (likelihood ratio, Wald-type, and score) and the generalized estimating equations (GEE)-based statistic (generalized score) for testing homogeneity of proportions in combined unilateral and bilateral data. Under Donner’s model, the
p-values for the three MLE-based statistics were
,
, and
, respectively, which clearly indicates a rejection of the equal proportions hypothesis at the
level. In contrast, when applying Rosner’s model, the corresponding
p-values were
,
, and
, failing to reject the null hypothesis. The GEE statistic produced a
p-value of
, supporting the inference under Donner’s model. This example underscores the importance of correct model specification by showing that using a misfitting model can obscure true effects and lead to conflicting or incorrect conclusions.
While previous work was focused on goodness-of-fit test methods for purely bilateral data, this study extends the investigation to the combined structure of unilateral and bilateral outcomes. We consider six statistical models and evaluate six methods for conducting goodness-of-fit tests under these models.
A simulation study is carried out to assess the performance of the six methods under different models by computing the empirical type I error rates and powers. Based on the simulation results, we draw several conclusions. Among the three commonly used test statistics, the deviance () test performs well across all four parametric models. In contrast, the Pearson chi-square () test depends more on models and performs less well when the sample size is small, especially under Donner’s model when intra-subject correlation is high. Although the adjusted chi-square () test includes a continuity correction to improve type I error control in small samples, our results indicate that it performs excessively conservatively across all scenarios under all models. Despite its theoretical motivation, the test yields overly small empirical type I error rates. Therefore, it is not recommended for practical use in this context. On the other hand, the three bootstrap methods () generally maintain good control over type I error rates, although a few liberal outcomes were observed under Donner’s model with small samples. Overall, differences in performance among the six methods are more pronounced when the sample size is small and tend to diminish as the sample size increases.
In general, our results are consistent with findings from the earlier studies. In particular, Liu and Ma [
20] showed that the
test tends to be overly conservative and is therefore not recommended. The
test performs well under Rosner’s, Donner’s, and Dallal’s models. The
test controls type I error well under Rosner’s mode but becomes liberal at small sample sizes when applied to the other two models. The three bootstrap methods generally mirror the behavior of the
test, however,
and
tend to be more liberal at small sample sizes compared to
. As the sample size increases, all the five methods (
) exhibit improved performance with satisfactory type I error control.
The practical application of these methods is illustrated through three real-world datasets from otolaryngologic and ophthalmologic studies.
Methods for the goodness-of-fit test presented in our study are universal in the sense that they can be applied to paired organ data under various model scenarios. A natural extension of this work is to incorporate covariates into the modeling framework, allowing for a more flexible assessment of model fit while accounting for intra-subject correlation. For example, extensions based on generalized linear mix models (GLMMs) or generalized estimating equations (GEE) could be developed to handle covariate-adjusted goodness-of-fit test. Such developments would broaden the practical utility of these methods in analyzing paired organ data across different clinical settings.
It is important to note that the asymptotic distributions used for the , , and statistics (as well as for bootstrap methods and that rely on the and statistics) are theoretically valid only under the large sample conditions. When the sample size is small, these asymptotic approximations may not hold, and alternative approaches such as exact methods or the bootstrap method should be considered. Exploring and developing more accurate small sample inference methods remains an interesting direction for future work.
Lastly, we provide guidance for applied use based on our findings. Among the six methods evaluated, the deviance () test and the three bootstrap methods () generally demonstrate robust performance across various model scenarios, especially as sample size increases. The Pearson chi-square () test performs well under certain models but may be liberal with small samples and high intra-subject correlation. The adjusted chi-square () test, while theoretically motivated, tends to be overly conservative and is not recommended for practical use. For smaller sample sizes, bootstrap method or exact methods may offer more reliable inference. We recommend practitioners carefully consider sample size and model assumptions when selecting methods for goodness-of-fit test for combined unilateral and bilateral data to ensure valid conclusions.