Next Article in Journal
A Gray Predictive Evolutionary Algorithm with Adaptive Threshold Adjustment Strategy for Photovoltaic Model Parameter Estimation
Previous Article in Journal
Cosmic Evolution Optimization: A Novel Metaheuristic Algorithm for Numerical Optimization and Engineering Design
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Goodness-of-Fit Tests for Combined Unilateral and Bilateral Data

Department of Biostatistics, University at Buffalo, Buffalo, NY 14214, USA
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(15), 2501; https://doi.org/10.3390/math13152501
Submission received: 26 June 2025 / Revised: 28 July 2025 / Accepted: 30 July 2025 / Published: 3 August 2025

Abstract

Clinical trials involving paired organs often yield a mixture of unilateral and bilateral data, where each subject may contribute either one or two responses. While unilateral responses from different individuals can be treated as independent, bilateral responses from the same individual are likely correlated. Various statistical methods have been developed to account for this intra-subject correlation in the bilateral data, and in practice, it is crucial to select a model that properly accounts for this correlation to ensure accurate inference. Previous research has investigated goodness-of-fit test statistics for correlated bilateral data under different group settings, assuming fully observed paired outcomes. In this work, we extend these methods to the more general and practically common setting where unilateral and bilateral data are combined. We examine the performance of various goodness-of-fit statistics under different statistical models, including the Clayton copula model. Simulation results indicate that the performance of the goodness-of-fit tests is model-dependent, especially when the sample size is small and/or the intra-subject correlation is high. However, the three bootstrap methods generally offer more robust performance. In real world applications from otolaryngologic and ophthalmologic studies, model choice significantly impacts conclusions, emphasizing the need for appropriate model assessment in practice.

1. Introduction

In clinical trial studies, the outcomes of paired organs, such as eyes, ears, and kidneys, are often recorded as binary data. The data are considered bilateral when both sites of the same individual are included, in contrast to unilateral data, where only one site of a paired organ is recorded per individual. Unilateral outcomes are typically assumed to be independent, with each individual’s response(s) being independent of others. Bilateral outcomes, however, exhibit intra-subject correlation due to the paired nature of the data. Both types of data are often recorded together in randomized clinical trials. For instance, in ophthalmologic studies, the focus is often on the statistical analysis of eyes rather than individuals. For n enrolled patients, the medical records may include responses for between n and 2 n eyes, as some patients provide data for both eyes (bilateral), while others provide data for only one (unilateral). This mixture of unilateral and bilateral observations may arise due to various practical reasons, including clinical ineligibility of one eye, pre-existing conditions, or occasional technical issues that prevent measurement of both eyes, making the inclusion of both unilateral and bilateral data inevitable in practice. Discarding the unilateral portion of the data can lead to reduced statistical power and potential bias. Therefore, it is important to develop statistical methods that can accommodate both types of data while appropriately accounting for intra-subject correlation in bilateral cases.
A number of statistical models have been proposed to address the intra-subject correlation problem. Rosner [1] proposed a “constant R model” in which the conditional probability of having a response in one eye, given a response in the other, is proportional to the marginal probability. The proportionality is governed by a constant R, which captures the symmetric dependence between the paired outcomes. Dallal [2] later argued that Rosner’s constant R model could lead to a poor fit if the characteristic is almost certain to occur bilaterally with largely varying group-specific prevalence. Instead, he proposed that the conditional probability is a constant γ . Subsequently, Donner [3] proposed an alternative approach that assumes a constant intra-person correlation ρ for all the individuals in the sample. This model was proved to be robust with a simulation study by Thompson [4]. Clayton [5] proposed a model for association in bivariate variables using Clayton copula, which expresses the joint distribution in terms of the marginal cumulative distribution functions (CDFs) and a dependence parameter θ . The Clayton copula model is particularly useful for capturing lower tail dependence that refers to the tendency of two variables to take extreme low values simultaneously. In medical research, this is particularly relevant when evaluating paired organ systems, where the disease in one organ may increase the risk in the paired counterpart. For example, in the case of paired kidneys, a severe decline in function in one kidney may be associated with a similar decline in the other kidney. In addition to these parametric models, the independence model and saturated model are often used as benchmarks for describing bilateral and combined data structures.
Different methods have been developed to analyze correlated binary data under the aforementioned models, including homogeneity test and confidence interval estimation (e.g., for homogeneity test, see [6,7,8,9,10,11,12,13,14,15]; for confidence interval estimation, see [9,16,17,18]). Given the variety of available models, it is essential to identify the most suitable model based on the characteristics of the observed data. Previous work by Tang et al. [19] and Liu and Ma [20] investigated goodness-of-fit test methods for correlated bilateral data in the context of two and g 2 groups, respectively. Both studies focused exclusively on purely bilateral data, providing valuable insights into model selection and test performance under those scenarios. In contrast, our work addresses the more complex and practically relevant situation where unilateral and bilateral data are combined, as often encountered in clinical trials due to missingness or design constraints. Moreover, we examine a model based on the Clayton copula, which has not been investigated in previous studies. In this paper, we focus on performing goodness-of-fit tests for the combined unilateral and bilateral data under the aforementioned models. Specifically, we compare the following six methods: deviance ( G 2 ), Pearson chi-square ( X 2 ), adjusted chi-square ( X a d j 2 ), and three bootstrap methods ( B 1 , B 2 , B 3 ).
The rest of the paper is organized as follows. Section 2 introduces six models for analyzing combined binary data and describes the procedures for obtaining maximum likelihood estimates (MLEs). Section 3 presents the six methods for goodness-of-fit test. A simulation study is conducted in Section 4 to evaluate the performance of these methods in terms of empirical type I error rates and powers under different models. In Section 5, three real-world examples are applied to illustrate the goodness-of-fit test methods. Conclusions are provided in Section 6.

2. Models for Combined Unilateral and Bilateral Data

Let m r i be the number of individuals who contribute data on paired organs with r r = 0 , 1 , 2 responses (response means the organ is cured/affected) in the i-th group ( i = 1 , , g ) and  n r i be the number of individuals who contribute data on one of the paired organs with r r = 0 , 1 responses in the i-th group, respectively. Let m r + r = 0 , 1 , 2 and n r + r = 0 , 1 be the numbers of subjects who respectively contribute to bilateral data and unilateral data with r responses. Then,
m r + = i = 1 g m r i , n r + = i = 1 g n r i .
Similarly, let m + i and n + i be the numbers of subjects who contribute on unilateral and bilateral data in the ith group, respectively. Then,
m + i = r = 0 2 m r i , n + i = r = 0 1 n r i .
The total number of subjects in the study is thus
i = 1 g r = 0 2 m r i + r = 0 1 n r i = i = 1 g m + i + n + i = m + + + n + + .
The data structure is demonstrated in Table 1.
It is obvious that for the i-th group, the random variables m 0 i , m 1 i , m 2 i and n 0 i , n 1 i follow multinomial distributions. In particular,
m 0 i , m 1 i , m 2 i M u l t i n o m i a l m + i , p 0 i , p 1 i , p 2 i , r = 1 2 p r i = 1 , n 1 i B i n o m i a l n + i , π i ,
where p r i denotes the probability of having r ( r = 0 , 1 , 2 ) responses for a subject in the i-th group for bilateral data, and  π i is the marginal probability of the response for a subject in the i-th group. Let Z i j k = 1 be the response of the k-th ( k = 1 , 2 ) paired organ for the j-th subject in the i-th group, then the joint probabilities p r i ( r = 0 , 1 , 2 ) take the following form:
p 2 i = P r Z i j 1 = 1 , Z i j 2 = 1 = E Z i j 1 Z i j 2 = π i π i + 1 π i Corr Z i j 1 , Z i j 2 , p 1 i = k = 1 2 P r Z i j k = 1 , Z i j , 3 k = 0 = 2 π i p 2 i = 2 π i 1 π i 1 Corr Z i j 1 , Z i j 2 , p 0 i = P r Z i j 1 = 0 , Z i j 2 = 0 = 1 p 1 i p 2 i = 1 π i 1 π i + π i Corr Z i j 1 , Z i j 2 ,
where Corr Z i j 1 , Z i j 2 denotes the intra-subject correlation between the two responses from the j-th subject in the i-th group.
Various statistical models that account for within-subject dependence often introduce additional (nuisance) parameters to capture the intra-subject correlation. In the following, we consider four such parametric models that introduce one nuisance parameter: (i) Rosner’s model, (ii) Donner’s (constant ρ ) model, (iii) Dallal’s model, and (iv) Clayton copula model, respectively. To facilitate comparison across models, Table 2 provides a summary of model assumptions and their associated nuisance parameters.
In what follows, we describe a general procedure for obtaining the MLEs for the four parametric models. An iterative algorithm is employed for this purpose, with the Newton–Raphson and Fisher’s scoring methods being two commonly used approaches. In our case, the Hessian matrix can be derived analytically, we adopt the Newton–Raphson method due to its faster convergence and suitability for more generic MLE optimization problems.
Let β = π 1 , , π g , κ , then the log-likelihood function can be written as
l β : = l π 1 , , π g , κ | m , n = i = 1 g r = 0 2 m r i log p r i + i = 1 g n 0 i log 1 π i + n 1 i log π i + const ,
where dependence of κ is through the joint probabilities p r i in (2), m , n denotes a vector of combined bilateral and unilateral data that
m , n = m 01 , m 11 , m 21 , , m 0 g , m 1 g , m 2 g , n 01 , n 11 , , n 0 g , n 1 g ,
and ‘const’ is a constant term depending on m , n .
Suppose the log-likelihood is concave and certain regularity conditions are satisfied. The maximum likelihood estimates (MLEs) of β can be obtained via the following normal equations:
l β π i = 0 , i = 1 , , g ; l β κ = 0 .
Usually there are no closed-form solutions for the above normal equations; rather, the MLEs β ^ = π ^ 1 , , π ^ g , κ ^ can be solved iteratively. The iteration procedure for estimating the MLEs of π i and κ is outlined below.
  • At the t -th step, for given κ ^ ( t ) , obtain π ^ i ( t ) as a real root for equation l β / π i = 0 for i = 1 , , g , such that β ^ ( t ) = π ^ i ( t ) , , π ^ g ( t ) , κ ^ ( t ) .
  • At the t + 1 -th step, κ ^ ( t + 1 ) is evaluated with the Newton–Raphson method:
    κ ^ ( t + 1 ) = κ ^ ( t ) + 2 l β κ 2 β = β ^ ( t ) 1 l β κ β = β ^ ( t ) .
  • Repeat Steps 1–2 until convergence of κ ^ occurs, which can be measured by δ ( t ) = κ ^ ( t + 1 ) κ ^ ( t ) . The iteration procedure stops when δ ( t ) < δ 0 for a sufficiently small δ 0 , such as 10 6 .
The initial values π ^ i ( 0 ) and κ ^ ( 0 ) can be set somewhat arbitrarily but within their allowed regions. The acceptable range of the nuisance parameter κ for each model is discussed in the following subsections, respectively.
It should be noted that, in addition to the four parametric models listed in Table 2, both the independence model and the saturated model are free of nuisance parameters and allow closed-form solutions for the MLEs of the probabilities p r i and π i . As a result, these models do not require the iteration procedure described above.

2.1. Rosner’s Model

Rosner [1] proposed a “constant R model” that assumed equal dependence between two eyes of the same person for the ophthalmologic data. More specifically, it assumed that the probability of cured eye at one site given cured eye at the other site for the j-th subject in the i-th group is proportional to the prevalence rate for the i-th group by a constant factor R, i.e.,
P r Z i j k = 1 | Z i j , 3 k = 1 = R · P r Z i j k = 1 = R π i ,
for k = 1 , 2 denoting the left and right eye, respectively. The intra-subject correlation then takes the form Corr Z i j k , Z i j , 3 k = R 1 π i / 1 π i . Clinically, the parameter R captures the symmetric dependence between the two eyes of a subject, quantifying the degree of intra-subject correlation in bilateral outcomes. The region of R is bounded by the region of probabilities and correlation. It can be shown that R satisfies 0 < R 1 / a if a 1 / 2 ; ( 2 1 / a ) / a R 1 / a if a > 1 / 2 with a = max i = 1 , , g π i  [12].
Taking κ = R , the normal equations in (5) are derived as
l β π i = 2 m 0 i R π i 1 1 2 π i + R π i 2 + m 1 i 2 R π i 1 π i 1 R π i + 2 m 2 i π i n 0 i 1 π i + n 1 i π i = 0 , i = 1 , , g
l β R = i = 1 g m 0 i π i 2 1 2 π i + R π i 2 + m 1 i π i R π i 1 + m 2 + R = 0 .
Equation (7a) leads to a quartic equation with respect to π i as shown below,
a i π i 4 + b i π i 3 + c i π i 2 + d i π i + e i = 0 ,
where the coefficients are
a i = 2 m + i + n + i R 2 , b i = R + 2 m 0 i + 2 R + 5 m 1 i + 2 R + 3 m 2 i + 3 n 0 i + R + 3 n 1 i R , c i = 4 R + 2 m 0 i + 7 R + 2 m 1 i + 4 2 R + 1 m 2 i + R + 1 n 0 i + 2 2 R + 1 n 1 i , d i = 2 R + 1 m 0 i + 2 R + 3 m 1 i + 5 m 2 i + n 0 i + R + 3 n 1 i , e i = m 1 i + 2 m 2 i + n i .
Therefore, at Step 1 in the iteration procedure, π ^ i ( t ) is a real root of the quartic Equation (8) for a given R ^ ( t ) [12]. The first (see the middle expression in the normal Equation (7b)) and second derivative of the log-likelihood with respect to R
2 l β R 2 = i = 1 g m 0 i π i 4 R π i 2 2 π i + 1 2 + m 1 i π i 2 R π i 1 2 + m 2 + R 2 ,
are used to update R ^ ( t + 1 ) in the t + 1 -th iteration as described at Step 2 in the iteration procedure.

2.2. Donner’s Model

With Donner’s approach [3], the correlation between the outcomes of paired organs of the same subject is assumed to be the same in the sample such that
Corr Z i j k , Z i j , 3 k = ρ , ρ 1 , 1 .
Taking κ = ρ , the normal equations have the following form:
l β π i = m 0 i 2 1 ρ π i + ρ 2 1 π i 1 1 ρ π i + m 1 i 1 2 π i π i 1 π i + m 2 i 2 1 ρ π i + ρ π i ρ 1 π i + π i n 0 i 1 π i + n 1 i π i = 0 ,
l β ρ = i = 1 g m 0 i π i 1 1 ρ π i + m 2 i 1 π i ρ 1 π i + π i m 1 + 1 ρ = 0 ,
where a cubic equation
a i π i 3 + b i π i 2 + c i π i + d i = 0 ,
can be derived from (11a), and the respective coefficients take the form
a i = 1 ρ 2 2 m + i + n + i , b i = 1 ρ 3 ρ 2 m 0 i + 3 ρ 1 m 1 i + 3 ρ 4 m 2 i + ρ 1 n 0 i + 2 ρ 1 n 1 i , c i = ρ ρ 2 m 0 i + ρ ρ 4 + 1 m 1 i + ρ ρ 4 + 2 m 2 i + ρ ρ 3 + 1 n 1 i , d i = ρ m 1 i + m 2 i + n 1 i .
Therefore, at Step 1 in the iteration procedure, π ^ i ( t ) is a real root of the cubic Equation (12) for a given ρ ^ ( t ) [14]. The first derivative shown in the middle expression in (11b) and the second derivative of the log-likelihood with respect to ρ
2 l β ρ 2 = i = 1 g m 0 i π i 2 1 1 ρ π i 2 m 2 i 1 π i 2 ρ 1 π i + π i 2 + m 1 + 1 ρ 2 ,
are used to update ρ ^ ( t + 1 ) in the t + 1 -th iteration described in the iteration procedure.

2.3. Dallal’s Model

Under Dallal’s model the conditional probability is assumed to be a constant [2], i.e.,
P r Z i j k = 1 | Z i j , 3 k = 1 = γ .
The resulting intra-subject correlation is Corr Z i j k , Z i j , 3 k = γ π i / 1 π i . Being bounded by the region of the probabilities and correlation, it can be shown that the region of γ is 0 γ 1 if a 1 / 2 ; 2 1 / a γ 1 if a > 1 / 2 , where a = m a x i = 1 , , g π i .
Taking the nuisance parameter κ = γ , the normal equations take the form
l β π i = 2 γ m 0 i 1 2 γ π i + m 1 i π i + m 2 i π i n 0 i 1 π i + n 1 i π i = 0 ,
l β γ = i = 1 g m 0 i π i 1 2 γ π i m 1 + 1 γ + m 2 + γ = 0 ,
where further reduction on Equation (15a) gives rise to a quadratic equation below
a i π i 2 + b i π i + c i = 0 ,
with the coefficients
a i = 2 γ m + i + n + i , b i = 2 γ m 0 i 3 γ m 1 i 3 γ m 2 i n 0 i 3 γ n 1 i , c i = m 1 i + m 2 i + n 1 i .
The smaller root of the quadratic Equation (16) leads to the maximum of the log-likelihood and thus is used at the Step 1 in the iteration procedure. The first derivative (see middle expression in (15b)) and the second derivative of the log-likelihood with respect to γ
l 2 β γ 2 = i = 1 g m 0 i π i 2 1 2 γ π i 2 m 1 + 1 γ 2 m 2 + γ 2 ,
are used to update γ ^ ( t + 1 ) in the t + 1 -th iteration.

2.4. Clayton Copula Model

According to Sklar’s theorem [21], every joint cumulative distribution function (CDF) of a random vector can be expressed in terms of its marginal CDFs and a copula C. In particular, for the paired organ data, it takes the form
P r Z i j 1 z i j 1 , Z i j 2 z i j 2 = C F i 1 z i j 1 , F i 2 z i j 2 .
Given Z i j = Z i j 1 , Z i j 2 T Bernoulli π i 1 2 , the joint probabilities can be written as
p 0 i = P r Z i j 1 = 0 , Z i j 2 = 0 = C F i 1 0 , F i 2 0 = C 1 π i , 1 π i , p 1 i = k = 1 2 P r Z i j k = 1 , Z i j , 3 k = 0 = k = 1 2 P r Z i j k = 0 P r Z i j k = 0 , Z i j , 3 k = 0 = 2 1 π i C 1 π i , 1 π i , p 2 i = P r Z i j 1 = 1 , Z i j 2 = 1 = 1 p 0 i p 1 i = 2 π i 1 + C 1 π i , 1 π i .
Comparing Equation (19) and Equation (2), it is straightforward that
Corr Z i j 1 , Z i j 2 = C 1 π i , 1 π i 1 π i 2 π i 1 π i .
The Clayton copula is a type of Archimedean copula which allows modeling dependence with one parameter ( θ ) and is denoted as C θ . It is particularly suited for modeling lower tail dependence [5]. Liang et al. [22] utilize the Clayton copula to test the homogeneity of two proportions for correlated bilateral data, where the copula is defined as:
C θ u , v = u θ + v θ 1 1 / θ , θ > 0 ,
for 0 u , v 1 . Note that the full expression of Clayton copula is C θ u , v = max u θ + v θ 1 , 0 1 / θ , for  θ [ 1 , 0 ) ( 0 , ) . When θ ( 0 , ) , the copula exhibits positive dependence and lower tail dependence. When θ ( 1 , 0 ) , it models negative dependence with vanishing lower tail dependence.
Using the copula form in (21), the normal equations can be written as
l β π i = 2 m 0 i 1 π i 2 1 π i θ + 2 m 1 i 2 1 π i θ 1 1 + θ / θ 1 π i 1 θ m 1 i 1 π i + 2 1 π i θ 1 1 / θ + 2 m 2 i 2 1 π i θ 1 1 + θ / θ 1 π i 1 θ 2 m 2 i 1 2 π i 2 1 π i θ 1 1 / θ n 0 i 1 π i + n 1 i π i = 0 ,
l β θ = θ 2 i = 1 g log 2 1 π i θ 1 + 2 θ log 1 π i 2 1 π i θ × m 0 i + m 1 i 1 1 π i 2 1 π i θ 1 1 / θ + m 2 i 1 1 2 π i 2 1 π i θ 1 1 / θ = 0 .
Unlike in the previous models where the first normal equation ( l β / π i = 0 ) can be further reduced to a polynomial equation so that an analytic solution for π ^ i ( t ) can be found for a given κ ^ ( t ) at the t -th iteration, with the Clayton copula model, the root of Equation (22a) is evaluated numerically. The second derivative of the log-likelihood with respect to θ
l 2 β θ 2 = i = 1 g k = 0 2 c k i m k i ,
along with the first derivative shown in (22b) are used to update θ ^ ( t + 1 ) at the t + 1 -th iteration, where the coefficients c k i ’s ( k = 0 , 1 , 2 ) are shown below
c 0 i = 2 θ 3 log 2 1 π i θ 1 + θ log 1 π i 1 π i θ θ log 1 π i + 2 4 1 π i θ 2 2 , c 1 i = 1 π i 2 θ 2 1 π i θ 1 2 1 + θ / θ θ 4 1 π i 2 1 π i θ 1 1 / θ 2 × 1 π i 2 1 π i θ 1 1 / θ 2 1 π i θ log 2 1 π i θ 1 + 2 θ log 1 π i 2 2 θ 1 π i 2 1 π i θ 1 1 / θ 1 ( 2 1 π i θ 2 log 2 1 π i θ 1 + θ log 1 π i 4 1 π i θ 2 + θ log 1 π i ) , c 2 i = 1 π i 2 θ 2 1 π i θ 1 2 1 + θ / θ θ 4 1 2 π i 2 1 π i θ 1 1 / θ 2 × 1 2 π i 2 1 π i θ 1 1 / θ 2 1 π i θ log 2 1 π i θ 1 + 2 θ log 1 π i 2 2 θ 1 2 π i 2 1 π i θ 1 1 / θ + 1 ( 2 1 π i θ 2 log 2 1 π i θ 1 + θ log 1 π i 4 1 π i θ 2 + θ log 1 π i ) .

2.5. Independence Model

The independence model assumes no correlation between the two paired organs of the same person, i.e.,  Corr Z i j 1 , Z i j 2 = 0 . Thus, this model is free of nuisance parameter, and the MLEs of π i can be directly obtained by solving the normal equation:
l β π i = 2 π i m 0 i + 1 2 π i m 1 i + 2 1 π i m 2 i π i n 0 i + 1 π i n 1 i π i 1 π i = 0 ,
which yields a closed-form solution:
π ^ i = m 1 i + 2 m 2 i + n 1 i 2 m + i + n + i , i = 1 , , g .
It should be noted that the independence model is a special case of the aforementioned models, including Rosner’s model, Donner’s model and Clayton copula model. It is a limiting case of Rosner’s model as R 1 , of Donner’s model as ρ 0 , and of the Clayton copula model as θ 0 + .

2.6. Saturated Model

The saturated model treats each joint probability p k i and the marginal probability π i as free parameters, subject to the constraint k = 0 2 p k i = 1 . This model serves as a reference (or “full”) model in the goodness-of-fit test, as it imposes no structural assumptions on the data. The log-likelihood is given by Equation (3). Using the method of Lagrange multipliers, the MLEs of the parameters are
p ^ k i = m k i m + i , π ^ i = n 1 i n + i ,
for k = 0 , 1 , 2 and i = 1 , , g .

3. Methods for Goodness-of-Fit Test

Three commonly used test statistics are employed to assess the goodness-of-fit of the model: the deviance (likelihood ratio), the Pearson chi-square and the adjusted chi-square tests. These are defined as follows.
  • Deviance
    G 2 = 2 r , i O r i log O r i E r i = 2 i = 1 g r = 0 2 m r i log m r i m + i p ^ r i + n 0 i log n 0 i n + i 1 π ^ i + n 1 i log n 1 i n + i π ^ i ,
  • Pearson Chi-square
    X 2 = r , i O r i E r i 2 E r i = i = 1 g r = 0 2 m r i m + i p ^ r i 2 m + i p ^ r i + n 0 i n + i 1 π ^ i 2 n + i 1 π ^ i + n 1 i n + i π ^ i 2 n + i π ^ i ,
  • Adjusted Chi-square
    X a d j 2 = r , i O r i E r i 1 / 2 2 E r i = i = 1 g r = 0 2 m r i m + i p ^ r i 1 / 2 2 m + i p ^ r i + n 0 i n + i 1 π ^ i 1 / 2 2 n + i 1 π ^ i + n 1 i n + i π ^ i 1 / 2 2 n + i π ^ i ,
    where the symbols O r i and E r i denote the observed and expected number of individuals with r responses in the i-th group, respectively. In addition, π ^ i is the MLE of π i , and  p ^ r i is the MLE of p r i , determined by π ^ i and κ ^ , where κ ^ denotes the MLE of the nuisance parameter in the respective parametric models. It is worth mentioning that the adjusted chi-square ( X a d j 2 ) statistic introduces a continuity correction (subtraction of 1 / 2 from O r i E r i ) which is especially helpful when the sample sizes are small and the expected counts are low [23]. In addition, the continuity correction tends to reduce the Pearson chi-square ( X 2 ) statistic to be liberal by slightly reducing the magnitude of the test statistic. All three statistics asymptotically follow a chi-square distribution under the null hypothesis, with the degrees of freedom (df) equal to the difference in number of free parameters between the saturated model and the parametric model under consideration (excluding the independence model which has no nuisance parameter), i.e.,  df = 3 g g + 1 = 2 g 1 ((1) For the saturated model, there are three free parameters in each group, with two in the bilateral portion and one in the unilateral portion of data. For the parametric model in the presence of one nuisance parameter, there are g + 1 parameters. (2) For purely bilateral data ( n r i = 0 , r = 0 , 1 ; i = 1 , , g ), the df is g 1 since there are 2 g free parameters for the saturated model.).
Additionally, we consider three bootstrap methods ( B 1 , B 2 , B 3 ) that use the statistic G 2 in (27), the statistic X 2 in (28), and the probability of the observed table below
P r m , n H 0 = i = 1 g m + i ! m 0 i ! m 1 i ! m 2 i ! p ^ 0 i m 0 i p ^ 1 i m 1 i p ^ 2 i m 2 i · n + i ! n 0 i ! n 1 i ! 1 π ^ i n 0 i π ^ i n 1 i ,
respectively, to order the bootstrap samples. The bootstrap procedure is outlined as follows.
  • Generate N B bootstrap samples using the estimated parameters π ^ i and p ^ r i obtained from the observed data.
  • For each bootstrap sample, re-estimate the parameters π ^ i and p ^ r i , then compute the corresponding test statistics: G 2 , X 2 , and Pr m , n H 0 .
  • For each method, compare the bootstrap statistic to the observed statistic from the original data. Count the number of bootstrap samples for which G 2 or X 2 is greater than the observed value, or Pr m , n H 0 is less than the observed probability. The null hypothesis is rejected if the proportion of such samples falls below a prespecified critical value (e.g., if the number of such samples is less than 5 % · N B ).

4. Simulation Study

We conduct a simulation study to assess the performance of the six methods for goodness-of-fit test proposed in Section 3 by investigating the empirical type I error and powers under Rosner’s, Donner’s, Dallal’s, and Clayton copula model introduced in Section 2.

4.1. Empirical Type I Error (TIE)

We consider equal sample size m + , n + = 25 , 50 , 100 for g = 2 , 4 , 8 , where m + , n + is a short notation for
m + , n + = m + 1 , , m + g , n + 1 , , n + g .
The true values of the parameters under null hypothesis H 0 are given in Table 3 and Table 4. The procedure of computing the empirical type I error rates is outlined as follows.
1.
Generate a dataset with a designed sample size and parameter configuration as one of the combinations given in Table 3 and Table 4.
2a.
Calculate the MLEs π ^ i and p ^ r i for each model and substitute them into Equations (27)–(29) to obtain the three statistics ( G 2 , X 2 and X a d j 2 ). Reject H 0 if the statistics > χ 1 α ; 2 g 1 2 ;
2b.
For bootstrap methods ( B 1 , B 2 , B 3 ), use the generated dataset in Step 1 as the observed data, and follow the bootstrap procedure described under Equation (30) in Section 3.
3.
Replicate the above simulations for N times, then the empirical type I error rate is computed as the ratio of the number of rejections to N, i.e.,
TIE ^ = # of rejections N .
In what follows, we set N = 10 , 000 and N B = 2000 to generate the simulation results.
Table 5, Table 6, Table 7 and Table 8 summarize the empirical type I error rates across all scenarios under Rosner’s, Donner’s, Dallal’s, and Clayton copula model, respectively. These results should be compared against the prespecified nominal level α . Following the definition in Tang et al. [24], a statistical test is considered liberal if the ratio of its associated empirical type I error rate to the nominal type I error rate exceeds 1.2 , conservative if the ratio is below 0.8 , and robust if the ratio is 0.8 1.2 . Thus, for a nominal level of α = 0.05 , a liberal test is associated with an empirical type I error rate greater than 0.06 , a conservative test with a rate smaller than 0.04 , and a robust test with a rate 0.04 0.06 . The robust results in these tables are highlighted in boldface, while the conservative results are shown in italics.

4.1.1. TIE Under Rosner’s Model

As shown in Table 5 below, all the results from the three bootstrap methods ( B 1 , B 2 , B 3 ) are robust. However, some of the results from the deviance ( G 2 ) test appear conservative when the sample size is small under certain null hypothesis scenarios. In particular, the conservative results are observed when m + , n + = 25 in Case III–VI and when m + , n + = 50 in Case V. When the sample size is large ( m + , n + = 100 ), all results are robust. The Pearson chi-square ( X 2 ) test performs similarly to the G 2 test, though it yields a few additional conservative results in small sample size settings. For the adjusted chi-square ( X a d j 2 ) test, all empirical type I error rates are notably small across all scenarios considered, indicating that this test is overly conservative.
Table 5. The empirical type I error rates (in %) under Rosner’s model, at the nominal level of α = 0.05 , where the robust and conservative results are highlighted in bold and italics, respectively.
Table 5. The empirical type I error rates (in %) under Rosner’s model, at the nominal level of α = 0.05 , where the robust and conservative results are highlighted in bold and italics, respectively.
m + , n + gCaseR G 2 X 2 X adj 2 B 1 B 2 B 3
252I1.24.874.521.294.704.745.13
1.55.705.291.535.565.625.42
1.85.224.721.384.824.904.38
II1.25.104.711.524.584.634.93
1.55.695.241.455.245.145.48
1.84.994.531.494.754.753.97
4III1.23.144.020.324.634.234.64
1.53.443.880.305.024.474.52
1.83.583.610.234.494.204.34
IV1.23.773.400.424.734.444.72
1.54.523.990.355.275.185.00
1.84.303.670.384.594.404.64
8V1.22.603.570.064.564.184.46
1.53.133.480.055.124.494.94
1.82.792.690.054.313.964.19
VI1.23.793.450.184.934.794.95
1.54.323.650.115.215.225.16
1.84.433.470.185.054.875.01
502I1.24.984.531.714.774.574.77
1.54.994.712.124.654.744.97
1.85.445.162.215.185.105.10
II1.24.954.642.084.744.725.02
1.55.355.182.295.145.255.38
1.85.084.882.344.834.844.59
4III1.24.124.490.895.205.025.22
1.53.944.010.934.604.694.90
1.84.134.080.734.884.704.84
IV1.24.414.250.944.984.765.13
1.54.894.600.995.034.995.15
1.85.264.641.174.985.005.19
8V1.23.393.820.254.624.454.66
1.53.623.640.314.884.644.87
1.83.973.880.374.984.904.71
VI1.24.784.420.475.175.184.97
1.54.954.360.435.045.024.74
1.85.544.760.625.285.215.42
1002I1.24.954.772.704.834.764.84
1.55.174.952.784.884.824.98
1.85.044.922.985.044.945.07
II1.25.094.922.825.025.055.20
1.55.345.323.135.235.295.34
1.85.245.122.915.225.235.31
4III1.24.504.531.485.205.015.17
1.54.374.331.354.894.804.93
1.84.984.961.715.275.385.61
IV1.25.004.491.564.844.664.78
1.55.585.021.865.315.125.29
1.85.575.071.715.265.185.32
8V1.24.204.420.704.854.935.05
1.54.374.420.814.935.004.87
1.84.564.500.845.145.085.22
VI1.24.954.571.044.744.694.72
1.55.304.651.014.884.834.94
1.85.394.841.025.004.945.10

4.1.2. TIE Under Donner’s Model

As shown in Table 6 below, the G 2 and the X 2 tests perform similarly to their performance under Rosner’s model. Additionally, when the sample size is small, the results become increasingly conservative as the intra-subjection correlation ( ρ ) increases. The X a d j 2 test consistently produces overly conservative results across all scenarios. Similar to the G 2 and X 2 tests, its performance also becomes more conservative with higher values of ρ . Although the three bootstrap methods yield predominantly robust results, a few liberal outcomes are observed in scenarios with small sample sizes and high intra-subject correlation ( ρ = 0.9 ).
Table 6. The empirical type I error rates (in %) under Donner’s model, at the nominal level of α = 0.05 , where the robust and conservative results are highlighted in bold and italics, respectively.
Table 6. The empirical type I error rates (in %) under Donner’s model, at the nominal level of α = 0.05 , where the robust and conservative results are highlighted in bold and italics, respectively.
m + , n + gCase ρ G 2 X 2 X adj 2 B 1 B 2 B 3
252I0.55.684.971.285.185.115.13
0.75.074.321.165.395.375.25
0.93.032.670.697.737.788.79
II0.55.584.991.244.945.045.09
0.75.184.541.155.235.235.32
0.93.463.100.798.338.369.10
4III0.53.903.260.215.075.014.96
0.73.412.930.165.045.254.84
0.91.621.600.155.645.607.00
IV0.55.104.160.355.125.035.02
0.74.653.900.395.305.404.94
0.92.262.050.125.625.556.46
8V0.53.522.780.074.764.904.26
0.72.842.480.035.195.174.72
0.90.630.650.003.924.005.07
VI0.55.384.040.185.215.045.09
0.74.263.540.115.105.024.80
0.90.910.870.013.913.895.46
502I0.55.274.961.954.884.815.02
0.75.234.811.764.924.784.81
0.93.873.411.056.076.036.17
II0.55.395.142.175.075.095.28
0.75.895.512.215.465.475.59
0.94.293.871.566.066.056.18
4III0.55.294.620.935.205.185.25
0.75.284.530.795.305.255.12
0.92.782.930.255.045.294.86
IV0.55.704.931.195.065.125.23
0.75.854.961.075.275.255.28
0.93.232.990.464.995.074.68
8V0.55.474.380.345.235.065.03
0.75.164.210.355.135.204.81
0.92.422.550.104.884.994.99
VI0.55.894.960.634.925.015.23
0.75.534.710.574.754.814.88
0.92.783.080.194.794.915.16
1002I0.55.215.082.625.004.985.08
0.74.884.692.364.744.754.84
0.94.974.511.875.025.024.79
II0.55.245.102.815.065.105.21
0.75.305.172.605.045.015.17
0.95.735.092.415.685.514.94
4III0.55.314.901.665.035.015.05
0.75.695.061.515.195.185.19
0.94.204.121.044.804.944.55
IV0.55.555.111.955.335.175.43
0.75.254.851.574.864.804.99
0.94.624.301.064.855.024.91
8V0.55.705.081.085.035.175.27
0.75.885.331.095.255.385.46
0.93.923.820.354.884.954.89
VI0.55.575.101.175.105.115.29
0.75.344.680.924.814.664.96
0.94.664.370.634.995.005.12

4.1.3. TIE Under Dallal’s Model

As shown in Table 7 below, all six methods ( G 2 , X 2 , X a d j 2 , B 1 , B 2 , B 3 ) perform similarly to their performance under Rosner’s model. In particular, the G 2 and X 2 tests yield slightly more robust results under Dallal’s model. The X a d j 2 test remains consistently overly conservative across all scenarios. The three bootstrap methods produce entirely robust results.
Table 7. The empirical type I error rates (in %) under Dallal’s model, at the nominal level of α = 0.05 , where the robust and conservative results are highlighted in bold and italics, respectively.
Table 7. The empirical type I error rates (in %) under Dallal’s model, at the nominal level of α = 0.05 , where the robust and conservative results are highlighted in bold and italics, respectively.
m + , n + gCase γ G 2 X 2 X adj 2 B 1 B 2 B 3
252I0.34.834.601.154.324.624.08
0.55.514.831.345.055.015.09
0.75.604.911.085.074.994.89
II0.35.274.641.304.944.974.35
0.55.555.061.424.854.925.03
0.75.625.121.355.004.985.41
4III0.32.933.020.354.824.934.70
0.53.933.560.365.585.395.12
0.73.933.320.255.495.435.16
IV0.34.123.780.505.405.275.41
0.54.944.120.425.204.894.87
0.75.294.370.315.465.255.11
8V0.32.272.410.054.824.894.63
0.53.082.720.105.105.094.46
0.73.652.870.075.615.694.95
VI0.33.503.280.165.205.045.19
0.55.354.570.255.475.735.32
0.75.454.220.115.375.425.14
502I0.35.565.251.875.265.315.02
0.55.465.182.085.045.225.25
0.75.094.811.994.674.664.84
II0.35.254.741.874.794.804.98
0.55.565.192.305.165.115.38
0.75.275.032.264.994.975.22
4III0.34.103.900.774.804.704.74
0.55.034.370.855.014.884.98
0.75.344.520.905.185.124.82
IV0.35.094.650.895.165.175.12
0.55.735.031.054.994.935.20
0.75.644.961.164.945.025.05
8V0.33.963.720.414.954.924.79
0.55.284.490.375.195.205.17
0.75.394.210.395.204.944.95
VI0.34.964.540.574.995.025.26
0.55.995.040.575.205.215.41
0.75.464.640.584.694.724.85
1002I0.35.515.332.835.235.335.32
0.55.375.132.625.055.035.05
0.75.165.062.905.035.045.16
II0.35.235.052.605.075.115.20
0.55.225.072.545.025.065.07
0.75.034.892.794.864.874.96
4III0.35.575.101.585.545.435.37
0.55.294.901.554.944.944.98
0.75.475.071.604.995.064.99
IV0.35.344.761.574.964.834.83
0.55.555.272.025.165.185.29
0.75.495.251.785.225.135.36
8V0.34.984.530.794.784.784.85
0.55.595.071.065.035.145.15
0.75.564.921.064.954.875.06
VI0.35.545.071.005.105.045.05
0.55.034.681.104.594.704.73
0.75.595.091.195.215.185.30

4.1.4. TIE Under the Clayton Copula Model

As shown in Table 8 below, the three bootstrap methods are entirely robust across all scenarios, and the X a d j 2 test remains consistently overly conservative. These patterns are consistent with the performance of these methods under Rosner’s and Dallal’s models. The G 2 test is generally robust; however, unlike under the other models, it produces a few liberal results when the sample size is small. In contrast, the X 2 test performs slightly better under the Clayton copula model, as indicated by a reduced number of conservative outcomes.
Table 8. The empirical type I error rates (in %) under the Clayton copula model, at the nominal level of α = 0.05 , where the robust and conservative results are highlighted in bold and italics, respectively.
Table 8. The empirical type I error rates (in %) under the Clayton copula model, at the nominal level of α = 0.05 , where the robust and conservative results are highlighted in bold and italics, respectively.
m + , n + gCase θ G 2 X 2 X adj 2 B 1 B 2 B 3
252I1.06.315.451.505.565.555.48
2.06.095.411.685.535.525.48
4.06.175.111.385.505.355.30
II1.05.975.201.455.385.285.44
2.06.045.371.355.365.465.37
4.05.905.021.315.355.315.25
4III1.04.183.380.275.285.135.23
2.03.943.260.265.135.135.12
4.04.163.440.285.325.144.95
IV1.05.784.580.465.675.595.61
2.05.724.560.465.685.555.22
4.05.284.070.445.085.084.84
8V1.03.552.810.094.974.835.28
2.03.622.860.084.924.744.74
4.03.853.160.065.355.194.80
VI1.05.204.060.235.255.185.36
2.05.374.120.185.405.365.05
4.05.323.960.255.225.184.69
502I1.05.525.202.225.115.235.33
2.05.225.022.034.944.875.15
4.05.495.172.185.065.075.24
II1.05.635.412.375.365.435.52
2.05.144.892.074.764.785.01
4.05.244.931.984.904.935.09
4III1.04.854.390.855.185.095.55
2.04.954.500.925.305.145.17
4.05.084.350.895.085.015.02
IV1.05.655.001.065.335.255.39
2.05.905.261.265.415.435.48
4.06.345.351.285.655.415.53
8V1.04.694.290.295.485.335.58
2.04.493.750.294.934.744.94
4.04.774.030.364.804.724.82
VI1.05.464.700.565.154.925.01
2.06.025.230.585.505.415.57
4.05.985.040.624.985.085.19
1002I1.04.974.842.744.684.724.86
2.05.435.292.825.185.265.26
4.05.505.262.615.195.205.30
II1.05.125.032.934.995.025.08
2.05.415.312.925.245.245.36
4.04.834.702.464.744.754.84
4III1.04.354.121.484.614.524.91
2.05.104.831.595.135.154.99
4.05.294.871.325.055.044.96
IV1.05.485.112.095.125.215.31
2.05.275.051.845.025.015.11
4.05.645.191.875.245.135.35
8V1.04.664.400.725.004.915.09
2.05.254.850.935.235.175.17
4.05.715.040.905.465.335.28
VI1.05.765.371.175.295.445.44
2.05.424.931.244.884.965.11
4.05.545.071.205.055.065.17
To summarize, the adjusted chi-square ( X a d j 2 ) test is consistently overly conservative, yielding excessively small empirical type I error rates. As a result, this test is not recommended for practical use. (The X a d j 2 test may achieve adequate control of type I error rates only when the sample size is considerably large, for example, when m + , n + = 2000 ). The deviance ( G 2 ) test demonstrates strong and consistent control over type I error across all four models, with empirical type I error rates mostly falling within the robust range ( 0.04 TIE ^ 0.06 ). For Rosner’s, Dallal’s, and the Clayton copula model, the Pearson chi-square ( X 2 ) test tends to be moderately more conservative than the deviance ( G 2 ) test when the sample size is small, but its performance improves with larger samples, producing results that fall largely within the robust range. Under Donner’s model, however, the X 2 test performs less favorably at small sample sizes, especially as the correlation increases. The three bootstrap methods ( B 1 , B 2 , B 3 ) demonstrate good control over the type I error rate, producing predominantly robust results for all the models. Nevertheless, a few liberal outcomes are observed under Donner’s model when the sample size is small.

4.2. Powers

The procedure for computing powers is analogous to that used for estimating empirical type I error rates, except that it employs the true parameter settings under the alternative hypothesis H 1 , as specified in Table 9 and Table 10.
The estimated powers across all scenarios under Rosner’s, Donner’s, Dallal’s, and Clayton copula model are presented in Table 11.
For each model, the deviance ( G 2 ) test overall yields the largest powers among the six methods. The three bootstrap methods produce comparable power estimates, generally close to those of the G 2 test. Additionally, the power of B 3 tends to increase slightly faster than that of the other two bootstrap methods as the number of groups (g) increases.

5. Real-World Applications

Unlike the simulation study, where no model preference was considered due to the data being simulated under a specific model, selecting an appropriate model is essential for analyzing real-world data. To evaluate the performance of the six proposed methods for the goodness-of-fit test, we apply them to three real-world examples.
Model selection is conducted among the following candidates: (i) the independence model, (ii) Rosner’s model, (iii) Donner’s model, (iv) Dallal’s model, and (v) the Clayton copula model. Each of these five candidate models has been introduced and discussed in Section 2.1, Section 2.2, Section 2.3, Section 2.4 and Section 2.5. For each dataset, we applied all five models and assessed their fit using the Akaike Information Criterion (AIC), provided that the model passed the goodness-of-fit test. The best fit model was then identified based on the lowest AIC. The AIC is defined as
AIC = 2 k 2 l p ^ 0 i , p ^ 1 i , p ^ 2 i , π ^ i i = 1 g ,
where k = g + 1 is the number of free parameters, and l · ^ is the log-likelihood with MLEs of p r i and π i ( r = 0 , 1 , 2 ; i = 1 , , g ).

5.1. Example 1

A double-blind randomized clinical trial was conducted at two sites to compare the cefaclor and amoxicillin for the treatment of acute otitis media with effusion (OME) in 214 children [25]. Table 12 shows the presence or absence of OME (in terms of the number of cured ears) at 14 days in 203 children from the sample of 214 children treated with cefaclor and amoxicillin.
Table 13 provides the p-values of the six methods for goodness-of-fit test, along with the AIC values for the five competing models. The independence model is excluded due to extremely small p-values across the six methods. The remaining four parametric models are considered acceptable, with all p-values exceeding 0.05 . Among them, Rosner’s model and the Clayton copula model yield the highest p-values (all ≳ 0.7), suggesting better fit compared to Donner’s and Dallal’s models. The AIC for the Clayton copula model is slightly lower than that for Rosner’s model, indicating that the Clayton copula model provides the best fit for this dataset. This result may reflect that the dependence structure in the OME dataset exhibits features of lower tail dependence which the Clayton copula is particularly suited to model. Interpreted in this context, the finding suggests that successful treatment of one ear may be associated with an increased likelihood of success in the paired ear of the same child.

5.2. Example 2

The second example involves combined unilateral and bilateral data obtained from an observational study for 60 myopia patients undergoing Orthokeratology (Ortho-k), a non-surgical vision correction method that uses specialized contact lenses worn overnight to temporarily reshape the cornea and correct myopia [26]. Myopia improvement is assessed by the axial length growth (ALG), where improvement is indicated if ALG is less than 0.3 mm, and absent otherwise. For this analysis, a subset of 33 patients using three masked brands of Ortho-K (labeled as Q, Y, and W) is included [18]. The observations on the number of improved myopic eyes by bands are summarized in Table 14.
Table 15 presents the p-values of the six methods, along with the AICs for the five competing models. The independence model yields considerably smaller p-values compared to the other four models. In particular, its p-values from the bootstrap methods B 1 and B 2 fall below 0.05 , indicating poor model fit. The remaining four models are considered acceptable, with all associated p-values exceeding 0.05 . Among them, Rosner’s model achieves the lowest AIC value, suggesting it is the best model for this dataset. It is worth noting that the Pearson chi-square test should be interpreted with caution, as several cell counts in Table 14 are smaller than 5, potentially affecting the test’s validity. This result may indicate that the dependence structure in the Orth-K dataset exhibits the symmetric feature that is described by Rosner’s model. Interpreted in this context, the finding suggests that the outcomes of the two myopic eyes in the same patient tend to be moderately correlated the chance of both myopic eyes being improved or not improved is moderately related, i.e., the improvement or lack of improvement in one eye is associated with a similar outcome in the paired eye.

5.3. Example 3

The third example, originally analyzed in Rosner’s paper introducing the constant R model [1], is based on data from an outpatient population of 218 persons aged 20 to 39 with retinitis pigmentosa (RP), who were seen at the Massachusetts Eye and Ear Infirmary between 1970 and 1979. The patients were classified into four types of genetic groups: (i) autosomal dominant RP (DOM), (ii) autosomal recessive RP (AR), (iii) sex-linked RP (SL), and (iv) isolate RP (ISO). In order to eliminate between-subject correlation, selected patients were from different families. The distribution of the number of effected eyes for persons in the four genetic groups is given in Table 16, where an eye was considered affected if the best corrected Snellen visual acuity (VA) was 20/50 or worse, and normal if VA was 20/40 or better. Note that this dataset contains only bilateral observations and may be viewed as a special case within the combined data framework.
The results for goodness-of-fit tests are shown in Table 17. As in the previous examples, the independence model is excluded due to extremely small p-values across all methods, indicating poor fit. The remaining four models are considered acceptable, with all associated p-values greater than 0.05 . Among them, Donner’s model and the Clayton copula model produce the highest p-values (all 0.6 ), more than twice those observed for Dallal’s model. Rosner’s model yields p-values that are marginally greater than 0.05 . Between the two best-fitting models, Donner’s model has a slightly lower AIC value, indicating it is the best model for this dataset. This result may imply that the structure dependence in the retinitis pigmentosa dataset exhibits the feature of constant intra-subject correlation across the four genetic groups as described by Donner’s model.

6. Conclusions

Selecting an appropriate statistical model that adequately fits the observed data is a key consideration in the analysis of paired organ data. Misfitting models can lead to inaccurate inference and potentially misleading conclusions. For instance, in Example 3, Donner’s model provides the best fit for the retinitis pigmentosa dataset. This same dataset was analyzed in a recent study by Zhou and Ma [27], which introduced three MLE-based statistics (likelihood ratio, Wald-type, and score) and the generalized estimating equations (GEE)-based statistic (generalized score) for testing homogeneity of proportions in combined unilateral and bilateral data. Under Donner’s model, the p-values for the three MLE-based statistics were 0.0073 , 0.0010 , and 0.0101 , respectively, which clearly indicates a rejection of the equal proportions hypothesis at the α = 0.05 level. In contrast, when applying Rosner’s model, the corresponding p-values were 0.1173 , 0.0980 , and 0.0769 , failing to reject the null hypothesis. The GEE statistic produced a p-value of 0.0135 , supporting the inference under Donner’s model. This example underscores the importance of correct model specification by showing that using a misfitting model can obscure true effects and lead to conflicting or incorrect conclusions.
While previous work was focused on goodness-of-fit test methods for purely bilateral data, this study extends the investigation to the combined structure of unilateral and bilateral outcomes. We consider six statistical models and evaluate six methods for conducting goodness-of-fit tests under these models.
A simulation study is carried out to assess the performance of the six methods under different models by computing the empirical type I error rates and powers. Based on the simulation results, we draw several conclusions. Among the three commonly used test statistics, the deviance ( G 2 ) test performs well across all four parametric models. In contrast, the Pearson chi-square ( X 2 ) test depends more on models and performs less well when the sample size is small, especially under Donner’s model when intra-subject correlation is high. Although the adjusted chi-square ( X a d j 2 ) test includes a continuity correction to improve type I error control in small samples, our results indicate that it performs excessively conservatively across all scenarios under all models. Despite its theoretical motivation, the test yields overly small empirical type I error rates. Therefore, it is not recommended for practical use in this context. On the other hand, the three bootstrap methods ( B 1 , B 2 , B 3 ) generally maintain good control over type I error rates, although a few liberal outcomes were observed under Donner’s model with small samples. Overall, differences in performance among the six methods are more pronounced when the sample size is small and tend to diminish as the sample size increases.
In general, our results are consistent with findings from the earlier studies. In particular, Liu and Ma [20] showed that the X a d j 2 test tends to be overly conservative and is therefore not recommended. The G 2 test performs well under Rosner’s, Donner’s, and Dallal’s models. The X 2 test controls type I error well under Rosner’s mode but becomes liberal at small sample sizes when applied to the other two models. The three bootstrap methods generally mirror the behavior of the G 2 test, however, B 1 and B 2 tend to be more liberal at small sample sizes compared to B 3 . As the sample size increases, all the five methods ( G 2 , X 2 , B 1 , B 2 , B 3 ) exhibit improved performance with satisfactory type I error control.
The practical application of these methods is illustrated through three real-world datasets from otolaryngologic and ophthalmologic studies.
Methods for the goodness-of-fit test presented in our study are universal in the sense that they can be applied to paired organ data under various model scenarios. A natural extension of this work is to incorporate covariates into the modeling framework, allowing for a more flexible assessment of model fit while accounting for intra-subject correlation. For example, extensions based on generalized linear mix models (GLMMs) or generalized estimating equations (GEE) could be developed to handle covariate-adjusted goodness-of-fit test. Such developments would broaden the practical utility of these methods in analyzing paired organ data across different clinical settings.
It is important to note that the asymptotic distributions used for the G 2 , X 2 , and X a d j 2 statistics (as well as for bootstrap methods B 1 and B 2 that rely on the G 2 and X 2 statistics) are theoretically valid only under the large sample conditions. When the sample size is small, these asymptotic approximations may not hold, and alternative approaches such as exact methods or the bootstrap method B 3 should be considered. Exploring and developing more accurate small sample inference methods remains an interesting direction for future work.
Lastly, we provide guidance for applied use based on our findings. Among the six methods evaluated, the deviance ( G 2 ) test and the three bootstrap methods ( B 1 , B 2 , B 3 ) generally demonstrate robust performance across various model scenarios, especially as sample size increases. The Pearson chi-square ( X 2 ) test performs well under certain models but may be liberal with small samples and high intra-subject correlation. The adjusted chi-square ( X a d j 2 ) test, while theoretically motivated, tends to be overly conservative and is not recommended for practical use. For smaller sample sizes, bootstrap method B 3 or exact methods may offer more reliable inference. We recommend practitioners carefully consider sample size and model assumptions when selecting methods for goodness-of-fit test for combined unilateral and bilateral data to ensure valid conclusions.

Author Contributions

Conceptualization, J.Z. and C.-X.M.; methodology, J.Z. and C.-X.M.; writing—original draft preparation, J.Z.; writing—review and editing, J.Z. and C.-X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in references [1,18,25]. To support the practical application of the proposed methods, we have developed a user-friendly online calculator that allows users to input their own datasets and perform model selection. The tool is available at: https://www.acsu.buffalo.edu/~cxma/GoodnessFitTestForBinaryCorrelatedData.htm (accessed on 25 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MLEmaximum likelihood estimate
CDFcumulative distribution function
dfdegrees of freedom
TIEempirical type I error
AICAkaike information criterion
OMEacute otitis media with effusion
Ortho-kOrthokeratology
ALGaxial length growth
RPretinitis pigmentosa
DOMautosomal dominant RP
ARautosomal recessive RP
SLsex-linked RP
ISOisolate RP
VAbest corrected Snellen visual acuity
GLMMgeneralized linear mixed model
GEEgeneralized estimating equations

References

  1. Rosner, B. Statistical methods in ophthalmology: An adjustment for the intraclass correlation between eyes. Biometrics 1982, 38, 105–114. [Google Scholar] [CrossRef] [PubMed]
  2. Dallal, G.E. Paired Bernoulli Trials. Biometrics 1988, 44, 253–257. [Google Scholar] [CrossRef]
  3. Donner, A. Statistical methods in ophthalmology: An adjusted chi-square approach. Biometrics 1989, 45, 605–611. [Google Scholar] [CrossRef]
  4. Thompson, J.R. The chi-square test for data collected on eyes. Br. J. Ophthalmol. 1993, 77, 115–117. [Google Scholar] [CrossRef]
  5. Clayton, D.G. A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 1978, 65, 141–151. [Google Scholar] [CrossRef]
  6. Tang, N.S.; Tang, M.L.; Qiu, S.F. Testing the equality of proportions for correlated otolaryngologic data. Comput. Stat. Data Anal. 2008, 52, 3719–3729. [Google Scholar] [CrossRef]
  7. Pei, Y.b.; Tang, M.L.; Guo, J.H. Testing the equality of two proportions for combined unilateral and bilateral data. Commun. Stat.-Comput. 2008, 37, 1515–1529. [Google Scholar] [CrossRef]
  8. Pei, Y.b.; Tang, M.L.; Wong, W.K.; Tang, N.S. Testing equality of correlations of two paired binary responses from two treated groups in a randomized trial. J. Biopharm. Stat. 2011, 21, 511–525. [Google Scholar] [CrossRef]
  9. Tang, N.S.; Qiu, S.F. Homogeneity test, sample size determination and interval construction of difference of two proportions in stratified bilateral-sample designs. J. Stat. Plan. Inference 2012, 142, 1243–1251. [Google Scholar] [CrossRef]
  10. Pei, Y.b.; Tian, G.L.; Tang, M.L. Testing homogeneity of proportion ratios for stratified correlated bilateral data in two-arm randomized clinical trials. Stat. Med. 2014, 33, 4370–4386. [Google Scholar] [CrossRef] [PubMed]
  11. Li, Y.; Li, Z.; Mou, K. Homogeneity Test of Many-to-One Relative Risk Ratios in Unilateral and Bilateral Data with Multiple Groups. Axioms 2023, 12, 333. [Google Scholar] [CrossRef]
  12. Ma, C.X.; Wang, K. Testing the homogeneity of proportions for combined unilateral and bilateral data. J. Biopharm. Stat. 2021, 31, 686–704. [Google Scholar] [CrossRef]
  13. Mou, K.; Li, Z. Homogeneity Test of Many-to-One Risk Differences for Correlated Binary Data under Optimal Algorithms. Complexity 2021, 2021. [Google Scholar] [CrossRef]
  14. Ma, C.X.; Wang, H. Testing the Equality of Proportions for Combined Unilateral and Bilateral Data Under Equal Intraclass Correlation Model. Stat. Biopharm. Res. 2022, 15, 608–617. [Google Scholar] [CrossRef]
  15. Sun, S.; Li, Z.; Jiang, H. Homogeneity test and sample size of risk difference for stratified unilateral and bilateral data. Commun. Stat.-Simul. Comput. 2022, 53, 4209–4232. [Google Scholar] [CrossRef]
  16. Pei, Y.; Tang, M.L.; Wong, W.K.; Guo, J. Confidence intervals for correlated proportion differences from paired data in a two-arm randomised clinical trial. Stat. Methods Med Res. 2012, 21, 167–187. [Google Scholar] [CrossRef]
  17. Wang, K.; Ma, C.X. Interval estimation of relative risks for combined unilateral and bilateral correlated data. J. Biopharm. Stat. 2024, 35, 163–186. [Google Scholar] [CrossRef]
  18. Liang, S.; Ma, C.X. Many-to-one Confidence Intervals of Risk Ratios for Bilateral and Unilateral Correlated Data. 2025; manuscript submitted for publication. [Google Scholar]
  19. Tang, M.L.; Pei, Y.B.; Wong, W.K.; Li, J.L. Goodness-of-fit tests for correlated paired binary data. Stat. Methods Med Res. 2012, 21, 331–345. [Google Scholar] [CrossRef]
  20. Liu, X.; Ma, C.X. Goodness-of-fit tests for correlated bilateral data from multiple groups. In Contemporary Experimental Design, Multivariate Analysis and Data Mining: Festschrift in Honour of Professor Kai-Tai Fang; Springer: Berlin/Heidelberg, Germany, 2020; pp. 311–327. [Google Scholar]
  21. Sklar, M. Fonctions de répartition à n dimensions et leurs marges. Annales de l’ISUP 1959, 8, 229–231. [Google Scholar]
  22. Liang, S.; Emura, T.; Ma, C.X.; Xin, Y.; Huang, X.W. Testing the Homogeneity of Two Proportions for Correlated Bilateral Data via the Clayton Copula. arXiv 2025, arXiv:2502.00523. [Google Scholar]
  23. Yates, F. Contingency Tables Involving Small Numbers and the χ2 Test. Suppl. J. R. Stat. Soc. 1934, 1, 217–235. [Google Scholar] [CrossRef]
  24. Tang, M.L.; Tang, N.S.; Rosner, B. Statistical inference for correlated data in ophthalmologic studies. Stat. Med. 2006, 25, 2771–2783. [Google Scholar] [CrossRef] [PubMed]
  25. Mandel, E.M.; Bluestone, C.D.; Rockette, H.E.; Blatter, M.M.; Reisinger, K.S.; Wucher, F.P.; Harper, J. Duration of effusion after antibiotic treatment for acute otitis media: Comparison of cefaclor and amoxicillin. Pediatr. Infect. Dis. J. 1982, 1, 310–316. [Google Scholar] [CrossRef] [PubMed]
  26. Liang, S.; Fang, K.T.; Huang, X.W.; Xin, Y.; Ma, C.X. Homogeneity tests and interval estimations of risk differences for stratified bilateral and unilateral correlated data. Stat. Pap. 2024, 65, 3499–3543. [Google Scholar] [CrossRef]
  27. Zhou, J.; Ma, C.X. Homogeneity Test of Proportions for Combined Unilateral and Bilateral Data via GEE and MLE Approaches. 2025; manuscript in preparation. [Google Scholar]
Table 1. Frequency table for number of cured organs for subjects in g groups.
Table 1. Frequency table for number of cured organs for subjects in g groups.
Group
# of Cured Organs12gTotal
0 m 01 m 02 m 0 g m 0 +
1 m 11 m 12 m 1 g m 1 +
2 m 21 m 22 m 2 g m 2 +
total m + 1 m + 2 m + g m + +
0 n 01 n 02 n 0 g n 0 +
1 n 11 n 12 n 1 g n 1 +
total n + 1 n + 2 n + g n + +
Table 2. Summary of model assumptions and corresponding nuisance parameters.
Table 2. Summary of model assumptions and corresponding nuisance parameters.
ModelNuisance Parameter ( κ )
Rosner’sR
Donner’s ρ
Dallal’s γ
Clayton copula θ
Table 3. Setups of the nuisance parameters in Rosner’s (R), Donner’s ( ρ ), Dallal’s ( γ ), and Clayton copula ( θ ) models, under H 0 .
Table 3. Setups of the nuisance parameters in Rosner’s (R), Donner’s ( ρ ), Dallal’s ( γ ), and Clayton copula ( θ ) models, under H 0 .
Model Parameter
R ρ γ θ
1.20.50.31.0
1.50.70.52.0
1.80.90.74.0
Table 4. Setups of π = π 1 , , π g for g = 2 , 4 , 8 under H 0 .
Table 4. Setups of π = π 1 , , π g for g = 2 , 4 , 8 under H 0 .
gCase π = π 1 , , π g
2I 0.3 , 0.5
II 0.5 , 0.5
4III 0.1 , 0.2 , 0.3 , 0.4
IV 0.2 , 0.2 , 0.4 , 0.4
8V 0.1 , 0.2 , 0.3 , 0.4 , 0.1 , 0.2 , 0.3 , 0.4
VI 0.2 , 0.2 , 0.4 , 0.4 , 0.2 , 0.2 , 0.4 , 0.4
Table 9. Setups of the nuisance parameters in Rosner’s (R), Donner’s ( ρ ), Dallal’s ( γ ), and Clayton copula ( θ ) models, under H 1 .
Table 9. Setups of the nuisance parameters in Rosner’s (R), Donner’s ( ρ ), Dallal’s ( γ ), and Clayton copula ( θ ) models, under H 1 .
g = 2 g = 4 g = 8
R 1.2 , 1.5 1.2 , 1.2 , 1.5 , 1.5 1.2 , 1.2 , 1.2 , 1.2 , 1.5 , 1.5 , 1.5 , 1.5
ρ 0.5 , 0.7 0.5 , 0.5 , 0.7 , 0.7 0.5 , 0.5 , 0.5 , 0.5 , 0.7 , 0.7 , 0.7 , 0.7
γ 0.5 , 0.7 0.5 , 0.5 , 0.7 , 0.7 0.5 , 0.5 , 0.5 , 0.5 , 0.7 , 0.7 , 0.7 , 0.7
θ 2.0 , 4.0 2.0 , 2.0 , 4.0 , 4.0 2.0 , 2.0 , 2.0 , 2.0 , 4.0 , 4.0 , 4.0 , 4.0
Table 10. Setups of π = π 1 , , π g for g = 2 , 4 , 8 under H 1 .
Table 10. Setups of π = π 1 , , π g for g = 2 , 4 , 8 under H 1 .
gCase π = π 1 , , π g
2I 0.2 , 0.2
II 0.2 , 0.4
4III 0.1 , 0.2 , 0.3 , 0.4
IV 0.2 , 0.2 , 0.4 , 0.4
8V 0.1 , 0.2 , 0.3 , 0.4 , 0.1 , 0.2 , 0.3 , 0.4
VI 0.2 , 0.2 , 0.4 , 0.4 , 0.2 , 0.2 , 0.4 , 0.4
Table 11. The powers (in %) with ( m + , n + ) = 150 at the nominal level of α = 0.05 .
Table 11. The powers (in %) with ( m + , n + ) = 150 at the nominal level of α = 0.05 .
gCase G 2 X 2 X adj 2 B 1 B 2 B 3
Rosner’s model
2I7.567.233.757.117.287.16
II10.409.325.5010.019.259.25
4III7.066.072.597.596.587.90
IV10.859.304.7110.439.339.32
8V18.8518.527.1820.0719.6318.51
VI30.0728.8315.0028.8328.9929.25
Donner’s model
2I27.7926.8819.1927.0426.9327.16
II34.9534.6425.9534.2834.4834.32
4III39.7739.2726.2438.6139.2438.68
IV46.8346.1833.1745.8846.2345.81
8V62.0960.4340.4760.5460.4560.96
VI68.9667.9549.7567.9267.7268.18
Dallal’s model
2I33.8833.1323.7533.0632.9833.29
II45.8245.6237.0745.1645.2245.18
4III51.9550.9336.6450.8750.9350.91
IV64.9964.2651.8364.0264.2364.03
8V82.2081.2165.6681.1481.1881.61
VI90.6990.3080.0690.2990.1690.56
Clayton copula model
2I16.1515.6610.4115.9615.8816.12
II21.6921.5515.3121.2421.4320.87
4III20.1619.0610.8919.4919.3418.47
IV28.2027.6317.8327.4027.6526.97
8V38.4037.0720.4337.4237.4236.95
VI46.7945.6928.7245.6745.6746.36
Table 12. Number of cured ears at 14 days in children treated with cefaclor and amoxicillin.
Table 12. Number of cured ears at 14 days in children treated with cefaclor and amoxicillin.
Treatment
# of Cured EarsCefaclorAmoxicillinTotal
021 (15.5 1, 20.1 2, 22.8 3, 22.6 4, 22.7 5)13 (5.9 1, 12.7 2, 10.8 3, 11.0 4, 11.1 5)34
19 (21.2 1, 10.7 2, 6.9 3, 6.1 4, 7.6 5)3 (15.3 1, 2.6 2, 5.0 3, 5.7 4, 4.1 5)12
214 (7.3 1, 13.2 2, 14.3 3, 15.3 4, 13.7 5)15 (9.8 1, 15.7 2, 15.2 3, 14.3 4, 15.8 5)29
total443175
038 (36.8 1, 35.9 2, 36.9 3, 36.2 4, 37.3 5)27 (28.9 1, 29.8 2, 28.3 3, 29.5 4, 28.0 5)65
124 (25.2 1, 26.1 2, 25.1 3, 25.8 4, 24.7 5)39 (37.1 1, 36.2 2, 37.7 3, 36.5 4, 38.0 5)63
total6266128
Note: The numbers in the parenthesis next to each observed cell count represent the expected frequencies under different models. The superscripts 15 indicate the model used to compute the expected values: 1. Independence; 2. Rosner’s; 3. Donner’s; 4. Dallal’s; 5. Clayton copula model, respectively. This notation also applies to Tables 14 and 16.
Table 13. Example 1: p-values of the six methods for goodness-of-fit test and AICs for different models.
Table 13. Example 1: p-values of the six methods for goodness-of-fit test and AICs for different models.
p-Value
No.Model G 2 X 2 X adj 2 B 1 B 2 B 3 AIC
1Independence0.00000.00000.00000.00000.00000.0000367.4916
2Ronser’s0.73270.73670.87960.74750.75150.7355329.4285
3Donner’s0.52830.53850.75530.52060.52860.5186330.3617
4Dallal’s0.26470.27410.48270.26900.27200.2615332.1132
5Clayton copula a0.77350.77420.93210.77900.77950.7740329.2583
a The Clayton copula model is the best model for the OME dataset.
Table 14. Number of improved myopic eyes with 3 brands of Ortho-k treatment.
Table 14. Number of improved myopic eyes with 3 brands of Ortho-k treatment.
# of Myopia
Improved Eyes
Brand
QYWTotal
02 (0.7 1, 2.1 2, 1.6 3, 2.0 4, 1.7 5)3 (2.6 1, 2.7 2, 3.2 3, 3.3 4, 2.9 5)3 (1.8 1, 3.2 2, 3.3 3, 2.8 4, 3.5 5)8
11 (3.9 1, 1.1 2, 1.9 3, 2.4 4, 1.6 5)1 (2.0 1, 1.8 2, 1.0 3, 0.5 4, 1.4 5)4 (6.1 1, 3.7 2, 3.0 3, 3.0 4, 3.0 5)6
27 (5.5 1, 6.9 2 6.5 3, 5.6 4, 6.7 5)1 (0.4 1, 0.5 2, 0.8 3, 1.2 4, 0.7 5)6 (5.2 1, 6.2 2, 6.7 3, 7.2 4, 6.5 5)14
total1051328
01 (0.8 1, 0.8 2, 0.8 3, 1.0 4, 0.7 5)1 (0.7 1, 0.7 2, 0.7 3, 0.7 4, 0.7 5)0 (0.4 1, 0.4 2, 0.4 3, 0.3 4, 0.4 5)2
12 (2.2 1, 2.2 2, 2.2 3, 2.0 4, 2.3 5)0 (0.3 1, 0.3 2, 0.3 3, 0.3 4, 0.3 5)1 (0.6 1, 0.6 2, 0.6 3, 0.7 4, 0.6 5)3
total3115
Table 15. Example 2: p-values of the six methods for goodness-of-fit test and AICs for different models.
Table 15. Example 2: p-values of the six methods for goodness-of-fit test and AICs for different models.
p-Value
No.Model G 2 X 2 X adj 2 B 1 B 2 B 3 AIC
1Independence0.08400.09350.55260.01350.01850.582074.5698
2Ronser’s  a0.75540.83990.97310.43770.57781.000067.5026
3Donner’s0.74660.84030.95930.37850.54671.000067.5607
4Dallal’s0.58410.68590.91510.24990.35401.000068.6260
5Clayton copula0.74390.83350.98510.30560.41671.000067.5782
a Rosner’s model is the best model for the Ortho-k dataset.
Table 16. Number of affected eyes for persons in four genetic groups.
Table 16. Number of affected eyes for persons in four genetic groups.
# of Affect-
ed Eyes
Genetic Type
DOMARSLISOTotal
015 (11.6 1, 13.2 2, 15.5 3, 15.0 4, 15.3 5)7 (4.3 1, 8.9 2, 7.7 3, 7.0 4, 8.3 5)3 (0.8 1, 7.6 2, 2.8 3, 3.0 4, 3.5 5)67 (42.2 1, 61.9 2, 65.9 3, 67.0 4, 64.8 5)92
16 (12.9 1, 7.6 2, 4.6 3, 3.9 4, 5.6 5)5 (10.4 1, 4.1 2, 3.7 3, 4.2 4, 3.5 5)2 (6.3 1, 1.4 2, 2.2 3, 4.8 4, 1.5 5)24 (73.7 1, 26.2 2, 26.4 3, 24.2 4, 26.5 5)37
27 (3.6 1, 7.2 2, 7.8 3, 9.1 4, 7.1 5)9 (6.3 1, 8.0 2, 9.6 3, 9.8 4, 9.2 5)14 (11.8 1, 10.0 2, 13.9 3, 11.2 4, 14.0 5)57 (32.2 1, 59.9 2, 55.7 3, 56.8 4, 56.7 5)87
total282119148216
Table 17. Example 3: p-values of the six methods for goodness-of-fit test and AICs for different models.
Table 17. Example 3: p-values of the six methods for goodness-of-fit test and AICs for different models.
p-Value
No.Model G 2 X 2 X adj 2 B 1 B 2 B 3 AIC
1Independence0.00000.00000.00000.00000.00000.0000537.6511
2Ronser’s0.05950.07970.20320.06250.07150.0885449.9490
3Donner’s a0.73550.72060.90300.75500.72950.6690443.7967
4Dallal’s0.21620.24240.44180.23750.24200.2205446.9802
5Clayton copula0.72180.70630.89170.71900.68750.6620443.8541
a Donner’s model is the best model for the retinitis pigmentosa dataset.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, J.; Ma, C.-X. Goodness-of-Fit Tests for Combined Unilateral and Bilateral Data. Mathematics 2025, 13, 2501. https://doi.org/10.3390/math13152501

AMA Style

Zhou J, Ma C-X. Goodness-of-Fit Tests for Combined Unilateral and Bilateral Data. Mathematics. 2025; 13(15):2501. https://doi.org/10.3390/math13152501

Chicago/Turabian Style

Zhou, Jia, and Chang-Xing Ma. 2025. "Goodness-of-Fit Tests for Combined Unilateral and Bilateral Data" Mathematics 13, no. 15: 2501. https://doi.org/10.3390/math13152501

APA Style

Zhou, J., & Ma, C.-X. (2025). Goodness-of-Fit Tests for Combined Unilateral and Bilateral Data. Mathematics, 13(15), 2501. https://doi.org/10.3390/math13152501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop