1. Introduction
In the clinical trial, the bilateral data are usually collected when the patients receive treatment on paired organs or body parts. Meantime, the unilateral data is often encountered when only one of the patient’s paired organs is diseased or has received treatment. Many current studies tend to analyze the unilateral or bilateral data, respectively [
1,
2]. Descriptive statistics and regression analysis methods are relatively straightforward, and may be insufficient for analyzing bilateral data [
3]. This limitation arises from the inherent internal correlations within the paired organs or parts. Ignoring these correlations could lead to biased results.
Up to now, the analysis of interclass correlation in bilateral data has been addressed by various probability models. For instance, Rosner [
4] introduced an interclass correlation model specifically for this purpose. The model supposes that the probability of a response on the other side is proportional to the prevalence rate of the corresponding group when one organ gives a response. However, Rosner’s model might lead to a poor fit if the characteristic is almost certain to occur bilaterally with widely varying group-specific prevalence. In view of this, Dallal [
5] proposed an alternative model, which assumed that the probability that another organ responds when one organ responds was independent of the probability that the organ responds. Later, Donner [
6] proposed another model, in which the interclass correlation coefficient was common in each of the two groups. In the statistical inference of bilateral data, it is very important to select an appropriate method to explain the interclass correlation within a subject. By ignoring this, the interclass correlation of real data will undermine the ability to identify the true therapeutic effect accurately. There have been research results on the statistical inference of bilateral data, such as asymptotic tests [
7,
8,
9] and confidence intervals [
10,
11,
12].
Although studies of bilateral data can well reflect the therapeutic effect of paired organ treatment. However, it is inevitable to include unilateral data when collecting actual data. At this point, the research methods that are only applicable to bilateral data become ineffective. Just as in the following two real examples of otolaryngology and myopathy. The first data example comes from a double-blinded randomized clinical trial conducted in the context of acute otitis media with effusion [
13]. It can be used to compare the effect of two antibiotics (Cefaclor and Amoxicillin) in the treatment of otitis media with effusion (OME). Another example comes from an observational study on myopia patients [
14]. There are 60 subjects diagnosed with myopia who will receive Orthokeratology (Ortho-k). The research on unilateral and bilateral data can better preserve the effective information of the original data. Their research results can also be easily extended to the application research of either unilateral data or bilateral data. The two aforementioned examples belong to the stratified unilateral and bilateral correlated data structure. They categorize the unilateral and bilateral correlated data based on different attributes, such as gender, age, treatment method, etc. Recently, some studies have been discussing whether patients need to be stratified and grouped for treatment under this data structure. Based on Donner’s model, Wang et al. [
15] derived three homogeneity tests to detect if the risk ratio retains consistency across strata. The result showed that the score test provided a robust type I error rate and satisfactory power performance. Moreover, the complete test procedure and interval estimation of the odds ratio are discussed by Hua et al. [
16,
17]. Under Dallal’s model, Liang et al. [
14] proposed four confidence interval methods of common risk difference and indicated that the profile likelihood confidence interval outperformed other methods. Sun et al. [
18,
19] researched the homogeneity test and the common test of risk difference. The result showed that the score method consistently outperforms other methods. Moreover, Sun and Li [
20] discussed the common test of the risk ratio. However, the results in reference [
21] indicated that a homogeneity test is usually required before conducting the common test.
However, the fitting results of the aforementioned statistical inferences all demonstrate a common characteristic: when the sample size is limited, the fitting of the statistical inference method is suboptimal; whereas as the sample size grows, the fitting effect of statistics will also enhance. Therefore, in designing clinical trials, after determining the statistical inference method, how can one select an appropriate sample size that can achieve the desired effect and simultaneously reduce resource waste? The determination of the sample size becomes extremely important for asymptotic statistical inference methods. Through the study of sample size, the statistical test can achieve a specified power at a given nominal level in all paired medical trials [
22]. Qiu et al. [
23] proposed an iterative algorithm for sample size determination using score and likelihood tests under two models. For the stratified bilateral data, Mou et al. [
24] proposed several methods to calculate sample sizes for a common test of relative risk ratios. However, there is little research to evaluate the methods’ performance based on the unilateral and bilateral data. Furthermore, statistical inference on the risk difference for stratified unilateral and bilateral data has been established under Dallal’s model. However, the theory of relative risk ratio has not yet been incorporated into Dallal’s model. Liang et al. [
14] also pointed out that the relative risk ratio is a parameter; it is worth extending the current framework for such applications. Meanwhile, the relative risk ratio would perform better than the risk difference for statistical inference when the differences between different data sets are relatively small [
25]. This paper aims to investigate the homogeneity test and sample size determination for the relative risk in stratified unilateral and bilateral data under Dallal’s model.
The rest of this paper is organized as follows.
Section 2 describes the data structure and introduces Dallal’s model, along with the unconstrained and constrained maximum likelihood estimations (MLEs) under different hypotheses. In
Section 3, we propose three test statistics for homogeneity and determine the sample size using an iterative algorithm.
Section 4 presents Monte Carlo simulations to evaluate the performance of the test statistics in terms of type I error rates and power (
Section 4.1), followed by an assessment of the sample size determination based on estimated power (
Section 4.2). Two real-world applications, including a study on acute otitis media and a recent study on myopia, are illustrated in
Section 5. Finally,
Section 6 concludes the paper.
2. Dallal’s Model
Assume that there are
M subjects, divided into
J strata, and each stratum has two groups. For the
jth stratum (
),
represents the number of subjects providing unilateral data and
represents the number of subjects providing bilateral data.
and
represent the number of responses provided for unilateral data and bilateral data, respectively. Suppose that
is the number of unilateral patients with
l responses, and
is the number of bilateral patients with
responses in the
ith group (
) of the
jth stratum (
). For each stratum, we denote:
Let
and
be the corresponding probabilities of
and
. The observed data of the
jth stratum are shown in
Table 1.
For unilateral data, let
be the indicator for judging whether the
kth patient has a response or not in the
ith group of the
jth stratum. If there is a response, then
; otherwise,
. For bilateral data, define
if the
hth organ (
) of the
kth patient has a response, and
otherwise. Under Dallal’s model, we assume that:
where
(
) represents the probability that the organ will improve, and
(
) represents the probability that one organ will respond when another organ improves. The correlation coefficient is
in the
i group of the
jth stratum. Specifically,
if two organs are completely independent, while
if two organs are completely dependent. By calculation, the probabilities can be obtained by
where
,
. For the observed data
, the joint probability function is given by:
Let
be the relative risk ratio between the two groups in the
jth stratum. We are interested in whether there is the same risk ratio between the the two groups across
J strata. Thus, the homogeneity test is given as follows:
Next, the expressions or algorithms for all MLEs will be provided under the homogeneity test. The MLEs under the alternative hypothesis and null hypothesis are called the unconstrained and constrained MLEs, respectively.
Unconstrained MLEs. Based on the hypothesis
and Equation (
1), the log-likelihood function can be expressed as follows:
where
,
,
,
, and
. Differentia (
2) with respect to
and
, and set them to 0. That is:
Since closed-form solutions may be not available, an iterative procedure is adopted for parameter estimation. The detailed process is as follows. Firstly, initial values are calculated from the explicit formulas of the counts as follows:
Then, the (
)th approximation
and
can be obtained,
where
is the Fisher information matrix (
Appendix A.1). Repeat the above step until all estimates converge. Then, the unconstrained MLEs
and
can be obtained.
Constrained MLEs. Under the null hypothesis
, it follows that
for each
. Thus, Equation (
2) can be expressed as follows:
where
is an parameter, which is the focus of the homogeneity test.
and
(
) are nuisance parameters. The constrained MLEs of
and
can be denoted as
and
. The estimates are the solution of the following equations:
However, there is no closed-form solution. It can be solved by the Newton–Raphson process and Fisher scoring method. First, take as the initial values. By iterating steps (i) and (ii) until convergence, the constrained MLEs are obtained as follows:
- (i)
The (
)th approximation
is:
- (ii)
and
can be updated by the Fisher scoring algorithm:
where
is the Fisher information matrix. See
Appendix A.1 for more details.
5. Two Real Examples
To address the unilateral and bilateral combined data structure, two real examples of otolaryngology and myopathy provide us with the ability to implement our methodology for real-world data. For the otolaryngology study (Table 5 of Reference [
20]), the risk ratio is
. The homogeneity hypothesis
versus
is tested to determine whether children of different ages require distinct antibiotics for improved treatment outcomes. The unconstrained MLEs are
,
,
and
. The results of the constrained MLEs can be found in Table 6 of Reference [
20].
Table 7 presents the corresponding statistical values and
p-values based on the three proposed tests for two examples. The results show that all
p-values are above
under
, failing to reject the null hypothesis of no treatment difference in relative risk ratio across the three age strata between Cefaclor and Amoxicillin.
The parameter estimates are identical to the MLEs given above.
Table 8 reports the sample sizes and corresponding estimated powers under target powers of
and
at
. The sample sizes for the three tests are generally accurate, with empirical powers close to the pre-specified levels, and are thus recommended for this example. While
requires a smaller sample size than
and
,
yields estimated powers closer to the nominal values. Consequently, to achieve
(or
) power, sample sizes of 504 (or 756) are needed.
Another example is a recorded observational study on myopia patients (Table 9 of Reference [
20]). We are interested in comparing the effectiveness of treatments between the VST and CRT groups across genders. Apply the proposed methods to test
vs.
. According to the calculation, the unconstrained MLEs are
,
,
and
.
Table 8 indicates that
and
have greater estimated power than
. In order to obtain more robust results and the desired power to reach 80% (or 90%), 120 (or 160) samples are needed.
6. Conclusions
We develop three asymptotic tests and three iterative sample size determination methods for the risk ratio based on stratified unilateral and bilateral data within Dallal’s model. The unconstrained and constrained MLEs are derived using the Newton–Raphson procedure and the Fisher scoring algorithm. Simulation results support the recommendation of the score test for evaluating treatment effectiveness under a variety of data-generating scenarios. The sample size methods based on the score test or the likelihood ratio test are also suggested for determining the empirical sample size, because their estimated powers are closer to empirical powers than those based on the Wald-type test. Furthermore, two real-world datasets of acute otitis media and myopic eyes are used to illustrate the application of the proposed tests and sample size determination.
The contributions of this work extend beyond the following: (i) Many current studies tend to analyze bilateral data without considering unilateral data. However, the unilateral data is also obtained in clinical practice, when only one side of the patient’s paired organs is diseased or has received treatment. Our methodologies can be applied not only to the research of bilateral data, but also to the research of unidirectional and bilateral data. In our context, the scenario of bilateral data alone constitutes a special case. (ii) In practice, sample size is one of the essential factors in designing clinical accuracy trials. Through the study of sample size, the statistical test can achieve a specified power at a given nominal level in all paired medical trials. In actual data research, it will lead to inaccurate test results if the sample size is insufficient. When the sample size is too large, it will lead to unnecessary waste of resources. Therefore, the sample size determination is discussed based on the stratified unilateral and bilateral data in this paper.
Despite its satisfactory performance, the proposed method has two limitations. First, the iterative sample size determination process is computationally intensive. Second, the reliability of the method relies on the assumption of large stratum sizes, and its performance in sparse data settings requires further investigation. Therefore, future research should focus on developing exact methods for small-sample data to enhance its applicability.