1. Introduction
Paired outcomes are very common in various fields of study. Data with paired observations can often be seen in health and social studies, which include results of the same test before and after an intervention, outcomes from crossover clinical trials where the same subject is assigned two treatment arms at two different time points in the same trial, measurements on the left and right eyes of the same person, and observations from twin studies involving identical or fraternal twins. For comparing the distributions of such paired outcomes, a paired-t test is a widely used approach. However, the strong distributional assumption of a paired-t test makes it unfavorable for non-normal data. As an alternative nonparametric approach, a Wilcoxon signed-rank test is very popular for comparing the paired outcomes.
The Wilcoxon signed-rank test is only valid for independent and identically distributed pairs. In practice, not all data is independently distributed as there can be correlated datasets. One type of correlated data is clustered data where outcomes within a cluster are correlated while outcomes between different clusters may be independent. Several methods have been developed for inference on different types of outcomes from clustered data including comparison of continuous outcomes from independent groups [
1,
2,
3,
4], categorical outcomes [
5], longitudinal outcomes [
6], and censored time-to-event outcomes [
7]. Apart from these aforementioned outcomes, another type of outcome which can exist in clustered data is paired outcomes. Such paired outcomes in clustered data can be observed in dental studies involving multiple individuals where measurement of attachment loss in each tooth is carried out at two different locations (e.g., buccal and mesial) of the same tooth. Here, individuals are clusters and attachment loss scores from buccal and mesial site of the same tooth form a paired outcome resulting in many paired observations within each cluster. Paired clustered data can also be obtained from large crossover clinical trials with two treatment arms and a washout period. Here, a trial participant forms a cluster while the outcome measurements before and after a treatment form a pair in that cluster. Since in a crossover trial every participant is allocated to both the competing treatment arms, separated by a washout period to remove prior treatment effects, each cluster has multiple pairs of observations. In these types of clustered paired data, the traditional Wilcoxon signed-rank test do not work as it fails to account for the correlated nature of the data. As a result, there have been a number of attempts in the past to develop signed-rank test for clustered data [
8,
9,
10].
The signed-rank test by Rosner, Glynn, and Lee [
10] is one of the earliest signed-rank tests developed for clustered data under the assumption of a common intra-cluster correlation structure across different clusters. Later, Datta and Satten [
8] developed a more flexible signed-rank testing approach for clustered data that considers informative cluster size scenarios using the idea of within-cluster resampling [
11]. Informative cluster sizes occur when the cluster size (i.e., the number of units (pairs) within a cluster) is correlated with the outcome in that cluster. Such informative cluster sizes can exist in a dental study when comparing the buccal and mesial attachment loss scores in an aged population. This is because the number of teeth (cluster size) in an aged individual (cluster) is indicative of the overall attachment loss (outcome) of that individual. Another example of a potentially informative cluster size can be considered while analyzing neuroimaging data of individuals suffering from dementia or Alzheimer’s disease. In this case, the number of imaging sessions, conducted on a patient, is the cluster size that may be related to the disease severity outcome.
The signed-rank tests discussed above are tests developed for marginal comparison of outcomes in a pair. These tests do not take into account any covariate information while comparing the outcome distributions in a pair. However, in many situations, there may exist potentially important covariate information in the data, which can significantly impact the outcomes and, hence, the paired comparison results. Ignoring available covariate information for marginal analyses of outcomes can lead to incomplete inference and, consequently, can result in inaccurate or biased findings. For example, in longitudinal neuroimaging data, it can be interesting to examine whether certain metrics of cognitive abilities of individuals who are at risk of cognitive impairment have significantly changed over the period of study. This can be obtained through the multiple MRI scans performed during their successive clinic visits. In this case, the data is clustered as each individual represents a cluster while there exists a possibility of informative cluster sizes since the number of visits (cluster size) may be associated with the severity of the impairment. However, it is also known that age impacts cognitive abilities of individuals and the effect of age on cognitive abilities can become significant in older population. Therefore, even if we find some significant changes in cognitive metrics over a certain period of study, those changes cannot be solely attributed to some cognitive disorder as age may have also contributed to the change in those cognitive abilities. Therefore, ignoring the age information during a marginal analysis may leave the effect of the age on the outcome, unadjusted leading to a possibility of biased inference. It becomes essential to adjust for the effect of such important covariates while performing pairwise comparison of outcomes in a clustered data. This highlights the need of a robust approach that can perform hypothesis testing of paired outcomes while incorporating information on and adjusting for the effect of important covariates. Motivated by this need, in this article, we develop a method for the covariate adjusted pairwise comparison of the outcomes in clustered data while maintaining a rank-based approach that is robust to the choice of outcome distribution. We discuss the different scenarios of clustered data, where the cluster sizes can be informative and where they can be uninformative, and how we can apply our covariate adjusted testing approach to address both types of clustered data. We show that the proposed covariate adjusted testing methodology maintains the correct size and has substantial power in different simulated scenarios of clustered data and performs better than the marginal signed rank tests and a standard parametric linear mixed effects method. Through neuroimaging data, we demonstrate the applicability of our method in obtaining meaningful results.
The rest of the article is organized in the following way. In
Section 2, we introduce the notations, discuss the different types of marginal hypothesis that can be framed for a clustered data and their implications. We also develop, in this section, our rank-based covariate adjusted testing mechanism for paired comparison that can be used for clustered data when the cluster sizes are informative as well as in situations when the cluster sizes appear to be uninformative. In
Section 3, we explore the performances of our covariate adjusted testing methodology through different simulated scenarios of clustered data. In
Section 4, we return to the neuroimaging data example for the application of our method. Finally, the article ends with a discussion in
Section 5.
3. Simulation Studies
We conducted two simulation studies in this section. In the first simulation scenario, the cluster sizes varied among different clusters, but these cluster sizes were uninformative. The second simulation scenario considered clustered data, where the cluster sizes were informative (i.e., the cluster sizes are correlated with the outcome of interest). In each of these simulated scenarios, we evaluated the performances, namely size (type-I error rate) and power, of ICAST and UCAST methods. Moreover, we compared the performances of these methods with the marginal signed-rank tests of both Rosner, Glynn, and Lee [
10] and Datta and Satten [
8]. We abbreviated these two marginal signed-rank testing methods as RGL and DS, respectively. In addition, we compared the performances of ICAST and UCAST with a parametric linear mixed model (LMM) [
14], which involved a fixed effect for covariate and a random cluster effect. The size and power computations for each of the abovementioned testing approaches were based on 500 Monte-Carlo repetitions under a fixed nominal size (type-I error) of 0.05. The empirical size (type-I error rate) and power were calculated as the proportion of total Monte Carlo replicates in which the null hypothesis was rejected. Note that, in this setting, if the empirical size of any method largely exceeded 0.05, then that testing approach was unacceptable for testing these hypotheses irrespective of its power.
3.1. Simulation Scenario 1
In this simulation scenario, we considered clustered data with uninformative cluster size where the marginal distribution of pairwise difference in a cluster did not depend on its cluster size. Extending the simulation settings of Rosner, Glynn, and Lee [
10] and Datta and Satten [
8], we generated the pairwise differences as
where
Xij∼
N(0,1),
β = 5,
ϵij =
Rijexp(|
Bij|),
Bij =
Aij +
Eij,
Ai∼
N(0,0.25),
Eij∼
N(0,0.75), and 1 ≤
j ≤
Ni, 1 ≤
i ≤
M. For generating
Rij, we first generated
pi, from a
Beta(1,
b) distribution for each 1 ≤
i ≤
M. If
pi ≤ 0.5, then
Rij = 1, or else
Rij = −1. Here,
M was fixed (either 10 or 25), but for each
i,
Ni was generated as
Ni =
Ni* + 1 where
Ni*∼
Binomial(7,0.5). In this scenario, the cluster size
Ni, for typical cluster
i, was a random variable which was independent of the outcome variable
Yij. Note that,
b = 1 represented the marginal null hypothesis
H0. For power calculations, one could choose any positive value of
b other than 1. For our simulations, we chose three different values of
b (0.15, 0.3, 0.6) to investigate and compare the power performances of all the methods under consideration.
Table 1 displays the results relating to the performances of all the methods in this simulated scenario. From
Table 1, we found that both the covariate-adjusted methods of ICAST and UCAST maintain the nominal size of 0.05 and had similar power performance patterns for both choices of
M. The powers of both methods increased with the increase in the number of clusters with the ICAST having slightly increased power in the case of smaller sample size (
M = 10). The marginal signed-rank testing methods (i.e., the Datta-Satten (DS) test and the Rosner-Glynn-Lee (RGL) test) maintained the nominal size of 0.05 but have extremely low power compared to ICAST and UCAST for both small and large number of clusters. This showed that there was a substantial loss of power for ignoring the effect of covariates on the outcomes. The parametric LMM approach has a highly inflated empirical size, much higher than the nominal size of 0.05, making them unacceptable for this simulated scenario. This was mainly because the underlying skewed distributions of the outcomes make the standard parametric mixed effects model unsuitable for this analysis. Overall, we observed that both ICAST and UCAST methods were appropriate for this scenario of uninformative cluster sizes and, hence, either of them can be considered for testing the marginal null hypothesis
H0 in presence of covariates.
3.2. Simulation Scenario 2
In this simulation scenario, we considered clustered data with informative cluster size where the marginal distribution of pairwise difference in a cluster was correlated to the cluster size. Here, we generated the pairwise differences through the same model, as in
Section 3.1, with the same model parameters except for the generation of cluster size
Ni for each 1 ≤
i ≤
M. In this case,
Ni = 2 if
pi ≤ 0.5 and
Ni = 8 if
pi > 0.5. Recall that,
pi is generated from a
Beta(1,
b) for each 1 ≤
i ≤
M and contributes to the generation of the paired differences
Yij through the quantity
ϵij as shown in
Section 3.1. Therefore, for a typical cluster
i, the cluster size
Ni and the paired outcome differences
Yij were correlated leading to an informative cluster size scenario. For the size calculation we simulated the data under
H0 which was equivalent to choosing
b = 1 while for the power calculations we retained the previous set of values of
b as (0.15, 0.3, 0.6).
The performances of all the methods in this simulated scenario of informative cluster size are shown in
Table 2. ICAST closely maintained the nominal size of 0.05 for both small and large number of clusters and its power increased with the increase in the number of clusters for all choices of
b. The performance of UCAST, in this informative cluster size setting, was different from that in the uninformative cluster size scenario. Here, the empirical size of UCAST exceeded the nominal size of 0.05 for both small and large number of clusters indicating that the type-I error rate of UCAST can be higher than expected in case of an informative cluster size. The marginal DS test maintained the nominal size for
M = 10 but narrowly exceeded the target size of 0.05 for
M = 25. The power performance of DS method was, again, dismal with its power values drastically lower than the power of ICAST even for large number of clusters. The marginal test of RGL, on the other hand, had a grossly inflated empirical size (0.202) compared to the target size of 0.05 when the number of clusters is large. Even the power of RGL became lower than that of the ICAST and UCAST methods when the effect size
b shifted further away (
b = 0.3 or
b = 0.15) from its null value (
b = 1). These indicated the unsuitability of marginal RGL test for informative cluster size scenarios. An interesting fact, however, was that applying our proposed covariate effect adjustment technique on the marginal RGL test does lead to a significant reduction of the type-I error rate, as evident from the size value (0.067) of UCAST, although it still exceeded the nominal limit of 0.05 by a considerable margin. The parametric LMM had unacceptably high sizes values, much worse than its size under the uninformative cluster size scenario, due to the added complexity of informative cluster size which the standard LMM does not address. Hence, the standard parametric LMM was inappropriate in presence of informative cluster sizes. Overall, we found that ICAST is the only method that, simultaneously, maintained the empirical size close to the nominal size and had adequate power for the informative cluster size scenario.
4. An Application to Open Access Series of Imaging Studies (OASIS) Data
The Open Access Series of Imaging Studies (OASIS) [
15] is a collection of neuroimaging data sets that are publicly available and contains magnetic resonance imaging (MRI) data from brains of hundreds of individuals including dementia and Alzheimer patients as well as nondemented individuals. In this section, we focus on a longitudinal data [
16] from the OASIS platform that involves MRI data of 150 individuals who have visited the clinic two or more times separated by at least a year. These individuals included 72 subjects who have been classified as non-demented throughout the entire period of study and 78 subjects who were identified as demented and/or suffering from Alzheimer’s Disease (AD) at some point during this study. Among the different variables computed from the MRI scans of brain, a variable of interest is the total intracranial volume (TIV) that has been, in the past, linked to cognitive impairments and development of dementia or AD in certain individuals [
17,
18]. A normalization factor called atlas scoring factor (ASF) [
19], proportional to TIV, is often measured in neuroimaging studies and is available from the OASIS longitudinal data. In this longitudinal study, an interesting question to consider is whether the ASF values, and hence the TIV levels, change over time that may be indicative of the changes or time trends in cognitive abilities of these individuals under study.
To answer this question, we use the change in ASF values, obtained from the MRI scans, during successive visits of an individual as the pairwise difference in the outcome. In that case, a hypothesis of no change over time would be equivalent to the null hypothesis of symmetry (about 0) of the distribution of the paired differences in ASF values. This scenario represents a clustered data where each individual is a cluster, and we have 150 clusters with one or more paired differences since all these individuals have at least two visits for MRI scans during the period of the study. Note that the number of visits vary by individuals, and it is possible that the frequency of visits for individuals suffering from cognitive impairments, e.g., demented individuals and AD patients, may be different from that of the nondemented group of individuals.
Figure 1 compares the distributions of number of visits between the demented group and the nondemented group. Combining the numbers from
Figure 1, we find that among the demented group of individuals only 23% had three or more visits, while more than 47% of nondemented individuals visited the MRI clinics three or more times. Such discrepancies in the number of visits between the two groups can be related to the fact that demented individuals are more prone to be lost to follow-up due to the severity of their diseases and high mortality. Since every clinic visit generated an MRI scan, a cluster size is directly obtained from total number of visits by an individual. Hence, this situation gives rise to a possibility of an informative cluster size in this clustered data. For the testing of the null hypothesis of no change in ASF values over time in such a clustered data, one can use the marginal testing approach of DS that addresses the issue of informative cluster size. Application of the DS test yields a
p-value of 0.0003. We also implement the marginal RGL testing approach that generates a
p-value < 0.0001. In both cases the null hypothesis of symmetry appears to be rejected with highly significant
p-values indicating that the distribution of paired differences is highly asymmetric around 0 and the ASF values changed over the time period of study.
These results, obtained from marginal tests of DS and RGL, do not account for effects of any covariate which may be associated with TIV or ASF values. However, certain covariates, that are available from the data, may have important effects on TIV that needs to be monitored or adjusted for while carrying out the testing of pairwise differences of ASF. One such important covariate is age of an individual. In recent studies [
20], it has been found out that age, especially in older individuals, affects the intracranial volume. It would be interesting to see if the conclusion of change in ASF values over time remains consistent after adjusting for the effect of age as this longitudinal study contains many aged individuals. For this analysis, we need to apply the covariate adjusted testing approaches of ICAST and UCAST with age as the adjusted additional covariate. The estimated age coefficient, obtained through the relevant R-estimation of ICAST and UCAST, is −0.008 indicating that higher age is associated with lower ASF values and larger cognitive impairment which is consistent with the findings in other recent studies [
20]. The signed-rank tests of ICAST and UCAST produce a
p-value of 0.050 and 0.049, respectively. The ICAST and UCAST
p-values show that the previously obtained highly significant changes in ASF over time can no longer be concluded once the effect of the age covariate is considered. Rather, the age adjusted ICAST and UCAST results indicate only a marginally significant ASF change, if any, over time at 5% level of significance.
Figure 2 shows the weighted histogram of paired differences of ASF values with inverse cluster size weights while
Figure 3 shows the unweighted histogram of same paired differences of ASF values. From these figures, it appears that the distribution of paired differences may be only marginally asymmetric supporting the borderline significant
p-values of the ICAST and UCAST approaches. Therefore, it is demonstrated that adjusting for the effects of potentially important covariates, while performing marginal hypothesis testing of paired outcome differences, can play important role in obtaining accurate inference.
5. Discussion
Rank based tests are popular nonparametric hypotheses testing approaches when distributions of outcomes tend to be non-normal with the signed rank test being one such test widely used for the comparison of marginal distributions of paired outcomes. Signed rank tests have been extended to different types of clustered data including the ones where the cluster sizes are informative. Most of these existing signed rank tests compare marginal distributions of paired outcomes without considering any additional covariate effect on the outcomes. However, ignoring available covariate information during paired comparisons can result in inaccurate inferences as discussed in
Section 1 and evident from the OASIS neuroimaging data analysis in
Section 4. Based on this need to develop a hypothesis testing mechanism for paired outcomes that can adjust for the effect of covariates, we proposed a robust rank-based procedure of covariate effect adjustment while carrying out hypothesis testing in a clustered data framework. Our method addresses the issue of informative cluster sizes and performs well even if the cluster sizes are uninformative as presented through the extensive simulation results in
Section 3.
In this article we have outlined covariate adjusted signed rank testing procedure for two types of clustered data, namely, a testing procedure in presence of informative cluster size (ICAST) and a testing procedure when the cluster sizes are uninformative (UCAST). The determination of the most appropriate choice between these two testing procedures would depend on the research aim of the investigator and the type of marginal distributions (
or
) to be compared. Note that another deciding factor in this context can be the identification of the primary unit of sampling and inference. In case the primary sampling unit is a cluster, ICAST may be preferred over UCAST as all the clusters receive the same weight under ICAST. On the other hand, if the primary sampling unit is a member within a cluster and the cluster sizes are not expected to be informative, UCAST can be preferred. Following this idea, one can prefer to choose ICAST over UCAST in the OASIS neuroimaging data analysis in
Section 4 since the primary unit of inference is a patient undergoing the MRI scans and not the pair of successive MRI scans.
In addition to the potential areas of real-life application mentioned in
Section 1 and
Section 4, our proposed method can also be applied in analyzing data arising from cluster-randomized trials. In such a trial, the clusters are randomized, and an intervention is administered to a whole cluster (i.e., all the units in a cluster receive the same intervention). If the intervention is a drug under trial while the patients are units within hospitals (clusters), then our method can be applied for testing the drug efficacy. Here, the outcomes obtained for each patient before and after receiving the drug form paired outcomes and we have multiple pairs from multiple patients in a cluster. Then, we can use our proposed method to adjust for the effects of the available covariates in each patient while comparing the pre-intervention and post-intervention outcomes.
In our rank-based covariate adjustment procedure, we have assumed a linear model framework without making any strong distributional assumptions. This type of model is applicable to any continuous response even if the underlying distribution is asymmetric. This rank-based covariate adjustment procedure can be extended to other types of non-continuous responses as well through generalized linear model frameworks (e.g., count model for discrete count responses). We plan to pursue such non-continuous outcome modeling in future. However, a limitation of this type of covariate adjusted rank-based testing is that it cannot be used for binary outcomes due to the infeasibility of ranking in those outcomes. Our proposed method is a two-step procedure where we perform covariate effect adjustment on the outcomes at the first step and then test the distribution of the modified paired differences at the second step. An alternative approach could be to develop a one-step inference procedure that can simultaneously estimate the covariate effects using ranks and perform rank-based testing under informative cluster sizes. Such an approach is an area of potential future research on rank-based inference.