1. Introduction
The area under the true receiver operating characteristic (ROC) curve, hereafter referred to as the AUC, is routinely estimated when it is of interest in determining how strongly a given model discriminates between the levels of a binary outcome. As described in Hanley and McNeil [
1] and later modified by DeLong, DeLong, and Clarke-Pearson [
2], the Mann–Whitney U statistic may be used to estimate the AUC and to compare the AUC between competing models. These approaches are only appropriate for inference with an AUC generated by data composed of independent observations. Extending the methodology of DeLong, DeLong, and Clarke-Pearson [
2], Obuchowski [
3] developed an approach for inference with the AUC that accounts for the clustering of data at two levels. In addition to this approach and building upon previous research that modified the Mann–Whitney U test for use with clustered data [
4], Rosner, Qiu, and Lee [
5] proposed a method for assessing the precision of and comparing AUC values in the setting of dependent data with a two-level structure (e.g., eyes measured within a person). In our application of the example discussed in
Section 7, the AUC is the probability that a score for a random subunit with a negative outcome (i.e., a nonevent subunit) is lower than the score for a random subunit with a positive outcome (i.e., an event subunit).
To apply this approach, a transformation to normality of the model prediction scores is performed, so that a known null distribution and an alternative distribution distinguished by a location shift may be established. Then, by extending the Mann–Whitney U statistic to allow for data at two levels (e.g., eyes within a person), a closed-form expression of the variance of the AUC can be obtained.
This article proposes further modification, enabling the construction of a confidence interval around the AUC with data having a three-level hierarchical structure. This structure can often be found in ophthalmology research. For example, when assessing the progression of age-related macular degeneration in datasets with multiple subjects in the same family, eyes are nested within patients, and patients may be nested within families [
6]. As another example, if an investigator is using a Humphrey visual field analyzer to assess visual field loss, each patient will have data at multiple sites within each eye; this allows for the analysis of data at three levels (i.e., patient, eye, and site) [
7]. In addition to ophthalmology, three-level data may be found in the fields of dentistry, where outcomes, such as the presence of caries, can occur within quadrants of the mouth, where each quadrant has multiple teeth, and in audiology, where ear-specific hearing loss may be studied in patients’ ears using tests with different frequencies within the same ear. In the following, the extended Mann–Whitney U statistic for three-level hierarchical data is discussed, the expectation and variance of this statistic, along with the construction of confidence intervals and the handling of ties, are defined, and simulation studies and two illustrations of its use are presented. For clarity, the levels of measurement will be referred to as family, person, and subunit. This terminology and hierarchy are illustrated in
Figure 1.
2. Extended Mann-Whitney U Statistic for Three-Level Hierarchical Data
We consider a generalized linear mixed model having a Bernoulli-distributed response defined at the subunit level of the hierarchy, a logit link, and two nested random intercepts.
where
i,
j, and
k index the family, person, and subunit levels of the data, respectively,
is the probability of a subunit experiencing an event,
are fixed effects and can be defined at any level of the hierarchy;
is a family-level random effect; and
is a person-level random effect. The random effects are assumed to be independent of each other and independent of the fixed effects. Cluster sizes are allowed to vary at any level. Standard manipulations from the hierarchical modeling literature, such as decomposing fixed effects into within- and between-cluster effects and interactions, may be applied, but allowing for heteroscedasticity of the random effects’ variances or including random slopes is not viable, as the correlation structure must be strictly exchangeable. By fitting the above model defined in Equation (1) to the observed data, one can derive a composite score.
Let
if the
subunit experiences an event
otherwise, and let
be an indicator
if
is true
otherwise. We define the AUC (or
by:
or in words, the probability that a risk score from a random subunit that has an event (or, in brief, an event subunit) is higher than a random nonevent subunit, which can be shown to be the area under a receiver operating characteristic curve (ROC), or AUC in brief. With 3-level hierarchical data,
, which is the parameter of interest, can be estimated with Equation (4):
where
is an indicator function that is equal to 1 if
and 0 otherwise and
is defined as the observed response status (1 = event, 0 = nonevent). Note that the numerator of Equation (4) is the generalization of the Mann–Whitney U statistics for observations across three levels of hierarchical data, e.g., across different families, across different subjects within the same family, and across different eyes of the same subjects. As usual, it is assumed that an increase in the prediction score is associated with an increase in the probability of having the event. The ties between prediction scores cannot be directly accommodated with this expression, however, they can be accounted for using the tie-breaking approach discussed in
Section 4. Since the denominator of Equation (4) is equal to the total number of discordant pairs of subunit observations (i.e., one subunit exhibits an event and the other does not),
is defined as the proportion of discordant pairs where the observation with the event has a higher prediction score than the observation without. The observations belonging to the discordant pairs are allowed to be from different families or persons or within the same family or person.
To facilitate inference with
, the prediction scores are transformed to a standard normal distribution. To do this, the probit (inverse of the cumulative standard normal probability distribution;
) is applied to the cumulative distribution of prediction scores (
), which may be estimated through the empirical distribution function. Note that this transformation preserves the rank of the prediction scores. The transformed prediction scores are defined as:
The transformed scores are distributed as:
The 2 × 1 random vector of the differences between any two transformed prediction scores is as follows:
is assumed to be a bivariate normal with a covariance matrix defined as:
will vary depending on the correlations between the pairs of transformed prediction scores; for scenarios where all scores originate from different family-level observations,
is defined as 0. The assumption of bivariate normality is useful for calculating the variance of
, as described in the
Supplementary.
Assuming that there is a location shift (
) in the transformed prediction scores between those who did and did not experience the event, the expectation of
is a composite of three
’s as defined below.
where
(a subunit that does not have an event has a lower prediction score than a subunit that has an event from the same person) = (e.g., comparing risk scores from two eyes of the same person).
(a subunit that does not have an event has a lower prediction score than a subunit that has an event from a different person in the same family) .
(a subunit that does not have an event has a lower prediction score than subunit that has an event from a person in a different family) = .
Correlation between transformed prediction scores for subunits of a person.
Correlation between transformed prediction scores for subunits of two different persons within the same family.
Assuming that there is no location shift in the transformed prediction scores (µ = 0), the above expectation is equal to 0.5. To calculate the variance of
,
is first respecified as:
where
number of subunit comparisons within a person.
number of subunit comparisons between members of the same family.
number of subunit comparisons between members of different families.
The
Supplementary S2 provide details of the derivation of the individual terms in Equation (7).
5. Simulation Studies
The validity of the point and interval estimates given in Equations (4) and (9) was assessed through simulation. Five scenarios were examined and are described in detail in their respective sections below. To simulate data, prediction scores were derived from the framework of a normal linear mixed model, specified as:
with a fixed effect
B1 for the dichotomous event status
eventijk, normally distributed random intercepts for family (
) and person (
), and a normally distributed residual for subunit (
), all with mean zero. Event status was randomly generated from a Bernoulli distribution with a predefined probability (i.e., prevalence), which was varied throughout the simulation scenarios. The regression coefficient associated with the fixed effect for event status (
) was 1.5, implying that having the event is associated with a higher prediction score. The variance of the family random effect (
) was 5.0, of person (
) was 3.0, and of subunit (
) was 2.0. Given this, the intra-class correlation at the person level was 0.50 and at the subunit level was 0.80. Due to the above simulation parametrizations, all simulation settings had
µ = 0.474, θ
s = 0.773, θ
f = 0.682, and θ = 0.631. The five simulation scenarios examined the effects of:
- (1)
Varying the prevalence of event status,
- (2)
accounting for tied prediction scores using the method outlined in
Section 4,
- (3)
having a variable number of persons per family, reflecting the 2012 US Census data,
- (4)
creating unbalanced data at the subunit level through the random deletion of subunit measurements, and
- (5)
introducing random variation to the prediction scores, which are assumed to be measured without error in Equation (2).
For each parametrization of each scenario, 5000 datasets were simulated with a prespecified number of families, persons within families (except in Scenario 3), and subunits.
calculated using Equation (4) (averaged over simulations) was compared to the expectation of
calculated using Equation (6) to assess bias (
minus
). The variance of
was examined by forming the ratio of the theoretical variance of
calculated using Equation (7), to the empirical variance, which was calculated by taking the variance of the estimated
across simulations (theoretical var
/ empirical var
). Lastly, the actual confidence interval coverage was measured for the proposed approach, the Obuchowski [
3] and Delong [
2] confidence interval methods, and the two-level approach of Rosner, Qiu, and Lee [
5].
5.1. Scenario 1
In Scenario 1, the number of persons per family and the number of subunits per person were constant, and the event status for every subunit was independently assigned using a Bernoulli distribution with a probability of 0.50 or 0.25. Results are listed in
Table 1. Across these simulation settings, the mean bias is 0.000, the mean variance ratio is 0.970, and the mean 95% CI coverage is 0.951; for the Obuchowski method, the mean 95% CI coverage is 0.935, for the DeLong method it is 0.938, and for the Rosner method it is 0.941. For the first parameterization in Scenario 1 (20 families, 2 persons, and 50% event prevalence), the actual confidence interval coverage was calculated to assess the viability of using a percentile bootstrap approach with 100 bootstrap samples by applying the same simulation methodology as described above. The percentile bootstrapping approach has a 95% CI coverage of 0.935.
5.2. Scenario 2
For Scenario 2, all the above design considerations were maintained, except that prediction scores were binned according to ventiles (20 bins), deciles (10 bins), and quintiles (5 bins) to create tied data. The analysis then proceeded as outlined in
Section 4. Results are listed in
Table 2. Across these simulation settings, the mean bias is −0.003, the mean-variance ratio is 0.955, and the mean 95% confidence interval coverage is 0.949.
5.3. Scenario 3
The design of Scenario 3 was similar to that of Scenario 1, except the number of persons within a family was allowed to vary randomly according to the empirical distribution of the number of children per family (in families with children) from the 2012 United States census data [
9]. Families could have up to 10 children with a probability of 0.43 for one child, 0.36 for two children, 0.15 for three children, and 0.03 for four children with progressively smaller probabilities for more. Based on
Table 3, the mean bias is 0.001, the mean-variance ratio is 0.965, and the mean 95% confidence interval coverage is 0.949; for the Obuchowski method, the mean 95% CI coverage is 0.939, for the DeLong method it is 0.936, and for the Rosner method it is 0.940.
5.4. Scenario 4
Scenario 4 was a modification of Scenario 1, where missing data at the subunit level was introduced. In each simulation, 10% of the persons were randomly selected to have one missing subunit. These missing observations were deleted rather than imputed. Based on
Table 4, the mean bias is 0.000, the mean-variance ratio is 0.981, and the mean 95% confidence interval coverage is 0.952; for the Obuchowski method, the mean 95% CI coverage is 0.943, for the DeLong method it is 0.942, and for the Rosner method it is 0.944.
5.5. Scenario 5
In Scenario 5, the same setting as used in Scenario 1 was examined, however random normal measurement error (mean set to 0 and standard deviation varied was applied to the
prediction scores (defined as
in the simulation setting). This was performed to assess the robustness of the proposed method to variability in the
values that are assumed to be measured without error. Based on
Table 5, the mean bias is −0.001, the mean-variance ratio is 0.954, and the mean 95% CI coverage is 0.946.
Based on the simulations, the bias in the point estimate of is negligible, and the coverage of the 95% confidence intervals is appropriate.
6. Example with Longitudinal Blood Pressure Data
To illustrate the proposed methodology, a sample of longitudinal blood pressure data collected across four timepoints from 1954 to 1968 in South Wales was analyzed [
10]. Two regions, the Rhondda Fach and the Vale of Glamorgan, were sampled through the random selection of propositi older than 5 years, whose first-degree relatives were then included in the study along with the propositi. For this illustration, only individuals older than 30 at baseline and only those who were siblings of the propositi or the propositus himself/herself were included. Additionally, only timepoints with blood pressure data available were included. Hypertension was defined as a systolic blood pressure ≥ 140 mmHg, a diastolic blood pressure ≥ 90 mmHg, or a self-reported use of blood pressure medication. A summary of the sample sizes at different levels within the clustered longitudinal structure of the study is presented in
Table S1, and descriptive statistics for the baseline visit are displayed in
Table S2 of the Supplementary.
A generalized linear mixed model with a logit link and Bernoulli response was fit for hypertension (binary, time-varying), with age (continuous, time-varying), BMI (continuous, time-varying), sex (binary, time-invariant), and region (binary, time-invariant) as predictors of interest. Due to a large quantity of missing BMI data, ten rounds of multiple imputations were applied to this variable. Age and BMI were specified using restricted cubic splines with five knots having locations at the 5th, 27.5th, 50th, 72.5th, and 95th percentiles [
11]. The inclusion of interactions between age and sex and BMI and sex were assessed using the Akaike Information Criterion (AIC) and found not to improve predictive ability. A family-level random intercept and a sibling random intercept within families, both assumed to be normally distributed, were included. The Kenward–Roger degrees of freedom and fixed effect standard error adjustments were used, and the model was estimated with residual pseudo-likelihood and optimized using Newton–Raphson with ridging in PROC GLIMMIX (SAS 9.4). Within each imputation dataset, the final sample size was 3143 visits (58% hypertensive) with 966 siblings and 319 families, a mean of 3.03 siblings per family (minimum = 1, maximum = 12), and a mean of 3.25 visits per sibling. The final model is given by:
where
is a family-level random intercept and
is a sibling-level random intercept.
Due to the use of flexible continuous predictor parameterizations, observed associations from the model fit to the complete dataset have been presented using categorical representations in
Table 6—the impact of multiple imputation on inference is accounted for through Rubin’s rules in this table [
8]. As seen in
Table 6, an increase in age or BMI is strongly associated with higher odds of hypertension. The estimated variance of the random intercept for family was 0.51 (Wald
p-value < 0.001) and for sibling was 1.67 (Wald
p-value < 0.0001).
Using the estimated parameters of this model, the unconditional (i.e., not conditional on the random intercepts) predicted log odds for each observation in the complete dataset were calculated. By applying the proposed methodology, the AUC was estimated to be 0.76
with a 95% confidence interval of 0.72 to 0.79. A naïve analysis that does not account for the clustered nature of the data produces the same AUC with a narrower 95% confidence interval of 0.74–0.77 (DeLong method [
2]), and an analysis that accounts for clustering at only the participant level produces the same AUC as well, but also results in a narrower 95% confidence interval of 0.73–0.78 (Obuchowski method [
3]). Given that multiple imputation was used during model fitting, an approach similar to that outlined in
Section 4 was applied to the AUC estimates and confidence intervals to accurately account for the variance introduced by the multiple imputation procedure used for missing BMI data.
7. Example with Age-Related Macular Degeneration Data Clustered within Person and Family
The previous example illustrated how confidence intervals for the AUC based on longitudinal data can be produced using the proposed methodology; in this example, data measuring the progression of age-related macular degeneration (AMD) at the individual eye-level but clustered within patients (up to two eyes per person) and families was explored. The data used are from a subset of eyes from patients from the Seddon Longitudinal Cohort Study who did not present with evidence of advanced macular degeneration at a baseline visit and were followed for five years [
6]. The sample sizes at each level of clustering are presented in
Table S3, and descriptive statistics for the entire sample are presented in
Tables S4 and S5 of the Supplementary.
A generalized linear mixed model with a logit link and Bernoulli response was fit for eye-level progression (binary) with age (categorical), sex (binary), smoking status (categorical), education (binary), BMI (categorical), and a baseline measure of AMD severity known as the Clinical Age Related Maculopathy Grading System (CARMS) grade (categorical; defined on a scale from 1 to 5, with grades 4 and 5 indicating subtypes of advanced AMD, but only values 1–3 at baseline are present in these example data) as predictors [
12]. Race was not included since AMD is much less common in nonwhites, and there were few nonwhites in this cohort. A family-level random intercept and a patient-level random intercept, both assumed to be normally distributed, were included. The Kenward–Roger degrees of freedom and fixed effect standard error adjustments were applied, and the model was estimated by residual pseudo-likelihood and optimized using Newton–Raphson with ridging in PROC GLIMMIX (SAS 9.4). No missing data were present. The final model is:
where
is a family-level random intercept and
is a person-level random intercept.
The sample size was 1396 eyes (8% with progression), 741 patients, and 325 families. The results from the model fit using the complete dataset are presented in
Table 7 where it can be seen that age and CARMS grade were the strongest predictors of progression. The estimated variance of the random intercept for family was 1.04 (Wald
p-value = 0.027) and for person was 0.71 (Wald
p-value = 0.09).
Using the estimated parameters of this model, the unconditional (i.e., not conditional on the random intercepts) predicted log odds for each observation in the complete dataset were calculated. There were many ties in the resulting predictions, so the previously outlined tie-breaking approach was used. By applying the proposed methodology, the AUC was estimated to be 0.88
with a 95% confidence interval of 0.82–0.92. A correlation-naïve analysis that does not account for the hierarchical nature of the data produces the same AUC with a narrower 95% confidence interval of 0.85–0.90 (DeLong method [
2]), and an analysis that accounts for clustering at only the participant level produces the same AUC as well, but also results in a narrower 95% confidence interval of 0.85–0.91 (Obuchowski method [
3]).
8. Discussion
This investigation described how the extended Mann–Whitney U statistic, when paired with normally transformed prediction scores, can be applied to three-level hierarchical data to calculate an AUC statistic with a corresponding confidence interval. The AUC is calculated based on prediction scores from a three-level generalized linear mixed model with a logit link and Bernoulli distribution. Data may originate from a clustered setting or a longitudinal setting and cluster sizes are allowed to vary, but the correlation structure must be exchangeable. Simulation studies showed that the coverage of the 95% confidence interval around the AUC is appropriate, even in unbalanced settings with small cluster sizes and a small number of independent units (i.e., families).
In the two examples provided, the data were not divided into a training and testing set as this was not relevant to the illustration of the methodology. In practice, this technique or cross-validation could be used to obtain a validated measure of AUC. Cross-validation can be used by calculating the variance and AUC of the holdout sets of data and then combining them through the use of Rubin’s rules [
8].
In the presentation of the two-level approach, prediction scores were assumed to be generated through a generalized estimating equation (GEE) model [
5]. A three-level hierarchical data is traditionally modeled using generalized linear mixed models, and so this technique is used here in lieu of GEE. If a marginal model rather than a subject-specific model is desired, GEE may still be used, assuming that a method supporting three-levels of data dependency is applied.
Two important assumptions deserve further attention. First, Equation (5) assumes bivariate normality of normal random variables. The authors believe that this assumption is appropriate given that an examination of 3D histograms of the presented example data and data simulation scenarios did not violate this assumption. Second, this methodology assumes that the correlation structure of the response data is exchangeable. This is typically valid in data settings where correlation is induced by observations being nested within groups but is often inappropriate with longitudinal data. The blood pressure example (
Section 6) includes longitudinal data, but due to model nonconvergence, it was not possible to assess the utility of complex correlation structures. As noted in Fitzmaurice, Laird, and Ware [
13], it is often difficult to specify correlation structures that are more flexible than exchangeable ones in generalized linear mixed models with binary responses. Regardless, users of the proposed methodology should attempt to determine if exchangeability is a fair assumption. In addition to assessing the correlation structure, users should also be cognizant of the presence of informative cluster sizes, which may lead to bias in the results from the proposed methodology. The authors report that there are no competing interests to declare. For users interested in applying the proposed method to construct valid confidence intervals for the AUC in three-level settings, R code is available in the
Supplementary Materials.