Next Article in Journal
Adaptations on the Use of p-Values for Statistical Inference: An Interpretation of Messages from Recent Public Discussions
Previous Article in Journal
Recurring Errors in Studies of Gender Differences in Variability
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluation of Risk Prediction with Hierarchical Data: Dependency Adjusted Confidence Intervals for the AUC

1
Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
2
University of Massachusetts Chan Medical School, Worcester, MA 01655, USA
3
Department of Epidemiology and Biostatistics, University of Maryland School of Public Health, College Park, MD 20742, USA
*
Author to whom correspondence should be addressed.
Stats 2023, 6(2), 526-538; https://doi.org/10.3390/stats6020034
Submission received: 17 March 2023 / Accepted: 12 April 2023 / Published: 24 April 2023

Abstract

:
The area under the true ROC curve (AUC) is routinely used to determine how strongly a given model discriminates between the levels of a binary outcome. Standard inference with the AUC requires that outcomes be independent of each other. To overcome this limitation, a method was developed for the estimation of the variance of the AUC in the setting of two-level hierarchical data using probit-transformed prediction scores generated from generalized estimating equation models, thereby allowing for the application of inferential methods. This manuscript presents an extension of this approach so that inference for the AUC may be performed in a three-level hierarchical data setting (e.g., eyes nested within persons and persons nested within families). A method that accounts for the effect of tied prediction scores on inference is also described. The performance of 95% confidence intervals around the AUC was assessed through the simulation of three-level clustered data in multiple settings, including ones with tied data and variable cluster sizes. Across all settings, the actual 95% confidence interval coverage varied from 0.943 to 0.958, and the ratio of the theoretical variance to the empirical variance of the AUC varied from 0.920 to 1.013. The results are better than those from existing methods. Two examples of applying the proposed methodology are presented.

1. Introduction

The area under the true receiver operating characteristic (ROC) curve, hereafter referred to as the AUC, is routinely estimated when it is of interest in determining how strongly a given model discriminates between the levels of a binary outcome. As described in Hanley and McNeil [1] and later modified by DeLong, DeLong, and Clarke-Pearson [2], the Mann–Whitney U statistic may be used to estimate the AUC and to compare the AUC between competing models. These approaches are only appropriate for inference with an AUC generated by data composed of independent observations. Extending the methodology of DeLong, DeLong, and Clarke-Pearson [2], Obuchowski [3] developed an approach for inference with the AUC that accounts for the clustering of data at two levels. In addition to this approach and building upon previous research that modified the Mann–Whitney U test for use with clustered data [4], Rosner, Qiu, and Lee [5] proposed a method for assessing the precision of and comparing AUC values in the setting of dependent data with a two-level structure (e.g., eyes measured within a person). In our application of the example discussed in Section 7, the AUC is the probability that a score for a random subunit with a negative outcome (i.e., a nonevent subunit) is lower than the score for a random subunit with a positive outcome (i.e., an event subunit).
To apply this approach, a transformation to normality of the model prediction scores is performed, so that a known null distribution and an alternative distribution distinguished by a location shift may be established. Then, by extending the Mann–Whitney U statistic to allow for data at two levels (e.g., eyes within a person), a closed-form expression of the variance of the AUC can be obtained.
This article proposes further modification, enabling the construction of a confidence interval around the AUC with data having a three-level hierarchical structure. This structure can often be found in ophthalmology research. For example, when assessing the progression of age-related macular degeneration in datasets with multiple subjects in the same family, eyes are nested within patients, and patients may be nested within families [6]. As another example, if an investigator is using a Humphrey visual field analyzer to assess visual field loss, each patient will have data at multiple sites within each eye; this allows for the analysis of data at three levels (i.e., patient, eye, and site) [7]. In addition to ophthalmology, three-level data may be found in the fields of dentistry, where outcomes, such as the presence of caries, can occur within quadrants of the mouth, where each quadrant has multiple teeth, and in audiology, where ear-specific hearing loss may be studied in patients’ ears using tests with different frequencies within the same ear. In the following, the extended Mann–Whitney U statistic for three-level hierarchical data is discussed, the expectation and variance of this statistic, along with the construction of confidence intervals and the handling of ties, are defined, and simulation studies and two illustrations of its use are presented. For clarity, the levels of measurement will be referred to as family, person, and subunit. This terminology and hierarchy are illustrated in Figure 1.

2. Extended Mann-Whitney U Statistic for Three-Level Hierarchical Data

We consider a generalized linear mixed model having a Bernoulli-distributed response defined at the subunit level of the hierarchy, a logit link, and two nested random intercepts.
ln p i j k | u i , v i j 1 p i j k | u i , v i j = β 0 + β 1 x i j k , 1 + β 2 x i j k , 2 + + β m x i j k , m + u i + v i j
u i ~ N 0 , σ u 2 , v i j ~ N ( 0 , σ v 2 )
where i, j, and k index the family, person, and subunit levels of the data, respectively, p i j k is the probability of a subunit experiencing an event, β 1 β m are fixed effects and can be defined at any level of the hierarchy; u i is a family-level random effect; and v i j is a person-level random effect. The random effects are assumed to be independent of each other and independent of the fixed effects. Cluster sizes are allowed to vary at any level. Standard manipulations from the hierarchical modeling literature, such as decomposing fixed effects into within- and between-cluster effects and interactions, may be applied, but allowing for heteroscedasticity of the random effects’ variances or including random slopes is not viable, as the correlation structure must be strictly exchangeable. By fitting the above model defined in Equation (1) to the observed data, one can derive a composite score.
S i j k = β 0 + β 1 x i j k , 1 + β 2 x i j k , 2 + β m x i j k , m
Let δ ijk = 1 if the ( i , j , k ) subunit experiences an event and   0 otherwise, and let I ( a ) be an indicator = 1 if S i 1 j 1 k 1 < S i 2 j 2 k 2 is true and   0 otherwise. We define the AUC (or θ 3 ) by:
θ 3 = E [ 1 δ i 1 j 1 k 1 δ i 2 j 2 k 2 I S i 1 j 1 k 1 S i 2 j 2 k 2 ]
or in words, the probability that a risk score from a random subunit that has an event (or, in brief, an event subunit) is higher than a random nonevent subunit, which can be shown to be the area under a receiver operating characteristic curve (ROC), or AUC in brief. With 3-level hierarchical data, θ 3 , which is the parameter of interest, can be estimated with Equation (4):
θ 3 ^ = i 1 = 1 I j 1 = 1 J i 1 k 1 = 1 K i 1 j 1 i 2 = 1 I j 2 = 1 J i 2 k 2 = 1 K i 2 j 2 1 δ i 1 j 1 k 1 δ i 2 j 2 k 2 I S i 1 j 1 k 1 S i 2 j 2 k 2 i 1 = 1 I j 1 = 1 J i 1 k 1 = 1 K i 1 j 1 i 2 = 1 I j 2 = 1 J i 2 k 2 = 1 K i 2 j 2 1 δ i 1 j 1 k 1 δ i 2 j 2 k 2 , i 1 j 1 k 1 i 2 j 2 k 2
where I is an indicator function that is equal to 1 if S i 1 j 1 k 1 < S i 2 j 2 k 2 and 0 otherwise and δ is defined as the observed response status (1 = event, 0 = nonevent). Note that the numerator of Equation (4) is the generalization of the Mann–Whitney U statistics for observations across three levels of hierarchical data, e.g., across different families, across different subjects within the same family, and across different eyes of the same subjects. As usual, it is assumed that an increase in the prediction score is associated with an increase in the probability of having the event. The ties between prediction scores cannot be directly accommodated with this expression, however, they can be accounted for using the tie-breaking approach discussed in Section 4. Since the denominator of Equation (4) is equal to the total number of discordant pairs of subunit observations (i.e., one subunit exhibits an event and the other does not), θ 3 ^ is defined as the proportion of discordant pairs where the observation with the event has a higher prediction score than the observation without. The observations belonging to the discordant pairs are allowed to be from different families or persons or within the same family or person.
To facilitate inference with θ 3 ^ , the prediction scores are transformed to a standard normal distribution. To do this, the probit (inverse of the cumulative standard normal probability distribution; Φ 1 ) is applied to the cumulative distribution of prediction scores ( F ), which may be estimated through the empirical distribution function. Note that this transformation preserves the rank of the prediction scores. The transformed prediction scores are defined as:
Z i j k = Φ 1 [ F ( S i j k ) ]
The transformed scores are distributed as:
Z i j k N ( 0,1 )   for   δ i j k = 0 N ( μ , 1 ) , μ 0   for   δ i j k = 1
The 2 × 1 random vector of the differences between any two transformed prediction scores is as follows:
Z i 1 j 1 k 1 Z i 2 j 2 k 2 Z i 3 j 3 k 3 Z i 4 j 4 k 4 , Z i 1 j 1 k 1 Z i 2 j 2 k 2   and   Z i 3 j 3 k 3 Z i 4 j 4 k 4
is assumed to be a bivariate normal with a covariance matrix defined as:
C O V Z i 1 j 1 k 1 Z i 2 j 2 k 2 Z i 3 j 3 k 3 Z i 4 j 4 k 4 = 1 ρ * ρ * 1 .
ρ * will vary depending on the correlations between the pairs of transformed prediction scores; for scenarios where all scores originate from different family-level observations, ρ * is defined as 0. The assumption of bivariate normality is useful for calculating the variance of θ 3 ^ , as described in the Supplementary.
Assuming that there is a location shift ( μ 0 ) in the transformed prediction scores between those who did and did not experience the event, the expectation of θ 3 ^ is a composite of three θ ’s as defined below.
E ( θ 3 ^ ) = θ s i = 1 I j = 1 J i f i j g i j + θ f i = 1 I j 1 = 1 J i j 2 j 1 J i f i j 1 g i j 2 + θ F × G i = 1 I j = 1 J i f i j g i j + i = 1 I j 1 = 1 J i j 2 j 1 J i f i j 1 g i j 2 F × G
where
θ s = P (a subunit that does not have an event has a lower prediction score than a subunit that has an event from the same person) = Φ μ / 2 ( 1 ρ s ) (e.g., comparing risk scores from two eyes of the same person).
θ f = P (a subunit that does not have an event has a lower prediction score than a subunit that has an event from a different person in the same family) = Φ μ / 2 ( 1 ρ f ) .
θ = P (a subunit that does not have an event has a lower prediction score than subunit that has an event from a person in a different family) = Φ μ / 2 .
ρ s = Correlation between transformed prediction scores for subunits of a person.
ρ f = Correlation between transformed prediction scores for subunits of two different persons within the same family.
f i j = k = 1 K i j δ i j k
g i j = k = 1 K i j ( 1 δ i j k )
F = i = 1 I j = 1 J i k = 1 K i j δ i j k
G = i = 1 I j = 1 J i k = 1 K i j 1 δ i j k
Assuming that there is no location shift in the transformed prediction scores (µ = 0), the above expectation is equal to 0.5. To calculate the variance of θ 3 ^ , θ 3 ^ is first respecified as:
θ 3 ^ = A + B + C F × G
where A = i = 1 I j = 1 J i k 1 = 1 K i j k 2 k 1 K i j 1 δ i j k 1 δ i j k 2 I S i j k 1 S i j k 2 ; number of subunit comparisons within a person. B = i = 1 I j 1 = 1 J i j 2 j 1 J i k 1 = 1 K i j 1 k 2 = 1 K i j 2 ( 1 δ i j 1 k 1 ) δ i j 2 k 2 I ( S i j 1 k 1 S i j 2 k 2 ) ; number of subunit comparisons between members of the same family. C = i 1 = 1 I i 2 i 1 I j 1 = 1 J i 1 j 2 = 1 J i 2 k 1 = 1 K i 1 j 1 k 2 = 1 K i 2 j 2 ( 1 δ i 1 j 1 k 1 ) δ i 2 j 2 k 2 I ( S i 1 j 1 k 1 S i 2 j 2 k 2 ) ; number of subunit comparisons between members of different families.
Then,
V A R ( θ 3 ^ ) = 1 F 2 × G 2 V A R ( A ) + V A R ( B ) + V A R ( C ) + 2 C O V ( A , B ) + 2 C O V ( A , C ) + 2 C O V ( B , C )
The Supplementary S2 provide details of the derivation of the individual terms in Equation (7).

3. Estimation of Confidence Limits for θ 3 ^

Since θ 3 is bounded by 0 and 1, the confidence interval surrounding the estimate of this parameter should be bounded similarly. To ensure this, a probit transformation can be applied to θ 3 as ω 3 = Φ 1 θ 3 where Φ 1 is the inverse of the cumulative standard normal probability distribution. Under the assumption of normality of ω 3 ^ , a 100(1 − α)% confidence interval for ω 3 is:
[ ω 3 ^ z α / 2 S E ( ω 3 ^ ) , ω 3 ^ + z α / 2 S E ( ω 3 ^ ) ]
where z α / 2 is the 100(1 − α/2)th percentile of the standard normal distribution. To find S E ( ω 3 ^ ) , the delta method is applied to θ 3 ^ , which has a known variance (Equation (7)).
n ( θ 3 ^ θ 3 ) N [ 0 , V A R ( θ 3 ^ ) ]
n [ Φ 1 ( θ 3 ^ ) Φ 1 ( θ 3 ) ] N ( 0 , V A R ( θ 3 ^ ) { ϕ [ Φ 1 ( θ 3 ) ] } 2 )
where ϕ is the standard normal probability density. Therefore, S E ( ω 3 ^ ) = S E ( θ 3 ^ ) | ϕ [ Φ 1 ( θ 3 ) ] | .
A back-transformation of Equation (8) using the cumulative standard normal probability distribution can then be used to obtain the desired confidence interval for θ 3 .
{ Φ [ ω 3 ^ z α / 2 S E ( ω 3 ^ ) ] , Φ [ ω 3 ^ + z α / 2 S E ( ω 3 ^ ) ] }

4. Incorporation of Ties

As stated in Section 2, the inferential procedures for estimating θ 3 ^ and its confidence limits require that no prediction scores be tied. For prediction scores generated by models having multiple continuous, finely measured predictors, the number of tied subunit prediction scores will likely be negligible, and a single random tie-breaking method may be adopted. For scenarios with a moderate number of ties, this approach will be inefficient and may produce biased results. With the presence of many ties in the data, repeated random tie-breaking followed by a pooling of estimates may be adopted [8]. This procedure is outlined as follows:
  • For each prediction score that is tied with at least one other prediction score, add to it a randomly generated value from a continuous uniform probability distribution defined from Δ   to   Δ where Δ is a small number. As an example, Δ may be chosen to be a number more precise than the observed prediction scores (e.g., if prediction scores are measured to the hundredths digit, then 0.001 could be used for Δ ). Call this new set of prediction scores (all unique prediction scores and the formerly tied scores) S i j k , where indexes the set of prediction scores after ties are broken, of which there will be L in total.
  • Using S i j k , calculate θ 3 ^ and V A R ( θ 3 ^ ) as usual. Call these estimates θ 3 ^ and V A R ( θ 3 ^ ) . As shown in Section 3, apply the probit transformation to θ 3 ^ and call this transformed estimate ω 3 .
  • Repeat steps 1 and 2 L times. Thus, there will be L estimates of θ 3 ^ , V A R ( θ 3 ^ ) , and ω 3 .
  • Calculate ω 3 * as 1 L = 1 L ω 3 , θ 3 ^ * as 1 L = 1 L θ 3 ^ , and V A R ( θ 3 ^ ) * as 1 L = 1 L V A R ( θ 3 ^ ) + ( 1 + L 1 ) 1 L 1 = 1 L θ 3 ^ θ 3 ^ * 2 .
Ten iterations of tie breaking ( L = 10 ) were found to be reasonable, but if computing power allows, users may run more and assess the stability of their results. θ 3 ^ * is interpreted as the AUC after accounting for ties and ω 3 * , and V A R ( θ 3 ^ ) * are used in place of ω 3 and V A R ( θ 3 ^ ) , respectively, for calculating the confidence interval defined in Section 3.

5. Simulation Studies

The validity of the point and interval estimates given in Equations (4) and (9) was assessed through simulation. Five scenarios were examined and are described in detail in their respective sections below. To simulate data, prediction scores were derived from the framework of a normal linear mixed model, specified as:
Y i j k = B 1 e v e n t i j k + u i + v i j + e i j k
with a fixed effect B1 for the dichotomous event status eventijk, normally distributed random intercepts for family ( u i ) and person ( v i j ), and a normally distributed residual for subunit ( e i j k ), all with mean zero. Event status was randomly generated from a Bernoulli distribution with a predefined probability (i.e., prevalence), which was varied throughout the simulation scenarios. The regression coefficient associated with the fixed effect for event status ( B 1 ) was 1.5, implying that having the event is associated with a higher prediction score. The variance of the family random effect ( u i ) was 5.0, of person ( v i j ) was 3.0, and of subunit ( e i j k ) was 2.0. Given this, the intra-class correlation at the person level was 0.50 and at the subunit level was 0.80. Due to the above simulation parametrizations, all simulation settings had µ = 0.474, θs = 0.773, θf = 0.682, and θ = 0.631. The five simulation scenarios examined the effects of:
(1)
Varying the prevalence of event status,
(2)
accounting for tied prediction scores using the method outlined in Section 4,
(3)
having a variable number of persons per family, reflecting the 2012 US Census data,
(4)
creating unbalanced data at the subunit level through the random deletion of subunit measurements, and
(5)
introducing random variation to the S i j k prediction scores, which are assumed to be measured without error in Equation (2).
For each parametrization of each scenario, 5000 datasets were simulated with a prespecified number of families, persons within families (except in Scenario 3), and subunits. θ 3 ^ calculated using Equation (4) (averaged over simulations) was compared to the expectation of θ 3 ^ calculated using Equation (6) to assess bias ( m e a n θ 3 ^ minus E θ 3 ^ ). The variance of θ 3 ^ was examined by forming the ratio of the theoretical variance of θ 3 ^ calculated using Equation (7), to the empirical variance, which was calculated by taking the variance of the estimated θ 3 ^ across simulations (theoretical var θ 3 ^ / empirical var θ 3 ^ ). Lastly, the actual confidence interval coverage was measured for the proposed approach, the Obuchowski [3] and Delong [2] confidence interval methods, and the two-level approach of Rosner, Qiu, and Lee [5].

5.1. Scenario 1

In Scenario 1, the number of persons per family and the number of subunits per person were constant, and the event status for every subunit was independently assigned using a Bernoulli distribution with a probability of 0.50 or 0.25. Results are listed in Table 1. Across these simulation settings, the mean bias is 0.000, the mean variance ratio is 0.970, and the mean 95% CI coverage is 0.951; for the Obuchowski method, the mean 95% CI coverage is 0.935, for the DeLong method it is 0.938, and for the Rosner method it is 0.941. For the first parameterization in Scenario 1 (20 families, 2 persons, and 50% event prevalence), the actual confidence interval coverage was calculated to assess the viability of using a percentile bootstrap approach with 100 bootstrap samples by applying the same simulation methodology as described above. The percentile bootstrapping approach has a 95% CI coverage of 0.935.

5.2. Scenario 2

For Scenario 2, all the above design considerations were maintained, except that prediction scores were binned according to ventiles (20 bins), deciles (10 bins), and quintiles (5 bins) to create tied data. The analysis then proceeded as outlined in Section 4. Results are listed in Table 2. Across these simulation settings, the mean bias is −0.003, the mean-variance ratio is 0.955, and the mean 95% confidence interval coverage is 0.949.

5.3. Scenario 3

The design of Scenario 3 was similar to that of Scenario 1, except the number of persons within a family was allowed to vary randomly according to the empirical distribution of the number of children per family (in families with children) from the 2012 United States census data [9]. Families could have up to 10 children with a probability of 0.43 for one child, 0.36 for two children, 0.15 for three children, and 0.03 for four children with progressively smaller probabilities for more. Based on Table 3, the mean bias is 0.001, the mean-variance ratio is 0.965, and the mean 95% confidence interval coverage is 0.949; for the Obuchowski method, the mean 95% CI coverage is 0.939, for the DeLong method it is 0.936, and for the Rosner method it is 0.940.

5.4. Scenario 4

Scenario 4 was a modification of Scenario 1, where missing data at the subunit level was introduced. In each simulation, 10% of the persons were randomly selected to have one missing subunit. These missing observations were deleted rather than imputed. Based on Table 4, the mean bias is 0.000, the mean-variance ratio is 0.981, and the mean 95% confidence interval coverage is 0.952; for the Obuchowski method, the mean 95% CI coverage is 0.943, for the DeLong method it is 0.942, and for the Rosner method it is 0.944.

5.5. Scenario 5

In Scenario 5, the same setting as used in Scenario 1 was examined, however random normal measurement error (mean set to 0 and standard deviation varied was applied to the S i j k prediction scores (defined as Y i j k in the simulation setting). This was performed to assess the robustness of the proposed method to variability in the S i j k values that are assumed to be measured without error. Based on Table 5, the mean bias is −0.001, the mean-variance ratio is 0.954, and the mean 95% CI coverage is 0.946.
Based on the simulations, the bias in the point estimate of θ 3 ^ is negligible, and the coverage of the 95% confidence intervals is appropriate.

6. Example with Longitudinal Blood Pressure Data

To illustrate the proposed methodology, a sample of longitudinal blood pressure data collected across four timepoints from 1954 to 1968 in South Wales was analyzed [10]. Two regions, the Rhondda Fach and the Vale of Glamorgan, were sampled through the random selection of propositi older than 5 years, whose first-degree relatives were then included in the study along with the propositi. For this illustration, only individuals older than 30 at baseline and only those who were siblings of the propositi or the propositus himself/herself were included. Additionally, only timepoints with blood pressure data available were included. Hypertension was defined as a systolic blood pressure ≥ 140 mmHg, a diastolic blood pressure ≥ 90 mmHg, or a self-reported use of blood pressure medication. A summary of the sample sizes at different levels within the clustered longitudinal structure of the study is presented in Table S1, and descriptive statistics for the baseline visit are displayed in Table S2 of the Supplementary.
A generalized linear mixed model with a logit link and Bernoulli response was fit for hypertension (binary, time-varying), with age (continuous, time-varying), BMI (continuous, time-varying), sex (binary, time-invariant), and region (binary, time-invariant) as predictors of interest. Due to a large quantity of missing BMI data, ten rounds of multiple imputations were applied to this variable. Age and BMI were specified using restricted cubic splines with five knots having locations at the 5th, 27.5th, 50th, 72.5th, and 95th percentiles [11]. The inclusion of interactions between age and sex and BMI and sex were assessed using the Akaike Information Criterion (AIC) and found not to improve predictive ability. A family-level random intercept and a sibling random intercept within families, both assumed to be normally distributed, were included. The Kenward–Roger degrees of freedom and fixed effect standard error adjustments were used, and the model was estimated with residual pseudo-likelihood and optimized using Newton–Raphson with ridging in PROC GLIMMIX (SAS 9.4). Within each imputation dataset, the final sample size was 3143 visits (58% hypertensive) with 966 siblings and 319 families, a mean of 3.03 siblings per family (minimum = 1, maximum = 12), and a mean of 3.25 visits per sibling. The final model is given by:
ln p ( h y p e r t e n s i o n i j k ) | u i , v i j 1 p ( h y p e r t e n s i o n i j k ) | u i , v i j = β 0 + β 1 4 age   spline i j k + β 5 8 BMI   spline i j k + β 9 sex i j + β 10 region i j + u i + v i j
where u i is a family-level random intercept and v i j is a sibling-level random intercept.
Due to the use of flexible continuous predictor parameterizations, observed associations from the model fit to the complete dataset have been presented using categorical representations in Table 6—the impact of multiple imputation on inference is accounted for through Rubin’s rules in this table [8]. As seen in Table 6, an increase in age or BMI is strongly associated with higher odds of hypertension. The estimated variance of the random intercept for family was 0.51 (Wald p-value < 0.001) and for sibling was 1.67 (Wald p-value < 0.0001).
Using the estimated parameters of this model, the unconditional (i.e., not conditional on the random intercepts) predicted log odds for each observation in the complete dataset were calculated. By applying the proposed methodology, the AUC was estimated to be 0.76 ( μ = 0.36 , θ s = 0.70 , θ f = 0.63 , θ = 0.60 ) with a 95% confidence interval of 0.72 to 0.79. A naïve analysis that does not account for the clustered nature of the data produces the same AUC with a narrower 95% confidence interval of 0.74–0.77 (DeLong method [2]), and an analysis that accounts for clustering at only the participant level produces the same AUC as well, but also results in a narrower 95% confidence interval of 0.73–0.78 (Obuchowski method [3]). Given that multiple imputation was used during model fitting, an approach similar to that outlined in Section 4 was applied to the AUC estimates and confidence intervals to accurately account for the variance introduced by the multiple imputation procedure used for missing BMI data.

7. Example with Age-Related Macular Degeneration Data Clustered within Person and Family

The previous example illustrated how confidence intervals for the AUC based on longitudinal data can be produced using the proposed methodology; in this example, data measuring the progression of age-related macular degeneration (AMD) at the individual eye-level but clustered within patients (up to two eyes per person) and families was explored. The data used are from a subset of eyes from patients from the Seddon Longitudinal Cohort Study who did not present with evidence of advanced macular degeneration at a baseline visit and were followed for five years [6]. The sample sizes at each level of clustering are presented in Table S3, and descriptive statistics for the entire sample are presented in Tables S4 and S5 of the Supplementary.
A generalized linear mixed model with a logit link and Bernoulli response was fit for eye-level progression (binary) with age (categorical), sex (binary), smoking status (categorical), education (binary), BMI (categorical), and a baseline measure of AMD severity known as the Clinical Age Related Maculopathy Grading System (CARMS) grade (categorical; defined on a scale from 1 to 5, with grades 4 and 5 indicating subtypes of advanced AMD, but only values 1–3 at baseline are present in these example data) as predictors [12]. Race was not included since AMD is much less common in nonwhites, and there were few nonwhites in this cohort. A family-level random intercept and a patient-level random intercept, both assumed to be normally distributed, were included. The Kenward–Roger degrees of freedom and fixed effect standard error adjustments were applied, and the model was estimated by residual pseudo-likelihood and optimized using Newton–Raphson with ridging in PROC GLIMMIX (SAS 9.4). No missing data were present. The final model is:
ln p ( p r o g r e s s i o n i j k ) | u i , v i j 1 p ( p r o g r e s s i o n i j k ) | u i , v i j = β 0 + β 1 age i j + β 2 sex i j + β 3 past   smoker i j + β 4 current   smoker i j + β 5 education i j + β 6 BMI   2 i j + β 7 BMI   3 i j + β 8 CARMS   2 i j k + β 9 CARMS   3 i j k + u i + v i j
where u i is a family-level random intercept and v i j is a person-level random intercept.
The sample size was 1396 eyes (8% with progression), 741 patients, and 325 families. The results from the model fit using the complete dataset are presented in Table 7 where it can be seen that age and CARMS grade were the strongest predictors of progression. The estimated variance of the random intercept for family was 1.04 (Wald p-value = 0.027) and for person was 0.71 (Wald p-value = 0.09).
Using the estimated parameters of this model, the unconditional (i.e., not conditional on the random intercepts) predicted log odds for each observation in the complete dataset were calculated. There were many ties in the resulting predictions, so the previously outlined tie-breaking approach was used. By applying the proposed methodology, the AUC was estimated to be 0.88 ( μ = 1.21 , θ s = 0.99 , θ f = 0.85 , θ = 0.80 ) with a 95% confidence interval of 0.82–0.92. A correlation-naïve analysis that does not account for the hierarchical nature of the data produces the same AUC with a narrower 95% confidence interval of 0.85–0.90 (DeLong method [2]), and an analysis that accounts for clustering at only the participant level produces the same AUC as well, but also results in a narrower 95% confidence interval of 0.85–0.91 (Obuchowski method [3]).

8. Discussion

This investigation described how the extended Mann–Whitney U statistic, when paired with normally transformed prediction scores, can be applied to three-level hierarchical data to calculate an AUC statistic with a corresponding confidence interval. The AUC is calculated based on prediction scores from a three-level generalized linear mixed model with a logit link and Bernoulli distribution. Data may originate from a clustered setting or a longitudinal setting and cluster sizes are allowed to vary, but the correlation structure must be exchangeable. Simulation studies showed that the coverage of the 95% confidence interval around the AUC is appropriate, even in unbalanced settings with small cluster sizes and a small number of independent units (i.e., families).
In the two examples provided, the data were not divided into a training and testing set as this was not relevant to the illustration of the methodology. In practice, this technique or cross-validation could be used to obtain a validated measure of AUC. Cross-validation can be used by calculating the variance and AUC of the holdout sets of data and then combining them through the use of Rubin’s rules [8].
In the presentation of the two-level approach, prediction scores were assumed to be generated through a generalized estimating equation (GEE) model [5]. A three-level hierarchical data is traditionally modeled using generalized linear mixed models, and so this technique is used here in lieu of GEE. If a marginal model rather than a subject-specific model is desired, GEE may still be used, assuming that a method supporting three-levels of data dependency is applied.
Two important assumptions deserve further attention. First, Equation (5) assumes bivariate normality of normal random variables. The authors believe that this assumption is appropriate given that an examination of 3D histograms of the presented example data and data simulation scenarios did not violate this assumption. Second, this methodology assumes that the correlation structure of the response data is exchangeable. This is typically valid in data settings where correlation is induced by observations being nested within groups but is often inappropriate with longitudinal data. The blood pressure example (Section 6) includes longitudinal data, but due to model nonconvergence, it was not possible to assess the utility of complex correlation structures. As noted in Fitzmaurice, Laird, and Ware [13], it is often difficult to specify correlation structures that are more flexible than exchangeable ones in generalized linear mixed models with binary responses. Regardless, users of the proposed methodology should attempt to determine if exchangeability is a fair assumption. In addition to assessing the correlation structure, users should also be cognizant of the presence of informative cluster sizes, which may lead to bias in the results from the proposed methodology. The authors report that there are no competing interests to declare. For users interested in applying the proposed method to construct valid confidence intervals for the AUC in three-level settings, R code is available in the Supplementary Materials.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/stats6020034/s1. R code for producing a confidence interval for the AUC in a three-level setting is available in the Supplementary Material.

Author Contributions

Conceptualization, R.J.G., J.M.S., M.-L.T.L. and B.R.; Data curation, B.R.; Formal analysis, C.B.; Funding acquisition, B.R. and J.M.S.; Investigation, C.B., J.M.S., M.-L.T.L. and B.R.; Methodology, C.B., R.J.G., M.-L.T.L. and B.R.; Project administration, B.R.; Resources, J.M.S. and B.R.; Software, C.B.; Supervision, B.R.; Validation, B.R.; Writing—original draft, C.B.; Writing—review and editing, C.B., R.J.G., J.M.S., M.-L.T.L. and B.R. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by the National Eye Institute (NEI) Grant No. R01 EY022445, EY028602, and EY011309.

Institutional Review Board Statement

No new data were created for this study and the IRB determined that this study is exempt; however, for the data analyzed, informed consent was obtained from participants, research adhered to the tenets of the Declaration of Helsinki, and was performed under approved institutional review board protocols.

Informed Consent Statement

No new data were created for this study; however, for the data analyzed, informed consent was obtained from participants, research adhered to the tenets of the Declaration of Helsinki, and was performed under approved institutional review board protocols.

Data Availability Statement

No new data were created for this study. Data from the relevant original studies are available from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hanley, J.A.; McNeil, B.J. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 1983, 148, 39–843. [Google Scholar] [CrossRef] [PubMed]
  2. DeLong, E.R.; DeLong, D.M.; Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 1988, 44, 37–845. [Google Scholar] [CrossRef]
  3. Obuchowski, N.A. Nonparametric analysis of clustered ROC curve data. Biometrics 1997, 53, 67–578. [Google Scholar] [CrossRef]
  4. Rosner, B.; Glynn, R.J.; Lee, M.L. Extension of the rank sum test for clustered data: Two-group comparisons with group membership defined at the subunit level. Biometrics 2006, 62, 251–1259. [Google Scholar] [CrossRef] [PubMed]
  5. Rosner, B.; Qiu, W.; Lee, M.L. Assessing discrimination of risk prediction rules in a clustered data setting. Lifetime Data Anal. 2013, 19, 42–256. [Google Scholar] [CrossRef] [PubMed]
  6. Seddon, J.M.; Rosner, B. Validated prediction models for macular degeneration progression and predictors of visual acuity loss identify high-risk individuals. Am. J. Ophthalmol. 2019, 198, 23–261. [Google Scholar] [CrossRef] [PubMed]
  7. Talbot, R.; Goldberg, I.; Kelly, P. Evaluating the accuracy of the visual field index for the Humphrey Visual Field Analyzer in patients with mild to moderate glaucoma. Am. J. Ophthalmol. 2013, 156, 272–1276. [Google Scholar] [CrossRef] [PubMed]
  8. Rubin, D.B. Multiple Imputation for Non-Response in Surveys; Wiley: New York, NY, USA, 1987. [Google Scholar]
  9. Vespa, J.; Lewis, J.M.; Kreider, R.M. America’s Families and Living Arrangements: 2012. Available online: https://www.census.gov/content/dam/Census/library/publications/2013/demo/p20-570.pdf (accessed on 25 March 2021).
  10. Miall, W.E.; Bell, R.A.; Lovell, H.G. Relation between change in blood pressure and weight. Br. J. Prev. Soc. Med. 1968, 22, 3–80. [Google Scholar] [CrossRef] [PubMed]
  11. Harrell, F.E. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis; Springer-Verlag: New York, NY, USA, 2001. [Google Scholar]
  12. Seddon, J.M.; Sharma, S.; Adelman, R.A. Evaluation of the clinical age-related maculopathy staging system. Ophthalmology 2006, 113, 60–266. [Google Scholar] [CrossRef]
  13. Fitzmaurice, G.M.; Laird, N.M.; Ware, J.H. Applied Longitudinal Analysis, 2nd ed.; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
Figure 1. Example of a three-level hierarchy with 2 families and varying numbers of persons per family and subunits per person.
Figure 1. Example of a three-level hierarchy with 2 families and varying numbers of persons per family and subunits per person.
Stats 06 00034 g001
Table 1. Simulation study with event prevalence of 0.50 or 0.25 (n = 5000 simulations).
Table 1. Simulation study with event prevalence of 0.50 or 0.25 (n = 5000 simulations).
Actual 95% CI Coverage
FamiliesPersonsSubunitsEvent Prevalence M e a n θ 3 ^ E θ 3 ^ BiasVariance RatioProposedObuchowskiDelongRosner
20220.500.6350.6340.0010.9890.9580.9350.9260.943
50220.500.6330.6330.0000.9200.9470.9430.9430.944
100220.500.6320.6320.0000.9510.9470.9450.9440.944
20320.500.6340.6340.0000.9960.9550.9340.9240.940
50320.500.6320.633−0.0010.9850.9510.9390.9450.942
100320.500.6320.6320.0011.0070.9570.9490.9460.938
20220.250.6330.634−0.0010.9640.9560.9340.9240.945
50220.250.6330.6330.0000.9650.9480.9430.9420.949
100220.250.6320.6320.0000.9790.9510.9440.9470.939
20320.250.6350.6340.0010.9310.9470.9330.9240.939
50320.250.6310.633−0.0020.9770.9500.9440.9480.937
100320.250.6320.6320.0000.9780.9500.9390.9480.943
Mean0.6330.6330.0000.9700.9510.9350.9380.941
Variance ratio is theoretical/empirical variance; actual 95% CI coverage is presented for the proposed method, Obuchowski’s method [3], DeLong’s method [2], and Rosner’s method [5].
Table 2. Simulation study with tied prediction scores (n = 5000 simulations).
Table 2. Simulation study with tied prediction scores (n = 5000 simulations).
FamiliesPersonsSubunitsBins M e a n θ 3 ^ E θ 3 ^ BiasVariance
Ratio
Actual 95% CI Coverage
202250.6270.634−0.0070.9680.953
502250.6260.633−0.0070.9550.943
2022100.6330.634−0.0010.9550.952
5022100.6310.633−0.0020.9440.945
2022200.6350.6340.0010.9630.954
5022200.6320.633−0.0010.9460.944
Mean0.6310.634−0.0030.9550.949
Variance ratio is theoretical/empirical variance; bins is percentile bins; event prevalence is 0.50 for all scenarios; actual 95% CI coverage is presented for the proposed method only, as the tie-breaking approach discussed in Section 4 is required.
Table 3. Simulation study with variable family size following the 2012 United States census (n = 5000 simulations).
Table 3. Simulation study with variable family size following the 2012 United States census (n = 5000 simulations).
Actual 95% CI Coverage
FamiliesSubunitsEvent Prevalence M e a n θ 3 ^ E θ 3 ^ BiasVariance RatioProposedObuchowskiDeLongRosner
5020.250.6330.6330.0000.9880.9520.9380.9320.939
5020.500.6340.6330.0010.9410.9460.9400.9400.940
Mean0.6340.6330.0010.9650.9490.9390.9360.940
Variance ratio is theoretical/empirical variance; actual 95% CI coverage is presented for the proposed method, Obuchowski’s method [3], DeLong’s method [2], and Rosner’s method [5].
Table 4. Simulation study with 10% of persons having a single ineligible subunit (n = 5000 simulations).
Table 4. Simulation study with 10% of persons having a single ineligible subunit (n = 5000 simulations).
Actual 95% CI Coverage
FamiliesPersonsEvent Prevalence M e a n θ 3 ^ E θ 3 ^ BiasVariance RatioProposedObuchowskiDeLongRosner
5020.500.6320.633−0.0010.9440.9480.9470.9480.949
5030.500.6330.6330.0001.0050.9540.9420.9400.940
5020.250.6320.633−0.0010.9610.9500.9410.9400.945
5030.250.6330.6330.0001.0130.9550.9420.9410.940
Mean0.6330.6330.0000.9810.9520.9430.9420.944
Variance ratio is theoretical/empirical variance; in each scenario, 90% of persons had two subunits and 10% had one subunit; actual 95% CI coverage is presented for the proposed method, Obuchowski’s method [3], DeLong’s method [2], and Rosner’s method [5].
Table 5. Simulation study with random normal error (mean = 0) introduced to the S i j k prediction scores.
Table 5. Simulation study with random normal error (mean = 0) introduced to the S i j k prediction scores.
FamiliesPersonsSubunitEvent PrevalenceNormal Random Error SD M e a n θ 3 ^ E θ 3 ^ BiasVariance Ratio95% CI Coverage
50220.500.10.6320.633−0.0010.9180.946
50320.500.10.6330.6330.0000.9820.948
50220.500.250.6310.633−0.0020.9390.945
50320.500.250.6330.6330.0000.9930.952
50220.500.500.6310.633−0.0020.9590.947
50320.500.500.6300.633−0.0030.9350.937
Mean0.6320.633−0.0010.9540.946
Variance ratio is theoretical/empirical variance; SD = standard deviation.
Table 6. The association between variables of interest and hypertension using categorized specifications of continuous predictors; random intercept variances; and AUC in the complete South Wales dataset (n visits = 3143, n siblings = 966, and n families = 319).
Table 6. The association between variables of interest and hypertension using categorized specifications of continuous predictors; random intercept variances; and AUC in the complete South Wales dataset (n visits = 3143, n siblings = 966, and n families = 319).
VariableOdds Ratio95% Confidence Intervalp-Value
Sex, reference = female0.75(0.58, 0.98)0.037
Age, reference = (30–39.9) <0.0001
(40–49.9)1.43(1.07, 1.92)
(50–59.9)3.01(2.19, 4.14)
(60–69.9)8.60(5.88, 12.57)
≥70 17.10(10.49, 27.89)
BMI, reference < 25 <0.0001
(25–29.9)1.74(1.36, 2.21)
≥303.88(2.63, 5.73)
Region, reference = Vale of Glamorgan0.76(0.57, 1.01)0.06
Family random intercept variance0.51 (Wald p-value < 0.001) *
Person random intercept variance1.67 (Wald p-value < 0.0001) *
AUC (proposed methodology)0.76 (95% CI: 0.72 to 0.79)
DeLong AUC [2]0.76 (95% CI: 0.74 to 0.77)
Obuchowski AUC [3]0.76 (95% CI: 0.73 to 0.78)
* Model-based Wald test of nonzero variance.
Table 7. The association between predictors of interest and eye-level progression to advanced age-related macular degeneration (AMD); random intercept variances; and AUC in the complete AMD dataset (1396 eyes [107 eyes progressing within five years], 741 patients, and 325 families).
Table 7. The association between predictors of interest and eye-level progression to advanced age-related macular degeneration (AMD); random intercept variances; and AUC in the complete AMD dataset (1396 eyes [107 eyes progressing within five years], 741 patients, and 325 families).
VariableOdds Ratio95% Confidence Intervalp-Value
Age (years), reference = (55–64.9) 0.010
(65–74.9)0.94(0.41, 2.15)
(75–80)2.43(1.00, 5.87)
Sex, reference = female0.58(0.31, 1.08)0.08
Smoking status, reference = never 0.60
Past1.26(0.68, 2.35)
Current1.64(0.54, 5.00)
Education, reference = more than high school1.28(0.73, 2.26)0.39
BMI, reference <25 0.50
(25–29.9)0.68(0.36, 1.30)
≥300.85(0.40, 1.83)
CARMS grade, reference = 1 <0.0001
225.95(5.76, 116.86)
389.18(20.91, 380.30)
Family random intercept variance1.04 (Wald p-value = 0.027) *
Person random intercept variance0.71 (Wald p-value = 0.09) *
AUC (proposed methodology)0.88 (95% CI: 0.82 to 0.92)
DeLong AUC [2] 0.88 (95% CI: 0.85 to 0.90)
Obuchowski AUC [3]0.88 (95% CI: 0.85 to 0.91)
* Model-based Wald test of nonzero variance.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bay, C.; Glynn, R.J.; Seddon, J.M.; Lee, M.-L.T.; Rosner, B. Evaluation of Risk Prediction with Hierarchical Data: Dependency Adjusted Confidence Intervals for the AUC. Stats 2023, 6, 526-538. https://doi.org/10.3390/stats6020034

AMA Style

Bay C, Glynn RJ, Seddon JM, Lee M-LT, Rosner B. Evaluation of Risk Prediction with Hierarchical Data: Dependency Adjusted Confidence Intervals for the AUC. Stats. 2023; 6(2):526-538. https://doi.org/10.3390/stats6020034

Chicago/Turabian Style

Bay, Camden, Robert J Glynn, Johanna M Seddon, Mei-Ling Ting Lee, and Bernard Rosner. 2023. "Evaluation of Risk Prediction with Hierarchical Data: Dependency Adjusted Confidence Intervals for the AUC" Stats 6, no. 2: 526-538. https://doi.org/10.3390/stats6020034

APA Style

Bay, C., Glynn, R. J., Seddon, J. M., Lee, M. -L. T., & Rosner, B. (2023). Evaluation of Risk Prediction with Hierarchical Data: Dependency Adjusted Confidence Intervals for the AUC. Stats, 6(2), 526-538. https://doi.org/10.3390/stats6020034

Article Metrics

Back to TopTop