Next Article in Journal
Spectral Clustering of Mixed-Type Data
Next Article in Special Issue
Resampling under Complex Sampling Designs: Roots, Development and the Way Forward
Previous Article in Journal
Bland–Altman Limits of Agreement from a Bayesian and Frequentist Perspective
Previous Article in Special Issue
The One Standard Error Rule for Model Selection: Does It Work?
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Resampling Plans and the Estimation of Prediction Error

Department of Statistics, Stanford University, Stanford, CA 94305, USA
Stats 2021, 4(4), 1091-1115; https://doi.org/10.3390/stats4040063
Submission received: 22 September 2021 / Revised: 16 November 2021 / Accepted: 18 November 2021 / Published: 20 December 2021
(This article belongs to the Special Issue Re-sampling Methods for Statistical Inference of the 2020s)

Abstract

:
This article was prepared for the Special Issue on Resampling methods for statistical inference of the 2020s. Modern algorithms such as random forests and deep learning are automatic machines for producing prediction rules from training data. Resampling plans have been the key technology for evaluating a rule’s prediction accuracy. After a careful description of the measurement of prediction error the article discusses the advantages and disadvantages of the principal methods: cross-validation, the nonparametric bootstrap, covariance penalties (Mallows’ C p and the Akaike Information Criterion), and conformal inference. The emphasis is on a broad overview of a large subject, featuring examples, simulations, and a minimum of technical detail.

1. Introduction

Modern prediction algorithms such as random forests and deep learning use training sets, often very large ones, to produce rules for predicting new responses from a set of available predictors. A second question—right after “How should the prediction rule be constructed?”—is “How accurate are the rule’s predictions?” Resampling methods have played a central role in the answer. This paper is intended to provide an overview of what are actually several different answers, while trying to keep technical complications to a minimum.
This is a Special Issue of STATS devoted to resampling, and before beginning on prediction rules it seems worthwhile to say something about the general effect of resampling methods on statistics and statisticians. Table 1 shows the law school data [1], a small data set but one not completely atypical of its time. The table reports average scores of the 1973 entering class at 15 law schools on two criteria: undergraduate grade point average (GPA) and result on the “LSAT”, a national achievement test. The observed Pearson correlation coefficient between GPA and LSAT score is
ρ ^ = 0.776 .
How accurate is ρ ^ ?
Suppose that Dr. Jones, a 1940s statistician, was given the data in Table 1 and asked to attach a standard error to ρ ^ = 0.776 ; let’s say a nonparametric standard error since a plot of (LSAT,GPA) looks definitely non-normal. At his disposal is the nonparametric delta method, which gives a first-order Taylor series approximation formula for se ( ρ ^ ) . For the Pearson correlation coefficient this turns out to be
se ^ ρ ^ = ρ ^ 2 4 n μ ^ 40 μ ^ 20 2 + μ ^ 04 μ ^ 02 2 + 2 μ ^ 22 μ ^ 20 μ ^ 02 + 4 μ ^ 22 μ ^ 11 2 4 μ ^ 31 μ ^ 11 μ ^ 20 4 μ ^ 13 μ ^ 11 μ ^ 02 1 / 2 ,
where n = 15 and μ ^ i j is the mean of ( GPA GPA ¯ ) i ( LSAT LSAT ¯ ) j , the bars indicating averages.
Table 1. Average scores for admitees to 15 American law schools, 1973. GPA is undergraduate grade point average, LSAT “law boards” score. Pearson correlation coefficient between GPA and LSAT is 0.776.
Table 1. Average scores for admitees to 15 American law schools, 1973. GPA is undergraduate grade point average, LSAT “law boards” score. Pearson correlation coefficient between GPA and LSAT is 0.776.
GPALSAT
3.39576
3.30635
2.81558
3.03578
3.44666
3.07580
3.00555
3.43661
3.36651
3.13605
3.12653
2.74575
2.76545
2.88572
2.96594
Jones either looks up or derives (2), evaluates the six terms on his mechanical calculator, and reports
se ^ ρ ^ = 0.124 ,
after which he goes home with the feeling of a day well spent.
Jones’ daughter, a 1960s statistician, has a much easier go of it. Now she does not have to look up or derive Formula (2). A more general resampling algorithm, the Tukey–Quenouille jackknife is available, and can be almost instantly evaluated on her university’s mainframe computer. It gives her the answer
se ^ ρ ^ = 0.143 .
Dr. Jones is envious of his daughter:
1.
She does not need to spend her time deriving arduous formulas such as (2).
2.
She is not restricted to traditional estimates such as ρ ^ that have closed-form Taylor series expansions.
3.
Her university’s mainframe computer is a million times faster than his old Marchant calculator (though it is across the campus rather than on her desk).
If now, 60 years later, the Jones family is still in the statistics business they’ll have even more reason to be grateful for resampling methods. Faster, cheaper, and more convenient computation combined with more aggressive methodology have pushed the purview of resampling applications beyond the assignment of standard errors.
Figure 1 shows 2000 nonparametric bootstrap replications ρ ^ * from the law school data (Each ρ ^ * is the correlation from a bootstrapped data matrix obtained by resampling the 15 rows of the 15 × 2 matrix in Table 1 15 times with replacement. See Chapter 11 of [2]). Their empirical standard deviation is the nonparametric bootstrap estimate of standard error for ρ ^ ,
se ^ ρ ^ = 0.138 .
Two thousand is 10 times too many replications needed for a standard error, but it is not too many for a bootstrap confidence interval. The arrowed segments in Figure 1 compare the standard approximate 95% confidence limit
ρ ρ ^ ± 1.96 se ^ = [ 0.505 , 1.048 ] ,
se ^ = 0.138 , with the nonparametric bootstrap interval. (This is the bca interval, constructed using program bcajack from the CRAN package bcaboot [3]. Chapter 11 of [2] shows why bca’s “second-order corrections”, here very large, improve on the standard method.)
ρ [ 0.320 , 0.944 ] .
The standard method does better if we first make Fisher’s z transformation
ξ ^ = 1 2 log 1 + ρ ^ 1 ρ ^ ,
compute the standard interval on the ξ scale, and transform the endpoints back to the ρ scale. This gives 0.95 interval
ρ [ 0.275 , 0.946 ] ,
not so different from the bootstrap interval (7), and at least not having its upper limit above 1.00!
This is the kind of trick Dr. Jones would have known. Resampling, here in the form of the bca algorithm, automates devices such as (8) without requiring Fisher-level insight for each new application.
If there was a statistics’ “word of the year” it would be two words: deep learning. This is one of a suite of prediction algorithms that input data sets, often quite massive ones, and output prediction rules. Others in the suite include random forests, support vector machines, and gradient boosting.
Having produced a prediction rule, it is natural to wonder how accurately it will predict future cases, our subject in what follows:
  • Section 2 gives a careful definition of the prediction problem, and describes a class of loss functions (the “Q class”) that apply to discrete as well as continuous response variables.
  • Section 3 concerns nonparametric estimates of prediction loss (cross-validation and the bootstrap “632 rule”) as well as Breiman’s out-of-bag error estimates for random forests.
  • Covariance penalties, including Mallows’ C p and the Akaike Information Criterion, are parametric methods discussed in Section 4 along with the related concept of degrees of freedom.
  • Section 5 briefly discusses conformal inference, the most recent addition to the resampling catalog of prediction error assessments.

2. The Prediction Problem

Statements of the prediction problem are often framed as follows:
  • A data set d of n pairs is observed,
    d = ( x i , y i ) , i = 1 , 2 , , n ,
    where the x i are p-dimensional predictor vectors and the y i are one-dimensional responses.
  • The ( x , y ) pairs are assumed to be independent and identically distributed draws from an unknown ( p + 1 ) -dimensional distribution F,
    ( x i , y i ) iid F , i = 1 , 2 , , n .
  • Using some algorithm A , the statistician constructs a prediction rule f ( x , d ) that provides a prediction μ ^ ( x ) ,
    μ ^ ( x ) = f ( x , d )
    for any vector x in the space of possible predictors.
  • A new pair ( x , y ) is independently drawn from F,
    ( x , y ) F independent of d ,
    but with only x observable.
  • The statistician predicts the unseen y by μ ^ ( x ) = f ( x , d ) and wishes to assess prediction error. Later the prediction error will turn out to directly relate to the estimation error of μ ^ ( x ) for the true conditional expectation of y given x,
    μ ( x ) = E F { y x } .
  • Prediction error is assessed as the expectation of loss under distribution F,
    Err ( u ) = E F Q y , μ ^ ( x ) ,
    for a given loss function such as squared error: Q ( y , μ ^ ) = ( y μ ^ ) 2 .
Here, E F indicates expectation over the random choice of all p + 1 pairs ( x i , y i ) and ( x , y ) in (11) and (13). The u in Err ( u ) reflects the unconditional definition of error in (15). The resampling algorithms we will describe calculate an estimate of Err ( u ) from the observed data. (One might hope for a more conditional error estimate, say one applying to the observed set of predictors x , a point discussed in what follows.)
Naturally, the primary concern within the prediction community has been with the choice of the algorithm A that produces the rule μ ^ ( x ) = f ( x , d ) . Elaborate computer-intensive algorithms such as random forests and deep learning have achieved star status, even in the popular press. Here, however, the “prediction problem” will focus on the estimation of prediction error. To a large extent the prediction problem has been a contest of competing resampling methods, as discussed in the next three sections.
Figure 2 illustrates a simple example: n = 20 pairs ( x i , y i ) have been observed, in this case with x real. A fourth-degree polynomial f ( x , d ) has been fit by ordinary least squares applied to d = { ( x i , y i ) , i = 1 , 2 , , 20 } with the heavy curve tracing out μ ^ ( x ) = f ( x , d ) .
In the usual OLS notation, we’ve observed y = ( y 1 , y 2 , , y n ) from the notional model
y = X β + ϵ ,
where X is the 20 × 5 matrix with ith row ( 1 , x i , x i 2 , x i 3 , x i 4 ) , β the unknown 5-dimensional vector of regression coefficients, and ϵ a vector of 20 uncorrelated errors having mean 0 and variance σ 2 ,
ϵ i ( 0 , σ 2 ) for i = 1 , 2 , , n .
The fitted curve μ ^ ( x ) = f ( x , d ) is given by
μ ^ ( x ) = X ( x ) β ^ ,
for X ( x ) = ( 1 , x i , x i 2 , x i 3 , x i 4 ) and β ^ = ( X X ) 1 X y (the OLS estimate), this being algorithm A .
The apparent error, what will be called err in what follows, is
err = i = 1 n Q ( y i , μ ^ i ) n μ ^ i = f ( x i , d )
which equals 1.99 for Q ( y , μ ^ ) = ( y μ ^ ) 2 . The usual unbiased estimate for the noise parameter σ 2 , not needed here, modifies the denominator in (19) to take account of fitting a 5-vector β ^ to d ,
σ ^ 2 = i = 1 n Q ( y i , μ ^ i ) ( n 5 ) = 2.65 .
Dividing the sum of squared errors by n p rather than n can be thought of as a classical prediction error adjustment; err usually underestimates future prediction error since the μ ^ i have been chosen to fit the observations y i .
Because this is a simulation, we know the true function μ ( x ) (14), the light dotted curve in Figure 2:
μ ( x ) = 2 + 0.2 x + cos x 5 2 ;
the points y i were generated with normal errors, variance 2,
y i ind N μ ( x i ) , 2 , i = 1 , 2 , , n .
Given model (22), we can calculate the true prediction error for estimate μ ^ ( x ) (18). If y * N ( μ ( x ) , 2 ) is a new observation, independent of the original data d which gave μ ^ ( x ) , then the true prediction error Err x is
Err x = E * y * μ ( x ) 2 = 2 + μ ^ ( x ) μ ( x ) 2 ,
the notation E * indicating expectation over y * . Let Err denote the average true error over the n = 20 observed values x i ,
Err = 1 n i = 1 n Err x i = 2 + 1 n i = 1 n μ ^ ( x i ) μ ( x i ) 2 ,
equaling 2.40 in this case. Err figures prominently in the succeeding sections. Here, it exceeds the apparent error err = 1.99 (19) by 21%. ( Err is not the same as Err ( u ) (15).)
Prediction algorithms are often, perhaps most often, applied to situations where the responses are dichotomous, y i = 1 or 0; that is, they are Bernoulli random variables, binomials of sample size 1 each,
y i ind Bi 1 , μ ( x i ) for i = 1 , 2 , , n .
Here μ ( x ) is the probability that y = 1 given prediction vector x,
μ ( x ) = Pr { y = 1 x } .
The probability model F in (11) can be thought of in two steps: as first selecting x according to some p-dimensional distribution G and then “flipping a biased coin” to generate y Bi ( 1 , μ ( x ) ) .
Squared error is not appropriate for dichotomous data. Two loss (or “error”) functions Q ( μ , μ ^ ) are in common use for measuring the discrepancy between μ and μ ^ , the true and estimated probability that y = 1 in (25). The first is counting error,
Q ( μ , μ ^ ) = 0 if μ and μ ^ are on same side of 1 / 2 2 · μ 1 2 if not .
For y = 0 or 1, Q ( y , μ ^ ) equals 0 or 1 if y and μ ^ are on the same or different sides of 1 / 2 .
The second error function is binomial deviance (or twice the Kullback–Leibler divergence),
Q ( μ , μ ^ ) = 2 μ log μ μ ^ + ( 1 μ ) log 1 μ 1 μ ^ .
Binomial deviance plays a preferred role in maximum likelihood estimation. Suppose that μ ( β ) is a p-parameter family for the true vector of means in model (25),
μ ( β ) = μ ( x 1 , β ) , μ ( x 2 , β ) , , μ ( x n , β ) ( β Ω R p ) .
Then, the maximum likelihood estimate (MLE) β ^ is the minimizer of the average binomial deviance (28) between y = ( y 1 , y 2 , , y n ) and μ ( β ) ,
β ^ = argmin β 1 n i = 1 n Q y i , μ ( x i , β ) ;
see Chapter 8 of [2]. Most of the numerical examples in the following sections are based on binomial deviance (30). (If μ ^ equals 0 or 1 then (30) is infinite. To avoid infinities, our numerical examples truncate μ ^ to [ 0.005 , 0.995 ] .)
Squared error, counting error, and binomial deviance are all members of the Q-class, a general construction illustrated in Figure 3 ([4] Section 3). The construction begins with some concave function q ( μ ) ; for the dichotomous cases considered here, μ [ 0 , 1 ] with q ( 0 ) = q ( 1 ) = 0 . The error Q ( μ , μ ^ ) between a true value μ and an estimate μ ^ is defined by the illustrated tangency calculation:
Q ( μ , μ ^ ) = q ( μ ^ ) + q ˙ ( μ ^ ) ( μ μ ^ ) q ( μ ) ,
q ˙ ( μ ) = d q ( μ ) / d μ (equivalent to the “Bregman divergence” [5]).
The entropy function
q ( μ ) = 2 μ log μ + ( 1 μ ) log ( 1 μ )
makes Q ( μ , μ ^ ) equal binomial deviance. Two other common choices are q ( μ ) = min { μ , 1 μ } for counting error and q ( μ ) = μ ( 1 μ ) for squared error.
Working within the Q-class (31), it is easy to express the true error of prediction μ ^ ( x ) at predictor value x where the true mean is μ ( x ) = E F { y x } . Letting y * be an independent realization from the distribution of y given x, the true error at x is, by definition,
Err μ ( x ) , μ ^ ( x ) = E * Q y * , μ ^ ( x ) ,
only y * being random in the expectation.
Lemma 1.
The true error at x (33) is
Err μ ( x ) , μ ^ ( x ) = Q μ ( x ) , μ ^ ( x ) + q μ ( x ) c ( x ) ,
with c ( x ) = 0 in the dichotomous case.
Proof. 
From definition (31) of the Q-class,
E * Q y * , μ ^ ( x ) = E * q μ ^ ( x ) + q ˙ μ ^ ( x ) y * μ ^ ( x ) q ( y * ) = q μ ^ ( x ) + q ˙ μ ^ ( x ) μ ( x ) μ ^ ( x ) E * q ( y * ) = Q μ ( x ) , μ ^ ( x ) + q μ ( x ) E * q ( y * ) ,
giving (34) with c ( x ) = E * q ( y * ) . In the dichotomous case q ( y * ) = 0 for y * equal 0 or 1, so c ( x ) = 0 .  ☐
To simplify notation, let μ i = μ ( x i ) and μ ^ i = μ ^ ( x i ) , with μ = ( μ 1 , μ 2 , , μ n ) and μ ^ = ( μ ^ 1 , μ ^ 2 , , μ ^ n ) . The average true error Err ( μ , μ ^ ) is defined to be
Err ( μ , μ ^ ) = 1 n i = 1 n Err ( μ i , μ ^ i ) = 1 n i = 1 n Q ( μ i , μ ^ i ) + q ( μ i ) c ( x i ) .
It is “true” in the desirable sense of applying to the given prediction rule μ ^ . If we average Err ( μ , μ ^ ) over the random choice of d in (11), we get the less desirable unconditional error Err ( u ) (15).
Err ( μ , μ ^ ) is minimized by μ ^ = μ ,
Err ( μ , μ ) = 1 n i = 1 n q ( μ i ) c ( x i ) .
Subtraction from (36) gives
Err ( μ , μ ^ ) Err ( μ , μ ) = 1 n i = 1 n Q ( μ i , μ ^ i ) .
This is an exact analogue of the familiar squared error relationship. Suppose y and y * are independently N ( μ , σ 2 I ) vectors, and that y produces an estimate μ ^ . Then
1 n E * y * μ ^ 2 1 n E * y * μ 2 = 1 n i = 1 n ( μ i μ ^ i ) 2 ,
which is (38) if Q ( μ , μ ^ ) = ( μ μ ^ ) 2 .
At a given value of x, say x 0 , a prediction μ ^ ( x 0 ) can also be thought of as an estimate of μ ( x 0 ) = E F { y 0 x 0 } . Lemma 1 shows that the optimum choice of a prediction rule μ ^ ( x 0 ) = f ( x 0 , d ) is also the optimum choice of an estimation rule: for any rule f ( x , d ) ,
E F Err μ ( x 0 ) , μ ^ ( x 0 ) = E F Q μ ( x 0 ) , μ ^ ( x 0 ) + q μ ( x 0 ) c ( x 0 )
( E F as in (11)), so that the rule that minimizes the expected prediction error also minimizes the expected estimation error E F { Q ( μ ( x 0 ) , μ ^ ( x 0 ) ) } . Predicting y and estimating its mean μ are equivalent tasks within the Q-class.

3. Cross-Validation and Its Bootstrap Competitors

Resampling methods base their inferences on recomputations of the statistic of interest derived from systematic modifications of the original sample. This is not a very precise definition, but it cannot be if we want to cover the range of methods used in estimating prediction error. There are intriguing differences among the methods concerning just what the modifications are and how the inferences are made, as discussed in this and the next two sections.
Cross-validation has a good claim to being the first resampling method. The original idea was to randomly split the sample d into two halves, the training and test sets, d train and d test . A prediction model is developed using only d train , and then validated by its performance on d test . Even if we cheated in the training phase, say by throwing out “bad” points, etc., the validation phase guarantees an honest estimate of prediction error.
One drawback is that inferences based on n / 2 data points are likely to be less accurate than those based on all n, a concern if we are trying to accurately assess prediction error. “One-at-a-time“ cross-validation almost eliminates this defect: let d ( i ) be data set (10) with point ( x i , y i ) removed, and define
μ ^ ( i ) = f ( x i , d ( i ) ) ,
the prediction for case i based on x i using the rule f ( x i , d ( i ) ) , constructed using only the data in d ( i ) . The cross-validation estimate of prediction error is then
Err ^ cv = 1 n i = 1 n Q ( y i , μ ^ ( i ) ) .
(This assumes that we know how to apply the construction rule A to subsets of size n 1 ). Because y i is not involved in μ ^ ( i ) , overfitting is no longer a concern. Under the independent-draws model (11), Err ^ cv is a nearly unbiased estimate of Err ( u ) (15) (“nearly” because it applies to samples of size n 1 rather than n).
The little example in Figure 2 has n = 20 points ( x i , y i ) . Applying (42) with Q ( y , μ ^ ) = ( y μ ^ ) 2 gave Err ^ cv = 3.59 , compared with apparent error err = 1.99 (19) and true error Err = 2.40 (24). This not-very-impressive result has much to do with the small sample size resulting in large differences between the original estimates μ ^ i and their cross-validated counterparts μ ^ ( i ) . The single data point at x i = 6.0 accounted for 40% of Err ^ cv .
Figure 4 concerns a larger data set that will serve as a basis for simulations comparing cross-validation with its competitors: 200 transplant recipients were followed to investigate the subsequent occurrence of anemia; 138 did develop anemia (coded as y i = 1 ) while 62 did not ( y i = 0 ). The goal of the study was to predict y from x, a vector of p = 17 baseline variables including the intercept term (The predictor variables were body mass index, sex, race, patient and donor age, four measures of matching between patient and donor, three baseline medicine indicators, and four baseline general health measures).
A standard logistic regression analysis gave estimated values of the anemia probability Pr { y i = 1 x i } for i = 1 , 2 , , n = 200 that we will denote as
μ = ( μ 1 , μ 2 , , μ 200 ) .
Figure 4 shows a histogram of the 200 μ i values. Here we will use μ as the “ground truth” for a simulation study (rather than analyzing the transplant study itself). The μ i will play the role of the true mean μ ( x i ) in Lemma 1 (34), enabling us to calculate true errors for the various prediction error estimates.
A 200 × 100 matrix Y of dichotomous responses y i j was generated as independent Bernoulli variables (that is, binomials of sample size 1),
y i j ind Bi ( 1 , μ i )
for i = 1 , 2 , , 200 and j = 1 , 2 , , 100 . The jth column of Y ,
y j = ( y 1 j , y 2 j , , y 200 j )
is a simulated binomial response vector (25) having true mean vector μ . Y provides 100 such response vectors.
Figure 4. Logistic regression estimated anemia probabilities for the 200 transplant patients.
Figure 4. Logistic regression estimated anemia probabilities for the 200 transplant patients.
Stats 04 00063 g004
For each one, a logistic regression was run,
glm ( y j X , binomial ) ,
in the language R, with X the 200 × 17 matrix of predictors from the transplant study. Cross-validation (“10-at-a-time” cross-validation rather than one-at-a-time: the 200 (x,y) pairs were randomly split into 20 groups of 10 each; each group was removed from the prediction set in turn and its 10 estimates obtained by logistic regression based on the other 190) gave an estimate of prediction error for the jth simulation,
Err ^ cv ( j ) = 1 n i = 1 200 Q ( y i j , μ ^ ( i ) j ) ,
while (36) gave true error
Err ( j ) = 1 n i = 1 200 Q ( μ i , μ ^ i j ) + q ( μ i ) ,
where μ ^ i j was the estimated mean from (46).
In terms of mean ± standard deviation the 100 simulations gave
Err cv 1.16 ± 0.14 and Err 1.07 ± 0.12 .
Err cv averaged 8% more than Err (a couple percent of which came from having sample size 190 rather than 200). The standard deviation of Err cv is not much bigger than the standard deviation of Err , which might suggest that Err cv was tracking Err as it varied across the simulations.
Sorry to say, that was not at all the case. The pairs ( Err ( j ) , Err cv ( j ) ) , j = 1 , 2 , , 100 , are plotted in Figure 5. It shows Err cv actually decreasing as the true error Err increases. The unfortunate implication is that Err cv is not estimating the true error, but only its expectation Err ( u ) (15). This is not a particular failing of cross-validation. It is habitually observed for all prediction error estimates—see for instance Figure 9 of [4]—though the phenomenon seems unexplained in the literature.
Cross-validation tends to pay for its low bias with high variability. Efron [1] proposed bootstrap estimates of prediction error intended to decrease variability without adding much bias. Among the several proposals the most promising was the 632 rule. (An improved version 632+ was introduced in Efron and Tibshirani [6], designed for reduced bias in overfit situations where err (19) equals zero. The calculations here use only 632.) The 632 rule is described as follows:
  • Nonparametric bootstrap samples d * are formed by drawing n pairs ( x i , y i ) with replacement from the original data set d (10).
  • Applying the original algorithm A to d * gives prediction rule f ( x , d * ) and predictions
    μ ^ ( x ) * = f ( x , d * ) ,
    as in (12).
  • B bootstrap data sets
    d * ( 1 ) , d * ( 2 ) , , d * ( B )
    are independently drawn, giving predictions
    μ ^ i j * = f ( x i , d * ( j ) ) ,
    for i = 1 , 2 , , n and j = 1 , 2 , , B .
  • Two numbers are recorded for each choice of i and j, the error of μ ^ i j * as a prediction of y i ,
    Q i j * = Q ( y i , μ ^ i j * )
    and
    N i j * = # ( x i , y i ) d * ( j ) ,
    the number of times ( x i , y i ) occurs in d * ( j ) .
  • The zero bootstrap  Err ^ 0 is calculated as the average value of Q i j * for those cases having N i j = 0 ,
    Err ^ 0 = ( i , j ) : N i j = 0 Q i j * ( i , j ) : N i j = 0 1
  • Finally, the 632 estimate of prediction error is defined to be
    Err ^ 632 = 0.632 · Err ^ 0 + 0.368 · err ,
    err being the apparent error rate (19).
Err ^ 632 was calculated for the same 100 simulated response vectors y j (45) used for Err ^ cv (each using B = 400 replications), the 100 simulations giving
Err ^ 632 1.15 ± 0.10 ,
an improvement on Err ^ cv 1.16 ± 0.14 at (49). This is in line with the 24 sampling experiments reported in [6]. (There, rule 632 + was used, with loss function counting error (27) rather than binomial deviance. Q ( y , μ ^ ) is discontinuous for counting error, which works to the advantage of 632 rules.)
Table 2 concerns the rationale for the 632 rule. The n × B = 80,000 values Q i j * for the first of the 100 simulated y vectors were averaged according to how many times ( x i , y i ) appeared in d * ( j ) ,
Err ^ k = ( i , j ) : N i j = k Q i j * ( i , j ) : N i j = k 1 ,
for k = 0 , 1 , 2 , 3 , 4 , 5 . Not surprisingly, Err ^ k decreases with increasing k. Err ^ 0 (55), which would seem to be the bootstrap analogue of Err ^ cv , is seen to exceed the true error Err , while err is below Err . The intermediate linear combination 0.632 Err ^ 0 + 0.368 err is motivated in Section 6 of [1], though in fact the argument is more heuristic than compelling. The 632 rules do usually reduce variability of the error estimates compared to cross-validation, but bias can be a problem.
The 632 rule recomputes a prediction algorithm by nonparametric bootstrap resampling of the original data. Random forests [7], a widely popular prediction algorithm, carries this further: the algorithm itself as well as estimates of its accuracy depend on bootstrap resampling calculations.
Regression trees are the essential component of random forests. Figure 6 shows one such tree (constructed using rpart, the R version of CART [8]; Chapters 9 and 15 of [9] describe CART and random forests) as applied to y 1 , the first of the 100 response vectors for the transplant data simulation (44); y 1 consists of 57 0 s and 143 1 s, average value y ¯ = 0.72 . The tree-making algorithm performs successive splits of the data, hoping to partition it into bins that are mostly 0 s or 1 s. The bin at the far right—comprising n = 29 cases having low body mass index, low age, and female gender—has just one 0 and 28 1 s, for an average of 0.97. For a new transplant case ( x , y ) , with only x observable, we could follow the splits down the tree and use the terminal bin average as a quantitative prediction of y.
Random forests improves the predictive accuracy of any one tree by bagging (“bootstrap aggregation”), sometimes also called bootstrap smoothing: B bootstrap data sets d * ( 1 ) , d * ( 2 ) , , d * ( B ) are drawn at random (51), each one generating a tree such as that in Figure 6. (Some additional variability is added to the tree-building process: only a random subset of the p predictors is deemed eligible at each splitting point.) A new x is followed down each of the B trees, with the random forest prediction being the average of x’s B terminal values. Letting μ ^ i j = f ( x i , d * ( j ) ) , the prediction at x i for the tree based on d * ( j ) , the random forest prediction at x i is
μ ^ i = j = 1 B μ ^ i j / B .
Figure 6. Regression tree for transplant data, simulation 1.
Figure 6. Regression tree for transplant data, simulation 1.
Stats 04 00063 g006
The predictive accuracy for μ ^ i uses a device such as that for Err ^ 0 (55): Let μ ^ ( i ) be the average value of μ ^ i j for bootstrap samples d j not containing ( x i , y i ) ,
μ ^ ( i ) = j : N i j = 0 μ ^ i j j : N i j = 0 1 ,
called the “out-of-bag” (oob) estimate of μ i . The oob error estimate is then
Err ^ oob ( x i ) = Q ( y i , μ ^ ( i ) ) .
The overall oob estimate of prediction error is
Err ^ oob = 1 n i = 1 n Err ^ oob ( x i ) .
(Notice that the leave-out calculations here are for the estimates μ ^ i , while those for Err ^ 632 (56) are for the errors Q i j * .) Calculated for the 100 simulated response vectors y j (45), this gave
Err ^ oob 1.12 ± 0.07 ,
a better match to the true error Err than either Err ^ cv (49) or Err ^ 632 (57). In fact the actual match was even better than (63) suggests, as shown in Table 3 of Section 4. This is all the more surprising given that, unlike Err ^ cv and Err ^ 632 , Err ^ oob is fully nonparametric: it makes no use of the logistic regression model glm ( y X , binomial ) (46), which was involved in generating the simulated response vectors y j (44) (It has to be added that Err ^ oob is not an estimate for the prediction error of the logistic regression model μ ^ glm ( x ) (46), but rather for the random forest estimates μ ^ rf ( x ) ).

4. Covariance Penalties and Degrees of Freedom

A quite different approach to the prediction problem was initiated by Mallows’ C p formula [10]. An observed n-dimensional vector y is assumed to follow the homoskedastic model
y = μ + ϵ with ϵ ( 0 , σ 2 , I ) ,
the notation indicating uncorrelated errors of mean 0 and variance σ 2 as in (17); σ 2 is known. A linear rule
μ ^ = M y
is used to estimate μ with M a fixed and known matrix. How accurate is μ ^ as a predictor of future observations?
The apparent error
err = 1 n i = 1 n ( y i μ ^ i ) 2
is likely to underestimate the true error of μ ^ given a hypothetical new observation vector y * = μ + ϵ * independent of y ,
Err = Err ( μ , μ ^ ) = 1 n E * i = 1 n ( y i * μ ^ i ) 2 ,
the E * notation indicating that μ ^ = ( μ ^ 1 , μ ^ 2 , , μ ^ n ) is fixed in (67). Mallows’ C p formula says that
Err ^ cp = err + 2 σ 2 n trace ( M )
is an unbiased estimator of Err ,
E Err ^ cp = E Err ( μ , μ ^ ) ;
that is, Err ^ cp is not unbiased for Err ( μ , μ ^ ) but is unbiased for E { Err ( μ , μ ^ ) } ( μ fixed in the expectation) under model (64)–(65).
One might wonder what has happened to the covariates x i in d (10)? The answer is that they are still there but no longer considered random–rather as fixed ancillary quantities such as the sample size n. In the OLS model y = X β + ϵ for Figure 3 (16)–(18) the covariates x = ( x 1 , x 2 , , x n ) determine X , β ^ = ( X X ) 1 X y and
μ ^ = M y M = X ( X X ) 1 X ,
Formula (65). We could, but do not, write M as M X .
Mallows’ C p Formula (68) can be extended to the Q class of error measures, Figure 4. An unknown probability model f —notice that f is not the same as f in (12)— is assumed to have produced y and its true mean vector μ ,
μ = E f { y } ;
an estimate μ ^ = m ( y ) has been calculated using some algorithm m ( · ) ,
f y μ ^ = m ( y ) ;
the apparent error err (19) and true error Err ( μ , μ ^ ) (33) are defined as before,
err = 1 n i = 1 n Q ( y i , μ ^ i ) and Err = 1 n E * i = 1 n Q ( y i * , μ ^ i ) ,
with the μ ^ i fixed and f y * = ( y 1 * , y 2 * , , y n * ) independently of y . Lemma 1 for the true error (36) still applies,
Err = 1 n i = 1 n Q ( μ i , μ ^ i ) + q ( μ i ) c ( x i ) .
The Q-class version of Mallows’ Formula (68) is derived as Optimism Theorem 1 in Section 3 of [4]:
Theorem 1.
Define
a m i = q ˙ ( μ ^ i ) / 2 q ˙ ( μ ) = d d μ q ( μ ) .
Then
Err ^ = err + 2 n i = 1 n cov f a m i , y i
where cov f indicates covariance under model (72), is an unbiased estimate of Err in the same sense as (69),
E f Err ^ = E f { Err } .
The covariance terms in (76) measure how much each y i affects its own estimate. They sum to a covariance penalty that must be added to the apparent error to account for the fitting process. If Q ( μ , μ ^ ) is binomial deviance then
a m i = log μ ^ i / ( 1 μ ^ i ) ,
the logistic parameter; the theorem still applies as stated whether or not μ ^ = m ( y ) is logistic regression.
Err ^ as stated in (76) is not directly usable since the covariance terms cov f ( a m i , y i ) are not observable statistics. This is where resampling comes in.
Suppose that the observed data y provides an estimate f ^ of f . For instance in a normal regression model y N ( μ , σ 2 , I ) we could take f ^ to be y * N ( μ ^ , σ 2 I ) for some estimate μ ^ . We replace (72) with the parametric bootstrap model
f ^ y * μ ^ * = m ( y * )
and generate B independent replications y * ( 1 ) , y * ( 2 ) , , y * ( B ) , from which are calculated B pairs,
y i * ( j ) , a m i * ( j ) , j = 1 , 2 , , B ,
for i = 1 , 2 , , n as in (75). The covariances in (76) can then be estimated as
cov ^ a m i , y i = 1 B j = 1 B a m i * ( j ) a m i * ( · ) y i * ( j ) y i * ( · ) ,
a m i * ( · ) = j a m i * ( j ) / B and y i * ( · ) = j y i * ( j ) / B , yielding a useable version of (76),
Err ^ cp = err + 2 n i = 1 n cov ^ a m i , y i ,
“cp” standing for “covariance penalty”.
Table 3 compares the performances of cross-validation, covariance penalties, the 632 rule, and the random forest out-of-bag estimates on the 100 transplant data simulations. The results are given in terms of total prediction error, rather than average error as in (47). The bottom line shows their root-mean-square differences from true error Err ,
rms = i = 1 100 Err ^ i Err i 2 / 100 1 / 2 .
Err ^ cv is highest, with Err ^ cp , Err ^ 632 , and Err ^ oob respectively 71%, 83%, and 65% as large.
Table 3. Transplant data simulation experiment. Top: First 5 of 100 estimates of total prediction error using cross-validation, covariance penalties (cp), the 632 rule, the out-of-bag random forest results, and the apparent error; degrees of freedom (df) estimate from cp. Bottom: Mean of the 100 simulations, their standard deviations, correlations with Errtrue, and root mean square differences from Errtrue. Cp and 632 rules used B = 400 bootstrap replications per simulation.
Table 3. Transplant data simulation experiment. Top: First 5 of 100 estimates of total prediction error using cross-validation, covariance penalties (cp), the 632 rule, the out-of-bag random forest results, and the apparent error; degrees of freedom (df) estimate from cp. Bottom: Mean of the 100 simulations, their standard deviations, correlations with Errtrue, and root mean square differences from Errtrue. Cp and 632 rules used B = 400 bootstrap replications per simulation.
ErrtrueErrCvErrCpErr632Errooberrdf
194216.5232.9260.6240.3180.426.2
199245.8220.9241.0230.9173.823.5
192221.2225.9247.6232.7180.622.7
234197.4200.9215.2219.7162.219.4
302175.7178.3187.1211.2144.616.9
ErrtrueErrCvErrCpErr632Errooberrdf
mean214.0231.1216.1229.4223.8169.923.1
stdev23.028.715.819.014.613.54.0
cor.true −0.58 0.61 0.64 0.26 0.41 0.52
rms 48.834.840.731.654.2
For the jth simulation vector y j (45) the parametric bootstrap replications (79) were generated as follows: the logistic regression estimate μ ^ j was calculated from (46), μ ^ j = glm ( y j X , binomial ) $ , fit ; the bootstrap replications y j * had independent Bernoulli components
y i j * ind Bi ( 1 , μ ^ i j ) for i = 1 , 2 , , 200 .
B = 400 independent replications y j * were generated for each j = 1 , 2 , , 100 , giving Err ^ cp according to (81)–(82).
Because the resamples were generated by a model of the same form as that which originally gave the y j ’s (43)–(44), Err ^ cp is unbiased for Err . In practice our resampling model f ^ in (79) might not match f in (72), causing Err ^ cp to be downwardly biased (and raising its rms). Err ^ cv ’s nonparametric resamples make it nearly unbiased for the unconditional error rate Err ( u ) (15) irrespective of the true model, accounting for the overwhelming popularity of cross-validation in applications.

4.1. A Few Comments

  • In computing Err ^ cp it is not necessary for f ^ in (79) to be constructed from the original estimate μ ^ . We might base f ^ on a bigger model than that which led to the choice of m ( y ) for μ ^ ; in the little example of Figure 3, for instance, we could take f ^ to be N ( μ ^ ( 6 ) , σ 2 I ) where μ ^ ( 6 ) is the OLS sixth degree polynomial fit, while still taking μ ^ = m ( y ) to be fourth degree as in (18). This reduces possible model-fitting bias in Err ^ cp , while increasing its variability.
  • A major conceptual difference between Err ^ cv and Err ^ cp concerns the role of the covariates x = ( x 1 , x 2 , , x n ) , considered as random in model (10) but fixed in (72). Classical regression problems have usually been analyzed in a fixed- x framework for three reasons:
    1.
    mathematical tractability;
    2.
    not having to specify x ’s distribution;
    3.
    inferential relevance.
    The reasons come together in the classic covariance formula for the linear model y = X β + ϵ ,
    cov β ^ = σ 2 X X 1 .
    A wider dispersion of the x i ’s in Figure 3 would make β ^ more accurate, and conversely.
  • It can be argued that because Err ^ cp is estimating the conditional error given x it is more relevant to what is being estimated. See [11,12] for a lot more on this question.
  • On the other hand, “fixed- x ” methods such as Mallows’ C p can be faulted for missing some of the variability contributing to the unconditional prediction error Err ( u ) (15). Let Err ( x , d ) be the conditional error given d for predicting y given x,
    Err ( x , d ) = Q μ ( x ) , f ( x , d ) + q μ ( x ) c ( x )
    according to Lemma 1 (34). Then
    Err ( u ) = E X g ( x ) Err ( x , d ) d x ,
    where g ( x ) is the marginal density of x and E indicates expectation over the choice of the training data d .
  • In the fixed- x framework of (36), Err replaces the integrand in (87) with its x average 1 n Err ( x i , d ) / n . We expect
    Err ( u ) > E f { Err }
    since Err ( x , d ) typically increases for values of x farther away from x . Rosset and Tibshirani [12] give an explicit formula for the difference in the case of normal-theory ordinary least squares when they show that the factor 2 for the penalty term in Mallows’ C p formula should be increased to 2 + ( p + 1 ) / ( n p 1 ) . Cross-validation effectively estimates Err ( u ) while C p estimates the fixed- x version of E { Err } .
  • With (88) in mind, Err ^ cp and Err ^ cv are often contrasted as
    insample versus outsample .
    This is dangerous terminology if it’s taken to mean that Err ^ cv applies to prediction errors Err ( x , d ) at specific points x outside of x . In Figure 3 for instance it seems likely that Err ( 11 , d ) exceeds Err ^ cv , but this is a fixed- x question and beyond the reach of the random- x assumptions underlying (88). See the discussion of Figure 8 in Section 5.
  • The sad story told in Figure 5 shows Err ^ cv negatively correlated with the true Err . The same is the case for Err ^ cp and Err ^ 632 , as can be seen from the negative correlations in the cor.true row of Table 3. Err ^ oob is also negatively correlated with Err , but less so. In terms of rms, the bottom row shows that the fully nonparametric Err ^ oob estimates beat even the parametric Err ^ cp ones.
  • Figure 7 compares the 200 μ ^ i estimates from the logistic regression estimates (46) with those from random forests, for y the first of the 100 transplant simulations. Random forests is seen to better separate the y i = 0 from the y i = 1 cases. Err ^ oob relates to error prediction for random forest estimates, not for logistic regression, but this does not explain how Err ^ oob could provide excellent estimates of the true error Err –which in fact were based on the logistic regression model (43)–(44). If this is a fluke it’s an intriguing one.
  • There is one special case where the covariance penalty formula (76) can be unbiasedly estimated without recourse to resampling: if f is the normal model y N ( μ , σ 2 I ) , and Q ( y i , μ i ) = ( y i μ i ) 2 —so a m i = μ ^ i —then Stein’s unbiased risk estimate (SURE) is defined to be
    Err ^ SURE = err + 2 σ 2 n i = 1 n μ ^ i y i ,
    where the partial derivatives are calculated directly from the functional form of μ ^ = m ( y ) . Section 2 of [4] gives an example comparing Err ^ SURE with Err ^ cp . Each term μ ^ i / y i measures the influence of y i on its own estimate.

4.2. Degrees of Freedom

The OLS model y = X β + ϵ yields the familiar estimate μ ^ = M y of μ = E { y } , where M is the projection matrix X ( X X ) 1 X as in (70); M has
trace ( M ) = trace ( X X ) 1 X X = p ,
where p = rank ( X ) . Mallows’ C p formula Err ^ cp = err + ( 2 σ 2 / n ) trace ( M ) becomes
Err ^ cp = err + 2 σ 2 n p
in this case. In other words, the covariance penalty that must be added to the apparent error err is directly proportional to p, the degrees of freedom of the OLS model.
Suppose that μ ^ = M y with matrix M not necessarily a projection. It has become common to define μ ^ ’s degrees of freedom as
df = degrees of freedom = trace ( M ) ,
trace ( M ) playing the role of p in the C p Formula (92). In this way, trace ( M ) becomes a lingua franca for comparing linear estimators μ ^ = M y of different forms. (The referee points out that formulas such as (92) are more often used for model selection rather than error rate prediction. Zhang and Yang [13] consider model selection applications, as does Remark B of [4].)
A nice example is the ridge regression estimator
μ ^ γ = X X X + γ I 1 X y ,
γ a fixed non-negative constant; μ ^ 0 is the usual OLS estimator while μ ^ γ “shrinks” μ ^ toward 0 , more so as γ increases. Some linear algebra gives the degrees of freedom for (94) as
df γ = i = 1 p e i e i + γ ,
where the e i are the eigenvalues of X X .
The generalization of Mallows’ Formula (68) in Theorem 1 (76) has penalty term
2 n i = 1 n cov f a m i , y i ,
which again measures the self-influence of each y i on its own estimate μ ^ i . The choice of
q ( μ ) = μ ( 1 μ )
in Figure 3 results in Q ( μ , μ ^ ) = ( μ ^ μ ) 2 , squared error, and a m i = μ ^ i 1 / 2 , in which case cov f ( a m i , y i ) equals σ 2 m i i , and (96) becomes ( 2 σ 2 / n ) trace ( M ) , as in Mallows’ Formula (68). This suggests using
Cov = i = 1 n cov f a m i , y i
as a measure of degrees of freedom (or its estimate Cov ^ from (81)) for a general estimator μ ^ = m ( y ) .
Some support comes from the following special situation: suppose y is obtained from a p-parameter generalized linear model, with prediction error measured by the appropriate deviance function. (Binomial deviance for logistic regression as in the transplant example.) Theorem 2 of [14], Section 6, then gives the asymptotic approximation
Cov p ,
as in (91)–(92), the intuitively correct answer.
Approximation (99) leads directly to Akaike’s information criterion (AIC). In a generalized linear model, the total deviance from the MLE μ ^ is
n · err = i = 1 n Q ( y i , μ ^ i ) = 2 i = 1 n log g y i ( y i ) / g μ ^ i ( y i ) ,
g μ ^ i ( y i ) denoting the density function ([2], Hoeffding’s Lemma). Suppose we have glm’s of different sizes p we wish to compare. Minimizing err + ( 2 / n ) Cov over the choice of model is then equivalent to maximizing the total log likelihood log g μ ^ ( y ) minus a dimensionality penalty,
log g μ ^ ( y ) Cov log g μ ^ ( y ) p ,
which is the AIC.
Approximation (99) is not razor-sharp: p = 17 is the transplant simulation logistic regression but the 100 estimates Cov ^ averaged 23.08 with standard error 0.40. Degrees of freedom play a crucial role in model selection algorithms. Resampling methods allow us to assess Cov (98) even for very complicated fitting algorithms μ ^ = m ( y ) .

5. Conformal Inference

If there is a challenger to cross-validation for “oldest resampling method” it is permutation testing, going back to Fisher in the 1930s. The newest prediction error technique, conformal inference, turns out to have permutation roots, as briefly reviewed next.
A clinical trial of an experimental drug has yielded independent real-valued responses for control and treatment groups:
Control : u = ( u 1 , , u n ) and Treatment : v = ( v 1 , , v m ) .
Student’s t-test could be used to see if the new drug was giving genuinely larger responses but Fisher, reacting to criticism of normality assumptions, proposed what we would now call a nonparametric two-sample test.
Let z be the combined data set,
z = ( u , v ) = ( z 1 , , z n + m ) ,
and choose some score function S ( z ) that contrasts the last m z-values with the first n, for example the difference of means,
S ( z ) = i = n + 1 n + m z i m i = 1 n z i n .
Define S as the set of scores for all permutations of z ,
S = S ( z * ) ,
z * ranging over the ( m + n ) ! permutations.
The permutation p-value for the treatment’s efficacy in producing larger responses is defined to be
p = # S ( z * ) S ( z ) / ( m + n ) ! ,
the proportion of S having scores S ( z * ) exceeding the observed score S ( z ) . Fisher’s key idea was that if in fact all the observations came from the same distribution F,
z i ind F ( i = 1 , 2 , , m + n )
(implying that Treatment is the same as Control), then all ( m + n ) ! permutations would be equally likely. Rejecting the null hypothesis of No Treatment Effect if p α has null probability (nearly) α .
Usually ( m + n ) ! is too many for practical use. This is where the sampling part of resampling comes in. Instead of all possible permutations, a randomly drawn subset of B of them, perhaps B = 1000 , is selected for scoring, giving an estimated permutation p-value
p ^ = # S ( z * ) S ( z ) / B .
In 1963, Hodges and Lehmann considered an extension of the null hypothesis (107) to cover location shifts; in terms of cumulative distribution functions (cdfs), they assumed
u i iid F ( u ) and v i iid F ( v Δ ) ,
where Δ is a fixed but unknown constant that translates the v’s distribution by Δ units to the right of the u’s.
For a given trial value of Δ let
z ( Δ ) = ( u 1 , , u n , v 1 Δ , , v m Δ )
and compute its permutation p-value
p ^ ( Δ ) = # S ( z * ( Δ ) ) S ( z ( Δ ) ) .
A 0.95 two-sided nonparametric confidence interval for Δ is then
Δ : 0.025 p ^ ( Δ ) 0.975 .
The only assumption is that, for the true value of Δ , the m + n components of z ( Δ ) are i.i.d., or more generally exchangeable, from some distribution F.
Vovk has proposed an ingenious extension of this argument applying to prediction error estimation, a much cited reference being [15]; see also [16]. Returning to the statement of the prediction problem at the beginning of Section 2, d = { ( x i , y i ) , i = 1 , 2 , , n } is the observed data and ( x 0 , y 0 ) a new (predictor, response) pair, all n + 1 pairs assumed to be random draws from the same distribution F,
( x i , y i ) iid F for i = 0 , 1 , , n ;
x 0 is observed but not y 0 , and it is desired to predict y 0 . Vovk’s proposal, conformal inference, produces an exact nonparametric distribution of the unseen y 0 .
Let Y 0 be a proposed trial value for y 0 , and define D as the data set d augmented with ( x 0 , Y 0 ) ,
D = d , ( x 0 , Y 0 ) .
A prediction rule f ( x , D ) gives estimates
μ ^ i = f ( x i , D ) for i = 0 , 1 , , n .
(It is required that f ( x , D ) be invariant under reordering of D ’s elements.)
For some score function s ( y , μ ^ ) let
s i = s ( y i , μ ^ i ) for i = 1 , 2 , , n ,
where s ( y , μ ^ ) measures disagreement between y and μ ^ ( x ) , larger values of | s | indicating less conformity between observation and prediction. (More generally s i can be any function s ( y i , D ) .)
If the proposed trial value Y 0 were in fact the unobserved y 0 then s 1 , s 2 , , s n and s 0 = s ( Y 0 , μ ^ 0 ) would be exchangeable random variables, because of the i.i.d. assumption (113). Let s ( 1 ) , s ( 2 ) , , s ( n ) be the ordered values of s 1 , s 2 , , s n . Assuming no ties, the n values
s ( 1 ) < s ( 2 ) < < s ( n )
partition the line into ( n + 1 ) intervals, the first and last of which are semi-infinite. Exchangeability implies that s 0 has probability 1 / ( n + 1 ) of falling into any one of the intervals.
A conformal interval for the unseen y 0 consists of those values of Y 0 for which s 0 = s ( Y 0 , μ ^ 0 ) “conforms” to the distribution (117). To be specific, for a chosen miscoverage level α , say 0.05, let I 0 and I 1 be integers approximately proportion α / 2 from the endpoints of 1 , 2 , , n ,
I 0 = [ n α / 2 ] and I 1 = [ n ( 1 α / 2 ) ] + 1 .
The conservative two-sided level 1 α conformal prediction interval C for y 0 is
C = Y 0 : s ( Y 0 , D ) [ s ( I 0 ) , s ( I 1 ) ] .
The argument is the same as for the Hodges–Lehmann interval (112), now with m = 1 and Δ = Y 0 .
Interval (119) is computationally expensive since all of the s i , not just s 0 , change with each choice of trial value Y 0 . The jackknife conformal interval begins with the jackknife estimates
μ ^ ( i ) = f x i , d ( i ) , i = 1 , 2 , , n ,
where d ( i ) = { ( x j , y j ) , j i } , that is d (10) with ( x i , y i ) deleted. The scores s i (116) are taken to be
s i = s y i , μ ^ ( i ) , i = 1 , 2 , , n ,
for some function s, for example s i = y i μ ^ ( i ) . These are compared with
s 0 = s ( Y 0 , μ ^ 0 ) ,
μ ^ 0 = f ( x 0 , d ) , and C is computed as at (119). Now the score distribution (117) does not depend on Y 0 (nor does μ ^ 0 ), greatly reducing the computational burden.
The jackknife conformal interval at x 0 = 11 was calculated for the small example of Figure 2 using
s i = y i μ ^ ( i )
for i = 1 , 2 , , n = 20 . For this choice of scoring function, interval (119) is
C = [ μ ^ 0 + s ( I 0 ) , μ ^ 0 + s ( I 1 ) ] ;
y 0 C with conformal probability ( I ( 1 ) I ( 0 ) ) / ( n + 1 ) . The square dots in Figure 8 are the values μ ^ 0 + s ( i ) for i = 1 , 2 , , 20 , with y 0 having probability 1 / 21 of falling into each of the 21 intervals. Conformal probability for the full range
μ ^ 0 + s ( 1 ) , μ ^ 0 + s ( 20 ) = [ 3.27 , 10.99 ]
is 19 / 21 = 0.905 .

Some Comments

  • Even when Y 0 equals y 0 , s 0 = s ( Y 0 , μ ^ 0 ) is not perfectly exchangeable with the s i = s ( y i , μ ^ ( i ) ) (121): each μ ^ ( i ) is based on n 1 other cases, while μ ^ 0 is based on n. Other stand-ins for the full conformal intervals (119) are favored in the literature but these have their own disadvantages. Barber et al. [17] offer a version of the jackknife intervals, “jackknife +“, with more dependable inferential performance.
  • The jackknife scores s i (121) are also the one-at-a-time cross-validation scores if s is taken to be the prediction loss Q in (42). In this sense, conformal inference can be thought of as a more ambitious version of prediction error estimation, where we try to estimate the entire error distribution rather than just its expectation. The conformal point estimate s ¯ = 1 n s i / n is the same as Err ^ cv (42) if s equals Q (which is why “conformal” was not included in Table 3).
  • Figure 8 is misleading in an important sense: the 95% coverage claimed in (125) is a marginal inference following from the i.i.d. assumption ( x i , y i ) iid F for i = 0 , 1 , , n (113), and does not apply conditionally to the particular configuration of x’s and y’s seen in the figure. (See Remark 3 and Section 3 of [16].) The same complaint was leveled against cross-validation in Section 3—for estimating the unconditional error Err ( u ) rather than prediction error for the rule at hand—but conformal inference leans even harder on the i.i.d. assumption.
  • Classic parametric prediction intervals do apply conditionally. The normal-theory version of model (16) and (17) gives 95% interval
    μ ^ 0 ± 1.96 σ 1 + γ 0
    for response y 0 at x 0 = 11 , where
    γ 0 = X ( x 0 ) ( X X ) 1 X ( x 0 )
    in notation (18). (At x 0 = 11 , (126) gives 95% prediction interval [ 15.6 , 32.9 ] , reflecting the hopelessly large extrapolation variability of the fourth-degree polynomial model; the standard deviation of μ ^ 0 at x 0 = 11 is 12.22.) The nonparametric random sampling assumption (113) destroys the geometry seen in Figure 8.
  • Rather than s i = y i μ ^ ( i ) at (123), we might use scores
    s i = y i μ ^ ( i ) γ i ,
    where γ i = γ ( x i ) is some measure of prediction difficulty at x i . Conformal inference continues to apply here since the s i are still exchangeable. Boström et al. [18] give several random forest examples, using out-of-bag estimates for the γ i .
  • Covariate shift estimation offers a more ambitious approach to broadening the reach of conformal prediction; see [19,20]. The underlying probability distribution F in (113) can be thought of in two stages, first choosing x according to say g ( x ) and then y given x according to p ( y x ) ,
    F : x g ( x ) and y x p ( y x ) .
    It is assumed that (129) holds in a training set, but in the test set where predictions are to be made g ( x ) is shifted to g test ( x ) ,
    F test : x g test ( x ) and y x p ( y x ) .
    With sufficiently large training and test sets available, the ratio g test ( x ) / g ( x ) can be estimated, allowing a suitably weighted version of conformal interval (124) to be constructed.
  • Conformal prediction is less appealing for dichotomous response data but can still be informative. Figure 9 shows its application to the transplant data of Section 3 and Section 4. The score function s is taken to be the deviance residual
    s i = sign y i μ ^ ( i ) Q y i , μ ^ ( i ) 1 / 2 ,
    μ ^ ( i ) the jackknife logistic regression estimate (120) and Q ( y , μ ) binomial deviance (28). The left side of Figure 9 shows the histogram of the 200 s i values. Any given value of μ ^ 0 corresponds to two possible values of s 0 = sign ( y 0 μ ^ 0 ) Q ( y 0 , μ ^ 0 ) 1 / 2 , for y 0 equal 0 or 1, and two values of the conformal p-value,
    p ( μ ^ 0 ) = # { s i s 0 } / 201 .
    The right side of Figure 9 graphs p ( μ ^ 0 ) as a function of μ ^ 0 for the two cases:
    μ ^ 0 < 0.315 gives p ^ 0.975 for y 0 = 1 , μ ^ 0 > 0.918 gives p ^ 0.025 for y 0 = 0 .
    For μ ^ 0 in [ 0.315 , 0.918 ] , neither y 0 = 0 or y 0 = 1 can be rejected at the 0.025 level.
Figure 9. (Left) Signed deviance residuals for transplant data. (Right) Attained conformal p-value given μ ; solid y = 1 , dashed y = 0 .
Figure 9. (Left) Signed deviance residuals for transplant data. (Right) Attained conformal p-value given μ ; solid y = 1 , dashed y = 0 .
Stats 04 00063 g009
As far as prediction error is concerned, cross-validation and covariance penalties are established subjects backed up by a substantial theoretical and applied literature. Conformal prediction, as the new kid on the block, is still in its formative stage, with at least a promising hint of moving beyond complete reliance on the random sampling model (15). All three approaches rely on resampling methodology, very much in the spirit of statistical inference in the 2020s.

Funding

This research was funded in part by the National Science Foundation grant number DMS 1608182.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Efron, B. Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Am. Stat. Assoc. 1983, 78, 316–331. [Google Scholar] [CrossRef]
  2. Efron, B.; Hastie, T. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science; Institute of Mathematical Statistics Monographs (Book 5); Cambridge University Press: Cambridge, UK, 2016; p. 476. [Google Scholar]
  3. Efron, B.; Narasimhan, B. The automatic construction of bootstrap confidence intervals. J. Comput. Graph. Stat. 2020, 29, 608–619. [Google Scholar] [CrossRef] [PubMed]
  4. Efron, B. The estimation of prediction error: Covariance penalties and cross-validation. J. Am. Stat. Assoc. 2004, 99, 619–642. [Google Scholar] [CrossRef]
  5. Brègman, L.M. A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. Ž. Vyčisl. Mat i Mat. Fiz. 1967, 7, 620–631. [Google Scholar] [CrossRef]
  6. Efron, B.; Tibshirani, R. Improvements on cross-validation: The .632+ bootstrap method. J. Am. Stat. Assoc. 1997, 92, 548–560. [Google Scholar]
  7. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  8. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth Statistics/Probability Series; Wadsworth Advanced Books and Software: Belmont, CA, USA, 1984; p. x+358. [Google Scholar]
  9. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2009; p. xxii+745. [Google Scholar] [CrossRef]
  10. Mallows, C.L. Some Comments on Cp. Technometrics 1973, 15, 661–675. [Google Scholar] [CrossRef]
  11. Bates, S.; Hastie, T.; Tibshirani, R. Cross-validation: What does it estimate and how well does it do it? arXiv 2021, arXiv:2104.00673v2. [Google Scholar]
  12. Rosset, S.; Tibshirani, R.J. From fixed-X to random-X regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. J. Am. Stat. Assoc. 2020, 115, 138–162. [Google Scholar] [CrossRef]
  13. Zhang, Y.; Yang, Y. Cross-validation for selecting a model selection procedure. J. Econom. 2015, 187, 95–112. [Google Scholar] [CrossRef]
  14. Efron, B. How biased is the apparent error rate of a prediction rule? J. Am. Stat. Assoc. 1986, 81, 461–470. [Google Scholar] [CrossRef]
  15. Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World; Springer: New York, NY, USA, 2005; p. xvi+324. [Google Scholar]
  16. Lei, J.; G’Sell, M.; Rinaldo, A.; Tibshirani, R.J.; Wasserman, L. Distribution-free predictive inference for regression. J. Am. Stat. Assoc. 2018, 113, 1094–1111. [Google Scholar] [CrossRef]
  17. Barber, R.F.; Candès, E.J.; Ramdas, A.; Tibshirani, R.J. Predictive inference with the jackknife+. Ann. Stat. 2021, 49, 486–507. [Google Scholar] [CrossRef]
  18. Boström, H.; Linusson, H.; Löfström, T.; Johansson, U. Accelerating difficulty estimation for conformal regression forests. Ann. Math. Artif. Intell. 2017, 81, 125–144. [Google Scholar] [CrossRef] [Green Version]
  19. Tibshirani, R.J.; Foygel Barber, R.; Candès, E.J.; Ramdas, A. Conformal prediction under covariate shift. In Proceedings of the Advances in Neural Information Processing Systems 32 (NIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2019; pp. 2526–2536. [Google Scholar]
  20. Sugiyama, M.; Krauledat, M.; Müller, K.R. Covariate Shift Adaptation by Importance Weighted Cross Validation. J. Mach. Learn. Res. 2007, 8, 985–1005. [Google Scholar]
Figure 1. 2000 nonparametric bootstraps, law school correlation. 95% confidence limits: bootstrap (red), standard (black).
Figure 1. 2000 nonparametric bootstraps, law school correlation. 95% confidence limits: bootstrap (red), standard (black).
Stats 04 00063 g001
Figure 2. Points ( x i , y i ) , i = 1 , 2 , , 20 , and fitted 4th-degree polynomial curve; light dotted curve is true mean.
Figure 2. Points ( x i , y i ) , i = 1 , 2 , , 20 , and fitted 4th-degree polynomial curve; light dotted curve is true mean.
Stats 04 00063 g002
Figure 3. The Q class of error measures.
Figure 3. The Q class of error measures.
Stats 04 00063 g003
Figure 5. CV estimate of prediction error versus true prediction error, transplant data.
Figure 5. CV estimate of prediction error versus true prediction error, transplant data.
Stats 04 00063 g005
Figure 7. (Left) Estimated probabilities, logistic regression simulation 1. (Right) Now for random forests, simulation 1. The increased random forests separation suggests increased degrees of freedom.
Figure 7. (Left) Estimated probabilities, logistic regression simulation 1. (Right) Now for random forests, simulation 1. The increased random forests separation suggests increased degrees of freedom.
Stats 04 00063 g007
Figure 8. Square points: jackknife conformal predictions at x = 11 for example in Figure 3; each interval Pr = 1 / 21 .
Figure 8. Square points: jackknife conformal predictions at x = 11 for example in Figure 3; each interval Pr = 1 / 21 .
Stats 04 00063 g008
Table 2. Mean of Q ( y , μ ^ ) given the number of times a case ( x i , y i ) occurred in bootstrap sample; first simulation. True Error 0.97, apparent err 0.90.
Table 2. Mean of Q ( y , μ ^ ) given the number of times a case ( x i , y i ) occurred in bootstrap sample; first simulation. True Error 0.97, apparent err 0.90.
# Times:012345
mean Q:1.440.960.810.690.640.57
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Efron, B. Resampling Plans and the Estimation of Prediction Error. Stats 2021, 4, 1091-1115. https://doi.org/10.3390/stats4040063

AMA Style

Efron B. Resampling Plans and the Estimation of Prediction Error. Stats. 2021; 4(4):1091-1115. https://doi.org/10.3390/stats4040063

Chicago/Turabian Style

Efron, Bradley. 2021. "Resampling Plans and the Estimation of Prediction Error" Stats 4, no. 4: 1091-1115. https://doi.org/10.3390/stats4040063

APA Style

Efron, B. (2021). Resampling Plans and the Estimation of Prediction Error. Stats, 4(4), 1091-1115. https://doi.org/10.3390/stats4040063

Article Metrics

Back to TopTop