Resampling Plans and the Estimation of Prediction Error

Efron, Bradley

doi:10.3390/stats4040063

Open AccessFeature PaperEditor’s ChoiceArticle

Resampling Plans and the Estimation of Prediction Error

by

Bradley Efron

Department of Statistics, Stanford University, Stanford, CA 94305, USA

Stats 2021, 4(4), 1091-1115; https://doi.org/10.3390/stats4040063

Submission received: 22 September 2021 / Revised: 16 November 2021 / Accepted: 18 November 2021 / Published: 20 December 2021

(This article belongs to the Special Issue Re-sampling Methods for Statistical Inference of the 2020s)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This article was prepared for the Special Issue on Resampling methods for statistical inference of the 2020s. Modern algorithms such as random forests and deep learning are automatic machines for producing prediction rules from training data. Resampling plans have been the key technology for evaluating a rule’s prediction accuracy. After a careful description of the measurement of prediction error the article discusses the advantages and disadvantages of the principal methods: cross-validation, the nonparametric bootstrap, covariance penalties (Mallows’

C_{p}

and the Akaike Information Criterion), and conformal inference. The emphasis is on a broad overview of a large subject, featuring examples, simulations, and a minimum of technical detail.

Keywords:

cross-validation; Cp; AIC; Q-class; conformal inference; random forests; bagging

1. Introduction

Modern prediction algorithms such as random forests and deep learning use training sets, often very large ones, to produce rules for predicting new responses from a set of available predictors. A second question—right after “How should the prediction rule be constructed?”—is “How accurate are the rule’s predictions?” Resampling methods have played a central role in the answer. This paper is intended to provide an overview of what are actually several different answers, while trying to keep technical complications to a minimum.

This is a Special Issue of STATS devoted to resampling, and before beginning on prediction rules it seems worthwhile to say something about the general effect of resampling methods on statistics and statisticians. Table 1 shows the law school data [1], a small data set but one not completely atypical of its time. The table reports average scores of the 1973 entering class at 15 law schools on two criteria: undergraduate grade point average (GPA) and result on the “LSAT”, a national achievement test. The observed Pearson correlation coefficient between GPA and LSAT score is

\hat{ρ} = 0.776 .

(1)

How accurate is

\hat{ρ}

?

Suppose that Dr. Jones, a 1940s statistician, was given the data in Table 1 and asked to attach a standard error to

\hat{ρ} = 0.776

; let’s say a nonparametric standard error since a plot of (LSAT,GPA) looks definitely non-normal. At his disposal is the nonparametric delta method, which gives a first-order Taylor series approximation formula for

se (\hat{ρ})

. For the Pearson correlation coefficient this turns out to be

\hat{se} (\hat{ρ}) = {\{\frac{{\hat{ρ}}^{2}}{4 n} [\frac{{\hat{μ}}_{40}}{{\hat{μ}}_{20}^{2}} + \frac{{\hat{μ}}_{04}}{{\hat{μ}}_{02}^{2}} + \frac{2 {\hat{μ}}_{22}}{{\hat{μ}}_{20} {\hat{μ}}_{02}} + \frac{4 {\hat{μ}}_{22}}{{\hat{μ}}_{11}^{2}} - \frac{4 {\hat{μ}}_{31}}{{\hat{μ}}_{11} {\hat{μ}}_{20}} - \frac{4 {\hat{μ}}_{13}}{{\hat{μ}}_{11} {\hat{μ}}_{02}}]\}}^{1 / 2},

(2)

where

n = 15

and

{\hat{μ}}_{i j}

is the mean of

{(GPA - \bar{GPA})}^{i} {(LSAT - \bar{LSAT})}^{j}

, the bars indicating averages.

Table 1. Average scores for admitees to 15 American law schools, 1973. GPA is undergraduate grade point average, LSAT “law boards” score. Pearson correlation coefficient between GPA and LSAT is 0.776.

GPA	LSAT
3.39	576
3.30	635
2.81	558
3.03	578
3.44	666
3.07	580
3.00	555
3.43	661
3.36	651
3.13	605
3.12	653
2.74	575
2.76	545
2.88	572
2.96	594

Jones either looks up or derives (2), evaluates the six terms on his mechanical calculator, and reports

\hat{se} (\hat{ρ}) = 0.124,

(3)

after which he goes home with the feeling of a day well spent.

Jones’ daughter, a 1960s statistician, has a much easier go of it. Now she does not have to look up or derive Formula (2). A more general resampling algorithm, the Tukey–Quenouille jackknife is available, and can be almost instantly evaluated on her university’s mainframe computer. It gives her the answer

\hat{se} (\hat{ρ}) = 0.143 .

(4)

Dr. Jones is envious of his daughter:

1.: She does not need to spend her time deriving arduous formulas such as (2).
2.: She is not restricted to traditional estimates such as $\hat{ρ}$ that have closed-form Taylor series expansions.
3.: Her university’s mainframe computer is a million times faster than his old Marchant calculator (though it is across the campus rather than on her desk).

If now, 60 years later, the Jones family is still in the statistics business they’ll have even more reason to be grateful for resampling methods. Faster, cheaper, and more convenient computation combined with more aggressive methodology have pushed the purview of resampling applications beyond the assignment of standard errors.

Figure 1 shows 2000 nonparametric bootstrap replications

{\hat{ρ}}^{*}

from the law school data (Each

{\hat{ρ}}^{*}

is the correlation from a bootstrapped data matrix obtained by resampling the 15 rows of the

15 \times 2

matrix in Table 1 15 times with replacement. See Chapter 11 of [2]). Their empirical standard deviation is the nonparametric bootstrap estimate of standard error for

\hat{ρ}

,

\hat{se} (\hat{ρ}) = 0.138 .

(5)

Two thousand is 10 times too many replications needed for a standard error, but it is not too many for a bootstrap confidence interval. The arrowed segments in Figure 1 compare the standard approximate 95% confidence limit

ρ \in \hat{ρ} \pm 1.96 \hat{se} = [0.505, 1.048],

(6)

\hat{se} = 0.138

, with the nonparametric bootstrap interval. (This is the bca interval, constructed using program bcajack from the CRAN package bcaboot [3]. Chapter 11 of [2] shows why bca’s “second-order corrections”, here very large, improve on the standard method.)

ρ \in [0.320, 0.944] .

(7)

The standard method does better if we first make Fisher’s z transformation

\hat{ξ} = \frac{1}{2} \log \frac{1 + \hat{ρ}}{1 - \hat{ρ}},

(8)

compute the standard interval on the

ξ

scale, and transform the endpoints back to the

ρ

scale. This gives 0.95 interval

ρ \in [0.275, 0.946],

(9)

not so different from the bootstrap interval (7), and at least not having its upper limit above 1.00!

This is the kind of trick Dr. Jones would have known. Resampling, here in the form of the bca algorithm, automates devices such as (8) without requiring Fisher-level insight for each new application.

If there was a statistics’ “word of the year” it would be two words: deep learning. This is one of a suite of prediction algorithms that input data sets, often quite massive ones, and output prediction rules. Others in the suite include random forests, support vector machines, and gradient boosting.

Having produced a prediction rule, it is natural to wonder how accurately it will predict future cases, our subject in what follows:

Section 2 gives a careful definition of the prediction problem, and describes a class of loss functions (the “Q class”) that apply to discrete as well as continuous response variables.
Section 3 concerns nonparametric estimates of prediction loss (cross-validation and the bootstrap “632 rule”) as well as Breiman’s out-of-bag error estimates for random forests.
Covariance penalties, including Mallows’ $C_{p}$ and the Akaike Information Criterion, are parametric methods discussed in Section 4 along with the related concept of degrees of freedom.
Section 5 briefly discusses conformal inference, the most recent addition to the resampling catalog of prediction error assessments.

2. The Prediction Problem

Statements of the prediction problem are often framed as follows:

A data set $d$ of n pairs is observed,

$d = \{(x_{i}, y_{i}), i = 1, 2, \dots, n\},$

(10)

where the $x_{i}$ are p-dimensional predictor vectors and the $y_{i}$ are one-dimensional responses.
The $(x, y)$ pairs are assumed to be independent and identically distributed draws from an unknown $(p + 1)$ -dimensional distribution F,

$(x_{i}, y_{i}) \overset{iid}{\sim} F, i = 1, 2, \dots, n .$

(11)
Using some algorithm $A$ , the statistician constructs a prediction rule $f (x, d)$ that provides a prediction $\hat{μ} (x)$ ,

$\hat{μ} (x) = f (x, d)$

(12)

for any vector x in the space of possible predictors.
A new pair $(x, y)$ is independently drawn from F,

$(x, y) \sim F independent of d,$

(13)

but with only x observable.
The statistician predicts the unseen y by $\hat{μ} (x) = f (x, d)$ and wishes to assess prediction error. Later the prediction error will turn out to directly relate to the estimation error of $\hat{μ} (x)$ for the true conditional expectation of y given x,

$μ (x) = E_{F} {y ∣ x} .$

(14)
Prediction error is assessed as the expectation of loss under distribution F,

${Err}^{(u)} = E_{F} \{Q (y, \hat{μ} (x))\},$

(15)

for a given loss function such as squared error: $Q (y, \hat{μ}) = {(y - \hat{μ})}^{2}$ .

Here,

E_{F}

indicates expectation over the random choice of all

p + 1

pairs

(x_{i}, y_{i})

and

(x, y)

in (11) and (13). The u in

{Err}^{(u)}

reflects the unconditional definition of error in (15). The resampling algorithms we will describe calculate an estimate of

{Err}^{(u)}

from the observed data. (One might hope for a more conditional error estimate, say one applying to the observed set of predictors

x

, a point discussed in what follows.)

Naturally, the primary concern within the prediction community has been with the choice of the algorithm

A

that produces the rule

\hat{μ} (x) = f (x, d)

. Elaborate computer-intensive algorithms such as random forests and deep learning have achieved star status, even in the popular press. Here, however, the “prediction problem” will focus on the estimation of prediction error. To a large extent the prediction problem has been a contest of competing resampling methods, as discussed in the next three sections.

Figure 2 illustrates a simple example:

n = 20

pairs

(x_{i}, y_{i})

have been observed, in this case with x real. A fourth-degree polynomial

f (x, d)

has been fit by ordinary least squares applied to

d = {(x_{i}, y_{i}), i = 1, 2, \dots, 20}

with the heavy curve tracing out

\hat{μ} (x) = f (x, d)

.

In the usual OLS notation, we’ve observed

y = {(y_{1}, y_{2}, \dots, y_{n})}^{⊤}

from the notional model

y = X β + ϵ,

(16)

where

X

is the

20 \times 5

matrix with ith row

(1, x_{i}, x_{i}^{2}, x_{i}^{3}, x_{i}^{4})

,

β

the unknown 5-dimensional vector of regression coefficients, and

ϵ

a vector of 20 uncorrelated errors having mean 0 and variance

σ^{2}

,

ϵ_{i} \sim (0, σ^{2}) for i = 1, 2, \dots, n .

(17)

The fitted curve

\hat{μ} (x) = f (x, d)

is given by

\hat{μ} (x) = X {(x)}^{⊤} \hat{β},

(18)

for

X {(x)}^{⊤} = (1, x_{i}, x_{i}^{2}, x_{i}^{3}, x_{i}^{4})

and

\hat{β} = {(X^{⊤} X)}^{- 1} X^{⊤} y

(the OLS estimate), this being algorithm

A

.

The apparent error, what will be called

err

in what follows, is

err = \sum_{i = 1}^{n} \frac{Q (y_{i}, {\hat{μ}}_{i})}{n} ({\hat{μ}}_{i} = f (x_{i}, d))

(19)

which equals 1.99 for

Q (y, \hat{μ}) = {(y - \hat{μ})}^{2}

. The usual unbiased estimate for the noise parameter

σ^{2}

, not needed here, modifies the denominator in (19) to take account of fitting a 5-vector

\hat{β}

to

d

,

{\hat{σ}}^{2} = \sum_{i = 1}^{n} \frac{Q (y_{i}, {\hat{μ}}_{i})}{(n - 5)} = 2.65 .

(20)

Dividing the sum of squared errors by

n - p

rather than n can be thought of as a classical prediction error adjustment; err usually underestimates future prediction error since the

{\hat{μ}}_{i}

have been chosen to fit the observations

y_{i}

.

Because this is a simulation, we know the true function

μ (x)

(14), the light dotted curve in Figure 2:

μ (x) = 2 + 0.2 x + \cos (\frac{x - 5}{2});

(21)

the points

y_{i}

were generated with normal errors, variance 2,

y_{i} \overset{ind}{\sim} N (μ (x_{i}), 2), i = 1, 2, \dots, n .

(22)

Given model (22), we can calculate the true prediction error for estimate

\hat{μ} (x)

(18). If

y^{*} \sim N (μ (x), 2)

is a new observation, independent of the original data

d

which gave

\hat{μ} (x)

, then the true prediction error

{Err}_{x}

is

{Err}_{x} = E_{*} {(y^{*} - μ (x))}^{2} = 2 + {(\hat{μ} (x) - μ (x))}^{2},

(23)

the notation

E_{*}

indicating expectation over

y^{*}

. Let

Err

denote the average true error over the

n = 20

observed values

x_{i}

,

Err = \frac{1}{n} \sum_{i = 1}^{n} {Err}_{x_{i}} = 2 + \frac{1}{n} \sum_{i = 1}^{n} {(\hat{μ} (x_{i}) - μ (x_{i}))}^{2},

(24)

equaling 2.40 in this case.

Err

figures prominently in the succeeding sections. Here, it exceeds the apparent error

err = 1.99

(19) by 21%. (

Err

is not the same as

{Err}^{(u)}

(15).)

Prediction algorithms are often, perhaps most often, applied to situations where the responses are dichotomous,

y_{i} = 1

or 0; that is, they are Bernoulli random variables, binomials of sample size 1 each,

y_{i} \overset{ind}{\sim} Bi (1, μ (x_{i})) for i = 1, 2, \dots, n .

(25)

Here

μ (x)

is the probability that

y = 1

given prediction vector x,

μ (x) = \Pr {y = 1 ∣ x} .

(26)

The probability model F in (11) can be thought of in two steps: as first selecting x according to some p-dimensional distribution G and then “flipping a biased coin” to generate

y \sim Bi (1, μ (x))

.

Squared error is not appropriate for dichotomous data. Two loss (or “error”) functions

Q (μ, \hat{μ})

are in common use for measuring the discrepancy between

μ

and

\hat{μ}

, the true and estimated probability that

y = 1

in (25). The first is counting error,

Q (μ, \hat{μ}) = \{\begin{matrix} 0 & if μ and \hat{μ} are on same side of 1 / 2 \\ 2 \cdot |μ - \frac{1}{2}| & if not . \end{matrix}

(27)

For

y = 0

or 1,

Q (y, \hat{μ})

equals 0 or 1 if y and

\hat{μ}

are on the same or different sides of

1 / 2

.

The second error function is binomial deviance (or twice the Kullback–Leibler divergence),

Q (μ, \hat{μ}) = 2 \{μ \log (\frac{μ}{\hat{μ}}) + (1 - μ) \log (\frac{1 - μ}{1 - \hat{μ}})\} .

(28)

Binomial deviance plays a preferred role in maximum likelihood estimation. Suppose that

μ (β)

is a p-parameter family for the true vector of means in model (25),

μ (β) = {(μ (x_{1}, β), μ (x_{2}, β), \dots, μ (x_{n}, β))}^{⊤} (β \in Ω \subset R^{p}) .

(29)

Then, the maximum likelihood estimate (MLE)

\hat{β}

is the minimizer of the average binomial deviance (28) between

y = {(y_{1}, y_{2}, \dots, y_{n})}^{⊤}

and

μ (β)

,

\hat{β} = \underset{β}{argmin} \{\frac{1}{n} \sum_{i = 1}^{n} Q (y_{i}, μ (x_{i}, β))\};

(30)

see Chapter 8 of [2]. Most of the numerical examples in the following sections are based on binomial deviance (30). (If

\hat{μ}

equals 0 or 1 then (30) is infinite. To avoid infinities, our numerical examples truncate

\hat{μ}

to

[0.005, 0.995]

.)

Squared error, counting error, and binomial deviance are all members of the Q-class, a general construction illustrated in Figure 3 ([4] Section 3). The construction begins with some concave function

q (μ)

; for the dichotomous cases considered here,

μ \in [0, 1]

with

q (0) = q (1) = 0

. The error

Q (μ, \hat{μ})

between a true value

μ

and an estimate

\hat{μ}

is defined by the illustrated tangency calculation:

Q (μ, \hat{μ}) = q (\hat{μ}) + \dot{q} (\hat{μ}) (μ - \hat{μ}) - q (μ),

(31)

\dot{q} (μ) = d q (μ) / d μ

(equivalent to the “Bregman divergence” [5]).

The entropy function

q (μ) = - 2 \{μ \log μ + (1 - μ) \log (1 - μ)\}

(32)

makes

Q (μ, \hat{μ})

equal binomial deviance. Two other common choices are

q (μ) = \min {μ, 1

- μ}

for counting error and

q (μ) = μ (1 - μ)

for squared error.

Working within the Q-class (31), it is easy to express the true error of prediction

\hat{μ} (x)

at predictor value x where the true mean is

μ (x) = E_{F} {y ∣ x}

. Letting

y^{*}

be an independent realization from the distribution of y given x, the true error at x is, by definition,

Err (μ (x), \hat{μ} (x)) = E_{*} \{Q (y^{*}, \hat{μ} (x))\},

(33)

only

y^{*}

being random in the expectation.

Lemma 1.

The true error at x (33) is

Err (μ (x), \hat{μ} (x)) = Q (μ (x), \hat{μ} (x)) + q (μ (x)) - c (x),

(34)

with

c (x) = 0

in the dichotomous case.

Proof.

From definition (31) of the Q-class,

\begin{matrix} E_{*} \{Q (y^{*}, \hat{μ} (x))\} & = E_{*} \{q (\hat{μ} (x)) + \dot{q} (\hat{μ} (x)) (y^{*} - \hat{μ} (x)) - q (y^{*})\} \\ = q (\hat{μ} (x)) + \dot{q} (\hat{μ} (x)) (μ (x) - \hat{μ} (x)) - E_{*} q (y^{*}) \\ = Q (μ (x), \hat{μ} (x)) + q (μ (x)) - E_{*} q (y^{*}), \end{matrix}

(35)

giving (34) with

c (x) = E_{*} q (y^{*})

. In the dichotomous case

q (y^{*}) = 0

for

y^{*}

equal 0 or 1, so

c (x) = 0

. ☐

To simplify notation, let

μ_{i} = μ (x_{i})

and

{\hat{μ}}_{i} = \hat{μ} (x_{i})

, with

μ = {(μ_{1}, μ_{2}, \dots, μ_{n})}^{⊤}

and

\hat{μ} = {({\hat{μ}}_{1}, {\hat{μ}}_{2}, \dots, {\hat{μ}}_{n})}^{⊤}

. The average true error

Err (μ, \hat{μ})

is defined to be

Err (μ, \hat{μ}) = \frac{1}{n} \sum_{i = 1}^{n} Err (μ_{i}, {\hat{μ}}_{i}) = \frac{1}{n} \sum_{i = 1}^{n} [Q (μ_{i}, {\hat{μ}}_{i}) + q (μ_{i}) - c (x_{i})] .

(36)

It is “true” in the desirable sense of applying to the given prediction rule

\hat{μ}

. If we average

Err (μ, \hat{μ})

over the random choice of

d

in (11), we get the less desirable unconditional error

{Err}^{(u)}

(15).

Err (μ, \hat{μ})

is minimized by

\hat{μ} = μ

,

Err (μ, μ) = \frac{1}{n} \sum_{i = 1}^{n} (q (μ_{i}) - c (x_{i})) .

(37)

Subtraction from (36) gives

Err (μ, \hat{μ}) - Err (μ, μ) = \frac{1}{n} \sum_{i = 1}^{n} Q (μ_{i}, {\hat{μ}}_{i}) .

(38)

This is an exact analogue of the familiar squared error relationship. Suppose

y

and

y^{*}

are independently

N (μ, σ^{2} I)

vectors, and that

y

produces an estimate

\hat{μ}

. Then

\frac{1}{n} E_{*} ∥ y^{*} - \hat{μ} ∥^{2} - \frac{1}{n} E_{*} {∥ y^{*} - μ ∥}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(μ_{i} - {\hat{μ}}_{i})}^{2},

(39)

which is (38) if

Q (μ, \hat{μ}) = {(μ - \hat{μ})}^{2}

.

At a given value of x, say

x_{0}

, a prediction

\hat{μ} (x_{0})

can also be thought of as an estimate of

μ (x_{0}) = E_{F} {y_{0} ∣ x_{0}}

. Lemma 1 shows that the optimum choice of a prediction rule

\hat{μ} (x_{0}) = f (x_{0}, d)

is also the optimum choice of an estimation rule: for any rule

f (x, d)

,

E_{F} \{Err (μ (x_{0}), \hat{μ} (x_{0}))\} = E_{F} \{Q (μ (x_{0}), \hat{μ} (x_{0}))\} + q (μ (x_{0}) - c (x_{0}))

(40)

(

E_{F}

as in (11)), so that the rule that minimizes the expected prediction error also minimizes the expected estimation error

E_{F} {Q (μ (x_{0}), \hat{μ} (x_{0}))}

. Predicting y and estimating its mean μ are equivalent tasks within the Q-class.

3. Cross-Validation and Its Bootstrap Competitors

Resampling methods base their inferences on recomputations of the statistic of interest derived from systematic modifications of the original sample. This is not a very precise definition, but it cannot be if we want to cover the range of methods used in estimating prediction error. There are intriguing differences among the methods concerning just what the modifications are and how the inferences are made, as discussed in this and the next two sections.

Cross-validation has a good claim to being the first resampling method. The original idea was to randomly split the sample

d

into two halves, the training and test sets,

d_{train}

and

d_{test}

. A prediction model is developed using only

d_{train}

, and then validated by its performance on

d_{test}

. Even if we cheated in the training phase, say by throwing out “bad” points, etc., the validation phase guarantees an honest estimate of prediction error.

One drawback is that inferences based on

n / 2

data points are likely to be less accurate than those based on all n, a concern if we are trying to accurately assess prediction error. “One-at-a-time“ cross-validation almost eliminates this defect: let

d_{(i)}

be data set (10) with point

(x_{i}, y_{i})

removed, and define

{\hat{μ}}_{(i)} = f (x_{i}, d_{(i)}),

(41)

the prediction for case i based on

x_{i}

using the rule

f (x_{i}, d_{(i)})

, constructed using only the data in

d_{(i)}

. The cross-validation estimate of prediction error is then

{\hat{Err}}_{cv} = \frac{1}{n} \sum_{i = 1}^{n} Q (y_{i}, {\hat{μ}}_{(i)}) .

(42)

(This assumes that we know how to apply the construction rule

A

to subsets of size

n - 1

). Because

y_{i}

is not involved in

{\hat{μ}}_{(i)}

, overfitting is no longer a concern. Under the independent-draws model (11),

{\hat{Err}}_{cv}

is a nearly unbiased estimate of

{Err}^{(u)}

(15) (“nearly” because it applies to samples of size

n - 1

rather than n).

The little example in Figure 2 has

n = 20

points

(x_{i}, y_{i})

. Applying (42) with

Q (y, \hat{μ}) = {(y - \hat{μ})}^{2}

gave

{\hat{Err}}_{cv} = 3.59

, compared with apparent error

err = 1.99

(19) and true error

Err = 2.40

(24). This not-very-impressive result has much to do with the small sample size resulting in large differences between the original estimates

{\hat{μ}}_{i}

and their cross-validated counterparts

{\hat{μ}}_{(i)}

. The single data point at

x_{i} = 6.0

accounted for 40% of

{\hat{Err}}_{cv}

.

Figure 4 concerns a larger data set that will serve as a basis for simulations comparing cross-validation with its competitors: 200 transplant recipients were followed to investigate the subsequent occurrence of anemia; 138 did develop anemia (coded as

y_{i} = 1

) while 62 did not (

y_{i} = 0

). The goal of the study was to predict y from x, a vector of

p = 17

baseline variables including the intercept term (The predictor variables were body mass index, sex, race, patient and donor age, four measures of matching between patient and donor, three baseline medicine indicators, and four baseline general health measures).

A standard logistic regression analysis gave estimated values of the anemia probability

\Pr {y_{i} = 1 ∣ x_{i}}

for

i = 1, 2, \dots, n = 200

that we will denote as

μ^{†} = (μ_{1}^{†}, μ_{2}^{†}, \dots, μ_{200}^{†}) .

(43)

Figure 4 shows a histogram of the 200

μ_{i}^{†}

values. Here we will use

μ^{†}

as the “ground truth” for a simulation study (rather than analyzing the transplant study itself). The

μ_{i}^{†}

will play the role of the true mean

μ (x_{i})

in Lemma 1 (34), enabling us to calculate true errors for the various prediction error estimates.

A

200 \times 100

matrix

Y

of dichotomous responses

y_{i j}

was generated as independent Bernoulli variables (that is, binomials of sample size 1),

y_{i j} \overset{ind}{\sim} Bi (1, μ_{i}^{†})

(44)

for

i = 1, 2, \dots, 200

and

j = 1, 2, \dots, 100

. The jth column of

Y

,

y_{j} = {(y_{1 j}, y_{2 j}, \dots, y_{200 j})}^{⊤}

(45)

is a simulated binomial response vector (25) having true mean vector

μ^{†}

.

Y

provides 100 such response vectors.

Figure 4. Logistic regression estimated anemia probabilities for the 200 transplant patients.

For each one, a logistic regression was run,

glm (y_{j} \sim X, binomial),

(46)

in the language R, with

X

the

200 \times 17

matrix of predictors from the transplant study. Cross-validation (“10-at-a-time” cross-validation rather than one-at-a-time: the 200 (x,y) pairs were randomly split into 20 groups of 10 each; each group was removed from the prediction set in turn and its 10 estimates obtained by logistic regression based on the other 190) gave an estimate of prediction error for the jth simulation,

{\hat{Err}}_{cv} (j) = \frac{1}{n} \sum_{i = 1}^{200} Q (y_{i j}, {\hat{μ}}_{(i) j}),

(47)

while (36) gave true error

Err (j) = \frac{1}{n} \sum_{i = 1}^{200} [Q (μ_{i}^{†}, {\hat{μ}}_{i j}) + q (μ_{i}^{†})],

(48)

where

{\hat{μ}}_{i j}

was the estimated mean from (46).

In terms of mean ± standard deviation the 100 simulations gave

{Err}_{cv} \sim 1.16 \pm 0.14 and Err \sim 1.07 \pm 0.12 .

(49)

{Err}_{cv}

averaged 8% more than

Err

(a couple percent of which came from having sample size 190 rather than 200). The standard deviation of

{Err}_{cv}

is not much bigger than the standard deviation of

Err

, which might suggest that

{Err}_{cv}

was tracking

Err

as it varied across the simulations.

Sorry to say, that was not at all the case. The pairs

(Err (j), {Err}_{cv} (j))

,

j = 1, 2, \dots, 100

, are plotted in Figure 5. It shows

{Err}_{cv}

actually decreasing as the true error

Err

increases. The unfortunate implication is that

{Err}_{cv}

is not estimating the true error, but only its expectation

{Err}^{(u)}

(15). This is not a particular failing of cross-validation. It is habitually observed for all prediction error estimates—see for instance Figure 9 of [4]—though the phenomenon seems unexplained in the literature.

Cross-validation tends to pay for its low bias with high variability. Efron [1] proposed bootstrap estimates of prediction error intended to decrease variability without adding much bias. Among the several proposals the most promising was the 632 rule. (An improved version 632+ was introduced in Efron and Tibshirani [6], designed for reduced bias in overfit situations where

err

(19) equals zero. The calculations here use only 632.) The 632 rule is described as follows:

Nonparametric bootstrap samples $d^{*}$ are formed by drawing n pairs $(x_{i}, y_{i})$ with replacement from the original data set $d$ (10).
Applying the original algorithm $A$ to $d^{*}$ gives prediction rule $f (x, d^{*})$ and predictions

$\hat{μ} {(x)}^{*} = f (x, d^{*}),$

(50)

as in (12).
B bootstrap data sets

$d^{*} (1), d^{*} (2), \dots, d^{*} (B)$

(51)

are independently drawn, giving predictions

${\hat{μ}}_{i j}^{*} = f (x_{i}, d^{*} (j)),$

(52)

for $i = 1, 2, \dots, n$ and $j = 1, 2, \dots, B$ .
Two numbers are recorded for each choice of i and j, the error of ${\hat{μ}}_{i j}^{*}$ as a prediction of $y_{i}$ ,

$Q_{i j}^{*} = Q (y_{i}, {\hat{μ}}_{i j}^{*})$

(53)

and

$N_{i j}^{*} = # \{(x_{i}, y_{i}) \in d^{*} (j)\},$

(54)

the number of times $(x_{i}, y_{i})$ occurs in $d^{*} (j)$ .
The zero bootstrap ${\hat{Err}}_{0}$ is calculated as the average value of $Q_{i j}^{*}$ for those cases having $N_{i j} = 0$ ,

${\hat{Err}}_{0} = \frac{\sum_{(i, j) : N_{i j} = 0} Q_{i j}^{*}}{\sum_{(i, j) : N_{i j} = 0} 1}$

(55)
Finally, the 632 estimate of prediction error is defined to be

${\hat{Err}}_{632} = 0.632 \cdot {\hat{Err}}_{0} + 0.368 \cdot err,$

(56)

$err$ being the apparent error rate (19).

{\hat{Err}}_{632}

was calculated for the same 100 simulated response vectors

y_{j}

(45) used for

{\hat{Err}}_{cv}

(each using

B = 400

replications), the 100 simulations giving

{\hat{Err}}_{632} \sim 1.15 \pm 0.10,

(57)

an improvement on

{\hat{Err}}_{cv} \sim 1.16 \pm 0.14

at (49). This is in line with the 24 sampling experiments reported in [6]. (There, rule

632 +

was used, with loss function counting error (27) rather than binomial deviance.

Q (y, \hat{μ})

is discontinuous for counting error, which works to the advantage of 632 rules.)

Table 2 concerns the rationale for the 632 rule. The

n \times B =

80,000 values

Q_{i j}^{*}

for the first of the 100 simulated

y

vectors were averaged according to how many times

(x_{i}, y_{i})

appeared in

d^{*} (j)

,

{\hat{Err}}_{k} = \frac{\sum_{(i, j) : N_{i j} = k} Q_{i j}^{*}}{\sum_{(i, j) : N_{i j} = k} 1},

(58)

for

k = 0, 1, 2, 3, 4, 5

. Not surprisingly,

{\hat{Err}}_{k}

decreases with increasing k.

{\hat{Err}}_{0}

(55), which would seem to be the bootstrap analogue of

{\hat{Err}}_{cv}

, is seen to exceed the true error

Err

, while

err

is below

Err

. The intermediate linear combination

0.632 {\hat{Err}}_{0} + 0.368 err

is motivated in Section 6 of [1], though in fact the argument is more heuristic than compelling. The 632 rules do usually reduce variability of the error estimates compared to cross-validation, but bias can be a problem.

The 632 rule recomputes a prediction algorithm by nonparametric bootstrap resampling of the original data. Random forests [7], a widely popular prediction algorithm, carries this further: the algorithm itself as well as estimates of its accuracy depend on bootstrap resampling calculations.

Regression trees are the essential component of random forests. Figure 6 shows one such tree (constructed using rpart, the R version of CART [8]; Chapters 9 and 15 of [9] describe CART and random forests) as applied to

y_{1}

, the first of the 100 response vectors for the transplant data simulation (44);

y_{1}

consists of 57 0 s and 143 1 s, average value

\bar{y} = 0.72

. The tree-making algorithm performs successive splits of the data, hoping to partition it into bins that are mostly 0 s or 1 s. The bin at the far right—comprising

n = 29

cases having low body mass index, low age, and female gender—has just one 0 and 28 1 s, for an average of 0.97. For a new transplant case

(x, y)

, with only x observable, we could follow the splits down the tree and use the terminal bin average as a quantitative prediction of y.

Random forests improves the predictive accuracy of any one tree by bagging (“bootstrap aggregation”), sometimes also called bootstrap smoothing: B bootstrap data sets

d^{*} (1), d^{*} (2), \dots, d^{*} (B)

are drawn at random (51), each one generating a tree such as that in Figure 6. (Some additional variability is added to the tree-building process: only a random subset of the p predictors is deemed eligible at each splitting point.) A new x is followed down each of the B trees, with the random forest prediction being the average of x’s B terminal values. Letting

{\hat{μ}}_{i j} = f (x_{i}, d^{*} (j))

, the prediction at

x_{i}

for the tree based on

d^{*} (j)

, the random forest prediction at

x_{i}

is

{\hat{μ}}_{i} = \sum_{j = 1}^{B} {\hat{μ}}_{i j} / B .

(59)

Figure 6. Regression tree for transplant data, simulation 1.

The predictive accuracy for

{\hat{μ}}_{i}

uses a device such as that for

{\hat{Err}}_{0}

(55): Let

{\hat{μ}}_{(i)}

be the average value of

{\hat{μ}}_{i j}

for bootstrap samples

d_{j}

not containing

(x_{i}, y_{i})

,

{\hat{μ}}_{(i)} = \frac{\sum_{j : N_{i j} = 0} {\hat{μ}}_{i j}}{\sum_{j : N_{i j} = 0} 1},

(60)

called the “out-of-bag” (oob) estimate of

μ_{i}

. The oob error estimate is then

{\hat{Err}}_{oob} (x_{i}) = Q (y_{i}, {\hat{μ}}_{(i)}) .

(61)

The overall oob estimate of prediction error is

{\hat{Err}}_{oob} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{Err}}_{oob} (x_{i}) .

(62)

(Notice that the leave-out calculations here are for the estimates

{\hat{μ}}_{i}

, while those for

{\hat{Err}}_{632}

(56) are for the errors

Q_{i j}^{*}

.) Calculated for the 100 simulated response vectors

y_{j}

(45), this gave

{\hat{Err}}_{oob} \sim 1.12 \pm 0.07,

(63)

a better match to the true error

Err

than either

{\hat{Err}}_{cv}

(49) or

{\hat{Err}}_{632}

(57). In fact the actual match was even better than (63) suggests, as shown in Table 3 of Section 4. This is all the more surprising given that, unlike

{\hat{Err}}_{cv}

and

{\hat{Err}}_{632}

,

{\hat{Err}}_{oob}

is fully nonparametric: it makes no use of the logistic regression model

glm (y \sim X, binomial)

(46), which was involved in generating the simulated response vectors

y_{j}

(44) (It has to be added that

{\hat{Err}}_{oob}

is not an estimate for the prediction error of the logistic regression model

{\hat{μ}}_{glm} (x)

(46), but rather for the random forest estimates

{\hat{μ}}_{rf} (x)

).

4. Covariance Penalties and Degrees of Freedom

A quite different approach to the prediction problem was initiated by Mallows’

C_{p}

formula [10]. An observed n-dimensional vector

y

is assumed to follow the homoskedastic model

y = μ + ϵ with ϵ \sim (0, σ^{2}, I),

(64)

the notation indicating uncorrelated errors of mean 0 and variance

σ^{2}

as in (17);

σ^{2}

is known. A linear rule

\hat{μ} = M y

(65)

is used to estimate

μ

with

M

a fixed and known matrix. How accurate is

\hat{μ}

as a predictor of future observations?

The apparent error

err = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{μ}}_{i})}^{2}

(66)

is likely to underestimate the true error of

\hat{μ}

given a hypothetical new observation vector

y^{*} = μ + ϵ^{*}

independent of

y

,

Err = Err (μ, \hat{μ}) = \frac{1}{n} E_{*} \sum_{i = 1}^{n} {(y_{i}^{*} - {\hat{μ}}_{i})}^{2},

(67)

the

E_{*}

notation indicating that

\hat{μ} = ({\hat{μ}}_{1}, {\hat{μ}}_{2}, \dots, {\hat{μ}}_{n})

is fixed in (67). Mallows’

C_{p}

formula says that

{\hat{Err}}_{cp} = err + \frac{2 σ^{2}}{n} trace (M)

(68)

is an unbiased estimator of

Err

,

E \{{\hat{Err}}_{cp}\} = E \{Err (μ, \hat{μ})\};

(69)

that is,

{\hat{Err}}_{cp}

is not unbiased for

Err (μ, \hat{μ})

but is unbiased for

E {Err (μ, \hat{μ})}

(

μ

fixed in the expectation) under model (64)–(65).

One might wonder what has happened to the covariates

x_{i}

in

d

(10)? The answer is that they are still there but no longer considered random–rather as fixed ancillary quantities such as the sample size n. In the OLS model

y = X β + ϵ

for Figure 3 (16)–(18) the covariates

x = (x_{1}, x_{2}, \dots, x_{n})

determine

X

,

\hat{β} = {(X^{⊤} X)}^{- 1} X^{⊤} y

and

\hat{μ} = M y (M = X {(X^{⊤} X)}^{- 1} X^{⊤}),

(70)

Formula (65). We could, but do not, write

M

as

M_{X}

.

Mallows’

C_{p}

Formula (68) can be extended to the Q class of error measures, Figure 4. An unknown probability model

f

—notice that

f

is not the same as f in (12)— is assumed to have produced

y

and its true mean vector

μ

,

μ = E_{f} {y};

(71)

an estimate

\hat{μ} = m (y)

has been calculated using some algorithm

m (\cdot)

,

f ⟶ y ⟶ \hat{μ} = m (y);

(72)

the apparent error

err

(19) and true error

Err (μ, \hat{μ})

(33) are defined as before,

err = \frac{1}{n} \sum_{i = 1}^{n} Q (y_{i}, {\hat{μ}}_{i}) and Err = \frac{1}{n} E_{*} \{\sum_{i = 1}^{n} Q (y_{i}^{*}, {\hat{μ}}_{i})\},

(73)

with the

{\hat{μ}}_{i}

fixed and

f \to y^{*} = (y_{1}^{*}, y_{2}^{*}, \dots, y_{n}^{*})

independently of

y

. Lemma 1 for the true error (36) still applies,

Err = \frac{1}{n} \sum_{i = 1}^{n} [Q (μ_{i}, {\hat{μ}}_{i}) + q (μ_{i}) - c (x_{i})] .

(74)

The Q-class version of Mallows’ Formula (68) is derived as Optimism Theorem 1 in Section 3 of [4]:

Theorem 1.

Define

a m_{i} = - \dot{q} ({\hat{μ}}_{i}) / 2 (\dot{q} (μ) = \frac{d}{d μ} q (μ)) .

(75)

Then

\hat{Err} = err + \frac{2}{n} \sum_{i = 1}^{n} {cov}_{f} (a m_{i}, y_{i})

(76)

where

{cov}_{f}

indicates covariance under model (72), is an unbiased estimate of

Err

in the same sense as (69),

E_{f} \{\hat{Err}\} = E_{f} {Err} .

(77)

The covariance terms in (76) measure how much each

y_{i}

affects its own estimate. They sum to a covariance penalty that must be added to the apparent error to account for the fitting process. If

Q (μ, \hat{μ})

is binomial deviance then

a m_{i} = \log [{\hat{μ}}_{i} / (1 - {\hat{μ}}_{i})],

(78)

the logistic parameter; the theorem still applies as stated whether or not

\hat{μ} = m (y)

is logistic regression.

\hat{Err}

as stated in (76) is not directly usable since the covariance terms

{cov}_{f} (a m_{i}, y_{i})

are not observable statistics. This is where resampling comes in.

Suppose that the observed data

y

provides an estimate

\hat{f}

of

f

. For instance in a normal regression model

y \sim N (μ, σ^{2}, I)

we could take

\hat{f}

to be

y^{*} \sim N (\hat{μ}, σ^{2} I)

for some estimate

\hat{μ}

. We replace (72) with the parametric bootstrap model

\hat{f} ⟶ y^{*} ⟶ {\hat{μ}}^{*} = m (y^{*})

(79)

and generate B independent replications

y^{*} (1), y^{*} (2), \dots, y^{*} (B)

, from which are calculated B pairs,

(y_{i}^{*} (j), a m_{i}^{*} (j)), j = 1, 2, \dots, B,

(80)

for

i = 1, 2, \dots, n

as in (75). The covariances in (76) can then be estimated as

\hat{cov} (a m_{i}, y_{i}) = \frac{1}{B} \sum_{j = 1}^{B} (a m_{i}^{*} (j) - a m_{i}^{*} (\cdot)) (y_{i}^{*} (j) - y_{i}^{*} (\cdot)),

(81)

a m_{i}^{*} (\cdot) = \sum_{j} a m_{i}^{*} (j) / B

and

y_{i}^{*} (\cdot) = \sum_{j} y_{i}^{*} (j) / B

, yielding a useable version of (76),

{\hat{Err}}_{cp} = err + \frac{2}{n} \sum_{i = 1}^{n} \hat{cov} (a m_{i}, y_{i}),

(82)

“cp” standing for “covariance penalty”.

Table 3 compares the performances of cross-validation, covariance penalties, the 632 rule, and the random forest out-of-bag estimates on the 100 transplant data simulations. The results are given in terms of total prediction error, rather than average error as in (47). The bottom line shows their root-mean-square differences from true error

Err

,

rms = {[\sum_{i = 1}^{100} {({\hat{Err}}_{i} - {Err}_{i})}^{2} / 100]}^{1 / 2} .

(83)

{\hat{Err}}_{cv}

is highest, with

{\hat{Err}}_{cp}

,

{\hat{Err}}_{632}

, and

{\hat{Err}}_{oob}

respectively 71%, 83%, and 65% as large.

Table 3. Transplant data simulation experiment. Top: First 5 of 100 estimates of total prediction error using cross-validation, covariance penalties (cp), the 632 rule, the out-of-bag random forest results, and the apparent error; degrees of freedom (df) estimate from cp. Bottom: Mean of the 100 simulations, their standard deviations, correlations with Errtrue, and root mean square differences from Errtrue. Cp and 632 rules used

B = 400

bootstrap replications per simulation.

Table 3. Transplant data simulation experiment. Top: First 5 of 100 estimates of total prediction error using cross-validation, covariance penalties (cp), the 632 rule, the out-of-bag random forest results, and the apparent error; degrees of freedom (df) estimate from cp. Bottom: Mean of the 100 simulations, their standard deviations, correlations with Errtrue, and root mean square differences from Errtrue. Cp and 632 rules used

B = 400

bootstrap replications per simulation.

	Errtrue	ErrCv	ErrCp	Err632	Erroob	err	df
	194	216.5	232.9	260.6	240.3	180.4	26.2
	199	245.8	220.9	241.0	230.9	173.8	23.5
	192	221.2	225.9	247.6	232.7	180.6	22.7
	234	197.4	200.9	215.2	219.7	162.2	19.4
	302	175.7	178.3	187.1	211.2	144.6	16.9
	Errtrue	ErrCv	ErrCp	Err632	Erroob	err	df
mean	214.0	231.1	216.1	229.4	223.8	169.9	23.1
stdev	23.0	28.7	15.8	19.0	14.6	13.5	4.0
cor.true		−0.58	$- 0.61$	$- 0.64$	$- 0.26$	$- 0.41$	$- 0.52$
rms		48.8	34.8	40.7	31.6	54.2

For the jth simulation vector

y_{j}

(45) the parametric bootstrap replications (79) were generated as follows: the logistic regression estimate

{\hat{μ}}_{j}

was calculated from (46),

{\hat{μ}}_{j} = glm (y_{j} \sim X, binomial) $, fit

; the bootstrap replications

y_{j}^{*}

had independent Bernoulli components

y_{i j}^{*} \overset{ind}{\sim} Bi (1, {\hat{μ}}_{i j}) for i = 1, 2, \dots, 200 .

(84)

B = 400

independent replications

y_{j}^{*}

were generated for each

j = 1, 2, \dots, 100

, giving

{\hat{Err}}_{cp}

according to (81)–(82).

Because the resamples were generated by a model of the same form as that which originally gave the

y_{j}

’s (43)–(44),

{\hat{Err}}_{cp}

is unbiased for

Err

. In practice our resampling model

\hat{f}

in (79) might not match

f

in (72), causing

{\hat{Err}}_{cp}

to be downwardly biased (and raising its rms).

{\hat{Err}}_{cv}

’s nonparametric resamples make it nearly unbiased for the unconditional error rate

{Err}^{(u)}

(15) irrespective of the true model, accounting for the overwhelming popularity of cross-validation in applications.

4.1. A Few Comments

In computing ${\hat{Err}}_{cp}$ it is not necessary for $\hat{f}$ in (79) to be constructed from the original estimate $\hat{μ}$ . We might base $\hat{f}$ on a bigger model than that which led to the choice of $m (y)$ for $\hat{μ}$ ; in the little example of Figure 3, for instance, we could take $\hat{f}$ to be $N (\hat{μ} (6), σ^{2} I)$ where $\hat{μ} (6)$ is the OLS sixth degree polynomial fit, while still taking $\hat{μ} = m (y)$ to be fourth degree as in (18). This reduces possible model-fitting bias in ${\hat{Err}}_{cp}$ , while increasing its variability.
A major conceptual difference between ${\hat{Err}}_{cv}$ and ${\hat{Err}}_{cp}$ concerns the role of the covariates $x = (x_{1}, x_{2}, \dots, x_{n})$ , considered as random in model (10) but fixed in (72). Classical regression problems have usually been analyzed in a fixed- $x$ framework for three reasons:
1.
mathematical tractability;
2.
not having to specify $x$ ’s distribution;
3.
inferential relevance.
The reasons come together in the classic covariance formula for the linear model $y = X β + ϵ$ ,

$cov (\hat{β}) = σ^{2} {(X^{⊤} X)}^{- 1} .$

(85)

A wider dispersion of the $x_{i}$ ’s in Figure 3 would make $\hat{β}$ more accurate, and conversely.
It can be argued that because ${\hat{Err}}_{cp}$ is estimating the conditional error given $x$ it is more relevant to what is being estimated. See [11,12] for a lot more on this question.
On the other hand, “fixed- $x$ ” methods such as Mallows’ $C_{p}$ can be faulted for missing some of the variability contributing to the unconditional prediction error ${Err}^{(u)}$ (15). Let $Err (x, d)$ be the conditional error given $d$ for predicting y given x,

$Err (x, d) = Q (μ (x), f (x, d)) + q (μ (x)) - c (x)$

(86)

according to Lemma 1 (34). Then

${Err}^{(u)} = E \{\int_{X} g (x) Err (x, d) d x\},$

(87)

where $g (x)$ is the marginal density of x and E indicates expectation over the choice of the training data $d$ .
In the fixed- $x$ framework of (36), $Err$ replaces the integrand in (87) with its $x$ average $\sum_{1}^{n} Err (x_{i}, d) / n$ . We expect

${Err}^{(u)} > E_{f} {Err}$

(88)

since $Err (x, d)$ typically increases for values of x farther away from $x$ . Rosset and Tibshirani [12] give an explicit formula for the difference in the case of normal-theory ordinary least squares when they show that the factor 2 for the penalty term in Mallows’ $C_{p}$ formula should be increased to $2 + (p + 1) / (n - p - 1)$ . Cross-validation effectively estimates ${Err}^{(u)}$ while $C_{p}$ estimates the fixed- $x$ version of $E {Err}$ .
With (88) in mind, ${\hat{Err}}_{cp}$ and ${\hat{Err}}_{cv}$ are often contrasted as

$insample versus outsample .$

(89)

This is dangerous terminology if it’s taken to mean that ${\hat{Err}}_{cv}$ applies to prediction errors $Err (x, d)$ at specific points x outside of $x$ . In Figure 3 for instance it seems likely that $Err (11, d)$ exceeds ${\hat{Err}}_{cv}$ , but this is a fixed- $x$ question and beyond the reach of the random- $x$ assumptions underlying (88). See the discussion of Figure 8 in Section 5.

The sad story told in Figure 5 shows ${\hat{Err}}_{cv}$ negatively correlated with the true $Err$ . The same is the case for ${\hat{Err}}_{cp}$ and ${\hat{Err}}_{632}$ , as can be seen from the negative correlations in the cor.true row of Table 3. ${\hat{Err}}_{oob}$ is also negatively correlated with $Err$ , but less so. In terms of rms, the bottom row shows that the fully nonparametric ${\hat{Err}}_{oob}$ estimates beat even the parametric ${\hat{Err}}_{cp}$ ones.
Figure 7 compares the 200 ${\hat{μ}}_{i}$ estimates from the logistic regression estimates (46) with those from random forests, for y the first of the 100 transplant simulations. Random forests is seen to better separate the $y_{i} = 0$ from the $y_{i} = 1$ cases. ${\hat{Err}}_{oob}$ relates to error prediction for random forest estimates, not for logistic regression, but this does not explain how ${\hat{Err}}_{oob}$ could provide excellent estimates of the true error $Err$ –which in fact were based on the logistic regression model (43)–(44). If this is a fluke it’s an intriguing one.
There is one special case where the covariance penalty formula (76) can be unbiasedly estimated without recourse to resampling: if $f$ is the normal model $y \sim N (μ, σ^{2} I)$ , and $Q (y_{i}, μ_{i}) = {(y_{i} - μ_{i})}^{2}$ —so $a m_{i} = {\hat{μ}}_{i}$ —then Stein’s unbiased risk estimate (SURE) is defined to be

${\hat{Err}}_{SURE} = err + \frac{2 σ^{2}}{n} \sum_{i = 1}^{n} \frac{\partial {\hat{μ}}_{i}}{\partial y_{i}},$

(90)

where the partial derivatives are calculated directly from the functional form of $\hat{μ} = m (y)$ . Section 2 of [4] gives an example comparing ${\hat{Err}}_{SURE}$ with ${\hat{Err}}_{cp}$ . Each term $\partial {\hat{μ}}_{i} / \partial y_{i}$ measures the influence of $y_{i}$ on its own estimate.

4.2. Degrees of Freedom

The OLS model

y = X β + ϵ

yields the familiar estimate

\hat{μ} = M y

of

μ = E {y}

, where

M

is the projection matrix

X {(X^{⊤} X)}^{- 1} X^{⊤}

as in (70);

M

has

trace (M) = trace [{(X^{⊤} X)}^{- 1} X^{⊤} X] = p,

(91)

where

p = rank (X)

. Mallows’

C_{p}

formula

{\hat{Err}}_{cp} = err + (2 σ^{2} / n) trace (M)

becomes

{\hat{Err}}_{cp} = err + \frac{2 σ^{2}}{n} p

(92)

in this case. In other words, the covariance penalty that must be added to the apparent error

err

is directly proportional to p, the degrees of freedom of the OLS model.

Suppose that

\hat{μ} = M y

with matrix

M

not necessarily a projection. It has become common to define

\hat{μ}

’s degrees of freedom as

df = degrees of freedom = trace (M),

(93)

trace (M)

playing the role of p in the

C_{p}

Formula (92). In this way,

trace (M)

becomes a lingua franca for comparing linear estimators

\hat{μ} = M y

of different forms. (The referee points out that formulas such as (92) are more often used for model selection rather than error rate prediction. Zhang and Yang [13] consider model selection applications, as does Remark B of [4].)

A nice example is the ridge regression estimator

{\hat{μ}}_{γ} = X {(X^{⊤} X + γ I)}^{- 1} X^{⊤} y,

(94)

γ

a fixed non-negative constant;

{\hat{μ}}_{0}

is the usual OLS estimator while

{\hat{μ}}_{γ}

“shrinks”

\hat{μ}

toward

0

, more so as

γ

increases. Some linear algebra gives the degrees of freedom for (94) as

{df}_{γ} = \sum_{i = 1}^{p} \frac{e_{i}}{e_{i} + γ},

(95)

where the

e_{i}

are the eigenvalues of

X^{⊤} X

.

The generalization of Mallows’ Formula (68) in Theorem 1 (76) has penalty term

\frac{2}{n} \sum_{i = 1}^{n} {cov}_{f} (a m_{i}, y_{i}),

(96)

which again measures the self-influence of each

y_{i}

on its own estimate

{\hat{μ}}_{i}

. The choice of

q (μ) = μ (1 - μ)

(97)

in Figure 3 results in

Q (μ, \hat{μ}) = {(\hat{μ} - μ)}^{2}

, squared error, and

a m_{i} = {\hat{μ}}_{i} - 1 / 2

, in which case

{cov}_{f} (a m_{i}, y_{i})

equals

σ^{2} m_{i i}

, and (96) becomes

(2 σ^{2} / n) trace (M)

, as in Mallows’ Formula (68). This suggests using

Cov = \sum_{i = 1}^{n} {cov}_{f} (a m_{i}, y_{i})

(98)

as a measure of degrees of freedom (or its estimate

\hat{Cov}

from (81)) for a general estimator

\hat{μ} = m (y)

.

Some support comes from the following special situation: suppose

y

is obtained from a p-parameter generalized linear model, with prediction error measured by the appropriate deviance function. (Binomial deviance for logistic regression as in the transplant example.) Theorem 2 of [14], Section 6, then gives the asymptotic approximation

Cov ≐ p,

(99)

as in (91)–(92), the intuitively correct answer.

Approximation (99) leads directly to Akaike’s information criterion (AIC). In a generalized linear model, the total deviance from the MLE

\hat{μ}

is

n \cdot err = \sum_{i = 1}^{n} Q (y_{i}, {\hat{μ}}_{i}) = 2 [\sum_{i = 1}^{n} \log (g_{y_{i}} (y_{i}) / g_{{\hat{μ}}_{i}} (y_{i}))],

(100)

g_{{\hat{μ}}_{i}} (y_{i})

denoting the density function ([2], Hoeffding’s Lemma). Suppose we have glm’s of different sizes p we wish to compare. Minimizing

err + (2 / n) Cov

over the choice of model is then equivalent to maximizing the total log likelihood

\log g_{\hat{μ}} (y)

minus a dimensionality penalty,

\log (g_{\hat{μ}} (y)) - Cov ≐ \log (g_{\hat{μ}} (y)) - p,

(101)

which is the AIC.

Approximation (99) is not razor-sharp:

p = 17

is the transplant simulation logistic regression but the 100 estimates

\hat{Cov}

averaged 23.08 with standard error 0.40. Degrees of freedom play a crucial role in model selection algorithms. Resampling methods allow us to assess

Cov

(98) even for very complicated fitting algorithms

\hat{μ} = m (y)

.

5. Conformal Inference

If there is a challenger to cross-validation for “oldest resampling method” it is permutation testing, going back to Fisher in the 1930s. The newest prediction error technique, conformal inference, turns out to have permutation roots, as briefly reviewed next.

A clinical trial of an experimental drug has yielded independent real-valued responses for control and treatment groups:

Control : u = (u_{1}, \dots, u_{n}) and Treatment : v = (v_{1}, \dots, v_{m}) .

(102)

Student’s t-test could be used to see if the new drug was giving genuinely larger responses but Fisher, reacting to criticism of normality assumptions, proposed what we would now call a nonparametric two-sample test.

Let

z

be the combined data set,

z = (u, v) = (z_{1}, \dots, z_{n + m}),

(103)

and choose some score function

S (z)

that contrasts the last m z-values with the first n, for example the difference of means,

S (z) = \sum_{i = n + 1}^{n + m} \frac{z_{i}}{m} - \sum_{i = 1}^{n} \frac{z_{i}}{n} .

(104)

Define

S

as the set of scores for all permutations of

z

,

S = \{S (z^{*})\},

(105)

z^{*}

ranging over the

(m + n)!

permutations.

The permutation p-value for the treatment’s efficacy in producing larger responses is defined to be

p = # \{S (z^{*}) \geq S (z)\} / (m + n)!,

(106)

the proportion of

S

having scores

S (z^{*})

exceeding the observed score

S (z)

. Fisher’s key idea was that if in fact all the observations came from the same distribution F,

z_{i} \overset{ind}{\sim} F (i = 1, 2, \dots, m + n)

(107)

(implying that Treatment is the same as Control), then all

(m + n)!

permutations would be equally likely. Rejecting the null hypothesis of No Treatment Effect if

p \leq α

has null probability (nearly)

α

.

Usually

(m + n)!

is too many for practical use. This is where the sampling part of resampling comes in. Instead of all possible permutations, a randomly drawn subset of B of them, perhaps

B = 1000

, is selected for scoring, giving an estimated permutation p-value

\hat{p} = # \{S (z^{*}) \geq S (z)\} / B .

(108)

In 1963, Hodges and Lehmann considered an extension of the null hypothesis (107) to cover location shifts; in terms of cumulative distribution functions (cdfs), they assumed

u_{i} \overset{iid}{\sim} F (u) and v_{i} \overset{iid}{\sim} F (v - Δ),

(109)

where

Δ

is a fixed but unknown constant that translates the v’s distribution by

Δ

units to the right of the u’s.

For a given trial value of

Δ

let

z (Δ) = (u_{1}, \dots, u_{n}, v_{1} - Δ, \dots, v_{m} - Δ)

(110)

and compute its permutation p-value

\hat{p} (Δ) = # \{S (z^{*} (Δ)) \geq S (z (Δ))\} .

(111)

A 0.95 two-sided nonparametric confidence interval for

Δ

is then

Δ : 0.025 \leq \hat{p} (Δ) \leq 0.975 .

(112)

The only assumption is that, for the true value of

Δ

, the

m + n

components of

z (Δ)

are i.i.d., or more generally exchangeable, from some distribution F.

Vovk has proposed an ingenious extension of this argument applying to prediction error estimation, a much cited reference being [15]; see also [16]. Returning to the statement of the prediction problem at the beginning of Section 2,

d = {(x_{i}, y_{i}), i = 1, 2, \dots, n}

is the observed data and

(x_{0}, y_{0})

a new (predictor, response) pair, all

n + 1

pairs assumed to be random draws from the same distribution F,

(x_{i}, y_{i}) \overset{iid}{\sim} F for i = 0, 1, \dots, n;

(113)

x_{0}

is observed but not

y_{0}

, and it is desired to predict

y_{0}

. Vovk’s proposal, conformal inference, produces an exact nonparametric distribution of the unseen

y_{0}

.

Let

Y_{0}

be a proposed trial value for

y_{0}

, and define

D

as the data set

d

augmented with

(x_{0}, Y_{0})

,

D = \{d, (x_{0}, Y_{0})\} .

(114)

A prediction rule

f (x, D)

gives estimates

{\hat{μ}}_{i} = f (x_{i}, D) for i = 0, 1, \dots, n .

(115)

(It is required that

f (x, D)

be invariant under reordering of

D

’s elements.)

For some score function

s (y, \hat{μ})

let

s_{i} = s (y_{i}, {\hat{μ}}_{i}) for i = 1, 2, \dots, n,

(116)

where

s (y, \hat{μ})

measures disagreement between y and

\hat{μ} (x)

, larger values of

| s |

indicating less conformity between observation and prediction. (More generally

s_{i}

can be any function

s (y_{i}, D)

.)

If the proposed trial value

Y_{0}

were in fact the unobserved

y_{0}

then

s_{1}, s_{2}, \dots, s_{n}

and

s_{0} = s (Y_{0}, {\hat{μ}}_{0})

would be exchangeable random variables, because of the i.i.d. assumption (113). Let

s_{(1)}, s_{(2)}, \dots, s_{(n)}

be the ordered values of

s_{1}, s_{2}, \dots, s_{n}

. Assuming no ties, the n values

s_{(1)} < s_{(2)} < \dots < s_{(n)}

(117)

partition the line into

(n + 1)

intervals, the first and last of which are semi-infinite. Exchangeability implies that

s_{0}

has probability

1 / (n + 1)

of falling into any one of the intervals.

A conformal interval for the unseen

y_{0}

consists of those values of

Y_{0}

for which

s_{0} = s (Y_{0}, {\hat{μ}}_{0})

“conforms” to the distribution (117). To be specific, for a chosen miscoverage level

α

, say 0.05, let

I_{0}

and

I_{1}

be integers approximately proportion

α / 2

from the endpoints of

1, 2, \dots, n

,

I_{0} = [n α / 2] and I_{1} = [n (1 - α / 2)] + 1 .

(118)

The conservative two-sided level

1 - α

conformal prediction interval

C

for

y_{0}

is

C = \{Y_{0} : s (Y_{0}, D) \in [s_{(I_{0})}, s_{(I_{1})}]\} .

(119)

The argument is the same as for the Hodges–Lehmann interval (112), now with

m = 1

and

Δ = Y_{0}

.

Interval (119) is computationally expensive since all of the

s_{i}

, not just

s_{0}

, change with each choice of trial value

Y_{0}

. The jackknife conformal interval begins with the jackknife estimates

{\hat{μ}}_{(i)} = f (x_{i}, d_{(i)}), i = 1, 2, \dots, n,

(120)

where

d_{(i)} = {(x_{j}, y_{j}), j \neq i}

, that is

d

(10) with

(x_{i}, y_{i})

deleted. The scores

s_{i}

(116) are taken to be

s_{i} = s (y_{i}, {\hat{μ}}_{(i)}), i = 1, 2, \dots, n,

(121)

for some function s, for example

s_{i} = y_{i} - {\hat{μ}}_{(i)}

. These are compared with

s_{0} = s (Y_{0}, {\hat{μ}}_{0}),

(122)

{\hat{μ}}_{0} = f (x_{0}, d)

, and

C

is computed as at (119). Now the score distribution (117) does not depend on

Y_{0}

(nor does

{\hat{μ}}_{0}

), greatly reducing the computational burden.

The jackknife conformal interval at

x_{0} = 11

was calculated for the small example of Figure 2 using

s_{i} = y_{i} - {\hat{μ}}_{(i)}

(123)

for

i = 1, 2, \dots, n = 20

. For this choice of scoring function, interval (119) is

C = [{\hat{μ}}_{0} + s_{(I_{0})}, {\hat{μ}}_{0} + s_{(I_{1})}];

(124)

y_{0} \in C

with conformal probability

(I_{(1)} - I_{(0)}) / (n + 1)

. The square dots in Figure 8 are the values

{\hat{μ}}_{0} + s_{(i)}

for

i = 1, 2, \dots, 20

, with

y_{0}

having probability

1 / 21

of falling into each of the 21 intervals. Conformal probability for the full range

[{\hat{μ}}_{0} + s_{(1)}, {\hat{μ}}_{0} + s_{(20)}] = [3.27, 10.99]

(125)

is

19 / 21 = 0.905

.

Some Comments

Even when $Y_{0}$ equals $y_{0}$ , $s_{0} = s (Y_{0}, {\hat{μ}}_{0})$ is not perfectly exchangeable with the $s_{i} = s (y_{i}, {\hat{μ}}_{(i)})$ (121): each ${\hat{μ}}_{(i)}$ is based on $n - 1$ other cases, while ${\hat{μ}}_{0}$ is based on n. Other stand-ins for the full conformal intervals (119) are favored in the literature but these have their own disadvantages. Barber et al. [17] offer a version of the jackknife intervals, “jackknife +“, with more dependable inferential performance.
The jackknife scores $s_{i}$ (121) are also the one-at-a-time cross-validation scores if s is taken to be the prediction loss Q in (42). In this sense, conformal inference can be thought of as a more ambitious version of prediction error estimation, where we try to estimate the entire error distribution rather than just its expectation. The conformal point estimate $\bar{s} = \sum_{1}^{n} s_{i} / n$ is the same as ${\hat{Err}}_{cv}$ (42) if s equals Q (which is why “conformal” was not included in Table 3).
Figure 8 is misleading in an important sense: the 95% coverage claimed in (125) is a marginal inference following from the i.i.d. assumption $(x_{i}, y_{i}) \overset{iid}{\sim} F$ for $i = 0, 1, \dots, n$ (113), and does not apply conditionally to the particular configuration of x’s and y’s seen in the figure. (See Remark 3 and Section 3 of [16].) The same complaint was leveled against cross-validation in Section 3—for estimating the unconditional error ${Err}^{(u)}$ rather than prediction error for the rule at hand—but conformal inference leans even harder on the i.i.d. assumption.
Classic parametric prediction intervals do apply conditionally. The normal-theory version of model (16) and (17) gives 95% interval

${\hat{μ}}_{0} \pm 1.96 σ \sqrt{1 + γ_{0}}$

(126)

for response $y_{0}$ at $x_{0} = 11$ , where

$γ_{0} = X {(x_{0})}^{⊤} {(X^{⊤} X)}^{- 1} X (x_{0})$

(127)

in notation (18). (At $x_{0} = 11$ , (126) gives 95% prediction interval $[- 15.6, 32.9]$ , reflecting the hopelessly large extrapolation variability of the fourth-degree polynomial model; the standard deviation of ${\hat{μ}}_{0}$ at $x_{0} = 11$ is 12.22.) The nonparametric random sampling assumption (113) destroys the geometry seen in Figure 8.
Rather than $s_{i} = y_{i} - {\hat{μ}}_{(i)}$ at (123), we might use scores

$s_{i} = \frac{y_{i} - {\hat{μ}}_{(i)}}{γ_{i}},$

(128)

where $γ_{i} = γ (x_{i})$ is some measure of prediction difficulty at $x_{i}$ . Conformal inference continues to apply here since the $s_{i}$ are still exchangeable. Boström et al. [18] give several random forest examples, using out-of-bag estimates for the $γ_{i}$ .
Covariate shift estimation offers a more ambitious approach to broadening the reach of conformal prediction; see [19,20]. The underlying probability distribution F in (113) can be thought of in two stages, first choosing x according to say $g (x)$ and then y given x according to $p (y ∣ x)$ ,

$F : x \sim g (x) and y ∣ x \sim p (y ∣ x) .$

(129)

It is assumed that (129) holds in a training set, but in the test set where predictions are to be made $g (x)$ is shifted to $g_{test} (x)$ ,

$F_{test} : x \sim g_{test} (x) and y ∣ x \sim p (y ∣ x) .$

(130)

With sufficiently large training and test sets available, the ratio $g_{test} (x) / g (x)$ can be estimated, allowing a suitably weighted version of conformal interval (124) to be constructed.

Conformal prediction is less appealing for dichotomous response data but can still be informative. Figure 9 shows its application to the transplant data of Section 3 and Section 4. The score function s is taken to be the deviance residual

$s_{i} = sign (y_{i} - {\hat{μ}}_{(i)}) Q {(y_{i}, {\hat{μ}}_{(i)})}^{1 / 2},$

(131)

${\hat{μ}}_{(i)}$ the jackknife logistic regression estimate (120) and $Q (y, μ)$ binomial deviance (28). The left side of Figure 9 shows the histogram of the 200 $s_{i}$ values. Any given value of ${\hat{μ}}_{0}$ corresponds to two possible values of $s_{0} = sign (y_{0} - {\hat{μ}}_{0}) Q {(y_{0}, {\hat{μ}}_{0})}^{1 / 2}$ , for $y_{0}$ equal 0 or 1, and two values of the conformal p-value,

$p ({\hat{μ}}_{0}) = # {s_{i} \geq s_{0}} / 201 .$

(132)

The right side of Figure 9 graphs $p ({\hat{μ}}_{0})$ as a function of ${\hat{μ}}_{0}$ for the two cases:

$\begin{matrix} {\hat{μ}}_{0} & < 0.315 gives \hat{p} \geq 0.975 for y_{0} = 1, \\ {\hat{μ}}_{0} & > 0.918 gives \hat{p} \geq 0.025 for y_{0} = 0 . \end{matrix}$

(133)

For ${\hat{μ}}_{0}$ in $[0.315, 0.918]$ , neither $y_{0} = 0$ or $y_{0} = 1$ can be rejected at the 0.025 level.

Figure 9. (Left) Signed deviance residuals for transplant data. (Right) Attained conformal p-value given

μ

; solid

y = 1

, dashed

y = 0

.

Figure 9. (Left) Signed deviance residuals for transplant data. (Right) Attained conformal p-value given

μ

; solid

y = 1

, dashed

y = 0

.

As far as prediction error is concerned, cross-validation and covariance penalties are established subjects backed up by a substantial theoretical and applied literature. Conformal prediction, as the new kid on the block, is still in its formative stage, with at least a promising hint of moving beyond complete reliance on the random sampling model (15). All three approaches rely on resampling methodology, very much in the spirit of statistical inference in the 2020s.

Funding

This research was funded in part by the National Science Foundation grant number DMS 1608182.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

Efron, B. Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Am. Stat. Assoc. 1983, 78, 316–331. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science; Institute of Mathematical Statistics Monographs (Book 5); Cambridge University Press: Cambridge, UK, 2016; p. 476. [Google Scholar]
Efron, B.; Narasimhan, B. The automatic construction of bootstrap confidence intervals. J. Comput. Graph. Stat. 2020, 29, 608–619. [Google Scholar] [CrossRef] [PubMed]
Efron, B. The estimation of prediction error: Covariance penalties and cross-validation. J. Am. Stat. Assoc. 2004, 99, 619–642. [Google Scholar] [CrossRef]
Brègman, L.M. A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. Ž. Vyčisl. Mat i Mat. Fiz. 1967, 7, 620–631. [Google Scholar] [CrossRef]
Efron, B.; Tibshirani, R. Improvements on cross-validation: The .632+ bootstrap method. J. Am. Stat. Assoc. 1997, 92, 548–560. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth Statistics/Probability Series; Wadsworth Advanced Books and Software: Belmont, CA, USA, 1984; p. x+358. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2009; p. xxii+745. [Google Scholar] [CrossRef]
Mallows, C.L. Some Comments on C_p. Technometrics 1973, 15, 661–675. [Google Scholar] [CrossRef]
Bates, S.; Hastie, T.; Tibshirani, R. Cross-validation: What does it estimate and how well does it do it? arXiv 2021, arXiv:2104.00673v2. [Google Scholar]
Rosset, S.; Tibshirani, R.J. From fixed-X to random-X regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. J. Am. Stat. Assoc. 2020, 115, 138–162. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Y. Cross-validation for selecting a model selection procedure. J. Econom. 2015, 187, 95–112. [Google Scholar] [CrossRef]
Efron, B. How biased is the apparent error rate of a prediction rule? J. Am. Stat. Assoc. 1986, 81, 461–470. [Google Scholar] [CrossRef]
Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World; Springer: New York, NY, USA, 2005; p. xvi+324. [Google Scholar]
Lei, J.; G’Sell, M.; Rinaldo, A.; Tibshirani, R.J.; Wasserman, L. Distribution-free predictive inference for regression. J. Am. Stat. Assoc. 2018, 113, 1094–1111. [Google Scholar] [CrossRef]
Barber, R.F.; Candès, E.J.; Ramdas, A.; Tibshirani, R.J. Predictive inference with the jackknife+. Ann. Stat. 2021, 49, 486–507. [Google Scholar] [CrossRef]
Boström, H.; Linusson, H.; Löfström, T.; Johansson, U. Accelerating difficulty estimation for conformal regression forests. Ann. Math. Artif. Intell. 2017, 81, 125–144. [Google Scholar] [CrossRef] [Green Version]
Tibshirani, R.J.; Foygel Barber, R.; Candès, E.J.; Ramdas, A. Conformal prediction under covariate shift. In Proceedings of the Advances in Neural Information Processing Systems 32 (NIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2019; pp. 2526–2536. [Google Scholar]
Sugiyama, M.; Krauledat, M.; Müller, K.R. Covariate Shift Adaptation by Importance Weighted Cross Validation. J. Mach. Learn. Res. 2007, 8, 985–1005. [Google Scholar]

Figure 1. 2000 nonparametric bootstraps, law school correlation. 95% confidence limits: bootstrap (red), standard (black).

Figure 2. Points

(x_{i}, y_{i})

,

i = 1, 2, \dots, 20

, and fitted 4th-degree polynomial curve; light dotted curve is true mean.

Figure 2. Points

(x_{i}, y_{i})

,

i = 1, 2, \dots, 20

, and fitted 4th-degree polynomial curve; light dotted curve is true mean.

Figure 3. The Q class of error measures.

Figure 5. CV estimate of prediction error versus true prediction error, transplant data.

Figure 7. (Left) Estimated probabilities, logistic regression simulation 1. (Right) Now for random forests, simulation 1. The increased random forests separation suggests increased degrees of freedom.

Figure 8. Square points: jackknife conformal predictions at

x = 11

for example in Figure 3; each interval

\Pr = 1 / 21

.

Figure 8. Square points: jackknife conformal predictions at

x = 11

for example in Figure 3; each interval

\Pr = 1 / 21

.

Table 2. Mean of

Q (y, \hat{μ})

given the number of times a case

(x_{i}, y_{i})

occurred in bootstrap sample; first simulation. True Error 0.97, apparent err 0.90.

Table 2. Mean of

Q (y, \hat{μ})

given the number of times a case

(x_{i}, y_{i})

occurred in bootstrap sample; first simulation. True Error 0.97, apparent err 0.90.

# Times:	0	1	2	3	4	5
mean Q:	1.44	0.96	0.81	0.69	0.64	0.57

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Efron, B. Resampling Plans and the Estimation of Prediction Error. Stats 2021, 4, 1091-1115. https://doi.org/10.3390/stats4040063

AMA Style

Efron B. Resampling Plans and the Estimation of Prediction Error. Stats. 2021; 4(4):1091-1115. https://doi.org/10.3390/stats4040063

Chicago/Turabian Style

Efron, Bradley. 2021. "Resampling Plans and the Estimation of Prediction Error" Stats 4, no. 4: 1091-1115. https://doi.org/10.3390/stats4040063

APA Style

Efron, B. (2021). Resampling Plans and the Estimation of Prediction Error. Stats, 4(4), 1091-1115. https://doi.org/10.3390/stats4040063

Article Menu

Resampling Plans and the Estimation of Prediction Error

Abstract

1. Introduction

2. The Prediction Problem

3. Cross-Validation and Its Bootstrap Competitors

4. Covariance Penalties and Degrees of Freedom

4.1. A Few Comments

4.2. Degrees of Freedom

5. Conformal Inference

Some Comments

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI