1. Introduction
Consider regression data
. The response
is assumed univariate while the regressor
is a
vector. Note that the entries of
are in general real-valued but they can also be discrete, i.e., taking values on a countable subset of the real line; however, we will not consider categorical regressors as in [
1].
Suppose that the
were chosen (or randomly generated—see Assumption 1 in what follows) as taking the values
respectively, and consider the usual linear regression model:
where the
are independent, identically distributed (i.i.d.) from some distribution
F with
and
.
The goal is to predict a future response
associated with a particular regressor value
of interest. Furthermore, we wish to construct a data-based prediction interval (PI), say
, so that for some chosen
we have
where the approximation is typically asymptotic as more data accrue. A review of the state-of-the-art of PI construction in different settings is given by [
2].
At first glance, this exercise seems to be a straightforward application of employing the quantiles of the conditional distribution
which in many settings admits a consistent estimator, say
. Letting
denote the quantile inverse of a distribution
G, we can construct the ‘oracle’ equal-tailed PI
, and its estimated version
. If
is continuous and strictly increasing for
y in the neighborhood of the two quantiles of interest, i.e., for
equal to
and
, then
and
since
will be a consistent estimator of
; see e.g., Lemma 1.2.1 of [
3].
Nevertheless, despite the asymptotic validity of Equation (
4), the quantile-based PI
is well-known to suffer from finite-sample undercoverage, i.e.,
This undercoverage has been illustrated empirically time and again—see e.g., [
4] and the references therein—but recently some theoretical explanations have also been given; see [
5,
6]. The implicit reason for the finite-sample undercoverage is that by plugging-in the estimated quantiles as if they were true, the PI
fails to capture the variability of these estimated quantiles, i.e., it is neglecting part of the variance of the problem at hand. In other words, the quantile-based PI
is not
pertinent; this term was coined by [
7] to describe PIs that encapsulate the variability of all estimated quantities figuring in the PI construction.
Starting with the work of [
8], there has been increased interest in PIs obtained via
conformal prediction; see also [
9]. Conformal prediction was originally developed for i.i.d. data. Modelling the regression pairs
as i.i.d., a conformal prediction PI
can be constructed that satisfies
However, the above finite-sample coverage guarantee is on the average, over all possible regressor values
. If the practitioner desires
coverage conditionally on a particular regressor value
, then the exchangeability of the pairs
breaks down and the finite-sample guarantee of (
6) is lost; see [
10,
11]. There is an active body of literature developing variations of the conformal prediction theme to produce PIs that are asymptotically valid conditionally on
as in Equation (
2); see e.g., [
12,
13,
14], as well as [
15] and the references therein. However, none of these variations yields pertinent PIs; see [
6] for a discussion.
Note that if the regression errors
can be assumed to be Gaussian, then a pertinent PI with exact coverage is available; see Equation (
A1) in the
Appendix A. However, in the Big Data Era, the assumption of Gaussianity is typically seen to be unjustifiable, and often just plain wrong. Without having to resort to unnessesarily restrictive assumptions such as Gaussianity, the only general way so far to obtain pertinent PIs has been via some form of bootstrap. Furthermore, to achieve pertinence, the resampling procedure must be carefully crafted so that it captures the variability of all estimated quantities including the variability of estimating the point predictor; see e.g., Ch. 3.7 of [
4].
Although the bootstrap approach gives generally good results in finite-samples (as well as asymptotically), it can be quite computationally expensive as it involves re-fitting the regression over
M bootstrap pseudo-samples; note that each pseudo-sample entails a new scatterplot of
n data points. In the paper at hand, we describe a shortcut that gives pertinent PIs based on a limited Monte Carlo simulation as opposed to full-scale resampling. The new method is introduced in
Section 3 in the linear regression framework that is outlined in
Section 2. Finally,
Section 4 reports the results of a small numerical experiment, and provides some practical recommendations. A brief Conclusions section is also provided as well as an
Appendix A to intuitively describe the notion of pertinence.
2. Linear Regression Framework
We can re-write the linear regression model of Equation (
1) as
where
and
are
random vectors,
is a
deterministic parameter vector with
, and
X is an
design matrix with
ith row given by vector
. Throughout the paper we will also assume:
Assumption 1. Assume that the design matrix X is either deterministic or that inference is conducted conditionally on the vectors that are assumed independent to the error vector . Hence, without this being explicitly denoted, all probabilities and expectations will be tacitly understood to be conditional on the event where are some fixed values.
Consequently, the coverage of a PI will be evaluated conditionally on
as well as
as compared to the probability measure
of [
6].
Let be an estimator of that is linear in the data , i.e., for some matrix B; the matrix B will typically depend on X but this will not be explicitly denoted. For example, if X has full rank, we can let so that is the Least Squares (LS) estimator. If X is not of full rank, we can let so that is a ridge regression estimator; here is the identity matrix and is a (small) positive constant. Other examples are readily available in the literature.
Let be the same estimator based on the dataset that has the pair omitted. In other words, where is with omitted, and is B with its ith column omitted. The predictive and fitted residuals ( and respectively) corresponding to data point are defined in the usual manner, i.e., and .
Remark 1 (LS residuals)
. In the LS case where the predictive residuals can be easily obtained without re-fitting the regression since where is the ith diagonal element of the ‘hat’ matrix ; see Theorem 10.1 of [16]. In this setting, we can also use the so-called studentized residuals as defined by [17]. Assuming that the regression has an intercept term, Equation (10.12) of [16] further implies , yielding the ordering for all i, i.e., the predictive residuals have the largest scale while the fitted residuals have the smallest. Recall that the
–optimal point predictor of
given
is the conditional expectation
. Since
is unknown, our practical point predictor of
will be
. In order to construct a pertinent PI, the goal is to approximate the distribution of the so-called ‘root’
. If we were to do so via bootstrap, one would proceed by resampling the residuals and creating new pseudo-scatterplots via the (fitted) model (
1). In this setting, the transformation-based approach of [
4] suggested resampling the predictive residuals
—as opposed to the fitted residuals
—in order to alleviate the undercoverage of bootstrap PIs first pointed out by [
18]. The predictive residuals will be found useful in what follows as well.
3. Pertinent Prediction Intervals
Consider again the root
. Model (
1) implies that
where
Hence,
where
is the estimation error. Standard regression conditions typically ensure that
as
. For instance, if
is the LS estimator in a well-specified model, then
. Further conditions on the design matrix are available to ensure the 2nd part of Equation (
7); see [
16].
Note that Equation (
7) implies that
in probability as
. It is tempting to treat
as negligible, and hence construct a simple quantile-based PI for
as
where
is a (uniformly) consistent estimator of
F. For example,
can be the empirical distribution function (e.d.f.) of the residuals (fitted, predictive or studentized) after centering them to ensure that
is a distribution with mean zero. If
F is continuous and strictly increasing in the neighborhood of the two quantiles of interest, namely
and
, then the latter are estimated consistently by
and
respectively, implying that the quantile-based PI (
8) has asymptotic coverage
. However, PI (
8) will be referred to as
naive in what follows because it ignores the term
and its variability, and typically exhibits finite-sample
undercoverage.
Recall that
is a function of
which is independent of
. Hence, the distribution of
is given by
where ∗ denotes convolution and
is the distribution of
. Consequently,
assuming
G is continuous and strictly increasing at its tail quantiles
and
. Equation (
9) then suggest the ‘oracle’ PI
The above oracle PI has exact coverage
but it involves the two unknown distributions,
F and
. To construct a practical pertinent PI we propose to replace
F and
by respective estimators
and
.
was already mentioned; as regards
, recall that regularity conditions typically ensure that
is asymptotically normal, and therefore
Coupled with the fact that
we are led to defining
as a normal distribution with mean zero and variance
; here,
is a consistent estimator of
, e.g., the sample variance of the residuals (fitted, predictive or studentized). Although
may be a discontinuous distribution, the continuity of
implies that our estimator
is a continuous, strictly increasing distribution that is (uniformly) consistent for
G.
The following lemma puts it all together.
Lemma 1. Assume conditions ensuring (7) and (11), and the (uniform) consistency of and for F and respectively. If G is continuous and strictly increasing at its tail quantiles and , thenas . Equation (
12) leads us to construct our novel pertinent PI as
Note that the Central Limit Theorem (CLT) of Equation (
11), as well as the approximations of
F and
by
and
, do require a reasonably large sample. But even with
n as small as 100 these approximations can be good enough for practical application; see
Section 4 for some numerical evidence. Moreover, recall that assumption (
7) implies that
which is
not negligible with
n about 100—therefore the importance of actually including it in the PI construction; the following remark elaborates.
Remark 2 (Asymptotic validity)
. Lemma 1 implies that the pertinent PI (13) has asymptotic coverage conditionally on . Although consistency is a sine qua non, it does not tell the whole story; recall that even the naive PI (8) is consistent. The PI (13) is called pertinent because it captures the variance of the estimation error which the naive PI (8) treats as negligible. As shown in the simulation results of Section 4, the pertinent PI (13) alleviates the finite-sample undercoverage of the naive PI (8). In principle, the convolution could be carried out numerically; however, the following Monte Carlo algorithm makes its implementation straightforward.
Compute the estimator , the point predictor , and the residuals of choice (fitted, predictive or studentized).
Compute an estimator of ; e.g., could be the sample variance of the chosen residuals. [In the case of Least Squares, the unbiased estimator could also be entertained.]
Compute an estimator of F; e.g., can be the e.d.f. of the residuals of choice after centering to ensure that is a distribution with mean zero.
Choose a large integer M, e.g., , and generate i.i.d. from ; the s can be thought of as pseudo-residuals. [If is an e.d.f. that puts mass on values (say), then generating can be done by drawing randomly (with replacement) from the set .]
Generate i.i.d. from a normal distribution with mean zero and variance .
Define for , and compute as the sample quantile of , i.e., where are the order statistics.
Use the above
with
equal to
and
to construct the pertinent PI (
13).
Remark 3 (Choice of residuals)
. In the LS setting of Remark 1, note that our Equation (7) implies that as , and hence all residuals (fitted, predictive or studentized) are asymptotically equivalent. Therefore, asymptotic considerations can not help us in choosing which residuals to employ for the construction of the estimators and in the above algorithm. Note, however, that the validity of Equation (7) hinges on either p being finite, or at least that . For example, if p is of the same order of magnitude as n, then the s will not be close to 0 even for large n, and the fitted, predictive andstudentized residuals will remain distinct. The numerical experiment of Section 4 will provide some finite-sample guidance on our choice of residuals and the effect of the relative magnitude of p to n. Remark 4 (Computational cost as compared to bootstrap)
. Note that sampling from is effectively a bootstrap procedure. However, in the above Monte Carlo algorithm, sampling from is done only once, i.e., we are drawing just one bootstrap sample of size M. Suppose that the estimators , and can be computed in operations for some Then, the above algorithm has computational cost of operations. By contrast, the pertinent version of the residual-based bootstrap requires drawing M resamples each of size n from ; using each of the M resamples as a set of pseudo-residuals, a new pseudo-scatterplot is created upon which , , etc. are recomputed—see the MF/MB procedure in Part II of [4] for more details. Hence, the pertinent version of the residual-based bootstrap requires operations which is more expensive by orders of magnitude. To elaborate, consider the realistic case where ; since M is also taken not less than 1000, the result is that the above Monte Carlo algorithm can be a thousand times faster than the pertinent version of the residual bootstrap.To elaborate, consider the popular example of LASSO regression that can be computed in operations as long as ; see [19]. Suppose that in which case LASSO regression is computable in operations with . Let , and that the sample size satisfies . Then, we have and our Monte Carlo algorithm is computable in operations, i.e., same order as computing a single LASSO regression. By contrast, the residual-based bootstrap requires operations, i.e., it is like fitting 1000 LASSO regressions. 4. Numerical Results
We now conduct a small numerical experiment with three goals: (a) to show that the pertinent PI (denoted PPI) of Equation (
13) has better finite-sample coverage than the naive PI of Equation (
8); (b) to provide some finite-sample guidance on which type of residuals to employ in the PPI algorithm; and (c) to see the effect of the magnitude of
p (relative to
n) in our choice of residuals, and the resulting PPI performance.
Throughout this section, we will focus on the LS estimator where X is an matrix having a column of 1s as its first column.
The structure of our numerical experiment is as follows:
Choose p and n; for our main results we will employ the following combinations for : (10,50), (20,50), (20,100), (40,100).
Generate all the elements of X (other than its first column) as i.i.d. standard normal.
Choose the error distribution F; for our purposes we will take F as either standard normal or Laplace (i.e., two-sided exponential) with mean zero and variance one.
Choose , the level, and the regressor of future interest ; for our purposes, is generated as where are i.i.d. Uniform , the level is chosen as 0.10, and is a row of 1s.
Generate
i.i.d. from
F, and use them to generate
from model (
1).
Fit the data via LS and compute and ; also compute the three sets of residuals: fitted, studentized and predictive.
Calculate four PIs for
: PI.naive of Equation (
8), and three versions of the PPI of Equation (
13): using the fitted residuals (PPI.fitted), using the studentized residuals (PPI.stud), and using the predictive residuals (PPI.pred). The three PPIs employ the Monte Carlo algorithm using
. Compute the lengths (LEN) of the four PIs.
Further generate i.i.d. from F, and use them to generate 500 realizations of to be used to test the PI coverage. The kth realization is generated as
Calculate the coverage (CVR) of each PI using the 500 realizations of , i.e., CVR of a PI is where denotes the indicator function.
Repeat steps 5 to 9 many times—in our case 200. Each of these replications has its own CVR[j] and LEN[j] associated with each PI.
Finally:
- (a)
Calculate the average CVR of each method over the 200 replications.
- (b)
Plot a histogram of CVR[j] for to portray the CVR variability, and compute what percentage of these CVRs fall below the target level of .
- (c)
Calculate the average LEN of each method over the 200 replications; also calculate the variability of LEN via its sample standard deviation.
To elaborate on the PI constructions:
PI.naive and PPI.fitted employ as the e.d.f. of the fitted residuals. The latter further uses to be the unbiased estimator . Recall that having a column of 1s in X ensures .
PPI.stud employs
the e.d.f. of the centered studentized residuals, i.e.,
It also employs .
Finally, PPI.pred employs
the e.d.f. of the centered predictive residuals, i.e.,
It also employs .
The main results of our experiment are presented in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9. There were also nine pages of CVR histograms associated with the results of each of the nine tables. To save space, we only present five representative ones in
Figure 1,
Figure 2,
Figure 3,
Figure 4 and
Figure 5; the omitted histograms were very similar to the ones included here. Note that each of these histograms is an approximation of the probability distribution of the random variable
Recall that per Assumption 1 the above probability is also conditional on the event
without it being explicitly denoted.
Being a function of the sample
, the quantity in Equation (
14) is a random variable that varies across samples. Our numerical experiment tries to encapsulate this variability in an analogous way as in
Figure 2 of [
6]. Note that the mean CVR reported in the first column of
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9 gives the center of each of these histograms, namely
which is the expected value of the random variable (
14) averaged over all possible
samples.
- –
The undercoverage of PI.naive is strongly illustrated in all Tables and Figures.
- –
All three of our proposed PPIs have larger coverage than PI.naive; furthermore, we observe the ordering:
CVR[PI.fitted] < CVR[PI.stud] < CVR[PI.pred]
which is not unexpected in view of Remark 1 that compares the scales of the three types of residuals.
- –
It is clear that PI.fitted falls short of achieving acceptable coverage for finite samples; this can be attributed to the known tendency of the fitted residuals to underestimate the scale of the errors—see [
20].
- –
The preferable methods seems to be either PPI.stud or PPI.pred; the latter appears to overcover in several instances so it can be thought of as conservative.
- –
The variability of the CVRs that is illustrated by the histograms in
Figure 1,
Figure 2,
Figure 3 and
Figure 4 is striking. Also interesting is the high proportion of CVRs (conditional on the sample) that are below the nominal level over the 200 replications. PPI.pred is more conservative in that respect as well, i.e., having the smallest proportion of PIs that undercover.
- –
In our
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8, the ratio
was either 0.2 or 0.4 which can not be considered small; neither a sample size of 50 or 100 can be considered particularly large. Despite the fact that these setups are quite challenging, both PPI.stud and PPI.pred provided reliable performances (with the latter being more conservative).
- –
Interestingly, there is no appreciable difference of PI performance going from normal errors to Laplace even though in the latter case the LS estimator does not coincide with the minimum maximum likelihood estimator. The only side-effect of the heavier tails of the Laplace distribution appears to be a slightly heavier left tail in some of our histograms.
As discussed in Remark 3, the standard asymptotics kick in when
in which case the fitted, studentized and predictive residuals become approximately equal to each other. In that case the performance of all three PPIs is expected to be good, although PPI.stud and PPI.pred will still be preferable over the other two PIs. To corroborate this claim, we offer the results of an additional simulation in the ‘easy’ case of
and
, i.e., when
is small; as
Table 9 shows, PPI.stud and PPI.pred have mean CVRs of 0.897 and 0.908 respectively that are practically equivalent to the nominal 90%. So are two methods PPI.stud and PPI.pred equivalent in this ‘easy’ case?
The above question allows us to think harder on our design goals for a successful PI construction. Until now, our target was to have CVR (
15), i.e.,
close to the nominal 90% (and ideally not below the nominal). However,
Figure 5 shows the histogram of the CVR (
14) that has Equation (
15) as its center. In order for the center to be about 90%, the CVR (
14) should take values both below and above 90%. Indeed, as shown in Corollary 4.1 of [
21], PIs based on the residual bootstrap will exhibit large-sample histograms with their median at the nominal level, i.e., asymptotically the histogram will have 50% of occurences below the nominal and 50% above.
By the Law of Large Numbers, such a histogram will have progressively smaller and smaller range since asymptotically
where the above expectation is taken over the distribution of
. In spite of the above, it seems preferable that a desirable finite-sample histogram of
would have less mass below the nominal 90% (as compared to above the nominal), i.e., only a minority of the CVRs (
14) would be below the nominal 90%. In this sense, PPI.pred appears preferable to PPI.stud even here since, as shown in
Table 9, PPI.stud and PPI.pred have respectively 51% and 38% of occurences below the nominal.
5. Conclusions
An interesting problem in statistics is the construction of prediction intervals (PI) in linear regression without assuming Gaussianity which is often not justifiable. We desire PIs that have (a) asymptotic conditional validity, and (b) are ‘pertinent’, i.e., are able to capture/incorporate the variability associated with estimated quantities employed in the PI construction. Such PIs in the past have been traditionally constructed via resampling.
The paper at hand proposes a short-cut that directly employs the asymptotic normal distribution of relevant estimators—as opposed to a bootstrap histogram—in order to capture their variability. The resulting prediction interval achieves the property of pertinence without full-scale resampling, thus offering computational savings of orders of magnitude over the bootstrap.
Asymptotic conditional validity of the new approach is shown under general conditions using either fitted, studentized or predictive residuals. Numerical work confirms that the new method has good finite-sample performance as well, and suggests the use of predictive residuals in order to construct conservative PIs, i.e., resulting into the smallest proportion of PIs that undercover.