Previous Article in Journal
Statistics of Non-Conserved Observables in Lindblad Master Equations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Pertinent Prediction Intervals in Linear Regression

by
Dimitris N. Politis
Department of Mathematics and Halicioglu Data Science Institute, University of California San Diego, La Jolla, San Diego, CA 92093, USA
Stats 2026, 9(4), 68; https://doi.org/10.3390/stats9040068 (registering DOI)
Submission received: 7 May 2026 / Revised: 16 June 2026 / Accepted: 17 June 2026 / Published: 25 June 2026

Abstract

In linear regression, a point predictor Y ^ f of a future response Y f associated with a regressor value of interest x ̲ f can easily be constructed. Since Y ^ f will always incur a prediction error, it is desirable to accompany the point predictor by a prediction interval, say C ( x ̲ f ) , that will contain the target Y f with a pre-specified high probability, e.g., 90%. An estimated prediction interval, say C ^ ( x ̲ f ) , is called pertinent if its construction incorporates the variability of all estimators that are employed in the prediction problem. So far, pertinent prediction intervals have only been constructed via some form of bootstrap. However, resampling can be quite computationally expensive since the estimation/prediction problem has to be re-calculated on a large number of pseudo-scatterplots, each having the same sample size as the original one. The paper at hand proposes a short-cut that directly employs the asymptotic normal distribution of relevant estimators—as opposed to a bootstrap histogram—in order to capture their variability. The resulting prediction interval achieves pertinence without full-scale resampling, thus offering computational savings of orders of magnitude.

1. Introduction

Consider regression data { ( Y 1 , X ̲ 1 ) , , ( Y n , X ̲ n ) } . The response Y i is assumed univariate while the regressor X ̲ i is a p × 1 vector. Note that the entries of X ̲ i are in general real-valued but they can also be discrete, i.e., taking values on a countable subset of the real line; however, we will not consider categorical regressors as in [1].
Suppose that the { X ̲ 1 , , X ̲ n } were chosen (or randomly generated—see Assumption 1 in what follows) as taking the values { x ̲ 1 , , x ̲ n } respectively, and consider the usual linear regression model:
Y i = x ̲ i β ̲ + Z i , for i = 1 , , n ,
where the Z i are independent, identically distributed (i.i.d.) from some distribution F with E Z i = 0 and E Z i 2 = σ 2 .
The goal is to predict a future response Y f associated with a particular regressor value x ̲ f of interest. Furthermore, we wish to construct a data-based prediction interval (PI), say C ^ ( x ̲ f ) , so that for some chosen α ( 0 , 1 ) we have
P ( Y f C ^ ( x ̲ f ) | X ̲ f = x ̲ f ) 1 α
where the approximation is typically asymptotic as more data accrue. A review of the state-of-the-art of PI construction in different settings is given by [2].
At first glance, this exercise seems to be a straightforward application of employing the quantiles of the conditional distribution F f ( y ) = P ( Y f y | X ̲ f ) which in many settings admits a consistent estimator, say F ^ f ( y ) . Letting G 1 ( γ ) = inf { x : G ( x ) γ } denote the quantile inverse of a distribution G, we can construct the ‘oracle’ equal-tailed PI C ( x ̲ f ) = [ F f 1 ( α / 2 ) , F f 1 ( 1 α / 2 ) ] , and its estimated version C ^ ( x ̲ f ) = [ F ^ f 1 ( α / 2 ) , F ^ f 1 ( 1 α / 2 ) ] . If F f ( y ) is continuous and strictly increasing for y in the neighborhood of the two quantiles of interest, i.e., for γ equal to α / 2 and 1 α / 2 , then
P ( Y f C ( x ̲ f ) | X ̲ f = x ̲ f ) = 1 α .
and
P ( Y f C ^ ( x ̲ f ) | X ̲ f = x ̲ f ) 1 α as n
since F ^ f 1 ( γ ) will be a consistent estimator of F f 1 ( γ ) ; see e.g., Lemma 1.2.1 of [3].
Nevertheless, despite the asymptotic validity of Equation (4), the quantile-based PI C ^ ( x ̲ f ) is well-known to suffer from finite-sample undercoverage, i.e.,
P ( Y f C ^ ( x ̲ f ) | X ̲ f = x ̲ f ) < 1 α for   finite n .
This undercoverage has been illustrated empirically time and again—see e.g., [4] and the references therein—but recently some theoretical explanations have also been given; see [5,6]. The implicit reason for the finite-sample undercoverage is that by plugging-in the estimated quantiles as if they were true, the PI C ^ ( x ̲ f ) fails to capture the variability of these estimated quantiles, i.e., it is neglecting part of the variance of the problem at hand. In other words, the quantile-based PI C ^ ( x ̲ f ) is not pertinent; this term was coined by [7] to describe PIs that encapsulate the variability of all estimated quantities figuring in the PI construction.
Starting with the work of [8], there has been increased interest in PIs obtained via conformal prediction; see also [9]. Conformal prediction was originally developed for i.i.d. data. Modelling the regression pairs ( Y 1 , X ̲ 1 ) , , ( Y n , X ̲ n ) as i.i.d., a conformal prediction PI C ˜ ( X ̲ f ) can be constructed that satisfies
P ( Y f C ˜ ( X ̲ f ) ) 1 α .
However, the above finite-sample coverage guarantee is on the average, over all possible regressor values X ̲ f . If the practitioner desires 1 α coverage conditionally on a particular regressor value X ̲ f = x ̲ f , then the exchangeability of the pairs ( Y i , X ̲ i ) breaks down and the finite-sample guarantee of (6) is lost; see [10,11]. There is an active body of literature developing variations of the conformal prediction theme to produce PIs that are asymptotically valid conditionally on X ̲ f = x ̲ f as in Equation (2); see e.g., [12,13,14], as well as [15] and the references therein. However, none of these variations yields pertinent PIs; see [6] for a discussion.
Note that if the regression errors Z i can be assumed to be Gaussian, then a pertinent PI with exact coverage is available; see Equation (A1) in the Appendix A. However, in the Big Data Era, the assumption of Gaussianity is typically seen to be unjustifiable, and often just plain wrong. Without having to resort to unnessesarily restrictive assumptions such as Gaussianity, the only general way so far to obtain pertinent PIs has been via some form of bootstrap. Furthermore, to achieve pertinence, the resampling procedure must be carefully crafted so that it captures the variability of all estimated quantities including the variability of estimating the point predictor; see e.g., Ch. 3.7 of [4].
Although the bootstrap approach gives generally good results in finite-samples (as well as asymptotically), it can be quite computationally expensive as it involves re-fitting the regression over M bootstrap pseudo-samples; note that each pseudo-sample entails a new scatterplot of n data points. In the paper at hand, we describe a shortcut that gives pertinent PIs based on a limited Monte Carlo simulation as opposed to full-scale resampling. The new method is introduced in Section 3 in the linear regression framework that is outlined in Section 2. Finally, Section 4 reports the results of a small numerical experiment, and provides some practical recommendations. A brief Conclusions section is also provided as well as an Appendix A to intuitively describe the notion of pertinence.

2. Linear Regression Framework

We can re-write the linear regression model of Equation (1) as
Y ̲ n = X β ̲ + Z ̲ n
where Y ̲ n = ( Y 1 , , Y n ) and Z ̲ n = ( Z 1 , , Z n ) are n × 1 random vectors, β ̲ is a p × 1 deterministic parameter vector with p < n , and X is an n × p design matrix with ith row given by vector X ̲ i . Throughout the paper we will also assume:
Assumption 1. 
Assume that the design matrix X is either deterministic or that inference is conducted conditionally on the X ̲ i vectors that are assumed independent to the error vector Z ̲ n . Hence, without this being explicitly denoted, all probabilities and expectations will be tacitly understood to be conditional on the event { X ̲ 1 = x ̲ 1 , , X ̲ n = x ̲ n } where { x ̲ 1 , , x ̲ n } are some fixed values.
Consequently, the coverage of a PI will be evaluated conditionally on { X ̲ 1 = x ̲ 1 , , X ̲ n = x ̲ n } as well as X ̲ f = x ̲ f as compared to the probability measure P 2 of [6].
Let β ^ ̲ be an estimator of β ̲ that is linear in the data Y ̲ n , i.e., β ^ ̲ = B Y ̲ n for some p × n matrix B; the matrix B will typically depend on X but this will not be explicitly denoted. For example, if X has full rank, we can let B = ( X X ) 1 X so that β ^ ̲ is the Least Squares (LS) estimator. If X is not of full rank, we can let B = ( X X + λ I p ) 1 X so that β ^ ̲ is a ridge regression estimator; here I p is the p × p identity matrix and λ is a (small) positive constant. Other examples are readily available in the literature.
Let β ^ ̲ ( i ) be the same estimator based on the dataset that has the pair ( Y i , X ̲ i ) omitted. In other words, β ^ ̲ ( i ) = B ( i ) Y ̲ n ( i ) where Y ̲ n ( i ) is Y ̲ n with Y i omitted, and B ( i ) is B with its ith column omitted. The predictive and fitted residuals ( z ˜ i and z i respectively) corresponding to data point Y i are defined in the usual manner, i.e., z ˜ i = Y i x ̲ i β ^ ̲ ( i ) and z i = Y i x ̲ i β ^ ̲ .
Remark 1 
(LS residuals). In the LS case where β ^ ̲ = ( X X ) 1 X Y ̲ n the predictive residuals can be easily obtained without re-fitting the regression since z ˜ i = z i / ( 1 h i ) where h i = x ̲ i ( X X ) 1 x ̲ i is the ith diagonal element of the ‘hat’ matrix X ( X X ) 1 X ; see Theorem 10.1 of [16]. In this setting, we can also use the so-called studentized residuals z ^ i = z i / 1 h i as defined by [17]. Assuming that the regression has an intercept term, Equation (10.12) of [16] further implies 1 / n h i 1 , yielding the ordering | z i |     | z ^ i |     | z ˜ i | for all i, i.e., the predictive residuals have the largest scale while the fitted residuals have the smallest.
Recall that the L 2 –optimal point predictor of Y f given X ̲ f = x ̲ f is the conditional expectation E ( Y f | X ̲ f = x ̲ f ) = x ̲ f β ̲ . Since β ̲ is unknown, our practical point predictor of Y f will be Y ^ f = x ̲ f β ^ ̲ . In order to construct a pertinent PI, the goal is to approximate the distribution of the so-called ‘root’ Y f Y ^ f . If we were to do so via bootstrap, one would proceed by resampling the residuals and creating new pseudo-scatterplots via the (fitted) model (1). In this setting, the transformation-based approach of [4] suggested resampling the predictive residuals z ˜ i —as opposed to the fitted residuals z i —in order to alleviate the undercoverage of bootstrap PIs first pointed out by [18]. The predictive residuals will be found useful in what follows as well.

3. Pertinent Prediction Intervals

Consider again the root Y f Y ^ f = Y f x ̲ f β ^ ̲ . Model (1) implies that Y f = x ̲ f β ̲ + Z f where Z f F . Hence,
Y f Y ^ f = Z f + x ̲ f β ̲ x ̲ f β ^ ̲ = Z f + E f
where E f = x ̲ f β ̲ x ̲ f β ^ ̲ is the estimation error. Standard regression conditions typically ensure that
E x ̲ f β ^ ̲ = x ̲ f β ̲ + o ( 1 / n ) and V a r ( x ̲ f β ^ ̲ ) = O ( 1 / n )
as n . For instance, if β ^ ̲ is the LS estimator in a well-specified model, then E β ^ ̲ = β ^ ̲ . Further conditions on the design matrix are available to ensure the 2nd part of Equation (7); see [16].
Note that Equation (7) implies that E f 0 in probability as n . It is tempting to treat E f as negligible, and hence construct a simple quantile-based PI for Y f as
[ Y ^ f + F ^ 1 ( α / 2 ) , Y ^ f + F ^ 1 ( 1 α / 2 ) ]
where F ^ is a (uniformly) consistent estimator of F. For example, F ^ can be the empirical distribution function (e.d.f.) of the residuals (fitted, predictive or studentized) after centering them to ensure that F ^ is a distribution with mean zero. If F is continuous and strictly increasing in the neighborhood of the two quantiles of interest, namely F 1 ( α / 2 ) and F 1 ( 1 α / 2 ) , then the latter are estimated consistently by F ^ 1 ( α / 2 ) and F ^ 1 ( 1 α / 2 ) respectively, implying that the quantile-based PI (8) has asymptotic coverage 1 α . However, PI (8) will be referred to as naive in what follows because it ignores the term E f and its variability, and typically exhibits finite-sample undercoverage.
Recall that β ^ ̲ is a function of Z ̲ n which is independent of Z f . Hence, the distribution of Z f + E f is given by G = F F E where ∗ denotes convolution and F E is the distribution of E f . Consequently,
P ( G 1 ( α / 2 ) Y f Y ^ f G 1 ( 1 α / 2 ) | X ̲ f = x ̲ f ) = 1 α
assuming G is continuous and strictly increasing at its tail quantiles G 1 ( α / 2 ) and G 1 ( 1 α / 2 ) . Equation (9) then suggest the ‘oracle’ PI
[ Y ^ f + G 1 ( α / 2 ) , Y ^ f + G 1 ( 1 α / 2 ) ] .
The above oracle PI has exact coverage 1 α but it involves the two unknown distributions, F and F E . To construct a practical pertinent PI we propose to replace F and F E by respective estimators F ^ and F ^ E . F ^ was already mentioned; as regards F E , recall that regularity conditions typically ensure that β ^ ̲ is asymptotically normal, and therefore
E f V a r ( x ̲ f β ^ ̲ ) N ( 0 , 1 ) as n .
Coupled with the fact that V a r ( x ̲ f β ^ ̲ ) = σ 2 x ̲ f B B x ̲ f we are led to defining F ^ E as a normal distribution with mean zero and variance σ ^ 2 x ̲ f B B x ̲ f ; here, σ ^ 2 is a consistent estimator of σ 2 , e.g., the sample variance of the residuals (fitted, predictive or studentized). Although F ^ may be a discontinuous distribution, the continuity of F ^ E implies that our estimator G ^ = F ^ F ^ E is a continuous, strictly increasing distribution that is (uniformly) consistent for G.
The following lemma puts it all together.
Lemma 1. 
Assume conditions ensuring (7) and (11), and the (uniform) consistency of F ^ and σ ^ 2 for F and σ 2 respectively. If G is continuous and strictly increasing at its tail quantiles G 1 ( α / 2 ) and G 1 ( 1 α / 2 ) , then
P ( G ^ 1 ( α / 2 ) Y f Y ^ f G ^ 1 ( 1 α / 2 ) | X ̲ f = x ̲ f ) 1 α
as n .
Equation (12) leads us to construct our novel pertinent PI as
[ Y ^ f + G ^ 1 ( α / 2 ) , Y ^ f + G ^ 1 ( 1 α / 2 ) ] .
Note that the Central Limit Theorem (CLT) of Equation (11), as well as the approximations of F and σ 2 by F ^ and σ ^ 2 , do require a reasonably large sample. But even with n as small as 100 these approximations can be good enough for practical application; see Section 4 for some numerical evidence. Moreover, recall that assumption (7) implies that E f = O P ( 1 / n ) which is not negligible with n about 100—therefore the importance of actually including it in the PI construction; the following remark elaborates.
Remark 2 
(Asymptotic validity). Lemma 1 implies that the pertinent PI (13) has asymptotic coverage 1 α conditionally on X ̲ f = x ̲ f . Although consistency is a sine qua non, it does not tell the whole story; recall that even the naive PI (8) is consistent. The PI (13) is called pertinent because it captures the variance of the estimation error E f which the naive PI (8) treats as negligible. As shown in the simulation results of Section 4, the pertinent PI (13) alleviates the finite-sample undercoverage of the naive PI (8).
In principle, the convolution G ^ = F ^ F ^ E could be carried out numerically; however, the following Monte Carlo algorithm makes its implementation straightforward.
  • Monte Carlo Algorithm for Pertinent Prediction Intervals
  • Compute the estimator β ^ ̲ = B Y ̲ n , the point predictor Y ^ f = x ̲ f β ^ ̲ , and the residuals of choice (fitted, predictive or studentized).
  • Compute an estimator σ ^ 2 of σ 2 ; e.g., σ ^ 2 could be the sample variance of the chosen residuals. [In the case of Least Squares, the unbiased estimator ( n p ) 1 i = 1 n z i 2 could also be entertained.]
  • Compute an estimator F ^ of F; e.g., F ^ can be the e.d.f. of the residuals of choice after centering to ensure that F ^ is a distribution with mean zero.
  • Choose a large integer M, e.g., M 1000 , and generate Z 1 * , , Z M * i.i.d. from F ^ ; the Z i * s can be thought of as pseudo-residuals. [If F ^ is an e.d.f. that puts mass 1 / n on values a 1 , , a n (say), then generating Z 1 * , , Z M * can be done by drawing randomly (with replacement) from the set { a 1 , , a n } .]
  • Generate E 1 * , , E M * i.i.d. from a normal distribution with mean zero and variance σ ^ 2 x ̲ f B B x ̲ f .
  • Define W i = Z i * + E i * for i = 1 , , M , and compute G ^ 1 ( γ ) as the γ sample quantile of W 1 , , W M , i.e., G ^ 1 ( γ ) = W [ γ M ] where W [ 1 ] W [ 2 ] W [ M ] are the order statistics.
  • Use the above G ^ 1 ( γ ) with γ equal to α / 2 and 1 α / 2 to construct the pertinent PI (13).
Remark 3 
(Choice of residuals). In the LS setting of Remark 1, note that our Equation (7) implies that h i 0 as n , and hence all residuals (fitted, predictive or studentized) are asymptotically equivalent. Therefore, asymptotic considerations can not help us in choosing which residuals to employ for the construction of the estimators σ ^ 2 and F ^ in the above algorithm. Note, however, that the validity of Equation (7) hinges on either p being finite, or at least that p < < n . For example, if p is of the same order of magnitude as n, then the h i s will not be close to 0 even for large n, and the fitted, predictive andstudentized residuals will remain distinct. The numerical experiment of Section 4 will provide some finite-sample guidance on our choice of residuals and the effect of the relative magnitude of p to n.
Remark 4 
(Computational cost as compared to bootstrap). Note that sampling from F ^ is effectively a bootstrap procedure. However, in the above Monte Carlo algorithm, sampling from F ^ is done only once, i.e., we are drawing just one bootstrap sample of size M. Suppose that the estimators β ^ ̲ , σ ^ 2 and F ^ can be computed in O ( n ζ ) operations for some ζ 1 . Then, the above algorithm has computational cost of O ( n ζ ) + O ( M ) operations. By contrast, the pertinent version of the residual-based bootstrap requires drawing M resamples each of size n from F ^ ; using each of the M resamples as a set of pseudo-residuals, a new pseudo-scatterplot is created upon which β ^ ̲ , σ ^ 2 , etc. are recomputed—see the MF/MB procedure in Part II of [4] for more details. Hence, the pertinent version of the residual-based bootstrap requires O ( M n ζ ) operations which is more expensive by orders of magnitude. To elaborate, consider the realistic case where n ζ 1000 ; since M is also taken not less than 1000, the result is that the above Monte Carlo algorithm can be a thousand times faster than the pertinent version of the residual bootstrap.
To elaborate, consider the popular example of LASSO regression that can be computed in O ( p 3 + p 2 n ) operations as long as p < n ; see [19]. Suppose that p n in which case LASSO regression is computable in O ( n ζ ) operations with ζ = 2 . Let M = 1000 , and that the sample size satisfies n > 33 . Then, we have M < n 2 and our Monte Carlo algorithm is computable in O ( n 2 ) operations, i.e., same order as computing a single LASSO regression. By contrast, the residual-based bootstrap requires O ( M n 2 ) operations, i.e., it is like fitting 1000 LASSO regressions.

4. Numerical Results

We now conduct a small numerical experiment with three goals: (a) to show that the pertinent PI (denoted PPI) of Equation (13) has better finite-sample coverage than the naive PI of Equation (8); (b) to provide some finite-sample guidance on which type of residuals to employ in the PPI algorithm; and (c) to see the effect of the magnitude of p (relative to n) in our choice of residuals, and the resulting PPI performance.
Throughout this section, we will focus on the LS estimator β ^ ̲ = ( X X ) 1 X Y ̲ n where X is an n × p matrix having a column of 1s as its first column.
The structure of our numerical experiment is as follows:
  • Choose p and n; for our main results we will employ the following combinations for ( p , n ) : (10,50), (20,50), (20,100), (40,100).
  • Generate all the elements of X (other than its first column) as i.i.d. standard normal.
  • Choose the error distribution F; for our purposes we will take F as either standard normal or Laplace (i.e., two-sided exponential) with mean zero and variance one.
  • Choose β ̲ , the α level, and the regressor of future interest x ̲ f ; for our purposes, β ̲ is generated as 3 · ( U 1 , , U p ) where U 1 , , U p are i.i.d. Uniform [ 0 , 1 ] , the α level is chosen as 0.10, and x ̲ f is a row of 1s.
  • Generate Z 1 , , Z n i.i.d. from F, and use them to generate Y 1 , , Y n from model (1).
  • Fit the data via LS and compute β ^ ̲ and Y ^ f ; also compute the three sets of residuals: fitted, studentized and predictive.
  • Calculate four PIs for Y f : PI.naive of Equation (8), and three versions of the PPI of Equation (13): using the fitted residuals (PPI.fitted), using the studentized residuals (PPI.stud), and using the predictive residuals (PPI.pred). The three PPIs employ the Monte Carlo algorithm using M = 1000 . Compute the lengths (LEN) of the four PIs.
  • Further generate Z n + 1 , , Z n + 500 i.i.d. from F, and use them to generate 500 realizations of Y f to be used to test the PI coverage. The kth realization Y f ( k ) is generated as Y f ( k ) = x ̲ f β ̲ + Z n + k .
  • Calculate the coverage (CVR) of each PI using the 500 realizations of Y f , i.e., CVR of a PI is 500 1 k = 1 500 1 { Y f PI } where 1 denotes the indicator function.
  • Repeat steps 5 to 9 many times—in our case 200. Each of these j = 1 , , 200 replications has its own CVR[j] and LEN[j] associated with each PI.
    Finally:
    (a)
    Calculate the average CVR of each method over the 200 replications.
    (b)
    Plot a histogram of CVR[j] for j = 1 , , 200 to portray the CVR variability, and compute what percentage of these CVRs fall below the target level of 1 α .
    (c)
    Calculate the average LEN of each method over the 200 replications; also calculate the variability of LEN via its sample standard deviation.
To elaborate on the PI constructions:
  • PI.naive and PPI.fitted employ as F ^ the e.d.f. of the fitted residuals. The latter further uses σ ^ 2 to be the unbiased estimator ( n p ) 1 k = 1 n z k 2 . Recall that having a column of 1s in X ensures k = 1 n z k = 0 .
  • PPI.stud employs F ^ the e.d.f. of the centered studentized residuals, i.e.,
    F ^ ( a ) = n 1 i = 1 n 1 { z ^ i n 1 k = 1 n z ^ k a } .
    It also employs σ ^ 2 = n 1 i = 1 n [ z ^ i n 1 k = 1 n z ^ k ] 2 .
  • Finally, PPI.pred employs F ^ the e.d.f. of the centered predictive residuals, i.e.,
    F ^ ( a ) = n 1 i = 1 n 1 { z ˜ i n 1 k = 1 n z ˜ k a } .
    It also employs σ ^ 2 = n 1 i = 1 n [ z ˜ i n 1 k = 1 n z ˜ k ] 2 .
The main results of our experiment are presented in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9. There were also nine pages of CVR histograms associated with the results of each of the nine tables. To save space, we only present five representative ones in Figure 1, Figure 2, Figure 3, Figure 4 and Figure 5; the omitted histograms were very similar to the ones included here. Note that each of these histograms is an approximation of the probability distribution of the random variable
P ( Y f PI | X ̲ f = x ̲ f , Y ̲ n ) .
Recall that per Assumption 1 the above probability is also conditional on the event { X ̲ 1 = x ̲ 1 , , X ̲ n = x ̲ n } without it being explicitly denoted.
Being a function of the sample Y ̲ n , the quantity in Equation (14) is a random variable that varies across samples. Our numerical experiment tries to encapsulate this variability in an analogous way as in Figure 2 of [6]. Note that the mean CVR reported in the first column of Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9 gives the center of each of these histograms, namely
P ( Y f PI | X ̲ f = x ̲ f )
which is the expected value of the random variable (14) averaged over all possible Y ̲ n samples.
From the numerical findings of Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9 we can draw several interesting conclusions:
The undercoverage of PI.naive is strongly illustrated in all Tables and Figures.
All three of our proposed PPIs have larger coverage than PI.naive; furthermore, we observe the ordering:
CVR[PI.fitted] < CVR[PI.stud] < CVR[PI.pred]
which is not unexpected in view of Remark 1 that compares the scales of the three types of residuals.
It is clear that PI.fitted falls short of achieving acceptable coverage for finite samples; this can be attributed to the known tendency of the fitted residuals to underestimate the scale of the errors—see [20].
The preferable methods seems to be either PPI.stud or PPI.pred; the latter appears to overcover in several instances so it can be thought of as conservative.
The variability of the CVRs that is illustrated by the histograms in Figure 1, Figure 2, Figure 3 and Figure 4 is striking. Also interesting is the high proportion of CVRs (conditional on the sample) that are below the nominal level over the 200 replications. PPI.pred is more conservative in that respect as well, i.e., having the smallest proportion of PIs that undercover.
In our Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8, the ratio p / n was either 0.2 or 0.4 which can not be considered small; neither a sample size of 50 or 100 can be considered particularly large. Despite the fact that these setups are quite challenging, both PPI.stud and PPI.pred provided reliable performances (with the latter being more conservative).
Interestingly, there is no appreciable difference of PI performance going from normal errors to Laplace even though in the latter case the LS estimator does not coincide with the minimum maximum likelihood estimator. The only side-effect of the heavier tails of the Laplace distribution appears to be a slightly heavier left tail in some of our histograms.
As discussed in Remark 3, the standard asymptotics kick in when p / n 0 in which case the fitted, studentized and predictive residuals become approximately equal to each other. In that case the performance of all three PPIs is expected to be good, although PPI.stud and PPI.pred will still be preferable over the other two PIs. To corroborate this claim, we offer the results of an additional simulation in the ‘easy’ case of p = 5 and n = 100 , i.e., when p / n is small; as Table 9 shows, PPI.stud and PPI.pred have mean CVRs of 0.897 and 0.908 respectively that are practically equivalent to the nominal 90%. So are two methods PPI.stud and PPI.pred equivalent in this ‘easy’ case?
The above question allows us to think harder on our design goals for a successful PI construction. Until now, our target was to have CVR (15), i.e., P ( Y f PI | X ̲ f = x ̲ f ) close to the nominal 90% (and ideally not below the nominal). However, Figure 5 shows the histogram of the CVR (14) that has Equation (15) as its center. In order for the center to be about 90%, the CVR (14) should take values both below and above 90%. Indeed, as shown in Corollary 4.1 of [21], PIs based on the residual bootstrap will exhibit large-sample histograms with their median at the nominal level, i.e., asymptotically the histogram will have 50% of occurences below the nominal and 50% above.
By the Law of Large Numbers, such a histogram will have progressively smaller and smaller range since asymptotically
P ( Y f PI | X ̲ f = x ̲ f , Y ̲ n ) E P ( Y f PI | X ̲ f = x ̲ f , Y ̲ n ) = P ( Y f PI | X ̲ f = x ̲ f )
where the above expectation is taken over the distribution of Y ̲ n . In spite of the above, it seems preferable that a desirable finite-sample histogram of P ( Y f PI | X ̲ f = x ̲ f , Y ̲ n ) would have less mass below the nominal 90% (as compared to above the nominal), i.e., only a minority of the CVRs (14) would be below the nominal 90%. In this sense, PPI.pred appears preferable to PPI.stud even here since, as shown in Table 9, PPI.stud and PPI.pred have respectively 51% and 38% of occurences below the nominal.

5. Conclusions

An interesting problem in statistics is the construction of prediction intervals (PI) in linear regression without assuming Gaussianity which is often not justifiable. We desire PIs that have (a) asymptotic conditional validity, and (b) are ‘pertinent’, i.e., are able to capture/incorporate the variability associated with estimated quantities employed in the PI construction. Such PIs in the past have been traditionally constructed via resampling.
The paper at hand proposes a short-cut that directly employs the asymptotic normal distribution of relevant estimators—as opposed to a bootstrap histogram—in order to capture their variability. The resulting prediction interval achieves the property of pertinence without full-scale resampling, thus offering computational savings of orders of magnitude over the bootstrap.
Asymptotic conditional validity of the new approach is shown under general conditions using either fitted, studentized or predictive residuals. Numerical work confirms that the new method has good finite-sample performance as well, and suggests the use of predictive residuals in order to construct conservative PIs, i.e., resulting into the smallest proportion of PIs that undercover.

Funding

This research was partially supported by NSF grant DMS 24-13718.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

Many thanks are due to three anonymous reviewers for their constructive suggestions that helped improve the paper.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. An Intuitive Explanation of Pertinence

The idea of pertinent intervals, i.e., intervals that possess the property of pertinence, is meant to describe prediction intervals that capture/incorporate the variability of estimated quantities in their construction. To explain it intuitively, consider the textbook setup of linear regression with Gaussian errors, i.e., model (1) with Z i i.i.d. N ( 0 , σ 2 ) , and β ^ ̲ being the Least Squares estimator.
Let S 2 = R S S / ( n p ) , where R S S is the Residual Sum of Squares, i.e., i = 1 n z i 2 . Then, conditioning on X f = x f and on X (as in Assumption 1), we can define an exact ( 1 α ) 100 % Pl for Y f as
Y ^ f ± t n p ( α / 2 ) S 1 + x f X X 1 x f ;
see Equation (5.27) in [16].
A different practitioner might use the α / 2 standard normal quantile z ( α / 2 ) instead of the t-distribution quantile t n p ( α / 2 ) , and construct the PI
Y ^ f ± z ( α / 2 ) S 1 + x f X X 1 x f .
Under standard assumptions on the structure of the design matrix X ensuring that the minimum eigenvalue of X X diverges to infinity, we have x f X X 1 x f 0 as n (with p held constant or at least p < < n ). One can then consider the simplified prediction interval
Y ^ f ± z ( α / 2 ) S .
Under the normality of F, all three PIs, namely (A1), (A2) and (A3), are asymptotically valid. However, interval (A1) is the optimal one with exact finite-sample coverage 1 α . Moreover, interval (A2) has better finite-sample coverage than interval (A3) as it captures the estimation variability of the linear model, i.e, the term V a r ( x f β ^ ) = σ 2 x f X X 1 x f which interval (A3) treats as negligible. That is why interval (A3) can be called ‘naive’, while interval (A2) can be called ‘pertinent’.
In general, an exact interval such as (A1) will not be available because of potential non-normality of the errors. Furthermore, if F is not Gaussian, then neither of the two intervals (A2) and (A3) will be valid, not even asymptotically. As mentioned already, in the absence of Gaussianity, the quantile-based PI (8) will be asymptotically valid under standard assumptions but it will be ‘naive’ as it neglects the variability inherent in estimation. To improve upon it and come up with a pertinent PI, past literature typically required a bootstrap procedure as described by [17]; see also Section 3.6 of [4]. The paper at hand proposes a new alternative method to construct a pertinent PI in linear regression without full-scale resampling.

References

  1. Tian, H.; Huang, L.; Cheng, C.-Y.; Zhang, L. Regression models with ordered multiple categorical predictors. J. Stat. Comput. Simul. 2018, 88, 3164–3178. [Google Scholar] [CrossRef]
  2. Tian, Q.; Nordman, D.J.; Meeker, W.Q. Methods to Compute Prediction Intervals: A Review and New Results. Stat. Sci. 2022, 37, 580–597. [Google Scholar] [CrossRef]
  3. Politis, D.N.; Romano, J.P.; Wolf, M. Subsampling; Springer: New York, NY, USA, 1999. [Google Scholar]
  4. Politis, D.N. Model-Free Prediction and Regression: A Transformation-Based Approach to Inference; Springer: New York, NY, USA, 2015. [Google Scholar]
  5. Bai, Y.; Mei, S.; Wang, H.; Xiong, C. Understanding the under-coverage bias in uncertainty Estimation. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Virtually, 6–14 December 2021. [Google Scholar]
  6. Wang, Y.; Politis, D.N. Model-free bootstrap and conformal prediction in regression: Conditionality, conjecture testing, and pertinent prediction intervals. J. Nonparametr. Stat. 2026, 38, 311–345. [Google Scholar] [CrossRef]
  7. Politis, D.N. Model-free Model-fitting and Predictive Distributions. Test 2013, 22, 183–221. [Google Scholar] [CrossRef]
  8. Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
  9. Shafer, G.; Vovk, V. A Tutorial on Conformal Prediction. J. Mach. Learn. Res. 2008, 9, 371–421. [Google Scholar]
  10. Lei, J.; Wasserman, L. Distribution-free prediction bands for non-parametric regression. J. R. Stat. Soc. Ser. B 2014, 76, 71–96. [Google Scholar] [CrossRef]
  11. Foygel Barber, R.; Candés, E.J.; Ramdas, A.; Tibshirani, R.J. The limits of distribution-free conditional predictive inference. Inf. Inference 2021, 10, 455–482. [Google Scholar]
  12. Lei, J.; G’Sell, M.; Rinaldo, A.; Tibshirani, R.J.; Wasserman, L. Distribution-free Predictive Inference for Regression. J. Am. Stat. Assoc. 2018, 113, 1094–1111. [Google Scholar] [CrossRef]
  13. Romano, Y.; Patterson, E.; Candés, E.J. Conformalized quantile regression. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  14. Chernozhukov, V.; Wüthrich, K.; Zhu, Y. Distributional Conformal Prediction. Proc. Natl. Acad. Sci. USA 2021, 118, e2107794118. [Google Scholar] [CrossRef] [PubMed]
  15. Duchi, J.C. A Few Observations on Sample-Conditional Coverage in Conformal Prediction. arXiv 2025, arXiv:2503.00220. [Google Scholar]
  16. Seber, G.A.F.; Lee, A.J. Linear Regression Analysis, 2nd ed.; Wiley: New York, NY, USA, 2003. [Google Scholar]
  17. Stine, R.A. Bootstrap prediction intervals for regression. J. Am. Stat. Assoc. 1985, 80, 1026–1031. [Google Scholar] [CrossRef]
  18. Efron, B. Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Am. Stat. Assoc. 1983, 78, 316–331. [Google Scholar] [CrossRef]
  19. Efron, B.; Hastie, T.; Johnstone, J.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
  20. Foygel Barber, R.; Candés, E.J.; Ramdas, A.; Tibshirani, R.J. Predictive inference with the jackknife+. Ann. Stat. 2021, 49, 486–507. [Google Scholar] [CrossRef]
  21. Zhang, Y.; Politis, D.N. Bootstrap prediction intervals with asymptotic conditional validity and unconditional guarantees. Inf. Inference 2023, 12, 157–209. [Google Scholar] [CrossRef]
Figure 1. CVR histograms; case p = 10 , n = 50 ; Normal errors.
Figure 1. CVR histograms; case p = 10 , n = 50 ; Normal errors.
Stats 09 00068 g001
Figure 2. CVR histograms; case p = 20 , n = 50 ; Laplace errors.
Figure 2. CVR histograms; case p = 20 , n = 50 ; Laplace errors.
Stats 09 00068 g002
Figure 3. CVR histograms; case p = 20 , n = 100 ; Normal errors.
Figure 3. CVR histograms; case p = 20 , n = 100 ; Normal errors.
Stats 09 00068 g003
Figure 4. CVR histograms; case p = 40 , n = 100 ; Laplace errors.
Figure 4. CVR histograms; case p = 40 , n = 100 ; Laplace errors.
Stats 09 00068 g004
Figure 5. CVR histograms; case p = 5 , n = 100 ; normal errors.
Figure 5. CVR histograms; case p = 5 , n = 100 ; normal errors.
Stats 09 00068 g005
Table 1. Performance of 90% PI constructions: mean coverage levels (CVR), % of CVRs that are <0.9, mean PI length (with standard deviation in parentheses); case of Normal errors.
Table 1. Performance of 90% PI constructions: mean coverage levels (CVR), % of CVRs that are <0.9, mean PI length (with standard deviation in parentheses); case of Normal errors.
p = 10 , n = 50 CVR% < 0.9Length (St.Dev.)
PI.naive0.76597%2.79 (0.37)
PPI.fitted0.83780.5%3.31 (0.42)
PPI.stud0.87460%3.62 (0.46)
PPI.pred0.91329%4.06 (0.51)
Table 2. Entries as in Table 1; case of Laplace errors.
Table 2. Entries as in Table 1; case of Laplace errors.
p = 10 , n = 50 CVR% < 0.9Length (St.Dev.)
PI.naive0.81290%2.74 (0.48)
PPI.fitted0.86566%3.24 (0.55)
PPI.stud0.88948%3.52 (0.60)
PPI.pred0.91727%4.00 (0.68)
Table 3. Entries as in Table 1; case of Normal errors.
Table 3. Entries as in Table 1; case of Normal errors.
p = 10 , n = 50 CVR% < 0.9Length (St.Dev.)
PI.naive0.680100%2.45 (0.32)
PPI.fitted0.84572%3.41 (0.43)
PPI.stud0.89939%4.01 (0.53)
PPI.pred0.9608%5.27 (0.70)
Table 4. Entries as in Table 1; case of Laplace errors.
Table 4. Entries as in Table 1; case of Laplace errors.
p = 10 , n = 50 CVR% < 0.9Length (St.Dev.)
PI.naive0.63999%2.40 (0.41)
PPI.fitted0.84260%3.71 (0.59)
PPI.stud0.88837%4.35 (0.69)
PPI.pred0.95010%5.71 (0.99)
Table 5. Entries as in Table 1; case of Normal errors.
Table 5. Entries as in Table 1; case of Normal errors.
p = 10 , n = 50 CVR% < 0.9Length (St.Dev.)
PI.naive0.79197.5%2.87 (0.27)
PPI.fitted0.86073%3.38 (0.28)
PPI.stud0.89344%3.71 (0.32)
PPI.pred0.92516%4.13 (0.35)
Table 6. Entries as in Table 1; case of Laplace errors.
Table 6. Entries as in Table 1; case of Laplace errors.
p = 10 , n = 50 CVR% < 0.9Length (St.Dev.)
PI.naive0.83287%2.77 (0.35)
PPI.fitted0.86855%3.37 (0.40)
PPI.stud0.89531.5%3.66 (0.41)
PPI.pred0.92716.5%4.10 (0.51)
Table 7. Entries as in Table 1; case of Normal errors.
Table 7. Entries as in Table 1; case of Normal errors.
p = 10 , n = 50 CVR% < 0.9Length (St.Dev.)
PI.naive0.676100%2.55 (0.26)
PPI.fitted0.85461%3.82 (0.36)
PPI.stud0.89044.5%4.34 (0.41)
PPI.pred0.9639.5%5.60 (0.53)
Table 8. Entries as in Table 1; case of Laplace errors.
Table 8. Entries as in Table 1; case of Laplace errors.
p = 10 , n = 50 CVR% < 0.9Length (St.Dev.)
PI.naive0.66799%2.47 (0.35)
PPI.fitted0.84557%3.75 (0.53)
PPI.stud0.88839%4.26 (0.59)
PPI.pred0.95010%5.54 (0.79)
Table 9. Entries as in Table 1; case of Normal errors.
Table 9. Entries as in Table 1; case of Normal errors.
p = 10 , n = 50 CVR% < 0.9Length (St.Dev.)
PI.naive0.86673%3.12 (0.29)
PPI.fitted0.88760.5%3.28 (0.29)
PPI.stud0.89751%3.36 (0.30)
PPI.pred0.90838%3.45 (0.31)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Politis, D.N. Pertinent Prediction Intervals in Linear Regression. Stats 2026, 9, 68. https://doi.org/10.3390/stats9040068

AMA Style

Politis DN. Pertinent Prediction Intervals in Linear Regression. Stats. 2026; 9(4):68. https://doi.org/10.3390/stats9040068

Chicago/Turabian Style

Politis, Dimitris N. 2026. "Pertinent Prediction Intervals in Linear Regression" Stats 9, no. 4: 68. https://doi.org/10.3390/stats9040068

APA Style

Politis, D. N. (2026). Pertinent Prediction Intervals in Linear Regression. Stats, 9(4), 68. https://doi.org/10.3390/stats9040068

Article Metrics

Back to TopTop