Multi-step ahead prediction intervals for non-parametric autoregressions via bootstrap: consistency, debiasing and pertinence

To address the difficult problem of multi-step ahead prediction of non-parametric autoregressions, we consider a forward bootstrap approach. Employing a local constant estimator, we can analyze a general type of non-parametric time series model, and show that the proposed point predictions are consistent with the true optimal predictor. We construct a quantile prediction interval that is asymptotically valid. Moreover, using a debiasing technique, we can asymptotically approximate the distribution of multi-step ahead non-parametric estimation by bootstrap. As a result, we can build bootstrap prediction intervals that are pertinent, i.e., can capture the model estimation variability, thus improving upon the standard quantile prediction intervals. Simulation studies are given to illustrate the performance of our point predictions and pertinent prediction intervals for finite samples.


Introduction
Since the 1980s, non-linear time-series models have attracted attention for modeling asymmetry in financial returns, the volatility of stock markets, switching regimes, etc. Compared to linear time-series models, non-linear models are more capable of depicting the underlying data-generating mechanism; see the review in [1], for example. However, unlike linear models, where the one-step-ahead predictor can be iterated, the multi-step-ahead prediction of non-linear models is cumbersome, since the innovation severely influences the forecasting value.
In this paper, by combining the forward bootstrap in [2] with nonparametric estimation, we develop multi-step-ahead (conditional) predictive inference for the general model: where the t values are assumed to be independent and identically distributed (i.i.d.) with mean 0 and variance 1, and m(·) and σ(·) are some functions that satisfy some smoothness conditions. We will also assume that the time series satisfying Equation (1) is geometrically ergodic and causal, i.e., that for any t, t is independent of {X s , s < t}.
In Equation (1), we have the trend/regression function m(·) depending on the last p data points, while the standard deviation/volatility function σ(·) depends on the last q data points; in many situations, p and q are taken to be equal for simplicity. Some special cases deserve mention: e.g., if σ(X t−1 , . . . , X t−q ) ≡ σ (constant), Equation (1) yields a non-linear/nonparametric autoregressive model with homoscedastic innovations. The well-known ARCH/GARCH models are a special case of Equation (1) with m(X t−1 , . . . , X t−p ) ≡ 0. Remark 1. Since neither of the above two approaches is satisfactory, we propose to approximate the distribution of the future value via a particular type of simulation when the model is known or, more generally, by the bootstrap. To describe this approach, we rewrite Equation (1) as where X t−1 is a vector that represents {X t−1 , . . . , X t−max(p,q) }, and G(·, ·) is some appropriate function. Then, when the model and the innovation information are known to us, we can create a pseudo-value X * T+k . Taking a three-step-ahead prediction as an example, the pseudo-value X * T+3 can be defined as follows: where { * i } T+3 i=T+1 is simulated as i.i.d. from F . Repeating this process to M pseudo-X * T+3 , the L 2optimal prediction of X T+3 can be estimated from the mean of {X * (m) T+3 } M m=1 . As already discussed, constructing the L 1 -optimal predictor may also be required since sometimes L 2 loss is not well defined; in our simulation framework, we can construct the optimal L 1 prediction by taking the median of {X * (m) T+k } M m=1 . Moreover, we can even build a prediction interval (PI) to measure the forecasting accuracy based on the quantile values of simulated pseudo-values. The extension of this algorithm to longer-step-ahead prediction is illustrated in Section 2.
Realistically, practitioners will not know F , m(·), or σ(·). In this situation, the first step is to estimate these quantities and plug them into the above simulation, which then turns into a bootstrap method. The bootstrap idea was introduced by [6] to carry out statistical inference for independent data. After that, many variants of the bootstrap were developed to handle time-series data. Prominent examples include the sieve bootstrap and the block bootstrap in its many variations, e.g., the circular bootstrap of [7] and the stationary bootstrap of [8]; see [9] for a review. Once some model structure of the data is assumed, participators can rely on model-based bootstrap methods, e.g., the residual and/or wild bootstrap; see [10] for a book-length treatment. The bootstrap technique can also be applied to a recently popular model, namely, a neural network. In particular, ref. [11] applied the bootstrap for the estimation inference of neural networks' parameters, while [12] utilized the bootstrap to estimate the performance of neural networks.
In the spirit of the idea of the bootstrap, ref. [13] proposed a backward bootstrap trick to predict an AR(p) model. The advantage of the backward method is that each bootstrap prediction is naturally conditional on the latest p observations, which coincide with the conditional prediction in the real world. However, this method cannot handle non-linear time series, whose backward representation may not exist. Later, ref. [14] proposed a strategy to generate forward bootstrap AR(p) series. To resolve the conditional prediction issues, they fixed the last p bootstrap values to be the true observations and computed predictions iteratively in the bootstrap world starting from there. They then extended this procedure to forecast the GARCH model in [15].
Sharing a similar idea, ref. [16] defined the forward bootstrap to perform prediction, but they proposed a different PI format that empirically has better performance, according to the coverage rate (CVR) and the length (LEN), compared to the PI of [14]. Although ref. [16] covered the forecasting of a non-linear and/or nonparametric time-series model, only one-step-ahead prediction was considered. The case of the multi-step-ahead prediction of non-linear (but parametric) time-series models was recently addressed in [17]. In the paper at hand, we address the case of the multi-step-ahead prediction of nonparametric time-series models, as in Equation (1). Beyond discussing optimal L 1 and L 2 point predictions, we consider two types of PI-quantile PI (QPI) and pertinent PI (PPI). As already mentioned, the former can be approximated by taking the quantile values of the future value's distribution in the bootstrap world. The PPI requires a more complicated and computationally heavy procedure to be built, as it attempts to capture the variability in parameter estimation. This additional effort results in improved finite-sample coverage as compared to the QPI.
As in most nonparametric estimation problems, the issue of bias becomes important. We will show that debiasing on the inherent bias-type terms of local constant estimation is necessary to guarantee the pertinence of a PI when multi-step-ahead predictions are required. Although the QPI and PPI are asymptotically equivalent, the PPI renders a better CVR in finite-sample cases; see the formal definition of PPI in the work of [2,16]. Analogously to the successful construction of PIs in the work of [18], we can employ predictive-as opposed to fitted-residuals in the bootstrap process to further alleviate the finite-sample undercoverage of bootstrap PIs in practice. There are several other nonparametric approaches to carry out the prediction inference of future values; e.g., see the work of [5,19] for variants of kernel-based methods; see the work of [20][21][22] for prediction with a neural network using the sieve bootstrap or various ensemble strategies; finally, see the work of [23][24][25] for a novel transformation-based approach for modelfree prediction. The comparison of these various nonparametric techniques could be an independent study. This paper is organized as follows. In Section 2, forward bootstrap prediction algorithms with local constant estimators will be given. The asymptotic properties of point predictions and PIs will be discussed in Section 3. Simulations are given in Section 4 to substantiate the finite-sample performance of our methods. Conclusions are given in Section 5. All proofs can be found in Appendix A. Discussions on the debiasing and pertinence related to building PIs are presented in Appendices B-D.

Nonparametric Forward Bootstrap Prediction
As discussed in the remark in Section 1, we can apply the simulation or bootstrap technique to approximate the distribution of future values. In general, this idea works for any geometrically ergodic autoregressive model, regardless of whether it is in a linear or non-linear format. For example, if we have a known general model X t = G(X t−1 , t ) at hand, we can perform k-step-ahead predictions according to the same logic of the three-step-ahead prediction example in Section 1.
To elaborate, we need to simulate { * i } T+k i=T+1 as i.i.d. from F and then compute the pseudo-value X * T+k iteratively with simulated innovations as follows: Repeating this procedure M times, we can make a prediction inference with the empirical Similarly, if the model and innovation distribution are unknown to us, we can perform the estimation first to obtain G(·, ·) and F . Then, the above simulation-based algorithm turns out to be a bootstrap-based algorithm. More specifically, we bootstrap {ˆ * i } T+k i=T+1 from F and calculate the pseudo-value X * T+k iteratively with G(·, ·). The prediction inference can also be conducted with the empirical distribution of { X * (m) T+k } M m=1 . This simulation/bootstrap idea was recently implemented by [17] in the case where the model G is either known or parametrically specified. In what follows, we will focus on the case of the nonparametric model in Equation (1) and will analyze the asymptotic properties of the point predictor and prediction interval. For the sake of simplicity, we consider only the case in which p = q = 1; the general case can be handled similarly, but the notation is much more cumbersome. Assume that we observe T + 1 data points and that we denote them by {X 0 , . . . , X T }; our goal is the prediction inference of X T+k for some k ≥ 1. If we know m(·), σ(·), and F , we can take a simulation approach to develop the prediction inference, as we explained in Section 1. When m(·), σ(·), and F are unknown, we start by estimating m(·) and σ(·); we then estimate F based on the empirical distribution of residuals. Subsequently, we can deploy a bootstrap-based method to approximate the distribution of future values. Several algorithms are given for this purpose later in the paper.

Bootstrap Algorithm for Point Prediction and QPI
For concreteness, we focus on local constant estimators, i.e., kernel-smoothed estimators of the Nadaraya-Watson type; other estimators can be applied similarly. The local constant estimators of m(·) and σ(·) are, respectively, defined as: where K is a non-negative kernel function that satisfies some regularity assumptions; see Section 3 for details. We use h to represent the bandwidth of kernel functions, but h may take a different value for mean and variance estimators. Due to theoretical and practical issues, we need to truncate the above local constant estimators as follows: where C m and C σ are large enough, and c σ is small enough.
Using m h (·) and σ h (·) on Equation (1), we can obtain the fitted residuals {ˆ t } T t=1 , which are defined as:ˆ Later, in Section 3, we will show that the innovation distribution F can be consistently estimated from the centered empirical distribution of {ˆ t } T t=1 , i.e., F , under some standard assumptions. We now have all the ingredients to perform the bootstrap-based Algorithm 1 to yield the point prediction and QPI of X T+k .

Algorithm 1 Bootstrap prediction of X T+k with fitted residuals
Step 1 With data {X 0 , . . . , X T }, construct the estimators m h (x) and σ h (x) with Equation (6).
Step 2 Compute fitted residuals based on Equation (7), and let¯ = 1 Step 3 For example, Step 4 Repeating Step 3 M times, we obtain pseudo-value replicates of X * T+k that we denote by {X

Remark 2.
To construct the QPI of Algorithm 1, we can employ the optimal bandwidth rate, i.e., h = O(T −1/5 ). However, in practice with small sample size, the QPI has a better empirical CVR for multi-step-ahead predictions by adopting an under-smoothing bandwidth; see Appendix B for a related discussion, and see Section 4 for simulation comparisons between applying optimal and under-smoothing bandwidths to the QPI.
In the next section, we will show the conditional asymptotic consistency of our optimal point predictions and the QPI. In particular, we will verify that our point predictions converge to oracle optimal point predictors in probability-conditional on X T . In addition, we will look for an asymptotically valid PI with a (1 − α)100% CVR to measure the prediction accuracy conditional on the latest observed data, which is defined as: where L and U are lower and higher PI bounds, respectively. Although not explicitly denoted, the probability P should be understood as the conditional probability given X T . Later, based on a sequence of sets that contains the observed sample with a probability tending to 1, we will show how to build a prediction interval that is asymptotically valid by the bootstrap technique, even if the model information is unknown.
Although asymptotically correct, in finite samples, the QPI typically suffers from undercoverage; see the discussion in [2,16]. To improve the CVR in practice, we consider taking the predictive residuals to boost the bootstrap process. To derive such predictive residuals, we need to estimate the model based on the delete-X t dataset, i.e., the available data for the scatter plot of X i vs. {X i−1 } for i = 1, . . . , t − 1, t + 1, . . . , T, i.e., excluding the single point at i = t. More specifically, we define the delete-X t local constant estimators as: Similarly, the truncated delete-X t local estimators m t (x) and σ t h (x) can be defined according to Equation (6). We now construct the so-called predictive residuals as: The k-step-ahead prediction of X T+k with predictive residuals is depicted in Algorithm 2.
Although Algorithms 1 and 2 are asymptotically equivalent, Algorithm 2 gives a QPI with a better CVR for finite samples; see the simulation comparisons of these two approaches in Section 4.

Algorithm 2
Bootstrap prediction of X T+k with predictive residuals Step 1 The same as Step 1 of Algorithm 1.
Step 2 Compute predictive residuals based on Equation (11). Let F p denote the empirical distribution of the centered predictive residualsˆ Steps 3-4 Replace F by F p in Algorithm 1. All the rest are the same.

Bootstrap Algorithm for PPI
To improve the CVR of a PI, we can try to take the variability in the model estimation into account when we build the PI; i.e., we need to mimic the estimation process in the bootstrap world. Employing this idea results in a pertinent PI (PPI), as discussed in Section 1; see also [26].
Algorithm 3 outlines the procedure to build a PPI. Although this algorithm is more computationally heavy, the advantage is that the PPI gives a better CVR compared to the QPI in practice, i.e., with finite samples; see the examples in Section 4.

Remark 3 (Bandwidth choices). In
Step 3 (b) of Algorithm 3, we can use an optimal bandwidth h and an over-smoothing bandwidth g to generate bootstrap time series so that we can capture the asymptotically non-random bias-type term of nonparametric estimation by the forward bootstrap; see the application in [27]. We can also apply an under-smoothing bandwidth h (and then use g = h) to render the bias term negligible. It turns out that both approaches work well for one-step-ahead prediction, although applying the over-smoothing bandwidth may be slightly better. However, taking under-smoothing bandwidth(s) is notably better for multi-step-ahead prediction. The reason for this is that the bias term cannot be captured appropriately for multi-step-ahead estimation with an over-smoothing bandwidth. On the other hand, with an under-smoothing bandwidth, the bias term is negligible; see Section 3.2 for further discussion; also, see [28] for a related discussion. The simulation studies in Appendix C explore the differences between these two bandwidth strategies.

Algorithm 3 Bootstrap PPI of X T+k with fitted residuals
Step 1 With data {X 0 , . . . , X T }, construct the estimators m h (x) and σ h (x) by using Equation (6). Furthermore, compute fitted residuals based on Equation (7). Denote the empirical distribution of centered residuals byˆ t − 1 T ∑ T i=1ˆ i , t = 1, . . . , T by F .
Step 2 Construct the L 1 or L 2 prediction X T+k using Algorithm 1.
Step 3 (a) Resample (with replacement) the residuals from F to create pseudo-errors where I is generated as a discrete random variable uniformly distributed on the values 0, . . . , T. Then, create bootstrap pseudo-data {X * t } T t=1 in a recursive manner from the formula (c) Based on the bootstrap data {X * t } T t=0 , re-estimate the regression and variance functions according to Equation (6) and obtain m * h (x) and σ * h (x); we use the same bandwidth h as the original estimator m h (x). (d) Guided by the idea of the forward bootstrap, re-define the latest value of X * T to match the original, i.e., re-define X * T = X T . (e) With the estimators m g (x) and σ g (x), the bootstrap data {X * t } T t=0 , and the pseudoerrors {ˆ * t } T+k t=T+1 , use Equation (12) to recursively generate the future bootstrap data X * T+1 , . . . , X * T+k . (f) With bootstrap data {X * t } T t=0 and the estimators m * h (x) and σ * h (x), utilize Algorithm 1 to compute the optimal bootstrap prediction, which is denoted by X * T+h ; to generate bootstrap innovations, we still use F . (g) Determine the bootstrap predictive root: X * T+k − X * T+k .
Step 4 Repeat Step 3 B times; the B bootstrap root replicates are collected in the form of an empirical distribution whose β-quantile is denoted by q(β). The (1 − α)100% equal-tailed prediction interval for X T+k centered at X T+k is then estimated by [ X T+k + q(α/2), X T+k + q(1 − α/2)].
As Algorithm 2 is a version of Algorithm 1 using predictive (as opposed to fitted) residuals, we now propose Algorithm 4, which constructs a PPI with predictive residuals.

Algorithm 4 Bootstrap PPI of X T+k with predictive residuals
Step 1 With data {X 0 , . . . , X T }, construct the estimators m h (x) and σ h (x) by using Equation (6). Furthermore, compute predictive residuals based on Equation (11).
Denote the empirical distribution of centered residualsˆ Steps 2-4 The same as in Algorithm 3, but change the residual distribution from F to F p , and change the application of Algorithm 1 to Algorithm 2.

Asymptotic Properties
In this section, we provide the theoretical substantiation of our nonparametric bootstrap prediction methods-Algorithms 1-4. We start by analyzing optimal point predictions and the QPI based on Algorithms 1 and 2.

Remark 4.
Since the effect of leaving out one data pair X t vs. {X t−1 } is asymptotically negligible for large T, the delete-X t estimators m t (x) and σ t (x) are asymptotically equal to m(x) and σ(x), respectively. Then, the predictive residualˆ p t is asymptotically the same as the fitted residualˆ t ; see Lemma 5.5 of [16] for a formal comparison of these two types of estimators and residuals. Thus, we just give theorems to guarantee the asymptotic properties of point predictions and PIs with fitted residuals. The asymptotic properties for variants with predictive residuals also hold true.

On Point Prediction and QPI
First, to conduct statistical inference for time series, we need to quantify the degree of the asymptotic dependence of the time series. In this paper, we consider that the time series is geometrically ergodic, which is equivalent to the β-mixing condition with an exponentially fast mixing rate; see [29] for a detailed introduction to different mixing conditions and ergodicity. To simplify the proof, we make the following assumptions: A1-A3 can guarantee that the time-series process is geometrically ergodic; see Theorem 1 of [30] for proof and see the work of [31] for a discussion on the sufficient conditions of higher-order time series.
Since we need to build consistent properties of the nonparametric estimation, we further assume that: A4 The regression function m(x) is twice continuously differentiable with bounded derivatives, and we denote its Lipschitz continuous constant as L m ; A5 The volatility function σ(x) is twice continuously differentiable with bounded derivatives, and we denote its Lipschitz continuous constant as The kernel function K(x) is a compactly supported and symmetric probability density on R and has a bounded derivative.

Remark 5.
Assumption A6 is originally used to show that the expected value of X * t is O p (1) in the bootstrap world for all t. In practice, this assumption is not strict; see examples in Section 4. For assumption A8, we can apply a kernel with a support on the whole real line as long as the part outside a large enough compact set is asymptotically negligible.
Under A1-A8, [27] shows that truncated the local constant estimators in Equation (6) are uniformly consistent with the true functions in an expanding region. We summarize this result in the lemma below: Lemma 1. Under A1-A8 and observed data {X 0 , . . . , X T }, for local constant estimation as in Equation (6), we have: where c T is an appropriate sequence that converges to infinity as T → ∞.
In addition, for the centered empirical distribution ofˆ , we can derive Lemma 2 to describe its consistency property.

Lemma 2.
Under A1-A8 and observed data {X 0 , . . . , X T }, for the centered empirical distribution F , we have: sup See Theorem 5 of [27] for the proof of Lemmas 1 and 2. Combining all the pieces, we present Theorem 1 to show that the optimal point prediction and QPI returned by Algorithm 1 or Algorithm 2 are consistent and asymptotically valid, respectively, conditionally on the latest observations. Theorem 1. Under assumptions A1-A8 and observed data {X 0 , . . . , X T }, we have: where X * T+k is a future value in the bootstrap world that can be determined iteratively by applying the expression X * , with its distribution given by the empirical distribution of fitted (or predictive) residuals; F X * T+k |X T ,...,X 0 (x) represents the distribution P * (X * T+k ≤ x|X T , . . . , X 0 ), and here, we take P * to represent the probability measure conditional on the sample of data; and F X T+k |X T (x) represents the (conditional) distribution of X T+k in the real world, i.e., P(X T+k ≤ x|X T ).

On PPI with Homoscedastic Errors
With more complicated prediction procedures, such as Algorithms 3 and 4, we expect to find a more accurate PI, i.e., a PPI. The superiority of such PIs is that the estimation variability can be captured when we use the distribution of the predictive root in the bootstrap world to approximate its variant in the real world. We consider models with homoscedastic errors throughout this section; the model with heteroscedastic errors will be analyzed later.
Firstly, let us consider the one-step-ahead predictive root centered at the optimal L 2 point prediction in the real and bootstrap worlds, as given below: where M is the number of bootstrap replications that we employ to approximate the optimal L 2 point prediction. Since we have centered the residuals to a mean of zero, Equation (16) degenerates to the following simple form asymptotically as M → ∞: To acquire a pertinent PI according to Definition 2.4 of [16], in addition to Equation (14), we also need asymptotically valid confidence intervals for local constant estimation in the bootstrap world; i.e., we should be able to estimate the distribution of the nonparametric estimator in the bootstrap world. For one-step-ahead prediction, this condition can be formulated as follows: where and a T is an appropriate sequence such that P(a T A m ≤ x) has a nontrivial limit as T → ∞. In [16], it was assumed that the nontrivial limit of P(a T A m ≤ x) is continuous. In this case, the uniform convergence in Equation (18) follows from the pointwise convergence of all x.

Remark 6.
As we have discussed in Remark 3, the bootstrap procedure cannot capture the bias term of nonparametric estimation exactly unless delicate manipulations are made. Ref. [16] adopts two strategies to solve this issue: (B1) let g = h, and take a bandwidth rate satisfying hT 1/5 → 0, i.e., under-smoothing in function estimation; (B2) use the optimal smoothing rate with h proportional to T −1/5 , but generate time series in the bootstrap world with over-smoothing estimators, i.e., g = h and g/h → ∞. No matter which approach we take, Equation (18) can be shown; see details from Theorem 1 of [27] and Theorem 5.4 of [16].
The following corollary is immediate: Under assumptions A1-A3 and observed data {X 0 , . . . , X T }, the one-step-ahead PI returned by Algorithms 3 and 4 with fitted or predictive residuals is asymptotically pertinent, respectively.
However, for multi-step-ahead predictions, the analysis becomes more complicated, and the under-smoothing strategy turns out to work better. For example, considering the two-step-ahead prediction, the two predictive roots can be written as follows: Correspondingly, the predictive root in the bootstrap world is: where the approximated equality is due to the application of the LLN on the sample mean of the centered residuals.

Remark 7.
We should note that the over-smoothing approach may work better for finite samples. The reason is that applying the optimal bandwidth rate is superior when the bias-type term of the nonparametric estimation can be captured by the bootstrap. However, we will soon show that applying an under-smoothing bandwidth strategy is more accurate for multi-step-ahead predictions since it can solve the bias issue and render a PPI. Thus, in practice, we recommend adopting strategy (B2) to perform one-step-ahead predictions and adopting strategy (B1) to perform multi-step-ahead predictions. For a time series with heteroscedastic errors, the optimal bandwidth strategy is slightly different; see Section 3.3 for reference.
Based on Equations (20) and (21), as we prove that the future distribution of X * T+k converges uniformly to the future distribution of X T+k in probability, we can show that the distribution of the predictive root X * T+2 − X * T+2 in the bootstrap world also converges uniformly in probability to the distribution of the predictive root X T+2 − X T+2 in the real world. This result guarantees the asymptotic validity of the PPI. We summarize this conclusion in Theorem 2.

Theorem 2.
Under assumptions A1-A8 and observed data {X 0 , . . . , X T }, we have: where X * T+k − X * T+k is the k-step-ahead predictive root in the bootstrap world, and F X * represents its distribution at point x; X T+k − X T+k is the k-step-ahead predictive root in the real world, and F X T+k − X T+k |X T ,...,X 0 (x) represents its (conditional) distribution at point x. This theorem holds for both bandwidth selection strategies.
However, since we apply a more complicated procedure to capture estimation variability, we anticipate that this results in a PPI. To see this, we first apply the Taylor expansion on the r.h.s. of Equations (20) and (21); the two predictive roots can be decomposed into several parts: wherex andx * are some points between m(X T ) and m(X T ) + T+1 and between m g (X T ) and m g (X T ) +ˆ * T+1 , respectively;x i andx * i are some points between m h (X T ) and m h (X T ) + i,T+1 and between m * h (X T ) and m * h (X T ) +ˆ * i,T+1 , respectively. The k-step-ahead predictive root can be expressed similarly when k > 2. We can consider the r.h.s of Equation (23) to be made up of two components in both the real and bootstrap worlds: (2) the rest of the terms, which are related to future innovations. For the second component, the bootstrap can mimic the real-world situation well.
We expect that the first component, i.e., the variability in local constant estimation of the mean function m(m(X T )) − m h ( m h (X T )), can be well approximated by its variant Although PPIs with either of the two bandwidth selection approaches are both asymptotically valid, the PPI with the bandwidth strategy (B2) is only "almost" pertinent for multi-step-ahead predictions since the variability in local constant estimation is not well estimated in finite samples; see also the simulation results in Section 4 and Appendix C. On the other hand, the PPI with the bandwidth strategy (B1) meets our goal. We summarize this finding in Theorem 3.
The direct implication of Theorem 3 is that the PPI generated by Algorithms 3 and 4 should have a better CVR for small sample sizes than the QPI since the estimation variability is included in the PI with high probability; see the simulation examples in Section 4.

On PPI with Heteroscedastic Errors
For time-series models with heteroscedastic errors, i.e., where the variance function σ(x) represents the heteroscedasticity of innovations, we do not need to care about the bias term in the nonparametric estimation of the variance function. In other words, we use neither under-smoothing nor over-smoothing bandwidth tricks on the variance function to generate the bootstrap series for covering the bias term; we can just use the bandwidth with the optimal rate to estimate the variance function from real and bootstrap series.
To see this, let us consider the two-step-ahead predictive root with heteroscedastic errors. In the real world, we have: Correspondingly, the predictive root in the bootstrap world is: Through Taylor expansion, we can obtain: We can still consider the r.h.s. of Equation (29) to contain two components. Once we use the under-smoothing technique to cover the estimation variability for the mean function, since the residual distribution is determined by the estimated mean and variance functions, the convergence rate of the residual distribution to the true innovation distribution is dominated by the convergence rate of m h (x) to m(x). In addition, all estimators of the variance function in Equation (29) are tied with future estimated innovations; so, we are free to use the bandwidth g = h with the optimal smoothing rate to estimate the variance function, and the overall convergence rate will not change. To show this benefit, we ran some simulations, which are presented in Appendix D, to compare the performance of PIs when applying under-smoothing or the optimal bandwidth in estimating the variance function. In Section 4, we will demonstrate the use of the optimal bandwidth to estimate the variance function if the time series is heteroscedastic.
To analyze the pertinence of the PPI for time series with heteroscedastic errors, from Equation (29), it is apparent that the distribution of m(m(X T )) − m h ( m h (X T )) can still be approximated by m g ( m g (X T )) − m * h ( m * h (X T )). For the rest of the terms, the bootstrap can still mimic the real-world situation.

Simulations
In this section, we describe the simulations that we deployed to check the performance of five-step-ahead point predictions and the corresponding PIs of our algorithms in the R platform with finite samples. To obtain the optimal bandwidth h op for our local constant estimators, we relied on the function npregbw from the R package np. The under-smoothing and over-smoothing bandwidths were taken as 0.5 · h op and 2 · h op , respectively.

Optimal Point Prediction
We first consider a simple non-linear model: where { t } is assumed to have a standard normal distribution. The geometric ergodicity of Equation (30) can be easily checked. We apply the "oracle" prediction as the benchmark. The oracle prediction is returned by employing a simulation approach, assuming that we know the true model and the error distribution, i.e., the simulation-based prediction, as we discussed in Section 1; see Section 3.2 of [17] for more details and the theoretical validation of this approach. Since this oracle prediction should have the best performance, we would like to challenge our nonparametric bootstrap-based methods by comparing them with the oracle prediction. We also pretend that the true model and innovation distribution are unknown when we perform the nonparametric bootstrap-based prediction. For point predictions, we just utilize fitted residuals. The application of predictive residuals will play a role in building PIs later.
In a single experiment, we take X 0 ∼ Uniform(−1, 1) and then iteratively generate a series with size C + T + 1 according to Equation (30). Here, C is taken as 200 to remove the effects of the initial distribution of X 0 . To perform oracle predictions, we take M = 1000 to obtain a satisfying approximation. For a fair comparison, we also apply a 1000 times bootstrap in Algorithms 1 and 2 to obtain bootstrap-based predictions.
Referring to the simulation studies in [16], we take T = 100, 200, k = 1, . . . , 5, and employ the Mean-Squared Prediction Error (MSPE) to compare oracle and bootstrap predictions. The metric MSPE can be approximated based on the formula below: MSPE of the k-th ahead prediction = 1 N N ∑ n=1 (X n,k − P n,k ) 2 , for k = 1, . . . , 5, where P n,k represents the k-th step-ahead optimal L 1 or L 2 point predictions implied by the bootstrap or simulation approach, and X n,k stands for the true future value in the n-th replication. We take N = 5000 and record all MSPEs in Table 1. From Table 1, we can find that the MSPEs of oracle-and bootstrap-based L 1 -or L 2 -optimal predictions are very close to each other, respectively. The MSPEs of oracle optimal predictions are always smaller than the corresponding bootstrap predictions. This phenomenon is in line with our expectation since the bootstrap prediction is obtained with an estimated model and innovation distribution.
Rather than applying the standard normal distribution, we consider a skewed innovation, i.e., t ∼ χ 2 (3) − 3. Repeating the above process, we present the MSPEs in Table 2.  The performance of bootstrap-based predictions is also competitive with oracle predictions. Another notable phenomenon indicated by Table 2 is that the MSPE of L 2 -optimal predictions is always less than its corresponding value in L 1 -optimal predictions. The reason for this is that the L 2 -optimal prediction coincides with the L 2 loss used in MSPE. However, this phenomenon is not remarkable for the results in Table 1 since the innovation distribution is symmetric in that case.

Model:
For the non-linear model with heteroscedastic errors, we consider the following model: Model Equation (32) is in a GARCH form, except that the regression function is non-linear. This model was also considered by [16]. We present the MSPEs of different predictions in Table 3. It reveals that our bootstrap-based optimal point prediction methods can work for the non-linear time-series model with heteroscedastic errors, and its performance is still competitive with oracle predictions. Model:

Remark 8. In practice, we should mention that both local constant estimators m(x) and σ(x)
will only be accurate when x falls in the area where data are dense. Estimations in the sparse area will return large fitted residuals. These large residuals will spoil the multi-step-ahead prediction process in the bootstrap procedure. Thus, depending on which optimal prediction we are pursuing, we replace all inappropriate or numerical NaN values with the sample mean or sample median of observed data. In addition, during the simulation studies, we truncate m(x) h ; i.e., we take C m as 5 · max{|x 0 |, . . . , |x T |}. For the mean function estimator m(x) * h in the bootstrap world, we take C * m as min{2 · C m , 5 · max{|x * 0 |, . . . , |x * T |}} since we want to allow more variability for the bootstrap series. For the local constant estimator of the variance function, we take c σ and c * σ as 0.01. We take C σ and C * σ as 2 ·σ and min{4 ·σ, 2 ·σ * }, respectively;σ andσ * are the sample standard deviations of the observed series in the real world and bootstrap world, respectively. These truncating constants work well for the above two models. In practice, a cross-validation approach could be taken to find the optimal truncating constants.

QPI and PPI
In this subsection, we try to evaluate the CVR of the QPI and PPI based on the nonparametric forward bootstrap prediction method. Similarly, we take the oracle prediction interval as the benchmark, which is computed by the QPI with a known model and innovation distribution; see the discussion in Section 1 and Section 3.2 of [17] for references on this approach.
Due to the time complexity of the double bootstrap in the bootstrap world, we only take B = 500 and M = 100 in Algorithms 3 and 4 to derive the PPI. Correspondingly, we take M = 500 to compute the QPI. In practice, people can increase the values of B and M. To make the result as consistent as possible, we still repeat the simulation process 5000 times.
The empirical CVR of the bootstrap-based QPI and PPI for k = 1, . . . , 5-step-ahead predictions is determined with the formula below: CVR of the k-th ahead prediction = 1 N N ∑ n=1 1 X n,k ∈[L n,k ,U n,k ] , for k = 1, . . . , 5, where [L n,k , U n,k ] and X n,k represent the k-th step-ahead prediction interval and the true future value in the n-th replication, respectively. In addition to the CVR, we are also concerned about the empirical LEN of different PIs. The empirical LEN of a PI is defined as follows: LEN of the k-th ahead PI = 1 N N ∑ n=1 (U n,k − L n,k ), for k = 1, . . . , 5.
Recall that the PPI can be centered at the L 1 -or L 2 -optimal point predictor, and the QPI can be found with the optimal bandwidth and the under-smoothing bandwidth; thus, we have four types of PIs based on the bootstrap. In particular, each type of PI can be obtained with fitted or predictive residuals. In total, we have eight bootstrap-type PIs and one oracle PI. In addition, to observe the effects of introducing the predictive residuals and the superiority of the PPI, we consider three sample sizes, 50, 100, and 200. All CVRs and LENs for different PIs for predicting Equations (30) and (32) are presented in Tables 4 and 5,  respectively. From there, we can observe that the SPI (oracle PI) is the best one, according to the most accurate CVR and relatively small LEN. For the QPI with fitted residuals, it severely under-covers the true future value, especially for data with a small sample size. With predictive residuals, although the LEN of the PI gets amplified, the CVR of the QPI improves significantly. After applying the under-smoothing bandwidth with the QPI, the CVR is further improved for multi-step-ahead (i.e., k ≥ 2) predictions, regardless of whether fitted or predictive residuals are used. The PPI with fitted residuals outperforms the QPI with fitted residuals. The PPI with predictive residuals can achieve the most accurate CVR among various bootstrap-based PIs, especially when the data are short, though the price is that its LEN is the largest compared to other PIs. We should note that the QPI with predictive residuals and the under-smoothing bandwidth can achieve a great CVR with 200 samples for these two models. However, we may not know the sufficiently large sample size to guarantee that the QPI can work well. Thus, we recommend taking the PPI with predictive residuals as the first choice.

Remark 9.
We should clarify that the CVR computed using Equation (33) is the unconditional coverage rate of X T+k since it is an average of the conditional coverage of X T+k for all replications. Note: With no other specifications, throughout all simulations, QPI-f and QPI-p represent QPIs based on optimal bandwidth with fitted and predictive residuals, respectively; QPI-f-u and QPI-p-u represent QPIs based on undersmoothing bandwidth with fitted and predictive residuals, respectively; L 2 -PPI-f-u and L 2 -PPI-p-u represent PPIs centered at L 2 -optimal point prediction with fitted and predictive residuals, respectively; L 1 -PPI-f-u and L 1 -PPI-p-u represent PPIs centered at L 1 -optimal point prediction with fitted and predictive residuals, respectively; all PPIs with the "-u" symbol are based on applying the under-smoothing bandwidth to estimate the model; SPI represents the oracle PI. Note: All PPIs with the "-opv" symbol are based on applying under-smoothing and optimal bandwidths to estimate mean and variance functions, respectively.

Simulation Results for Appendices
We carried out a simulation study to show that the QPIs with the optimal bandwidth and under-smoothing bandwidth are asymptotically equivalent; see the results in Table 6, and see the formal analysis in Appendix B. We also deployed simulations to check the effects of applying under-smoothing or over-smoothing tricks on the performance of the PPI. We took the sample size T + 1 to be 50 or 500 and performed simulations 5000 times on the first model; see Table 7 and Appendix C for the results and analysis, respectively. For a model with heteroscedastic errors, as we mentioned in Section 3.3, we can rely on the optimal bandwidth to estimate the variance functions. To check this claim, we consider two strategies for the bandwidth of the estimator for the variance function: (1) take the under-smoothing bandwidth as we do for the mean function estimator; (2) take the bandwidth with the optimal rate. To estimate the mean function in the bootstrap world, we continue using the under-smoothing bandwidth strategy. The simulation results based on Equation (32) with a small sample size are shown in Table 8; see the corresponding discussion in Appendix D.  Note: "-opv" indicates the corresponding PPI is built by optimal bandwidth for the variance function estimator.

Conclusions
In this paper, we propose some forward bootstrap prediction algorithms based on the local constant estimation of the model. With theoretical and practical validations, we show that our bootstrap-based point predictions work well, and their MSPEs are very close to those of the oracle predictions. By debiasing the nonparametric estimation with the under-smoothing bandwidth, we show that the confidence bounds for the multi-step-ahead estimator can be approximated by the bootstrap. As a result, we can obtain a pertinence prediction interval by using a specifically designed algorithm. Empirically, we further take the predictive residuals to make predictions that can alleviate the under-coverage of the PI for a small sample size. Among different bootstrap-based PIs, as revealed by simulation studies, the PPI with predictive residuals is the best one, and it is competitive with the oracle PI.
Author Contributions: All authors D.N.P. and K.W. contributed equally to this project. All authors have read and agreed to the published version of the manuscript.

Funding:
The research of the first author was partially supported by NSF grant DMS 19-14556. The research of the second author was partially supported by the Richard Libby Graduate Research Award.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proofs
Proof of Theorem 1. To show that Equation (15) is satisfied for k ≥ 1, we can just show the case with k = 2. Cases with k = 1 and k > 2 can be handled similarly. F X T+2 |X T (x) is equivalent to: to simplify notations. Similarly, we can analyze F X * T+2 (x), and it has the following equivalent expressions: where G(x, X T ,ˆ * T+1 ) represents , and E * (·) represents the expectation in the bootstrap world, i.e., E(·|X T , . . . , X 0 ). Thus, we hope to show: However, it is hard to analyze Equation (A3) since there is a random variable X T inside E * (·) and E(·). Thus, we consider two regions of X T , i.e., (1) |X T | > γ T and (2) |X T | ≤ γ T , where γ T is an appropriate sequence that converges to infinity. Under A1, A2, and A5, by Lemma 1 of [30], we have: In addition, we have the relationship: Thus, to verify Equation (A3), we just need to show that the second term on the r.h.s. of Equation (A5) converges to 0. We can take the sequences c T and γ T to be the same sequence, which converges to infinity slowly enough. Then, it is enough for us to analyze the asymptotic probability of the following expression: We decompose the l.h.s. of Equation (A6) into: Then, we analyze the two terms on the r.h.s. of Equation (A7) separately. For the first term, we have: ≤ sup |x|≤c T ,|y|≤c T ,z F ( G(x, y, z)) − F ( G(x, y, z)) p → 0, under Equation (14).
For the second term on the r.h.s. of Equation (A7), we have: We further consider the two terms on the r.h.s. of Equation (A9) separately. For the first term, by applying Taylor expansion, we have: ≤ sup |x|≤c T ,|y|≤c T ,j∈{1,...,T} C · G(x, y,ˆ j ) − G(x, y, j ) .

(A11)
From Equation (A10) and Lemma 1, we verify that Equation (A11) converges to 0 in probability. For the second term on the r.h.s. of Equation (A9), by the uniform law of large numbers, we have: Combining all the pieces, Equation (A6) converges to 0 in probability, which implies Theorem 1.
Proof of Theorem 2. We want to show: where X * T+k − X * T+k and X T+k − X T+k are predictive roots. We still present the proof for the two-step prediction. The proof for higher-step predictions can be shown similarly. When we are dealing with two-step-ahead predictions, predictive roots have the same expression as in Equations (20) and (21). Thus, we want to measure the asymptotic distance between the following two quantities: Compared to Equations (A1) and (A2), Equations (20) and (21) just have two more , in the predictive root in the real and bootstrap worlds, respectively. By the LLN, these two terms converge to their corresponding means in the real or bootstrap world. Based on the consistency between m h (·) and m * h (·), we can show Theorem 2 similarly with the procedure used to prove Theorem 1.
Proof of Theorem 3. The proof is based on {X 0 , . . . , X T } ∈ Ω T . We need to verify Equation (24); i.e., we can build confidence bounds for the k-step-ahead estimation by the bootstrap. Still, we focus on the two-step-ahead prediction; i.e., we want to show: Applying the property P(|X T | > c T ) → 0 again, it is enough to show: To handle the uniform convergence on y, we make a ε-covering of X T . Let the εcovering number of [−c T , c T ] be C N = N(ε; [−c T , c T ]; | · |), which means that for every y ∈ [−c T , c T ], ∃ i ∈ {1, 2, . . . , C N } s.t. |y − y i | ≤ ε for ∀ε > 0. Defining y 0 ∈ {y 1 , . . . , y C N }, we can consider: For the first term on the r.h.s. of Equation (A17), we have: where C 1 and C 2 are some finite constants since the derivatives of m h (·) and m(·) are bounded. Considering the first term inside the absolute bracket on the r.h.s. of Equation (A18), we can think of this as a convolution of two random variables: Further, based on the smoothing property of m h (·) and m(·), again, we can take ε to be small enough to make the random variable Z close to being degenerated, i.e., P(Z = 0) = 1 − P(Z ∈ A) = 1 − o(1); A is a small set around 0 without containing 0. Thus, Equation (A18) can be written as: Similarly, the last term on the r.h.s. of Equation (A17) can also be made to converge to 0. We can then focus on analyzing the middle term. In other words, it is enough to analyze the pointwise convergence property between distributions in the real and bootstrap worlds. According to the idea for estimating the distribution of nonparametric estimation by the bootstrap in the work of [27], we decompose √ Th( m h ( m h (y 0 )) − m(m(y 0 ))) into bias-type and variance terms: where K h (·) represents the function in the form 1 h K(·/h). By also carrying this process out for For the variance term, by Lemma 4.4 of [27], we have: where Z(x 0 ) has the distribution N(0, τ 2 (x 0 )); τ 2 (x 0 ) = f X (x 0 ) K 2 (v)dv; x 0 ∈ R. Since m h (y 0 ) and m * h (y 0 ) both converge to m(y 0 ) in probability and the target distribution is continuous, by the continuous mapping theorem, we can obtain the uniform convergence between the distributions of √ Thr V,h (m(y 0 )) and √ Thr V,h ( m h (y 0 )), i.e.: To show the uniform convergence relationship between √ Thr * V,h (m(y 0 )) and Unfortunately, F is the empirical distribution of residuals and is discrete. To make the analysis more convenient, we take a convolution approach to smooth the distribution of empirical residuals; i.e., we define another random variable, which is the sum ofˆ and a standard normal random variable ξ: where ξ ∼ N(0, L(T)) and L(T) → 0 at an appropriate rate. It is easy to show that the distribution of˜ , F is asymptotically equivalent to F ; i.e., Equation (14) is also satisfied for F . In practice, we can take L(T) to be small enough, and then we still bootstrap time series based on F in practice. However, from a theoretical view, we would like to take F . To simplify the notation, we use F throughout this paper, and its representation changes according to the context. Combining all the pieces, we can obtain: (1); Then, the bias-type term in the real and bootstrap worlds remains to be analyzed. We first consider the bias-type termr B,h ( m h (y 0 )): For the first term on the r.h.s. of Equation (A29), based on the ergodicity of {X t } series, we can find that the mean of this term is: If we take the bandwidth satisfying Th 5 → 0, Equation (A30) converges to 0. Then, we consider the mean of the second term on the r.h.s. of Equation (A29): h (x) · ( m h (y 0 ) − m(y 0 )) · (m(X t ) − m(m(y 0 ))) h (x) · ( m h (y 0 ) − m(y 0 )) · (m(X t ) − m(m(y 0 ))) X t . (A31) Since E( √ Th · ( m h (y 0 ) − m(y 0 )) is O( √ Th 5 ) (see Lemma 4.6 of [27] for the proof), under the assumption that K(·) has a bounded derivative and m(·) is bounded in a compact set, we have E(E( h (x) · ( m h (y 0 )−m(y 0 )) · (m(X t ) − m(m(y 0 )))|X t )) equal to O( √ Th 5 ); once we select the under-smoothing bandwidth that satisfies Th 5 → 0, Equation (A31) converges to 0. Then, we need to analyze the variance of √ Thr B,h ( m h (y 0 )). Similarly, we can show that it is op (1). All in all,
For the bias-type termr * B,h ( m * h (y 0 )) in the bootstrap world, we can perform a similar decomposition to the one applied in Equation (A29), and then we can obtain: We first rely on the fact that E * (E * ( ; see Lemma 4.6 of [27] for more details. Thus, using the under-smoothing bandwidth strategy, the second term on the r.h.s. of Equation (A32) also converges to 0. For the first term, we can rely on the fact that the bootstrap series is also ergodic with high probability; see Theorem 2 of [30,32] for a time-series model with homoscedastic or heteroscedastic errors, respectively. Thus, with a similar analysis of the variant in the real world, we can see that the bias-type term in the bootstrap world also converges to 0 in probability. Given the consistent relationship betweenf h ( m h (y 0 )) andf * h ( m h (y 0 )), which is implied by Lemma 4.5 of [27], Equation (A15) follows from the analysis of variance and bias-type terms in the real and bootstrap worlds.

Appendix B. The Advantage of Applying Under-Smoothing Bandwidth for QPI with Finite Sample
The proof of Theorem 1 provides the big picture of the asymptotic validity of the QPI. Although the choice of the bandwidth does not influence the asymptotic validity of the QPI, we can find that the QPI with the under-smoothing bandwidth has a better CVR for multi-step-ahead predictions from the simulation results. We attempt to analyze this phenomenon informally. Starting from the convergence result, we want to show: sup |x|≤c T F X * T+k |X T ,...,X 0 (x) − F X T+k |X T (x) p → 0, for k ≥ 1.

(A36)
For the first term on the r.h.s. of Equation (A36), based on the ergodicity, asymptotically, we have: The convergence of f h (m(y) + j ) to f X (m(y) + j ) guarantees the consistency relationship of Equation (A37) and m(m(y) + j ). Similarly, for the third term on the r.h.s. of Equation (A36), we can conduct a similar analysis to find the convergence to 0 in probability. Moreover, the convergence speed is related to O(h 2 ). When multiple-step-ahead predictions are required, we will obtain more and more such O(h 2 ) terms. If we have large enough data, it is "safe" to focus on the bandwidth with the optimal rate to estimate the model. However, for the finite-sample cases, it is better to take an under-smoothing h, though the corresponding LEN of the prediction interval will get larger due to the mean-variance trade-off. This conclusion coincides with the results shown in Tables 4 and 5. From there, we can observe that the one-step-ahead QPI with the optimal bandwidth has a better CVR compared to the version with the under-smoothing bandwidth. In the meantime, the LEN of the PI with the optimal bandwidth is also slightly smaller. When the prediction horizon is larger than 1, although the QPI with the under-smoothing bandwidth has a slightly larger LEN, its CVR is notably better than the QPI with the optimal bandwidth. Here, we conducted more simulation studies to show that the QPIs with the optimal bandwidth and under-smoothing bandwidth are asymptotically equivalent. We performed simulations with Equation (30) and took T + 1 to be 1000. The CVR and LEN of different QPIs are tabulated in Table 6. From the simulation results, although the LEN of the QPI with the optimal bandwidth is always less than the variant with the under-smoothing bandwidth, the difference is marginal. In addition, these two types of QPIs have indistinguishable performance according to the CVR, which implies the asymptotic equivalence of applying the optimal bandwidth or under-smoothing bandwidth. It also implies that adopting fitted or predictive residuals is also asymptotically equivalent.