1. Introduction
Since the 1980s, non-linear time-series models have attracted attention for modeling asymmetry in financial returns, the volatility of stock markets, switching regimes, etc. Compared to linear time-series models, non-linear models are more capable of depicting the underlying data-generating mechanism; see the review in [
1], for example. However, unlike linear models, where the one-step-ahead predictor can be iterated, the multi-step-ahead prediction of non-linear models is cumbersome, since the innovation severely influences the forecasting value.
In this paper, by combining the forward bootstrap in [
2] with nonparametric estimation, we develop multi-step-ahead (conditional) predictive inference for the general model:
where the
values are assumed to be independent and identically distributed (i.i.d.) with mean 0 and variance 1, and
and
are some functions that satisfy some smoothness conditions. We will also assume that the time series satisfying Equation (
1) is geometrically ergodic and causal, i.e., that for any
t,
is independent of
.
In Equation (
1), we have the trend/regression function
depending on the last
p data points, while the standard deviation/volatility function
depends on the last
q data points; in many situations,
p and
q are taken to be equal for simplicity. Some special cases deserve mention: e.g., if
(constant), Equation (
1) yields a non-linear/nonparametric autoregressive model with homoscedastic innovations. The well-known ARCH/GARCH models are a special case of Equation (
1) with
.
Although the
-optimal one-step-ahead prediction of Equation (
1) is trivial when we know the regression function
or have a consistent estimator of it, the multi-step-ahead prediction is not easy to obtain. In addition, it is nontrivial to find the
-optimal prediction, even for one-step-ahead forecasting. In several applied areas, e.g., econometrics, climate modeling, and water resources management, data might not possess a finite second moment, in which case, optimizing
loss is vacuous. For all such cases—but also of independent interest—prediction that is optimal with respect to
loss should receive more attention in practice; see detailed discussion in Ch. 10 of [
2]. Later, we will show that our method is compatible with both
- and
-optimal multi-step-ahead predictions.
Efforts to overcome the difficulty of forecasting non-linear time series can be traced back to the work of [
3], where a numerical approach was proposed to explore the exact conditional
k-step-ahead
-optimal prediction of
for the homoscedastic Equation (
1). However, this method is computationally intractable with long-horizon prediction and requires knowledge of the distribution of innovations and the regression function, which is not realistic in practice.
Consequently, practitioners started to investigate some suboptimal methods to perform multi-step-ahead prediction. Generally speaking, these methods take one of two avenues: (1) direct prediction or (2) iterative prediction. The first idea involves working with a different (“direct”) model, specific to
k-step-ahead prediction, namely:
Even though
and
are unknown to us, we can construct nonparametric estimators,
and
, and plug them into Equation (
2) to perform
k-step-ahead prediction. Ref. [
4] gives a review of this approach. However, as pointed out by [
5], a drawback of this approach is that information from intermediate observations
is disregarded. Furthermore, if
in Equation (
1) is i.i.d., then
in Equation (
2) cannot be i.i.d. In other words, a practitioner must employ the (estimated) dependence structure of
in Equation (
2) in order to perform the prediction in an optimal fashion.
The second idea is “iterative prediction”, which employs one-step-ahead predictors in a sequential way to perform a multi-step-ahead forecast. For example, consider a two-step-ahead prediction using Model Equation (
1); first, note that the
-optimal predictor of
is
. The
-optimal predictor of
, but since
is unknown, it is tempting to plug in
in its place. This plug-in idea can be extended to multi-step-ahead forecasts, but it does
not lead to the
-optimal predictor, except in the special case where the function
is linear, e.g., in the case of a linear autoregressive (LAR) model.
Remark 1. Since neither of the above two approaches is satisfactory, we propose to approximate the distribution of the future value via a particular type of simulation when the model is known or, more generally, by the bootstrap. To describe this approach, we rewrite Equation (1) aswhere is a vector that represents , and is some appropriate function. Then, when the model and the innovation information are known to us, we can create a pseudo-value . Taking a three-step-ahead prediction as an example, the pseudo-value can be defined as follows:where is simulated as i.i.d. from . Repeating this process to M pseudo-, the -optimal prediction of can be estimated from the mean of . As already discussed, constructing the -optimal predictor may also be required since sometimes loss is not well defined; in our simulation framework, we can construct the optimal prediction by taking the median of . Moreover, we can even build a prediction interval (PI) to measure the forecasting accuracy based on the quantile values of simulated pseudo-values. The extension of this algorithm to longer-step-ahead prediction is illustrated in Section 2. Realistically, practitioners will not know
,
, or
. In this situation, the first step is to estimate these quantities and plug them into the above simulation, which then turns into a bootstrap method. The bootstrap idea was introduced by [
6] to carry out statistical inference for independent data. After that, many variants of the bootstrap were developed to handle time-series data. Prominent examples include the sieve bootstrap and the block bootstrap in its many variations, e.g., the circular bootstrap of [
7] and the stationary bootstrap of [
8]; see [
9] for a review. Once some model structure of the data is assumed, participators can rely on model-based bootstrap methods, e.g., the residual and/or wild bootstrap; see [
10] for a book-length treatment. The bootstrap technique can also be applied to a recently popular model, namely, a neural network. In particular, ref. [
11] applied the bootstrap for the estimation inference of neural networks’ parameters, while [
12] utilized the bootstrap to estimate the performance of neural networks.
In the spirit of the idea of the bootstrap, ref. [
13] proposed a
backward bootstrap trick to predict an
model. The advantage of the backward method is that each bootstrap prediction is naturally conditional on the latest
p observations, which coincide with the conditional prediction in the real world. However, this method cannot handle non-linear time series, whose backward representation may not exist. Later, ref. [
14] proposed a strategy to generate forward bootstrap
series. To resolve the conditional prediction issues, they fixed the last
p bootstrap values to be the true observations and computed predictions iteratively in the bootstrap world starting from there. They then extended this procedure to forecast the GARCH model in [
15].
Sharing a similar idea, ref. [
16] defined the
forward bootstrap to perform prediction, but they proposed a different PI format that empirically has better performance, according to the coverage rate (CVR) and the length (LEN), compared to the PI of [
14]. Although ref. [
16] covered the forecasting of a non-linear and/or nonparametric time-series model, only one-step-ahead prediction was considered. The case of the multi-step-ahead prediction of non-linear (but parametric) time-series models was recently addressed in [
17]. In the paper at hand, we address the case of the multi-step-ahead prediction of nonparametric time-series models, as in Equation (
1). Beyond discussing optimal
and
point predictions, we consider two types of PI—quantile PI (QPI) and pertinent PI (PPI). As already mentioned, the former can be approximated by taking the quantile values of the future value’s distribution in the bootstrap world. The PPI requires a more complicated and computationally heavy procedure to be built, as it attempts to capture the variability in parameter estimation. This additional effort results in improved finite-sample coverage as compared to the QPI.
As in most nonparametric estimation problems, the issue of bias becomes important. We will show that debiasing on the inherent bias-type terms of local constant estimation is necessary to guarantee the pertinence of a PI when multi-step-ahead predictions are required. Although the QPI and PPI are asymptotically equivalent, the PPI renders a better CVR in finite-sample cases; see the formal definition of PPI in the work of [
2,
16]. Analogously to the successful construction of PIs in the work of [
18], we can employ predictive—as opposed to fitted—residuals in the bootstrap process to further alleviate the finite-sample undercoverage of bootstrap PIs in practice. There are several other nonparametric approaches to carry out the prediction inference of future values; e.g., see the work of [
5,
19] for variants of kernel-based methods; see the work of [
20,
21,
22] for prediction with a neural network using the sieve bootstrap or various ensemble strategies; finally, see the work of [
23,
24,
25] for a novel transformation-based approach for model-free prediction. The comparison of these various nonparametric techniques could be an independent study.
This paper is organized as follows. In
Section 2, forward bootstrap prediction algorithms with local constant estimators will be given. The asymptotic properties of point predictions and PIs will be discussed in
Section 3. Simulations are given in
Section 4 to substantiate the finite-sample performance of our methods. Conclusions are given in
Section 5. All proofs can be found in
Appendix A. Discussions on the debiasing and pertinence related to building PIs are presented in
Appendix B,
Appendix C and
Appendix D.
2. Nonparametric Forward Bootstrap Prediction
As discussed in the remark in
Section 1, we can apply the
simulation or
bootstrap technique to approximate the distribution of future values. In general, this idea works for any geometrically ergodic autoregressive model, regardless of whether it is in a linear or non-linear format. For example, if we have a known general model
at hand, we can perform
k-step-ahead predictions according to the same logic of the three-step-ahead prediction example in
Section 1.
To elaborate, we need to simulate
as i.i.d. from
and then compute the pseudo-value
iteratively with simulated innovations as follows:
Repeating this procedure
M times, we can make a prediction inference with the empirical distribution of
. Similarly, if the model and innovation distribution are unknown to us, we can perform the estimation first to obtain
and
. Then, the above simulation-based algorithm turns out to be a bootstrap-based algorithm. More specifically, we bootstrap
from
and calculate the pseudo-value
iteratively with
. The prediction inference can also be conducted with the empirical distribution of
.
This simulation/bootstrap idea was recently implemented by [
17] in the case where the model
G is either known or parametrically specified. In what follows, we will focus on the case of the nonparametric model in Equation (
1) and will analyze the asymptotic properties of the point predictor and prediction interval. For the sake of simplicity, we consider only the case in which
; the general case can be handled similarly, but the notation is much more cumbersome. Assume that we observe
data points and that we denote them by
; our goal is the prediction inference of
for some
. If we know
,
, and
, we can take a simulation approach to develop the prediction inference, as we explained in
Section 1. When
,
, and
are unknown, we start by estimating
and
; we then estimate
based on the empirical distribution of residuals. Subsequently, we can deploy a bootstrap-based method to approximate the distribution of future values. Several algorithms are given for this purpose later in the paper.
2.1. Bootstrap Algorithm for Point Prediction and QPI
For concreteness, we focus on local constant estimators, i.e., kernel-smoothed estimators of the Nadaraya–Watson type; other estimators can be applied similarly. The local constant estimators of
and
are, respectively, defined as:
where
K is a non-negative kernel function that satisfies some regularity assumptions; see
Section 3 for details. We use
h to represent the bandwidth of kernel functions, but
h may take a different value for mean and variance estimators. Due to theoretical and practical issues, we need to truncate the above local constant estimators as follows:
where
and
are large enough, and
is small enough.
Using
and
on Equation (
1), we can obtain the fitted residuals
, which are defined as:
Later, in
Section 3, we will show that the innovation distribution
can be consistently estimated from the centered empirical distribution of
, i.e.,
, under some standard assumptions. We now have all the ingredients to perform the bootstrap-based Algorithm 1 to yield the point prediction and QPI of
.
Algorithm 1 Bootstrap prediction of with fitted residuals |
Step 1 |
With data , construct the estimators and with Equation (6). |
Step 2 | Compute fitted residuals based on Equation (7), and let . Let denote the empirical distribution of the centered residuals for . |
Step 3 | Generate i.i.d. from . Then, construct bootstrap pseudo-values , iteratively, i.e.,
For example, and . |
Step 4 | Repeating Step 3 M times, we obtain pseudo-value replicates of that we denote by . Then, - and -optimal predictors can be approximated by and the , respectively. Furthermore, a % QPI can be built as , where L and U denote the and sample quantiles of M values .
|
Remark 2. To construct the QPI of Algorithm 1, we can employ the optimal bandwidth rate, i.e., . However, in practice with small sample size, the QPI has a better empirical CVR for multi-step-ahead predictions by adopting an under-smoothing bandwidth; see Appendix B for a related discussion, and see Section 4 for simulation comparisons between applying optimal and under-smoothing bandwidths to the QPI. In the next section, we will show the conditional asymptotic consistency of our optimal point predictions and the QPI. In particular, we will verify that our point predictions converge to oracle optimal point predictors in probability—conditional on
. In addition, we will look for an asymptotically valid PI with a
CVR to measure the prediction accuracy conditional on the latest observed data, which is defined as:
where
L and
U are lower and higher PI bounds, respectively. Although not explicitly denoted, the probability
should be understood as the conditional probability given
. Later, based on a sequence of sets that contains the observed sample with a probability tending to 1, we will show how to build a prediction interval that is asymptotically valid by the bootstrap technique, even if the model information is unknown.
Although asymptotically correct, in finite samples, the QPI typically suffers from undercoverage; see the discussion in [
2,
16]. To improve the CVR in practice, we consider taking the predictive residuals to boost the bootstrap process. To derive such predictive residuals, we need to estimate the model based on the delete-
dataset, i.e., the available data for the scatter plot of
vs.
for
, i.e., excluding the single point at
. More specifically, we define the delete-
local constant estimators as:
Similarly, the truncated delete-
local estimators
and
can be defined according to Equation (
6). We now construct the so-called predictive residuals as:
The
k-step-ahead prediction of
with predictive residuals is depicted in Algorithm 2. Although Algorithms 1 and 2 are asymptotically equivalent, Algorithm 2 gives a QPI with a better CVR for finite samples; see the simulation comparisons of these two approaches in
Section 4.
Algorithm 2 Bootstrap prediction of with predictive residuals |
Step 1 |
The same as Step 1 of Algorithm 1. |
Step 2 |
Compute predictive residuals based on Equation (11). Let denote the empirical distribution of the centered predictive residuals . |
Steps 3–4 | Replace by in Algorithm 1. All the rest are the same.
|
2.2. Bootstrap Algorithm for PPI
To improve the CVR of a PI, we can try to take the variability in the model estimation into account when we build the PI; i.e., we need to mimic the estimation process in the bootstrap world. Employing this idea results in a pertinent PI (PPI), as discussed in
Section 1; see also [
26].
Algorithm 3 outlines the procedure to build a PPI. Although this algorithm is more computationally heavy, the advantage is that the PPI gives a better CVR compared to the QPI in practice, i.e., with finite samples; see the examples in
Section 4.
Remark 3 (Bandwidth choices).
In Step 3 (b) of Algorithm 3, we can use an optimal bandwidth h and an over-smoothing bandwidth g to generate bootstrap time series so that we can capture the asymptotically non-random bias-type term of nonparametric estimation by the forward bootstrap; see the application in [27]. We can also apply an under-smoothing bandwidth h (and then use ) to render the bias term negligible. It turns out that both approaches work well for one-step-ahead prediction, although applying the over-smoothing bandwidth may be slightly better. However, taking under-smoothing bandwidth(s) is notably better for multi-step-ahead prediction. The reason for this is that the bias term cannot be captured appropriately for multi-step-ahead estimation with an over-smoothing bandwidth. On the other hand, with an under-smoothing bandwidth, the bias term is negligible; see Section 3.2 for further discussion; also, see [28] for a related discussion. The simulation studies in Appendix C explore the differences between these two bandwidth strategies. Algorithm 3 Bootstrap PPI of with fitted residuals |
Step 1 | With data , construct the estimators and by using Equation (6). Furthermore, compute fitted residuals based on Equation (7). Denote the empirical distribution of centered residuals by by . |
Step 2 |
Construct the or prediction using Algorithm 1. |
Step 3 |
(a) Resample (with replacement) the residuals from to create pseudo-errors and . |
|
(b) Let , where I is generated as a discrete random variable uniformly distributed on the values . Then, create bootstrap pseudo-data in a recursive manner from the formula
(c) Based on the bootstrap data , re-estimate the regression and variance functions according to Equation (6) and obtain and ; we use the same bandwidth h as the original estimator . (d) Guided by the idea of the forward bootstrap, re-define the latest value of to match the original, i.e., re-define . (e) With the estimators and , the bootstrap data , and the pseudo-errors , use Equation (12) to recursively generate the future bootstrap data . (f) With bootstrap data and the estimators and , utilize Algorithm 1 to compute the optimal bootstrap prediction, which is denoted by ; to generate bootstrap innovations, we still use . (g) Determine the bootstrap predictive root: . |
Step 4 |
Repeat Step 3 B times; the B bootstrap root replicates are collected in the form of an empirical distribution whose -quantile is denoted by . The equal-tailed prediction interval for centered at is then estimated by |
As Algorithm 2 is a version of Algorithm 1 using predictive (as opposed to fitted) residuals, we now propose Algorithm 4, which constructs a PPI with predictive residuals.
Algorithm 4 Bootstrap PPI of with predictive residuals |
Step 1 | With data , construct the estimators and by using Equation (6). Furthermore, compute predictive residuals based on Equation (11). Denote the empirical distribution of centered residuals by . |
Steps 2–4 | The same as in Algorithm 3, but change the residual distribution from to , and change the application of Algorithm 1 to Algorithm 2. |