Multistep Forecast Averaging with Stochastic and Deterministic Trends

Kejriwal, Mohitosh; Nguyen, Linh; Yu, Xuewen

doi:10.3390/econometrics11040028

Open AccessArticle

Multistep Forecast Averaging with Stochastic and Deterministic Trends

by

Mohitosh Kejriwal

^1,*

,

Linh Nguyen

¹

and

Xuewen Yu

²

¹

Daniels School of Business, Purdue University, 403 Mitch Daniels Blvd., West Lafayette, IN 47907, USA

²

Department of Applied Economics, School of Management, Fudan University, 670 Guoshun Road, Shanghai 200433, China

^*

Author to whom correspondence should be addressed.

Econometrics 2023, 11(4), 28; https://doi.org/10.3390/econometrics11040028

Submission received: 24 August 2023 / Revised: 27 November 2023 / Accepted: 7 December 2023 / Published: 15 December 2023

Download

Browse Figures

Versions Notes

Abstract

This paper presents a new approach to constructing multistep combination forecasts in a nonstationary framework with stochastic and deterministic trends. Existing forecast combination approaches in the stationary setup typically target the in-sample asymptotic mean squared error (AMSE), relying on its approximate equivalence with the asymptotic forecast risk (AFR). Such equivalence, however, breaks down in a nonstationary setup. This paper develops combination forecasts based on minimizing an accumulated prediction errors (APE) criterion that directly targets the AFR and remains valid whether the time series is stationary or not. We show that the performance of APE-weighted forecasts is close to that of the optimal, infeasible combination forecasts. Simulation experiments are used to demonstrate the finite sample efficacy of the proposed procedure relative to Mallows/Cross-Validation weighting that target the AMSE as well as underscore the importance of accounting for both persistence and lag order uncertainty. An application to forecasting US macroeconomic time series confirms the simulation findings and illustrates the benefits of employing the APE criterion for real as well as nominal variables at both short and long horizons. A practical implication of our analysis is that the degree of persistence can play an important role in the choice of combination weights.

Keywords:

averaging; combination; cross-validation; Mallows; unit root; accumulated prediction errors

JEL Classification:

C22; C53

1. Introduction

The pioneering work of Granger (1966) demonstrated that a large number of macroeconomic time series have a typical spectral shape dominated by a peak at low frequencies. This finding suggests the presence of relatively long run information in the current level of the variables, which should be taken into account when modeling their time series evolution and can potentially be exploited to yield improved forecasts. One way to incorporate this long-run information in econometric modeling is through stochastic trends (unit roots) and/or deterministic trends. However, given that trends are slowly evolving, there is only limited information in any data set about how best to specify the trend or distinguish between alternative models of the trend. For instance, unit root tests often fail to reject a unit root despite the fact that theory does not postulate the presence of a unit root for many macroeconomic variables [see Elliott (2006), for further discussion of this issue]. Therefore, it appears prudent to incorporate the uncertainty arising from the presence of a stochastic trend when constructing macroeconomic forecasts. Moreover, this uncertainty is likely to be particularly important for longer horizons.1

A second source of uncertainty involved in the construction of forecasts relates to the specification of short-run dynamics driving the time series. Within an autoregressive modeling framework, this form of uncertainty can be expressed in terms of the lags of first differences of the time series being analyzed. Since the number of lags that ought to be included in the model is unknown in practice, there is a bias–variance trade-off facing the forecaster: underspecifying the number of lags would lead to biased forecasts, while including irrelevant lags would induce a higher forecast variance. The challenge therefore lies in incorporating lag order uncertainty in a manner that best addresses this trade-off.

Motivated by these considerations, this paper proposes a new multistep forecast combination approach designed for forecasting a highly persistent time series that simultaneously addresses uncertainty about the presence of a stochastic trend and uncertainty about the nature of short-run dynamics within a unified autoregressive modeling framework. Unlike extant forecast combination approaches, we develop combination forecasts based on minimizing the so-called accumulated prediction errors (APE) criterion that directly targets the asymptotic forecast risk (AFR) instead of the in-sample asymptotic mean squared error (AMSE). This is particularly relevant, since the equivalence between AFR and AMSE breaks down in a nonstationary setup. Our analysis generalizes existing results by establishing the asymptotic validity of the APE for multistep forecasts in the unit root and (fixed) stationary cases, both for models with and without deterministic trends. We further show that, regardless of the presence of a unit root, the performance of APE-weighted forecasts remains close to that of the infeasible combination forecasts which assume that the optimal (i.e., AFR minimizing) weights are known. Monte Carlo experiments are used to (i) demonstrate the finite sample efficacy of the proposed procedure relative to Mallows/Cross-Validation weighting that target the AMSE; (ii) underscore the importance of accounting for uncertainty about the stochastic trend and/or the lag order. In a pseudo out-of-sample forecasting exercise applied to US monthly macroeconomic time series, we evaluate the performance of a variety of selection/combination-based approaches at horizons of one, three, six, and twelve months. Consistent with the simulation results, the empirical analysis provides strong evidence in favor of a version of the advocated approach that simultaneously addresses stochastic trend and lag order uncertainty regardless of the forecast horizon considered.

The present study builds on previous work by Hansen (2010a) and Kejriwal and Yu (2021) who analyzed one-step ahead combination forecasts allowing for both persistence and lag order uncertainty. In particular, Hansen (2010a) adopted a local-to-unity framework to develop combination forecasts that combine forecasts from the restricted (i.e., imposing a unit root) and unrestricted models with the weights obtained by minimizing a one-step Mallows criterion. To address lag order uncertainty, he also proposed a general combination approach that, in addition to the restricted and unrestricted model forecasts, also combines forecasts based on different lag orders. Kejriwal and Yu (2021) provided theoretical justification for the general combination approach and developed improved combination forecasts that employ feasible generalized least squares (FGLS) estimates instead of ordinary least squares (OLS) estimates of the deterministic trend component.

Our paper can be viewed as extending Hansen’s (2010a) approach in two practically relevant directions. First, in addition to one-step ahead forecasts, we also analyze the statistical properties of multistep combination forecasts given that uncertainty regarding the presence of a stochastic trend is especially relevant over longer horizons. Second, in contrast to Mallows weighting as advocated by Hansen (2010a), our combination weights are obtained via the APE criterion that directly targets the AFR instead of the AMSE. Our Monte Carlo and empirical comparisons of the performance of combination forecasts based on different weighting schemes clearly illustrate the importance of directly targeting the AFR. Thus, an important implication of our study is that the preferred choice of weighting scheme when combining forecasts can critically depend on whether the variables involved are stationary or not.

The recent machine learning literature has proposed a variety of forecasting methods that exploit information in a large number of potential predictors (see, e.g., Masini et al. 2023, for a survey). In contrast, our study is univariate in that it only utilizes past information about the variable of interest to develop forecasts. A natural question one may ask, then, is what is the value added by our univariate forecasting approach when more sophisticated machine learning approaches are available? We offer three possible responses. First, our approach is simple to use in practice, since it only requires running OLS regressions. Second, it is transparent in that its statistical properties can be studied analytically, which can be useful for understanding the merits and limitations of the approach. Third, when evaluating the performance of machine learning methods, our preferred forecasting approach can provide a much more competitive univariate benchmark for comparison than a simple autoregressive model with a prespecified/estimated lag order which has routinely been used as the benchmark (see, e.g., Kim and Swanson 2018; Medeiros et al. 2021).

The rest of the paper is organized as follows. Section 2 provides a review of the related literature. Section 3 presents the model and the related estimators. Section 4 analyzes the AMSE and AFR as alternative measures of forecast accuracy. Section 5 discusses the choice of combination weights based on the APE criterion. Section 6 extends the analysis to allow for lag order uncertainty in the construction of the forecasts. Monte Carlo evidence, including comparisons with various existing methods, is provided in Section 7. Section 8 details an empirical application to forecasting US macroeconomic time series and Section 9 concludes. Appendix A, Appendix B and Appendix C contain, respectively, the proofs, details of forecasting methods considered, and additional simulation results. All computations were carried out in the 2022b version of MATLAB (2022).

2. Literature Review

A common practice in the economic forecasting literature is to apply a stationarity-inducing transformation (e.g., differencing or detrending) to the time series of interest and then attempt to forecast the transformed series. Consequently, most of the forecasting procedures in current use have been developed under the assumption of data stationarity. The traditional approach of Box and Jenkins (1970) transforms the data through differencing which amounts to modeling the low-frequency peak in the spectrum as a zero-frequency phenomenon and proceeds to forecast the transformed series using standard stationary autoregressive moving average (ARMA) models. More recently, Stock and Watson (2005, 2006) constructed a extensive database of 132 monthly macroeconomic time series over the period 1959-2003 and applied a variety of transformations to render them stationary before using a handful of common factors extracted from the data set using principal components as predictors (the so-called diffusion-index methodology). Similarly, McCracken and Ng (2016) assembled a publicly available database of 134 monthly time series referred to as FRED-MD and updated on a timely basis by the Federal Reserve Bank of St Louis. They also suggested a set of data transformations which is used to construct factor-based diffusion indexes for forecasting as well as to analyze business cycle turning points.

While convenient in practice, the approach of forecasting the transformed stationary series tends to ignore the information in the levels of the variables. In particular, it does not properly account for the uncertainty arising from the nature of the underlying trends, which can lead to poor forecasts if the trends are misspecified. Clements and Hendry (2001) documented, both analytically and numerically, the detrimental consequences of trend misspecification on the resulting forecasts in the presence of parameter estimation uncertainty. Specifically, they found that when the sample size increases at a faster rate than the forecast horizon, misspecifying a difference stationary process as trend stationary or vice versa yields forecast error variances of a higher order of magnitude relative to the correctly specified model. Consequently, the objective of our study is to construct forecasts of the time series in levels and explicitly model the uncertainty regarding the presence of a stochastic trend instead of transforming the time series to stationarity based on a trend specification that is determined a priori and is possibly misspecified.

Our study is closely related to the existing literature on methods for forecasting nonstationary time series. Diebold and Kilian (2000) showed that a unit root pretesting strategy can improve forecast accuracy relative to restricted or unrestricted estimation. Ng and Vogelsang (2002) found that the use of FGLS estimates of the trend component can yield superior forecasts relative to their OLS counterparts. Turner (2004) recommended the use of forecasting thresholds whereby the restricted (unit root) forecast is preferred on one side of these thresholds while the unrestricted (OLS) forecast is preferred on the other. His proposal was based on median unbiased estimation of the local-to-unity parameter to determine the thresholds and was shown to dominate a unit root pretesting strategy. Ing et al. (2012) studied the impact of nonstationarity, model complexity and model misspecification on the AFR in infinite order autoregressions.

A promising approach to addressing both stochastic trend uncertainty and lag order uncertainty is forecast combination. Introduced in the seminal work of Bates and Granger (1969), the idea underlying forecast combination is to exploit the bias–variance trade-off by combining forecasts from restricted (possibly subject to bias) specifications and unrestricted (possibly subject to overfitting) specifications using an appropriate choice of combination weights. A voluminous amount of literature has subsequently developed, which has analyzed the efficacy of several alternative weighting schemes for constructing the combination forecasts (see, e.g.,Wang et al. 2022, for a recent survey). Hansen (2010a) proposed one-step ahead combination forecasts within an autoregressive modeling framework that accounts for both aforementioned sources of uncertainty, where the combination weights are obtained by minimizing a Mallows criterion. The Mallows criterion is designed to provide an approximately unbiased estimator of the in-sample AMSE. Hansen’s analysis showed that the unit root pretesting strategy could be subject to high forecast risk for a range of persistence levels, while his combination forecast performed favorably compared to a number of methods popular in applied work and dominated the unrestricted forecast uniformly in terms of finite sample forecast risk. Kejriwal and Yu (2021) proposed a refinement of Hansen’s (2010a) approach, which entails estimating the deterministic trend component by FGLS instead of OLS. Tu and Yi 2017 analyzed one-step forecasting based on the Mallows averaging estimator in a cointegrated vector autoregressive model and found that it dominated the commonly used approach of pretesting for cointegration.

In a stationary setup, combination forecasts based on Mallows/cross-validation (CV) weighting typically target the AMSE, relying on its approximate equivalence with the AFR (e.g., Hansen 2008, 2010b; Liao and Tsay 2020). Such equivalence, however, breaks down in a nonstationary setup. Hansen (2010a) showed, within a local-to-unity framework, that the AMSEs of unrestricted as well as restricted (imposing a unit root) one-step ahead forecasts are different from the corresponding expressions for their AFR in autoregressive models (see Section 4 for further discussion on the issue of equivalence or lack thereof).

To address the lack of equivalence between AMSE and AFR, we develop combination forecasts based on minimizing the APE criterion that directly targets the AFR instead of the AMSE. Previous work in the context of model selection has shown the APE criterion to remain valid whether the process is stationary or has a unit root. Specifically, Ing (2004) showed that a normalized version of the APE is almost certain to the AFR in the stationary case, while a similar result was obtained by Ing et al. (2009) in the unit root case. Focusing on the first-order autoregressive case and one-step ahead forecasts, Yu et al. (2012) extended the validity of the APE to a unit root model with a deterministic time trend. Our study extends the use of the APE criterion to construct combination forecasts in a nonstationary environment.

In summary, there is a plethora of approaches available in the literature for forecasting nonstationary time series, including model selection, pretesting, and forecast combination. Combination forecasts have often been shown to incur lower forecast risk in practice than forecasts based on model selection or pretesting. However, the existing literature has typically employed weighting schemes such as Mallows/CV weighting that have been formally justified only in a stationary framework. Our study contributes to this literature by demonstrating that when the variable of interest is potentially nonstationary, it may be desirable to construct the combination weights using an alternative approach (the APE criterion). Our findings are particularly relevant for macroeconomic applications, given that several macroeconomic time series have been documented to exhibit a degree of persistence that is difficult to distinguish from a unit root process.

3. Model and Estimation

We consider a univariate time series

y_{t}

generated as follows:

\begin{matrix} y_{t} & = m_{t} + u_{t} \\ m_{t} & = β_{0} + β_{1} t + \dots + β_{p} t^{p} \\ u_{t} & = α u_{t - 1} + α_{1} Δ u_{t - 1} + \dots + α_{k} Δ u_{t - k} + e_{t} \\ α & = 1 + \frac{a c}{T}, a = 1 - α_{1} - \dots - α_{k}, c \leq 0 \end{matrix}

(1)

where

p \in {0, 1}

is the order of the trend component and the stochastic component

u_{t}

follows a finite order autoregressive process of order

(k + 1)

process driven by the innovations

e_{t}

. The uncertainty about the stochastic trend is captured by the persistence parameter

α

that is modeled as local-to-unity with

c = 0

corresponding to the unit root case and

c < 0

to the stationary case. The initial observations are set at

u_{0}, u_{- 1}, \dots, u_{- k} = O_{p} (1)

.2 This section treats the true lag order

k

as known. Lag order uncertainty is addressed in Section 6. Our analysis is based on the following assumptions:

Assumption 1.

The sequence

{e_{t}}

is a martingale difference sequence with

E (e_{t} | F_{t - 1}) = 0

and

E (e_{t}^{2} | F_{t - 1}) = σ^{2},

where

0 < σ^{2} < \infty,

and

F_{t}

is the σ-field generated by

{e_{s}; s \leq t}

. Moreover, there exist small positive numbers

ϕ_{1}

and

ϕ_{2}

and a large positive number

M_{1}

, such that for

0 \leq s - s^{'} \leq ϕ_{2},

sup_{1 \leq m \leq t < \infty, ∥v_{m}∥ = 1} |F_{t, m, v_{m}} (s) - F_{t, m, v_{m}} (s^{'})| \leq M_{1} {(s - s^{'})}^{ϕ_{1}},

where

v_{m} = {(v_{1}, \dots, v_{m})}^{'} \in R^{m}, ∥v_{m}∥ = \sum_{j = 1}^{m} v_{j}^{2}

and

F_{t, m, v_{m}} (.)

denotes the distribution of

\sum_{l = 1}^{m} v_{l} e_{t + 1 - l}

.

Assumption 2.

All roots of

A (L) = 1 - \sum_{i = 1}^{k} α_{i} L^{i}

lie outside the unit circle.

The data generating process in (1) and Assumptions 1 and 2 are adopted from Hansen (2010a) with an additional restriction on the distribution of

{e_{t}}

, which ensures that the sample second moments of the regressors are bounded in expectation (see Ing et al. 2009). The difference between our modeling framework and Ing et al. (2009) is that they impose an exact unit root (

c = 0

), while we allow

c \leq 0

. For

h \geq 1

, let the optimal (infeasible) mean squared error minimizing h-step ahead forecast of

y_{t}

be denoted as

μ_{t + h}

. It is the conditional mean of

y_{t + h}

given

F_{t}

, which is obtained from the following recursion (Hamilton 1994, pp. 80–82):

\begin{matrix} μ_{t + h} & = z_{t + h}^{'} β + α (μ_{t + h - 1} - z_{t + h - 1}^{'} β) + α_{1} (Δ μ_{t + h - 1} - Δ z_{t + h - 1}^{'} β) \\ + \dots + α_{k} (Δ μ_{t + h - k} - Δ z_{t + h - k}^{'} β) \end{matrix}

(2)

with

μ_{t + j} = y_{t + j}

if

j \leq 0; β = β_{0}

and

z_{t} = 1

if

p = 0; β = {(β_{0}, β_{1})}^{'}

and

z_{t} = {(1, t)}^{'}

if

p = 1

. We can further rewrite (2) as

μ_{t + h} = z_{t + h}^{'} β^{*} + α μ_{t + h - 1} + \sum_{j = 1}^{k} α_{j} Δ μ_{t + h - j}

(3)

where

β^{*} = (1 - α) β_{0}

if

z_{t} = 1

and

β^{*} = {(β_{0}^{*}, β_{1}^{*})}^{'}

with

β_{0}^{*} = (1 - α) β_{0} + (α - \sum_{j = 1}^{k} α_{j}) β_{1}, β_{1}^{*} = (1 - α) β_{1}

if

z_{t} = (1, t)

.

We consider three alternative estimators of

μ_{t + h}

. The first is the unrestricted estimator

{\hat{μ}}_{t + h}

obtained as

{\hat{μ}}_{t + h} = z_{t + h}^{'} {\hat{β}}^{*} + \hat{α} {\hat{μ}}_{t + h - 1} + \sum_{j = 1}^{k} {\hat{α}}_{j} Δ {\hat{μ}}_{t + h - j}

(4)

with

{\hat{μ}}_{t + j} = y_{t + j}

if

j \leq 0

where

({\hat{β}}^{*}, \hat{α}, {\hat{α}}_{j})

are the OLS estimates from the regression

y_{s} = z_{s}^{'} β^{*} + α y_{s - 1} + \sum_{j = 1}^{k} α_{j} Δ y_{s - j} + e_{s}, s = k + 2, \dots, T

Instead of using

(4),

one may consider a two-step strategy for estimating

μ_{t + h}

that entails regressing

y_{t}

on

z_{t}

and obtaining the estimate

\hat{β}

of

β

and the residuals

{\hat{u}}_{t} = y_{t} - z_{t}^{'} \hat{β}

in a first step and then estimating an autoregression of order

k + 1

in

{\hat{u}}_{t}

to obtain the estimates of

(α, α_{1}, \dots, α_{k})

. The forecasts are obtained from

(4)

. However, as shown in Ng and Vogelsang (2002), the one-step estimate

{\hat{μ}}_{t + h}

is preferable to the two-step estimate with persistent data.

The second estimator is the restricted estimator

{\tilde{μ}}_{t + h}

that imposes the unit root restriction

α = 1

and is obtained as

{\tilde{μ}}_{t + h} = Δ z_{t + h}^{'} {\tilde{β}}^{*} + {\tilde{μ}}_{t + h - 1} + \sum_{j = 1}^{k} {\tilde{α}}_{j} Δ {\tilde{μ}}_{t + h - j}

with

{\tilde{μ}}_{t + j} = y_{t + j}

if

j \leq 0

where

({\tilde{β}}^{*}, \tilde{α}, {\tilde{α}}_{j})

are the OLS estimates from the regression

Δ y_{s} = Δ z_{s}^{'} β^{*} + \sum_{j = 1}^{k} α_{j} Δ y_{s - j} + e_{s}, s = k + 2, \dots, T

Finally, the third estimator is based on taking a weighted average of the unrestricted and restricted forecasts. Letting

w \in [0, 1]

be the weight assigned to the unrestricted estimator, the averaging estimator is given by

{\hat{μ}}_{t + h} (w) = w {\hat{μ}}_{t + h} + (1 - w) {\tilde{μ}}_{t + h}

The relative accuracy of the three foregoing estimators can be evaluated using the asymptotic forecast risk (AFR), which is the limit of the h-step ahead expected squared forecast error:

\begin{matrix} f_{0} (c, p, k, h) & = lim_{T \to \infty} \frac{T}{σ^{2}} E {({\tilde{μ}}_{T + h} - μ_{T + h})}^{2} \\ f_{1} (c, p, k, h) & = lim_{T \to \infty} \frac{T}{σ^{2}} E {({\hat{μ}}_{T + h} - μ_{T + h})}^{2} \\ f_{w} (c, p, k, h) & = lim_{T \to \infty} \frac{T}{σ^{2}} E {({\hat{μ}}_{T + h} (w) - μ_{T + h})}^{2} \end{matrix}

In order to derive analytical expressions for the AFR, we introduce the following notation. Let

W (.)

denote a standard Brownian motion on

[0, 1]

and define the Ornstein–Uhlenbeck process

d W_{c} (r) = c W_{c} (r) + d W (r)

For

p \in {0, 1},

let

X_{c} (r) = {(r^{p}, W_{c} (r))}^{'}

and define the stochastic processes

\begin{matrix} W_{c}^{*} (r, p) & = \{\begin{matrix} W_{c} (r) \\ W_{c} (r) - \int_{0}^{1} W_{c} (s) d s \end{matrix} \begin{matrix} if \\ if \end{matrix} \begin{matrix} p = 0 \\ p = 1 \end{matrix} \\ X_{c}^{*} (r, p) & = \{\begin{matrix} X_{c} (r) \\ X_{c} (r) - \int_{0}^{1} X_{c} (s) d s \end{matrix} \begin{matrix} if \\ if \end{matrix} \begin{matrix} p = 0 \\ p = 1 \end{matrix} \end{matrix}

and the functionals

\begin{matrix} T_{0 c} & = - c W_{c}^{*} (1, p) + I (p = 1) W (1) \\ T_{1 c} & = X_{c}^{*} {(1, p)}^{'} {(\int_{0}^{1} X_{c}^{*} (r, p) X_{c}^{*} {(r, p)}^{'})}^{- 1} \int_{0}^{1} X_{c}^{*} (r, p) d W (r) + I (p = 1) W (1) \end{matrix}

Next, note that from

(1),

we can write

y_{t + h} = E_{t} (y_{t + h}) + η_{t, h}

where

η_{t, h} = \sum_{j = 0}^{h - 1} b_{j} e_{t + h - j}, E_{t} (.)

denotes conditional expectation with respect to information at time

t

and the coefficients

b_{j} (j = 0, \dots, h - 1)

are obtained by equating coefficients of

L^{j}

on both sides of the equation

b (L) d (L) = 1

where

b (L) = \sum_{j = 0}^{h - 1} b_{j} L^{j}

and

d (L) = 1 - α L - (1 - L) \sum_{j = 1}^{k} α_{j} L^{j}

. When

α = 1, b_{j} = \sum_{i = 0}^{j} ν_{i}, ν_{0} = 1

and

ν_{j}; j \geq 1,

satisfies

1 + \sum_{j = 1}^{\infty} ν_{j} L^{j} = 1 / A (L)

(see Ing et al. 2009).

Denoting

α (k) = {(α_{1}, \dots, α_{k})}^{'},

we define the following quantities:

\begin{matrix} S_{M} (k) & = (\begin{matrix} α (k - 1) & I_{k - 1} \\ α_{k} & 0_{k - 1}^{'} \end{matrix}), S_{M}^{0} (k) = I_{k} \\ M_{h} (k) & = \sum_{j = 0}^{h - 1} b_{j} S_{M}^{h - 1 - j} (k), Γ (k) = lim_{j \to \infty} E (s_{j} (k) s_{j} {(k)}^{'}), s_{j} (k) = {(Δ y_{j}, \dots, Δ y_{j - k + 1})}^{'} \\ g_{h} (k) & = \{\begin{matrix} 0 \\ t r (Γ (k) M_{h} (k) Γ^{- 1} (k) M_{h}^{'} (k)) \end{matrix} \begin{matrix} if k = 0 \\ if k \geq 1 \end{matrix} \end{matrix}

With the above notation in place, we obtain the following result, which provides an analytical representation for the AFR of the unrestricted and restricted forecasts:

Theorem 1.

Under Assumptions 1 and 2 and sup

_{t} E ({|e_{t}|}^{θ_{h}}) < \infty,

where

θ_{h} = max {8, 2 (h + 2)} + ψ

for some

ψ > 0

,

(a)

f_{1} (c, p, k, h) = f_{1} (c, p, h) + g_{h} (k), f_{1} (c, p, h) = {(\sum_{j = 0}^{h - 1} b_{j})}^{2} E (T_{1 c}^{2}) .

(b)

f_{0} (c, p, k, h) = f_{0} (c, p, h) + g_{h} (k), f_{0} (c, p, h) = {(\sum_{j = 0}^{h - 1} b_{j})}^{2} E (T_{0 c}^{2}) .

Theorem 1 shows that the AFR of both the restricted and unrestricted forecasts can be decomposed into two components: the first component

f_{j} (c, p, h), j = 0, 1,

depends on both the underlying stochastic/deterministic trends as well as the short-run dynamics through the coefficients

{b_{j}};

the second component

g_{h} (k)

is common to the restricted and unrestricted estimators and depends on the parameters governing the short-run dynamics of the time series. The result generalizes Theorem 2 of Hansen (2010a) for one-step forecasts to multistep forecasts. Interestingly, when

h = 1

, the AFR can be expressed as the sum of a purely nonstationary component representing the stochastic/deterministic trends (since

b_{0} = 1

) and a stationary short-run component which is simply the number of first-differenced lags, i.e.,

g_{1} (k) = k

. However, as Theorem 1 shows, when

h > 1,

such a stationary-nonstationary decomposition no longer holds, since both components now depend on the short-run coefficients

{α_{j}}

. Theorem 1 also generalizes Theorem 2.2 of Ing et al. (2009), which derived an expression for AFR assuming an exact unit root

(c = 0

) and no deterministic component.

The next result, which follows as a direct consequence of Theorem 1, shows that the optimal combination weight is independent of the forecast horizon and the moving average coefficients

{b_{j}}

but depends on the nuisance parameter c:

Corollary 1.

The AFR of the combination forecast is given by

f_{w} (c, p, k, h) = {(\sum_{j = 0}^{h - 1} b_{j})}^{2} \{w^{2} E (T_{1 c}^{2}) + {(1 - w)}^{2} E (T_{0 c}^{2}) + 2 w (1 - w) E (T_{1 c} T_{0 c})\} + g_{h} (k)

with optimal (i.e., AFR minimizing) weight

w^{*} = \frac{E (T_{0 c}^{2}) - E (T_{0 c} T_{1 c})}{E (T_{0 c}^{2}) + E (T_{1 c}^{2}) - 2 E (T_{0 c} T_{1 c})}

4. Asymptotic Mean Squared Error and Asymptotic Forecast Risk

An alternative measure of forecast accuracy is the in-sample asymptotic mean squared error (AMSE) defined as

m_{u} (c, p, k, h) = lim_{T \to \infty} \frac{1}{σ^{2}} \sum_{t = 1}^{T - h} E {({\hat{μ}}_{t + h} - μ_{t + h})}^{2}

for the unrestricted estimator with similar expressions in place for the restricted and averaging estimators. Hansen (2008) established the approximate equivalence between this measure and the AFR under the assumption of strict stationarity. Accordingly, existing forecast combination approaches developed in the stationary framework are based on targeting the AMSE by appealing to its equivalence with the AFR. Hansen (2008) proposed estimating the weights by minimizing a Mallows (2000) criterion which yields an asymptotically unbiased estimate of the AMSE. Similarly, Hansen (2010b) demonstrated that a leave-h-out cross validation criterion delivers an asymptotically unbiased estimate of the AMSE.

This equivalence result, however, breaks down in a nonstationary setup. For instance, when the process has a unit root with no drift and the regression does not include a deterministic component, it follows from the results in Hansen (2010a) that the AMSE of the one-step ahead forecast coincides with the expected value of the squared limiting Dickey–Fuller t-statistic. This expectation has been shown to be about 1.141 by Gonzalo and Pitarakis (1998) and Meng (2005) using analytical and numerical integration techniques, respectively. In contrast, Ing (2001) theoretically established that the AFR of the one-step ahead forecast for the same data generating process and regression is two. More recently, Hansen (2010a) demonstrated the lack of equivalence within a local-to-unity framework, showing that the AMSE of unrestricted as well as restricted (imposing a unit root) one-step ahead forecasts are different from the corresponding expressions for their AFR in autoregressive models with a general lag order and a deterministically trending component. Notwithstanding this result, he suggested using a Mallows criterion to estimate the combination weights and evaluated the adequacy of the resulting combination forecast in finite samples via simulations. A similar approach was taken by Kejriwal and Yu (2021), who also employed Mallows weighting but estimated the deterministic component by FGLS in order to improve upon the accuracy of OLS-based forecasts.

To illustrate the failure of equivalence, Figure 1 plots the AMSE and the AFR of the unrestricted estimator for the case

p = 0

and

k = 0

.3 The figure clearly illustrates that while the two measures of forecast accuracy follow a similar path for c sufficiently far from zero, they tend to diverge as the process becomes more persistent. This pattern remains robust across different forecast horizons and suggests that a forecast combination approach that directly targets AFR instead of AMSE can potentially generate more accurate forecasts of highly persistent time series when forecast risk is used as a metric for forecast evaluation.

5. Choice of Combination Weights

The optimal combination forecast

{\hat{μ}}_{t + h} (w^{*})

is infeasible in practice, since the weight

w^{*}

depends on the unknown local-to-unity parameter

c

that is not consistently estimable. Given the lack of equivalence between AMSE and AFR for nonstationary time series as discussed in the previous section, we pursue an alternative approach to estimating the combination weights that directly targets the AFR, which is a more direct and practical measure of forecast accuracy than AMSE. In particular, the estimated weight

\hat{w}

is obtained by minimizing the so-called accumulated prediction errors (APE) criterion defined as

A P E (w) = \sum_{i = m_{h}}^{T - h} {\{y_{i + h} - {\hat{μ}}_{i + h} (w)\}}^{2} = \sum_{i = m_{h}}^{T - h} {\{w (y_{i + h} - {\hat{μ}}_{i + h}) + (1 - w) (y_{i + h} - {\tilde{μ}}_{i + h})\}}^{2}

(5)

with respect to

w,

where

w \in [0, 1], {\hat{μ}}_{i + h} (w)

is the h-step ahead combination forecast based only on data up to period i, and

m_{h}

denotes the smallest positive number such that the forecasts

{\hat{μ}}_{i + h}

and

{\tilde{μ}}_{i + h}

are well defined for all

i \geq m_{h}

. The solution is given by

\hat{w} = \frac{\sum_{i = m_{h}}^{T - h} {(y_{i + h} - {\tilde{μ}}_{i + h})}^{2} - \sum_{i = m_{h}}^{T - h} (y_{i + h} - {\hat{μ}}_{i + h}) (y_{i + h} - {\tilde{μ}}_{i + h})}{\sum_{i = m_{h}}^{T - h} {(y_{i + h} - {\tilde{μ}}_{i + h})}^{2} + \sum_{i = m_{h}}^{T - h} {(y_{i + h} - {\hat{μ}}_{i + h})}^{2} - 2 \sum_{i = m_{h}}^{T - h} (y_{i + h} - {\hat{μ}}_{i + h}) (y_{i + h} - {\tilde{μ}}_{i + h})}

The APE criterion with

h = 1

was first introduced by Rissanen (1986) in the context of model selection. Wei (1987) derived the asymptotic properties of APE in general regression models and specialized his results to stationary and nonstationary autoregressive processes with

h = 1

. Ing (2004) demonstrated the strong consistency of the APE-based lag order estimator in stationary autoregressive models for

h \geq 1

. In particular, he showed that a normalized version of the APE is almost certain to converge to the AFR in large samples. Ing et al. (2009) extended the analysis to autoregressive processes with a unit root. The results in Wei (1987), Ing (2004) and Ing et al. (2009) all relied on the law of iterated logarithm which ensures that, in large samples, APE is almost certain to be equivalent to

log T

times the AFR. It is, however, important to note that while this convergence result holds pointwise for

|α|

\leq 1

, it does not hold uniformly over

α

. In particular, it does not hold in the local-to-unity setup considered in this paper for

c < 0

.4 Nevertheless, the following result shows that the APE criterion remains asymptotically valid in the current framework at the two limits of

c

which represent the unit root and fixed stationary cases:

Theorem 2.

For a given

k,

let

A P E_{0} = \sum_{i = m_{h}}^{T - h} {\{y_{i + h} - {\tilde{μ}}_{i + h}\}}^{2}, A P E_{1} = \sum_{i = m_{h}}^{T - h} {\{y_{i + h} - {\hat{μ}}_{i + h}\}}^{2}

. Under Assumptions 1 and 2 and sup

_{t} E ({|e_{t}|}^{r}) < \infty,

for some

r > 2

,

(a) For

c = O (T),

l i m_{T \to \infty} {(σ^{2} log T)}^{- 1} (A P E_{1} - \sum_{i = m_{h}}^{T - h} η_{i, h}^{2}) = {lim}_{c \to - \infty} f_{1} (c, p, k, h) .

(b)

l i m_{c \to 0} {lim}_{T \to \infty} {(σ^{2} log T)}^{- 1} (A P E_{0} - \sum_{i = m_{h}}^{T - h} η_{i, h}^{2}) = {lim}_{c \to 0} f_{0} (c, p, k, h) .

Remark 1.

In a similar vein, Hansen (2010a) developed feasible combination weights by evaluating the Mallows criterion at the two limits of

c,

given that the criterion depends on

c

and is therefore infeasible in practice. Thus, while his analysis demonstrated that the infeasible Mallows criterion is an asymptotically unbiased estimate of the AMSE for any

c,

the feasible version of the criterion remains valid only in the two limit cases. When estimation is performed using FGLS instead of OLS, Kejriwal and Yu (2021) showed that the infeasible Mallows criterion also depends on the parameter

a

in (1) which governs the short-run dynamics. Evaluating the criterion at the two limits, however, eliminates the dependence on both nuisance parameters.

Figure 2 plots the AFR of the optimal (infeasible) and APE-based combination forecasts for

p = 1

and

k = 0

.5 For comparison, the unrestricted and restricted forecasts are also presented. As expected, the forecast risk of the restricted estimator increases with

|c|

, while the risk function of the unrestricted estimator is relatively flat as a function of c. Regardless of the forecast horizon, the feasible combination forecast maintains a risk profile close to that of the optimal forecast. In particular, the risk of the APE-weighted forecast is uniformly lower than that of the unrestricted estimator across values of

c

, as well as lower than that of the restricted estimator unless

c

is very close to zero. These results suggest that the loss in forecast accuracy due to the unknown degree of persistence is relatively small when constructing the combination weights based on the APE criterion. In Section 7 and Section 8, we conduct extensive comparisons of the APE-based combination forecasts with both the Mallows and cross-validation based combination forecasts.

6. Lag Order Uncertainty

This section extends the preceding analysis to the case where the lag order

k

is unknown. In order to accommodate lag order uncertainty, the set of models on which the combination forecast is based needs to be expanded to include models with different lag orders. Such a forecast can potentially trade off the misspecification bias inherent from the omission of relevant lags against the problem of overfitting induced by the inclusion of unnecessary lags. Kejriwal and Yu (2021) showed that the essence of this trade-off can be captured analytically by adopting a local asymptotic framework in which the coefficients of the short-run dynamics lie in a

O (T^{- 1 / 2})

-neighborhood of zero in addition to the

O (T^{- 1})

parameterization for the persistence parameter as specified in

(1) .

Specifically, we make the following assumption as in Kejriwal and Yu (2021):

Assumption 3.

We assume that

α_{i} = \frac{δ_{i}}{T^{1 / 2}}, i = 1, \dots, k,

where

δ = {(δ_{1}, \dots, δ_{k})}^{'}

is fixed and independent of T.

Assumption 3 ensures that the squared misspecification bias from omitting relevant lags is of the same order as the sampling variance introduced by estimating additional lags. Modeling

{α_{i}}

as fixed would make the bias due to misspecification diverge with the sample size and thus leave no scope for exploiting the trade-off between inclusion and exclusion of lags when constructing the combination forecasts.

We include sub-models with

l \in {0, 1, \dots, K}, K \geq k,

with the corresponding restricted and unrestricted forecasts given by

{\tilde{μ}}_{t} (l)

and

{\hat{μ}}_{t} (l)

, respectively. Let

I (l < k) = 1

if

l < k,

and zero otherwise. Define

ξ_{h} (δ, l, k) = M_{h} (k) {(0_{l}^{'}, δ_{l + 1}, \dots, δ_{k})}^{'},

where

0_{l}

is an

(l \times 1)

vector of zeros. Further, let

r_{h} (δ, l, k) = ξ_{h} {(δ, l, k)}^{'} ξ_{h} (δ, l, k) .

The following result derives the AFR of the forecasts in the presence of lag order uncertainty:

Theorem 3.

Under Assumptions 1–3 and sup

_{t} E ({|e_{t}|}^{θ_{h}}) < \infty,

where

θ_{h} = max {8, 2 (h + 2)} + ψ

for some

ψ > 0

,

(a)

{lim}_{T \to \infty} \frac{T}{σ^{2}} E {({\hat{μ}}_{T + h} (l) - μ_{T + h})}^{2} = h^{2} E (T_{1 c}^{2}) + g_{h} (l) + r_{h} (δ, l, k) .

(b)

{lim}_{T \to \infty} \frac{T}{σ^{2}} E {({\tilde{μ}}_{T + h} (l) - μ_{T + h})}^{2} = h^{2} E (T_{0 c}^{2}) + g_{h} (l) + r_{h} (δ, l, k) .

Theorem 3 shows that large sample forecast accuracy now depends on an additional misspecification component [

r_{h} (δ, l, k)

] emanating from the omission of relevant lags. The larger the magnitudes of the coefficients corresponding to the omitted lags, the larger the contribution of this component to the forecast risk. Moreover, under Assumption 3,

g_{h} (l) = t r (M_{h} (l) M_{h}^{'} (l))

varies with

h

for

h < l

but is constant for all

h \geq l

. Similarly,

r_{h} (δ, l, k)

varies with

h

for

h < k

but is constant thereafter. Thus, the forecast horizon only makes a limited contribution to the two short-run components of the asymptotic forecast risk. Another notable feature of Theorem 3 is that, in contrast to the case where the lag order is assumed known (Theorem 1), the contribution of the trend component is now proportional to the square of the forecast horizon. This difference is due to the fact that the coefficients

b_{j} \to 1

for all

j

since

α_{i} \to 0

for

i = 1, \dots, k

by virtue of Assumption 3.

We consider two types of combination forecasts. The first is a “partial averaging” forecast that only addresses lag order uncertainty by averaging over the

K + 1

unrestricted forecasts:

{\hat{μ}}_{t + h} (\hat{W}) = \sum_{l = 0}^{K} {\hat{w}}_{l} {\hat{μ}}_{t} (l)

(6)

The weights

\hat{W} = {({\hat{w}}_{0}, {\hat{w}}_{1}, \dots, {\hat{w}}_{K})}^{'}

are obtained by minimizing the APE criterion

A P E_{P} (W) = \sum_{i = m_{h}}^{T - h} {\{\sum_{l = 0}^{K} [w_{l} (y_{i + h} - {\hat{μ}}_{i + h} (l))]\}}^{2}

(7)

where

w_{l} \geq 0 (l = 0, \dots, K), \sum_{l = 0}^{K} w_{l} = 1

. We refer to (6) as the APE-based partial averaging (APA) forecast.

The second forecast is a “general averaging” forecast that accounts for both persistence and lag order uncertainty and thus combines the forecasts from all

2 (K + 1)

sub-models:

{\overset{˘}{μ}}_{t + h} (\overset{˘}{W}) = \sum_{l = 0}^{K} ({\overset{˘}{w}}_{1 l} {\hat{μ}}_{t} (l) + {\overset{˘}{w}}_{0 l} {\tilde{μ}}_{t} (l))

(8)

The weights

\overset{˘}{W} = {({\overset{˘}{w}}_{01}, {\overset{˘}{w}}_{02}, \dots, {\overset{˘}{w}}_{0 K}, {\overset{˘}{w}}_{11}, {\overset{˘}{w}}_{12}, \dots, {\overset{˘}{w}}_{1 K})}^{'}

are obtained by minimizing a generalized APE criterion of the form

A P E_{G} (W) = \sum_{i = m_{h}}^{T - h} {\{\sum_{l = 0}^{K} [w_{1 l} (y_{i + h} - {\hat{μ}}_{i + h} (l)) + w_{0 l} (y_{i + h} - {\tilde{μ}}_{i + h} (l))]\}}^{2}

(9)

where

w_{1 l} \geq 0, w_{0 l} \geq 0 (l = 0, \dots, K), \sum_{l = 0}^{K} (w_{0 l} + w_{1 l}) = 1

. We refer to

(8)

as the APE-based general averaging (AGA) forecast. Comparing the APA and AGA forecasts will serve to isolate the effects of the two sources of uncertainty on forecast accuracy.

The following result establishes the limiting behavior of the APE criterion in the presence of lag-order uncertainty:

Theorem 4.

Let

A P E_{0} (l) = \sum_{i = m_{h}}^{T - h} {\{y_{i + h} - {\tilde{μ}}_{i + h} (l)\}}^{2}, A P E_{1} (l) = \sum_{i = m_{h}}^{T - h} {\{y_{i + h} - {\hat{μ}}_{i + h} (l)\}}^{2}

. Under Assumptions 1–3 and sup

_{t} E ({|e_{t}|}^{r}) < \infty,

for some

r > 2

,

(a) For

c = O (T), lim_{T \to \infty} {(σ^{2} log T)}^{- 1} (A P E_{1} (l) - σ^{2} r_{h} (δ, l, k) - \sum_{i = m_{h}}^{T - h} η_{i, h}^{2}) = h^{2} lim_{c \to - \infty} E (T_{1 c}^{2}) + g_{h} (l) .

(b)

lim_{c \to 0} lim_{T \to \infty} {(σ^{2} log T)}^{- 1} (A P E_{0} (l) - σ^{2} r_{h} (δ, l, k) - \sum_{i = m_{h}}^{T - h} η_{i, h}^{2}) = h^{2} lim_{c \to 0} E (T_{0 c}^{2}) + g_{h} (l) .

Theorem 4 shows that while APE captures the components of AFR that are attributable to persistence uncertainty and estimation of the short-run dynamics, it does not account for lag-order uncertainty in the limit. As shown in the proof of Theorem 4, this is because the former two components grow at a logarithmic rate while the bias component due to lag order misspecification is bounded [i.e.,

O (1)

]. Nonetheless, given that the logarithmic function is slowly varying, it can be expected that in small samples, the APE is still effective at capturing the bias that occurs due to misspecification of the number of lags. Indeed, as shown subsequently in the simulations, the APE criterion offers considerable improvements over its competitors, even under lag-order misspecification.

7. Monte Carlo Simulations

This section reports the results of a set of Monte Carlo experiments designed to (1) evaluate the finite sample performance of the proposed approach relative to extant approaches; (2) quantify the importance of accounting for each source of uncertainty in terms of its effect on finite sample forecast risk. Section 7.1 lays out the experimental design. Section 7.2 details the different forecasting procedures included in the analysis. Section 7.3 and Section 7.4 present the results. Results are obtained for

p \in {0, 1} .

For brevity, we report the results only for

p = 1

. The results for

p = 0

are qualitatively similar, although the improvements offered by the proposed approach are more pronounced for

p = 1

than

p = 0 .

The full set of results is available upon request.

7.1. Experimental Design

We adopt a design similar to that in Hansen (2010a) and Kejriwal and Yu (2021) to facilitate direct comparisons. The data generating process (DGP) is based on (1) and specified as follows: (a) the innovations

e_{t} \overset{i . i . d .}{\sim} N (0, 1);

(b) the trend parameters are set at

β_{0} = β_{1} = 0;

(c) the true lag order

k \in {0, 6, 12}

with

α_{j} = - {(- θ)}^{j}

for

j = 1, \dots, k

and

θ = 0.6

. The maximum number of first-differenced lags included is set at

K = 12 .

The sample size is set at

T \in {100, 200}

. The local-to-unity parameter c varies from

- 20

to 0, implying

α

ranging from 0.8 to 1 for

T = 100

and

α

ranging from 0.9 to 1 for

T = 200

. At each c value, the finite-sample forecast risk

T E [{({\hat{μ}}_{T + h} - μ_{T + h})}^{2}]

is computed for all estimators considered, where

h \in {1, 3, 6, 12}

. All experiments are based on 10,000 Monte Carlo replications.

We report two sets of results. The first assumes k is known, thereby allowing us to demonstrate the effect of persistence uncertainty on forecast accuracy while abstracting from lag order uncertainty. The second allows

k

to be unknown and facilitates the comparison between forecasts that address both forms of uncertainty with those that only account for lag order uncertainty.

7.2. Forecasting Methods

The benchmark forecast in both the known and unknown lag cases is calculated from a standard autoregressive model of order

K + 1

estimated by OLS:

y_{t} = β_{0}^{*} + β_{1}^{*} t + α y_{t - 1} + \sum_{j = 1}^{K} α_{j} Δ y_{t - j} + ϵ_{t}

(10)

When the number of lags is assumed to be known (Section 7.3), we compare a set of six forecasting methods: (1) Mallows selection (Mal-Sel); (2) Cross-validation selection (CVh-Sel); (3) APE selection (APE-Sel); (4) Mallows averaging (Mal-Ave); (5) Cross-validation averaging (CVh-Ave); (6) APE averaging (APE-Ave). With an unknown number of lags, the following six methods are compared6: (1) Mallows partial averaging (MPA); (2) Cross-validation partial averaging (CPA); (3) APE partial averaging (APA); (4) Mallows general averaging (MGA); (5) Cross-validation general averaging (CGA); (6) APE general averaging (AGA). For brevity, a detailed description of these methods is not presented here but is included in Appendix B.

Both the APE selection and combination forecasts require a choice of

m_{h}

. To our knowledge, no data-dependent methods for choosing

m_{h}

are available in the existing literature. We therefore examined the viability of alternative choices via simulations. Specifically, for each persistence level (value of c), we computed the minimum forecast risk over all values of

m_{h}

in the range

[15, 70]

with a step-size of 5 (assuming a known number of lags k). While no single value was found to be uniformly dominant across persistence levels/horizons,

m_{h} = 20

turned out to be a reasonable choice overall.7 To justify this choice, Figure A1 in Appendix C plots the difference between the optimal forecast risk and the risk of the APE selection forecasts for

m_{h} = 20

expressed as a percentage of the forecast risk for

m_{h} = 20

. The corresponding results for the APE combination forecasts are presented in Figure A2. It is evident that using

m_{h} = 20

entails only a marginal increase in forecast risk (at most 5%) for the combination forecasts over the optimal forecast risk across different persistence levels and horizons. In contrast, the optimal choice of

m_{h}

for the selection forecasts is somewhat more unstable and appears to depend more heavily on the forecast horizon and the level of persistence. This robustness in behavior provides additional motivation for employing a combination approach to forecasting in practice.

7.3. Forecast Risk with Known Lag Order

Figure 3a,b, Figure 4a,b and Figure 5a,b plot the risk of the six methods relative to the benchmark. First, we consider the case

k = 0

. Several features of the results are noteworthy.

First, the selection forecasts typically exhibit higher risk than the corresponding combination forecasts across sample sizes and horizons. Second, when

T = 100,

the APE combination forecast is clearly the dominant method, performing discernibly better than forecasts based on either of the two competing weighting schemes. When

T = 200,

its dominance continues except when

| c |

is sufficiently large (the exact magnitude being horizon-dependent), in which case the benchmark delivers the most accurate forecasts and averaging over the restricted model becomes less attractive. Third, the relative performance of the Mallows and cross-validation weighting schemes depends on the horizon: at

h = 1,

the two schemes yield virtually indistinguishable forecasts; when

h \in {3, 6},

Mallows weighting yields uniformly lower risk over the parameter space; at

h = 12,

Mallows weighting is preferred when persistence is high (

c

close to zero) while cross-validation weighting dominates for lower levels of persistence.

In the presence of higher order serial correlation (

k > 0

), the superior performance of the APE combination forecast becomes even more evident: it now dominates all competing forecasts regardless of horizon and sample size. In particular, APE weighting outperforms the benchmark at all persistence levels, even at

T = 200,

unlike the

k = 0

case. The intuition for this difference in relative performance between the cases with and without higher-order serial correlation is that in the former case, averaging is comparatively more beneficial, since imposing the unit root restriction can potentially reduce the estimation uncertainty associated with the coefficients of the lagged differences. This reduction in sampling uncertainty in turn engenders a reduction in the overall risk of the combination forecast relative to the unrestricted benchmark forecast. Another notable difference from the

k = 0

case is that while Mallows and cross-validation weighting are comparable for

h \in {1, 3},

the former now dominates for

h \in {6, 12}

uniformly over the parameter space.

7.4. Forecast Risk with Unknown Lag Order

Figure 6a,b, Figure 7a,b and Figure 8a,b plot the relative risk of the six combination forecasts which comprise the three partial forecasts that only account for lag-order uncertainty and the three general forecasts that account for both lag-order and stochastic trend uncertainty. A clear implication of these results is that general averaging methods typically exhibit considerably lower forecast risk than partial averaging methods unless the process has relatively low persistence, in which case averaging over the unit root model increases the forecast risk incurred by the general averaging methods. The improvements offered by general averaging hold across both horizons and the number of lags

(k)

in the true DGP and become more prominent as the sample size increases.

Among the three weighting schemes, APE-based weights are the preferred choice except when

h \in {6, 12}

and

T = 100

, where Mallows weighting turns out to be the dominant approach if persistence is relatively low. A potential explanation for this result is that with long horizons and a small sample size, the APE criterion is based on a relatively smaller number of prediction errors, which increases the sampling variability associated with the resulting weights, thereby increasing the risk of the combination forecast. As in the known lag-order case, the choice between Mallows and cross-validation weighting is horizon-dependent: when

h = 1,

cross-validation weighting is preferred while when

h > 1,

Mallows weighting is preferred, with the magnitude of reduction in forecast risk increasing as

h

increases.

In summary, the results from the simulation experiments make a strong case for employing APE weights when constructing the combination forecasts and clearly highlight the benefits of targeting forecast risk rather than in-sample mean squared error. The comparison of general and partial combination forecasts also underscore the importance of concomitantly controlling for both stochastic trend uncertainty and lag-order uncertainty in generating accurate forecasts.

8. Empirical Application

This section conducts a pseudo out-of-sample forecast comparison of the different multistep forecast combination methods using a set of US macroeconomic time series. Our objectives are to empirically assess (1) the efficacy of different averaging/selection methods relative to a standard autoregressive benchmark; (2) the importance of averaging over both the persistence level and the lag order; and (3) the relative performance of alternative weight choices for constructing the combination forecasts.

Our analysis employs the FRED-MD data set compiled by McCracken and Ng (2016), which contains 123 monthly macroeconomic variables over the period January 1960–December 20188. McCracken and Ng (2016) suggested a set of seven transformation codes designed to render each series stationary: (1) no transformation; (2)

Δ y_{t};

(3)

Δ^{2} y_{t}

; (4) log

(y_{t})

; (5)

Δ

log

(y_{t})

; (6)

Δ^{2}

log

(y_{t})

; (7)

Δ (y_{t} / y_{t - 1} - 1)

. To ensure that the series fit our framework, which allows for highly persistent time series with/without deterministic trends, we adopt the following transformation codes as modified by Kejriwal and Yu (2021): (1’) no transformation; (2’)

y_{t}

; (3’)

Δ y_{t}

, (4’)

log (y_{t})

; (5’)

log (y_{t})

; (6’)

Δ log (y_{t})

; (7’)

y_{t} / y_{t - 1} - 1

. For series that correspond to codes (1’) and (4’), we construct the forecasts from a model with no deterministic trend (

p = 0

), while for the remaining codes, we use forecasts from a model that include a linear deterministic trend (

p = 1

). We also report results for eight core series as in Stock and Watson (2002), comprising four real and four nominal variables. As in the simulation experiments, four alternative forecast horizons are considered:

h \in {1, 3, 6,

12}

. We use a rolling window scheme with an initial estimation period of January 1960–December 1969 so that the forecast evaluation period is January 1970–December 2018 (588 observations). The size of the estimation window changes depending on the forecast horizon h. For example, when

h = 1

, the initial training sample contains 120 observations from January 1960–December 1969 while for

h = 3

, it contains only 118 observations from January 1960–October 1969. This ensures that the forecast origin is January 1970 for all forecast horizons considered. We compare ten different methods in terms of the mean squared forecast error (MSFE) computed as the average of the squared forecast errors: (1) MPA: Mallows partial averaging over the number of lags only in the unrestricted model; (2) MGA: Mallows general averaging over both the unit root restriction and the number of lags; (3) CPA: leave-h-out cross-validation (CV-h) averaging over the number of lags only in the unrestricted model; (4) CGA: leave-h-out cross-validation averaging over both the unit root restriction and the number of lags; (5) APA: accumulated prediction error averaging over the number of lags only in the unrestricted model; (6) AGA: accumulated prediction error averaging over both the unit root restriction and the number of lags; (7) MS: Mallows selection from all models (unrestricted and restricted) that vary with the number of lags; (8) CVhS: leave-h-out cross-validation selection from all models (unrestricted and restricted) that vary with the number of lags; (9) APES: accumulated prediction error selection from all models (unrestricted and restricted) that vary with the number of lags; (10) AR: unrestricted autoregressive model (benchmark). The maximum number of allowable first differenced lags in each method is set at

K = 12

. The benchmark forecast is computed from unrestricted OLS estimation of an autoregressive model of the form (10) that uses 12 first-differenced lags of the dependent variable and includes/excludes a deterministic trend depending on the transformation code the series corresponds to, as discussed above.

Table 1a (

h = 1, 3

) and Table 1b (

h = 6, 12

) report the percentages of wins and losses based on the MSFE for the 123 series. Specifically, they show the percentage of 123 series for which a method listed in a row outperforms a method listed in a column, and all other methods (last column). A summary of the results in Table 1(a and b) is given below:

The averaging methods uniformly dominate their selection counterparts at all forecast horizons. For instance, Mallows/cross-validation averaging outperform the corresponding selection procedures in more than 90% of the series at each horizon. The performance of AGA relative to APES is relatively more dependent on the horizon, with improvements observed in 77% (65%) of the series for $h = 1 (h = 12$ ), respectively.
Given a particular weighting scheme, averaging over both the unit root restriction and number of lags (general averaging) outperforms averaging over only the number of lags (partial averaging) at all horizons. For instance, when $h = 1,$ MGA (CGA, AGA) dominate MPA (CPA, APA) in 95% (81%, 79%) of the series, respectively, based on pairwise comparisons. A similar pattern is observed for multi-step forecasts.
Across all horizons, AGA emerges as the leading procedure due to its ability to deliver forecasts with the lowest MSFE among all methods for the maximum number of series (last column of Table 1(a and b)). This approach also dominates each of the competing approaches in terms of pairwise comparisons. The APES approach ranks second among all methods so that forecasting based on the accumulated prediction errors criterion (either AGA or APES) outperforms the other approaches for more than 50% of the series over each horizon (the specific percentages are 68.3% for $h = 1, 3$ ; 57.7% for $h = 6;$ 55.3% for $h = 12$ ).

Next, we examine the performance of the forecasting methods for different types of series based on their groupwise classification by McCracken and Ng (2016) in an attempt to uncover the extent to which the best methods vary by the type of series analyzed. In particular, McCracken and Ng (2016) classified the series into eight distinct groups: (1) output and income; (2) labor market; (3) housing; (4) consumption, orders and inventories; (5) money and credits; (6) interest and exchange rates; (7) prices; (8) stock market. For each of these groups, Table 2 reports the method(s) with the lowest MSFE for the most series compared to all other competing methods. We also report the number of horizons in which (a) averaging outperforms selection and vice-versa; (b) averaging over both the unit root restriction and number of lags (general averaging—GA) methods is superior to averaging over only the number of lags (partial averaging—PA) and vice-versa; (c) each of the three weighting schemes dominates the other two. The results are consistent with those in Table 1(a and b) and clearly demonstrate (1) the dominance of averaging over selection (with the exception of Group 3) ; (2) the benefits of accounting for both stochastic trend uncertainty and lag order uncertainty (GA) relative to only the latter (PA) for five out of the eight groups; (3) the superiority of APE weighting over the two competing weighting schemes (the exception is Group 5, where cross-validation weighting is the dominant approach).

Finally, we present a comparison of the different methods with respect to their ability to forecast the eight core series analyzed in Stock and Watson (2002). Table 3 reports the MSFE of the eight methods relative to the benchmark model (10) for four real variables (industrial production, real personal income less transfers, real manufacturing and trade sales, number of employees on nonagricultural payrolls) while Table 4 reports the corresponding results for four nominal variables (the consumer price index, the personal consumption expenditure implicit price deflator, the consumer price index less food and energy, and the producer price index for finished goods). To assess whether the difference between the proposed methods and the benchmark model is statistically significant, we use a two-tailed Diebold–Mariano test statistic (Diebold and Mariano 1995). A number less than one indicates better forecast performance than the benchmark and vice versa. The method with smallest relative MSFE for a given series is highlighted in bold.

Consider first the results for real variables (Table 3). The performance of the best method is statistically significant (at the 10% level) relative to the benchmark in twelve out of the sixteen cases. Consistent with the results in Table 1(a and b) and Table 2, general averaging typically dominates partial averaging, the exceptions being nonagricultural employment for

h \leq 6,

industrial production at

h = 12,

and real manufacturing and trade sales for

h = 6, 12,

where APES is the dominant procedure. The AGA approach turns out to have the highest relative forecast accuracy in 50% of all cases, with the improvements offered over rival approaches being particularly notable at

h = 12

. While cross-validation weighting does not yield the best forecasting procedure in any of the cases, Mallows weighting is the preferred approach in only two cases, although the improvements are statistically insignificant. Turning to the nominal variables (Table 4), the best method significantly outperforms the benchmark in ten cases. Again, general averaging is usually preferred to partial averaging, the exception being the case

h = 12

, where APA outperforms all other methods for three of the four variables. As with the real variables, the AGA forecast is the most accurate in 50% of all cases, though the improvements are now comparable across horizons. Finally, cross-validation weighting partly redeems itself by providing the best forecast in four cases, while Mallows weighting is the preferred method in only one case.

It is useful to briefly discuss the recent, related literature to place our empirical findings in perspective. Cheng and Hansen (2015) conducted a comparison of several shrinkage-type forecasting approaches using 143 quarterly US macroeconomic time series (transformed to stationarity) from 1960 to 2008. Their methods included factor-augmented forecast combination based on Mallows/cross-validation/equal weights, Bayesian model averaging, empirical Bayes, pretesting and bagging. They found that while the methods were comparable at the one-quarter horizon, cross-validation weighting clearly emerged as the preferred approach at the four-quarter horizon. Tu and Yi (2017) found that, when forecasting US inflation one-quarter ahead, Mallows-based combination forecasts that combine forecasts from unrestricted and restricted (imposing no error correction) vector autoregressions under the assumption of cointegration dominated both unrestricted and restricted forecasts. Using the same data set as ours, Kejriwal and Yu (2021) compared partial and general combination forecasts using Mallows weights with forecasts based on pretesting and Mallows selection. Consistent with our results, they found that a general combination strategy that averages over both the unit root restriction and different lag orders delivered the best forecasts overall.

In summary, our empirical results were found to be consistent with the simulation results in that (1) addressing both persistence uncertainty and lag-order uncertainty are crucial for generating accurate forecasts; (2) a weighting scheme that directly targets forecast risk instead of in-sample mean squared error yields an efficacious forecast combination approach at all horizons.

9. Conclusions

This paper has developed new multistep forecast combination methods for a time series driven by stochastic and/or deterministic trends. In contrast to existing methods based on Mallows/cross-validation weighting, our proposed combination forecasts were based on constructing weights obtained from an accumulated prediction errors criterion that directly targets the asymptotic forecast risk instead of the in-sample AMSE. Our analysis found strong evidence in favor of a version of the proposed approach that simultaneously addresses stochastic trend and lag order uncertainty. A practical implication of our results is that the degree of persistence in a time series can play an important role in the choice of combination weights. Our preferred approach can potentially serve as a useful univariate benchmark when evaluating the effectiveness of methods designed to exploit information in large data sets.

We conclude with a discussion of four possible directions for future research. First, the APE-based combination forecasts can potentially be used in conjunction with FGLS estimation of the deterministic component, given that the latter has been shown to yield improved forecasts over OLS estimation (Kejriwal and Yu 2021). Second, it may be useful to explore the possibility of allowing for a nonlinear deterministic component through, say, the inclusion of polynomial trends or a few low-frequency trigonometric components (Gallant 1981). To the extent that the specific nonlinear modeling structure captures the observed nonlinearities, such an approach may contribute to a further improvement in forecasting performance. Third, it would be useful to develop prediction intervals around our combination forecasts in order to quantify the associated sampling uncertainty. Fourth, and perhaps most challenging, while our numerical and empirical analyses clearly document the desirability of the proposed approach based on APE weighting relative to Mallows/CV weighting, an analytical comparison may shed further light on the relative merits of the different methods. To our knowledge, such results are primarily available in the context of the standard stationary framework with Mallows/cross-validation weighting (e.g., Hansen 2007; Zhang et al. 2013; Liao and Tsay 2020). Extending these results to the present nonstationary framework would be a potentially fruitful endeavor.

Author Contributions

Conceptualization, M.K., L.N. and X.Y.; methodology, M.K., L.N. and X.Y.; Software, L.N. and X.Y.; validation, M.K., L.N. and X.Y.; formal analysis, M.K., L.N. and X.Y.; investigation, M.K., L.N. and X.Y.; resources, M.K., L.N. and X.Y.; data curation, L.N. and X.Y.; writing—original draft preparation, M.K., L.N. and X.Y.; writing—review and editing, M.K., L.N. and X.Y.; visualization, L.N. and X.Y.; supervision, M.K.; project administration, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

Yu is supported by Shanghai Sailing Program No. 23YF1402000 and National Natural Science Foundation of China (72303040, 72192845).

Data Availability Statement

The data set is publicly available for download at https://research.stlouisfed.org/econ/mccracken/fred-databases/ (accessed on 1 February 2023) . Simulation data are available upon request.

Acknowledgments

The authors wish to thank the three anonymous referees for their constructive feedback which helped improve the paper. They also gratefully acknowledge many useful discussions from participants in the Purdue University seminar and the Midwest Econometrics conference.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs

Let

W (.)

denote a standard Brownian motion on

[0, 1]

and define the Ornstein-Uhlenbeck process:

d W_{c} (r) = c W_{c} (r) + d W (r)

. For

p \in {0, 1},

let

X_{c} (r) = {(r^{p}, W_{c} (r))}^{'}

and define the detrended processes

\begin{matrix} W_{c}^{*} (r, p) & = \{\begin{matrix} W_{c} (r) \\ W_{c} (r) - \int_{0}^{1} W_{c} (s) d s \end{matrix} \begin{matrix} if \\ if \end{matrix} \begin{matrix} p = 0 \\ p = 1 \end{matrix} \\ X_{c}^{*} (r, p) & = \{\begin{matrix} X_{c} (r) \\ X_{c} (r) - \int_{0}^{1} X_{c} (s) d s \end{matrix} \begin{matrix} if \\ if \end{matrix} \begin{matrix} p = 0 \\ p = 1 \end{matrix} \end{matrix}

and the functionals

\begin{matrix} T_{0 c} & = - c W_{c}^{*} (1, p) + I (p = 1) W (1) \\ T_{1 c} & = X_{c}^{*} {(1, p)}^{'} {(\int_{0}^{1} X_{c}^{*} (r, p) X_{c}^{*} {(r, p)}^{'})}^{- 1} \int_{0}^{1} X_{c}^{*} (r, p) d W (r) + I (p = 1) W (1) . \end{matrix}

Let

β = {(β_{0}, β_{1})}^{^{'}}

,

z_{t} = {(1, t)}^{^{'}}

. Without loss of generality, we assume that

β_{0} = β_{1} = 0

in the true data generating process. For a matrix

A, {∥A∥}^{2} = {sup}_{∥v∥ = 1} v^{'} A^{'} A v

with

∥v∥

denoting the Euclidean norm for vector v. Unless otherwise defined, for any variable

x,

we use

x^{*}

to denote its demeaned version. For a random quantity

δ

, we write

δ = δ_{0} + o_{p} (δ_{0})

as

δ = δ_{0} + s . o .

, where

s . o .

represents a term of smaller order in probability. For brevity, all proofs are provided only for the case

p = 1

. The proofs for

p = 0

are simpler andfollow analogous arguments.

We start by noting that if

u_{t}

is generated by (1), it has the

A R (k + 1)

representation

u_{t} = \sum_{i = 1}^{k + 1} a_{i} u_{t - i} + e_{t},

where

a_{1} = α + α_{1}, a_{i} = α_{i} - α_{i - 1} (i = 2, \dots, k), a_{k + 1} = - α_{k}

. The companion VAR(1) form of the model is expressed as

Y_{t} = B {(k)}^{'} Y_{t - 1} + ν_{t}

where

\begin{matrix} \underset{(k + 3) \times 1}{Y_{t}} & = {(1, t + 1, y_{t}, \dots, y_{t - k})}^{'}, \underset{(k + 3) \times 1}{ν_{t}} = {(0, 0, e_{t}, 0, \dots, 0)}^{'} \\ \underset{(k + 3) \times (k + 3)}{B (k)} & = (\begin{matrix} B_{1} & B_{2} \\ 0_{(k + 1) \times 2} & F (k) \end{matrix}), \underset{(2 \times 2)}{B_{1}} = (\begin{matrix} 1 & 1 \\ 0 & 1 \end{matrix}), \underset{2 \times (k + 1)}{B_{2}} = (\begin{matrix} 0 & 0 & \dots & 0 \\ 0 & 0 & \dots & 0 \end{matrix}) \\ \underset{(k + 1) \times (k + 1)}{F (k)} & = (a (k) | \begin{matrix} I_{k} \\ 0_{k}^{'} \end{matrix}), F^{0} (k) = I_{k + 1}, a (k) = {(a_{1}, \dots, a_{k + 1})}^{'} \end{matrix}

With “hat” and “tilde” denoting the unrestricted and restricted OLS estimates, respectively, the unrestricted and restricted forecasts can then be expressed as (see, e.g., Ing 2003):

\begin{matrix} {\hat{μ}}_{T + h} & = & y_{T} {(k + 1)}^{'} {\hat{B}}^{h - 1} (k) \hat{γ} (k) \end{matrix}

(A1)

\begin{matrix} {\tilde{μ}}_{T + h} & = & y_{T} {(k + 1)}^{'} {\tilde{B}}^{h - 1} (k) \tilde{γ} (k) \end{matrix}

(A2)

where

y_{T} (k + 1) = {(1, T + 1, y_{T}, \dots, y_{T - k})}^{'}, \hat{γ} (k) = {({\hat{β}}_{0}^{*}, {\hat{β}}_{1}^{*}, {\hat{a}}_{1}, \dots, {\hat{a}}_{k + 1})}^{'}

and

\begin{matrix} \underset{(k + 3) \times (k + 3)}{\hat{B} (k)} & = & (\begin{matrix} B_{1} & {\hat{B}}_{2} \\ 0_{(k + 1) \times 2} & \hat{F} (k) \end{matrix}), \\ \underset{(2 \times 2)}{B_{1}} & = & (\begin{matrix} 1 & 1 \\ 0 & 1 \end{matrix}), \underset{2 \times (k + 1)}{{\hat{B}}_{2}} = (\begin{matrix} {\hat{β}}_{0}^{*} & 0 & \dots & 0 \\ {\hat{β}}_{1}^{*} & 0 & \dots & 0 \end{matrix}) \\ \underset{(k + 1) \times (k + 1)}{\hat{F} (k)} & = & (\hat{a} (k) | \begin{matrix} I_{k} \\ 0_{k}^{'} \end{matrix}), {\hat{F}}^{0} = I_{k + 1}, \hat{a} (k) = {({\hat{a}}_{1}, \dots, {\hat{a}}_{k + 1})}^{'} \\ \underset{(k + 3) \times (k + 3)}{\tilde{B} (k)} & = & (\begin{matrix} B_{1} & {\tilde{B}}_{2} \\ 0_{(k + 1) \times 2} & \tilde{F} (k) \end{matrix}), \underset{2 \times (k + 1)}{{\tilde{B}}_{2}} = (\begin{matrix} {\tilde{β}}_{0}^{*} & 0 & \dots & 0 \\ 0 & 0 & \dots & 0 \end{matrix}) \end{matrix}

(A3)

The matrix

\tilde{F} (k)

is constructed in the same way as

\hat{F} (k)

with

\hat{a} (k)

replaced by

\tilde{a} (k)

, where

\tilde{a} (k) = {({\tilde{a}}_{1}, \dots, {\tilde{a}}_{k + 1})}^{'} = {(1 + {\tilde{α}}_{1}, {\tilde{α}}_{2} - {\tilde{α}}_{1}, \dots, {\tilde{α}}_{k} - {\tilde{α}}_{k - 1}, - {\tilde{α}}_{k})}^{'}

with

\tilde{γ} (k) = ({\tilde{β}}_{0}^{*}, 0, {\tilde{a}}_{1}, \dots,

{\tilde{a}}_{k + 1})^{'}

. Next, we state a set of lemmas that will be useful in developing the proofs of the main results. Lemmas A.1–A.4, A.7–A.9 below parallel Lemmas A.1–A.4, B.1–B.3 in Ing et al. (2009) who assumed an exact unit root

(c = 0) .

Since the sample moments have the same order whether

c = 0

or

c < 0

, the proofs of the following lemmas also follow directly those in Ing et al. (2009) and are hence omitted.

Lemma A1.

Suppose

{y_{t}}

satisfies (1) and Assumptions 1 and 2. Then for any

q > 0

,

E | | {\hat{R}}_{T}^{- 1} (k) {| |}^{q} = O (1)

where

{\hat{R}}_{T} (k) = T^{- 1} D_{T} (k) \sum_{j = k + 1}^{T - 1} y_{j} (k + 1) y_{j} {(k + 1)}^{'} D_{T} (k)^{'}

with

\begin{matrix} \underset{(k + 3) \times (k + 3)}{D_{T} (k)} & = d i a g (1, T^{- 1}, {\bar{D}}_{T} (k)), \\ \underset{(k + 1) \times (k + 1)}{{\bar{D}}_{T} (k)} & = (\begin{matrix} \frac{1}{\sqrt{T}} & \frac{- α_{1}}{\sqrt{T}} & \dots & \dots & \frac{- α_{k}}{\sqrt{T}} \\ 1 & - 1 & 0 & \dots & 0 \\ 0 & ⋱ & ⋱ & ⋱ & ⋮ \\ ⋮ & ⋱ & ⋱ & ⋱ & 0 \\ 0 & \dots & 0 & 1 & - 1 \end{matrix}) \end{matrix}

Lemma A2.

Suppose

{y_{t}}

satisfies (1) and Assumptions 1 and 2 and for some

q_{1} \geq 2

,

sup_{- \infty \leq t \leq \infty}

E | e_{t} |^{2 q_{1}} < \infty

. Then for any

0 < q < q_{1}

,

E | | {\hat{R}}_{T}^{- 1} (k) - {\hat{R}}_{T}^{*}^{- 1} (k) {| |}^{q} = O (T^{- q / 2})

where

\begin{matrix} \underset{(k + 3) \times (k + 3)}{{\hat{R}}_{T}^{*} (k)} = d i a g (\underset{3 \times 3}{{\hat{R}}_{c}^{*} (k)}, \underset{k \times k}{{\hat{Γ}}_{T} (k))} \\ {\hat{R}}_{c}^{*} (k) & = (\begin{matrix} T^{- 1} (T - 1 - k) & T^{- 1} \sum_{j = k + 1}^{T - 1} X_{t}^{'} \\ T^{- 1} \sum_{j = k + 1}^{T - 1} X_{t} & T^{- 2} \sum_{j = k + 1}^{T - 1} X_{t} X_{t}^{'} \end{matrix}) \\ X_{t} & = {[T^{- 1} (t + 1), T^{- 1 / 2} N_{t}]}^{'}, N_{j} = A (L) y_{j} \\ {\hat{Γ}}_{T} (k) & = T^{- 1} \sum_{j = k + 1}^{T - 1} s_{j} (k) s_{j} {(k)}^{'}, s_{j} (k) = {(Δ y_{j}, \dots, Δ y_{j - k + 1})}^{'} \end{matrix}

Lemma A3.

Suppose

{y_{t}}

satisfies (1) and Assumptions (1) and (2) with

sup_{- \infty \leq t \leq \infty} E {| e_{t} |}^{q} < \infty

for some

q \geq 2

. Then,

E | | T^{- 1 / 2} D_{T} (k) \sum_{j = k + 1}^{T - 1} y_{j} (k + 1) e_{j + 1} {| |}^{q} = O (1)

Lemma A4.

Suppose

{y_{t}}

satisfies (1) and Assumptions 1 and 2 with

sup_{- \infty \leq t \leq \infty} E {| e_{t} |}^{r} < \infty

for some

r > 4

. Then,

lim_{T \overset{}{\to} \infty} E (F_{T, k}) = 0

where

F_{T, k} = s_{T} {(k)}^{'} M_{h} (k) {\hat{Γ}}_{T}^{- 1} (k) \{\sum_{j = k + 1}^{T - 1} s_{j} (k) e_{j + 1}\} X_{T}^{'} {(\sum_{j = k + 1}^{T - 1} X_{j} {X_{j}}^{'})}^{- 1} \{\sum_{j = k + 1}^{T - 1} X_{j} e_{j + 1}\}

Lemma A5.

Let

\underset{T \times (p + 1)}{X} = \underset{T \times 1 T \times p}{[X_{1}, X_{2}]}

,

X_{1} = {(1, \dots, 1)}^{'}

, and assume

X^{'} X

is invertible. Define

M_{1} = \underset{T \times T}{I} - X_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'}

,

X_{2}^{*} = M_{1} X_{2}

. For any

T \times 1

vector e and any

p \times 1

vector

x_{2}

, we have

x^{'} {(X^{'} X)}^{- 1} X^{'} e = x_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} e + x_{2}^{*'} {(X_{2}^{*'} X_{2}^{*})}^{- 1} X_{2}^{*'} e

, where

x = {(x_{1}, x_{2}^{'})}^{'}

,

x_{1} = 1

,

x_{2}^{*} = x_{2} - {(X_{1}^{'} X_{1})}^{- 1} X_{2}^{'} X_{1}

.

Lemma A6.

Under Assumptions 1 and 2,

\frac{\sqrt{T} {\tilde{β}}_{0}^{*}}{σ} \overset{d}{\to} W_{c} (1) .

Lemma A7.

Under Assumptions 1 and 2 and

sup_{- \infty \leq t \leq \infty} E {| e_{t} |}^{q} < \infty

for some

q > 2

,

(i) For some

κ_{1} > 0

,

| | \hat{Γ} (k) - Γ (k) | | = o (T^{- κ_{1}}) a . s .

;

(ii) For some

κ_{2} > 0

,

| | {\hat{R}}_{T} - {\hat{R}}_{T}^{*} | | = o (T^{- κ_{2}}) a . s .

;

(iii)

| | {\hat{R}}_{T}^{- 1} | | = O (log log T) a . s .

.

Lemma A8.

Under Assumptions 1 and 2 and

sup_{- \infty \leq t \leq \infty} E {| e_{t} |}^{q} < \infty

for some

q > 2,

\sum_{i = m_{h}}^{T - h} F_{i, k} = o (T) a . s .

, where

F_{i, k} = s_{i} {(k)}^{'} M_{h} (k) {\hat{Γ}}_{i}^{- 1} (k) \{\sum_{j = k + 1}^{i - 1} s_{j} (k) e_{j + 1}\} X_{i}^{'} {(\sum_{j = k + 1}^{i - 1} X_{j} {X_{j}}^{'})}^{- 1} \{\sum_{j = k + 1}^{i - 1} X_{j} e_{j + 1}\}

Lemma A9.

Let

{x_{T}}

be a sequence of real numbers.

(i) If

x_{T} \geq 0, T^{- 1} \sum_{j = 1}^{T} x_{j} = O (1)

, and for some

ξ > 1, \underset{T \overset{}{\to} \infty}{lim inf} ν_{T} / T^{ξ} > 0

, then,

\sum_{j = 1}^{T} x_{j} / ν_{j} = O (1)

;

(ii) If

T^{- 1} \sum_{j = 1}^{T} x_{j} = o (1)

, then,

\sum_{j = 1}^{T} x_{j} / j = o (log T)

.

Proof of Lemma A5.

Note, by block matrix inversion,

\begin{matrix} {(X^{'} X)}^{- 1} = & {(\begin{matrix} X_{1}^{'} X_{1} & X_{1}^{'} X_{2} \\ X_{2}^{'} X_{1} & X_{2}^{'} X_{2} \end{matrix})}^{- 1} \\ = & (\begin{matrix} {(X_{1}^{'} X_{1})}^{- 1} + {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} X_{2} {(X_{2}^{'} M_{1} X_{2})}^{- 1} X_{2}^{'} X_{1} {(X_{1}^{'} X_{1})}^{- 1} & - {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} X_{2} {(X_{2}^{'} M_{1} X_{2})}^{- 1} \\ - {(X_{2}^{'} M_{1} X_{2})}^{- 1} X_{2}^{'} X_{1} {(X_{1}^{'} X_{1})}^{- 1} & {(X_{2}^{'} M_{1} X_{2})}^{- 1} \end{matrix}) \end{matrix}

then

{(X^{'} X)}^{- 1} X^{'} e = (\begin{matrix} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} [I - X_{2} {(X_{2}^{'} M_{1} X_{2})}^{- 1} X_{2}^{'} M_{1}] e \\ {(X_{2}^{'} M_{1} X_{2})}^{- 1} X_{2}^{'} M_{1} e \end{matrix})

Recall

x = {(x_{1}, x_{2}^{'})}^{'} = {(x_{1}, x_{2}^{*'})}^{'} + {(0, X_{1}^{'} X_{2} {(X_{1}^{'} X_{1})}^{- 1})}^{'}

, we have,

\begin{matrix} x^{'} {(X^{'} X)}^{- 1} X^{'} e \\ = & \underset{Term 1}{\underset{︸}{[(x_{1}, x_{2}^{*'})] {(X^{'} X)}^{- 1} X^{'} e}} + \underset{Term 2}{\underset{︸}{[(0, X_{1}^{'} X_{2} {(X_{1}^{'} X_{1})}^{- 1})] {(X^{'} X)}^{- 1} X^{'} e}} \\ = & \underset{Term 1}{\underset{︸}{x_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} [I - X_{2} {(X_{2}^{'} M_{1} X_{2})}^{- 1} X_{2}^{'} M_{1}] e + x_{2}^{*'} {(X_{2}^{'} M_{1} X_{2})}^{- 1} X_{2}^{'} M_{1} e}} \\ + \underset{Term 2}{\underset{︸}{X_{1}^{'} X_{2} {(X_{1}^{'} X_{1})}^{- 1} {(X_{2}^{'} M_{1} X_{2})}^{- 1} X_{2}^{'} M_{1} e}} \\ = & x_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} e + x_{2}^{*'} {(X_{2}^{'} M_{1} X_{2})}^{- 1} X_{2}^{'} M_{1} e \\ \underset{= 0, since x_{1} = 1, {(X_{1}^{'} X_{1})}^{- 1} = 1 / T, which is a constant}{\underset{︸}{- x_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} X_{2} {(X_{2}^{'} M_{1} X_{2})}^{- 1} X_{2}^{'} M_{1} e + X_{1}^{'} X_{2} {(X_{1}^{'} X_{1})}^{- 1} {(X_{2}^{'} M_{1} X_{2})}^{- 1} X_{2}^{'} M_{1} e}} \\ = & x_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} e + x_{2}^{*'} {(X_{2}^{*'} X_{2}^{*})}^{- 1} X_{2}^{*'} e \end{matrix}

□

Proof of Lemma A6.

The true DGP can be expressed as

Δ y_{t} = β_{0}^{*} + \sum_{j = 1}^{k} α_{j} Δ y_{t - j} + e_{t}^{*}

where

β_{0}^{*} = 0

and

e_{t}^{*} = \frac{a c}{T} u_{t - 1} + e_{t}

. Let

{\dot{Z}}_{t} = ({\dot{Z}}_{1}, {\dot{Z}}_{2, t}^{'})

,

{\dot{Z}}_{1} = 1

,

{\dot{Z}}_{2, t} = {(Δ y_{t - 1}, \dots, Δ y_{t - k})}^{'}

,

ι_{1} = {(1, 0, \dots, 0)}^{'}

,

ι_{[2 : k + 1]} = {(0, 1, \dots, 1)}^{'}

. Now

\begin{matrix} \frac{\sqrt{T} {\tilde{β}}_{0}^{*}}{σ} = & \frac{\sqrt{T}}{σ} ι_{1}^{'} {(\sum_{t = k + 1}^{T} {\dot{Z}}_{t} {\dot{Z}}_{t}^{'})}^{- 1} \sum_{t = k + 1}^{T} {\dot{Z}}_{t}^{'} (\frac{a c}{T} u_{t - 1} + e_{t}) \\ = & \frac{\sqrt{T}}{σ} {(\sum_{t = k + 1}^{T} {\dot{Z}}_{1}^{2})}^{- 1} \sum_{t = k + 1}^{T} {\dot{Z}}_{1} (\frac{a c}{T} u_{t - 1} + e_{t}) + o_{p} (1) \\ = & \frac{c a}{σ T^{3 / 2}} \sum_{t = k + 1}^{T} u_{t - 1} + \frac{1}{σ \sqrt{T}} \sum_{t = k + 1}^{T} e_{t} + o_{p} (1) \overset{d}{\to} c \int_{0}^{1} W_{c} + W (1) = W_{c} (1) \end{matrix}

(A4)

□

Proof of Theorem 1.

(a) Defining

γ (k) = {(β_{0}^{*}, β_{1}^{*}, a_{1}, \dots, a_{k + 1})}^{'}, {\hat{L}}_{h} (k) = \sum_{j = 0}^{h - 1} b_{j} {\hat{B}}^{h - 1 - j} (k)

and

L_{h} = \sum_{j = 0}^{h - 1} b_{j} B^{h - 1 - j} (k)

, we can write

\begin{matrix} \frac{T}{σ^{2}} & E {({\hat{μ}}_{T + h} - μ_{T + h})}^{2} \\ = & \frac{T}{σ^{2}} E {[y_{T} {(k + 1)}^{'} {\hat{L}}_{h} (k) (\hat{γ} (k) - γ (k))]}^{2} \\ = & \frac{T}{σ^{2}} [E {[y_{T} {(k + 1)}^{'} L_{h} (k) (\hat{γ} (k) - γ (k))]}^{2} \\ + E {[y_{T} {(k + 1)}^{'} \{{\hat{L}}_{h} (k) - L_{h} (k)\} (\hat{γ} (k) - γ (k))]}^{2} + o (1)] \\ = & \frac{1}{σ^{2}} E {[y_{T} {(k + 1)}^{'} L_{h} (k) D_{T}^{'} (k) ({\hat{R}}_{T}^{* - 1} (k)) \frac{D_{T} (k)}{\sqrt{T}} \sum_{j = k + 1}^{T - 1} y_{j} (k + 1) e_{j + 1}]}^{2} \\ + \frac{1}{σ^{2}} E {[y_{T} {(k + 1)}^{'} L_{h} (k) D_{T}^{'} (k) ({\hat{R}}_{T}^{- 1} (k) - {\hat{R}}_{T}^{* - 1} (k)) \frac{D_{T} (k)}{\sqrt{T}} \sum_{j = k + 1}^{T - 1} y_{j} (k + 1) e_{j + 1}]}^{2} \\ + \frac{T}{σ^{2}} E {[y_{T} {(k + 1)}^{'} \{{\hat{L}}_{h} (k) - L_{h} (k)\} (\hat{γ} (k) - γ (k))]}^{2} + o (1) \\ = & (I) + (I I) + (I I I) \end{matrix}

(A5)

The

(I I)

and

(I I I)

terms in (A5) are each

o (1)

by Lemmas A1–A3 and Holder’s inequality [see, e.g., the proof of Theorem 2.2 in Ing et al. 2009].

The term

(I)

can be written as:

\begin{matrix} \frac{1}{σ^{2}} E {[y_{T} {(k + 1)}^{'} L_{h} (k) D_{T}^{'} (k) {\hat{R}}_{T}^{* - 1} (k) \frac{D_{T} (k)}{\sqrt{T}} \sum_{j = k}^{T - 1} y_{j} (k + 1) e_{j + 1}]}^{2} \\ = \frac{1}{σ^{2}} E {[y_{T} {(k + 1)}^{'} D_{T}^{'} (k) {\bar{L}}_{h} (k) {\hat{R}}_{T}^{* - 1} (k) \frac{D_{T} (k)}{\sqrt{T}} \sum_{j = k}^{T - 1} y_{j} (k + 1) e_{j + 1}]}^{2} \end{matrix}

(A6)

where

{\bar{L}}_{h} (k) = \sum_{j = 0}^{h - 1} b_{j} d i a g (G_{T}^{h - 1 - j}, \bar{F} {(k)}^{h - 1 - j})

with

G_{T} = (\begin{matrix} 1 & T^{- 1} \\ 0 & 1 \end{matrix})

,

\bar{F} (k) = d i a g (1, S_{M} (k))

and

S_{M} (k) = (\begin{matrix} α (k - 1) & I_{k - 1} \\ α_{k} & 0_{k - 1}^{'} \end{matrix}), S_{M}^{0} (k) = I_{k} .

Note that

y_{T} {(k + 1)}^{'} D_{T}^{'} = (1, T^{- 1} (T + 1), T^{- 1 / 2} N_{T}, s_{T} (k))

. Further, since

G_{T}

is upper triangular, (A6) converges to

\begin{matrix} \frac{1}{σ^{2}} {(\sum_{j = 0}^{h - 1} b_{j})}^{2} lim_{T \overset{}{\to} \infty} E {\{T^{- 1 / 2} \sum_{j = k + 1}^{T - 1} e_{j + 1} + {X_{T}^{*}}^{'} {(\sum_{j = k + 1}^{T - 1} X_{j}^{*} {X_{j}^{*}}^{'})}^{- 1} \sum_{j = k + 1}^{T - 1} X_{j}^{*} e_{j + 1}\}}^{2} \\ + lim_{T \overset{}{\to} \infty} \frac{1}{σ^{2}} E {\{s_{T}^{'} (k) M_{h} (k) {\hat{Γ}}_{T}^{- 1} (k) T^{- 1 / 2} \sum_{j = k + 1}^{T - 1} s_{j} (k) e_{j + 1}\}}^{2} + \frac{2}{σ^{2}} (\sum_{j = 0}^{h - 1} b_{j}) lim_{T \overset{}{\to} \infty} E (F_{T, k}) \\ = & B. 1 + B. 2 + B. 3 \end{matrix}

(A7)

where

B. 1

utilizes Lemma A.5. Since

B. 2 = g_{h} (k)

by Theorem 1 of Ing (2003) and

B. 3 = 0

by Lemma A4, (A7) simplifies to:

\begin{matrix} B. 1 + B. 2 = & {(\sum_{j = 0}^{h - 1} b_{j})}^{2} lim_{T \overset{}{\to} \infty} E {\{W (1) + {X_{c}^{*} (1)}^{'} {(\int_{0}^{1} X_{c}^{*} {X_{c}^{*}}^{'})}^{- 1} \int_{0}^{1} X_{c}^{*} d W\}}^{2} + g_{h} (k) \\ = {(\sum_{j = 0}^{h - 1} b_{j})}^{2} E [T_{1 c}^{2}] + g_{h} (k) \end{matrix}

(A8)

The required result then follows from (A5), (A7) and (A8).

(b) Defining

{\tilde{L}}_{h} (k) = \sum_{j = 0}^{h - 1} b_{j} {\tilde{B}}^{h - 1 - j} (k)

, with similar arguments as in (a), we can write:

\begin{matrix} \frac{T}{σ^{2}} E {({\tilde{μ}}_{T + h} - μ_{T + h})}^{2} & = \frac{T}{σ^{2}} E {[{y_{T} (k + 1)}^{'} {\tilde{L}}_{h} (k) (\tilde{γ} (k) - γ (k))]}^{2} \\ = \frac{T}{σ^{2}} E {[{y_{T} (k + 1)}^{'} L_{h} (k) (\tilde{γ} (k) - γ (k))]}^{2} + o (1) \end{matrix}

(A9)

Note that

L_{h} (k) = \sum_{j = 0}^{h - 1} b_{j} B^{h - 1 - j} (k) = \sum_{j = 0}^{h - 1} b_{j} {(\begin{matrix} B_{1} & 0 \\ 0 & F (k) \end{matrix})}^{h - 1 - j} = \sum_{j = 0}^{h - 1} b_{j} (\begin{matrix} B_{1}^{h - 1 - j} & 0 \\ 0 & F^{h - 1 - j} (k) \end{matrix})

Since

B_{1}

is upper triangular with

B_{1} (1, 1) = 1

,

\begin{matrix} (A 9) & = \frac{T}{σ^{2}} E {[y_{T} {(k + 1)}^{'} (\begin{matrix} \sum_{j = 0}^{h - 1} b_{j} [\begin{matrix} {\tilde{β}}_{0}^{*} \\ 0 \end{matrix}] \\ \sum_{j = 0}^{h - 1} b_{j} F^{h - 1 - j} (k) [\tilde{a} (k) - a (k)] \end{matrix})]}^{2} + o (1) \\ = \frac{T}{σ^{2}} E {[(\sum_{j = 0}^{h - 1} b_{j}) {\tilde{β}}_{0}^{*} + (y_{T}, \dots, y_{T - k}) \sum_{j = 0}^{h - 1} b_{j} F^{h - 1 - j} (k) [\tilde{a} (k) - a (k)]]}^{2} + o (1) \end{matrix}

(A10)

Now, consider the term

\begin{matrix} \frac{\sqrt{T}}{σ} (y_{T}, \dots, y_{T - k}) \sum_{j = 0}^{h - 1} b_{j} F^{h - 1 - j} (k) [\tilde{a} (k) - a (k)] = \frac{\sqrt{T}}{σ} (y_{T}, \dots, y_{T - k}) L_{h}^{(F)} (k) [\tilde{a} (k) - a (k)] \\ = & \frac{\sqrt{T}}{σ} (y_{T}, \dots, y_{T - k}) L_{h}^{(F)} (k) [\hat{a} (k) - a (k) + H_{k} D_{T}^{'} (k) {\hat{R}}_{T}^{- 1} (k) D_{T} (k) R_{k}^{'} {(R_{k} D_{T}^{'} (k) {\hat{R}}_{T}^{- 1} (k) D_{T} (k) R_{k}^{'})}^{- 1} (r - R_{k} \hat{γ} (k))] \end{matrix}

(A11)

where

\begin{matrix} L_{h}^{(F)} (k) & = \sum_{j = 0}^{h - 1} b_{j} F {(k)}^{h - 1 - j}, \underset{(k + 1) \times (k + 3)}{H_{k}} = [\begin{matrix} 0_{(k + 1) \times 2} & I_{(k + 1)} \end{matrix}], \underset{2 \times (k + 3)}{R_{k}} & = [\begin{matrix} 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 1 \end{matrix}], \underset{2 \times 1}{r} = {(0, 1)}^{'} \end{matrix}

Next, define

{\bar{L}}_{h}^{(F)} (k) = d i a g (\sum_{j = 0}^{h - 1} b_{j}, M_{h} (k))

,

\underset{(k + 1) \times 1}{\hat{θ} (k)} = {(T (1 - \hat{α}) / a, 0, \dots, 0)}^{'}

, we have

\begin{matrix} (A 11) = & \frac{1}{σ} (y_{T}, \dots, y_{T - k}) [\sqrt{T} L_{h}^{(F)} (k) {\hat{a} (k) - a (k)} + L_{h}^{(F)} (k) {\bar{D}}_{T}^{'} (k) \hat{θ} (k)] \\ = & \frac{1}{σ} (y_{T}, \dots, y_{T - k}) [\sqrt{T} L_{h}^{(F)} (k) {\hat{a} (k) - a (k)} + {\bar{D}}_{T}^{'} (k) {\bar{L}}_{h}^{(F)} (k) \hat{θ} (k)] \\ = & \frac{1}{σ} (y_{T}, \dots, y_{T - k}) \sqrt{T} L_{h}^{(F)} (k) (\hat{a} (k) - a (k)) + \frac{1}{σ} (N_{T} / \sqrt{T}, s_{T} (k)) d i a g (\sum_{j = 0}^{h - 1} b_{j}, M_{h} (k)) \hat{θ} (k) \\ = & \frac{1}{σ} (y_{T}, \dots, y_{T - k}) \sqrt{T} L_{h}^{(F)} (k) (\hat{a} (k) - a (k)) + \frac{N_{T}}{σ \sqrt{T}} \sum_{j = 0}^{h - 1} b_{j} [- c - T (\hat{α} - α) / a] \\ = & \frac{N_{T}}{σ \sqrt{T}} \sum_{j = 0}^{h - 1} b_{j} \{T (\hat{α} - α) / a\} + \frac{1}{σ} s_{T} {(k)}^{'} M_{h} (k) {\hat{Γ}}_{T}^{- 1} (k) T^{- 1 / 2} \sum_{j = k}^{T - 1} s_{j} (k) e_{j + 1} \\ + \frac{N_{T}}{σ \sqrt{T}} \sum_{j = 0}^{h - 1} b_{j} [- c - T (\hat{α} - α) / a] \\ = & - c \frac{N_{T}}{σ \sqrt{T}} \sum_{j = 0}^{h - 1} b_{j} + \frac{1}{σ} s_{T} {(k)}^{'} M_{h} (k) {\hat{Γ}}_{T}^{- 1} (k) T^{- 1 / 2} \sum_{j = k}^{T - 1} s_{j} (k) e_{j + 1} \end{matrix}

(A12)

Then, combining (A10) with (A12) and using Lemma A6, we finally get

\begin{matrix} lim_{T \overset{}{\to} \infty} \frac{T}{σ^{2}} E {({\tilde{μ}}_{T + h} - μ_{T + h})}^{2} & = E {[\sum_{j = 0}^{h - 1} b_{j} (W_{c} (1) - c W_{c} (1))]}^{2} \\ + \frac{1}{σ^{2}} lim_{T \overset{}{\to} \infty} E {\{s_{T}^{' *} (k) M_{h} (k) {\hat{Γ}}_{T}^{- 1} (k) T^{- 1 / 2} \sum_{j = k}^{T - 1} s_{j} (k) e_{j + 1}\}}^{2} \\ = {(\sum_{j = 0}^{h - 1} b_{j})}^{2} E [T_{0 c}^{2}] + g_{h} (k) \end{matrix}

which uses the fact that

W_{c} (1) - c W_{c} (1) = W (1) - c W_{c}^{*} (1)

, thereby proving the result. □

Proof of Theorem 2.

Henceforth, estimated parameters and quantities with subscript i denotes the estimates using observations from 1 to i. We prove (a) first. It follows from Chow (1965) and Ing (2004) that

A P E_{1} - \sum_{i = m_{h}}^{T - h} η_{i, h}^{2} = \sum_{i = m_{h}}^{T - h} {[y_{i}^{'} (k + 1) {\hat{L}}_{i, h} (k) ({\hat{γ}}_{i} (k) - γ (k))]}^{2} (1 + o (1)) + O (1) a . s .

Using similar algebra as in Theorem 1, we have:

\begin{matrix} \sum_{i = m_{h}}^{T - h} {[y_{i}^{'} (k + 1) {\hat{L}}_{i, h} (k) ({\hat{γ}}_{i} (k) - γ (k))]}^{2} \\ = \sum_{i = m_{h}}^{T - h} [{[y_{i} {(k + 1)}^{'} L_{h} (k) (\hat{γ_{i}} (k) - γ (k))]}^{2} \\ + {[y_{i} {(k + 1)}^{'} \{{\hat{L}}_{i, h} (k) - L_{h} (k)\} ({\hat{γ}}_{i} (k) - γ (k))]}^{2}] + s . o . \\ = & \sum_{i = m_{h}}^{T - h} \frac{1}{i} {[y_{i} {(k + 1)}^{'} L_{h} (k) D_{i}^{'} (k) ({\hat{R}}_{i}^{* - 1} (k)) \frac{D_{i} (k)}{\sqrt{i}} \sum_{j = k + 1}^{i - 1} y_{j} (k + 1) e_{j + 1}]}^{2} \\ + \sum_{i = m_{h}}^{T - h} \frac{1}{i} {[y_{i} {(k + 1)}^{'} L_{h} (k) D_{i}^{'} (k) ({\hat{R}}_{i}^{- 1} (k) - {\hat{R}}_{i}^{* - 1} (k)) \frac{D_{i} (k)}{\sqrt{i}} \sum_{j = k + 1}^{i - 1} y_{j} (k + 1) e_{j + 1}]}^{2} \\ + \sum_{i = m_{h}}^{T - h} {[y_{i} {(k + 1)}^{'} \{{\hat{L}}_{i, h} (k) - L_{h} (k)\} ({\hat{γ}}_{i} (k) - γ (k))]}^{2} + s . o . \\ = & (I V) + (V) + (V I) \end{matrix}

(A13)

The

(V)

and

(V I)

terms in (A13) are each

O (1)

following similar arguments in Ing et al. (2009) which build on Lemmas A7–A9.

Analogous to (A6) and (A7) in the proof of Theorem 1, (IV) can be rewritten as:

\begin{matrix} (I V) = & {(\sum_{j = 0}^{h - 1} b_{j})}^{2} \sum_{i = m_{h}}^{T - h} {\{{Z_{i}}^{'} {(\sum_{j = k + 1}^{i - 1} Z_{j} {Z_{j}}^{'})}^{- 1} \sum_{j = k + 1}^{i - 1} Z_{j} e_{j + 1}\}}^{2} \\ + \sum_{i = m_{h}}^{T - h} {\{s_{i}^{'} (k) M_{h} (k) {\hat{Γ}}_{i}^{- 1} (k) \frac{1}{i} \sum_{j = k + 1}^{i - 1} s_{j} (k) e_{j + 1}\}}^{2} + 2 (\sum_{j = 0}^{h - 1} b_{j}) \sum_{i = m_{h}}^{T - h} \frac{1}{i} F_{i, k} \\ = C. 1 + C. 2 + C. 3 \end{matrix}

where

Z_{j} = {(1, t + 1, N_{j})}^{'}

. In analogy with Theorem 3.1 of Ing (2004),

C . 2 = g_{h} (k) σ^{2} log T + o_{p} (log T)

(A14)

By Lemmas A8 and A9,

C . 3 = o_{p} (log T)

. Now we focus on

C . 1

. By Theorem 4 of Wei (1987), we have

C . 1 = {(\sum_{j = 0}^{h - 1} b_{j})}^{2} σ^{2} log det (\sum_{j = k + 1}^{T - 1} Z_{j} Z_{j}^{'}) + o_{p} (log T)

Defining the

3 \times 3

matrix

Υ_{T} = d i a g (T, T^{3}, T^{2} / |c|)

and using Lemma A of Phillips (2014) in conjunction with the fact that

|c| T^{- 2} = O (T^{- 1})

, we can calculate

\begin{matrix} log det (\sum_{j = k + 1}^{T - 1} Z_{j} Z_{j}^{'}) & = log det (Υ_{T}^{1 / 2} Υ_{T}^{- 1 / 2} \sum_{j = k + 1}^{T - 1} Z_{j} Z_{j}^{'} Υ_{T}^{- 1 / 2} Υ_{T}^{1 / 2}) \\ = log det (Υ_{T}) + O_{p} (1) = log (T^{5}) + O_{p} (1) \\ = 5 log (T) + O_{p} (1) \end{matrix}

(A15)

which leads to

C . 1 = 5 σ^{2} {(\sum_{j = 0}^{h - 1} b_{j})}^{2} log (T) + o_{p} (log T)

. Thus,

lim_{T \to \infty} \frac{1}{σ^{2} log T} (A P E_{1} - \sum_{i = m_{h}}^{T - h} η_{i, h}^{2}) = 5 {(\sum_{j = 0}^{h - 1} b_{j})}^{2} + g_{h} (k)

(A16)

where the right hand side of (A16) is the limit of

f_{1} (c, p, k, h) = f_{1} (c, p, h) + g_{h} (k)

as

c \to

-∞.

We next prove (b). Following similar steps as in the proof of (a) and the proof of Theorem 1 for the restricted case, we can derive

\begin{matrix} A P E_{0} - \sum_{i = m_{h}}^{T - h} η_{i, h}^{2} = & {(\sum_{j = 0}^{h - 1} b_{j})}^{2} \sum_{i = m_{h}}^{T - h} {({\tilde{β}}_{0, i}^{*} - c \frac{N_{i}}{i})}^{2} \\ + \sum_{i = m_{h}}^{T - h} {\{s_{i}^{'} (k) M_{h} (k) {\hat{Γ}}_{i}^{- 1} (k) \frac{1}{i} \sum_{j = k + 1}^{i - 1} s_{j} (k) e_{j + 1}\}}^{2} + o_{p} (log T) = D. 1 + D. 2 \end{matrix}

In view of (A4), taking the limit

c \to 0

, we have

\begin{matrix} \sum_{i = m_{h}}^{T - h} {({\tilde{β}}_{0, i}^{*} - c \frac{N_{i}}{i})}^{2} = & \sum_{i = m_{h}}^{T - h} {[ι_{1}^{'} {(\sum_{t = k + 1}^{i} {\dot{Z}}_{t} {\dot{Z}}_{t}^{'})}^{- 1} \sum_{t = k + 1}^{i} {\dot{Z}}_{t}^{'} e_{t}]}^{2} \\ = & \sum_{i = m_{h}}^{T - h} {[{(\sum_{t = k + 1}^{i} {\dot{Z}}_{1}^{2})}^{- 1} \sum_{t = k + 1}^{i} {\dot{Z}}_{1} e_{t}]}^{2} + s . o . \\ = & log det (\sum_{j = k + 1}^{T - 1} {\dot{Z}}_{1}^{2}) + o_{p} (log T) = σ^{2} log T + o_{p} (log T) \end{matrix}

Further, using the same argument as in (A14), we have

D . 2 = g_{h} (k) σ^{2} log T + o_{p} (log T)

. Thus,

lim_{c \to 0} lim_{T \to \infty} \frac{1}{σ^{2} log T} (A P E_{0} - \sum_{i = m_{h}}^{T - h} η_{i, h}^{2}) = {(\sum_{j = 0}^{h - 1} b_{j})}^{2} + g_{h} (k)

(A17)

where the right hand side of (A17) is the limit of

f_{0} (c, p, k, h) = f_{0} (c, p, h) + g_{h} (k)

as

c \to 0

since

{lim}_{c \to 0} E (T_{0 c}^{2}) = E [W {(1)}^{2}] = 1

. □

For the proofs of Theorems 3 and 4, we focus on the misspecified case where

l < k

. For the case

l \geq k,

the proofs follow directly from the arguments in Theorems 1 and 2 above and those in Ing et al. (2009).

Proof of Theorem 3.

First, note that under Assumption 3, Lemmas A1–A4 continue to hold with

k

replaced by

l (l < k)

, and

e_{t}

replaced by

ε_{t} = e_{t} + ω_{t},

where

ω_{t} = \sum_{j = l + 1}^{k} α_{j} Δ y_{t - j}

. Define

L_{h}^{*} (l) = \sum_{j = 0}^{h - 1} b_{j} {[B^{*} (l)]}^{h - 1 - j},

where

B^{*} (l)

is defined similarly to

B (l)

except that

F (l)

is replaced by

F^{*} (l) = (a^{*} (l) | \begin{matrix} I_{l} \\ 0_{l}^{'} \end{matrix}), F^{* 0} (l) = I_{l + 1}, a^{*} (l) = {(a_{1}, \dots, a_{l}, a_{l + 1}^{*})}^{'}

and

a_{l + 1}^{*} = - α_{l}

. Also, let

γ^{*} (l) = {(β_{0}^{*}, β_{1}^{*}, a_{1}, \dots, a_{l}, a_{l + 1}^{*})}^{'}

and

α^{*} (l, k) = {(0_{l}^{'}, α_{l + 1}, \dots, α_{k})}^{'}

. Finally, note that under Assumption 3,

b_{j} \to 1

for all j.

(a) We can write

\begin{matrix} \frac{T}{σ^{2}} E {({\hat{μ}}_{T + h} (l) - μ_{T + h})}^{2} & = \frac{T}{σ^{2}} E {[y_{T} {(l + 1)}^{'} {\hat{L}}_{h} (l) (\hat{γ} (l) - γ^{*} (l)) + s_{T} {(k)}^{'} M_{h} (k) α^{*} (l, k)]}^{2} \\ = E [{\hat{f}}_{T, h}^{2} (l, k)] + E [r_{T, h}^{2} (l, k)] + o (1) \\ = (I) + (I I) + o (1) \end{matrix}

(A18)

where

\begin{matrix} {\hat{f}}_{T, h} (l, k) & = (\sqrt{T} / σ) y_{T} {(l + 1)}^{'} {\hat{L}}_{h} (l) (\hat{γ} (l) - γ^{*} (l)) \\ r_{T, h} (l, k) & = (\sqrt{T} / σ) s_{T} {(k)}^{'} M_{h} (k) α^{*} (l, k) \end{matrix}

We now derive the limit of the terms

(I)

and

(I I)

in (A18). First, consider the term

(I)

in (A18). Noting that the effective errors are now

{ε_{t}}

instead of

{e_{t}},

we can write, similar to (A5),

\begin{matrix} E [{\hat{f}}_{T, h}^{2} (l, k)] & = \frac{1}{σ^{2}} E {[y_{T} {(l + 1)}^{'} L_{h}^{*} (l) D_{T}^{'} (l) ({\hat{R}}_{T}^{* - 1} (l)) \frac{D_{T} (l)}{\sqrt{T}} \sum_{j = l + 1}^{T - 1} y_{j} (l + 1) ε_{j + 1}]}^{2} \\ + \frac{1}{σ^{2}} E {[y_{T} {(l + 1)}^{'} L_{h}^{*} (l) D_{T}^{'} (l) ({\hat{R}}_{T}^{- 1} (l) - {\hat{R}}_{T}^{* - 1} (l)) \frac{D_{T} (l)}{\sqrt{T}} \sum_{j = l + 1}^{T - 1} y_{j} (l + 1) ε_{j + 1}]}^{2} \\ + \frac{T}{σ^{2}} E {[y_{T} {(l + 1)}^{'} \{{\hat{L}}_{h} (l) - L_{h}^{*} (l)\} (\hat{γ} (l) - γ^{*} (l))]}^{2} + o (1) \\ = (T 1) + (T 2) + (T 3) + o (1) \end{matrix}

(A19)

The terms

(T 2)

and

(T 3)

are each

o (1)

by the fact that Lemmas A1–A3 hold when

{e_{t}}

is replaced by

{ε_{t}}

. Now consider the following term appearing in term

(T 1)

:

\begin{matrix} \frac{D_{T} (l)}{\sqrt{T}} \sum_{j = l + 1}^{T - 1} y_{j} (l + 1) ε_{j + 1} \\ = \frac{D_{T} (l)}{\sqrt{T}} \sum_{j = l + 1}^{T - 1} y_{j} (l + 1) e_{j + 1} + \frac{D_{T} (l)}{\sqrt{T}} \sum_{j = l + 1}^{T - 1} y_{j} (l + 1) ω_{j + 1} \\ = \frac{D_{T} (l)}{\sqrt{T}} \sum_{j = l + 1}^{T - 1} y_{j} (l + 1) e_{j + 1} + \frac{1}{T} \sum_{i = l + 1}^{k} δ_{i} \sum_{j = l + 1}^{T} D_{T} (l) y_{j} (l + 1) Δ y_{j - i + 1} \\ = \frac{D_{T} (l)}{\sqrt{T}} \sum_{j = l + 1}^{T - 1} y_{j} (l + 1) e_{j + 1} + \frac{1}{\sqrt{T}} \sum_{i = l + 1}^{k} δ_{i} \frac{1}{\sqrt{T}} \sum_{j = l + 1}^{T} D_{T} (l) y_{j} (l + 1) e_{j - i + 1} + o_{p} (1) \\ = \frac{D_{T} (l)}{\sqrt{T}} \sum_{j = l + 1}^{T - 1} y_{j} (l + 1) e_{j + 1} + \frac{1}{\sqrt{T}} (\sum_{i = l + 1}^{k} δ_{i}) O_{p} (1) + o_{p} (1) \\ = \frac{D_{T} (l)}{\sqrt{T}} \sum_{j = l + 1}^{T - 1} y_{j} (l + 1) e_{j + 1} + o_{p} (1) \end{matrix}

(A20)

where the third equality follows from Assumption 3. Then, substituting (A20) in (A19) and following the same arguments used in deriving (A8), we get

lim_{T \to \infty} E [{\hat{f}}_{T, h}^{2} (l, k)] = lim_{T \to \infty} (T 1) = h^{2} E (T_{1 c}^{2}) + g_{h} (l)

(A21)

Now, consider the term

(I I)

in (A18). Defining

ξ_{h} (δ, l, k) = M_{h} (k) {(0_{l}^{'}, δ_{l + 1}, \dots, δ_{k})}^{'},

we can write

\begin{matrix} E [r_{T, h}^{2} (l, k)] & = & \frac{1}{σ^{2}} [\sqrt{T} α^{*} {(l, k)}^{'} M_{h} {(k)}^{'} E {s_{T} (k) s_{T} {(k)}^{'}} M_{h} (k) \sqrt{T} α^{*} (l, k)] \\ = & \frac{1}{σ^{2}} [\sqrt{T} α^{*} {(l, k)}^{'} M_{h} {(k)}^{'} σ^{2} I_{k} M_{h} (k) \sqrt{T} α^{*} (l, k)] + o (1) \\ \to & ξ_{h} {(δ, l, k)}^{'} ξ_{h} (δ, l, k) \end{matrix}

(A22)

where the second equality in (A22) follows from the facts that under Assumption 3,

\sqrt{T} α^{*} (l, k) = {(0_{l}^{'}, δ_{l + 1}, \dots, δ_{k})}^{'}

and

E {s_{T} (k) s_{T} {(k)}^{'}} \to σ^{2} I_{k} .

Finally, substituting (A21) and (A22) in (A18), the result follows.

(b) We can write

\begin{matrix} \frac{T}{σ^{2}} E {({\tilde{μ}}_{T + h} (l) - μ_{T + h})}^{2} & = & \frac{T}{σ^{2}} E {[y_{T} {(l + 1)}^{'} {\tilde{L}}_{h} (l) (\tilde{γ} (l) - γ^{*} (l)) + s_{T} {(k)}^{'} M_{h} (k) α^{*} (l, k)]}^{2} \\ = & E [{\tilde{f}}_{T, h}^{2} (l, k)] + E [r_{T, h}^{2} (l, k)] + o (1) \\ = & (I) + (I I) + o (1) \end{matrix}

(A23)

where

{\tilde{f}}_{T, h} (l, k) = (\sqrt{T} / σ) y_{T} {(l + 1)}^{'} {\tilde{L}}_{h} (l) (\tilde{γ} (l) - γ^{*} (l))

and

r_{T, h} (l, k)

is as defined in (A18). The limit of term

(I I)

is derived in (A22). To obtain the limit of

(I),

we follow the same steps as in the proof of Theorem 1(b) with

k

replaced by

l

and use (A20) to get

E [{\tilde{f}}_{T, h}^{2} (l, k)] \to h^{2} E (T_{0 c}^{2}) + g_{h} (l)

(A24)

The result then follows by substituting (A24) and (A22) in (A23). □

Proof of Theorem 4.

(a) For each

i \in {m_{h}, T - h}

, we can write

y_{i + h} - {\hat{μ}}_{i + h} (l) = η_{i, h} - \underset{{\hat{f}}_{i, h} (l, k)}{\underset{︸}{y_{i}^{'} (l + 1) {\hat{L}}_{i, h} (l) ({\hat{γ}}_{i}^{*} (l) - γ^{*} (l))}} + \underset{r_{i, h} (l, k)}{\underset{︸}{s_{i}^{'} (k) M_{h} (k) α^{*} (l, k)}}

We then have

\begin{matrix} A P E_{1} (l) & = \sum_{i = m_{h}}^{T - h} {[η_{i, h} - {\hat{f}}_{i, h} (l, k) + r_{i, h} (l, k)]}^{2} \\ = \sum_{i = m_{h}}^{T - h} η_{i, h}^{2} + \sum_{i = m_{h}}^{T - h} {\hat{f}}_{i, h}^{2} (l, k) [1 + o_{p} (1)] + \sum_{i = m_{h}}^{T - h} r_{i, h}^{2} (l, k) + 2 \sum_{i = m_{h}}^{T - h} η_{i, h} r_{i, h} (l, k) \\ - 2 \sum_{i = m_{h}}^{T - h} {\hat{f}}_{i, h} (l, k) r_{i, h} (l, k) + O_{p} (1) \end{matrix}

(A25)

Note that

\sum_{i = m_{h}}^{T - h} η_{i, h} r_{i, h} (l, k) = [T^{- 1 / 2} \sum_{i = m_{h}}^{T - h} η_{i, h} s_{i}^{'} (k)] [M_{h} (k) T^{1 / 2} α^{*} (l, k)] = O_{p} (1) . O (1) = O_{p} (1)

. We will now consider

\sum_{i = m_{h}}^{T - h} {\hat{f}}_{i, h}^{2} (l, k)

,

\sum_{i = m_{h}}^{T - h} r_{i, h}^{2} (l, k)

, and

\sum_{i = m_{h}}^{T - h} {\hat{f}}_{i, h} (l, k) r_{i, h} (l, k)

in turn.

\begin{matrix} \sum_{i = m_{h}}^{T - h} {\hat{f}}_{i, h}^{2} (l, k) & = & \sum_{i = m_{h}}^{T - h} \frac{1}{i} {[y_{i}^{'} (l + 1) L_{h} (l) D_{i}^{'} (l) {\hat{R}}_{i}^{* - 1} (l) \frac{D_{i} (l)}{\sqrt{i}} \sum_{j = l}^{i - 1} y_{j} (l + 1) ε_{j + 1}]}^{2} + O_{p} (1) \\ = & \sum_{i = m_{h}}^{T - h} \frac{1}{i} {[y_{i}^{'} (l + 1) L_{h} (l) D_{i}^{'} (l) {\hat{R}}_{i}^{* - 1} (l) \frac{D_{i} (l)}{\sqrt{i}} \sum_{j = l}^{i - 1} y_{j} (l + 1) e_{j + 1}]}^{2} + O_{p} (1) \\ = & 5 σ^{2} h^{2} log T + g_{h} (l) σ^{2} log T + o_{p} (log T) . \end{matrix}

(A26)

where the first equality in (A26) follows in analogy with (A13), the second follows in analogy with (A20), and the third follows from the same arguments used to derive (A14) and (A15) as well as the fact that

b_{j} \to 1

.

Next, consider the term

\begin{matrix} \sum_{i = m_{h}}^{T - h} r_{i, h}^{2} (l, k) & = & \sum_{i = m_{h}}^{T - h} [α^{*'} (l, k) M_{h}^{'} (k) s_{i} (k) s_{i}^{'} (k) M_{h} (k) α^{*} (l, k)] \\ = & ξ_{h} {(δ, l, k)}^{'} [T^{- 1} \sum_{i = m_{h}}^{T - h} s_{i} (k) s_{i}^{'} (k)] ξ_{h} (δ, l, k) \\ = & ξ_{h} {(δ, l, k)}^{'} (σ^{2} I_{k}) ξ_{h} (δ, l, k) + o_{p} (1) \\ \to & σ^{2} ξ_{h} {(δ, l, k)}^{'} ξ_{h} (δ, l, k) \end{matrix}

(A27)

where the third equality in (A27) is a consequence of Assumption 3.

Finally, consider the cross-product term

\begin{matrix} \sum_{i = m_{h}}^{T - h} {\hat{f}}_{i, h} (l, k) r_{i, h} (l, k) & \leq & {[\sum_{i = m_{h}}^{T - h} {\hat{f}}_{i, h}^{2} (l, k)]}^{1 / 2} {[\sum_{i = m_{h}}^{T - h} r_{i, h}^{2} (l, k)]}^{1 / 2} \\ = & O_{p} (\sqrt{log T}) O_{p} (1) = o_{p} (log T) \end{matrix}

(A28)

Substituting (A26)–(A28) in (A25), the result follows.

(b) For each

i \in {m_{h}, T - h}

, we can write

y_{i + h} - {\tilde{μ}}_{i + h} (l) = η_{i, h} - \underset{{\tilde{f}}_{i, h} (l, k)}{\underset{︸}{y_{i}^{'} (l + 1) {\tilde{L}}_{i, h} (l) ({\tilde{γ}}_{i}^{*} (l) - γ^{*} (l))}} + \underset{r_{i, h} (l, k)}{\underset{︸}{s_{i}^{'} (k) M_{h} (k) α^{*} (l, k)}}

We then have

\begin{matrix} A P E_{0} (l) & = \sum_{i = m_{h}}^{T - h} {[η_{i, h} - {\tilde{f}}_{i, h} (l, k) + r_{i, h} (l, k)]}^{2} \\ = \sum_{i = m_{h}}^{T - h} η_{i, h}^{2} + \sum_{i = m_{h}}^{T - h} {\tilde{f}}_{i, h}^{2} (l, k) [1 + o_{p} (1)] + \sum_{i = m_{h}}^{T - h} r_{i, h}^{2} (l, k) - 2 \sum_{i = m_{h}}^{T - h} {\tilde{f}}_{i, h} (l, k) r_{i, h} (l, k) + O_{p} (1) \end{matrix}

(A29)

By similar arguments as in (a), we have

\begin{matrix} \sum_{i = m_{h}}^{T - h} {\tilde{f}}_{i, h}^{2} (l, k) & = & σ^{2} h^{2} log T + g_{h} (l) σ^{2} log T + o_{p} (log T) \\ \sum_{i = m_{h}}^{T - h} {\tilde{f}}_{i, h} (l, k) r_{i, h} (l, k) & = & o_{p} (log T) \end{matrix}

(A30)

Substituting (A27) and (A30) in (A29), the result follows. □

Appendix B. Description of Methods

This Appendix provides a detailed description of the forecasting methods compared in the Monte Carlo analysis presented in Section 7 and the empirical analysis presented in Section 8.

Unrestricted Autoregressive Model (Benchmark). The benchmark forecast is calculated from a standard autoregressive model of order

K + 1

estimated by OLS:

y_{t} = β_{0}^{*} + β_{1}^{*} t + α y_{t - 1} + \sum_{j = 1}^{K} α_{j} Δ y_{t - j} + ϵ_{t},

Mallows Selection.Hansen (2010a) demonstrated the validity of the Mallows criterion for selecting between the restricted and unrestricted models when

h = 1

. When the number of lags

k

is known, the criteria for the restricted and unrestricted models are, respectively, given by

\begin{matrix} M_{0} & = T {\tilde{σ}}^{2} + 2 {\hat{σ}}^{2} (p + k) \\ M_{1} & = T {\hat{σ}}^{2} + 2 {\hat{σ}}^{2} (2 + p + k) \end{matrix}

where

{\tilde{σ}}^{2} = T^{- 1} \sum_{t = 1}^{T} {(y_{t} - {\tilde{μ}}_{t})}^{2}

and

σ^{2} = T^{- 1} \sum_{t = 1}^{T} {(y_{t} - {\hat{μ}}_{t})}^{2}

. The Mallows selection estimator picks the restricted model if

M_{0} < M_{1}

and the unrestricted model otherwise. This is equivalent to picking the unrestricted model when

F_{T} = T (\frac{{\tilde{σ}}^{2} - {\hat{σ}}^{2}}{{\hat{σ}}^{2}}) \geq 4

. The Mallows selection forecast can then be expressed as

{\hat{μ}}_{t + h, M} = {\hat{μ}}_{t + h} 1 (F_{T} \geq 4) + {\tilde{μ}}_{t + h} 1 (F_{T} < 4)

. When the number of lags is unknown, the relevant Mallows criteria are obtained as (see Kejriwal and Yu 2021):

\begin{matrix} M_{0} (l) & = T {\tilde{σ}}_{l}^{2} + 2 {\hat{σ}}_{K}^{2} (p + l) \\ M_{1} (l) & = T {\hat{σ}}_{l}^{2} + 2 {\hat{σ}}_{K}^{2} (2 + p + l) \end{matrix}

for

l = 0, 1, \dots, K,

where

{\hat{σ}}_{j}^{2} = T^{- 1} \sum_{t = 1}^{T} {(y_{t} - {\hat{μ}}_{t} (j))}^{2}, j = l, K

and

{\tilde{σ}}_{l}^{2} = T^{- 1} \sum_{t = 1}^{T} {(y_{t} - {\tilde{μ}}_{t} (l))}^{2}

. Then, defining

\tilde{l} = arg {min}_{l \in S} {M_{0} (l)}, \hat{l} = arg {min}_{l \in S} {M_{1} (l)},

where

S = {0, 1, \dots, K},

the Mallows selection forecast is obtained as

{\overset{˘}{μ}}_{t + h, M} = \{\begin{matrix} {\hat{μ}}_{t + h} (\hat{l}), \\ {\tilde{μ}}_{t + h} (\tilde{l}), \end{matrix} \begin{matrix} i f \\ i f \end{matrix} \begin{matrix} min_{l \in S} {M_{1} (l)} \leq min_{l \in S} {M_{0} (l)} \\ min_{l \in S} {M_{1} (l)} > min_{l \in S} {M_{0} (l)} \end{matrix}

Mallows Averaging. As an alternative to Mallows selection, Hansen (2010a) developed the Mallows combination forecast that entails taking a weighted average of the unrestricted and restricted forecasts where the weights are chosen by minimizing a Mallows criterion. When the number of lags is known, the criterion is

M_{w} = \sum_{t = 1}^{T} {(y_{t} - {\hat{μ}}_{t} (w))}^{2} + 2 {\hat{σ}}^{2} (2 w + p + k)

(A31)

with

{\hat{μ}}_{t} (w) = w {\hat{μ}}_{t} + (1 - w) {\tilde{μ}}_{t}

and

{\hat{σ}}^{2} = T^{- 1} \sum_{t = 1}^{T} {(y_{t} - {\hat{μ}}_{t})}^{2}

. The Mallows selected weight

\hat{w}

is derived from minimizing (A31) over

w \in [0, 1]

. The solution is

\hat{w} = \{\begin{matrix} 1 - 2 / F_{T} \\ 0 \end{matrix} \begin{matrix} i f F_{T} > 2 \\ o t h e r w i s e \end{matrix}

The Mallows averaging estimator is then defined as

{\hat{μ}}_{t + h, M} (\hat{w}) = \hat{w} {\hat{μ}}_{t + h} + (1 - \hat{w}) {\tilde{μ}}_{t + h} = \{\begin{matrix} {\tilde{μ}}_{t + h} \\ (1 - \frac{2}{F_{T}}) {\hat{μ}}_{t + h} + \frac{2}{F_{T}} {\tilde{μ}}_{t + h} \end{matrix} \begin{matrix} i f F_{T} \leq 2 \\ o t h e r w i s e \end{matrix}

(A32)

When the number of lags is unknown, Hansen (2010a) considered two alternative Mallows combination forecasts. The first is the so-called partial averaging forecast that averages only over unrestricted forecasts that vary according to the number of first-differenced lags included. With a maximum of

K

lags, this forecast is given by

{\hat{μ}}_{t + h, M} (\hat{W}) = \sum_{l = 0}^{K} {\hat{w}}_{l} {\hat{μ}}_{t + h} (l)

(A33)

where

\hat{W} = {({\hat{w}}_{0}, {\hat{w}}_{1} \dots, {\hat{w}}_{K})}^{'}

minimizes the criterion (with

{\hat{μ}}_{t} (W) = \sum_{l = 0}^{K} w_{l} {\hat{μ}}_{t} (l)

),

M_{P} (W) = \sum_{t = 1}^{T} {(y_{t} - {\hat{μ}}_{t} (W))}^{2} + 2 {\hat{σ}}_{K}^{2} (\sum_{l = 0}^{K} [w_{l} (2 + l + p)])

subject to the restrictions

w_{j} \geq 0 (j = 0, 1, \dots K), \sum_{j = 0}^{K} w_{j} = 1

. The second combination forecast is the so-called general averaging forecast that averages over the forecasts from all

2 (K + 1)

models that include the

(K + 1)

restricted models. This forecast is given by

{\overset{˘}{μ}}_{t + h, M} (\overset{˘}{W}) = \sum_{l = 0}^{K} ({\overset{˘}{w}}_{1 l} {\hat{μ}}_{t + h} (l) + {\overset{˘}{w}}_{0 l} {\tilde{μ}}_{t + h} (l))

(A34)

with

\overset{˘}{W} = {({\overset{˘}{w}}_{00}, {\overset{˘}{w}}_{01}, \dots, {\overset{˘}{w}}_{0 K}, {\overset{˘}{w}}_{10}, {\overset{˘}{w}}_{11}, {\overset{˘}{w}}_{12}, \dots, {\overset{˘}{w}}_{1 K})}^{'}

minimizing the criterion (with

{\overset{˘}{μ}}_{t} (W) = \sum_{l = 0}^{K} (w_{0 l} {\tilde{μ}}_{t} (l) + w_{1 l} {\hat{μ}}_{t} (l))

),

M_{G} (W) = \sum_{t = 1}^{T} {(y_{t} - {\overset{˘}{μ}}_{t} (W))}^{2} + 2 {\hat{σ}}_{K}^{2} (\sum_{l = 0}^{K} [w_{0 l} l + w_{1 l} (2 + l)] + p)

where the weights are non-negative and sum to one:

w_{1 l} \geq 0, w_{0 l} \geq 0, \sum_{l = 0}^{K} (w_{0 l} + w_{1 l}) = 1

. In what follows, we will refer to (A33) and (A34) as the MPA (Mallows Partial Averaging) and MGA (Mallows General Averaging) forecasts, respectively.

Leave-h-out Cross Validation Selection.Hansen (2010b) provided theoretical justification for constructing h-step ahead forecasts using leave-h-out cross validation under the assumption that the data are strictly stationary. For model selection with a known number of lags, let

C V_{0}

and

C V_{1}

denote the cross-validation criteria for the restricted and unrestricted models, respectively. These criteria are computed as

\begin{matrix} C V_{0} & = \sum_{t = k + 1}^{T - h} {(y_{t + h} - {\tilde{μ}}_{t + h}^{(t)})}^{2} \end{matrix}

(A35)

\begin{matrix} C V_{1} & = \sum_{t = k + 1}^{T - h} {(y_{t + h} - {\hat{μ}}_{t + h}^{(t)})}^{2} \end{matrix}

(A36)

where

{\tilde{μ}}_{t + h}^{(t)}

and

{\hat{μ}}_{t + h}^{(t)}

are the restricted and unrestricted leave-h-out forecasts, respectively. Specifically,

{\tilde{μ}}_{t + h}^{(t)}

is obtained using parameter estimates from the restricted model after leaving out the observations

{t + 1, \dots, t + h}

9:

Δ y_{j} = β_{0}^{*} + \sum_{s = 1}^{k} α_{k} Δ y_{j - s} + ϵ_{j}, j \neq t + 1, \dots, t + h

Similarly,

{\hat{μ}}_{t + h}^{(t)}

is obtained from estimating the unrestricted model after leaving out the observations

{t + 1, \dots, t + h}

:

y_{j} = β_{0}^{*} + β_{1}^{*} j + α y_{j - 1} + \sum_{s = 1}^{k} α_{k} Δ y_{j - s} + ϵ_{j}, j \neq t + 1, \dots, t + h

Then the cross-validation based forecast is

{\hat{μ}}_{t + h, C V} = {\hat{μ}}_{t + h} 1 (C V_{1} \leq C V_{0}) + {\tilde{μ}}_{t + h} 1 (C V_{1} > C V_{0}) .

When the number of lags is unknown, the cross-validation criterion is computed for each of the

2 (K + 1)

possible models and the selected forecast, denoted

{\overset{˘}{μ}}_{t + h, C V}

, is the one that corresponds to the model with the minimum value of this criterion.

Leave-h-out Cross Validation Averaging. When the number of lags is known, the cross validation weights

(\hat{w}, 1 - \hat{w})

are obtained by minimizing the criterion

C V_{w} = \sum_{t = k + 1}^{T - h} {\{w (y_{t + h} - {\hat{μ}}_{t + h}^{(t)}) + (1 - w) (y_{t + h} - {\tilde{μ}}_{t + h}^{(t)})\}}^{2}

and the resulting forecast is

{\hat{μ}}_{t + h, C V} (\hat{w}) = \hat{w} {\hat{μ}}_{t + h} + (1 - \hat{w}) {\tilde{μ}}_{t + h}

. When the number of lags is unknown, the partial combination forecast that only combines the unrestricted forecasts with different lags is obtained as

{\hat{μ}}_{t + h, C V} (\hat{W}) = \sum_{l = 0}^{K} {\hat{w}}_{l} {\hat{μ}}_{t + h} (l)

(A37)

where

\hat{W} = {({\hat{w}}_{0}, {\hat{w}}_{1} \dots, {\hat{w}}_{K})}^{'}

minimizes the criterion

C V_{P} (W) = \sum_{t = k + 1}^{T - h} {\{\sum_{l = 0}^{K} w_{l} (y_{t + h} - {\hat{μ}}_{t + h}^{(t)} (l))\}}^{2}

(A38)

subject to the restrictions

w_{j} \geq 0 (j = 0, 1, \dots, K), \sum_{j = 0}^{K} w_{j} = 1

, and

{\hat{μ}}_{t + h}^{(t)} (l)

is the unrestricted leave-h-out forecast assuming

l

first-differenced lags. As with weight selection using the Mallows criterion, we also construct a general combination forecast that combines forecasts from the

K + 1

unrestricted models as well as the

K + 1

restricted models. This forecast is given by

{\overset{˘}{μ}}_{t + h, C V} (\overset{˘}{W}) = \sum_{l = 0}^{K} ({\overset{˘}{w}}_{1 l} {\hat{μ}}_{t + h} (l) + {\overset{˘}{w}}_{0 l} {\tilde{μ}}_{t + h} (l))

(A39)

with

\overset{˘}{W} = {({\overset{˘}{w}}_{01}, {\overset{˘}{w}}_{02}, \dots, {\overset{˘}{w}}_{0 K}, {\overset{˘}{w}}_{11}, {\overset{˘}{w}}_{12}, \dots, {\overset{˘}{w}}_{1 K})}^{'}

minimizing the criterion

C V_{G} (W) = \sum_{t = k + 1}^{T - h} {\{\sum_{l = 0}^{K} [w_{1 l} (y_{t + h} - {\hat{μ}}_{t + h}^{(t)} (l)) + w_{0 l} (y_{t + h} - {\tilde{μ}}_{t + h}^{(t)} (l))]\}}^{2}

where

w_{1 l} \geq 0, w_{0 l} \geq 0, \sum_{l = 0}^{K} (w_{0 l} + w_{1 l}) = 1, {\hat{μ}}_{t + h}^{(t)} (l)

is as defined in (A38) and

{\tilde{μ}}_{t + h}^{(t)} (l)

is the restricted leave-h-out forecast assuming

l

first-differenced lags. In what follows, we will refer to (A37) and (A39) as the CPA (Cross-Validation Partial Averaging) and CGA (Cross-Validation General Averaging) forecasts, respectively.

APE Selection. With a known number of lags, this forecast is computed from the model that corresponds to the lower APE between the restricted and unrestricted models:

\begin{matrix} {\hat{μ}}_{t + h, S} & = {\tilde{μ}}_{t + h} I (A P E_{0} \leq A P E_{1}) + {\hat{μ}}_{t + h} I (A P E_{0} > A P E_{1}) \\ A P E_{0} & = \sum_{i = m_{h}}^{T - h} {\{y_{i + h} - {\tilde{μ}}_{i + h}\}}^{2}, A P E_{1} = \sum_{i = m_{h}}^{T - h} {\{y_{i + h} - {\hat{μ}}_{i + h}\}}^{2} \end{matrix}

In the unknown lags case, the forecast is computed from the model that minimizes the APE criterion among all

2 (K + 1)

possible models, comprising the

K + 1

restricted and

K + 1

unrestricted models.

Appendix C. Additional Simulation Results

This Appendix shows simulation evidence to justify the choice of

m_{h} = 20

. Figure A1 and Figure A2 plot the difference between the optimal forecast risk (a grid from 20 to 70 in increments of 5) and the risk of the APE selection and APE averaging forecasts respectively for

m_{h} = 20

expressed as a percentage of the forecast risk for

m_{h} = 20

.

Figure A1. Forecast risk with optimal

m_{h}

and

m_{h} = 20

, APE selection

(p = 1)

.

Figure A1. Forecast risk with optimal

m_{h}

and

m_{h} = 20

, APE selection

(p = 1)

.

Figure A2. Forecast risk with optimal

m_{h}

and

m_{h} = 20

, APE averaging

(p = 1)

.

Figure A2. Forecast risk with optimal

m_{h}

and

m_{h} = 20

, APE averaging

(p = 1)

.

Notes

1	Analytically, the importance of the trend component over long horizons can be seen by noting that the trend/drift coefficient is multiplied by the forecast horizon when constructing forecasts so that any specification/estimation error is magnified linearly as the forecast horizon increases (Sampson 1991).
2	The conclusion for the subsequent analysis will not be affected as long as the initial observations are $o_{p} (T^{1 / 2})$ .
3	The figure was obtained by simulating the AMSE and AFR assuming i.i.d. normal errors with $T = 1000 .$ 5000 replications were used.
4	To illustrate the lack of uniformity, consider the case $p = 1$ with $k = 0$ . Using the same arguments as in the proof of Theorem 2 of Yu et al. (2012), it follows that, for any finite $c \leq 0, \sum_{i = m_{h}}^{T - h} {\{y_{i + h} - {\hat{μ}}_{i + h}\}}^{2} = E (T_{10}^{2}) log T + o_{p} (log T),$ where $E (T_{10}^{2}) = 6$ . The lack of uniformity follows since $E (T_{1 c}^{2}) \neq E (T_{10}^{2})$ for any $c < 0$ .
5	This figure was obtained using the same method as in Figure 1.
6	We do not report the results for the selection forecasts, since their performance relative to the combination forecasts is qualitatively similar to the known lag-order case. The results are nevertheless available upon request.
7	This choice was also adopted by Ing and Yang (2014) in their Monte Carlo analysis of forecasting using autoregressive models with positive-valued errors.
8	The data set is publicly available for download at https://research.stlouisfed.org/econ/mccracken/fred-databases/ (accessed on 1 February 2023).
9	Hansen (2010b) instead left out the $2 h - 1$ observations ${t - h + 1, \dots, t, t + 1, \dots, t + h - 1}$ . The difference emanates from the fact that he constructs direct forecasts while our forecasts are constructed iteratively which exploit the autoregressive structure and hence necessitate leaving out only the $h$ observations ${t + 1, \dots, t + h}$ .

References

Bates, John M., and Clive W. J. Granger. 1969. The combination of forecasts. Journal of the Operational Research Society 20: 451–68. [Google Scholar] [CrossRef]
Box, George E. P., and Gwilym M. Jenkins. 1970. Time Series Analysis: Forecasting and Control. San Francisco: Holden-Day. [Google Scholar]
Cheng, Xu, and Bruce E. Hansen. 2015. Forecasting with factor-augmented regression: A frequentist model averaging approach. Journal of Econometrics 186: 280–93. [Google Scholar] [CrossRef]
Chow, Yuan Shih. 1965. Local convergence of martingales and the law of large numbers. The Annals of Mathematical Statistics 36: 552–58. [Google Scholar] [CrossRef]
Clements, Michael P., and David F. Hendry. 2001. Forecasting with difference-stationary and trend-stationary models. The Econometrics Journal 4: 1–19. [Google Scholar]
Diebold, Francis X., and Lutz Kilian. 2000. Unit-root tests are useful for selecting forecasting models. Journal of Business & Economic Statistics 18: 265–73. [Google Scholar]
Diebold, Francis X., and Roberto S. Mariano. 1995. Comparing predictive accuracy. Journal of Business and Economic Statistics 13: 253–63. [Google Scholar]
Elliott, Graham. 2006. Unit Root Pre-Testing and Forecasting. Technical Report, Working Paper. San Diego: UCSD. [Google Scholar]
Gallant, A. Ronald. 1981. On the bias in flexible functional forms and an essentially unbiased form: The fourier flexible form. Journal of Econometrics 15: 211–45. [Google Scholar] [CrossRef]
Gonzalo, Jesus, and Jean-Yves Pitarakis. 1998. On the exact moments of asymptotic distributions in an unstable ar(1) with dependent errors. International Economic Review 39: 71–88. [Google Scholar] [CrossRef]
Granger, Clive W. J. 1966. The typical spectral shape of an economic variable. Econometrica 34: 150–61. [Google Scholar] [CrossRef]
Hamilton, James Douglas. 1994. Time Series Analysis. Princeton: Princeton University Press. [Google Scholar]
Hansen, Bruce E. 2007. Least squares model averaging. Econometrica 75: 1175–89. [Google Scholar] [CrossRef]
Hansen, Bruce E. 2008. Least-squares forecast averaging. Journal of Econometrics 146: 342–50. [Google Scholar] [CrossRef]
Hansen, Bruce E. 2010a. Averaging estimators for autoregressions with a near unit root. Journal of Econometrics 158: 142–55. [Google Scholar] [CrossRef]
Hansen, Bruce E. 2010b. Multi-step forecast model selection. Paper presented at the 20th Annual Meetings of the Midwest Econometrics Group, St. Louis, MO, USA, October 1–2. [Google Scholar]
Ing, Ching-Kang. 2001. A note on mean-squared prediction errors of the least squares predictors in random walk models. Journal of Time Series Analysis 22: 711–24. [Google Scholar] [CrossRef]
Ing, Ching-Kang. 2003. Multistep prediction in autoregressive processes. Econometric Theory 19: 254–79. [Google Scholar] [CrossRef]
Ing, Ching-Kang. 2004. Selecting optimal multistep predictors for autoregressive processes of unknown order. The Annals of Statistics 32: 693–722. [Google Scholar] [CrossRef][Green Version]
Ing, Ching-Kang, Jin-Lung Lin, and Shu-Hui Yu. 2009. Toward optimal multistep forecasts in non-stationary autoregressions. Bernoulli 15: 402–37. [Google Scholar] [CrossRef]
Ing, Ching-Kang, Chor-yiu Sin, and Shu-Hui Yu. 2012. Model selection for integrated autoregressive processes of infinite order. Journal of Multivariate Analysis 106: 57–71. [Google Scholar] [CrossRef]
Ing, Ching-Kang, and Chiao-Yi Yang. 2014. Predictor selection for positive autoregressive processes. Journal of the American Statistical Association 109: 243–53. [Google Scholar] [CrossRef]
Kejriwal, Mohitosh, and Xuewen Yu. 2021. Generalized forecast averaging in autoregressions with a near unit root. The Econometrics Journal 24: 83–102. [Google Scholar] [CrossRef]
Kim, Hyun Hak, and Norman R. Swanson. 2018. Mining big data using parsimonious factor, machine learning, variable selection and shrinkage methods. International Journal of Forecasting 34: 339–54. [Google Scholar] [CrossRef]
Liao, Jen-Che, and Wen-Jen Tsay. 2020. Optimal multistep var forecast averaging. Econometric Theory 36: 1099–126. [Google Scholar] [CrossRef]
Mallows, Colin L. 2000. Some comments on cp. Technometrics 42: 87–94. [Google Scholar]
Masini, Ricardo P., Marcelo C. Medeiros, and Eduardo F. Mendes. 2023. Machine learning advances for time series forecasting. Journal of Economic Surveys 37: 76–111. [Google Scholar] [CrossRef]
McCracken, Michael W., and Serena Ng. 2016. Fred-md: A monthly database for macroeconomic research. Journal of Business & Economic Statistics 34: 574–89. [Google Scholar]
Medeiros, Marcelo C., Gabriel F. R. Vasconcelos, Álvaro Veiga, and Eduardo Zilberman. 2021. Forecasting inflation in a data-rich environment: The benefits of machine learning methods. Journal of Business & Economic Statistics 39: 98–119. [Google Scholar]
Meng, Xiao-Li. 2005. From unit root to stein’s estimator to fisher’sk statistics: If you have a moment, i can tell you more. Statistical Science 20: 141–62. [Google Scholar] [CrossRef]
Ng, Serena, and Timothy Vogelsang. 2002. Forecasting autoregressive time series in the presence of deterministic components. The Econometrics Journal 5: 196–224. [Google Scholar] [CrossRef]
Phillips, Peter C. B. 2014. On confidence intervals for autoregressive roots and predictive regression. Econometrica 82: 1177–95. [Google Scholar] [CrossRef]
Rissanen, Jorma. 1986. Order estimation by accumulated prediction errors. Journal of Applied Probability 23: 55–61. [Google Scholar] [CrossRef]
Sampson, Michael. 1991. The effect of parameter uncertainty on forecast variances and confidence intervals for unit root and trend stationary time-series models. Journal of Applied Econometrics 6: 67–76. [Google Scholar] [CrossRef]
Stock, James H., and Mark W. Watson. 2002. Macroeconomic forecasting using diffusion indexes. Journal of Business & Economic Statistics 20: 147–62. [Google Scholar]
Stock, James H., and Mark W. Watson. 2005. An Empirical Comparison of Methods for Forecasting Using Many Predictors. Manuscript. Princeton: Princeton University, p. 46. [Google Scholar]
Stock, James H., and Mark W. Watson. 2006. Forecasting with many predictors. Handbook of Economic Forecasting 1: 515–54. [Google Scholar]
The MathWorks Inc. 2022. MATLAB version: 9.13.0 (R2022b) Natick, Massachusetts: The MathWorks Inc. Available online: https://www.mathworks.com (accessed on 1 February 2023).
Tu, Yundong, and Yanping Yi. 2017. Forecasting cointegrated nonstationary time series with time-varying variance. Journal of Econometrics 196: 83–98. [Google Scholar] [CrossRef]
Turner, John L. 2004. Local to unity, long-horizon forecasting thresholds for model selection in the ar (1). Journal of Forecasting 23: 513–39. [Google Scholar] [CrossRef]
Wang, Xiaoqian, Rob J. Hyndman, Feng Li, and Yanfei Kang. 2022. Forecast combinations: An over 50-year review. International Journal of Forecasting 39: 1518–47. [Google Scholar] [CrossRef]
Wei, Ching-Zong. 1987. Adaptive prediction by least squares predictors in stochastic regression models with applications to time series. The Annals of Statistics 15: 1667–82. [Google Scholar] [CrossRef]
Yu, Shu-Hui, Chien-Chih Lin, and Hung-Wen Cheng. 2012. A note on mean squared prediction error under the unit root model with deterministic trend. Journal of Time Series Analysis 33: 276–86. [Google Scholar] [CrossRef]
Zhang, Xinyu, Alan T. K. Wan, and Guohua Zou. 2013. Model averaging by jackknife criterion in models with dependent data. Journal of Econometrics 174: 82–94. [Google Scholar] [CrossRef]

Figure 1. In-sample AMSE versus asymptotic forecast risk (

p = 0

,

k = 0

).

Figure 1. In-sample AMSE versus asymptotic forecast risk (

p = 0

,

k = 0

).

Figure 2. Asymptotic forecast risk of infeasible (optimal) and feasible (APE-based) combination forecasts (

p = 1, k = 0

).

Figure 2. Asymptotic forecast risk of infeasible (optimal) and feasible (APE-based) combination forecasts (

p = 1, k = 0

).

Figure 3. (a). Forecast risk with known lag order (

k = 0, T = 100

). (b). Forecast risk with known lag order (

k = 0, T = 200

).

Figure 3. (a). Forecast risk with known lag order (

k = 0, T = 100

). (b). Forecast risk with known lag order (

k = 0, T = 200

).

Figure 4. (a). Forecast risk with known lag order (

k = 6, T = 100

). (b). Forecast risk with known lag order (

k = 6, T = 200

).

Figure 4. (a). Forecast risk with known lag order (

k = 6, T = 100

). (b). Forecast risk with known lag order (

k = 6, T = 200

).

Figure 5. (a). Forecast risk with known lag order (

k = 12, T = 100

). (b). Forecast risk with known lag order (

k = 12, T = 200

).

Figure 5. (a). Forecast risk with known lag order (

k = 12, T = 100

). (b). Forecast risk with known lag order (

k = 12, T = 200

).