Selecting a Model for Forecasting

: We investigate forecasting in models that condition on variables for which future values are unknown. We consider the role of the signiﬁcance level because it guides the binary decisions whether to include or exclude variables. The analysis is extended by allowing for a structural break, either in the ﬁrst forecast period or just before. Theoretical results are derived for a three-variable static model, but generalized to include dynamics and many more variables in the simulation experiment. The results show that the trade-off for selecting variables in forecasting models in a stationary world, namely that variables should be retained if their noncentralities exceed unity, still applies in settings with structural breaks. This provides support for model selection at looser than conventional settings, albeit with many additional features explaining the forecast performance, and with the caveat that retaining irrelevant variables that are subject to location shifts can worsen forecast performance.


Introduction
There are many approaches to formulating models when the sole objective is forecasting, from the very parsimonious through to large systems. However, there is little agreement on which performs best on a forecasting criterion: see Makridakis and Hibon (2000) and Fildes and Ord (2002) for evidence from forecast competitions. Clements and Hendry (2001) suggest that this lack of agreement is the result of intermittent distributional shifts that affect alternative formulations in different ways. We address this puzzle by analysing the selection of models in the pursuit of optimal mean square forecast error (MSFE) in settings with structural breaks. 1 We focus on regression models that are linear in the parameters, and consider model selection that is controlled by the nominal significance level for statistical significance. Loose significance levels have been shown to be optimal to select regression models for stationary processes if evaluating on a one-step-ahead MSFE. Shibata (1980) showed that the Akaike information criterion (AIC, see Akaike 1973) is an asymptotically efficient selection method when the data generating process (DGP) is an infinite-order process; also see Ing and Wei (2003). Many other criteria have been proposed that aim to have optimal properties in certain settings but information criteria alone are not a sufficient principle for selecting models as they do not ensure congruence, so a misspecified model could be selected: see Bontemps and Mizon (2003). We explore general-to-specific (Gets) model selection in the simulation exercise to narrow down the class of forecasting models to undominated models. This yields well-specified encompassing models in sample, albeit nonstationarities may preclude those benefits continuing over the forecast horizon.
The theoretical analysis commences with a bivariate conditional model that is part of a three-variable system in which the selection decision is whether to retain or exclude one of the regressors. This is empirically relevant as demonstrated by UK inflation, where autoregressive (AR) forecasting models are augmented with the unemployment rate. The bivariate model is analysed first in a stationary setting. This is extended to a nonstationary settings where location shifts occur at or near the forecast origin. The static setting still requires forecasts of the conditioning variables, and alternative forecasting devices are considered, including the two extremes of the class of robust forecasting devices proposed by , the sample mean and the random walk. The results confirm that regressors should be retained for forecasting if their noncentralities exceed unity, regardless of whether or not there is a structural break, or of the forecasting device used. These analytic results map to a selection significance level of 16% in the bivariate case, much looser than conventional significance levels used. The results closely match that of AIC, which can be interpreted as a likelihood ratio χ 2 test for a pair of nested models with one degree of freedom and a penalty of two, and also gives a significance level of approximately 16%: see Pötscher (1991) and Leeb and Pötscher (2009).
A key source of forecast failure is an induced shift in the equilibrium mean of the variable being forecast, irrespective of whether those conditioning variables are included in the forecasting model; see the taxonomy in Hendry and Mizon (2012). Consequently, the simulation exercise evaluates a wide range of settings including larger models, break types and magnitudes at or near the forecast origin, and the method of forecasting. We consider a range of significance levels from the very tight (0.001), eliminating almost all potentially irrelevant variables, to the very loose (0.50), enabling retention of relevant variables even if they are only marginally significant. The results enable evaluation of the costs when either omitting relevant variables, or from incorrectly retaining irrelevant variables. Overall, the results support looser than conventional significance levels for selecting forecasting models, with a 10% target significance level often producing superior forecasts. This paper is structured as follows. Section 2 outlines the aims of this paper, then Section 3 formulates the model framework that is analysed. Section 4 considers the choice of selection significance level for forecasting in a stationary DGP. Section 5 analyses selection in a nonstationary DGP where a location shift occurs out of sample in one of the regressors, and investigates the consequences of that variable's inclusion or exclusion in the forecasting model. Section 6 considers the impacts on selection of in-sample shifts using different forecasting devices. The analytic results are summarized in Section 7. Sections 8 and 9 present simulation design and evidence on the performance of the various approaches, examining the preferred significance level to minimize MSFE across experimental designs. Section 10 concludes this paper. Appendix A provides analytical calculations and Supplementary Tables are given in Appendix B.

Empirical Motivation
An empirical example of inflation forecasting motivates our interest in structural breaks and their roles in forecast accuracy and the selection of regressors. Two popular models within this large literature include single-equation forecasting models based on past inflation and so-called 'Phillips curve forecasts'. The former usually consist of univariate models such as autoregressive integrated moving average (ARIMA) models. In the latter, the univariate model is augmented with an activity variable such as the unemployment rate or output gap; see Stock and Watson (2009).
The framework considered below, although static, can be applied to these two models where the econometrician wishes to determine whether to augment a univariate forecasting model with the contemporaneous unemployment rate. This 'exogenous' variable is subject to breaks in the form of location shifts, which may occur at or near the forecast horizon. Figure 1 records 2 the quarterly observations on the annual percentage inflation in UK consumer price index, π t , and the UK unemployment rate as a percentage, U t , along with a broken mean obtained by step indicator saturation (SIS, see  at a nominal significance level α = 0.1%. The analytics derived below correspond to a Phillips curve formulation (model M 1 ), a univariate AR model (M 2 ) and selection applied to the unemployment rate using a significance level of 0.16 (M 3 ). Using model-specific coefficients µ, β i , γ i and error term ν i , the three models are: where ∆π t = π t − π t−1 . Selection using Autometrics at α = 0.16 is denoted by * , e.g., γ * 0 = 0 implies that the contemporaneous unemployment rate is not selected. Dynamics are included to account for any autocorrelation. The forecasting models are estimated over the period 2000Q1-2013Q4, producing one-quarter-ahead inflation forecasts for the period 2014Q1-2017Q4 evaluated on MSFE. Selection at 16% results in U t−1 being retained, with a p-value of 0.149, so would not be retained under a commonly used 5% significance level. Longer lags of the unemployment rate were not retained. Table 1 reports the square root of the MSFEs (RMSFE) for one-step-ahead forecasts over the sample that was held back. Three cases are considered corresponding to the analytics below: (a) known U t , (b) forecast U t using the in-sample mean, and (c) forecast U t using U t−1 . Method (a) is infeasible; method (c) is the random walk forecast. When U t is known, model M 3 outperforms M 1 and M 2 , although the differences are not statistically significant. As this is infeasible, the random walk forecast combined with selection matches the RMSFE of knowing U t . This shows that selection can be beneficial. The next four sections formalize the framework to establish the optimal significance level for selection.

The Analytic Design
In this section, we specify the analytic design, consisting of a three-variable DGP and two different models for that DGP. In later sections, we introduce a third model that involves selection. Together, these mimic the models M 1 , M 2 , and M 3 that were introduced above.
The DGP is a static vector autoregression (VAR) for variables y, x 1 , x 2 with coefficients β i , µ i and error terms , η 1 , η 2 structured as: Using y t = (y t : x 1,t : x 2,t ) and µ = (µ y : µ 1 : µ 2 ), assuming normality, we can write (1) as: IN 3 denotes a three-dimensional independent normal distribution, here with mean µ and variance Σ. Without loss of generality we set the variance of x 1 and x 2 to one, V[x i,t ] = σ 2 ii = 1, and the correlation between x 1 and x 2 to ρ: Unless otherwise noted, Figures 2-8 use the following parameter values in calculations: β 0 = 5, β 1 = 1, σ 2 = 1, µ 1 = µ 2 = 2, ρ = 0.5, M = 10 5 , T = 50 and (when there is a break in µ 2 ) δ = 4. Although a static DGP may seem restrictive, the main role of adding dynamics to this three-variable VAR would be to slow adjustments to location shifts. Such dynamics are considered in the simulation exercise in Section 9. The analytic design ensures the assumptions required for valid application of a single t-test are satisfied. In practice, selection from a carefully designed general model including long lags and saturation estimators should deliver approximately martingale-difference normal residuals. While it may be more intuitive to lag the exogenous regressors in the DGP for forecasting purposes, none of the results would change. The current set up naturally leads to analyses of the forecasting models for the contemporaneous exogenous regressors, allowing a comparison of alternative devices and an assessment of open models, see Hendry and Mizon (2012).
Throughout, we assume that the sampling variation of estimates of µ i can be neglected, and use the population values to focus on the impacts of location shifts. Then (1) implies E[y t ] = µ y = β 0 + β 1 µ 1 + β 2 µ 2 with: Considering the conditional model (4), we compare M 1 , which includes both weakly exogenous regressors, and M 2 , which excludes x 2 : where Appendix A.1 summarises φ 0 , γ 1 , ν t and σ 2 ν . The choice between M 1 and M 2 will depend on a test of significance of x 2,t . The usual Student's t-statistic for β 2 is where t T − k, ψ β indicates a singly noncentral Student's t-distribution with ψ β nonzero under the alternative hypothesis. Here, T − k is the degrees of freedom, and is the squared noncentrality parameter under the alternative.

Selection in a Stationary DGP
We start by analysing the forecast errors of the two models that were introduced, denoted M 1 and M 2 , in the absence of breaks. The analysis is then augmented in Section 4.2 by introducing selection of regressors in M 3 , and the influence of the significance level on the selection decision in Section 4.3. In this section, we assume that there are no breaks in the DGP.

Known Future Values of Regressors
The one-step-ahead forecast errors from M 1 are denoted and those from M 2 . The mean square forecast errors are written as MSFE 1 and MSFE 2 respectively. We look at the conditions for MSFE 2 ≤ MSFE 1 . An estimated intercept is always retained which maintains comparability between M 1 and M 2 .
When there are no breaks, the parameter estimates for M 1 are unbiased, E[ T+1|T ] = 0, so: which is the unconditional MSFE formula for the impact of estimating 3 parameters under the assumption of correct model specification. For M 2 , despite the misspecification when β 2 = 0, E[ T+1|T ] = 0 and the mean square forecast error is: where σ 2 ν = σ 2 1 + T −1 ψ 2 β σ 2 . There is one less parameter to estimate, traded off against a larger equation variance (see Appendix A.2 for derivations).
The results confirm that x 2 should be retained if its noncentrality exceeds approximately 1. The result converges to 1 as T → ∞, because the information content of the regressor outweighs the parameter estimation cost for one-step forecasts, regardless of the correlation between x 1 and x 2 .  (8), circles by simulation) and MSFE 2 (dashed line computed from (9), squares by simulation).

Selecting Regressors
Although M 1 and M 2 provide the extremes of always/never retaining x 2 , in practice, selection will be applied. From (5), x 2,t will be omitted if t 2 β 2 =0 < c 2 α . Using the approximation that: Thus, retention of x 2,t will depend on α and ψ 2 β for a given draw. Forecasts in repeated sampling will be based on a mixture of M 1 and M 2 depending on whether x 2,t is retained in each draw. The MSFE of the selected model, called M 3 , will be a weighted average of the MSFEs of M 1 and M 2 , with the weights given by the probability that x 2,t is retained: where ψ 2 β is given by (7), with: From the last term in (13), it is clear that MSFE 3 ≤ MSFE 1 whenever ψ 2 β ≤ 1. Moreover, p α ψ β will be low when ψ 2 β ≤ 1, so M 2 will usually be selected. Note that p α ψ β = α when β 2 = 0. However, MSFE 3 is a highly nonlinear function of ψ 2 β entering directly and indirectly, as well as of α which also influences p α ψ β nonlinearly. Figure 3 records the ratio of MSFE 3 to MSFE 1 , for a range of ψ 2 β , which from (13) is given by: Selection delivers a 1.8% improvement in MSFE relative to M 1 under the null when ψ 2 β = 0 with α = 0.05 or tighter, but for looser α, e.g., at 0.5, p α ψ β = 0.5 when x 2,t is irrelevant so the benefits of selection are halved. Selection is most costly at intermediate noncentralities under the alternative, where, e.g., the largest increase in MSFE relative to M 1 is 3% at α = 0.05 for T = 50, but is over 9% for α = 0.001 at its peak. The hump shape reflects the nonlinear trade-off as the noncentrality of x 2,t increases from the cost of omitting x 2,t rising as its signal is stronger, but the probability of retaining x 2,t also increases. While the magnitude of the maximal loss may seem small for intermediate values of α, this example considers the selection of just one regressor. In practice, selection is applied when there are multiple potential regressors, and the loss associated with selection at a given significance level is cumulated across all potential regressors, as seen in the simulation results below.
The selection rule that x 2,t should be retained if ψ 2 β > 1 is evident ∀α, but unfortunately the forecaster does not know ψ 2 β . If it was known, the optimal α is 0 for ψ 2 β < 1 and 1 for ψ 2 β > 1. We next look at the choice of α to minimize cost in terms of improvements in MSFEs for an unknown ψ 2 β .

The Choice of Significance Level
Equation (11) must hold for x 2 to be excluded at the chosen significance level. On average, that inequality requires: assuming unbiasedness. Equating that inequality for β 2 2 with ψ 2 β < 1 from (10) gives the boundary for the critical value c α in which selection results in a smaller MSFE due to the omission-estimation trade-off: This implies that c 2 α = 2 at the boundary, or an approximate significance level of α = 0.16. The theoretical probability of retaining x 2 for β 2 > 0 at α = 0.16 using E[t This gives the retention probabilities recorded in Table 2.
These results are close to the implied significance level for the AIC in Campos et al. (2003). This can have a cumulative effect, as shown in Figure 4 which records values of the term 1 − p α ψ β where there are five independent regressors, all with the same ψ 2 β . The probability of retaining all five variables is low even at loose significance levels unless the noncentralities are large. At ψ 2 β = 9 the gap between α = 0.05 and α = 0.16 is 29%, demonstrating large benefits to a looser significance level for the retention of relevant regressors. The trade-off is that more irrelevant variables will be retained, and this can be costly if those variables are subject to breaks, which we next explore.

An Out-of-Sample Shift in the Regressors
The analysis of the previous section is augmented by the introduction of a break in Section 5.1. This break is immediately after the estimation sample, while in Section 6 it is applied to the last in-sample observation. We distinguish between whether the future values of the regressors are known (Section 5.2) or unknown (Section 5.4). The role of selection is studied again (Section 5.3), and we look at the random walk as a device to forecast future values of the regressors in Section 5.5. Forecasting devices based on full insample information and a random walk are the extremes of the class in , but there is no information in sample regarding the break to help either device.

Specification of the Out-of-Sample Shift
Consider a mean shift of size δ in x 2 at T + 1 with the forecast origin at T, so the shift coincides with the one-step-ahead forecast. The DGP has the same structure as (1)-(3) with the parameters (β 1 β 2 ) of the conditional model constant: x 1,t = µ 1 + η 1,t t = 1, . . . , T + 1, Since (15) entails: then β 2 δ = 0 induces a location shift in the relationship between y T+1 and its in-sample determinants unless the future x 2,T+1 is known at time T. As shown in all forecast-error taxonomies (see e.g., Clements and Hendry 1998), shifts in the equilibrium mean are the most pernicious source of forecast failure, whereas changes in the parameters of mean-zero variables have only a variance impact. Omitting x 2,T+1 from (16) as in M 2 will create the same location shift. Thus, there is little loss of generality by only considering shifts in the regressors. We first evaluate the trade-off to omitting x 2,t for known future exogenous regressors, emulating the above results as the break which occurs in the forecast period is modeled in the known x 2,T+1 .

Known Future Values of Regressors
The one-step-ahead forecasts for M 1 given (15), in which values of x T+1 are assumed to be known at T, are unbiased when the parameter estimates are unbiased. The mean square forecast error of M 1 (see Appendix A.3 for derivations) is: which does not depend on ψ 2 β . Comparison with (8) highlights the effects of the location shift: δ 2 enters the MSFE despite the shift being 'known' given x 2,T+1 , and MSFE 1 is no longer independent of ρ. (17) also reveals the additional costs of including an irrelevant regressor which shifts out of sample as δ 2 enters even when β 2 = 0, although it is scaled by T 1 − ρ 2 so larger samples mitigate its effect.
For M 2 (which omits the regressor x 2,t ), the expectation of the forecast error is E[ T+1|T+1 ] = β 2 δ, so the forecasts are biased by the shift in the omitted variable. The one-step-ahead MSFE for M 2 is: where β 2 2 enters directly so the MSFE is a function of ψ 2 β , unlike for M 1 . Comparison with (9) reveals the role that ρ and δ 2 play. When β 2 = 0, so M 2 is the correct model, (18) collapses to (9).
Assuming a criterion of minimizing one-step-ahead MSFE, using (10), MSFE 2 ≤ MSFE 1 requires: which depends on estimation uncertainty and therefore does not simplify neatly. However, the solution is close to 1 for reasonable values of ρ. For example, when ρ = 0.5, T = 50 and δ = 4, then ψ 2 β < 0.983, or |ψ β | < 0.991, results in a smaller MSFE 2 compared to MSFE 1 . Figure 5 demonstrates the close approximation to a trade-off at ψ β = 1 which holds regardless of the break. Thus, even knowing there is a shift in x 2 does not affect the choice of forecasting model between including or omitting x 2 : always (never) include for ψ 2 β ≥ 1 (ψ 2 β < 1).

Selecting Regressors
Following Section 4.2, a t-test for statistical significance will be conducted on x 2,t in sample and a decision to retain or exclude x 2,t will be made at c α for a given draw. Hence, MSFE 3 will be a weighted average of MSFE 1 and MSFE 2 , using (12): The term in square brackets is scaled by T −1 . As before, the difference between MSFE 1 and MSFE 3 diminishes as the sample size increases. When ψ 2 β = 0, the first term in square brackets in (20) drops out and the benefits of selection relative to MSFE 1 are evident as the second term must be negative. The magnitude of δ 2 affects both MSFE 1 and MSFE 2 but, from (20), the first δ 2 term is multiplied by ψ 2 β whereas the second offsetting term is not, so the effect of the location shift is exacerbated if ψ 2 β > 1. Figure 5 compares the MSFEs of M 1 from (17), M 2 from (18), and M 3 using (20) at three illustrative values of α. The profiles of the MSFEs mirror the analytical results for the no break case. Selection outperforms the estimated DGP for ψ 2 β < 1 despite a break, and remains close to the MSFE 1 at α = 0.16 for ψ 2 β > 1.

Unknown Future Values of Regressors
Now consider when the future values of the regressors are unknown. We use two devices to obtain forecasts of x i,T+1 , i = 1, 2: the in-sample mean or a random walk. The random walk is biased for unanticipated location shifts but does not result in systematic bias following a location shift, whereas the in-sample mean is persistently biased following a location shift unless updated. The two devices comprise the two extremes of using either the full in-sample data or only the last observation to produce the forecasts of the weakly exogenous regressors. 3 Although the link between y and the x i stays constant, forecasts when the x i,T+1 are unknown will fail if the shift at T + 1 is not anticipated, inducing a shift in y T+1 . This will lead to forecast failure as the in-sample mean µ y shifts to (µ y + β 2 δ) at T + 1 but would be forecast to be µ y .
The forecasts based on in-sample estimates from (15) when µ 1 and µ 2 are not zero are given by: so will miss the unknown break. When the break occurs in x 2 , the MSFEs will worsen for β 2 = 0. As before, we consider the sampling variation in estimating the means as small compared to the impact of shifts, so we approximate by taking T sufficiently large that Replacing the unknown x i,T+1 by µ i leads to forecasting y T+1 by the in-sample mean for both M 1 and M 2 , see Appendix A.4. Both face the same forecast bias, E[ T+1|T ] = E[ T+1|T ] = β 2 δ which is the same bias as M 2 with known regressors. Parameter estimation adds terms of O p T −1 . Hence, ignoring O p T −1 terms, MSFE 1 = MSFE 2 : When β 2 = 0, the MSFE is σ 2 + β 2 1 , so is inflated relative to the known regressors case as x 1,T+1 must also be forecast. However, the in-sample mean forecast is the best forecast device for x 1,T+1 in this setting (in terms of minimum MSFE) as x 1,T+1 is stationary and not subject to a location shift. Selection will have little or no noticeable impact when MSFE 2 ≈ MSFE 1 , as this will also result in MSFE 3 ≈ MSFE 1 . Figure 6 records the MSFEs for M 1 and M 2 when there is a break in x 2 at T + 1, comparing known and unknown regressors using the in-sample mean to forecast x i,T+1 , i = 1, 2 in the unknown regressor case, i.e., the figure records (17), (18) and (23), (solid/dashed/dotted lines). Simulation outcomes are checked to capture O p T −1 effects but they are negligible so are not recorded. Figure 6 includes the random walk forecasts and the M 1 and M 2 results for the known regressor case are repeated from Figure 5 to facilitate comparison.
The simulation outcomes where parameters are estimated closely match the analytic results. For known regressors for MSFE 1 , the break in µ 2 does not affect the MSFE as it is captured in x 2,T+1 : even at δ = 4 for T = 100, MSFE 1 = 1.23 for the parameters given in the figure which is only slightly greater than σ 2 . However, when x T+1 is unknown both M 1 and M 2 are affected by the break in x 2,T+1 . Simulation outcomes again closely match the theory for the unknown break case, and show that the choice of whether to retain or exclude x 2,t is not important in a forecasting context. The unanticipated break dominates any forecast error resulting from model misspecification. Increasing the sample size does mitigate the MSFE costs but the MSFE premium relative to known regressors is maintained for all ψ 2 β . Increasing the number of relevant exogenous regressors that shift will increase the MSFE at ψ 2 β = 0, shifting the MSFE trajectories up. These results show that in this static setting of location shifts, if the break occurs in the forecast period and is unknown and unpredictable, then the retention of x 2 is irrelevant (other than parameter estimation uncertainty), as neither M 1 nor M 2 capture the shift which dominates the MSFE. Parsimony, or lack thereof, neither helps nor hinders much in this setting. Moreover, selection does not substantively affect the outcome as MSFE 3 ≈ MSFE 1 . : unknown x T+1 using in-sample mean M 1 : unknown x T+1 using random walk forecast M 2 : unknown x T+1 using random walk forecast M 3 : unknown x T+1 using random walk forecast, α=0.16 Figure 6. MSFE comparisons between M 1 , M 2 and M 3 for known and unknown future exogenous regressors including in-sample mean and random walk forecasts, where the break occurs in the mean of x 2 at T + 1.

Forecasting Regressors with a Random Walk
We now consider using a random walk to forecast the exogenous variables: x 2,T+1|T = x 2,T .
Such a device is not robust in this setting as the forecasts are made before the shift, and robustness refers to forecasting properties that are insensitive to a feature in the DGP, such as after a location shift. Although the last in-sample observation is an imprecise measure of the out-of-sample mean, it is unbiased when there are no location shifts (as there are no dynamics in the DGP), so E[x 1,T ] = µ 1 and E[x 2,T ] = µ 2 , and hence E[∆x 1,T+1 ] = 0 and E[∆x 2,T+1 ] = δ.
The forecasts from M 1 will be biased by the bias in the random walk forecast of x 2,T+1 , so (see Appendix A.5 for derivations) neglecting the small impact of η i,T on β i − β i : and the resulting mean square forecast error is: Comparison with (23) highlights the additional cost of using the random walk relative to the in-sample mean when neither forecasting device can predict the break, since: The in-sample mean of x 1 is the optimal forecast of x 1,T+1 given its in-sample stationarity, so irrespective of the value of β 2 , the in-sample mean forecasts dominate when the shift is during the forecast period. When β 2 = 0, (26) collapses to ≈ σ 2 + 2β 2 1 , ignoring O p T −1 terms, compared to σ 2 + β 2 1 for the in-sample mean forecasts. A random walk doubles the error variance, so can be costly if there are no breaks or if the break occurs after the forecast origin. As for the in-sample mean case, the MSFE of M 1 is a function of the break.
The forecast bias for M 2 is the same as that for M 1 by the same argument, although MSFE 2 (reported in Appendix A.5) does deviate from that for M 1 as ψ 2 β increases. This is due to the correlation parameter ρ which is picking up part of the omitted variable x 2,T+1 in M 2 and has more effect as ψ 2 β increases. When β 2 = 0, MSFE 2 ≈ σ 2 + 2β 2 1 , which is the same as for M 1 . Despite small but increasing deviations as ψ 2 β increases, MSFE 2 follows a similar trajectory to MSFE 1 . The misspecification is less relevant for the random walk forecasts of the marginal processes relative to the effect of the break, similar to the results for the in-sample mean forecasts.

Selecting Forecasted Regressors
In practice, selection will be applied to determine whether to include x 2,t or not. Then, from (12), we can obtain the MSFE 3 as: The trade-off between parameter estimation uncertainty and including x 2 is essentially the same as in the known variable case: if x 2 has a noncentrality of zero, so β 2 = ψ 2 β = 0, then the one-step MSFE is minimized by excluding x 2 from the forecasting model. It should be included if ψ 2 β > 1. However, depending on the values of ρ and T, the switch point can be smaller than ψ 2 β = 1, although the impact is likely to be small given the scale factor σ 2 T −1 . Even though the random walk forecast is highly uncertain by using just one observation, if the variable that breaks is quite significant then it pays to include that variable when using the random walk forecast. Figure 6 also records the MSFEs for the random walk forecasts using the same parameter values. The increase in MSFE over the in-sample mean forecasts is evident. Both MSFE 1 and MSFE 2 follow similar trajectories, although they do start to diverge for large ψ 2 β , with MSFE 3 at α = 0.16 close to MSFE 1 .

An In-Sample Shift in the Regressors
In contrast to the previous section, the break is assumed to occur at T, which is the last observation available for estimation. Now there is information available regarding the break when the forecasts are made. Such a framework would also be relevant in sequential forecasting. We consider forecasting using in-sample means. In common with the previous section, we study selection (Sections 6.3 and 6.5), the random walk device to forecast the regressors (Section 6.4), and finally using the random walk to forecast y (Section 6.6).

Specification of the In-Sample Shift
The DGP is adapted from (15) but the shift in µ 2 occurs at T, rather than T + 1:

Forecasting Regressors Using In-Sample Means
The relationship of interest, i.e., the conditional equation for y T+1 , remains constant. However, the in-sample mean µ y is shifted to (µ y + β 2 δ) at T. Although the only DGP parameter to shift is µ 2 to µ 2 + δ, sample calculations will be altered as now E[x 2 ] = µ 2 + T −1 δ (see Appendix A.6 for derivations).
The impact on the estimated in-sample mean of {x 2,t } will be small from the break, unless δ is very large, so by using the in-sample means for their future unknown values, the forecasted mean of y T+1 for M 1 will still be close to µ y , and the resulting forecast error bias is: This is unbiased when β 2 = 0, but could be badly biased if β 2 δ is large. The MSFE for M 1 is: This is very similar to the MSFE 1 in (23) for an out-of-sample break using the in-sample means to forecast the exogenous regressors, and hence MSFE 2 and MSFE 3 as well, although the correlation between the two regressors does not enter. When β 2 = 0, both (23) and (28) collapse to σ 2 + β 2 1 . The dampening of the squared location shift by 1 − T −1 2 slightly improves the MSFE for the in-sample shift relative to an out-of-sample shift at larger ψ 2 β , as shown in Figure 7.  For a break out of sample, we find the analytic results for M 2 are identical to those for M 1 (see Section 5.4). For the in-sample break, the forecast error and MSFE for M 2 does differ to that of M 1 (see Appendix A.6 for analytic results). This is because the in-sample location shift affects ρ which introduces a term similar to the squared location shift scaled by T in (28). Therefore, MSFE 1 = MSFE 2 unless β 2 = 0, with M 2 incurring a larger MSFE cost as ψ 2 β increases due to misspecification, although the divergence is small even for small T, and disappears asymptotically.

Selecting Regressors
Selection follows from (12) and hence: The cost of omitting x 2 rises with β 2 2 δ 2 , although increases in β 2 will raise ψ 2 β and hence raise the probability of retaining x 2 , albeit unconnected with the magnitude of δ 2 . As the location shift is scaled by T, MSFE 3 → MSFE 1 as T → ∞.

Forecasting Regressors Using a Random Walk
From the previous analysis in Section 6.2, knowledge of the break at T brought little benefit when using in-sample means as forecasts. However, the random walk should do better when the break occurs at T as opposed to T + 1. As before: x 1,T+1|T = x 1,T and x 2,T+1|T = x 2,T , but now E[x 1,T ] = µ 1 and E[x 2,T ] = µ 2 + δ, and hence E[∆x 1,T+1 ] = 0 and E[∆x 2,T+1 ] = 0 as well.
Given the unbiased forecasts of the exogenous regressors, it follows that the forecasts for M 1 are unbiased (see Appendix A.7) when the parameter estimates are unbiased. The MSFE for M 1 is: When β 2 = 0, the MSFE is similar to that of the out-of-sample break case, where the random walk is costly as forecasts of both x 1,T+1 and x 2,T+1 are inefficient. However, (29) does depend on the magnitude of the shift independently of β 2 , unlike (26). MSFE 1 is a function of ψ 2 β , increasing as ψ 2 β increases, unlike in the known regressor case. But it does so more slowly than for breaks out of sample, or breaks in sample using the in-sample mean. As ψ 2 β increases, the break at T in µ 2 has a larger effect on the dependent variable, and hence the benefits of using a random walk forecast of x 2,T+1 are larger. M 2 will suffer when β 2 = 0 as the forecasts will be biased. The MSFE for M 2 is: so no robustness in the sense of reducing bias is achieved unless β 2 = 0. When β 2 = 0, MSFE 2 < MSFE 1 , but the bias from not including a random walk, and hence unbiased, forecast of x 2,T+1 quickly outweighs parameter estimation costs as ψ 2 β increases. Solving for MSFE 2 < MSFE 1 results in: The break term dominates and offsets on the numerator and denominator, leading to a trade-off at ≈1 with deviations scaled by T −1 . For ρ = 0.5, T = 100 and δ = 4, MSFE 2 dominates when ψ β = 1.05. Interestingly, the cut-off is slightly above 1 for this case, compared to slightly below 1 for the known breaks out-of-sample case, but the results still imply that a selection significance level of approximately 16% would be optimal to trade-off the cost of estimating an additional parameter. Figure 8 records the MSFEs from M 1 (29), M 2 (30) and three values of M 3 (A4) for the analytic results. There is a clear trade-off at ψ 2 β ≈ 1, just as in the known breaks case.

Selecting Forecasted Regressors
The final step is to compute the MSFE for M 3 for the random walk forecast, reported in Appendix A.7. Just as regression models are usually selected, that will occur for any forecasting devices designed to minimize systematic bias. As with Figure 5, selection between M 1 and M 2 can be advantageous even for these forecasting devices as seen in Figure 8. Selection outperforms M 1 for ψ 2 β < 1, and remains close to the MSFE 1 at α = 0.05 and α = 0.16, again in all cases matching or outperforming always using M 2 .
A comparison with the MSFE for the in-sample mean forecasts, also recorded in Figure 8, suggests a possible forecast improvement. If the regressor that breaks at T is known, combining the in-sample mean forecast for M 1 with the random walk forecast for M 2 will improve forecast performance (shifting the MSFE curves for the random walk forecast down by approximately 1). As the number of regressors increases, the forecasting method for each contemporaneous regressor will have a cumulative impact. However, as the break occurs in sample, methods to detect breaks at the forecast origin such as impulse indicator saturation (IIS) could be used to guide the forecaster to the most appropriate forecasting device. 4 Selection between forecasting devices that minimize systematic bias versus those that trade-off bias and variance requires pre-testing and would only help for in-sample shifts; see, e.g., Chu et al. (1996). Thus, selection can be valuable for forecasting to the extent that it retains relevant regressors that shift (here, x 2 ), and also if it eliminates irrelevant regressors that shift, as considered in Section 9.

Forecasting the Dependent Variable Using a Random Walk
If a break is suspected, an alternative to the approaches considered so far is to use a knowingly misspecified model of the conditional DGP. One possibility is to use a random walk forecast for y, with the advantage that y T is known and avoids the need to forecast x 1,T+1 and x 2,T+1 . Hendry and Mizon (2012) derive a forecast-error taxonomy for open models that demonstrates the numerous additional forecast errors that arise from forecasting regressors offline in open models. They show that, in some cases, it can pay to use a misspecified model rather than to forecast the regressors offline. The forecast device is: Then y T = µ y + β 2 δ + β 1 η 1,T + β 2 η 2,T + T is a noisy one-observation estimator of µ y + β 2 δ . The outturn at T + 1 is: y T+1 = µ y + β 2 δ + β 1 ∆η 1,T+1 + β 2 ∆η 2,T+1 + T+1 + β 1 η 1,T + β 2 η 2,T .
The forecast error is given by: which is unbiased and has a MSFE of: This is independent of δ so should perform relatively the best when δ 2 is large, although performs worse than random walk forecasts for x 1,T+1 and x 2,T+1 when ψ 2 β is small; see Figure 8. The forecasts are invariant to omitting x 2 since this random walk forecast is independent of the regressors, which is a major advantage and negates the role of selection. However, there is a cost when the model is correctly specified. The results in the simulation below suggest that such an approach should be viewed as complementary, with forecast pooling across selected conditional models and misspecified robust devices designed to mitigate bias frequently outperforming individual methods.

Summary of Analytic Results and the Impact of Selection
The theoretical analysis has established four results.

1.
Regressors should be retained if ψ β ≥ 1. This is established for DGPs that are stationary or with a break out of sample for known regressors and a break in sample for random walk forecasts.

2.
For the two-regressor case, ψ β = 1 maps to α ≈ 0.16. Selection delivers improvements to the one-step-ahead MSFE for ψ β < 1 and can be close to the correct model specification for ψ β > 1, with the largest deviations occurring at intermediate values of ψ β .

3.
If there are breaks out of sample and contemporaneous regressors need to be forecast, the break dominates the MSFE and selection plays almost no role. Similar results are found even if the break occurs at the end of the sample, but the in-sample mean is used to forecast the regressors.

4.
Random walk forecasts are costly if there are no breaks (forecasting x 1,T+1 ) or if the breaks are unpredictable (a break at T + 1 and forecasting T + 1|T). However, they improve MSFE when the break is predictable (break at T and forecasting T + 1|T). Table 3 summarises the results for specific parameters using T = 50 (T = 100 is in Table A1 in Appendix B). For each scenario, the ratio of MSFE j /MSFE 1 for j = 2, 3 is reported. MSFE 2 has no selection, and is therefore listed as α = 0, while three values of α are used for MSFE 3 . The squared noncentralities ψ 2 β = 0, 1, 4, 9, 16 capture the full hump shape seen in the figures above.
M 2 is the correct model in the column labelled ψ 2 β = 0, so the ratio of MSFE 2 /MSFE 1 measures the cost of over-specification. The gains can be substantial in some cases, almost 30% for a break out of sample with known regressors, but in other cases including x 2,t is not at all costly despite its irrelevance. Tighter selection for M 3 is close to M 2 as x 2,t will be omitted more frequently, but even at α = 0.16 the ratio for M 3 is close to the ratio for M 2 , suggesting that selection is not costly.
Moving to the next column highlights the ψ β = 1 trade-off, with all cases almost exactly equal to one. A cut-off slightly lower than one was found in (19), which is reflected in the ratio marginally greater than one. Conversely, (31) found a cut-off slightly larger than one, resulting in a ratio slightly below one, but the differences are small.
Next, consider the columns labelled ψ 2 β = 4, 9, and 16. M 1 is the correct model so the objective is to minimize the ratio. In some cases M 2 performs poorly, but M 3 at α = 0.16 is frequently very close to 1, i.e., MSFE 1 . Selection forecast performance tends to be worse at ψ 2 β = 4, but as the signal for x 2 increases, the probability of retaining x 2 increases so the selected model is closer to M 1 . The benefits of selection vary by case. For example, for a break at T using in-sample means, selection at α = 0.16 delivers a 2.4% improvement relative to M 2 for ψ β = 4, compared to a halving of the ratio for the random walk. In almost every setting, MSFE 3 is close to MSFE 1 so the costs of selection are usually small, irrespective of the noncentrality. In that sense, model selection acts to reduce the risk relative to the worst model. Conversely, the costs of unmodeled shifts are very large, up to almost 8-fold greater than the baseline stationary MSFE 1 . Table 3. Ratio of MSFE to that of MSFE 1 , T = 50. M 2 has no selection (α = 0); selection in M 3 at α.

MSFE Relative to MSFE 1
Model These results show that even facing breaks, the well-known trade-off for selecting variables in forecasting models, namely that variables should be retained if their noncentralities exceed 1, still applies, resulting in much looser significance levels than typically used. The problem with such an approach is that when many β 2,i = 0 but are subject to location shifts, M 1 , which erroneously includes x 2,t in the model, will perform worse. Loose significance levels increase the chance that irrelevant variables with ψ β = 0 are retained by being adventitiously significant for that draw. To evaluate this effect, the next section undertakes a simulation study of selection in models with ten irrelevant and five relevant exogenous regressor variables confronting a variety of shifts.

Simulation Design
We generalize the above analysis using Monte Carlo analysis, formalizing the DGP and models that are estimated. We consider larger models with dynamics, evaluating for a range of strategies to forecast future values of the regressors, different significance levels, and different configurations of out-of-sample breaks. The next section then evaluates the simulation results.

Data Generation Process
The DGP is for a scalar dependent variable y t , and N regressors x t = (x 1,t , . . . , x N,t ) . There are n regressors that are relevant, i.e., have a nonzero coefficient in the DGP for y t , and N − n that are irrelevant with coefficient zero.
Breaks in the process for the target variable y are introduced through breaks in the regressors. During the break, δ R = −0.3 ≡ δ ∆ , so δ drops by −2.3. Keeping λ unchanged, the equilibrium changes from x = 8 to x ∆ = −1.2, which is a shock of six unconditional standard errors. The impact on y t depends on the coefficients β j . To quantify this, it is convenient to assume that the processes are at their unconditional means, after which we follow the shocks through the dynamic system, ignoring the disturbances. The impact on x when the coefficients change from (δ, λ) = (2, 0.75) to (δ ∆ , λ ∆ ) is given in Table 4. The process reverts to the original coefficients at T + 3, aiming to capture qualitatively aspects of a sustained but temporary structural break, such as the Great Financial Crisis or the COVID-19 pandemic. The impact of the break on y j,T+1|T is 0.87 times the new x. For (−0.3, 0.95) this is a change of 0.6, well below y's conditional standard error of unity. Table 5 lists the break settings we consider. The upward break in slope (a) pushes the process towards a unit root, while the downward break in slope (b) makes it almost white noise. Figure 9 plots the second half of y t for one replication of the DGP and for each of the five specifications of the break. This is for T = 100 and after discarding the initial observations. The break lasts for two observations in the forecast period, after which the DGP reverts to the settings without break. Figure 9 illustrates the low impact of the break in mean and slope when (δ ∆ , λ ∆ ) = (−0.3, 0.95). The design (33) allows for breaks in relevant variables, in irrelevant variables, or in both. In the last case: δ R = δ I = δ ∆ and λ R = λ I = λ ∆ . Breaks in irrelevant variables do not affect y, but can have an impact on forecasts if the irrelevant variables are used in the forecasts' construction. However, when forecasting for T + 1|T, such breaks have no impact at all, because the future x T+1 s are not yet known.  Table 5, T = 100, H = 5.

Models and Forecast Devices
We generate Q + T + H observations from DGP (32)-(34), discarding the initial Q. The starting point for modeling is the general unrestricted model (GUM): An asterisk indicates that model selection is used, so the intercept and lagged y are not selected over but are always retained. Model selection is only performed once for each replication, but the selected model is re-estimated by ordinary least squares (OLS) each time that we forecast given data up to T+h−1: Only one-step-ahead forecasts are generated and evaluated: The out-of-sample values x j,T+h of the regressors in (38) are unknown when forming the forecasts. We consider a range of forecast devices that can supply these missing values: INF: future outcomes: x j,T+h = x j,T+h ; AVG: the in-sample average: x j,T+h = ∑ T+h−1 t=h x j,t /T; ARX: an AR(1) for each regressor: x j,T+h = µ j + ρ j x j,T+h−1 , estimated by OLS for each horizon from: x j,t = µ j + ρ j x j,t−1 + u j,t , t = h, . . . , T + h − 1; (39) RWX: the random walk forecast: x j,T+h = x j,T+h−1 ; RDX: a random walk with differencing (Hendry 2006), using differenced estimates from (39): x j,T+h = x j,T+h−1 + ρ j ∆x j,T+h−1 .
CAX: Cardt forecast of x j,T+h .
In addition, several alternatives that ignore the regressors are considered: RWY: a random walk forecast: y T+h = y T+h−1 ; ARY: an AR(1) forecast: y T+h = γ 0 + γ 1 y T+h−1 , estimated by OLS for each horizon; CAY: Cardt forecasts of y T+h .
The devices that forecast the regressors supply plug-in values to allow forecasting with the GUM (36), as well as the reductions (37) of the GUM, at a range of nominal significance levels. Device INF uses future outcomes, making it infeasible for stochastic variables. Note that all devices using regressors benefit from some knowledge that is not available in practice, namely that the DGP is nested in the GUM, and the GUM is not misspecified. The fact that the regressors are exchangeable and break at the same time in the same way may also help: finding just one that matters could already improve the forecasts.
Cardt is a slightly improved version of Card (calibrated average of rho and delta methods), see Doornik et al. (2020a), which performed very well in the M4 forecast competition of Makridakis et al. (2020). Cardt averages forecasts from a differenced, autoregressive, and a moving average model. These are then treated as future observations in a calibration model with richer autoregressive structure. The full procedure is documented in Castle et al. (2021). Cardt pays particular attention to seasonality, which is irrelevant here. We use Cardt to make four forecasts, then use the first of these. The method will take logarithms by default. Switching that off makes little difference in these experiments. Cardt is used in daily COVID-19 forecasts of Doornik et al. (2020b).

Selecting Regressors
The noncentrality ψ β in the DGP affects the probabilities of retaining a variable in the model selection procedure. Table 6 shows the probability of retaining one or all relevant regressors assuming independent t-tests. While the probability of retaining one variable may be quite large, the joint probability of retaining all can be extremely low. Thus, even using a significance level of 16%, many relevant variables will be omitted if their noncentralities are small. However, their contribution to explaining the dependent variable is also small and breaks in such variables will have a smaller effect. Table 6. Probability of retaining one or all variables when the coefficients have the specified noncentrality, assuming independence at nominal significance α and Student-t(83) distribution. ψ β = 1.2 ψ β = 0.5 ψ β = 1 ψ β = 1.5 ψ β = 2 ψ β = 3 ψ β = 4 Joint Average ψ β = 4 α n = 1 n = 10 n = 1 n = 1 n = 1 n = 1 n = 1 n = 1 n = 6 n = 6 n = 1 n = 3 The fraction of relevant variables that is retained in the Monte Carlo experiment is denoted the potency, and the fraction of irrelevant variables that is retained is denoted the gauge. We always retain the intercept and lagged y, so the GUM (36) has 2N possible variables to select over, of which n are relevant. For m = 1, . . . , M replications we define the indicator function 1{·} and: This is then averaged over all replications. Table 7 shows that the empirical gauge matches the theoretical probabilities in Table 6 when using Autometrics for selection: the gauge is higher than α but not by much. Potencies are close to the powers of one-off t-tests with the same noncentralities, up to α = 0.1, beyond that they fall behind. Consequently, it is appropriate to use Autometrics to investigate the theoretical results by simulating a more general setting, without concern that the selection algorithm will influence the results relative to the single t-test approach analyzed above.

Simulation Evidence
Simulation evidence is presented using the design of Section 8.1 and forecast devices of Section 8.2. All experiments use M = 10,000 and are implemented in Ox 9 (Doornik 2018) and PcGive (Hendry and Doornik 2018). We start with out-of-sample forecasts in Section 9.1, when the break is unanticipated. Then Section 9.2 compares breaks in relevant and irrelevant variables, Section 9.3 looks at forecasts after the break, Section 9.4 considers selection, Section 9.5 introduces pooled forecasts, and Section 9.6 summarizes.

Forecasting before the Break
The top half of Table 8 is for the case without breaks, when forecasting T+1|T is similar to forecasting T + 2|T + 1, etc. The table reports the ratio of the MSFE for devices INF, AVG, ARX, RWX respectively to the MSFE of ARY for a range of significance levels α. Selection at α = 0 implies dropping all the regressors, leaving an AR(1) in y, denoted ARY. The bottom row of each half gives the MSFE of ARY. Not selecting at all (α = 1) coincides with the GUM. Table 8. No break and out-of-sample break. Ratio of MSFE to MSFE ARY forecasting T + 1|T.

Ratio
No break α = 0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 α = 0.001 1.03 Without a break, knowing the future value of regressors, device INF, is only useful when they are significant. Using the sample mean AVG never improves one-step forecasting relative to ARY. This also holds when there is a break, and is even more pronounced for T + 2|T + 1 and T+3|T+2 (not shown). We see that MSFE ARY increases when there are more highly significant variables. There is an improvement over ARY from forecasting the regressors with ARX at strict significance levels for ψ(4). In this stationary DGP without breaks, ARX dominates RWX: it is better to model the regressors by an autoregression (the true model) than taking the last known value.
The bottom half of Table 8 is for the cases with an out-of-sample break in the relevant variables only. The ratios for the five break settings (in mean, in slope, and in mean and slope, for (a) and (b)) are averaged. Now it really would help to know the future. There is only a small penalty for including irrelevant regressors, as their influence is swamped by the break. Except for the sample means, both feasible methods perform on a par with ARY. The infeasible device is best with loose selection, as was found theoretically.

Selection and Location of the Break
The design of the experiments allows for three locations of the break. Table 9 gives the mean square forecast errors for a break in mean and slope (b), listing three cases.

Break in relevant regressors
The break shows up in y through the relevant variables. Inclusion of irrelevant variables in the forecasting model is not costly relative to the impact of the break. Loose selection is preferred, because it includes more relevant variables. For T+1|T selection has no impact because the break is not observed (except for known regressors). Including regressors in ARX and RWX gives a substantial improvement over ARY.

Break in irrelevant regressors
There is no break in y, so any inclusion of irrelevant variables is costly, as their break offsets the small estimated coefficients. The more irrelevant variables included, the stronger this effect. The autoregression in y is almost always preferred.

Break in all regressors
The y variable is identical to that of a break in relevant variables only. Selection is now a trade-off between including variables that matter and help with forecasting, and irrelevant variables that make forecasts worse. Including regressors in ARX and RWX gives a substantial improvement over ARY.

Forecasting after the Break
We now dispense of INF for its infeasibility, and AVG because it has the highest MSFE in all experiments. Table 10 reports the ratio of the MSFE for all other devices to that of ARY. For the devices that forecast regressor values, results are reported after selection at 10%.  ψ(1) 1.10 1.15 1.31 1.15 1.21 1.29 1.11 1.16 1.31 1.16 1.21 1.29 1.10 1.14 1.30 1.15 1.21 1.28  ψ(2) 1.00 1.04 1.24 1.05 1.14 1.19 1.02 1.07 1.29 1.08 1.15 1.22 1.01 1.06 1.25 1.06 1.15 1.21  ψ(4)   When there is no break, only ARX is able to gain on ARY, and then only for the design with significant regressors (but stricter selection would help; see Table 8). Otherwise, and always for the break in irrelevant variables only, the AR(1) in y has the smallest mean square forecast error. This matches an oft-found outcome. This model is misspecified, ignoring all information from the exogenous regressors, but misspecification need not entail forecast failure. Indeed, the costs of forecasting the exogenous regressors can outweigh their inclusion. However, the DGP design is also an AR(1) in y so this forecasting device has the advantage of correctly specifying the dynamics. It may not perform so well if the DGP contains more complex dynamics.

No break
The AR(1) in y performs poorly when relevant regressors break. Now we see substantial gains in Table 10 from modeling the regressors, even shortly after the break has finished (the break is active for T + 1 and T + 2).
Device RDX improves on RWX when the process shifts towards a unit root, but not otherwise. Cardt behaves quite similar to the random walk forecasts in this DGP: CAX is close to RWX in most cases. Cardt on y is usually a small improvement on RWY in the cases with a break.
The AR(1) for x always improves on ARY in the cases with break. In the first period with an observed break, T + 2, it is the worst of the methods that forecast regressors, while in subsequent periods it is the best of these. But note that at T + 3 the naive random walk forecast of y and Cardt are better still.

Is Selection Costly When Forecasting?
Comparing selection to using the GUM to forecast regressors, we find that selection is always advantageous. Table 11 gives the average MSFE ratio relative to ARY, where the average is taken over the three noncentrality settings, and different break cases. The top panel of the table combines cases where there is no change in y, either because nothing breaks, or for the break in mean and slope for irrelevant variables only. In that case ARY tends to dominate, so tight selection is advantageous. The exception is highly significant regressors in a stationary setting.  The bottom panel of Table 11 averages over the five cases where all variables break. There we often see a U-shaped effect of selection, with a loose selection best. This is particularly so at T + 2|T + 1, as was found in the theoretical results.
The bottom row in each panel of Table 11 gives the result when the specification of the DGP is known but its parameters need estimated. The entries under INF have the most information: the DGP as well as the future values of the regressors. Moving to the other columns shows the cost of not knowing the latter.

Forecast Combinations
Many investigations of forecasting have shown that combined forecasts can outperform the individual forecasts. The main candidates here are ARX in combination with a random walk style forecast of y. Although there are many other possibilities, we restrict ourselves to: APOOL (ARX + RWY)/2; CPOOL (ARX + CAY)/2.
In both cases ARX is used in the model that is selected from the GUM at 10%. To summarize the results, we consider again the MSFE relative to ARY, with a threeway average across noncentralities, break types and horizons T + 2, T + 3, T + 4. Table 12 illustrates that in this setting pooling can be advantageous as well. It is even competitive with the infeasible device.

Summary of the Simulation Results
We can infer some general results from the experiments. First, using the in-sample mean to forecast the exogenous regressors is always dominated by other approaches.
Next, when the break occurs out of sample, so forecasts are computed for T + 1, all methods struggle, and incorporating regressors is worse than simply using the AR(1) for y. Moving to the case when the break occurs in sample, so the forecasts are computed for T + 2 when the break occurs at T + 1, the random walk forecasts of the regressors is preferred when the break occurs in the relevant or all regressors. Looser significance levels tend to do well here. If the breaks occur in the irrelevant regressors, including even one can already be poisonous, and the AR(1) in y performs best.
There are substantial differences in the forecast performance of the two robust devices RWX and RDX. The former is the random walk for the regressor, and works best, except if the break drives the process towards a unit root. In that case, the differenced AR(1) for x gives a higher weight to the previous value. However, when the type of break is unknown, represented by the average performance here, the simple random walk dominates. Table 12, rather arbitrarily, averages over all experiments and horizons. It shows that pooling provides some protection against different states of nature, just inching ahead of the autoregression in y. After that come the methods that ignore regressors, followed by using an AR(1), random walk, or Cardt, to forecast the regressors. However, if we know that a break has happened in the regressors, we should switch to modeling them, at least until the break is out of the system again.
The variation in MSFEs across α is very small for intermediate values of α relative to the variation in MSFEs across break types and DGP designs. For moderate α the selection significance level does not have a large impact on forecast performance. This is an encouraging finding showing that forecast performance is relatively unaffected by the precise choice of significance level for selection when using Autometrics, despite a range of noncentralities and numbers of relevant and irrelevant exogenous variables.

Conclusions
This paper investigates the choice of significance level and its associated critical value when selecting forecasting models, both analytically in a static bivariate setting where there are location shifts at the forecast origin, and in more general simulation experiments. The theory suggests that variables should be retained if their noncentralities exceed 1, which translates to c 2 α = 2 at the boundary. This result holds regardless of whether location shifts affect the variable about which a retention decision is made. Undertaking selection at such loose significance levels implies that fewer relevant variables will be excluded when they contribute to forecast accuracy, but that more variables will be retained by chance because they happen to be in a draw that results in statistical significance at the proposed critical value. Although retaining irrelevant variables that are subject to location shifts usually worsens forecast performance, their coefficient estimates will be driven towards zero when updating estimates as the horizon moves forward.
Although the static design is simple, it produces several generic analytical results. Those results hold regardless of whether the regressors are contemporaneous or lagged, although the timing of location shifts is fundamental. Dynamics will slow adjustment to new equilibria, but this would not change the essence of the results. The inflation forecasts illustrated the analytic results, with a loose selection significance level of 16% being preferred for both the known regressors and the random walk forecasts for unknown regressors case.
The simulation evidence examines a wide range of experimental designs and despite the disparate outcomes, they provide some guidance for forecasting. The ideal scenario is obviously to have complete knowledge of the DGP, such that the empirical modeller knows the number and magnitude of both relevant and irrelevant regressors, and their future values, and hence whether and where breaks are likely to occur. In practice, no-one has the benefit of omniscience, and once the future values of regressors need to be forecast, selecting from a GUM that nests the DGP may cost little, relative to knowing the precise specification of the DGP.
The simulation results suggest that if the model is being used primarily for one-stepahead forecasting with the aim of minimizing MSFE, selection at looser than standard selection significance levels may well help, and doing so will rarely hinder forecast performance. The results provide some support for selecting models at around 10% when there are approximately 15 regressors, many of which are irrelevant. This is close to the 16% derived theoretically in this paper when the number of irrelevant regressors is small. The simulation results also highlight the degree of complexity in pinning down the optimal selection rule for forecasting, with results depending on all aspects of the experimental design. A take-away for the forecaster is that pooling works well across many settings, suggesting a combination of a robust device which minimizes systematic bias and modelbased forecast based on univariate methods as a good insurance policy. Moreover, methods that did not nest the DGP, such as the direct AR(1) forecast of the dependent variable and Cardt, also performed well, both matching commonly found empirical outcomes. However, if we know that a break has happened, one-step forecasts are improved by incorporating forecasts of the regressors.  future research. We are especially grateful to Neil Ericsson and two anonymous referees for their careful reading and many helpful comments.

Conflicts of Interest: Doornik and Hendry have developed Autometrics, which is included in the
OxMetrics software package, and have a share in the returns.
Omitting x 2 from the forecasting equation leads to a forecast error of: with an MSFE for M 2 given by: where σ 2 ν is given in (A1).
From (12): Appendix B Table A1. Ratio of MSFE to that of MSFE 1 . T = 100, otherwise as Table 3.  Clements and Hendry (1993) argue that the generalized forecast error second moment should be used to evaluate forecast performance instead of MSFE. In this case the results would be equivalent, because we focus on one-step-ahead forecasts.

MSFE Relative to MSFE 1 Model
2 UK quarterly consumer price index (CPI) is given by ONS series D7BT, which is the quarterly average of the monthly index.
Annual inflation percentage is defined as π t = 100∆ 4 log D7BT t . UK Unemployment is the quarterly average of ONS series MGUK, LFS ILO unemployment rate (UK, All, Aged 16 and over, %, NSA).

3
Intermediate alternatives such as sub-sample estimation, recursive or rolling estimation could also be used.