“Go Wild for a While!”: A New Test for Forecast Evaluation in Nested Models

: In this paper, we present a new asymptotically normal test for out-of-sample evaluation in nested models. Our approach is a simple modification of a traditional encompassing test that is commonly known as Clark and West test (CW). The key point of our strategy is to introduce an independent random variable that prevents the traditional CW test from becoming degenerate under the null hypothesis of equal predictive ability. Using the approach developed by West (1996), we show that in our test, the impact of parameter estimation uncertainty vanishes asymptotically. Using a variety of Monte Carlo simulations in iterated multi-step-ahead forecasts, we evaluated our test and CW in terms of size and power. These simulations reveal that our approach is reasonably well-sized, even at long horizons when CW may present severe size distortions. In terms of power, results were mixed but CW has an edge over our approach. Finally, we illustrate the use of our test with an empirical application in the context of the commodity currencies literature. forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports average size across the 30 forecasting horizons. σ(θ  ) is the standard deviation of θ  and it was set as a percentage of the standard deviation of the forecasting errors of model 2 ( σ(e  )) . The total number of Monte Carlo simulations was 2000 and the sample size was T = 900 (R = 450 and P = 450). We evaluated the CW test and our proposed test using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach. forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports average size results across the 30 forecasting horizons. σ(θ  ) is the standard deviation of θ  and it was set as a percentage of the standard deviation of the forecasting errors of model 2 ( σ(e  )) . The total number of Monte Carlo simulations was 2000 and the sample size was T = 900 (R = 450 and P = 450). We evaluated the CW test and our proposal using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach. using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports the average power results across the 30 forecasting horizons. σ(θ  ) is the standard deviation of θ  and it was set as a percentage of the standard deviation of the forecasting errors of model 2 ( σ(e  )) . The total number of Monte Carlo simulations was 2000 and the sample size was T = 900 (R = 450 and P = 450). We evaluated the CW test and our proposed test using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach. our proposed test using standard normal critical values at level. Multistep-ahead forecasts were computed using the iterated approach.  and was set a percentage deviation of the forecasting errors of model 2 ( . The total number of Monte Carlo simulations was 2000. We evaluated the CW test and our proposed test using one-sided critical values at the significance level. Multistep-ahead forecasts were computed using the iterated approach.


Introduction
Forecasting is one of the most important and widely studied areas in time series econometrics. While there are many challenges related to financial forecasting, forecast evaluation is a key topic in the field. One of the challenges faced by the forecasting literature is the development of adequate tests to conduct inference about predictive ability. In what follows, we review some advances in this area and address some of the remaining challenges.
"Mighty oaks from little acorns grow". This is probably the best way to describe the forecast evaluation literature since the mid-1990s. The seminal works of Diebold and Mariano (1995) [1] and West (1996) [2] (DMW) have flourished in many directions, attracting the attention of both scholars and practitioners in the quest for proper evaluation techniques. See West (2006) [3], Clark and McCracken (2013a) [4], and Giacomini and Rossi (2013) [5] for great reviews on forecasting evaluation.
Considering forecasts as primitives, Diebold and Mariano (1995) [1] showed that under mild conditions on forecast errors and loss functions, standard time-series versions of the central limit theorem apply, ensuring asymptotic normality for tests evaluating predictive performance. West (1996) [2] considered the case in which forecasts are constructed with estimated econometric models. This is a critical difference with respect to Diebold and Mariano (1995) [1], since forecasts are now polluted by estimation error.
Building on this insight, West (1996) [2] developed a theory for testing populationlevel predictive ability (i.e., using estimated models to learn something about the true models). Two fundamental issues arise from West's contribution: Firstly, in some specific cases, parameter uncertainty is "asymptotically irrelevant", hence, it is possible to proceed as proposed by Diebold and Mariano (1995) [1]. Secondly, although West's theory is quite general, it requires a full rank condition over the long-run variance of the objective function when parameters are set at their true values. A leading case in which this assumption is violated is in standard comparisons of mean squared prediction errors (MSPE) in nested environments.
As pointed out by West (2006) [3]: "A rule of thumb is: if the rank of the data becomes degenerate when regression parameters are set at their population values, then a rank condition assumed in the previous sections likely is violated. When only two models are being compared, "degenerate" means identically zero" West (2006) [3], page 117. Clearly, in the context of two nested models, the null hypothesis of equal MSPE means that both models are exactly the same, which generates the violation of the rank condition in West (1996) [2].
Forecast evaluations in nested models are extremely relevant in economics and finance for at least two reasons. Firstly, it is a standard in financial econometrics to compare the predictive accuracy of a given model A with a simple benchmark that usually is generated from a model B, which is nested in A (e.g., the 'no change forecast'). Some of the most influential empirical works, like Welch and Goyal (2008) [6] and Rogoff (1983,1988) [7,8], have shown that outperforming naïve models is an extremely difficult task. Secondly, comparisons within the context of nested models provide an easy and intuitive way to evaluate and identify the predictive content of a given variable X : suppose the only difference between two competing models is that one of them uses the predictor X , while the other one does not. If the former outperforms the latter, then X has relevant information to predict the target variable.
Due to its relevance, many efforts have been undertaken to deal with this issue. Some key contributions are those of McCracken (2001, 2005) [9,10] and McCracken (2007) [11], who used a different approach that allows for comparisons at the population level between nested models. Although, in general, the derived asymptotic distributions are not standard, for some specific cases (e.g., no autocorrelation, conditional homoskedasticity of forecast errors, and one-step-ahead forecasts), the limiting distributions of the relevant statistics are free of nuisance parameters, and their critical values are provided in Clark and McCracken (2001) [9].
While the contributions of many authors in the last 25 years have been important, our reading of the state of the art in forecast evaluation coincides with the view of Diebold (2015) [12]: "[…] one must carefully tiptoe across a minefield of assumptions depending on the situation. Such assumptions include but are not limited to: (1) Nesting structure and nuisance parameters. Are the models nested, non-nested, or partially overlapping? (2) Functional form. Are the models linear or nonlinear? (3) Model disturbance properties. Are the disturbances Gaussian? Martingale differences? Something else? (4) Estimation sample. Is the pseudo-in-sample estimation period fixed? Recursively expanding? Something else? (5) Estimation method. Are the models estimated by OLS? MLE? GMM? Something else? And crucially: Does the loss function embedded in the estimation method match the loss function used for pseudo-out-of-sample forecast accuracy comparisons? (6) Asymptotics. What asymptotics are invoked?" Diebold (2015) [12], pages 3-4. Notably, the relevant limiting distribution generally depends on some of these assumptions.
In this context, there is a demand for straightforward tests that simplify the discussion in nested model comparisons. Of course, there have been some attempts in the literature. For instance, one of the most used approaches in this direction is the test outlined in Clark and West (2007) [13]. The authors showed, via simulations, that standard normal critical values tend to work well with their test, even though, Clark and McCracken (2001) [9] demonstrated that this statistic has a non-standard distribution. Moreover, when the null model is a martingale difference and parameters are estimated with rolling regressions, Clark and West (2006) [14] showed that their test is indeed asymptotically normal. Despite this and other particular cases, as stated in the conclusions of West (2006) [3] review: "One of the highest priorities for future work is the development of asymptotically normal or otherwise nuisance parameter-free tests for equal MSPE or mean absolute error in a pair of nested models. At present only special case results are available". West (2006) [3], page 131. Our paper addresses this issue.
Our WCW test can be viewed as a simple modification of the CW test. As noticed by West (1996) [2], in the context of nested models, the CW core statistic becomes degenerate under the null hypothesis of equal predictive ability. Our suggestion is to introduce an independent random variable with a "small" variance in the core statistic. This random variable prevents our test from becoming degenerate under the null hypothesis, keeps the asymptotic distribution centered around zero, and eliminates the autocorrelation structure of the core statistic at the population level. While West's (1996) [2] asymptotic theory does not apply for CW (as it does not meet the full rank condition), it does apply for our test (as the variance of our test statistic remains positive under the null hypothesis). In this sense, our approach not only prevents our test from becoming degenerate, but also ensures asymptotic normality relying on West's (1996) [2] results. In a nutshell, there are two key differences between CW and our test. Firstly, our test is asymptotically normal, while CW is not. Secondly, our simulations reveal that WCW is better sized than CW, especially at long forecasting horizons.
We have also demonstrated that "asymptotic irrelevance" applies; hence the effects of parameter uncertainty can be ignored. As asymptotic normality and "asymptotic irrelevance" apply, our test is extremely user friendly and easy to implement. Finally, one possible concern about our test is that it depends on one realization of one independent random variable. To partially overcome this issue, we have also provided a smoothed version of our test that relies on multiple realizations of this random variable.
Most of the asymptotic theory for the CW test and other statistics developed in McCracken (2001, 2005) [9,10] and McCracken (2007) [11] focused almost exclusively on direct multi-step-ahead forecasts. However, with some exceptions (e.g., Clark and McCracken (2013b) [15] and Pincheira and West (2016) [16]), iterated multi-step-ahead forecasts have received much less attention. In part for this reason, we evaluated the performance of our test (relative to CW), focusing on iterated multi-step-ahead forecasts. Our simulations reveal that our approach is reasonably well-sized, even at long horizons when CW may present severe size distortions. In terms of power, results have been rather mixed, although CW has frequently exhibited some more power. All in all, our simulations reveal that asymptotic normality and size corrections come with a cost: the introduction of a random variable erodes some of the power of WCW. Nevertheless, we also show that the power of our test improves with a smaller variance of our random variable and with an average of multiple realizations of our test.
Finally, based on the commodity currencies literature, we provide an empirical illustration of our test. Following Rogoff (2010,2011) [17,18]; Pincheira and Hardy (2018, 2019 [19][20][21]; and Pincheira and Jarsun (2020) [22], we evaluated the performance of the exchange rates of three major commodity exporters (Australia, Chile, and South Africa) when predicting commodity prices. Consistent with previous literature, we found evidence of predictability for some of the commodities considered in this exercise. Particularly strong results were found when predicting the London Metal Exchange Index, aluminum and tin. Fairly interesting results were also found for oil and the S&P GSCI. The South African rand and the Australian dollar have a strong ability to predict these two series. We compared our results using both CW and WCW. At short horizons, both tests led to similar results. The main differences appeared at long horizons, where CW tended to reject the null hypothesis of no predictability more frequently. From the lessons learned from our simulations, we can think of two possible explanations for these differences: Firstly, they might be the result of CW displaying more power than WCW. Secondly, they might be the result of CW displaying a higher false discovery rate relative to WCW. Let us recall that CW may be severely oversized at long horizons, while WCW is better sized. These conflicting results between CW and WCW might act as a warning of a potential false discovery of predictability. As a consequence, our test brings good news to careful researchers that seriously wish to avoid spurious findings.
The rest of this paper is organized as follows. Section 2 establishes the econometric setup and forecast evaluation framework, and presents the WCW test. Section 3 addresses the asymptotic distribution of the WCW, showing that "asymptotic irrelevance" applies. Section 4 describes our DGPs and simulation setups. Section 5 discusses the simulation results. Section 6 provides an empirical illustration. Finally, Section 7 concludes.

Econometric Setup
Consider the following two competing nested models for a target scalar variable y . y = X β + e (model 1: null model) where e and e are both zero mean martingale difference processes, meaning that E(e |F ) = 0 for i = 1,2 and F stands for the sigma field generated by current and past values of X , Z and e . We will assume that e and e have finite and positive fourth moments.
When the econometrician wants to test the null using an out-of-sample approach in this econometric context, Clark and McCracken (2001) [9] derived the asymptotic distribution of a traditional encompassing statistic used, for instance, by Harvey, Leybourne, and Newbold (1998) [23] (other examples of encompassing tests include Chong and Hendry (1986) [24] and Clements and Hendry (1993) [25], to name a few). In essence, the ENCt statistic proposed by Clark and McCracken (2001) [9] studies the covariance between e and (e − e ). Accordingly, this test statistic takes the form: √ where σ is the usual variance estimator for e (e − e ) and P is the number of out-of-sample forecasts under evaluation (as pointed out by Clark and McCracken (2001) [9], the HLN test is usually computed with regression-based methods. For this reason, we use √P − 1 rather than √P). See Appendix A.1 for two intuitive interpretations of the ENC-t test.
The null hypothesis of interest is that γ = 0. This implies that β = β and e = e . This null hypothesis is also equivalent to equality in MSPE. Even though West (1996) [2] showed that the ENC-t is asymptotically normal for nonnested models, this is not the case in nested environments. Note that one of the main assumptions in West's (1996) [2] theory is that the population counterpart of σ is strictly positive. This assumption is clearly violated when models are nested. To see this, recall that under the null of equal predictive ability, γ = 0 and e = e for all t. In other words, the population prediction errors from both models are identical under the null and, therefore, e It follows that the rank condition in West (1996) [2] cannot be met as σ = 0.
The main aim of our paper was to modify this ENC-t test to make it asymptotically normal under the null. Our strategy required the introduction of a sequence of independent random variables θ with variance ϕ and expected value equal to 1. It is critical to notice that θ is not only i.i.d, but also independent from X , Z and e .
In this case, under the null we have e = e , therefore: Consequently, our test is one-sided. Finally, there are two possible concerns with the implementation of our WCW-t statistic. The first one is about the choice of (θ ) = ϕ . Even though this decision is arbitrary, we give the following recommendation: ϕ should be "small"; the idea of our test is to recover asymptotic normality under the null hypothesis, something that could be achieved for any value of ϕ > 0. However, if ϕ is "too big", it may simply erode the predictive content under the alternative hypothesis, deteriorating the power of our test. Notice that a "small" variance for some DGPs could be a "big" one for others, for this reason, we propose to take ϕ as a small percentage of the sample counterpart of (e ). As we discuss later in Section 4, we considered three different standard deviations with reasonable size and power results: (e ), 0.02 (e ), and 0.04 (e )} (1 percent, 2 percent, and 4 percent of the standard deviation of e ). We emphasize that (e ) is the sample variance of the estimated forecast errors. Obviously, our test tends to be better sized as ϕ grows, at the cost of some power.
Secondly, notice that our test depends on = 1 realization of the sequence θ . One reasonable concern is that this randomness could strongly affect our WCW-t statistic (even for "small" values of the ϕ parameter). In other words, we would like to avoid significant changes in our statistic generated by the randomness of θ . Additionally, as we report in Section 4, our simulations suggest that using just one realization of the sequence θ sometimes may significantly reduce the power of our test relative to CW. To tackle both issues, we propose to smooth the randomness of our approach by considering different WCW-t statistics constructed with different and independent sequences of θ . Our proposed test is the simple average of these standard normal WCW-t statistics, adjusted by the correct variance of the average as follows: where WCW is the k-th realization of our statistic and ρ , is the sample correlation between the i-th and j-th realization of the WCW-t statistics. Interestingly, as we discuss in Section 4, when using = 2, the size of our test is usually stable, but it significantly improves the power of our test.

Asymptotic Distribution
Since most of our results rely on West (1996) [2], here we introduce some of his results and notation. For clarity of exposition, we focus on one-step-ahead forecasts. The generalization to multi-step-ahead forecasts is cumbersome in notation but straightforward.
Let f = e (e − θ e ) = (Y − X β * )(Y − X β * − θ [Y − X β * − Z γ * ]) be our loss function. We use "*" to emphasize that f depends on the true population be the sample counterpart of f . Notice that f (β ) rely on estimates of β * , and as a consequence, f (β ) is polluted by estimation error. Moreover, notice the subindex in β : the out-of-sample forecast errors ( e and e ) depend on the estimates β constructed with the relevant information available up to time t. These estimates can be constructed using either rolling, recursive, or fixed windows. See West (1996West ( , 2006 [2,3] and Clark and McCracken (2013a) [4] for more details about out-of-sample evaluations.
Let f = [e (e − θ e )] be the expected value of our loss function. As considered in Diebold and Mariano (1995) [1], if predictions do not depend on estimated parameters, then under weak conditions, we can apply the central limit theorem: where S > 0 stands for the long-run variance of the scalar f . However, one key technical contribution of West (1996) [2] was the observation that when forecasts are constructed with estimated rather than true, unknown, population parameters, some terms in expression (2) must be adjusted. We remark here that we observe f = e (e − θ e ) rather than f = e (e − θ e ). To see how parameter uncertainty may play an important role, under assumptions Appendix A.1-Appendix A.4 in the Appendix A, West (1996) [2] showed that a second-order expansion of f (β ) around β yields where F = ( * ) , R denotes the length of the initial estimation window, and T is the total sample size (T = R + P), while B and H will be defined shortly.
Recall that in our case, under the null hypothesis, f = [e (e − θ e )] = 0, hence expression (3) is equivalent to Note that according to West (2006) [3], pp.112, and in line with Assumption 2 in West (1996) [2], pp.1070-1071, the estimator of the regression parameters satisfies B, B as a matrix of rank k; (b) H(t)= t ∑ h (β * ) if the estimation method is recursive, is a qx1 orthogonality condition that is satisfied. Notice that H = P ∑ H(t) ; c) Eh (β * ) = 0. As explained in West (2006) [3]: "Here, ℎ can be considered as the score if the estimation method is ML, or the GMM orthogonality condition if GMM is the estimator. The matrix ( ) is the inverse of the Hessian if the estimation method is ML or a linear combination of orthogonality conditions when using GMM, with large sample counterparts ." West (2006) [3], pp.112.
Notice that Equation (3) clearly illustrates that P ∑ e (e − θ e ) can be decomposed into two parts. The first term of the RHS is the population counterpart, whereas the second term captures the sequence of estimates of β * (in other words, terms arising because of parameter uncertainty). Then, as P, R → ∞, we can apply the expansion in West (1996) [2] as long as assumptions of Appendix A.1-Appendix A.4 hold. The key point is that a proper estimation of the variance in Equation (3) must account for: (i) the variance of the first term of the RHS (S = ϕ e > 0, i.e., the variance when there is no uncertainty about the population parameters), (ii) the variance of the second term of the RHS, associated with parameter uncertainty, and iii) the covariance between both terms. Notice, however, that parameter uncertainty may be "asymptotically irrelevant" (hence (ii) and (iii) may be ignored) in the following cases: (1) → 0 as P, R → ∞, (2) a fortunate cancellation between (ii) and (iii), or (3) F = 0.
In our case: Note that under the null, γ * = 0, β * = β * and recall that θ = 1, therefore With a similar argument, it is easy to show that This result follows from the fact that we define e as a martingale difference with respect to X and Z .
Hence, in our case "asymptotic irrelevance" applies as F = 0 and Equation (3) reduces simply to In other words, we could simply replace true errors by estimated out-of-sample errors and forget about parameter uncertainty, at least asymptotically.

Monte Carlo Simulations
In order to capture features from different economic/financial time series and different modeling situations that might induce a different behavior in the tests under evaluation, we considered three DGPs. The first DGP (DGP1) relates to the Meese-Rogoff puzzle and matches exchange rate data Rogoff (1983,1988) [7,8] found that, in terms of predictive accuracy, many exchange rate models perform poorly against a simple random walk). In this DGP, under the null hypothesis, the target variable is simply white noise. In this sense, DGP1 mimics the low persistence of high frequency exchange rate returns. While in the null model, there are no parameters to estimate, under the alternative model there is only one parameter that requires estimation. Our second DGP matches quarterly GDP growth in the US. In this DGP, under the null hypothesis, the target variable follows an AR(1) process with two parameters requiring estimation. In addition, the alternative model has four extra parameters to estimate. Differing from DGP1, in DGP 2, parameter uncertainty may play an important role in the behavior of the tests under evaluation. DGP1 and DGP2 model stationary variables with low persistence, such as exchange rate returns and quarterly GDP growth. To explore the behavior of our tests with a series displaying more persistence, we considered DGP3. This DGP is characterized by a VAR(1) model in which both the predictor and the predictand are stationary variables that display relatively high levels of persistence. In a nutshell, there are three key differences in our DGPs: persistence of the variables, the number of parameters in the null model, and the number of excess parameters in the alternative model (according to Clark and McCracken (2001) [9], the asymptotic distribution of the ENC-t, under the null hypothesis, depends on the excess of parameters in the alternative model-as a consequence, the number of parameters in both the null and alternative models are key features of these DGPs).
To save space, we only report here results for recursive windows, although in general terms, results with rolling windows were similar and they are available upon request. For large sample exercises, we considered an initial estimation window of R = 450 and a prediction window of P = 450 (T = 900), while for small sample exercises, we considered R = 90 and P = 90 (T = 180). For each DGP, we ran 2000 independent replications. We evaluated the CW test and our test, computing iterated multi-step-ahead forecasts at several forecasting horizons from h = 1 up to h = 30. As discussed at the end of Section 2, we computed our test using K = 1 and K = 2 realizations of our WCW-t statistic. Additionally, for each simulation, we considered three different standard deviations of θ : ϕ = {0.01 * (e ), 0.02 * (e ), and 0.04 * (e )} (1 percent, 2 percent, and 4 percent of the standard deviation of e ) . We emphasize that (e ) is the sample variance of the out-of-sample forecast errors and it was calculated for each simulation.
Finally, we evaluated the usefulness of our approach using the iterated multistep ahead method for the three DGPs under evaluation (notice that the iterated method uses an auxiliary equation for the construction of the multistep ahead forecasts-here, we stretched the argument of "asymptotic irrelevance" and we assumed that parameter uncertainty on the auxiliary equation plays no role). We report our results comparing the CW and the WCW-t test using one-sided standard normal critical values at the 10% and 5% significance level (a summary of the results considering a 5% significance level can be found in the Appendix section). For simplicity, in each simulation we considered only homoscedastic, i.i.d, normally distributed shocks.

DGP 1
Our first DGP assumes a white noise for the null model. We considered a case like this given its relevance in finance and macroeconomics. Our setup is very similar to simulation experiments in Pincheira and West (2006) [16], Stambaugh (1999) [29], Nelson and Kim (1993) [30], and Mankiw and Shapiro (1986) [31].
Null model:

=
Alternative model: We set our parameters as follows: The null hypothesis posits that Y follows a no-change martingale difference. Additionally, the alternative forecast for multi-step-ahead horizons was constructed iteratively through an AR(p) on r . This is the same parametrization considered in Pincheira and West (2016) [16], and it is based on a monthly exchange rate application in Clark and West (2006) [14]. Therefore, Y represents the monthly return of a U.S dollar bilateral exchange rate and r is the corresponding interest rate differential.

DGP 2
Our second DGP is mainly inspired by macroeconomic data, and it was also considered in Pincheira and West (2016) [16] and Clark and West (2007) [13]. This DGP is based on models exploring the relationship between U.S GDP growth and the Federal Reserve Bank of Chicago's factor index of economic activity.
Null model: = + + Alternative model: We set our parameters as follows: = .
We set our parameters as follows:

Simulation Results
This section reports exclusively results for a nominal size of 10%. To save space, we considered only results with a recursive scheme. Results with rolling windows were similar, and they are available upon request. Results of the recursive method are more interesting to us for the following reason: For DGP1, Clark and West (2006) [14] showed that the CW statistic with rolling windows is indeed asymptotically normal. In this regard, the recursive method may be more interesting to discuss due to the expected departure from normality in the CW test. For each simulation, we considered θ i.i.d normally distributed with mean one and variance ϕ . Tables 1-6 show results on size considering different choices for (θ ) = ϕ and K, as suggested at the end of Section 2. The last row of each table reports the average size for each test across the 30 forecasting horizons. Tables 7-12  are akin to Tables 1-6, but they report results on power. Similarly to Tables 1-6, the last row of each table reports the average power for each test across the 30 forecasting horizons. Our analysis with a nominal size of 5% carried the same message. A summary of these results can be found in the Appendix. Table 1 reports results for the case of a martingale sequence (i.e., DGP1) using large samples (P = R = 450 and T = 900). From the second column of Table 1, we observed that the CW test was modestly undersized. The empirical size of nominal 10% tests ranged from 6% to 8%, with an average size across the 30 forecasting horizons of 6%. These results are not surprising. For instance, for the case of a martingale sequence, Clark and West (2006) [14] commented that: "our statistic is slightly undersized, with actual sizes ranging from 6.3% […] to 8.5%" Clark and West (2006) [14], pp. 172-173. Moreover, Pincheira and West (2016) [16], using iterated multi-step ahead forecasts, found very similar results.

Simulation Results: Size
Our test seemed to behave reasonably well. Across the nine different exercises presented in Table 1, the empirical size of our WCW test ranged from 7% to 11%. Moreover, the last row indicates that the average size of our exercises ranged from 0.08 (σ(θ ) = 0.01 * σ(e )) to 0.10 (e.g., all exercises considering σ(θ ) = 0.04 * σ(e )). Notably, our results using "the highest variance", 0.04 * σ(e ), ranged from 9% to 11%, with an average size of 10% in the two cases. As we discuss in the following section, in some cases, this outstanding result comes at the cost of some reduction in power. Table 2 is akin to Table 1, but considering simulations with small samples (P = R = 90 and T = 180). While the overall message was very similar, the CW test behaved remarkably well, with an empirical size ranging from 8% to 10% and an average size of 9%. Additionally, our test also showed good size behavior, but with mild distortions in some experiments. Despite these cases, in 6 out of 9 exercises, our test displayed an average size of 10% across different forecast horizons. The main message of Tables 1 and 2 is that our test behaves reasonably well, although there were no great improvements (nor losses) compared to CW. Notes: Table 1 presents empirical sizes for the CW test and different versions of our test when parameters were estimated with a recursive scheme. K is the number of independent realizations of the sequence of θ and h is the forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports average size across the 30 forecasting horizons. σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000 and the sample size was T = 900 (R = 450 and P = 450). We evaluated the CW test and our proposed test using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach.  Table 2 presents empirical sizes for the CW test and different versions of our test when parameters were estimated with a recursive scheme. K is the number of independent realizations of the sequence of θ and h is the forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports average size across the 30 forecasting horizons. σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000 and the sample size was T = 180 (R = 90 and P = 90). We evaluated the CW test and our proposed test using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach. Table 3 reports our results for DGP2 using large samples (P = R = 450 and T = 900). In this case, the empirical size of the CW test ranged from 8% to 16%, with an average size of 13%. Notably, the CW test was undersized at "short" forecasting horizons (h ≤ 3) and oversized at long forecasting horizons (h ≥ 12). This is consistent with the results reported in Pincheira and West (2016) [16] for the same DGP using a rolling scheme: "[…] the CW test has a size ranging from 7% to 13%. It tends to be undersized at shorter horizons (h ≤ 3), oversized at longer horizons (h ≥ 6)." Pincheira and West (2016) [16], pp. 313.
In contrast, our test tended to be considerably better sized. Across all exercises, the empirical size of the WCW ranged from 8% to 12%. Moreover, the average size for each one of our tests was in the range of 10% to 11%. In sharp contrast with CW, our test had a "stable" size and did not become increasingly oversized with the forecasting horizon. In particular, for h = 30, the empirical size of our test across all exercises was exactly 10%, while CW had an empirical size of 15%. In this sense, our test offers better protection to the null hypothesis at long forecasting horizons. Table 4 is akin to Table 3, but considering a smaller sample. The overall message is similar, however, both CW and our test became oversized. Despite these size distortions in both tests, we emphasize that our test performed comparatively better relative to CW in almost every exercise. For instance, using a standard deviation of σ(θ ) = 0.02 * σ(e ) or σ(θ ) = 0.04 * σ(e ), our test was reasonably well-sized across all exercises. The worst results were found for σ(θ ) = 0.01 * σ(e ); however, our worst exercise, with K = 2, was still better (or equally) sized compared to CW for all horizons. The intuition of σ(θ ) = 0.01 * σ(e ) presenting the worst results is, in fact, by construction; recall that for σ(θ ) = 0, our test coincided with CW, hence, as the variance of θ becomes smaller, it is reasonable to expect stronger similarities between CW and our test. In a nutshell, Tables 3-4 indicate that our test is reasonably well sized, with some clear benefits compared to CW for long horizons (e.g., h ≥ 12), as CW becomes increasingly oversized.  Table 3 presents empirical sizes for the CW test and different versions of our test when parameters were estimated with a recursive scheme. K is the number of independent realizations of the sequence of θ and h is the forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports average size across the 30 forecasting horizons. σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000 and the sample size was T = 900 (R = 450 and P = 450). We evaluated the CW test and our proposed test using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach.  Table 4 presents empirical sizes for the CW test and different versions of our test when parameters were estimated with a recursive scheme. K is the number of independent realizations of the sequence of θ and h is the forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports average size across the 30 forecasting horizons. σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000 and the sample size was T = 180 (R = 90 and P = 90). We evaluated the CW test and our proposed test using onesided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach.
Finally, Tables 5 and 6 show our results for DGP3 using large samples (P = R = 450 and T = 900) and small samples (P = R = 90 and T = 180), respectively. The main message is very similar to that obtained from DGP2-CW was slightly undersized at short forecasting horizons (e.g., h ≤ 3) and increasingly oversized at longer horizons (h ≥ 12). In contrast, our test either did not exhibit this pattern with the forecasting horizon or, when it did, it was milder. Notably, for long horizons (e.g., h = 30) our test was always better sized than CW. As in the previous DGP, our test worked very well using "the higher variance" σ(θ ) = 0.04 * σ(e ) and became increasingly oversized as the standard deviation approached zero. Importantly, using the two highest variances (σ(θ ) = 0.02 * σ(e ) and σ(θ ) = 0.04 * σ(e )) our worst results were empirical sizes of 16%; in sharp contrast, the worst entries for CW were 20% and 22%.
All in all, Tables 1 through 6 provide a similar message: on average, our test seemed to be better sized, specially at long forecasting horizons. The size of our test improved with a higher σ(θ ), but as we will see in the following section, sometimes this improvement comes at the cost of a mild reduction in power.  Table 5 presents empirical sizes for the CW test and different versions of our test when parameters were estimated with a recursive scheme. K is the number of independent realizations of the sequence of θ and h is the forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports average size results across the 30 forecasting horizons. σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000 and the sample size was T = 900 (R = 450 and P = 450). We evaluated the CW test and our proposal using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach.  Table 6 presents empirical sizes for the CW test and different versions of our test when parameters were estimated with a recursive scheme. K is the number of independent realizations of the sequence of θ and h is the forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports average size results across the 30 forecasting horizons. σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000 and the sample size was T = 180 (R = 90 and P = 90). We evaluated the CW test and our proposal using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach.

Simulation Results: Power
The intuition of our test is that we achieve normality by introducing a random variable that prevents the core statistic of the CW test from becoming degenerate under the null hypothesis. As reported in the previous section, our test tended to display a better size relative to CW, especially at long horizons. The presence of this random variable, however, may also have eroded some of the predictive content of model 2, and consequently, it may also erode the power of our test. As we will see in this section, CW has an edge over WCW in terms of power (this was somewhat expected since CW exhibits some important size distortions). Nevertheless, we noticed that the power of WCW improved with the number of realizations of θ (K) and with a smaller variance of θ (ϕ). Tables 7 and 8 report power results for DGP1, considering large and small samples, respectively. Table 7 shows results that are, more or less, consistent with the previous intuition-the worst results were found for the highest standard deviation (σ(θ ) = 0.04 * σ(e )) and one sequence of realizations of θ (K = 1). In this sense, the good results in terms of size reported in the previous section came at the cost of a slight reduction in power. In this case, the average loss of power across the 30 forecasting horizons was about 6% (55% for CW and 49% for our "less powerful" exercise). Notice, however, that averaging two independent realizations of our test (e.g., K = 2) or reducing σ(θ ), rapidly enhanced the power of our test. Actually, with K = 2 and a low variance of σ(θ ), the power of our test became very close to that of CW. The best results in terms of power were found for the smallest variance. This can be partially explained by the fact that the core statistic of our test became exactly the CW core statistic when the variance (θ ) approached zero. Table 8 shows results mostly in the same line, although this time figures are much lower due to the small sample. Importantly, differences in terms of power were almost negligible between our approach and CW.  Table 7 presents power results for CW and different versions of our test when parameters were estimated with a recursive scheme. K is the number of independent realizations of the sequence of θ and h is the forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports average power across the 30 forecasting horizons. σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000 and the sample size was T = 900 (R = 450 and P = 450). We evaluated the CW test and our proposed test using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach. Notes: Same notes as in Table 7. The only difference is that in Table 8, the sample size was T = 180 (R = 90 and P = 90).
Tables 9 and 10 report power results for DGP2, considering large and small samples, respectively. Contrary to DGP1, now, power reductions using our approach are important for some exercises. For instance, in Table 10, CW had 20% more rejections than our "less powerful" exercise. In this sense, asymptotic normality and good results for σ(θ ) = 0.04 * σ(e ) in terms of size, came along with an important reduction in power. As noticed before, the power of our test rapidly improved with K > 1 or with a smaller σ(θ ). For instance, in Table 10, for the case of σ(θ ) = 0.04 * σ(e ), if we considered K = 2 instead of K = 1, the average power improved from 37% to 43%. Moreover, if we kept K = 2 and reduced σ(θ ) to σ(θ ) = 0.01 * σ(e ), differences in power compared to CW were small.  Table 9 presents power results for CW and different versions of our test when parameters were estimated with a recursive scheme. K is the number of independent realizations of the sequence of θ and h is the forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports average power results across the 30 forecasting horizons. σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000 and the sample size was T = 900 (R = 450 and P = 450). We evaluated the CW test and our proposed test using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts was computed using the iterated approach. Notes: Same notes as in Table 9. The only difference is that in Table 10 the sample size was T = 180 (R = 90 and P = 90).
Finally, Tables 11 and 12 report power results for DGP3, considering large and small samples, respectively. In most cases reductions in power were small (if any). For instance, our "less powerful exercise" in Table 11 had an average power only 3% below CW (although there were some important differences at long forecasting horizons, such as h = 30). However, as commented previously, the power of our test rapidly improved when considering = 2; in this case, differences in power were fairly small for all exercises. Notably, in some cases we found tiny (although consistent) improvements in power over CW; for instance, using the smallest standard deviation and K = 2, our test was "as powerful" as CW, and sometimes even slightly more powerful for longer horizons (e.g., h > 18).
All in all, our simulations reveal that asymptotic normality and size corrections come with a cost: The introduction of the random variable tended to erode some of the power of our test. In this sense, there was a tradeoff between size and power in the WCW test. Nevertheless, our results are consistent with the idea that the power improves with an average of K realizations of , and with a smaller variance of ( ). An interesting avenue for further research would be to explore different strategies to maximize this size/power tradeoff (e.g., an optimal criteria for K and ).  Table 11 presents power results for CW and different versions of our test when parameters were estimated with a recursive scheme. K is the number of independent realizations of the sequence of θ and h is the forecasting horizon. When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). The last row reports the average power results across the 30 forecasting horizons. σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000 and the sample size was T = 900 (R = 450 and P = 450). We evaluated the CW test and our proposed test using one-sided standard normal critical values at the 10% significance level. Multistep-ahead forecasts were computed using the iterated approach. Notes: Same notes as in Table 11. The only difference is that in Table 10 the sample size was T = 180 (R = 90 and P = 90).

Simulation Results: Some Comments on Asymptotic Normality
Our simulation exercises show that CW has a pattern of becoming increasingly oversized with the forecasting horizon. At the same time, WCW tends to have a more "stable" size at long forecasting horizons. These results may, in part, be explained by a substantial departure from normality of CW as h grows. Using DGP2 with h = 12, 21, and 27, Figures  1-3 support this intuition-while CW showed a strong departure from normality, our WCW seemed to behave reasonably well.    Table 13 reports the means and the variances of CW and WCW after 4000 Monte Carlo simulations. As both statistics were standardized, we should expect means around zero, and variances around one (if asymptotic normality applies). Results in Table 13 are consistent with our previous findings-while the variance of CW was notably high for longer horizons (around 1.5 for h > 18), the variance of our test seemed to be stable with h, and tended to improve with a higher ( ). In particular, for the last columns, the average variance of our test ranged from 1.01 to 1.02, and, moreover, none of the entries were higher than 1.05 nor lower than 0.98. In sharp contrast, the average variance of CW was 1.32, ranging from 1.07 through 1.51. All in all, these figures are consistent with the fact that WCW is asymptotically normal. Notes: Table 13 shows the mean and the variance of the CW and WCW statistics after 4000 Monte Carlo simulations. For this exercise, we considered large samples (P = R = 450 and T = 900). We evaluated CW and our test computing iterated forecasts.

Empirical Illustration
Our empirical illustration was inspired by the commodity currencies literature. Relying on the present value model for exchange rate determination (Campbell and Shiller (1987) [33] and Engel and West (2005) [34]), Chen, Rogoff, and Rossi (2010, 2011) [17,18]; Pincheira and Hardy (2018, 2019 [19][20][21]; and many others showed that the exchange rates of some commodity-exporting countries have the ability to predict the prices of the commodities being exported and other closely related commodities as well. Based on this evidence, we studied the predictive ability of three major commodityproducer's economies frequently studied by this literature: Australia, Chile, and South Africa. To this end, we considered the following nine commodities/commodity indices: (1) WTI oil, (2) copper, (3) S&P GSCI: Goldman Sachs Commodity Price Index, (4) aluminum, (5) zinc, (6) LMEX: London Metal Exchange Index, (7) lead, (8) nickel, and (9) tin.
The source of our data was the Thomson Reuters Datastream, from which we downloaded the daily close price of each asset. Our series was converted to the monthly frequency by sampling from the last day of the month. The time period of our database went from September 1999 through June 2019 (the starting point of our sample period was determined by the date in which monetary authorities in Chile decided to pursue a pure flotation exchange rate regime).
Our econometric specifications were mainly inspired by Chen, Rogoff, and Rossi (2010) [17] and Pincheira and Hardy (2018, 2019 [19][20][21]. Our null model was Δ log(CP ) = c + ρ Δ log(CP ) + ε , While the alternative model was Δ log(CP ) = c + βΔ log(ER ) + ρ Δ log(CP ) + ε , where Δ log(CP ) denotes the log-difference of a commodity price at time t+1, Δ log(ER ) stands for the log-difference of an exchange rate at time t; c , ρ are the regression parameters for the null model, and c , β, ρ are the regression parameters for the alternative model. Finally ε , and ε , are error terms. One-step-ahead forecasts are constructed in an obvious fashion through both models. Multi-step-ahead forecasts are constructed iteratively for the cumulative returns from t through t+h. To illustrate, let y (1) be the one-step-ahead forecasts from t to t+1 and y (1) be the one-step-ahead forecast from t+1 to t+2; then, the two-steps-ahead forecast is simply y (1) + y (1).
Under the null hypothesis of equal predictive ability, the exchange rate has no role in predicting commodity prices, i.e., H : β = 0. For the construction of our iterated multistep-ahead forecasts, we assumed that Δ log(ER ) follows an AR(1) process. Finally, for our out-of-sample evaluations, we considered P/R = 4 and a rolling scheme.
Following Equation (1), we took the adjusted average of K = 2 WCW statistics and considered σ(θ ) = 0.04 * σ(e ). Additional results using a recursive scheme, other splitting decisions (P and R), and different values of σ(θ ) and K are available upon request. Tables 14 and 15 show our results for Chile and Australia, respectively. Table A8 in the Appendix section reports our results for South Africa. Tables 14 and 15 show interesting results for the LMEX. In particular, the alternative model outperformed the AR(1) for almost every forecasting horizon, using either the Australian Dollar or the Chilean Peso. A similar result was found for aluminum prices when considering h ≥ 3. These results seem to be consistent with previous findings. For instance, Pincheira and Hardy (2018, 2019 [19][20][21], using the ENCNEW test of Clark and McCracken (2001) [9], showed that models using exchange rates as predictors generally outperformed simple AR(1) processes when predicting some base metal prices via one-step-ahead forecasts.
Interestingly, using the Chilean exchange rate, Pincheira and Hardy (2019) [20] reported very unstable results for the monthly frequencies of nickel and zinc; moreover, they reported some exercises in which they could not outperform an AR(1). This is again consistent with our results reported in Table 14.
Results of the CW and our WCW tests were similar. Most of the exercises tended to have the same sign and the statistics had similar "magnitudes". However, there are some important differences worth mentioning. In particular, CW tended to reject the null hypothesis more frequently. There are two possible explanations for this result. On the one hand, our simulations reveal that CW had, frequently, higher power; on the other hand, CW tended to be more oversized than our test at long forecasting horizons, especially for h ≥ 12. Table 14 can be understood using these two points. Both tests tended to be very similar for short forecast horizons; however, some discrepancies became apparent at longer horizons. Considering h ≥ 12, CW rejected the null hypothesis at the 10% significance level in 54 out of 81 exercises (67%), while the WCW rejected the null only 42 times (52%). Table 15 has a similar message: CW rejected the null hypothesis at the 5% significance level in 49 out of 81 exercises (60%), while WCW rejected the null only 41 times (51%). The results for oil (C1) in Table 15 emphasize this fact: CW rejected the null hypothesis at the 5% significance level for most of the exercises with h ≥ 12, but our test only rejected at the 10%. In summary, CW showed a higher rate of rejections at long horizons. The question here is whether this higher rate is due to higher size-adjusted power, or due to a false discovery rate induced by an empirical size that was higher than the nominal size. While the answer to this question cannot be known for certain, a conservative approach, one that protects the null hypothesis, would suggest to look at these extra CW rejections with caution.  Table 14 shows out-of-sample results using the Chilean exchange rate as a predictor. We reported the test by CW and the WCW for P/R = 4 using a rolling window scheme. C1 denotes WTI oil, C2: copper, C3: S&P GSCI: Goldman Sachs Commodity Price Index, C4: aluminum, C5: zinc, C6: LMEX: London Metal Exchange Index, C7: lead, C8: nickel, and C9: tin. Following Equation (1), we took the adjusted average of K = 2 WCW statistics and we considered σ(θ ) = 0.04 * σ(e ). * p < 10%, ** p < 5%,*** p < 1%.  Table 15 shows out-of-sample results using the Australian exchange rate as a predictor. We reported the test by CW and the WCW for P/R = 4 using a rolling window scheme. C1 denotes WTI oil, C2: copper, C3: S&P GSCI: Goldman Sachs Commodity Price Index, C4: aluminum, C5: zinc, C6: LMEX: London Metal Exchange Index, C7: lead, C8: nickel, and C9: tin. Following Equation (1), we took the adjusted average of K = 2 WCW statistics and we considered σ(θ ) = 0.04 * σ(e ). * p < 10%, ** p < 5%,*** p < 1%.

Concluding Remarks
In this paper, we have presented a new test for out-of-sample evaluation in the context of nested models. We labelled this statistic as "Wild Clark and West (WCW)". In essence, we propose a simple modification of the CW (Clark and McCracken (2001) [9] and West (2006, 2007) [13,14]) core statistic that ensures asymptotic normality: basically, this paper can be viewed as a "non-normal distribution problem", becoming "a normal distribution" one, which significantly simplifies the discussion. The key point of our strategy was to introduce a random variable that prevents the CW core statistic from becoming degenerate under the null hypothesis of equal predictive accuracy. Using West's (1996) [2] asymptotic theory, we showed that "asymptotic irrelevance" applies, hence our test can ignore the effects of parameter uncertainty. As a consequence, our test is extremely simple and easy to implement. This is important, since most of the characterizations of the limiting distributions of out-of-sample tests for nested models are non-standard. Additionally, they tend to rely, arguably, on a very specific set of assumptions, that, in general, are very difficult to follow by practitioners and scholars. In this context, our test greatly simplifies the discussion when comparing nested models.
We evaluated the performance of our test (relative to CW), focusing on iterated multistep-ahead forecasts. Our Monte Carlo simulations suggest that our test is reasonably well-sized in large samples, with mixed results in power compared to CW. Importantly, when CW shows important size distortions at long horizons, our test seems to be less prone to these distortions and, therefore, it offers a better protection to the null hypothesis.
Finally, based on the commodity currencies literature, we provided an empirical illustration of our test. Following Chen, Rossi, and Rogoff (2010, 2011) [17,18] and Pincheira and Hardy (2018, 2019 [19][20][21], we evaluated the predictive performance of the exchange rates of three major commodity exporters (Australia, Chile, and South Africa) when forecasting commodity prices. Consistent with the previous literature, we found evidence of predictability for some of our sets of commodities. Although both tests tend to be similar, we did find some differences between CW and WCW. As our test tends to "better protect the null hypothesis", some of these differences may be explained by some size distortions in the CW test at long horizons, but some others are most likely explained by the fact that CW may, sometimes, be more powerful.
Extensions for future research include the evaluation of our test using the direct method to construct multi-step-ahead forecasts. Similarly, our approach seems to be flexible enough to be used in the modification of other tests. It would be interesting to explore, via simulations, its potential when applied to other traditional out-of-sample tests of predictive ability in nested environments. An alternative interpretation goes along the lines of West (2006, 2007) [14,13] (CW). In these papers, CW showed that there is an equivalence between the ENCt core statistic and an "adjusted mean squared prediction error (adj-MSPE)". In simple words, the CW test (or ENC-t test) tracks the behavior of MSPE differences between the forecasts coming from the nested and nesting models, but at the population level. No rejection of the null hypothesis means that both models are indistinguishable: the nested and nesting models. Rejection of the null means that both models are different and, furthermore, that forecasts from the bigger nesting model should have lower population MSPE relative to the forecasts generated by the nested model.  Table A1 presents a summary of empirical sizes of the CW test and different versions of our test when parameters were estimated with a recursive scheme. Each entry reports the average size across the h = 30 exercises. Each row considers a different DGP. The first panel reports our results for large samples (P = R = 450, T = 900), while the second panel shows our results in small samples (P = R = 45, T = 90). K is the number of independent realizations of the sequence of θ . When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000. We evaluated the CW test and our proposed test using one-sided standard normal critical values at the 5% significance level. Multistep-ahead forecasts were computed using the iterated approach.

Appendix A.7. Summary of Power Comparisons between CW and WCW Tests with Nominal
Size of 5% for Our Three DGPs  Table A2 presents a summary of the empirical power of the CW test and different versions of our test when parameters were estimated with a recursive scheme. Each entry reports the average power across the h = 30 exercises. Each row considers a different DGP. The first panel reports our results for large samples (P = R = 450, T = 900), while the second panel shows our results in small samples (P = R = 45, T = 90). K is the number of independent realizations of the sequence of θ . When K > 1, our statistic was the adjusted average of the K WCW statistics, as considered in Equation (1). σ(θ ) is the standard deviation of θ and it was set as a percentage of the standard deviation of the forecasting errors of model 2 (σ(e )). The total number of Monte Carlo simulations was 2000. We evaluated the CW test and our proposed test using one-sided standard normal critical values at the 5% significance level. Multistep-ahead forecasts were computed using the iterated approach.