Structural Break Tests Robust to Regression Misspecification

Structural break tests developed in the literature for regression models are sensitive to model misspecification. We show - analytically and through simulations - that the sup Wald test for breaks in the conditional mean and variance of a time series process exhibits severe size distortions when the conditional mean dynamics are misspecified. We also show that the sup Wald test for breaks in the unconditional mean and variance does not have the same size distortions, yet benefits from similar power to its conditional counterpart. Hence, we propose using it as an alternative and complementary test for breaks. While the conditional tests based on dynamic regression models detect breaks in the mean and variance of the US unemployment growth and interest rate growth series around the Great Moderation, the evidence for these breaks disappears when using the unconditional tests. Therefore, there is no evidence of long-run mean or volatility shifts in unemployment growth and interest rate growth.


Introduction
There is vast literature on alternative structural break tests, as well as empirical evidence that many economic indicators went through periods of structural change. Most structural break tests are developed for the slope parameters of a regression model (see inter alia Andrews 1993; Andrews and Ploberger 1994;Bai and Perron 1998;Ploberger and Krämer 1992).
Macroeconomic variables may often exhibit long-run mean shifts, that is, structural breaks in their unconditional mean. Mean shifts in unemployment rates, interest rates, GDP growth, inflation and other macroeconomic variables may signal permanent changes in the structure of the economy and are therefore themselves of interest to practitioners. Nevertheless, very few papers test for unconditional mean shifts; instead, most of the literature refers to "mean shifts" as breaks in the short-run conditional Additionally, we propose testing for breaks in the unconditional mean as an alternative to testing for breaks in the conditional mean. Breaks in the conditional mean are not equivalent, yet closely related, to breaks in the conditional mean, as long as the conditional mean is correctly specified. Aue and Horvath (2012), among others, illustrated tests for both type of breaks in a recent comprehensive review on structural break tests. Our extensive simulation study shows that the unconditional mean break test, corrected for autocorrelation, yields close to correctly-sized tests, while, for most common static and dynamic misspecifications in the conditional mean, the conditional mean tests are severely oversized. Moreover, the power of both tests is very similar, especially as the sample size increases. 3 Similar results hold for the unconditional versus conditional break tests in variance. Therefore, the approach of testing first for a break in the unconditional mean and variance of the variable of interest is not only complementary to the regression approach, but is robust to alternative sources of misspecification.
There is a plethora of empirical evidence for breaks in the conditional mean and volatility of many US macroeconomic time series during the early and mid 1980s, associated with the Great Moderation (see, for example, Bataa et al. 2013;McConnell and Perez-Quiros 2000;Sensier and van Dijk 2004;Stock and Watson 2002). Most studies employ dynamic regression models to detect such breaks. Focusing on unemployment, the industrial production and the real interest rate, we show that the unconditional mean and volatility tests mostly indicate no breaks in the unemployment or the industrial production growth. We further show that while the conditional mean tests occasionally detect breaks, the implied break-point estimates vary across the dynamic specifications employed. Therefore, it is plausible that the breaks found by the conditional tests are spurious because of size distortions, or that they do not result in long-run breaks in the mean of unemployment or industrial production growth. We also find that the evidence for a break in the variance of unemployment rates and real interest rates around the Great Moderation is not as strong as previously shown in Stock and Watson (2002).
This paper is organized as follows: Section 2 defines the unconditional break tests in mean and variance and derives their asymptotic properties in a unified framework. Section 3 defines the conditional structural break tests in mean and variance. It contains asymptotic results for the conditional break tests under correct specification and misspecification. Section 4 presents the simulation evidence comparing the size and power of the conditional and unconditional break tests. Section 5 illustrates the difference between these alternative structural break tests approaches with three empirical applications: the US civilian unemployment rate, the short-term real interest rate and the industrial production growth rate. A final section concludes. All the proofs are relegated to Appendix A.

Unconditional Mean and Variance Break Tests
In this section, we define the unconditional sup Wald test for an unknown break in the mean or variance of a dynamic univariate process. 4 To our knowledge, a test for an unknown unconditional mean break, adjusted for autocorrelation, is rarely used in the literature. 5 Most papers test for a break in the conditional mean of a series; when they intend to test for an unconditional mean break, they routinely test for a trend break or an intercept break instead, after specifying a conditional mean (see, e.g., Stock and Watson 2002). 3 The only case where our test has comparatively low power to the conditional mean test is in a correctly specified dynamic model with an intercept very close to zero. This case is further discussed in Section 3. 4 Throughout the paper, we use the sup Wald test definition in Andrews (1993); alternative definitions of the sup Wald test are available, but they are not equivalent to the original sup Wald test in Andrews (1993) and should not be confused with it. 5 Even though UM tests are not routinely used, they are a special case of the HAC-adjusted conditional break-point test in, e.g., Bai and Perron (1998), when the only regressor is an intercept. In addition, a CUSUM (cumulative sum) variant of this test for iid data is in Pitarakis (2004). As shown in Appendix A, proof of Theorem 1, for unconditional break tests, there is an explicit asymptotic relationship between the CUSUM test and the sup Wald test. However, as the Appendix shows, the conclusion of the two tests based on asymptotic critical values is in general different. Since there is strong evidence for the non-monotonic power of the CUSUM test (see, e.g., Vogelsang 1999), the paper focuses on the sup Wald test instead.
Such approaches have the disadvantage that they are highly dependent on the correct specification of the conditional mean. They also do not shed light on unconditional mean shifts, which may not be equivalent to conditional mean shifts. Therefore, in this paper, we propose using UM break tests complementarily to CM break tests, to uncover long-run mean shifts in the presence of potential static and dynamic misspecification.
We denote the unconditional mean by UM, and the unconditional variance by UV henceforth. In contrast to UM breaks, UV breaks are routinely tested in applications, for example to uncover the Great Moderation break. It is common to test for a break in the absolute value of the demeaned data, as a proxy for testing a variance break (see McConnell and Perez-Quiros 2000;Sensier and van Dijk 2004;Stock and Watson 2002). We call these tests UA (unconditional absolute deviation) break tests. One can also use the squared demeaned data to test for a variance break, as in Pitarakis (2004) and Qu and Perron (2007). We call these UV break tests, because they test directly for a variance break. 6 Below, we state the null asymptotic distributions of UM, UA and UV break tests under fairly general assumptions on the data. These distributions are not dependent on regressor, functional form, or seasonality misspecifications, simply because a conditional mean is not specified. The only misspecification that affects the null asymptotic distribution of these tests are UM breaks for the UA and UV tests, or UV breaks for the UM test. Fortunately, this misspecification is easy to correct; we discuss this correction at the end of this section.
The true model takes the general form: where µ 1 , µ 2 are deterministic, the break T UM = [Tλ UM ] is an unknown, fixed fraction of the sample 0 < λ UM < 1, and u t satisfies the assumption below, in which AVar = lim T→∞ Var.

Assumption 1.
(i) E(u t ) = 0 and AVar T −1/2 ∑ [Tλ] . . , g t+m ), and {g t } is either φ-mixing of size m −d/(2(d−1)) or α-mixing of size m −d/(d−2) . 7 With these assumptions, y t can exhibit very general dependence-ARMA, GARCH, and nonlinear dependence-but it cannot have unit roots or UV breaks. 8 For a UM break, the null and alternative hypotheses are: For a UA break, let a t = E|y t − y|, and test: For a UV break, let v ut = E(y t − y) 2 , and test: Note that a break in the expected absolute value of a demeaned series is not the same as a variance break only under certain conditions. 7 Here, · 2 = (E · 2 ) 1/2 stands for the L 2 -norm, and | · | stands for the Euclidean norm. 8 For a proof that the most common GARCH model, GARCH(1,1), is near-epoch dependent and therefore fits our assumptions, see (Hansen 1991 The UM test is defined below. It is a special case of the Andrews (1993) sup Wald test when the only regressor is an intercept, and when the variance is estimated under the null of no break. Therefore, it is not new; nevertheless, to our knowledge, it is rarely used in the empirical literature in the form defined below: > 0 is a small cut-off, typically = 0.15 in applications, For the HAC estimator v uλ , it is crucial to calculate it over the full sample, i.e., under the null H UM 0 . If we use sub-sample estimators in its computation-i.e., we estimate the variances T 1/2 (y 1λ − µ) and T 1/2 (y 2λ − µ) separately-we need a separate bandwidth for each. Since the bandwidth estimation is only accurate in large samples, for those λ's that are close to and 1 − , such an estimation would be highly inaccurate, resulting in high size distortions. 9 Thus, we define: is throughout the paper the Bartlett kernel, with the optimal data-dependent bandwidth in Newey and West (1994). 10 Specifically, we let ]. The lag truncation parameter τ governs how many auto-covariances should be used in forming the nonparametric estimates f (1) and f (0) , which estimate the spectral density at frequency one and zero. 11 Therefore, f (1) , f (0) , and η are computed over the full sample.
The UA and UV tests are denoted by U A * T and UV * T . They are computed as U M * T , but with y t replaced by a t = |y t − y| for UA, v t = (y t − y) 2 for UV, and v uλ replaced by the HAC consistent estimator of the asymptotic variance of a t or v t .
Define the distribution: where B p (·) is a p × 1 vector of independent standard Brownian motions, for some p ≥ 1. As Theorem 1 shows, G 1 is the null asymptotic distributions of the UM, UA and UV break tests. Although the distribution of various break point tests under different (more restrictive) assumptions is available, an explicit proof for the UM, UV and UA tests under Assumption 1 is not available in a unified setting to our knowledge, and we provide it in Appendix A. For the UA, respectively, UV tests, we need the following additional assumptions. 9 Simulation evidence for this statement is available from the authors upon request. 10 Additional simulations not reported here show that the fixed optimal bandwidth proposed in Andrews (1991) leads to worse performance of the UM break test. 11 The weights mentioned in Newey and West (1994) are set equal to one as usual for scalar cases. Assumption 2.
Theorem 1. Let the model be as in (1), and let Assumption 1 hold. Then: and Assumption 2(i), U A * T ⇒ G 1 ; and (iii) under H UM 0 , H UV 0 and Assumption 2(ii), UV * T ⇒ G 1 . 12 Note that the distributions are non-standard, but critical values are available in, e.g., Andrews (1993) and Bai and Perron (1998 (2004), we can obtain v t and a t via subsample demeaning, and Theorem 1(ii)-(iii) will hold. That is, we let v t = (y t − y t ) 2 , a t = |y t − y t |, and y Bai and Perron (1998) OLS break-point estimator of T UM in (1). If there is a UV break, the asymptotic distribution of the UM test is affected, but one can employ the fixed-regressor bootstrap in Hansen (2000) to correct for this. The correction for the UV tests via sub-sample demeaning is necessary and employed in our empirical analysis in Section 5.

Correct Specification
Unlike unconditional break tests, regression-based break tests are pervasive in empirical work, despite their sensitivity to misspecification (this sensitivity is discussed in Section 3.2). The most common regression specification is of the linear form: where T CM = [Tλ CM ], 0 < λ CM < 1, x t is a p × 1 vector of regressors that includes an intercept and possibly lagged dependent variables, and t are scalar errors. We denote by CM, CA and CV the conditional mean, conditional absolute deviation and the conditional variance, where the word "conditional" simply refers to specifying the conditional mean in (3). To derive the asymptotic distribution of the CM, CA and CV break tests, we need additional assumptions on the joint dependence of regressors and errors.
Note that we need the assumption T −1 ∑ [Tλ] t=1 x t x t P −→ λQ for two reasons. First, as explained in Hansen (2000), if this assumption does not hold, then the asymptotic distribution of the test statistic is not pivotal. Second, this assumption does not allow for unit roots in x t , but it allows for lagged dependent variables in x t .
The null and alternative hypotheses of the conditional tests are: The corresponding sup Wald test for a CM break is defined in, e.g., Andrews (1993): where θ 1λ , θ 2λ are the OLS estimators of θ in Equation (3) in subsamples {1, . . . , T 1λ } and {T 1λ + 1, . . . , T}, and V λ is a consistent estimator of AVar(T 1/2 ( θ 1λ − θ 2λ )) under H CM 0 . For the conditional test, the asymptotic variance AVar(T 1/2 ( θ 1λ − θ 2λ )) is routinely estimated over subsamples, i.e., separately for T 1/2 ( θ 1λ − θ 0 ) and T 1/2 ( θ 2λ − θ 0 ), or under the alternative. If a HAC estimator under the alternative would be used, the same problems would arise as for the unconditional test: there would be size distortions due to inaccurate bandwidth estimation for λ close to the beginning or the end of the sample. However, in most studies, the conditional mean specification in (3) is assumed to be correct, in which case all lags of the dependent variable are included as regressors, and correcting for autocorrelation is no longer necessary. If this is the case, the variance can be estimated under the alternative without further size distortions. Thus, as in most empirical studies, we use variance estimators that are not autocorrelation-robust in all the simulations except those where the model is static. In a static model, the researcher might suspect that the errors are autocorrelated, and a HAC estimator is justified.
For the theory section, we consider two potential estimators for AVar(T 1/2 ( θ 1λ − θ 2λ )), under homoskedasticity or heteroskedasticity. The one under homoskedasticity is: The one under heteroskedasticity is: We define the CA and CV tests as the UA and the UV tests, but with a t , v t replaced by a t = | t |, v t = 2 t , and with t = y t − x t θ the residuals from estimating (3) under the null H CM 0 . We emphasize that the name "conditional" refers exclusively to pre-specifying the conditional mean in (3), and not the conditional variance of y t or t . Therefore, the tests in this paper should not be confused with the conditional variance tests proposed by, e.g., Andreou and Ghysels (2002), who wrote down a model for the conditional variance of t .
Theorem 2 states the asymptotic distribution of the CM, CA and CV break tests. Note that the distributions are similar to the unconditional break tests, but there are more degrees of freedom used up by the conditional break tests.
As for the UA and UV tests, the asymptotic distributions of the CA and CV tests are not valid if there is a CM break; in that case, as Pitarakis (2004) shows, the CM break at T CM can be pre-estimated by T CM along with the slope parameters θ 1 , θ 2 before and after the break, via the methods in Bai and Perron (1998). Then, we can redefine t = y t − x t θ 1 1[t ≤ T CM ] − x t θ 2 1[t > T CM ] in the computation of a t , v t , obtaining the same asymptotic distributions as stated in Theorem 2. Under the alternative H CV A , the asymptotic null distribution of the CM test is not valid, but, as for the unconditional break tests, it can be bootstrapped via the fixed regressor bootstrap in Hansen (2000).
Note that UM and CM tests are in general testing equivalent null hypotheses under correct specification, with some exceptions. Consider the following general AR(p) model with additional covariates: The CM statistic tests the null hypothesis H CM 0 : θ 1 = θ 2 , where θ j = (α j , β j1 , . . . , β jp , γ j ) for j = 1, 2. The UM statistic tests the null hypothesis H UM In principle, a change in any of the elements of θ 1 results in a change in µ 1 . In addition, a small change in β 1i , a number usually between zero and one, typically results in a larger change in µ 1 , which means that the UM test will tend to have larger power that the CM test against these changes. However, it is useful to note that, if α j + [E(x t )] θ j = 0 (for example, α j = θ j = 0 for j = 1, 2), then the unconditional mean is zero and the UM test will have no power against changes in the other parameters β ji . Therefore, the UM test should be used only for series which do not have a zero unconditional mean over the whole sample, a condition that can be easily verified for any dataset before proceeding with the UM test.

Dynamic Misspecification
Unlike the unconditional break tests, all the conditional break tests are highly dependent on the correct specification of the functional form, including seasonality and dynamics. Bataa et al. (2013) and Altansukh (2013) empirically showed the effects of misspecifying the conditional mean seasonalities, outliers, dynamics and heteroskedasticity on the conditional break tests. Chong (2003) and Bai et al. (2008) theoretically showed that misspecification of the functional form leads to different null asymptotic distributions for the CM break tests. They focus on iid errors and static misspecifications, although some of their theoretical results apply to dynamic misspecification as well. The impact of dynamic misspecification of Equation (3) on conditional break tests was analyzed by Perron (1998),Vogelsang (1999), Perron and Yabu (2009), inter alia. However, all these studies correct for omitted autocorrelation in the errors by either better selection of lags in the regression equation, or directly correcting the error variance via HAC estimators. The first correction is successful if the method used indeed selects the number of lags correctly. The second correction is not always valid if the regression model is already dynamic, as omitted autocorrelation in the errors often violates the exogeneity Assumption 3(i), so a HAC variance estimator does not fix the dynamic misspecification problems.
To our knowledge, the effect of misspecifying the regressors or number of lags on CM break tests has not been studied before under general dependence and conditionally heteroskedastic data as allowed for in Assumption 4. The result in Theorem 3 is a generalization of the result in Chong (2003). The assumptions in Chong (2003) allow for lagged dependent variables, however the authors assumed that the error term is iid, and constructed the sup Wald test imposing this assumption (i.e., imposing homoskedasticity). Our results are more general than Chong (2003) in two ways. First, we allow the error term in the true model to be a near epoch dependent process, thus also allowing for conditionally heteroskedastic series, which is empirically important for analyzing both macroeconomic and financial time series. Second, we construct the sup Wald test such that it corrects for heteroskedasticity. We prove below that the asymptotic distribution of the CM break test is data-dependent and different than that stated in Theorem 3. Therefore, in the presence of dynamic misspecification, the critical values of the CM tests will be incorrect 13 , while the critical values for the UM break test are correct. Thus, the UM break test provides a valuable tool for assessing stability of the process y t in the presence of dynamic misspecification.
To formalize the results under dynamic misspecification, let x t = vec(x t(1) , x t(2) ) and θ = vec(θ (1) , θ (2) ), where x t(1) , θ (1) are p 1 × 1, x t(2) , θ (2) is p 2 × 1, and p 1 + p 2 = p. 14 The true model is (3), but we mistakenly regress y t only on x t(1) (which we assume includes the intercept). Thus, we underspecify the number of regressors; in particular, we are interested in the effects of underspecifying the number of lags. 15 where vech(A, B) selects, in order, the unique elements and the first occurrence of the repeating elements in vec(A, B).
Assumption 4(iii) states that the omitted regressors are correlated with the included regressors, as is the case when the number of lags is underspecified. The rest of the statements in Assumption 4 are standard. Let r = p 1 (1 + p 2 + (p 1 + 1)/2), the dimension of w t . Then, under Assumption 4, the functional central limit theorem in (Wooldridge and White 1988, Theorem 2.11) can be applied to yield T −1/2 ∑ 1λ w t ⇒ H 1/2 B r (λ). To state the asymptotic distribution of the CM break test under is constructed from B r (λ) by repeating its elements exactly in the positions where w * t = vec(k t , L t , M t ) repeats the elements of w t = vech(k t , L t , M t ). Similarly, let H * 1/2 and Ω * be positive semidefinite matrices constructed from H 1/2 and Ω, which were defined in Assumption 4, so that With this notation, the asymptotic distribution of the CM test is stated in Theorem 3.

Theorem 3. Let Assumptions 3 and 4 and H CM
Comment 1. The theorem above shows that the asymptotic distribution of the CM test is nonstandard and highly dependent on the data parameters and the unknown number of lags omitted. Our theorem is a generalization of Theorem 3 in Chong (2003), who proved the same result, but under iid and conditionally homoskedastic errors, with CM * T constructed only under homoskedasticity. Comment 2. As we expect, in the presence of no misspecification (θ (2) = 0), we can show that Theorem 3 reduces to Theorem 2. To see this, note that, when exactly the distribution in Theorem 2. By similar arguments, Theorem 3(ii) also reduces to Theorem 2, as it should under conditional homoskedastic errors. Comment 3. The distributions in Theorem 3 are also the same as in Theorem 2 when the model is static, except that they have fewer degrees of freedom (p 1 instead of p). The reason for this is that the functional form of the model is still linear when misspecified, so it will be correctly specified for a modified version of the initial model. 16 To see the intuition for this result, note that, under the null hypothesis, the true model is y t = x t(1) θ (1) + x t(2) θ (2) + t . The parameters will be consistently estimated by OLS if we regress y t on x t = (x t(1) , x t(2) ) . However, the initial model can be rewritten as This new model satisfies E(x t(1) u t ) = 0 by construction and therefore it is also correctly specified in the sense that it will consistently estimate the new parameter θ * (1) when regressing y t on x t(1) via OLS. This would suggest that the test statistic CM * T for breaks should have a similar distribution as for the correct specification but using only p 1 degrees of freedom pertaining to x t(1) . However, this intuition is only true for static models (by static models, we mean models where the long-run variance H * = AVar(T −1/2 ∑ T t=1 w t ) is equal to the short-run variance Ω * = T −1 ∑ T t=1 w t w t . In such models, ]H * 1/2 and it is therefore a projection matrix of rank p 1 , acting as a matrix that selects only the first p 1 elements of B * s (λ). 17 It follows that B * , the same distribution as in Theorem 2, but with p 1 degrees of freedom instead of p.
Comment 4. If the model is dynamic in the sense that the long-run and short-run variances differ (H * = Ω * ), then the distribution in Theorem 3 does not simplify, as the intuition outlined in Comment 3 no longer holds. For example, if x t(1) t are autocorrelated, then x t(1) u t are autocorrelated, but the CM * T test does not correct for autocorrelation, yielding a more complicated distribution. Even if a HAC correction would be employed, there is still the problem of dynamic misspecification: lags of x t(1) t might be correlated with lags of x t(1) x t(2) , yielding H * = Ω * and therefore the more complicated distribution in Theorem 3.
Comment 5. Theorem 3 shows that, in the general case of dynamic misspecification (with Q (12) = 0), the usual critical values from Theorem 2 no longer apply. Allowing for conditional heteroskedasticity, Theorem 3 demonstrates that the size distortions of the CM test are dependent on several parameters of the data generating process, and that correcting for heteroskedasticity does not help in overcoming this problem.

Correct Specification and Various Misspecifications
The objective of the simulation analysis was to compare the size and power of unconditional moments, UM/UV, break tests to their conditional moments, CM/CV, counterparts, under correct regression model specification, and under static and dynamic misspecification. We evaluated the size and power of the tests for alternative model specifications, sample sizes, as well as structural break sources and sizes. 18 We considered sample sizes T = 100, 200, 500, 1000 with a break in the middle of the sample, T 0 = [0.5T] and four data generating processes (DGPs). We also considered alternative break points and our results are robust to T 0 = [0.25T] and T 0 = [0.75T]. For all simulations, we used the critical values reported in (Andrews 2003). For DGPs with static errors, we calculated CM * T withV λ as described in (4). For a static DGP with i.i.d. errors, we calculated U M * T without the HAC adjustment, but with split sample variance estimators as for the CM * T to make the comparison fair. For the same purpose of fair comparison, for the DGP with static regressors and AR (1)  We consider four DGPs, some of which we analyze under both correct specification and misspecification. The first DGP is a simple AR(1) model with iid errors: All simulations were performed in Matlab for 10,000 replications 21 and for the AR models we used zero as the starting value and 100 burn-in observations. Under the null, α t = α = 1, and the persistence ranges β t = β ∈ [0.1, 0.7]. Under the alternative, there is one break either in the intercept, with α t = 1 + δ α 1 t>T 0 , and δ α ∈ (0, 2], or in the slope, with β t = 0.1 + δ β 1 t>T 0 , and δ β ∈ (0, 0.6]. For DGP1, we estimated only the correctly specified dynamic model. The size of the CM and UM break tests (under the null) are reported in the top panel of Table 1. Using the 5% critical values, we found that the UM test exhibits slightly better size for small sample sizes of T = 100 relative to the CM test which yields size around 10%. For large sample sizes of T > 500, both tests approached the nominal level, as expected. Under the alternative, we plot the size-adjusted power functions in Figure 1. When the break occurs in the slope parameter, the UM and CM tests have similar power as 18 The unconditional mean and variance sup Wald tests require a long-run variance estimator. We report the Newey and West (1994) HAC estimator with the data dependent bandwidth therein and the Bartlett kernel, as explained in detail in Section 2. The Andrews (1991) fixed bandwidth HAC estimator leads to slightly worse performance across all tests and designs; results are available upon request from the authors. 19 For a static DGP with i.i.d. errors (DGP3, detailed below), we report the power of the tests based on asymptotic critical values rather than size-adjusted powers, because the size distortions are minor. 20 The size-adjusted powers are computed as follows: for a DGP under the alternative of one break, we take the parameter values after the break, and use these parameter values for generating the DGP under the null, which will have the same sample size as the DGP under the alternative. We simulate this null DGP and take the 95% quantile of the empirical distribution of a test statistic as the empirical critical value to be used. We then simulate the corresponding DGP under the alternative, and calculate the empirical rejection frequency using the corresponding empirical critical values. Note that, by construction, all size-adjusted power plots start at 5%, which is the corresponding empirical rejection frequency for a DGP under the null of no break using its simulated empirical 95% critical values. 21 Only for the static model in DGP3, we used 2000 simulations because they were sufficient to get accurate Monte Carlo results. the sample size grows. The CM test performs only mildly better for moderate changes in the AR slope parameter (with maximum relative gains in power of 10% for T = 100). On the other hand, when the break is in the intercept, the UM test has better power in small sample sizes (of T = 100, 200), with up to 20% gains vis-a-vis the CM test. 22 The second DGP is an AR(4) model with iid errors: We set β t = (β t,1 , 0.2, 0.15, 0.075) to represent the memory decaying pattern encountered in many time series in economics. Under the null, we set α t = α = 1 and vary β t, We analyze the impact of dynamic misspecification: the true DGP is an AR(4) model, but we estimate an AR(1) model or an AR(2) model instead. 23 The top two panels of Table 2 show that underestimating the number of lags causes severe size distortions of the CM test, of up to 60% even for small levels of forgone persistence. This effect does not die out even for large sample sizes of T = 1000. In contrast, the UM test is not so severely oversized, especially for large samples; the size distortions reach a maximum of 13% for large samples of T = 1000. Our simulation results indicate that, although the HAC estimator that corrects for dynamics in error term of the UM test may be less reliable in small samples, it results in much smaller size distortions than if we instead used a CM test for a misspecified model. In Section 4.2, we provide further evidence for this using DGP2; we consider information criteria to select the number of lags and show the performance of the CM test for various lag lengths.
For DGP3, we analyzed both correctly specified and misspecified models. As expected, if we estimated the correctly specified static model in DGP3, the size of both the CM and UM break tests is close to the nominal size, as shown by the second panel in Table 1. The corresponding power curves in Figure 2 are again similar for the two tests, especially as T increases.
However, if instead we estimated an AR(1) model, the results in the third panel of Table 2 show that the UM test is undersized for small sample sizes and that its size improves for T > 500. In contrast, the CM test is oversized for small samples. For smaller samples, misspecifying the regressors compromises the power of the CM test which can be up to around 20% smaller than that of the UM test when T = 100.
The fourth DGP is a static model with AR(1) errors: For comparison purposes, in DGP4, the X t and the null and alternatives are generated in the same way as in DGP3. For DGP4, we analyzed both correctly specified and misspecified models. Under correct specification, the last panel of Table 1 shows that the UM test is correctly sized for all sample sizes, whereas the CM test is oversized even for large sample sizes. The size of the CM tests can reach up to 10% even when T = 1000 (and the nominal size is 5%).
Furthermore, we consider a nonlinear misspecification by estimating the model with X 2 t instead of X t , similar to (Chong 2003). The nonlinear misspecification yields oversized CM tests across all sample sizes. The last panel of Table 2 shows that, even for T = 1000, the traditional CM test yields size of around 13%.
We now turn to examine the size and power of tests for breaks in the variance of the residuals of the regression models by comparing the UV and CV tests. 24 We considered the same DGPs as before, but we set α t = 1, β t = 0.5. For DGP1-3, we let t ∼ iid N (0, σ t ), and, for DGP4, we let ν t ∼ iid N (0, σ t ). Under the null hypotheses, we fixed σ t = σ ∈ [1, 2.6]. Under the alternative, we set σ t = 1 + δ σ 1[t > T 0 ], and let δ σ ∈ (0, 1.6]. As before, we estimated both correctly specified and misspecified models.
When the estimated model is correctly specified, as considered in DGP1 and DGP4, the size of both CV and UV tests are close to the nominal size for T 200, as shown in Table 3. However, the powers of these two tests differ. Figure 4 shows that, for DGP1, the CV test has better power for small sample sizes across all break sizes, including small breaks, as T increases. For DGP4, Figure 4 shows that the power curves of the CV tests and UV tests are the same.
If instead, a misspecified model is estimated for DGP1-DGP4, the CV test appears to enjoy good size properties, as shown in Table 4. The exception is the oversizing reported in the top panel of Table 4, which is due to underestimating the lag order; in this case the size does not improve as the sample increases. Our analysis shows that misspecifying the dynamics of the conditional mean of the regression model yields an oversized CV test.
Summarizing, the simulation results show that under correct model specification, the UM/UV and CM/CV have similar size and power. In contrast, under static nonlinear and dynamic misspecifications, the CM/CV tests are severely oversized, having both finite and large sample distortions. While the UM/UV tests may also occasionally exhibit mild size distortions, they feature similar power properties as the CM/CV tests, especially in larger samples. Therefore, the UM/UV tests can be a valuable tool for detecting breaks, because, in applied work, misspecification is likely to occur and bias the CM/CV break test results. 25   (1)

Dynamic Misspecification
In this section, we further analyze DGP2, given by the AR(4) model below, with i.i.d. N (0, 1) errors, and 400 burn-in observations: We consider two variants of this model: DGP2-A and DGP2-B. DGP2-A is a typical DGP encountered in applied work, where the coefficients on the higher order lags are smaller than those of the first and second lag (β ∈ {0.1, 0.2, 0.3}, γ 1 = 0.2, γ 2 = 0.15, γ 3 = 0.075). Note that the three empirical applications in Section 5 show similar patterns of smaller coefficients on higher order lags. The second DGP, DGP2-B, is less realistic, allowing for the coefficient on the fourth lag to be larger than that on the other lags (β = 0.1, γ 1 = 0, γ 2 = 0, γ 3 ∈ {0.175, 0.275, 0.375}). Such DGP is plausible if the seasonality at lag four was not yet removed from the data, and it is encountered less in macroeconomic datasets, because they are typically seasonally adjusted. Nevertheless, we include it for completeness.
In most empirical applications, the number of lags are selected with AIC and BIC. Hence, we begin by showing the empirical distribution of the number of lags selected by AIC and BIC in 10,000 simulations for both DGP2-A and DGP2-B. Figures 5-7 show that, for DGP2-A, both the AIC and BIC tend to incorrectly select the number of lags even for sample sizes of T = 1000. In particular, the BIC tends to underestimate the number of lags about 60% of the time. What is perhaps more surprising is the even the AIC underestimates the number of lags about 20% of the times in large samples. This result implies that both the AIC and BIC are not reliable for typical datasets where the higher order lags enter with smaller coefficients; unfortunately, those are the same datasets typically encountered in applications, for which the problem of dynamic misspecification of the conditional mean seems unavoidable.
For completeness, we also show the empirical distribution of the number of lags estimated by AIC and BIC for DGP2-B, for which the fourth coefficient is larger than all the three others. Figures 8-10 show that in this case, BIC selects the true number of lags with probability approaching one as the sample size gets large. However, notice that, when β = 0.1 and γ 3 = 0.175, it still selects only one lag about 10% of the time for large sample sizes T = 1000. This implies that γ 3 has to be quite large, reminiscent of models with uncorrected seasonalities, for a correct selection of the number of lags. We also notice that AIC tends to select the true number of lags or a larger number, but only as the sample size rises to T = 500.
From these figures, we conclude that AIC and BIC can both incorrectly choose the number of lags, leading to dynamic misspecification in the conditional mean of the CM tests. In Tables 5 and 6, we show for DGP2-A and DGP2-B the empirical sizes of both the UM and CM tests for a nominal size of 5% and 10, 000 simulations. For the CM test, we impose different lags, from k = 0 to k = 8.
For DGP2-A presented in Table 5, we notice severe size distortions when underspecifying the number of lags, with the sizes of the CM test up to 61.4% for one lag and 21.7% for two lags and T = 1000. When the number of lags is overspecified, the size distortions of the CM tests remain severe for T = 100, with sizes going up to 69.9%, although they do decrease towards the nominal level as the sample size increases. The size distortions in the UM test are less severe for small sample sizes like T = 100, and can reach up to 22.1%, due to the HAC correction this test employs. As the sample size increases, the size of the UM test improves to around 10% for T = 1000.
For DGP2-B in Table 6, we notice the same patterns. Given these results and Figures 8-10, we conclude that, even for seasonally unadjusted models such as DGP2-B, performing the CM test with the number of lags selected by BIC is only reliable if the sample sizes are large; otherwise, the UM test would be a good alternative.          Overall, the simulation results show that, under both static and dynamic misspecifications and for data generating processes typically encountered in applications, the CM/CV tests are severely oversized even in large samples, while the UM/UV tests occasionally exhibit size distortions that are relatively mild. Especially because the size distortions due to dynamic misspecification cannot be fixed by choosing the number of lags with information criteria, while imposing too many lags also leads to size distortions in small samples as we showed above, the UM/UV tests can be a valuable complementary alternative to the CM/UM tests.

Empirical Illustrations
This section illustrates the use of UM/UV in conjunction with CM/CV breaks tests for three macroeconomic series: the US civilian unemployment rate, the industrial production growth and the short term real interest rates. These variables were also examined for breaks in the conditional mean and volatility by Stock and Watson (2002) and Sensier and van Dijk (2004), among others, at quarterly frequency. We employed the analysis on a monthly sample during January 1960-October 2014 with T = 658 to benefit from larger samples (except for the interest rates for which data are available only from April 1960, therefore we used the sample April 1960-October 2014). Our data source is the FRED database at the Federal Reserve Bank of St. Louis. The unemployment rate is the civilian unemployment rate, and the real interest rates are computed as the annualized nominal three-month Treasury Bill minus the three-month CPI inflation rate over the sample period.
The unemployment rate and the real interest rates are analyzed in levels for two reasons: (1) because both series are measured in rates so they are bounded over the sample period; and (2) because first differencing removes most of the variation in these two series. The industrial production series is typically treated as a unit root process with a possible trend; therefore, as in (Stock and Watson 2002), we analyze the first differences of the logs of the series. Note that the unit root tests give mixed results, but they are also unreliable even if there were no breaks in these series, because the lag length selection may fail as shown in the simulation section. We therefore show the ACF and PACF plots for these series in Figures 11-13. The ACF of the unemployment rate is dying out reasonably fast for a monthly series, so if we believe that the mean of this series does not have breaks, as argued in Section 5.1, then indeed this series should be treated as stationary. The ACF for the interest rates is dying out much slower. However, if we believe that there are mean breaks in the interest rates, as argued in Section 5.2, then this plot is unreliable, and the interest rates should be treated as piece-wise stationary with breaks. The ACF is very persistent for the industrial production series, and if we believe that this series has no mean breaks, as argued in Section 5.3, then it is non-stationary in the sense of having a unit root and/or a trend, and it therefore needs to be first differenced. We refer to the first differences in the log of the industrial production series as the industrial production growth in the rest of the paper.   We apply the UM/CM with 5% and 10% trimming to allow detection of potential breaks due to the recent economic crisis. However, the 5% trimming turns out to be problematic in many cases, as we argue in Sections 5.1-5.3. Therefore, we report the UV/CV tests only at the 10% level. For all tests, we use the critical values in (Andrews 2003). 26 For all series, the CM and CV tests are employed for an AR(p) model with an intercept, where p ∈ {1, 4, 12} as these are typical choices in the literature, or p is selected by AIC and by BIC. When using the AIC and BIC in Tables 7-13, we impose a maximum of 12 lags because this is the typical maximum choice in applied work for monthly data. If we increase the maximum to 30 lags, then only the AIC selects more lags for some series, and this is discussed explicitly below for the series for which this occurs.
We also report for all series a distributed lag model with an intercept and p lags-DL(p)-of each of two monthly factors: a macro factor, extracted from the mean of a large cross-section of economic and financial US series, and a macro uncertainty factor. The number of lags is selected with AIC and BIC. The two factors are taken from (Jurado et al. 2014) and their use is further motivated in (Benigno et al. 2015), inter alia. Note that, as shown in Table 8, the macro factor may have a variance break in July 2008. Since the CM tests are not robust to changes in the variance of regressors, as shown in (Hansen 2000), we employ a simple correction to these CM tests in Tables 7, 10 and 12, by interacting the lags of the macro factor with the break in variance in July 2008, as recommended in (Pitarakis 2004). Similarly, all variance tests in Tables 9, 11 and 13 are not valid if a mean break was detected in the unconditional or conditional mean of that particular model used. Therefore, we follow (Pitarakis 2004) and correct the UV/CV tests for a potential break as follows. If for a particular model, a mean break is detected by the UM or the CM test, then the implicit break is imposed in the unconditional or conditional mean of the model (in the conditional mean, all regressors and the intercept are interacted with this break). From this model, the residuals are calculated, and the CV test is testing for a mean break in the squared residuals of the corresponding model. Note: (a) Superscript * means that the test rejects the null of no breaks at the 5% level; (b) DL(p) refers to a distributed lag model with p lags of a given factor; and (c) the tests with the macro factors are corrected for a potential break in variance as explained in Section 5. 26 For one test in Table 10, the critical values are not available because this test entails 26 parameters, while critical values are available, to our knowledge, only for maximum 20 parameters. However, from (Andrews 2003), it is evident that the critical values are strictly increasing in the number of parameters, so it is reasonable to assume that the critical values for 26 parameters should be above the critical values for 20 parameters.  It is worth noting that the unconditional mean over the full-sample of our three macroeconomic series is significantly different from zero at the 1% level 27 , which implies that it is non-zero at least for some subsamples of the data, and we therefore expect the UM test to have power against structural change for at least. In addition, for all three series, the full sample coefficient estimates on higher order lags in AR(p) models are typically larger than the coefficients on the first two lags, so, from the simulation evidence in Section 4.2, we expect the conditional mean to be misspecified when choosing the number of lags with AIC or BIC. Table 10. Structural breaks in the mean of the industrial production growth.

Moments/Models Trimming Statistic Value Critical Value Break Fraction Break Date
Unconditional Mean sup Wald tests:

Unemployment Rate
Stock and Watson (2002) estimated AR(4) models for the quarterly difference in the US unemployment rate and found no break in the conditional mean, over a shorter period , but that is likely because the series was first differenced so a lot of the variation in the series has been removed. When analyzing the unemployment rate in levels, Table 7 shows that the evidence for breaks in this series is inconclusive. The UM test indicates a break in the recent crises when using a 10% cut-off, but not when using a 5% cut-off. Some of the CM tests also give mixed evidence: no break in the conditional (or short-run) mean at the 10% level, and one break in the recent crisis at the 5% level. The PACF in Figure 11 indicates spikes at lags 13 and 25 (also at higher lags), lags that are to our knowledge rarely used in applied work. Nevertheless, when rerunning the model selection with AIC and BIC and a maximum of 30 lags, we observe that the BIC does not change, but the AIC selects 25 lags. The CM tests on an AR(25) model are 51.508 at the 10% level, and 148.311 at the 5% level, so they reject in both cases the null of no break because they are above the critical values of 43.47 and 44.46, respectively (these are critical values for 20 parameters, and as explained in Footnote 23, they only provide a lower bound for the true critical values which are in this case not available). Nevertheless, these tests achieve their maximum value at the boundaries of the cut-off: 0.9 and 0.05.
Because at a 5% cut-off we only have 33 observations relative to 26 parameters to estimate, there are strong reasons to believe that these tests are plagued by numerical inaccuracies driven by too many parameters relative to the cut-off. Note that this numerical inaccuracy may also plague the CM test for the DL(12) model with uncertainty factor at the 5% level. Moreover, the uncertainty factor seems to be very persistent; an AR(1) model of the uncertainty factor yields a coefficient of 0.98 on the first lag. Because typically this factor is used in predictions in levels rather than first differences, we just report the CM tests for the models with the uncertainty factor in levels. However, note that the large values of the CM test can be explained by large distortions in the presence of long memory regressors, and therefore these results are also not to be fully trusted.
To summarize, it could be that CM tests are oversized and spuriously reject the null because of heavy underspecification of the true number of lags, an explanation in line with our simulations in Section 4.2. However, this problem cannot be remedied by including more lags because of numerical issues. When examining the UM test, it can also be that the HAC correction in the UM test is inaccurate for a large number of lags; it could explain why this test rejects the null hypothesis when using a 10% cut-off.
Next, we test for breaks in the variance of unemployment via the UV and CV tests; these results are reported in Table 9. The UV test and the CV test with 12 lags of the uncertainty factor used in the conditional mean does not reject the null of no breaks. In contrast, the CV tests based on AR(p) models show a structural break in the conditional variance of the unemployment in the mid 1980s, associated with the Great Moderation period. In Section 4.2, we showed that dynamic misspecification, especially underestimation of the number of lags, yields severely oversized CV tests even for T = 1000, while their power is comparable to the UV tests. This might explain the difference in results between the two tests. Alternatively, the results in Table 9 can be taken to suggest that although there is no break in the unconditional (long-run) variance of unemployment, there is a structural change in the conditional (short-run) variance of the unemployment dynamics, possibly related to the Great Moderation. It is worth mentioning that the mixed empirical evidence on volatility breaks in unemployment provided by the two tests is also found in other studies using quarterly data. Namely, while Stock and Watson (2002) report no evidence of breaks using the UV test for quarterly unemployment, Sensier and van Dijk (2004) find support for a Great Moderation volatility break when using an AR(4) model. Table 10 reports the UM and CM tests for industrial production growth. Here, the UM test and the CM tests for DL models with lags of the uncertainty factor and 10% cut-off do not show evidence of breaks. Neither do the CM tests with lags of the macro factor. The other tests for AR(p) models do find breaks, but notice that as the number of lags increases, the CM tests become closer to the critical values, so the evidence for breaks is getting weaker. Moreover, the PACF in Figure 12 indicates a sizable negative spike at lag 24; it seems that underspecification of the number of lags (for example with AIC and BIC) yields to incorrect rejection of the null of no breaks, as shown in the simulation Section 4. This may also explain why Stock and Watson (2002) find a break in the conditional mean of the quarterly industrial production growth around the Great Moderation. To investigate this further, we reran the AIC and BIC model selection of an AR(p) model using a maximum of 30 lags; the BIC does not change, but the AIC now selects 18 lags. The corresponding CM test statistic is 137.039 with both 10% and 5% cutoffs and rejects the null of no breaks at the 5% level, but the implied break-faction estimate is in both cases at 0.89, so near the cut-off, indicating that this test cannot be fully trusted as evidence of a mean break in the recent period, given its inflated value near the cut-off due to estimating many parameters with few observations.

Industrial Production Growth
As for breaks in the variance of the industrial production growth, the UV test and the CV test for a model with lags of the macro factor do not reject the null of no breaks, but the other CV tests indicate a break around the Great Moderation. This result is also found in (Stock and Watson 2002) for the quarterly industrial production growth, where the UV test does no reject, but the CV tests reject. Note however that the CV tests starting with the one for the AR(4) model are decreasing when including more lags and getting very close to the critical values at 12 lags, indicating that the evidence for a Great Moderation break seems to diminish as more lags are taken into account.

Interest Rates
Several papers find breaks in the conditional mean of the short-term interest rates (see, e.g., Garcia and Perron 1996;Sensier and van Dijk 2004;Stock and Watson 2002). For example, Garcia and Perron (1996) employed a Markov switching model with three possible regimes in mean and variance for ex-post real interest rates. They find that there are two mean shifts, one in 1973 associated with the sudden rise in oil prices and another in 1981 which is in line with a federal budget deficit. Furthermore, Rapach and Wohar (2005) examined structural breaks in the mean real interest rate for 13 industrialized countries over the period 1960 till 1998. For the US series, they find breaks in the late 1960s, early 1970s and early 1980s. Table 12 confirms these findings: it shows strong evidence of breaks in both the long-run and short-run mean of the interest rates, some around the Great Moderation and some around 2001, perhaps tied to the 11 September 2001 attack, which sent the stock market plummeting after several days in which it was closed. 28 The UV test in Table 13 does not detect a break whilst the CV tests indicate at least one break, whose estimate is again around the Great Moderation. In this case, also because the CV test statistics seem to still be large at 12 lags, it is more likely that there is a variance break, and the UV test does not have enough power to detect it, which can happen for some data generating processes as indicated in Figure 2. We conclude that there is strong evidence of breaks in the unconditional mean and in the variance of short-term interest rates.

Conclusions
In this paper, we propose an alternative and complementary approach to the sup Wald test for breaks in the conditional mean and variance. We show that the corresponding unconditional mean and variance break tests exhibit comparable size and power properties. The unconditional mean and variance do not employ a conditional mean specification, so they do not suffer from potential regression model misspecification. We show that under certain commonly encountered forms of regression model misspecification, especially dynamic misspecification, the traditional conditional mean break tests suffer from severe oversizing, even for large sample sizes. Moreover, both tests for a break in mean have similar size-adjusted power as the sample size grows. In a comprehensive empirical analysis, we applied these tests to show that there is no clear evidence of long-run breaks in the mean of the unemployment rate or the industrial production growth. Similarly, the evidence for breaks in the long-run variance of the unemployment rate and the industrial production growth is mixed. It is worth noting that we only focus on the sup Wald test, which is among the most popular break point tests in empirical work. It would be interesting to repeat the analysis in this paper for other break-point tests, including more powerful tests, and we leave this for future research.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proofs of Theorems 1-3
Proof of Theorem 1. Part (i). Aue and Horvath (2012) where v u is a HAC consistent estimator of v u = AVar 1 √ T ∑ T t=1 y t under H UM 0 and Assumption 1(i).
They state that if a functional central limit theorem (FCLT) holds under H UM t=1 u t , then Z T (λ) ⇒ [B 1 (λ) − λB 1 (1)] = B 1 (λ), and so by the continuous mapping theorem (CMT), where B 1 (λ) = B 1 (λ) − λB 1 (1) is a scalar independent Brownian bridge. Below, we show that there is a clear connection between the CUSUM and the UM test, so the asymptotic distribution of the second follows from the first.
Comparing (A1) and (A2), the two limiting distributions attain their supremum at different λs, and thus the size of these tests will in general be different. 29 However, underlying the asymptotic theory is the same assumption, that the FCLT holds for T −1/2 ∑ [Tλ] t=1 u t . Assumption 1 guarantees that the FCLT in (Wooldridge and White 1988), Theorem 2.11, can be applied for u t (in fact, we only need d m = O(m −1/2 )), completing the proof of (i).
Part (ii). Here, we just verify Assumption 1 for |y t − y| − a instead of u t . The rest of the proof is as in part (i) of the proof. Assumption 1(i) holds by the null hypothesis and by Assumption 2(i), and we are left to verify Assumption 1(ii). Since u t is L 2 -near epoch dependent of size m −1/2 on {g t } with positive constants equal to 1 (these constants appear in the near epoch dependent definition in Davidson (1994) but since here they are fixed, they are absorbed into the definition for d m ), it follows that so is y t − y, with constants 2 sup t (1) = 2. In Theorem 17.12 in Davidson (1994), let φ t (·) = | · |, a uniform Lipschitz function, with the argument y t − y. Then, y t − y is L 2 -near epoch dependent of size m −1/2 . Part (iii). Here, we just verify Assumption 1(ii) for (y t − y) 2 − v u , instead of u t , because Assumption 1(i) holds by the null hypothesis and Assumption 2(ii). In Theorem 17.12 in Davidson (1994), under H UV 0 and H UM 0 , define the function φ t (y t − y) = (y t − y) 2 − v u . From part (ii) of the 29 In addition, note that the test statistic sup λ∈[ ,1− ] √ U M T is known in statistics as a "weighted version" of the CUSUM test-see (Aue and Horvath 2012, p. 5). proof, (y t − y) is a L 2 -near epoch dependent process of size m −1 on {g t } with constants equal to 2. Below, we show that under H UM 0 , φ t is uniform Lipschitz almost surely: |φ t (y t − y) − φ t (y k − y)| = |(y t − y) 2 − (y k − y) 2 | ≤ |y t + y k − 2y| |(y t − y) − (y k − y)| ≤ |u t + u k − 2u| |(y t − y) − (y k − y)| ≤ (4 sup t |u t |) |(y t − y) − (y k − y)| ≤ κ|(y t − y) − (y k − y)|, almost surely, for some κ > 0, by Assumption 1 for u t , where u = T −1 ∑ T t=1 u t . Hence, by Theorem 17.12 in Davidson (1994), (y t − y) 2 − v u is also L 2 -near epoch dependent of size m −1/2 . Proof of Theorem 2. (i). Since Assumption 3 is a special case of Assumption 8 in (Hall et al. 2012), the result follows directly from their Theorem 6, setting x t = z t . (ii)-(iii). Primitive assumptions for CV test can be found in, e.g., Qu and Perron (2007), and involve joint mixing assumptions on {x t t } and 2 t . They mention that these conditions can be replaced by sufficient conditions to yield a FCLT for {x t t } and 2 t − v under the null. By similar reasoning, for the CA test, sufficient conditions to yield a joint FCLT for {x t t } and | t | − E| t | suffice. Since x t includes an intercept, these conditions can be verified as for the proof of Theorem 1(ii)-(iii). Note that they all require H CM 0 .