Selecting the Lag Length for the M GLS Unit Root Tests with Structural Change : A Warning Note for Practitioners Based on Simulations †

This is a simulation-based warning note for practitioners who use the MGLS unit root tests in the context of structural change using different selection lag length criteria. With T = 100, we find severe oversize problems when using some criteria, while other criteria produce an undersizing behavior. In view of this dilemma, we do not recommend using these tests. While such behavior tends to disappear when T = 250, it is important to note that most empirical applications use smaller sample sizes such as T = 100 or T = 150. The ADFGLS test does not present an oversizing or undersizing problem. The only disadvantage of the ADFGLS test arises in the presence of MA(1) negative correlation, in which case the MGLS tests are preferable, but in all other cases they are very undersized. When there is a break in the series, selecting the breakpoint using the Supremum method greatly improves the results relative to the Infimum method.


Introduction
Testing for the presence of a unit root in a time series (i.e., whether or not a structural change can be identified) is now a common starting point in advanced models frequently used in macroeconomics and finance.Recent efficient unit root tests are the ADF GLS and the P GLS and Rodríguez (2003), who show that these tests enjoy the same efficiency characteristics.M GLS tests have become increasingly popular in the literature.For example, Haldrup and Jansson (2006) argue that practitioners should abandon the use of ADF tests altogether in favor of M GLS tests because of their excellent size properties and nearly optimal power properties.However, this note arrives at the opposite conclusion, suggesting that the choice of the most suitable testing method should be carefully assessed.
Currently, it is widely accepted that the selection of the lag length (denoted by k) has important implications for the (size and power) behavior of the different unit root tests.See, for instance, Schwert (1989), Ng and Perron (1995), Agiakloglou and Newbold (1992), Agiakloglou and Newbold (1996), Elliott et al. (1996), Ng and Perron (2001), Del Barrio Castro et al. (2011), and Fossati (2012).The consensus is to use data-dependent methods.These rules include AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), Modified AIC (MAIC), Modified BIC (MBIC), and the t-sig method, which are briefly explained below.
Recently, we performed a routine empirical application of the M GLS tests and obtained strange results.For example, applying the MZ GLS α and the AIC method to the labor market of the Spanish region of Cantabria, 2 we obtained an unemployment rate of −3'140,463, a huge (explosive) negative value with k = 9.Using the t-sig procedure , we obtained −50'078,041 with k = 10, which is even more impressive.A straightforward interpretation implies an overwhelming rejection of the null hypothesis, given any of the asymptotic or finite critical values tabulated in Perron and Rodríguez (2003).However, it is clear that the magnitude of this value is counter-intuitive and inadmissible, because its magnitude is very far from standard values.In contrast, other rules yield opposite results (very small values in absolute value).When applied to other three time series (unemployment rates in the Spanish regions of Galicia and Murcia, and to Peru's monetary policy rate), similar results are obtained. 3In consequence, we consider that it is worth analyzing the source of the poor behavior of the M GLS tests in the cases mentioned above.Hence, we perform extensive finite sample simulations for the M GLS tests using different lag-length criteria, where the size performance is our primary interest.
This note (to our best knowledge) represents the first simulation-based attempt to study the size and the eccentric behavior of the M GLS unit root tests in the context of structural change.We do not pretend to perform an exhaustive analysis of each rule.Rather, this document is only a simulation-based note of caution for users of these unit root tests. 4 This note is structured as follows.In Section 2, the GLS approach with structural break, the test statistics, the rules used to select k, and the two methods to select the break date are briefly reviewed.In Section 3 we present simulation evidence about the size of the MZ GLS α test linking the results with an explosive behavior of the test.Section 4 provides some conclusions.Perron and Rodríguez (2003), the data generating process (DGP) is:

DGP, GLS Detrending, M GLS
(1) Quartely data covering the period Q3 1976-Q2 2012 (T = 144 observations). 3 The sample size for Galicia and Murcia are the same as for Cantabria.For Peru's monetary policy rate, the data are monthly for February 2002-August 2010 (T = 92 observations).

4
We recognize the limitations of this note, which is only based on simulations.We agree with a Referee that formal proofs are needed in the spirit of Del Barrio Castro et al. (2013).Hence, further work in the direction of a formal treatment will be addressed in a future research project.
for t = 0, 1, 2, ..., T, where v t = ∑ ∞ j=0 γ j e t−j , γ(L) = ∑ ∞ j=0 γ j L j , that is, v t is an unobserved stationary zero-mean process, where ∑ ∞ j=0 j|γ j | < ∞ and e t is a martingale difference sequence.We assume that u 0 = 0 throughout, although the results generally hold for the weaker requirement that E(u 2 0 ) < ∞ (even as T → ∞).The process e t has a non-normalized spectral density at frequency zero given by σ 2 = σ 2 γ(1) 2 , where σ 2 = lim T→∞ T −1 ∑ ∞ i=0 E(e 2 t ).In the first equation of (1), d t = ψ z t , where z t is a set of deterministic components.Perron and Rodríguez (2003) consider two models in the context of an unknown structural break: (i) Model I, where there is a single structural change in the slope, that is, z t = {1, t, 1(t > T B )(t − T B )} where 1(.) is the indicator function and T B is the time of change and can be expressed as a fraction of the whole sample as T B = δT for some δ ∈ (0, 1); and (ii) Model II, which includes a single structural change in intercept and slope, that is,

GLS Detrending and M GLS Statistics
The class of M OLS tests are due to Stock (1999) and further analyzed by Perron and Ng (1996).These tests are shown to have far less size distortions in the presence of important negative serial correlation.The M GLS tests are constructed using y t = y t − ψ GLS z t , where for t = 2, 3, 4....., T, and for a chosen α = 1 + c/T and where z t has been defined in Section 2.1.We also use the P GLS T test, as defined in Perron and Rodríguez (2003).Hence, defining S(ρ, δ) = ∑ T t=1 (y 2 for ρ = α, 1, the M GLS and the P GLS T are: Following Perron and Rodríguez (2003), we use c = −22.5. 6The statistics are modified versions of the Z α test of Phillips andPerron (1988), Bhargava (1986)'s R 1 statistic, and the Z t α test proposed by Phillips and Perron (1988), respectively.The term s 2 is an autoregressive estimate of (2π times) the spectral density at frequency zero of u t , suggested by Perron and Ng (1998), and defined by , with bj and { e tk } obtained from the autoregression: Another test is the so-called ADF GLS (δ) test, which is the t-statistic for testing the null hypothesis that α 0 = 0 in (2).

Rules for Selecting the Lag Length (k)
In the derivation of the asymptotic distributions of the different unit root tests, the theoretical conditions provide little practical guidance for choosing k.The literature suggests to use data-dependent rules like the AIC and the BIC where k is chosen by minimizing: T−k max is the penalty attached to an additional regressor, and T − k max is the number of observations effectively available. 7The AIC and the BIC are obtained when C T = 2 and C T = ln(T − k max ), respectively.Another procedure is the sequential t-sig procedure described in Campbell and Perron (1991). 8Selecting a value for k max , the lag k is selected in a general to specific recursive procedure based on a two-tailed t-statistic on the coefficient associated with the last lag in (2).This approach is denoted by t-sig(10).In a more recent contribution, Ng and Perron (2001) proposed a class of Modified Information Criteria (MIC) that selects k satisfying: The modified Akaike (MAIC) is obtained when C T = 2, and the modified BIC (MBIC) is obtained when Recently, in order to improve finite (size and power) sample performance, Perron and Qu (2007) have proposed a hybrid approach consisting of two steps: (i) OLS detrended data are used to select k using AIC, BIC, MAIC or MBIC; and (ii) estimating (2) using GLS detrended data to construct s 2 .In the simulations, we consider this hybrid approach and the methods used are classified as AIC OLS , BIC OLS , MAIC OLS and MBIC OLS , respectively.

Selecting the Breakpoint
Given that the break date (δ) is considered to be unknown, we follow Perron and Rodríguez (2003) using two methods for selecting the break date.The first is to define the break date as the point that minimizes the statistic t α 0 in (2).This procedure is known as the Infimum method; see Zivot and Andrews (1992) and Perron and Rodríguez (2003) for further details.The second method is based on the maximum absolute value of the t-statistic associated with the dummy variable of the break in the slope.This procedure is known as the Supremum method, which is equivalent to minimizing the SSR; see Perron (1997) and Perron and Rodríguez (2003) for further details.

Setup
The DGP is y t = αy t−1 + u t with three scenarios for the autocorrelation of u t : (i) the i.i.d.case: u t = e t ; (ii) the AR(1) case: u t = φu t−1 + e t ; and (iii) the MA(1) case: u t = e t + θe t−1 .For all cases, e t ∼ i.i.d.N(0, 1), 1000 replications, T = 100 and 250, φ = −0.8,−0.4,0.4, 0.8 and θ = −0.8,−0.5, 0.3, 0.8 and α = 1 (null hypothesis).We performed extensive simulations for all M GLS tests, using both models and both ways to select the break point.We present a selected set of results.We have selected the MZ GLS α test as the representative test for the entire family of the M GLS tests.Furthermore, the Infimum method is used to select the break date and results are only reported for Model I.All other results or Tables are available upon request. 9

The Problem of Size
Table 1 shows the size of the MZ GLS α test for T = 100 and for the different criteria for selecting k.
that is, k max = 10.For the i.i.d.case, the results indicate that the test constructed using BIC and BIC OLS have a size around 3.0%, suggesting an undersized test.Testing based on all MAIC (OLS and GLS versions) seems to be extremely conservative (with an exact size of 0.0%).On the other side, testing constructed with AIC, AIC OLS and the t-sig(10) present values implying an extremely oversized test (22%, 27% and 63%, respectively).This same result 7 Note that in all experiments we use T − k max as the available number of observations, which is fixed, as suggested by Ng and Perron (2005).8 See also Hall (1994) and Ng and Perron (1995).9 We are agree with the Editor that our scenario is the worst possible scenario because we are using the Infimum method jointly (in some cases) with the t-sig(10) rule.However, this worst scenario is widely used in typical empirical applications.Furthermore, it is a regular or natural option in many statistical packages used by practitioners.Minimizing SSR (or Supremum) is better, as we mention later.appears when we use some fixed values of k (k = 5, 6, ..., 10), where sizes go from 43% to 82%.Indeed, the size is greater when the selected k is higher.For the AR(1) case, very similar results are found.In the MA(1) case, we observe the standard result that the test is oversized.In fact, when θ = −0.80,all selection criteria yield an oversized test.Even when using MAIC and MBIC, the sizes are 23% and 24%, respectively.In Table 2, the results are presented for T = 250, where k max = 13.The values of the distortions decrease, meaning that the explosiveness (oversizing) problem decreases.For the i.i.d.case, the tests constructed with BIC and BIC OLS yield 2.6% and 2.7%, respectively which are very similar when T = 100.With MIC and MIC OLS , the test has sizes of 1.7% and 1.6%, respectively which are better than for T = 100, but are still very undersized.Tests using the AIC, AIC OLS and t-sig(10) have sizes of 9%, 11.2%, and 37.9%, respectively, which are smaller than the values for T = 100, but they still indicate an oversized test, in particular the t-sig(10) criterion.With a fixed k (k = 5, 6, ..., 13), sizes are greater when k is higher, although smaller compared with T = 100.If we increase k max , the size of the test for higher k values increases considerably.We may emphasize this issue comparing with the same class of test, but without a structural change, that is, with some of the results obtained by Ng and Perron (2001).If we observe Table II.B of Ng and Perron (2001), the MZ GLS α for θ = −0.80 using k = 10 yields a size of 18% with T = 100.In our case, for the same values, we have a size of 62%.With T = 250, Ng and Perron (2001) obtain 3.6%, a size close to the nominal size (5%).However, in our case, for this sample we have a size of 19% (Table 2, k = 13).In fact, our simulations suggest that we need T = 350 in order to obtain a size close to 5% when θ = −0.80.The results are surely due to the higher number of deterministic components in our models compared with Ng and Perron (2001).However, our conclusion is that practitioners interested in applying the MZ GLS α need a non-trivial number of observations.A further comparison with Ng and Perron (2001) is possible if we select k using different criteria.Again, in the MA(1) case, where θ = −0.80 and T = 100, the test constructed with MAIC and MBIC yields sizes of 23% and 24%, respectively.The OLS versions of these criteria yield 32% and 33%, respectively (see Table 1).However, in the case shown in Table VI.A of Ng and Perron (2001), sizes of 5.9% are obtained using MIC and 12.3% using MIC OLS (T = 100).In Table 2, for T = 250, the tests constructed with MAIC and MBIC yield sizes of 3.8% and 4.6%, respectively.In the case of Ng and Perron (2001), MIC and MIC OLS yield 1.2% and 1.6%, respectively.

Some Additional Results 10
Two values are used in the construction of s 2 : s 2 ek and b(1).Available simulations show that the reason why s 2 → ∞ is b(1) → 1.That is, when a higher k is selected, it is possible to incur in overparameterization in (2) and b(1) → 1.If s 2 tends to +∞, then the MZ GLS α and MZ GLS t α statistics tend to −∞ and MSB GLS and P GLS T converge in probability to zero.Additional simulations show a link between the excessive size of the test and a high probability of selecting higher values of k.Following Ng and Perron (1995), we examine the number of times that k = i is selected by each rule for i = 0, 1, 2, ..., 10 and T = 100.In the i.i.d.case, the results show that AIC, BIC, MAIC and MBIC have probabilities to select k = 1 of 56.2%, 93.2%, 74.4%, and 81.6%, respectively.The t-sig(10) criterion has probabilities of selecting lag lengths that are equally distributed for all values of k.For instance, the recursive t-sig(10) has a probability of around 53% of selecting k ≥ 7. Until now, a basic conclusion is that the AIC, AIC OLS , and t-sig(10) methods are not recommended, as they have high probabilities of selecting higher values of k, which are associated with the size distortions observed in Tables 1 and 2.
When we calculate the mean value for MZ GLS α (in the i.i.d.case), explosive negative values are obtained for k ≥ 5 in AIC, AIC OLS , BIC OLS and t-sig(10).In contrast, reduced values of the test (in absolute value) are given by MAIC, MBIC, MAIC OLS and MBIC OLS .We also examine the number of times that the MZ GLS α test is smaller than a threshold.We consider six possible values: −500, −1000, −5000, −10, 000, −50, 000, −100, 000, and the i.i.d.case.For all thresholds considered, we find that the number of explosive values of MZ GLS α increases as the value of k is larger.For example, for k = 7, the probability of getting a value of MZ GLS α ≺ −1000 is 13.4%; and the probabilities for k = 9 and k = 10 are 31% and 40.2%, respectively.Furthermore, the probabilities of finding values of MZ GLS α ≺ −100, 000 are 18% and 22.7% for k = 9 and k = 10, respectively.All previous results are less severe when T = 250.Among other things, the probabilities of finding elevated k values are lower.In this regard, the oversizing problem is attenuated (see Table 2).Moreover, when a break is included in the simulations, the improvement is greater when T = 250.However, explosive negative values are still observed when the lag is selected with AIC, AIC OLS , and t-sig (10).

The ADF GLS Statistic
While the MZ GLS α test (and the entire family of the M GLS tests) shows either oversizing or undersizing problems, depending on the criteria used to choose k, the ADF GLS statistic works well.In the available Tables, we find that the mean value for ADF GLS is not explosive irrespective of the selection criterion used.There are some slightly large negative values when θ = −0.8,but it is a standard result in the literature.
Table 3 shows the exact size of the ADF GLS statistic when T = 100.For the i.i.d.case, the tests constructed with MAIC and MAIC OLS yield sizes of 3.1% and 3.4%, respectively; that is, they are slightly undersized, but closer to 5%.A similar observation is valid for MBIC and MBIC OLS .Other information criteria, like AIC, AIC OLS and t-sig(10), generate oversized tests; but the values are much smaller compared with Table 1 for the MZ GLS α test.For example, for the t-sig (10) procedure, Table 1 (i.i.d.case) shows that the statistic MZ GLS α has a size of 63%, which is poor.However, this value is reduced to 14.6% in the case of the ADF GLS test (Table 3).In general, the values in all scenarios are smaller compared with Table 1 for MZ GLS α .The only difference (as expected) arises when θ = −0.80.In this case, the MZ GLS α test has sizes of 23% and 24% for the MAIC and MBIC, respectively, while for the ADF GLS test the values are 31.5% and 32.6%, respectively.Table 4 shows the exact size of the ADF GLS test when T = 250.Again, the size distortions are clearly smaller compared to those of the MZ GLS α test (Table 2).As in Table 3, the results using the MZ GLS α test are better when θ = −0.80.In Table 4, the ADF GLS test yields 11.5% and 12.8% when MAIC and MBIC are used, respectively.In the case of the MZ GLS α test, the values are 3.8% and 4.6%, respectively.Furthermore, our calculations show that the ADF GLS test will have a size closer to 5% for θ = −0.80 when T = 350.This sample size is even more prohibitive for most empirical applications.
A comparison of Tables 1 and 2 against Tables 3 and 4 suggests that it is recommendable to use the ADF GLS test, except when practitioners are sure that they face a strong MA(1) negative correlation.In this case, practitioners should use T = 350 or T = 250 for ADF GLS or MZ GLS α , respectively.

The Supremum Method and a Single Breakpoint
The results change favorably when the Supremum method is used to select the breakpoint.Several simulations have been performed under the setup of Section 3.1 for Model I: ) with two scenarios: (i) β 3 = 0, that is, no break; and (ii) β 3 = 0.5, 1.0, 1.5 with δ = 0.50 × T. Similar experiments have been performed for Model II.In the first case, the MZ GLS α test still has explosive values, although less frequently; and the values are negative but of a smaller magnitude (in absolute value) than when using the Infimum method.In the second case, the results show considerable improvement, especially when T = 250.The explosive values of the MZ GLS α test practically disappear for the MIC and MIC OLS rules, although the cost is to have small values (in absolute value), which produce a conservative test.On the other hand, the rules AIC, AIC OLS , and t-sig(10) continue to present an MZ GLS α test with explosive values which, however, are very small compared to the previous cases, and occur only when a higher k is selected.
The best results with the Supremum method are important, since this method is recommended in the literature to select the break date.For instance, Vogelsang and Perron (1998) argue that this method is to be preferred, since it allows a consistent estimate of the breaking point, a matter that the Infimum method cannot do.
The evidence suggests that, in the empirical applications, the Supremum method should be used to select the breakpoint along with the MIC and MIC OLS rules, although the potential cost is to have a conservative test.The evidence suggests avoiding the use of rules such as AIC, AIC OLS , and t-sig(10) to select k, as well as the use of the Infimum method to select the breakpoint.

Conclusions
This note aims to examine the performance of the size of the M GLS statistics to test for the presence of a unit root using different lag length selection criteria in the context of an unknown structural change.
In particular, we have focused on the size performance of the MZ GLS α test.Overall, the results show that there is a strong relationship between the explosive negative values of the MZ GLS α test and the values of the selected k.Using the Infimum method to select the break point jointly with some rule, such as AIC, AIC OLS or t-sig(10), produces the worst scenario, in the sense that the test yields explosive negative values, which generates severe oversizing problems.On the opposite side, using other criteria for k implies conservative tests.These issues seem to improve when T = 250 (relative to T = 100) or more, which creates sample size difficulties for most macroeconomic applications, especially in Latin American countries.
The results indicate that ADF GLS should be used, because it does does not result in explosiveness.Although for other reasons, this recommendation is in the same vein as Harvey et al. (2013).The advantage of the MZ GLS α test is that it is intrinsically conservative.So, if we obtain a good size when θ = −0.80,this is achieved at the cost of having an undersized test in the other cases, including the i.i.d.case.Our results are in line with those obtained in Del Barrio Castro et al. (2011), Del Barrio Castro et al. (2013), and Del Barrio Castro et al. (2015)  11 .
The results change for the better when using the Supremum method (minimizing the SSR) to select the breakpoint.However, this result only occurs when there is a break in the series.With this method, the test values are reduced (in absolute value) and no explosiveness is observed.Furthermore, the advantage is that the method offers a consistent breakpoint estimator which is currently suggested in the literature.Although a possible undersizing problem is addressed, then a possible best scenario is to use the Supremum method together with rules for selecting k such as MIC.This potential need to perform a pre-testing to see the existence of a break is similar to what is proposed by Kim and Perron (2009) when there is only one break and the proposal of Carrión-i-Silvestre et al. ( 2009) when there are multiple breaks.

Table 1 .
Size of the MZ GLS α Test, Model I, T = 100. AIC

Table 2 .
Size of the MZ GLS α Test, Model I, T = 250.i.