A Joint Chow Test for Structural Instability

The classical Chow (1960) test for structural instability requires strictly exogenous regressors and a break-point speci ed in advance. In this paper we consider two generalisations, the 1-step recursive Chow test (based on the sequence of studentized recursive residuals) and its supremum counterpart, which relax these requirements. We use results on strong consistency of regression estimators to show that the 1-step test is appropriate for stationary, unit root or explosive processes modelled in the autoregressive distributed lags (ADL) framework. We then use results in extreme value theory to develop a new supremum version of the test, suitable for formal testing of structural instability with an unknown break-point. The test assumes normality of errors, and is intended to be used in situations where this can either be assumed or established empirically.


Introduction
Identifying structural instability in models is of major concern to econometric practitioners.The Chow [1] tests are perhaps the most widely used for this purpose, but require strictly exogenous regressors and a break-point specified in advance.As such, a plethora of variants have been developed to meet different requirements.In this paper, we consider two generalisations: the one-step recursive Chow test, based on the sequence of studentised recursive forecast residuals; and its supremum counterpart.The pointwise test is frequently used and reported in the applied work, while the supremum test is new.Whereas Chow assumes a classical regression framework, practitioners typically use the one-step test to evaluate dynamic models, e.g., [2].Further, since a series of such tests is usually presented graphically to the modeller, multiple testing issues arise, making it difficult to determine how many point failures may be tolerated.These two issues motivate the analysis that follows.First, in Theorem 6, we show that the pointwise statistic has the correct asymptotic distribution under fairly general assumptions about the generating process, including lagged dependent variables and deterministic terms.Second, we take advantage of the almost sure convergence proven earlier to construct a supremum version of the one-step test, applicable to detecting parameter change or the outlier at an unknown point in the sample.The supremum test offers several advantages useful to modellers: it is simple to compute and has a standard distribution under the null, which does not depend on the autoregressive parameter (even in the unit-root or explosive cases); it focuses attention on end-sample instability; and it is agnostic about the number of breaks, giving power against more complex forms of misspecification.These advantages incur certain costs: the test is not invariant to the distribution of errors (even asymptotically); and other tests are more powerful against particular alternatives.
The pointwise one-step Chow test is essentially the "prediction interval" test described by Chow, but computed recursively and over the sample (rather than at an a priori hypothesised change point).It first appears in PcGive Version 4.0 [3] as part of a suite of model misspecification diagnostics; a similar diagnostic graphic, the "one-step forecast test", is provided in EViews ( [4] p. 180).The idea of using residuals calculated recursively to test model misspecification dates to the landmark cumulated sum (CUSUM) and cumulated sum of squares (CUSUMSQ) tests [5,6], which are based on partial sums of (squared) recursive residuals and have since been generalised to models, including lagged dependent variables [7][8][9].Unlike these tests, the one-step Chow test does not consider partial sums, but the sequence of recursive residuals itself; in effect, testing one-step-ahead forecast failure at each time step.As the following analysis shows, this approach leads to a different type of asymptotics, with a residual sequence behaving like i.i.d.random variables, rather than a partial sum of residuals behaving like a Brownian motion.
Examining the residual sequence to check the model specification is, of course, well established.The residuals can be either OLS residuals or recursive residuals, see [6,10].The recursive residuals have two advantages over the OLS residuals: first, under the normal linear model with fixed regressors, they are identically and independently normal; second, they have a natural interpretation (in a time series setting) as forecast errors.Ironically, in typical time series settings, where the forecast error interpretation is most useful, the independence of the residuals does not hold due to the presence of lagged dependent variables, see [11].This may lead to difficulties drawing firm conclusions from plotted pointwise test sequences and, thus, motivates the second part of this paper, which considers a supremum test.
The supremum test considers the maximum of the pointwise one-step tests, appropriately normalised.It is intended to reflect structural instability anywhere in the sample (with the early part excluded to allow consistent estimation).It relates to work on tests for either structural breaks or outliers, both at a possibly unknown time.
Perron [12] divides structural breaks tests into those that do not explicitly model a break, those that model one break and those that model multiple breaks.Our test joins the first category, which includes the already mentioned CUSUM and CUSUMSQ tests.(As an aside, Perron notes that these tests can suffer from the non-monotonicity of power against some alternatives.The risk of this is much reduced by using the one-step Chow statistics, since all parameters are estimated on a growing sub-sample.)The modelled break category includes, most prominently, the Quandt-Andrews (respectively, [13,14] supremum tests.These tests are complicated by a non-standard distribution (tabulated in [15]), but are nevertheless popular in practice, being implemented in several software packages.Our test is distinguished from these by not imposing any restrictions on the end-of-sample, so that end-of-sample instability may be detected.This feature is similar to [15], but a key distinction is that our test is agnostic about the number of breaks in the sample, a useful property in practice.It is also substantially simpler in implementation.Additionally, because the one-step tests behave like an i.i.d.process, the asymptotics differ from full-sample tests, like Quandt-Andrews, requiring the application of the extreme value theory of independent and weakly dependent sequences, rather than the suprema of random-walks.
Seen specifically as an outlier test, the supremum Chow test falls squarely within the tradition of [16], which, however, considers an unknown outlier in a classical setting.Outliers in the Box-Jenkins paradigm have attracted substantial interest, see [17][18][19].These authors take a full-sample approach with stepwise elimination of outliers.Although effective in many cases, there is a risk of smearing/masking effects when multiple outliers are present, which is reduced with the recursive the test we present.
Surprisingly, the use of recursive residuals to detect outliers in time series data is relatively unexplored, although there is little doubt that they are used for this purpose in practice.Barnett and Lewis ([20] p. 330) comment that "[recursive residuals] would seem to have the potential for the study of outliers, although no major progress on this front is evident.There is a major difficulty in that the labelling of the observations is usually done at random, or in relation to some concomitant variable. . .".This difficulty does not exist with time series, where there is a natural chronological labelling of observations.The section in the same book (at p. 396) on detecting outliers in time series is, nevertheless, notably brief, and recursive methods are not considered.

The Test Statistics
The one-step test applies to a linear regression: with y t scalar, x t a k-dimensional vector of regressors and the errors independently and identically distributed.For such a regression, we can define the sequence of least squares estimators calculated over progressively larger subsamples, along with the corresponding residual sums of squares and recursive residual (or standardised one-step forecast error) The one-step Chow test statistic, C 2 1,t , is then defined as: and can be expressed as: Chow showed that in a classical Gaussian regression model, this statistic would have an exact F (1, t − k − 1) distribution.We first extend this result to show that, for a general class of Gaussian autoregressive distributed lag (ADL) processes, C 2 1,t converges in distribution to a χ 2 1 random variable, so that, asymptotically, the additional dependence does not matter.This result means that comparing the pointwise statistic against an F (1, •) or χ 2  1 distribution (as is typically done) is appropriate in large samples.However, it still leaves unresolved the difficulty that this test is generally reported graphically to detect parameter change with an unknown change point.To formally treat the problem of multiple testing that occurs in evaluating many pointwise statistics over the entire sample, we introduce a new supremum test based on the test statistic: where g is an arbitrary function of T , such that g(T ) → ∞.
To put the test statistics and the asymptotic analysis into perspective, it is useful to review some related, but also somewhat different statistics.The literature on adaptive control, also called tracking, comes to mind, although it does not appear to be applied much in econometrics.It is concerned with tracking the sum of squares innovations.The aim is then to show that: vanishes.A discussion for a non-explosive autoregressive setup without deterministic terms is given in [21,22].Asymptotic distribution theory does not seem to be discussed.The reason that the tracking result does not extend to the explosive case is that the residuals are not normalised by the "hat" matrix, in contrast to the residuals in Equation (4).Normalisation by the "hat" matrix also gives excellent finite-sample properties; see Section 5.
The Chow statistic C 2 1,t in Equation ( 5) also involves scaling by a residual variance estimator, so that the asymptotic distribution is free of nuisance parameter; see Theorem 6 below.More fundamentally, the present analysis is concerned with the Chow statistic for individual observations rather than sums, and it will therefore involve extreme value theory.
Another related test that is used in econometrics is the CUSUMSQ test based on the statistic: or a similar statistic based on least squares residual variance estimates instead of sums of squared recursive residuals.This test statistic is aimed at detecting non-constancy in the innovation variance rather than detecting individual outliers.Again, it includes normalisation by the "hat" matrix.Distribution theory is discussed in [7,8] for the stationary case and in [9] for general autoregressions, which are possibly explosive and with deterministic terms.
A third related test is Andrews' sup-F test, see [13].This is a test for structural breaks in the mean parameters for which the asymptotic theory only applies in the stationary case.The finite sample performances of the Chow test and the sup-F test are compared in Section 6.

Model and Assumptions
We consider the behaviour of the test statistic for ADL models with arbitrary deterministic terms, a class that includes by restriction many commonly-posited economic relationships, see ( [23] Chapter 7).For the purpose of analysis, we assume that the true data generating model can be represented as a vector autoregression (VAR).
We observe a p-dimensional time series X 1−k , . . ., X 0 , X 1 , . . .X T .We model the series by partitioning X t as (Y t , Z t ) , where Y t is univariate and Z t is of dimension p − 1, and then consider the regression of Y t on the contemporaneous Z t , lags of both Y t and Z t and a deterministic term D t .That is, In order to specify the joint distribution of X t = (Y t , Z t ) , we assume that X t follows the vector autoregression: with the deterministic term D t given by: The deterministic term D t follows the approach of [24,25] and may include, for example, a constant, a linear trend or periodic functions, such as seasonal dummies.The matrix D has characteristic roots on the unit circle.For example, will generate a constant and a biannually dummy.The term D t is assumed to have linearly-independent coordinates, formalised as follows.We assume the VAR innovations form a martingale difference sequence satisfying the assumption below.The requirement that the innovations have finite moments just beyond 16 stems from a problem with controlling unit root processes, see ( [25] Remark 9.3).In the present analysis, this constraint emerges in Lemma 12 (i) and is transmitted via Lemma 13 (iv) to Lemma 16.If dim D = 0 and the geometric multiplicity of roots in unity equals their algebraic multiplicity (including I(1), but excluding I(2) processes), this could be improved to finite moments greater than four using the result of [26].
Assumption 2. ξ t is a martingale difference sequence with respect to the natural filtration F t , so E(ξ t |F t−1 ) = 0.The initial values X 0 , . . .X 1−k are F 0 -measurable and: a.s.
This assumption also excludes the possibility that the innovations could be heteroscedastic, a common assumption in financial modelling (e.g., autoregressive conditional heteroscedastic, ARCH), but also an increasingly relevant property in macroeconomic work, particularly in light of the "Great Moderation" period [27] and subsequent period of the "Global Financial Crisis".The assumption indicates that such heteroscedasticity should be modelled.
We permit nearly all possible values of the autoregressive parameters A j in Equation (11), excluding only the case of singular explosive roots, which can only arise for a VAR with p ≥ 2 and multiple explosive roots, see [28] for a discussion.We can express the restriction in terms of the companion matrix: Assumption 3. The explosive roots of B have geometric multiplicity of unity.That is, for all complex λ with |λ| > 1, rank(B − λI pk ) ≥ pk − 1.
Additionally, we require that the innovations in the ADL regression are martingale differences.
Assumption 4. Let G t be the sigma field over F t and Z t .Then, Finally, the one-step statistic is such that a distributional assumption must be made in order to derive the limiting distribution of the statistic (since the statistic is an estimate of a single error term, we cannot take advantage of a central limit theorem).Similarly, since the analysis of the supremum statistic will rely on extreme value theory, we must impose distributional and independence assumptions on the ADL innovations ε t , in order to uniquely determine the norming sequences applied in Lemma 9. We assume normality, which may result from joint normality in the underlying VAR process and is tested, in practice, under the above assumptions, see [29].

Main Results
We must briefly examine the decomposition of the process used in the proofs in order to elucidate the first main result in the explosive case (in the non-explosive case, this decomposition becomes trivial).A two-way decomposition allows us to express separately certain terms that arise in connection with the explosive component of the process.Group the regressors by defining: and then write Equation (11) in companion form, so that: Then, there exists a regular real matrix M to block diagonalize S (see the elaboration in Section 3 of [25]), so that the process can be decomposed into non-explosive and explosive components, R t and S t , respectively.We have: with R and W having eigenvalues inside or on, and outside, the unit circle, respectively.The first theorem states that the test statistic is almost surely close to a related process in the innovations, q 2 t , under multiple assumptions.This result does not require the normality of Assumption 5.
Theorem 6.Under Assumptions 1, 2, 3 and 4, where: and W is as in Equation ( 16), and as in [25] (Corollaries 5.3 and 7.2), Having established pointwise convergence almost surely, we use an argument based on Egorov's theorem to establish the convergence of the supremum of a subsequence.Both the subsequence itself and the lead-in period must grow without bound, to allow the regression estimates to converge.Lemma 7. Suppose C 2 1,t − (q t /σ) 2 as → 0 as t → ∞.Then: where g(T ) is an arbitrary function of T , such that g(T ) → ∞.
Now, if an appropriately normalised expression in the maximum over q t can be shown to converge in distribution, then so will the supremum statistic, with the same normalisation, by asymptotic equivalence.We show that, under the assumption of independent and identical Gaussian innovations, max 1≤s≤t q s , appropriately normalised, does indeed converge to the Gumbel extremal distribution (as t → ∞), which has distribution function: where: A useful property of the Gumbel distribution is the following simple monotonically decreasing transformation to a χ 2 variable, allowing standard distributions to be used: In showing the above convergence, we rely on Theorem 1 of Deo [30], and its corollary, showing that the extremal distribution of the absolute values of a Gaussian sequence is the same in the stationary dependent and independent cases.However, Deo's Lemma 1 gives an incorrect statement of the norming sequences.Here, we state the correct sequences, adopting the notation of Deo (proof in Section A.6). Lemma 8. Let {X n } be independent Gaussian random variables with mean zero and variance one.
Deo's result can then be applied to q t defined in Equation (18).
Lemma 9.Under Assumption 5, where: and Λ is a random variable distributed according to the Gumbel (Type 1) law.
Combining these lemmas gives our main result, that with independent and identically Gaussian innovations, an appropriate normalisation of the supremum one-step Chow test converges in distribution to the Gumbel extremal distribution.
Theorem 10.Under Assumptions 1, 2, 3, 4 and 5, with some g(T ) → ∞, where C 2 1,t is the one-step Chow statistic defined in Equation ( 5) and: and Λ is a random variable distributed according to the Gumbel distribution Equation (20).
As a simple corollary, we can transform the test using Equation ( 21), so that it may be compared against a more readily-available distribution.
A test based on this result should reject for small values of the statistic.

Finite-Sample Corrections
In practice, we find by simulation that the test as specified above is over-sized in small samples.To minimise this, we suggest two corrections.For the first correction, we observe that the one-step statistics appear to be distributed close to F (1, t − k − 1) (as indeed, they are exactly in the classical case) and so use the following transformation to bring the statistics closer to the asymptotic chi-squared distribution: where F (•) and G(•) are the F (1, t − k − 1) and χ 2 1 distribution functions, respectively.This first correction results in a test that tends to under-correct, largely a result of relatively slow convergence to the limiting Gumbel distribution.We find that the test performs better if simply compared with the finite maximal distribution, assuming the independence and identical distribution of the test statistics (the first assumption holding only in the limit and in the absence of an explosive component and the second holding only in the limit).That is, we approximate the distribution of the maximum, max g(T )<t≤T C 2 * 1,t , by: This forms the basis of the finite adjusted sup-Chow test (SC 2 * ), with rejection in the right tail.Note that in this case, no centring or scaling is required, because the null distribution itself depends on T .

Simulation Study
We present the results of four simulation experiments done in Ox [31]: size; distributional sensitivity; power against mean shifts; and power against outliers.All of the simulations are done first for a first order autoregression and then for an autoregressive distributed lag model.

Autoregressive Data-Generating Process
Consider the following data-generating process: x 0 = 0.
Where not otherwise stated, we set α = 0 and t iid ∼ N(0, 1).The number of Monte Carlo repetitions and the implied Monte Carlo standard error, MCSE, are indicated in table captions.
The regression model is that of an first order autoregression.It includes an intercept unless otherwise stated.The five tests computed and presented in the tables are: the asymptotic sup-Chow test (SC 2 ); the corrected sup-Chow test (SC * 2 ); the [13] sup-F test (sup F); an outlier test based on the OLS residuals (sup t 2 ); and the [32] E p test for normality (Φ).The nominal size is 5% unless otherwise stated.
The two sup-Chow tests are described in Theorem 10 and Equation (28), respectively; for the function g(T ), we use T 1/2 .The sup-F test is a linear regression form of Andrews' sup-W (Wald) test with 15% trimming, as used for the simulations in [33] and implemented in EViews 7 by the command ubreak.It is the maximum of the Chow F-tests calculated over break points 0.15T ≤ λ ≤ 0.85T , such that for each break point λ, under the alternative, the model is estimated separately for each subsample (1 . . .λ and λ + 1. . .T), whereas under the hypothesis, the model is estimated for the full sample.The null distribution is given by asymptotic approximation to a non-standard distribution with simulated critical values given in [15].The sup-t 2 test examines the maximum of the squared full-sample OLS residuals, externally studentised as in ([34] s. 2.2.9); that is, residual t is normalised using an estimate of the error variance that excludes residual t itself.These squared statistics are F 1,t−k−1 distributed under normality of the errors, but not independent; hence, the use of the Bonferroni inequality to find significance levels is recommended by [34].We find in simulation that, despite dependence, the exact maximal distribution of T independent F 1,t−k−1 random variables is a reasonable choice for the sample sizes and processes we consider, except those that are near-unit-root.Finally, in experiments where we wish to evaluate the performance of the sup-Chow test conditional on residuals having satisfied a normality test, we use the [32] E p test of the OLS residuals.
In the first experiment (Table 1), we vary the autoregressive parameter through the stationary, unit-root and explosive regions and consider the effect of either including or excluding an intercept from the model.As noted above, the SC 2 test is uniformly oversized.The SC * 2 test is correctly sized and approximately similar, with simulated size varying very little across the parameter space.There is some tendency towards inflated sizes under near-unit-root processes when an intercept is included in the model, but the extent of this is quite limited (7% simulated size).The key consequence of this result is that it is not necessary to know a priori where the autoregressive parameter lies to effectively apply the SC * 2 test, avoiding a potential circularity in model construction.
In simulations that are not reported here, we also investigated the sup F test.This test is not valid in the non-stationary case.Thus, the same patterns were seen, albeit with a larger effect.The simulated size is as high as 44% in the unit root case.The second experiment evaluates the sensitivity of the SC * 2 test to failures of Assumption 5, in particular the non-normality of the errors.Table 2 presents simulated sizes for both the SC * 2 and sup F tests under a range of error distributions.The former is very sensitive to departures from normality, while the latter is not.In the second part of the table we consider a further scenario, in which a model builder runs the structural instability tests only if a test for normal residuals is not rejected.This yields three additional tests, the normality test Φ and the SC * 2 |Φ and sup F|Φ tests, each conditional on the normality hypothesis having not been rejected.We also consider joint tests SC * 2 + Φ and sup F + Φ that first tests normality and then tests for break if normality cannot be rejected.As the table illustrates, some size distortion remains, but the inflation of the unconditional test is largely controlled in the conditional case.As noted in Section 8 below, we recommend using the test in this way if the normality of the errors cannot be safely assumed.The third experiment considers the power of the tests against a single shift in the mean level of the process.The data generating process is: x 0 = 0.
and we allow γ and τ to vary as presented.The regression model remains a first order autoregression with an intercept.The level shift is therefore not modelled.Table 3 shows simulated sizes for the unconditional tests as in the previous experiment.We note that the sup F test performs well for a break at mid-sample, but is outperformed by the SC * 2 test for breaks occurring near the end of the sample.We also consider conditional tests as in the previous experiment.There are two main observations: firstly, the normality test is increasingly likely to reject as the break magnitude becomes large; but secondly, the SC * 2 test still has power (attenuated by around one-half) to detect the break in this case.The fourth experiment (Table 4) considers the power of the tests against a single innovation outlier at the process mid-point.The data generating process is: x 0 = 0.
and we allow α and δ to vary.Both tests presented have similar power in most circumstances, with an outlier larger than three-times the error standard deviation being detected with useful frequency.The OLS-based sup t 2 test has slightly better power than the Chow test.The conditional evaluations show that both tests retain power in situations where the normality test is not rejected.

Autoregressive Distributed Lag Data-Generating Process
We consider a bivariate data-generating process, written in triangular equilibrium correction form as where t , η t iid ∼ N 2 (0, I 2 ).The characteristic roots of the system are ψ and 3/4.When ψ = 1, the model is cointegrated.When ψ = 1/4, the model is stationary.In both cases, y t − z t is stationary.
We then fit the univariate autoregressive distributed lag model and investigate the residuals ε t using the Chow statistics.
The first experiment (Table 5) evaluates the size of the Chow tests.Here, ψ varies, while ν = 0 in the data generating process.The results are in line with those seen for the autoregressive situation in Table 1.The second experiment is not done in this situation.
Table 5. Simulated rejection frequency for SC 2 , SC * 2 under an autoregressive distributed lag process in Equations ( 32), (33) The third experiment (Table 6) evaluates the power of the Chow tests against a single shift in the mean level.This is done by replacing ν by νI t>τ in the data generating process Equation (32).The results are in line with those seen for the autoregressive situation in Table 3: There is good size control.The power is nearly uniform in ψ and comparable to the power reported in Table 3.
The fourth experiment (Table 7) evaluates the power of the Chow tests against a single innovation outlier at the process mid-point.This is done by replacing ν by νI t=T /2+1 in the data generating process Equation (32).The results are in line with those seen for the autoregressive situation in Table 4. Table 6.Simulated rejection frequency for SC * 2 under process in Equations ( 32), (33) with a break of magnitude ν at time τ .T = 50.50,000 repetitions, MCSE ≤ 0.5.32), (33) with an outlier of magnitude ν at mid-sample.T = 50.50,000 repetitions, MCSE ≤ 0.5.

Empirical Illustration
As an empirical illustration, consider log quarterly U.K. gross domestic product 1 , y, say, for the period 1991:1 to 2013:3.This gives a total sample length of 91, of which two observations are held back as initial values.The data is provided as supplementary material.Figure 1a shows the series.
An autoregression with two lags, an intercept and a linear trend is fitted to the data: y t−2 + 0.17 max C 2 * = 12.9 (p = 0.026) {arg max = 2008 : 2} The specification tests in Equation ( 36) are a cumulant-based test for normality and a test for autocorrelated residuals.They are valid for non-stationary autoregressive distributed lag models; see [29,35,36].These specification tests do not indicate particular problems with the model.The specification test in Equation ( 37) is the supremum Chow test C 2 * 1,t from Equation ( 27) evaluated according to the finite sample approximation Equation (28).This indicates a specification problem that is worst for 2008:2.A graphical version of this test is discussed below.
Figure 1b shows the scaled residuals.There appears to be clustering of first positive and then negative residuals around the start of the financial crisis in 2006-2009.This tendency is, however, not sufficient to tricker the specification tests for autocorrelation and non-normality in Equation (36).
Figure 1c shows pointwise one-step Chow tests.This is the standard output from PcGive.The scale is a probability scale, where the horizontal line indicates the critical value for pointwise tests at a 1% level.The exact nature of the probability scale is not documented in [37].This plot represents pointwise tests based on 79 statistics with an unclear correlation structure.Thus, researchers have traditionally interpreted this plot with a grain of salt; see ([38] p.197).In this plot, there is a cluster of pointwise rejections, so it would seem prudent to question the stability of the model.
Figure 1d shows the Chow statistics C 2 * 1,t (see Equation ( 27)), along with horizontal lines indicating the critical values of simultaneous 5% and 1% tests.These are computed according to the finite sample approximation Equation (28).The maximum of the test statistics is 12.9; see Equation (37).This is between the 5% and the 1% critical levels of 11.6 and 14.7, respectively.Thus, with a 5% level, it would seem prudent to question the stability of the model.
What could be done to remedy the situation?The usual answer is to inspect the data, think about the economic context and the economic question of interest and update the empirical model accordingly.While the economic question is vague in this illustration, the context of the specification problem is the financial crisis.In 2008, there is downward shift both in productivity and in growth rates.How big these effects were remains unclear, and so, the Office of National Statistics keeps revising the data for this period; see [39,40].Now, one way to capture the downward shift in productive and growth rates is to use an impulse indicator and a step indicator.This gives the model: The formal asymptotic theory does not cover this situation with dummies.We proceed with these results nonetheless.The specification tests (39), (40) for the new model (38) are clearly better than the tests (37), (37) for the original model (35).An asymptotic theory for this type of iterative specification of a regression model is given in [41].
It is worth noting how the plot Figure 1d is computed.First, C 2 1,t is computed using Equation ( 5) for t = 11, . . ., 89, so that g(T ) = 10 and T = 89.This can be done easily using standard regression techniques.For instance, to compute C 2 1,11 , run a regression including over the sample 1, . . ., 11, including a dummy for Observation 11.Then, C 2 1,11 is the F -statistic for testing the absence of the dummy variable.Secondly, C 2 * 1,11 is computed from C 2 1,11 by first transforming using the F (1, t − k − 1) distribution function, with t = 11 and k = 4 in this case, and then by the inverse of G, the χ 2 1 distribution function.The critical values are computed according to the finite sample approximation (28).For instance, for p = 0.05, then x = G −1 {(1 − p) 1/[T −g(T )] } = 11.72,so that [G(x)] T −g(T ) = 1 − p.

Conclusions
We advocate the sup-Chow test as a general misspecification test to be used as part of an iterative modelling strategy.It is a relatively simple transformation of the existing one-step Chow test (or, similarly, the EViews one-step forecast test), with a standard and easily calculated null distribution, which does not vary substantially in the AR(1) parameter space.We anticipate that it would be used as one of a battery of tests (including the normality of residuals); rejection would draw the modeller's attention to the pointwise plot, which would help identify the cause and timing of the failure.
By construction, the test is sensitive to parameter changes and outliers and is somewhat agnostic about the timing and number of these breaks.This makes it useful against a variety of simple and complex misspecification types.However, there is a clear trade-off, and as the first columns of Table 3 show, the test is less powerful against a particular alternative than the [13] sup-F test, which explicitly models a single break.This motivates the use of multiple different tests, with the failure of any one signalling misspecification and triggering further investigation.In real datasets, breaks may not be of the single mean-shift variety, and the parallel popularity of CUSUM-type and Andrews-type tests suggests that both approaches have value.
The test is not invariant to the error distribution, even asymptotically; a feature it shares with most outlier tests and end-sample tests of parameter change.There are two different solutions to this, depending on the modelling approach being used.
If normality is assumed and tested, there is no problem.As Table 2 shows, the pre-test for normality affects the size and power of the test, but substantial power remains following the pre-test: that is, the sup-Chow test has power somewhat orthogonal to a common normality test.
If normality is not maintained, the sup-Chow test as presented cannot be used.The solution to this problem would involve a subsampling technique to recover the distribution of the errors from the known good part of the sample.This is the approach taken in [15], but it necessarily complicates the test.An appealing alternative is offered in [42] in the context of another test applying extreme value theory to outlier detection.The authors resolve the distribution dependency in three stages.First, extreme value theory itself means that a wide class of error distributions converge in the maximum to the same extreme value distribution: the Gumbel, discussed in Equation ( 20) above.Second, the centring factor required for extreme value convergence is eliminated by examining differences between order statistics, applying a convenient theorem of [43].Third, the scaling factor is implicitly sampled by examining the largest (1/α) pointwise statistics for a level α test.Preliminary experiments suggest that this approach can be applied effectively to the sup-Chow test, but as in [42], a further correction is needed to control the size of the test; hence, this remains future work.

A.2. Three-Way Process Decomposition
We elaborate on the decomposition of the companion form Equation (15) given in Equation ( 16).Whereas, there, it was decomposed into non-explosive and explosive components, we now further decompose the non-explosive components into stationary and unit-root components.As before, there exists a regular real matrix M 3 to block diagonalize S into stationary, unit-root and explosive components: where Ũ, Q and W have eigenvalues inside, on and outside the unit circle, respectively.We can now express the two-way decomposition presented in Equation ( 16) as follows:

A.3. Preliminary Asymptotic Results
The ADL model Equation (10) becomes: where θ is the vector of coefficients.Then, from Equation (11), we have Z t = ΠS t−1 + ξ 2,t , where ξ t has been partitioned conformably with X t .Then, the residuals from regressing Y t on (Z t , S t−1 ) could also be obtained by regressing Y t on (ξ 2,t , S t−1 ) or as a result of the decomposition above in Equation ( 16), on x t = (ξ 2,t , R t−1 , W t−1 ) ; so, we can analyse the test statistic Equation ( 6) as if these were the actual regressors.
Lemma 12. Suppose Assumptions 1, 2 and 3 hold with α > 4 only.Then, for all β > 1/α and ζ < 1/8, Proof.Result (i) is proven by decomposing the correlation to apply results from [25], so that: where the last line follows, because C Ũ Q is vanishing almost surely by [25] Result (iii) follows by writing: and applying (i) to show that C RW is vanishing.
Result (iv) is exactly analogous, but substitutes (ii) for (i).Result (v) follows by again decomposing R t .Namely, Then, the first normed quantity on the right-hand side is bounded, since C Ũ Q is vanishing by [25], Theorem 9.4.The second normed quantity comprises ) and by [44], Theorem 1(i), we have that Ũt QQ independently in the same way, but since Q t contains only the unit-root components (with eigenvalues on the unit circle), we can apply [25], Theorem 8.4, which states that for some η, max t ) for all ζ < 1/8, and so, a fortiori, Considering the maximum of these components, we have again that the latter dominates and S −1/2 RR R t−1 = O(t −ζ/2 ), since α > 16/7 under Assumption 2. Result (vi) follows directly from [44], Lemma 4(i).Result (vii) follows from (i), (v) and (vi).Write: giving three normed quantities to bound.The first is o(t −ζ/2 ) by (v), as is the second by (i), while the third is bounded by (vi).
Proof.For (i), use [46], Lemma 1(iii), and [44], Corollary 1(iii).For (ii), write: Proof.Once again, we take the proof in two steps, using that: For the first step, we again decompose using a partial regression transformation, so that: and we consider each term on the right separately.For the first term in Equation ( 47), use Lemma 12 (iv) to write: For the second step, we have to show the bounding rate for: Many of these terms are familiar from the proof of Lemma 15, and the only new terms to bound are t−1 s=1 ε s W s−1 (W −(t−1) ) − G t and G t .For the latter, we have:

Table 3 .
Simulated rejection frequency for SC * 2 and sup F, possibly combined with normality test Φ, under the process in Equation (30) with a break of magnitude γ at time τ .50,000 repetitions, MCSE ≤ 0.5.

Table 4 .
Simulated rejection frequency for SC * 2 and sup F, possibly combined with normality test Φ, under the process in Equation (31) with a break of magnitude δ. 50,000 repetitions, MCSE ≤ 0.5.