Information-Criterion-Based Lag Length Selection in Vector Autoregressive Approximations for I(2) Processes

: When using vector autoregressive (VAR) models for approximating time series, a key step is the selection of the lag length. Often this is performed using information criteria, even if a theoretical justiﬁcation is lacking in some cases. For stationary processes, the asymptotic properties of the corresponding estimators are well documented in great generality in the book Hannan and Deistler (1988). If the data-generating process is not a ﬁnite-order VAR, the selected lag length typically tends to inﬁnity as a function of the sample size. For invertible vector autoregressive moving average (VARMA) processes, this typically happens roughly proportional to log T . The same approach for lag length selection is also followed in practice for more general processes, for example, unit root processes. In the I(1) case, the literature suggests that the behavior is analogous to the stationary case. For I(2) processes, no such results are currently known. This note closes this gap, concluding that information-criteria-based lag length selection for I(2) processes indeed shows similar properties to in the stationary case.


Introduction
The workhorse in empirical modeling of economic time series is still constituted by the vector autoregressive (VAR) model. The VAR model explains the process (y t ) t∈Z , y t ∈ R p , using the difference equation: where (u t ) t∈Z is a white noise process. Under the stability assumption |A(z)| = 0, |z| ≤ 1 for A(z) = I p − A 1 z − · · · − A h z h , this equation has a unique stationary solution. Often, prior to VAR modeling, deterministic components are extracted such as a constant, a linear trend, and seasonal dummies. Oftentimes, this model is seen as an approximation to the data-generating process. This happens, for instance, if the data are generated by a vector autoregressive moving average (VARMA) process, where u t = ε t + B 1 ε t−1 + · · · + B g ε t−g for white noise (ε t ) t∈Z with expectation zero and variance Σ > 0. The VARMA system (A(z), B(z)), B(z) = I p + B 1 z + . . . + B g z g (where the pair (A(z), B(z)) is left-coprime) is called invertible if |B(z)| = 0, |z| ≤ 1. Under stability and invertibility, we obtain the VAR(∞) representation Under invertibility, we have Φ j ≤ Cρ j 0 where ρ −1 0 > 1 typically is the smallest modulus of the solutions of |B(z)| = 0. Thus, the coefficients converge to zero, geometrically implying that a VAR approximation seems plausible.
When estimating a VAR model, one has to decide on the lag length, h. This choice of the lag length is typically based on adding a penalty term to the estimation accuracy as measured using the residual variance Σ h for a range of potential lag lengths 0 ≤ h ≤ H T , resulting in so-called information criteria. Here, the residual variance Σ h is estimated from data as follows: 1 If no structural restrictions apply, the matricesÂ 1 , . . . ,Â h are typically estimated using OLS or the Yule-Walker equations (see, for example, Hannan and Deistler 1988, p. 211). 2 In this paper, we only consider OLS estimation. Information criteria then are defined as Here, hp 2 denotes the number of parameters estimated in the conditional mean equation, and C T is a penalty factor. The most prominent choices are AIC (C T = 2) and BIC (C T = log T). The minimal integerĥ minimizing IC is then the selected lag length. 3 For the case of stationary invertible VARMA processes, the asymptotic properties are well documented in section 6.6 of HD. Under appropriate assumptions on the noise (see Theorem 6.6.3 of HD) and upper bounds H T for the lag length, forĥ BIC estimated using BIC, we haveĥ BIC /l(T) = 1 a.s. (Theorem 6.6.3 of HD), and the same limit holds for h AIC in probability (Theorem 6.6.4 of HD), where l(T) = log T/(−2 log ρ 0 ). Consequently, ρ l(T) 0 = exp(− log ρ 0 log T/(2 log ρ 0 )) = T −1/2 . Thus, in this case, asymptotically, the approximation error (2) is of the order T −1/2 . Moreover, Theorem 7.4.7 of HD states that in a more general setting, if (y t ) t∈Z is not generated by a finite autoregression under the same assumptions on (ε t ) t∈Z , as above and where ∑ ∞ j=1 j 1/2 Φ j < ∞, we have Here,Σ T = 1 T ∑ T t=1 ε t ε t is independent of h. Now impose the following assumption on the sequence Σ h , h ∈ N: Under this assumption HD show that the optimal h (for IC(h; C T )) is close to a minimizer l(T) of the deterministic function L T (h) = hp 2 T (C T − 1) + θ(h). The limitation to stationary processes and autoregressive coefficients Φ j declining fast enough excludes persistent processes such as unit root processes that are often encountered in economic time series. It is common practice, nevertheless, to select the lag length using information criteria also in case of highly persistent processes. For integrated processes of the I(1) kind, this is backed by some results in the literature, such as Ng and Perron (1995), stating that lag length selection for VARMA models has properties analogous to the stationary case. These results are generalized to a larger class of processes in Lütkepohl and Saikkonen (1999). 4 However, in the literature, there are currently no results available for the I(2) case of doubly integrated processes. This class of processes was introduced by Granger and Lee (1989), who illustrated the main underlying idea using an example from inventory processes: if the demand for a product (a flow variable) is modeled as an I(1) process (which is realistic in a number of cases; see Granger and Lee 1989), then the stock of the product at the producers will be I(2), as stock variables sum up the corresponding flow variables. In a macroeconomic framework, I(2) processes have been investigated, for example, by Johansen and Juselius and coworkers, who argue that I(2) processes play a role if the model contains nominal quantities, as inflation often is found to be integrated. Since inflation is the change rate of prices, it follows that prices must be I(2) then (see, for example, Juselius 2006, chps. 16-18, or Johansen 1995.
While these models have been used in the literature, the corresponding inferential methodology focuses on the case of autoregressions with a finite lag length. If the datagenerating process is more general, such models are often an approximation and must involve the selection of the lag length. In practice, this is performed using information criteria such as AIC or BIC also in this situation while a theoretical justification is missing.
This paper closes the following gap: we provide a general result for the asymptotic behavior of lag length selection using information criteria for I(2) processes, linking the performance for the I(2) processes to a related stationary process for which the results of HD can be applied. The proof of this result can be found in the appendix. The result is illustrated using a simulation study in Section 3.

Integrated Processes
The key to the results for integrated processes is the insight that the OLS residuals of (1) and-for example-of the vector error correction equation are identical. Since the information criteria are a function ofΣ h (see (3)), they are invariant to linear transformations of the regressors as well as to transformations of the dependent variable by adding linear functions of the regressors. If in (6) Π = 0, this defines a VAR(h − 1) process (∆y t ) t∈Z for which stationary theory applies under stability assumptions for Γ(z) = I p − ∑ h−1 j=1 Γ j z j . As in this case the estimation of Π is superconsistent, one would suspect that the inclusion of y t−1 as a regressor does not change the lag length selection substantially except for adding one lag to account for the differencing. A similar reasoning applies in the case of reduced rank of Π. Paulsen (1984) shows that for finite-order autoregressive I(1) processes, indeed, the results for the stationary case can be extended. Lütkepohl and Saikkonen (1999) extend some of the results to VAR(∞) I(1) processes, rectifying earlier incorrect proofs in the literature. Bauer and Wagner (2004) provide theory for the case of multifrequency I(1) processes such as seasonally integrated processes by using the vector of seasons representation linking the multifrequency I(1) case to the I(1) case.
In more detail, let, in the I(1) case, where |c • (z)| = 0, |z| ≤ 1,c • (z) = ∑ ∞ j=0c•,j z j and where it is assumed that ∑ ∞ j=0 j 3/2 c •,j < ∞. Under these assumptions, it follows that we obtain the VAR(∞) representatioñ for the I(1) process (y t ) t∈Z . Furthermore, the triangular representation (7) implies that the process (ỹ 2, t ) t∈Z is integrated and not cointegrated; compare, for example, Saikkonen and Lütkepohl (1996), sct. 2. The corresponding VAR(h) approximation is given as y t = ∑ h j=1 A j y t−j + e t (h). Premultiplying with B and using that (for h > 1), the set of regressors y t−1 , y t−2 , . . . , y t−h (of dimension hp) can be linearly transformed into the set of regressors u t−1 , . . . , Here, all variables are stationary except forỹ 2,t−1 , which is I(1) and not cointegrated by design. Consequently, Φ 1,2 = 0 has to hold (compare equation (A.2) in Saikkonen and Lütkepohl 1996). Omitting this regressor, the approximation almost equals a VAR(h) approximation of (u t ) t∈Z except for the inclusion of u 1,t−h instead of the full vector u t−h .
h denotes the residual variance in a VAR(h) approximation for (u t ) t∈Z . This shows that the residual variance of (y t ) t∈Z and the one of (u t ) t∈Z are closely related.
Thus, the essential step in linking the asymptotic properties of the lag length selection in the I(1) case (obtained from inclusion ofỹ 2,t−1 ) to the ones in the stationary case (assuming thatỹ 2,t−1 is left out, which is infeasible in practice as the matrix β is not known prior to estimation) lies in establishing that the inclusion ofỹ 2,t−1 has negligible effects.
For the I(2) case, the same route can be taken. To this end, consider processes according to the following assumptions: Assumption 2. Let (y t ) t∈Z be an I(2) process (not generated by a finite-order autoregression) obtained as a solution to the equation (with deterministic values y 0 , y 1 , and y 2 ) is nonsingular.
From these assumptions, it follows that the process (y t ) t∈Z is I(2) with cointegration and potentially multi-cointegration occurring. The structure of the process is best seen in the triangular representation of the process, which is equivalent to the one used above (see, for example, Li and Bauer 2020): assume that the stationary process (u t ) t∈Z is generated according to u t =c • (L)ε t (for suitable assumptions onc • (z) including |c • (1)| = 0) and related to the process (y t ) t∈Z via where β, β 1 , and β 2 are as in Assumption 2. Then, clearly, (ỹ 3, t ) t∈Z = β 2 (y t ) t∈Z is I(2) and not cointegrated, (ỹ 2, t ) t∈Z = β 1 (y t ) t∈Z is I(1) and not cointegrated, and (ỹ 1, t ) t∈Z = β (y t ) t∈Z is I(1) and cointegrates with ∆ỹ 3,t to stationarity for nonzero A. If A = 0, it is stationary. 6 The cointegrating relationsỹ 1,t − A∆ỹ 3,t between an I(1) and a differenced I(2) process are termed multi-cointegration by Granger and Lee (1989).
Denoting Au t = c • (L)ε t and using c −1 • (L) = c −1 • (1) + c * • (1)∆ + c * * • (L)∆ 2 , we obtain from premultiplying the above equation with c −1 • (L): This confirms the low rank decompositions αβ and α ⊥ Γβ ⊥ = ζη as well as the rank of these matrices. The VAR(∞) representation then follows from calculating the coefficients ofc(L) and c • (L) −1 . Therefore, the triangular representation immediately shows that the process (y t ) t∈Z is I(2) and not integrated of higher order, but leaves the dynamics contained in c • (z) unspecified. The VAR(∞) representation, on the other hand, focuses on the dynamics, but makes the derivation of the properties of the processes defined by the difference equation less immediate.
The right hand side process is a stationary process with nonsingular spectrum at z = 1 as it equals α 2 ε t + ∆ṽ t for stationary process (ṽ t ) t∈Z . This shows that for the properties of β 2 ∆ 2 y t , the square matrix is of major importance: if it is nonsingular, then β 2 ∆ 2 (y t ) t∈Z is stationary. If it is singular such that the process still contains a unit root, then β 2 ∆ 2 (y t ) t∈Z is an integrated process. Hence, the nonsingularity of (8) ascertains that the solution process (y t ) t∈Z is I(2) and not I(3) (or integrated of even higher order). For simplicity of presentation, no deterministic terms are included in the model formulation. Then, the corresponding result for I(2) processes is the main result of this note: Theorem 1. Let (y t ) t∈Z be generated according to Assumption 2. Let the lag length selection be performed by minimizing the information criterion IC(h; C T ) with penalty term C T ≥ c > 1, C T /T → 0 over the integers 0 ≤ h ≤ H T = o(T 1/3 ). Let the minimal minimizing argument be denoted asĥ(C T ). Then, (i) P(ĥ(C T ) ≤ M) → 0 for each constant 0 < M. (ii) If Assumption 1 holds for Σ h with function θ(h), thenĥ(C T )/h * T → 1 in probability, where h * T minimizes the function L T (h; C T ) = hp 2 (C T − 1)/T + θ(h). (iii) If (y t ) t∈Z is an I(2) invertible VARMA process corresponding to the left-coprime pair (A(z), B(z)), then −2ĥ(C T ) log ρ 0 / log T → 1 in probability, where ρ −1 0 = min{|z| : z ∈ C, |B(z)| = 0}. The same results hold if the process is demeaned and/or detrended before estimation.
The result shows that the lag length selection essentially follows the minima of the function L T (h; C T ) analogously to the stationary case. In the VARMA situation, we obtain a lag length that is proportional to log T and depends on the location of the zero closest to the unit circle. Otherwise, the minima of L T (h; C T ) will increase for increasing sample size as the penalty term hp 2 (C T − 1)/T tends to zero for T → ∞ while θ(h) decreases monotonously.
The assumptions here are stronger than needed in almost all aspects. Convergence of Π(z) on |z| < 1 + δ can be weakened to ∑ ∞ j=1 j a Π j < ∞ for suitable a > 0. In addition, the iid assumption for (ε t ) t∈Z can be weakened to martingale difference type assumptions as in HD. I here use the stronger assumptions in order to stay in line with Li and Bauer (2020) (in the following LB) on which the proof (to be found in Appendix A) is based. 7 Note that Assumption 2 excludes the finite lag length case in which (y t ) t∈Z is generated by an autoregression of, say, order h 0 . In this case, we have Σ h = Σ, h ≥ h 0 . It follows that, asymptotically,ĥ(C T ) ≥ h 0 with probability one if C T /T → 0, C T ≥ c > 1. Furthermore, the probability ofĥ(C T ) > h 0 tends to zero for C T → ∞ as a function of T as the penalty then dominates the estimation error forΣ h . This demonstrates that BIC is (weakly) consistent for VAR(h 0 ) processes.
Inspecting the proof, it is obvious that analogous results also hold for the I(1) and the multifrequency I(1) case. Moreover, often, not only is the lag length selected uniformly for all component processes, but different lag lengths are selected for each component. In addition, in this case, the proof in the appendix can be adapted to show that the asymptotic properties in the integrated case are analogous to the ones in the stationary case obtained by suitably differencing the process.
Such an extension also pertains to VARX models. In this situation, it is also possible to be more general such that the exogenous process does not need to possess a VAR(∞) representation with sufficiently fast decreasing coefficients, but may show (stationary or integrated) long memory if the number of lags to be included is subject to an upper bound. The bottom line in all such cases is that the integration properties of the process are of less concern, if differencing leads to stationarity with sufficiently fast decaying impulse responses.
Finally, note that similar results appear possible for processes of higher order of integration.

Simulations
In this section, the theory is illustrated with a simulation exercise involving MA(1) processes of the form: for standard Gaussian white noise process (ε t ) t∈Z . These processes are stationary and invertible for θ being a stable matrix. In the scalar case, the smallest zero of the process equals ρ 0 = |θ|. Therefore, the required number of lags in an autoregressive approximation is controlled via θ and is expected to grow similar to log T/(−2 log ρ 0 ) as a function of the sample size. A total of M = 1000 realizations of the process with sample size T = 100, 200, 400, and T = 800 are simulated, and one parameter θ = −0.9, −0.7, . . . , 0.9 is varied on a regular grid.
We thus investigate the increase of the average selected lag length for univariate MA(1) processes as a function of the sample size as well as the location of the zero closest to the unit circle. This is performed for the stationary process as well as for a doubly integrated MA(1) process obtained via double summation of the stationary process. The double integration is expected to add two lags compared to the lag length required for the stationary process (u t ) t∈Z as ∆ 2 y t = u t . Furthermore, the double integration potentially also adds a deterministic quadratic trend. Thus, lag length selection is performed on the corresponding detrended process in the I(2) case. Figure 1 shows the resulting average lag lengths selected using BIC. The main effects contained in the theorem are clearly visible: the selected lag length increases with sample size. The logarithmic scale on the x-axis for plot (a) shows 8 the roughly linear increase in log T. In addition, larger absolute values of θ result in larger lag lengths. The average selected lag lengths with BIC are very similar to the optimal values h * T (corresponding to the stationary process (u t ) t∈Z ). The doubly integrated processes (y t ) t∈Z require roughly two more lags in all cases compared to the stationary process ∆ 2 (y t ) t∈Z except close to θ = 0.9 (which is close to a pole zero cancellation), where only one additional lag results in the lag length selection on average.

Conclusions
In this paper, we investigate the properties of the commonly used lag length selection using information criteria for VAR approximations of I(2) processes. The discussion shows that the asymptotic properties of the lag length selection criteria are analogous to the properties according to the standard stationary setting, but typically require two extra lags to account for the double integration.
In the invertible VARMA case, this implies that the lag lengths selected using AIC or BIC tend to infinity as a function of the sample size proportional to log T. The proportionality constant depends on the location of the zero closest to the unit circle. This is identical to the stationary case. The proof of the result indicates that this property is robust for a great number of unit roots being present in the data-generating process.
Funding: This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation-Projektnummer 276051388) which is gratefully acknowledged. I acknowledge support for the publication costs by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. Proof of Theorem 1
Proof. As in Johansen (1995), sct. 4.3, it follows that the process is stationary for A =ᾱ Γβ 2 and β 1 = β ⊥ η for appropriate choice of initial conditions y 0 , y 1 , and y 2 . The MA representation, as in Johansen (1995), Theorem 4.6, shows that the difference for different values of y 0 , y 1 , and y 2 amounts to the addition of deterministic terms, which are easily dealt with.
The properties of c • (z) follow from the usual arguments, as in Johansen (1995), sct. 4.3 (see Theorem 4.6 on p. 58) noting that derivatives of absolutely convergent power series have the same radius of convergence as the original function.
The rest of the proof uses such a triangular representation. Theñ wherec • (L) −1 u t = ε t is a VAR(∞) process. Note, however, that the triangular representation requires knowledge of β, β 1 , β 2 and hence is not operational in general. We use it here only as a technical device in the proof in order to separate processes of different order of integration. In particular, the representation above in conjunction with the nonsingularity of the spectrum of (u t ) t∈Z at z = 1 directly implies thatỹ 3,t is I(2) and not cointegrated,ỹ 2,t is I(1) and not cointegrated, andỹ 1,t is I(1) or I(0) and cointegrated with ∆ỹ 3,t . Using this notation (A.2) of LB states that 9 The regressors on the right hand side are a linear transformation of the regressors in the VAR(h + 2) approximation in levels. Φ 2 , Φ 3 , and Θ subsume all nonstationary directions of the regressors. Since these are not cointegrated-as follows from the triangular representation-and all other quantities in the equation are stationary, their coefficients need to be zero, that is, We rewrite the equation as ∆ 2 y t = Ψ u z t + Ξ h U t−1,h + e t (h + 2), where z t = (ỹ 2,t−1 , ∆ỹ 3,t−1 ,ỹ 3,t−1 ) and U t−1,h collects all stationary regressors. Denoting a t , b t = T −1 ∑ T t=h+1 a t b t and W t = [z t , U t−1,h ] , we obtain, from the Frisch-Waugh-Lovell theorem, We introduce the scaling matrix D T = diag(T −1/2 I r+s , T −3/2 I s ). Then, D T z t , z t D T d → W for some random, almost surely positive, definite matrix W ∈ R (r+2s)×(r+2s) . Moreover, The theory for stationary processes implies that the second factor is O P (1). The expectation of the first factor is O(h/T) = o(1) (compare LB, p. 17). Thus, the whole term is Similarly, according to Theorem 7.4.5. of HD, where Ψ h U t−1,h , z t D T = O P (T −1/2 ). Since ∆ 2 y t , z t D T = O P (T −1/2 ), we obtain ∆ 2 y t , z π t D T = O P (T −1/2 ). Thus, Comparing with the expression forΣ h+2 from above, we obtain Thus, |Σ u h | ≥ |Σ u h | ≥ |Σ u h+2 | and |Σ h+2 | − |Σ u h | = O P (T −1 ) uniformly in 0 < h < H T , wherê Σ u h denotes the estimated residual variance of a long VAR approximation of (u t ) t∈Z using lag length h. This is sufficient for the results, as can be seen as follows: divergence to infinity for h(C T ) is obvious since the error term is at most of the same order as the penalty term.
To establish (ii), consider the case C T → ∞ first. Note that the order O P (T −1 ) together with C T → ∞ implies that IC(h; C T ) = IC u (h; C T )(1 + o P (1)), where IC u denotes the criterion for (u t ) t∈Z . This is the prerequisite for (ii) and (iii) according to the arguments in HD on pp. 332-34 following Theorem 7.4.7. (ii): Let h * T denote the minimizer over h of L T (h) (for h ∈ R + ). Then, a mean value expansion around h * T leads to whereh T is an intermediate value. Now IC(h; C T ) is minimized atĥ T , where IC(h; C T ) − Σ T − L T (h; C T ) = O P (T −1 ) + o P (θ(h)) uniformly in h, according to the above in combination with (7.4.22) of HD. This implies that eitherĥ T /h * T → 1 or θ(h * T ) = O P (T −1 ) such that the positive right hand side of the above can be reversed by the estimation error to obtain IC(ĥ T ; C T ) ≤ IC(h * T ; C T ). However, θ(h * T ) = O P (T −1 ) is a contradiction to h * T , minimizing L T (h; C T ) = hp 2 (C T − 1)/T + θ(h): in that case, h * T − 1 reduces L T by p 2 (C T − 1)/T plus a smaller error term due to C T → ∞. This leads toĥ T /h * T → 1. For h → ∞, the order IC(h; C T ) = IC u (h; C T )(1 + o P (1)) also holds with C T remaining finite, as in AIC. Since the lag length estimated for finite C T is at least as large as the one selected using BIC,ĥ T → ∞ follows, again implying the results.
All statements above also hold for demeaned and/or detrended processes.

1
Other variants exist, for example, using the same time points in the summation for all models; compare with Kilian and Lütkepohl (2017). 2 Hannan and Deistler (1988) will, in the following, be abbreviated as HD. 3 In the unlikely case of draws, the smallest integer h achieving the minimum is selected. 4 Lütkepohl and Saikkonen (1999) also correct an argument contained in Ng and Perron (1995).