Modeling I(2) Processes Using Vector Autoregressions Where the Lag Length Increases with the Sample Size

In this paper the theory on the estimation of vector autoregressive (VAR) models for I(2) processes is extended to the case of long VAR approximation of more general processes. Hereby the order of the autoregression is allowed to tend to infinity at a certain rate depending on the sample size. We deal with unrestricted OLS estimators (in the model formulated in levels as well as in vector error correction form) as well as with two stage estimation (2SI2) in the vector error correction model (VECM) formulation. Our main results are analogous to the I(1) case: We show that the long VAR approximation leads to consistent estimates of the long and short run dynamics. Furthermore, tests on the autoregressive coefficients follow standard asymptotics. The pseudo likelihood ratio tests on the cointegrating ranks (using the Gaussian likelihood) used in the 2SI2 algorithm show under the null hypothesis the same distributions as in the case of data generating processes following finite order VARs. The same holds true for the asymptotic distribution of the long run dynamics both in the unrestricted VECM estimation and the reduced rank regression in the 2SI2 algorithm. Building on these results we show that if the data is generated by an invertible VARMA process, the VAR approximation can be used in order to derive a consistent initial estimator for subsequent pseudo likelihood optimization in the VARMA model.


Introduction
Many macroeconomic variables have been found to exhibit trend-like behaviour that can be modelled by using vector autoregressions (VARs). Katarina Juselius (2006) states that empirical modelling led to the development of I(1) and I(2) models since certain features of the datasets considered required including first and second differences in order to obtain stationary time series. Additionally cointegrating relations were found in the corresponding analyses. Similar findings have reoccurred numerous times in the literature for example related to money demand Johansen (1992b); Juselius (1994), inflation Banerjee et al. (2001); Georgoutsos and Kouretas (2004), interest rates and real exchange rates Johansen et al. (2007); Juselius and Assenmacher (2017); Juselius and Stillwagon (2018); Stillwagon (2018) to mention only a few sources.
The predominant methodological approach to model integration and cointegration in the I(1) and the I(2) case in the vector autoregressive (VAR) framework has been established mainly by Søren Johansen and Katarina Juselius together with a number of coauthors (see the lists of references in Johansen (1995); Juselius (2006) for details) building on vector error correction models (see Engle and Granger (1987) for early comments on the history of using error correction models for co-integrated processes). Extending the main ideas for cointegration modeling for the I(1) setting Johansen (1997) see, e.g., Johansen (1992a) suggested a representation for the I(2) case. Johansen (1997) established asymptotic distributions for the suggested two step I(2) estimator (2SI2) as an approximation to pseudo maximum likelihood estimation involving numerical optimization. Asymptotics for the corresponding likelihood ratio tests has been developed in Paruolo (1994Paruolo ( , 1996, its asymptotic equivalence to pseudo likelihood (using the Gaussian distribution) optimization (and hence in a certain sense statistical efficiency) is shown in Paruolo (2000). However, Nielsen and Rahbek (2007) shows that in finite samples the likelihood ratio test has size advantages. The testing of restrictions on the parameters has been investigated by Boswijk and Doornik (2004); Boswijk and Paruolo (2017); Johansen and Lütkepohl (2005). Due to the implicit vector error correction (VECM) modeling, deterministic terms in the VECM produce complex deterministic terms in the solutions processes. In the I(2) context Nielsen and Rahbek (2007); Paruolo (1994Paruolo ( , 2006; Rahbek et al. (1999); Kurita et al. (2011) discuss the impacts of deterministic terms.
As the VECM representation includes the representation of reduced rank matrices by a product of two matrices, identification conditions are of particular importance, see Juselius (2006); Paruolo (2013, 2017). In this context also weak exogeneity has been studied Kurita (2012); Paruolo and Rahbek (1999).
The main idea underlying the VECM approach for estimating VAR models in the I(2) context is to reparameterize the problem such that integration and cointegration properties relate to the rank of two matrices. Assuming the data generating process to be a VAR of known finite order, the rank of matrices can be tested using (pseudo) likelihood ratio tests.
Sometimes the assumption of known order is not justified. For example it is known that a subset of variables that are generated using a finite order VAR cannot be described by a finite order VAR, but instead requires a vector autoregressive moving average (VARMA) model. However, the class of VARs provides flexibility in the sense that a VAR of infinite order can represent a large set of linear dynamical systems including all invertible VARMA systems. For stationary processes Berk (1974) and Lewis and Reinsel (1985) show that by letting the order of the VAR tend to infinity at a suitable function of the sample size, consistent estimation of the underlying transfer function can be achieved for data generating processes that can be described by a VAR(∞) subject to mild assumptions on the summability of the VAR coefficients. Additionally Lewis and Reinsel (1985) also establishes asymptotic normality (in a very specific sense) of linear combinations of the estimated autoregressive coefficients. Hannan and Deistler (1988) make the concepts operational by showing that in the case of a VARMA process generating the dataset the required rate of letting the order tend to infinity can be estimated using BIC model selection.
In the case of I(1) processes the estimation theory for long VAR approximations to VARMA processes has been extended based on the techniques in the stationary case of Lewis and Reinsel in a series of papers by coauthors Saikkonen (1991, 1992); Lütkepohl and Saikkonen (1997); Saikkonen and Lütkepohl (1996); Saikkonen and Luukkonen (1997). Additionally also the Johansen framework of rank restricted estimation in the VECM model has been extended to the long VAR approximations by Saikkonen and Luukkonen (1997). Bauer and Wagner (2004) provide extensions to the multi frequency I(1) case where unit roots may occur at the seasonal frequencies.
For the I(2) case no such extensions are currently known. This is the research gap this paper tries to fill: First we establish consistency and asymptotic normality of estimated autoregressive coefficients (in the sense of Lewis and Reinsel) for unrestricted ordinary least squares (OLS) estimation in the VECM representation. This can be used in order to derive Wald type tests of linear restrictions on the autoregressive parameters. Secondly, we extend the rank restricted regression techniques in the I(2) case to the long VAR approximations showing that the asymptotics (for estimated cointegrating relations, likelihood ratio tests and the two step estimation procedures) are identical in the case of long VAR approximations and VARs of finite known order. Third, we show that if the data generating process is an invertible VARMA process the long VAR system estimator can be used in order to obtain consistent initial estimators for subsequent pseudo likelihood maximization in the VARMA model class. In all results we limit ourselves to the case of no deterministic terms being included in the VECM representation. The inclusion of deterministic terms requires changing the test distribution, compare the theory contained for example in Rahbek et al. (1999).
The paper is organized as follows: In the next section the data generating process and the main assumptions are described. Section 3 then provides the results for the unrestricted estimation. Section 4 deals with rank restricted regression in the 2SI2 procedure, while Section 5 investigates the initial guess in the VARMA setting for subsequent pseudo likelihood maximization. Finally Section 6 concludes the paper. Proofs are relegated to an appendix.
Throughout the paper we will use the notation introduced by Johansen (1997): For a matrix C ∈ R p×s , s < p, of full column rank we use the notationC = C(C C) −1 . Furthermore, C ⊥ denotes a full column rank matrix of dimension p × (p − s) such that C ⊥ C = 0. Whenever this notation is used the particular choice of C ⊥ is not of importance. For a matrix C = (C i,j ) ∈ R p×s we let C denote the

Data Generating Process and Assumptions
In this paper we use the following assumptions on the data generating process: Assumption 1 (DGP). The process (y t ) t∈Z , y t ∈ R p , is generated from the difference equation for t ∈ Z: where α, β ∈ R p×r , 0 ≤ r < p are full column rank matrices, ∆ = (1 − L) with L denoting the backward shift operator such that L(y t ) t∈Z = (y t−1 ) t∈Z . The matrix function Furthermore, there exists a real δ > 0 such that the power series defining A(z) converges absolutely for |z| < 1 + δ. Define β 2 = β ⊥ η ⊥ , α 2 = α ⊥ ζ ⊥ where α ⊥ Γβ ⊥ = ζη , η, ζ ∈ R (p−r)×s are of full column rank s < p − r. Then it is assumed that the matrix is nonsingular. Furthermore, the process (ε t ) t∈Z denotes independent identically distributed (iid) white noise with mean zero and variance Σ > 0.
It is well known that the conditions (2) and (3) are necessary and sufficient for the existence of solutions to the difference equation that are I(2) processes, see for example Johansen (1992a). Moreover, note that the assumption of absolute convergence of A(z) for |z| < 1 + δ implies that ∑ ∞ j=0 j k Π j < ∞ for every k ∈ N. In particular ∑ ∞ j=0 j 2 Π j < ∞ follows as will be used frequently below. Every vector autoregressive function A(z) corresponding to the autoregression A(L)y t = ε t , that fulfills Assumption 1, allows a representation as This can be seen as follows: Here In this representation is nonsingular due to assumption (3). Furthermore, g(z) = ∑ ∞ j=0 G j z j is a transfer function with ∑ ∞ j=0 G j j 2 < ∞ since ∑ ∞ j=1 Π j j 2 < ∞ and thus the same holds for the power series coefficients A * (L). Since |B(z)| = 0, z = 1 it follows that |g(z)| = 0, |z| ≤ 1. Therefore is a VAR process. Note, however, that g(0) = G 0 = I p in general. This constitutes a triangular representation of the process denoting y 1,t = β y t ∈ R p 1 , y 2,t = β 1 y t ∈ R p 2 , y 3,t = β 2 y t ∈ R p 3 such that y 1,t = −ᾱ Γβ 2 ∆y 3,t + u 1,t = A∆y 3,t + u 1,t A : p 1 × p 3 ∆y 2,t = u 2,t , where u t = [u 1,t , u 2,t , u 3,t ] has a VAR(∞) representation. Furthermore, defining we obtain A(L) = g(L)B(L) =g(L)B(L) such that B(L)y t = ∆ 2 y t +Πy t−1 +Γ∆y t−1 = v t ,g(L)v t = ε t is another representation of the process (y t ) t∈Z withB(0) = I p . It follows that the triangular representation can be seen as a special case where one has partial information on the matrices β, β 1 , β 2 . For estimation the VECM representation is approximated using a finite order h: where e t = ε t + e 1t , e 1t = ∑ ∞ j=h−1 Π j ∆ 2 y t−j . As in the VECM representation the dimensions of β, β 1 , β 2 are linked to the rank of the matrices Φ and α ⊥ Ψβ ⊥ . Restricting these matrices to be of particular rank is simpler than imposing the equivalent restrictions in the VAR(h) representation directly.
In the following we will first investigate the unrestricted ordinary least squares estimator in the VECM representation without taking rank restrictions into account. In the second step the 2SI2 procedure as presented in Paruolo (2000) for imposing the two rank restrictions in two steps is investigated.
For both procedures the selection of the order h is of importance. In this respect the following assumption will be used: Assumption 2 (Lag order h). The order h is chosen subject to the following restrictions: This condition defines an upper bound for the order which is usually directly assured during order selection using for example information criteria. The upper bound is smaller than the usual rate T 1/3 for technical reasons. The stronger bound is not needed for all results. However, the implications for practical applications are minor as for example in the range 1 ≤ T ≤ 950 we have 2.5T 1/5 > T 1/3 . The second condition of Assumption 2 implies a lower bound for the increase of h as a function of the sample size. Clearly ∑ ∞ j=h+1 Π j → 0 for h → ∞. The bound implies that for h = h(T) this convergence needs to be fast enough such that T 1/2 ∑ ∞ j=h(T)+1 Π j still converges to zero. The lower bound depends on the underlying true parameters. For invertible VARMA processes -which can be seen as the leading case -Π j ≤ Cρ j 0 for some 0 ≤ ρ 0 < 1. Hannan and Deistler (1988) show that for an invertible stationary VARMA process the lower bound (in this case proportional to log T) can be achieved asymptotically by using BIC as the order selection procedure. Thus in this case also the stronger condition (h = o(T 1/5 )) is satisfied. Bauer and Wagner (2004) extend this result to the multi frequency I(1) setting. For the I(2) case no analogous result is known, although the developments of Bauer and Wagner (2004) suggest that a similar result holds also there. This is left for future research.
Therefore the difference between the 'usual' rates and the ones assumed above are deemed to be of minor practical consequences. Thus we are not explicit in the main text as to which results hold true under the less restrictive set of results and which do not. In the appendix, we will comment on this point, however.

Unrestricted Estimation
In this section the results of Lewis and Reinsel (1985) and Saikkonen and Lütkepohl (1996) are extended to the I(2) case. To simplify notation define a t , b t = ∑ T t=h+1 a t b t for sequences a t , b t , t = 1, . . . , T. 1 Then the unrestricted least squares estimator in the finite VECM model uses the regressor vector Z t,h = [y t−1 , ∆y t−1 , ∆ 2 y t−1 , . . . , ∆ 2 y t−h+2 ] ∈ R ph . The corresponding ordinary least squares estimator is given as 1 Here somewhat sloppily we use the same symbols for processes and their realizations. Φ ,Ψ,Π 1 , . . . ,Π h−2 = ∆ 2 y t , y t−1 , ∆ 2 y t , ∆y t−1 , ∆ 2 y t , ∆ 2 y t−1 , . . . , ∆ 2 y t , ∆ 2 y t−h+2 The noise covariance is estimated from the residuals as usual aŝ where N = T − h denotes the effective sample size.

Estimation in the Triangular VECM Representation
As typical for the cointegration framework, analysis is easier in the triangular representation which separates stationary components from I(1) and I(2) processes: Let y t = [y 1,t , y 2,t , y 3,t ] ∈ R p where y i,t ∈ R p i is such that Note, however, that using the triangular representation implies that the matrix B(L) is known up the value of the matrix A. For applications this is the case only seldom.
Thus letting g(z) = g(1) + g * (z)∆ we obtain with π(L) = I p − ∑ ∞ j=1 Π j L j leads to the corresponding VECM representation: The sums exists since ∑ ∞ j=1 G j j 2 < ∞ by assumption. Similarly, we partition Φ, Ψ and Π j into [Φ 1 , Φ 2 , Φ 3 ], [Ψ 1 , Ψ 2 , Ψ 3 ] and [Π j1 , Π j2 , Π j3 ], respectively. The analogous partitioning is used for estimates. Then Note that in this notation the I(2) components on the right hand side are y t−1,3 , the I(1) components are y t−1,1 , y t−1,2 , ∆y t−1,3 , where y t−1,1 − A∆y t−1,3 is stationary. Thus in order to separate regressors of different integration orders in the proof (as is usually done in the literature) we use a transformation using the unknown matrix A such that the regressor y t−1,1 is replaced by y t−1,1 − A∆y t−1,3 . Consequently the estimateΨ 3 of Ψ 3 is replaced by the estimateΘ Based on the estimatesΨ andΦ then A can be estimated aŝ Here the insertion ofΣ −1 appears somewhat arbitrary. A motivation for this choice in the I(1) case can be found in Saikkonen (1992) equation (12). However, any other positive definite matrix could be used as well. Currently there is no knowledge on the optimality of the choice suggested above.
In the asymptotic distribution of the estimation error Brownian motions occur relating to the process (u t ) t∈Z : Under Assumption 1 we have where B(r), 0 ≤ r ≤ 1, denotes a Brownian motion with corresponding variance An estimator of Ω 1.c is given by 2Ω With these definitions we can state our first result of the paper (which is proved in Appendix B): Theorem 1. Under Assumptions 1 and 2 for the triangular VECM representation we have: (A) Consistency: (B) Asymptotic distribution of coefficients to nonstationary regressors: Under Assumptions 1 and 2 we have (N = T − h): Note that α = [I p 1 , 0] , and thus (C) Asymptotic distribution of coefficients to stationary regressors: Let L h be a sequence of (p 2 (h − 2) (D) Asymptotic distribution on Wald type tests: Finally lettinĝ The theorem provides the asymptotic distributions of the OLS estimates in the triangular system. Note that in this somewhat special case the properties of the regressor components (stationary or not) are known such that for each entry the convergence speed is known. Correspondingly the definition of the regressor vectorX t involves only lags of y t but omits all nonstationary regressors except the ones cointegrated with ∆y 3,t−1 .
The assumptions on L h are more restrictive than needed. Lewis and Reinsel (1985) and Saikkonen and Lütkepohl (1996) only require that L h has full column rank when deriving the normalized convergence to normal distribution with unit variance as the limit for Similar arguments could be used here.

Estimation in the General VECM Representation
The previous section dealt with the special case that a triangular representation is used and hence knowledge on the matrices [β, β 1 , β 2 ] is given. This section provides a result for the general case, which, however, is limited to the coefficients to the stationary components. Since a general process generated according to Assumption 1 can be rewritten into a triangular representation using the knowledge of [β, β 1 , β 2 ], some asymptotic properties of the unrestricted OLS estimators can be derived from Theorem 1 for the general case (which is proved in Appendix C): Then under Assumptions 1 and 2 it follows thatΛ − Λ = o P (1).
Beside consistency the theorem implies that linear combination of OLS estimators show asymptotic normality and hence standard inference, if the asymptotic variance is nonsingular. One application of such results consists in the so called 'surplus lag' formulation in the context of Granger causality testing, see Bauer and Maynard (2012); Dolado and Lütkepohl (1996).
Finally note that this section does not contain results with regard to the cointegrating rank or the cointegrating space. The theorem above merely allows to test coefficients corresponding to stationary regressors. Therefore the usage is limited to somewhat special situations like the surplus-lag causality tests. However, it is also relevant for impulse response analysis, compare Inoue and Kilian (2020).

Rank Restricted Regression
The previous sections show that for the estimators discussed in that sections full inference on all coefficients is only possible when information on the matrices β, β 1 and β 2 exists. The dimensions of the matrices relate to the ranks of the matrices Φ = αβ and, conditional on this, to the rank ofᾱ ⊥ Ψβ ⊥ . The two rank restrictions make estimation and specification more complex than in the I(1) case. Johansen (1995) provides the two-step approach 2SI2 that can be used for estimation and specification of the two integer valued parameters p 1 and p 2 . Paruolo and Rahbek (1999) extend the 2SI2 procedure suggested in section 8 of Johansen (1997). Paruolo (2000) shows that this 2SI2 procedure achieves the same asymptotic distribution as pseudo maximum likelihood estimation which could be performed subsequent to 2SI2 estimation. This makes the procedure attractive from a practical point of view. In this section we show that these approaches extend naturally to the long VAR case. The main focus here lies on the derivation of the asymptotic properties of the rank tests.
Recall the long VAR approximation given as where Φ = αβ has reduced rank r < p andᾱ ⊥ Ψβ ⊥ = ζη has reduced rank s < p − r. In this notation the 2SI2 procedure works as follows: In the first step the rank constraint onᾱ ⊥ Ψβ ⊥ is neglected estimating α and β by using reduced-rank regression (RRR). Then in the second step the reduced rank ofᾱ ⊥ Ψβ ⊥ is imposed using RRR in a transformed equation. In more detail using the Johansen notation we denote with R 0t , R 1t and R 2t the residuals of regressing ∆ 2 y t , ∆y t−1 and y t−1 on ∆ 2 y t−1 , . . . , ∆ 2 y t−h+2 , respectively; then we can rewrite (9) as Concentrating out R 1t and denoting the residuals as R 0.1t and R 2.1t we obtain with S ij.
−1 R 1t , R jt the solution to the RRR problem from solving the eigenvalue problem with solutions 1 >λ 1 ≥ . . . ≥λ p > 0 ordered with decreasing size and corresponding vectors V = (v 1 , . . . , v p ). Then as usual the trace statistic of testing the model H r with rank(Φ) ≤ r, r < p, in the model H p with rank(Φ) ≤ p, is given as The optimizers for α, β are given bŷ In the second step, given α and β known, we can obtain by multiplying (10) Note that β R 1t is stationary. Thus concentrating out C and denoting the residuals as Rᾱ ⊥ .β,t and R β ⊥ .β,t , respectively, we can define S ab.β := R a.β,t , R b.β,t , for a, b =ᾱ ⊥ or β ⊥ . Then the likelihood ratio test of the model H r,s with rank(ζη ) ≤ s, s < p − r in the model H 0 r with rank(ᾱ ⊥ Ψβ ⊥ ) = p − r is given by where 1 >ρ 1 ≥ . . . ≥ρ p−r > 0 are the solutions of the eigenvalue problem and the corresponding eigenvectors are W = (w 1 , . . . , w p−r ). Estimators of ζ and η are given bŷ For the 2SI2 procedure in this second step the first step estimatesα andβ are used in place of the unknown true quantities. Then we obtain the following analogon to the results in the finite order VAR framework (the proof is given in Appendix D): Theorem 3. Let the data be generated according to Assumption 1 and let the VAR order fulfill Assumption 2. Then the following asymptotic results hold: (A) The asymptotic distribution of the likelihood ratio statistic Q r under the null hypothesis H r is given by . This is identical to the distribution achieved in the finite VAR case. (B) The asymptotic distribution of the likelihood ratio statistic Q r,s under the null hypothesis H r,s is given by The asymptotic distribution of the test statistic S r,s = Q r + Q r,s under the null hypothesis H r,s is given by (D) Using suitable normalizations all estimators are consistent: The asymptotic distributions of the coefficients to the nonstationary regressors are identical to the ones in the finite order VAR case stated in Paruolo (2000). The asymptotic distribution of the coefficientsΠ j are identical to the ones in Theorem 1.
The main message of the theorem is that the 2SI2 procedure shows the same asymptotic properties including the rank tests as in the finite order VAR case. As usual also restricting the coefficients for the non-stationary regressors does not influence the asymptotics for the coefficients corresponding to the stationary regressors.
Note that Paruolo (2000) shows that in the finite VAR case 2SI2 estimates have the same asymptotic distribution as pseudo maximum likelihood (pML) estimates maximizing the Gaussian likelihood. The first order conditions for the pML estimates of the coefficients to the non-stationary regressors provided in the first display on p. 548 in Paruolo (2000) depend on the data only via the matrices S ij defined above. These matrices depend on the lag length of the VECM only via the concentration step. The proof of our Theorem 3 shows that these terms have the same asymptotic distributions for the finite order VAR and the long VAR. Theorem 4.3 of Paruolo (2000) shows that the asymptotic distribution of the coefficients due to stationary regressors does not depend on the distribution of the coefficients corresponding to the non-stationary regressors as long as they are estimated super-consistently. Thus our results imply that also in the long VAR case the asymptotic distribution of all estimates for the 2SI2 and the pML approach is identical.

Initial Guess for VARMA Estimation
One usage of long VAR approximations is as preliminary estimate for VARMA model estimation. Hannan and Kavalieris (1986) provide properties of such an approach in the stationary case, Lütkepohl and Claessen (1997) extend the procedure to the I(1) case. Here we extend this idea to the I(2) case.
The goal is to provide a consistent initial guess for the estimation of a VARMA model for I(2) processes. In this respect we assume the following data generating process: Assumption 3 (VARMA dgp). The process (y t ) t∈Z is generated as the solution to the state space equations where (ε t ) t∈Z denotes white noise subject to the same assumptions as in Assumption 1. Here x t ∈ R n is the unobserved state process. The system (A, B, C) is assumed to be minimal and in the canonical form of Bauer and Wagner (2012), that is • B • ε −j denotes the stationary solution to the stable part of the system.
In this situation it follows that (y t ) t∈Z is an I(2) process in the definition of Bauer and Wagner (2012), that is its second difference is a stationary VARMA process. The integers c and d are connected to the integers p 1 , p 2 , p 3 via c = p 3 , d = p 2 such that p 1 = p − c − d. It can furthermore be shown that a process generated using Assumption 3 possesses a VAR(h) approximation: converges to zero exponentially fast for j → ∞ due to the strict minimum-phase condition. Letting h → ∞ then implies the existence of a VAR(∞) representation. It follows that for such systems A(z) converges absolutely for |z| < ρ −1 where 1 < ρ −1 .
From the autoregressive representation the VECM representation can be obtained: A comparison of power series coefficients provides the identities: It follows that the coefficients Π j , j = 1, 2, . . . form the impulse response of a rational transfer function of order smaller or equal to n. IfĀ is nonsingular then the order equals n and the system (Ā, B, D) is minimal. Furthermore, it follows that for arbitrary Φ and Ψ the transfer function is a rational transfer function with the additional property that Consequently Φ and Ψ determine the integration properties of processes generated using a(z). Conversely whenever the constraints hold the corresponding triple (A, B, C) corresponds to an I(2) process (if the eigenvalues of A are in the closed unit disc). Defining The third equation does not have a solution for fixed Bβ ⊥ , ζ, η, if the row space of Bβ ⊥ does not contain the space spanned by the rows of η . In this case row-wise projection of η onto the space spanned by the rows of Bβ ⊥ allows for (not necessarily unique) solutions in C † . In the limit no projection is needed. Consequently for large enough T the projected matrix will have full row rank. The second equation then determinesᾱ ⊥ which in turn determinesᾱ up to the choice of the basis such thatᾱ = T Cᾱo for some full row rank matrixᾱ o ∈ R r×p ,ᾱ o ᾱ ⊥ = 0. The first equation then can be rewritten as The second equation shows that the row space of (I −Ā) −1 B contains the row space ofᾱ ⊥ . Thus the matrix R 1 has full row rank. It follows that this equation has solutions.
Having obtained a solution for C * , C † ,ᾱ,ᾱ ⊥ then C is obtained from A unique solution then can be obtained from adding the restrictions Π j = C(I −Ā) −2Āj+1 B, j = 1, 2, . . . , 2n which for the estimates are to be solved in a least squares sense among all solutions to equations (22).
It then follows that for the true matrices Φ, Ψ, Π j the only solution for givenĀ, B consists in the corresponding true C. These facts therefore can be used in order to develop an initial guess for subsequent pseudo likelihood maximization using the parameterization of I(2) processes in state space representation: Given the integer valued parameters n, c and d:

2.
Choose the integer f ≥ n. Use the algorithm described in Appendix F to obtain estimates (Â,B,D) realizing the impulse responseΠ j , j = 1, . . . , 2 f from the Hankel matrix with f block columns and f block rows.

3.
Project rows ofη onto the space spanned by the rows ofB β ⊥ to obtainη .
The algorithm obtains a minimal state space system of order n in the canonical form for I(2) processes given in Bauer and Wagner (2012) and hence can be used as an initial guess for subsequent pseudo-likelihood optimization in the set M n (r, s) of all order n rational transfer functions corresponding to I(2) processes with state space unit root structure ((0, (c, c + d))).
Theorem 4 (Consistent initial guess). Let (y t ) t∈Z denote a process generated using the system (A 0 , B 0 , C 0 ) according to Assumption 3 and let the system (Ã,B,C) be estimated based on the long VAR approximation with lag order chosen according to Assumption 2. Then (Ã,B,C) is a weakly consistent estimator of the data generating system (A 0 , B 0 , C 0 ) in the sense thatCÃ jB p → C 0 A j 0 B 0 , j = 0, 1, . . . and hence the corresponding transfer functions converge in pointwise topology.
The proof of this theorem can be found in Appendix E.

Conclusions
In this paper the theory on long VAR approximation of general linear dynamical processes is extended to the case of I(2) processes. We find that we need slightly narrower upper and lower bounds in the approximations. The tighter bounds are not needed for all results and appear not very restrictive for applications.
The main results are completely analogous to the I(1) case: The asymptotics in many respects is identical to the finite order VAR case. Asymptotic distributions for the coefficients to non-stationary variables are the same as in the finite order VAR case. This holds true both for unrestricted OLS estimates as well as the 2SI2 approach in the Johansen framework. Tests on cointegrating ranks show identical asymptotic distributions under the null as in the finite order VAR case and hence do not require other tables. In this respect the main conclusion is that the usual procedure of estimating the lag order in the first step and then applying the Johansen procedure for estimated lag order is justified also for processes generated from a VAR(∞) that is approximated with a choice of the lag order lying within the prescribed bounds.
Additionally in the VARMA case the long VAR approximation can be used in order to derive consistent initial guesses that can be used in subsequent pseudo likelihood estimation.
Thus the paper provides both a full extension of results that have been achieved in the I(1) case as well as a useful starting point for subsequent VARMA modeling which might be preferable in situations which require a high VAR order or show a large number of variables to be modeled, a situation where VARMA models can be more parsimonious than VAR models.  Acknowledgments: The reviewers and in particular the two guest editors provided significant comments that helped in improving the paper, which is gratefully acknowledged.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Preliminaries
The theory in this paper follows closely the arguments in Lewis and Reinsel (1985) and its extension to the I(1) case in Saikkonen and Lütkepohl (1996). To this end consider the finite order VECM approximation: The properties of the various estimators heavily use the following rewriting of the approximation using the triangular representation of y t : j=2 (j − 1)Ξ j,1 . Note that in the reparametrization (A2), the I(1) components, y c,t := (y 2,t , ∆y 3,t ) , as well as the I(2) components, y 3,t−1 , are isolated from the stationary ones, u t−j , and have coefficients equal to zero, which facilitates the derivation of the asymptotic properties.
In the reparameterized setting define 3 Ξ : and correspondingly, is the OLS estimator of Λ. Here X t , Z t := ∑ T t=h+3 X t Z t . Note that W t and the regressors in (A1) are in one-one correspondence. In the original Equation (A1) beside the nonstationary regressors y c,t−1 and y 3,t−1 the regressor vector X t = [y 1,t−1 , ∆y 1,t−1 , u 2,t−1 , ∆ 2 y t−1 , . . . , ∆ 2 y t−h ] ∈ R 2p 1 +p 2 +ph occurs which cointegrates with ∆y 3,t−1 such that is stationary. Here the nonsingular matrix T h ∈ R (ph+2p 1 +p 2 )×(ph+2p 1 +p 2 ) is defined as: It can be verified that T h is invertible. The asymptotic properties ofΛ − Λ are clarified in the next lemma: Lemma A1. Under the assumptions of Theorem 1 using N = T − h − 2 as the effective sample size 3 In this appendix processes whose dimension depends on the choice of h are denoted using upper case letters neglecting the dependence on h in the notation otherwise for simplicity.
Proof. The proof essentially shows that the coefficients corresponding to the stationary regressors and the ones corresponding to the integrated regressors asymptotically can be dealt with separately. Θ], and N 2Φ 3 are the 1st, 2nd and 3rd column blocks of (Λ − Λ)D −1 T , respectively. Moreover, we have and Note that each block of the matrix R 2 is of order O p (1), and moreover, both R 2 and its limit are almost surely invertible, as there is no cointegration between y c,t−1 and y 3,t−1 (see Lemma 3.1.1 in Chan and Wei (1988), and Sims et al. (1990)). Note that Here ε t , W t D T R −1 has the limits stated in the lemma since: The lemma therefore holds, if E 1 = [o P (h 1/2 ), o P (1), o P (1)], E 2 = o P (1), E 3 = o P (1) can be shown (where the blocks in E 1 correspond to the partitioning of W t into stationary, I(1) and I(2) components). For this it is sufficient to show: Here . 1 denotes the spectral norm of a matrix while . denotes the Frobenius norm. Lewis and Reinsel (1985), it is sufficient to show , then there exists a transformation T u of full row rank, such that U t = T u U o t , where T u is a (ph + 2p 1 + p 2 ) × p(h + 2) matrix:   Lewis and Reinsel (1985), p. 397) and R −1 2 1 = O P (1), since R 2 is a.s. invertible and converges in distribution to an almost surely nonsingular random matrix. (II) With respect to e 1t , W t D T = o P (h 1/2 ) note that e 1t , W t D T ≤ N − 1 2 e 1t , U t + N −1 e 1t , y c,t−1 + N −2 e 1t , y 3,t−1 .
(ii) Note that Next, from the definition of e t , we can show that where the last equality follows the law of large numbers and the first equality is implied by the fact that e 1t 2 = o P (T −1 ) and ε t 2 = O P (1).

Appendix B.2. (B) Asymptotic Distribution of Coefficients to Nonstationary Regressors
(i) The distribution of the coefficients due to the nonstationary components is contained in Lemma A1.

Appendix B.3. (C) Asymptotic Distribution of Coefficients to Stationary Regressors
Since the regressor vector U t is stationary, the asymptotic distribution of N 1/2 L h vec(Ξ − Ξ) follows from Lewis and Reinsel (1985) in combination with uniform boundedness of the maximal and the minimal eigenvalue of Γ u = EU t U t , see above. Analogously the result for the coefficients corresponding to the regressor vector X t are shown as X t = T h U t for nonsingular matrix T h .

Appendix C. Proof of Theorem 2
Consistency follows directly from Theorem 1 as the general representation can be transformed into a triangular representation using the matrix B = [β, β 1 , β 2 ], see (4).
With respect to the asymptotic distribution following the proof of Theorem 1 there exists a nonsingular transformation matrix S h such that Therefore it follows that the blocks corresponding to the nonstationary regressors do not contribute to the asymptotic distribution. Then standard arguments for the stationary part of the regressor vector can be used.

Appendix D. Proofs for Theorem 3
The proof combines the ideas of Saikkonen and Luukkonen (1997) (in the following S&L) with the asymptotics of 2SI2 of Paruolo (2000) (in the following P). In the proof we will work without restriction of generality with the triangular representation.
The key to the asymptotic properties of the estimators obtained from the 2SI2 algorithm lies in the results of P Lemma A.4 and Lemma A.5 in the appendix. These lemmas deal with the limits of various moment matrices of the form N −a R it , R jt corrected for the stationary components ∆ 2 y t−j , j = 1, . . . , h − 2. The correction involves a regressor vector growing in dimension with sample size. This is dealt with in S&L.
In this respect let S t = [∆ 2 y t−1 , . . . , ∆ 2 y t−h+2 ] which according to (A4) is a linear function of On p. 543 in P the matrices Σ ij , i, j ∈ {Y, U, 0} are defined as limits of second moment matrices. Here U refers to β 1 ∆y t−1 = u 2,t−1 in the triangular representation, Y refers to β y t−1 + δβ 2 ∆y t−1 = y 1,t−1 − A∆y 3,t−1 = u 1,t−1 and 0 refers to ∆ 2 y t . These are all stationary processes and linear functions of u t , u t−1 , u t−2 . Additional to S t also β ∆y t−1 = ∆u 1,t−1 + Au 3,t−1 is corrected for in the second stage.
The arguments on p. 114 and 115 of S&L deal with terms of the form Analogous arguments to S&L(A.12) show that this equals (up to terms of order o P (1)) S&L state that this is bounded from above and bounded away from zero. The second claim actually is wrong. If (u 1,t ) t∈Z is univariate white noise with unit variance then C 11 = 1 h is achieved by predicting u 1,t−1 by including integration of the regressors in the form of the summation. This does not change the remaining arguments in S&L, it only implies that the separation of the eigenvalues corresponding to the stationary regressors and the ones corresponding to the non-stationary ones is weaker.
In the current case one can show that for where S t contains ∆u 1,t−1 and ∆ 2 u 1,t−j , j = 1, . . . , h for the corresponding limit C 11 the lower bound hC 11 ≥ cI holds for some 0 < c. The order of the lower bound is achieved by including a double integration of the regressors. For Here the arguments from above can be applied to the process (∆u t ) t∈Z . For a differenced process the smallest eigenvalue of the matrix is of order h −2 , compare Theorem 2 of Palma and Bondon (2003).
Since N −1 S t , y c,t−1 = O P (h 1/2 ) and N −2 S t , y 3,t−1 = O P (h 1/2 ) it follows that N −1 ( u 1,t−1 , y c,t−1 − u 1,t−1 , S t S t , S t −1 S t , y c,t−1 ) = O P (h 1/2 ), N −2 ( y c,t−1 , y c,t−1 − y c,t−1 , S t S t , S t −1 S t , y c,t−1 ) = N −2 y c,t−1 , y c,t−1 + o P ((h/N) 1/2 ) as well as Therefore the limits of the moment matrices M ij are not affected by the correction using stationary terms even if h → ∞ except for the terms involving the orders O P (h 1/2 ). For all stationary terms we find convergence to the corresponding limits denoted Σ ij in P.
The first step in the 2SI2 procedure then uses RRR in the equation Then R 0t denotes ∆ 2 y t corrected for S t , R 1,t denotes ∆y t−1 corrected for S t and R 2,t denotes y t−1 corrected for S t . Lemma A.4 of P derives the limits of different directions of M ij.k defined as Here R ε,t equals e t correct for S t and R β,t = β R 1,t . Further P uses the notation A T = [β 1 , T −1β 2 ] andβ 2,T =β 2 . Here and below we assume without restriction of generality that [β, β 1 , β 2 ] is an orthonormal matrix. Consequentlyβ = β,β 1 = β 1 ,β 2 = β 2 . Then the results above imply all results of Lemma A.4. of P except that now A T M 20.1 = O P (h 1/2 ).
In particular we obtain the following limits: Here W = g(1)B denotes the Brownian motion corresponding to (ε t ) t∈Z , F † denotes the Brownian motion corresponding to R 2t (equaling y t−1 corrected for S t ) corrected for R 1t (∆y t−1 whose only nonstationary component equals ∆y 3,t−1 with corresponding Brownian motion B 3 ). Thus we obtain the following definitions (where L is as in Theorem 1): The above arguments show that in the current setting U t−1 = u 2,t−1 and Y t−1 = u 1,t−1 are contained in the space spanned by S t for h → ∞. Therefore Σ ij = 0 for i, j ∈ {U, Y}. The subscript 'b' refers to correcting for β ⊥ R 2t used in the second stage of 2SI2.
LetΣ YY denote the limit of h Y t−1 , Y t−1 and analogously defineΣ YU ,Σ UU ,Σ 0Y andΣ 0U . For the latter two note thatΣ 0Y denotes the limit of corrected for S t and β ∆y t−1 . Since Y t−1 is stationary the last term is of order O P ((h 3 /N) 1/2 ) = o P (1). Therefore it follows thatΣ 0Y = αΣ YY + ζΣ UY . Then the results of Lemma A.5 of P hold where in (A.11) and (A.14) Σ ij can be replaced byΣ ij .
The asymptotic analysis below will heavily use the Johansen approach of investigating the solutions to eigenvalue problems in order to maximize the pseudo-likelihood corresponding to the reduced rank regression problem. In order to use the corresponding local analysis one has to first clarify consistency for the various estimators as well as rates of convergence.
The main tool in this respect is Theorem A.1 of Johansen (1997) which establishes in the I(2) setting for the regression y t = θ Z t + ε t (Z t being composed of stationary, I(1) and I(2) components) where D T Z t , Z t D T = O P (1) and D T Z t , ε t = o P (1) that D −1 T (θ − θ) = o P (1) whereθ denotes the pseudo likelihood estimator over some closed parameter set Θ.
It is straightforward to see that analogous results hold in the present setting when first concentrating out the stationary components: Consider y t = θ 1 z t + θ 2 Z t + e t . Thenθ 2 (θ 1 ) is obtained from the concentration step and the pseudo likelihood involves R t,y − θ 1 R t,z , R t,y − θ 1 R t,z where again the processes R t,y and R t,z denote the processes y t and z t with the corresponding stationary regressors Z t regressed out. These concentrated quantities now can be used in the proof of Theorem A.1 of Johansen (1997) essentially without changes to show consistency forθ 1 . Consistency ofθ 2 (θ 1 ) then follows from the unrestricted estimation as contained in Theorem 2. As shown above the rates of convergence as well as the limits are unchanged for the coefficients corresponding to the non-stationary components of the regressors for the long VAR case compared to the finite VAR case.
Note that these results hold for general closed parameter space Θ, thus including the unrestricted as well as the rank-reduced problem. This shows that we can always reduce the asymptotic analysis of the eigenvalue problems to a neighborhood of the true value as is done in P.
The first step in the proof of Theorem 4.1. of P consists in the investigation of the solutions to the Therefore asymptotically the first p − r eigenvalues of hΛ are positive, the remaining ones tending to zero. Likewise the eigenvectors converge at the same speed as the matrices.
where a 1 = M 20.1 O P (h 2 T −1/2 ) = o P (1) andβ =βH −1 . Then the remaining arguments on p. 546 of P show that the asymptotic distribution of (Tβ 1 , T 2 β 2 ) (β − β) is identical for the long VAR case as in the finite VAR case. From these arguments the distribution of the likelihood ratio test of H r versus H p can be shown: Let δ 1 = Tλ, so that for every δ 1 we have that λ → 0, as T → ∞. By the above arguments we have that where W † = (α ⊥ Σ ε α ⊥ ) −1/2 α ⊥ W. Thus, the smallest (p − r) solutions of (11) converge in distribution to the solutions of δ 1 1 0 F † F † − ( 1 0 F † dW † )( 1 0 dW † F † ) = 0, which implies that the test statistic Q r has the following limiting distribution, For the second stage the arguments are very similar. The eigenvalue problem solved here is the following:β ⊥ M 11.ββ⊥η Y =β ⊥ M 1α ⊥ .β M −1 α ⊥α⊥ .β Mα ⊥ 1.ββ⊥η . 4 Contrary to the usual Johansen notation we use Σ as the noise covariance and Ω as the variance of the Brownian motion corresponding to (u t ) t∈Z . Thus some of the formulas in this part show 'unusual' form.
These terms show consistency forβ,η. Using the results of Lemma A.4 of P then consistency for α,ζ follow. Following the proof of Theorem 4.2. on pp. 548+549 of P we can show consistency forψ of P. The only changes refer to the orders of convergence where our setting introduces orders of h into the arguments. Jointly this proves consistency ofΨ andΓ. Consistency for the coefficients to the stationary terms ∆ 2 y t−j follows as usual from the consistency of the estimates for the coefficients to non-stationary regressors. This completes the proof of (D).
With respect to (E) note that the results above show that the asymptotics for the two eigenvalue problems to be solved converge to the same quantities as in the finite VAR case. This shows that the results of P in this respect hold also in the case of long VARs.
Finally for the matrices Π j note that Theorem 4.3. of P shows that the asymptotic distribution for all quantities corresponding to stationary regressors are identical for every super-consistent estimator for the coefficients to the non-stationary components.
For details see Appendix F. a(z) does not necessarily correspond to a rational transfer function of order n. It does so, however, if the additional restrictions (22) hold.
Step 3 and 4 of the proposed algorithm achieve this. Here step 3 ascertains that solutions to the third equation exist. The second equation explicitly provides a solution α ⊥ for given C † . This solution not necessarily is of full row rank. As in the limit this is the case, it also holds for large enough T. The first equation always admits solutions. Thus for large enough T the set of all solutions is defined by polynomial restrictions. Adding the least squares distance to the estimated impulse response sequence then leads to a quadratic problem under non-linear differentiable constraints, which in the limit has a unique solution. Thus the solution is unique for large enough T.
Consistency of the estimates in combination with continuity of the solution of step 4 implies consistency for the system (Â,B,Ĉ). This implies consistency for the inverse system (Â,B,Ĉ) in the sense of converging impulse response coefficients and hence consistency for the transfer function estimator in the pointwise topology. The fulfillment of restrictions (22) ensures the structure of the corresponding matrixÂ according to state space unit root structure ((0, (c, c + d))).