Estimation of the I(2) cointegrated vector autoregressive (CVAR) model is considered. Without further restrictions, estimation of the I(1) model is by reduced-rank regression (Anderson (1951)). Maximum likelihood estimation of I(2) models, on the other hand, always requires iteration. This paper presents a new triangular representation of the I(2) model. This is the basis for a new estimation procedure of the unrestricted I(2) model, as well as the I(2) model with linear restrictions imposed.
cointegrationI(2)vector autoregressionrepresentationmaximum likelihood estimationreduced rank regressiongeneralized least squaresJEL ClassificationC32C51C611. Introduction
The I(1) model or cointegrated vector autoregression (CVAR) is now well established. The model is developed in a series of papers and books (see, e.g., Johansen (1988), Johansen (1991), Johansen (1995a), Juselius (2006)) and generally available in econometric software. The I(1) model is formulated as a rank reduction of the matrix of ‘long-run’ coefficients. The Gaussian log-likelihood is estimated by reduced-rank regression (RRR; see Anderson (1951), Anderson (2002)).
Determining the cointegrating rank only finds the cointegrating vectors up to a rank-preserving linear transformation. Therefore, the next step of an empirical study usually identifies the cointegrating vectors. This may be followed by imposing over-identifying restrictions. Common restrictions, i.e., the same restrictions on each cointegrating vector, can still be solved by adjusting the RRR estimation; see Johansen and Juselius (1990) and Johansen and Juselius (1992). Estimation with separate linear restrictions on the cointegrating vectors, or more general non-linear restrictions, requires iterative maximization. The usual approach is based on so-called switching algorithms; see Johansen (1995b) and Boswijk and Doornik (2004). The former proposes an algorithm that alternates between cointegrating vectors, estimating one while keeping the others fixed. The latter consider algorithms that alternate between the cointegrating vectors and their loadings: when one is kept fixed, the other is identified. The drawback is that these algorithms can be very slow and occasionally terminate prematurely. Doornik (2017) proposes improvements that can be applied to all switching algorithms.
Johansen (1995c) and Johansen (1997) extend the CVAR to allow for I(2) stochastic trends. These tend to be smoother than I(1) stochastic trends. The I(2) model implies a second reduced rank restriction, but this is now more complicated, and estimation under Gaussian errors can no longer be performed by RRR. The basis of an algorithm for maximum likelihood estimation is presented in Johansen (1997), with an implementation in Dennis and Juselius (2004).
The general approach to handling the I(2) model is to create representations that introduce parameters that vary freely without changing the nature of the model. This facilitates both the statistical analysis and the estimation.
The contributions of the current paper are two-fold. First, we present the triangular representation of the I(2) model. This is a new trilinear formulation with a block-triangular matrix structure at its core. The triangular representation provides a convenient framework for imposing linear restrictions on the model parameters. Next, we introduce several improved estimation algorithms for the I(2) model. A simulation experiment is used to study the behaviour of the algorithms.
Notation
Let α (p×r) be a matrix with full column rank r,r≤p. The perpendicular matrix α∟ (p×p-r) has α∟′α=0. The orthogonal complement α⊥ has α⊥′α=0 with the additional property that α⊥′α⊥=Ip-r. Define α˜=α(α′α)-1/2 and α¯=α(α′α)-1. Then, (α˜:α⊥) is a p×p orthogonal matrix, so Ip=α˜α˜′+α⊥α⊥′=α¯α′+α⊥α⊥′=αα¯′+α⊥α⊥′.
The (thin) singular value decomposition (SVD) of α is α=UWV′, where U(p×r),V(r×r) are orthogonal: U′U=V′V=VV′=Ir, and W is a diagonal matrix with the ordered positive singular values on the diagonal. If rank(α)=s<r, then the last r-s singular values are zero. We can find α⊥=U2 from the SVD of the square matrix (α:0)=(U1:U2)WV′=(U1W1V1′:0).
The (thin) QR factorization of α with pivoting is αP=QR, with Q(p×r) orthogonal and R upper triangular. This pivoting is the reordering of columns of α to better handle poor conditioning and singularity, and is captured in P, as discussed Golub and Van Loan (2013, §5.4.2).
The QL decomposition of A can be derived from the QR decomposition of JAJ: JAJ=QR, so A=JJAJJ=JQJJZJ=Q′L. J is the exchange matrix, which is the identity matrix with columns in reverse order: premultiplication reverses rows; postmultiplication reverses columns; and JJ=I.
Let α¯¯=Ω-1αα′Ω-1α-1, then α⊥′Ωα¯¯=0.
Finally, a←b assigns the value of b to a.
2. The I(2) Model
The vector autoregression (VAR) with p dependent variables and m≥1 lags:
yt=A1yt-1+...+Amyt-m+ΦxtU+ϵt,ϵt∼IINp[0p,Ω],
for t=1,...,T, and with yj,j=-m+1,...,0 fixed and given, can be written in equilibrium correction form as:
Δyt=Πyyt-1+Γ1Δyt-1+...+Γm-1Δyt-m+1+ΦxtU+ϵt,
without imposing any restrictions. The I(1) cointegrated VAR (CVAR) imposes a reduced rank restriction on Πy(p×p): rankΠy=r; see, e.g., Johansen and Juselius (1990), Johansen (1995a).
With m≥2, the model can be written in second-differenced equilibrium correction form as:
Δ2yt=Πyyt-1-ΓyΔyt-1+Ψ1Δ2yt-1+...+Ψm-2Δ2yt-m+2+ΦxtU+ϵt.
The I(2) CVAR involves an additional reduced rank restriction:
rank(α⊥′Γyβy,⊥)=s,
where α⊥′α=0. The two rank restrictions can be expressed more conveniently in terms of products of matrices with reduced dimensions: Πy=αβy′,α⊥′Γyβy,⊥=ξηy′,
where α and βy are p×r matrices. The second restriction needs rank s, so ξ and ηy are a (p-r)×s. This requires that the matrices on the right-hand side of (2) and (3) have full column rank. The number of I(2) trends is s2=p-r-s.
The most relevant model in terms of deterministics allows for linearly trending behaviour: ΦxtU=μ0+μ1t . Using the representation theorem of Johansen (1992) and assuming E[yt]=a+bt imply:
μ1=-αβy′b,μ0=-αβy′a+Γyb,
which restricts and links μ0 and μ1; we see that α⊥′μ1=0 and α⊥′μ0=α⊥′Γyb.
2.1. The I(2) Model with a Linear Trend
The model (1) subject to the I(1) and I(2) rank restrictions (2) and (3) with ΦxtU=μ0+μ1t, subject to (4) and (5) can be written as:
Δ2yt=αβ′yt-1t-ΓΔyt-11+Ψ1Δ2yt-1+...+Ψm-2Δ2yt-m+2+ϵt,
subject to:
α⊥′Γβ⊥=ξη′,
where β is p1×r, Γ is p×p1 and η is (p1-r)×s. In this case, p1=p+1. Because α is the leading term in (4), we can extend βy by introducing βc′=-βy′b, so β′=(βy′:βc′). Furthermore, Γ has been extended to Γ=(Γy:Γc)=(Γy:-μ0).
To see that (6) and (7) remains the same I(2) model, consider α⊥′Γc and insert Ip=βyβ¯y′+βy⊥βy⊥′:
α⊥′Γc=-α⊥′Γyβ¯yβy′b-α⊥′Γyβy⊥βy⊥′b=α⊥′Γyβ¯yβc′-ξηy′βy⊥′b=α⊥′Γyβ¯yβc′+ξηc′.
Using the perpendicular matrix:
β∟=βy⊥-β¯yβc′01
we see that the rank condition is unaffected: α⊥′Γβ∟=α⊥′Γyβy⊥:ξηc′=α⊥′Γyβy⊥:α⊥′[-Γyβ¯yβc′+Γc]=ξηy′:ηc′.
A more general formulation allows for restricted deterministic and weakly exogenous variables xtR and unrestricted variables xtU:
Δ2yt=Πyt-1xt-1R-ΓΔyt-1ΔxtR+Ψ1Δ2yt-1+...+Ψm-2Δ2yt-m+2+ΦxtU+ϵt,=Πw2t-Γw1t+Ψw3t+ϵt,
where Δ2xtR, and its lags are contained in xtU; this in turn, is subsumed under w3t=(Δ2yt-1′,...,xtU′)′. The number of variables in xtR is p1-p, so Π and Γ always have the same dimensions. Ψ is unrestricted, which allows it to be concentrated out by regressing all other variables on w3t:
z0t=αβ′z2t-Γz1t+ϵt,ϵt∼IINp[0p,Ω].
To implement likelihood-ratio tests, it is necessary to count the number of restrictions:
restrictionsonΠ:(p-r)(p1-r)restrictions,restrictionsonΓ:(p-r-s)(p1-r-s)=s2s2*restrictions,
defining s2*=p1-r-s. The restrictions on Π follow from the representation. Several representations of the I(2) model have been introduced in the literature to translate the implicit non-linear restriction (3) on Γ into an explicit part of the model. These representations reveal the number of restrictions imposed on Γ, as is shown below.
First, we introduce the new triangular representation.
2.2. The Triangular Representation
Consider the model:z0t=Πz2t-Γz1t+ϵt,with rank restrictions Π=αβ′ and α⊥′Γβ⊥=ξη′ where α is a p×r matrix, β is p1×r, ξ is (p-r)×s, η is (p1-r)×s. This can be written as:z0t=AWB′z2t-AVB′z1t+ϵt,where:W=000000W1100,V=V3100V21V220V11V12V13.A,B,W11,V22 are full rank matrices. A is p×p, and B is p1×p1; moreover, A, B and the nonzero blocks in W and V are freely varying. A and B are partitioned as:A=A2:A1:A0,B=B0:B1:B2,where the blocks in A have s2,s,r columns respectively; for B, this is: r,s,s2*; p1=r+s+s2*. W and V are partitioned accordingly.
Write α˜=α(α′α)-1/2, such that α˜′α˜=Ir. Construct A and B as:
A=α⊥ξ⊥:α⊥ξ˜:α˜,B=β˜:β⊥η˜:β⊥η⊥.
Now, A′A=I and B′B=I. A(p×p) and B(p1×p1) are full rank by design. Define V=A′ΓB:
V=ξ⊥′α⊥′Γβ˜ξ⊥′α⊥′Γβ⊥η˜ξ⊥′α⊥′Γβ⊥η⊥ξ˜′α⊥′Γβ˜ξ˜′α⊥′Γβ⊥η˜ξ˜′α⊥′Γβ⊥η⊥α˜′Γβ˜α˜′Γβ⊥η˜α˜′Γβ⊥η⊥=V3100V21V220V11V12V13.V22=(ξ′ξ)12(η′η)12 is a full rank s×s matrix. The zero blocks in V arise because, e.g., ξ⊥′α⊥′Γβ⊥=ξ⊥′ξη′=0. Trivially:
Π=αβ′=A000000W1100B′=AWB′.W11=(α′α)12(β′β)12 is a full rank r×r matrix. Both W and V are p×p1 matrices. Because A and B are each orthogonal: Γ=AA′ΓBB′=AVB′.
The QR decomposition shows that a full rank square matrix can be written as the product of an orthogonal matrix and a triangular matrix. Therefore, AVB′=ALaLa-1VLbLb-1B′=A*V*B*′ preserves the structure in V* when La,Lb are lower triangular, as well as that in W*. This shows that (9) holds for any full rank A and B, and the orthogonality can be relaxed.
Therefore, any model with full rank matrices A and B, together with any W,V that have the zeros as described above, satisfies the I(2) rank restrictions. We obtain the same model by restricting A and B to be orthogonal. ☐
When Γ is restricted only by the I(2) condition: rankΓ=r+s+min(r,s2). Then, V varies freely, except for the zero blocks, and the I(2) restrictions are imposed through the trilinear form of (9). Γ=0 implies V=0. Another way to have s=0 is Γ=(α:0)G; in that case, V≠0.
The s2 restrictions on the intercept (5) can be expressed as A2′(μ0-μc)=0, using μc=Γyβ¯yβc′, or μ0=(A1:A0)v+μc, for a vector v of length r+s.
2.3. Obtaining the Triangular Representation
The triangular representation shows that the I(2) model can be written in trilinear form:z0t=AWB′z2t-AVB′z1t+ϵt,
where A and B are freely varying, provided W and V have the appropriate structure.
Consider that we are given α,β,Γ of an I(2) CVAR with rank indices r,s and wish to obtain the parameters of the triangular representation. First compute α⊥′Γβ⊥=ξη′, which can be done with the SVD, assuming rank s. From this, compute A and B:A=(A2:A1:A0)=α⊥ξ⊥:α⊥ξ˜:α,B=(B0:B1:B2)=β:β⊥η˜:β⊥η⊥.
Then, V=A-1ΓB-1′. Because Γ satisfies the I(2) rank restriction, V will have the corresponding block-triangular structure.
It may be of interest to consider which part of the structure can be retrieved in the case where rank(Π)=r, but rank(α⊥′Γβ⊥)=p-r, while it should be s. This would happen when using I(1) starting values for I(2) estimation. The off anti-diagonal blocks of zeros:V*=V31*V32*V33*V21V22V23*V11V12V13*→V310V33V21V220V11V12V13=V
can be implemented with two sweep operations:Is2-V32*V22-100Is000IrV*Ir000Is-V22-1V23*00Is2*.
The offsetting operations affect A1 and B1 only, so Π and Γ are unchanged. However, we cannot achieve V33=0 in a similar way, because it would remove the zeros just obtained. The V33 block has dimension s2s2* and represents the number of restrictions imposed on Γ in the I(2) model. Similarly, the anti-diagonal block of zeros in W captures the restrictions on Π.
Note that the r×s2* block V13 can be made lower triangular. Write the column partition of V as (V·1:V·2:V·3), and use V13=LQ to replace V·3 by V·3Q′ and B2 by B2Q′. When r<s2*, the rightmost s2*-r columns of L will be zero, and the corresponding columns of B2 are not needed to compute Γ. This part can then be omitted from the likelihood evaluation. This is an issue when we propose an estimation procedure in §4.2.1.
2.4. Restoring Orthogonality
Although A and B are freely varying, interpretation may require orthogonality between column blocks. The column blocks of A are in reverse order from B to make V and W block lower triangular. As a consequence, multiplication of V or W from either side by a lower triangular matrix preserves their structure. This allows for the relaxation of the orthogonality of A and B, but also enables us to restore it again.
To restore orthogonality, let Γ=A*V*B*′, where A*,B* are not orthogonal, but with V* block-triangular. Now, use the QL decomposition to get A*=AL, with A orthogonal and L lower triangular. Use the QR decomposition to get B*=BR, with B orthogonal and R upper triangular. Then, A*V*B*′=ALV*R′B′=AVB′ with the blocks of zeros in V preserved. A*W*B*′ must be adjusted accordingly. When β is restricted, B0 cannot be modified like this. However, we can still adjust (A2:A1)=A* to get A*′A*=Ip-r and A0′A*=0; with similar adjustments to (B1:B2).
The orthogonal version is convenient mathematically, but for estimation, it is preferable to use the unrestricted version. We do not distinguish through notation, but the context will state when the orthogonal version is used.
2.5. Identification in the Triangular Representation
The matrices A and B are not identified without further restrictions. For example, rescaling α and β as in αW11β′=α′c-1cW11dd-1β=α*cW11dβ*′ can be absorbed in V:
V31d00V21dV220cV11dcV12cV13.
When β is identified, W11 remains freely varying, and we can, e.g., set c=W11-1. However, it is convenient to transform to W11=I, so that A0 and B0 correspond to α and β. This prevents part of the orthogonality, in the sense that A0′A0≠I and B0′B0≠I.
The following scheme identifies A and B, under the assumption that B0 is already identified through prior restrictions.
Orthogonalize to obtain A0′A1=0,A0′A2=0,A1′A2=0.
Choose s full rank rows from B1, denoted MB1, and set B1←B1MB1-1. Adjust V accordingly.
Do the same for B2←B2MB2-1.
Set A1←A1V22, V21←V22-1V21 and V22←I.
A2←A2MA2-1.
The ordering of columns inside Ai,Bi is not unique.
3. Relation to Other Representations
Two other formulations of the I(2) model that are in use are the so-called τ and δ representations. All representations implement the same model and make the rank restrictions explicit. However, they differ in their definitions of freely-varying parameters, so may facilitate different forms of analysis, e.g., asymptotic analysis, estimation or the imposition of restrictions. The different parametrizations may also affect economic interpretations.
Johansen (1997) transforms (8) into the τ-representation:
z0t=αϱ′τ′z2t+ψ′z1t+wκ′τ′z1t+ϵt,
where ϱ(p1×r+s) is used to recover β: β=τϱ. The parameters (α,ϱ,τ,ψ,κ) vary freely. If we normalize on ϱ′=(Ir:0) and adjust κ,τ accordingly, then τ=(β:β1), and:
z0t=αβ′z2t+ψ′z1t+wκ′τ′z1t+ϵt.
We shall derive the τ representation. The first step is to define a transformation of ϵt∼N[0,Ω]:
α⊥α¯¯′ϵt∼N0,α⊥′Ωα⊥00α′Ω-1α-1.
This splits the p-variate systems into two independent parts. The first has any terms with leading α knocked out, while the second has all leading α’s cancelled. The inverse transformation is given by: (α⊥:α¯¯)-1=(α⊥-αα¯¯′α⊥:α)′=(w:α)′.
The next step is to apply (13) to (8) to create two independent systems and insert Ip=β¯β′+β⊥β⊥′ in the ‘marginal’ equation:
α⊥α¯¯′z0t=-α⊥′Γ(β¯β′+β⊥β⊥′)z1t+ε1t=κ′(β:β⊥η)′z1t+ε1t,β′z2t-α¯¯′Γz1t+ε2t=β′z2t+ψ′z1t+ε2t.
where ψ′=-α¯¯′Γ and κ′=-(α⊥′Γβ¯:ξ) are freely varying. Removing the transformation:
z0t=w′κ′(β:β⊥η)′z1t+α′(β′z2t+ψ′z1t)+ϵt
and introducing the additional parameters τ=(β:β⊥η) and ϱ completes the τ-representation (12). Table 1 provides definitions of the parameters that are used (cf. Johansen (1997, Tables 1 and 2)).
Triangular representation (9) is equivalent to the τ-representation (12) when A0′(A2:A1)=0.
Write A*=(A2:A1), so A=(A*:A0). First, the system (9) is premultiplied by A-1=A¯′ and subsequently with a lower triangular matrix L to create two independent subsystems. The matrix L and its inverse are given by:
L=Ip-r0-A¯0′w*Ir,L-1=Ip-r0A¯0′w*Ir,
where w*=ΩA¯*(A¯*′ΩA¯*)-1, cf. (14). Because A0′A*=0, we have that A*+A0A¯0′w*=A*+(I-A*A¯*)′w*=w*, so AL-1=(w*:A0). Furthermore: LW=W. The identity matrix L-1L can also be inserted directly in (9):
z0t=A0W11B0′z2t-(w*:A0)LVB′z1t+ϵt=A0W11B0′z2t-w*V310V21V22(B0:B1)′z1t+A0A¯0′w*(V11:V12:V13)B′z1t+ϵt=αβ′z2t+ψ′z1t+wκ′τ′z1t+ϵt,
where ψ′=-A¯0′w*(V11:V12:V13)B′ and wκ′=-w*V310V21V22. ☐
Paruolo and Rahbek (1999) and Paruolo (2000a) use the δ representation:
z0t=αβ′z2t+δτ⊥′z1t+ζτ′z1t+ϵt.
Here, (α,δ,ζ,τ=[β:β1]) vary freely. To derive the δ representation, use τ¯τ′+τ⊥τ⊥′=Ir+s:
-Γτ¯τ′=-(Γβ¯:Γβ⊥η¯)τ′=(ζ1:ζ2)τ′=ζτ′,-Γτ⊥τ⊥′=-αα¯′Γτ⊥τ⊥′-α⊥α⊥′Γτ⊥τ⊥′=αδτ⊥′,
and insert in (8). The term with α⊥′Γτ⊥ disappears because τ⊥=β⊥η⊥, so α⊥′Γτ⊥=ξη′η⊥=0.
When β is identified both δτ⊥′ and ζτ′ are unique, but not yet ζ or δ. In the τ representation, the variable ψ is also unique with ϱ chosen as (I:0)′ and β identified. Table 2 relates the τ, δ and triangular representations.
Triangular representation (9) is equivalent to the δ-representation (15) when B2′(B0:B1)=0.
Write B2=τ∟ and (B0:B1)=τ. Using the column partitioning if V=(V.1:V.2:V.3): Γτ¯=AVB′τ¯=A(V.1:V.2). From (9):
z0t=A0W11B0′z2t-V13τ∟′z1t-A(V.1:V.2)τ′z1t+ϵt=A0W11B0′z2t-V13τ∟′z1t-Γτ¯τ′z1t+ϵt=αβ′z2t+δτ⊥′z1t+ζτ′z1t+ϵt. ☐
4. Algorithms for Gaussian Maximum Likelihood Estimation
Algorithms to estimate the Gaussian CVAR are usually alternating over sets of variables. In the cointegration literature, these are called switching algorithms, following Johansen and Juselius (1994).
The advantage of switching is that each step is easy to implement, and no derivatives are required. Furthermore, the partitioning circumvents the lack of identification that can occur in these models and which makes it harder to use Newton-type methods. The drawback is that progress is often slow, taking many iterations to converge. Occasionally, this will lead to premature convergence. Although the steps can generally be shown to be in a non-downward direction, this is not enough to show convergence to a stationary point. The work in Doornik (2017) documents the framework for the switching algorithms and also considers acceleration of these algorithms; both results are used here.
Johansen (1997, §8) proposes an algorithm based on the τ-representation, called τ-switching here. This is presented in detail in Appendix B. Two new algorithms are given next, the first based on the δ-representation, the second on the triangular representation. Some formality is required to describe the algorithms with sufficient detail.
The free parameters in the δ-representation (15) are (α,δ,ζ,τ) with symmetric positive definite Ω. The algorithm alternates between estimating τ given the rest and fixing τ. The model for τ given the other parameters is linear:
To estimate τ=[β:β1], rewrite (15) as:
z0t=αβ′z2t+ζ1β′z1t+ζ2β1′z1t+αdz1t+ϵt,
where d replaces δτ⊥′. Then, vectorize, using αβ′z2t=vec(z2t′βα′)=(α⊗z2t′)vecβ:
z0t=(α⊗z2t′+ζ1⊗z1t′)vecβ+(ζ2⊗z1t′)vecβ1+α⊗z1t′vec(d′)+ϵt.
Given α,ζ1,ζ2,Ω, we can treat β,β1 and d as free parameters to be estimated by generalized least squares (GLS). This will give a new estimate of τ.
We can treat d as a free parameter in (16). First, when r≥s2*, δ has more parameters than τ⊥. Second, when r<s2*, then Γ is reduced rank, and s2*-r columns of τ⊥ are redundant. Orthogonality is recovered in the next step.
Given τ and derived τ⊥, we can estimate α and δ by RRR after concentrating out τ′z1t. Introducing ρ with dimension (r+s2*)×r allows us to write (15) as:
z0t=α*ρ′β′z2tτ⊥′z1t+ζτ′z1t+ϵt.
RRR provides estimates of α* and ρ′. Next, α*ρ′ is transformed to α(Ir:δ), giving new estimates of α and δ. Finally, ζ can be obtained by OLS from (17) given α,δ,τ, and hence, Ω.
The RRR step is the same as used in Dennis and Juselius (2004) and Paruolo (2000b). However, the GLS step for τ is different from both. We have found that the specification of the GLS step can have a substantial impact on the performance of the algorithm.
For numerical reasons (see, e.g. Golub and Van Loan (2013, Ch.5)), we prefer to use the QR decomposition to implement OLS and RRR estimation rather than moment matrices. However, in iterative estimation, there are very many regressions, which would be much faster using precomputed moment matrices. As a compromise, we use precomputed ‘data’ matrices that are transformed by a QR decomposition. This reduces the effective sample size from T to 2p1. The regressions (16) and (17) can then be implemented in terms of the transformed data matrices; see Appendix A.
Usually, starting values of α and β are available from I(1) estimation. The initial τ is then obtained from the marginal equation of the τ-representation, (14a), written as:
α⊥′z0t=κ′τ′z1t+ε1t=κ1′β′z1t+ξη′β⊥′z1t+ε1t.
RRR of α⊥′z0t on β⊥′z1t corrected for β′z1t gives estimates of η, and so, τ.
δ-switching algorithm:
To start, set k=1, and choose starting values α(0),β(0), tolerance ε1 and the maximum number of iterations. Compute τc(0) from (18) and α(0),δ(0),ζ(0),Ω(0) from (17). Furthermore, compute f(0)=-log|Ω(0)|.
Get τc(k) from (16); get the remaining parameters from (17).
Compute fc(k)=-log|Ωc(k)|.
Enter a line search for τ.
The change in τ is ∇=τc(k)-τ(k-1) and the line search find a step length λ with τ(k)=τ(k-1)+λ∇. Because only τ is varied, a GLS step is needed to evaluate the log-likelihood for each trial τ. The line search gives new parameters with corresponding f(k).
Compute the relative change from the previous iteration:
c(k)=f(k)-f(k-1)1+f(k-1).
Terminate if:
|c(k)|≤ε1andmaxi,jΠij(k)-Πij(k-1)1+Πij(k-1)≤ε11/2.
Else increment k, and return to Step 1. ☐
The subscript c indicates that these are candidate values that may be improved upon by the line search. The line search is the concentrated version, so the I(2) equivalent to the LBeta line search documented in Doornik (2017). This means that the function evaluation inside the line search needs to re-evaluate all of the other parameters as τ changes. Therefore, within the line search, we effectively concentrate out all other parameters.
Normalization of τ prevents the scale from growing excessively, and it was found to be beneficial to normalize in the first iteration every hundredth or when the norm of τ gets large. Continuous normalization had a negative impact in our experiments. Care is required when normalizing: if an iteration uses a different normalization from the previous one, then the line search will only be effective if the previous coefficients are adjusted accordingly.
The algorithm is incomplete without starting values, and it is obvious that a better start will lead to faster and more reliable convergence. Experimentation also showed that this and other algorithms struggled more in cases with s=0. To improve this, we generate two initial values, follow three iterations of the τ-switching algorithms, then select the best for continuation. The details are in Appendix C.
4.2. MLE with the Triangular Representation
We set W11=I. This tends to lead to slower convergence, but is required when both α and β are restricted. V22 is kept unrestricted: fewer restrictions seem to lead to faster convergence. All regressions use the data matrices that are pre-transformed by an orthogonal matrix as described in Appendix A. In the next section, we describe the estimation steps that can be repeated until convergence.
4.2.1. Estimation Steps
Equation (9) provides a convenient structure for an alternating variables algorithm. We can solve three separate steps by ordinary or generalized least squares for the case with orthogonal A:
B-step: estimate B, and fix A,V,W,Ω at Ac,Vc,Wc,Ωc. The resulting model is linear in B:
z0t=AcWcB′z2t-AcVcB′z1t+εt.
Estimation by GLS can be conveniently done as follows. Start with the Cholesky decomposition Ωc=PP′, and premultiply (20) by P-1. Next take the QL decomposition of P-1A as P-1A=HL with L lower diagonal and H orthogonal. Now, premultiply the transformed system by H′:
H′P-1z0t=LWcB′z2t-LVcB′z1t+ut=W˜cB′z2t-V˜cB′z1t+ut,
which has the unit variance matrix. Because the structures of W and V are preserved, this approach can also be used in the next step.
V-step: estimate W,V,, and fix A,B,Ω. This is a linear model in (W,V), which can be solved by GLS as in the B step.
A-step: estimate A,Ω and fix W,V,B at Wc,Vc,Bc:
z0t=AWcBc′z2t-VcBc′z1t+εt
This is the linear regression of z0t on WcBc′z2t-VcBc′z1t.
The likelihood will not go down when making one update that consists of the three steps given above, provided V is full rank. If that does not hold, as noted at the end of §2.3, some part of B2 or A2 is not identified from the above expressions. To handle this, we make the following adjustments to steps 1 and 3:
B-step: Remove the last s2*-min{r,s2*} columns from B, V and W, as they do not affect the log-likelihood. When iteration is finished, we can add columns of zeros back to W and V and the orthogonal complement of the reduced B to get a rectangular B.
A-step: we wish to keep A invertible and, so, square during iteration. The missing part of A2 is filled in with the orthogonal complement of the remainder of A after each regression. This requires re-estimation of V.1 by OLS.
4.2.2. Triangular-Switching Algorithm
The steps described in the previous section form the basis of an alternating variables algorithm:
Triangular-switching algorithm:
To start, set k=1, and choose α(0),β(0) and the maximum number of iterations. Compute A(0), B(0), V(0), W(0) and Ω(0).
B-step: obtain B(k) from A(k-1),V(k-1),W(k-1),Ω(k-1).
V step: obtain W(k),V(k) from A(k-1),B(k),Ω(k-1).
A step: obtain A(k),Ω(k) from B(k),V(k),W(k).
V.1 step: if necessary, obtain new V.1(k) from A(k),B(k),V.2(k),V.3(k),W(k).
As steps 2,3,T from the δ-switching algorithm. In this case, the line search is over all of the parameters in A,B,V. ☐
The starting values are taken as for the δ-switching algorithm; see Appendix C. This means that two iterations of δ-switching are taken first, using only restrictions on β.
4.3. Linear Restrictions4.3.1. Delta Switching
Estimation under linear restrictions on β or τ of the form:β=H1ϕ1:...:Hrϕrorτ=H1ϕ1:...:Hr+sϕr+s
can be done by adjusting the GLS step in §4.1. However, estimation of α is by RRR, which is not so easily adjusted for linear restrictions. Restricting δ requires replacing the RRR step by regression conditional on δ, which makes the algorithm much slower. Estimation under δ=0, which implies d=0, is straightforward.
4.3.2. Triangular Switching
Triangular switching avoids RRR, and restrictions on β=B0 or τ=(B0:B1) can be implemented by adjusting the B-step. In general, we can test restrictions of the form:
B=H1ϕ1:...:Hp1ϕp1andA=G1θ1:...:Gpθp.
Such linear restrictions on the columns of A and B are a straightforward extension of the GLS steps described above.
Estimation without multi-cointegration is also feasible. Setting δ=0 corresponds to V13=0 in the triangular representation. This amounts to removing the last s2* columns from B,V,W. Boswijk (2010) shows that the test for δ=0 has an asymptotic χ2(rs2*) distribution.
Paruolo and Rahbek (1999) derives conditions for weak exogeneity in (15). They decompose this into three sub-hypotheses: H0:b′α=0, H1:b′(α⊥ξ)=0, H2:b′ζ1=0. These restrictions, taking b=ep,i, where ep,i is the i-th column of Ip, correspond to a zero right-hand side in a particular equation in the triangular representation. First is ep,i′A0=0 creating a row of zeros in AWB′. Next is ep,i′A1=0, which extends the row of zeros. However, A must be full rank, so the final restriction must be imposed on V as ep,i′AV=(ep,i′A2:0:0)V=0, expressed as ep,i′A2V31=0. Paruolo and Rahbek (1999) shows that the combined test for a single variable has an asymptotic χ2(2r+s) distribution.
5. Comparing Algorithms
We have three algorithms that can be compared:
The δ-switching algorithm, §4.1, which can handle linear restrictions on β or τ.
The triangular-switching algorithm proposed in §4.2.2. This can optionally have linear restrictions on the columns of A or B.
The improved τ-switching algorithm, Appendix B, implemented to allow for common restrictions on τ.
These algorithms, as well as two pre-existing ones, have been implemented in Ox 7 Doornik (2013).
The comparisons are based on a model for the Danish data (five variables: m3= log real money, y= log real GDP, Δp= log GDP deflator, and rm, rb, two interest rates); see Juselius (2006, §4.1.1). This has two lags in the VAR, with an unrestricted constant and restricted trend for the deterministic terms, i.e., specification Hl. The sample period is 1973(3) to 2003(1). First computed is the I(2) rank test table.
Table 3 records the number of iterations used by each of the algorithms; this is closely related to the actual computational time required (but less machine specific). All three algorithms converge rapidly to the same likelihood value. Although τ switching takes somewhat fewer iterations, it tends to take a bit more time to run than the other two algorithms. The new triangular I(2) switching procedure is largely competitive with the new δ-switching algorithm.
To illustrate the advances made with the new algorithms, we report in Table 4 how the original τ-switching, as well as the CATS2 version of δ-switching performed. CATS2, Dennis and Juselius (2004), is a RATS package for the estimation of I(1) and I(2) models, which uses a somewhat different implementation of an I(2) algorithm that is also called δ-switching. The number of iterations of that CATS 2 algorithm is up to 200-times higher than that of the new algorithms, which are therefore much faster, as well as more robust and reliable.
6. A More Detailed Comparison
A Monte Carlo experiment is used to show the difference between algorithms in more detail. The first data generation process is the model for the Danish data, estimated with the I(1) and I(2) restrictions r,s imposed. M=1000 random samples are drawn from this, using, for each case, the estimated parameters and estimated residual variance assuming normality. The number of iterations and the progress of the algorithm is recorded for each sample. The maximum number of iterations was set to 10 000, ϵ1=10-11, and all replications are included in the results.
Figure 1 shows the histograms of the number of iterations required to achieve convergence (or 10,000). Each graph has the number of iterations (on a log 10 scale) on the horizontal axis and the count (out of 1000 experiments) represented by the bars and the vertical axis. Ideally, all of the mass is to the left, reflecting very quick convergence. The top row of histograms is for δ switching, the bottom row for triangular switching. In each histogram, the data generation process (DGP) uses the stated r,s values, and estimation is using the correct values of r,s.
The histograms show that triangular switching (bottom row) uses more iterations than δ switching (top row), in particular when s=0. Nonetheless, the experiment using triangular switching runs slightly faster as measured by the total time taken (and τ switching is the slowest).
An important question is whether the algorithms converge to the same maximum. The function value that is maximized is:f(θ)=-log|Ω(θ)|.
Out of 10,000 experiments, counted over all r,s combinations that we consider, there is only a single experiment with a noticeable difference in f(θ^). This happens for r=3,s=0, and δ-switching finds a higher function value by almost 0.05. Because T=119, the 0.05 translates to a difference of three in the log-likelihoods.
A second issue of interest is how the algorithms perform when restrictions are imposed. The following restrictions are imposed on the three columns of β with r=3:
m3yΔprmrbtβ1′a-a01-1∗β2′0∗1-aa∗β3′001∗0∗
This specification identifies the cointegrating vectors and imposes two over-identifying restrictions. For r=3,s=0 this is accepted with a p-value of 0.4, while for r=3,s=1, the p-value is 0.5 using the model on the actual Danish data. Simulation is from the estimated restricted model.
In terms of the number of iterations, as Figure 2 shows, δ-switching converges more rapidly in most cases. This makes triangular switching slower, but only by about 10%–20%.
Figure 3 shows f(θ^δ)-f(θ^triangular), so a positive value means that triangular switching obtained a lower log-likelihood. There are many small differences, mostly to the advantage of δ-switching when s=1 (right-hand plot), but to the advantage of triangular switching on the left, when s=0. The latter case is also much more noisy.
6.1. Hybrid Estimation
To increase the robustness of the triangular procedure, we also consider a hybrid procedure, which combines algorithms as follows:
standard starting values, as well as twenty randomized starting values, then
triangular switching, followed by
BFGS optimization (the Broyden-Fletcher, Goldfarb, and Shanno quasi-Newton method) for a maximum of 200 iterations, followed by
triangular switching.
This offers some protection against false convergence, because BFGS is based on first derivatives combined with an approximation to the inverse Hessian.
More importantly, we add a randomized search for better starting values as perturbations of the default starting values. Twenty versions of starting values are created this way, and each is followed for ten iterations. Then, we discard half, merge (almost) identical ones and run another ten iterations. This is repeated until a single one is left.
Figure 4 shows that this hybrid approach is an improvement: now, it is almost never beaten by δ switching. Of course, the hybrid approach is a bit slower again. The starting value procedure for δ switching could be improved in the same way.
7. Conclusions
We introduced the triangular representation of the I(2) model and showed how it can be used for estimation. The trilinear form of the triangular representation has the advantage that estimation can be implemented as alternating least squares, without using reduced-rank regression. This structure allows us to impose restrictions on (parts of) the A and B matrices, which gives more flexibility than is available in the δ and τ representations.
We also presented an algorithm based on the δ-representation and compared the performance to triangular switching in an application based on Danish data, as well as a parametric bootstrap using that data. Combined with the acceleration of Doornik (2017), both algorithms are fast and give mostly the same result. This will improve empirical applications of the I(2) model and facilitate recursive estimation and Monte Carlo analysis. Expressions for the computation of t-values of coefficients will be reported in a separate paper.
Because they are considerably faster than the previous generation, bootstrapping the I(2) model can now be considered, as Cavaliere et al. (2012) did for the I(1) model.
Acknowledgments
I am grateful to Peter Boswijk, Søren Johansen and Katarina Juselius for providing detailed comments as this paper was developing. Their suggestions have greatly improved the results presented here. Financial support from the Robertson Foundation (Award 9907422) and the Institute for New Economic Thinking (Grant 20029822) is gratefully acknowledged. All computations and graphs use OxMetrics and Ox, Doornik (2013).
Conflicts of Interest
The author declares no conflict of interest.
Appendix A. Estimation Using the QR Decomposition
The data matrices in the I(2) model (8) are Zi′=(zi1:...:ziT) for i=0,1,2.
Take the QR decomposition of (Z2:Z1) as (Z2:Z1)P=QR=Qz(R2:R1) where Q is a T×T orthogonal matrix and R a T×2p1 upper triangular matrix, while Qz are the T×2p1 leading columns of Q and (R2:R1) a 2p1×2p1 upper triangular matrix. P is the orthogonal matrix that captures the column reordering. Then:Qz′(Z2:Z1)=(R2:R1)P′=(X2:X1),
where (X2:X1) is no longer triangular. Introduce:
X0X0*=Q′Z0,
where X0=QzZ0 is 2p1×2p1, then:Zi′Zj=Zi′QQ′Zj=X0′X0+X0*′X0*ifi=j=0,Xi′Xjotherwise.
Now, e.g., a regression of A′z0t on B′z1t for known A,B:
A′z0t=γB′z1t+ϵt,t=1,...,T,
has:
γ^=(B′Z1′Z1B)-1B′Z1′Z0A=(B′X1′X1B)-1B′X1′X0A.
This is a regression of X0A on X1B. If such regressions need to be done often for the same Z’s, it is more efficient to do them in terms of the Xi:A′x0i=γB′x1i+ei,i=1,...,2p1,
with estimated residual variance:Ω^e=T-1A′X0*′X0*A+∑i=12p1e^i.
This regression has fewer ‘observations’, while at the same time avoiding the creation of moment matrices. Precomputed moment matrices would be faster, but not as good numerically. For recursive estimation, it is useful to be able to handle singular regressions because dummy variables can be zero over a subsample; this happens naturally in the QR approach. This approach needs to be adjusted when (A1) also has z0t on the right-hand side, as happens for τ-switching in (A3).
Reduced Rank Regression
Let RRR(Z0,Z1|Zx) denote reduced rank regression of z0t on z1t corrected for zxt. Assume that (Z0,Z1,Zx) have been transformed into (X0,X1,Xx) using the QR decomposition described above. Concentrating Xx out can be done by regression of (X0,X1) on Xx, with residuals (Y,X). Form S00=Y′Y+X0*′X0*, and decompose using the Cholesky decomposition: S00=LL′.
We need to solve the matrix pencil:
X′YS00-1Y′Xx=λX′Xx.
Start by using the QR decomposition X=QRP′, y=P′x:
R′Q′YL-1′L-1Y′QRy=λR′Ry,R′W′WRy=λR′Ry,W′Wz=λz,UΣ2U′z=λz.
The second line introduces W=L-1Y′Q; the next line removes R; and the final line takes the SVD of W′. The eigenvalues are the squared singular values that are on the diagonal of Σ2, and the eigenvectors are PR-1U.
When X is singular, as may be the case in recursive estimation, the upper triangular matrix R will have rows and columns that are zero at the bottom and end. These are the same on the left and right of the pencil, so they can be dropped. The resulting reduced dimension R is full rank, and we can set the corresponding rows in the eigenvectors to zero. When the regressors are singular, their corresponding coefficients in β will be set to zero, just as in our regressions.
This approach differs somewhat from Doornik and O’Brien (2002) because of the different structure of S00 as a consequence of the prior QR transformation.
Appendix B. Tau-Switching Algorithm
The algorithm of Johansen (1997, §8) is based on the τ-representation and involves three stages:
The estimate of τ is obtained by GLS given all other parameters except ψ. Johansen (1997, p. 451) shows the GLS expressions using second moment matrices. Define the orthogonal matrix A=(α⊥:α¯¯), then using κ′τ′z1t=vec(z1t′τκ)=(κ′⊗z1t)vecτ:
A′z0t=κ′⊗z1tϱ′⊗z2tvecτ+0Ir⊗z1tvecψ+ε1tε2t=κ′0⊗z1t+0ϱ′⊗z2tvecτ+0Ir⊗z1tvecψ+ut.
The error term ut has variance A′ΩA, which is block diagonal. Given α,κ,ρ,Ω, (A2) is linear in τ and ψ. The estimates of the latter are discarded.
Given just τ, reduced-rank regression of z0t corrected for τ′z1t on z0t corrected for z1t,τ′z2t is used to estimate α. Details are in Johansen (1997, p. 450).
Given τ and α, the remaining parameters can be obtained by GLS. The equivalence α¯¯′=α¯′-α¯′wα⊥′ is used to write the conditional equation as:
α¯′z0t=γ′α⊥′z0t+ϱ′τ′z2t+ψ′z1t+ε2t,
from which ϱ and ψ are estimated by regression. Then, κ is estimated from the marginal equation:
α⊥′z0t=κ′τ′z1t+ε1t.
Together, they give Ω and w. We always transform to set ϱ′=(I:0), adjusting κ and τ accordingly.
τ-switching algorithm:
To start, set k=1, and choose starting values α(0),β(0), tolerance ε1 and the maximum number of iterations. Compute τc(0) from (18) and κ(0),ψ(0),Ω(0) from (A3) and (A4). Furthermore, compute f(0)=-log|Ω(0)|.
Get τc(k) from (A2). Identify this as follows. Select the non-singular (r+s)×(r+s) submatrix from τ with the largest volume, say M. We find M by using the first r+s column pivots that are chosen by the QR decomposition of τ (Golub and Van Loan (2013, Algorithm 5.4.1) ). Set τc(k)←τc(k)M. Get αc(k) by RRR; finally, get the remaining parameters from (A3) and (A4).
As steps 2,3,T from the δ-switching algorithm. ☐
The line search is only for the p1s2* parameters in τ as part of it is set to a unit matrix every time. The function evaluation inside the line search needs to obtain all of the other parameters as τ changes.
This is the algorithm of Johansen (1997) except for the normalization of τ and the line search. The former protects the parameter values from exploding, while the latter improves convergence speed and makes it more robust. Removing ϱ is largely for convenience: it has little impact on convergence. The τ-switching algorithm is easily adjusted for common restrictions on τ in the form of τ=Hτ˜. However, ϱ gets in the way of more general restrictions.
Appendix C. Starting Values
The first starting value procedure is:
Set α(-1),β(-1) to their I(1) values (i.e., with full rank Γ).
Get τ(-1) from (A4), then Ω(-1) from (A3), ignoring restrictions.
Take two iterations with the relevant switching algorithm subject to restrictions.
The second starting value procedure is:
Get α(-2),β(-2) by RRR from the τ-representation using κ=0:
z0t=α(β′z2t+ψ′z1t)+ϵt.
Get κ(-2),w(-2) from (A3), (A4).
Get α(-1),β(-1) by RRR from the τ-representation:
z0t-wκ′β′z1t=α(β′z2t+ψ′z1t)+ϵt.
Get τ(-1) from (A4), then Ω(-1) from (A3), ignoring restrictions.
Take two iterations with the relevant switching algorithm subject to restrictions.
Finally, choose the final starting values as those that have the highest function value.
ReferencesAndersonTheodore W.Estimating linear restrictions on regression coefficients for multivariate normal distributionsAndersonTheodore W.Reduced rank regression in cointegrated modelsBoswijkH. PeterMixed Normal Inference on MulticointegrationBoswijkH. PeterDoornikJurgen A.Identifying, Estimating and Testing Restricted Cointegrated Systems: An OverviewCavaliereGiuseppeRahbekAndersTaylorR.A.M.Bootstrap Determination of the Co-Integration Rank in Vector Autoregressive ModelsDennisJonathan G.JuseliusKatarinaDoornikJurgen A.DoornikJurgen A.DoornikJurgen A.O’BrienR. J.Numerically Stable Cointegration AnalysisGolubGen H.Van LoanCharles F.JohansenSørenStatistical Analysis of Cointegration VectorsJohansenSørenEstimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autoregressive ModelsJohansenSørenA Representation of Vector Autoregressive Processes Integrated of Order 2JohansenSørenJohansenSørenIdentifying Restrictions of Linear Equations with Applications to Simultaneous Equations and CointegrationJohansenSørenA Statistical Analysis of Cointegration for I(2) VariablesJohansenSørenLikelihood Analysis of the I(2) ModelJohansenSørenJuseliusKatarinaMaximum Likelihood Estimation and Inference on Cointegration—With Application to the Demand for MoneyJohansenSørenJuseliusKatarinaTesting Structural Hypotheses in a Multivariate Cointegration Analysis of the PPP and the UIP for UKJohansenSørenJuseliusKatarinaIdentification of the Long-run and the Short-run Structure. An Application to the ISLM ModelJuseliusKatarinaParuoloPaoloAsymptotic Efficiency of the Two Stage Estimator in I(2) SystemsParuoloPaoloOn likelihood-maximizing algorithms for I(2) VAR models. MimeoUniversitá dell’InsubriaVarese2000bParuoloPaoloRahbekAndersWeak exogeneity in I(2) VAR SystemsFigures and Tables
Comparison of algorithms: δ-switching (top row) and triangular-switching (bottom row). Simulating a range of r,s. Number of iterations on the horizontal axis, count (out of 1000) on the vertical.
Comparison of algorithms: δ-switching (left two) and triangular-switching (right two). Simulating a range of r,s. Number of iterations on the horizontal axis, count (out of 1000) on the vertical.
δ-switching function value minus the triangular switching function value (vertical axis) for each replication (horizontal axis). Both starting from their default starting values. The labels are the cointegration indices (r,s,s2).
δ-switching function value minus the hybrid triangular-switching function value (vertical axis) for each replication (horizontal axis).
econometrics-05-00019-t001_Table 1
Definitions of the symbols used in the τ and δ representations of the I(2) model.
Definition
Dimension
τ
=
(β:β⊥η)whenϱ′=(I:0)
p1×(r+s)
τ⊥
=
β⊥η⊥
p1×s2*
ψ
=
-(α¯¯′Γ)′
p1×r
κ′
=
-α⊥′Γτ¯=-(α⊥′Γβ¯:ξ)=(κ1:κ2)′
(p-r)×(r+s)
δ
=
-α¯′Γτ⊥
r×s2*
ζ
=
-Γτ¯=(ζ1:ζ2)
p×(r+s)
w
=
α⊥-αα¯¯′α⊥=Ωα⊥α⊥′Ωα⊥-1=α⊥¯¯
p×(p-r)
d
=
τ⊥δ′
p1×r
e
=
τζ′
p1×p
econometrics-05-00019-t002_Table 2
Links between symbols used in the representations of the I(2) model, assuming W11=Ir and a⊥′a⊥=I.
-Γ=αψ′+wκ′τ′=αδτ⊥′+ζτ′=αd′+e′
ζ
=
αψ′τ¯+wκ′
(fromΓτ¯)
d′
=
ψ′τ⊥τ⊥′
(fromΓτ⊥)
κ′
=
α⊥′ζ
(fromα⊥′Γ)
ψ′
=
d′+α¯¯′ζτ′
(fromα¯¯′Γ)
α
=
A0
β
=
B0
d′
=
-V13B2′
e′
=
-A(V.1:V.2)(B0:B1)′
τ
=
(B0:B1)
econometrics-05-00019-t003_Table 3
Estimation of all I(2) models by τ,δ and triangular switching; all using the same starting value procedure. Number of iterations to convergence for ε1=10-14.
r\s2
τ Switching
δ Switching
Triangular Switching
4
3
2
1
4
3
2
1
4
3
2
1
1
19
25
36
34
15
24
37
30
31
31
39
32
2
18
32
25
18
32
34
22
27
50
3
37
23
42
38
50
59
4
29
28
85
econometrics-05-00019-t004_Table 4
Estimation of all I(2) models by old versions of τ,δ switching. Number of iterations to convergence for ε1=10-14.