## 1. Introduction

The I(1) model or cointegrated vector autoregression (CVAR) is now well established. The model is developed in a series of papers and books (see, e.g.,

Johansen (1988),

Johansen (1991),

Johansen (1995a),

Juselius (2006)) and generally available in econometric software. The I(1) model is formulated as a rank reduction of the matrix of ‘long-run’ coefficients. The Gaussian log-likelihood is estimated by reduced-rank regression (RRR; see

Anderson (1951),

Anderson (2002)).

Determining the cointegrating rank only finds the cointegrating vectors up to a rank-preserving linear transformation. Therefore, the next step of an empirical study usually identifies the cointegrating vectors. This may be followed by imposing over-identifying restrictions. Common restrictions, i.e., the same restrictions on each cointegrating vector, can still be solved by adjusting the RRR estimation; see

Johansen and Juselius (1990) and

Johansen and Juselius (1992). Estimation with separate linear restrictions on the cointegrating vectors, or more general non-linear restrictions, requires iterative maximization. The usual approach is based on so-called switching algorithms; see

Johansen (1995b) and

Boswijk and Doornik (2004). The former proposes an algorithm that alternates between cointegrating vectors, estimating one while keeping the others fixed. The latter consider algorithms that alternate between the cointegrating vectors and their loadings: when one is kept fixed, the other is identified. The drawback is that these algorithms can be very slow and occasionally terminate prematurely.

Doornik (2017) proposes improvements that can be applied to all switching algorithms.

Johansen (1995c) and

Johansen (1997) extend the CVAR to allow for I(2) stochastic trends. These tend to be smoother than I(1) stochastic trends. The I(2) model implies a second reduced rank restriction, but this is now more complicated, and estimation under Gaussian errors can no longer be performed by RRR. The basis of an algorithm for maximum likelihood estimation is presented in

Johansen (1997), with an implementation in

Dennis and Juselius (2004).

The general approach to handling the I(2) model is to create representations that introduce parameters that vary freely without changing the nature of the model. This facilitates both the statistical analysis and the estimation.

The contributions of the current paper are two-fold. First, we present the triangular representation of the I(2) model. This is a new trilinear formulation with a block-triangular matrix structure at its core. The triangular representation provides a convenient framework for imposing linear restrictions on the model parameters. Next, we introduce several improved estimation algorithms for the I(2) model. A simulation experiment is used to study the behaviour of the algorithms.

#### Notation

Let $\alpha $ ($p\times r$) be a matrix with full column rank $r,r\le p$. The perpendicular matrix ${\alpha}_{\u221f}$ ($p\times p\phantom{\rule{-0.166667em}{0ex}}-\phantom{\rule{-0.166667em}{0ex}}r$) has ${\alpha}_{\u221f}^{\prime}\alpha =0$. The orthogonal complement ${\alpha}_{\perp}$ has ${\alpha}_{\perp}^{\prime}\alpha =0$ with the additional property that ${\alpha}_{\perp}^{\prime}{\alpha}_{\perp}={I}_{p-r}$. Define $\tilde{\alpha}=\alpha {\left({\alpha}^{\prime}\alpha \right)}^{-1/2}$ and $\overline{\alpha}=\alpha {\left({\alpha}^{\prime}\alpha \right)}^{-1}$. Then, $(\tilde{\alpha}:{\alpha}_{\perp})$ is a $p\times p$ orthogonal matrix, so ${I}_{p}=\tilde{\alpha}{\tilde{\alpha}}^{\prime}+{\alpha}_{\perp}{\alpha}_{\perp}^{\prime}=\overline{\alpha}{\alpha}^{\prime}+{\alpha}_{\perp}{\alpha}_{\perp}^{\prime}=\alpha {\overline{\alpha}}^{\prime}+{\alpha}_{\perp}{\alpha}_{\perp}^{\prime}$.

The (thin) singular value decomposition (SVD) of $\alpha $ is $\alpha =UW{V}^{\prime}$, where $U(p\times r),V(r\times r)$ are orthogonal: ${U}^{\prime}U={V}^{\prime}V=V{V}^{\prime}={I}_{r}$, and W is a diagonal matrix with the ordered positive singular values on the diagonal. If rank$\left(\alpha \right)=s<r$, then the last $r-s$ singular values are zero. We can find ${\alpha}_{\perp}={U}_{2}$ from the SVD of the square matrix $(\alpha :0)=({U}_{1}:{U}_{2})W{V}^{\prime}=({U}_{1}{W}_{1}{V}_{1}^{\prime}:0)$.

The (thin) QR factorization of

$\alpha $ with pivoting is

$\alpha P=QR$, with

$Q(p\times r)$ orthogonal and

R upper triangular. This pivoting is the reordering of columns of

$\alpha $ to better handle poor conditioning and singularity, and is captured in

P, as discussed

Golub and Van Loan (2013, §5.4.2).

The QL decomposition of A can be derived from the QR decomposition of $JAJ$: $JAJ=QR$, so $A=JJAJJ=JQJJZJ={Q}^{\prime}L$. J is the exchange matrix, which is the identity matrix with columns in reverse order: premultiplication reverses rows; postmultiplication reverses columns; and $JJ=I$.

Let $\overline{\overline{\alpha}}={\Omega}^{-1}\alpha {\left({\alpha}^{\prime}{\Omega}^{-1}\alpha \right)}^{-1}$, then ${\alpha}_{\perp}^{\prime}\Omega \overline{\overline{\alpha}}=0$.

Finally, $a\leftarrow b$ assigns the value of b to a.

## 2. The I(2) Model

The vector autoregression (VAR) with

p dependent variables and

$m\ge 1$ lags:

for

$t=1,...,T$, and with

${y}_{j},j=-m+1,...,0$ fixed and given, can be written in equilibrium correction form as:

without imposing any restrictions. The I(1) cointegrated VAR (CVAR) imposes a reduced rank restriction on

${\Pi}_{y}(p\times p)$: rank

${\Pi}_{y}=r$; see, e.g.,

Johansen and Juselius (1990),

Johansen (1995a).

With

$m\ge 2$, the model can be written in second-differenced equilibrium correction form as:

The I(2) CVAR involves an additional reduced rank restriction:

where

${\alpha}_{\perp}^{\prime}\alpha =0.$ The two rank restrictions can be expressed more conveniently in terms of products of matrices with reduced dimensions:

where

$\alpha $ and

${\beta}_{y}$ are

$p\times r$ matrices. The second restriction needs rank

s, so

$\xi $ and

${\eta}_{y}$ are a

$(p-r)\times s$. This requires that the matrices on the right-hand side of (

2) and (

3) have full column rank. The number of I(2) trends is

${s}_{2}=p-r-s$.

The most relevant model in terms of deterministics allows for linearly trending behaviour:

$\Phi {x}_{t}^{U}={\mu}_{0}+{\mu}_{1}t$ . Using the representation theorem of

Johansen (1992) and assuming

$\mathsf{E}\left[{y}_{t}\right]=a+bt$ imply:

which restricts and links

${\mu}_{0}$ and

${\mu}_{1}$; we see that

${\alpha}_{\perp}^{\prime}{\mu}_{1}=0$ and

${\alpha}_{\perp}^{\prime}{\mu}_{0}={\alpha}_{\perp}^{\prime}{\Gamma}_{y}b$.

#### 2.1. The I(2) Model with a Linear Trend

The model (

1) subject to the I(1) and I(2) rank restrictions (

2) and (

3) with

$\Phi {x}_{t}^{U}={\mu}_{0}+{\mu}_{1}t$, subject to (

4) and (

5) can be written as:

subject to:

where

$\beta $ is

${p}_{1}\times r$,

$\Gamma $ is

$p\times {p}_{1}$ and

$\eta $ is

$({p}_{1}-r)\times s$. In this case,

${p}_{1}=p+1$. Because

$\alpha $ is the leading term in (

4), we can extend

${\beta}_{y}$ by introducing

${\beta}_{c}^{\prime}=-{\beta}_{y}^{\prime}b$, so

${\beta}^{\prime}=({\beta}_{y}^{\prime}:{\beta}_{c}^{\prime})$. Furthermore,

$\Gamma $ has been extended to

$\Gamma =({\Gamma}_{y}:{\Gamma}_{c})=({\Gamma}_{y}:-{\mu}_{0})$.

To see that (

6) and (

7) remains the same I(2) model, consider

${\alpha}_{\perp}^{\prime}{\Gamma}_{c}$ and insert

${I}_{p}={\beta}_{y}{\overline{\beta}}_{y}^{\prime}+{\beta}_{y\perp}{\beta}_{y\perp}^{\prime}$:

Using the perpendicular matrix:

we see that the rank condition is unaffected:

A more general formulation allows for restricted deterministic and weakly exogenous variables

${x}_{t}^{R}$ and unrestricted variables

${x}_{t}^{U}$:

where

${\Delta}^{2}{x}_{t}^{R}$, and its lags are contained in

${x}_{t}^{U}$; this in turn, is subsumed under

${w}_{3t}={({\Delta}^{2}{y}_{t-1}^{\prime},...,{x}_{t}^{U\prime})}^{\prime}$. The number of variables in

${x}_{t}^{R}$ is

${p}_{1}-p$, so

$\Pi $ and

$\Gamma $ always have the same dimensions.

$\Psi $ is unrestricted, which allows it to be concentrated out by regressing all other variables on

${w}_{3t}$:

To implement likelihood-ratio tests, it is necessary to count the number of restrictions:

defining

${s}_{2}^{*}={p}_{1}-r-s$. The restrictions on

$\Pi $ follow from the representation. Several representations of the I(2) model have been introduced in the literature to translate the implicit non-linear restriction (

3) on

$\Gamma $ into an explicit part of the model. These representations reveal the number of restrictions imposed on

$\Gamma $, as is shown below.

First, we introduce the new triangular representation.

#### 2.2. The Triangular Representation

**Theorem** **1.** Consider the model:with rank restrictions $\Pi =\alpha {\beta}^{\prime}$ and ${\alpha}_{\perp}^{\prime}\Gamma {\beta}_{\perp}=\xi {\eta}^{\prime}$ where α is a $p\times r$ matrix, β is ${p}_{1}\times r$, ξ is $(p-r)\times s$, η is $({p}_{1}-r)\times s$. This can be written as:where:$A,B,{W}_{11},{V}_{22}$ are full rank matrices. A is $p\times p$, and B is ${p}_{1}\times {p}_{1}$; moreover, A, B and the nonzero blocks in W and V are freely varying. A and B are partitioned as:where the blocks in A have ${s}_{2},s,r$ columns respectively; for B, this is: $r,s,{s}_{2}^{*}$; ${p}_{1}=r+s+{s}_{2}^{*}$. W and V are partitioned accordingly. **Proof.** Write

$\tilde{\alpha}=\alpha {\left({\alpha}^{\prime}\alpha \right)}^{-1/2}$, such that

${\tilde{\alpha}}^{\prime}\tilde{\alpha}={I}_{r}$. Construct

A and

B as:

Now,

${A}^{\prime}A=I$ and

${B}^{\prime}B=I$.

$A(p\times p)$ and

$B({p}_{1}\times {p}_{1})$ are full rank by design. Define

$V={A}^{\prime}\Gamma B$:

${V}_{22}={\left({\xi}^{\prime}\xi \right)}^{\frac{1}{2}}{\left({\eta}^{\prime}\eta \right)}^{\frac{1}{2}}$ is a full rank

$s\times s$ matrix. The zero blocks in

V arise because, e.g.,

${\xi}_{\perp}^{\prime}{\alpha}_{\perp}^{\prime}\Gamma {\beta}_{\perp}={\xi}_{\perp}^{\prime}\xi {\eta}^{\prime}=0$. Trivially:

${W}_{11}={\left({\alpha}^{\prime}\alpha \right)}^{\frac{1}{2}}{\left({\beta}^{\prime}\beta \right)}^{\frac{1}{2}}$ is a full rank

$r\times r$ matrix. Both

W and

V are

$p\times {p}_{1}$ matrices. Because

A and

B are each orthogonal:

$\Gamma =A{A}^{\prime}\Gamma B{B}^{\prime}=AV{B}^{\prime}.$The QR decomposition shows that a full rank square matrix can be written as the product of an orthogonal matrix and a triangular matrix. Therefore,

$AV{B}^{\prime}=A{L}_{a}{L}_{a}^{-1}V{L}_{b}{L}_{b}^{-1}{B}^{\prime}={A}_{*}{V}_{*}{B}_{*}^{\prime}$ preserves the structure in

${V}_{*}$ when

${L}_{a},{L}_{b}$ are lower triangular, as well as that in

${W}_{*}$. This shows that (

9) holds for any full rank

A and

B, and the orthogonality can be relaxed.

Therefore, any model with full rank matrices A and B, together with any $W,V$ that have the zeros as described above, satisfies the I(2) rank restrictions. We obtain the same model by restricting A and B to be orthogonal. ☐

When

$\Gamma $ is restricted only by the I(2) condition: rank

$\Gamma =r+s+min(r,{s}_{2})$. Then,

V varies freely, except for the zero blocks, and the I(2) restrictions are imposed through the trilinear form of (

9).

$\Gamma =0$ implies

$V=0$. Another way to have

$s=0$ is

$\Gamma =(\alpha :0)G$; in that case,

$V\ne 0$.

The

${s}_{2}$ restrictions on the intercept (

5) can be expressed as

${A}_{2}^{\prime}({\mu}_{0}-{\mu}_{c})=0,$ using

${\mu}_{c}={\Gamma}_{y}{\overline{\beta}}_{y}{\beta}_{c}^{\prime}$, or

${\mu}_{0}=({A}_{1}:{A}_{0})v+{\mu}_{c},$ for a vector

v of length

$r+s$.

#### 2.3. Obtaining the Triangular Representation

The triangular representation shows that the I(2) model can be written in trilinear form:

where

A and

B are freely varying, provided

W and

V have the appropriate structure.

Consider that we are given

$\alpha ,\beta ,\Gamma $ of an I(2) CVAR with rank indices

$r,s$ and wish to obtain the parameters of the triangular representation. First compute

${\alpha}_{\perp}^{\prime}\Gamma {\beta}_{\perp}=\xi {\eta}^{\prime}$, which can be done with the SVD, assuming rank

s. From this, compute

A and

B:

Then,

$V={A}^{-1}\Gamma {B}^{-1\prime}$. Because

$\Gamma $ satisfies the I(2) rank restriction,

V will have the corresponding block-triangular structure.

It may be of interest to consider which part of the structure can be retrieved in the case where rank

$(\Pi )=r$, but rank

$({\alpha}_{\perp}^{\prime}\Gamma {\beta}_{\perp})=p-r$, while it should be

s. This would happen when using I(1) starting values for I(2) estimation. The off anti-diagonal blocks of zeros:

can be implemented with two sweep operations:

The offsetting operations affect

${A}_{1}$ and

${B}_{1}$ only, so

$\Pi $ and

$\Gamma $ are unchanged. However, we cannot achieve

${V}_{33}=0$ in a similar way, because it would remove the zeros just obtained. The

${V}_{33}$ block has dimension

${s}_{2}{s}_{2}^{*}$ and represents the number of restrictions imposed on

$\Gamma $ in the I(2) model. Similarly, the anti-diagonal block of zeros in

W captures the restrictions on

$\Pi $.

Note that the $r\times {s}_{2}^{*}$ block ${V}_{13}$ can be made lower triangular. Write the column partition of V as $({V}_{\xb71}:{V}_{\xb72}:{V}_{\xb73})$, and use ${V}_{13}=LQ$ to replace ${V}_{\xb73}$ by ${V}_{\xb73}{Q}^{\prime}$ and ${B}_{2}$ by ${B}_{2}{Q}^{\prime}$. When $r<{s}_{2}^{*}$, the rightmost ${s}_{2}^{*}-r$ columns of L will be zero, and the corresponding columns of ${B}_{2}$ are not needed to compute $\Gamma $. This part can then be omitted from the likelihood evaluation. This is an issue when we propose an estimation procedure in §4.2.1.

#### 2.4. Restoring Orthogonality

Although A and B are freely varying, interpretation may require orthogonality between column blocks. The column blocks of A are in reverse order from B to make V and W block lower triangular. As a consequence, multiplication of V or W from either side by a lower triangular matrix preserves their structure. This allows for the relaxation of the orthogonality of A and B, but also enables us to restore it again.

To restore orthogonality, let $\Gamma ={A}_{*}{V}_{*}{B}_{*}^{\prime}$, where ${A}_{*},{B}_{*}$ are not orthogonal, but with ${V}_{*}$ block-triangular. Now, use the QL decomposition to get ${A}_{*}=AL$, with A orthogonal and L lower triangular. Use the QR decomposition to get ${B}_{*}=BR$, with B orthogonal and R upper triangular. Then, ${A}_{*}{V}_{*}{B}_{*}^{\prime}=AL{V}_{*}{R}^{\prime}{B}^{\prime}=AV{B}^{\prime}$ with the blocks of zeros in V preserved. ${A}_{*}{W}_{*}{B}_{*}^{\prime}$ must be adjusted accordingly. When $\beta $ is restricted, ${B}_{0}$ cannot be modified like this. However, we can still adjust $({A}_{2}:{A}_{1})={A}_{*}$ to get ${A}_{*}^{\prime}{A}_{*}={I}_{p-r}$ and ${A}_{0}^{\prime}{A}_{*}=0$; with similar adjustments to $({B}_{1}:{B}_{2})$.

The orthogonal version is convenient mathematically, but for estimation, it is preferable to use the unrestricted version. We do not distinguish through notation, but the context will state when the orthogonal version is used.

#### 2.5. Identification in the Triangular Representation

The matrices

A and

B are not identified without further restrictions. For example, rescaling

$\alpha $ and

$\beta $ as in

$\alpha {W}_{11}{\beta}^{\prime}={\alpha}^{\prime}{c}^{-1}c{W}_{11}d{d}^{-1}\beta ={\alpha}^{*}c{W}_{11}d{\beta}^{*\prime}$ can be absorbed in

V:

When

$\beta $ is identified,

${W}_{11}$ remains freely varying, and we can, e.g., set

$c={W}_{11}^{-1}$. However, it is convenient to transform to

${W}_{11}=I$, so that

${A}_{0}$ and

${B}_{0}$ correspond to

$\alpha $ and

$\beta $. This prevents part of the orthogonality, in the sense that

${A}_{0}^{\prime}{A}_{0}\ne I$ and

${B}_{0}^{\prime}{B}_{0}\ne I$.

The following scheme identifies A and B, under the assumption that ${B}_{0}$ is already identified through prior restrictions.

Orthogonalize to obtain ${A}_{0}^{\prime}{A}_{1}=0,{A}_{0}^{\prime}{A}_{2}=0,{A}_{1}^{\prime}{A}_{2}=0$.

Choose s full rank rows from ${B}_{1}$, denoted ${M}_{{B}_{1}}$, and set ${B}_{1}\leftarrow {B}_{1}{M}_{{B}_{1}}^{-1}$. Adjust V accordingly.

Do the same for ${B}_{2}\leftarrow {B}_{2}{M}_{{B}_{2}}^{-1}$.

Set ${A}_{1}\leftarrow {A}_{1}{V}_{22}$, ${V}_{21}\leftarrow {V}_{22}^{-1}{V}_{21}$ and ${V}_{22}\leftarrow I$.

${A}_{2}\leftarrow {A}_{2}{M}_{{A}_{2}}^{-1}$.

The ordering of columns inside ${A}_{i},{B}_{i}$ is not unique.

## 4. Algorithms for Gaussian Maximum Likelihood Estimation

Algorithms to estimate the Gaussian CVAR are usually alternating over sets of variables. In the cointegration literature, these are called switching algorithms, following

Johansen and Juselius (1994).

The advantage of switching is that each step is easy to implement, and no derivatives are required. Furthermore, the partitioning circumvents the lack of identification that can occur in these models and which makes it harder to use Newton-type methods. The drawback is that progress is often slow, taking many iterations to converge. Occasionally, this will lead to premature convergence. Although the steps can generally be shown to be in a non-downward direction, this is not enough to show convergence to a stationary point. The work in

Doornik (2017) documents the framework for the switching algorithms and also considers acceleration of these algorithms; both results are used here.

Johansen (1997, §8) proposes an algorithm based on the

$\tau $-representation, called

$\tau $-switching here. This is presented in detail in

Appendix B. Two new algorithms are given next, the first based on the

$\delta $-representation, the second on the triangular representation. Some formality is required to describe the algorithms with sufficient detail.

#### 4.1. $\delta $-Switching Algorithm

The free parameters in the

$\delta $-representation (

15) are

$(\alpha ,\delta ,\zeta ,\tau )$ with symmetric positive definite

$\Omega $. The algorithm alternates between estimating

$\tau $ given the rest and fixing

$\tau $. The model for

$\tau $ given the other parameters is linear:

To estimate

$\tau =[\beta :{\beta}_{1}]$, rewrite (

15) as:

where

d replaces

$\delta {\tau}_{\perp}^{\prime}$. Then, vectorize, using

$\alpha {\beta}^{\prime}{z}_{2t}=\mathrm{vec}\left({z}_{2t}^{\prime}\beta {\alpha}^{\prime}\right)=(\alpha \otimes {z}_{2t}^{\prime})\mathrm{vec}\beta $:

Given

$\alpha ,{\zeta}_{1},{\zeta}_{2},\Omega $, we can treat

$\beta ,{\beta}_{1}$ and

d as free parameters to be estimated by generalized least squares (GLS). This will give a new estimate of

$\tau $.

We can treat

d as a free parameter in (

16). First, when

$r\ge {s}_{2}^{*}$,

$\delta $ has more parameters than

${\tau}_{\perp}$. Second, when

$r<{s}_{2}^{*}$, then

$\Gamma $ is reduced rank, and

${s}_{2}^{*}-r$ columns of

${\tau}_{\perp}$ are redundant. Orthogonality is recovered in the next step.

Given

$\tau $ and derived

${\tau}_{\perp}$, we can estimate

$\alpha $ and

$\delta $ by RRR after concentrating out

${\tau}^{\prime}{z}_{1t}$. Introducing

$\rho $ with dimension

$(r\phantom{\rule{-0.166667em}{0ex}}+\phantom{\rule{-0.166667em}{0ex}}{s}_{2}^{*})\times r$ allows us to write (

15) as:

RRR provides estimates of

${\alpha}^{*}$ and

${\rho}^{\prime}$. Next,

${\alpha}^{*}{\rho}^{\prime}$ is transformed to

$\alpha ({I}_{r}:\delta )$, giving new estimates of

$\alpha $ and

$\delta $. Finally,

$\zeta $ can be obtained by OLS from (

17) given

$\alpha ,\delta ,\tau $, and hence,

$\Omega $.

The RRR step is the same as used in

Dennis and Juselius (2004) and

Paruolo (2000b). However, the GLS step for

$\tau $ is different from both. We have found that the specification of the GLS step can have a substantial impact on the performance of the algorithm.

For numerical reasons (see, e.g.

Golub and Van Loan (2013, Ch.5)), we prefer to use the QR decomposition to implement OLS and RRR estimation rather than moment matrices. However, in iterative estimation, there are very many regressions, which would be much faster using precomputed moment matrices. As a compromise, we use precomputed ‘data’ matrices that are transformed by a QR decomposition. This reduces the effective sample size from

T to

$2{p}_{1}$. The regressions (

16) and (

17) can then be implemented in terms of the transformed data matrices; see

Appendix A.

Usually, starting values of

$\alpha $ and

$\beta $ are available from I(1) estimation. The initial

$\tau $ is then obtained from the marginal equation of the

$\tau $-representation, (

14a), written as:

RRR of

${\alpha}_{\perp}^{\prime}{z}_{0t}$ on

${\beta}_{\perp}^{\prime}{z}_{1t}$ corrected for

${\beta}^{\prime}{z}_{1t}$ gives estimates of

$\eta $, and so,

$\tau $.

$\delta $-switching algorithm:

To start, set

$k=1$, and choose starting values

${\alpha}^{\left(0\right)},{\beta}^{\left(0\right)}$, tolerance

${\epsilon}_{1}$ and the maximum number of iterations. Compute

${\tau}_{c}^{\left(0\right)}$ from (

18) and

${\alpha}^{\left(0\right)},{\delta}^{\left(0\right)},{\zeta}^{\left(0\right)},{\Omega}^{\left(0\right)}$ from (

17). Furthermore, compute

${f}^{\left(0\right)}=-log\left|{\Omega}^{\left(0\right)}\right|$.

- 1.
Get

${\tau}_{c}^{\left(k\right)}$ from (

16); get the remaining parameters from (

17).

- 2.
Compute ${f}_{c}^{\left(k\right)}=-log\left|{\Omega}_{c}^{\left(k\right)}\right|.$

- 3.
Enter a line search for $\tau $.

The change in $\tau $ is $\nabla ={\tau}_{c}^{\left(k\right)}-{\tau}^{(k-1)}$ and the line search find a step length $\lambda $ with ${\tau}^{\left(k\right)}={\tau}^{(k-1)}+\lambda \nabla $. Because only $\tau $ is varied, a GLS step is needed to evaluate the log-likelihood for each trial $\tau $. The line search gives new parameters with corresponding ${f}^{\left(k\right)}$.

- T.
Compute the relative change from the previous iteration:

Terminate if:

Else increment

k, and return to Step 1. ☐

The subscript

c indicates that these are candidate values that may be improved upon by the line search. The line search is the concentrated version, so the I(2) equivalent to the LBeta line search documented in

Doornik (2017). This means that the function evaluation inside the line search needs to re-evaluate all of the other parameters as

$\tau $ changes. Therefore, within the line search, we effectively concentrate out all other parameters.

Normalization of $\tau $ prevents the scale from growing excessively, and it was found to be beneficial to normalize in the first iteration every hundredth or when the norm of $\tau $ gets large. Continuous normalization had a negative impact in our experiments. Care is required when normalizing: if an iteration uses a different normalization from the previous one, then the line search will only be effective if the previous coefficients are adjusted accordingly.

The algorithm is incomplete without starting values, and it is obvious that a better start will lead to faster and more reliable convergence. Experimentation also showed that this and other algorithms struggled more in cases with

$s=0$. To improve this, we generate two initial values, follow three iterations of the

$\tau $-switching algorithms, then select the best for continuation. The details are in

Appendix C.

#### 4.2. MLE with the Triangular Representation

We set

${W}_{11}=I$. This tends to lead to slower convergence, but is required when both

$\alpha $ and

$\beta $ are restricted.

${V}_{22}$ is kept unrestricted: fewer restrictions seem to lead to faster convergence. All regressions use the data matrices that are pre-transformed by an orthogonal matrix as described in

Appendix A. In the next section, we describe the estimation steps that can be repeated until convergence.

#### 4.2.1. Estimation Steps

Equation (

9) provides a convenient structure for an alternating variables algorithm. We can solve three separate steps by ordinary or generalized least squares for the case with orthogonal

A:

B-step: estimate

B, and fix

$A,V,W,\Omega $ at

${A}_{c},{V}_{c},{W}_{c},{\Omega}_{c}$. The resulting model is linear in

B:

Estimation by GLS can be conveniently done as follows. Start with the Cholesky decomposition

${\Omega}_{c}=P{P}^{\prime}$, and premultiply (

20) by

${P}^{-1}$. Next take the QL decomposition of

${P}^{-1}A$ as

${P}^{-1}A=HL$ with

L lower diagonal and

H orthogonal. Now, premultiply the transformed system by

${H}^{\prime}$:

which has the unit variance matrix. Because the structures of

W and

V are preserved, this approach can also be used in the next step.

V-step: estimate $W,V,$, and fix $A,B,\Omega $. This is a linear model in $(W,V)$, which can be solved by GLS as in the B step.

A-step: estimate

$A,\Omega $ and fix

$W,V,B$ at

${W}_{c},{V}_{c},{B}_{c}$:

This is the linear regression of

${z}_{0t}$ on

${W}_{c}{B}_{c}^{\prime}{z}_{2t}-{V}_{c}{B}_{c}^{\prime}{z}_{1t}.$

The likelihood will not go down when making one update that consists of the three steps given above, provided V is full rank. If that does not hold, as noted at the end of §2.3, some part of ${B}_{2}$ or ${A}_{2}$ is not identified from the above expressions. To handle this, we make the following adjustments to steps 1 and 3:

- 1a.
B-step: Remove the last ${s}_{2}^{*}-min\{r,{s}_{2}^{*}\}$ columns from B, V and W, as they do not affect the log-likelihood. When iteration is finished, we can add columns of zeros back to W and V and the orthogonal complement of the reduced B to get a rectangular B.

- 3a.
A-step: we wish to keep A invertible and, so, square during iteration. The missing part of ${A}_{2}$ is filled in with the orthogonal complement of the remainder of A after each regression. This requires re-estimation of ${V}_{.1}$ by OLS.

#### 4.2.2. Triangular-Switching Algorithm

The steps described in the previous section form the basis of an alternating variables algorithm:

Triangular-switching algorithm:

To start, set $k=1$, and choose ${\alpha}^{\left(0\right)},{\beta}^{\left(0\right)}$ and the maximum number of iterations. Compute ${A}^{\left(0\right)}$, ${B}^{\left(0\right)}$, ${V}^{\left(0\right)}$, ${W}^{\left(0\right)}$ and ${\Omega}^{\left(0\right)}$.

- 1.1
B-step: obtain ${B}^{\left(k\right)}$ from ${A}^{(k-1)},{V}^{(k-1)},{W}^{(k-1)},{\Omega}^{(k-1)}$.

- 1.2
V step: obtain ${W}^{\left(k\right)},{V}^{\left(k\right)}$ from ${A}^{(k-1)},{B}^{\left(k\right)},{\Omega}^{(k-1)}$.

- 1.3
A step: obtain ${A}^{\left(k\right)},{\Omega}^{\left(k\right)}$ from ${B}^{\left(k\right)},{V}^{\left(k\right)},{W}^{\left(k\right)}$.

- 1.4
${V}_{.1}$ step: if necessary, obtain new ${V}_{.1}^{\left(k\right)}$ from ${A}^{\left(k\right)},{B}^{\left(k\right)},{V}_{.2}^{\left(k\right)},{V}_{.3}^{\left(k\right)},{W}^{\left(k\right)}$.

- 2...
As steps 2,3,T from the $\delta $-switching algorithm. In this case, the line search is over all of the parameters in $A,B,V$. ☐

The starting values are taken as for the

$\delta $-switching algorithm; see

Appendix C. This means that two iterations of

$\delta $-switching are taken first, using only restrictions on

$\beta $.

#### 4.3. Linear Restrictions

#### 4.3.1. Delta Switching

Estimation under linear restrictions on

$\beta $ or

$\tau $ of the form:

can be done by adjusting the GLS step in §4.1. However, estimation of

$\alpha $ is by RRR, which is not so easily adjusted for linear restrictions. Restricting

$\delta $ requires replacing the RRR step by regression conditional on

$\delta $, which makes the algorithm much slower. Estimation under

$\delta =0$, which implies

$d=0$, is straightforward.

#### 4.3.2. Triangular Switching

Triangular switching avoids RRR, and restrictions on

$\beta ={B}_{0}$ or

$\tau =({B}_{0}:{B}_{1})$ can be implemented by adjusting the

B-step. In general, we can test restrictions of the form:

Such linear restrictions on the columns of

A and

B are a straightforward extension of the GLS steps described above.

Estimation without multi-cointegration is also feasible. Setting

$\delta =0$ corresponds to

${V}_{13}=0$ in the triangular representation. This amounts to removing the last

${s}_{2}^{*}$ columns from

$B,V,W$.

Boswijk (2010) shows that the test for

$\delta =0$ has an asymptotic

${\chi}^{2}\left(r{s}_{2}^{*}\right)$ distribution.

Paruolo and Rahbek (1999) derives conditions for weak exogeneity in (

15). They decompose this into three sub-hypotheses:

${\mathcal{H}}_{0}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}:{b}^{\prime}\alpha =0$,

${\mathcal{H}}_{1}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}:{b}^{\prime}\left({\alpha}_{\perp}\xi \right)=0$,

${\mathcal{H}}_{2}\phantom{\rule{-0.166667em}{0ex}}\phantom{\rule{-0.166667em}{0ex}}:{b}^{\prime}{\zeta}_{1}=0$. These restrictions, taking

$b={e}_{p,i}$, where

${e}_{p,i}$ is the

i-th column of

${I}_{p}$, correspond to a zero right-hand side in a particular equation in the triangular representation. First is

${e}_{p,i}^{\prime}{A}_{0}=0$ creating a row of zeros in

$AW{B}^{\prime}$. Next is

${e}_{p,i}^{\prime}{A}_{1}=0$, which extends the row of zeros. However,

A must be full rank, so the final restriction must be imposed on

V as

${e}_{p,i}^{\prime}AV=({e}_{p,i}^{\prime}{A}_{2}:0:0)V=0$, expressed as

${e}_{p,i}^{\prime}{A}_{2}{V}_{31}=0$.

Paruolo and Rahbek (1999) shows that the combined test for a single variable has an asymptotic

${\chi}^{2}(2r+s)$ distribution.

## 5. Comparing Algorithms

We have three algorithms that can be compared:

The $\delta $-switching algorithm, §4.1, which can handle linear restrictions on $\beta $ or $\tau $.

The triangular-switching algorithm proposed in §4.2.2. This can optionally have linear restrictions on the columns of A or B.

The improved

$\tau $-switching algorithm,

Appendix B, implemented to allow for common restrictions on

$\tau $.

These algorithms, as well as two pre-existing ones, have been implemented in Ox 7

Doornik (2013).

The comparisons are based on a model for the Danish data (five variables:

${m}_{3}=$ log real money,

$y=$ log real GDP,

$\Delta p=$ log GDP deflator, and

${r}_{m}$,

${r}_{b}$, two interest rates); see

Juselius (2006, §4.1.1). This has two lags in the VAR, with an unrestricted constant and restricted trend for the deterministic terms, i.e., specification

${H}_{l}$. The sample period is 1973(3) to 2003(1). First computed is the I(2) rank test table.

Table 3 records the number of iterations used by each of the algorithms; this is closely related to the actual computational time required (but less machine specific). All three algorithms converge rapidly to the same likelihood value. Although

$\tau $ switching takes somewhat fewer iterations, it tends to take a bit more time to run than the other two algorithms. The new triangular I(2) switching procedure is largely competitive with the new

$\delta $-switching algorithm.

To illustrate the advances made with the new algorithms, we report in

Table 4 how the original

$\tau $-switching, as well as the CATS2 version of

$\delta $-switching performed. CATS2,

Dennis and Juselius (2004), is a RATS package for the estimation of I(1) and I(2) models, which uses a somewhat different implementation of an I(2) algorithm that is also called

$\delta $-switching. The number of iterations of that CATS 2 algorithm is up to 200-times higher than that of the new algorithms, which are therefore much faster, as well as more robust and reliable.

## 6. A More Detailed Comparison

A Monte Carlo experiment is used to show the difference between algorithms in more detail. The first data generation process is the model for the Danish data, estimated with the I(1) and I(2) restrictions $r,s$ imposed. $M=1000$ random samples are drawn from this, using, for each case, the estimated parameters and estimated residual variance assuming normality. The number of iterations and the progress of the algorithm is recorded for each sample. The maximum number of iterations was set to 10 000, ${\u03f5}_{1}={10}^{-11}$, and all replications are included in the results.

Figure 1 shows the histograms of the number of iterations required to achieve convergence (or 10,000). Each graph has the number of iterations (on a log 10 scale) on the horizontal axis and the count (out of 1000 experiments) represented by the bars and the vertical axis. Ideally, all of the mass is to the left, reflecting very quick convergence. The top row of histograms is for

$\delta $ switching, the bottom row for triangular switching. In each histogram, the data generation process (DGP) uses the stated

$r,s$ values, and estimation is using the correct values of

$r,s$.

The histograms show that triangular switching (bottom row) uses more iterations than $\delta $ switching (top row), in particular when $s=0$. Nonetheless, the experiment using triangular switching runs slightly faster as measured by the total time taken (and $\tau $ switching is the slowest).

An important question is whether the algorithms converge to the same maximum. The function value that is maximized is:

Out of 10,000 experiments, counted over all $r,s$ combinations that we consider, there is only a single experiment with a noticeable difference in $f(\widehat{\theta})$. This happens for $r=3,s=0$, and $\delta $-switching finds a higher function value by almost $0.05$. Because $T=119$, the $0.05$ translates to a difference of three in the log-likelihoods.

A second issue of interest is how the algorithms perform when restrictions are imposed. The following restrictions are imposed on the three columns of

$\beta $ with

$r=3$:

This specification identifies the cointegrating vectors and imposes two over-identifying restrictions. For

$r=3,s=0$ this is accepted with a

p-value of

$0.4$, while for

$r=3,s=1$, the

p-value is

$0.5$ using the model on the actual Danish data. Simulation is from the estimated restricted model.

In terms of the number of iterations, as

Figure 2 shows,

$\delta $-switching converges more rapidly in most cases. This makes triangular switching slower, but only by about

$10\%$–

$20\%$.

Figure 3 shows

$f({\widehat{\theta}}_{\delta})-f({\widehat{\theta}}_{\mathrm{triangular}})$, so a positive value means that triangular switching obtained a lower log-likelihood. There are many small differences, mostly to the advantage of

$\delta $-switching when

$s=1$ (right-hand plot), but to the advantage of triangular switching on the left, when

$s=0$. The latter case is also much more noisy.

#### 6.1. Hybrid Estimation

To increase the robustness of the triangular procedure, we also consider a hybrid procedure, which combines algorithms as follows:

standard starting values, as well as twenty randomized starting values, then

triangular switching, followed by

BFGS optimization (the Broyden-Fletcher, Goldfarb, and Shanno quasi-Newton method) for a maximum of 200 iterations, followed by

triangular switching.

This offers some protection against false convergence, because BFGS is based on first derivatives combined with an approximation to the inverse Hessian.

More importantly, we add a randomized search for better starting values as perturbations of the default starting values. Twenty versions of starting values are created this way, and each is followed for ten iterations. Then, we discard half, merge (almost) identical ones and run another ten iterations. This is repeated until a single one is left.

Figure 4 shows that this hybrid approach is an improvement: now, it is almost never beaten by

$\delta $ switching. Of course, the hybrid approach is a bit slower again. The starting value procedure for

$\delta $ switching could be improved in the same way.

## 7. Conclusions

We introduced the triangular representation of the I(2) model and showed how it can be used for estimation. The trilinear form of the triangular representation has the advantage that estimation can be implemented as alternating least squares, without using reduced-rank regression. This structure allows us to impose restrictions on (parts of) the A and B matrices, which gives more flexibility than is available in the $\delta $ and $\tau $ representations.

We also presented an algorithm based on the

$\delta $-representation and compared the performance to triangular switching in an application based on Danish data, as well as a parametric bootstrap using that data. Combined with the acceleration of

Doornik (2017), both algorithms are fast and give mostly the same result. This will improve empirical applications of the I(2) model and facilitate recursive estimation and Monte Carlo analysis. Expressions for the computation of

t-values of coefficients will be reported in a separate paper.

Because they are considerably faster than the previous generation, bootstrapping the I(2) model can now be considered, as

Cavaliere et al. (2012) did for the I(1) model.