 Next Article in Journal
Copula-Based Factor Models for Multivariate Asset Returns
Next Article in Special Issue
Improved Inference on Cointegrating Vectors in the Presence of a near Unit Root Using Adjusted Quantiles
Previous Article in Journal
The Univariate Collapsing Method for Portfolio Optimization
Previous Article in Special Issue
Panel Cointegration Testing in the Presence of Linear Time Trends
Article

# Maximum Likelihood Estimation of the I(2) Model under Linear Restrictions

Department of Economics and Institute for New Economic Thinking at the Oxford Martin School, University of Oxford, Oxford, OX1 3UQ, UK
Econometrics 2017, 5(2), 19; https://doi.org/10.3390/econometrics5020019
Received: 27 February 2017 / Revised: 2 May 2017 / Accepted: 8 May 2017 / Published: 15 May 2017

## Abstract

Estimation of the I(2) cointegrated vector autoregressive (CVAR) model is considered. Without further restrictions, estimation of the I(1) model is by reduced-rank regression (Anderson (1951)). Maximum likelihood estimation of I(2) models, on the other hand, always requires iteration. This paper presents a new triangular representation of the I(2) model. This is the basis for a new estimation procedure of the unrestricted I(2) model, as well as the I(2) model with linear restrictions imposed.

## 1. Introduction

The I(1) model or cointegrated vector autoregression (CVAR) is now well established. The model is developed in a series of papers and books (see, e.g., Johansen (1988), Johansen (1991), Johansen (1995a), Juselius (2006)) and generally available in econometric software. The I(1) model is formulated as a rank reduction of the matrix of ‘long-run’ coefficients. The Gaussian log-likelihood is estimated by reduced-rank regression (RRR; see Anderson (1951), Anderson (2002)).
Determining the cointegrating rank only finds the cointegrating vectors up to a rank-preserving linear transformation. Therefore, the next step of an empirical study usually identifies the cointegrating vectors. This may be followed by imposing over-identifying restrictions. Common restrictions, i.e., the same restrictions on each cointegrating vector, can still be solved by adjusting the RRR estimation; see Johansen and Juselius (1990) and Johansen and Juselius (1992). Estimation with separate linear restrictions on the cointegrating vectors, or more general non-linear restrictions, requires iterative maximization. The usual approach is based on so-called switching algorithms; see Johansen (1995b) and Boswijk and Doornik (2004). The former proposes an algorithm that alternates between cointegrating vectors, estimating one while keeping the others fixed. The latter consider algorithms that alternate between the cointegrating vectors and their loadings: when one is kept fixed, the other is identified. The drawback is that these algorithms can be very slow and occasionally terminate prematurely. Doornik (2017) proposes improvements that can be applied to all switching algorithms.
Johansen (1995c) and Johansen (1997) extend the CVAR to allow for I(2) stochastic trends. These tend to be smoother than I(1) stochastic trends. The I(2) model implies a second reduced rank restriction, but this is now more complicated, and estimation under Gaussian errors can no longer be performed by RRR. The basis of an algorithm for maximum likelihood estimation is presented in Johansen (1997), with an implementation in Dennis and Juselius (2004).
The general approach to handling the I(2) model is to create representations that introduce parameters that vary freely without changing the nature of the model. This facilitates both the statistical analysis and the estimation.
The contributions of the current paper are two-fold. First, we present the triangular representation of the I(2) model. This is a new trilinear formulation with a block-triangular matrix structure at its core. The triangular representation provides a convenient framework for imposing linear restrictions on the model parameters. Next, we introduce several improved estimation algorithms for the I(2) model. A simulation experiment is used to study the behaviour of the algorithms.

#### Notation

Let $α$ ($p × r$) be a matrix with full column rank $r , r ≤ p$. The perpendicular matrix $α ∟$ ($p × p - r$) has $α ∟ ′ α = 0$. The orthogonal complement $α ⊥$ has $α ⊥ ′ α = 0$ with the additional property that $α ⊥ ′ α ⊥ = I p - r$. Define $α ˜ = α ( α ′ α ) - 1 / 2$ and $α ¯ = α ( α ′ α ) - 1$. Then, $( α ˜ : α ⊥ )$ is a $p × p$ orthogonal matrix, so $I p = α ˜ α ˜ ′ + α ⊥ α ⊥ ′ = α ¯ α ′ + α ⊥ α ⊥ ′ = α α ¯ ′ + α ⊥ α ⊥ ′$.
The (thin) singular value decomposition (SVD) of $α$ is $α = U W V ′$, where $U ( p × r ) , V ( r × r )$ are orthogonal: $U ′ U = V ′ V = V V ′ = I r$, and W is a diagonal matrix with the ordered positive singular values on the diagonal. If rank$( α ) = s < r$, then the last $r - s$ singular values are zero. We can find $α ⊥ = U 2$ from the SVD of the square matrix $( α : 0 ) = ( U 1 : U 2 ) W V ′ = ( U 1 W 1 V 1 ′ : 0 )$.
The (thin) QR factorization of $α$ with pivoting is $α P = Q R$, with $Q ( p × r )$ orthogonal and R upper triangular. This pivoting is the reordering of columns of $α$ to better handle poor conditioning and singularity, and is captured in P, as discussed Golub and Van Loan (2013, §5.4.2).
The QL decomposition of A can be derived from the QR decomposition of $J A J$: $J A J = Q R$, so $A = J J A J J = J Q J J Z J = Q ′ L$. J is the exchange matrix, which is the identity matrix with columns in reverse order: premultiplication reverses rows; postmultiplication reverses columns; and $J J = I$.
Let $α ¯ ¯ = Ω - 1 α α ′ Ω - 1 α - 1$, then $α ⊥ ′ Ω α ¯ ¯ = 0$.
Finally, $a ← b$ assigns the value of b to a.

## 2. The I(2) Model

The vector autoregression (VAR) with p dependent variables and $m ≥ 1$ lags:
$y t = A 1 y t - 1 + . . . + A m y t - m + Φ x t U + ϵ t , ϵ t ∼ IIN p [ 0 p , Ω ] ,$
for $t = 1 , . . . , T$, and with $y j , j = - m + 1 , . . . , 0$ fixed and given, can be written in equilibrium correction form as:
$Δ y t = Π y y t - 1 + Γ 1 Δ y t - 1 + . . . + Γ m - 1 Δ y t - m + 1 + Φ x t U + ϵ t ,$
without imposing any restrictions. The I(1) cointegrated VAR (CVAR) imposes a reduced rank restriction on $Π y ( p × p )$: rank$Π y = r$; see, e.g., Johansen and Juselius (1990), Johansen (1995a).
With $m ≥ 2$, the model can be written in second-differenced equilibrium correction form as:
$Δ 2 y t = Π y y t - 1 - Γ y Δ y t - 1 + Ψ 1 Δ 2 y t - 1 + . . . + Ψ m - 2 Δ 2 y t - m + 2 + Φ x t U + ϵ t .$
The I(2) CVAR involves an additional reduced rank restriction:
$rank ( α ⊥ ′ Γ y β y , ⊥ ) = s ,$
where $α ⊥ ′ α = 0 .$ The two rank restrictions can be expressed more conveniently in terms of products of matrices with reduced dimensions:
$Π y = α β y ′ ,$
$α ⊥ ′ Γ y β y , ⊥ = ξ η y ′ ,$
where $α$ and $β y$ are $p × r$ matrices. The second restriction needs rank s, so $ξ$ and $η y$ are a $( p - r ) × s$. This requires that the matrices on the right-hand side of (2) and (3) have full column rank. The number of I(2) trends is $s 2 = p - r - s$.
The most relevant model in terms of deterministics allows for linearly trending behaviour: $Φ x t U = μ 0 + μ 1 t$ . Using the representation theorem of Johansen (1992) and assuming $E [ y t ] = a + b t$ imply:
$μ 1 = - α β y ′ b ,$
$μ 0 = - α β y ′ a + Γ y b ,$
which restricts and links $μ 0$ and $μ 1$; we see that $α ⊥ ′ μ 1 = 0$ and $α ⊥ ′ μ 0 = α ⊥ ′ Γ y b$.

#### 2.1. The I(2) Model with a Linear Trend

The model (1) subject to the I(1) and I(2) rank restrictions (2) and (3) with $Φ x t U = μ 0 + μ 1 t$, subject to (4) and (5) can be written as:
$Δ 2 y t = α β ′ y t - 1 t - Γ Δ y t - 1 1 + Ψ 1 Δ 2 y t - 1 + . . . + Ψ m - 2 Δ 2 y t - m + 2 + ϵ t ,$
subject to:
$α ⊥ ′ Γ β ⊥ = ξ η ′ ,$
where $β$ is $p 1 × r$, $Γ$ is $p × p 1$ and $η$ is $( p 1 - r ) × s$. In this case, $p 1 = p + 1$. Because $α$ is the leading term in (4), we can extend $β y$ by introducing $β c ′ = - β y ′ b$, so $β ′ = ( β y ′ : β c ′ )$. Furthermore, $Γ$ has been extended to $Γ = ( Γ y : Γ c ) = ( Γ y : - μ 0 )$.
To see that (6) and (7) remains the same I(2) model, consider $α ⊥ ′ Γ c$ and insert $I p = β y β ¯ y ′ + β y ⊥ β y ⊥ ′$:
$α ⊥ ′ Γ c = - α ⊥ ′ Γ y β ¯ y β y ′ b - α ⊥ ′ Γ y β y ⊥ β y ⊥ ′ b = α ⊥ ′ Γ y β ¯ y β c ′ - ξ η y ′ β y ⊥ ′ b = α ⊥ ′ Γ y β ¯ y β c ′ + ξ η c ′ .$
Using the perpendicular matrix:
$β ∟ = β y ⊥ - β ¯ y β c ′ 0 1$
we see that the rank condition is unaffected:
$α ⊥ ′ Γ β ∟ = α ⊥ ′ Γ y β y ⊥ : ξ η c ′ = α ⊥ ′ Γ y β y ⊥ : α ⊥ ′ [ - Γ y β ¯ y β c ′ + Γ c ] = ξ η y ′ : η c ′ .$
A more general formulation allows for restricted deterministic and weakly exogenous variables $x t R$ and unrestricted variables $x t U$:
$Δ 2 y t = Π y t - 1 x t - 1 R - Γ Δ y t - 1 Δ x t R + Ψ 1 Δ 2 y t - 1 + . . . + Ψ m - 2 Δ 2 y t - m + 2 + Φ x t U + ϵ t , = Π w 2 t - Γ w 1 t + Ψ w 3 t + ϵ t ,$
where $Δ 2 x t R$, and its lags are contained in $x t U$; this in turn, is subsumed under $w 3 t = ( Δ 2 y t - 1 ′ , . . . , x t U ′ ) ′$. The number of variables in $x t R$ is $p 1 - p$, so $Π$ and $Γ$ always have the same dimensions. $Ψ$ is unrestricted, which allows it to be concentrated out by regressing all other variables on $w 3 t$:
$z 0 t = α β ′ z 2 t - Γ z 1 t + ϵ t , ϵ t ∼ IIN p [ 0 p , Ω ] .$
To implement likelihood-ratio tests, it is necessary to count the number of restrictions:
$restrictions on Π : ( p - r ) ( p 1 - r ) restrictions , restrictions on Γ : ( p - r - s ) ( p 1 - r - s ) = s 2 s 2 * restrictions ,$
defining $s 2 * = p 1 - r - s$. The restrictions on $Π$ follow from the representation. Several representations of the I(2) model have been introduced in the literature to translate the implicit non-linear restriction (3) on $Γ$ into an explicit part of the model. These representations reveal the number of restrictions imposed on $Γ$, as is shown below.
First, we introduce the new triangular representation.

#### 2.2. The Triangular Representation

Theorem 1.
Consider the model:
$z 0 t = Π z 2 t - Γ z 1 t + ϵ t ,$
with rank restrictions $Π = α β ′$ and $α ⊥ ′ Γ β ⊥ = ξ η ′$ where α is a $p × r$ matrix, β is $p 1 × r$, ξ is $( p - r ) × s$, η is $( p 1 - r ) × s$. This can be written as:
$z 0 t = A W B ′ z 2 t - A V B ′ z 1 t + ϵ t ,$
where:
$W = 0 0 0 0 0 0 W 11 0 0 , V = V 31 0 0 V 21 V 22 0 V 11 V 12 V 13 .$
$A , B , W 11 , V 22$ are full rank matrices. A is $p × p$, and B is $p 1 × p 1$; moreover, A, B and the nonzero blocks in W and V are freely varying. A and B are partitioned as:
$A = A 2 : A 1 : A 0 , B = B 0 : B 1 : B 2 ,$
where the blocks in A have $s 2 , s , r$ columns respectively; for B, this is: $r , s , s 2 *$; $p 1 = r + s + s 2 *$. W and V are partitioned accordingly.
Proof.
Write $α ˜ = α ( α ′ α ) - 1 / 2$, such that $α ˜ ′ α ˜ = I r$. Construct A and B as:
$A = α ⊥ ξ ⊥ : α ⊥ ξ ˜ : α ˜ , B = β ˜ : β ⊥ η ˜ : β ⊥ η ⊥ .$
Now, $A ′ A = I$ and $B ′ B = I$. $A ( p × p )$ and $B ( p 1 × p 1 )$ are full rank by design. Define $V = A ′ Γ B$:
$V = ξ ⊥ ′ α ⊥ ′ Γ β ˜ ξ ⊥ ′ α ⊥ ′ Γ β ⊥ η ˜ ξ ⊥ ′ α ⊥ ′ Γ β ⊥ η ⊥ ξ ˜ ′ α ⊥ ′ Γ β ˜ ξ ˜ ′ α ⊥ ′ Γ β ⊥ η ˜ ξ ˜ ′ α ⊥ ′ Γ β ⊥ η ⊥ α ˜ ′ Γ β ˜ α ˜ ′ Γ β ⊥ η ˜ α ˜ ′ Γ β ⊥ η ⊥ = V 31 0 0 V 21 V 22 0 V 11 V 12 V 13 .$
$V 22 = ( ξ ′ ξ ) 1 2 ( η ′ η ) 1 2$ is a full rank $s × s$ matrix. The zero blocks in V arise because, e.g., $ξ ⊥ ′ α ⊥ ′ Γ β ⊥ = ξ ⊥ ′ ξ η ′ = 0$. Trivially:
$Π = α β ′ = A 0 0 0 0 0 0 W 11 0 0 B ′ = A W B ′ .$
$W 11 = ( α ′ α ) 1 2 ( β ′ β ) 1 2$ is a full rank $r × r$ matrix. Both W and V are $p × p 1$ matrices. Because A and B are each orthogonal: $Γ = A A ′ Γ B B ′ = A V B ′ .$
The QR decomposition shows that a full rank square matrix can be written as the product of an orthogonal matrix and a triangular matrix. Therefore, $A V B ′ = A L a L a - 1 V L b L b - 1 B ′ = A * V * B * ′$ preserves the structure in $V *$ when $L a , L b$ are lower triangular, as well as that in $W *$. This shows that (9) holds for any full rank A and B, and the orthogonality can be relaxed.
Therefore, any model with full rank matrices A and B, together with any $W , V$ that have the zeros as described above, satisfies the I(2) rank restrictions. We obtain the same model by restricting A and B to be orthogonal. ☐
When $Γ$ is restricted only by the I(2) condition: rank$Γ = r + s + min ( r , s 2 )$. Then, V varies freely, except for the zero blocks, and the I(2) restrictions are imposed through the trilinear form of (9). $Γ = 0$ implies $V = 0$. Another way to have $s = 0$ is $Γ = ( α : 0 ) G$; in that case, $V ≠ 0$.
The $s 2$ restrictions on the intercept (5) can be expressed as $A 2 ′ ( μ 0 - μ c ) = 0 ,$ using $μ c = Γ y β ¯ y β c ′$, or $μ 0 = ( A 1 : A 0 ) v + μ c ,$ for a vector v of length $r + s$.

#### 2.3. Obtaining the Triangular Representation

The triangular representation shows that the I(2) model can be written in trilinear form:
$z 0 t = A W B ′ z 2 t - A V B ′ z 1 t + ϵ t ,$
where A and B are freely varying, provided W and V have the appropriate structure.
Consider that we are given $α , β , Γ$ of an I(2) CVAR with rank indices $r , s$ and wish to obtain the parameters of the triangular representation. First compute $α ⊥ ′ Γ β ⊥ = ξ η ′$, which can be done with the SVD, assuming rank s. From this, compute A and B:
$A = ( A 2 : A 1 : A 0 ) = α ⊥ ξ ⊥ : α ⊥ ξ ˜ : α , B = ( B 0 : B 1 : B 2 ) = β : β ⊥ η ˜ : β ⊥ η ⊥ .$
Then, $V = A - 1 Γ B - 1 ′$. Because $Γ$ satisfies the I(2) rank restriction, V will have the corresponding block-triangular structure.
It may be of interest to consider which part of the structure can be retrieved in the case where rank$( Π ) = r$, but rank$( α ⊥ ′ Γ β ⊥ ) = p - r$, while it should be s. This would happen when using I(1) starting values for I(2) estimation. The off anti-diagonal blocks of zeros:
$V * = V 31 * V 32 * V 33 * V 21 V 22 V 23 * V 11 V 12 V 13 * → V 31 0 V 33 V 21 V 22 0 V 11 V 12 V 13 = V$
can be implemented with two sweep operations:
$I s 2 - V 32 * V 22 - 1 0 0 I s 0 0 0 I r V * I r 0 0 0 I s - V 22 - 1 V 23 * 0 0 I s 2 * .$
The offsetting operations affect $A 1$ and $B 1$ only, so $Π$ and $Γ$ are unchanged. However, we cannot achieve $V 33 = 0$ in a similar way, because it would remove the zeros just obtained. The $V 33$ block has dimension $s 2 s 2 *$ and represents the number of restrictions imposed on $Γ$ in the I(2) model. Similarly, the anti-diagonal block of zeros in W captures the restrictions on $Π$.
Note that the $r × s 2 *$ block $V 13$ can be made lower triangular. Write the column partition of V as $( V · 1 : V · 2 : V · 3 )$, and use $V 13 = L Q$ to replace $V · 3$ by $V · 3 Q ′$ and $B 2$ by $B 2 Q ′$. When $r < s 2 *$, the rightmost $s 2 * - r$ columns of L will be zero, and the corresponding columns of $B 2$ are not needed to compute $Γ$. This part can then be omitted from the likelihood evaluation. This is an issue when we propose an estimation procedure in §4.2.1.

#### 2.4. Restoring Orthogonality

Although A and B are freely varying, interpretation may require orthogonality between column blocks. The column blocks of A are in reverse order from B to make V and W block lower triangular. As a consequence, multiplication of V or W from either side by a lower triangular matrix preserves their structure. This allows for the relaxation of the orthogonality of A and B, but also enables us to restore it again.
To restore orthogonality, let $Γ = A * V * B * ′$, where $A * , B *$ are not orthogonal, but with $V *$ block-triangular. Now, use the QL decomposition to get $A * = A L$, with A orthogonal and L lower triangular. Use the QR decomposition to get $B * = B R$, with B orthogonal and R upper triangular. Then, $A * V * B * ′ = A L V * R ′ B ′ = A V B ′$ with the blocks of zeros in V preserved. $A * W * B * ′$ must be adjusted accordingly. When $β$ is restricted, $B 0$ cannot be modified like this. However, we can still adjust $( A 2 : A 1 ) = A *$ to get $A * ′ A * = I p - r$ and $A 0 ′ A * = 0$; with similar adjustments to $( B 1 : B 2 )$.
The orthogonal version is convenient mathematically, but for estimation, it is preferable to use the unrestricted version. We do not distinguish through notation, but the context will state when the orthogonal version is used.

#### 2.5. Identification in the Triangular Representation

The matrices A and B are not identified without further restrictions. For example, rescaling $α$ and $β$ as in $α W 11 β ′ = α ′ c - 1 c W 11 d d - 1 β = α * c W 11 d β * ′$ can be absorbed in V:
$V 31 d 0 0 V 21 d V 22 0 c V 11 d c V 12 c V 13 .$
When $β$ is identified, $W 11$ remains freely varying, and we can, e.g., set $c = W 11 - 1$. However, it is convenient to transform to $W 11 = I$, so that $A 0$ and $B 0$ correspond to $α$ and $β$. This prevents part of the orthogonality, in the sense that $A 0 ′ A 0 ≠ I$ and $B 0 ′ B 0 ≠ I$.
The following scheme identifies A and B, under the assumption that $B 0$ is already identified through prior restrictions.
• Orthogonalize to obtain $A 0 ′ A 1 = 0 , A 0 ′ A 2 = 0 , A 1 ′ A 2 = 0$.
• Choose s full rank rows from $B 1$, denoted $M B 1$, and set $B 1 ← B 1 M B 1 - 1$. Adjust V accordingly.
• Do the same for $B 2 ← B 2 M B 2 - 1$.
• Set $A 1 ← A 1 V 22$, $V 21 ← V 22 - 1 V 21$ and $V 22 ← I$.
• $A 2 ← A 2 M A 2 - 1$.
The ordering of columns inside $A i , B i$ is not unique.

## 3. Relation to Other Representations

Two other formulations of the I(2) model that are in use are the so-called $τ$ and $δ$ representations. All representations implement the same model and make the rank restrictions explicit. However, they differ in their definitions of freely-varying parameters, so may facilitate different forms of analysis, e.g., asymptotic analysis, estimation or the imposition of restrictions. The different parametrizations may also affect economic interpretations.

#### 3.1. $τ$ Representation

Johansen (1997) transforms (8) into the $τ$-representation:
$z 0 t = α ϱ ′ τ ′ z 2 t + ψ ′ z 1 t + w κ ′ τ ′ z 1 t + ϵ t ,$
where $ϱ ( p 1 × r + s )$ is used to recover $β$: $β = τ ϱ$. The parameters $( α , ϱ , τ , ψ , κ )$ vary freely. If we normalize on $ϱ ′ = ( I r : 0 )$ and adjust $κ , τ$ accordingly, then $τ = ( β : β 1 )$, and:
$z 0 t = α β ′ z 2 t + ψ ′ z 1 t + w κ ′ τ ′ z 1 t + ϵ t .$
We shall derive the $τ$ representation. The first step is to define a transformation of $ϵ t ∼ N [ 0 , Ω ]$:
$α ⊥ α ¯ ¯ ′ ϵ t ∼ N 0 , α ⊥ ′ Ω α ⊥ 0 0 α ′ Ω - 1 α - 1 .$
This splits the p-variate systems into two independent parts. The first has any terms with leading $α$ knocked out, while the second has all leading $α$’s cancelled. The inverse transformation is given by: $( α ⊥ : α ¯ ¯ ) - 1 = ( α ⊥ - α α ¯ ¯ ′ α ⊥ : α ) ′ = ( w : α ) ′ .$
The next step is to apply (13) to (8) to create two independent systems and insert $I p = β ¯ β ′ + β ⊥ β ⊥ ′$ in the ‘marginal’ equation:
$α ⊥ α ¯ ¯ ′ z 0 t = - α ⊥ ′ Γ ( β ¯ β ′ + β ⊥ β ⊥ ′ ) z 1 t + ε 1 t = κ ′ ( β : β ⊥ η ) ′ z 1 t + ε 1 t , β ′ z 2 t - α ¯ ¯ ′ Γ z 1 t + ε 2 t = β ′ z 2 t + ψ ′ z 1 t + ε 2 t .$
where $ψ ′ = - α ¯ ¯ ′ Γ$ and $κ ′ = - ( α ⊥ ′ Γ β ¯ : ξ )$ are freely varying. Removing the transformation:
$z 0 t = w ′ κ ′ ( β : β ⊥ η ) ′ z 1 t + α ′ ( β ′ z 2 t + ψ ′ z 1 t ) + ϵ t$
and introducing the additional parameters $τ = ( β : β ⊥ η )$ and $ϱ$ completes the $τ$-representation (12). Table 1 provides definitions of the parameters that are used (cf. Johansen (1997, Tables 1 and 2)).
Corollary 1.
Triangular representation (9) is equivalent to the τ-representation (12) when $A 0 ′ ( A 2 : A 1 ) = 0$.
Proof.
Write $A * = ( A 2 : A 1 )$, so $A = ( A * : A 0 )$. First, the system (9) is premultiplied by $A - 1 = A ¯ ′$ and subsequently with a lower triangular matrix L to create two independent subsystems. The matrix L and its inverse are given by:
$L = I p - r 0 - A ¯ 0 ′ w * I r , L - 1 = I p - r 0 A ¯ 0 ′ w * I r ,$
where $w * = Ω A ¯ * ( A ¯ * ′ Ω A ¯ * ) - 1$, cf. (14). Because $A 0 ′ A * = 0$, we have that $A * + A 0 A ¯ 0 ′ w * = A * + ( I - A * A ¯ * ) ′ w * = w *$, so $A L - 1 = ( w * : A 0 )$. Furthermore: $L W = W$. The identity matrix $L - 1 L$ can also be inserted directly in (9):
$z 0 t = A 0 W 11 B 0 ′ z 2 t - ( w * : A 0 ) L V B ′ z 1 t + ϵ t = A 0 W 11 B 0 ′ z 2 t - w * V 31 0 V 21 V 22 ( B 0 : B 1 ) ′ z 1 t + A 0 A ¯ 0 ′ w * ( V 11 : V 12 : V 13 ) B ′ z 1 t + ϵ t = α β ′ z 2 t + ψ ′ z 1 t + w κ ′ τ ′ z 1 t + ϵ t ,$
where $ψ ′ = - A ¯ 0 ′ w * ( V 11 : V 12 : V 13 ) B ′$ and $w κ ′ = - w * V 31 0 V 21 V 22 .$ ☐

#### 3.2. $δ$ Representation

Paruolo and Rahbek (1999) and Paruolo (2000a) use the $δ$ representation:
$z 0 t = α β ′ z 2 t + δ τ ⊥ ′ z 1 t + ζ τ ′ z 1 t + ϵ t .$
Here, $( α , δ , ζ , τ = [ β : β 1 ] )$ vary freely. To derive the $δ$ representation, use $τ ¯ τ ′ + τ ⊥ τ ⊥ ′ = I r + s$:
$- Γ τ ¯ τ ′ = - ( Γ β ¯ : Γ β ⊥ η ¯ ) τ ′ = ( ζ 1 : ζ 2 ) τ ′ = ζ τ ′ , - Γ τ ⊥ τ ⊥ ′ = - α α ¯ ′ Γ τ ⊥ τ ⊥ ′ - α ⊥ α ⊥ ′ Γ τ ⊥ τ ⊥ ′ = α δ τ ⊥ ′ ,$
and insert in (8). The term with $α ⊥ ′ Γ τ ⊥$ disappears because $τ ⊥ = β ⊥ η ⊥$, so $α ⊥ ′ Γ τ ⊥ = ξ η ′ η ⊥ = 0$.
When $β$ is identified both $δ τ ⊥ ′$ and $ζ τ ′$ are unique, but not yet $ζ$ or $δ$. In the $τ$ representation, the variable $ψ$ is also unique with $ϱ$ chosen as $( I : 0 ) ′$ and $β$ identified. Table 2 relates the $τ$, $δ$ and triangular representations.
Corollary 2.
Triangular representation (9) is equivalent to the δ-representation (15) when $B 2 ′ ( B 0 : B 1 ) = 0$.
Proof.
Write $B 2 = τ ∟$ and $( B 0 : B 1 ) = τ$. Using the column partitioning if $V = ( V . 1 : V . 2 : V . 3 )$: $Γ τ ¯ = A V B ′ τ ¯ = A ( V . 1 : V . 2 )$. From (9):
$z 0 t = A 0 W 11 B 0 ′ z 2 t - V 13 τ ∟ ′ z 1 t - A ( V . 1 : V . 2 ) τ ′ z 1 t + ϵ t = A 0 W 11 B 0 ′ z 2 t - V 13 τ ∟ ′ z 1 t - Γ τ ¯ τ ′ z 1 t + ϵ t = α β ′ z 2 t + δ τ ⊥ ′ z 1 t + ζ τ ′ z 1 t + ϵ t .$
☐

## 4. Algorithms for Gaussian Maximum Likelihood Estimation

Algorithms to estimate the Gaussian CVAR are usually alternating over sets of variables. In the cointegration literature, these are called switching algorithms, following Johansen and Juselius (1994).
The advantage of switching is that each step is easy to implement, and no derivatives are required. Furthermore, the partitioning circumvents the lack of identification that can occur in these models and which makes it harder to use Newton-type methods. The drawback is that progress is often slow, taking many iterations to converge. Occasionally, this will lead to premature convergence. Although the steps can generally be shown to be in a non-downward direction, this is not enough to show convergence to a stationary point. The work in Doornik (2017) documents the framework for the switching algorithms and also considers acceleration of these algorithms; both results are used here.
Johansen (1997, §8) proposes an algorithm based on the $τ$-representation, called $τ$-switching here. This is presented in detail in Appendix B. Two new algorithms are given next, the first based on the $δ$-representation, the second on the triangular representation. Some formality is required to describe the algorithms with sufficient detail.

#### 4.1. $δ$-Switching Algorithm

The free parameters in the $δ$-representation (15) are $( α , δ , ζ , τ )$ with symmetric positive definite $Ω$. The algorithm alternates between estimating $τ$ given the rest and fixing $τ$. The model for $τ$ given the other parameters is linear:
• To estimate $τ = [ β : β 1 ]$, rewrite (15) as:
$z 0 t = α β ′ z 2 t + ζ 1 β ′ z 1 t + ζ 2 β 1 ′ z 1 t + α d z 1 t + ϵ t ,$
where d replaces $δ τ ⊥ ′$. Then, vectorize, using $α β ′ z 2 t = vec ( z 2 t ′ β α ′ ) = ( α ⊗ z 2 t ′ ) vec β$:
$z 0 t = ( α ⊗ z 2 t ′ + ζ 1 ⊗ z 1 t ′ ) vec β + ( ζ 2 ⊗ z 1 t ′ ) vec β 1 + α ⊗ z 1 t ′ vec ( d ′ ) + ϵ t .$
Given $α , ζ 1 , ζ 2 , Ω$, we can treat $β , β 1$ and d as free parameters to be estimated by generalized least squares (GLS). This will give a new estimate of $τ$.
We can treat d as a free parameter in (16). First, when $r ≥ s 2 *$, $δ$ has more parameters than $τ ⊥$. Second, when $r < s 2 *$, then $Γ$ is reduced rank, and $s 2 * - r$ columns of $τ ⊥$ are redundant. Orthogonality is recovered in the next step.
• Given $τ$ and derived $τ ⊥$, we can estimate $α$ and $δ$ by RRR after concentrating out $τ ′ z 1 t$. Introducing $ρ$ with dimension $( r + s 2 * ) × r$ allows us to write (15) as:
$z 0 t = α * ρ ′ β ′ z 2 t τ ⊥ ′ z 1 t + ζ τ ′ z 1 t + ϵ t .$
RRR provides estimates of $α *$ and $ρ ′$. Next, $α * ρ ′$ is transformed to $α ( I r : δ )$, giving new estimates of $α$ and $δ$. Finally, $ζ$ can be obtained by OLS from (17) given $α , δ , τ$, and hence, $Ω$.
The RRR step is the same as used in Dennis and Juselius (2004) and Paruolo (2000b). However, the GLS step for $τ$ is different from both. We have found that the specification of the GLS step can have a substantial impact on the performance of the algorithm.
For numerical reasons (see, e.g. Golub and Van Loan (2013, Ch.5)), we prefer to use the QR decomposition to implement OLS and RRR estimation rather than moment matrices. However, in iterative estimation, there are very many regressions, which would be much faster using precomputed moment matrices. As a compromise, we use precomputed ‘data’ matrices that are transformed by a QR decomposition. This reduces the effective sample size from T to $2 p 1$. The regressions (16) and (17) can then be implemented in terms of the transformed data matrices; see Appendix A.
Usually, starting values of $α$ and $β$ are available from I(1) estimation. The initial $τ$ is then obtained from the marginal equation of the $τ$-representation, (14a), written as:
$α ⊥ ′ z 0 t = κ ′ τ ′ z 1 t + ε 1 t = κ 1 ′ β ′ z 1 t + ξ η ′ β ⊥ ′ z 1 t + ε 1 t .$
RRR of $α ⊥ ′ z 0 t$ on $β ⊥ ′ z 1 t$ corrected for $β ′ z 1 t$ gives estimates of $η$, and so, $τ$.
$δ$-switching algorithm:
To start, set $k = 1$, and choose starting values $α ( 0 ) , β ( 0 )$, tolerance $ε 1$ and the maximum number of iterations. Compute $τ c ( 0 )$ from (18) and $α ( 0 ) , δ ( 0 ) , ζ ( 0 ) , Ω ( 0 )$ from (17). Furthermore, compute $f ( 0 ) = - log | Ω ( 0 ) |$.
1.
Get $τ c ( k )$ from (16); get the remaining parameters from (17).
2.
Compute $f c ( k ) = - log | Ω c ( k ) | .$
3.
Enter a line search for $τ$.
The change in $τ$ is $∇ = τ c ( k ) - τ ( k - 1 )$ and the line search find a step length $λ$ with $τ ( k ) = τ ( k - 1 ) + λ ∇$. Because only $τ$ is varied, a GLS step is needed to evaluate the log-likelihood for each trial $τ$. The line search gives new parameters with corresponding $f ( k )$.
T.
Compute the relative change from the previous iteration:
$c ( k ) = f ( k ) - f ( k - 1 ) 1 + f ( k - 1 ) .$
Terminate if:
$| c ( k ) | ≤ ε 1 and max i , j Π i j ( k ) - Π i j ( k - 1 ) 1 + Π i j ( k - 1 ) ≤ ε 1 1 / 2 .$
The subscript c indicates that these are candidate values that may be improved upon by the line search. The line search is the concentrated version, so the I(2) equivalent to the LBeta line search documented in Doornik (2017). This means that the function evaluation inside the line search needs to re-evaluate all of the other parameters as $τ$ changes. Therefore, within the line search, we effectively concentrate out all other parameters.
Normalization of $τ$ prevents the scale from growing excessively, and it was found to be beneficial to normalize in the first iteration every hundredth or when the norm of $τ$ gets large. Continuous normalization had a negative impact in our experiments. Care is required when normalizing: if an iteration uses a different normalization from the previous one, then the line search will only be effective if the previous coefficients are adjusted accordingly.
The algorithm is incomplete without starting values, and it is obvious that a better start will lead to faster and more reliable convergence. Experimentation also showed that this and other algorithms struggled more in cases with $s = 0$. To improve this, we generate two initial values, follow three iterations of the $τ$-switching algorithms, then select the best for continuation. The details are in Appendix C.

#### 4.2. MLE with the Triangular Representation

We set $W 11 = I$. This tends to lead to slower convergence, but is required when both $α$ and $β$ are restricted. $V 22$ is kept unrestricted: fewer restrictions seem to lead to faster convergence. All regressions use the data matrices that are pre-transformed by an orthogonal matrix as described in Appendix A. In the next section, we describe the estimation steps that can be repeated until convergence.

#### 4.2.1. Estimation Steps

Equation (9) provides a convenient structure for an alternating variables algorithm. We can solve three separate steps by ordinary or generalized least squares for the case with orthogonal A:
• B-step: estimate B, and fix $A , V , W , Ω$ at $A c , V c , W c , Ω c$. The resulting model is linear in B:
$z 0 t = A c W c B ′ z 2 t - A c V c B ′ z 1 t + ε t .$
Estimation by GLS can be conveniently done as follows. Start with the Cholesky decomposition $Ω c = P P ′$, and premultiply (20) by $P - 1$. Next take the QL decomposition of $P - 1 A$ as $P - 1 A = H L$ with L lower diagonal and H orthogonal. Now, premultiply the transformed system by $H ′$:
$H ′ P - 1 z 0 t = L W c B ′ z 2 t - L V c B ′ z 1 t + u t = W ˜ c B ′ z 2 t - V ˜ c B ′ z 1 t + u t ,$
which has the unit variance matrix. Because the structures of W and V are preserved, this approach can also be used in the next step.
• V-step: estimate $W , V ,$, and fix $A , B , Ω$. This is a linear model in $( W , V )$, which can be solved by GLS as in the B step.
• A-step: estimate $A , Ω$ and fix $W , V , B$ at $W c , V c , B c$:
$z 0 t = A W c B c ′ z 2 t - V c B c ′ z 1 t + ε t$
This is the linear regression of $z 0 t$ on $W c B c ′ z 2 t - V c B c ′ z 1 t .$
The likelihood will not go down when making one update that consists of the three steps given above, provided V is full rank. If that does not hold, as noted at the end of §2.3, some part of $B 2$ or $A 2$ is not identified from the above expressions. To handle this, we make the following adjustments to steps 1 and 3:
1a.
B-step: Remove the last $s 2 * - min { r , s 2 * }$ columns from B, V and W, as they do not affect the log-likelihood. When iteration is finished, we can add columns of zeros back to W and V and the orthogonal complement of the reduced B to get a rectangular B.
3a.
A-step: we wish to keep A invertible and, so, square during iteration. The missing part of $A 2$ is filled in with the orthogonal complement of the remainder of A after each regression. This requires re-estimation of $V . 1$ by OLS.

#### 4.2.2. Triangular-Switching Algorithm

The steps described in the previous section form the basis of an alternating variables algorithm:
Triangular-switching algorithm:
To start, set $k = 1$, and choose $α ( 0 ) , β ( 0 )$ and the maximum number of iterations. Compute $A ( 0 )$, $B ( 0 )$, $V ( 0 )$, $W ( 0 )$ and $Ω ( 0 )$.
1.1
B-step: obtain $B ( k )$ from $A ( k - 1 ) , V ( k - 1 ) , W ( k - 1 ) , Ω ( k - 1 )$.
1.2
V step: obtain $W ( k ) , V ( k )$ from $A ( k - 1 ) , B ( k ) , Ω ( k - 1 )$.
1.3
A step: obtain $A ( k ) , Ω ( k )$ from $B ( k ) , V ( k ) , W ( k )$.
1.4
$V . 1$ step: if necessary, obtain new $V . 1 ( k )$ from $A ( k ) , B ( k ) , V . 2 ( k ) , V . 3 ( k ) , W ( k )$.
2...
As steps 2,3,T from the $δ$-switching algorithm. In this case, the line search is over all of the parameters in $A , B , V$. ☐
The starting values are taken as for the $δ$-switching algorithm; see Appendix C. This means that two iterations of $δ$-switching are taken first, using only restrictions on $β$.

#### 4.3.1. Delta Switching

Estimation under linear restrictions on $β$ or $τ$ of the form:
$β = H 1 ϕ 1 : . . . : H r ϕ r or τ = H 1 ϕ 1 : . . . : H r + s ϕ r + s$
can be done by adjusting the GLS step in §4.1. However, estimation of $α$ is by RRR, which is not so easily adjusted for linear restrictions. Restricting $δ$ requires replacing the RRR step by regression conditional on $δ$, which makes the algorithm much slower. Estimation under $δ = 0$, which implies $d = 0$, is straightforward.

#### 4.3.2. Triangular Switching

Triangular switching avoids RRR, and restrictions on $β = B 0$ or $τ = ( B 0 : B 1 )$ can be implemented by adjusting the B-step. In general, we can test restrictions of the form:
$B = H 1 ϕ 1 : . . . : H p 1 ϕ p 1 and A = G 1 θ 1 : . . . : G p θ p .$
Such linear restrictions on the columns of A and B are a straightforward extension of the GLS steps described above.
Estimation without multi-cointegration is also feasible. Setting $δ = 0$ corresponds to $V 13 = 0$ in the triangular representation. This amounts to removing the last $s 2 *$ columns from $B , V , W$. Boswijk (2010) shows that the test for $δ = 0$ has an asymptotic $χ 2 ( r s 2 * )$ distribution.
Paruolo and Rahbek (1999) derives conditions for weak exogeneity in (15). They decompose this into three sub-hypotheses: $H 0 : b ′ α = 0$, $H 1 : b ′ ( α ⊥ ξ ) = 0$, $H 2 : b ′ ζ 1 = 0$. These restrictions, taking $b = e p , i$, where $e p , i$ is the i-th column of $I p$, correspond to a zero right-hand side in a particular equation in the triangular representation. First is $e p , i ′ A 0 = 0$ creating a row of zeros in $A W B ′$. Next is $e p , i ′ A 1 = 0$, which extends the row of zeros. However, A must be full rank, so the final restriction must be imposed on V as $e p , i ′ A V = ( e p , i ′ A 2 : 0 : 0 ) V = 0$, expressed as $e p , i ′ A 2 V 31 = 0$. Paruolo and Rahbek (1999) shows that the combined test for a single variable has an asymptotic $χ 2 ( 2 r + s )$ distribution.

## 5. Comparing Algorithms

We have three algorithms that can be compared:
• The $δ$-switching algorithm, §4.1, which can handle linear restrictions on $β$ or $τ$.
• The triangular-switching algorithm proposed in §4.2.2. This can optionally have linear restrictions on the columns of A or B.
• The improved $τ$-switching algorithm, Appendix B, implemented to allow for common restrictions on $τ$.
These algorithms, as well as two pre-existing ones, have been implemented in Ox 7 Doornik (2013).
The comparisons are based on a model for the Danish data (five variables: $m 3 =$ log real money, $y =$ log real GDP, $Δ p =$ log GDP deflator, and $r m$, $r b$, two interest rates); see Juselius (2006, §4.1.1). This has two lags in the VAR, with an unrestricted constant and restricted trend for the deterministic terms, i.e., specification $H l$. The sample period is 1973(3) to 2003(1). First computed is the I(2) rank test table.
Table 3 records the number of iterations used by each of the algorithms; this is closely related to the actual computational time required (but less machine specific). All three algorithms converge rapidly to the same likelihood value. Although $τ$ switching takes somewhat fewer iterations, it tends to take a bit more time to run than the other two algorithms. The new triangular I(2) switching procedure is largely competitive with the new $δ$-switching algorithm.
To illustrate the advances made with the new algorithms, we report in Table 4 how the original $τ$-switching, as well as the CATS2 version of $δ$-switching performed. CATS2, Dennis and Juselius (2004), is a RATS package for the estimation of I(1) and I(2) models, which uses a somewhat different implementation of an I(2) algorithm that is also called $δ$-switching. The number of iterations of that CATS 2 algorithm is up to 200-times higher than that of the new algorithms, which are therefore much faster, as well as more robust and reliable.

## 6. A More Detailed Comparison

A Monte Carlo experiment is used to show the difference between algorithms in more detail. The first data generation process is the model for the Danish data, estimated with the I(1) and I(2) restrictions $r , s$ imposed. $M = 1000$ random samples are drawn from this, using, for each case, the estimated parameters and estimated residual variance assuming normality. The number of iterations and the progress of the algorithm is recorded for each sample. The maximum number of iterations was set to 10 000, $ϵ 1 = 10 - 11$, and all replications are included in the results.
Figure 1 shows the histograms of the number of iterations required to achieve convergence (or 10,000). Each graph has the number of iterations (on a log 10 scale) on the horizontal axis and the count (out of 1000 experiments) represented by the bars and the vertical axis. Ideally, all of the mass is to the left, reflecting very quick convergence. The top row of histograms is for $δ$ switching, the bottom row for triangular switching. In each histogram, the data generation process (DGP) uses the stated $r , s$ values, and estimation is using the correct values of $r , s$.
The histograms show that triangular switching (bottom row) uses more iterations than $δ$ switching (top row), in particular when $s = 0$. Nonetheless, the experiment using triangular switching runs slightly faster as measured by the total time taken (and $τ$ switching is the slowest).
An important question is whether the algorithms converge to the same maximum. The function value that is maximized is:
$f ( θ ) = - log | Ω ( θ ) | .$
Out of 10,000 experiments, counted over all $r , s$ combinations that we consider, there is only a single experiment with a noticeable difference in $f ( θ ^ )$. This happens for $r = 3 , s = 0$, and $δ$-switching finds a higher function value by almost $0.05$. Because $T = 119$, the $0.05$ translates to a difference of three in the log-likelihoods.
A second issue of interest is how the algorithms perform when restrictions are imposed. The following restrictions are imposed on the three columns of $β$ with $r = 3$:
$m 3 y Δ p r m r b t β 1 ′ a - a 0 1 - 1 ∗ β 2 ′ 0 ∗ 1 - a a ∗ β 3 ′ 0 0 1 ∗ 0 ∗$
This specification identifies the cointegrating vectors and imposes two over-identifying restrictions. For $r = 3 , s = 0$ this is accepted with a p-value of $0.4$, while for $r = 3 , s = 1$, the p-value is $0.5$ using the model on the actual Danish data. Simulation is from the estimated restricted model.
In terms of the number of iterations, as Figure 2 shows, $δ$-switching converges more rapidly in most cases. This makes triangular switching slower, but only by about $10 %$$20 %$.
Figure 3 shows $f ( θ ^ δ ) - f ( θ ^ triangular )$, so a positive value means that triangular switching obtained a lower log-likelihood. There are many small differences, mostly to the advantage of $δ$-switching when $s = 1$ (right-hand plot), but to the advantage of triangular switching on the left, when $s = 0$. The latter case is also much more noisy.

#### 6.1. Hybrid Estimation

To increase the robustness of the triangular procedure, we also consider a hybrid procedure, which combines algorithms as follows:
• standard starting values, as well as twenty randomized starting values, then
• triangular switching, followed by
• BFGS optimization (the Broyden-Fletcher, Goldfarb, and Shanno quasi-Newton method) for a maximum of 200 iterations, followed by
• triangular switching.
This offers some protection against false convergence, because BFGS is based on first derivatives combined with an approximation to the inverse Hessian.
More importantly, we add a randomized search for better starting values as perturbations of the default starting values. Twenty versions of starting values are created this way, and each is followed for ten iterations. Then, we discard half, merge (almost) identical ones and run another ten iterations. This is repeated until a single one is left.
Figure 4 shows that this hybrid approach is an improvement: now, it is almost never beaten by $δ$ switching. Of course, the hybrid approach is a bit slower again. The starting value procedure for $δ$ switching could be improved in the same way.

## 7. Conclusions

We introduced the triangular representation of the I(2) model and showed how it can be used for estimation. The trilinear form of the triangular representation has the advantage that estimation can be implemented as alternating least squares, without using reduced-rank regression. This structure allows us to impose restrictions on (parts of) the A and B matrices, which gives more flexibility than is available in the $δ$ and $τ$ representations.
We also presented an algorithm based on the $δ$-representation and compared the performance to triangular switching in an application based on Danish data, as well as a parametric bootstrap using that data. Combined with the acceleration of Doornik (2017), both algorithms are fast and give mostly the same result. This will improve empirical applications of the I(2) model and facilitate recursive estimation and Monte Carlo analysis. Expressions for the computation of t-values of coefficients will be reported in a separate paper.
Because they are considerably faster than the previous generation, bootstrapping the I(2) model can now be considered, as Cavaliere et al. (2012) did for the I(1) model.

## Acknowledgments

I am grateful to Peter Boswijk, Søren Johansen and Katarina Juselius for providing detailed comments as this paper was developing. Their suggestions have greatly improved the results presented here. Financial support from the Robertson Foundation (Award 9907422) and the Institute for New Economic Thinking (Grant 20029822) is gratefully acknowledged. All computations and graphs use OxMetrics and Ox, Doornik (2013).

## Conflicts of Interest

The author declares no conflict of interest.

## Appendix A. Estimation Using the QR Decomposition

The data matrices in the I(2) model (8) are $Z i ′ = ( z i 1 : . . . : z i T )$ for $i = 0 , 1 , 2$.
Take the QR decomposition of $( Z 2 : Z 1 )$ as $( Z 2 : Z 1 ) P = Q R = Q z ( R 2 : R 1 )$ where Q is a $T × T$ orthogonal matrix and R a $T × 2 p 1$ upper triangular matrix, while $Q z$ are the $T × 2 p 1$ leading columns of Q and $( R 2 : R 1 )$ a $2 p 1 × 2 p 1$ upper triangular matrix. P is the orthogonal matrix that captures the column reordering. Then:
$Q z ′ ( Z 2 : Z 1 ) = ( R 2 : R 1 ) P ′ = ( X 2 : X 1 ) ,$
where $( X 2 : X 1 )$ is no longer triangular. Introduce:
$X 0 X 0 * = Q ′ Z 0 ,$
where $X 0 = Q z Z 0$ is $2 p 1 × 2 p 1$, then:
$Z i ′ Z j = Z i ′ Q Q ′ Z j = X 0 ′ X 0 + X 0 * ′ X 0 * if i = j = 0 , X i ′ X j otherwise .$
Now, e.g., a regression of $A ′ z 0 t$ on $B ′ z 1 t$ for known $A , B$:
$A ′ z 0 t = γ B ′ z 1 t + ϵ t , t = 1 , . . . , T ,$
has:
$γ ^ = ( B ′ Z 1 ′ Z 1 B ) - 1 B ′ Z 1 ′ Z 0 A = ( B ′ X 1 ′ X 1 B ) - 1 B ′ X 1 ′ X 0 A .$
This is a regression of $X 0 A$ on $X 1 B$. If such regressions need to be done often for the same Z’s, it is more efficient to do them in terms of the $X i$:
$A ′ x 0 i = γ B ′ x 1 i + e i , i = 1 , . . . , 2 p 1 ,$
with estimated residual variance:
$Ω ^ e = T - 1 A ′ X 0 * ′ X 0 * A + ∑ i = 1 2 p 1 e ^ i .$
This regression has fewer ‘observations’, while at the same time avoiding the creation of moment matrices. Precomputed moment matrices would be faster, but not as good numerically. For recursive estimation, it is useful to be able to handle singular regressions because dummy variables can be zero over a subsample; this happens naturally in the QR approach. This approach needs to be adjusted when (A1) also has $z 0 t$ on the right-hand side, as happens for $τ$-switching in (A3).

#### Reduced Rank Regression

Let RRR$( Z 0 , Z 1 | Z x )$ denote reduced rank regression of $z 0 t$ on $z 1 t$ corrected for $z x t$. Assume that $( Z 0 , Z 1 , Z x )$ have been transformed into $( X 0 , X 1 , X x )$ using the QR decomposition described above. Concentrating $X x$ out can be done by regression of $( X 0 , X 1 )$ on $X x$, with residuals $( Y , X )$. Form $S 00 = Y ′ Y + X 0 * ′ X 0 *$, and decompose using the Cholesky decomposition: $S 00 = L L ′$.
We need to solve the matrix pencil:
$X ′ Y S 00 - 1 Y ′ X x = λ X ′ X x .$
Start by using the QR decomposition $X = Q R P ′$, $y = P ′ x$:
$R ′ Q ′ Y L - 1 ′ L - 1 Y ′ Q R y = λ R ′ R y , R ′ W ′ W R y = λ R ′ R y , W ′ W z = λ z , U Σ 2 U ′ z = λ z .$
The second line introduces $W = L - 1 Y ′ Q$; the next line removes R; and the final line takes the SVD of $W ′$. The eigenvalues are the squared singular values that are on the diagonal of $Σ 2$, and the eigenvectors are $P R - 1 U$.
When X is singular, as may be the case in recursive estimation, the upper triangular matrix R will have rows and columns that are zero at the bottom and end. These are the same on the left and right of the pencil, so they can be dropped. The resulting reduced dimension R is full rank, and we can set the corresponding rows in the eigenvectors to zero. When the regressors are singular, their corresponding coefficients in $β$ will be set to zero, just as in our regressions.
This approach differs somewhat from Doornik and O’Brien (2002) because of the different structure of $S 00$ as a consequence of the prior QR transformation.

## Appendix B. Tau-Switching Algorithm

The algorithm of Johansen (1997, §8) is based on the $τ$-representation and involves three stages:
• The estimate of $τ$ is obtained by GLS given all other parameters except $ψ$. Johansen (1997, p. 451) shows the GLS expressions using second moment matrices. Define the orthogonal matrix $A = ( α ⊥ : α ¯ ¯ )$, then using $κ ′ τ ′ z 1 t = vec ( z 1 t ′ τ κ ) = ( κ ′ ⊗ z 1 t ) vec τ$:
$A ′ z 0 t = κ ′ ⊗ z 1 t ϱ ′ ⊗ z 2 t vec τ + 0 I r ⊗ z 1 t vec ψ + ε 1 t ε 2 t = κ ′ 0 ⊗ z 1 t + 0 ϱ ′ ⊗ z 2 t vec τ + 0 I r ⊗ z 1 t vec ψ + u t .$
The error term $u t$ has variance $A ′ Ω A$, which is block diagonal. Given $α , κ , ρ , Ω$, (A2) is linear in $τ$ and $ψ$. The estimates of the latter are discarded.
• Given just $τ$, reduced-rank regression of $z 0 t$ corrected for $τ ′ z 1 t$ on $z 0 t$ corrected for $z 1 t , τ ′ z 2 t$ is used to estimate $α$. Details are in Johansen (1997, p. 450).
• Given $τ$ and $α$, the remaining parameters can be obtained by GLS. The equivalence $α ¯ ¯ ′ = α ¯ ′ - α ¯ ′ w α ⊥ ′$ is used to write the conditional equation as:
$α ¯ ′ z 0 t = γ ′ α ⊥ ′ z 0 t + ϱ ′ τ ′ z 2 t + ψ ′ z 1 t + ε 2 t ,$
from which $ϱ$ and $ψ$ are estimated by regression. Then, $κ$ is estimated from the marginal equation:
$α ⊥ ′ z 0 t = κ ′ τ ′ z 1 t + ε 1 t .$
Together, they give $Ω$ and w. We always transform to set $ϱ ′ = ( I : 0 )$, adjusting $κ$ and $τ$ accordingly.
$τ$-switching algorithm:
To start, set $k = 1$, and choose starting values $α ( 0 ) , β ( 0 )$, tolerance $ε 1$ and the maximum number of iterations. Compute $τ c ( 0 )$ from (18) and $κ ( 0 ) , ψ ( 0 ) , Ω ( 0 )$ from (A3) and (A4). Furthermore, compute $f ( 0 ) = - log | Ω ( 0 ) |$.
1.
Get $τ c ( k )$ from (A2). Identify this as follows. Select the non-singular $( r + s ) × ( r + s )$ submatrix from $τ$ with the largest volume, say M. We find M by using the first $r + s$ column pivots that are chosen by the QR decomposition of $τ$ (Golub and Van Loan (2013, Algorithm 5.4.1) ). Set $τ c ( k ) ← τ c ( k ) M$. Get $α c ( k )$ by RRR; finally, get the remaining parameters from (A3) and (A4).
2...
As steps 2,3,T from the $δ$-switching algorithm. ☐
The line search is only for the $p 1 s 2 *$ parameters in $τ$ as part of it is set to a unit matrix every time. The function evaluation inside the line search needs to obtain all of the other parameters as $τ$ changes.
This is the algorithm of Johansen (1997) except for the normalization of $τ$ and the line search. The former protects the parameter values from exploding, while the latter improves convergence speed and makes it more robust. Removing $ϱ$ is largely for convenience: it has little impact on convergence. The $τ$-switching algorithm is easily adjusted for common restrictions on $τ$ in the form of $τ = H τ ˜$. However, $ϱ$ gets in the way of more general restrictions.

## Appendix C. Starting Values

The first starting value procedure is:
• Set $α ( - 1 ) , β ( - 1 )$ to their I(1) values (i.e., with full rank $Γ$).
• Get $τ ( - 1 )$ from (A4), then $Ω ( - 1 )$ from (A3), ignoring restrictions.
• Take two iterations with the relevant switching algorithm subject to restrictions.
The second starting value procedure is:
• Get $α ( - 2 ) , β ( - 2 )$ by RRR from the $τ$-representation using $κ = 0$:
$z 0 t = α ( β ′ z 2 t + ψ ′ z 1 t ) + ϵ t .$
• Get $κ ( - 2 ) , w ( - 2 )$ from (A3), (A4).
• Get $α ( - 1 ) , β ( - 1 )$ by RRR from the $τ$-representation:
$z 0 t - w κ ′ β ′ z 1 t = α ( β ′ z 2 t + ψ ′ z 1 t ) + ϵ t .$
• Get $τ ( - 1 )$ from (A4), then $Ω ( - 1 )$ from (A3), ignoring restrictions.
• Take two iterations with the relevant switching algorithm subject to restrictions.
Finally, choose the final starting values as those that have the highest function value.

## References

1. Anderson, Theodore W. 1951. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of Mathematical Statistics 22: 327–51, (Erratum in Annals of Statistics 8, 1980). [Google Scholar] [CrossRef]
2. Anderson, Theodore W. 2002. Reduced rank regression in cointegrated models. Journal of Econometrics 106: 203–16. [Google Scholar] [CrossRef]
3. Boswijk, H. Peter. 2010. Mixed Normal Inference on Multicointegration. Econometric Theory 26: 1565–76. [Google Scholar] [CrossRef]
4. Boswijk, H. Peter, and Jurgen A. Doornik. 2004. Identifying, Estimating and Testing Restricted Cointegrated Systems: An Overview. Statistica Neerlandica 58: 440–65. [Google Scholar] [CrossRef]
5. Cavaliere, Giuseppe, Anders Rahbek, and R.A.M. Taylor. 2012. Bootstrap Determination of the Co-Integration Rank in Vector Autoregressive Models. Econometrica 80: 1721–40. [Google Scholar]
6. Dennis, Jonathan G., and Katarina Juselius. 2004. CATS in RATS: Cointegration Analysis of Time Series Version 2. Technical Report. Evanston: Estima. [Google Scholar]
7. Doornik, Jurgen A. 2013. Object-Oriented Matrix Programming using Ox, 7th ed. London: Timberlake Consultants Press. [Google Scholar]
8. Doornik, Jurgen A. 2017. Accelerated Estimation of Switching Algorithms: The Cointegrated VAR Model and Other Applications. Oxford: Department of Economics, University of Oxford. [Google Scholar]
9. Doornik, Jurgen A., and R. J. O’Brien. 2002. Numerically Stable Cointegration Analysis. Computational Statistics &amp; Data Analysis 41: 185–93. [Google Scholar] [CrossRef]
10. Golub, Gen H., and Charles F. Van Loan. 2013. Matrix Computations, 4th ed. Baltimore: The Johns Hopkins University Press. [Google Scholar]
11. Johansen, Søren. 1988. Statistical Analysis of Cointegration Vectors. Journal of Economic Dynamics and Control 12: 231–54, Reprinted in R. F. Engle, and C. W. J. Granger, eds. 1991. Long-Run Economic Relationships. Oxford: Oxford University Press, pp. 131–52. [Google Scholar] [CrossRef]
12. Johansen, Søren. 1991. Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autoregressive Models. Econometrica 59: 1551–80. [Google Scholar] [CrossRef]
13. Johansen, Søren. 1992. A Representation of Vector Autoregressive Processes Integrated of Order 2. Econometric Theory 8: 188–202. [Google Scholar] [CrossRef]
14. Johansen, Søren. 1995a. Likelihood-based Inference in Cointegrated Vector Autoregressive Models. Oxford: Oxford University Press. [Google Scholar]
15. Johansen, Søren. 1995b. Identifying Restrictions of Linear Equations with Applications to Simultaneous Equations and Cointegration. Journal of Econometrics 69: 111–32. [Google Scholar] [CrossRef]
16. Johansen, Søren. 1995c. A Statistical Analysis of Cointegration for I(2) Variables. Econometric Theory 11: 25–59. [Google Scholar] [CrossRef]
17. Johansen, Søren. 1997. Likelihood Analysis of the I(2) Model. Scandinavian Journal of Statistics 24: 433–62. [Google Scholar] [CrossRef]
18. Johansen, Søren, and Katarina Juselius. 1990. Maximum Likelihood Estimation and Inference on Cointegration—With Application to the Demand for Money. Oxford Bulletin of Economics and Statistics 52: 169–210. [Google Scholar] [CrossRef]
19. Johansen, Søren, and Katarina Juselius. 1992. Testing Structural Hypotheses in a Multivariate Cointegration Analysis of the PPP and the UIP for UK. Journal of Econometrics 53: 211–44. [Google Scholar] [CrossRef]
20. Johansen, Søren, and Katarina Juselius. 1994. Identification of the Long-run and the Short-run Structure. An Application to the ISLM Model. Journal of Econometrics 63: 7–36. [Google Scholar] [CrossRef]
21. Juselius, Katarina. 2006. The Cointegrated VAR Model: Methodology and Applications. Oxford: Oxford University Press. [Google Scholar]
22. Paruolo, Paolo. 2000a. Asymptotic Efficiency of the Two Stage Estimator in I(2) Systems. Econometric Theory 16: 524–50. [Google Scholar] [CrossRef]
23. Paruolo, Paolo. 2000b. On likelihood-maximizing algorithms for I(2) VAR models. Mimeo. Varese: Universitá dell’Insubria. [Google Scholar]
24. Paruolo, Paolo, and Anders Rahbek. 1999. Weak exogeneity in I(2) VAR Systems. Journal of Econometrics 93: 281–308. [Google Scholar] [CrossRef]
Figure 1. Comparison of algorithms: $δ$-switching (top row) and triangular-switching (bottom row). Simulating a range of $r , s$. Number of iterations on the horizontal axis, count (out of 1000) on the vertical.
Figure 1. Comparison of algorithms: $δ$-switching (top row) and triangular-switching (bottom row). Simulating a range of $r , s$. Number of iterations on the horizontal axis, count (out of 1000) on the vertical. Figure 2. Comparison of algorithms: $δ$-switching (left two) and triangular-switching (right two). Simulating a range of $r , s$. Number of iterations on the horizontal axis, count (out of 1000) on the vertical.
Figure 2. Comparison of algorithms: $δ$-switching (left two) and triangular-switching (right two). Simulating a range of $r , s$. Number of iterations on the horizontal axis, count (out of 1000) on the vertical. Figure 3. $δ$-switching function value minus the triangular switching function value (vertical axis) for each replication (horizontal axis). Both starting from their default starting values. The labels are the cointegration indices $( r , s , s 2 )$.
Figure 3. $δ$-switching function value minus the triangular switching function value (vertical axis) for each replication (horizontal axis). Both starting from their default starting values. The labels are the cointegration indices $( r , s , s 2 )$. Figure 4. $δ$-switching function value minus the hybrid triangular-switching function value (vertical axis) for each replication (horizontal axis).
Figure 4. $δ$-switching function value minus the hybrid triangular-switching function value (vertical axis) for each replication (horizontal axis). Table 1. Definitions of the symbols used in the $τ$ and $δ$ representations of the I(2) model.
Table 1. Definitions of the symbols used in the $τ$ and $δ$ representations of the I(2) model.
DefinitionDimension
$τ$=$( β : β ⊥ η ) when ϱ ′ = ( I : 0 )$$p 1 × ( r + s )$
$τ ⊥$=$β ⊥ η ⊥$$p 1 × s 2 *$
$ψ$=$- ( α ¯ ¯ ′ Γ ) ′$$p 1 × r$
$κ ′$=$- α ⊥ ′ Γ τ ¯ = - ( α ⊥ ′ Γ β ¯ : ξ ) = ( κ 1 : κ 2 ) ′$$( p - r ) × ( r + s )$
$δ$=$- α ¯ ′ Γ τ ⊥$$r × s 2 *$
$ζ$=$- Γ τ ¯ = ( ζ 1 : ζ 2 )$$p × ( r + s )$
w=$α ⊥ - α α ¯ ¯ ′ α ⊥ = Ω α ⊥ α ⊥ ′ Ω α ⊥ - 1 = α ⊥ ¯ ¯$$p × ( p - r )$
d=$τ ⊥ δ ′$$p 1 × r$
e=$τ ζ ′$$p 1 × p$
Table 2. Links between symbols used in the representations of the I(2) model, assuming $W 11 = I r$ and $a ⊥ ′ a ⊥ = I$.
Table 2. Links between symbols used in the representations of the I(2) model, assuming $W 11 = I r$ and $a ⊥ ′ a ⊥ = I$.
 $- Γ = α ψ ′ + w κ ′ τ ′ = α δ τ ⊥ ′ + ζ τ ′ = α d ′ + e ′$ $ζ$ = $α ψ ′ τ ¯ + w κ ′$ $( from Γ τ ¯ )$ $d ′$ = $ψ ′ τ ⊥ τ ⊥ ′$ $( from Γ τ ⊥ )$ $κ ′$ = $α ⊥ ′ ζ$ $( from α ⊥ ′ Γ )$ $ψ ′$ = $d ′ + α ¯ ¯ ′ ζ τ ′$ $( from α ¯ ¯ ′ Γ )$ $α$ = $A 0$ $β$ = $B 0$ $d ′$ = $- V 13 B 2 ′$ $e ′$ = $- A ( V . 1 : V . 2 ) ( B 0 : B 1 ) ′$ $τ$ = $( B 0 : B 1 )$
Table 3. Estimation of all I(2) models by $τ , δ$ and triangular switching; all using the same starting value procedure. Number of iterations to convergence for $ε 1 = 10 - 14$.
Table 3. Estimation of all I(2) models by $τ , δ$ and triangular switching; all using the same starting value procedure. Number of iterations to convergence for $ε 1 = 10 - 14$.
r$\ s 2$$τ$ Switching$δ$ SwitchingTriangular Switching
432143214321
1192536341524373031313932
2 183225 183234 222750
3 3723 4238 5059
4 29 28 85
Table 4. Estimation of all I(2) models by old versions of $τ , δ$ switching. Number of iterations to convergence for $ε 1 = 10 - 14$.
Table 4. Estimation of all I(2) models by old versions of $τ , δ$ switching. Number of iterations to convergence for $ε 1 = 10 - 14$.
r$∖ s 2$Old $τ$ SwitchingCATS2 Switching
43214321
11261983382015229832985165371
2 79211229 7234709861
3 483237 550432
4 4851 5771