Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method

Bauer, Dietmar; Buschmeier, Rainer

doi:10.3390/e23040436

Open AccessArticle

Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method

by

Dietmar Bauer

^*

and

Rainer Buschmeier

Department of Business Administration and Economics, Bielefeld University, Universitaetsstrasse 25, 33615 Bielefeld, Germany

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(4), 436; https://doi.org/10.3390/e23040436

Submission received: 19 February 2021 / Revised: 26 March 2021 / Accepted: 31 March 2021 / Published: 8 April 2021

(This article belongs to the Special Issue Time Series Modelling)

Download

Browse Figures

Versions Notes

Abstract

This paper investigates the asymptotic properties of estimators obtained from the so called CVA (canonical variate analysis) subspace algorithm proposed by Larimore (1983) in the case when the data is generated using a minimal state space system containing unit roots at the seasonal frequencies such that the yearly difference is a stationary vector autoregressive moving average (VARMA) process. The empirically most important special cases of such data generating processes are the I(1) case as well as the case of seasonally integrated quarterly or monthly data. However, increasingly also datasets with a higher sampling rate such as hourly, daily or weekly observations are available, for example for electricity consumption. In these cases the vector error correction representation (VECM) of the vector autoregressive (VAR) model is not very helpful as it demands the parameterization of one matrix per seasonal unit root. Even for weekly series this amounts to 52 matrices using yearly periodicity, for hourly data this is prohibitive. For such processes estimation using quasi-maximum likelihood maximization is extremely hard since the Gaussian likelihood typically has many local maxima while the parameter space often is high-dimensional. Additionally estimating a large number of models to test hypotheses on the cointegrating rank at the various unit roots becomes practically impossible for weekly data, for example. This paper shows that in this setting CVA provides consistent estimators of the transfer function generating the data, making it a valuable initial estimator for subsequent quasi-likelihood maximization. Furthermore, the paper proposes new tests for the cointegrating rank at the seasonal frequencies, which are easy to compute and numerically robust, making the method suitable for automatic modeling. A simulation study demonstrates by example that for processes of moderate to large dimension the new tests may outperform traditional tests based on long VAR approximations in sample sizes typically found in quarterly macroeconomic data. Further simulations show that the unit root tests are robust with respect to different distributions for the innovations as well as with respect to GARCH-type conditional heteroskedasticity. Moreover, an application to Kaggle data on hourly electricity consumption by different American providers demonstrates the usefulness of the method for applications. Therefore the CVA algorithm provides a very useful initial guess for subsequent quasi maximum likelihood estimation and also delivers relevant information on the cointegrating ranks at the different unit root frequencies. It is thus a useful tool for example in (but not limited to) automatic modeling applications where a large number of time series involving a substantial number of variables need to be modelled in parallel.

Keywords:

cointegration; subspace algorithms; VARMA models; seasonality

JEL Classification:

C13; C32

1. Introduction

Many time series show seasonal patterns that, according to [1] for example, cannot be modeled appropriately using seasonal dummies because they exhibit a slowly trending behavior typical for unit root processes.

To model such processes in the vector autoregressive (VAR) framework, Ref. [2] (abbreviated as JS in the following) extend the error correction representation for seasonally integrated autoregressive processes pioneered by [3] to the multivariate case. This vector error correction formulation (VECM) models the yearly differences of a process observed S times per year. The model includes systems having unit roots at some or all of the possible locations

z_{j} = exp (\frac{2 π j}{S} i), j = 0, . . ., S - 1

of seasonal unit roots. In JS all unit roots are assumed to be simple such that the process of yearly differences is stationary.

In this setting JS propose an estimator for the autoregressive polynomial subject to restrictions on its rank (the so-called cointegrating rank) at the unit roots

z_{j}

based on an iterative scheme focusing on a pair of complex-conjugated unit roots (or the unit roots

z_{j} = 1

or

z_{j} = - 1

respectively) at a time. The main idea here is the reformulation of the model using the so called vector error correction representation. Beside estimators JS also derived likelihood ratio tests for the cointegrating rank at the various unit roots.

Refs. [4,5] propose simpler estimation schemes based on complex reduced rank regression (cRRR in the following). They also show that their numerically simpler algorithm leads to test statistics for the cointegrating rank that are asymptotically equivalent to the quasi maximum likelihood tests of JS. These schemes still typically alternate between cRRR problems corresponding to different unit roots until convergence, although a one step version estimating only once at each unit root exists. Ref. [6] provides updating equations for quasi maximum likelihood estimation in situations where constraints on the parameters prohibit focusing on one unit root at a time.

The leading case here is that of quarterly data (

S = 4

) where potential unit roots are located at

\pm 1

and

\pm i

, implying that the VECM representation contains four potentially rank restricted matrices. However, increasingly time series of much higher sampling frequency such as hourly, daily or weekly observations are available. In such cases it is unrealistic that all unit roots are present. If a unit root is not present, the corresponding matrix in the VECM is of full rank. Therefore in situations with only a few unit roots being present, the VECM requires a large number of parameters to be estimated. Also in cases with a long period length (such as for example hourly data with yearly cycles) usage of the VECM involves the estimation of all coefficient matrices for lags for at least one year.

In general, for processes of moderate to large dimension the VAR framework involves estimation of a large number of parameters which potentially can be avoided by using the more flexible vector autoregressive moving average (VARMA) or the—in a sense—equivalent state space framework. This setting has been used in empirical research for the modeling of electricity markets, see the survey [7] for a long list of contributions. In particular, ref. [8] use the model described below without formal verification of the asymptotic theory for the quasi maximum likelihood estimation.

Recently, ref. [9] show that in the setting of dynamic factor models, typically used for observation processes of high dimension, the common assumption that the factors are generated using a vector autoregression jointly with the assumption that the idiosyncratic component is white noise (or more generally generated using a VAR or VARMA model independent of the factors) leads to a VARMA process. Also a number of papers (see for example [10,11,12]) show that in their empirical application the usage of VARMA models instead of approximations using the VAR model leads to superior prediction performance. This, jointly with the fact that the linearization of dynamic stochastic general equilibrium models (DSGE) leads to state space models, see e.g., [13], has fuelled recent interest in VARMA—and thus state space—modeling in particular in macroeconomics, see for example [14].

In this respect, quasi maximum likelihood estimation is the most often used approach for inference. Due to the typically highly non-convex nature of the quasi likelihood function (using the Gaussian density) in the VARMA setting, the criterion function shows many local maxima where the optimization can easily get stuck. Randomization alone does not solve the problem efficiently, as typically the parameter space is high-dimensional causing problems of the curse of dimensionality type.

Moreover, VARMA modeling requires a full specification of the state space unit root structure of the process, see [15]. The state space unit root structure specifies the number of common trends at each seasonal frequency (see below for definitions). For data of weekly or higher sampling frequency it is unlikely that the state space unit root structure is known prior to estimation. Testing all possible combinations is numerically infeasible in many cases.

As an attractive alternative in this respect the class of subspace algorithms is investigated in this paper. One particular member of this class, the so called canonical variate analysis (CVA) introduced by [16] (in the literature the algorithm is often called canonical correlation analysis; CCA), has been shown to provide system estimators which (under the assumption of known system order) are asymptotically equivalent to quasi maximum likelihood estimation (using the Gaussian likelihood) in the stationary case [17]. CVA shares a number of robustness properties in the stationary case with VAR estimators: [18] shows that CVA produces consistent estimators of the underlying transfer function in situations where the innovations are conditionally heteroskedastic processes of considerable generality. Ref. [19] shows that CVA provides consistent estimators of the transfer function even for stationary fractionally integrated processes, if the order of the system tends to infinity as a function of the sample size at a sufficient rate.

In the I(1) case [20] introduce a heuristic adaptation of the algorithm using the assumption of known cointegrating rank in order to show consistency for the corresponding transfer function estimators. However, the specification of the cointegrating rank is no easy task in itself. In case of misspecification of the cointegrating rank the properties of this approach are unclear. Ref. [21] states without proof that also the original CVA algorithm delivers consistent estimates in the I(1) case without the need to impose the true cointegrating rank.

Furthermore for I(1) processes [20] proposed various tests for the cointegrating rank and compared them to tests in the Johansen framework showing superior finite sample performance in particular for multivariate data sets with a large dimension of the modeled process.

This paper builds on these results and shows that CVA can also be used in the seasonally integrated case. The main contributions of the paper are:

(i): It is shown that the original CVA algorithm in the seasonally integrated case provides strongly consistent system estimators under the assumption of known system order (thus delivering the currently unpublished proof of the claim in the I(1) case in [21]).
(ii): Upper bounds for the order of convergence for the estimated system matrices are given, establishing the familiar superconsistency for the estimation of the cointegrating spaces at all unit roots.
(iii): Several tests for separate (that is for each unit root irrespective of the specification at the other potential unit roots) determination of the seasonal cointegrating ranks are proposed which are based on the estimated systems and are simple to implement.

The derivation of the asymptotic properties of the estimators is complemented by a simulation study and an application, both demonstrating the potential of CVA and one of the suggested tests. Jointly our results imply that CVA constitutes a very reasonable initial estimate for subsequent quasi likelihood maximization in the VARMA case. Moreover the method provides valuable information on the number of unit roots present in the process, which can be used for subsequent investigation at the very least by providing upper bounds on the number of common trends present at each unit root frequency. Contrary to the JS approach in the VAR framework these tests can be performed in parallel for all unit roots, eliminating the interdependence of the results inherent in the VECM representation. Moreover, they do not use the VECM representation involving a large number of parameters in the case of high sampling rates.

These properties make CVA a useful tool in automatic modeling of multivariate (with a substantial number of variables) seasonally (co-)integrated processes.

The paper is organized as follows: in the next section the model set and the main assumptions of the paper are presented. The estimation methods are described in Section 3. Section 4 states the consistency results. Inference on the cointegrating ranks is proposed in Section 5. Data preprocessing is discussed in Section 6. The simulations are contained in Section 7, while Section 8 discusses an application to real world data. Section 9 concludes the paper. Appendix A contains supporting material, Appendix C provides the proofs of the main results of this paper, which are based on preliminary results presented in Appendix B.

Throughout the paper we will use the symbols

o (g_{T})

and

O (g_{T})

to denote orders of almost sure convergence where T denotes the sample size, i.e.,

x_{T} = o (g_{T})

if

x_{T} / g_{T} \to 0

almost surely and

x_{T} = O (g_{T})

if

x_{T} / g_{T}

is bounded almost surely for large enough T (that is there exists a constant

M < \infty

such that

lim {sup}_{T \to \infty} x_{T} / g_{T} \leq M

a.s.). Furthermore,

o_{P} (g_{T}), O_{P} (g_{T})

denote the corresponding in probability versions.

2. Model Set and Assumptions

In this paper state space processes

{(y_{t})}_{t \in Z}, y_{t} \in R^{s},

are considered which are defined as the solutions to the following equations for given white noise

{(ε_{t})}_{t \in Z}, ε_{t} \in R^{s}, E ε_{t} = 0, E ε_{t} ε_{t}^{'} = Ω > 0

:

\begin{matrix} x_{t + 1} & = & A x_{t} + K ε_{t}, \\ y_{t} & = & C x_{t} + ε_{t} . \end{matrix}

(1)

Here

x_{t} \in R^{n}

denotes the unobserved state and

A \in R^{n \times n}, C \in R^{s \times n}

and

K \in R^{n \times s}

define the state space system typically written as the tuple

(A, C, K)

.

In this paper we consider without restriction of generality only minimal state space systems in innovations representation. For a minimal system the integer n is called the order of the system. As is well known (cf. e.g., [22]) minimal systems are only identified up to the choice of the basis of the state space. Two minimal systems

(A, C, K)

and

(\tilde{A}, \tilde{C}, \tilde{K})

are observationally equivalent if and only if there exists a nonsingular matrix

T \in R^{n \times n}

such that

A = T \tilde{A} T^{- 1}, C = \tilde{C} T^{- 1}, K = T \tilde{K}

. For two observationally equivalent systems the impulse response sequences

k_{0} = I_{s}, k_{j + 1} = C A^{j} K = \tilde{C} {\tilde{A}}^{j} \tilde{K}, j = 0, 1, . . .

coincide.

Ref. [15] shows that the structure of the Jordan normal form of the matrix A determines the properties (such as stationarity) of the solutions to (1) for

t \in Z

. Eigenvalues of A on the unit circle lead to unit root processes in the sense of [15] who also define a state space unit root structure indicating the location and multiplicity of unit roots. A process

{(y_{t})}_{t \in Z}

with state space unit root structure

Ω_{S} = {(0, (c_{0})), (2 π / S, (c_{1})), . . ., (π, (c_{S / 2}))}

for some even integer S is called multi frequency I(1) (in short MFI(1)). Even S is chosen because it simplifies the notation by implying that

S / 2

also is an integer and

z = - 1

is a seasonal unit root. By adjusting the notation appropriately all results hold true for odd S as well).

If, moreover, such a process is observed for S periods per year, it is called seasonal MFI(1). In this case the canonical form in [15] takes the following form:

\begin{matrix} A & = & diag (A_{0}, A_{1}, \dots, A_{S / 2}, A_{•}), \\ A_{0} & = & I_{c_{0}}, \\ A_{j} & = & [\begin{matrix} cos (ω_{j}) I_{c_{j}} & sin (ω_{j}) I_{c_{j}} \\ - sin (ω_{j}) I_{c_{j}} & cos (ω_{j}) I_{c_{j}} \end{matrix}], 0 < j < S / 2, \\ A_{S / 2} & = & - I_{c_{S / 2}}, \\ C & = & [\begin{matrix} C_{0, R} & | & C_{1, R} & C_{1, I} & \dots & \dots & C_{S / 2 - 1, R} & C_{S / 2 - 1, I} & | & C_{S / 2} & | & C_{•} \end{matrix}] \\ = & [\begin{matrix} C_{0} & | & C_{1} & \dots & C_{S / 2 - 1} & | & C_{S / 2} & | & C_{•} \end{matrix}], \\ K & = & {[\begin{matrix} K_{0, R}^{'} & | & K_{1, R}^{'} & K_{1, I}^{'} | & \dots & \dots & | K_{S / 2 - 1, R}^{'} & K_{S / 2 - 1, I}^{'} & | & K_{S / 2}^{'} & | & K_{•}^{'} \end{matrix}]}^{'} \end{matrix}

(2)

where

ω_{j} = 2 π j / S, j = 0, \dots, S / 2

denote the unit root frequencies and

C_{j, R} \in R^{s \times c_{j}}, C_{j, I} \in R^{s \times c_{j}}, K_{j, R} \in R^{c_{j} \times s}, K_{j, I} \in R^{c_{j} \times s}

where

0 \leq c_{j} \leq s, 0 \leq j \leq S / 2

. Furthermore for

C_{j, C} : = C_{j, R} - i C_{j, I}

it holds that

C_{j, C}^{'} C_{j, C} = I_{c_{j}}

and

K_{j, C} = K_{j, R} + i K_{j, I}

is of full row rank and positive upper triangular (

C_{0, I} = C_{S / 2, I} = 0, K_{0, I} = K_{S / 2, I} = 0

), see [15] for details. Finally

| λ_{m a x} (A_{•}) | < 1

, where

λ_{m a x} (A)

denotes an eigenvalue of the matrix

A

with maximal modulus. The stable subsystem

(A_{•}, C_{•}, K_{•})

is assumed to be in echelon canonical form (see [22]).

Using this notation the assumptions on the data generating process (dgp) in this paper can be stated as follows:

Assumption 1.

{(y_{t})}_{t \in Z}

has a minimal state space representation

(A_{\circ}, C_{\circ}, K_{\circ}), A_{\circ} \in R^{n \times n}

of the form (2) with minimal

(A_{\circ, •}, C_{\circ, •}, K_{\circ, •}), A_{\circ, •} \in R^{n_{•} \times n_{•}}

in echelon canonical form where

c = n - n_{•} > 0

.

Furthermore the stability assumption

| λ_{m a x} (A_{\circ, •}) | < 1

and the strict minimum-phase condition

ρ_{0} : = | λ_{m a x} (A_{\circ} - K_{\circ} C_{\circ}) | < 1

hold.

The state at time

t = 1

is given by

x_{1} = {[x_{1, 0}^{'}, . . ., x_{1, S / 2}^{'}, x_{1, •}^{'}]}^{'}

where

x_{1, j} \in R^{δ_{j} c_{j}}

(for

δ_{j} = 2, 0 < j < S / 2

and

δ_{j} = 1

else) is deterministic and

x_{1, •} = \sum_{j = 1}^{\infty} A_{\circ, •}^{j - 1} K_{\circ, •} ε_{1 - j}

is such that

{(x_{t, •})}_{t \in Z}

is stationary.

The noise process

{(ε_{t})}_{t \in Z}

is assumed to be a strictly stationary ergodic martingale difference sequence with respect to the filtration

F_{t}

with zero conditional mean

E (ε_{t} | F_{t - 1}) = 0

, deterministic conditional variance

E (ε_{t} ε_{t}^{'} | F_{t - 1}) = Ω > 0

and finite fourth moments.

Due to the block diagonal form of A the state equations are in a convenient form such that partitioning the state vector accordingly as

x_{t} = (\begin{matrix} x_{t, 0} \\ x_{t, 1} \\ ⋮ \\ x_{t, S / 2} \\ x_{t, •} \end{matrix}),

(3)

the blocks

{(x_{t, j})}_{t \in Z}, x_{t, j} \in R^{δ_{j} c_{j}}

for

c_{j} > 0

are unit root processes with state space unit root structure

{(ω_{j}, (c_{j}))}

. Finally

{(x_{t, •})}_{t \in Z}

is assumed to be stationary due to the assumptions on

x_{1, •}

. If

{({\tilde{y}}_{t})}_{t \in N}

denotes a different solution to the state space equations corresponding to

{\tilde{x}}_{1}

then (for

t > 1

)

{\tilde{y}}_{t} - y_{t} = C A^{t - 1} ({\tilde{x}}_{1} - x_{1}) = \sum_{j = 0}^{S / 2} C_{j} A_{j}^{t - 1} ({\tilde{x}}_{1, j} - x_{1, j}) + C_{•} A_{•}^{t - 1} ({\tilde{x}}_{1, •} - x_{1, •}) .

Note that

C_{j} A_{j}^{t - 1} z_{12} = cos (ω_{j} t) z_{1} + sin (ω_{j} t) z_{2}, 0 < j < S / 2

(for appropriate vectors

z_{12}, z_{1}, z_{2}

),

C_{0} A_{0}^{t - 1} = C_{0}, C_{S / 2} A_{S / 2}^{t - 1} = {(- 1)}^{t - 1} C_{S / 2} .

Therefore the sum

\sum_{j = 0}^{S / 2} C_{j} A_{j}^{t - 1} ({\tilde{x}}_{1, j} - x_{1, j})

can be modeled using a constant and seasonal dummies. The term

C_{•} A_{•}^{t - 1} ({\tilde{x}}_{1, •} - x_{1, •})

tends to zero with an exponential rate as

t \to \infty

and hence does not influence the asymptotics.

Assumption 1 implies that the yearly difference

\begin{matrix} y_{t} - y_{t - S} & = & C A^{S} x_{t - S} + ε_{t} + \sum_{i = 1}^{S} C A^{i - 1} K ε_{t - i} - C x_{t - S} - ε_{t - S} \\ = & (C A^{S} - C) x_{t - S} + v_{t} = (C_{•} A_{•}^{S} - C_{•}) x_{t - S, •} + v_{t} \end{matrix}

is a stationary VARMA process where

v_{t} = ε_{t} + \sum_{i = 1}^{S} C A^{i - 1} K ε_{t - i} - ε_{t - S}

since

A_{j}^{S} = I_{δ_{j} c_{j}}

. Thus the process according to Assumption 1 is a unit root process in the sense of [15]. Note that we do not assume that all unit roots are contained such that the spectral density of the stationary process

{(y_{t} - y_{t - S})}_{t \in Z}

may contain zeros due to overdifferentiation and hence the process potentially is not stably invertible. The special form of

A_{0}

implies that

I (1)

processes are a special case of our dgp while

I (d), d > 1, d \in N,

processes are not allowed for.

3. Canonical Variate Analysis

The main idea of CVA is that, given the state, the system equations (1) are linear in the system matrices. Therefore, based on an estimate of the state sequence, the system can be estimated using least squares regression. The estimate of the state is based on the following equation (for details see for example [17]):

Let

Y_{t, f}^{+} : = {[y_{t}^{'}, y_{t + 1}^{'}, \dots, y_{t + f - 1}^{'}]}^{'}

denote the vector of stacked observations for some integer

f \geq n

and let

E_{t, f}^{+} : = {[ε_{t}^{'}, ε_{t + 1}^{'}, \dots, ε_{t + f - 1}^{'}]}^{'}

. Further define

Y_{t, p}^{-} : = {[y_{t - 1}^{'}, \dots, y_{t - p}^{'}]}^{'}

. Then (for

t > p

)

\begin{matrix} Y_{t, f}^{+} & = & O_{f} x_{t} + E_{f} E_{t, f}^{+} = O_{f} K_{p} Y_{t, p}^{-} + O_{f} {(A_{\circ} - K_{\circ} C_{\circ})}^{p} x_{t - p} + E_{f} E_{t, f}^{+} \\ = & β_{1} Y_{t, p}^{-} + N_{t, f}^{+} \end{matrix}

(4)

where

K_{p} : = [K_{\circ}, {\bar{A}}_{\circ} K_{\circ}, {\bar{A_{\circ}}}^{2} K_{\circ}, \dots, {\bar{A_{\circ}}}^{p - 1} K_{\circ}]

for

\bar{A_{\circ}} : = A_{\circ} - K_{\circ} C_{\circ}

and

O_{f} : = {[C_{\circ}^{'}, A_{\circ}^{'} C_{\circ}^{'}, \dots, {(A_{\circ}^{f - 1})}_{\circ}^{'} C_{\circ}^{'}]}^{'}

. The strict minimum-phase assumption implies

{\bar{A_{\circ}}}^{p} \to 0

for

p \to \infty

.

Let

〈 a_{t}, b_{t} 〉 : = T^{- 1} \sum_{t = p + 1}^{T - f + 1} a_{t} b_{t}^{'}

for sequences

{(a_{t})}_{t \in N}

and

{(b_{t})}_{t \in N}

. Then an estimate of

β_{1}

is obtained from the reduced rank regression (RRR)

Y_{t, f}^{+} = β_{1} Y_{t, p}^{-} + N_{t, f}^{+}

under the rank constraint

r a n k (β_{1}) = n

. This results in the estimate

\hat{O_{f}} \hat{K_{p}} : = [{(\hat{Ξ_{f}^{+}})}^{- 1} {\hat{U}}_{n} {\hat{S}}_{n}] [{\hat{V}}_{n}^{'} {(\hat{Ξ_{p}^{-}})}^{- 1}]

of

β_{1}

using the singular value decomposition (SVD)

\hat{Ξ_{f}^{+}} {\hat{β}}_{1} \hat{Ξ_{p}^{-}} = \hat{U} \hat{S} {\hat{V}}^{'} = {\hat{U}}_{n} {\hat{S}}_{n} {\hat{V}}_{n}^{'} + {\hat{R}}_{n} .

Here

{\hat{β}}_{1} = 〈 Y_{t, f}^{+}, Y_{t, p}^{-} 〉 {〈 Y_{t, p}^{-}, Y_{t, p}^{-} 〉}^{- 1}

denotes the unrestricted least squares estimate of

β_{1}

and

\hat{Ξ_{f}^{+}} : = {〈 Y_{t, f}^{+}, Y_{t, f}^{+} 〉}^{- 1 / 2}, \hat{Ξ_{p}^{-}} : = {〈 Y_{t, p}^{-}, Y_{t, p}^{-} 〉}^{1 / 2} .

(5)

Here the symmetric matrix square root is used. The definition is, however, not of importance and other square roots such as Cholesky factors could be used.

{\hat{U}}_{n} \in R^{f s \times n}

denotes the matrix whose columns are the left singular vectors to the n largest singular values which are the diagonal entries in

{\hat{S}}_{n} : = diag ({\hat{σ}}_{1}, {\hat{σ}}_{2}, \dots, {\hat{σ}}_{n}), {\hat{σ}}_{1} \geq \dots \geq {\hat{σ}}_{n} > 0

and

{\hat{V}}_{n} \in R^{p s \times n}

contains the corresponding right singular vectors as its columns.

{\hat{R}}_{n}

denotes the approximation error.

The system estimate

(\hat{A}, \hat{C}, \hat{K})

is then obtained using the estimated state

{\hat{x}}_{t} : = \hat{K_{p}} Y_{t, p}^{-}, t = p + 1, \dots, T + 1

via regression in the system equations.

In the algorithm a specific decomposition of the rank n matrix

\hat{O_{f}} \hat{K_{p}}

into the two factors

\hat{O_{f}}

and

\hat{K_{p}}

is given such that

\hat{K_{p}} \hat{Ξ_{p}^{-}} {(\hat{Ξ_{p}^{-}})}^{'} {\hat{K_{p}}}^{'} = I_{n}

. It is obvious that every other decomposition of

\hat{O_{f}} \hat{K_{p}}

produces an estimated state sequence in a different coordinate system, leading to a different observationally equivalent representation of the same transfer function estimator. Therefore, with respect to consistency of the transfer function estimator it is sufficient to show that there exists a factorization of

\hat{O_{f}} \hat{K_{p}}

leading to convergent system matrix estimators

(\tilde{A}, \tilde{C}, \tilde{K})

, even if this factorization cannot be used in actual computations, as it requires information not known at the time of estimation.

In order to generate a consistent initial guess for subsequent quasi likelihood optimization in the set of all state space systems corresponding to processes with state space unit root structure

Ω_{S} : = {(ω_{0}, (c_{0})), . . ., (ω_{S / 2}, (c_{S / 2}))}

, however, we will derive a realizable (for known integers

c_{j}

and matrices

E_{j}

such that

E_{j}^{'} C_{\circ, j, C} = I_{c_{j}}

) consistent system estimate. To this end note that consistency of the transfer function implies (see for example [23]) that the eigenvalues

{\tilde{λ}}_{l}

of

\hat{A}

converge (in a specific sense) to the eigenvalues

λ_{j}

of

A_{\circ}

. Therefore transforming

\hat{A}

into complex Jordan normal form (where

\hat{A}

is almost surely diagonalizable), ordering the eigenvalues such that groups of eigenvalues

{\tilde{λ}}_{l} (j), l = 1, . . ., c_{j},

converging to

λ_{j}

are grouped together, we obtain a realizable system

(\overset{ˇ}{A}, \overset{ˇ}{C}, \overset{ˇ}{K})

where the diagonal blocks of the block diagonal matrix

\overset{ˇ}{A}

corresponding to the unit roots converge to a diagonal matrix with the eigenvalues

z_{j}

on the diagonal:

{\overset{ˇ}{A}}_{j, C} = [\begin{matrix} {\tilde{λ}}_{1} (j) & 0 & \dots & 0 \\ 0 & {\tilde{λ}}_{2} (j) & ⋱ & ⋮ \\ ⋮ & ⋱ & ⋱ & 0 \\ 0 & \dots & 0 & {\tilde{λ}}_{c_{j}} (j) \end{matrix}] \to A_{j, C} = [\begin{matrix} z_{j} & 0 & \dots & 0 \\ 0 & z_{j} & ⋱ & ⋮ \\ ⋮ & ⋱ & ⋱ & 0 \\ 0 & \dots & 0 & z_{j} \end{matrix}] .

Replacing

{\overset{ˇ}{A}}_{j, C}

by the limit

A_{j, C}

and transforming the estimates to the real Jordan normal form, we obtain estimates

(\overset{˘}{A}, \overset{ˇ}{C}, \overset{ˇ}{K})

corresponding to unit root processes with state space unit root structure

Ω_{S}

.

Note, however, that this representation not necessarily converges as perturbation analysis only implies convergence of the eigenspaces. Therefore in the final step the estimate

(\overset{˘}{A}, \overset{ˇ}{C}, \overset{ˇ}{K})

is converted such that we obtain convergence of the system matrix estimates. In the class of observationally equivalent systems with the matrix

{\overset{˘}{A}}_{C} = diag (A_{0, C}, A_{1, C}, \bar{A_{1, C}} . . ., \bar{A_{S / 2 - 1, C}}, A_{S / 2, C}, {\overset{ˇ}{A}}_{•}), A_{j, C} = I_{c_{j}} z_{j},

block diagonal transformations of the form

T = diag (T_{0}, T_{1}, \bar{T_{1}}, . . ., T_{S / 2}, I)

do not change the matrix

{\overset{˘}{A}}_{C}

. Here the basis of the stable subsystem can be chosen such that the corresponding transformed

({\overset{˘}{A}}_{•}, {\overset{˘}{C}}_{•}, {\overset{˘}{K}}_{•})

is uniquely defined using an overlapping echelon form (see [22], Section 2.6). The impact of such transformations on the blocks of C is given by

{\overset{ˇ}{C}}_{j, C} T_{j}^{- 1}

. Therefore, if for each

j = 0, . . ., S / 2

a matrix

E_{j} \in C^{s \times c_{j}}

is known such that

E_{j}^{'} C_{\circ, j, C} \in C^{c_{j} \times c_{j}}

is nonsingular, the restriction

E_{j}^{'} {\overset{˘}{C}}_{j, C} = I_{c_{j}}

picks a unique representative

(\overset{˘}{A}, \overset{˘}{C}, \overset{˘}{K})

of the class of systems observationally equivalent to

(\overset{˘}{A}, \overset{ˇ}{C}, \overset{ˇ}{K})

.

Note that this estimate

(\overset{˘}{A}, \overset{˘}{C}, \overset{˘}{K})

is realizable if the integers

c_{j}

(needed to identify the

c_{j}

eigenvalues of

\hat{A}

closest to

z_{j}

), the matrices

E_{j}

(needed to fix a basis for

x_{t, j}

) and the index corresponding to the overlapping echelon form for the stable part are known. Furthermore, this estimate corresponds to a process with state space unit root structure

Ω_{S}

and hence can be used as a starting value for quasi likelihood maximization.

Finally in this section it should be noted that the estimate of the state

{\hat{x}}_{t}

here mainly serves the purpose of obtaining an estimator for the state space system. Based on this estimate, Kalman filtering techniques can be used to obtain different estimates of the state sequence. The relation between these different estimates is unclear and so is their usage for inference. For this paper the state estimates

{\hat{x}}_{t}

are only an intermediate step in the CVA algorithm.

4. Asymptotic Properties of the System Estimators

As follows from the last section, the central step in the CVA procedure is a RRR problem involving stationary and nonstationary components. The asymptotic properties of the solution to such RRR problems are derived in Theorem 3.2. of [24]. Using these results the following theorem can be proved (see Appendix C.1):

Theorem 1.

Let the process

{(y_{t})}_{t \in Z}

be generated according to Assumption 1. Let

(\hat{A}, \hat{C}, \hat{K})

denote theCVAestimators of the system matrices using the assumption of correctly specified order n with

f \geq n

not depending on the sample size and finite and

p = o ({(log T)}^{\bar{a}})

for some real

0 < \bar{a} < \infty, p \geq - d log T / log ρ_{0}

for some

d > 1

where

0 < ρ_{0} = | λ_{m a x} (A_{\circ} - K_{\circ} C_{\circ}) | < 1

. Let

(A_{\circ}, C_{\circ}, K_{\circ})

be in the form given in (2) where

(A_{\circ, •}, C_{\circ, •}, K_{\circ, •})

is in echelon canonical form and for each

j = 0, . . ., S / 2

there exists a row selector matrix

E_{j} \in R^{s \times c_{j}}

such that

E_{j}^{'} C_{\circ, j, C}

is non-singular. Then for some integer a:

(I): $\hat{C} {\hat{A}}^{j} \hat{K} - C_{\circ} A_{\circ}^{j} K_{\circ} = O_{P} ({(log T)}^{a} / \sqrt{T})$ for each $j \geq 0$ .
(II): Using $D_{x} = d i a g (T^{- 1} I_{c}, T^{- 1 / 2} I_{n - c})$ where $c = \sum_{j = 0}^{S / 2} c_{j} δ_{j}$ we have

$(\overset{˘}{A} - A_{\circ}) D_{x}^{- 1} = O_{P} ({(log T)}^{a}), \sqrt{T} (\overset{˘}{K} - K_{\circ}) = O_{P} ({(log T)}^{a}), (\overset{˘}{C} - C_{\circ}) D_{x}^{- 1} = O_{P} ({(log T)}^{a})$

for some integer $a < \infty$ .
(III): If the noise is assumed to be an iid sequence, then results (I) and (II) hold almost surely.

Beside stating consistency in the seasonal integration case, the theorem also improves on the results of [20] in the I(1) case by showing that no adaptation of CVA is needed in order to obtain consistent estimators of the impulse response sequence or the system matrices. Note that this consistency result for the impulse response sequence concerns both the short and the long-run dynamics. In particular it implies that short-run prediction coefficients are consistent. Moreover the theorem establishes strong consistency rather than weak consistency as opposed to [20]. (II) establishes orders of convergence which, however, apply only to a transformed system that requires knowledge of the integers

c_{j}

and matrices

E_{j}

to be realized. No tight bounds for the integer a are derived, since they do not seem to be of much value.

Note that the assumptions on the innovations rule out conditionally heteroskedastic processes. However, since the proof mostly relies on convergence properties for covariance estimators for stationary processes and continuous mapping theorems for integrated processes, it appears likely that the results can be extended to conditionally heteroskedastic processes as well. For the stationary cases this follows directly from the arguments in [18], while for integrated processes results (using different assumptions on the innovations) given for example in [25] can be used. The conditions of [25] hold for example in a large number of GARCH type processes, see [26]. The combination of the different sets of assumptions on the innovations is not straightforward, however, and hence would further complicate the proofs. We refrain from including them.

It is worth pointing out that due to the block diagonal structure of

A_{\circ}

the result

(\overset{˘}{C} - C_{\circ}) D_{x}^{- 1} = O_{P} ({(log T)}^{a})

implies consistency of the blocks

{\overset{˘}{C}}_{j}

corresponding to unit root

z_{j}

(or the corresponding complex pair) of order almost

T^{- 1}

. Using the complex valued canonical form this implies consistent estimation of

C_{\circ, j, C}

by the corresponding

{\overset{˘}{C}}_{j, C}

. In the canonical form this matrix determines the cointegrating relations (both the static as well as the dynamic ones, for details see [15]) as the unitary complement to this matrix. It thus follows that CVA delivers estimators for the cointegrating relations at the various unit roots that are (super-)consistent. In fact, the proof can be extended to show convergence in distribution of

(\overset{˘}{C} - C_{\circ}) D_{x}^{- 1}

. This distribution could be used in order to derive tests for cointegrating relations. However, preliminary simulations indicate that these estimates and hence the corresponding tests are not optimal and can be improved upon by quasi maximum likelihood estimation in the VARMA setting initialized by the CVA estimates. Therefore we refrain from presenting these results.

Note that the assumptions impose the restriction

ρ_{0} > 0

excluding VAR systems. This is done solely for stating a uniform lower bound on the increase of p as a function of T. This bound is related to the lag length selection achieved using BIC, see [27]. In the VAR case the lag length estimator using BIC will converge to the true order and thus remain finite. All results hold true if in the VAR case a fixed (that is independent of the sample size)

p \geq n

is used.

5. Inference Based on the Subspace Estimators

Beside consistency of the impulse response sequence also the specification of the integers

c_{0}, . . ., c_{S / 2}

is of interest. First, following [20] this information can be obtained by detecting the unity singular values in the RRR step of the procedure. Second, from the system representation (2) it is clear that the location of the unit roots is determined by the eigenvalues of

A_{\circ}

on the unit circle: The integers

c_{j}

denote the number of eigenvalues at the corresponding locations on the unit circle (provided the eigenvalues are simple). Due to perturbation theory (see for example Lemma A2) we know that the eigenvalues of

\hat{A}

will converge (for

T \to \infty

) to the eigenvalues of

A_{\circ}

and the distribution of the mean of all eigenvalues of

\hat{A}

converging to an eigenvalue of

A_{\circ}

can be derived based on the distribution of the estimation error

\hat{A} - A_{\circ}

. This can be used to derive tests on the number of eigenvalues at a particular location on the unit circle. Third, if

n \leq s

the state process is a VAR(1) process and hence in some cases allows for inference on the number of cointegrating relations and thus also on the integers

c_{j}

as outlined in [4]. Tests based on these three arguments will be discussed below.

Theorem 2.

Under the assumptions of Theorem 1 the test statistic

T \sum_{i = 1}^{c} (1 - {\hat{σ}}_{i}^{2})

converges in distribution to the random variable

Z = t r [E ({\tilde{ε}}_{t, ⊥} {\tilde{ε}}_{t, ⊥}^{'}) {(\int_{0}^{1} W (r) W {(r)}^{'})}^{- 1}]

where

{\tilde{ε}}_{t, ⊥} = {\tilde{ε}}_{t, 1} - E {\tilde{ε}}_{t, 1} {\tilde{ε}}_{t, •}^{'} {(E {\tilde{ε}}_{t, •} {\tilde{ε}}_{t, •}^{'})}^{- 1} {\tilde{ε}}_{t, •}

(for definition of

{\tilde{ε}}_{t, 1}

and

{\tilde{ε}}_{t, •}

see the proof in Appendix C.2) and where

W (r)

denotes a c-dimensional Brownian motion with variance

\sum_{i = 0}^{S - 1} A_{u}^{i} K_{u} Ω K_{u}^{'} {(A_{u}^{i})}^{'}

with

A_{u}

denoting the

c \times c

heading submatrix of

A

and

K_{u}

denoting the submatrix of

K

composed of the first c rows such that

(A_{u}, C_{u}, K_{u})

denotes the unstable subsystem.

The theorem is proved in Appendix C.2, where also the many nuisance parameters of the limiting random variable are explained and defined. The proof also corrects an error in Theorem 4 of [20], where the wrong distribution is given since the second order terms were neglected.

As the distribution is not pivotal and in particular contains information that is unknown when performing the RRR step, it is not of much interest for direct application. Nevertheless the order of convergence allows for the derivation of simple consistent estimators of the number of common trends: Let

{\hat{c}}_{T}

denote the number of singular values calculated in the RRR that exceed

\sqrt{1 - h (T) / T}

for arbitrary

h (T) \to \infty, h (T) < T, h (T) / T \to 0,

for

T \to \infty

. Then it is a direct consequence of Theorem 2 in combination with

{\hat{σ}}_{j} \to σ_{j} < 1, j > c

, that

{\hat{c}}_{T} \to c

in probability, implying consistent estimation of c. Based on these results also estimators for c could be derived, for example along the lines of [28]. However, as [29] shows, such estimators have not performed well in simulations and thus are not considered subsequently.

The singular values do not provide information on the location of the unit roots. This additional information is contained in the eigenvalues of the matrix

A_{\circ}

:

Theorem 3.

Under the assumptions of Theorem 1 let

{\hat{λ}}_{i} (m), i = 1, . . ., c_{m}

denote the

c_{m}

eigenvalues of

\hat{A}

closest to the unit root

z_{m}, | z_{m} | = 1

. Then defining

{\hat{μ}}_{m} = \sum_{i = 1}^{c_{m}} ({\hat{λ}}_{i} (m) - z_{m})

we obtain

T {\hat{μ}}_{m} \overset{d}{\to} t r [{(\int B (r) B (r) d r)}^{- 1} \int B (r) d B {(r)}^{'}]

where

B (r)

denotes a

c_{m}

-dimensional Brownian motion with zero expectation and variance

I_{c_{m}}

for

z_{m} = \pm 1

and a complex Brownian motion with expectation zero and variance equal to the identity matrix else.

Further if

\tilde{A} : = 〈 x_{t + 1}, x_{t} 〉 {〈 x_{t}, x_{t} 〉}^{- 1}

using the true state

x_{t}

and

{\tilde{μ}}_{m} = \sum_{i = 1}^{c_{m}} ({\tilde{λ}}_{i} (m) - z_{m})

where

{\tilde{λ}}_{i} (m), i = 1, . . ., c_{m}

denote the

c_{m}

eigenvalues of

\tilde{A}

closest to

z_{m}

, then

T ({\hat{μ}}_{m} - {\tilde{μ}}_{m}) = o_{P} (1) .

Therefore the estimated eigenvalues can be used in order to obtain a test on the number of common trends at a particular frequency for each frequency separately. The test distribution is obtained as the limit to

T t r [〈 K_{\circ, m, C} ε_{t}, x_{t, m, C} 〉 {〈 x_{t, m, C}, x_{t, m, C} 〉}^{- 1}]

where

x_{t, m, C} = \bar{z_{m}} x_{t - 1, m, C} + K_{\circ, m, C} ε_{t - 1}, x_{1, m, C} = 0

. The distribution thus does not depend on the presence of other unit roots or stationary components of the state. Furthermore it can be seen that it is independent of the noise variance or the matrix

K_{\circ, m, C}

. Hence critical values are easily obtained from simulations. Also note that the limiting distribution is identical for all complex unit roots.

Therefore, for each seasonal unit root location

z_{m}

we can order the eigenvalues of the estimated matrix

\hat{A}

with increasing distance to

z_{m}

. Then starting from the assumption of

H_{0} : c_{m} = \bar{c}

(for a reasonable

\bar{c}

obtained, e.g., from a plot of the eigenvalues) one can perform the test with statistic

T {\hat{μ}}_{m}

. If the test rejects, then the hypothesis

H_{0} : c_{m} = \bar{c} - 1

is tested, until the hypothesis is not rejected anymore, or

H_{0} : c_{m} = 1

is reached. This is then the last test. If

H_{0}

is rejected again, no unit root is found at this location. Otherwise we do not have evidence against

c_{m} = 1

. In any case, the system needs to be estimated only once and the calculation of the test statistics is easy even for all seasonal unit roots jointly.

The third option for obtaining tests is to use the tests derived in [4] based on the JS framework for VARs. In the case

n \leq s

the state process

x_{t + 1} = A x_{t} + K ε_{t}

is a seasonally integrated VAR(1) process (for

n > s

the noise variance is singular). The corresponding VECM representation equals

p (L) x_{t} = \sum_{m = 1}^{S} (I_{n} - A z_{m}) X_{t - 1}^{(m)} + K ε_{t - 1} = \sum_{m = 1}^{S} α_{m} β_{m}^{'} X_{t - 1}^{(m)} + K ε_{t - 1}

where

z_{m} = exp (\frac{2 π m}{S} i), m = 1, . . ., S

and

\begin{matrix} p (L) = 1 - L^{S} & , & p_{t} = p (L) x_{t} = x_{t} - x_{t - S}, \\ p_{m} (L) = \frac{p (L)}{1 - \bar{z_{m}} L} & , & X_{t}^{(m)} = - \frac{p_{m} (L)}{p_{m} (z_{m}) z_{m}} x_{t} . \end{matrix}

Note that in this VAR(1) setting no additional stationary regressors of the form

p (L) x_{t - j}

occur. Also no seasonal dummies are needed but could be added to the equation. In this setting [4] suggests to use the eigenvalues

{\hat{λ}}_{i}

(ordered with increasing modulus) of the matrix (the superscript

{(.)}^{π}

denotes the residuals with respect to the remaining regressors

X_{t - 1}^{(j)}, j \neq m

)

〈 X_{t - 1}^{(m), π}, p_{t}^{π} 〉 {〈 p_{t}^{π}, p_{t}^{π} 〉}^{- 1} 〈 p_{t}^{π}, X_{t - 1}^{(m), π} 〉 {〈 X_{t - 1}^{(m), π}, X_{t - 1}^{(m), π} 〉}^{- 1}

as the basis for a test statistic

{\tilde{C}}_{m} : = - δ_{m} \sum_{i = 1}^{c_{m}} log (1 - {\hat{λ}}_{i}) .

where

δ_{m} = 2

for complex unit roots and

δ_{m} = 1

for real unit roots. In the

I (1)

case this leads to the familiar Johansen trace test, for seasonal unit roots a different asymptotic distribution is obtained.

Theorem 4.

Under the assumptions of Theorem 1 let

{\hat{C}}_{m}

be calculated based on the estimated state and let

{\tilde{C}}_{m}

denote the same statistic based on the true state. Then for

n \leq s

it holds that

{\hat{C}}_{m} - {\tilde{C}}_{m} = o_{P} (T^{- 1})

and

T {\hat{C}}_{m} \overset{d}{\to} t r [\int d B (r) B {(r)}^{'} {(\int B (r) B (r) d r)}^{- 1} \int B (r) d B {(r)}^{'}]

where

B (r)

is a real Brownian motion for

z_{m} = \pm 1

or a complex Brownian motion else.

Thus again under the null hypothesis the test statistic based on the estimated state and the one based on the true state reject jointly asymptotically with probability one. Therefore for

n \leq s

the tests of JS can be used to obtain information on the number of common cycles, ignoring the fact that the estimated state is used in place of the true state process.

After presenting three disjoint ideas for providing information on the number and location of unit roots, the question arises, which one to use in practice. In the following a number of ideas are given in this respect.

The criterion based on the singular values given in Theorem 2 is of limited information as it only provides the overall number of unit roots. Since the limiting distribution is not pivotal it cannot be used for tests and the choice of the cutoff value

h (T)

is somewhat arbitrary. Nevertheless, using a relatively large value one obtains a useful upper bound on c which can be included in the typical sequential procedures for tests for

c_{j}

.

Using the results of Theorem 4 has the advantage of using a framework that is well known to many researchers. It is remarkable that in terms of the asymptotic distributions there is no difference involved in using the estimated state in place of the true state. The assumption

n \leq s

, however, is somewhat restrictive except in situations with a large s.

Finally the results of Theorem 3 provide simple to use tests for all unit roots, independently of the specification of the model for the remaining unit roots. Again it is remarkable that, under the null, inference is identical for known and for estimated state.

Since our estimators are not quasi maximum likelihood estimators the question of a comparison with the usual likelihood ratio tests arises. For VAR models simulation exercises documented in Section 7 below demonstrate that there are situations where the proposed tests outperform tests in the VAR framework. Comparisons with tests in the state space framework (or equivalently in the VARMA framework) are complicated by the fact that no results are currently available in the literature of this framework. One difference, however, is given by the fact that quasi likelihood ratio tests in the VARMA setting require a full specification of the

c_{j}

values for all unit roots. This introduces interdependencies such that the tests for one unit root depend on the specification of the cointegrating rank at the other roots. The interdependencies can be broken by performing tests based on alternative specifications for each unit root. The test based on Theorem 3 does not require this but can be performed based on the same estimate

\hat{A}

. This is seen as an advantage.

The question of the comparison of the empirical size in finite samples as well as power to local alternatives between the CVA based tests and tests based on quasi-likelihood ratios is left as a research question.

6. Deterministic Terms

Up to now it has been assumed that no deterministic terms appear in the model contrary to common practice. In the VAR framework dealing with trends is complicated by the usage of the VECM representation, see e.g., [30]. In the state space framework used in this paper, however, deterministic terms are easily incorporated.

Theorem 5.

Let the process

{(y_{t})}_{t \in Z}

be generated according to Assumption 1 and assume that the process

{({\tilde{y}}_{t})}_{t \in Z}

is observed where

{\tilde{y}}_{t} = y_{t} + Φ d_{t}

with

d_{t} = {[\begin{matrix} 1, & cos (\frac{2 π}{S} t), & sin (\frac{2 π}{S} t), & \dots & {(- 1)}^{t} \end{matrix}]}^{'} \in R^{S}

and

Φ \in R^{s \times S}

.

Then if theCVAestimation is applied to

{\tilde{y}}_{t}^{π} : = y_{t} - (\sum_{t = 1}^{T} y_{t} d_{t}^{'}) {(\sum_{t = 1}^{T} d_{t} d_{t}^{'})}^{- 1} d_{t}, t = 1, . . ., T,

the results of Theorem 1 hold, i.e., the system is estimated consistently and the orders of convergence for the transformed system

(\overset{˘}{A}, \overset{˘}{C}, \overset{˘}{K})

hold true.

Furthermore the convergence in distribution results in Theorems 2–4 hold true where in the limits the Brownian motions

B (r)

occurring in the distributions must be replaced by their demeaned versions

B (r) - \int_{0}^{1} B (s) d s

.

In this sense the results are robust to some operations typically termed preprocessing of data such as demeaning and deseasonalizing using seasonal dummies. More general preprocessing steps such as detrending or the extraction of more general deterministic terms analogous to [30] can be investigated along the same lines.

7. Simulations

The estimation of the seasonal cointegration ranks and spaces is usually carried out via quasi maximum likelihood methods that originated from the VAR model class. Typical estimators in this setting are those of [2,4,5,31]. In the first two experiments we focus on the estimation of the cointegrating spaces and the specification of the cointegration ranks in the classical situation of quarterly data and show that there are certain situations in which CVA estimators and the test in Theorem 3 possess finite sample properties superior to those of the methods above. In a third experiment the test performance is evaluated for a daily sampling rate. Moreover, the prediction accuracy of CVA is investigated as well as its robustness to innovations exhibiting behaviors often encountered at such higher sampling rates. All simulations are carried out using 1000 replications.

To investigate the practical usefulness of the proposed procedures we generate quarterly data using two VAR dgps of dimension

s = 2

first and then two more general VARMA dgps with

s = 8

. Each pair contains dgps with different state space unit root structures

{(0, (1)), (π / 2, (c_{π / 2})), (π, (1))}, c_{π / 2} = 1, 2 .

From all four dgps samples of size

T \in {50, 100, 200, 500}

are generated with initial values set to zero. Although none of the dgps contains deterministics, the data is adjusted for a constant and quarterly seasonal dummies as in [5]. For reasons of comparability, the adjustment for deterministic terms is done before estimation.

In the third experiment we generate daily data with dimension

s = 4

from a state space system including unit roots corresponding to weekly frequencies (that is a period length of seven days). In the simulations we use several years of data (excluding new year’s day to account for 52 weeks of seven days each). The first 200 observations are discarded to include the effects of different starting values. In this example the focus lies on a comparison of the prediction accuracy. Furthermore we investigate the robustness of the test procedures to conditional heteroskedasticity of the GARCH type as well as to non-normality of the innovations.

To assess the performance of specifying the cointegrating rank at unit root z using CVA, the following test statistic is constructed from the results in Theorem 3

Λ (c) = T | (\frac{1}{c} \sum_{i = 1}^{c} {\hat{λ}}_{i}) - z | .

(6)

Here

{\hat{λ}}_{1}, \dots, {\hat{λ}}_{n}

are the eigenvalues of

\hat{A}

ordered increasingly according to the distance from z. Note that a similar test in [20] only uses the c-th largest eigenvalue, whereas here the average over the nearest c eigenvalues is taken. Critical values have been obtained by simulation using large sample sizes (sample size 2000 (JS) and 5000 (CVA), 10,000 replications).

In our first two experiments usage of

Λ (c)

is compared with variants of the likelihood ratio test from [2] (JS), [4] (

Q_{1}

), and [5] (

Q_{2}

,

Q_{3}

).

Q_{1}

is Cubadda’s trace test for complex-valued data,

Q_{2}

takes the information at frequency

π / 2

into account when the analysis is carried out at frequency

3 π / 2

, and

Q_{3}

iterates between

π / 2

and

3 π / 2

in the alternating reduced rank regression (ARR) of [5]. For the procedure of [2] the likelihood maximization at frequency

π / 2

is carried out using numerical optimization (BFGS) with initial values obtained from an unrestricted regression.

All tests are evaluated by comparing the percentages of correctly detected common trends, or hit rates, with

0.95

, the hit rate to be expected from a nominal significance level of

0.05

. The testing procedure employed for all tests is the same: at each of the frequencies it is started from a null hypothesis of s unit roots against less than s unit roots. In case of rejection,

s - 1

unit roots are tested versus less than

s - 1

and so on, until there are zero unit roots under the alternative.

For the first two experiments the estimation performance of CVA for the simultaneous estimation of the seasonal cointegrating spaces is compared with the maximum likelihood estimates of [2,4,31] (cRRR), and also with an iterative procedure (Generalized ARR or GARR) of [5]. The comparison is carried out by means of the gap metric, measuring the distance between the true and the estimated cointegrating space as in [32]. The smaller the mean gap over all replications, the better is the estimation performance. Throughout a difference between two mean gaps or two hit rates is considered statistically significant if it is larger than twice the Monte Carlo standard error.

For all procedures used in this section, an AR lag length has to be chosen first. For CVA this can be done using the AIC as in ([33], Section 5), as is done in the third experiment.

In the first two experiments where sample sizes are rather small, we estimate the lag length via minimization of the corrected AIC (AICc) ([34], p. 432),

{\hat{k}}_{A I C c}

, benefitting the simulation results. For larger sample sizes the two criteria lead to the same choices. Due to the quarterly data we work with, the lag length is then chosen to be

\hat{k} = max {{\hat{k}}_{A I C c}, 4}

.

Other information criteria could be chosen here. An anonymous referee also suggested the application of the Modified Akaike Information Criterion (MAIC) of [35], proposed there for the I(1)-case. In an attempt to apply it to the seasonally integrated case considered here, it performed considerably worse than the AICc. Thus we refrain from using the MAIC in the following and also omit the results of that attempt. They can be obtained from the authors upon request.

For CVA the truncation indices f and p are chosen as

\hat{f} = \hat{p} = 2 \hat{k}

([33], Section 5). The system order n is estimated by minimizing ([33], Section 5)

S V C (n) = {\hat{σ}}_{n + 1}^{2} + 2 n s \frac{log T}{T} .

(7)

Here

{\hat{σ}}_{i}

denotes the i-th largest singular value from the singular value decomposition of

{\hat{Ξ}}_{f}^{+} {\hat{β}}_{1} {\hat{Ξ}}_{p}^{-}

(Step 2 of CVA). Note that selecting the number of states by

S V C

is made less influential insofar as

\hat{n} = max {c_{0} + 2 c_{π / 2} + c_{π}, {\hat{n}}_{S V C}}

, where

{\hat{n}}_{S V C}

denotes the SVC estimated system order.

In Section 7.1 we start with the two VAR dgps and find that the likelihood-based procedures are mostly superior. Continuing with the VARMA dgps in Section 7.2, CVA performs better and is superior for the smaller sample sizes in terms of size and gap and better for all sample sizes in terms of power. Section 7.3 evaluates the performance of the tests for unit roots for larger sample sizes together with the prediction performance in this setting. We find that the tests are robust to the distribution of the innovations as well as to conditional heteroskedasticity of the GARCH type. Furthermore the empirical size of the tests lies close to the size already for moderate sample sizes, where the tests also show almost perfect power properties.

7.1. VAR Processes

The VAR dgps considered in this paper are given by,

X_{t} = Π_{1} X_{t - 1} + Π_{2} X_{t - 2} + Π_{3} X_{t - 3} + Π_{4} X_{t - 4} + ε_{t}, ε_{t} \sim N ([\begin{matrix} 0 \\ 0 \end{matrix}], [\begin{matrix} 1 & 0.5 \\ 0.5 & 1 \end{matrix}])

(8)

where

{(ε_{t})}_{t \in Z}

is white noise and the coefficient matrices are

\begin{matrix} Π_{1} & = & [\begin{matrix} γ & 0 \\ 0 & 0 \end{matrix}], Π_{2} = [\begin{matrix} - 0.4 & 0.4 - γ \\ 0 & 0 \end{matrix}], \\ Π_{3} & = & [\begin{matrix} - γ & 0 \\ 0 & 0 \end{matrix}], Π_{4} = [\begin{matrix} 0.6 - (γ / 10) & 0.4 + γ \\ 0 & 1 \end{matrix}] . \end{matrix}

This dgp is adopted from [5] with a slight adjustment to

Π_{4}

. The corresponding VECM representation in the notation of [5] equals

\begin{matrix} X_{0, t} & = & [\begin{matrix} - 0.2 \\ 0 \end{matrix}] [\begin{matrix} 1 + γ / 8 & - 1 \end{matrix}] X_{1, t - 1} + [\begin{matrix} 0.2 \\ 0 \end{matrix}] [\begin{matrix} 1 + γ / 8 & - 1 \end{matrix}] X_{2, t - 1} + \\ [\begin{matrix} γ \\ 0 \end{matrix}] [\begin{matrix} 1 + 0.05 L & - L \end{matrix}] X_{3, t - 1} + ε_{t} . \end{matrix}

As can be seen from Table 1, the dgps possess unit roots at frequencies 0,

π

, and

π / 2

, where

c_{π / 2} = 2 [1]

for

γ = 0 [0.2]

, respectively. Note that in all cases the order of integration equals 1, while the number of common cycles at

π / 2

is varied.

Table 2 exhibits the hit rates from the application of the different test statistics. At frequencies 0 and

π

,

Λ

is compared with the trace test of Johansen (J; based on [31] for unit roots

z = - 1

), whereas at

π / 2

it is competing with JS,

Q_{1}, Q_{2}

, and

Q_{3}

. All competitors are likelihood-based tests which is the term we are referring to when we compare

Λ

to them as a whole.

The results for 0 and

π

are very similar for both dgps in that

Λ

scores more hits than the likelihood-based tests when the sample size is small,

T \in {50, 100}

. Convergence of its finite sample distribution is slower than for the other test statistics, however, as J is closer to

0.95

from

T = 200

on. For

T = 500

the distribution of

Λ

only seems to have converged to its asymptotic distribution when

c_{π / 2} = 2

at frequency 0, whereas convergence of the likelihood-based tests has occurred in all cases.

At

π / 2

the likelihood ratio test of JS strictly dominates all implementations of [5] for all sample sizes and both dgps. It strictly dominates the CVA-based test procedure as well, except for one case, it seems: when

c_{π / 2} = 1

and

T = 50

Λ

scores slightly, but significantly, more hits than the likelihood ratio test of JS. Surprisingly,

Λ

is drastically worse when

T = 100

with only

8.7 %

, only to be up at

85 %

for

T = 200

.

The behavior of

Λ

is explained by

z_{5}

and

z_{6}

being close to

\pm i

when

c_{π / 2} = 1

, cf. Table 1. For future reference we will call the corresponding roots false unit roots.

For

T = 50

the estimates of eigenvalues corresponding to actual unit roots are rather not very close to

\pm i

in contrast to the false unit roots. Thus the latter are mistaken for actual unit roots (cf. the first panel in Figure 1), leading to a hit rate of

81.1 %

, one that is even larger than the rates at 0 and

π

. As the sample size increases, the eigenvalue estimates of the true unit roots become more and more accurate, visible from the second and third panel in Figure 1. Accordingly they can be detected correctly more often. Unfortunately however, for

T = 100

the false unit roots remain to be detected such that often two instead of just one unit root are found by

Λ

, resulting in a hit rate of only

8.7 %

. For

T \in {200, 500}

Λ

is able to distinguish the false unit roots from the true ones and the detection rate is getting closer to the asymptotic rate,

85.5 %

and

92.7 %

, respectively.

When the VAR dgp without false unit roots and

c_{π / 2} = 2

is considered, it is visible that the hit rates of

Λ

at

π / 2

are monotonously increasing in the sample size again. The rates are smaller than those of the likelihood-based tests, however, and also clearly worse than those of

Λ

at 0 and

π

, cf. Table 2 again.

Taken together, at frequencies 0 and

π

which correspond to real-valued unit roots, the use of

Λ

was advantageous for

T = 50

. It also scored more hits for

T = 100

and

c_{π / 2} = 1

. For higher sample sizes the likelihood-based tests clearly dominate

Λ

at these two frequencies. At

π / 2

this superiority of the likelihood-based tests for all sample sizes and both dgps continues. The example also points to a general weakness: if the sample size is low and false unit roots are present, it can be difficult for

Λ

to distinguish them from actual unit roots.

7.2. VARMA Processes

The second setup consists of VARMA data generated by a state space system

(A_{r}, C_{r}, K_{r}),

r = 1, 2

, as in (1), where the matrices

A_{1}

and

A_{2}

are constructed as in (2) and are taken to be

\begin{matrix} A_{1} = [\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & - 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & - 1 & 0 \end{matrix}], A_{2} = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & - 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & - 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & - 1 & 0 \end{matrix}] . \end{matrix}

(9)

These two choices yield the same state space unit root structures as those of the two VAR dgps with

c_{π / 2} = 1

and

c_{π / 2} = 2

for

A_{1}

and

A_{2}

, respectively. The other two system matrices

K_{r} \in R^{(2 + 2 r) \times s}

and

C_{r} \in R^{s \times (2 + 2 r)}

with

s = 8

are drawn randomly from a standard normal distribution in each replication and

{(ε_{t})}_{t \in Z}

is multivariate normal white noise with an identity covariance matrix.

Note that these systems are within the VARMA model class such that the dgp is contained in the VAR setting only by increasing the lag length as a function of the sample size. While superiority of the CVA approach in such a setting might be expected, this is far from obvious. Moreover, using a long VAR approximation is the industry norm in such situations.

From the hit rates in Table 3 it can be seen that the combination of large s, small T, and a minimal lag length of four render the likelihood-based tests useless at all frequencies with hit rates below ten percent for

T = 50

.

Λ

in contrast does not suffer from this problem and is already close to

95 %

for this sample size. Only when

T = 200

do the likelihood-based tests appear to work, exhibiting hit rates close to

95 %

.

For all tests alike, however, it is striking that hit rates move away from

95 %

when

T = 500

. This behavior is most pronounced for

Λ

, e.g., from

T = 200

to

T = 500

its hit rate drops from

93.1 %

to

82.4 %

at 0 when

A_{2}

is used. This phenomenon is a consequence of the fact that f and k in the algorithm are chosen data dependent. An inspection of how the hit rates depend on f and k and a comparison with the actually selected

\hat{f}, \hat{k}

reveals that for

T = 500

too large values of f and k are chosen too often and leave room for improvement in the hit rates, cf. Figure 2. The figure stresses an important point: The performance of the unit root tests is heavily influenced by the selected lag lengths for all procedures. We tested a number of different information criteria in this respect. AICc turned out to be the best criterion overall, but not uniformly. Figure 2 indicates advantages for this example of BIC over AIC as it on average selects smaller lag lengths, associated here with higher hit rates.

To study the power of the different procedures, the transition dynamics

A_{r}

in (9) are multiplied by

ρ \in {0.8, 0.85, 0.9, 0.95}

so that the systems do not contain unit roots at any of the frequencies. Here empirical power is defined as the frequency of choosing zero common trends. This is why for

ρ = 1

, when there are in fact common trends present in our specifications, the empirical power values plotted in Figure 3 are not equal to the actual size we could define as one minus the hit rate: our measure of empirical power in this situation only counts the false test conclusion of zero common trends, but there are of course multiple ways the testing procedure could conclude falsely.

As expected, rejection of the null hypothesis is easiest when

ρ

is small and is very difficult when it is close to 1, cf. Figure 3 for the case of

A_{2}

.

Further, there are almost no differences among the likelihood-based tests over all combinations of sample size and frequency, only for

T = 100

is JS significantly worse than the

Q_{i}, i = 1, 2, 3

at

π / 2

. It is also clearly visible at all frequencies that the likelihood-based tests possess no or only very limited power when

T = 50

and

T = 100

, respectively.

Λ

, in contrast, is clearly more powerful in these cases. As the sample size increases to

T = 200

, the power of each test improves, still

Λ

remains the most powerful option. Only for

T = 500

have the differences almost vanished with small, but significant, advantages for

Λ

at 0 and

π

.

The results are the same when

A_{1}

is used and

c_{π / 2} = 1

and all of the differences described here are statistically significant.

Next the estimation performance of CVA is evaluated by calculation of the gaps between the true and the estimated cointegrating spaces. At all frequencies these gaps are compared with the GARR procedure of [5] which cycles through frequencies. At

π / 2

CVA and GARR are also compared with our implementation of JS and cRRR of [4], whereas it is also compared with the usual Johansen procedure at 0 and

π

. All estimates are conditional on the true state space unit root structure in the sense that the minimal number of states used is larger or equal to the number of unit roots over all frequencies. Other than imposing a minimum state dimension, the estimation of the order using

S V C

is not influenced. The likelihood-based procedures, on the other hand, take the unit root structure as given, i.e., do not perform CI rank testing for this estimation exercise.

From the results in Table 4 it can be noted first that the likelihood-based procedures show mostly equal mean gaps. Only for

π / 2

and

T = 50

and both dgps does JS possess significantly larger gaps than cRRR and GARR and other differences are not statistically significant. Thus it does not matter in our example whether the iterative procedure is used or not.

Second, CVA is again superior for

T = 50

where it exhibits mean gaps that are significantly smaller than those of the other estimators at all frequencies. This advantage is turned around for higher sample sizes, though: mean gaps are smaller for the likelihood-based procedures when

T \in {100, 200, 500}

and

A_{2}

is used, if only slightly. When

A_{1}

is used instead, mean gaps do not differ significantly from each other at

π / 2

when

T > 50

and at

0, π

when

T = 100

and those of CVA are only very modestly worse when

T \in {200, 500}

at

0, π

.

Thus, when it comes to estimating the cointegrating spaces, CVA is superior for

T = 50

and equally good or only slightly worse than the likelihood-based procedures for higher sample sizes. For the systems analyzed, decreasing

c_{π / 2}

leads to gaps that are smaller for all methods and these improvements are slightly larger for CVA than for the other estimators.

7.3. Robustness of Unit Root Tests for Daily Data

In this last simulation example we examine the robustness of the proposed procedures with regard to test performance and prediction accuracy with respect to the innovation distribution and the existence of conditional heteroskedasticity of the GARCH-type, as these features are often observed in data of higher frequency, for example in financial applications. While our asymptotic results do not depend on the distribution of the innovations (subject to the assumptions), the assumptions do not include GARCH effects. Nevertheless, the theory in [25,26] suggests that the tests might be robust also in this respect.

We generate a state space system of order

n = 8

using the matrix

A = {[A_{i, j}]}_{i, j = 1, . . ., 8}

where

A_{i, i + 1} = 1, i = 1, . . ., 6, A_{7, 1} = 1, A_{8, 8} = 0.8

and

A_{i, j} = 0

else. This implies that the eigenvalues of this matrix are

λ_{j} = exp (2 π i j / 7), j = 1, . . ., 7, λ_{8} = 0.8

. Therefore the corresponding process has state space unit root structure

((0, (1)), (2 π / 7, (1)), (4 π / 7, (1)), (6 π / 7, (1))) .

The entries of the matrices C and K are chosen as independent standard normally distributed random variables as before.

A process

{(y_{t})}_{t = 1, . . ., T}

is generated from filtering an independent identically distributed innovation process

{(ε_{t})}_{t = - 199, . . ., T + 1}

through the system

(A, C, K)

. The first 200 observations are discarded, the last are used for validation purposes. A total of 1000 replications are generated where in each replication a different system is chosen.

With the generated data three different estimates are obtained: An autoregressive model (called AR in the following) is estimated with lag length chosen using AIC of maximal lag length equal to

⌊ \sqrt{T} ⌋

. Second, an autoregressive model with large lag length (called ARlong) is estimated. This estimate is used to hint at the behavior of an autoregression using the lag length equal to a full year. This would correspond to estimating a VECM without rank restrictions, when accounting for yearly differences. The third method consists of the CVA estimates, where

f = p = 2 {\hat{k}}_{A I C}

is chosen. The order is estimated by minimizing SVC. However, we correct for orders smaller than

n = 7

which would limit the possibilities of finding all unit roots.

First, we compare the prediction accuracy for the three methods for two different distributions of the innovations: Beside the standard normal distribution also the student t-distribution with

v = 5

degrees of freedom (scaled to unit variance) is used. This distribution shows considerably heavier tails than the normal distribution but nevertheless is covered by our assumptions.

Figure 4 provides the results for out-of-sample one day ahead mean absolute prediction error (over all coordinates) for the sample sizes

T = 364

days (one year),

T = 1092

(3 years) and

T = 3276

(nine years). The long AR model is estimated with lag lengths of 8 weeks for the smallest sample size, 10 weeks for the medium sample size and 12 weeks for the largest sample size.

In the figure the results for the normally distributed innovations are presented as well as the ones for the t-distributed residuals (scaled to unit variance). It can be seen that for the two larger sample sizes the mean absolute error for the residuals for CVA is smaller in all cases. For the smallest sample size, by contrast, results are mixed. For CVA the results for the heavy tailed distribution in this case are much worse than for the normal distribution. For the larger sample sizes the differences are small. The maximal standard error of the estimated means over 1000 replications for

T = 1092

and

T = 3276

amounts to 0.05. This allows the conclusion that CVA performs better for the two larger sample sizes. For

T = 364

there are no statistically significant differences between the performance of the three methods: CVA seems to suffer more from few very large errors (using the root mean square errors the CVA results are worse for

T = 364

in comparison; if one uses the 95% percentiles CVA performs best also for the smallest sample size). This results in a standard error over the replications of the mean absolute error for

T = 364

of 0.18 for normally distributed innovations and 3.4 for t-distributed innovations. The long AR models are clearly worse than the two other approaches. This happens even if we are still far from using a full year as the lag length.

With regard to the unit root tests we investigate results for the tests of the hypotheses

H_{0} : c_{m} = 1

versus

H_{1} : c_{m} = 0

at all frequencies

2 π m / 364, m = 0, . . ., 363

. The data generating process features unit roots with

c_{m} = 1

at the seven frequencies

2 π k / 7, k = 0, . . ., 6

. Therefore the tests should not reject at these frequencies, but should reject at all others.

Consequently we compare the minimum of the non-rejection rates for the seven unit roots (called empirical size below) as well as the maximum of the non-rejection rates for the non-unit root frequencies

ω_{j} = 2 π j / 364, j \neq 52 k, k = 0, 1, 2, . . ., 6

(called empirical power below).

For the larger sample sizes the empirical size is practically 95% while the empirical power is 100%. For

T = 364

we obtain an empirical size of 90% for the normal distribution and 91.5% for the t-distribution. The worst empirical power equals 89.3% (normal) and 87.6% (t-distribution). Hence even for one year of data the discrimination properties of the unit root tests are good and we do not observe differences between the normal distribution for the innovations and the heavy tailed t-distribution.

Finally we compare the empirical size and power of the tests for the various unit roots for smaller sample sizes

T \in {104, 208, 312, 416, 520}

. For the experiments we consider univariate GARCH models of the form

ε_{t, i} = h_{t, i} η_{t, i}, h_{t, i}^{2} = 1 + α ε_{t - 1, i}^{2} + β h_{t - 1, i}^{2}, i = 1, . ., 4,

where

{(η_{t, i})}_{t \in Z}

is independent and identically standard normally distributed.

α, β \geq 0

are reals. It follows that the component processes

{(ε_{t, i})}_{t \in Z}

show conditional heteroskedasticity, the persistence of which is governed by

α + β

. Here

0 < α + β < 1

implies stationarity while

α + β = 1

implies persistent conditional heteroskedasticity usually termed I-GARCH. We include five different processes for the innovations:

norm: $α = β = 0$ , no GARCH effects
G1: $α = 0.8, β = 0.1$
IG1: $α = 0.8, β = 0.2$
IG2: $α = 0.5, β = 0.5$
IG3: $α = 0.2, β = 0.8$

For the five different sample sizes 1000 replications of the estimates using the CVA algorithm are obtained. For each estimate we calculate the test statistic for testing

H_{0} : c_{m} = 1

versus

H_{0} : c_{m} = 0

for

m = 0, . . ., 363

corresponding to the unit roots

z_{m} = exp (2 π i m / 364)

. This set of unit roots contains all seven unit roots

exp (2 π i k / 7), k = 0, . . ., 6

.

Figure 5 provides the mean over the 1000 replications of the test statistics

Λ (1)

for

z_{j}, j = 0, . . ., 363

and the five sample sizes. It can be seen that the test

Λ (1)

is able to pinpoint the seven unit roots present in the data generating process fairly accurately even for sample size

T = 104

. The zoom on the region around the unit root frequency

2 π / 7

shows that the mean value is larger than the cutoff value of the test (the dashed horizontal line) for the adjacent frequency

2 π \frac{53}{364}

already for

T = 312

.

Table 5 lists the minimum of the achieved percentages of non-rejections of the test statistic for the seven unit root frequencies as well as the maximum over all non-unit root frequencies. It can be seen that for all GARCH models for

T = 312

the test rejects unit roots at all non unit root frequencies every time, while the empirical size is close to the nominal 5%. For small sample sizes the tests are slightly undersized while for

T = 208

a slight oversizing is observed. The two larger sample sizes are omitted as the tests perform perfectly there.

It follows from the examples presented in this subsection that the test is robust also in small samples with respect to heavy tailed distributions of the innovations (subject to the assumptions). Furthermore also a remarkable robustness with respect to GARCH-type conditional heteroskedasticity is observed.

8. Application

In this section we apply CVA to the modeling of electricity consumption using a data set from [36]. The dataset contains hourly consumption data (in megawatts) from a number of US regions, scraped from the webpage of PJM Interconnection LLC, a regional transmission organization. The number of regions have changed over time, thus the data set contains many missing values. It also contains data aggregated into regions called east and west, which are not used subsequently.

In order to avoid problems with missing values, we restrict the analysis to four regions, for which data over the same time period is available: American Electric Power (AEP; in the following printed in blue), the Dayton Power and Light Company (DAYTON; black), Dominion Virginia Power (DOM; red) and Duquesne Light Co. (DUQ; green). We use data from 1 May 2005 until 31 July 2018. In this period only 3 data points are missing for the four regions and their imputation is handled by interpolation of the corresponding previous values. One observation in this sample is an obvious outlier which is corrected for analogously.

The data is split into an estimation sample covering observations up to the end of 2016 (102,291 observations on 4263 days) and a validation sample containing data in 2017 and 2018 (13,845 observations on 577 days). Data is equally sampled, but contains two hour segments when switching from winter to summer time or back. Table 6 contains some summary statistics.

Figure 6 provides an overview of the data: Panel (a) shows the full data on an hourly basis, while (b) presents aggregation to daily frequency. Panel (c) zooms in on a two year stretch of daily consumption. Panel (d) finally provides hourly data for the first month in the validation data. The figures clearly document strong daily, weekly and yearly patterns. From these figures it appears that these seasonal fluctuations are somewhat regular with changes throughout time. It is hence not clear whether a fixed seasonal pattern is appropriate. Also note that the sampling frequency is on an hourly basis such that a year roughly covers 8760 observations.

In the following we estimate (on the estimation part) and compare (on the validation part) a number of different models, first for the full hourly data set and afterwards for the aggregated daily data. As a benchmark we will use univariate AR models including deterministic seasonal patterns for daily, weekly and yearly variations. Subsequently we estimate models using CVA including different sets of such seasonal patterns.

First in the analysis using dummy variables fixed periodic patterns have been estimated. We model the natural logarithm of consumption (to reduce problems due to heteroskedasticity) and include dummies for weekdays, hours and sine and cosine terms corresponding to the first 20 Fourier frequencies with respect to annual periodicity. The corresponding results can be viewed in Figure 7. It is obvious that there is quite some periodic variation. Also the four data sets show very similar patterns as expected.

After the extraction of these deterministic terms the next step is univariate autoregressive (AR) modeling. Figure 8 shows the BIC values of AR models of lag lengths zero to 800 for the four series as well as the BIC of a multivariate AR model for the same number of lags. The chosen values are given in Table 6.

The BIC curve is extremely flat for the univariate models. Noticeable drops in BIC occur around lag 24 (one day), 144 (six days), 168 (one week), 336 (two weeks), 504 (three weeks). BIC selects large lag lengths from 529 (DUQ) up to 554 (DOM). AIC selects lag lengths close to the maximum allowed with a minimum at 772 lags. The BIC pattern of the multivariate model differs in that the two drops at two and three weeks are missing. Instead, the optimal BIC value is obtained at lag 194, well below the optimal lag lengths in the univariate cases. AIC here opts for lag length 531, just over 22 days.

Subsequently CVA is applied with

f = {\hat{k}}_{B I C}, p = {\hat{k}}_{A I C}

as estimated for the multivariate model. This differs from the usual recommendation of

f = p = 2 {\hat{k}}_{A I C}

in order to avoid numerical problems with huge matrices. The order is chosen according to SVC, resulting in

\hat{n} = 240

. The corresponding model is termed Mod 1 in the following. Note that this configuration of

f, \hat{n}

does not fulfill the requirements of our asymptotic theory. The bound

f \geq n

ensures that the matrix

O_{f}

has full column rank. Generically this will be the case for

f s \geq n

leading to a less restrictive assumption. In practice too low values of f will be detected by

\hat{n}

estimated close to the maximum, which is not the case here.

As a second model we only use weekday dummies but neglect the other deterministics. Again AIC (

{\hat{k}}_{A I C} = 531

) and BIC (

{\hat{k}}_{B I C} = 195

) are used to determine the optimal lag length in the multivariate AR model. The corresponding CVA estimated model uses

\hat{n} = 245

according to SVC, resulting in Mod 2.

The third model uses only a constant as deterministic term. Again similar AIC (555) and BIC (195) values are selected. A state space model, Mod 3, using CVA is estimated with

\hat{n} = 209

.

Figure 9 provides information on the results. Panel (a) shows the coefficients of the univariate AR models. It can be seen that lags around one day and one to three weeks play the biggest role for all four datasets. Panel (b) shows that the multivariate models lead to better one step ahead predictions in terms of the root mean square error (RMSE). Mod 1 and Mod 2 show practically equivalent out of sample prediction error for all four data sets, while Mod 3 delivers the best out of sample fit for all four regions.

In particular in financial applications data of high sampling frequency shows persistent behaviour, also in terms of conditional heteroskedasticity, as well as heavy tailed distributions of the innovations. For our data sets Figure 10 below provides some information in this respect for the residuals according to Mod 3. Panel (a) provides a plot of the residuals in the year 2018 (contained in the validation period). It can be seen that large deviations occur occasionally, while else residuals vary in a tight band around 0. The kernel density estimates for the normalized (to unit variance) residuals on the full validation data set in panel (b) show the typical heavy tailed distributions. Panel (c) contains an ACF plot for the four regions again calculated using the full validation sample. It demonstrates that the model successfully eliminates all autocorrelations with only a few ACF values occurring outside the confidence interval. Panel (d) provides the ACF plot for the squared innovations to examine GARCH-type effects. While GARCH-effects are clearly visible, the ACF drops to zero fast with occasional positive values (except maybe for the Duquesne data).

Applying the eigenvalue based test

Λ (1)

for

c = 1

and all Fourier frequencies

ω_{j} = 2 π j / (365 * 24)

we find that for Mod 2 and Mod 3 the largest p-value is obtained for

ω_{365}

corresponding to a period length of one day with 0.0187 for Mod 2 (test statistic 6.6) and 0.02 for Mod 3 (with a statistic of 6.5). This implies that the unit root at frequency

ω_{365}

is not rejected for a significance level of 1%, but is rejected for 5%. All other unit roots are rejected at every usual significance level. For Mod 1 the test statistic for

ω_{365}

equals 41.2 corresponding to a p-value of practically 0. This implies that on top of a deterministic daily pattern the series show strong persistence at the daily period. Excluding the hourly dummies pulls the roots closest to

ω_{365}

closer to the unit circle resulting in insignificant unit root tests and improves the one step ahead forecasts. Including the dummies weakens the evidence of a unit root while leading to worse predictions.

The analysis is repeated with data aggregated to daily sampling frequency. The aggregation reduces the required lag lengths, as is visible from Table 6 in the univariate case, and hence we use CVA with the recommended

f = p = 2 {\hat{k}}_{A I C}

. Beside the univariate models, in this case also a naive model of predicting the consumption for today as yesterday’s consumption is used. Three multivariate models are estimated: Mod 1 contains weekday dummies and sine and cosine terms for the first twenty Fourier frequencies corresponding to a period of one year. Mod 2 only contains the weekday dummies, while Mod 3 only uses the constant. Figure 11 provides the out-of-sample RMSE for one day ahead predictions (panel (a)) and seven days ahead predictions (panel (b)).

It can be seen that both Mod 1 and Mod 2 beat the univariate AR models in terms of one step ahead prediction error, while Mod 3 performs better for seven days ahead prediction. Mod 1 performs on par with Mod 2 for one step ahead prediction but performs better in predicting seven steps ahead. In Figure 12 poles and zeros for the three estimated state space models are plotted. Here the poles (marked with ‘x’) are the eigenvalues of the matrix A. These are the inverses of the determinantal roots of the autoregressive matrix polynomial in the equivalent VARMA representation. The zeros are the inverses of zeros of the determinant of the MA polynomial. We can see that for Mod 3 with only a constant, poles close to

2 π j / 7, j = 1, . . ., 6

arise to capture the weekly pattern. The other two models only show one pole close to the unit circle, a real pole of almost

z = 1

. The pole corresponding to Mod 1 is closer to the unit circle than the one for Mod 2 (see (b)).

For Mod 3 we obtain p-values for the tests of three complex unit roots of 0.05 (

ω = 2 π / 7

), 0.165 (

4 π / 7

) and 0.01 (

6 π / 7

), which are hence all not statistically significant for significance level

α = 0.01

. The corresponding test for

z = 1

shows a p-value of 0.004. This provides evidence against the null hypothesis of the root being present. For Mod 1 the p-value for the test of

z = 1

is 0.28 and hence we cannot reject the null. Mod 2 provides a p-value of 0.023 and hence weak evidence for the presence of the unit root. This can be seen from the distance of the nearest pole from the point

z = 1

in Figure 12.

Jointly this indicates that the location and strength of persistence due to the estimated roots is influenced by the presence of deterministic terms: if the deterministic terms are not included in the model, the cyclical patterns are generated by poles situated close to the unit circle.

The decision whether on top of the deterministic seasonality unit roots exist, is not easy in all cases: for the daily data the locations of the poles indicate that deterministic seasonality is enough to capture weekly fluctuations while a unit root at

z = 1

appears to be needed to capture yearly variations. For hourly data there is evidence that the daily cycle is best captured with a unit root at frequency

ω_{365}

. This leads to the best predictive fit. Finally note that temporal aggregation from hourly data to daily data implies that the frequency

ω_{365}

for hourly data aliases to the frequency

ω = 0

in the daily data. Therefore the higher evidence of a unit root at

z = 1

found in daily data might be a consequence of the unit root at frequency

ω_{365}

found for hourly data, compare [37].

The system matrix estimates as well as the evidence in support of unit roots at

ω_{365}

for hourly data and

z = 1

for daily data that we obtain from the CVA modeling can be taken as starting points in subsequent quasi maximum likelihood estimation.

9. Conclusions

In this paper the asymptotic properties of CVA estimators for seasonally integrated unit root processes are investigated. The main results can be summarized as follows:

CVA provides consistent estimators for long-run and short-run dynamics without knowledge of the location and number of unit roots. Hence the algorithm is robust with respect to the presence of trending components at frequency zero as well as at the other seasonal unit root frequencies.
The singular values calculated in the RRR step reveal information on the total number of unit roots. The distance of the singular values to one can be used to construct a consistent estimator of this quantity.
The eigenvalues of $\hat{A}$ can be used in order to test for the number of common trends. Under the null hypothesis these tests are asymptotically equivalent to the corresponding tests using the true state, making the derivation of asymptotic results and the simulation of the test distribution simple.
An analogous statement holds for the Johansen trace test in the $I (1)$ case and analogous tests in the $M F I (1)$ case calculated on the basis of the estimated state in the restrictive setting of $n \leq s$ . Under the null hypothesis these tests reject and accept asymptotically jointly with the corresponding tests calculated using the true state.
From the simulation exercises we conclude that CVA performs best when the dgp is of the more general VARMA type, the process dimension is moderate to large and the sample size is small. Then it is superior to the likelihood-based procedures based on VAR approximations in terms of the estimation performance and the size and power of $Λ$ , the test developed from CVA. For higher sample sizes the likelihood-based procedures are clearly superior when it comes to the size of the corresponding tests, whereas $Λ$ remains the best test choice in terms of empirical power. The estimation performance is about equal for all procedures when the sample size is high with slight advantages for the likelihood-based procedures.
The simulations also demonstrate that the unit root test results are robust with respect to the distribution of the innovation sequence as well as some forms of conditional heteroskedasticity of the GARCH-type.

Because of the promising performance of CVA and in particular its robustness it can be recommended as a simple way to extract information on the number of common trends from the estimated matrix of transition dynamics. This information can be used in order to reduce the uncertainty in a subsequent likelihood ratio analysis where quasi maximum likelihood estimates can be obtained starting from the CVA estimates. Since the CVA estimates can be obtained for a range of orders numerically fast they are seen as a valuable starting point for the empirical modeling of time series potentially including seasonal cointegration. Moreover they can also be used in situations where the number of seasons is large or even unclear as in hourly data sets as demonstrated in the case study.

Author Contributions

Conceptualization, D.B. and R.B.; methodology, D.B.; software, R.B.; formal analysis, D.B. and R.B.; writing—original draft preparation, D.B. and R.B.; writing—review and editing, D.B. and R.B.; visualization, D.B. and R.B.; supervision, D.B. Both authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation—Projektnummer 276051388) which is gratefully acknowledged. We acknowledge support for the publication costs by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Supporting Material

Appendix A.1. Complex Valued Canonical Form

Additionally to the real valued canonical form (2) we will also use the corresponding complex valued representation obtained by transforming each block corresponding to unit root

z_{j} = cos (ω_{j}) + i sin (ω_{j})

with the transformation matrix

T_{j} = [\begin{matrix} I_{c_{j}} & i I_{c_{j}} \\ I_{c_{j}} & - i I_{c_{j}} \end{matrix}]

leading to the triple of system matrices in the j-th block as:

A_{j, C} = [\begin{matrix} \bar{z_{j}} I_{c_{j}} & 0 \\ 0 & z_{j} I_{c_{j}} \end{matrix}], K_{j, C} = [\begin{matrix} K_{j, C} \\ \bar{K_{j, C}} \end{matrix}],_{j, C} = [\begin{matrix} C_{j, C} / 2 & \bar{C_{j, C}} / 2 \end{matrix}],

such that

x_{t + 1, j, C} = \bar{z_{j}} x_{t, j, C} + K_{j, C} ε_{t}, x_{t, j} = T_{j}^{- 1} [\begin{matrix} x_{t, j, C} \\ \bar{x_{t, j, C}} \end{matrix}] .

Lemma A1.

Let

x_{t} = {[x_{t, 0}^{'}, x_{t, 1}^{'}, \dots, x_{t, S / 2}^{'}, x_{t, •}^{'}]}^{'}

where

x_{t, j}

is generated according to

x_{t + 1, j} = A_{j} x_{t, j} + K_{j} ε_{t}, t \in N

with

A_{j}

as in (2) and

K_{j} = {[K_{j, R}^{'}, K_{j, I}^{'}]}^{'} \in R^{δ_{j} c_{j} \times s}

using iid white noise process

{(ε_{t})}_{t \in N}

where

x_{0, j}

is deterministic. Further let

{(x_{t, •})}_{t \in N}

denote the stationary solution to the equation

x_{t + 1, •} = A_{•} x_{t, •} + K_{•} ε_{t}

such that

M_{•} = E x_{t, •} x_{t, •}^{'} > 0

.

(I) Then using

Q_{T} = \sqrt{(log log T) / T}

for

u_{t} = \sum_{i = 0}^{q} φ_{i} ε_{t + i}

for arbitrary

q \in N, q < \infty,

and coefficients

φ_{i}, i = 0, . . ., q

we have

\begin{matrix} 〈 x_{t, •}, u_{t} 〉 = O (Q_{T}) & , & 〈 u_{t - j}, u_{t} 〉 - E u_{t - j} u_{t}^{'} = O (Q_{T}), \\ 〈 x_{t, j}, x_{t, •} 〉 = O (log T) & , & 〈 x_{t, j}, u_{t} 〉 = O (log T) \\ 〈 x_{t, j}, x_{t, k} 〉 / T = O (log log T) & , & j, k = 0, . . ., S / 2 . \end{matrix}

If

{(ε_{t})}_{t \in Z}

only fulfills Assumptions 1 then the order bounds hold in probability rather than almost surely.

(II) Furthermore for

0 < j, k < S / 2

\begin{matrix} 〈 x_{t, j, C}, ε_{t} 〉 & \overset{d}{\to} & \frac{1}{2} \int_{0}^{1} W_{j} d B_{j, C}^{'} = : M_{j}, \\ 〈 x_{t, j, C}, x_{t, k, C} 〉 / T & \overset{d}{\to} & \{\begin{matrix} \frac{1}{2} \int_{0}^{1} W_{j} W_{j}^{'} : = N_{j} & , & j = k, \\ 0 & , & j \neq k \end{matrix} \\ 〈 x_{t, j}, ε_{t} 〉 & \overset{d}{\to} & [\begin{matrix} \frac{1}{2} \int_{0}^{1} (W_{j, R} d B_{j, R}^{'} + W_{j, I} d B_{j, I}^{'}) \\ \frac{1}{2} \int_{0}^{1} (W_{j, I} d B_{j, R}^{'} - W_{j, R} d B_{j, I}^{'}) \end{matrix}], \\ 〈 x_{t, k}, x_{t, j} 〉 / T & \overset{d}{\to} & \{\begin{matrix} \frac{1}{2} [\begin{matrix} \int_{0}^{1} (W_{k, R} W_{k, R}^{'} + W_{k, I} W_{k, I}^{'}) & \int_{0}^{1} (W_{k, R} W_{k, I}^{'} - W_{k, I} W_{k, R}^{'}) \\ - \int_{0}^{1} (W_{k, R} W_{k, I}^{'} - W_{k, I} W_{k, R}^{'}) & \int_{0}^{1} (W_{k, R} W_{k, R}^{'} + W_{k, I} W_{k, I}^{'}) \end{matrix}] & , & j = k \\ 0 & , & j \neq k \end{matrix} \end{matrix}

where

W_{j} = W_{j, R} + i W_{j, I} = K_{j, C} B_{j, C}, K_{j, C} = K_{j, R} + i K_{j, I}, B_{j, C} = B_{j, R} + i B_{j, I}

and

B_{j, R}, B_{j, I}

are two independent Brownian motions with covariance matrix Ω. For

j = 0

and

j = S / 2

the results hold analogously:

\begin{matrix} 〈 x_{t, 0}, ε_{t} 〉 \overset{d}{\to} \int_{0}^{1} W_{0, R} d W_{0, R}^{'} & , & 〈 x_{t, 0}, x_{t, 0} 〉 / T \overset{d}{\to} \int_{0}^{1} W_{0, R} W_{0, R}^{'}, \\ 〈 x_{t, S / 2}, ε_{t} 〉 \overset{d}{\to} \int_{0}^{1} W_{S / 2, R} d W_{S / 2, R}^{'} & , & 〈 x_{t, S / 2}, x_{t, S / 2} 〉 / T \overset{d}{\to} \int_{0}^{1} W_{S / 2, R} W_{S / 2, R}^{'} . \end{matrix}

Proof.

Most evaluations in (I) are standard, see for example Lemma 4 in [38].

(II) follows from the results in Section 4 of [2] for the complex valued representations or [39] for the corresponding real case. □

Appendix A.2. Perturbation of Eigendecompositions

Lemma A2

(Rayleigh-Schrödinger expansion). Let

{\hat{A}}_{t} = A - δ A_{t}

where

∥ δ A_{t} ∥ \to 0

and where

A = U Λ U^{- 1} \in R^{n \times n}, Λ = d i a g (λ_{1} I_{c_{1}}, . . ., λ_{J} I_{c_{J}}), \sum_{j = 1}^{J} c_{j} = n

is diagonalizable.

U = [U_{1}, . . ., U_{J}] \in C^{n \times n}

is a nonsingular matrix such that for

U^{- 1} = {[V_{1}, . . ., V_{J}]}^{'}

we have

V_{j}^{'} U_{j} = I_{c_{j}}

.

Then for each circle

B (λ_{j}, δ)

around

λ_{j}

not containing any other eigenvalue of A there exist from some t onwards

$c_{j}$ eigenvalues of ${\hat{A}}_{t}$ in the circle $B (λ_{j}, δ)$ around $λ_{j}$
a basis ${\hat{U}}_{t, j}$ for the space spanned by the eigenspaces to these $c_{j}$ eigenvalues such that $V_{j}^{'} {\hat{U}}_{t, j} = I_{c_{j}}$ ,
a sequence of matrices ${\hat{B}}_{t, j} = V_{j}^{'} {\hat{A}}_{t} {\hat{U}}_{t, j} \in C^{c_{j} \times c_{j}}$ .

Then

{\hat{U}}_{t, j} = \sum_{k = 0}^{\infty} Z_{k}, {\hat{B}}_{t, j} = \sum_{k = 0}^{\infty} C_{k}

where

\begin{matrix} Z_{0} = U_{j} & , & C_{0} = λ_{j} I_{c_{j}}, \\ Z_{k} = Σ (δ A_{t} Z_{k - 1} + \sum_{i = 1}^{k - 1} Z_{k - i} C_{i}) & , & C_{k} = - V_{j}^{'} δ A_{t} Z_{k - 1} . \end{matrix}

Here

Σ = U {(Λ - I_{n} λ_{j})}^{+} U^{- 1}

where

d i a g {(s_{1}, . . ., s_{n})}^{+} = d i a g (s_{1}^{+}, . . ., s_{n}^{+})

and

x^{+} = 1 / x, x \neq 0

and zero else, that is

{(Λ - I_{n} λ_{j})}^{+}

denotes a quasi-inverse.

Furthermore for

ρ = ∥ δ A_{t} ∥ < 1

we obtain:

∥ C_{k} ∥ \leq μ_{C} ρ^{k}, ∥ Z_{k} ∥ \leq μ_{Z} ρ^{k}, k \geq 0

.

The results follow directly from Section 2.9 of [23], see in particular Proposition 2.9.1 and the discussion below this proposition. Further note that the results hold for each root separately and hence the restriction

ℓ_{j} = 1

needs to hold only for the investigated root for the results to apply. Finally note that a second order approximation

{\hat{U}}_{t, j} = Z_{0} + Z_{1} + Z_{2}

and

{\hat{B}}_{t, j} = C_{0} + C_{1} + C_{2}

is accurate to the order

o (∥ δ A_{t} ∥^{2})

.

Appendix A.3. Random Transformation of Systems

Lemma A3.

Let the assumptions of Theorem 1 hold and use the same notation as given there. Let

(\tilde{A}, \tilde{C}, \tilde{K})

denote a sequence of systems converging a.s. to

(A, C, K)

such that

(\tilde{A} - A) D_{x}^{- 1} = O ({(log T)}^{a}), \sqrt{T} (\tilde{K} - K) = O ({(log T)}^{a}), (\tilde{C} - C) D_{x}^{- 1} = O ({(log T)}^{a})

and let

A_{0} = S_{0} A S_{0}^{- 1} = d i a g (A_{0, 11}, A_{0, 22}), K_{0} = S_{0} K, C_{0} = C S_{0}^{- 1}

. Further let

S_{T} = [\begin{matrix} S_{T, 11} & S_{T, 12} \\ 0 & S_{T, 22} \end{matrix}] \to S_{0}

such that

(S_{T} - S_{0}) D_{x}^{- 1} = O ({(log T)}^{a})

. Let

Δ S = (S_{T} - S_{0}) D_{x}^{- 1}, Δ A = (\tilde{A} - A) D_{x}^{- 1}

and denote the sequence of transformed systems as

(\hat{A}, \hat{C}, \hat{K}) = (S_{T} \tilde{A} S_{T}^{- 1}, \tilde{C} S_{T}^{- 1}, S_{T} \tilde{K})

. Let the block entries of

S_{0}

be denoted as

S_{i j}

and the blocks of

Δ S

be denoted as

Δ S_{i j}

. Then:

\begin{matrix} T ({\hat{A}}_{11} - A_{0, 11}) & = & (Δ S_{11} A_{11} - A_{0, 11} Δ S_{11} + S_{11} Δ A_{11} + S_{12} Δ A_{21}) S_{11}^{- 1} + o (1), \\ \sqrt{T} ({\hat{A}}_{12} - A_{0, 12}) & = & (S_{11} Δ A_{12} + S_{12} Δ A_{22}) S_{22}^{- 1} + Δ S_{12} S_{22}^{- 1} A_{0, 22} - A_{0, 11} Δ S_{12} S_{22}^{- 1} + o (1), \\ T ({\hat{A}}_{21} - A_{0, 21}) & = & S_{22} Δ A_{21} S_{11}^{- 1} + o (1), \\ \sqrt{T} ({\hat{A}}_{22} - A_{0, 22}) & = & Δ S_{22} S_{22}^{- 1} A_{0, 22} + S_{22} Δ A_{22} S_{22}^{- 1} - A_{0, 22} Δ S_{22} S_{22}^{- 1} + o (1), \\ \sqrt{T} (\hat{K} - K_{0}) & = & [\begin{matrix} Δ S_{12} K_{2} + S_{11} \sqrt{T} ({\tilde{K}}_{1} - K_{1}) + S_{12} \sqrt{T} ({\tilde{K}}_{2} - K_{2}) \\ Δ S_{22} K_{2} + S_{22} \sqrt{T} ({\tilde{K}}_{2} - K_{2}) \end{matrix}] + o (1), \\ (\hat{C} - C_{0}) D_{x}^{- 1} & = & (\hat{C} - C) D_{x}^{- 1} [\begin{matrix} S_{11}^{- 1} & 0 \\ 0 & S_{22}^{- 1} \end{matrix}] - C_{0} [\begin{matrix} Δ S_{11} S_{11}^{- 1} & Δ S_{12} S_{22}^{- 1} \\ 0 & Δ S_{22} S_{22}^{- 1} \end{matrix}] + o (1) . \end{matrix}

Proof.

The proof follows from straightforward computations using the various orders of convergence by neglecting higher order terms. □

Appendix B. Reduced Rank Regression with Integrated Variables

The main results of this paper are based on a more general result documented in [24] (henceforth called BRRR). BRRR uses a slightly different setting and in particular a different dgp. The following lemma provides the essence of the results of BRRR that will be used below.

Lemma A4.

Let

{(y_{t})}_{t \in N}, {(z_{t}^{r})}_{t \in N}, {(z_{t}^{u})}_{t \in N}, y_{t} \in R^{s}, z_{t}^{r} \in R^{m}, z_{t}^{u} \in R^{l}

be three processes related via

y_{t} = b_{r} z_{t}^{r} + b_{u} z_{t}^{u} + u_{t}

where the zero mean stationary process

{(u_{t})}_{t \in N}

is such that

E u_{t} {(z_{t}^{r})}^{'} = 0, E u_{t} {(z_{t}^{u})}^{'} = 0, E u_{t} u_{t}^{'} > 0

and where

n =

rank

(b_{r}) < min (s, m)

, that is

b_{r}

is of reduced rank.

Further assume that there exist square nonsingular matrices

T_{y} \in R^{s \times s}, T_{r} \in R^{m \times m}, T_{u} \in R^{n \times n}

such that

{\tilde{y}}_{t} = T_{y} y_{t} = (T_{y} b_{r} T_{r}^{- 1}) (T_{r} z_{t}^{r}) + (T_{y} b_{r} T_{r}^{- 1}) (T_{r} z_{t}^{r}) + T_{y} u_{t} = {\tilde{b}}_{r} {\tilde{z}}_{t} + {\tilde{b}}_{u} {\tilde{z}}_{t}^{u} + {\tilde{u}}_{t}

such that with

c_{•} = n - c

we have

{\tilde{b}}_{r} = [\begin{matrix} I_{c} & 0 & 0 \\ 0 & 0 & {\tilde{b}}_{r, •} \end{matrix}], {\tilde{b}}_{r, •} = {\tilde{O}}_{•} Γ_{•}^{'}, {\tilde{O}}_{•} \in R^{(s - c) \times c_{•}}, Γ_{•} \in R^{m_{•} \times c_{•}} .

Here the partitioning corresponds to

{\tilde{z}}_{t}^{'} = [{\tilde{z}}_{t, 1}^{'}, {\tilde{z}}_{t, 2}^{'}, {\tilde{z}}_{t, •}^{'}]

where

{\tilde{z}}_{t, 1} \in R^{c}, {\tilde{z}}_{t, 2} \in R^{m - c - m_{•}}

are MFI(1) processes and

{({\tilde{z}}_{t, •})}_{t \in N}, {\tilde{z}}_{t, •} \in R^{m_{•}}

is stationary,

{\tilde{z}}_{t}^{u} = {[{({\tilde{z}}_{t, 1}^{u})}^{'}, {({\tilde{z}}_{t, •}^{u})}^{'}]}^{'}

where

{({\tilde{z}}_{t, 1}^{u})}_{t \in N}

is a MFI(1) process and

{({\tilde{z}}_{t, •}^{u})}_{t \in N}

is stationary and where the following bounds hold (

{\tilde{z}}_{t, :} = {[{\tilde{z}}_{t, 1}^{'}, {\tilde{z}}_{t, 2}^{'}]}^{'}

):

\begin{matrix} 〈 {\tilde{u}}_{t}, {\tilde{u}}_{t} 〉 = O (1) & , & 〈 {\tilde{u}}_{t}, {\tilde{z}}_{t, •} 〉 = O (Q_{T}) & , & 〈 {\tilde{u}}_{t}, {\tilde{z}}_{t, •}^{u} 〉 = O (Q_{T}), \\ 〈 {\tilde{u}}_{t}, {\tilde{u}}_{t} 〉 - E {\tilde{u}}_{t} {\tilde{u}}_{t}^{'} = O (Q_{T}) & , & 〈 {\tilde{u}}_{t}, {\tilde{z}}_{t, :} 〉 = O (log T) & , & 〈 {\tilde{u}}_{t}, {\tilde{z}}_{t, 1}^{u} 〉 = O (log T), \\ {\hat{M}}_{•} = 〈 (\begin{matrix} {\tilde{z}}_{t, •} \\ {\tilde{z}}_{t, •}^{u} \end{matrix}), (\begin{matrix} {\tilde{z}}_{t, •} \\ {\tilde{z}}_{t, •}^{u} \end{matrix}) 〉 & , & {\hat{M}}_{•}^{- 1} = O (1) & , & {\hat{M}}_{•} = O (1), M_{•} > 0 \\ {\hat{M}}_{1} = 〈 (\begin{matrix} {\tilde{z}}_{t, :} \\ {\tilde{z}}_{t, 1}^{u} \end{matrix}), (\begin{matrix} {\tilde{z}}_{t, :} \\ {\tilde{z}}_{t, 1}^{u} \end{matrix}) 〉 & , & {\hat{M}}_{1} / T = O (log log T) & , & {({\hat{M}}_{1})}^{- 1} = O (Q_{T}^{2}), \\ 〈 (\begin{matrix} {\tilde{z}}_{t, •} \\ {\tilde{z}}_{t, •}^{u} \end{matrix}), (\begin{matrix} {\tilde{z}}_{t, :} \\ {\tilde{z}}_{t, 1}^{u} \end{matrix}) 〉 = O (log T) & , & {\hat{M}}_{•} - M_{•} = O (Q_{T}) . \end{matrix}

Then the reduced rank regression estimator

{\hat{b}}_{R R R} = [{\hat{b}}_{r, R R R}, {\hat{b}}_{u, R R R}]

maximizing the Gaussian likelihood subject to rank

(β_{r}) = n = c + c_{•}

is consistent:

{\hat{b}}_{R R R} - b = O ({(log T)}^{a} / \sqrt{T})

for some

a < \infty

. Furthermore

{\tilde{b}}_{R R R, r} - {\tilde{b}}_{r} = [O ({(log T)}^{a} / T), O ({(log T)}^{a} / \sqrt{T})]

with

{\tilde{b}}_{R R R, r} = T_{y} {\hat{b}}_{R R R, r} T_{r}^{- 1}

, where the second block has

m_{•}

columns and corresponds to the stationary components of the regressor vector.

Proof.

The theorem slightly extends the results of BRRR by adding high level assumptions instead of low level assumptions on the data generating process. The proof hence consists in adjusting the proof in BRRR. In the following we only indicate where arguments in BRRR need to be replaced. A detailed proof would replicate much of the arguments in BRRR and hence is omitted.

The representation of Theorem 3.1 in BRRR is contained in the assumptions. Then consistency follows from examining the proof of the first part of Theorem 3.2 in BRRR: essential for the norm bounds are Lemma A.1 (I) and (III). The norm bounds stated under point (I) are directly assumed in this lemma except for the filtered version using

n_{t}

in place of

x_{t}

. Instead, here the results for

n_{t}

which are needed in the proof of Theorem 3.2 of BRRR are directly assumed. (III) then follows. Lemmas A.3–A.5 in BRRR do not depend on the assumptions on the various processes and hence continue to hold. Then the proof for consistency in Appendix A.3.1 of BRRR only uses these norm bounds referring also to [38] (which is also only based on the norm bounds contained in the assumptions of this lemma) and hence continues to hold. □

Appendix C. Proofs of the Theorems

Appendix C.1. Proof of Theorem 1

For proving consistency of the transfer function estimators it is sufficient to find a (possibly) random matrix

{\tilde{S}}_{T}

such that the least squares estimates

(\tilde{A}, \tilde{C}, \tilde{K})

of one representation

(A, C, K)

of the true system obtained using

{\tilde{x}}_{t} : = {\tilde{S}}_{T} {\hat{x}}_{t}

converges (a.s.) to

(A, C, K)

. This will be done in two steps: First a particular basis (which is not realizable in practice) will be chosen such that

{\tilde{K}}_{p} - K_{p} = o (1)

sufficiently fast such that in the second step the regressions in the system equations based on the resulting state estimator

{\tilde{x}}_{t}

are consistent. The derivation of the first step will also provide an approximation of the error term which can be used in order to derive the asymptotic distribution.

Appendix C.1.1. Proof of Theorem 1 (I)

The central step in CVA is the solution to the RRR problem. The following proof heavily draws on the results contained in [24] (henceforth called BRRR) collected in Lemma A4 for easier reference. As in BRRR, in order to derive the asymptotic properties we first transform the vectors in order to separate stationary and nonstationary terms. In order to achieve the separation let

Z_{t} = {[y_{t - 1}^{'}, y_{t - 2}^{'}, . . ., y_{t - S}^{'}]}^{'} \in R^{s S}

. Then for

p = k S

we obtain

Y_{t, p}^{-} = (\begin{matrix} y_{t - 1} \\ y_{t - 2} \\ ⋮ \\ y_{t - S} \\ y_{t - S - 1} \\ ⋮ \\ y_{t - k S} \end{matrix}) = (\begin{matrix} Z_{t} \\ Z_{t - S} \\ ⋮ \\ Z_{t - (k - 1) S} \end{matrix}) .

It is easy to see that for each j the process

{(Z_{r S - j})}_{r \in N}

is an

I (1)

process. Moreover the strict minimum-phase condition for

(A_{\circ}, C_{\circ}, K_{\circ})

implies that also for the system corresponding to

{(Z_{r S - j})}_{r \in N}

the strict minimum-phase condition holds.

Define the transformation

T_{S} : = {[O_{S, 1}, O_{S, ⊥}]}^{'}

where

O_{S, 1} \in R^{s S \times c}

denotes the matrix containing the first c columns of

O_{S}

for the system

(A_{\circ}, C_{\circ}, K_{\circ})

in the canonical form. Further

O_{S, ⊥}

is a block column of an orthonormal matrix such that

O_{S, ⊥}^{'} O_{S, 1} = 0

. Then the argument of [20] shows that in

T_{S} Z_{t}

the first c components are integrated while the remaining

s S - c

components are stationary. Then consider for

p = k S < t \leq T - f + 1

(using

O_{S, 1}^{†} = {(O_{S, 1}^{'} O_{S, 1})}^{- 1} O_{S, 1}^{'}

)

{\tilde{z}}_{t, p} : = [\begin{matrix} O_{S, 1}^{†} O_{S} (x_{t} - {\bar{A}}_{\circ}^{p} x_{t - p}) \\ O_{S, ⊥}^{'} Z_{t} \\ O_{S, 1}^{†} (Z_{t} - Z_{t - S}) \\ O_{S, ⊥}^{'} Z_{t - S} \\ ⋮ \\ O_{S, 1}^{†} (Z_{t - (k - 2) S} - Z_{t - (k - 1) S}) \\ O_{S, ⊥}^{'} Z_{t - (k - 1) S} \end{matrix}], {\tilde{y}}_{t} : = [\begin{matrix} O_{f, 1}^{†} \\ O_{f, ⊥}^{'} \end{matrix}] Y_{t, f}^{+} .

(A1)

Here

O_{f, ⊥}

is a matrix such that

O_{f, ⊥}^{'} O_{f, 1} = 0, O_{f, ⊥}^{'} O_{f, ⊥} = I

. Obviously

{\tilde{z}}_{t, p}

is a linear transformation of

Y_{t, p}^{-}

and

{\tilde{y}}_{t}

of

Y_{t, f}^{+}

. It can be shown that the linear transformation is nonsingular such that there is a one-one relation between

Y_{t, p}^{-}

and

{\tilde{z}}_{t, p}

. In

{\tilde{z}}_{t, p}

and

{\tilde{y}}_{t}

only the first c components are unit root processes, the remaining components being stationary.

For

p \neq k S

the final

p - k S

block rows of

{\tilde{z}}_{t, p}

are defined as

y_{t - (k - 1) S - j} - y_{t - k S - j}, j = 1, . . ., p - k S

. Clearly also these components are stationary.

Partition

{\tilde{z}}_{t, p} = {[{\tilde{z}}_{t, 1}^{'}, {\tilde{z}}_{t, •}^{'}]}^{'}, {\tilde{z}}_{t, 1} \in R^{c},

into its first c and the remaining coordinates (omitting the subscript

_{p}

on the right hand side for notational convenience). Similarly partition

{\tilde{y}}_{t} = {[{\tilde{y}}_{t, 1}^{'}, {\tilde{y}}_{t, •}^{'}]}^{'}, {\tilde{y}}_{t, 1} \in R^{c}

. Using these transformed matrices,

Y_{t, f}^{+} = β_{1} Y_{t, p}^{-} + N_{t, f}^{+}

can be written as

{\tilde{y}}_{t} = {\tilde{b}}_{1} {\tilde{z}}_{t, p} + {\tilde{N}}_{t, f, p}^{+} = [\begin{matrix} {\tilde{y}}_{t, 1} \\ {\tilde{y}}_{t, •} \end{matrix}] = [\begin{matrix} I_{c} & 0 \\ 0 & {\tilde{b}}_{•, p} \end{matrix}] [\begin{matrix} {\tilde{z}}_{t, 1} \\ {\tilde{z}}_{t, •} \end{matrix}] + \tilde{O_{f}} {\bar{A}}_{\circ}^{p} x_{t - p} + [\begin{matrix} {\tilde{ε}}_{t, 1} \\ {\tilde{ε}}_{t, •} \end{matrix}]

(A2)

where

{\tilde{b}}_{1} = [\begin{matrix} I_{c} & 0 \\ 0 & {\tilde{b}}_{•, p} \end{matrix}], {\tilde{b}}_{•, p} = E {\tilde{y}}_{t, •} {\tilde{z}}_{t, •}^{'} {(E {\tilde{z}}_{t, •} {\tilde{z}}_{t, •}^{'})}^{- 1} = O_{•, p} Γ_{•, p}^{'}, {\tilde{b}}_{1} = O_{p} Γ_{p}^{'}

and where

{\tilde{b}}_{•, p}

is of rank

n - c

providing a representation of the form given in Theorem 3.1 of BRRR except that the error term

{\tilde{N}}_{t, f, p}^{+}

(defined by the equation) is not white. Finally (A2) also defines the sub blocks

{\tilde{ε}}_{t, i}

of

{\tilde{N}}_{t, f, p}^{+}

which are hence linear combinations of

E_{t, f}^{+}

and therefore typically MA(f) processes. Note that

{\tilde{z}}_{t, 1}, {\tilde{z}}_{t, •}, {\tilde{y}}_{t, •}

depend on the choice of f and p which is not reflected in the notation.

Here

{(E {\tilde{z}}_{t, •} {\tilde{z}}_{t, •}^{'})}^{- 1}

and

E {\tilde{y}}_{t, •} {\tilde{z}}_{t, •}^{'}

are worth a remark: for

p = k S

the results of [20] can be directly used to obtain upper and lower bounds for the norms of these matrices uniformly in

k \in N

. For

p \neq k S

the additional rows in

{\tilde{z}}_{t, •}

add entries to

E {\tilde{y}}_{t, •} {\tilde{z}}_{t, •}^{'}

that are of order

O (λ^{p})

for some

0 < λ < 1

as

y_{t} - y_{t - S}

is a VARMA process. Similarly the smallest eigenvalue of

E {\tilde{z}}_{t, •} {\tilde{z}}_{t, •}^{'}

can be bounded from below based on arguments for

p = k S

following [20] which in turn refer to Theorem 6.6.10 of [22]. The additional terms for

p \neq k S

correspond to backward innovations with non-singular covariance matrix thus also leading to a lower bound of the smallest eigenvalue uniformly in k. (The backward innovations representation for a stationary VARMA process

{(y_{t})}_{t \in Z}

equals

y_{t} = \sum_{j = 1}^{\infty} k_{j}^{b} y_{t + j} + ε_{t}^{b}

and can be obtained from the complex conjugate of the spectral density. Nonsingularity of the spectral density implies that the backward innovation

ε_{t}^{b}

have nonsingular covariance matrix. This implies a lower bound on the accuracy with which components of

y_{t - (k - 1) S - j}

can be predicted based on

y_{t - i}, i \leq (k - 1) S

.)

Furthermore the strict minimum-phase assumption for the state space representation

(A_{\circ}, C_{\circ}, K_{\circ})

of the process

{(y_{t})}_{t \in Z}

implies the strict minimum-phase assumption for the sub-sampled process

{(Z_{k S + j})}_{k \in Z}

. Thus the arguments of [20] show that

[{\tilde{b}}_{•, p}, 0] \to {\tilde{b}}_{•, \infty}

where the norm of the difference is of order

O (∥ {\bar{A}}_{\circ}^{p} ∥)

. The increase of p as a function of the sample size jointly with the strict minimum-phase assumption implies that

O (∥ {\bar{A}}_{\circ}^{p} ∥) = o (T^{- 1})

. This also implies that

\tilde{O_{f}} {\bar{A}}_{\circ}^{p} x_{t - p} = o_{p} (T^{- 1 / 2})

.

Correspondingly there exists a limiting decomposition

{\tilde{b}}_{•, \infty} = O_{•} Γ_{•}^{'}

such that

Γ_{•}^{'} S_{•} = I_{n - c}

where

S_{•}

denotes a selector matrix whose columns contain the vectors of the canonical basis of

R^{\infty}

. Since

[K_{\circ}, (A_{\circ} - K_{\circ} C_{\circ}) K_{\circ}, {(A_{\circ} - K_{\circ} C_{\circ})}^{2} K_{\circ}, . . ., {(A_{\circ} - K_{\circ} C_{\circ})}^{n - 1} K_{\circ}]

is of full row rank it can be assumed that

S_{•}

only has nonzero entries in its first

n s

rows. Denoting the submatrix of the first

p s

rows by

S_{p, 2}

then also

{[Γ_{•}^{'}]}_{1 : p} S_{p, 2} = I_{n - c}

where

{[.]}_{1 : p}

denotes the first p block columns of a matrix. This fixes a unique decomposition of

{\tilde{b}}_{•}

and hence

O_{•}

and

Γ_{•}

do not depend on p. Convergence of

{\tilde{b}}_{•, p}

to

{\tilde{b}}_{•}

jointly with the lower bound on

p (T)

then implies convergence of order

o (T^{- 1})

of

O_{•, p}

to

O_{•}

and

Γ_{•, p}^{'}

to

{[Γ_{•}^{'}]}_{1 : p}

using the decomposition of

{\tilde{b}}_{•, p}

such that

Γ_{•, p}^{'} S_{p, 2} = I_{n - c}

. Correspondingly

O_{p} \to O

and

∥ Γ_{p}^{'} - {[Γ^{'}]}_{1 : p} ∥ \to 0

.

Therefore the reduced rank regression in the CVA procedure shows the same structure as investigated in Lemma A4 with the differences that

{\tilde{z}}_{t, 2}

and

{\tilde{z}}_{t}^{u}

do not occur, and

{\tilde{z}}_{t, •}

has increasing size as a function of the sample size. The next lemma therefore extends the results of BRRR to the RRR used in CVA:

In the following we will use a generic

a \in N

in statements like

O ({(log T)}^{a})

, not necessarily the same in each occurrence. In this sense e.g., the product of two terms that are

O ({(log T)}^{a})

is again taken to be

O ({(log T)}^{a})

.

Lemma A5.

Let the assumptions of Theorem 1 hold where additionally

{(ε_{t})}_{t \in Z}

is iid. Introduce the notation

\begin{matrix} {\tilde{D}}_{z} = d i a g (T^{- 1 / 2} I_{c}, I_{p s - c}), & {\tilde{D}}_{y} = d i a g (T^{- 1 / 2} I_{c}, I_{f s - c}), & {\tilde{D}}_{x} = d i a g (T^{- 1 / 2} I_{c}, I_{n - c}) . \end{matrix}

Let

{\bar{G}}_{p}

denote a solution to

({\tilde{D}}_{z} 〈 {\tilde{z}}_{t, p}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{y}) {({\tilde{D}}_{y} 〈 {\tilde{y}}_{t}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{y})}^{- 1} ({\tilde{D}}_{y} 〈 {\tilde{y}}_{t}, {\tilde{z}}_{t, p} 〉 {\tilde{D}}_{z}) {\bar{G}}_{p} = ({\tilde{D}}_{z} 〈 {\tilde{z}}_{t, p}, {\tilde{z}}_{t, p} 〉 {\tilde{D}}_{z}) {\bar{G}}_{p} {\bar{R}}^{2}

using the notation of (A1) where

{\bar{R}}^{2} \to Θ^{2} = d i a g (I_{c}, Θ_{•}) \in R^{n \times n}

and where

{\bar{G}}_{p}

is normalized such that

{\bar{G}}_{1, 1, p} = I_{c}, {\bar{G}}_{•, 2, p}^{'} S_{p, 2} = I_{n - c}

for a selector matrix

S_{p, 2}

. Further let

{\bar{Γ}}_{p} = [\begin{matrix} I_{c} & 0 \\ 0 & {\bar{Γ}}_{•, p} \end{matrix}], {\bar{Γ}}_{•, p}^{'} S_{p, 2} = I_{n - c}

denote the solution to the decoupled problem where the stationary and the nonstationary subproblem are separated:

(\begin{matrix} 〈 {\tilde{z}}_{t, 1}, {\tilde{y}}_{t, 1} 〉 {〈 {\tilde{y}}_{t, 1}, {\tilde{y}}_{t, 1} 〉}^{- 1} 〈 {\tilde{y}}_{t, 1}, {\tilde{z}}_{t, 1} 〉 {\bar{Γ}}_{1, 1, p} \\ 〈 {\tilde{z}}_{t, •}, {\tilde{y}}_{t, •} 〉 {〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, •} 〉}^{- 1} 〈 {\tilde{y}}_{t, •}, {\tilde{z}}_{t, •} 〉 {\bar{Γ}}_{•, p} \end{matrix}) = (\begin{matrix} 〈 {\tilde{z}}_{t, 1}, {\tilde{z}}_{t, 1} 〉 {\bar{Γ}}_{1, 1, p} {\bar{Θ}}_{1} \\ 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉 {\bar{Γ}}_{•, p} {\bar{Θ}}_{•} \end{matrix}) .

(I) Then if

f \geq n

fixed independent of T and

p \geq - d log T / log ρ_{0}, d > 1, p = o ({(log T)}^{\bar{a}})

for some

\bar{a} < \infty

the a.s. results of Lemma A.6 (I)-(III) and Lemma A.7 of [24] hold true for

{(log T)}^{3}

replaced by

{(log T)}^{a}

for some integer

a < \infty

. In particular

{\bar{G}}_{p} - {\bar{Γ}}_{p} = O ({(log T)}^{a} / T^{1 / 2})

.

(II) Using the notation

δ G_{p} : = {\bar{G}}_{p} - {\bar{Γ}}_{p}

define

{\tilde{S}}_{T} : = [\begin{matrix} I_{c} & - \sqrt{T} δ G_{•, 1, p}^{'} 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉 {\bar{Γ}}_{•, p}^{†} \\ 0 & I_{p s - c} \end{matrix}], {\bar{Γ}}_{•, p}^{†} : = {\bar{Γ}}_{•, p} {({\bar{Γ}}_{•, p}^{'} 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉 {\bar{Γ}}_{•, p})}^{- 1} .

Then for

{\tilde{Γ}}_{p}^{'} : = {\tilde{S}}_{T} {\tilde{D}}_{x}^{- 1} {\bar{G}}_{p}^{'} {\tilde{D}}_{z}

and

Γ^{'} = [\begin{matrix} I & 0 \\ 0 & Γ_{•}^{'} \end{matrix}]

we obtain

{\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p} = [O ({(log T)}^{a} / T), O ({(log T)}^{a} / T^{1 / 2})]

where the partitioning corresponds to the partitioning of

{\tilde{z}}_{t, p}

into

{\tilde{z}}_{t, 1}

and

{\tilde{z}}_{t, •}

. Here

Γ_{•}^{'}

denotes the right factor of

{\tilde{b}}_{•, \infty} = O_{•} Γ_{•}^{'}

such that

{[Γ_{•}^{'}]}_{1 : p} S_{p, 2} = I_{n - c}

holds.

(III) Let the assumptions of Theorem 1 hold. Then

{\hat{Z}}_{T} : = T v e c (({\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p}) [\begin{matrix} I_{c} \\ 0 \end{matrix}])

converges in distribution.

Proof.

(I) First consider the entries of the vectors

{\tilde{y}}_{t, •}

and

{\tilde{z}}_{t, •}

(see (A1)) more closely. Since in

O_{f, ⊥}^{'} Y_{t, f}^{+} = O_{f, ⊥}^{'} (O_{f, •} x_{t, •} + E_{f} E_{t, f}^{+})

the nonstationary directions are filtered by definition,

{\tilde{y}}_{t, •}

is stationary and does not depend on T.

Further, also

{\tilde{z}}_{t, •}

is stationary for fixed p as the nonstationary directions are either filtered by pre-multiplication with

O_{S, ⊥}^{'}

or by yearly differencing

Z_{t} - Z_{t - S}

.

Therefore we obtain from stationary theory for fixed

p = k S

that

∥ E {\tilde{y}}_{t, •} {\tilde{z}}_{t, •}^{'} {(E {\tilde{z}}_{t, •} {\tilde{z}}_{t, •}^{'})}^{- 1} - 〈 {\tilde{y}}_{t, •}, {\tilde{z}}_{t, •} 〉 {〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉}^{- 1} ∥ = o (1) .

Here

{sup}_{p} ∥ {(E {\tilde{z}}_{t, •} {\tilde{z}}_{t, •}^{'})}^{- 1} ∥ < \infty

has been discussed before. Now

E {\tilde{y}}_{t, •} {\tilde{z}}_{t, •}^{'} {(E {\tilde{z}}_{t, •} {\tilde{z}}_{t, •}^{'})}^{- 1} = {\tilde{β}}_{•, p} + o (T^{- 1 / 2}) = O_{•, p} {[Γ_{•}^{'}]}_{1 : p} + o (T^{- 1 / 2})

where the

o (T^{- 1 / 2})

term appears due to neglecting

\tilde{O_{f}} {\bar{A}}^{p} x_{t - p}

. It follows that

det [{({\tilde{β}}_{•, p} S_{p, 2})}^{'} ({\tilde{β}}_{•, p} S_{p, 2})] = det [O_{•, p}^{'} O_{•, p}] > 0

and hence

∥ {\hat{β}}_{•, p} - {\tilde{β}}_{•, p} ∥_{F r} = o (1)

implies

{lim}_{T \to \infty} det [{({\hat{β}}_{•, p} S_{p, 2})}^{'} ({\hat{β}}_{•, p} S_{p, 2})] > 0

a.s. where

{\hat{β}}_{•, p} : = 〈 {\tilde{y}}_{t, •}, {\tilde{z}}_{t, •} 〉 {〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉}^{- 1}

. Since

{\hat{O}}_{•, p} {\bar{Γ}}_{•, p}^{'} - {\tilde{β}}_{•, p} = o (1)

due to consistency, also

lim_{T \to \infty} det [{({\hat{O}}_{•, p} {\bar{Γ}}_{•, p}^{'} S_{p, 2})}^{'} ({\hat{O}}_{•, p} {\bar{Γ}}_{•, p}^{'} S_{p, 2})] = lim_{T \to \infty} det {\hat{O}}_{•, p}^{'} {\hat{O}}_{•, p} det {({\bar{Γ}}_{•, p}^{'} S_{p, 2})}^{2} > 0 a . s .

Since

{\bar{Γ}}_{•, p} - Γ_{•, p} = o (1)

due to the definition of

{\bar{Γ}}_{•, p}

and the continuity of the solution of the eigenvalue problem it follows that

{\hat{O}}_{•, p} - O_{•, p} = o (1)

and therefore

lim {sup}_{T} det {\hat{O}}_{•, p}^{'} {\hat{O}}_{•, p} > 0

. As in Lemma 6 of [40] it can be shown that

Γ_{•, p}^{'} - {[Γ_{•}^{'}]}_{1 : p} = o (T^{- 1})

and

O_{•, p} = O_{•} + o (T^{- 1})

for the range of p given in Theorem 1 since these matrices correspond to a stationary problem. Hence the chosen normalization of

{\bar{Γ}}_{•, p}

can be used a.s.

Next in order to obtain the convergence of

\bar{G}

to

{\bar{Γ}}_{p}

, Lemma A.6 of BRRR is slightly extended to the current situation (for details and notation see there). Lemma A.6 of BRRR contains three parts: BRRR(I) gives bounds on the error in the matrices (with

l_{T} = log T

)

\begin{matrix} δ_{y z} & = [\begin{matrix} \frac{1}{T} 〈 {\tilde{y}}_{t, 1}, {\tilde{z}}_{t, 1} 〉 & \frac{1}{\sqrt{T}} 〈 {\tilde{y}}_{t, 1}, {\tilde{z}}_{t, •} 〉 \\ \frac{1}{\sqrt{T}} 〈 {\tilde{y}}_{t, •}, {\tilde{z}}_{t, 1} 〉 & 〈 {\tilde{y}}_{t, •}, {\tilde{z}}_{t, •} 〉 \end{matrix}] - [\begin{matrix} \frac{1}{T} 〈 {\tilde{z}}_{t, 1}, {\tilde{z}}_{t, 1} 〉 & 0 \\ 0 & 〈 {\tilde{y}}_{t, •}, {\tilde{z}}_{t, •} 〉 \end{matrix}] = [\begin{matrix} O (\frac{1}{T} l_{T}^{a}) & O (\frac{1}{\sqrt{T}} l_{T}^{a}) \\ O (\frac{1}{\sqrt{T}} l_{T}^{a}) & 0 \end{matrix}], \\ δ_{y y} & = [\begin{matrix} \frac{1}{T} 〈 {\tilde{y}}_{t, 1}, {\tilde{y}}_{t, 1} 〉 & \frac{1}{\sqrt{T}} 〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, •} 〉 \\ \frac{1}{\sqrt{T}} 〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, 1} 〉 & 〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, •} 〉 \end{matrix}] - [\begin{matrix} \frac{1}{T} 〈 {\tilde{z}}_{t, 1}, {\tilde{z}}_{t, 1} 〉 & 0 \\ 0 & 〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, •} 〉 \end{matrix}] = [\begin{matrix} O (\frac{1}{T} l_{T}^{a}) & O (\frac{1}{\sqrt{T}} l_{T}^{a}) \\ O (\frac{1}{\sqrt{T}} l_{T}^{a}) & 0 \end{matrix}], \\ δ_{z z} & = [\begin{matrix} \frac{1}{T} 〈 {\tilde{z}}_{t, 1}, {\tilde{z}}_{t, 1} 〉 & \frac{1}{\sqrt{T}} 〈 {\tilde{z}}_{t, 1}, {\tilde{z}}_{t, •} 〉 \\ \frac{1}{\sqrt{T}} 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, 1} 〉 & 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉 \end{matrix}] - [\begin{matrix} \frac{1}{T} 〈 {\tilde{z}}_{t, 1}, {\tilde{z}}_{t, 1} 〉 & 0 \\ 0 & 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉 \end{matrix}] = [\begin{matrix} 0 & O (\frac{1}{\sqrt{T}} l_{T}^{a}) \\ O (\frac{1}{\sqrt{T}} l_{T}^{a}) & 0 \end{matrix}] . \end{matrix}

BRRR(II) deals with

J = \bar{Q} - \bar{Φ} =

{\tilde{D}}_{z} 〈 {\tilde{z}}_{t}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{y} {({\tilde{D}}_{y} 〈 {\tilde{y}}_{t}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{y})}^{- 1} {\tilde{D}}_{y} 〈 {\tilde{y}}_{t}, {\tilde{z}}_{t} 〉 {\tilde{D}}_{z} - [\begin{matrix} \frac{1}{T} 〈 {\tilde{z}}_{t, 1}, {\tilde{z}}_{t, 1} 〉 & 0 \\ 0 & 〈 {\tilde{z}}_{t, •}, {\tilde{y}}_{t, •} 〉 {〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, •} 〉}^{- 1} 〈 {\tilde{y}}_{t, •}, {\tilde{z}}_{t, •} 〉 \end{matrix}]

and BRRR(III) shows that there exists a solution

{\bar{G}}_{p}

converging to a solution

{\bar{Γ}}_{p}

of the separated problem.

For showing the orders of convergence of

δ_{z z}

the arguments are unchanged except for noting that in

〈 {\tilde{z}}_{t, 1}, {\tilde{z}}_{t, •} 〉

the number of columns increases as a function of the sample size. Since the a.s. bounds on the entries of this expression hold uniformly (as follows straightforwardly from the arguments of Lemma A.1 of BRRR) this does not change the arguments. With respect to

δ_{y z}

note that now

{\tilde{y}}_{t} = {\tilde{β}}_{1} {\tilde{z}}_{t, p} + {\tilde{ε}}_{t} + \tilde{O_{f}} {\bar{A}}^{p} x_{t - p}

. Due to the increase of p as a function of the sample size,

{\bar{A}}^{p} = o (T^{- 1 - ϵ})

for small enough

ϵ > 0

and therefore

\tilde{O_{f}} {\bar{A}}^{p} x_{t - p} = o (T^{- 1 / 2 - ϵ / 2})

since

x_{t} = o (T^{(1 + ϵ) / 2})

(uniformly in

1 \leq t \leq T

) whether

{(x_{t})}_{t \in Z}

is a unit root process or stationary. Hence

〈 \tilde{O_{f}} {\bar{A}}^{p} x_{t - p}, \tilde{O_{f}} {\bar{A}}^{p} x_{t - p} 〉 = o (1)

. Further

〈 \tilde{O_{f}} {\bar{A}}^{p} x_{t - p}, {\tilde{ε}}_{t} 〉 = o (T^{- 1 / 2})

follows from

〈 x_{t - p}, {\tilde{ε}}_{t} 〉 = O (log T)

(see Lemma A.1 (I)). This shows that the additional term is always of lower order and can be neglected. The remaining arguments follow exactly as in the proof of Lemma A.6 of BRRR. The proof of Lemma A.7 of BRRR only uses the order bounds derived above and hence follows immediately. This shows (I).

(II) Using the definition of

{\tilde{S}}_{T}

we obtain:

\begin{matrix} {\tilde{Γ}}_{p}^{'} & = & {\tilde{S}}_{T} {\tilde{D}}_{x}^{- 1} {\bar{G}}_{p}^{'} {\tilde{D}}_{z} = {\tilde{S}}_{T} [\begin{matrix} I_{c} & \sqrt{T} δ G_{•, 1, p}^{'} \\ δ G_{1, 2, p}^{'} / \sqrt{T} & {\bar{G}}_{•, 2, p}^{'} \end{matrix}] \\ = & [\begin{matrix} I_{c} - δ G_{•, 1, p}^{'} 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉 {\bar{Γ}}_{•, p}^{†} δ G_{1, 2, p}^{'} & \sqrt{T} δ G_{•, 1, p}^{'} (I - 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉 {\bar{Γ}}_{•, p}^{†} {\bar{G}}_{•, 2, p}^{'}) \\ δ G_{1, 2, p}^{'} / \sqrt{T} & {\bar{G}}_{•, 2, p}^{'} \end{matrix}] . \end{matrix}

From (I) and Lemma A.7 of BRRR

δ G_{•, 1, p} = O ({(log T)}^{a} / T^{1 / 2}), δ G_{1, 2, p} = O ({(log T)}^{a} / T^{1 / 2})

and

{\bar{G}}_{•, 2, p} - {\bar{Γ}}_{•, p} = o ({(log T)}^{a} / T^{1 / 2})

. Finally

δ G_{•, 1, p}^{'} (I - 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉 {\bar{Γ}}_{•, p}^{†} {\bar{G}}_{•, 2, p}^{'}) = δ G_{•, 1, p}^{'} (I - 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉 {\bar{Γ}}_{•, p}^{†} {\bar{Γ}}_{•, p}^{'}) + O ({(log T)}^{a} / T) = O ({(log T)}^{a} / T)

as in the proof of Lemma A.7 of BRRR. Using Lemma A.5 (III) of BRRR with

{\hat{Ξ}}_{f} = {〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, •} 〉}^{- 1 / 2}

it follows that

{\bar{Γ}}_{•, p}^{'} - Γ_{•, p}^{'} = O ({(log T)}^{a} T^{- 1 / 2})

. Since

{\bar{G}}_{•, 2, p} - {\bar{Γ}}_{•, p} {= o ((log T)}^{a} /

T^{1 / 2})

the same rate of convergence holds for

{\bar{G}}_{•, 2, p}^{'} - Γ_{•, p}^{'} = O ({(log T)}^{a} / T^{1 / 2})

. It follows that

{\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p} = [O ({(log T)}^{a} / T), O ({(log T)}^{a} / T^{1 / 2})]

.

(III) From above we have

T ({\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p}) (\begin{matrix} I_{c} \\ 0 \end{matrix}) = [\begin{matrix} - (\sqrt{T} δ G_{•, 1, p}^{'} 〈 {\tilde{z}}_{t, •}, {\tilde{z}}_{t, •} 〉 {\bar{Γ}}_{•, p}^{†} \sqrt{T} δ G_{1, 2, p}^{'}) \\ \sqrt{T} δ G_{1, 2, p}^{'} \end{matrix}] + o_{P} (1) .

(A3)

Now from the proof of Lemma A.7 of BRRR we obtain

{[\sqrt{T} δ G_{•, 1, p}]}^{'} = Ξ O_{•} {(I - Θ_{•}^{2})}^{- 1} {\bar{Γ}}_{•, p}^{'} + o_{P} (1) .

Furthermore using the expression given in Lemma A.7 of BRRR:

\begin{matrix} \sqrt{T} δ G_{1, 2, p} & = & \sqrt{T} Z_{11}^{- 1} [δ_{z z}^{1 •} Γ_{•, p} Θ_{•}^{2} - J_{1, •} Γ_{•, p}] {(I - Θ_{•}^{2})}^{- 1} + o_{P} (1) \\ = & \sqrt{T} Z_{11}^{- 1} [δ_{z z}^{1 •} Γ_{•, p} (Θ_{•}^{2} - I) + [δ_{z z}^{1, •} - J_{1, •}] Γ_{•, p}] {(I - Θ_{•}^{2})}^{- 1} + o_{P} (1) \\ = & - Z_{11}^{- 1} 〈 {\tilde{z}}_{t, 1}, x_{t, •} 〉 - Z_{11}^{- 1} \sqrt{T} [J_{1, •} - δ_{z z}^{1, •}] Γ_{•, p} {(I - Θ_{•}^{2})}^{- 1} + o_{P} (1) \\ = & - Z_{11}^{- 1} 〈 {\tilde{z}}_{t, 1}, x_{t, •} 〉 - Z_{11}^{- 1} E {\tilde{ε}}_{t, 1} {\tilde{ε}}_{t, •}^{'} {(E {\tilde{y}}_{t, •} {({\tilde{y}}_{t, •})}^{'})}^{- 1} E {\tilde{y}}_{t, •} x_{t, •}^{'} {(I - Θ_{•}^{2})}^{- 1} + o_{P} (1) . \end{matrix}

This shows the result. □

The transformations in the representation lead to an estimator

\bar{G}

taking the place of

\hat{K_{p}}

. Using

{\tilde{S}}_{T}

as defined in Lemma A5 the corresponding estimator

{\tilde{Γ}}_{p}^{'} = {\tilde{S}}_{T} {\tilde{D}}_{x}^{- 1} {\bar{G}}_{p}^{'} {\tilde{D}}_{z}

fulfills

{\tilde{Γ}}_{p}^{'} - Γ_{p}^{'} = [O ({(log T)}^{a} / T), O ({(log T)}^{a} / \sqrt{T})]

.

Based on this result let

(A, C, K)

denote the realization of the true transfer function in the state basis corresponding to

Γ_{p}^{'}

where

Γ_{p}^{'} S_{p} = I_{n}

and let

(\tilde{A}, \tilde{C}, \tilde{K})

denote the (unfeasible) CVA estimates using

{\tilde{x}}_{t} : = {\tilde{Γ}}_{p}^{'} {\tilde{z}}_{t, p}

. The next lemma then provides the main ingredients for the rest of the proofs:

Lemma A6.

Let the assumptions of Theorem 1 hold and define

D_{x} : = d i a g (I_{c} T^{- 1}, I_{n - c} T^{- 1 / 2})

. Then there exists an integer

a < \infty

such that

\begin{matrix} (\tilde{A} - A) D_{x}^{- 1} = O ({(log T)}^{a}), & (\tilde{C} - C) D_{x}^{- 1} = O ({(log T)}^{a}), & (\tilde{K} - K) = O ({(log T)}^{a} / T^{1 / 2}) . \end{matrix}

Proof.

First note that the regression of

Y_{t, f}^{+}

onto

Y_{t, p}^{-}

includes time points

t = p + 1, . . ., T - f + 1

whereas for estimating the system matrices we can use

{\hat{x}}_{t}, t = p + 1, . . ., T + 1

and

y_{t}, t = p + 1, . . ., T

. Thus in this proof we use

{〈 a_{t}, b_{t} 〉}_{p + 1}^{T} : = T^{- 1} \sum_{t = p + 1}^{T} a_{t} b_{t}^{'}

instead of

〈 a_{t}, b_{t} 〉 = T^{- 1} \sum_{t = p + 1}^{T - f + 1} a_{t} b_{t}^{'}

.

The following orders of convergence are straightforward to derive using the results of Lemma A1,

{\bar{A}}^{p} = o (T^{- 1}), ({\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p}) D_{z}^{- 1} = O ({(log T)}^{a})

and

{\tilde{x}}_{t} - x_{t} = ({\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p}) {\tilde{z}}_{t, p} - {\bar{A}}^{p} x_{t - p}, t > p

according to Lemma A5 and Lemma A1 for the range of p given in Theorem 1:

\begin{matrix} {〈 ε_{t}, {\tilde{x}}_{t} - x_{t} 〉}_{p + 1}^{T} = O (p {(log T)}^{a} / T) & , & {\tilde{D}}_{z} {〈 {\tilde{z}}_{t, p}, {\tilde{x}}_{t} - x_{t} 〉}_{p + 1}^{T} = O (p {(log T)}^{a} / T^{1 / 2}) \\ {\tilde{D}}_{z} {〈 {\tilde{z}}_{t + 1, p}, {\tilde{x}}_{t} - x_{t} 〉}_{p + 1}^{T} = O (p {(log T)}^{a} / T^{1 / 2}) & , & {\tilde{D}}_{x} {〈 x_{t}, {\tilde{x}}_{t} - x_{t} 〉}_{p + 1}^{T} = O (p {(log T)}^{a} / T^{1 / 2}) \\ {〈 {\tilde{x}}_{t} - x_{t}, {\tilde{x}}_{t} - x_{t} 〉}_{p + 1}^{T} = O (p^{2} {(log T)}^{a} / T) & . \end{matrix}

Using these orders of convergence we obtain

{\tilde{D}}_{x} {〈 {\tilde{x}}_{t}, {\tilde{x}}_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x} = {\tilde{D}}_{x} {〈 x_{t}, x_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x} + O (p^{2} {(log T)}^{a} / T^{1 / 2}) > 0 a . s .

From Lemma A1 also

{({\tilde{D}}_{x} {〈 {\tilde{x}}_{t}, {\tilde{x}}_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x})}^{- 1} = {({\tilde{D}}_{x} {〈 x_{t}, x_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x})}^{- 1} (1 + o (1)) = O ({(log T)}^{a})

.

Therefore

\begin{matrix} (\tilde{C} - C) D_{x}^{- 1} & = & \sqrt{T} ({〈 ε_{t}, {\tilde{x}}_{t} 〉}_{p + 1}^{T} - {C 〈 {\tilde{x}}_{t} - x_{t}, {\tilde{x}}_{t} 〉}_{p + 1}^{T}) {\tilde{D}}_{x} {({\tilde{D}}_{x} {〈 {\tilde{x}}_{t}, {\tilde{x}}_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x})}^{- 1} \\ = & \sqrt{T} {〈 ε_{t}, x_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x} {({\tilde{D}}_{x} {〈 x_{t}, x_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x} + o (1))}^{- 1} \\ - \sqrt{T} C {〈 {\tilde{x}}_{t} - x_{t}, x_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x} {({\tilde{D}}_{x} {〈 x_{t}, x_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x})}^{- 1} + o (1) = O (p {(log T)}^{a}) . \end{matrix}

(A4)

This in particular establishes consistency for the estimate. Next analogously (using the notation

δ x_{t} = {\tilde{x}}_{t} - x_{t}

) we obtain

(\tilde{A} - A) D_{x}^{- 1} =

\begin{matrix} \sqrt{T} {〈 {\tilde{x}}_{t + 1} - A {\tilde{x}}_{t}, {\tilde{x}}_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x} {({\tilde{D}}_{x} {〈 {\tilde{x}}_{t}, {\tilde{x}}_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x})}^{- 1} \\ = & \sqrt{T} ({〈 ({\tilde{x}}_{t + 1} - x_{t + 1}) + (x_{t + 1} - A x_{t}) + A (x_{t} - {\tilde{x}}_{t}), {\tilde{x}}_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x}) {({\tilde{D}}_{x} {〈 x_{t}, x_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x} + o (1))}^{- 1} \\ = & \sqrt{T} ({〈 δ x_{t + 1}, x_{t} 〉}_{p + 1}^{T} - A {〈 δ x_{t}, x_{t} 〉}_{p + 1}^{T} + {〈 K ε_{t}, x_{t} 〉}_{p + 1}^{T}) {\tilde{D}}_{x} {({\tilde{D}}_{x} {〈 x_{t}, x_{t} 〉}_{p + 1}^{T} {\tilde{D}}_{x})}^{- 1} + o (1) \\ = & O (p {(log T)}^{a}) \end{matrix}

(A5)

and therefore consistency for

\tilde{A}

is established. Finally note that for

{\hat{ε}}_{t} = y_{t} - \tilde{C} {\tilde{x}}_{t} = ε_{t} + C (x_{t} - {\tilde{x}}_{t}) + (C - \tilde{C}) {\tilde{x}}_{t}

it follows that

{〈 {\hat{ε}}_{t}, {\hat{ε}}_{t} 〉}_{p + 1}^{T} = Ω + O (p^{2} {(log T)}^{a} / T^{1 / 2})

. Furthermore since

{\hat{ε}}_{t}

denotes the residuals of the regression of

y_{t}

onto

{\tilde{x}}_{t}

it follows that

{〈 {\hat{ε}}_{t}, {\bar{x}}_{t} 〉}_{p + 1}^{T} = 0

. From this we obtain

\begin{matrix} \sqrt{T} (\tilde{K} - K) & = & \sqrt{T} ({〈 {\tilde{x}}_{t + 1} - K {\hat{ε}}_{t}, {\hat{ε}}_{t} 〉}_{p + 1}^{T} {({〈 {\hat{ε}}_{t}, {\hat{ε}}_{t} 〉}_{p + 1}^{T})}^{- 1} \\ = & \sqrt{T} ({〈 ({\tilde{x}}_{t + 1} - x_{t + 1}) - A δ x_{t} + K (ε_{t} - {\hat{ε}}_{t}), {\hat{ε}}_{t} 〉}_{p + 1}^{T}) {({〈 {\hat{ε}}_{t}, {\hat{ε}}_{t} 〉}_{p + 1}^{T})}^{- 1} \\ = & \sqrt{T} ({〈 δ x_{t + 1} - A δ x_{t} + K (ε_{t} - {\hat{ε}}_{t}), {\hat{ε}}_{t} 〉}_{p + 1}^{T}) Ω^{- 1} (1 + o (1)) \\ = & \sqrt{T} ({〈 δ x_{t + 1} - A δ x_{t} + K (ε_{t} - {\hat{ε}}_{t}), ε_{t} 〉}_{p + 1}^{T}) Ω^{- 1} (1 + o (1)) + o (1) \\ = & (\sqrt{T} {〈 δ x_{t + 1}, ε_{t} 〉}_{p + 1}^{T}) Ω^{- 1} + o (1) = (\sqrt{T} {〈 ({\tilde{Γ}}_{p}^{'} - Γ_{p}^{'}) {\bar{z}}_{t + 1, p}, ε_{t} 〉}_{p + 1}^{T}) Ω^{- 1} + o (1) \\ = & O (p {(log T)}^{a}) . \end{matrix}

(A6)

□

These expressions do not only show consistency of a specific order, but also give the relevant highest order terms for the asymptotic distribution, which are used below.

As

\hat{C} {\hat{A}}^{j} \hat{K} = \tilde{C} {\tilde{A}}^{j} \tilde{K} \to C A^{j} K = C_{\circ} A_{\circ}^{j} K_{\circ}

, Lemma A6 establishes consistency for the impulse response sequence

\hat{C} {\hat{A}}^{j} \hat{K}

(thus proofs Theorem 1 (I)) as well as, jointly with

p = O ({(log T)}^{a})

, the rate of convergence

O ({(log T)}^{a} / T^{1 / 2})

for the not realizable choice of the basis and the impulse response sequence

C A^{j} K

.

Appendix C.1.2. Proof of Theorem 1 (II)

In order to arrive at the canonical representation

(\overset{ˇ}{A}, \overset{ˇ}{C}, \overset{ˇ}{K})

two steps are performed: first the reordered Jordan normal form is calculated, afterwards the matrices

{\tilde{C}}_{j, C}

are transformed such that

E_{j}^{'} {\overset{ˇ}{C}}_{j, C} = I_{c_{j}}

holds. We will follow these steps below.

In the first step a transformation matrix

\hat{U}

needs to be found such that

\tilde{A} = \hat{U} \tilde{A} {\hat{U}}^{- 1}

is in reordered Jordan normal form. In this respect

\tilde{A}

and

A

are used in Lemma A2. Accordingly

{\hat{U}}_{t} = [{\hat{U}}_{t, 1}, . . ., {\hat{U}}_{t, S / 2}, {\hat{U}}_{t, •}]

can be defined such that

V_{j}^{'} {\hat{U}}_{t, j} = I_{c_{j}}

where

U \in R^{n \times n}

corresponds to the transformation from

A

to

A_{\circ}

as given in the theorem. An appropriate choice of

{\tilde{z}}_{t, 1}

leads to

U = I_{n}

. Furthermore the basis in the space spanned by the columns of

{\hat{U}}_{t, •}

where

{\hat{U}}_{t, j}^{'} {\hat{U}}_{t, •} = 0

can be chosen such that

[0, I] {\hat{U}}_{t, •} = I

for large enough T.

A first order approximation according to Lemma A2 then leads to

{\hat{U}}_{t, j} = U_{j} + Z_{1} + O (∥ \hat{A} {- A ∥}^{2}) = U_{j} - Σ (\hat{A} - A) U_{j} + O (∥ \hat{A} - A ∥^{2})

for

j = 0, . . ., S / 2

. Consequently

∥ {\hat{U}}_{t, j} - U_{j} ∥ = O ({(log T)}^{a} T^{- 1})

and thus also

{\hat{U}}_{t} - U = O ({(log T)}^{a} T^{- 1})

. Consequently the order of convergence for the transformed system

(\hat{A}, \hat{C}, \hat{K})

is unchanged. In a second step an upper triangular transformation matrix

\tilde{U}

can be found transforming

(\hat{A}, \hat{C}, \hat{K})

such that

\tilde{A}

corresponds to the reordered Jordan normal form. Due to the upper block triangularity of this transform we can apply Lemma A3 to show that the order of convergence remains identical.

For the second step note that Lemma A3 provides the required terms: An application to the block diagonal transformation

S_{T} = diag (E_{1}^{'} {\tilde{C}}_{1, C}, . . ., E_{S / 2}^{'} {\tilde{C}}_{S / 2, C}, S_{T, •})

, where

S_{T, •}

transforms the stationary subsystem to echelon form, concludes the proof.

Appendix C.1.3. Proof of Theorem 1 (III)

The only argument that uses the iid assumption is the almost sure convergence of

{({\tilde{D}}_{x} 〈 x_{t}, x_{t} 〉 {\tilde{D}}_{x})}^{- 1}

. Weakening the assumptions on the noise implies that this order of convergence still holds in probability while the almost sure version cannot be shown with the tools of this paper. This concludes the proof of Theorem 1.

Appendix C.2. Proof of Theorem 2

Using the notation introduced in (A1),

\hat{X} = {\tilde{D}}_{z} 〈 {\tilde{y}}_{t}, {\tilde{z}}_{t, p} 〉 {\tilde{D}}_{z} {({\tilde{D}}_{z} 〈 {\tilde{z}}_{t, p}, {\tilde{z}}_{t, p} 〉 {\tilde{D}}_{z})}^{- 1} {\tilde{D}}_{z} 〈 {\tilde{z}}_{t, p}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{z} {({\tilde{D}}_{z} 〈 {\tilde{y}}_{t}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{z})}^{- 1} \to X_{\circ} = [\begin{matrix} I_{c} & 0 \\ 0 & X_{\circ, •} \end{matrix}]

for a suitable matrix

X_{\circ, •}

. The eigenvalues of

\hat{X}

are the squares of the singular values of the RRR problem in the first step of CVA. Therefore

\begin{matrix} T \sum_{i = 1}^{c} (1 - {\hat{σ}}_{i}^{2}) & = & - T t r [U_{c}^{'} (\hat{X} - X_{\circ}) [U_{c} - {(X_{\circ} - I)}^{†} (\hat{X} - X_{\circ}) U_{c}]] + o_{P} (1) \\ = & - T t r [Δ X_{11} - Δ X_{12} {(X_{\circ, •} - I)}^{†} Δ X_{21}] + o_{P} (1) \end{matrix}

according to a second order approximation in the Rayleigh-Schrödinger expansions (Lemma A2). Now, in the current situation we obtain

(I - \hat{X}) [\begin{matrix} I \\ 0 \end{matrix}] =

\begin{matrix} = & ({\tilde{D}}_{y} 〈 {\tilde{y}}_{t}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{y} - {\tilde{D}}_{y} 〈 {\tilde{y}}_{t}, {\tilde{z}}_{t, p} 〉 {\tilde{D}}_{z} {({\tilde{D}}_{z} 〈 {\tilde{z}}_{t, p}, {\tilde{z}}_{t, p} 〉 {\tilde{D}}_{z})}^{- 1} {\tilde{D}}_{z} 〈 {\tilde{z}}_{t, p}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{y}) {({\tilde{D}}_{y} 〈 {\tilde{y}}_{t}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{y})}^{- 1} [\begin{matrix} I \\ 0 \end{matrix}] \\ = & ({\tilde{D}}_{y} 〈 {\tilde{ε}}_{t}, {\tilde{ε}}_{t} 〉 {\tilde{D}}_{y} - {\tilde{D}}_{y} 〈 {\tilde{ε}}_{t}, {\tilde{z}}_{t, p} 〉 {\tilde{D}}_{z} {({\tilde{D}}_{z} 〈 {\tilde{z}}_{t, p}, {\tilde{z}}_{t, p} 〉 {\tilde{D}}_{z})}^{- 1} {\tilde{D}}_{z} 〈 {\tilde{z}}_{t, p}, {\tilde{ε}}_{t} 〉 {\tilde{D}}_{y}) {({\tilde{D}}_{y} 〈 {\tilde{y}}_{t}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{y})}^{- 1} [\begin{matrix} I \\ 0 \end{matrix}] . \end{matrix}

Furthermore

〈 {\tilde{ε}}_{t}, {\tilde{z}}_{t, p} 〉 {\tilde{D}}_{z} {({\tilde{D}}_{z} 〈 {\tilde{z}}_{t, p}, {\tilde{z}}_{t, p} 〉 {\tilde{D}}_{z})}^{- 1} {\tilde{D}}_{z} 〈 {\tilde{z}}_{t, p}, {\tilde{ε}}_{t} 〉 = O_{P} (T^{- 1})

and

{({\tilde{D}}_{y} 〈 {\tilde{y}}_{t}, {\tilde{y}}_{t} 〉 {\tilde{D}}_{y})}^{- 1} [\begin{matrix} I \\ 0 \end{matrix}] = [\begin{matrix} I \\ - {〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, •} 〉}^{- 1} 〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, 1} 〉 / \sqrt{T} \end{matrix}] {(〈 {\tilde{y}}_{t, 1}^{π}, {\tilde{y}}_{t, 1}^{π} 〉 / T)}^{- 1}

where

{\tilde{y}}_{t, 1}^{π} = {\tilde{y}}_{t, 1} - 〈 {\tilde{y}}_{t, 1}, {\tilde{y}}_{t, •} 〉 {〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, •} 〉}^{- 1} {\tilde{y}}_{t, •} .

From this we get using

E {\tilde{ε}}_{t, •} {\tilde{ε}}_{t, •}^{'} = E {\tilde{y}}_{t, •} {\tilde{y}}_{t, •}^{'} - X_{\circ, •} E {\tilde{y}}_{t, •} {\tilde{y}}_{t, •}^{'}

:

\begin{matrix} T (I_{c} - {\hat{X}}_{1, 1}) & = & (〈 {\tilde{ε}}_{t, 1}, {\tilde{ε}}_{t, 1} 〉 - 〈 {\tilde{ε}}_{t, 1}, {\tilde{ε}}_{t, •} 〉 {〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, •} 〉}^{- 1} 〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, 1} 〉) {(〈 {\tilde{y}}_{t, 1}^{π}, {\tilde{y}}_{t, 1}^{π} 〉 / T)}^{- 1} + o_{P} (1), \\ \sqrt{T} Δ X_{2, 1} & = & (- 〈 {\tilde{ε}}_{t, •}, {\tilde{ε}}_{t, 1} 〉 + (I - X_{\circ, •}) 〈 {\tilde{y}}_{t, •}, {\tilde{y}}_{t, 1} 〉) {(〈 {\tilde{y}}_{t, 1}^{π}, {\tilde{y}}_{t, 1}^{π} 〉 / T)}^{- 1} + o_{P} (1) \\ \sqrt{T} Δ X_{1, 2} & = & - E {\tilde{ε}}_{t, 1} {\tilde{ε}}_{t, •}^{'} {(E {\tilde{y}}_{t, •} {\tilde{y}}_{t, •}^{'})}^{- 1} + o_{P} (1) . \end{matrix}

Thus

T \sum_{i = 1}^{c} (1 - {\hat{σ}}_{i}^{2}) =

\begin{matrix} = & t r [(〈 {\tilde{ε}}_{t, 1}, {\tilde{ε}}_{t, 1} 〉 - E {\tilde{ε}}_{t, 1} {\tilde{ε}}_{t, •}^{'} {(E {\tilde{y}}_{t, •} {\tilde{y}}_{t, •}^{'})}^{- 1} {(I - X_{0, •})}^{- 1} E {\tilde{ε}}_{t, •} {\tilde{ε}}_{t, 1}^{'}) {(〈 {\tilde{y}}_{t, 1}, {\tilde{y}}_{t, 1} 〉 / T)}^{- 1}] + o_{P} (1) \\ = & t r [(〈 {\tilde{ε}}_{t, 1}, {\tilde{ε}}_{t, 1} 〉 - E {\tilde{ε}}_{t, 1} {\tilde{ε}}_{t, •}^{'} {(E {\tilde{ε}}_{t, •} {\tilde{ε}}_{t, •}^{'})}^{- 1} E {\tilde{ε}}_{t, •} {\tilde{ε}}_{t, 1}^{'}) {(〈 {\tilde{y}}_{t, 1}, {\tilde{y}}_{t, 1} 〉 / T)}^{- 1}] + o_{P} (1) \overset{d}{\to} Z . \end{matrix}

Appendix C.3. Proof of Theorem 3

The proof of Theorem 3 follows the same path as the proof of Theorem 1. In (A5) it was shown that the asymptotic distribution of

T ({\tilde{A}}_{11} - A_{\circ, 11})

depends on

\begin{matrix} 〈 {\tilde{x}}_{t + 1, j} - x_{t + 1, j}, x_{t, k} 〉, 〈 {\tilde{x}}_{t, j} - x_{t, j}, x_{t, k} 〉 & , & 〈 ε_{t}, x_{t, j} 〉, 〈 x_{t, k}, x_{t, j} 〉 / T \end{matrix}

for

j, k = 0, . . ., S / 2

. Note that

δ x_{t + i} = {\tilde{x}}_{t + i} - x_{t + i} = ({\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p}) {\tilde{z}}_{t + i, p} + o_{P} (T^{- 1})

for

i = 0, 1

. Then the results of Lemma A5 show that the first c columns of

({\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p})

converge to a random variable (below denoted as

Z_{Γ}

) when multiplied with T while the remaining columns converge in distribution when multiplied with

\sqrt{T}

. Therefore

〈 δ x_{t + i}, x_{t, k} 〉 = T ({\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p}) \frac{〈 {\tilde{z}}_{t + i, p}, x_{t, k} 〉}{T} + o_{P} (1) = T ({\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p}) [\begin{matrix} I_{c} \\ 0 \end{matrix}] \frac{〈 {\tilde{z}}_{t + i, 1}, x_{t, k} 〉}{T} + o_{P} (1) .

Due to the definition (A1),

{\tilde{z}}_{t, 1} = {[x_{t, j}]}_{j = 0, . . ., S / 2} + o (T^{- 1})

and hence (using

A_{\circ} = d i a g (A_{\circ, u}, A_{\circ, •})

)

〈 {\tilde{z}}_{t + 1, 1}, x_{t, k} 〉 / T = A_{\circ, u} 〈 {\tilde{z}}_{t, 1}, x_{t, k} 〉 / T + o (1) .

Considering now the complex-valued representation and using the notation

Δ Γ_{1} : = T ({\tilde{Γ}}_{p}^{'} - {[Γ^{'}]}_{1 : p}) [\begin{matrix} I_{c} \\ 0 \end{matrix}], S_{j} = [0_{c_{j}, \sum_{i < j} c_{i}}, I_{c_{j}}, 0_{c_{j}, \sum_{i > j} c_{j}}]

where

S_{j} {\tilde{z}}_{t, 1} = x_{t, j, C}

, it follows that the contribution of these two terms to the limiting distribution of the diagonal block corresponding to the unit root

z_{j}

amounts to (using

〈 x_{t, j, C}, x_{t, k, C} 〉 / T \to 0

for

k \neq j

and

δ x_{t, j, C} = {\tilde{x}}_{t, j, C} - x_{t, j, C}

)

〈 δ x_{t + 1, j, C}, x_{t, j, C} 〉 - A_{\circ, j j} 〈 δ x_{t, j, C}, x_{t, j, C} 〉 =

\begin{matrix} = & S_{j} Δ Γ_{1} A_{\circ, u} \frac{〈 {\tilde{z}}_{t, 1}, x_{t, j, C} 〉}{T} - A_{\circ, j j} S_{j} Δ Γ_{1} \frac{〈 {\tilde{z}}_{t, 1}, x_{t, j, C} 〉}{T} + o_{P} (1) \\ = & S_{j} Δ Γ_{1} S_{j}^{'} A_{\circ, j j} \frac{〈 x_{t, j, C}, x_{t, j, C} 〉}{T} - A_{\circ, j j} S_{j} Δ Γ_{1} S_{j}^{'} \frac{〈 x_{t, j, C}, x_{t, j, C} 〉}{T} + o_{P} (1) \\ = & S_{j} Δ Γ_{1} S_{j}^{'} \bar{z_{j}} \frac{〈 x_{t, j, C}, x_{t, j, C} 〉}{T} - \bar{z_{j}} S_{j} Δ Γ_{1} S_{j}^{'} \frac{〈 x_{t, j, C}, x_{t, j, C} 〉}{T} + o_{P} (1) = o_{P} (1) . \end{matrix}

Therefore, for the diagonal blocks in (A5) these two terms do not contribute and the asymptotic distribution is determined by

T 〈 K_{\circ, j} ε_{t}, x_{t, j} 〉 {〈 x_{t, j}, x_{t, j} 〉}^{- 1}

for which the asymptotic results are provided in Lemma A1. This also shows that estimating the state does not change the asymptotic distribution in the diagonal blocks as the impact of

{\tilde{Γ}}_{p} - Γ_{p}

is of lower order.

In order to derive the distribution of the sum of the eigenvalues note that as in the proof of Theorem 2, according to Lemma A2 the sum of the eigenvalues of

\tilde{A}

converging to

z_{j}

obeys the following second order approximation:

\begin{matrix} T \sum_{i = 1}^{c_{j}} ({\hat{λ}}_{i} - z_{j}) & = & T t r [U_{j}^{'} (\hat{A} - A_{\circ}) [U_{j} - A_{\circ} {(z_{j})}^{†} (\hat{A} - A_{\circ}) U_{j}]] + o_{P} (T^{- 1}) \\ = & T t r [{\hat{A}}_{\circ, j j} - z_{j} I_{c_{j}}] + o_{P} (1) \end{matrix}

since

(\hat{A} - A_{\circ}) U_{j} = O ({(log T)}^{a} T^{- 1})

in this case implying that the second order terms vanish. Thus we obtain the asymptotic distribution under the null hypothesis as the limiting distribution of

T t r [〈 K_{\circ, j, C} ε_{t}, x_{t, j, C} 〉 {〈 x_{t, j, C}, x_{t, j, C} 〉}^{- 1}] .

It is easy to verify that this test statistic is pivotal for complex and real unit roots. This proves Theorem 3.

Appendix C.4. Proof of Theorem 4

The result for

{\tilde{C}}_{m}

can be shown using the results of [4]. As the eigenvalues are insensitive to changes in the basis we can assume without restriction of generality that the only unit root components in

T X_{t}^{(m)}

are contained in the first

c_{m}

rows:

c_{t}^{(m)} : = T X_{t}^{(m)} = [\begin{matrix} c_{t, u}^{(m)} \\ c_{t, •}^{(m)} \end{matrix}], {\tilde{D}}_{c} = [\begin{matrix} T^{- 1} I_{c_{m}} & 0 \\ 0 & I_{n - c_{m}} \end{matrix}] .

Due to the filtering,

c_{t, •}^{(m)}

is stationary while

c_{t, u}^{(m)}

contains the unit root

z_{m}

. Then the relevant matrix

{\hat{X}}_{m}

can be written as

{\hat{X}}_{m} : = 〈 c_{t - 1}^{π}, p_{t}^{π} 〉 {〈 p_{t}^{π}, p_{t}^{π} 〉}^{- 1} 〈 p_{t}^{π}, c_{t - 1}^{π} 〉 {〈 c_{t - 1}^{π}, c_{t - 1}^{π} 〉}^{- 1} .

Since

p_{t} = K ε_{t - 1} + \sum_{j = 1, j \neq m}^{S} α_{j} β_{j}^{'} X_{t - 1}^{(j)} + [0, {\tilde{α}}_{m}] c_{t - 1}^{(m)},

we consequently have

p_{t}^{π} = K ε_{t - 1}^{π} + [0, {\tilde{α}}_{m}] c_{t - 1}^{π}

. Therefore, for the three components of

{\hat{X}}_{m}

we obtain with appropriate definitions of the random variables

S_{m}, T_{m}

and using standard asymptotics

\begin{matrix} 〈 p_{t}^{π}, p_{t}^{π} 〉 = 〈 K ε_{t - 1}^{π} + {\tilde{α}}_{m} c_{t - 1, •}^{π}, K ε_{t - 1}^{π} + {\tilde{α}}_{m} c_{t - 1, •}^{π} 〉 & \to & K (E ε_{t - 1} ε_{t - 1}^{'}) K^{'} + {\tilde{α}}_{m} E c_{t - 1, •}^{Π} {(c_{t - 1, •}^{Π})}^{'} {\tilde{α}}_{m}^{'} > 0, \\ 〈 p_{t}^{π}, c_{t - 1}^{π} 〉 = 〈 K ε_{t - 1}^{π} + {\tilde{α}}_{m} c_{t - 1, •}^{π}, c_{t - 1}^{π} 〉 & \overset{d}{\to} & [S_{m}, {\tilde{α}}_{m} E c_{t - 1, •}^{Π} {(c_{t - 1, •}^{Π})}^{'}], \\ 〈 p_{t}^{π}, c_{t - 1}^{π} 〉 {〈 c_{t - 1}^{π}, c_{t - 1}^{π} 〉}^{- 1} {\tilde{D}}_{c}^{- 1} & = & [0, {\tilde{α}}_{m}] + 〈 K ε_{t - 1}^{π}, c_{t - 1}^{π} 〉 {〈 c_{t - 1}^{π}, c_{t - 1}^{π} 〉}^{- 1} {\tilde{D}}_{c}^{- 1} \\ 〈 K ε_{t - 1}^{π}, c_{t - 1}^{π} 〉 {〈 c_{t - 1}^{π}, c_{t - 1}^{π} 〉}^{- 1} {\tilde{D}}_{c}^{- 1} & \overset{d}{\to} & [T_{m}, 0] . \end{matrix}

Correspondingly the first block column

{\hat{X}}_{m, u}

of

{\hat{X}}_{m}

converges to zero such that

T {\hat{X}}_{m, u}

converges in distribution while the second block column converges in probability without normalization. This shows that

\begin{matrix} T \sum_{i = 1}^{c_{m}} {\hat{λ}}_{i} & = & T t r [U_{m}^{'} ({\hat{X}}_{m} - X_{m}) [U_{m} - X_{m}^{†} ({\hat{X}}_{m} - X_{m}) U_{m}] + o_{P} (1) = t r [T {\hat{X}}_{m, u u}] + o_{P} (1) \end{matrix}

converges in distribution. The limit is given in [4].

For the case of the estimated state note that the difference between the estimated and the true state is given as

{\tilde{x}}_{t} - x_{t} = {\tilde{Γ}}_{p}^{'} {\tilde{z}}_{t, p} - Γ_{p}^{'} {\tilde{z}}_{t, p} - {\bar{A}}^{p} x_{t - p} = {({\tilde{Γ}}_{p} - Γ_{p})}^{'} {\tilde{z}}_{t, p} - {\bar{A}}^{p} x_{t - p} .

The strict minimum-phase assumption and the assumption on the increase of

p = p (T)

implies that the second term can be neglected being

o_{P} (T^{- 1})

. Furthermore

{({\tilde{Γ}}_{p} - Γ_{p})}^{'} {\tilde{z}}_{t, p} = {({\tilde{Γ}}_{p} - Γ_{p})}^{'} {\tilde{D}}_{z}^{- 1} {\tilde{D}}_{z} {\tilde{z}}_{t, p}, {({\tilde{Γ}}_{p} - Γ_{p})}^{'} {\tilde{D}}_{z}^{- 1} = O_{P} (T^{- 1 / 2}) .

Using this it can be concluded that

\begin{matrix} 〈 {\hat{p}}_{t}, {\hat{p}}_{t} 〉 = 〈 p_{t}, p_{t} 〉 + O_{P} (T^{- 1 / 2}) & , & 〈 {\hat{p}}_{t}, {\hat{c}}_{t - 1, •}^{(m)} 〉 = 〈 p_{t}, c_{t - 1, •}^{(m)} 〉 + O_{P} (T^{- 1 / 2}), \\ 〈 {\hat{p}}_{t}, {\hat{c}}_{t - 1, u}^{(m)} 〉 = 〈 p_{t}, c_{t - 1, u}^{(m)} 〉 + O_{P} (T^{- 1 / 2}) & , & 〈 {\hat{c}}_{t, u}^{(m)}, {\hat{c}}_{t, u}^{(k)} 〉 = 〈 c_{t, u}^{(m)}, c_{t, u}^{(k)} 〉 + O_{P} (1) . \end{matrix}

These equations imply that the difference between the expression using the true state and the one using the estimated state converges to zero, implying that the two tests accept and reject jointly asymptotically under the null hypothesis.

References

Rodrigues, P.M.; Taylor, A. Alternative estimators and unit root tests for seasonal autoregressive processes. J. Econom. 2004, 120, 35–73. [Google Scholar] [CrossRef]
Johansen, S.; Schaumburg, E. Likelihood Analysis of Seasonal Cointegration. J. Econom. 1999, 88, 301–339. [Google Scholar] [CrossRef]
Hylleberg, S.; Engle, R.; Granger, C.; Yoo, B. Seasonal Integration and Cointegration. J. Econom. 1990, 44, 215–238. [Google Scholar] [CrossRef]
Cubadda, G. Complex Reduced Rank Models For Seasonally Cointegrated Time Series. Oxf. Bull. Econ. Stat. 2001, 63, 497–511. [Google Scholar] [CrossRef]
Cubadda, G.; Omtzigt, P. Small-sample improvements in the statistical analysis of seasonally cointegrated systems. Comput. Stat. Data Anal. 2005, 49, 333–348. [Google Scholar] [CrossRef][Green Version]
Ahn, S.K.; Cho, S.; Seong, B. Inference of Seasonal Cointegration: Gaussian Reduced Rank Estimation and Tests for Various Types of Cointegration. Oxford Bull. Econ. Stat. 2004, 66, 261–284. [Google Scholar] [CrossRef]
Vivas, E.; Allende-Cid, H.; Salas, R. A Systematic Review of Statistical and Machine Learning Methods for Electrical Power Forecasting with Reported MAPE Score. Entropy 2020, 22, 1412. [Google Scholar] [CrossRef]
García-Martos, C.; Rodríguez, J.; Sánchez, M.J. Forecasting electricity prices and their volatilities using Unobserved Components. Energy Econ. 2011, 33, 1227–1239. [Google Scholar] [CrossRef]
Dufour, J.M.; Stevanović, D. Factor-augmented VARMA models with macroeconomic applications. J. Bus. Econ. Stat. 2013, 31, 491–506. [Google Scholar] [CrossRef]
Dias, G.; Kapetanios, G. Estimation and forecasting in vector autoregressive moving average models for rich datasets. J. Econom. 2018, 202, 75–91. [Google Scholar] [CrossRef]
Foroni, C.; Marcellino, M.; Stevanović, D. Mixed-frequency models with moving-average components. J. Appl. Econom. 2019, 34, 688–706. [Google Scholar] [CrossRef]
Kascha, C.; Trenkler, C. Simple Identification and specification of cointegrated VARMA models. J. Appl. Econom. 2015, 30, 675–702. [Google Scholar] [CrossRef]
Ravenna, F. Vector autoregressions and reduced form representations of DSGE models. J. Monet. Econ. 2007, 54, 2048–2064. [Google Scholar] [CrossRef]
Komunjer, I.; Zhu, Y. Likelihood ratio testing in linear state space models: An application to dynamic stochastic general equilibrium models. J. Econom. 2020, 218, 561–586. [Google Scholar] [CrossRef]
Bauer, D.; Wagner, M. A State Space Canonical Form for Unit Root Processes. Econom. Theory 2012, 28, 1313–1349. [Google Scholar] [CrossRef]
Larimore, W.E. System Identification, reduced order filters and modeling via canonical variate analysis. In Proceedings of the 1983 American Control Conference, San Francisco, CA, USA, 22–24 June 1983; pp. 445–451. [Google Scholar]
Bauer, D. Comparing the CCA subspace method to quasi maximum likelihood methods in the case of no exogenous inputs. J. Time Ser. Anal. 2006, 26, 631–668. [Google Scholar] [CrossRef]
Bauer, D. Using Subspace Methodes for Estimating ARMA models for multivariate time series with conditionally heteroskedastic innovations. Econom. Theory 2008, 24, 1063–1092. [Google Scholar]
Bauer, D. Using Subspace Methods to Model Long-Memory Processes. In Theory and Applications of Time Series Analysis. ITISE 2018. Contributions to Statistics; Valenzuela, O., Rojas, F., Pomares, H., Rojas, I., Eds.; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Bauer, D.; Wagner, M. Estimating Cointegrated Systems Using Subspace Algorithms. J. Econom. 2002, 111, 47–84. [Google Scholar] [CrossRef]
Bauer, D. Estimating linear dynamical systems using subspace methods. Econom. Theory 2005, 21, 181–211. [Google Scholar] [CrossRef]
Hannan, E.J.; Deistler, M. The Statistical Theory of Linear Systems; John Wiley: New York, NY, USA, 1998. [Google Scholar]
Chatelin, F. Eigenvalues of Matrices; John Wiley & Sons: Hoboken, NJ, USA, 1993. [Google Scholar]
Bauer, D. Asymptotic Distribution of Estimators in Reduced Rank Regression Settings When the Regressors Are Integrated. Technical Report. 2012. Available online: http://arxiv.org/abs/1211.1439 (accessed on 26 March 2021).
Phillips, P.C.B.; Durlauf, S.N. Multiple Time Series Regression with Integrated Processes. Rev. Econ. Stud. 1986, LIII, 473–495. [Google Scholar] [CrossRef]
Carrasco, M.; Chen, X. Mixing and Moment Properties of Various GARCH and Stochastic Volatility Models. Econom. Theory 2002, 18, 17–39. [Google Scholar] [CrossRef]
Bauer, D.; Wagner, M. Autoregressive Approximations to MFI(1) Processes; Technical Report; Department for Mathematical Methods in Economics: TU Wien, Austria, 2004. [Google Scholar]
Bierens, H. Nonparametric cointegration analysis. J. Econom. 1997, 77, 379–404. [Google Scholar] [CrossRef]
Wagner, M. A Comparison of Johansen’s, Bierens’ and the Subspace Algorithm Method for Cointegration Analysis. Oxf. Bull. Econ. Stat. 2004, 66, 399–424. [Google Scholar] [CrossRef]
Johansen, S.; Nielsen, M. The cointegrated vector autoregressive model with general deterministic terms. J. Econom. 2018, 202, 214–229. [Google Scholar] [CrossRef]
Lee, H. Maximum Likelihood Inference on Cointegration and Seasonal Cointegration. J. Econom. 1992, 54, 1–47. [Google Scholar] [CrossRef]
Bauer, D.; Wagner, M. Using subspace algorithm cointegration analysis: Simulation performance and application to the term structure. Comput. Stat. Data Anal. 2009, 53, 1954–1973. [Google Scholar] [CrossRef]
Bauer, D. Order Estimation for Subspace Methods. Automatica 2001, 37, 1561–1573. [Google Scholar] [CrossRef]
Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2006. [Google Scholar]
Qu, Z.; Perron, P. A Modified Information Criterion for Cointegration Tets Based on a VAR Approximation. Econom. Theory 2007, 23, 638–658. [Google Scholar] [CrossRef]
Mulla, R. Hourly Energy Consumption. Available online: www.kaggle.com/robikscube/hourly-energy-consumption/ (accessed on 22 January 2021).
del Barrio Castro, T.; Rodrigues, P.M.M.; Taylor, A.M.R. Temporal Aggregation of Seasonally Near-Integrated Processes. J. Time Ser. Anal. 2019, 40, 872–886. [Google Scholar] [CrossRef]
Bauer, D. Almost sure bounds on the estimation error for ols estimators when the regressors include certain MFI(1) processes. Econom. Theory 2009, 25, 571–582. [Google Scholar] [CrossRef]
Ahn, S.; Reinsel, G. Estimation of Partially Nonstationary Vector Autoregressive Models with Seasonal Behaviour. J. Econom. 1994, 1994 62, 317–350. [Google Scholar] [CrossRef]
Bauer, D.; Deistler, M.; Scherrer, W. Consistency and Asymptotic Normality of some Subspace Algorithms for Systems Without Observed Inputs. Automatica 1999, 35, 1243–1254. [Google Scholar] [CrossRef]

Figure 1. Eigenvalues around

z = i

of 1000 replications when

γ = 0.2

(

c_{π / 2} = 1

).

Figure 1. Eigenvalues around

z = i

of 1000 replications when

γ = 0.2

(

c_{π / 2} = 1

).

Figure 2. Relationship between hit rates and chosen values of f and k, illustration for the VARMA dgp using A₂. The lower x-axes show f or k, above are the choice frequencies of the selection criteria.

Figure 3. Empirical power of the different test procedures (VARMA dgp with

A_{2}

). Twice the Monte Carlo standard error is 0.005.

Figure 3. Empirical power of the different test procedures (VARMA dgp with

A_{2}

). Twice the Monte Carlo standard error is 0.005.

Figure 4. Mean of absolute value of one day ahead prediction error over all four components. CVA (blue), AR (red) and long AR (black). Dash-dot lines refer to the t-distribution.

Figure 5. Results of the unit root tests for all seasonal unit roots jointly.

Figure 6. Electricity consumption data.

Figure 7. Periodic patterns from dummy variables.

Figure 8. BIC values for univariate models and multivariate model (dashed line; divided by four to fit).

Figure 9. Results for the hourly datasets.

Figure 10. Residual analysis.

Figure 11. Results for the hourly datasets.

Figure 12. Poles (x) and zeros (o) of the transfer functions corresponding to the three models: Mod 1 (red), Mod 2 (blue), Mod 3 (magenta).

Table 1. Eigenvalues of the coefficient matrix of the companion form.

		j
		1	2	3	4	5	6	7	8
$γ = 0.2$	$z_{j}$	−1	1	i	−i	0.126 + i0.99	0.126 − i0.99	−0.790	0.737
$γ = 0.2$	$\| z_{j} \|$	1	1	1	1	0.998	0.998	0.790	0.737
$γ = 0$	$μ_{j}$	−1	i	−i	1	i	−i	0.775	−0.775
$γ = 0$	$\| μ_{j} \|$	1	1	1	1	1	1	0.775	0.775

Table 2. Hit rates for the different tests (VAR dgp). Twice the maximum (over all entries) Monte Carlo standard error is 0.005.

		0		$π / 2$					$π$
	T	$Λ$	J	$Λ$	JS	Q1	Q2	Q3	$Λ$	J
$γ = 0$	50	0.685	0.348	0.351	0.903	0.844	0.851	0.844	0.681	0.343
	100	0.841	0.732	0.490	0.925	0.900	0.902	0.900	0.831	0.724
	200	0.897	0.951	0.841	0.934	0.925	0.924	0.925	0.876	0.936
	500	0.931	0.938	0.916	0.949	0.941	0.942	0.941	0.927	0.948
$γ = 0.2$	50	0.550	0.367	0.811	0.796	0.777	0.778	0.788	0.604	0.297
	100	0.711	0.801	0.087	0.920	0.913	0.908	0.908	0.799	0.806
	200	0.907	0.922	0.855	0.954	0.949	0.948	0.947	0.854	0.939
	500	0.944	0.953	0.927	0.939	0.938	0.938	0.936	0.924	0.942

Table 3. Hit rates for the different tests (VARMA dgp). Twice the maximum (over all entries) Monte Carlo standard error is 0.005.

		0		$π / 2$					$π$
	T	$Λ$	J	$Λ$	JS	Q1	Q2	Q3	$Λ$	J
$A_{1}$	50	0.890	0.003	0.906	0.024	0.027	0.032	0.025	0.897	0.008
	100	0.928	0.434	0.944	0.755	0.783	0.783	0.761	0.930	0.440
	200	0.936	0.937	0.923	0.925	0.915	0.916	0.915	0.925	0.924
	500	0.852	0.901	0.853	0.919	0.906	0.904	0.904	0.853	0.894
$A_{2}$	50	0.863	0.008	0.785	0.062	0.047	0.063	0.039	0.867	0.006
	100	0.917	0.500	0.880	0.582	0.596	0.596	0.571	0.916	0.518
	200	0.931	0.927	0.882	0.908	0.915	0.913	0.911	0.919	0.922
	500	0.824	0.882	0.786	0.878	0.860	0.859	0.861	0.812	0.865

Table 4. Mean gaps between estimated and true cointegrating spaces (VARMA dgp).

2 M C s e

denotes twice the maximal Monte Carlo standard error for the corresponding row.

Table 4. Mean gaps between estimated and true cointegrating spaces (VARMA dgp).

2 M C s e

denotes twice the maximal Monte Carlo standard error for the corresponding row.

			0			$π / 2$				$π$
	T	2MCse	CVA	J	GARR	CVA	JS	cRRR	GARR	CVA	J	GARR
$A_{1}$	50	0.016	0.116	0.189	0.192	0.091	0.147	0.130	0.130	0.111	0.192	0.197
	100	0.004	0.047	0.048	0.048	0.039	0.035	0.035	0.035	0.047	0.046	0.046
	200	0.003	0.023	0.019	0.019	0.019	0.016	0.016	0.016	0.024	0.019	0.019
	500	0.003	0.012	0.007	0.007	0.008	0.008	0.006	0.006	0.011	0.007	0.007
$A_{2}$	50	0.016	0.174	0.245	0.242	0.250	0.349	0.331	0.331	0.165	0.231	0.234
	100	0.004	0.072	0.061	0.061	0.098	0.080	0.078	0.078	0.069	0.060	0.060
	200	0.003	0.031	0.026	0.026	0.047	0.036	0.034	0.034	0.032	0.027	0.027
	500	0.003	0.016	0.011	0.010	0.021	0.015	0.013	0.013	0.017	0.011	0.011

Table 5. Percentage of accept (minimum for all unit root frequencies) and reject (maximum for non unit root frequencies) of

Λ (1)

test statistic.

Table 5. Percentage of accept (minimum for all unit root frequencies) and reject (maximum for non unit root frequencies) of

Λ (1)

test statistic.

	Unit Root Frequencies					Non Unit Root Frequencies
T	norm	G1	IG1	IG2	IG3	norm	G1	IG1	IG2	IG3
104	0.94	0.89	0.87	0.88	0.87	0.87	0.82	0.79	0.82	0.79
208	0.98	0.96	0.95	0.94	0.96	0.78	0.75	0.72	0.72	0.69
312	0.97	0.96	0.96	0.95	0.95	0.00	0.00	0.00	0.00	0.00

Table 6. Summary of data sets.

Region	Daily Obs. (4263 est., 577 val.)					Hourly Obs. (102,291 est., 13,845 val.)
Region	Mean	Mean(log)	Std.(log)	AIC	BIC	Mean(log)	Std.(log)	AIC	BIC
AEP	371,844	12.82	0.127	43	12	9.63	0.168	782	532
DAYTON	48,897	10.79	0.144	43	3	7.60	0.193	772	531
DOM	262,727	12.47	0.158	17	3	9.28	0.215	795	554
DUQ	39,837	10.58	0.130	23	7	7.40	0.177	800	529

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bauer, D.; Buschmeier, R. Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method. Entropy 2021, 23, 436. https://doi.org/10.3390/e23040436

AMA Style

Bauer D, Buschmeier R. Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method. Entropy. 2021; 23(4):436. https://doi.org/10.3390/e23040436

Chicago/Turabian Style

Bauer, Dietmar, and Rainer Buschmeier. 2021. "Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method" Entropy 23, no. 4: 436. https://doi.org/10.3390/e23040436

APA Style

Bauer, D., & Buschmeier, R. (2021). Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method. Entropy, 23(4), 436. https://doi.org/10.3390/e23040436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method

Abstract

1. Introduction

2. Model Set and Assumptions

3. Canonical Variate Analysis

4. Asymptotic Properties of the System Estimators

5. Inference Based on the Subspace Estimators

6. Deterministic Terms

7. Simulations

7.1. VAR Processes

7.2. VARMA Processes

7.3. Robustness of Unit Root Tests for Daily Data

8. Application

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

Appendix A. Supporting Material

Appendix A.1. Complex Valued Canonical Form

Appendix A.2. Perturbation of Eigendecompositions

Appendix A.3. Random Transformation of Systems

Appendix B. Reduced Rank Regression with Integrated Variables

Appendix C. Proofs of the Theorems

Appendix C.1. Proof of Theorem 1

Appendix C.1.1. Proof of Theorem 1 (I)

Appendix C.1.2. Proof of Theorem 1 (II)

Appendix C.1.3. Proof of Theorem 1 (III)

Appendix C.2. Proof of Theorem 2

Appendix C.3. Proof of Theorem 3

Appendix C.4. Proof of Theorem 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI