Asymptotic Properties of Estimators for Seasonally Cointegrated State Space Models Obtained Using the CVA Subspace Method

This paper investigates the asymptotic properties of estimators obtained from the so called CVA (canonical variate analysis) subspace algorithm proposed by Larimore (1983) in the case when the data is generated using a minimal state space system containing unit roots at the seasonal frequencies such that the yearly difference is a stationary vector autoregressive moving average (VARMA) process. The empirically most important special cases of such data generating processes are the I(1) case as well as the case of seasonally integrated quarterly or monthly data. However, increasingly also datasets with a higher sampling rate such as hourly, daily or weekly observations are available, for example for electricity consumption. In these cases the vector error correction representation (VECM) of the vector autoregressive (VAR) model is not very helpful as it demands the parameterization of one matrix per seasonal unit root. Even for weekly series this amounts to 52 matrices using yearly periodicity, for hourly data this is prohibitive. For such processes estimation using quasi-maximum likelihood maximization is extremely hard since the Gaussian likelihood typically has many local maxima while the parameter space often is high-dimensional. Additionally estimating a large number of models to test hypotheses on the cointegrating rank at the various unit roots becomes practically impossible for weekly data, for example. This paper shows that in this setting CVA provides consistent estimators of the transfer function generating the data, making it a valuable initial estimator for subsequent quasi-likelihood maximization. Furthermore, the paper proposes new tests for the cointegrating rank at the seasonal frequencies, which are easy to compute and numerically robust, making the method suitable for automatic modeling. A simulation study demonstrates by example that for processes of moderate to large dimension the new tests may outperform traditional tests based on long VAR approximations in sample sizes typically found in quarterly macroeconomic data. Further simulations show that the unit root tests are robust with respect to different distributions for the innovations as well as with respect to GARCH-type conditional heteroskedasticity. Moreover, an application to Kaggle data on hourly electricity consumption by different American providers demonstrates the usefulness of the method for applications. Therefore the CVA algorithm provides a very useful initial guess for subsequent quasi maximum likelihood estimation and also delivers relevant information on the cointegrating ranks at the different unit root frequencies. It is thus a useful tool for example in (but not limited to) automatic modeling applications where a large number of time series involving a substantial number of variables need to be modelled in parallel.


Introduction
Many time series show seasonal patterns that, according to [1] for example, cannot be modeled appropriately using seasonal dummies because they exhibit a slowly trending behavior typical for unit root processes.
To model such processes in the vector autoregressive (VAR) framework, Ref. [2] (abbreviated as JS in the following) extend the error correction representation for seasonally integrated autoregressive processes pioneered by [3] to the multivariate case. This vector error correction formulation (VECM) models the yearly differences of a process observed S times per year. The model includes systems having unit roots at some or all of the possible locations z j = exp( 2πj S i), j = 0, ..., S − 1 of seasonal unit roots. In JS all unit roots are assumed to be simple such that the process of yearly differences is stationary.
In this setting JS propose an estimator for the autoregressive polynomial subject to restrictions on its rank (the so-called cointegrating rank) at the unit roots z j based on an iterative scheme focusing on a pair of complex-conjugated unit roots (or the unit roots z j = 1 or z j = −1 respectively) at a time. The main idea here is the reformulation of the model using the so called vector error correction representation. Beside estimators JS also derived likelihood ratio tests for the cointegrating rank at the various unit roots.
Refs. [4,5] propose simpler estimation schemes based on complex reduced rank regression (cRRR in the following). They also show that their numerically simpler algorithm leads to test statistics for the cointegrating rank that are asymptotically equivalent to the quasi maximum likelihood tests of JS. These schemes still typically alternate between cRRR problems corresponding to different unit roots until convergence, although a one step version estimating only once at each unit root exists. Ref. [6] provides updating equations for quasi maximum likelihood estimation in situations where constraints on the parameters prohibit focusing on one unit root at a time.
The leading case here is that of quarterly data (S = 4) where potential unit roots are located at ±1 and ±i, implying that the VECM representation contains four potentially rank restricted matrices. However, increasingly time series of much higher sampling frequency such as hourly, daily or weekly observations are available. In such cases it is unrealistic that all unit roots are present. If a unit root is not present, the corresponding matrix in the VECM is of full rank. Therefore in situations with only a few unit roots being present, the VECM requires a large number of parameters to be estimated. Also in cases with a long period length (such as for example hourly data with yearly cycles) usage of the VECM involves the estimation of all coefficient matrices for lags for at least one year.
In general, for processes of moderate to large dimension the VAR framework involves estimation of a large number of parameters which potentially can be avoided by using the more flexible vector autoregressive moving average (VARMA) or the-in a senseequivalent state space framework. This setting has been used in empirical research for the modeling of electricity markets, see the survey [7] for a long list of contributions. In particular, ref. [8] use the model described below without formal verification of the asymptotic theory for the quasi maximum likelihood estimation.
Recently, ref. [9] show that in the setting of dynamic factor models, typically used for observation processes of high dimension, the common assumption that the factors are generated using a vector autoregression jointly with the assumption that the idiosyncratic component is white noise (or more generally generated using a VAR or VARMA model independent of the factors) leads to a VARMA process. Also a number of papers (see for example [10][11][12]) show that in their empirical application the usage of VARMA models instead of approximations using the VAR model leads to superior prediction performance. This, jointly with the fact that the linearization of dynamic stochastic general equilibrium models (DSGE) leads to state space models, see e.g., [13], has fuelled recent interest in VARMA-and thus state space-modeling in particular in macroeconomics, see for example [14].
In this respect, quasi maximum likelihood estimation is the most often used approach for inference. Due to the typically highly non-convex nature of the quasi likelihood function (using the Gaussian density) in the VARMA setting, the criterion function shows many local maxima where the optimization can easily get stuck. Randomization alone does not solve the problem efficiently, as typically the parameter space is high-dimensional causing problems of the curse of dimensionality type. Moreover, VARMA modeling requires a full specification of the state space unit root structure of the process, see [15]. The state space unit root structure specifies the number of common trends at each seasonal frequency (see below for definitions). For data of weekly or higher sampling frequency it is unlikely that the state space unit root structure is known prior to estimation. Testing all possible combinations is numerically infeasible in many cases.
As an attractive alternative in this respect the class of subspace algorithms is investigated in this paper. One particular member of this class, the so called canonical variate analysis (CVA) introduced by [16] (in the literature the algorithm is often called canonical correlation analysis; CCA), has been shown to provide system estimators which (under the assumption of known system order) are asymptotically equivalent to quasi maximum likelihood estimation (using the Gaussian likelihood) in the stationary case [17]. CVA shares a number of robustness properties in the stationary case with VAR estimators: [18] shows that CVA produces consistent estimators of the underlying transfer function in situations where the innovations are conditionally heteroskedastic processes of considerable generality. Ref. [19] shows that CVA provides consistent estimators of the transfer function even for stationary fractionally integrated processes, if the order of the system tends to infinity as a function of the sample size at a sufficient rate.
In the I(1) case [20] introduce a heuristic adaptation of the algorithm using the assumption of known cointegrating rank in order to show consistency for the corresponding transfer function estimators. However, the specification of the cointegrating rank is no easy task in itself. In case of misspecification of the cointegrating rank the properties of this approach are unclear. Ref. [21] states without proof that also the original CVA algorithm delivers consistent estimates in the I(1) case without the need to impose the true cointegrating rank.
Furthermore for I(1) processes [20] proposed various tests for the cointegrating rank and compared them to tests in the Johansen framework showing superior finite sample performance in particular for multivariate data sets with a large dimension of the modeled process.
This paper builds on these results and shows that CVA can also be used in the seasonally integrated case. The main contributions of the paper are: (i) It is shown that the original CVA algorithm in the seasonally integrated case provides strongly consistent system estimators under the assumption of known system order (thus delivering the currently unpublished proof of the claim in the I(1) case in [21]). (ii) Upper bounds for the order of convergence for the estimated system matrices are given, establishing the familiar superconsistency for the estimation of the cointegrating spaces at all unit roots. (iii) Several tests for separate (that is for each unit root irrespective of the specification at the other potential unit roots) determination of the seasonal cointegrating ranks are proposed which are based on the estimated systems and are simple to implement.
The derivation of the asymptotic properties of the estimators is complemented by a simulation study and an application, both demonstrating the potential of CVA and one of the suggested tests. Jointly our results imply that CVA constitutes a very reasonable initial estimate for subsequent quasi likelihood maximization in the VARMA case. Moreover the method provides valuable information on the number of unit roots present in the process, which can be used for subsequent investigation at the very least by providing upper bounds on the number of common trends present at each unit root frequency. Contrary to the JS approach in the VAR framework these tests can be performed in parallel for all unit roots, eliminating the interdependence of the results inherent in the VECM representation. Moreover, they do not use the VECM representation involving a large number of parameters in the case of high sampling rates.
These properties make CVA a useful tool in automatic modeling of multivariate (with a substantial number of variables) seasonally (co-)integrated processes.
The paper is organized as follows: in the next section the model set and the main assumptions of the paper are presented. The estimation methods are described in Section 3. Section 4 states the consistency results. Inference on the cointegrating ranks is proposed in Section 5. Data preprocessing is discussed in Section 6. The simulations are contained in Section 7, while Section 8 discusses an application to real world data. Section 9 concludes the paper. Appendix A contains supporting material, Appendix C provides the proofs of the main results of this paper, which are based on preliminary results presented in Appendix B.
Throughout the paper we will use the symbols o(g T ) and O(g T ) to denote orders of almost sure convergence where T denotes the sample size, i.e., x T = o(g T ) if x T /g T → 0 almost surely and x T = O(g T ) if x T /g T is bounded almost surely for large enough T (that is there exists a constant M < ∞ such that lim sup T→∞ x T /g T ≤ M a.s.). Furthermore, o P (g T ), O P (g T ) denote the corresponding in probability versions.

Model Set and Assumptions
In this paper state space processes (y t ) t∈Z , y t ∈ R s , are considered which are defined as the solutions to the following equations for given white noise (ε t ) t∈Z , ε t ∈ R s , Eε t = 0, Eε t ε t = Ω > 0: Here x t ∈ R n denotes the unobserved state and A ∈ R n×n , C ∈ R s×n and K ∈ R n×s define the state space system typically written as the tuple (A, C, K).
In this paper we consider without restriction of generality only minimal state space systems in innovations representation. For a minimal system the integer n is called the order of the system. As is well known (cf. e.g., [22]) minimal systems are only identified up to the choice of the basis of the state space. Two minimal systems (A, C, K) and (Ã,C,K) are observationally equivalent if and only if there exists a nonsingular matrix T ∈ R n×n such that A = TÃT −1 , C =CT −1 , K = TK. For two observationally equivalent systems the impulse response sequences k 0 = I s , k j+1 = CA j K =CÃ jK , j = 0, 1, ... coincide.
Ref. [15] shows that the structure of the Jordan normal form of the matrix A determines the properties (such as stationarity) of the solutions to (1) for t ∈ Z. Eigenvalues of A on the unit circle lead to unit root processes in the sense of [15] who also define a state space unit root structure indicating the location and multiplicity of unit roots. A process (y t ) t∈Z with state space unit root structure Ω S = {(0, (c 0 )), (2π/S, (c 1 )), ..., (π, (c S/2 ))} for some even integer S is called multi frequency I(1) (in short MFI(1)). Even S is chosen because it simplifies the notation by implying that S/2 also is an integer and z = −1 is a seasonal unit root. By adjusting the notation appropriately all results hold true for odd S as well).
If, moreover, such a process is observed for S periods per year, it is called seasonal MFI (1). In this case the canonical form in [15] takes the following form: where ω j = 2πj/S, j = 0, . . . , S/2 denote the unit root frequencies and C j,R ∈ R s×c j , C j,I ∈ R s×c j , K j,R ∈ R c j ×s , K j,I ∈ R c j ×s where 0 ≤ c j ≤ s, 0 ≤ j ≤ S/2. Furthermore for C j,C := C j,R − iC j,I it holds that C j,C C j,C = I c j and K j,C = K j,R + iK j,I is of full row rank and positive upper triangular (C 0,I = C S/2,I = 0, K 0,I = K S/2,I = 0), see [15] for details. Finally |λ max (A • )| < 1, where λ max (A) denotes an eigenvalue of the matrix A with maximal modulus. The stable subsystem (A • , C • , K • ) is assumed to be in echelon canonical form (see [22]). Using this notation the assumptions on the data generating process (dgp) in this paper can be stated as follows: Furthermore the stability assumption |λ max (A •,• )| < 1 and the strict minimum-phase The state at time t = 1 is given by The noise process (ε t ) t∈Z is assumed to be a strictly stationary ergodic martingale difference sequence with respect to the filtration F t with zero conditional mean E(ε t |F t−1 ) = 0, deterministic conditional variance E(ε t ε t |F t−1 ) = Ω > 0 and finite fourth moments.
Due to the block diagonal form of A the state equations are in a convenient form such that partitioning the state vector accordingly as the blocks (x t,j ) t∈Z , x t,j ∈ R δ j c j for c j > 0 are unit root processes with state space unit root structure {(ω j , (c j ))}. Finally (x t,• ) t∈Z is assumed to be stationary due to the assumptions on x 1,• . If (ỹ t ) t∈N denotes a different solution to the state space equations corresponding tõ x 1 then (for t > 1) Note that C j A t−1 j z 12 = cos(ω j t)z 1 + sin(ω j t)z 2 , 0 < j < S/2 (for appropriate vectors Therefore the sum ∑ can be modeled using a constant and seasonal dummies. The term C • A t−1 • (x 1,• − x 1,• ) tends to zero with an exponential rate as t → ∞ and hence does not influence the asymptotics. Assumption 1 implies that the yearly difference Thus the process according to Assumption 1 is a unit root process in the sense of [15]. Note that we do not assume that all unit roots are contained such that the spectral density of the stationary process (y t − y t−S ) t∈Z may contain zeros due to overdifferentiation and hence the process potentially is not stably invertible. The special form of A 0 implies that I(1) processes are a special case of our dgp while I(d), d > 1, d ∈ N, processes are not allowed for.

Canonical Variate Analysis
The main idea of CVA is that, given the state, the system equations (1) are linear in the system matrices. Therefore, based on an estimate of the state sequence, the system can be estimated using least squares regression. The estimate of the state is based on the following equation (for details see for example [17]): Let Y + t, f := [y t , y t+1 , . . . , y t+ f −1 ] denote the vector of stacked observations for some integer f ≥ n and let Then an estimate of β 1 is obtained from the reduced rank regression (RRR) f under the rank constraint rank(β 1 ) = n. This results in the estimateÔ fKp : Here the symmetric matrix square root is used. The definition is, however, not of importance and other square roots such as Cholesky factors could be used.Û n ∈ R f s×n denotes the matrix whose columns are the left singular vectors to the n largest singular values which are the diagonal entries inŜ n := diag(σ 1 ,σ 2 , . . . ,σ n ),σ 1 ≥ · · · ≥σ n > 0 and V n ∈ R ps×n contains the corresponding right singular vectors as its columns.R n denotes the approximation error.
The system estimate (Â,Ĉ,K) is then obtained using the estimated statex t :=K p Y − t,p , t = p + 1, . . . , T + 1 via regression in the system equations.
In the algorithm a specific decomposition of the rank n matrixÔ fKp into the two factorsÔ f andK p is given such thatK pΞ It is obvious that every other decomposition ofÔ fKp produces an estimated state sequence in a different coordinate system, leading to a different observationally equivalent representation of the same transfer function estimator. Therefore, with respect to consistency of the transfer function estimator it is sufficient to show that there exists a factorization ofÔ fKp leading to convergent system matrix estimators (Ã,C,K), even if this factorization cannot be used in actual computations, as it requires information not known at the time of estimation.
In order to generate a consistent initial guess for subsequent quasi likelihood optimization in the set of all state space systems corresponding to processes with state space unit root structure Ω S := {(ω 0 , (c 0 )), ..., (ω S/2 , (c S/2 ))}, however, we will derive a realizable (for known integers c j and matrices E j such that E j C •,j,C = I c j ) consistent system estimate. To this end note that consistency of the transfer function implies (see for example [23]) that the eigenvaluesλ l ofÂ converge (in a specific sense) to the eigenvalues λ j of A • . Therefore transformingÂ into complex Jordan normal form (whereÂ is almost surely diagonalizable), ordering the eigenvalues such that groups of eigenvaluesλ l (j), l = 1, ..., c j , converging to λ j are grouped together, we obtain a realizable system (Ǎ,Č,Ǩ) where the diagonal blocks of the block diagonal matrixǍ corresponding to the unit roots converge to a diagonal matrix with the eigenvalues z j on the diagonal: ReplacingǍ j,C by the limit A j,C and transforming the estimates to the real Jordan normal form, we obtain estimates (Ȃ,Č,Ǩ) corresponding to unit root processes with state space unit root structure Ω S . Note, however, that this representation not necessarily converges as perturbation analysis only implies convergence of the eigenspaces. Therefore in the final step the estimate (Ȃ,Č,Ǩ) is converted such that we obtain convergence of the system matrix estimates. In the class of observationally equivalent systems with the matrix block diagonal transformations of the form T = diag(T 0 , T 1 , T 1 , ..., T S/2 , I) do not change the matrixȂ C . Here the basis of the stable subsystem can be chosen such that the corresponding transformed (Ȃ • ,C • ,K • ) is uniquely defined using an overlapping echelon form (see [22], Section 2.6). The impact of such transformations on the blocks of C is given byČ j,C T −1 j . Therefore, if for each j = 0, ..., S/2 a matrix E j ∈ C s×c j is known such that E j C •,j,C ∈ C c j ×c j is nonsingular, the restriction E jC j,C = I c j picks a unique representative (Ȃ,C,K) of the class of systems observationally equivalent to (Ȃ,Č,Ǩ).
Note that this estimate (Ȃ,C,K) is realizable if the integers c j (needed to identify the c j eigenvalues ofÂ closest to z j ), the matrices E j (needed to fix a basis for x t,j ) and the index corresponding to the overlapping echelon form for the stable part are known. Furthermore, this estimate corresponds to a process with state space unit root structure Ω S and hence can be used as a starting value for quasi likelihood maximization.
Finally in this section it should be noted that the estimate of the statex t here mainly serves the purpose of obtaining an estimator for the state space system. Based on this estimate, Kalman filtering techniques can be used to obtain different estimates of the state sequence. The relation between these different estimates is unclear and so is their usage for inference. For this paper the state estimatesx t are only an intermediate step in the CVA algorithm.

Asymptotic Properties of the System Estimators
As follows from the last section, the central step in the CVA procedure is a RRR problem involving stationary and nonstationary components. The asymptotic properties of the solution to such RRR problems are derived in Theorem 3.2. of [24]. Using these results the following theorem can be proved (see Appendix C.1): Theorem 1. Let the process (y t ) t∈Z be generated according to Assumption 1. Let (Â,Ĉ,K) denote the CVA estimators of the system matrices using the assumption of correctly specified order n with f ≥ n not depending on the sample size and finite and p = o((log T)ā) for some real is in echelon canonical form and for each j = 0, ..., S/2 there exists a row selector matrix E j ∈ R s×c j such that E j C •,j,C is non-singular. Then for some integer a: for some integer a < ∞. (III) If the noise is assumed to be an iid sequence, then results (I) and (II) hold almost surely.
Beside stating consistency in the seasonal integration case, the theorem also improves on the results of [20] in the I(1) case by showing that no adaptation of CVA is needed in order to obtain consistent estimators of the impulse response sequence or the system matrices. Note that this consistency result for the impulse response sequence concerns both the short and the long-run dynamics. In particular it implies that short-run prediction coefficients are consistent. Moreover the theorem establishes strong consistency rather than weak consistency as opposed to [20]. (II) establishes orders of convergence which, however, apply only to a transformed system that requires knowledge of the integers c j and matrices E j to be realized. No tight bounds for the integer a are derived, since they do not seem to be of much value.
Note that the assumptions on the innovations rule out conditionally heteroskedastic processes. However, since the proof mostly relies on convergence properties for covariance estimators for stationary processes and continuous mapping theorems for integrated processes, it appears likely that the results can be extended to conditionally heteroskedastic processes as well. For the stationary cases this follows directly from the arguments in [18], while for integrated processes results (using different assumptions on the innovations) given for example in [25] can be used. The conditions of [25] hold for example in a large number of GARCH type processes, see [26]. The combination of the different sets of assumptions on the innovations is not straightforward, however, and hence would further complicate the proofs. We refrain from including them.
It is worth pointing out that due to the block diagonal structure of A • the result implies consistency of the blocksC j corresponding to unit root z j (or the corresponding complex pair) of order almost T −1 . Using the complex valued canonical form this implies consistent estimation of C •,j,C by the correspondingC j,C . In the canonical form this matrix determines the cointegrating relations (both the static as well as the dynamic ones, for details see [15]) as the unitary complement to this matrix. It thus follows that CVA delivers estimators for the cointegrating relations at the various unit roots that are (super-)consistent. In fact, the proof can be extended to show convergence in distribution of (C − C • )D −1 x . This distribution could be used in order to derive tests for cointegrating relations. However, preliminary simulations indicate that these estimates and hence the corresponding tests are not optimal and can be improved upon by quasi maximum likelihood estimation in the VARMA setting initialized by the CVA estimates. Therefore we refrain from presenting these results.
Note that the assumptions impose the restriction ρ 0 > 0 excluding VAR systems. This is done solely for stating a uniform lower bound on the increase of p as a function of T. This bound is related to the lag length selection achieved using BIC, see [27]. In the VAR case the lag length estimator using BIC will converge to the true order and thus remain finite. All results hold true if in the VAR case a fixed (that is independent of the sample size) p ≥ n is used.

Inference Based on the Subspace Estimators
Beside consistency of the impulse response sequence also the specification of the integers c 0 , ..., c S/2 is of interest. First, following [20] this information can be obtained by detecting the unity singular values in the RRR step of the procedure. Second, from the system representation (2) it is clear that the location of the unit roots is determined by the eigenvalues of A • on the unit circle: The integers c j denote the number of eigenvalues at the corresponding locations on the unit circle (provided the eigenvalues are simple). Due to perturbation theory (see for example Lemma A2) we know that the eigenvalues ofÂ will converge (for T → ∞) to the eigenvalues of A • and the distribution of the mean of all eigenvalues ofÂ converging to an eigenvalue of A • can be derived based on the distribution of the estimation errorÂ − A • . This can be used to derive tests on the number of eigenvalues at a particular location on the unit circle. Third, if n ≤ s the state process is a VAR(1) process and hence in some cases allows for inference on the number of cointegrating relations and thus also on the integers c j as outlined in [4]. Tests based on these three arguments will be discussed below.

Theorem 2. Under the assumptions of Theorem 1 the test statistic T
• (for definition ofε t,1 andε t,• see the proof in Appendix C.2) and where W(r) denotes a c-dimensional Brownian motion with variance with A u denoting the c × c heading submatrix of A and K u denoting the submatrix of K composed of the first c rows such that (A u , C u , K u ) denotes the unstable subsystem.
The theorem is proved in Appendix C.2, where also the many nuisance parameters of the limiting random variable are explained and defined. The proof also corrects an error in Theorem 4 of [20], where the wrong distribution is given since the second order terms were neglected.
As the distribution is not pivotal and in particular contains information that is unknown when performing the RRR step, it is not of much interest for direct application. Nevertheless the order of convergence allows for the derivation of simple consistent estimators of the number of common trends: Letĉ T denote the number of singular values calculated in the RRR that exceed 1 − h(T)/T for arbitrary h(T) → ∞, h(T) < T, h(T)/T → 0, for T → ∞. Then it is a direct consequence of Theorem 2 in combination withσ j → σ j < 1, j > c, thatĉ T → c in probability, implying consistent estimation of c. Based on these results also estimators for c could be derived, for example along the lines of [28]. However, as [29] shows, such estimators have not performed well in simulations and thus are not considered subsequently.
The singular values do not provide information on the location of the unit roots. This additional information is contained in the eigenvalues of the matrix A • : where B(r) denotes a c m -dimensional Brownian motion with zero expectation and variance I c m for z m = ±1 and a complex Brownian motion with expectation zero and variance equal to the identity matrix else.
Therefore the estimated eigenvalues can be used in order to obtain a test on the number of common trends at a particular frequency for each frequency separately. The test distribution is obtained as the limit to The distribution thus does not depend on the presence of other unit roots or stationary components of the state. Furthermore it can be seen that it is independent of the noise variance or the matrix K •,m,C . Hence critical values are easily obtained from simulations. Also note that the limiting distribution is identical for all complex unit roots. Therefore, for each seasonal unit root location z m we can order the eigenvalues of the estimated matrixÂ with increasing distance to z m . Then starting from the assumption of H 0 : c m =c (for a reasonablec obtained, e.g., from a plot of the eigenvalues) one can perform the test with statistic Tμ m . If the test rejects, then the hypothesis H 0 : c m =c − 1 is tested, until the hypothesis is not rejected anymore, or H 0 : c m = 1 is reached. This is then the last test. If H 0 is rejected again, no unit root is found at this location. Otherwise we do not have evidence against c m = 1. In any case, the system needs to be estimated only once and the calculation of the test statistics is easy even for all seasonal unit roots jointly.
The third option for obtaining tests is to use the tests derived in [4] based on the JS framework for VARs. In the case n ≤ s the state process x t+1 = Ax t + Kε t is a seasonally integrated VAR(1) process (for n > s the noise variance is singular). The corresponding VECM representation equals Note that in this VAR(1) setting no additional stationary regressors of the form p(L)x t−j occur. Also no seasonal dummies are needed but could be added to the equation. In this setting [4] suggests to use the eigenvaluesλ i (ordered with increasing modulus) of the matrix (the superscript (.) π denotes the residuals with respect to the remaining regressors X as the basis for a test statisticC where δ m = 2 for complex unit roots and δ m = 1 for real unit roots. In the I(1) case this leads to the familiar Johansen trace test, for seasonal unit roots a different asymptotic distribution is obtained.

Theorem 4.
Under the assumptions of Theorem 1 letĈ m be calculated based on the estimated state and letC m denote the same statistic based on the true state. Then for n ≤ s it holds that where B(r) is a real Brownian motion for z m = ±1 or a complex Brownian motion else.
Thus again under the null hypothesis the test statistic based on the estimated state and the one based on the true state reject jointly asymptotically with probability one. Therefore for n ≤ s the tests of JS can be used to obtain information on the number of common cycles, ignoring the fact that the estimated state is used in place of the true state process.
After presenting three disjoint ideas for providing information on the number and location of unit roots, the question arises, which one to use in practice. In the following a number of ideas are given in this respect.
The criterion based on the singular values given in Theorem 2 is of limited information as it only provides the overall number of unit roots. Since the limiting distribution is not pivotal it cannot be used for tests and the choice of the cutoff value h(T) is somewhat arbitrary. Nevertheless, using a relatively large value one obtains a useful upper bound on c which can be included in the typical sequential procedures for tests for c j .
Using the results of Theorem 4 has the advantage of using a framework that is well known to many researchers. It is remarkable that in terms of the asymptotic distributions there is no difference involved in using the estimated state in place of the true state. The assumption n ≤ s, however, is somewhat restrictive except in situations with a large s.
Finally the results of Theorem 3 provide simple to use tests for all unit roots, independently of the specification of the model for the remaining unit roots. Again it is remarkable that, under the null, inference is identical for known and for estimated state.
Since our estimators are not quasi maximum likelihood estimators the question of a comparison with the usual likelihood ratio tests arises. For VAR models simulation exercises documented in Section 7 below demonstrate that there are situations where the proposed tests outperform tests in the VAR framework. Comparisons with tests in the state space framework (or equivalently in the VARMA framework) are complicated by the fact that no results are currently available in the literature of this framework. One difference, however, is given by the fact that quasi likelihood ratio tests in the VARMA setting require a full specification of the c j values for all unit roots. This introduces interdependencies such that the tests for one unit root depend on the specification of the cointegrating rank at the other roots. The interdependencies can be broken by performing tests based on alternative specifications for each unit root. The test based on Theorem 3 does not require this but can be performed based on the same estimateÂ. This is seen as an advantage.
The question of the comparison of the empirical size in finite samples as well as power to local alternatives between the CVA based tests and tests based on quasi-likelihood ratios is left as a research question.

Deterministic Terms
Up to now it has been assumed that no deterministic terms appear in the model contrary to common practice. In the VAR framework dealing with trends is complicated by the usage of the VECM representation, see e.g., [30]. In the state space framework used in this paper, however, deterministic terms are easily incorporated.

Theorem 5.
Let the process (y t ) t∈Z be generated according to Assumption 1 and assume that the process (ỹ t ) t∈Z is observed whereỹ t = y t + Φd t with Then if the CVA estimation is applied tõ the results of Theorem 1 hold, i.e., the system is estimated consistently and the orders of convergence for the transformed system (Ȃ,C,K) hold true. In this sense the results are robust to some operations typically termed preprocessing of data such as demeaning and deseasonalizing using seasonal dummies. More general preprocessing steps such as detrending or the extraction of more general deterministic terms analogous to [30] can be investigated along the same lines.

Simulations
The estimation of the seasonal cointegration ranks and spaces is usually carried out via quasi maximum likelihood methods that originated from the VAR model class. Typical estimators in this setting are those of [2,4,5,31]. In the first two experiments we focus on the estimation of the cointegrating spaces and the specification of the cointegration ranks in the classical situation of quarterly data and show that there are certain situations in which CVA estimators and the test in Theorem 3 possess finite sample properties superior to those of the methods above. In a third experiment the test performance is evaluated for a daily sampling rate. Moreover, the prediction accuracy of CVA is investigated as well as its robustness to innovations exhibiting behaviors often encountered at such higher sampling rates. All simulations are carried out using 1000 replications.
To investigate the practical usefulness of the proposed procedures we generate quarterly data using two VAR dgps of dimension s = 2 first and then two more general VARMA dgps with s = 8. Each pair contains dgps with different state space unit root structures From all four dgps samples of size T ∈ {50, 100, 200, 500} are generated with initial values set to zero. Although none of the dgps contains deterministics, the data is adjusted for a constant and quarterly seasonal dummies as in [5]. For reasons of comparability, the adjustment for deterministic terms is done before estimation.
In the third experiment we generate daily data with dimension s = 4 from a state space system including unit roots corresponding to weekly frequencies (that is a period length of seven days). In the simulations we use several years of data (excluding new year's day to account for 52 weeks of seven days each). The first 200 observations are discarded to include the effects of different starting values. In this example the focus lies on a comparison of the prediction accuracy. Furthermore we investigate the robustness of the test procedures to conditional heteroskedasticity of the GARCH type as well as to non-normality of the innovations.
To assess the performance of specifying the cointegrating rank at unit root z using CVA, the following test statistic is constructed from the results in Theorem 3 Hereλ 1 , . . . ,λ n are the eigenvalues ofÂ ordered increasingly according to the distance from z. Note that a similar test in [20] only uses the c-th largest eigenvalue, whereas here the average over the nearest c eigenvalues is taken. Critical values have been obtained by simulation using large sample sizes (sample size 2000 (JS) and 5000 (CVA), 10,000 replications). In our first two experiments usage of Λ(c) is compared with variants of the likelihood ratio test from [2] (JS), [4] (Q 1 ), and [5] (Q 2 , Q 3 ). Q 1 is Cubadda's trace test for complexvalued data, Q 2 takes the information at frequency π/2 into account when the analysis is carried out at frequency 3π/2, and Q 3 iterates between π/2 and 3π/2 in the alternating reduced rank regression (ARR) of [5]. For the procedure of [2] the likelihood maximization at frequency π/2 is carried out using numerical optimization (BFGS) with initial values obtained from an unrestricted regression.
All tests are evaluated by comparing the percentages of correctly detected common trends, or hit rates, with 0.95, the hit rate to be expected from a nominal significance level of 0.05. The testing procedure employed for all tests is the same: at each of the frequencies it is started from a null hypothesis of s unit roots against less than s unit roots. In case of rejection, s − 1 unit roots are tested versus less than s − 1 and so on, until there are zero unit roots under the alternative.
For the first two experiments the estimation performance of CVA for the simultaneous estimation of the seasonal cointegrating spaces is compared with the maximum likelihood estimates of [2,4,31] (cRRR), and also with an iterative procedure (Generalized ARR or GARR) of [5]. The comparison is carried out by means of the gap metric, measuring the distance between the true and the estimated cointegrating space as in [32]. The smaller the mean gap over all replications, the better is the estimation performance. Throughout a difference between two mean gaps or two hit rates is considered statistically significant if it is larger than twice the Monte Carlo standard error.
For all procedures used in this section, an AR lag length has to be chosen first. For CVA this can be done using the AIC as in ( [33], Section 5), as is done in the third experiment.
In the first two experiments where sample sizes are rather small, we estimate the lag length via minimization of the corrected AIC (AICc) ( [34], p. 432),k AICc , benefitting the simulation results. For larger sample sizes the two criteria lead to the same choices. Due to the quarterly data we work with, the lag length is then chosen to bek = max{k AICc , 4}.
Other information criteria could be chosen here. An anonymous referee also suggested the application of the Modified Akaike Information Criterion (MAIC) of [35], proposed there for the I(1)-case. In an attempt to apply it to the seasonally integrated case considered here, it performed considerably worse than the AICc. Thus we refrain from using the MAIC in the following and also omit the results of that attempt. They can be obtained from the authors upon request.
For CVA the truncation indices f and p are chosen asf =p = 2k ( [33], Section 5). The system order n is estimated by minimizing ( [33], Section 5) Hereσ i denotes the i-th largest singular value from the singular value decomposition ofΞ + fβ 1Ξ − p (Step 2 of CVA). Note that selecting the number of states by SVC is made less influential insofar asn = max{c 0 + 2c π/2 + c π ,n SVC }, wheren SVC denotes the SVC estimated system order.
In Section 7.1 we start with the two VAR dgps and find that the likelihood-based procedures are mostly superior. Continuing with the VARMA dgps in Section 7.2, CVA performs better and is superior for the smaller sample sizes in terms of size and gap and better for all sample sizes in terms of power. Section 7.3 evaluates the performance of the tests for unit roots for larger sample sizes together with the prediction performance in this setting. We find that the tests are robust to the distribution of the innovations as well as to conditional heteroskedasticity of the GARCH type. Furthermore the empirical size of the tests lies close to the size already for moderate sample sizes, where the tests also show almost perfect power properties.

VAR Processes
The VAR dgps considered in this paper are given by, where (ε t ) t∈Z is white noise and the coefficient matrices are This dgp is adopted from [5] with a slight adjustment to Π 4 . The corresponding VECM representation in the notation of [5] equals As can be seen from Table 1, the dgps possess unit roots at frequencies 0, π, and π/2, where c π/2 = 2 [1] for γ = 0[0.2], respectively. Note that in all cases the order of integration equals 1, while the number of common cycles at π/2 is varied.  Table 2 exhibits the hit rates from the application of the different test statistics. At frequencies 0 and π, Λ is compared with the trace test of Johansen (J; based on [31] for unit roots z = −1), whereas at π/2 it is competing with JS, Q 1 , Q 2 , and Q 3 . All competitors are likelihood-based tests which is the term we are referring to when we compare Λ to them as a whole.  The results for 0 and π are very similar for both dgps in that Λ scores more hits than the likelihood-based tests when the sample size is small, T ∈ {50, 100}. Convergence of its finite sample distribution is slower than for the other test statistics, however, as J is closer to 0.95 from T = 200 on. For T = 500 the distribution of Λ only seems to have converged to its asymptotic distribution when c π/2 = 2 at frequency 0, whereas convergence of the likelihood-based tests has occurred in all cases.
At π/2 the likelihood ratio test of JS strictly dominates all implementations of [5] for all sample sizes and both dgps. It strictly dominates the CVA-based test procedure as well, except for one case, it seems: when c π/2 = 1 and T = 50 Λ scores slightly, but significantly, more hits than the likelihood ratio test of JS. Surprisingly, Λ is drastically worse when T = 100 with only 8.7%, only to be up at 85% for T = 200.
The behavior of Λ is explained by z 5 and z 6 being close to ±i when c π/2 = 1, cf. Table 1. For future reference we will call the corresponding roots false unit roots.
For T = 50 the estimates of eigenvalues corresponding to actual unit roots are rather not very close to ±i in contrast to the false unit roots. Thus the latter are mistaken for actual unit roots (cf. the first panel in Figure 1), leading to a hit rate of 81.1%, one that is even larger than the rates at 0 and π. As the sample size increases, the eigenvalue estimates of the true unit roots become more and more accurate, visible from the second and third panel in Figure 1. Accordingly they can be detected correctly more often. Unfortunately however, for T = 100 the false unit roots remain to be detected such that often two instead of just one unit root are found by Λ, resulting in a hit rate of only 8.7%. For T ∈ {200, 500} Λ is able to distinguish the false unit roots from the true ones and the detection rate is getting closer to the asymptotic rate, 85.5% and 92.7%, respectively.

T=200
Re qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

T=500
Re When the VAR dgp without false unit roots and c π/2 = 2 is considered, it is visible that the hit rates of Λ at π/2 are monotonously increasing in the sample size again. The rates are smaller than those of the likelihood-based tests, however, and also clearly worse than those of Λ at 0 and π, cf. Table 2 again.
Taken together, at frequencies 0 and π which correspond to real-valued unit roots, the use of Λ was advantageous for T = 50. It also scored more hits for T = 100 and c π/2 = 1. For higher sample sizes the likelihood-based tests clearly dominate Λ at these two frequencies. At π/2 this superiority of the likelihood-based tests for all sample sizes and both dgps continues. The example also points to a general weakness: if the sample size is low and false unit roots are present, it can be difficult for Λ to distinguish them from actual unit roots.

VARMA Processes
The second setup consists of VARMA data generated by a state space system (A r , C r , K r ), r = 1, 2, as in (1), where the matrices A 1 and A 2 are constructed as in (2) and are taken to be These two choices yield the same state space unit root structures as those of the two VAR dgps with c π/2 = 1 and c π/2 = 2 for A 1 and A 2 , respectively. The other two system matrices K r ∈ R (2+2r)×s and C r ∈ R s×(2+2r) with s = 8 are drawn randomly from a standard normal distribution in each replication and (ε t ) t∈Z is multivariate normal white noise with an identity covariance matrix. Note that these systems are within the VARMA model class such that the dgp is contained in the VAR setting only by increasing the lag length as a function of the sample size. While superiority of the CVA approach in such a setting might be expected, this is far from obvious. Moreover, using a long VAR approximation is the industry norm in such situations.
From the hit rates in Table 3 it can be seen that the combination of large s, small T, and a minimal lag length of four render the likelihood-based tests useless at all frequencies with hit rates below ten percent for T = 50. Λ in contrast does not suffer from this problem and is already close to 95% for this sample size. Only when T = 200 do the likelihood-based tests appear to work, exhibiting hit rates close to 95%.  For all tests alike, however, it is striking that hit rates move away from 95% when T = 500. This behavior is most pronounced for Λ, e.g., from T = 200 to T = 500 its hit rate drops from 93.1% to 82.4% at 0 when A 2 is used. This phenomenon is a consequence of the fact that f and k in the algorithm are chosen data dependent. An inspection of how the hit rates depend on f and k and a comparison with the actually selectedf ,k reveals that for T = 500 too large values of f and k are chosen too often and leave room for improvement in the hit rates, cf. Figure 2. The figure stresses an important point: The performance of the unit root tests is heavily influenced by the selected lag lengths for all procedures. We tested a number of different information criteria in this respect. AICc turned out to be the best criterion overall, but not uniformly. Figure 2 indicates advantages for this example of BIC over AIC as it on average selects smaller lag lengths, associated here with higher hit rates.
To study the power of the different procedures, the transition dynamics A r in (9) are multiplied by ρ ∈ {0.8, 0.85, 0.9, 0.95} so that the systems do not contain unit roots at any of the frequencies. Here empirical power is defined as the frequency of choosing zero common trends. This is why for ρ = 1, when there are in fact common trends present in our specifications, the empirical power values plotted in Figure 3 are not equal to the actual size we could define as one minus the hit rate: our measure of empirical power in this situation only counts the false test conclusion of zero common trends, but there are of course multiple ways the testing procedure could conclude falsely. As expected, rejection of the null hypothesis is easiest when ρ is small and is very difficult when it is close to 1, cf. Figure 3 for the case of A 2 .
Further, there are almost no differences among the likelihood-based tests over all combinations of sample size and frequency, only for T = 100 is JS significantly worse than the Q i , i = 1, 2, 3 at π/2. It is also clearly visible at all frequencies that the likelihood-based tests possess no or only very limited power when T = 50 and T = 100, respectively. Λ, in contrast, is clearly more powerful in these cases. As the sample size increases to T = 200, the power of each test improves, still Λ remains the most powerful option. Only for T = 500 have the differences almost vanished with small, but significant, advantages for Λ at 0 and π.
The results are the same when A 1 is used and c π/2 = 1 and all of the differences described here are statistically significant.
Next the estimation performance of CVA is evaluated by calculation of the gaps between the true and the estimated cointegrating spaces. At all frequencies these gaps are compared with the GARR procedure of [5] which cycles through frequencies. At π/2 CVA and GARR are also compared with our implementation of JS and cRRR of [4], whereas it is also compared with the usual Johansen procedure at 0 and π. All estimates are conditional on the true state space unit root structure in the sense that the minimal number of states used is larger or equal to the number of unit roots over all frequencies. Other than imposing a minimum state dimension, the estimation of the order using SVC is not influenced. The likelihood-based procedures, on the other hand, take the unit root structure as given, i.e., do not perform CI rank testing for this estimation exercise.
From the results in Table 4 it can be noted first that the likelihood-based procedures show mostly equal mean gaps. Only for π/2 and T = 50 and both dgps does JS possess significantly larger gaps than cRRR and GARR and other differences are not statistically significant. Thus it does not matter in our example whether the iterative procedure is used or not.
Second, CVA is again superior for T = 50 where it exhibits mean gaps that are significantly smaller than those of the other estimators at all frequencies. This advantage is turned around for higher sample sizes, though: mean gaps are smaller for the likelihood-based procedures when T ∈ {100, 200, 500} and A 2 is used, if only slightly. When A 1 is used instead, mean gaps do not differ significantly from each other at π/2 when T > 50 and at 0, π when T = 100 and those of CVA are only very modestly worse when T ∈ {200, 500} at 0, π.  Thus, when it comes to estimating the cointegrating spaces, CVA is superior for T = 50 and equally good or only slightly worse than the likelihood-based procedures for higher sample sizes. For the systems analyzed, decreasing c π/2 leads to gaps that are smaller for all methods and these improvements are slightly larger for CVA than for the other estimators.

Robustness of Unit Root Tests for Daily Data
In this last simulation example we examine the robustness of the proposed procedures with regard to test performance and prediction accuracy with respect to the innovation distribution and the existence of conditional heteroskedasticity of the GARCH-type, as these features are often observed in data of higher frequency, for example in financial applications. While our asymptotic results do not depend on the distribution of the innovations (subject to the assumptions), the assumptions do not include GARCH effects. Nevertheless, the theory in [25,26] suggests that the tests might be robust also in this respect.
The entries of the matrices C and K are chosen as independent standard normally distributed random variables as before.
A process (y t ) t=1,...,T is generated from filtering an independent identically distributed innovation process (ε t ) t=−199,...,T+1 through the system (A, C, K). The first 200 observations are discarded, the last are used for validation purposes. A total of 1000 replications are generated where in each replication a different system is chosen.
With the generated data three different estimates are obtained: An autoregressive model (called AR in the following) is estimated with lag length chosen using AIC of maximal lag length equal to √ T . Second, an autoregressive model with large lag length (called ARlong) is estimated. This estimate is used to hint at the behavior of an autoregression using the lag length equal to a full year. This would correspond to estimating a VECM without rank restrictions, when accounting for yearly differences. The third method consists of the CVA estimates, where f = p = 2k AIC is chosen. The order is estimated by minimizing SVC. However, we correct for orders smaller than n = 7 which would limit the possibilities of finding all unit roots.
First, we compare the prediction accuracy for the three methods for two different distributions of the innovations: Beside the standard normal distribution also the student t-distribution with v = 5 degrees of freedom (scaled to unit variance) is used. This distribution shows considerably heavier tails than the normal distribution but nevertheless is covered by our assumptions. Figure 4 provides the results for out-of-sample one day ahead mean absolute prediction error (over all coordinates) for the sample sizes T = 364 days (one year), T = 1092 (3 years) and T = 3276 (nine years). The long AR model is estimated with lag lengths of 8 weeks for the smallest sample size, 10 weeks for the medium sample size and 12 weeks for the largest sample size. In the figure the results for the normally distributed innovations are presented as well as the ones for the t-distributed residuals (scaled to unit variance). It can be seen that for the two larger sample sizes the mean absolute error for the residuals for CVA is smaller in all cases. For the smallest sample size, by contrast, results are mixed. For CVA the results for the heavy tailed distribution in this case are much worse than for the normal distribution. For the larger sample sizes the differences are small. The maximal standard error of the estimated means over 1000 replications for T = 1092 and T = 3276 amounts to 0.05. This allows the conclusion that CVA performs better for the two larger sample sizes. For T = 364 there are no statistically significant differences between the performance of the three methods: CVA seems to suffer more from few very large errors (using the root mean square errors the CVA results are worse for T = 364 in comparison; if one uses the 95% percentiles CVA performs best also for the smallest sample size). This results in a standard error over the replications of the mean absolute error for T = 364 of 0.18 for normally distributed innovations and 3.4 for t-distributed innovations. The long AR models are clearly worse than the two other approaches. This happens even if we are still far from using a full year as the lag length.
With regard to the unit root tests we investigate results for the tests of the hypotheses H 0 : c m = 1 versus H 1 : c m = 0 at all frequencies 2πm/364, m = 0, ..., 363. The data generating process features unit roots with c m = 1 at the seven frequencies 2πk/7, k = 0, ..., 6. Therefore the tests should not reject at these frequencies, but should reject at all others.
Consequently we compare the minimum of the non-rejection rates for the seven unit roots (called empirical size below) as well as the maximum of the non-rejection rates for the non-unit root frequencies ω j = 2πj/364, j = 52k, k = 0, 1, 2, ..., 6 (called empirical power below).
For the larger sample sizes the empirical size is practically 95% while the empirical power is 100%. For T = 364 we obtain an empirical size of 90% for the normal distribution and 91.5% for the t-distribution. The worst empirical power equals 89.3% (normal) and 87.6% (t-distribution). Hence even for one year of data the discrimination properties of the unit root tests are good and we do not observe differences between the normal distribution for the innovations and the heavy tailed t-distribution.
Finally we compare the empirical size and power of the tests for the various unit roots for smaller sample sizes T ∈ {104, 208, 312, 416, 520}. For the experiments we consider univariate GARCH models of the form ., 4, where (η t,i ) t∈Z is independent and identically standard normally distributed. α, β ≥ 0 are reals. It follows that the component processes (ε t,i ) t∈Z show conditional heteroskedasticity, the persistence of which is governed by α + β. Here 0 < α + β < 1 implies stationarity while α + β = 1 implies persistent conditional heteroskedasticity usually termed I-GARCH. We include five different processes for the innovations: 1.
IG3: α = 0.2, β = 0.8 For the five different sample sizes 1000 replications of the estimates using the CVA algorithm are obtained. For each estimate we calculate the test statistic for testing H 0 : c m = 1 versus H 0 : c m = 0 for m = 0, ..., 363 corresponding to the unit roots z m = exp(2πim/364). This set of unit roots contains all seven unit roots exp(2πik/7), k = 0, ..., 6. Figure 5 provides the mean over the 1000 replications of the test statistics Λ(1) for z j , j = 0, ..., 363 and the five sample sizes. It can be seen that the test Λ(1) is able to pinpoint the seven unit roots present in the data generating process fairly accurately even for sample size T = 104. The zoom on the region around the unit root frequency 2π/7 shows that the mean value is larger than the cutoff value of the test (the dashed horizontal line) for the adjacent frequency 2π 53 364 already for T = 312.
(a) Mean of unit root test statistics.
(b) Zoom of mean unit root tests. Figure 5. Results of the unit root tests for all seasonal unit roots jointly. Table 5 lists the minimum of the achieved percentages of non-rejections of the test statistic for the seven unit root frequencies as well as the maximum over all non-unit root frequencies. It can be seen that for all GARCH models for T = 312 the test rejects unit roots at all non unit root frequencies every time, while the empirical size is close to the nominal 5%. For small sample sizes the tests are slightly undersized while for T = 208 a slight oversizing is observed. The two larger sample sizes are omitted as the tests perform perfectly there. It follows from the examples presented in this subsection that the test is robust also in small samples with respect to heavy tailed distributions of the innovations (subject to the assumptions). Furthermore also a remarkable robustness with respect to GARCH-type conditional heteroskedasticity is observed.

Application
In this section we apply CVA to the modeling of electricity consumption using a data set from [36]. The dataset contains hourly consumption data (in megawatts) from a number of US regions, scraped from the webpage of PJM Interconnection LLC, a regional transmission organization. The number of regions have changed over time, thus the data set contains many missing values. It also contains data aggregated into regions called east and west, which are not used subsequently.
In order to avoid problems with missing values, we restrict the analysis to four regions, for which data over the same time period is available: American Electric Power (AEP; in the following printed in blue), the Dayton Power and Light Company (DAYTON; black), Dominion Virginia Power (DOM; red) and Duquesne Light Co. (DUQ; green). We use data from 1 May 2005 until 31 July 2018. In this period only 3 data points are missing for the four regions and their imputation is handled by interpolation of the corresponding previous values. One observation in this sample is an obvious outlier which is corrected for analogously.
The data is split into an estimation sample covering observations up to the end of 2016 (102,291 observations on 4263 days) and a validation sample containing data in 2017 and 2018 (13,845 observations on 577 days). Data is equally sampled, but contains two hour segments when switching from winter to summer time or back. Table 6 contains some summary statistics. Table 6. Summary of data sets.

Region
Daily Obs. (4263 est., 577 val.) Hourly Obs. (102,291 est., 13,845 Figure 6 provides an overview of the data: Panel (a) shows the full data on an hourly basis, while (b) presents aggregation to daily frequency. Panel (c) zooms in on a two year stretch of daily consumption. Panel (d) finally provides hourly data for the first month in the validation data. The figures clearly document strong daily, weekly and yearly patterns. From these figures it appears that these seasonal fluctuations are somewhat regular with changes throughout time. It is hence not clear whether a fixed seasonal pattern is appropriate. Also note that the sampling frequency is on an hourly basis such that a year roughly covers 8760 observations. In the following we estimate (on the estimation part) and compare (on the validation part) a number of different models, first for the full hourly data set and afterwards for the aggregated daily data. As a benchmark we will use univariate AR models including deterministic seasonal patterns for daily, weekly and yearly variations. Subsequently we estimate models using CVA including different sets of such seasonal patterns.
First in the analysis using dummy variables fixed periodic patterns have been estimated. We model the natural logarithm of consumption (to reduce problems due to heteroskedasticity) and include dummies for weekdays, hours and sine and cosine terms corresponding to the first 20 Fourier frequencies with respect to annual periodicity. The corresponding results can be viewed in Figure 7. It is obvious that there is quite some periodic variation. Also the four data sets show very similar patterns as expected.
After the extraction of these deterministic terms the next step is univariate autoregressive (AR) modeling. Figure 8 shows the BIC values of AR models of lag lengths zero to 800 for the four series as well as the BIC of a multivariate AR model for the same number of lags. The chosen values are given in Table 6.  The BIC curve is extremely flat for the univariate models. Noticeable drops in BIC occur around lag 24 (one day), 144 (six days), 168 (one week), 336 (two weeks), 504 (three weeks). BIC selects large lag lengths from 529 (DUQ) up to 554 (DOM). AIC selects lag lengths close to the maximum allowed with a minimum at 772 lags. The BIC pattern of the multivariate model differs in that the two drops at two and three weeks are missing. Instead, the optimal BIC value is obtained at lag 194, well below the optimal lag lengths in the univariate cases. AIC here opts for lag length 531, just over 22 days.
Subsequently CVA is applied with f =k BIC , p =k AIC as estimated for the multivariate model. This differs from the usual recommendation of f = p = 2k AIC in order to avoid numerical problems with huge matrices. The order is chosen according to SVC, resulting inn = 240. The corresponding model is termed Mod 1 in the following. Note that this configuration of f ,n does not fulfill the requirements of our asymptotic theory. The bound f ≥ n ensures that the matrix O f has full column rank. Generically this will be the case for f s ≥ n leading to a less restrictive assumption. In practice too low values of f will be detected byn estimated close to the maximum, which is not the case here.
As a second model we only use weekday dummies but neglect the other deterministics. Again AIC (k AIC = 531) and BIC (k BIC = 195) are used to determine the optimal lag length in the multivariate AR model. The corresponding CVA estimated model usesn = 245 according to SVC, resulting in Mod 2.
The third model uses only a constant as deterministic term. Again similar AIC (555) and BIC (195) values are selected. A state space model, Mod 3, using CVA is estimated witĥ n = 209. Figure 9 provides information on the results. Panel (a) shows the coefficients of the univariate AR models. It can be seen that lags around one day and one to three weeks play the biggest role for all four datasets. Panel (b) shows that the multivariate models lead to better one step ahead predictions in terms of the root mean square error (RMSE). Mod 1 and Mod 2 show practically equivalent out of sample prediction error for all four data sets, while Mod 3 delivers the best out of sample fit for all four regions.
(a) AR coefficients (b) RMSE on validation data set Figure 9. Results for the hourly datasets.
In particular in financial applications data of high sampling frequency shows persistent behaviour, also in terms of conditional heteroskedasticity, as well as heavy tailed distributions of the innovations. For our data sets Figure 10 below provides some information in this respect for the residuals according to Mod 3. Panel (a) provides a plot of the residuals in the year 2018 (contained in the validation period). It can be seen that large deviations occur occasionally, while else residuals vary in a tight band around 0. The kernel density estimates for the normalized (to unit variance) residuals on the full validation data set in panel (b) show the typical heavy tailed distributions. Panel (c) contains an ACF plot for the four regions again calculated using the full validation sample. It demonstrates that the model successfully eliminates all autocorrelations with only a few ACF values occurring outside the confidence interval. Panel (d) provides the ACF plot for the squared innovations to examine GARCH-type effects. While GARCH-effects are clearly visible, the ACF drops to zero fast with occasional positive values (except maybe for the Duquesne data).
Applying the eigenvalue based test Λ(1) for c = 1 and all Fourier frequencies ω j = 2πj/(365 * 24) we find that for Mod 2 and Mod 3 the largest p-value is obtained for ω 365 corresponding to a period length of one day with 0.0187 for Mod 2 (test statistic 6.6) and 0.02 for Mod 3 (with a statistic of 6.5). This implies that the unit root at frequency ω 365 is not rejected for a significance level of 1%, but is rejected for 5%. All other unit roots are rejected at every usual significance level. For Mod 1 the test statistic for ω 365 equals 41.2 corresponding to a p-value of practically 0. This implies that on top of a deterministic daily pattern the series show strong persistence at the daily period. Excluding the hourly dummies pulls the roots closest to ω 365 closer to the unit circle resulting in insignificant unit root tests and improves the one step ahead forecasts. Including the dummies weakens the evidence of a unit root while leading to worse predictions.
The analysis is repeated with data aggregated to daily sampling frequency. The aggregation reduces the required lag lengths, as is visible from Table 6 in the univariate case, and hence we use CVA with the recommended f = p = 2k AIC . Beside the univariate models, in this case also a naive model of predicting the consumption for today as yesterday's consumption is used. Three multivariate models are estimated: Mod 1 contains weekday dummies and sine and cosine terms for the first twenty Fourier frequencies corresponding to a period of one year. Mod 2 only contains the weekday dummies, while Mod 3 only uses the constant. Figure 11 provides the out-of-sample RMSE for one day ahead predictions (panel (a)) and seven days ahead predictions (panel (b)).  (a) RMSE of one day ahead predictions (b) RMSE of seven day ahead predictions Figure 11. Results for the hourly datasets.
It can be seen that both Mod 1 and Mod 2 beat the univariate AR models in terms of one step ahead prediction error, while Mod 3 performs better for seven days ahead prediction. Mod 1 performs on par with Mod 2 for one step ahead prediction but performs better in predicting seven steps ahead. In Figure 12 poles and zeros for the three estimated state space models are plotted. Here the poles (marked with 'x') are the eigenvalues of the matrix A. These are the inverses of the determinantal roots of the autoregressive matrix polynomial in the equivalent VARMA representation. The zeros are the inverses of zeros of the determinant of the MA polynomial. We can see that for Mod 3 with only a constant, poles close to 2πj/7, j = 1, ..., 6 arise to capture the weekly pattern. The other two models only show one pole close to the unit circle, a real pole of almost z = 1. The pole corresponding to Mod 1 is closer to the unit circle than the one for Mod 2 (see (b)).
For Mod 3 we obtain p-values for the tests of three complex unit roots of 0.05 (ω = 2π/7), 0.165 (4π/7) and 0.01 (6π/7), which are hence all not statistically significant for significance level α = 0.01. The corresponding test for z = 1 shows a p-value of 0.004. This provides evidence against the null hypothesis of the root being present. For Mod 1 the p-value for the test of z = 1 is 0.28 and hence we cannot reject the null. Mod 2 provides a p-value of 0.023 and hence weak evidence for the presence of the unit root. This can be seen from the distance of the nearest pole from the point z = 1 in Figure 12.  Jointly this indicates that the location and strength of persistence due to the estimated roots is influenced by the presence of deterministic terms: if the deterministic terms are not included in the model, the cyclical patterns are generated by poles situated close to the unit circle.
The decision whether on top of the deterministic seasonality unit roots exist, is not easy in all cases: for the daily data the locations of the poles indicate that deterministic seasonality is enough to capture weekly fluctuations while a unit root at z = 1 appears to be needed to capture yearly variations. For hourly data there is evidence that the daily cycle is best captured with a unit root at frequency ω 365 . This leads to the best predictive fit. Finally note that temporal aggregation from hourly data to daily data implies that the frequency ω 365 for hourly data aliases to the frequency ω = 0 in the daily data. Therefore the higher evidence of a unit root at z = 1 found in daily data might be a consequence of the unit root at frequency ω 365 found for hourly data, compare [37].
The system matrix estimates as well as the evidence in support of unit roots at ω 365 for hourly data and z = 1 for daily data that we obtain from the CVA modeling can be taken as starting points in subsequent quasi maximum likelihood estimation.

Conclusions
In this paper the asymptotic properties of CVA estimators for seasonally integrated unit root processes are investigated. The main results can be summarized as follows: • CVA provides consistent estimators for long-run and short-run dynamics without knowledge of the location and number of unit roots. Hence the algorithm is robust with respect to the presence of trending components at frequency zero as well as at the other seasonal unit root frequencies. • The singular values calculated in the RRR step reveal information on the total number of unit roots. The distance of the singular values to one can be used to construct a consistent estimator of this quantity. • The eigenvalues ofÂ can be used in order to test for the number of common trends. Under the null hypothesis these tests are asymptotically equivalent to the corresponding tests using the true state, making the derivation of asymptotic results and the simulation of the test distribution simple. • An analogous statement holds for the Johansen trace test in the I(1) case and analogous tests in the MFI(1) case calculated on the basis of the estimated state in the restrictive setting of n ≤ s. Under the null hypothesis these tests reject and accept asymptotically jointly with the corresponding tests calculated using the true state. • From the simulation exercises we conclude that CVA performs best when the dgp is of the more general VARMA type, the process dimension is moderate to large and the sample size is small. Then it is superior to the likelihood-based procedures based on VAR approximations in terms of the estimation performance and the size and power of Λ, the test developed from CVA. For higher sample sizes the likelihood-based procedures are clearly superior when it comes to the size of the corresponding tests, whereas Λ remains the best test choice in terms of empirical power. The estimation performance is about equal for all procedures when the sample size is high with slight advantages for the likelihood-based procedures. • The simulations also demonstrate that the unit root test results are robust with respect to the distribution of the innovation sequence as well as some forms of conditional heteroskedasticity of the GARCH-type.
Because of the promising performance of CVA and in particular its robustness it can be recommended as a simple way to extract information on the number of common trends from the estimated matrix of transition dynamics. This information can be used in order to reduce the uncertainty in a subsequent likelihood ratio analysis where quasi maximum likelihood estimates can be obtained starting from the CVA estimates. Since the CVA estimates can be obtained for a range of orders numerically fast they are seen as a valuable starting point for the empirical modeling of time series potentially including seasonal cointegration. Moreover they can also be used in situations where the number of seasons is large or even unclear as in hourly data sets as demonstrated in the case study.

Conflicts of Interest:
The authors declare no conflict of interest. 1 0 W j dB j,C =: M j , where W j = W j,R + iW j,I = K j,C B j,C , K j,C = K j,R + iK j,I , B j,C = B j,R + iB j,I and B j,R , B j,I are two independent Brownian motions with covariance matrix Ω. For j = 0 and j = S/2 the results hold analogously: Proof. Most evaluations in (I) are standard, see for example Lemma 4 in [38].
(II) follows from the results in Section 4 of [2] for the complex valued representations or [39] for the corresponding real case.
Then for each circle B(λ j , δ) around λ j not containing any other eigenvalue of A there exist from some t onwards • c j eigenvalues ofÂ t in the circle B(λ j , δ) around λ j • a basisÛ t,j for the space spanned by the eigenspaces to these c j eigenvalues such that V jÛ t,j = I c j , • a sequence of matricesB t,j = V jÂ tÛt,j ∈ C c j ×c j .
Here Σ = U(Λ − I n λ j ) + U −1 where diag(s 1 , ..., s n ) + = diag(s + 1 , ..., s + n ) and x + = 1/x, x = 0 and zero else, that is (Λ − I n λ j ) + denotes a quasi-inverse. Furthermore for ρ = δA t < 1 we obtain: The results follow directly from Section 2.9 of [23], see in particular Proposition 2.9.1 and the discussion below this proposition. Further note that the results hold for each root separately and hence the restriction j = 1 needs to hold only for the investigated root for the results to apply. Finally note that a second order approximationÛ t,j = Z 0 + Z 1 + Z 2 andB t,j = C 0 + C 1 + C 2 is accurate to the order o( δA t 2 ).
x and denote the sequence of transformed systems as (Â,Ĉ,K) = (S TÃ S −1 T ,CS −1 T , S TK ). Let the block entries of S 0 be denoted as S ij and the blocks of ∆S be denoted as ∆S ij . Then: Proof. The proof follows from straightforward computations using the various orders of convergence by neglecting higher order terms.

Appendix C. Proofs of the Theorems
Appendix C.1. Proof of Theorem 1 For proving consistency of the transfer function estimators it is sufficient to find a (possibly) random matrixS T such that the least squares estimates (Ã,C,K) of one representation (A, C, K) of the true system obtained usingx t :=S Txt converges (a.s.) to (A, C, K). This will be done in two steps: First a particular basis (which is not realizable in practice) will be chosen such thatK p − K p = o(1) sufficiently fast such that in the second step the regressions in the system equations based on the resulting state estimatorx t are consistent. The derivation of the first step will also provide an approximation of the error term which can be used in order to derive the asymptotic distribution. The central step in CVA is the solution to the RRR problem. The following proof heavily draws on the results contained in [24] (henceforth called BRRR) collected in Lemma A4 for easier reference. As in BRRR, in order to derive the asymptotic properties we first transform the vectors in order to separate stationary and nonstationary terms. In order to achieve the separation let Z t = [y t−1 , y t−2 , ..., y t−S ] ∈ R sS . Then for p = kS we obtain It is easy to see that for each j the process (Z rS−j ) r∈N is an I(1) process. Moreover the strict minimum-phase condition for (A • , C • , K • ) implies that also for the system corresponding to (Z rS−j ) r∈N the strict minimum-phase condition holds.  [20] shows that in T S Z t the first c components are integrated while the remaining sS − c components are stationary. Then consider for p = kS < t ≤ T − f + 1 (using O † S,1 = (O S,1 O S,1 ) −1 O S,1 ) z t,p := Here O f ,⊥ is a matrix such that O f ,⊥ O f ,1 = 0, O f ,⊥ O f ,⊥ = I. Obviouslyz t,p is a linear transformation of Y − t,p andỹ t of Y + t, f . It can be shown that the linear transformation is nonsingular such that there is a one-one relation between Y − t,p andz t,p . Inz t,p andỹ t only the first c components are unit root processes, the remaining components being stationary.
For p = kS the final p − kS block rows ofz t,p are defined as y t−(k−1)S−j − y t−kS−j , j = 1, ..., p − kS. Clearly also these components are stationary.
SinceΓ •,p − Γ •,p = o(1) due to the definition ofΓ •,p and the continuity of the solution of the eigenvalue problem it follows thatÔ •,p − O •,p = o(1) and therefore lim sup T detÔ •,pÔ•,p > 0. As in Lemma 6 of [40] it can be shown that Γ •,p − [Γ • ] 1:p = o(T −1 ) and O •,p = O • + o(T −1 ) for the range of p given in Theorem 1 since these matrices correspond to a stationary problem. Hence the chosen normalization ofΓ •,p can be used a.s. Next in order to obtain the convergence ofḠ toΓ p , Lemma A.6 of BRRR is slightly extended to the current situation (for details and notation see there). Lemma A.6 of BRRR contains three parts: BRRR(I) gives bounds on the error in the matrices (with l T = log T) Furthermore using the expression given in Lemma A.7 of BRRR: This shows the result.
The transformations in the representation lead to an estimatorḠ taking the place of K p . UsingS T as defined in Lemma A5 the corresponding estimatorΓ p =S TD Proof. First note that the regression of Y + t, f onto Y − t,p includes time points t = p + 1, ..., T − f + 1 whereas for estimating the system matrices we can usex t , t = p + 1, ..., T + 1 and y t , t = p + 1, ..., T. Thus in this proof we use a t , b t T p+1 := T −1 ∑ T t=p+1 a t b t instead of a t , b t = T −1 ∑ T− f +1 t=p+1 a t b t . The following orders of convergence are straightforward to derive using the results of Lemma A1,Ā p = o(T −1 ), (Γ p − [Γ ] 1:p )D −1 z = O((log T) a ) andx t − x t = (Γ p − [Γ ] 1:p )z t,p −Ā p x t−p , t > p according to Lemma A5 and Lemma A1 for the range of p given in Theorem 1: ε t ,x t − x t T p+1 = O(p(log T) a /T) ,D z z t,p ,x t − x t T p+1 = O(p(log T) a /T 1/2 ) D z z t+1,p ,x t − x t T p+1 = O(p(log T) a /T 1/2 ) ,D x x t ,x t − x t T p+1 = O(p(log T) a /T 1/2 ) x t − x t ,x t − x t T p+1 = O(p 2 (log T) a /T) .
Using these orders of convergence we obtaiñ D x x t ,x t T p+1D x =D x x t , x t T p+1D x + O(p 2 (log T) a /T 1/2 ) > 0 a.s. From Lemma A1 also (D x x t ,x t T p+1D x ) −1 = (D x x t , x t T p+1D x ) −1 (1 + o(1)) = O((log T) a ). Therefore This in particular establishes consistency for the estimate. Next analogously (using the notation δx t =x t − x t ) we obtain (Ã − A)D −1 and therefore consistency forÃ is established. Finally note that for it follows that ε t ,ε t T p+1 = Ω + O(p 2 (log T) a /T 1/2 ). Furthermore sinceε t denotes the residuals of the regression of y t ontox t it follows that ε t ,x t T p+1 = 0. From this we obtain These expressions do not only show consistency of a specific order, but also give the relevant highest order terms for the asymptotic distribution, which are used below. AsĈÂ jK =CÃ jK → CA j K = C • A j • K • , Lemma A6 establishes consistency for the impulse response sequenceĈÂ jK (thus proofs Theorem 1 (I)) as well as, jointly with p = O((log T) a ), the rate of convergence O((log T) a /T 1/2 ) for the not realizable choice of the basis and the impulse response sequence CA j K. In order to arrive at the canonical representation (Ǎ,Č,Ǩ) two steps are performed: first the reordered Jordan normal form is calculated, afterwards the matricesC j,C are transformed such that E jČ j,C = I c j holds. We will follow these steps below.
In the first step a transformation matrixÛ needs to be found such thatÃ =ÛÃÛ −1 is in reordered Jordan normal form. In this respectÃ and A are used in Lemma A2. Ac-cordinglyÛ t = [Û t,1 , ...,Û t,S/2 ,Û t,• ] can be defined such that V jÛ t,j = I c j where U ∈ R n×n corresponds to the transformation from A to A • as given in the theorem. An appropriate choice ofz t,1 leads to U = I n . Furthermore the basis in the space spanned by the columns ofÛ t,• whereÛ t,jÛ t,• = 0 can be chosen such that [0, I]Û t,• = I for large enough T.
A first order approximation according to Lemma A2 then leads tô for j = 0, ..., S/2. Consequently Û t,j − U j = O((log T) a T −1 ) and thus alsoÛ t − U = O((log T) a T −1 ). Consequently the order of convergence for the transformed system (Â,Ĉ,K) is unchanged. In a second step an upper triangular transformation matrixŨ can be found transforming (Â,Ĉ,K) such thatÃ corresponds to the reordered Jordan normal form. Due to the upper block triangularity of this transform we can apply Lemma A3 to show that the order of convergence remains identical. For the second step note that Lemma A3 provides the required terms: An application to the block diagonal transformation S T = diag(E 1C 1,C , ..., E S/2C S/2,C , S T,• ), where S T,• transforms the stationary subsystem to echelon form, concludes the proof.
In order to derive the distribution of the sum of the eigenvalues note that as in the proof of Theorem 2, according to Lemma A2 the sum of the eigenvalues ofÃ converging to z j obeys the following second order approximation: = Ttr Â •,jj − z j I c j + o P (1) since (Â − A • )U j = O((log T) a T −1 ) in this case implying that the second order terms vanish. Thus we obtain the asymptotic distribution under the null hypothesis as the limiting distribution of Ttr[ K •,j,C ε t , x t,j,C x t,j,C , x t,j,C −1 ].
It is easy to verify that this test statistic is pivotal for complex and real unit roots. This proves Theorem 3.