1. Introduction
Subspace algorithms such as the canonical variate analysis (CVA) (
Larimore, 1983) are used for the estimation of linear dynamical state space systems for (multivariate) time series. CVA is popular since it is numerically cheap, consisting of a series of regressions, asymptotically equivalent to quasi-maximum likelihood estimation (using the Gaussian likelihood; qMLE) for stationary processes, and it is robust to the existence of simple unit roots (see
Bauer, 2005 for a survey). It has also been shown to provide consistent estimates for GARCH effects being present in the innovation process (
Bauer, 2008) and even fractionally integrated processes when the system order tends to infinity with appropriate sample sizes (
Bauer, 2018). In addition to asymptotic properties, recently, finite sample properties have also been investigated for subspace algorithms, which also apply to CVA (
He et al., 2025; see also
Tsiamis et al., 2023).
The algorithm fits a state space system in innovation form
to an observed time series
. Here
define the state space system with system order
n. In this paper, we will assume that the system is minimal (the state dimension cannot be reduced; see (
Hannan & Deistler, 1988), Chapter 1, for details) and stable (such that all eigenvalues of
A are smaller than one in modulus).
The innovation representation corresponds to the Wold representation of the process , if and only if the eigenvalues of the matrix are inside or on the unit circle. In this case, , where denotes a maximum modulus eigenvalue of the matrix M.
The asymptotic properties for CVA when the data are generated from a state space system in innovation representation documented in the literature are restricted to the case of stably invertible processes where the strict inequality
holds. However, this restriction may be violated in particular for economic data if the data are transformed to stationarity by temporal differencing without taking possible co-integrating relations into account. If co-integrating relations between the component variables exist and the whole time series is differenced, this leads to over-differencing in some directions, introducing spectral zeros at frequency
. A similar situation occurs if HP filtering is used to extract a trend:
Sakarya and de Jong (
2022) show that the spectrum of the detrended cyclical process then has a zero at frequency zero. Similar effects occur for seasonal differencing.
Processes that have spectral zeros are often called non-invertible.
Funovits (
2024), dealing with structural modelling involving non-invertible autoregressive moving average (ARMA) processes, excludes spectral zeros from his model class and argues that such processes in fact are invertible in the sense that long autoregressive approximations exist. These approximations, however, have non-standard properties; see below. Hence, we use the term ‘stably invertible’ for
.
1In such a situation, the properties of the subspace estimators are currently unknown. This study closes this gap using results from
Poskitt (
2006) related to the autoregressive approximation of processes with spectral zeros. We show that CVA provides consistent estimators for the impulse response sequence also in the case of some spectral zeros at
. From the proof, it is clear that analogous results also hold for zeros at different frequencies. Consistency is obtained for the integer parameter
p of CVA (corresponding to the lag length of an autoregressive approximation) tending to infinity at a certain rate. We investigate the asymptotic bias arising for finite lag lengths and show that for typical choices, it is not negligible, as it tends to zero slower than
, the typical convergence rate involved in asymptotic normality.
The significance of this result is demonstrated in a simulation exercise whose main implications are that inference in the over-differenced situation is unreliable, be it using CVA or qMLE or prediction error method (PEM) estimators (each suffering from different problems). The simulations illustrate the advantages of working with the original data rather than preprocessed data involving seasonal adjustment or differencing to transform the data to stationarity.
This paper is organized as follows: in the next section, the main idea of CVA is presented. Afterwards, the results of this study are described in
Section 3. Demonstrations are provided in
Section 4.
Section 5 concludes the paper. Proofs are relegated to the appendices.
2. Canonical Variate Analysis
The CVA method of estimation is performed in three steps and uses two integers,
(‘future’ and ‘past’), and information of the system order
n (compare
Bauer, 2005):
Obtain an estimate of the state for .
Estimate C by regressing onto . This step provides residuals .
Estimate A and B by regressing onto and .
The essential idea of CVA lies in the estimation of
, which uses the representation of the joint vector
for some integer
as the state space system implies:
where
contains the impulse response coefficients and
denotes the observability matrix, which has full column rank due to minimality. Furthermore
denotes the regression coefficient for explaining
by
for integer
, leading to the approximation
. Then
denotes the approximation error.
Under the strict minimum-phase assumption implying
for
, we may use
2 , leading to
to infer that the variance of the approximation error
can be bounded by
such that it is of order
, where
. If in that case,
is used, we obtain
such that the variance of the approximation error is of order
. If
, this implies that the approximation error is negligible in the usual
asymptotics. In the stably invertible case, it is known that the lag length of a vector autoregressive approximation of the process
selected using information criteria like
BIC tends to infinity roughly as
(
Hannan & Deistler, 1988, see Theorems 6.6.3 and 6.6.4). Thus, selecting
by the
AIC optimal autoregressive lag length has been suggested (
Bauer, 2005). For integrated processes, twice this number has been suggested such that the approximation error decays as
.
For
, this argument does not work any more.
Poskitt (
2006) shows that also in this case, the approximation error decreases to zero, albeit not at the same speed, as shown in the following example:
Example 1. Consider for independent identically distributed white noise with expectation zero and variance . This can be represented in state space form asand hence and . Following Poskitt (2006), we see that , implying thatDenoting , we obtain such that the approximation error . It follows that . Thus the approximation error tends to zero in mean square, but the variance is of order and does not decrease exponentially. This example is typical. The same arguments show that the variance of the approximation error for for stationary process with non-singular spectral density at (not necessarily white noise) is at most of order . In order for this error to be neglectable in typical asymptotic analyses, we require (see Example 2 below). Thus, in this setting, p must increase much faster than the usual rate proportional to for stably invertible processes to produce a neglectable bias term.
Note that in the over-differenced situation with
, the properties of
BIC optimal lag length selection do not follow from Theorem 6.6.3 (
Hannan & Deistler, 1988). The arguments in the corresponding proofs, however, suggest that the corresponding estimate
is close to the minimizer of
, where
denotes the one-step prediction error variance for an autoregression of order
p. In the example above, one obtains
. The minimizer of
equals the minimizer of
. Approximating
as
, one obtains
as the minimizer. This illustrates that in the over-differenced situation, one might expect a relatively large choice of the lag length roughly proportional to
in the autoregressive approximation chosen by
BIC. The simulations in
Section 4 demonstrate this behaviour.
3. Results
Poskitt (
2006) derives results for the estimation accuracy of the autoregressive approximation coefficients. In his Theorem 5, he states that uniformly in
for some upper bound
and using
, we have
where
denotes an almost certain convergence at the given rate. Here
denotes the autoregressive coefficients in a lag
p approximation for
obtained from
and
, which are the corresponding least squares estimates.
Poskitt (
2006) uses a univariate setting; however, the extension to multivariate time series in our framework is obvious.
3 In this study, we do not investigate autoregressive processes but state space processes with spectral zeros. We focus on the case of simple spectral zeros obtained by one-time over-differencing:
Assumption 1. The stationary process is generated using a rational, stable, and stably invertible transfer function (which has all its zeros and poles outside the unit circle) and an orthonormal matrix , , where is an integer, as (L denoting the backward-shift operator and )Here denotes a zero mean ergodic, stationary, martingale difference sequence with respect to the sequence of sigma fields spanned by the past of , fulfillingFurthermore, . We use the same noise assumptions as
Poskitt (
2006) and
Hannan and Deistler (
1988). The conditional homoskedasticity assumption
can be replaced by the weaker assumption
, allowing for GARCH-type processes (
Bauer, 2008). We do not attempt to weaken the assumption of finite fourth moments. Clearly such processes have a spectral density of rank
(which is hence singular for
) for
due to the differencing). At all other frequencies, the rank equals
s, since
is assumed to be stably invertible.
These assumptions are fulfilled when examining first differences of a co-integrated I(1) process. Examples of such processes include multivariate time series, wherein some components are stationary, while others show the random walk-like behaviour typical for I(1) processes. In this situation, differencing the whole multivariate process will over-difference the stationary components. This can be avoided by only differencing the integrated components, if such a partitioning of the multivariate time series exists, which of course requires knowledge on which components are stationary and which are integrated.
Often, however, all components of a multivariate time series will be integrated, while linear combinations are stationary. This corresponds to a process where the number of common trends driving the time series is smaller than the dimension of the time series, the difference being equal to the co-integrating rank. This is reasonable in systems with many variables where inference on the co-integrating rank is difficult. But even in smaller systems, the decision on the number of co-integrating relations is sometimes not simple, as documented in examples in (
Johansen, 1995). If in such situations, the co-integrating rank is specified as too small or if, for simplicity, the full process is differenced, spectral zeros result.
To apply this result in our setting, note that the straightforward multivariate extension
4 to Theorem 2 of
Palma and Bondon (
2003)’s work implies that
is bounded from below by
for
according to Assumption 1.
This implies that the bound above amounts to , which tends to zero, if for . Note, however, that for this rate of increase, the approximation error (with variance of order ) is larger than and hence dominates the asymptotic distribution of terms like ; see Theorem 2 below.
The results from the autoregressive setting can be used here almost immediately if is fixed and , where depends on the sample size. This implies that for the approximation of , the unrestricted estimate equals an autoregressive model for . Here and below, we use the notation for two processes and . This matrix —which in the limit has rank n but in finite samples is of full rank—is then low-rank-approximated, leading to the estimate of .
In order to identify the factors
and
from the product, we use a selector matrix
such that
. Such a matrix always exists (cf. for example the overlapping echelon forms; Section 2.6 of
Hannan and Deistler (
1988)). Since the results below correspond to estimates of the impulse response coefficients (which are invariant in this respect), this choice of the state basis can be assumed without restriction of generality.
The second and third step of CVA then amount to least squares using the estimate
of the state. If instead we had access to the state approximation
as well as population instead of sample moments, we would obtain the following matrices:
If the approximation errors tend to zero and the convergence of sample covariances to population quantities is uniform in
p, then consistency for
follows (for the proof, see
Appendix A):
Theorem 1. Let the process be generated according to Assumption 1. Let the CVA procedure be applied with not depending on T and for such that .
Thenfor as . Consequently, almost certainly in that case. Note that these two error bounds are differently influenced by the integer p: a large p value reduces the approximation errors such as but increases the sampling error . It is the first one that tends to zero slower for spectral zeros than in the stably invertible case.
Example 2. Consider, as in Example 1, for white noise . Then and . It follows that Thus . This shows for the special case that the system for fixed p is a biased estimate of the true system . The bias is of order . This is typical, as is demonstrated by the next result.
Theorem 2. Let the process be generated as in Theorem 1.
- (I)
Assume that . Then we obtain - (II)
provides an upper bound on the achievable error in the sense that .
In order for this bound of the bias to be asymptotically negligible, has to grow faster than . The bound may be conservative; however, Example 2 shows that, in general, cannot be smaller than such that the bias is of order . Even in this case, the required increase is faster than the upper bound used above, such that with our methods, we cannot derive results for the asymptotic distribution of the system estimates.
Additionally note that typically the upper bound for selecting the lag length is such that the bias derived above will dominate the asymptotics.
4. Illustration
In order to demonstrate the theoretical results, in this section a simulation exercise is discussed and the approaches are applied to a Danish dataset which (
Johansen, 1995) used as a demonstration example for co-integration analysis in the vector error correction framework.
4.1. Simulations
We simulate a test example to indicate the relative performance of three different estimators for the original as well as the once-differenced time series:
CVA: The subspace procedure described above, where p is chosen according to twice the AIC optimal lag length in an autoregressive approximation.
qMLE: Maximum likelihood estimation based on the Gaussian likelihood. Here both stability and strict minimum-phase assumptions are imposed using a barrier function approach. The data is assumed to be stationary and the corresponding likelihood is optimized.
PEM: Prediction error methods use the assumption of . With this initialization, the Kalman filter collapses to the inverse system. Again, stability and strict minimum-phase assumptions are imposed using a barrier function approach.
qMLE and PEM are initialized using the CVA estimate but also employ randomization to reduce the probability of being trapped in local minima. For the data-generating process, we use a bivariate system:
Here
denotes a bivariate standard normal error process. Consequently, the process is a bivariate I(1) process with independent innovations. The first component is stationary, while the second is integrated. We apply the estimation procedures to
and to
, which has a state space representation of order three:
We generate datasets of sample size and estimate a state space system with for and for . Hereby is chosen, where denotes the lag length of an autoregressive approximation chosen using AIC with used as the upper bound for the original data and for the differenced data.
The results can be seen in
Figure 1. Plot (a) of that figure provides a plot of the mean squared error of the impulse response estimates times the sample size
T. For
, we calculate the impulse response corresponding to the differenced series. A convergence rate of order
would imply that the curves level off for large sample sizes. While this seems to be the case for CVA and PEM applied to the original data, all curves for the differenced series show an increase for the larger sample size. The superiority of the estimates for the original series is visible for all three estimators. CVA applied to the differenced series performs worst, particularly for large sample sizes. Additionally, it is visible that qMLE performs worse than PEM for the integrated process. This happens as the qMLE assumes stationary initialization of the state process in the Kalman filter. In the integrated case, this leads to very large initial uncertainty of the state, reflected in the poorer estimation. This is also visible in the estimates of the
element of
, the first element of the impulse response, as shown in plots (c) for
and (d) for
. All but the qMLE estimates appear to be Gaussian-distributed around the correct mean −0.5. In plot (c), the superiority of the estimates for the original series is visible in a higher concentration around
. Plot (b) provides mean selected lag lengths, documenting that for the original series the lag length selection roughly behaves as
(printed as red stars), while for the differenced series, the increase is much faster.
While the impulse response estimates for the large sample size appear to be Gaussian-distributed, this is not the case for all system-dependent quantities.
Figure 2a,b provide density estimates for the estimate of the sum of all entries of
. In (a), for the original data, one notices a bias for the qMLE estimates (compensating for the large initial state uncertainty), while in (b), for the differenced data, the CVA estimates show a downward bias. Plot (c) provides the deviation of the mean of the largest modulus of the eigenvalues of
for the original process (with true value
), as well as for the differenced process (with true value 1).
5 The larger downward bias for the qMLE estimates for the original data and the CVA estimates for the difference data is clearly visible but diminishes with sample size. Note that for
onward for the differenced data for PEM and qMLE, the mean maximum modulus almost reaches the maximum allowed value, implying that almost all values are equal to the maximum. Plot (d) provides the ecdf for
of the maximum modulus of the eigenvalues of
. While for CVA, the maximum modulus is almost always estimated below 0.98, for PEM and qMLE, the opposite is true, with a large point mass located at 0.99. Only in 6% of the cases is the maximum modulus smaller than
. Therefore the maximum modulus for the qMLE and PEM estimates for the differenced data does not follow a normal distribution, not even for
.
Consequently, for the differenced series, inference provides obstacles for qMLE and PEM in the form of non-Gaussian distributions with discrete components and a dominating bias for CVA.
4.2. The Danish Dataset
The simulation evidence from the last subsection is briefly complemented with a real-world data set. The Danish dataset was used by
Johansen (
1995) for the illustration of co-integration modeling using vector error correction representation. The dataset comprises four time series observed quarterly over the span of 55 quarters from 1974:1 to 1987:3. The variables are log real money (m2), log real income (y), the bank deposit rate for interest-bearing deposits (ide), and the bond rate (ibo). From the plots on p. 26 of
Johansen (
1995)’s work and the accompanying univariate tests, one may conclude that all four series appear to be integrated with stationary differences.
Johansen (
1995) uses seasonal dummies as deterministics.
Johansen (
1995) tests for the number of co-integrating relations and states on p. 112 ‘From Table 7.1 we find that there is no clear evidence for co-integration. […] We choose to maintain the hypothesis that r = 1, that is, that there is one co-integrating relation’. This is a typical situation where the number of co-integrating relations (if any) is not certain. Economic theory would imply its existence, but the data provides borderline tests and no clear-cut results.
In addition to the estimation of the model,
Johansen (
1995) also tests—under the assumption of one co-integrating relation—for restrictions on the co-integrating relation.
We used the dataset to estimate models for the original data as well as for the differenced process. Seasonal dummies are included. We use CVA and PEM as in the simulation section. The results are provided in
Table 1.
For the original data,
was used as in
Johansen (
1995). As deterministic terms, quarterly dummies were added. For PEM and CVA,
resulted in the best fit (according to
AIC). For PEM, three common trends corresponding to one co-integrating relation minimize
AIC.
For the differenced data, was chosen. Lag length selection using AIC is very unstable, though. The results depend heavily on the upper limit if a uniform evaluation sample is used. For PEM and CVA, was selected for the differenced data, which corresponds to the first difference of the model for the original data.
We can see from
Table 1 that PEM for the original data results in the best one-step-ahead prediction error except for y, where CVA performs better, and ide, where PEM on the differenced series provides better forecasts.
In terms of the zeros of the transfer function for the original and for the differenced data, the PEM estimates provide zeros on the boundary to being unstable. For the differenced data, six out of the eight zeros have a modulus larger than . Also for the original data, the PEM model has three zeros with a modulus larger than . Thus, these estimates are also not well behaved.
Computing the spectral density at zero frequency according to the PEM model for the differenced data, we obtain for the square roots of the eigenvalues . This indicates that the rank is either two or three. For the first difference of the PEM model for the original data, we obtain , since there are only three common trends contained in the model. The strength of the third is, thus, also very weak.
With the PEM model for the original data, we can use the trace statistic in the state space error correction (SPECM) formulation including a restricted constant (see
Matuschek, 2020), providing the eigenvalues
. The corresponding trace statistics equal
, resulting in rank
with the usual sequential testing technique (using critical values
).
Within the SPECM, one can also test restrictions on the co-integrating relations, which is estimated as
, normalized such that the coefficients for m2 equals 1. The first entry corresponds to the restricted constant. These values are close to the ones by
Johansen (
1995), shown in Section 7.3.1. One notices that the coefficient for y is close to
. Economic theory further conjectures that the coefficients for ibo and ide should sum to zero. Testing this restriction on
within the SPECM, we get a test statistic of
, which corresponds to a
p-value of 0.14 for the limiting
distribution. The restricted estimate
. This is to be compared to
, as provided by
Johansen (
1995) with a different model and different specification of the deterministics (i.e., leaving out the seasonal dummies).
Finally, the spectral density of the differenced process in the direction of the co-integrating relation (omitting the coefficient corresponding to the constant and scaling the vector to have a norm equal to one) is corresponding to the model for the first differences equal to . The smallest eigenvalue of the spectrum equals and the second to last equals , while the largest eigenvalue equals . This hints at the singularity of the spectrum at zero for the differenced process as well as to the fact that the co-integrating relation found points in the direction of the singularity.
Thus, the analysis with the state space systems adds evidence to Johansen’s choice of as well as no evidence against the equalities in the co-integrating vectors suggested by economic theory.