Next Article in Journal
Analysis of School Absenteeism for Single- vs. Two-Parent Families: A Finite Mixture Roy Approach
Previous Article in Journal
Graph Attention Networks in Exchange Rate Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Using Subspace Algorithms for the Estimation of Linear State Space Models for Over-Differenced Processes

by
Dietmar Bauer
Department of Business Administration and Economics, Bielefeld University, Universitätsstrasse 25, D-33615 Bielefeld, Germany
Econometrics 2026, 14(1), 12; https://doi.org/10.3390/econometrics14010012
Submission received: 12 December 2025 / Revised: 11 February 2026 / Accepted: 26 February 2026 / Published: 28 February 2026

Abstract

Subspace algorithms like canonical variate analysis (CVA) are regression-based methods for the estimation of linear dynamic state space models. They have been shown to deliver accurate (consistent and asymptotically equivalent to quasi-maximum likelihood estimation using the Gaussian likelihood) estimators for stably invertible stationary autoregressive moving average (ARMA) processes. These results use the assumption that there are no zeros of the spectral density on the unit circle corresponding to the state space system. In this technical study, we consider vector processes made stationary by applying differencing to all variables, ignoring potential co-integrating relations. This leads to spectral zeros violating the above mentioned assumptions. We show consistency for the CVA estimators, closing a gap in the literature. However, a simulation exercise shows that over-differencing (while leading to consistent estimation of the transfer function) also complicates inference for CVA estimators, not just maximum likelihood-based estimators. This is also demonstrated in a real-world data example. The result also applies to seasonal differencing. The present paper hence suggests working with original data, not working in differences.

1. Introduction

Subspace algorithms such as the canonical variate analysis (CVA) (Larimore, 1983) are used for the estimation of linear dynamical state space systems for (multivariate) time series. CVA is popular since it is numerically cheap, consisting of a series of regressions, asymptotically equivalent to quasi-maximum likelihood estimation (using the Gaussian likelihood; qMLE) for stationary processes, and it is robust to the existence of simple unit roots (see Bauer, 2005 for a survey). It has also been shown to provide consistent estimates for GARCH effects being present in the innovation process (Bauer, 2008) and even fractionally integrated processes when the system order tends to infinity with appropriate sample sizes (Bauer, 2018). In addition to asymptotic properties, recently, finite sample properties have also been investigated for subspace algorithms, which also apply to CVA (He et al., 2025; see also Tsiamis et al., 2023).
The algorithm fits a state space system in innovation form
y t = C x t + ε t , x t + 1 = A x t + B ε t , t Z ,
to an observed time series y t R s , t = 1 , , T . Here A R n × n , B R n × s ,   C R s × n define the state space system with system order n. In this paper, we will assume that the system is minimal (the state dimension cannot be reduced; see (Hannan & Deistler, 1988), Chapter 1, for details) and stable (such that all eigenvalues of A are smaller than one in modulus).
The innovation representation corresponds to the Wold representation of the process ( y t ) t Z , if and only if the eigenvalues of the matrix A ̲ = A B C are inside or on the unit circle. In this case, | λ m a x ( A ̲ ) | 1 , where λ m a x ( M ) denotes a maximum modulus eigenvalue of the matrix M.
The asymptotic properties for CVA when the data are generated from a state space system in innovation representation documented in the literature are restricted to the case of stably invertible processes where the strict inequality | λ m a x ( A ̲ ) | < 1 holds. However, this restriction may be violated in particular for economic data if the data are transformed to stationarity by temporal differencing without taking possible co-integrating relations into account. If co-integrating relations between the component variables exist and the whole time series is differenced, this leads to over-differencing in some directions, introducing spectral zeros at frequency ω = 0 . A similar situation occurs if HP filtering is used to extract a trend: Sakarya and de Jong (2022) show that the spectrum of the detrended cyclical process then has a zero at frequency zero. Similar effects occur for seasonal differencing.
Processes that have spectral zeros are often called non-invertible. Funovits (2024), dealing with structural modelling involving non-invertible autoregressive moving average (ARMA) processes, excludes spectral zeros from his model class and argues that such processes in fact are invertible in the sense that long autoregressive approximations exist. These approximations, however, have non-standard properties; see below. Hence, we use the term ‘stably invertible’ for | λ m a x ( A ̲ ) | < 1 .1
In such a situation, the properties of the subspace estimators are currently unknown. This study closes this gap using results from Poskitt (2006) related to the autoregressive approximation of processes with spectral zeros. We show that CVA provides consistent estimators for the impulse response sequence also in the case of some spectral zeros at ω = 0 . From the proof, it is clear that analogous results also hold for zeros at different frequencies. Consistency is obtained for the integer parameter p of CVA (corresponding to the lag length of an autoregressive approximation) tending to infinity at a certain rate. We investigate the asymptotic bias arising for finite lag lengths and show that for typical choices, it is not negligible, as it tends to zero slower than 1 / T , the typical convergence rate involved in asymptotic normality.
The significance of this result is demonstrated in a simulation exercise whose main implications are that inference in the over-differenced situation is unreliable, be it using CVA or qMLE or prediction error method (PEM) estimators (each suffering from different problems). The simulations illustrate the advantages of working with the original data rather than preprocessed data involving seasonal adjustment or differencing to transform the data to stationarity.
This paper is organized as follows: in the next section, the main idea of CVA is presented. Afterwards, the results of this study are described in Section 3. Demonstrations are provided in Section 4. Section 5 concludes the paper. Proofs are relegated to the appendices.

2. Canonical Variate Analysis

The CVA method of estimation is performed in three steps and uses two integers, f , p (‘future’ and ‘past’), and information of the system order n (compare Bauer, 2005):
  • Obtain an estimate x ^ t of the state x t for t = p + 1 , , T + 1 .
  • Estimate C by regressing y t onto x ^ t . This step provides residuals ε ^ t = y t C ^ x ^ t ,   t = p + 1 , , T .
  • Estimate A and B by regressing x ^ t + 1 onto x ^ t and ε ^ t , t = p + 1 , , T .
The essential idea of CVA lies in the estimation of x t , which uses the representation of the joint vector Y t + = ( y t , y t + 1 , , y t + f 1 ) for some integer f n as the state space system implies:
Y t + = O f x t + E f E t + , x t = K p Y t + δ x t ( p ) ,
where E f R f s × f s contains the impulse response coefficients and
O f = ( C , A C , , ( A f 1 ) C ) R f s × n
denotes the observability matrix, which has full column rank due to minimality. Furthermore
K p = E x t ( Y t ) ( E Y t ( Y t ) ) 1 R n × p s
denotes the regression coefficient for explaining x t by Y t = ( y t 1 , , y t p ) R p s for integer p n , leading to the approximation x t ( p ) = K p Y t . Then δ x t ( p ) = x t x t ( p ) denotes the approximation error.
Under the strict minimum-phase assumption implying A ̲ p 0 for p , we may use2  K p = [ B , A ̲ B , A ̲ 2 B , , A ̲ p 1 B ] , leading to
x t = K p Y t + A ̲ p x t p
to infer that the variance of the approximation error δ x t ( p ) can be bounded by A ̲ p ( E x t p x t p ) ( A ̲ p ) such that it is of order O ( ρ 0 2 p ) , where 1 > ρ 0 > | λ m a x ( A ̲ ) | . If in that case, p = p ( T ) = c ( log T ) / ( 2 log ρ 0 ) is used, we obtain
ρ 0 p = e x p ( c ( log T ) / ( 2 log ρ 0 ) log ρ 0 ) = e x p ( c ( log T ) / 2 ) = T c / 2
such that the variance of the approximation error is of order T c . If c > 1 , this implies that the approximation error is negligible in the usual T asymptotics. In the stably invertible case, it is known that the lag length of a vector autoregressive approximation of the process ( y t ) t Z selected using information criteria like BIC tends to infinity roughly as log T / ( 2 log ρ 0 ) (Hannan & Deistler, 1988, see Theorems 6.6.3 and 6.6.4). Thus, selecting p ( T ) by the AIC optimal autoregressive lag length has been suggested (Bauer, 2005). For integrated processes, twice this number has been suggested such that the approximation error decays as 1 / T .
For ρ 0 = 1 , this argument does not work any more. Poskitt (2006) shows that also in this case, the approximation error decreases to zero, albeit not at the same speed, as shown in the following example:
Example 1.
Consider y t = ε t ε t 1 R s for independent identically distributed white noise ( ε t ) t Z with expectation zero and variance Ω > 0 . This can be represented in state space form as
y t = I s x t + ε t , x t + 1 = 0 s × s x t I s ε t
and hence x t = ε t 1 and ( A , B , C ) = ( 0 s × s , I s , I s ) . Following Poskitt (2006), we see that K p = [ p p + 1 I s , p 1 p + 1 I s , , 1 p + 1 I s ] , implying that
x t ( p ) = K p Y t = 1 p + 1 j = 1 p ( p + 1 j ) ( ε t j ε t j 1 ) = 1 p + 1 j = 1 p ( p + 1 j ) ε t j 1 p + 1 j = 2 p + 1 ( p + 2 j ) ε t j = 1 p + 1 p ε t 1 j = 2 p ε t j ε t p 1 = ε t 1 + 1 p + 1 j = 1 p + 1 ε t j .
Denoting ε ¯ t 1 ( p ) = j = 1 p + 1 ε t j / ( p + 1 ) , we obtain x t = x t ( p ) ε ¯ t 1 ( p ) , ε t ( p ) = y t x t ( p ) = ε t ε ¯ t 1 ( p ) such that the approximation error δ x t ( p ) = ε ¯ t 1 ( p ) . It follows that E δ x t ( p ) δ x t ( p ) = 1 p + 1 Ω . Thus the approximation error tends to zero in mean square, but the variance is of order 1 / p and does not decrease exponentially.
This example is typical. The same arguments show that the variance of the approximation error for y t = Δ u t = u t u t 1 for stationary process ( u t ) t Z with non-singular spectral density at ω = 0 (not necessarily white noise) is at most of order p 1 . In order for this error to be neglectable in typical asymptotic analyses, we require p 1 T 0 (see Example 2 below). Thus, in this setting, p must increase much faster than the usual rate proportional to log T for stably invertible processes to produce a neglectable bias term.
Note that in the over-differenced situation with ρ 0 = 1 , the properties of BIC optimal lag length selection do not follow from Theorem 6.6.3 (Hannan & Deistler, 1988). The arguments in the corresponding proofs, however, suggest that the corresponding estimate p ^ is close to the minimizer of log det Ω p + 2 p s 2 / T , where Ω p denotes the one-step prediction error variance for an autoregression of order p. In the example above, one obtains Ω p = Ω p + 2 p + 1 . The minimizer of log det Ω p + 2 p s 2 / T equals the minimizer of log det Ω + s log ( p + 2 ) / ( p + 1 ) + 2 ( p + 1 ) s 2 / T . Approximating log ( 1 + 1 p + 1 ) as 1 p + 1 , one obtains T 2 s 1 as the minimizer. This illustrates that in the over-differenced situation, one might expect a relatively large choice of the lag length roughly proportional to T in the autoregressive approximation chosen by BIC. The simulations in Section 4 demonstrate this behaviour.

3. Results

Poskitt (2006) derives results for the estimation accuracy of the autoregressive approximation coefficients. In his Theorem 5, he states that uniformly in 0 < p H T for some upper bound H T = O ( T / log T ) and using Q T 2 = ( log   T ) / T , we have
j = 1 p | α ^ p ( j ) α p ( j ) | 2 = O ( p λ m i n ( Γ p ) 2 Q T 2 )
where O ( . ) denotes an almost certain convergence at the given rate. Here α p ( j ) denotes the autoregressive coefficients in a lag p approximation for ( y t ) t Z obtained from
[ α p ( 1 ) , , α p ( p ) ] = E y t ( Y t ) Γ p 1 , Γ p = E Y t ( Y t ) 1
and α ^ p ( j ) , which are the corresponding least squares estimates. Poskitt (2006) uses a univariate setting; however, the extension to multivariate time series in our framework is obvious.3
In this study, we do not investigate autoregressive processes but state space processes with spectral zeros. We focus on the case of simple spectral zeros obtained by one-time over-differencing:
Assumption 1.
The stationary process ( y t ) t Z , y t R s is generated using a rational, stable, and stably invertible transfer function k ( z ) = I s + j = 1 K j z j ,   K j = C A j 1 B (which has all its zeros and poles outside the unit circle) and an orthonormal matrix M = [ M c , M s c ] R s × s , M M = I s , M c R s × c , where 0 < c s is an integer, as (L denoting the backward-shift operator and Δ = ( 1 L ) )
( y t ) t Z = M Δ I c 0 0 I s c M k ( L ) ( ε t ) t Z .
Here ( ε t ) t Z denotes a zero mean ergodic, stationary, martingale difference sequence with respect to the sequence F t of sigma fields spanned by the past of ε t , fulfilling
E ( ε t | F t 1 ) = 0 , E ( ε t ε t | F t 1 ) = E ( ε t ε t ) = Ω .
Furthermore, E ε t , j 4 < , j = 1 , , s .
We use the same noise assumptions as Poskitt (2006) and Hannan and Deistler (1988). The conditional homoskedasticity assumption E ( ε t ε t | F t 1 ) = E ( ε t ε t ) can be replaced by the weaker assumption lim k E ( ε t ε t | F t k ) = Ω , allowing for GARCH-type processes (Bauer, 2008). We do not attempt to weaken the assumption of finite fourth moments. Clearly such processes have a spectral density of rank s c (which is hence singular for c > 0 ) for ω = 0 due to the differencing). At all other frequencies, the rank equals s, since k ( L ) is assumed to be stably invertible.
These assumptions are fulfilled when examining first differences of a co-integrated I(1) process. Examples of such processes include multivariate time series, wherein some components are stationary, while others show the random walk-like behaviour typical for I(1) processes. In this situation, differencing the whole multivariate process will over-difference the stationary components. This can be avoided by only differencing the integrated components, if such a partitioning of the multivariate time series exists, which of course requires knowledge on which components are stationary and which are integrated.
Often, however, all components of a multivariate time series will be integrated, while linear combinations are stationary. This corresponds to a process where the number of common trends driving the time series is smaller than the dimension of the time series, the difference being equal to the co-integrating rank. This is reasonable in systems with many variables where inference on the co-integrating rank is difficult. But even in smaller systems, the decision on the number of co-integrating relations is sometimes not simple, as documented in examples in (Johansen, 1995). If in such situations, the co-integrating rank is specified as too small or if, for simplicity, the full process is differenced, spectral zeros result.
To apply this result in our setting, note that the straightforward multivariate extension4 to Theorem 2 of Palma and Bondon (2003)’s work implies that λ m i n ( E Y t ( Y t ) ) is bounded from below by c ̲ p 2 for ( y t ) t Z according to Assumption 1.
This implies that the bound above amounts to p 5 log T / T , which tends to zero, if p = c T δ for 0 < δ < 0.2 . Note, however, that for this rate of increase, the approximation error (with variance of order p 1 ) is larger than O ( 1 / T ) and hence dominates the asymptotic distribution of terms like T ( A ^ A ) ; see Theorem 2 below.
The results from the autoregressive setting can be used here almost immediately if f n is fixed and p = f p ˜ , where p ˜ = p ˜ ( T ) = o ( T δ ) depends on the sample size. This implies that for the approximation of x t , the unrestricted estimate β ^ f , p = Y t + , Y t Y t , Y t 1 equals an autoregressive model for Y t + . Here and below, we use the notation a t , b t = T 1 t = p + 1 T f a t b t for two processes ( a t ) t Z and ( b t ) t Z . This matrix β ^ f , p —which in the limit has rank n but in finite samples is of full rank—is then low-rank-approximated, leading to the estimate O ^ f K ^ p of O f K p .
In order to identify the factors O ^ f and K ^ p from the product, we use a selector matrix S f R n × f s such that S f O f = I n . Such a matrix always exists (cf. for example the overlapping echelon forms; Section 2.6 of Hannan and Deistler (1988)). Since the results below correspond to estimates of the impulse response coefficients (which are invariant in this respect), this choice of the state basis can be assumed without restriction of generality.
The second and third step of CVA then amount to least squares using the estimate x ^ t = K ^ p Y t of the state. If instead we had access to the state approximation x t ( p ) = K p Y t as well as population instead of sample moments, we would obtain the following matrices:
A p = E x t + 1 ( p ) x t ( p ) ( E x t ( p ) x t ( p ) ) 1 B p = E x t + 1 ( p ) ε t ( p ) ( E ε t ( p ) ε t ( p ) ) 1 C p = E y t x t ( p ) ( E x t ( p ) x t ( p ) ) 1 .
If the approximation errors tend to zero and the convergence of sample covariances to population quantities is uniform in p, then consistency for p follows (for the proof, see Appendix A):
Theorem 1.
Let the process ( y t ) t Z be generated according to Assumption 1. Let the CVA procedure be applied with f n not depending on T and p = p ( T ) for T such that p ( T ) = o ( T δ ) , 0 < δ < 0.2 .
Then
max { A ^ A p , B ^ B p , C ^ C p } = O ( p ( T ) 5 log ( T ) / T ) , max { A A p , B B p , C C p } 0
for p = p ( T ) as T . Consequently, C ^ A ^ j B ^ C A j B , j = 0 , 1 , 2 , almost certainly in that case.
Note that these two error bounds are differently influenced by the integer p: a large p value reduces the approximation errors such as A p A but increases the sampling error A ^ A p . It is the first one that tends to zero slower for spectral zeros than in the stably invertible case.
Example 2.
Consider, as in Example 1, y t = Δ ε t for white noise ( ε t ) t Z . Then x t ( p ) = ε t 1 + ε ¯ t 1 ( p ) and ε t ( p ) = ε t ε ¯ t 1 ( p ) . It follows that E x t ( p ) ε t ( p ) = 0 ,
E ε t ( p ) ε t ( p ) = p + 2 p + 1 Ω , E x t ( p ) x t ( p ) = p p + 1 Ω , E x t + 1 ( p ) ε t ( p ) = ( 1 1 ( p + 1 ) 2 ) Ω , E x t + 1 ( p ) x t ( p ) = 1 ( p + 1 ) 2 Ω .
Thus A p = A I s 1 p ( p + 1 ) , B p = B + 1 p + 1 I s , C p = C .
This shows for the special case that the system ( A p , B p , C p ) for fixed p is a biased estimate of the true system ( 0 , I s , I s ) . The bias is of order p 1 . This is typical, as is demonstrated by the next result.
Theorem 2.
Let the process ( y t ) t Z be generated as in Theorem 1.
(I) 
Assume that E δ x t ( p ) δ x t ( p ) = O ( g p ) . Then we obtain
max { A A p , B B p , C C p } = O ( g p 1 / 2 ) .
(II) 
g p = 1 / p provides an upper bound on the achievable error in the sense that E δ x t ( p ) δ x t ( p ) = O ( 1 / p ) .
In order for this bound of the bias to be asymptotically negligible, p has to grow faster than T 1 / 2 . The bound may be conservative; however, Example 2 shows that, in general, g p cannot be smaller than 1 / p 2 such that the bias is of order O ( 1 / p ) . Even in this case, the required increase is faster than the upper bound H T = T / log T used above, such that with our methods, we cannot derive results for the asymptotic distribution of the system estimates.
Additionally note that typically the upper bound for selecting the lag length is H T = c T 1 / 4 such that the bias derived above will dominate the asymptotics.

4. Illustration

In order to demonstrate the theoretical results, in this section a simulation exercise is discussed and the approaches are applied to a Danish dataset which (Johansen, 1995) used as a demonstration example for co-integration analysis in the vector error correction framework.

4.1. Simulations

We simulate a test example to indicate the relative performance of three different estimators for the original as well as the once-differenced time series:
  • CVA: The subspace procedure described above, where p is chosen according to twice the AIC optimal lag length in an autoregressive approximation.
  • qMLE: Maximum likelihood estimation based on the Gaussian likelihood. Here both stability and strict minimum-phase assumptions are imposed using a barrier function approach. The data is assumed to be stationary and the corresponding likelihood is optimized.
  • PEM: Prediction error methods use the assumption of x 0 = 0 . With this initialization, the Kalman filter collapses to the inverse system. Again, stability and strict minimum-phase assumptions are imposed using a barrier function approach.
qMLE and PEM are initialized using the CVA estimate but also employ randomization to reduce the probability of being trapped in local minima. For the data-generating process, we use a bivariate system:
y t = x t + e t , x t + 1 = 0.7 0 0 1 x t + 1 0 0 0.5 e t .
Here ( e t ) t Z denotes a bivariate standard normal error process. Consequently, the process is a bivariate I(1) process with independent innovations. The first component is stationary, while the second is integrated. We apply the estimation procedures to ( y t ) t Z and to Δ ( y t ) t Z , which has a state space representation of order three:
Δ y t = 0 0.3 0 0 0 0.5 x t + e t , x t + 1 = 0.7 0 0 1 0 0 0 0 0 x t + 1 0 0 0 0 1 e t .
We generate M = 1000 datasets of sample size T { 100 , 200 , 400 , 800 , 1600 } and estimate a state space system with n = 2 for ( y t ) t Z and n = 3 for Δ ( y t ) t Z . Hereby f ^ = p ^ = 2 k ^ A I C is chosen, where k ^ A I C denotes the lag length of an autoregressive approximation chosen using AIC with p = 10 used as the upper bound for the original data and p = 20 for the differenced data.
The results can be seen in Figure 1. Plot (a) of that figure provides a plot of the mean squared error of the impulse response estimates times the sample size T. For ( y t ) t Z , we calculate the impulse response corresponding to the differenced series. A convergence rate of order O P ( T 1 / 2 ) would imply that the curves level off for large sample sizes. While this seems to be the case for CVA and PEM applied to the original data, all curves for the differenced series show an increase for the larger sample size. The superiority of the estimates for the original series is visible for all three estimators. CVA applied to the differenced series performs worst, particularly for large sample sizes. Additionally, it is visible that qMLE performs worse than PEM for the integrated process. This happens as the qMLE assumes stationary initialization of the state process in the Kalman filter. In the integrated case, this leads to very large initial uncertainty of the state, reflected in the poorer estimation. This is also visible in the estimates of the ( 2 , 2 ) element of K 1 = C K , the first element of the impulse response, as shown in plots (c) for T = 200 and (d) for T = 1600 . All but the qMLE estimates appear to be Gaussian-distributed around the correct mean −0.5. In plot (c), the superiority of the estimates for the original series is visible in a higher concentration around 0.5 . Plot (b) provides mean selected lag lengths, documenting that for the original series the lag length selection roughly behaves as log T / ( 2 log ρ 0 ) (printed as red stars), while for the differenced series, the increase is much faster.
While the impulse response estimates for the large sample size appear to be Gaussian-distributed, this is not the case for all system-dependent quantities. Figure 2a,b provide density estimates for the estimate of the sum of all entries of A ̲ = A B C . In (a), for the original data, one notices a bias for the qMLE estimates (compensating for the large initial state uncertainty), while in (b), for the differenced data, the CVA estimates show a downward bias. Plot (c) provides the deviation of the mean of the largest modulus of the eigenvalues of A ̲ for the original process (with true value 0.5 ), as well as for the differenced process (with true value 1).5 The larger downward bias for the qMLE estimates for the original data and the CVA estimates for the difference data is clearly visible but diminishes with sample size. Note that for T = 400 onward for the differenced data for PEM and qMLE, the mean maximum modulus almost reaches the maximum allowed value, implying that almost all values are equal to the maximum. Plot (d) provides the ecdf for T = 400 of the maximum modulus of the eigenvalues of A ̲ . While for CVA, the maximum modulus is almost always estimated below 0.98, for PEM and qMLE, the opposite is true, with a large point mass located at 0.99. Only in 6% of the cases is the maximum modulus smaller than 0.989 . Therefore the maximum modulus for the qMLE and PEM estimates for the differenced data does not follow a normal distribution, not even for T = 1600 .
Consequently, for the differenced series, inference provides obstacles for qMLE and PEM in the form of non-Gaussian distributions with discrete components and a dominating bias for CVA.

4.2. The Danish Dataset

The simulation evidence from the last subsection is briefly complemented with a real-world data set. The Danish dataset was used by Johansen (1995) for the illustration of co-integration modeling using vector error correction representation. The dataset comprises four time series observed quarterly over the span of 55 quarters from 1974:1 to 1987:3. The variables are log real money (m2), log real income (y), the bank deposit rate for interest-bearing deposits (ide), and the bond rate (ibo). From the plots on p. 26 of Johansen (1995)’s work and the accompanying univariate tests, one may conclude that all four series appear to be integrated with stationary differences. Johansen (1995) uses seasonal dummies as deterministics.
Johansen (1995) tests for the number of co-integrating relations and states on p. 112 ‘From Table 7.1 we find that there is no clear evidence for co-integration. […] We choose to maintain the hypothesis that r = 1, that is, that there is one co-integrating relation’. This is a typical situation where the number of co-integrating relations (if any) is not certain. Economic theory would imply its existence, but the data provides borderline tests and no clear-cut results.
In addition to the estimation of the model, Johansen (1995) also tests—under the assumption of one co-integrating relation—for restrictions on the co-integrating relation.
We used the dataset to estimate models for the original data as well as for the differenced process. Seasonal dummies are included. We use CVA and PEM as in the simulation section. The results are provided in Table 1.
For the original data, p ^ = 2 was used as in Johansen (1995). As deterministic terms, quarterly dummies were added. For PEM and CVA, n ^ = 7 resulted in the best fit (according to AIC). For PEM, three common trends corresponding to one co-integrating relation minimize AIC.
For the differenced data, p ^ = 2 was chosen. Lag length selection using AIC is very unstable, though. The results depend heavily on the upper limit if a uniform evaluation sample is used. For PEM and CVA, n ^ = 8 was selected for the differenced data, which corresponds to the first difference of the model for the original data.
We can see from Table 1 that PEM for the original data results in the best one-step-ahead prediction error except for y, where CVA performs better, and ide, where PEM on the differenced series provides better forecasts.
In terms of the zeros of the transfer function for the original and for the differenced data, the PEM estimates provide zeros on the boundary to being unstable. For the differenced data, six out of the eight zeros have a modulus larger than 0.9 . Also for the original data, the PEM model has three zeros with a modulus larger than 0.9 . Thus, these estimates are also not well behaved.
Computing the spectral density at zero frequency according to the PEM model for the differenced data, we obtain for the square roots of the eigenvalues ( 0.0476 , 0.0175 , 0.0065 , 0.0023 ) . This indicates that the rank is either two or three. For the first difference of the PEM model for the original data, we obtain ( 0.0416 , 0.0168 , 0.0006 , 0.000 ) , since there are only three common trends contained in the model. The strength of the third is, thus, also very weak.
With the PEM model for the original data, we can use the trace statistic in the state space error correction (SPECM) formulation including a restricted constant (see Matuschek, 2020), providing the eigenvalues ( 0.9030 , 0.2732 , 0.1202 , 0.0246 ) . The corresponding trace statistics equal ( 195.52 , 25.02 , 8.11 , 1.32 ) , resulting in rank r = 1 with the usual sequential testing technique (using critical values ( 53.92 , 35.08 , 20.27 , 9.12 ) ).
Within the SPECM, one can also test restrictions on the co-integrating relations, which is estimated as β ^ = ( 5.83 , 1.00 , 1.08 , 4.52 , 2.37 ) , normalized such that the coefficients for m2 equals 1. The first entry corresponds to the restricted constant. These values are close to the ones by Johansen (1995), shown in Section 7.3.1. One notices that the coefficient for y is close to 1 . Economic theory further conjectures that the coefficients for ibo and ide should sum to zero. Testing this restriction on β ^ within the SPECM, we get a test statistic of 3.92 , which corresponds to a p-value of 0.14 for the limiting χ 2 2 distribution. The restricted estimate β ˜ = ( 6.11 , 1.00 , 1.00 , 4.73 , 4.73 ) . This is to be compared to ( 6.21 , 1.00 , 1.00 , 5.88 , 5.88 ) , as provided by Johansen (1995) with a different model and different specification of the deterministics (i.e., leaving out the seasonal dummies).
Finally, the spectral density of the differenced process in the direction of the co-integrating relation β ˜ (omitting the coefficient corresponding to the constant and scaling the vector to have a norm equal to one) is corresponding to the model for the first differences equal to 1.5148 × 10 5 . The smallest eigenvalue of the spectrum equals 5.18 × 10 6 and the second to last equals 4.26 × 10 5 , while the largest eigenvalue equals 0.0023 . This hints at the singularity of the spectrum at zero for the differenced process as well as to the fact that the co-integrating relation found points in the direction of the singularity.
Thus, the analysis with the state space systems adds evidence to Johansen’s choice of r = 1 as well as no evidence against the equalities in the co-integrating vectors suggested by economic theory.

5. Conclusions

In this technical paper, we show that working with first differences does not invalidate consistency for CVA. This is a relief in situations where one is not sure about the existence of co-integrating relations.
Inference, on the other hand, gets more complicated as in a situation where some of the variables are over-differenced, the asymptotic distribution is typically dominated by the bias term for CVA estimates. Note that for partially over-differenced series, the quasi-maximum likelihood estimators are also known to not be asymptotically normal, as the true system lies at the boundary of the parameter space. This implies that inference is close to impossible for over-differenced processes.
The results also imply that a higher order of differencing and spectral zeros at other frequencies introduced, for example, from yearly differencing, can be dealt with using exactly the same methods, as the results from Poskitt (2006) and Palma and Bondon (2003) also hold for these cases. In such situations, the consistency of CVA estimators of the impulse response sequence again follows for p = p ( T ) increasing at a sufficiently slow rate.

Funding

This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation—Projektnummer 469278259), which is gratefully acknowledged.

Data Availability Statement

The Danish data utilized in the study are openly available in the package ‘urca’ (Unit Root and Cointegration Tests for Time Series Data) at CRAN (https://cran.r-project.org/web/packages/urca/index.html) under the DOI 10.32614/CRAN.package.urca. See also Pfaff (2008).

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. Proof of Theorem 1

Note that α ^ ( p ) = Y t + , Y t Y t , Y t 1 is an autoregressive approximation of Y t + by Y t if p = f p ˜ for some integer p ˜ :
Y t + = α ( p ) Y t + U t + .
Poskitt (2006)’s Theorem 5 then implies that α ^ ( p ) α ( p ) 2 2 = O ( p 5 Q T 2 ) using
( λ m i n ( Γ p ) ) 1 = O ( p 2 )
in that case. The proof of this result by Poskitt (2006) can be easily extended to the case of general p in our setting.
Clearly α ( p ) = O f K p is of rank n as E f E t + is orthogonal to Y t . CVA then uses a SVD of W ^ f α ^ ( p ) W ^ p or, equivalently, the SVD of
W ^ f α ^ ( p ) Y t , Y t α ^ ( p ) ( W ^ f )
to obtain a rank n approximation where W ^ f = Y t + , Y t + 1 / 2 (the square root denotes the Cholesky decomposition). Due to the uniform convergence of the sample covariances, we obtain W ^ f W f = O ( Q T ) for fixed f since the Cholesky factorization is differentiable for positive definite matrices.
Now α ( p ) = O ( p ) ( . denoting the row-sum norm) as can be seen, for example, from the Levinson–Whittle algorithm (see Hannan & Deistler, 1988, p. 218). It follows that α ( p ) 1 = O ( 1 ) (column-sum norm, here equivalent to maximum entry due to finite f), α ( p ) E Y t ( Y t ) = E Y t + ( Y t ) , and α ( p ) 2 = O ( p ) . Consequently
α ^ ( p ) Y t , Y t α ^ ( p ) α ( p ) E Y t ( Y t ) α ( p ) = O ( p 5 / 2 Q T ) .
The properties of the SVD then imply O ^ f O f 2 = O ( p 5 / 2 Q T ) , which in turn leads to K ^ p K p 2 = O ( p 5 / 2 Q T ) . The key here is the differentiable dependence of the eigenspace to an eigenvalue on the matrix; see Chatelin (1993). This applies here as O f spans the orthocomplement of the eigenspace to an eigenvalue of zero. The convergence for O ^ f then requires fixing a basis of this space, which is achieved by S f O f = I n . We then use the same normalization for O ^ f such that S f O ^ f = I n to obtain O ^ f O f 2 = O ( p 5 / 2 Q T ) . As O f O f I s , we have with K p = ( O f O f ) 1 O f α ( p ) and K ^ p = ( O ^ f O ^ f ) 1 O ^ f α ^ ( p ) that K ^ p K p 2 = O ( p 5 / 2 Q T ) .
The remainder of the proof then follows from providing error bounds for terms involving
x ^ t ( p ) x t ( p ) = ( K ^ p K p ) Y t .
For example,
x ^ t , x ^ t = x ^ t x t ( p ) , x ^ t + x t ( p ) , x ^ t x t ( p ) + x t ( p ) , x t ( p ) = ( K ^ p K p ) Y t , Y t K ^ p + K p Y t , Y t ( K ^ p K p ) + x t ( p ) , x t ( p ) = ( K ^ p K p ) E Y t x t ( p ) + E x t ( p ) ( Y t ) ( K ^ p K p ) + x t ( p ) , x t ( p ) + o ( p 5 / 2 Q T ) = x t ( p ) , x t ( p ) + O ( p 5 / 2 Q T )
where the next-to-last error bound follows from replacing estimates with limits. All evaluations are simple and hence omitted.
These arguments show that uniformly for 0 < p H T , the difference between the estimates using x ^ t and using x t ( p ) is of order O ( p 5 / 2 Q T ) .
Considering x t ( p ) , x t ( p ) E x t ( p ) x t ( p ) , we see that
α ( p ) ( Y t , Y t Γ p ) α ( p ) = O ( p 2 Q T )
since α ( p ) 1 = O ( 1 ) (the entries of α ( p ) are uniformly bounded). This holds uniformly in p < H T . Similar results show that for a p large enough, this error rate carries over to the difference in the estimators such that (see also the proof of Theorem 2):
max { A ^ A p 2 , B ^ B p 2 , C ^ C p 2 } = O ( p 5 / 2 Q T ) = o ( 1 )
for p H T .
Next investigating A p A , for example, the difference of the second moments such as E x t ( p ) x t ( p ) E x t x t is essential. These convergences to zero follow E δ x t ( p ) ( δ x t ( p ) ) 0 , as the approximation error converges to zero; we can compare Lemma 1 by (Poskitt, 2006). This finishes the proof.

Appendix B. Proof of Theorem 2

(I) The system ( A p , B p , C p ) is obtained using x t ( p ) , while ( A , B , C ) uses the same regression based on the state x t . Both systems are obtained using the differentiable mapping
C p = Γ y x ( p ) Γ x x ( p ) 1 , A p = Γ x 1 x ( p ) Γ x x ( p ) 1 , B p = Γ x 1 ε ( p ) Γ ε ε ( p ) 1 = ( Γ x 1 y ( p ) Γ x 1 x ( p ) C p ) ( Γ y y Γ y x ( p ) C p C p Γ x y ( p ) + C p Γ x x ( p ) C p ) 1 .
Here Γ a b ( p ) denotes the covariances, where the indices indicate the various processes. x stands for x t ( p ) , y for y t , and x 1 for x t + 1 ( p ) . It follows that the system matrices ( A p , B p , C p ) are a function of Γ y x ( p ) , Γ x x ( p ) , Γ x 1 x ( p ) , and Γ x 1 y ( p ) .
Since the map is differentiable, we linearize around E y t x t , E x t x t , E x t + 1 x t , E x t + 1 y t . Then the difference is a function of the difference between these quantities:
E y t ( x t x t ( p ) ) , E x t x t E x t ( p ) x t ( p ) , E x t + 1 x t E x t + 1 ( p ) x t ( p ) , E ( x t + 1 x t + 1 ( p ) ) y t .
In all these terms, applying the Cauchy–Schwartz inequality implies that the highest order terms correspond to the square root of
E δ x t ( p ) δ x t ( p ) = O ( g p ) .
(II) Consider the state space representation of the process according to Assumption 1:
y t = C M c x ˜ t + ε t , x ˜ t + 1 = x t + 1 M c u t = A 0 M c C 0 x ˜ t + B M c ε t .
Here M c R s × c is the orthonormal block column defined in the assumption. This representation is minimal if A is stable and A B C does not have a zero eigenvalue. If A B C has a zero eigenvalue, it is still observable but not necessarily controllable. Coordinates in the second block of x ˜ t in the non-minimal case can be omitted, and the resulting state is a subvector of the vector given above. This does not change the maximum modulus of the eigenvalues of A or A B C for the reduced minimal system.
Let u ˜ t = M y t such that according to the generation of ( y t ) t Z , we obtain:
u ˜ t = u t , c u t 1 , c u t , s c
where ( u t ) t Z is a stationary process with a non-singular spectrum. With this notation, the state process ( x ˜ t ) t Z has two blocks, where the first block corresponds to the state process for ( u t ) t Z and the second block contains (a subset of coordinates of) u t 1 , c .
Using the summation as in Example 1, we may approximate u t given u ˜ t :
u t ( p ) = p p + 1 I c 0 p 1 p + 1 I c 1 p + 1 I c 0 0 I s c 0 I s c u ˜ t u ˜ t 1 u ˜ t p + 1 = u t , c u ¯ t , c ( p ) u t , s c .
The error term u ¯ t , c ( p ) denotes the average over p + 1 values of a stationary process and hence has variance of order O ( 1 / ( p + 1 ) ) .
Next, the state x t = j = 1 A ̲ j 1 B u t j can be approximated by x ˜ t ( p ) = j = 1 p A ̲ j 1 B u t j ( p ) . Jointly we obtain:
x ˜ t = x ˜ t ( p ) u t 1 , c ( p ) + j = 1 p A ̲ j 1 B ( u t j u t j ( p ) ) + A ̲ p 1 B x t p u t 1 , c u t 1 , c ( p ) = x ˜ t ( p ) u t 1 , c ( p ) + j = 1 p A ̲ j 1 B u ¯ t j , c ( p ) + A ̲ p 1 B x t p u ¯ t 1 , c ( p ) .
The variance of the second term can be bounded by O ( min ( 1 p , ρ p ) ) , where 0 < ρ < 1 denotes the maximum modulus of the eigenvalues of A ̲ .
Since u t p , c ( p ) involves lags of u ˜ t back to 2 p 1 , we have constructed an approximation of the state such that the variance of the approximation error based on 2 p lags can be bounded of order O ( 1 p ) . Since x t ( p ) is a minimum variance approximation, the corresponding approximation error is smaller. This concludes the proof.

Notes

1
In the literature, this is also called ‘strictly minimum-phase’.
2
This deviates from the definition above and is only used to motivate the size of the approximation error.
3
The main argument is the lower bound on the eigenvalues of the matrix Γ p ; see the proof of Theorem 1 in Appendix A for details.
4
The only change required is to replace the univariate bound 0 < α g β < by the matrix bound 0 < α I s g β I s < in the sense of positive definite matrices. This implies the same bound in their Equation (8).
5
For numerical reasons, the value is cut off at 0.99 to ensure stable invertibility.

References

  1. Bauer, D. (2005). Estimating linear dynamical systems using subspace methods. Econometric Theory, 21, 181–211. [Google Scholar] [CrossRef]
  2. Bauer, D. (2008). Using subspace methods for estimating ARMA models for multivariate time series with conditionally heteroskedastic innovations. Econometric Theory, 24, 1063–1092. [Google Scholar] [CrossRef][Green Version]
  3. Bauer, D. (2018, September 19–21). Using subspace methods to model long-memory processes. International Conference on Time Series and Forecasting (pp. 171–185), Granada, Spain. [Google Scholar]
  4. Chatelin, F. (1993). Eigenvalues of matrices. John Wiley & Sons. [Google Scholar]
  5. Funovits, B. (2024). Identifiability and estimation of possibly non-invertible SVARMA Models: The normalised canonical WHF parametrisation. Journal of Econometrics, 241, 105766. [Google Scholar] [CrossRef]
  6. Hannan, E. J., & Deistler, M. (1988). The statistical theory of linear systems. John Wiley. [Google Scholar]
  7. He, J., Ziemann, I., Rojas, C. R., Qin, S. J., & Hjalmarsson, H. (2025). Finite sample analysis of open-loop subspace identification methods. arXiv, arXiv:2501.16639. [Google Scholar]
  8. Johansen, S. (1995). Likelihood-based inference in cointegrated vector auto-regressive models. Oxford University Press. [Google Scholar]
  9. Larimore, W. E. (1983, June 22–24). System identification, reduced order filters and modeling via canonical variate analysis. 1983 American Control Conference (pp. 445–451), San Francisco, CA, USA. [Google Scholar]
  10. Matuschek, L. (2020). Essays on cointegration analysis in the state space framework [Doctoral dissertation, Technische Universität Dortmund]. [Google Scholar]
  11. Palma, W., & Bondon, P. (2003). On the eigenstructure of generalized fractional processes. Statistics and Probability Letters, 65, 93–101. [Google Scholar] [CrossRef]
  12. Pfaff, B. (2008). Analysis of integrated and cointegrated time series with R. Springer. [Google Scholar]
  13. Poskitt, D. S. (2006). Autoregressive approximation in nonstandard situations: The fractionally integrated and non-invertible case. Annals of Institute of Statistical Mathematics, 59, 697–725. [Google Scholar] [CrossRef]
  14. Sakarya, N., & de Jong, R. M. (2022). The spectral analysis of the Hodrick–Prescott filter. Journal of Time Series Analysis, 43, 479–489. [Google Scholar] [CrossRef]
  15. Tsiamis, A., Ziemann, I., Matni, N., & Pappas, G. J. (2023). Statistical learning theory for control: A finite-sample perspective. IEEE Control Systems Magazine, 43, 67–97. [Google Scholar] [CrossRef]
Figure 1. (a) Mean squared norm of impulse response estimation error times sample size. (b) Mean lag length optimizing AIC. (c) Density estimate of estimates of K 1 ( 2 , 2 ) for T = 200 . (d) Density estimate of estimates of K 1 ( 2 , 2 ) for T = 1600 .
Figure 1. (a) Mean squared norm of impulse response estimation error times sample size. (b) Mean lag length optimizing AIC. (c) Density estimate of estimates of K 1 ( 2 , 2 ) for T = 200 . (d) Density estimate of estimates of K 1 ( 2 , 2 ) for T = 1600 .
Econometrics 14 00012 g001
Figure 2. Density estimate of the estimated sum of all entries of A ̲ for T = 400 (a) for the original data and (b) for the differenced data. (c) Deviation of mean of largest modulus from true value. (d) ecdf of largest eigenvalue of A ̲ for differenced data and T = 400 .
Figure 2. Density estimate of the estimated sum of all entries of A ̲ for T = 400 (a) for the original data and (b) for the differenced data. (c) Deviation of mean of largest modulus from true value. (d) ecdf of largest eigenvalue of A ̲ for differenced data and T = 400 .
Econometrics 14 00012 g002
Table 1. Mean square error in sample for the three estimators, for the original data, and for the differenced data.
Table 1. Mean square error in sample for the three estimators, for the original data, and for the differenced data.
DatasetEstimatorm2yiboide
original dataVECM(1,1)0.01950.02150.00770.0054
PEM0.01630.02070.00750.0048
CVA0.02190.01970.00850.0049
differenced dataVAR(2)0.02020.02010.00730.0053
PEM0.01680.02020.00700.0048
CVA0.02000.02170.00880.0061
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bauer, D. Using Subspace Algorithms for the Estimation of Linear State Space Models for Over-Differenced Processes. Econometrics 2026, 14, 12. https://doi.org/10.3390/econometrics14010012

AMA Style

Bauer D. Using Subspace Algorithms for the Estimation of Linear State Space Models for Over-Differenced Processes. Econometrics. 2026; 14(1):12. https://doi.org/10.3390/econometrics14010012

Chicago/Turabian Style

Bauer, Dietmar. 2026. "Using Subspace Algorithms for the Estimation of Linear State Space Models for Over-Differenced Processes" Econometrics 14, no. 1: 12. https://doi.org/10.3390/econometrics14010012

APA Style

Bauer, D. (2026). Using Subspace Algorithms for the Estimation of Linear State Space Models for Over-Differenced Processes. Econometrics, 14(1), 12. https://doi.org/10.3390/econometrics14010012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop