State-Space Models on the Stiefel Manifold with a New Approach to Nonlinear Filtering

We develop novel multivariate state-space models wherein the latent states evolve on the Stiefel manifold and follow a conditional matrix Langevin distribution. The latent states correspond to time-varying reduced rank parameter matrices, like the loadings in dynamic factor models and the parameters of cointegrating relations in vector error-correction models. The corresponding nonlinear filtering algorithms are developed and evaluated by means of simulation experiments.


Introduction
The coefficient matrix of explanatory variables in multivariate time series models can be rank deficient due to some modelling assumptions, and the parameter constancy of the rank deficient matrix may be questionable. This may happen, for example, in the factor † Department of Statistics, Uppsala University, P.O. 513, SE-75120, Uppsala, Sweden, email address: yukai.yang@statistik.uu.se (corresponding author) 1 model, which construct very few factors by using a large number of macroeconomic and financial predictors, while the factor loadings are suspect to be time-varying. Stock and Watson (2002) state that it is reasonable to suspect temporal instability taking place in factor loadings, and later Stock and Watson (2009) and Breitung and Eickmeier (2011) find empirical evidence of instability. Another setting where instability may arise is in cointegrating relations (see e.g. Bierens and Martins (2010)), hence in the the reduced rank cointegrating parameter matrix of a vector error-correction model.
There are solutions in the literature to the modelling of the temporal instability of reduced rank parameter matrices. Such parameters are typically regarded as unobserved random components and in most cases are modelled as random walks on a Euclidean space; see for example Del Negro and Otrok (2008) and Eickmeier et al. (2014). In these works, the noise component of the latent processes (factor loading) is assumed to have a diagonal covariance matrix in order to alleviate the computational complexity and make the estimation feasible, especially when the dimension of the system is high. However, the random walk assumption on the Euclidean space cannot guarantee the orthonormality of the factor loading (or cointegration) matrix, while this type of assumption identifies the loading (or cointegration) space. Hence, other identification restrictions on the Euclidean space are needed. Moreover, the diagonality of the error covariance matrix of the latent processes contradicts itself when a permutation of the variables is performed.
In this work we develop new state-space models on the Stiefel manifold, which do not suffer from the problems on the Euclidean space. It is noteworthy that Chikuse (2006) also develops state-space models on the Stiefel manifold. The key difference between Chikuse (2006) and our work is that we keep the Euclidean space for the measurement evolution of the observable variables, while Chikuse (2006) puts them on the Stiefel manifold, which is not relevant for modelling economic time series. By specifying the time-varying reduced rank parameter matrices on the Stiefel manifold, their orthonormality is obtained by construction, and therefore their identification is guaranteed.
The corresponding recursive nonlinear filtering algorithms are developed to estimate the a posteriori distributions of the latent processes of the reduced rank matrices. By applying the matrix Langevin distribution on the a priori distributions of the latent processes, conjugate a posteriori distributions are achieved, which gives great convenience in the computational implementation of the filtering algorithms. The predictive step of the filtering requires to solve an integral on the Stiefel manifold, which does not have a closed form. To compute this integral, we resort to Laplace method.
The paper is organized as follows. Section 2 introduces the general framework of the vector models with time-varying reduced rank parameters. Two specific forms of the time-varying reduced rank parameters, which the paper is focused on, are given. Section 2 3 discusses some problems in the prevalent literature on modelling the time dependence of the time-varying reduced rank parameters, which underlie our modelling choices. Then, in Section 4 we present the novel state-space models on the Stiefel manifold. Section 5 presents the nonlinear filtering algorithms that we develop for the new state-space models.
Section 6 presents several simulation based examples. Finally, Section 7 concludes and gives possible research extensions.
2 Vector models with time-varying reduced rank parameters Consider the multivariate time series model with partly time-varying parameters y t = A t x t + Bz t + ε t , t = 1, ..., T, (2.1) where y t is a (column) vector of dependent variables of dimension p, x t and z t are vectors of explanatory variables of dimensions q 1 and q 2 , A t and B are p × q 1 and p × q 2 matrices of parameters, and ε t is a vector belonging to a white noise process of dimension p, with positive-definite covariance matrix Ω. For quasi-maximum likelihood estimation, we further assume that ε t ∼ N p (0, Ω).
The distinction between x t and z t is introduced to separate the explanatory variables between those that have time-varying coefficients (A t ) from those that have fixed coefficients (B). In the sequel we always consider that x t is not void (ie q 1 > 0). The explanatory variables may contain lags of y t , and the remaining stochastic elements (if any) of these vectors are assumed to be weakly exogenous. Equation ( (1994).
We assume furthermore that the time-varying parameter matrix A t has reduced rank r < min(p, q 1 ). This assumption can be formalized by decomposing A t as α t β t , where α t and β t are p × r and q 1 × r full rank matrices, respectively. If we allow both α t and β t to be time-varying, the model is not well focused and hard to explain, and its identification is very difficult. Hence, we focus on the cases where either α t or β t is time-varying, that is, on the following two cases: .., k − 1, as well as some predetermined variables. There are papers in the literature arguing that the temporal instability of the parameters in both stationary and non-stationary macroeconomic data does exist and cannot be overlooked. For example, Swanson (1998) and Rothman et al. (2001) give convincing examples in investigating the Granger causal relationship between money and output using nonlinear vector error-correction model. They model the instability in α by means of regime-switching mechanisms governed by some observable variable. An alternative to that modelling approach is to regard α t as a totally latent process.
The case 1 model also includes as particular case the factor model with time-varying factor loadings. In the factor model context, the factors f t are extracted from a number of observable predictors x t by using the r linear combinations f t = β x t . Note that f t is latent since β is unknown. Then the corresponding factor model (neglecting the Bz t term) takes the form where α t is matrix of the time-varying factor loadings. The representation is quite flexible in the sense that y t can be equal to x t and then we reach exactly the same representation as Stock and Watson (2002), but we also allow them to be distinct. In Stock and Watson (2002), the factor loading matrix α is time-invariant and the identification is obtained by imposing the constraints q 1 α = β and α β = β α = α α/q 1 = q 1 β β = I r . Notice that if α is time-varying but β time-invariant, these constraints cannot be imposed.
The case 2 model (equations (2.1) and (2.3)) can be used to account for time-varying long-run relations in cointegrated time series, as β t is changing. Bierens and Martins (2010) show that this may be the case for the long run purchasing power parity. In the case 2 model, there exist p−r linearly independent vectors α ⊥ that span the left null space of α, such that α ⊥ A t = 0. Therefore, the case 2 model implies that the time-varying parameter matrix β t vanishes in the structural vector model for any column vector γ ∈ sp(α ⊥ ), where sp(α ⊥ ) denotes the space spanned by α ⊥ , thus implying that the temporal instability can be removed in the above way. Moreover, x t does not explain any variation of γ y t .
Another possible application for the case 2 model is the instability in the factor composition. Considering the factor model y t = αf t + ε t , with time-invariant factor loading α, the factor composition may be slightly evolving through β t in f t = β t x t .
3 Issues about the specification of the time-varying reduced rank parameter In the previous section, we have introduced two models with time-varying reduced rank parameters. In this section, in order to motivate our choices presented in Section 4, we discuss the specification in the literature of the dynamic process governing the evolution of the time-varying parameters.
Since the sequences α t or β t in the two cases are unobservable in practice, it is quite natural to write the two models into the state-space form with a measurement equation  (1971), Koopman (1974), Durbin and Koopman (2012), and more recently Casals et al. (2016).
Consider for example the factor model (2.4) with time-varying factor loading α t , but notice that the following discussion can be easily adapted to the cointegration model, where only β t is time-varying. The traditional state-space framework on the Euclidean space assumes that the elements of the time-varying matrix α t evolve like random walks on the Euclidean space, see for example Del Negro and Otrok (2008) and Eickmeier et al. (2014). That is, where vec denotes the vectorization operator, and the sequence of η t is assumed to be a Gaussian strong white noise process with constant positive definite covariance matrix Σ η .
Thus, (2.1) and (3.1) form a vector state-space model, and the Kalman filter technique can be applied for estimating α t .
A first problem of the model (3.1) is that the latent random walk evolution on the Euclidean space is strange. Consider the special case p = 2 and r = 1: in Figure 1, points 1-3 are possible locations of the latent variable vec(α t ) = (α 1t , α 2t ) . Suppose that the next state α t+1 evolves as in (3.1) with a diagonal covariance matrix Σ η . The circles centered around points 1-3 are contour lines such that, say, almost all the probability mass lies inside the circles. The straight lines OA and OB are tangent lines to circle 1 with A and B the tangent points; the straight lines OC and OD are tangent lines to circle 2; and the straight lines OE and OF are tangent lines to circle 3. The angles between the tangent lines depend on the location of the points 1-2-3: generally the more distant a point from the origin, the smaller the corresponding angle despite some special ellipses.
The plot shows that the distributions of the next subspace based on the current point differ for different subspaces (angle for 3 and 2 smaller than angle for 1); even for the same subspace (points 2 and 3), the distribution of the subspace is different (angle for 3 smaller than angle for 2).
-insert Figure 1 about here -A second problem is the identification issue. The pair of α t and β should be identified before we can proceed with the estimation of (2.1) and (3.1). If both α and β are timeinvariant, it is common to assume the orthonormality (or asymptotic orthonormality) α α/q 1 = I r or α α = I r to identify the factors and then to estimate them by using the principle components method. But when α t is evolving as (3.1), the orthonormality of α t can never be guaranteed for all t on the Euclidean space.
The alternative solution to the identification problem is to normalize the time-invariant part β as (I r , b ) . The normalization is valid when the upper block of β is invertible, but if the upper block of β is not invertible, one can always permute the rows of β to find an invertible submatrix of order r rows for such a normalization. The permutation can be performed by left-multiplying β by a permutation matrix P to make its upper block invertible. In practice, it should be noted that the choice of the permutation matrix P is usually arbitrary and casual.
Even though the model defined by (2.1) and (3.1) is identified by some normalized β, if one does not impose any constraint on the elements of the positive definite covariance matrix Σ η , the estimation can be very difficult due to computational complexity. A feasible solution is to assume that η t is cross-sectionally uncorrelated. This restriction reduces the number of parameters, alleviates the complexity of the model, and makes the estimation much more efficient, but it may be too strong and imposes a priori information on the data. However, a third problem then arises. In the following two propositions, we show that any design like (2.1) and (3.1) with the restriction that Σ η is diagonal is casual in the sense that it may lead to contradiction since the normalization of β is arbitrarily chosen.
6 Proposition 3.1. Suppose that the reduced rank coefficient matrix A t in (2.1) with rank r has the decomposition (2.2). By choosing some permutation matrix P β (p × p), the time-invariant component β can be linearly normalized if the r × r upper block b 1 in is invertible. Then the corresponding linear normalization is and the time-varying component is re-identified asα t = α t b 1 . Assuming that the timevarying component evolves by following Proposition 3.2. Suppose that the reduced rank coefficient matrix A t in (2.1) with rank r has the decomposition (2.3). By choosing some permutation matrix P α (p × p), the constant component α can be linearly normalized if the r × r upper block a 1 in is invertible. The corresponding linear normalization is and the time-varying component is re-identified asβ t = β t a 1 . Assuming that the timevarying component evolves by following consider another permutation P * α = P α with the correspondingα * ,β * t , a * 1 and η β * t . The variance-covariance matrices of η β t and η β * t are both diagonal if and only if a 1 = a * 1 .
Proof. See Appendix B.
The two corollaries below follow Propositions 3.1 and 3.2 immediately, showing that the assumption that the variance-covariance matrix Σ η is always diagonal for any linear normalization is inappropriate. Corollary 3.4. Given the settings in Proposition 3.2, the variance-covariance matrices of the error vectors in forms like (3.7) based on different linear normalizations, cannot be both diagonal if a 1 = a * 1 where a 1 and a * 1 are the upper block square matrices in forms like (3.5).
One may argue that there is a chance for the two covariance matrices to be both diagonal, i.e., when b 1 = b * 1 . It should be noticed that the condition b 1 = b * 1 does not imply that P = P * . Instead, it implies that the permutation matrices move the same variables to the upper part of β with the same order. If this is the case, the two permutation matrices P and P * are distinct but equivalent as the order of the variables in the lower part is trivial for linear normalization.
Since the choice of the permutation P and the corresponding linear normalization is arbitrary in practice, which is simply the order of x t (y t for case 2), the models with different P are telling different stories about the data. In fact, the model has been over-identified by the assumption that Σ η must be diagonal. Consequently, the model becomes β-normalization dependent, and the β-normalization imposes some additional information on the data. This can be serious when the forecasts from the models with distinct normalizations of α give totally different results. A solution to this "unexpected" problem may be to try all possible normalizations of α and do model selection, that is, after estimating every possible model, pick the best model according to an information criterion. However, this solution is not always feasible because the number of possible permutations for α , which is equal to q 1 (q 1 − 1)...(q 1 − r + 1), can be huge. When the number of predictors is large, which is common in practice, the estimation of each possible model based on different normalization becomes a very demanding task. Stock and Watson (2002) propose the assumption that the cross-sectional dependence between the elements in η t is weak and the variances of the elements are shrinking with the increase of the sample size. Then the aforementioned problem may not be so serious, as, intuitively, different normalizations with diagonal covariance matrix Σ η may produce approximately or asymptotically the same results.
We have shown that the modelling of the time-varying parameter matrix in (2.2) as a process like (3.1) on the Euclidean space involves some problems. Firstly, the evolution of the subspace spanned by the latent process on the Euclidean space is strange. Secondly, the process does not comply with the orthonormality assumption to identify the pair of α t and β. Thus, a linear normalization is employed instead of the orthonormality. Thirdly, 8 the state-space model on the Euclidean space suffers from the curse of dimensionality, and hence the diagonality of the covariance of the errors is often used with the linear normalization in order to alleviate the computational complexity when the dimension is high.
This leads to two other problems: firstly, the diagonality assumption is inappropriate in the sense that different linear normalizations may lead to a contradiction; secondly the model selection can be a tremendous task when there are many predictors.
In the following section, we propose that the time-varying parameter matrices α t and β t evolve on the Stiefel manifold, instead of the Euclidean space, and we show that the corresponding state-space models do not suffer from the aforementioned problems.
4 State-space models on the Stiefel manifold

The Stiefel manifold and the matrix Langevin distribution
Before presenting the state-space models on the Stiefel manifold, we introduce some concepts and terms. The Stiefel manifold V a,b , for dimensions a and b such that a ≥ b, is a hypersphere if a > 3. The link with the modelling presented in Section 2 and developed in the next subsection is that the time-varying matrix α t of (2.2) is assumed to be evolving in V p,r (instead of a Euclidean space), and β t of (2.3) in V q 1 ,r . Hence, each α t and β t is by definition orthonormal.
We also need to replace the assumption (3.1) that the distribution of vec(α t+1 ) conditional on vec(α t ) is N p×r (vec(α t ), Σ η ) by an appropriate distribution defined on V p,r , and likewise for vec(β t+1 ). A convenient distribution for this purpose is the matrix Langevin distribution (also known as von Mises-Fisher distribution) denoted by M L(a, b, F ). A random matrix X ∈ V a,b follows a matrix Langevin distribution if and only if it has the probability density function where etr{Q} stands for exp{tr{Q}} for any full rank square matrix Q, F is a a × b matrix, and 0 F 1 (a/2; F F /4) is called (0, 1)-type hypergeometric function with arguments a/2 and F F /4. The hypergeometric function 0 F 1 is unusual due to a matrix argument, see Herz (1955), and it is actually the normalizing constant of the density defined in (4.1), that is, stands for the differential form of a Haar measure on the Stiefel manifold, x i is a column vector of X, and ∧ is the exterior product of vectors.
The density function (4.1) is obtained from a normal density for a random matrix Z The parameter F of (4.1) is then equal to M Σ −1 .
The matrix F has a singular value decomposition Each pair of the column vectors in U and V corresponds to a singular value in D. Notice that the hypergeometric function in (4.1) has the property It can be shown that the density function (4.1) has maximum value exp(  Khatri and Mardia (1977), and Mardia (1975).
The density function (4.1) is rotationally symmetric around X m , in the sense that the density at H 1 XH 2 is the same as that at X for all orthogonal matrices H 1 (of dimension a × a) and H 2 (of dimension b × b) such that H 1 U = U and H 2 V = V (hence In Figure 2, the modal orientation U = (1/ √ 2, 1/ √ 2) is shown for the densities of Figure   3, and the point at which the density values are minimal, this point being equal to −U .
The densities are shown on Figure 3 as functions of the angle θ shown on Figure 2, for θ between 0 and 2π, instead of being shown as lines above the unit circle. Rotational symmetry in this example means that if we premultiply the random vector X by any orthogonal 2 × 2 matrix H 1 that does not modify the modal orientation, the densities are not changed.
-insert Figure 2 about here --insert Figure 3 about here -

Models
Chikuse (2006) develops a state-space model whose observable and latent variables are both evolving on Stiefel manifolds. For economic data, it is not appropriate to assume that the observable variables evolve on a Stiefel manifold, so that we keep the assumption that y t evolves on a Euclidean space in the measurement equation (2.1).
We define two state space models corresponding to the case 1 and 2 models introduced in Section 2, with latent processes evolving over the Stiefel manifold and following conditional matrix Langevin distributions: with the constraints that a t V = α t and b t V = β t , respectively. We assume in addition that the error ε t and α t+1 or β t+1 are mutually independent. The parameters of the ML distributions of the models are chosen so that the previous state of α t or β t is the modal orientation of the next state. Thus the transitions of the latent processes are random walks on the Stiefel manifold and evolve in the matrix Langevin way.
The models (4.4) and (4.5) are not yet identified due to the fact that the pairs between a t or b t and the nuisance parameter V can be arbitrarily chosen, and therefore the timeinvariant β and α are not identified as well. The identification problem can be solved by imposing V = I r . Then the identified version of the models is Model 2: The new state-space models in (4.6) and (4.7) do not have the problems mentioned in Section 3, due to the fact that both α t and β t are points in the Stiefel manifold. By construction, orthonormality is ensured, that is α t α t = I r for Model 1, and similarly β t β t = I r for Model 2. If the space spanned by the columns of α t (or the columns of β t ) is subjected to a rotation, the model is fundamentally unchanged. Indeed, in the case of Model 1, let H be an orthogonal matrix (p × p), and define the rotationα t = Hα t . Theñ α tα t = α t H Hα t = α t α t = I r . A similar reasoning holds for Model 2.
More simple versions of the models in (4.6) and (4.7) are obtained by assuming that the evolutions of α t and β t are independent of their previous states, with the same modal orientations α * and β * across time: If we assume that the random variation of α t+1 in (4.6) or β t+1 in (4.7) are inside the subspace spanned by α t or β t (hence α 0 or β 0 ), then we have another two state space models. The corresponding conditional distributions of α t+1 and β t+1 become truncated matrix Langevin distributions with the density functions: These two models can be interesting if the spaces spanned by the time-varying α t and β t are expected to be invariant over time.
The quasi-likelihood function for Model 1 based on Gaussian errors takes the form The quasi-likelihood function for Model 2 based on Gaussian errors takes the form We treat the initial values α 0 and β 0 as the parameters to be estimated, but of course they can be regarded as given. 12

The filtering algorithms
In this section, for the models (4.6) and (4.7) defined in the previous section, we propose nonlinear filtering algorithms to estimate the a posteriori distributions of the latent processes based on the Gaussian error assumption in the measurement equations.
We start with Model 1 which has time-varying α t . The filtering algorithm consists of two steps: where the symbol [dα t−1 ] stands for the differential form for a Haar measure on the Stiefel manifold. The predictive density in (5.1) represents the a priori distribution of the latent variable before observing the information at time t. The updating density, which is also called the filtering density, represents the a posteriori distribution of the latent variable after observing the information at time t.
The prediction step is quite tricky in the sense that even if we can find the joint distribution of α t and α t−1 , which is the product f (α t |α t−1 )f (α t−1 |F t−1 ), we must integrate out α t−1 over the Stiefel manifold. The density kernel f (α t−1 |F t−1 ) appearing in the integral in the first line of (5.2) comes from the previous updating step and is quite straightforward as it is proportional to the product of the density function of y t−1 and the predicted density of α t−1 , see the updating step in (5.2).
The initial condition for the filtering algorithm can be a Dirac delta function f (α 0 |F 0 ) such that f (α 0 |F 0 ) = ∞ when α 0 = U 0 where U 0 is the modal orientation and zero otherwise, but the integral f (α 0 |F 0 )[dα 0 ] is exactly equal to one.
Another contribution of this paper is that we propose to approximate this integral by using the Laplace method. (see Wong (1989, Chapters 2 and 9) for a detailed exposition).
Then the Laplace method can be applied since the Taylor expansion on which it is based is valid in the neighbourhood for any point on the Stiefel manifold. It follows that, with p → ∞, the integral (5.5) can be approximated by Given f (α 2 |F 1 ) ∝ etr{DU 1 α 2 }, then it can be shown that f (α 2 |F 2 ) has the same form as (5.4) with H 2 = − 1 2 β x 2 x 2 β, C 2 = U 1 D + Ω −1 (y 2 − Bz 2 )x 2 β. Thus, by induction, we have the following proposition for the recursive filtering algorithm for state-space Model 1.
Proposition 5.1. Given the state-space Model 1 in (4.6) with the quasi-likelihood function (4.12) based on Gaussian errors, the Laplace approximation based recursive filtering algorithm for α t is given by Likewise, we have the recursive filtering algorithm for the state-space Model 2.
Proposition 5.2. Given the state-space Model 2 in (4.7) with the quasi-likelihood function (4.13) based on Gaussian errors, the Laplace approximation based recursive filtering algorithm for β t is given by Several remarks related to the propositions follow.
Remark 5.3. The distributions of predicted and updated α t and β t in the recursive filtering algorithms are conjugate.
The predictive distribution and the updating or filtering distribution are both known as the matrix Langevin-Bingham (or matrix Bingham-von Mises-Fisher) distribution; see for example Khatri and Mardia (1977). This feature is desirable as it gives great convenience in the computational implementation of the filtering algorithms.
Remark 5.4. When estimating the predicted distribution of α t and β t , a numerical optimization for finding U t−1 is required.
There are several efficient line-search based optimization algorithms available in the literature which can be easily implemented and applied. See Absil et al. (2008, Chapter 4) for a detailed exposition.
Remark 5.5. The predictive distributions in (5.13) and (5.16) are Laplace type approximations. Therefore, the dimensions of the data y t in Model 1 and the predictors in Model 2 are expected to be high enough in order to achieve good approximations.
For the high-dimensional factor models that use a large number of predictors, the filtering algorithms are natural choices to model the possible temporal instability, while a small value of the rank r implies the dimension reduction in forecasting. In the next section, our finding from simulation is that, even for small p and q 1 , the approximations of the modal orientations can be very good.
Remark 5.6. The recursive filtering algorithms make it possible to use both maximum likelihood estimation and the Bayesian analysis for the proposed state-space models.
Next, we consider the models in (4.8) and (4.9). The corresponding filtering algorithms are similar to Propositions 5.1 and 5.2. The filtering algorithm for Model 1 * is given by We have the following remarks for both models.
Remark 5.7. The predictive distributions do not depend on any previous information, which is due to the assumption of sequentially independent latent processes.
Remark 5.8. The predictive and filtering distributions for Model 1 * and Model 2 * are not approximations.
We do not need to approximate integral like (5.5). Since f (α t |F t−1 ) does not depend on α t−1 in Model 1 * and f (β t |F t−1 ) does not depend on β t−1 in Model 2 * , f (α t |F t−1 ) and f (β t |F t−1 ) can be directly moved outside the integral.
The smoothing distribution is defined to be the a posteriori distribution of the latent parameters given all the observations. We have the following two propositions for the smoothing distributions of the state-space models.
Proposition 5.9. The smoothing distribution of Model 1 is given by Proposition 5.10. The smoothing distribution of Model 2 is given by There is no closed form for the smoothing distributions as the corresponding normalizing constants are unknown. Hoff (2009) develops a Gibbs sampling algorithm that can be used to sample from these smoothing distributions.

Evaluation of the filtering algorithms by simulation experiments
To investigate the performance of the filtering algorithm in Proposition 5.1, we consider several settings based on data generated from Model 1 in (4.6) for different values of its parameters.
Recall that at each iteration of the recursive algorithm, the predictive density kernel in (5.13) is a Laplace type approximation of the true predictive density which takes an integral form as (5.5), and hence the resulting filtering density is an approximation as well. It is of great interest to check the performance of the approximation under different settings. Since the exact filtering distributions of the latent process are not available, we resort to comparing the true (i.e. generated) value α t and the filtered modal orientation at time t from the filtering distribution f (α t |F t ), which is U t as defined in (5.15). The modal orientations are expected to be distributed around the true values across time if the algorithm performs well.
Then a measure of distance between two points in the Stiefel manifold is needed for the comparison. We consider the squared Frobenius norm of the difference between two matrices or column vectors: (6.1) If the two matrices or column vectors X and Y are points in the Stiefel manifold, then it holds that F 2 (X, Y ) = 2r − 2tr{X Y } ∈ [0, 4r], and F 2 (X, Y ) takes the minimum 0 when X = Y (closest) and the maximum 4r when X = −Y (furthest). Thus, we employ the normalized the distance which is matrix dimension free.
Note that the modal orientation of the filtering distribution is not supposed to be consistent to the true value of the latent process with the increase of the sample size T .
As a matter of fact, the sample size is irrelevant to the consistency which can be seen from the filtering density (5.14). We should note that the filtering distribution in (5.14) also has concentration or dispersion which is determined by H t , J (the inverse of Ω) and C t (the current information, i.e. y t , x t and z t ), together with the parameters, while the previous information has limited influence only through the orthonormal matrix U t−1 .
Since the concentration of the filtering distribution does not shrink with the increase of the sample size, we use T = 100 in all the experiments. If the filtering distribution has big concentration, the filtered modal orientations are expected to be close to the true values and hence the normalized distances close to zero and less dispersed.
The data generating process follows Model 1 in (4.6). Since we input the true parameters in the filtering algorithm, the difference y t − Bz t is perfectly known and then there is no need to consider the effect of Bz t . Thus, it is natural to exclude Bz t from the data generating process.
We consider the settings with different combinations of • T = 100, the sample size • p ∈ {2, 3, 10, 20}, the dimension of the dependent variable y t • r ∈ {1, 2}, the rank of the matrix A t • x t , the explanatory variable vector has dimension q 1 = 3 ensuring that q 1 > r always holds, and each x t is sampled independently (over time) from a N 3 (0, I 3 ).
The simulation based experiment of each setting consists of the following three steps: 1. We sample from Model 1 by using the identified version in (4.6). First, simulate α t given α t−1 , and then y t given α t . We save the sequence of the latent process α t , t = 1, ..., T .
2. Then we apply the filtering algorithm on the sampled data to obtain the filtered modal orientation U t , t = 1, ..., T .
3. We compute the normalized distances δ t (α t , U t ) and report by plotting them against the time t.
We use the same seed, which is one, for the underlying random number generator throughout the experiments so that all the results can be replicated.  Figure 8 depicts the results. The normalized distances are stable at a low level for the case p = 3 with r = 1, but a high level (around 0.5) in the case p = 3 with r = 2. A higher concentration (d = 800) reduces the latter level to about 0.12, as can be seen on the lower plot of Figure 8. We conclude that the approximation of the true filtering distribution tends to fail when the matrix α t tends to a square matrix, that is, p ≈ r, and therefore the filtering algorithms proposed in this paper seems to be appropriate when p is sufficiently larger than r.
-insert Figure 8 about here -All the previous experiments are based on the true initial value α 0 , but in practice this is unknown. The filtering algorithm may be sensitive to the choice of the initial value.
In the following experiments, we look into the effect of a wrong initial value. The setting is p ∈ {2, 10, 20}, r = 1, ρ = 0.1 and d = 50, and we use as initial value −α 0 , which is the furthest point in the Stiefel manifold away from the true one. Figure 9 depicts the results. We see that in all the experiments the normalized distances move towards zero, hence the filtered values approach the true values in no more than 20 steps. After that, the level and dispersion of the distance series are similar to what they are in Figure 4 where the true initial value is used. Thus we can conclude that the effect of a wrongly chosen initial value is temporary.
-insert Figure 9 about here -

Concluding remarks
In this paper, we discuss the modelling of the time dependence of the time-varying reduced rank parameters in multivariate time series models and develop novel state-space models whose latent states evolve on the Stiefel manifold. Almost all the existing models in the past literature only deal with the case where the evolution of the latent processes takes places on the Euclidean space, and we point out that this approach can be problematic.
These problems motivate the development of the novel state-space models. The matrix Langevin distribution is proposed to specify the sequential evolution of the corresponding latent processes over the Stiefel manifold. Nonlinear filtering algorithms for the new models are designed, wherein the integral for computing the predictive step is approximated by applying the Laplace method. An advantage of the matrix Langevin distribution is that the a priori and a posteriori distributions of the latent variables are conjugate. The new models can be useful when the temporal instability of some parameters of multvariate models is suspected, for example, cointegration models with time-varying short-run adjustment or time-varying long-run relations, and factor models with time-varying factor loading.
Further research is needed in several directions. The most obvious one is the implementation of estimation methods, which can be maximum likelihood or Bayesian inference, and the investigation of their properties. This will enable us to apply the models to data.
In this paper, we only consider the case where the latent variables evolve on the Stiefel manifold in a 'random walk' way. It will be interesting to consider the case where the latent variables evolve on the Stiefel manifold but in a mean-reverting way.
A Proof of Proposition 3.1 In the model (2.1) with the decomposition (2.2), both the rows of β and the order of the variables x t are permuted by P β as follows: The time-invariant component β can be linearly normalized if the r × r upper block b 1 in (3.2) is invertible. It follows that the corresponding linear normalization defined in (3.3) is due to whereα t = α t b 1 is the new time-varying component following the evolution (3.4).
Consider another permutation P * β = P β . Similarly we have where b * 1 is also invertible. Then we can have the evolution Assume that the error vector η α t in (3.4) has zero mean and a diagonal variancecovariance matrix. From (A.1)-(A.3), we have and hence it follows thatα * t =α tβ P β P * β κ, (A.7) where the q 1 × r matrix κ satisfiesβ * κ = I r . The existence of κ is guaranteed by the fact that β has full rank and so doesβ * .

B Proof of Propositions 3.2
In the model (2.1) with the decomposition (2.3), the rows of α are permuted by P α as follows: A t x t = αβ t x t = P α P α αβ t x t . (B.1) Notice that we can remove P α in the equation, which means that we choose not to permute back to the original order of the dependent variables y t . The linear normalization (3.6) is obtained by P α P α αβ t x t = P α a 1 a 2 β t x t = P α a 1 a 2 a −1 1 a 1 β t x t = P ααβ t x t , (B.2) whereβ t = β t a 1 is the new time-varying component following the evolution (3.7).

C Figures
26