Geometric Learning of Hidden Markov Models via a Method of Moments Algorithm

We present a novel algorithm for learning the parameters of hidden Markov models (HMMs) in a geometric setting where the observations take values in Riemannian manifolds. In particular, we elevate a recent second-order method of moments algorithm that incorporates non-consecutive correlations to a more general setting where observations take place in a Riemannian symmetric space of non-positive curvature and the observation likelihoods are Riemannian Gaussians. The resulting algorithm decouples into a Riemannian Gaussian mixture model estimation algorithm followed by a sequence of convex optimization procedures. We demonstrate through examples that the learner can result in significantly improved speed and numerical accuracy compared to existing learners.


Introduction
Hidden Markov models (HMMs) describe states with Markovian dynamics that are hidden in the sense that they are only accessible via observations by a noisy sensor. Specifically, at every time-step k, an observation y k is sampled from an observation space Y according to the HMM's observation likelihoods, which specify the probability of making a particular observation, conditioned on the system being in a certain state. Despite their structural simplicity, HMMs have become a standard tool in the modeling of stochastic time-series [1] in recent decades and have found applications in a wide range of fields including computational biology [2,3], signal and image analysis [4], speech recognition [5,6], and financial modeling [7].
In order to apply an HMM, it is often necessary to estimate its parameters from data. The standard approach to estimating the parameters of an HMM is using a maximum likelihood (ML) criterion. Numerical algorithms for computing the ML estimate are dominated by iterative local-search procedures that aim to maximize the likelihood of observed data, such as the expectation-maximization (EM) algorithm [1,4]. Unfortunately, these schemes are only guaranteed to converge to local stationary points of the typically non-convex likelihood function and as a result often become trapped in local optima. Thus, to have a chance of converging to a global optimum, a good initialization is usually required. Another drawback of such methods is the significant computational cost associated with long runtimes due to costly iterations for large datasets.
In order to overcome such challenges, methods of moments have been introduced for HMMs [8][9][10][11][12][13][14]. Originally, these methods relied on empirical estimation of correlations between consecutive pair-or tripletwise observations to compute estimates of the HMM parameters. Although computationally attractive, such methods suffered from a loss of accuracy due to a focus on low order correlations in the data. In response, Mattila et al. [15,16] extended these methods to include non-consecutive correlations in the data, resulting in improved accuracy while retaining their attractive computational properties.

Hidden Markov models with manifold-valued observations
The development and analysis of statistical procedures and optimization algorithms on manifolds and nonlinear spaces more broadly have been the subject of intense and growing research interest in recent decades due to the ubiquity of manifold-valued data in a wide range of applications [17][18][19][20][21][22][23]. Since the application of Euclidean algorithms to such data often has a significantly negative impact on the accuracy and interpretability of the results, it is necessary to devise algorithms that respect the intrinsic geometry of the data. In this work, we turn our attention to HMMs with observations in a Riemannian manifold [24,25]. In particular, we restrict our attention to the class of models with observations in Riemannian symmetric spaces of non-positive curvature, which include hyperbolic spaces, as well as spaces of real, complex, and quaternionic positive definite matrices. We have three motivations for this restriction: (1) standard operations on such spaces have relatively favorable computational properties due to symmetries, (2) there exists a theory of Riemannian Gaussian distributions on such spaces together with associated algorithms such as Riemannian Gaussian mixture estimation [26,27], and (3) they are applicable to a substantial class of problems involving manifold-valued data, including applications with data in the form of covariance matrices [27].

Contributions and paper outline
Our main contribution in this paper is to extend the second-order method of moments algorithm with non-consecutive correlations developed by Mattila et al. [15,16] to the setting of HMMs with observations in a Riemannian symmetric space of non-positive curvature, where the observation likelihoods take the form of Riemannian Gaussians [27,28]. The paper is organized as follows. In Section 2, we describe HMMs with manifold-valued observations and review the necessary geometric background. In Section 3, we review the method of moments algorithms for HMMs and describe how they manifest in the geometric setting. In Section 4, we present a number of simulations based on these algorithms and conclude with a discussion in Section 5.

Notation
We denote the i-th entry of a vector by [·] i , and the element at row i and column j of a matrix by [·] ij . Vectors are assumed to be column vectors unless transposed. The vector of all ones is denoted 1. We interpret inequalities between vectors and matrices to hold elementwise. The operator diag acts on vectors and returns the matrix where the vector has been placed on the diagonal, and all other elements set to zero. The matrix Frobenius norm is denoted · F . The probability of an event A is denoted P(A).

Hidden Markov models on manifolds
We consider a discrete-time hidden Markov model with a finite-state Markov chain on the state space X = {1, . . . , N} with time-homogeneous N × N transition probability matrix P with elements The initial and stationary distributions of the HMM exist under appropriate assumptions and are denoted by π 0 ∈ R N and π ∞ ∈ R N , respectively. The HMM is said to be stationary if π 0 = π ∞ .
We assume that the states are hidden and can only be accessed through observations in a Riemannian symmetric space of non-positive curvature so that the Riemannian Gaussian distribution with probability density function with respect to the Riemannian volume measure dv(y) on Y is well-defined for anyȳ ∈ Y and σ > 0, as outlined in [27]. d(·, ·) denotes the Riemannian distance function on Y and Z(σ) denotes the normalization factor of the Riemannian Gaussian, whose efficient computation has been the subject of interest in recent years [28][29][30][31]. We assume that the observations are sampled from Y according to conditional probability densities for j = 1, . . . , N where p(·|ȳ j , σ j ) is a Riemannian Gaussian density function of the form (2) with meanȳ j ∈ Y and dispersion σ j > 0.
To use an HMM for applications such as filtering or prediction, its model parameters must be specified or estimated in advance. This task can be formulated as the following learning problem for HMMs: Problem 1. Given a sequence y 1 , . . . , y D of observations in Y generated by an HMM of known state space X = {1, . . . , N}, estimate the conditional probability densities B and the matrix of transition probabilities P.
The learning problem is well-posed under the standard assumptions that the HMM is ergodic (irreducible and aperiodic) and identifiable [4,10,15,16]. A special case of the learning problem that is worth noting is that of the known-sensor HMM, in which the observation likelihoods B are assumed to be known. Known-sensor HMMs are motivated by applications in which the sensor is designed by the user, such as a target tracking system whose sensor specifications can be determined prior to deployment.
Various methods since the inception of HMMs have focused on maximizing the likelihood in terms of both B and P; however, recent efforts have demonstrated the potential of methods that decouple the problem [12,13] and estimate B and P sequentially. Specifically, in parametric-output HMMs (e.g., Gaussian HMMs), the observation likelihoods are estimated via a general mixture model learner as a first step, followed by identification of the transition matrix P as a second step [12]. In the first step, assuming that the underlying Markov chain behaves well (e.g. is recurrent) and mixes rapidly, in stationarity, each observation y k from the HMM can be interpreted as having been sampled from the mixture distribution density Since we are assuming that the observation likelihoods belong to the family of isotropic Riemannian Gaussians on Y, the density (4) can be estimated using one of several algorithms for the estimation of mixtures of Riemannian Gaussian distributions including expectation-maximization (EM) [26,27], stochastic EM [32], and online variants [33]. The second step is then equivalent to the identification of a known-sensor HMM.

Method of moments for HMMs
We begin with a brief review of the method of moments algorithm for HMMs developed by Mattila et al. in [15]. The significance of this work is that it extends previous method of moments algorithms for HMMs that were based on correlations between consecutive pair-or triplet-wise observations to include non-consecutive correlations in the data. In doing so, the authors improve the accuracy of the approach by reducing the volume of neglected information inherent in the data while maintaining the computationally attractive properties of previous method of moments algorithms.
Before presenting the algorithm in the setting of HMMs with manifold-valued observations, we briefly review a summary of the key steps involved in the second-order algorithm of Mattila et al. [15] in the simplest setting where the observations take place in a finite observation alphabet {1, . . . , Y} with a known N × Y observation matrix B: (5) Methods of moments for HMMs (e.g. [8][9][10][11][12][13][14]) involve the empirical estimation of low-order correlations in the data, such as pairs P[y k , y k+1 ] or triplets P[y k , y k+1 , y k+2 ], followed by computation of the HMM parameter estimates by minimizing the discrepancy between the empirical estimates and their analytical expressions via a series of convex optimization problems. In Mattila et al. [15], the authors extend such methods to include non-consecutive correlations of the form P[y k , y k+τ ] with τ = 1, 2, . . . ,τ where the numberτ is a user-defined lag parameter.
The lag-τ second-order moments M 2 (k, τ) ∈ R Y×Y of the HMM are defined as the matrices where i, j = 1, . . . , Y and τ ≥ 0. The case τ = 0 reduces to the first-order moments [M 1 (k)] i = P[y k = i], where M 1 (k) ∈ R Y , which for notational convenience is expressed as a special case of second-order moments by writing M 2 (k, 0) = diag(M 1 (k)). For a stationary HMM (i.e., π 0 = π ∞ ) the lag-τ second-order moments are related to the HMM parameters according to the equations for any τ > 0 [15].
The lag-τ second-order moments can be empirically estimated from data asM 2 (τ) according to the equation for τ = 0, 1, . . . ,τ, where D is the number of observations and I denotes the indicator function. The next step in the method is moment matching through the minimization of the discrepancy between the empirical estimateM(τ) and its analytical expression by solving the following convex (quadratic) optimization problems: and setÂ(0) = diag(π ∞ ).
The output of the above moment matching procedure is a sequenceÂ(0), . . . ,Â(τ). In the final step, we use this sequence to estimate the transition matrix P by solving the following least-squares problem, which incorporates information from every lag by construction.
The dominant contribution to the computational cost of the above algorithm is independent of the data size D and scales linearly with the number of lagsτ included. In contrast, each iteration of the EM algorithm has a complexity of O(N 2 D). In addition to favorable computational properties, it is shown in [15,16] that the above algorithm is strongly consistent under reasonable assumptions. That is, as the number of samples grows, we expect the estimate of the transition matrix P to converge to its true value.

Geometric learning of HMMs using method of moments
We now return to the problem of estimating the parameters of an HMM with observations in a Riemannian manifold Y via an extension of the second-order method of moments presented earlier. We assume conditional probability densities to be given by Riemannian Gaussians of the form (2). The first stage of the process is to estimate the means and variances of the observation densities from data by employing a Riemannian Gaussian mixture learner [27,32,33]. In the case of a known-sensor HMM, this would be unnecessary as the observation densities are known a priori. In the next stage, we use a kernel trick outlined in [12,16] to extend the pairwise correlations between discrete-valued observations M 2 (τ) to an analogous quantity H(τ) ∈ R N×N applicable in the setting of continuous observation spaces. H is then related to the parameters of the HMM according to the equations for τ = 1, . . . ,τ ∈ N, where π ∞ is the HMM stationary distribution which can be estimated from (4), and K ∈ R N×N is defined as The N × N matrix K in (13) is called the the effective observation matrix in [12,16] and replaces the N × Y observation matrix (5). We can compute K using Monte Carlo techniques based on sampling from Riemannian Gaussians [27].
The elements of the left-hand side of (12) can be interpreted as conditional expectations with respect to the joint probability distribution of y k and y k+τ , which can be empirically estimated from HMM observations as in analogy with empirical estimate (8) employed in the case of HMMs with a discrete observation space.
Following the estimation of H(τ) and the computation of K, the moment matching procedure now takes the form of minimizing the discrepancy between the empirical estimateĤ(τ) and the corresponding analytical expressions in (12). Specifically, in the case of the known-sensor HMM, we solve the following sequence of convex (quadratic) optimization problems: and setÂ(0) = diag(π ∞ ).
The output is once again a sequenceÂ(0), . . . ,Â(τ), which is used to compute an estimate for the transition matrix P by solving (11).
To summarize, the algorithm follows a 2-stage procedure to learn the parameters of an HMM with observations in a Riemannian manifold admitting well-defined Gaussian densities of the form (2) from data. In stage 1, Riemannian Gaussian mixture estimation is employed to compute estimates for the conditional likelihoods B, which are then used in stage 2 to compute an estimate for the transition probabilities P by solving a series of convex optimization problems.

Simulations
We now present the results of several numerical experiments on learning HMMs with manifold-valued observations. In the first example, observations take place in the Poincaré disk model of hyperbolic 2-space. Poincaré models of hyperbolic spaces have been a subject of increasing interest in machine learning in recent years due to their ability to efficiently represent hierarchical data [34]. In the second example, we consider a model with observations in the manifold of 2 × 2 symmetric positive definite (SPD) matrices equipped with the standard affine-invariant Rao-Fisher metric [26].
We employed the second-order method of moments algorithm of Section 3.2 to learn the parameters of this HMM from observations alone. The model was fitted on 20 HMM chains, each with 10,000 observations. In our implementation, we used the mixture estimation algorithm of [26] to estimate the density (4). The full results are reported in Table 1, where the true and estimated Gaussian means are denoted byȳ i and y i , respectively. On repeating the experiment with varyingτ and the same random seed-and hence the same estimates for means and dispersions by construction-we observed that incorporating non-consecutive data (i.e.,τ > 1) up toτ = 3 significantly improved our estimate for P and produced a more accurate estimate than alternative algorithms [24,25]. Comparing the empirical performance of our algorithm to the numerical results reported in [24], we observed that our algorithm performed competitively, while requiring only a fraction of the runtime with the same number of observations. In comparison to the online learning algorithm of [25], which we employed on the same learning problem, we observed improved performance forτ > 1, with the method of moments algorithm withτ = 3 producing the most accurate estimate of P out of all considered methods. Interestingly, the runtime of our algorithm was not noticeably affected by the choice ofτ in this example since the mixture estimation and computation of K (13) accounted for the dominant contribution to the computational cost.  We now consider an HMM with N = 5 hidden states that are accessible through noisy observations in the manifold of 2 × 2 SPD matrices generated from a Riemannian Gaussian model with meansȳ i and standard deviations σ i given in Table 2. Here the Riemannian distance function d(·, ·) and the Riemannian Gaussian normalization factor Z(σ) are given by d(y, z) = log(y −1/2 zy −1/2 ) F , Z(σ) = (2π) While the expression for the Riemannian distance function holds true for higher dimensional SPD matrices, the analytical expression for Z(σ) in (20) is only valid in the 2 × 2 case. Nonetheless, Z(σ) can be directly computed or approximated for higher dimensional SPD matrices [26][27][28][29][30][31].
The transition matrix P of the underlying Markov chain is We employed our proposed geometric second-order method of moments algorithm withτ = 1 to sequentially estimate the underlying Gaussian model and the probability transition matrix from 10,000 observations. The results of the Gaussian mixture estimation procedure are reported in Table 2 and demonstrate a high level of accuracy. The estimated Riemannian Gaussian model with meansŷ i and standard deviationsσ i as well as the observations used to learn the model are visualized in Figure 1.
which yields a relative approximation error of P −P F P F = 0.050 (23) with respect to the Frobenius norm. The mean error in the estimated transition probabilities is

Conclusion
In this paper, we have shown that the recent method of moments algorithms for HMMs can be generalized to geometric settings in which observations take place in Riemannian manifolds. We observe through simple numerical simulations that the documented advantages of method of moments algorithms, including their competitive accuracy and attractive computational and statistical properties, may continue to hold in the geometric setting. Nonetheless, we expect unique computational challenges to arise in applications involving high-dimensional Riemannian manifolds. Specifically, using Markov chain Monte Carlo (MCMC) algorithms to compute the effective observation matrix K defined in (13) may become prohibitively expensive in high dimensions, which is not the case in the Euclidean setting as K admits a closed form analytic expression for multivariate Gaussian HMMs. Thus, a key technical challenge for the effective application of the proposed algorithm in problems involving high-dimensional manifolds is to devise algorithms for the efficient and scalable computation of K. Further developments of the approach may include extensions to models that incorporate third-or higher-order moments or more elaborate dynamics and control inputs.