Bayesian Inference of Recurrent Switching Linear Dynamical Systems with Higher-Order Dependence

: Many complicated dynamical events may be broken down into simpler pieces and efficiently described by a system that shifts among a variety of conditionally dynamical modes. Building on switching linear dynamical systems, we develop a new model that extends the switching linear dynamical systems for better discovering these dynamical modes. In the proposed model, the linear dynamics of latent variables can be described by a higher-order vector autoregressive process, which makes it feasible to evaluate the higher-order dependency relationships in the dynamics. In addition, the transition of switching states is determined by a stick-breaking logistic regression, overcoming the limitation of a restricted geometric state duration and recovering the symmetric dependency between the switching states and the latent variables from asymmetric relationships. Furthermore, logistic regression evidence potentials can appear as conditionally Gaussian potentials by utilizing the P´olya-gamma augmentation strategy. Filtering and smoothing algorithms and Bayesian inference for parameter learning in the proposed model are presented. The utility and versatility of the proposed model are demonstrated on synthetic data and public functional magnetic resonance imaging data. Our model improves the current methods for learning the switching linear dynamical modes, which will facilitate the identification and assessment of the dynamics of complex systems.


Introduction
Complex systems frequently exhibit multiple levels of abstraction in their descriptions [1].For example, a computer program can be characterized by the collection of functions it calls, the sequence in which it carries out statements or the assembly instructions it sends to the CPU.This assertion is true for a multitude of natural systems.Brain activity can be classified based on either broad psychological states or the activation of individual ion channels.The necessary amount of specificity may differ based on the particular task being performed [2].Through the identification of these behavioral units and their interdependence, we can obtain a deeper understanding of the intricate mechanisms that give rise to complex natural occurrences.Furthermore, modern machine learning offers powerful tools to help model the dynamics of complex systems.The toolbox has recently been improved to incorporate more versatile elements, such as Gaussian processes [3] and neural networks [4], into probabilistic time series models.
Time series analysis encompasses several methodologies and models, such as stationary process models, spectral models, state space models, and non-linear models [5].Among them, the state space methods are regarded as more versatile and adept at addressing a broader range of problems compared to the other models.Hidden Markov models (HMM) and switching linear dynamical systems (SLDS) are two well-known state space models.In numerous real-world time series situations, the present condition of a dynamic system is connected to its condition in prior time intervals.Both HMM and SLDS can address such problems and have been widely used in several problem fields.Examples include human motion [6,7], computer vision [8], speech recognition [9], econometrics [10], machine learning [11] and neuroscience [12,13].
HMM is a discrete Markov process with doubly embedded stochastic models, namely, (i) an unobservable hidden process characterized by a Markov chain and (ii) an observation process determined by hidden states.HMM-based models have garnered significant interest and demonstrated their utility in the signal processing community for accurately simulating the intricate temporal progression of signals.More precisely, HMM have a lengthy track record in signal processing, particularly in the field of speech processing where they have achieved notable success [14].The SLDS allows for the modeling of nonlinear time series data by dividing it into subsequences controlled by linear dynamics.Indeed, the generative model of SLDS is slightly different from HMM, where there is an additional set of latent variables between the switching states and the observations.In addition, a SLDS differs from a HMM by choosing from a collection of linear Gaussian dynamics that evolve continuously, instead of a standard Gaussian mixture density as in HMM.The SLDS can be viewed as an expansion of the HMM, where each HMM state, or mode, is linked to a linear dynamical process.One further benefit of SLDS is its ability to handle time series data with high dimensionality.From a generative model perspective, it can be claimed that high-dimensional time series can be cheaply represented by a dynamical process specified on a low-dimensional manifold.This representation benefits from the model structure of SLDS.Nevertheless, employing HMM directly on time series with high dimensionality is likely to result in overfitting.[15].Therefore, SLDS offers the potential for increased descriptive capability [1].
Although SLDS offers advanced predictive models, the dynamics from the data are only stated under certain conditions, which poses some shortcomings of the generative model and may limit its application.First, the continuously evolving linear Gaussian dynamics of latent variables is essentially equivalent to a first-order switching vector autoregressive (VAR) process, i.e., VAR(1) process, which may not be suitable for the case when there are higher-order dependency relationships in the time series.Higher-order autoregressive interactions are more common in time series data [16,17].Furthermore, determining an appropriate autoregressive order must trade off the model likelihood and complexity, constituting a non-trivial task in the applications [18].Indeed, a restricted VAR (1) process involved in the linear dynamics of latent variables limits the ability of SLDS to recover the higher-order dependency relationships in dynamical phenomena.
Second, the switching state transition in SLDS follows the Markov assumption, which means that the duration time (i.e., time interval spent in a certain state) is restricted to a geometric distribution [19].Thus, more weight is given to shorter consecutive time periods within a certain state, which means states switch frequently.This may not be appropriate when the system spends a long time in one state [20].In addition, the state transitions are based only on the previous state and are unrelated to the latent variables.If a discrete switch happens when the system enters a particular area of the state space, the SLDS cannot recognize this relationship.While hidden semi-Markov model (HSMM) based on the semi-Markov assumption is an extension of the HMM that alleviates this issue, it should be noted that state transitions in HSMM still have no associations with the previous observations [20].
To address the mentioned limitations, an extension of SLDS, the recurrent switching linear dynamical systems with higher-order dependence (HO-rSLDS), is presented.The HO-rSLDS provides a new method that improves the modeling of dynamics from complex systems, by enabling a higher-order dependence in the linear Gaussian dynamics of latent variables and allowing the switching state transition probabilities to hinge on the preceding values of the latent variables.Specifically, to address the first limitation, we enhance the generative process of latent variables by directly making the current latent variable depend on the state values at the previous several time steps, rather than only the previous one time step.Therefore, the proposed model is feasible for capturing the higher-order autoregressive relationships in the data.To address the second limitation, we utilize the stick-breaking logistic regression to determine the switching state transition [21].In addition, we leverage a Pólya-gamma augmentation approach [22] to improve block inference updating.This enhancement transforms specific logistic regression evidence potentials into conditionally Gaussian potentials, facilitating efficient Bayesian inference procedures.In this way, the transition of switching states is directly associated with the previous switching state and latent variable.Moreover, since the switching states and the latent variables are mutually interdependent in HO-rSLDS, the symmetric dependency among them is recovered, enhancing the generative process of both switching states and latent variables.
Similar to SLDS, the exact inference in an HO-rSLDS is intractable, which impedes efficient model estimation and parameter learning [8].In this study, the HO-rSLDS is learned through variational inference, which is an approximate scheme to maximize the evidence lower bound, i.e., minimize the Kullback-Leibler divergence between a restricted family of functions of the model parameters and the actual posterior distribution [23].Moreover, we propose message-passing algorithms about switching states and latent variables in HO-rSLDS to evaluate sufficient statistics for further parameter updates.
The primary contributions of the novel approach suggested in this study are outlined below: 1.
HO-rSLDS assumes the linear dynamics of latent variables can be described by a higher-order VAR process, which makes it feasible to dig out and evaluate the longterm dependency relationships from dynamical phenomena; 2.
Stick-breaking logistic regression is applied to determine the switching state transition in HO-rSLDS.By this means, the transition probabilities are time-varying and can be adjusted according to the previous latent variable, which overcomes the limitation of restricted geometric state duration time by Markov assumption and recovers the symmetric dependency between the switching states and the latent variable; 3.
The P ólya-gamma augmentation strategy permits efficient Bayesian inference algorithms.In addition, we propose message-passing algorithms, including a forward Kalman filter and backward Kalman smoother for HO-rSLDS, which facilitates the parameter update in variational inference.
The rest of this study is organized as follows.In Section 2.3, we discuss the related models.In Section 3, we present the model learning algorithms in detail.In Section 4, we demonstrate the performance of the proposed HO-rSLDS on simulated data.Our method is applied to public functional magnetic resonance imaging (fMRI) data as well, and the results are described in Section 5. Finally, we wrap up our conclusions in Section 6.

HMM and HSMM
HMMs are mathematical models that analyze dynamic systems by using random processes with hidden states.A HMM is constructed using a Markov chain and is composed of two layers: a layer representing hidden states and a layer representing observations.The observed sequences are allocated to their respective hidden states according to the observation probability distribution.In the context of HMM, the observations are solely influenced by the current state and are not influenced by any preceding states [20].The entire state sequence can be presented as z 1:T = {z t } T t=1 , which can be viewed as a sequence over a finite set Z with cardinality |Z |.The state transition probability from state i to j is defined as π ij = p(z t+1 = j|z t = i).The distribution of observations y t ∈ R n given a specific state is denoted by p(y t |z t , θ j ), where θ j denotes the emission parameters of state j.Above all, the HMM can be described as: where F(•) denotes emission distribution.
Compared to HMM, the HSMM formalism improves upon the standard HMM by incorporating a random state length time, chosen from a state-specific distribution, into the generative process.The state remains constant until the period elapses, at which juncture there is a Markov transition to a fresh state.By using this approach, the distribution of state durations is not limited to a geometric form [24,25].The random variable D t represents the length of time that a state is entered at time t.The associated probability mass function is denoted as p(D t |z t = j) and the HSMM can be described as where ν s indexes the state shared by state segment s.A graphical illustration of HMM and HSMM is shown in Figure 1.HMM or HSMM typically uses a Gaussian mixture as the emission distribution, which fails to capture potential dependencies in observed time series.An effective approach to tackle this problem is to employ the VAR emission model, which describes the behavior of time series through linear historical interactions among the observed time series.In this way, the emission model in HMM or HSMM can be specifically expressed as: where p denotes the maximum lag order, W k,l are n × n dimensional matrices representing the k-th state autoregressive coefficient matrices for lag l and the covariance matrix is Σ k .Subsequently, we refer to the H(S)MM with VAR(p) emission model as the AR(p)-H(S)MM.An HSMM enhances the generative process of the standard HMM with a random state duration time, which is drawn from some state-specific distribution when the state is entered.K denotes the cardinality of the hidden state space, which can be a finite or infinite value.

SLDS
A SLDS Markov process differs from a hidden Markov model (HMM) by choosing from a collection of linear Gaussian dynamics that evolve continuously, instead of a standard Gaussian mixture density as in HMM.Therefore, SLDS offers the potential for increased descriptive capability.Similar to HMM, at each time t = 1, 2, . . ., T there is a discrete switching state z t ∈ {1, 2, . . . ,K} that evolves according to Markov dynamics: {π k } K k=1 constitutes the Markov transition matrix and π k ∈ [0, 1] K is its k-th row.The z 1 is assumed to be sampled from an initial distribution π 0 parameterized by prior parameter α 0 , i.e., z 1 |α 0 ∼ π 0 (α 0 ).Furthermore, a continuous latent variable x t ∈ R m follows conditionally linear dynamics depend on the state values with at the previous time step, and the switching state z t determines the linear dynamical system at time t by: where z t+1 ∈ R m×m denotes the coefficient matrix and z t+1 ∈ R m×m is covariance matrix.At each time t, a linear Gaussian observation y t ∈ R n is produced from the associated latent continuous variable: where C (0) ∈ R n×m , S (0) ∈ R n×n .Usually, n is much larger than m, such that the dependence structure of high-dimensional observation data can be evaluated based on the low-dimensional latent variable space.A graphical illustration of SLDS is given in Figure 2.

Latent variables
Observations The graphical model of the SLDS, which consists of switching states (z t ), latent variables (x t ) and observations (y t ).Note that we assume it does not contain a link from z t to y t .
In classical form, however, there are some shortcomings in the generative model of SLDS, which may limit its application.First, the continuously-evolving linear Gaussian dynamics of x t is essentially equivalent to a first-order switching vector autoregressive (VAR) process, i.e., the VAR(1) process, which may not be suitable for the case when there are higher-order dependency relationships in x t or y t .Substitute Equation (9) into Equation (10) and assume C (0) ⊤ C (0) = I and w t approximately zero, one can immediately obtain the conclusion that SLDS describes the dependency relationships by a VAR (1) process since y t = C (0) A (0) Second, the state transition of z t follows the Markov assumption, which means that the duration time is restricted to a geometric distribution [19].Thus, more weight is given to shorter consecutive time periods within a certain state, which means states switch frequently.This may not be appropriate when the system spends a long time in one state [20].In addition, the state z t+1 transitions based only on the previous state z t , and z t+1 |z t is unrelated to the latent variable x t .If a discrete switch happens when the system enters a particular area of the state space, the SLDS will not be able to recognize this relationship.To address these issues, the proposed HO-rSLDS enhances the generative process of x t and the switching state transition, which is demonstrated in the following subsection.

Recurrent Switching Linear Dynamic System with Higher-Order Dependence
In this section, we describe the proposed HO-rSLDS comprising two main components that are switching linear dynamic systems with higher-order dependence (HO-SLDS, without recurrence) and the stick-breaking logistic regression as well as the P ólya-gamma augmentation for HO-SLDS.Each component is demonstrated below.

HO-SLDS
The switching state transition in HO-SLDS is the same as that in SLDS.However, the continuous latent variable x t ∈ R m follows conditionally linear dynamics depending on the state values at the previous p time steps, rather than only the previous one time step, and the switching state z t determines the linear dynamical system at time t by: Indeed, the evolution of the latent variable dynamics in x t can be considered as a VAR(p) process.The initial x0 given the initial discrete state z 0 is supposed to be normally distributed with mean µ 0 and covariance Σ 0 , i.e., x0 ∼ N (µ 0 , Σ 0 ).Similarly, at each time t, a linear Gaussian observation y t ∈ R n is produced from the associated latent continuous variable: where Note that here we assume C z t and S z t are related to z t .Usually, n is much larger than m, such that the dependence structure of high-dimensional observation data can be evaluated based on the low-dimensional latent variable space.The system parameters consist of the discrete Markov transition matrix and the collection of linear dynamical system matrices, denoted as:

Stick-Breaking Logistic Regression and P ólya-Gamma Augmentation for HO-SLDS
Another component included in HO-rSLDS is a stick-breaking logistic regression.We utilize a P ólya-gamma augmentation technique to improve block inference updates [21].This approach allows certain logistic regression evidence potentials to transform to conditionally Gaussian potentials in an augmented distribution, which facilitates the Bayesian inference algorithms.
Consider a logistic regression model with regressors x ∈ R m that maps to a categorical distribution over the switching state z ∈ {1, 2, . . . ,K}, denoted as: where R ∈ R K−1×M represents a weight matrix and r ∈ R K−1 denotes a bias vector.We use a stick-breaking link function π SB : R K−1 → [0, 1] K , which transforms a real vector into a normalized probability vector based on the stick-breaking process: where ν k represents the kth component of ν and σ(ν) = (1 + e −ν ) −1 is the logistic function.The probability mass function p(z|x) is given by where I(•) is the indicator function.
The posterior density p(x|z) is non-Gaussian and does not allow straightforward Bayesian inference for this regression model due to the likelihood p(z|x) not aligning with a Gaussian prior density p(x).To solve this problem, we introduce P ólya-gamma auxiliary variables ω = {ω k } K k=1 so that the conditional density p(x|z, ω) becomes Gaussian [22].Specifically, the conditional density of ω k is distributed according to a P ólya-Gamma distribution.A random variable X has a P ólya-Gamma distribution with parameters b > 0 and c ∈ R, denoted as X ∼ PG(b, c), if where the g k ∼ Gamma(b, 1) are independent Gamma random variables, and where where the mean vector Ω −1 κ and covariance matrix Ω −1 are calculated based on the following: Thus allowing efficient block updates while maintaining Gaussian posterior distribution p(x|z) in Bayesian inference.
The HO-rSLDS splits the latent space into K sections, with each section following its own linear dynamics by including the recurrence (Equation ( 17)) in the transition density of z t , which can be described as follows: Moreover, we place Gaussian priors on R z t and r z t .In this way, the probability mass function p(z t+1 |x t ) can be expressed as: The P ólya-gamma augmentation focuses on exactly the above densities, utilizing the following integral identity: where κ = a − b/2 and p PG (ω|b, 0) refers to the density of the P ólya-gamma distribution, PG(b, 0), which is independent of ν [22].Combing Equations ( 24) and ( 25), p(z t+1 |x t ) can be expressed as the marginalization of a component in the probability distribution p(z t+1 |x t , ω t ), where ω t ∈ R K−1 represents a vector of auxiliary variables.Dependent on ν t , there is where where Ω t = diag(ω t ) and κ t+1 = (κ t+1,1 , . . ., κ t+1,K−1 ).After augmentation, the conditional density on x t becomes virtually Gaussian, simplifying the evaluation of the integrals needed for message passing during the Bayesian inference.Finally, we summarize the proposed HO-rSLDS as To learn the HO-rSLDS utilizing Bayesian inference, we assign conjugate priors to each component of the model parameters θ, given by where Dir and MNIW denote Dirichelet and matrix normal inverse Wishart distribution, respectively, R k,i denotes the ith row of R k .A graphical illustration of HO-rSLDS is given in Figure 3.

Latent variables
Observations Figure 3.The graphical model of the HO-rSLDS, where there is higher-order dependence involved in x t (blue) and the recurrent dependence from x t−1 to z t (red).
3.1.Update for q(x 0:T ) Since E −q(x 0:T ) ln p(θ) is not a function of x 0:T , Equation (36) can be written as ln q(x 0:T ) ∝ E −q(x 0:T ) ln p(z 1:T , x 0:T , ω 1:T , y 1:T |θ), From Equation (42) it can be observed that the structure of the variational posterior ln q(x 0:T ) resembles the joint log-likelihood function of a time-varying linear dynamical system with higher-order dependence (HO-LDS).Thus, the ln q(x 0:T ) can be re-expressed as and the set of variational parameters involved in Equation (43) are defined as ( Each component in the θx 0:T can be precisely determined by matching the coefficients of each term in Equations ( 42) and (43).Algorithm 1 outlines an efficient scheme for this purpose.
Algorithm 1: Obtaining θx 0:T x t−j where A z t (j) is defined as the submatrix of A z t associated with x t−j in A z t xt−1 .
Thus, we perform message passing on a time-varing HO-LDS parameterized by θx 0:T to obtain the posterior distribution q(x 0:T ) and sufficient statistics used in the update for the other factors.To be concrete, the message passing comprises the generalized Kalman filter and Kalman smoother for the time-varing HO-LDS, which are demonstrated by Theorems 1 and 2. To prove the theorems, we need the following lemmas [26]: Lemma 1.If random variables X and Y follow the Gaussian probability distributions: then the joint distribution of X and Y, as well as the marginal distribution of Y, are provided by Lemma 2. If the random variables X and Y follow the joint Gaussian probability distribution then the marginal and conditional distributions of X and Y are as follows: Now we demonstrate Theorems 1 and 2 in turn, which provides the Kalman filter and smoother equations about q(x 0:T ) in the time-varying HO-LDS.
Theorem 1.The Kalman filter equations for the time-varing HO-LDS (Equation ( 43)) parameterized by θx 0:T can be evaluated in closed form, which results in Gaussian distributions: The distribution parameters can be calculated using the following Kalman filter prediction, update and merging steps detailed in the following proof.The recursion begins with the prior mean µ m 0 and covariance Σ m 0 (i.e., μ0 and Σ0 ).
Proof.The Gaussian filter distributions (Equations (50)-( 52)) can be obtained by the following steps: 1. Prediction step.By Lemma 1, the joint distribution of x t and xt−1 given y 1:t−1 is (1) , Σ (1) , where and by Lemma 2 the marginal distribution of x t is given by where 2. Update step.By Lemma 1, the joint distribution of y t and x t given y 1:t−1 is = N x t y t µ (2) , Σ (2) , ( where By Lemma 2 the conditional distribution of x t given y 1:t is where 3. Merging step.By Lemma 1, the joint distribution of xt−1 , x t , y t given y 1:t−1 is where By Lemma 2, the conditional distribution of xt−1 and x t given y 1:t is where (67) Finally, the marginal distribution of xt is where µ (4) ( xt ) and Σ (4) ( xt ) are marginal mean vector and covariance matrix of xt .
The Kalman filter for the time-varying HO-LDS in Theorem 1 calculates the estimates based on measurements collected up to and including time step t.After obtaining the filtering posterior state distributions, Theorem 2 outlines the results for calculating the marginal posterior distributions for each time step based on all data up to the final time step T, which is described as the following: Theorem 2. The Kalman smoother equations for the time-varing HO-LDS (Equation ( 43)) parameterized by θx 0:T can be evaluated in closed form, which results Gaussian distributions: The smoothing distribution and filtering distribution of the last time step are identical, allowing for the recursive computation of the smoothing distribution for all time steps by beginning from the last step and moving backward.
Proof.Firstly by Lemma 1, the joint distribution of xt−1 and x t given y 1:t−1 is where According to the Markov property of the states, we obtain Therefore, we have the conditional distribution as where Then, the joint distribution of xt−1 and x t given all the data is where Thus, the marginal distribution of xt−1 is given as where The p(x t−1 |y 1:T ) will be directly obtained from p(x t−1 |y 1:T ).Therefore, we have derived the concrete expressions of the Gaussian smoother distributions (Equations ( 69)-( 71)).Thus, we can obtain the posterior distributions of q(x 0:T |y 1:T ) as well as the expectations involved in the update for the other factors, which are E q(x 0:T ) (x t ), E q(x 0:T ) (x t x ⊤ t ) and E q(x 0:T ) ( xt x ⊤ t+1 ).
3.2.Update for q(ω 1:T ) q(ω 1:T ) can be updated using a closed-form expression by the Pólya-gamma augmenta- tion.Equations ( 37) and (41) results in the following update for q(ω 1:T ) = ∏ T t=1 ∏ K−1 k=1 q(ω t,k ): Moreover, the expectation of the posterior P ólya-gamma distribution is available in closed form [22]: 3.3.Update for q(z 1:T ) Since E −q(z 1:T ) ln p(θ) is not a function of z 1:T , Equation ( 35) can be written as ln q(z 1:T ) ∝ E −q(z 1:T ) ln p(z 1:T , x 0:T , ω 1:T , y 1:T |θ) From Equation (86) it is evident that the resulting factor q(z 1:T ) has the form of an HMM parameterized by q(x 1:T , ω 1:T , θ), and the expected sufficient statistics required for updating the other factors can be obtained by message passing algorithms.We define ln πz 1 := E −q(z 1:T ) ln p(z 1 |θ), ln ãz t z t+1 := E −q(z 1:T ) ln p(z t+1 |z t , x t , ω t , θ), ln bz t (x t ) := E −q(z 1:T ) ln p(x t |z t , θ) and the set of the HMM parameters is denoted as θz 1:T .Thus, we have Note that the densities p(z 1 |θ), p(z t+1 |z t , x t , ω t , θ), p(x t |z t , θ) belong to the exponential family, these expectations can be evaluated in closed form.The computation of the HMM parameters by Equation ( 86) is summarized in Algorithm 2.
We apply a standard forward-backward algorithm for an HMM where the forward recursion is: where α 1 (i) ∝ πi bi (x 1 ) and the backward recursion is: Algorithm 2: Obtaining θz 1:T Then we can obtain the sufficient statistics used in the update for the other factors using the following equations: 3.4.Update for q(θ) Given the sufficient statistics of q(z 1:T ) and q(x 1:T ) obtained by message passing, the conjugate posterior distribution of the model parameters can be updated in closed form by evaluating Equation (38).For each parameter, we present the optimized variational distribution in the following: • Update q(π 0 ).Simplifying Equation (38) and extracting the terms related to π 0 , we have which results in where α 0,k are prior parameters of p(π 0 ).
• Update (µ 0 , Σ 0 ).The posterior q(µ 0 , Σ 0 ) can be obtained by the expectations of q(x 0:T ), given by Simplifying Equation (38) and extracting the terms related to which results in where The variational posterior parameters can be optimized as where Simplifying Equation (38) and extracting the terms related to (C k , S k ), we have which results in where The variational posterior parameters can be optimized as where k , ∆ k are prior parameters of p(C k , S k ).• Update q(R k ).Simplifying Equation (38) and extracting the terms related to R k , we have which results in where , which can be optimized as k,i are prior parameters of p(R k,i ).To derive Equation (107), just note that and use the matrix trace.• Update q(r k ).Simplifying Equation (38) and extracting the terms related to r k , we have which results in where , which can be optimized as where µ k are prior parameters of p(r k ).

Initialization
To obtain a reliable variational inference, we initialized the model parameters and latent states with reasonable values by using the following initialization procedure: 1.
Probabilistic Principal Component Analysis (PPCA) [27] is conducted on the data, to initialize the continuous latent variables, x 0:T and the parameters C; 2.
Fit an AR(p)-HMM to initialize the switching states, z 1:T and the parameters, {A k , Q k }.
The autoregressive order, p, is determined by AIC, BIC, and HQ criterion; 3.
To alleviate the possible and undesirable dependence on ordering that arises from the stick-breaking formulation during the inference, we adopt a strategy of greedily fitting a decision list [21] to identify the most suitable permutation of the switching states for the stick-breaking process.Specifically, we start by performing a greedy search on permutations by creating a decision list based on (x t , z t ), z t+1 pairs, given by where (o 1 , . . ., o K ) is a permutation of (1, . . ., K), and p 1 , . . ., p k are predicates that rely on (x t , z t ) and provide a true or false.In our framework, these predicates are defined by logistic functions: where ε ∈ (0, 1) should be predetermined.To determine o 1 and r 1 , we used the maximum a posteriori estimate of the model for each of the K potential states.For the kth logistic regression, the inputs are x 0:T and the outputs are y t = I(z t+1 = k).
We chose the logistic regression model with the largest likelihood as the first output.Then we excluded time points where z t+1 = o 1 from the data and proceeded to K − 1 logistic regressions to predict the subsequent output, o 2 , and so on.After cycling through all K results, we obtained the permutation of z 1:T .In addition, the R 0,j served as an initialization of the recurrence weights (R).

Numerical Experiments
To investigate the performance of the proposed HO-rSLDS, we generated synthetic datasets through the following steps: 1.
Benchmark model settings.We take AR-HSMM as a benchmark model to generate synthetic time series.Specifically, the dimensionality of the synthetic time series is n = 5, and the total length is T = 1000.The state number is K = 3, and the autoregressive order p takes values {1, 2, 3}.

2.
Generating (A k , Q k ).(A k , Q k ) corresponds to the emission parameters involved in the VAR process of latent variables x t (Equation ( 11)).The autoregressive coefficient matrices A k are generated with 50% sparsity (defined as the proportion of the zero elements), i.e., 50% of the elements in A k are 0. The non-zero elements of A k are generated to be positive or negative with equal probability, and their absolute value is sampled uniformly between 0.2 and 0.5.The covariance matrices Q k are generated as Q k = P ⊤ k diag(σ 1 , . . ., σ d )P k , with P k being an orthogonal matrix and σ i assumed positive.The matrix P k is constructed by orthogonalizing a random matrix whose entries are simulated from a standard Normal, while each σ i is uniformly sampled in the interval [1,3].Additionally, a rejection step is done to check that the sampled (A k , Q k ) constituted a stable VAR process.

3.
Generating switching state sequence z 1:T .To generate z 1:T , we set the state transition probabilities π jk to 0.5, for j ̸ = k = 1, 2, 3. Note that in an HSMM, the self-transition probability is 0. Furthermore, we simulate the state duration time in two cases.
In the first case, the state duration time is sampled from a geometric distribution (the between-state transition probabilities are set to 0.1).In the second case, the state duration time is directly sampled from {1, . . . ,10} with specified sampling probabilities.Since the second case does not correspond to a parametric distribution, we denote these two cases as geometric and nonparametric, respectively.In addition, the initial state is uniformly sampled from {1, . . . ,10}. 4.
Generating (C k , S k ).(C k , S k ) corresponds to the emission parameters generating the Gaussian observations y t (Equation ( 12)).We generate (C k , S k ) in the same manner as (A k , Q k ).

5.
Repeat the above procedure 100 times to obtain 100 synthetic data by setting different random seeds when generating (A k , Q k , C k , S k ).We investigated the performance of HO-rSLDS by comparing with the competing models, while the estimation accuracy is evaluated by the normalized mean squared error (NMSE) between the true and estimated emission model parameters and the signal-to-noise ratio (SNR) of the inferred optimized switching state sequence ( ẑ1:T ), given by where ∥ • ∥ F denotes Frobenius norm, Ĉk and ẑ1:T are estimated values.For AR-HMM, the NMSE is evaluated based on the true and estimated VAR emission parameters.To obtain ẑ1:T , we set ẑt = arg max k γ t (k).In addition, we avoid the optimization about discrete state number (K) and the dependence order (p) in HO-rSLDS by directly setting them to the true value.
The results of estimated NMSE and SNR for the three models are summarized in Tables 1 and 2, respectively.It is observed that HO-rSLDS outperforms AR-HMM and SLDS with lower average NMSE and higher average SNR for all the autoregressive orders and state duration distribution combinations.In addition, when p = 1, for both geometric and nonparametric state duration, SLDS outperforms AR-HMM slightly, which may be because the synthetic data are generated by a simulated extensive SLDS structure.However, when p > 1 and there is long-term dependence involved in the latent variables or observation data, it is just the opposite, since AR-HMM with higher p can capture the temporal relationships in the observation data while SLDS cannot.Moreover, when the transition of the underlying discrete states is characterized by a nonparametric duration distribution, the estimation accuracy of AR-HMM and SLDS is reduced, while HO-rSLDS remains stable with good estimation accuracy.This is because state transitions in both AR-HMM and SLDS are characterized by a Markov assumption and restricted to geometric duration distribution.Thus, more weight is given to shorter consecutive time periods within a certain state, which means the regimes switch frequently, which may not be appropriate in cases without sufficient prior information about a geometric state duration.By comparison, HO-rSLDS exhibits robustness to the state duration distributions since it improves the state transition by leveraging the stick-breaking logistic regression.The current state z t no longer depends on only the preceding state z t−1 , but also the latent variable x t−1 , and, furthermore, the previous observations.
A more exhaustive comparison is illustrated in Figure 5 where HO-rSLDS outperforms the competing models with better estimation accuracy and statistical significance in all the cases (t-test, ** p < 0.01).

Dynamic Functional Connectivity Analysis in fMRI Data
To show that HO-rSLDS can also work well with real-world data, we evaluate the performance of HO-rSLDS by comparing it with standard SLDS based on public functional magnetic resonance imaging (fMRI) data.The data are collected in the Human Connectome Project (HCP) 1200 Parcellation Timeseries Netmats (HCP1200-TPN) dataset (Available at https://www.humanconnectome.org,(accessed on 6 March 2024)).Please see [28] for a detailed explanation of the entire acquisition protocol.In this study, we apply HO-rSLDS to data from 20 unrelated subjects with dimension d = 15 and times series length T 0 = 4800 for each subject included in the HCP1200-PTN release to perform dynamic functional connectivity (DFC) analysis, which can be used to analysis the dynamical temporal coherence among endogenous fluctuations in distributed brain regions [29,30].In general, HO-rSLDS and SLDS assume that there are metastable states with characteristic connectivity pattern in the brain, while the connectivity pattern can be represented by the statistical correlations directly.Specifically, by Lemma 1, the functional connectivity pattern of state k in HO-rSLDS and SLDS can be obtained as In addition, there are two unknown parameters: the state number K and the dependence order p.We use the mixture minimum description length (MMDL) considering optimal code lengths for each state in a mixed model [31], given by where L(K) is the variational lower bound evaluated during the model inference and γk := ∑ T t=1 γ t (k).The concept of MMDL may be elucidated by the minimal description length principle in [32].The state number K is determined by minimizing the MMDL on a specified list of K.In addition, we fit VAR model on the fMRI time series and the optimal dependence order p will be determined by AIC, BIC and HQ criteria [33], given by AIC := ln det( Σ) + where Σ is the estimated covariance matrix of noise in VAR model.To avoid the model order too large to permit feasible computation, we choose the lag order on the condition when AIC/BIC/HQ show no further substantial decreases at higher orders [34].As illustrated in Figure 6, we set K = 4 according to the minimal MMDL and p = 4 since the downward trend of AIC, BIC and HQ tends to be flat with higher order.In Figure 7, we present the DFC patterns derived from the proposed HO-rSLDS and the standard SLDS.The 15 brain regions producing the fMRI time series are indexed by numbers 1 to 15.The states are ordered by their fractional occupancy (defined as the percentage of time allocated to that specific state); it is evident that the 4 states from SLDS have a more average fractional occupancy distribution.Moreover, it is observed that DFC patterns derived from SLDS are more similar than those from HO-rSLDS, where there is only a slight difference in DFC patterns between State4 and the other three states from SLDS.We use cosine similarity to measure the difference of the DFC patterns characterized by each state, given by cos(F 1 , where F 1 and F 2 are square matrices.By this means, the averaged cosine similarity for SLDS is 0.9189 while that for HO-rSLDS is 0.6731.Although the real DFC patterns of this fMRI data are unknown, compared to standard SLDS, HO-rSLDS can indeed dig out more information about the DFC patterns by the model superiority.For example, the State4 in HO-rSLDS, with the lowest fractional occupancy, can be considered as a strongly connected state where the absolute functional connectivity is significantly larger than that of the other three states.Moreover, compared with standard SLDS, there is significantly positive or negative functional connectivity in the four states derived from HO-rSLDS. In addition, there is similar functional connectivity between State1 and State2, which results in a higher cosine similarity than the averaged value for HO-rSLDS (0.7776).The similar functional connectivity is not exhibited obviously in State3 (such as the negative connectivity between region 8 and regions 11, 12, 13).However, these findings cannot be obtained by standard SLDS.

Conclusions
In this study, we have established a new method named HO-rSLDS as an extension of SLDS that improves the statistical modeling of dynamics from complex systems.First, HO-rSLDS addresses the problem of discovering a higher-order temporal dependence involved in the latent variables or observation data by allowing the current latent variable to be associated with the measurements of several previous time steps as in AR-HMM.Second, HO-rSLDS improves the switching state transition by utilizing stick-breaking logistic regression and P ólya-gamma augmentation, making the transition dependent on the latent variable and thus overcoming the limitations of the Markov assumption, leading to a geometric distribution of state durations.Moreover, the symmetric dependency between the switching states and the latent variable is recovered from the asymmetric relationships as in standard SLDS.We have presented the detailed inference algorithms of the HO-rSLDS for message passing and parameter learning.In numerical experiments, we concluded that HO-rSLDS outperforms the competing models with higher estimation accuracy and robustness.The utility and versatility of the developed HO-rSLDS have been demonstrated on real-world data as well.The application of fMRI data for DFC analyses indicates the superiority of the method in discovering abundant information about DFC patterns.A potential limitation of the proposed HO-rSLDS is that the state number and dependency order need to be predetermined in applications, while the hyperparameter optimization is always a non-trivial task.Future research will improve the proposed model by utilizing Bayesian nonparametric methods so that the complicated hyperparameter optimization can be efficiently avoided [35].The proposed model improves the current methods for learning the switching linear dynamical modes, which will facilitate the identification and assessment of the dynamics of complex systems.

Figure 1 .
Figure 1.Graphical model of HMM (a) and HSMM (b).An HMM typically consists of two layers: a hidden state layer based on a Markov chain and an observation layer dependent on the current state.An HSMM enhances the generative process of the standard HMM with a random state duration time, which is drawn from some state-specific distribution when the state is entered.K denotes the cardinality of the hidden state space, which can be a finite or infinite value.

Figure 4
Figure 4 displays all dimensions and the true state sequence of one synthetic dataset.

Figure 4 .
Figure 4. Simulated time series with dimension d = 5 and total time points T = 1000 (the last 200 time points are exhibited).The five dimensions are displayed in different colors.The color intervals indicate the state sequence.

Figure 5 .
Figure 5. Box-plot of NMSE and SNR derived from AR-HMM, SLDS and HO-rSLDS based on the 100 numerical experiments.HO-rSLDS outperforms the competing models with significant lower NMSE and higher SNR on average for all the autoregressive order and state duration distribution combination.(t-test, ** p < 0.01).

Figure 6 .
Figure 6.(a) State number (K) determined by MMDL.The best K = 4 is chosen with respect to the minimal MMDL.The y-axis has been logarithmically scaled to highlight the variation.(b) Dependence order p is determined by AIC, BIC, and HQ criteria.The ideal p is set to 4 since there is no further substantial decrease with higher order.

Figure 7 .
Figure 7. DFC patterns derived from HO-rSLDS (a) and standard SLDS (b).The 15 brain regions are indexed by the numbers 1 to 15, and the states are ordered by their fractional occupancy (number in the brackets).The color bar is shared by the two group heatmaps.

Table 1 .
NMSE derived from AR-HMM, SLDS, and HO-rSLDS based on the synthetic time series with different state duration distribution and autoregressive order in the generative model.All the results are the average of the 100 experiments (mean ± std).

Table 2 .
SNR derived from AR-HMM, SLDS, and HO-rSLDS based on the synthetic time series with different state duration distribution and autoregressive order in the generative model.All the results are the average of the 100 experiments (mean ± std).