Variational Beta Process Hidden Markov Models with Shared Hidden States for Trajectory Recognition

Hidden Markov model (HMM) is a vital model for trajectory recognition. As the number of hidden states in HMM is important and hard to be determined, many nonparametric methods like hierarchical Dirichlet process HMMs and Beta process HMMs (BP-HMMs) have been proposed to determine it automatically. Among these methods, the sampled BP-HMM models the shared information among different classes, which has been proved to be effective in several trajectory recognition scenes. However, the existing BP-HMM maintains a state transition probability matrix for each trajectory, which is inconvenient for classification. Furthermore, the approximate inference of the BP-HMM is based on sampling methods, which usually takes a long time to converge. To develop an efficient nonparametric sequential model that can capture cross-class shared information for trajectory recognition, we propose a novel variational BP-HMM model, in which the hidden states can be shared among different classes and each class chooses its own hidden states and maintains a unified transition probability matrix. In addition, we derive a variational inference method for the proposed model, which is more efficient than sampling-based methods. Experimental results on a synthetic dataset and two real-world datasets show that compared with the sampled BP-HMM and other related models, the variational BP-HMM has better performance in trajectory recognition.


Introduction
Trajectory recognition is important and meaningful in many practical applications, such as human activities recognition [1], speech recognition [2], handwritten character recognition [3] and navigation task with mobile robot [4]. In most practical applications, the trajectory is affected by the hidden features corresponding to each point. The hidden Markov model (HMM) [2], hierarchical conditional random field (HCRF) [5,6] and the HMM-based models, such as the hierarchical Dirichlet process hidden Markov model (HDP-HMM) [7], the Beta process hidden Markov model (BP-HMM) [8][9][10] and the Gaussian mixture model hidden Markov model (GMM-HMM) [2] are used to model sequential data and identify their classes [11][12][13][14].
The HMM is a popular model which has been applied widely in human activity recognition [1,15], speech recognition [2,16] and remote target tracking [2,17]. Besides, the HMM is becoming a more significant part as a building block of smart cities and Industry 4.0 [18,19] and implemented in extensive applications such as driving behaviors prediction [20] and the inernet of thing (IoT) signature anomalies [21]. One drawback of the HMM is having to ensure in advance the number of hidden states that need to be selected or cross-validated. To address this problem, several methods based on model selection are employed, such as BIC [22] or some Bayesian non-parameter prior like the BP [23] and the HDP [24]. Besides, directly using the original HMM for classification has another disadvantage, in which each HMM is trained for one class separately and thus information from different classes cannot be shared. It is worth mentioning that the sampled BP-HMM proposed by Fox et al. [9] can not only learn the number of hidden features automatically but also obtain the sharing features between different classes, which has been proved to be meaningful for human activity trajectory recognition. The sampled BP-HMM learns the shared states among different classes by jointly modeling all trajectories together, in which a hidden state indicator for one trajectory with a BP prior is introduced and thus a state transition matrix for each trajectory is maintained. When used for classification, the sampled BP-HMM calculates the class-specific transition matrix by averaging the transition matrices of the trajectories from the corresponding class. However, from the perspective of performance or efficiency, if the sampled BP-HMM [1,7] is used for classification, there is still a lot of room for improvement.
From the perspective of performance, the classification procedure in the sampled BP-HMM [1] is too rough to make full use of the trained model, in which the state transition matrix for each class is calculated by averaging the transition matrixes of all the trajectories. Obviously, this will lead to the loss of information, especially when the training set has some ambiguous trajectories. For instance, a "running" class has some "jogging" trajectories. One naive method to solve it is to select the K best HMMs for each class. However, it will cost plenty of time to select representatives for each class. In order to take account of both performance and efficiency, we change the way of modeling data in BP-HMMs. Differently from those versions of BP-HMMs [1,[8][9][10]25], in variational BP-HMMs, an HMM is created for each class instead of for each trajectory.
From the perspective of efficiency, the existing approximate inference for the BP-HMM is based on sampling methods [1,9] which often converge slowly. This drawback of the sampled BP-HMM [1] is inconvenient to practical applications. To provide a faster convergence rate than sampling methods, we develop variational inference for the BP-HMM. If the variational lower bound is unchanged or almost unchanged, the iteration will stop. To be amenable to the variational method, we use the stick-breaking construction of the BP [26] instead of the Indian buffet process (IBP) construction [27] in the sampled BP-HMM.
In this paper, we propose a variational BP-HMM for trajectory recognition, in which the way of the data modeling and the inference method are novel compared with the previous sampled BP-HMM. On the one hand, the new method of modeling trajectories enables the model to obtain better classification performance. Specifically, the hidden state can be optionally shared, and the class-specific state indicator is more suitable for classification than the trajectory-specific state indicator in the sampled BP-HMM. The transition matrix is actually learned from the data instead of averaging all the trajectoryspecific transitions. On the other hand, the derived variational inference of the BP-HMM makes the model more efficient. In particular, we use the two-parameter BP as the prior of the class-specific state indicator, which is more flexible than the one-parameter Indian buffet process in the sampled BP-HMM. We apply our model to the navigation task of mobile robots and human activity trajectory recognition. Experimental results on the synthetic and real-world data show that the proposed variational BP-HMM with sharing hidden states has advantages to trajectory recognition.
The remainder of this paper is organized as follows. Section 2 gives an overview of the BP and HMM. In Section 3, we review the model assumption of the sampled BP-HMM. In Section 4, we present the proposed variational BP-HMM including the model setting and its variational inference procedure. Experimental results on both synthetic and realworld datasets are reported in Section 5. Finally, Section 6 gives the conclusion and future research directions.

Preliminary Knowledge
In order to explain the variational BP-HMM more clearly, the key related backgrounds including BP and HMM will be introduced in the following sub-sections.

Beta Process
The BP is defined by Hjort [28] for applications in survival analysis. It is a significant application as a non-parametric prior for latent factor models [23,26], and used as a nonparameter prior for selecting the hidden state set of the HMM [8,9,25]. At the beginning, the BP is defined on the positive real line (R + ) then extended to more general spaces Ω (e.g., R).
A BP, B ∼ BP(α, B 0 ), is a positive Lévy process. Here, α is the concentration parameter and B 0 is a fixed measure on Ω. Let γ = B 0 (Ω). The BP(α, B 0 ) is formulated as where {ω} are atoms in B. If B 0 is continuous, the Lévy measure of the BP is expressed as If B 0 is discrete, in the form of B 0 = ∑ k q k ω k , the atoms in B and B 0 have the same location. It can be represented as follows As K → ∞ and H K → ∞, B represents a BP [29]. The BP is conjugate to a class of Bernoulli process, denoted by BeP(B). For example, we define a Bernoulli process F ∼ BeP(B). In this article, we focus on the discrete Bernoulli process in the form of B = ∑ k π k δ ω k , and then the Bernoulli process can be expressed as is called the Beta-Bernoulli process. Similarly to Dirichlet process which has two principle methods for drawing samples, (1) the Chinese restaurant process [30], (2) the stick-breaking process [31], the BP generates samples using the Indian buffet process (IBP) [23] and the stick-breaking process [29].
The original IBP can be seen as a special case of the general BP, i.e., an IBP is a oneparameter BP. Similarly to the Chinese restaurant process, the IBP is described in the view of customers choosing dishes. It is also employed to construct two-parameter BPs but with some details changed. Specifically, the procedure for constructing BP(α, B 0 ), γ = B 0 (Ω) is as follows: 1.
The first customer takes the first Poisson(γ) dishes.

2.
The nth customer then takes dishes that have been previously sampled with probability m k α+n−1 , where m k is the number of people who have already sampled the dish k. He also takes Poisson αγ α+n−1 new dishes. The BP has been shown as a de Finetti mixing distribution underlying the Indian buffet process, and an algorithm has been presented to generate the BP [23].
The stick-breaking process of the BP, B ∼ BP(α, B 0 ), is provided by Paisley et al. [29]. It is formulated as follows.
It is clearly shown from the above equations that in every round (indexed by i), C i atoms have been selected, the weights of them follow an i-times stick-breaking process in which each breaking has the Beta(1, a) probability and C i is drawn from Poisson(γ).

Hidden Markov Models
The HMM [2] is a state space model where each sequence uses a Markov chain of discrete latent variables, with each observation conditioned on the state of the corresponding latent variables. Obviously, they are appropriate to model the data varying over time, and the data can be considered to be generated by the process that switches between different phases or states at different time-points. The HMM has been proved as a valuable tool in human activity recognition, speech recognition and many other popular areas [32].

6
Jing Zhao et al. as If x t is discrete with value set Ω 2 in the size of D, φ is a K × D matrix with element φ i j = p(x t = j|z t = i), i ∈ Ω 1 , j ∈ Ω 2 . Π and φ are named respectively as the transition matrix and emission matrix. If x t is continuous, the emission matrix will be replaced by the emission distribution, where φ is often defined as a distribution like Gaussian distribution p(x t |z t = k) = N (x t |µ k , Σ k ), k ∈ Ω 1 . In the fully Bayesian framework, µ k , Σ k can be regarded as random variables with distribution like normal inverse Wishart or Gaussian with Gamma distribution.
Marginal likelihood is often used to evaluate how an HMM is fit for the trajectories. Therefore, the HMM is usually trained by maximizing the marginal likelihood over the training trajectories. Baum-Welch (BW) algorithm, as an EM method is a famous algorithm for learning parameters of HMMs. The parameters include the transition matrix, the initial state distribution and the emission matrix (distribution's parameters). In the BW algorithm, the forward-backward algorithm is employed to calculate the marginal probability. It should be noted that since the BW algorithm can only find the local optimum, multiple initializations are usually used to obtain better solutions. Given the learned parameters, the most likely state sequence corresponding to a trajectory is required in many practical applications. Viterbi algorithm is an effective method to obtain the most probable state sequence.
HMMs are a kind of generative model; they model the distribution of all the observed data. In trajectory classification tasks, such as activity trajectory recognition, different HMMs are used to model different classes of trajectories separately. After training these HMMs, the parameters in different HMMs are used to evaluate the newly come trajectory to find the most probable class. Specifically, to model large multiple trajectories from different classes, a separate HMM is defined for each class of trajectories, where θ c represents its parameters. Given the trained HMMs, the class label y * of a new test trajectory x * is determined according to where p(x * |θ c ) can be calculated using the forward-backward algorithm.

The Sampled BP-HMM
The sampled BP-HMM [9] is proposed to discover the available hidden states and the sharing patterns among different classes. It jointly models multiple trajectories and learns a state transition matrix for each trajectory. The sampled BP-HMM is successfully applied to trajectory recognition tasks, such as human activity trajectory recognition [1,10].
The sampled BP-HMM uses HMMs to model all the trajectories from all the classes and uses the BP as the prior of the indicator variables with each one corresponding to one trajectory. Suppose X = {X (1) , X (2) , . . . , X (n) , . . . , X (N) }, N ∈ N + where X (n) is the nth trajectory. Each trajectory is modeled by an HMM. These HMMs share a global hidden feature set Ω with the size of ∞. The sampled BP-HMM uses a hidden state selection matrix F with the size of N × ∞ to indicate the available states for each trajectory, i.e., f nk = {0, 1} indicators whether the nth HMM owns the kth state. The prior of the transition matrix Π (n) for each trajectory is related to F, The transition matrix of the nth HMM is and the initial state probability vector π (n) 0 is also related to F, . . ] f n ). (9) Similarly to the standard HMM, the latent variable z (n) is a discrete sequence with and the emission distribution of the nth HMM is In order to build a non-parameter model, the hidden states selection matrix F is constructed by a BP-BeP.
From the perspective of the characteristic of BPs, we can find that the greater the concentration parameter α, the sparser the hidden state selection matrix F, and greater γ will lead to more hidden features. Given the above model assumptions, the sampled BP-HMM uses the Gibbs sampling method to train the model and uses the gradient based method to learn the parameters. With the state transition matrix for each trajectory being learned, the average state transition matrix for each class can be calculated by the mean operation. The new test trajectories are classified according to their likelihood probabilities conditional on each class.

The Proposed Variational BP-HMM
In this section, we will introduce the proposed variational BP-HMM which has more reasonable assumptions and more efficient inference procedure than the sampled BP-HMM. We first describe key points of our model and present our stick-breaking representation for the BP which allows for variational inference. Then we give the joint distribution of the proposed BP-HMM and the variational inference for the BP-HMM.

BP-HMM with the Shared Hidden State Space and Class Specific Indicators
As introduced above, the existing sampled BP-HMM can jointly learn the trajectories from different classes by sharing a same hidden state space. It can also automatically determine the available states and the corresponding transition matrices for one trajectory by the introducing state selection matrix F. However, in the sampled BP-HMM, the state transition matrix and initial probabilities are trajectory-specific, and it is not appropriate to perform mean operation on these transition matrices and probabilities to obtain a average matrix and probabilities for each class.
In order to model trajectories from different classes more reasonably, we introduce a shared hidden state space and class-specific indicators. We define a state selection vector f c for each class which are used to distinguish the differences between classes and define state initial probabilities π 0 and transition matrix π j for each class which are used to capture the commonness with one class. The transition matrix of the cth class from state j is and the initial state probability vector π Similarly to the standard HMM, the latent variable z (n) for the nth trajectory is a discrete sequence with where y n denotes the class of nth trajectory. From the way of modeling, the proposed new version of the BP-HMM is different from the sampled BP-HMM [1] which learns an HMM for each trajectory, and it is also different from the traditional HMMs which learn an HMM for each class separately. The proposed BP-HMM can use all the sequences from different classes to jointly train a whole BP-HMM with each HMM corresponding to one class. Therefore, the proposed BP-HMM can better model the trajectories from multiple classes and can further make better classification.

A Simpler Representation for Beta Process
Besides the model assumption, the proposed variational BP-HMM has different representation of the BP. As introduced in Section 2, the IBP construction of the BP describes the process by conditional distributions. This kind of representation is only suitable for sampling methods which are similar to the Chinese restaurant construction of DPs. Therefore, different from the sampled BP-HMM which uses the IBP construction for the BP to lend it to a Gibbs sampler, we use the stick-breaking construction for the BP to adapt to variational inference. There is some work in constructing stick-breaking representation of BPs for variational inference. The stick-breaking construction is used for the IBP which is closely related to the BP and can be seen as a one-parameter BP [26]. The two-parameter BP is also constructed through stick-breaking processes to server for variational inference [29]. Recently, a simpler representation of the two-parameter BP based on stick-breaking construction is developed to make simpler variational inference [33]. In order to approximate posterior inference to the BP with variational Bayesian method more easily, we refer to the simpler representation of the BP [33]. Let d k mark the round in which the kth atom appears. That is, Note δ(·) is a binary indicator and it equals to 1 if the formula is true. Using the latent indicators, the representation of B in (6) is simplified as with ω and V drawn as before.
. This gives the following representations of the BP, Here we should notice that each d k does not have a distribution, but the cardinality of {d k = r} is drawn by Poisson(γ). In addition, T k = 0 with probability one when d k = 1. In this BP, the atom ω k = {µ k , Σ k } and Gamma priors with hyper-parameters {a 1 , a 2 }, {b 1 , b 2 } are given to α and γ:

Joint Distribution of the Proposed BP-HMM
Assume that the total class number is C and the trajectory number is N. Let X represent k }, Z} represents the set of all latent variables in the model, including θ which is the set of all the hyper-parameters, and Y is the set of all the class labels. The probabilistic graphical model is shown in Figure 2, where its joint likelihood is The likelihood p(X|W, θ) is defined as a multi-normal distribution by The prior distribution of the parameter W and detailed setup are expressed in Appendix A. 10 Jing Zhao et al. Fig. 2 This is the probabilistic graphical model of the proposed variational BP-HMM.
T } is the hidden state sequence of the nth trajectory, and y (n) is the class label of the nth trajectory which indicates to choose the state transition probabilities from the class it belongs to. In this graphical model, we omit the hyper-parameters.
with ω and V drawn as before. Let . This gives the following representations of the BP, Here we should notice that each d k does not have a distribution, but the cardinality of {d k = r} is drawn by Poisson (γ). Also, T k = 0 with probability one when d k = 1. In this BP, the atom ω k = {μ k , Σ k } and Gamma priors with hyper-parameters {a 1 , a 2 }, {b 1 , b 2 } are given to α and γ.
T } is the hidden state sequence of the nth trajectory, and y (n) is the class label of the nth trajectory which indicates choosing the state transition probabilities from the class it belongs to. In this graphical model, we omit the hyper-parameters.

Variational Inference for the Proposed BP-HMM
We use a factorized variational distribution over all the latent variables to approximate the intractable posterior p(W|X, θ). Two truncations are set in the inference: one is truncation of the number of hidden states at K and the other is the truncation of the round number at R. Specifically, we assume the variational distribution as where It is obvious that V k and T k do not have conjugate posterior. Thus the distributions are selected for better accuracy and more convenience. Here a * 0k is an estimation of the probability of the initial state distribution, a * j 1j 2 , where j 1 > 0 and j 2 > 0 is an estimation of the probability of transition from state j 1 to j 2 and b * t j is an estimation of the emission probability density given the system in state j at time point t. In order to simplify our representation, we do not use sub-index. Here a i = {a ij }, j = 1, . . . , K. Let φ be the set of variational parameters. We expand the lower bound as L(X, φ) = E Q (lnP(X, W|θ)) − E Q [lnQ] which is expressed in detail in Appendix B.

Parameter Update
In the framework of variational mean field approximation, the parameters of some variational distributions can be analytically solved using However, in some cases that the prior distribution and posterior distribution over one latent variable are not conjugate, the variational distribution over this variable cannot have an analytical solution. The parameters of this variational distribution should be optimized through gradient based methods with the variational lower bound being the objective.
In our model, the variational distributions q(α), q(γ), q(µ k , Σ k ), q(d k ), q(π k ), q(Z) have a closed form solution, and we can get their parameter update formulas according to (23). While the variational distributions q(V k ), q(T k ), q( f ck ) cannot be analytically solved, we can update their parameters by corresponding gradients. Next, we give the way of calculating variational distributions and show the procedure for training the variational BP-HMM in Algorithm 1. The detailed parameters update formulas or the gradients with respect to the parameters are presented in Appendix C. for each trajectory n do 6: Update q(z (n) ) 7: Calculate q(z (n) t = k) and q(z end for 9: for each class c do 10: Update each q(π (c) k ), k = 0, . . . , K

Remarks
Note that when updating the BP parameters, we should calculate the expectation as of which the second term is intractable. We refer the work in [33] to use a Taylor expansion to E q [ln(1 − V k e −T k )] about the point one, For clarity, we define each term 1 m E V k e −T k m in the Taylor expansion using the notation k (m) as

Classification
Our model is applicable to trajectory recognition like human activity trajectory recognition. We use the proposed variational BP-HMM to model all the training data from different classes, with each HMM corresponding to a class. Given the learned model with the hyperparameters and variational parameters {θ, φ}, a new test trajectory x * can be classified according to its marginal likelihood p(x * |θ, φ). Denote y * as the label of the test trajectory; the classification criteria can be expressed as where a * jk (c) is an estimate of the probability of transition from state j to k in the cth class.
The likelihood can be calculated through the forward-backward algorithm. This classification mechanism is more reasonable than the method in [1], as the transition matrix is actually learned.

Experiment
To demonstrate the effectiveness of our model on trajectory recognition, we conduct experiments on one synthetic dataset and two real-world datasets; the detailed data statistics are illustrated in Table 1 and the following subsections. We compare our model with HCRF, LSTM [34], HMM-BIC and the sampled BP-HMM. In particular, in HCRF, the number of hidden states is set to 15 and the size of the window is set to 0. In LSTM, we use a recurrent neural network with one hidden layer as its architecture. In HMM-BIC, the state number is selected from the range [1,20]. In the sampled BP-HMM, the hyperparameters are set according to Sun et al. [1]. In the variational BP-HMM, the hyperparameters {a 1 , a 2 , b 1 , b 2 , r, κ} are randomly initialized and selected by maximizing the variational lower bound, and the emission hyperparameters are initialized with k-means. Particularly, the state truncation parameters in variational BP-HMM are set according to specific datasets, e.g., K = 7 for the synthetic data and K = 20 for the two real-world data. All experiments are repeated ten times with different training and test division methods, and the average classification accuracy with the standard deviation is reported.

Synthetic Data
The synthetic data called control chart patterns (CCP) have some quantifiable similarities. They contain four pattern types which can be downloaded from the UCI machine learning repository. The CCP are trajectories that show the level of a machine parameter plotted against time. 400 trajectories are artificially generated by the following four equations [35]: 1. Normal pattern (Normal): y(t) = m + rs, where m = 3, s = 2 and 0 < r < 1.   Table 2. The results are obtained through five-fold cross-validation. In order to illustrate that the sharing patterns have been learned by our method, the Hinton diagrams of the variational parameter V are given in Figure 4, where the occurrence probabilities of the hidden states are presented by the sizes of the blocks. For example, we can find that IT and DT share the 4th, 5th, 6th features.   We compare our method with HCRF, LSTM, HMM-BIC and the sampled BP-HMM. As we can see from Table 2, our method outperforms all the other methods.
In this experiment, the sharing patterns contribute to improving the performance. Since an HMM is created for each class of trajectories in our proposed method instead of each trajectory in the sampled BP-HMM, our method has better performance than the sampled BP-HMM.

Human Activity Trajectory Recognition
Human activity trajectory recognition (HATR) [36] is important in many applications such as health care. In our human activity trajectory recognition experiment, parking lot data are collected from the video [1]. We use the data tagged manually [1], which has 300 trajectories with 50 trajectories for each class. Six classes are defined, which are "passing through south street" (PTSS), "passing through east street" (PTES), "going around" (GA), "crossing park horizontally" (CPH), "wandering in front of building" (WFB) and "walking in top street" (WTS). As seen from [1], the sampled BP-HMM is the best method among the methods including HCRF, LSTM, HMM-BIC and the sampled BP-HMM in HATR. Here we use the same training and test data to compare the variational BP-HMM with the sampled BP-HMM. Table 3 shows the comparisons of the classification accuracy for the proposed method VBP-HMM versus HCRF, LSTM, HMM-BIC and the sampled-BP-HMMs in HATR. The results are obtained through five-fold cross-validation. As can be seen from Table 3, the accuracy of our method is 0.96, while the accuracy of the sampled BP-HMM is 0.91 [1]. The detailed confusion matrix for our method is given in Table 4. The state sharing patterns learned by variational BP-HMM are displayed with the Hinton diagrams in Figure 5, in which GA and CPH, as well as GA and WTS, are more likely to share states. The good performance verifies the superiority of modeling an HMM for each class. Moreover, we take some examples of the correct classification and misclassification results for visualization as in Figures 6 and 7. As illustrated in Figure 7, the misclassified trajectories often contain some deceptive subpatterns such as the trajectory of CPH in subfigure (d) containing a back turn and a left turn like the GA class.

Wall-Following Navigation Task
We perform the Wall-Following navigation task (WFNT) in which data are collected from the sensors on the mobile robot SCITOS-G5 [4]. We think that this task is a trajectory with historical data, and two ultrasound sensors datasets are selected, because the cost is as low as possible in civil applications with acceptable accuracy. There are 187 trajectories in the data and four classes need to be recognized, which are "front distance" (F), "left distance" (L), "right distance" (R) and "back distance" (B). We randomly select 40 training trajectories with 10 for each class. The confusion matrix of classification is shown in Table 5 and the state sharing patterns learned by variational BP-HMM are displayed with the Hinton diagrams in Figure 8, where R and F, as well as R and B, have a small number of shared states.  The comparison of the classification accuracy for our method VBP-HMM versus HCRF, LSTM, HMM-BIC and the sampled BP-HMM is shown in Table 6. The results are obtained by five-fold cross-validation. It is obvious that our method is much better than the sampled BP-HMM, because we create an HMM for each class of trajectories rather than create an HMM for each trajectory. Although the sharing patterns are not obvious in this experiment, our method has better performance than the other methods. As we have analyzed, sharing patterns among different classes will be learned automatically by our model, which helps to localize precisely the difference of different classes. When there is no sharing pattern among classes, the advantage will be weakened.

Performance Analysis
In our experiments, the results show that the proposed variational BP-HMM has a great improvement compared to the sampled BP-HMM which uses average transition over trajectories from each class. We analyze the advantages of variational BP-HMM for the following reasons. Due to the small amount of training data in our experiment, the performance of LSTM is not satisfactory. HMM-BIC finds an optimal state number through model selection but it cannot make use of the shared information among classes, and its performance is the second-best overall. Although the sample BP-HMM can share hidden states among classes, it does not make correct use of the shared information in classification and thus does not gain better results. Our proposed variational BP-HMM constructs a mechanism to learn shared hidden states by introducing state indicator variables and maintains class-specific state transition matrices which are very helpful for classification tasks.
Moreover, we give the total cost time of the variational BP-HMM, HMM-BIC, LSTM, HCRF and the sampled BP-HMM in Table 7, where we can see the variational BP-HMM performs much more efficiently than the sampled BP-HMM. This is attributed to the efficiency of the variational methods. Although the sampled BP-HMM and the variational BP-HMM have similar time complexity, due to the sampling operation, the cost time of the sampled BP-HMM is usually several times that of the variational BP-HMM. In other words, the variational BP-HMM converges much faster than the sampled BP-HMM. Besides, compared with HMM-BIC, it only takes about twice the time to achieve significant performance improvements. Above all, we can conclude that the proposed variational BP-HMM is an effective and efficient method for trajectory recognition.

Conclusions
In this paper, we have proposed a novel variational BP-HMM for modeling and recognizing trajectories. The proposed variational BP-HMM has shared hidden state space which is used to capture the commonality of the cross-category data and class-specific indicators which are used to distinguish the data from different classes. As a result, in the variational BP-HMM, multiple HMMs are used to model multiple classes of trajectories among which a hidden state space is shared.
The more reasonable assumptions of the proposed model make it more suitable for jointly modeling trajectories over all classes and further making trajectory recognition. Experimental results both on synthetic and real-world data have verified that the proposed variational BP-HMM can find the feature sharing patterns among different classes, which helps to better model trajectories and further improve the classification performance.
We use to account for the class in which an atom appears. Two terms p(T k |d k , α) and p(d|γ) are given by Paisley et al. [33], In p(T k |d k , α), the indicator d k is used for selecting the Gamma prior parameters of T k . Moreover, the term p(T k |d k , α) in (21) will be removed if d k = 1. The binary indicator δ ∑ ∞ r =r ∑ ∞ k=1 δ(δ(d k = r ) > 0) in p(d|γ) means that at least one of the K indexed atoms occur in round r or the round after r.

Appendix B. The Lower Bound L(X, φ)
We expand the lower bound as L(X, φ) = E Q (lnP(X, W|θ)) − E Q [lnQ] which is expressed as Note that we multiply the entropy of T k , E q (T k )lnq(T k ), by the variational probability ϕ k (r > 1) as done in [33] for keeping the entropy of T k from blowing up when ϕ k (1) → 1, where ϕ k (r > 1) = ∑ r>1 ϕ kr = E q [δ(d k > 1)].

Appendix C. Coordinate Update for the Key Distributions
Appendix C.1. Coordinate Update for q( f ck ) Since the Dirichlet distribution of π k , f ck cannot be obtained by analysis directly. The gradient ascent algorithm is used for updating f ck , c ∈ {1, . . . , C}. The derivative of L with respect to υ ck is Appendix C.2. Coordinate Update for q(d k ) The update for each ϕ k is given below for r = 1, . . . , R. Let ρ(r) =(r − 1)(Ψ(k 1 ) − lnk 2 ) − lnΓ(r − 1) + (r − 2) Ψ u k − lnv k .
If r = 1, If r > 2, where n 1k = ∑ C c=1 v ck , n 0k = C − n 1k and ξ = ∑ ∞ x x −2 ln(x) ≈ 0.9375. where . We can get For j = 0 , π (c) j is the prior probability of the hidden states. We can obtain Thus, the expressions of the posterior distributions are q(z t = j) = ι j (t)β j (t) ∑ K j=1 ι j (t)β j (t) , q(z t = j, z t+1 = k) = ι j (t)a * jk β j (t)b * t+1k ∑ K k=1 ι k (t)β k (t) . Now we can update the variational parameters in the BP-HMM according to the above equations. We judge the convergence of this update according to the change of the lower bound.