Variational Beta Process Hidden Markov Models with Shared Hidden States for Trajectory Recognition

Zhao, Jing; Zhang, Yi; Sun, Shiliang; Dai, Haiwei

doi:10.3390/e23101290

Open AccessArticle

Variational Beta Process Hidden Markov Models with Shared Hidden States for Trajectory Recognition

School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(10), 1290; https://doi.org/10.3390/e23101290

Submission received: 9 August 2021 / Revised: 23 September 2021 / Accepted: 28 September 2021 / Published: 30 September 2021

Download

Browse Figures

Versions Notes

Abstract

:

Hidden Markov model (HMM) is a vital model for trajectory recognition. As the number of hidden states in HMM is important and hard to be determined, many nonparametric methods like hierarchical Dirichlet process HMMs and Beta process HMMs (BP-HMMs) have been proposed to determine it automatically. Among these methods, the sampled BP-HMM models the shared information among different classes, which has been proved to be effective in several trajectory recognition scenes. However, the existing BP-HMM maintains a state transition probability matrix for each trajectory, which is inconvenient for classification. Furthermore, the approximate inference of the BP-HMM is based on sampling methods, which usually takes a long time to converge. To develop an efficient nonparametric sequential model that can capture cross-class shared information for trajectory recognition, we propose a novel variational BP-HMM model, in which the hidden states can be shared among different classes and each class chooses its own hidden states and maintains a unified transition probability matrix. In addition, we derive a variational inference method for the proposed model, which is more efficient than sampling-based methods. Experimental results on a synthetic dataset and two real-world datasets show that compared with the sampled BP-HMM and other related models, the variational BP-HMM has better performance in trajectory recognition.

Keywords:

hidden Markov models; variational inference; trajectory recognition; Beta process

1. Introduction

Trajectory recognition is important and meaningful in many practical applications, such as human activities recognition [1], speech recognition [2], handwritten character recognition [3] and navigation task with mobile robot [4]. In most practical applications, the trajectory is affected by the hidden features corresponding to each point. The hidden Markov model (HMM) [2], hierarchical conditional random field (HCRF) [5,6] and the HMM-based models, such as the hierarchical Dirichlet process hidden Markov model (HDP-HMM) [7], the Beta process hidden Markov model (BP-HMM) [8,9,10] and the Gaussian mixture model hidden Markov model (GMM-HMM) [2] are used to model sequential data and identify their classes [11,12,13,14].

The HMM is a popular model which has been applied widely in human activity recognition [1,15], speech recognition [2,16] and remote target tracking [2,17]. Besides, the HMM is becoming a more significant part as a building block of smart cities and Industry 4.0 [18,19] and implemented in extensive applications such as driving behaviors prediction [20] and the inernet of thing (IoT) signature anomalies [21]. One drawback of the HMM is having to ensure in advance the number of hidden states that need to be selected or cross-validated. To address this problem, several methods based on model selection are employed, such as BIC [22] or some Bayesian non-parameter prior like the BP [23] and the HDP [24]. Besides, directly using the original HMM for classification has another disadvantage, in which each HMM is trained for one class separately and thus information from different classes cannot be shared. It is worth mentioning that the sampled BP-HMM proposed by Fox et al. [9] can not only learn the number of hidden features automatically but also obtain the sharing features between different classes, which has been proved to be meaningful for human activity trajectory recognition. The sampled BP-HMM learns the shared states among different classes by jointly modeling all trajectories together, in which a hidden state indicator for one trajectory with a BP prior is introduced and thus a state transition matrix for each trajectory is maintained. When used for classification, the sampled BP-HMM calculates the class-specific transition matrix by averaging the transition matrices of the trajectories from the corresponding class. However, from the perspective of performance or efficiency, if the sampled BP-HMM [1,7] is used for classification, there is still a lot of room for improvement.

From the perspective of performance, the classification procedure in the sampled BP-HMM [1] is too rough to make full use of the trained model, in which the state transition matrix for each class is calculated by averaging the transition matrixes of all the trajectories. Obviously, this will lead to the loss of information, especially when the training set has some ambiguous trajectories. For instance, a “running” class has some “jogging” trajectories. One naive method to solve it is to select the K best HMMs for each class. However, it will cost plenty of time to select representatives for each class. In order to take account of both performance and efficiency, we change the way of modeling data in BP-HMMs. Differently from those versions of BP-HMMs [1,8,9,10,25], in variational BP-HMMs, an HMM is created for each class instead of for each trajectory.

From the perspective of efficiency, the existing approximate inference for the BP-HMM is based on sampling methods [1,9] which often converge slowly. This drawback of the sampled BP-HMM [1] is inconvenient to practical applications. To provide a faster convergence rate than sampling methods, we develop variational inference for the BP-HMM. If the variational lower bound is unchanged or almost unchanged, the iteration will stop. To be amenable to the variational method, we use the stick-breaking construction of the BP [26] instead of the Indian buffet process (IBP) construction [27] in the sampled BP-HMM.

In this paper, we propose a variational BP-HMM for trajectory recognition, in which the way of the data modeling and the inference method are novel compared with the previous sampled BP-HMM. On the one hand, the new method of modeling trajectories enables the model to obtain better classification performance. Specifically, the hidden state can be optionally shared, and the class-specific state indicator is more suitable for classification than the trajectory-specific state indicator in the sampled BP-HMM. The transition matrix is actually learned from the data instead of averaging all the trajectory-specific transitions. On the other hand, the derived variational inference of the BP-HMM makes the model more efficient. In particular, we use the two-parameter BP as the prior of the class-specific state indicator, which is more flexible than the one-parameter Indian buffet process in the sampled BP-HMM. We apply our model to the navigation task of mobile robots and human activity trajectory recognition. Experimental results on the synthetic and real-world data show that the proposed variational BP-HMM with sharing hidden states has advantages to trajectory recognition.

The remainder of this paper is organized as follows. Section 2 gives an overview of the BP and HMM. In Section 3, we review the model assumption of the sampled BP-HMM. In Section 4, we present the proposed variational BP-HMM including the model setting and its variational inference procedure. Experimental results on both synthetic and real-world datasets are reported in Section 5. Finally, Section 6 gives the conclusion and future research directions.

2. Preliminary Knowledge

In order to explain the variational BP-HMM more clearly, the key related backgrounds including BP and HMM will be introduced in the following sub-sections.

2.1. Beta Process

The BP is defined by Hjort [28] for applications in survival analysis. It is a significant application as a non-parametric prior for latent factor models [23,26], and used as a non-parameter prior for selecting the hidden state set of the HMM [8,9,25]. At the beginning, the BP is defined on the positive real line

(R^{+})

then extended to more general spaces

Ω

(e.g.,

R

).

A BP,

B \sim BP (α, B_{0})

, is a positive Lévy process. Here,

α

is the concentration parameter and

B_{0}

is a fixed measure on

Ω

. Let

γ = B_{0} (Ω)

. The

BP (α, B_{0})

is formulated as

\begin{matrix} \begin{matrix} B_{K} = \sum_{k = 1}^{\infty} π_{k} δ_{ω_{k}}, \\ ω_{i_{j}} \overset{i . i . d .}{\sim} \frac{1}{γ} B_{0}, \end{matrix} \end{matrix}

(1)

where

{ω}

are atoms in B. If

B_{0}

is continuous, the Lévy measure of the BP is expressed as

\begin{matrix} \begin{matrix} ν (d ω, d π) = α (ω) π^{- 1} {(1 - π)}^{c (ω) - 1} d π B_{0} (d ω) . \end{matrix} \end{matrix}

(2)

If

B_{0}

is discrete, in the form of

B_{0} = \sum_{k} q_{k} ω_{k}

, the atoms in B and

B_{0}

have the same location. It can be represented as follows

\begin{matrix} \begin{matrix} B_{K} = \sum_{k = 1}^{K} π_{k} δ_{ω_{k}}, \\ π_{k} \overset{i . i . d .}{\sim} Beta (\frac{α γ}{K}, α (1 - \frac{γ}{K})), \\ ω_{k} \overset{i . i . d .}{\sim} \frac{1}{γ} B_{0} . \end{matrix} \end{matrix}

(3)

As

K \to \infty

and

H_{K} \to \infty

, B represents a BP [29].

The BP is conjugate to a class of Bernoulli process, denoted by

BeP (B)

. For example, we define a Bernoulli process

F \sim BeP (B)

. In this article, we focus on the discrete Bernoulli process in the form of

B = \sum_{k} π_{k} δ_{ω_{k}}

, and then the Bernoulli process can be expressed as

F = \sum_{k} b_{k} δ_{ω_{k}}

, where

B \in [0, 1]

,

b_{k}

is the independent Bernoulli variable with the probability

π_{k}

. If B is a BP, then

\begin{matrix} \begin{matrix} B \sim BP (α, B_{0}), \\ F \sim BeP (B), \end{matrix} \end{matrix}

(4)

is called the Beta-Bernoulli process.

Similarly to Dirichlet process which has two principle methods for drawing samples, (1) the Chinese restaurant process [30], (2) the stick-breaking process [31], the BP generates samples using the Indian buffet process (IBP) [23] and the stick-breaking process [29].

The original IBP can be seen as a special case of the general BP, i.e., an IBP is a one-parameter BP. Similarly to the Chinese restaurant process, the IBP is described in the view of customers choosing dishes. It is also employed to construct two-parameter BPs but with some details changed. Specifically, the procedure for constructing BP

(α, B_{0}), γ = B_{0} (Ω)

is as follows:

The first customer takes the first $Poisson (γ)$ dishes.
The nth customer then takes dishes that have been previously sampled with probability $\frac{m_{k}}{α + n - 1}$ , where $m_{k}$ is the number of people who have already sampled the dish k. He also takes Poisson $(\frac{α γ}{α + n - 1})$ new dishes.

The BP has been shown as a de Finetti mixing distribution underlying the Indian buffet process, and an algorithm has been presented to generate the BP [23].

The stick-breaking process of the BP,

B \sim BP (α, B_{0})

, is provided by Paisley et al. [29]. It is formulated as follows.

\begin{matrix} \begin{matrix} B = \sum_{i = 1}^{\infty} \sum_{j = 1}^{C_{i}} V_{i_{j}}^{(i)} \prod_{l}^{i - 1} (1 - V_{i_{j}}^{(l)}) δ_{ω_{i_{j}}}, \\ C_{i} \overset{i . i . d .}{\sim} Poisson (γ), \\ V_{i_{j}}^{(l)} \overset{i . i . d .}{\sim} Beta (1, α), \\ ω_{i_{j}} \overset{i . i . d .}{\sim} \frac{1}{γ} B_{0} . \end{matrix} \end{matrix}

(5)

It is clearly shown from the above equations that in every round (indexed by i),

C_{i}

atoms have been selected, the weights of them follow an i-times stick-breaking process in which each breaking has the

Beta (1, a)

probability and

C_{i}

is drawn from

Poisson (γ)

.

2.2. Hidden Markov Models

The HMM [2] is a state space model where each sequence uses a Markov chain of discrete latent variables, with each observation conditioned on the state of the corresponding latent variables. Obviously, they are appropriate to model the data varying over time, and the data can be considered to be generated by the process that switches between different phases or states at different time-points. The HMM has been proved as a valuable tool in human activity recognition, speech recognition and many other popular areas [32].

Suppose that the trajectory observation

X = {x_{1}, \dots, x_{N}}

is an

N \times d

matrix and

Z = {z_{1}, \dots, z_{N}}

is a N dimensional latent variable vector which has a value set

Ω_{1}

with size K. The joint distribution of X and Z is expressed as

\begin{matrix} \begin{matrix} p (X, Z ∣ θ) = & p (z_{1} | π) \{\prod_{t = 2}^{T} p (z_{t} | z_{t - 1}, Π)\} \\ \prod_{t = 1}^{T} p (x_{t} | z_{t}, ϕ), \end{matrix} \end{matrix}

(6)

where

θ = {π_{0}, π_{k}, ϕ}

, and A is a

K \times K

matrix with

π_{j k} = p (z_{t + 1} = k | z_{t} = j)

,

t = {1, \dots, T - 1}, i, j \in Ω_{1}

with

\sum_{k} π_{j k} = 1

, and

π_{0 k}

is a K dimensional vector with

π_{0 k} = p (z_{1} = k), k \in Ω_{1}

with

\sum_{k} π_{0 k} = 1

. Furthermore, in the nonparametric version of HMM, the matrix

Π

can be assumed to obey a Dirichlet distribution, i.e.,

π_{j} \sim Dir (α_{1}, α_{2}, \dots, α_{K})

where

\sum α_{k} = 1 .

The probabilistic graphical model is represented in Figure 1.

If

x_{t}

is discrete with value set

Ω_{2}

in the size of D,

ϕ

is a

K \times D

matrix with element

ϕ_{i_{j}} = p (x_{t} = j | z_{t} = i), i \in Ω_{1}, j \in Ω_{2}

.

Π

and

ϕ

are named respectively as the transition matrix and emission matrix. If

x_{t}

is continuous, the emission matrix will be replaced by the emission distribution, where

ϕ

is often defined as a distribution like Gaussian distribution

p (x_{t} | z_{t} = k) = N (x_{t} | μ_{k}, Σ_{k})

,

k \in Ω_{1}

. In the fully Bayesian framework,

μ_{k}, Σ_{k}

can be regarded as random variables with distribution like normal inverse Wishart or Gaussian with Gamma distribution.

Marginal likelihood is often used to evaluate how an HMM is fit for the trajectories. Therefore, the HMM is usually trained by maximizing the marginal likelihood over the training trajectories. Baum–Welch (BW) algorithm, as an EM method is a famous algorithm for learning parameters of HMMs. The parameters include the transition matrix, the initial state distribution and the emission matrix (distribution’s parameters). In the BW algorithm, the forward-backward algorithm is employed to calculate the marginal probability. It should be noted that since the BW algorithm can only find the local optimum, multiple initializations are usually used to obtain better solutions. Given the learned parameters, the most likely state sequence corresponding to a trajectory is required in many practical applications. Viterbi algorithm is an effective method to obtain the most probable state sequence.

HMMs are a kind of generative model; they model the distribution of all the observed data. In trajectory classification tasks, such as activity trajectory recognition, different HMMs are used to model different classes of trajectories separately. After training these HMMs, the parameters in different HMMs are used to evaluate the newly come trajectory to find the most probable class. Specifically, to model large multiple trajectories from different classes, a separate HMM is defined for each class of trajectories, where

θ^{c}

represents its parameters. Given the trained HMMs, the class label

y^{*}

of a new test trajectory

x^{*}

is determined according to

y^{*} = \underset{c}{arg max} ln p (x^{*} | θ^{c}),

(7)

where

p (x^{*} | θ^{c})

can be calculated using the forward-backward algorithm.

3. The Sampled BP-HMM

The sampled BP-HMM [9] is proposed to discover the available hidden states and the sharing patterns among different classes. It jointly models multiple trajectories and learns a state transition matrix for each trajectory. The sampled BP-HMM is successfully applied to trajectory recognition tasks, such as human activity trajectory recognition [1,10].

The sampled BP-HMM uses HMMs to model all the trajectories from all the classes and uses the BP as the prior of the indicator variables with each one corresponding to one trajectory. Suppose

X = {X^{(1)}, X^{(2)}, \dots, X^{(n)}, \dots, X^{(N)}}, N \in N^{+}

where

X^{(n)}

is the

n th

trajectory. Each trajectory is modeled by an HMM. These HMMs share a global hidden feature set

Ω

with the size of ∞. The sampled BP-HMM uses a hidden state selection matrix F with the size of

N \times \infty

to indicate the available states for each trajectory, i.e.,

f_{n k} = {0, 1}

indicators whether the

n th

HMM owns the

k th

state. The prior of the transition matrix

Π^{(n)}

for each trajectory is related to F, The transition matrix of the

n th

HMM is

π_{j}^{(n)} \sim D i r ([r, r, \dots, r + κ, r, \dots] ⊙ f_{n}), j > 1,

(8)

and the initial state probability vector

π_{0}^{(n)}

is also related to F,

π_{0}^{(n)} \sim D i r ([r, r, \dots, r, r, \dots] ⊙ f_{n}) .

(9)

Similarly to the standard HMM, the latent variable

z^{(n)}

is a discrete sequence with

z_{1}^{(n)} \sim π_{0}^{(n)}, z_{t + 1}^{(n)} | z_{t}^{(n)} \sim π_{z_{t}^{(n)}}^{(n)}, t = 1 \dots T,

(10)

and the emission distribution of the

n th

HMM is

\begin{matrix} \begin{matrix} X_{t}^{(n)} | z_{t}^{(n)} \sim N (μ_{z_{t}^{n}}, Σ_{z_{t}^{n}}), \\ (μ_{k}, Σ_{k}) \sim NIW (u_{0}, λ_{0}, Φ_{0}, υ_{0}) . \end{matrix} \end{matrix}

(11)

In order to build a non-parameter model, the hidden states selection matrix F is constructed by a BP−BeP.

\begin{matrix} \begin{matrix} B \sim BP (α, B_{0}), \\ f_{i} | B \sim BeP (B) . \end{matrix} \end{matrix}

(12)

From the perspective of the characteristic of BPs, we can find that the greater the concentration parameter

α

, the sparser the hidden state selection matrix F, and greater

γ

will lead to more hidden features.

Given the above model assumptions, the sampled BP-HMM uses the Gibbs sampling method to train the model and uses the gradient based method to learn the parameters. With the state transition matrix for each trajectory being learned, the average state transition matrix for each class can be calculated by the mean operation. The new test trajectories are classified according to their likelihood probabilities conditional on each class.

4. The Proposed Variational BP-HMM

In this section, we will introduce the proposed variational BP-HMM which has more reasonable assumptions and more efficient inference procedure than the sampled BP-HMM. We first describe key points of our model and present our stick-breaking representation for the BP which allows for variational inference. Then we give the joint distribution of the proposed BP-HMM and the variational inference for the BP-HMM.

4.1. BP-HMM with the Shared Hidden State Space and Class Specific Indicators

As introduced above, the existing sampled BP-HMM can jointly learn the trajectories from different classes by sharing a same hidden state space. It can also automatically determine the available states and the corresponding transition matrices for one trajectory by the introducing state selection matrix F. However, in the sampled BP-HMM, the state transition matrix and initial probabilities are trajectory-specific, and it is not appropriate to perform mean operation on these transition matrices and probabilities to obtain a average matrix and probabilities for each class.

In order to model trajectories from different classes more reasonably, we introduce a shared hidden state space and class-specific indicators. We define a state selection vector

f_{c}

for each class which are used to distinguish the differences between classes and define state initial probabilities

π_{0}

and transition matrix

π_{j}

for each class which are used to capture the commonness with one class. The transition matrix of the cth class from state j is

π_{j}^{(c)} \sim D i r ([r, r, \dots, r + κ, r, \dots] ⊙ f_{c}), j > 0,

(13)

and the initial state probability vector

π_{0}^{(c)}

is also related to F,

π_{0}^{(c)} \sim D i r ([r, r, \dots, r, r, \dots] ⊙ f_{c}) .

(14)

Similarly to the standard HMM, the latent variable

z^{(n)}

for the nth trajectory is a discrete sequence with

z_{1}^{(n)} \sim π_{0}^{(y_{n})}, z_{t + 1}^{(n)} | z_{t}^{(n)} \sim π_{z_{t}^{(n)}}^{(y_{n})}, t = 1, \dots, T .

(15)

where

y_{n}

denotes the class of nth trajectory.

From the way of modeling, the proposed new version of the BP-HMM is different from the sampled BP-HMM [1] which learns an HMM for each trajectory, and it is also different from the traditional HMMs which learn an HMM for each class separately. The proposed BP-HMM can use all the sequences from different classes to jointly train a whole BP-HMM with each HMM corresponding to one class. Therefore, the proposed BP-HMM can better model the trajectories from multiple classes and can further make better classification.

4.2. A Simpler Representation for Beta Process

Besides the model assumption, the proposed variational BP-HMM has different representation of the BP. As introduced in Section 2, the IBP construction of the BP describes the process by conditional distributions. This kind of representation is only suitable for sampling methods which are similar to the Chinese restaurant construction of DPs. Therefore, different from the sampled BP-HMM which uses the IBP construction for the BP to lend it to a Gibbs sampler, we use the stick-breaking construction for the BP to adapt to variational inference. There is some work in constructing stick-breaking representation of BPs for variational inference. The stick-breaking construction is used for the IBP which is closely related to the BP and can be seen as a one-parameter BP [26]. The two-parameter BP is also constructed through stick-breaking processes to server for variational inference [29]. Recently, a simpler representation of the two-parameter BP based on stick-breaking construction is developed to make simpler variational inference [33]. In order to approximate posterior inference to the BP with variational Bayesian method more easily, we refer to the simpler representation of the BP [33]. Let

d_{k}

mark the round in which the

k th

atom appears. That is,

\begin{matrix} \begin{matrix} d_{k} = 1 + \sum_{i = 1}^{\infty} δ (\sum_{j = 1}^{i} C_{j} < k) . \end{matrix} \end{matrix}

(16)

Note

δ (\cdot)

is a binary indicator and it equals to 1 if the formula is true. Using the latent indicators, the representation of B in (6) is simplified as

\begin{matrix} \begin{matrix} B = \sum_{k = 1}^{\infty} V_{k_{d_{k}}} \prod_{l = 1}^{d_{k} - 1} (1 - V_{k_{l}}) δ_{ω_{k}}, \end{matrix} \end{matrix}

(17)

with

ω

and V drawn as before.

Let

T_{k} = - \sum_{l < d_{k}} ln (1 - V_{k_{l}})

. Since each individual term

- \ln (1 - V_{k_{l}}) \overset{i i d}{\sim} Exponential (α)

, it follows that

T_{k} \overset{i i d}{\sim} Gamma (d_{k} - 1, α)

. This gives the following representations of the BP,

\begin{matrix} \begin{matrix} B = \sum_{k = 1}^{\infty} V_{k} e^{- T_{k}} δ_{ω_{k}}, \\ V_{k} \overset{i . i . d .}{\sim} Beta (1, α), \\ T_{k} \sim Gamma (d_{k} - 1, α), \\ \sum_{k = 1}^{\infty} 1 (d_{k} = r) \overset{i . i . d .}{\sim} Poisson (γ), r \in N^{+}, \\ ω_{k} \overset{i . i . d .}{\sim} \frac{B_{0}}{γ} . \end{matrix} \end{matrix}

(18)

Here we should notice that each

d_{k}

does not have a distribution, but the cardinality of

{d_{k} = r}

is drawn by

Poisson (γ)

. In addition,

T_{k} = 0

with probability one when

d_{k} = 1

. In this BP, the atom

ω_{k} = {μ_{k}, Σ_{k}}

and Gamma priors with hyper-parameters

{a_{1}, a_{2}}

,

{b_{1}, b_{2}}

are given to

α

and

γ

:

\begin{matrix} \begin{matrix} α \sim Gamma (a_{1}, a_{2}), \\ γ \sim Gamma (b_{1}, b_{2}) . \end{matrix} \end{matrix}

(19)

4.3. Joint Distribution of the Proposed BP-HMM

Assume that the total class number is C and the trajectory number is N. Let X represent the data,

W = {α

,

γ,

{μ_{k}

,

Σ_{k}}

,

{d_{k}}

,

{V_{k}}

,

{T_{k}}

,

{f_{c k}}

,

{π_{k}^{(c)}}

,

Z}

represents the set of all latent variables in the model, including

θ

which is the set of all the hyper-parameters, and Y is the set of all the class labels.

The probabilistic graphical model is shown in Figure 2, where its joint likelihood is

\begin{matrix} \begin{matrix} p (X, W | θ) = p (X | W, θ) \times p (W | θ) . \end{matrix} \end{matrix}

(20)

The likelihood

p (X | W, θ)

is defined as a multi-normal distribution by

p (X | W, θ) = \prod_{n = 1}^{N} \prod_{t = 1}^{T} \prod_{k = 1}^{K} N {(x_{t}^{(n)} | μ_{k}, Σ_{k})}^{δ (z_{t} = k)} .

(21)

The prior distribution of the parameter W and detailed setup are expressed in Appendix A.

4.4. Variational Inference for the Proposed BP-HMM

We use a factorized variational distribution over all the latent variables to approximate the intractable posterior

p (W | X, θ)

. Two truncations are set in the inference: one is truncation of the number of hidden states at K and the other is the truncation of the round number at R. Specifically, we assume the variational distribution as

\begin{matrix} \begin{matrix} Q = & q (α) q (γ) \prod_{k = 1}^{K} {q (μ_{k}, Σ_{k}) q (d_{k}) q (V_{k}) q (T_{k}) \\ \times \prod_{c = 1}^{C} q (f_{c k}) q (π_{k}^{(c)})\} \prod_{c = 1}^{C} q (π_{0}^{(c)}) \prod_{n = 1}^{N} q (z^{(n)}), \end{matrix} \end{matrix}

(22)

where

\begin{matrix} q (α) = Gamma (α | k_{1}, k_{2}), \\ q (γ) = Gamma (γ | τ_{1}, τ_{2}), \\ q (μ_{k}, Σ_{k}) = NIW (μ_{k}, Σ_{k} | u_{k}, λ_{k}, Φ_{k}, υ_{k}), \\ q (d_{k}) = MultiNomial (d_{k} | φ_{k}), \\ q (V_{k}) = Beta (V_{k} | τ_{k_{1}}, τ_{k_{2}}), \\ q (T_{k}) = Gamma (T_{k} | u_{k}^{'}, v_{k}^{'}), \\ q (f_{c k}) = Bernoulli (f_{c k} | υ_{c k}), \\ q (π_{k}^{(c)}) = Dir (π_{k}^{(c)} | {r_{k 1}^{'}}^{(c)}, {r_{k 2}^{'}}^{(c)}, \dots, {r_{k K}^{'}}^{(c)}), \\ q (z^{(n)} | y_{n}) = \prod_{t = 1}^{T} \prod_{k_{1} = 1}^{K} \prod_{k_{2} = 1}^{K} a_{k_{1} k_{2}}^{{(y_{n})}^{δ (z_{t}^{(n)} = k_{1}, z_{t + 1}^{(n)} = k_{2})}} \\ \times \prod_{k = 1}^{K} {a_{0 k}^{(y_{n})}}^{δ (z_{0}^{(n)} = k)} \prod_{t = 1}^{T} \prod_{k = 1}^{K} b_{t k}^{{(y_{n})}^{δ (Z_{t}^{(n)} = k)}} . \end{matrix}

It is obvious that

V_{k}

and

T_{k}

do not have conjugate posterior. Thus the distributions are selected for better accuracy and more convenience. Here

a_{0 k}^{*}

is an estimation of the probability of the initial state distribution,

a_{j_{1 j_{2}}}^{*}

, where

j_{1} > 0

and

j_{2} > 0

is an estimation of the probability of transition from state

j_{1}

to

j_{2}

and

b^{*} t_{j}

is an estimation of the emission probability density given the system in state j at time point t. In order to simplify our representation, we do not use sub-index. Here

a_{i} = {a_{i j}}

,

j = 1, \dots, K

. Let

ϕ

be the set of variational parameters. We expand the lower bound as

L (X, ϕ) = E_{Q} (\ln P (X, W | θ)) - E_{Q} [\ln Q]

which is expressed in detail in Appendix B.

4.5. Parameter Update

In the framework of variational mean field approximation, the parameters of some variational distributions can be analytically solved using

\ln q (w_{j}) = E_{q (W \neq w_{j})} [\ln p (X, W | θ)] + const .

(23)

However, in some cases that the prior distribution and posterior distribution over one latent variable are not conjugate, the variational distribution over this variable cannot have an analytical solution. The parameters of this variational distribution should be optimized through gradient based methods with the variational lower bound being the objective.

In our model, the variational distributions

q (α)

,

q (γ)

,

q (μ_{k}, Σ_{k})

,

q (d_{k})

,

q (π_{k})

,

q (Z)

have a closed form solution, and we can get their parameter update formulas according to (23). While the variational distributions

q (V_{k})

,

q (T_{k})

,

q (f_{c k})

cannot be analytically solved, we can update their parameters by corresponding gradients. Next, we give the way of calculating variational distributions and show the procedure for training the variational BP-HMM in Algorithm 1. The detailed parameters update formulas or the gradients with respect to the parameters are presented in Appendix C.

Algorithm 1 Variational Inference for the Proposed BP-HMM

1:: Initialize $θ$ and $ϕ$ .
2:: Given R and threshold and Initialize RunTime = 0;
3:: while $| L - L_{old} | < threshold$ or RunTime < R do
4:: $L_{old} = L$
5:: for each trajectory n do
6:: Update $q (z^{(n)})$
7:: Calculate $q (z_{t}^{(n)} = k)$ and $q (z_{t}^{(n)} = k_{1}, z_{t + 1}^{(n)} = k_{2})$
8:: end for
9:: for each class c do
10:: Update each $q (π_{k}^{(c)})$ , $k = 0, \dots, K$
11:: Update each $q (f_{c k})$ , $k = 1, \dots, K$
12:: end for
13:: for each $k = 1, \dots, K$ do
14:: Update $q (μ_{k}, Σ_{k})$ , $q (d_{k})$ , $q (T_{k})$ , $q (V_{k})$
15:: end for
16:: Update $q (α)$ , $q (γ)$
Calculate $L$
17:: end while

4.5.1. Calculation for $q (α)$ , $q (γ)$ , $q (μ_{k}, Σ_{k})$ , $q (d_{k})$ , $q (π_{k}^{(c)})$ , $q (Z)$

\begin{matrix} ln q (α) = E_{q} [ln p (α) + \sum_{k = 1}^{K} ln p (V_{k} | α) + ln p (T_{k} | d_{k}, α)], \\ ln q (γ) = E_{q} [ln p (γ) + \sum_{k = 1}^{K} ln p (d_{k} | γ)], \\ ln q (μ_{k}, Σ_{k}) = E_{q} [ln p (μ_{k}, Σ_{k} | θ) \\ + \sum_{n = 1}^{N} \sum_{t = 1}^{T} p (x^{(n)} | z^{(n)}, μ_{k}, Σ_{k})], \\ ln q (d_{k}) = E_{q} [ln p (d | γ) + ln p (T_{k} | d_{k}, α) \\ + \sum_{c = 1}^{C} ln p (f_{c k} | V_{k}, T_{k}, d_{k})], \\ ln q (π_{k}^{(c)}) = E_{q} [ln p (π_{k}^{(c)} | f_{c k}, r, κ) \\ + \sum_{n = 1}^{N} \sum_{t = 1}^{T - 1} δ (y_{n} = c) ln p (z_{t + 1}^{(n)} | π_{k}^{(c)}, z_{t}^{(n)} = k)], \\ ln q (z^{(n)}) = E_{q} [ln p (x^{(n)} | z^{(n)}, μ_{k}, Σ_{k}) \\ + \sum_{n = 1}^{N} ln p (z^{(n)} | Π^{(y_{n})})], \end{matrix}

4.5.2. Optimization for $q (V_{k})$ , $q (T_{k})$ , $q (f_{c k})$

The variational parameters of

q (V_{k})

,

q (T_{k})

,

q (f_{c k})

include

{τ_{k_{1}}, τ_{k_{2}}}

,

{u_{k}^{'}, v_{k}^{'}}

,

{υ_{c k}}

. They are updated by the gradient based method where the gradients of the lower bound

L

with respect to these parameters should be calculated.

4.5.3. Remarks

Note that when updating the BP parameters, we should calculate the expectation as

\begin{matrix} E_{q} [ln p (f_{c k} | V_{k}, T_{k})] & = & v_{c k} E_{q} [ln V_{k} e^{- T_{k}}] \\ + (1 - v_{c k}) E_{q} [1 - ln V_{k} e^{- T_{k}}], \end{matrix}

of which the second term is intractable. We refer the work in [33] to use a Taylor expansion to

E_{q} [ln (1 - V_{k} e^{- T_{k}})]

about the point one,

E_{q} [ln (1 - V_{k} e^{- T_{k}})] = - \sum_{m = 1}^{M} \frac{1}{m} {(V_{k} e^{- T_{k}})}^{m} .

(24)

For clarity, we define each term

\frac{1}{m} E [{(V_{k} e^{- T_{k}})}^{m}]

in the Taylor expansion using the notation

Δ_{k} (m)

as

\begin{matrix} \begin{matrix} Δ_{k} (m) & = \frac{1}{m} \frac{Γ (τ_{k_{1}} + τ_{k_{2}})}{Γ (τ_{k_{1}} + τ_{k_{2}} + m)} \frac{Γ (τ_{k_{1}} + m)}{Γ (τ_{k_{1}})} {(\frac{v_{k}^{'}}{v_{k}^{'} + m})}^{u_{k}^{'}} \\ = \prod_{i = 1}^{m} \frac{τ_{k_{1}} + i - 1}{τ_{k_{1}} + τ_{k_{2}} + i - 1} {(\frac{v_{k}^{'}}{v_{k}^{'} + m})}^{u_{k}^{'}}, \end{matrix} \end{matrix}

and define

Δ_{k} (\cdot) = \sum_{m = 1}^{M} Δ_{k} (m)

. Therefore,

E_{q} [ln (1 - V_{k} e^{- T_{k}})] = - Δ_{k} (\cdot)

.

4.6. Classification

Our model is applicable to trajectory recognition like human activity trajectory recognition. We use the proposed variational BP-HMM to model all the training data from different classes, with each HMM corresponding to a class. Given the learned model with the hyperparameters and variational parameters

{θ, ϕ}

, a new test trajectory

x^{*}

can be classified according to its marginal likelihood

p (x^{*} | θ, ϕ)

. Denote

y^{*}

as the label of the test trajectory; the classification criteria can be expressed as

\begin{matrix} \begin{matrix} y^{*} & = \underset{c}{arg max} ln p (x^{*} | {{a_{k}^{*}}^{(c)}, u_{k}, λ_{k}, υ_{k}, Φ_{k}}, {a_{0}^{*}}^{(c)}), \\ = \underset{c}{arg max} ln (\int p (x^{*} | {μ_{k}, Σ_{k}}, z) p (z | {{a_{k}^{*}}^{(c)}}) \\ \prod_{k = 1}^{K} p (μ_{k}, Σ_{k} | u_{k}, λ_{k}, υ_{k}, Φ_{k}) d z d μ_{k} d Σ_{k}), \end{matrix} \end{matrix}

(25)

where

{a_{j k}^{*}}^{(c)}

is an estimate of the probability of transition from state j to k in the cth class. The likelihood can be calculated through the forward-backward algorithm.

This classification mechanism is more reasonable than the method in [1], as the transition matrix is actually learned.

5. Experiment

To demonstrate the effectiveness of our model on trajectory recognition, we conduct experiments on one synthetic dataset and two real-world datasets; the detailed data statistics are illustrated in Table 1 and the following subsections. We compare our model with HCRF, LSTM [34], HMM-BIC and the sampled BP-HMM. In particular, in HCRF, the number of hidden states is set to 15 and the size of the window is set to 0. In LSTM, we use a recurrent neural network with one hidden layer as its architecture. In HMM-BIC, the state number is selected from the range

[1, 20]

. In the sampled BP-HMM, the hyperparameters are set according to Sun et al. [1]. In the variational BP-HMM, the hyperparameters

{a_{1}, a_{2}, b_{1}, b_{2}, r, κ}

are randomly initialized and selected by maximizing the variational lower bound, and the emission hyperparameters are initialized with k-means. Particularly, the state truncation parameters in variational BP-HMM are set according to specific datasets, e.g.,

K = 7

for the synthetic data and

K = 20

for the two real-world data. All experiments are repeated ten times with different training and test division methods, and the average classification accuracy with the standard deviation is reported.

5.1. Synthetic Data

The synthetic data called control chart patterns (CCP) have some quantifiable similarities. They contain four pattern types which can be downloaded from the UCI machine learning repository. The CCP are trajectories that show the level of a machine parameter plotted against time. 400 trajectories are artificially generated by the following four equations [35]:

\begin{matrix} \begin{matrix} 1 . Normal pattern (Normal) : y (t) = m + r s, \\ where m = 3, s = 2 and 0 < r < 1 . \\ 2 . Cyclic pattern (Cyclic) : y (t) = m + r s + a Sin (2 π t / T), \\ where 0 < a, T < 15 . \\ 3 . Increasing trend (IT) : y (t) = m + r s + g t, \\ where 0.2 < g < 0.5 . \\ 4 . Decreasing trend (DT) : y (t) = m + r s - g t, \\ where 0.2 < g < 0.5 . \end{matrix} \end{matrix}

Figure 3 shows the generated synthetic data. In this experiment, 20 trajectories are used for training with 5 trajectories for each class. The classification results are presented in Table 2. The results are obtained through five-fold cross-validation. In order to illustrate that the sharing patterns have been learned by our method, the Hinton diagrams of the variational parameter V are given in Figure 4, where the occurrence probabilities of the hidden states are presented by the sizes of the blocks. For example, we can find that IT and DT share the 4th, 5th, 6th features.

We compare our method with HCRF, LSTM, HMM-BIC and the sampled BP-HMM. As we can see from Table 2, our method outperforms all the other methods.

In this experiment, the sharing patterns contribute to improving the performance. Since an HMM is created for each class of trajectories in our proposed method instead of each trajectory in the sampled BP-HMM, our method has better performance than the sampled BP-HMM.

5.2. Human Activity Trajectory Recognition

Human activity trajectory recognition (HATR) [36] is important in many applications such as health care. In our human activity trajectory recognition experiment, parking lot data are collected from the video [1]. We use the data tagged manually [1], which has 300 trajectories with 50 trajectories for each class. Six classes are defined, which are “passing through south street” (PTSS), “passing through east street” (PTES), “going around” (GA), “crossing park horizontally” (CPH), “wandering in front of building” (WFB) and “walking in top street” (WTS). As seen from [1], the sampled BP-HMM is the best method among the methods including HCRF, LSTM, HMM-BIC and the sampled BP-HMM in HATR. Here we use the same training and test data to compare the variational BP-HMM with the sampled BP-HMM. Table 3 shows the comparisons of the classification accuracy for the proposed method VBP-HMM versus HCRF, LSTM, HMM-BIC and the sampled-BP-HMMs in HATR. The results are obtained through five-fold cross-validation. As can be seen from Table 3, the accuracy of our method is

0.96

, while the accuracy of the sampled BP-HMM is

0.91

[1]. The detailed confusion matrix for our method is given in Table 4. The state sharing patterns learned by variational BP-HMM are displayed with the Hinton diagrams in Figure 5, in which GA and CPH, as well as GA and WTS, are more likely to share states. The good performance verifies the superiority of modeling an HMM for each class. Moreover, we take some examples of the correct classification and misclassification results for visualization as in Figure 6 and Figure 7. As illustrated in Figure 7, the misclassified trajectories often contain some deceptive subpatterns such as the trajectory of CPH in subfigure (d) containing a back turn and a left turn like the GA class.

5.3. Wall-Following Navigation Task

We perform the Wall-Following navigation task (WFNT) in which data are collected from the sensors on the mobile robot SCITOS-G5 [4]. We think that this task is a trajectory with historical data, and two ultrasound sensors datasets are selected, because the cost is as low as possible in civil applications with acceptable accuracy. There are 187 trajectories in the data and four classes need to be recognized, which are “front distance” (F), “left distance” (L), “right distance” (R) and “back distance” (B). We randomly select 40 training trajectories with 10 for each class. The confusion matrix of classification is shown in Table 5 and the state sharing patterns learned by variational BP-HMM are displayed with the Hinton diagrams in Figure 8, where R and F, as well as R and B, have a small number of shared states.

The comparison of the classification accuracy for our method VBP-HMM versus HCRF, LSTM, HMM-BIC and the sampled BP-HMM is shown in Table 6. The results are obtained by five-fold cross-validation. It is obvious that our method is much better than the sampled BP-HMM, because we create an HMM for each class of trajectories rather than create an HMM for each trajectory. Although the sharing patterns are not obvious in this experiment, our method has better performance than the other methods. As we have analyzed, sharing patterns among different classes will be learned automatically by our model, which helps to localize precisely the difference of different classes. When there is no sharing pattern among classes, the advantage will be weakened.

5.4. Performance Analysis

In our experiments, the results show that the proposed variational BP-HMM has a great improvement compared to the sampled BP-HMM which uses average transition over trajectories from each class. We analyze the advantages of variational BP-HMM for the following reasons. Due to the small amount of training data in our experiment, the performance of LSTM is not satisfactory. HMM-BIC finds an optimal state number through model selection but it cannot make use of the shared information among classes, and its performance is the second-best overall. Although the sample BP-HMM can share hidden states among classes, it does not make correct use of the shared information in classification and thus does not gain better results. Our proposed variational BP-HMM constructs a mechanism to learn shared hidden states by introducing state indicator variables and maintains class-specific state transition matrices which are very helpful for classification tasks.

Moreover, we give the total cost time of the variational BP-HMM, HMM-BIC, LSTM, HCRF and the sampled BP-HMM in Table 7, where we can see the variational BP-HMM performs much more efficiently than the sampled BP-HMM. This is attributed to the efficiency of the variational methods. Although the sampled BP-HMM and the variational BP-HMM have similar time complexity, due to the sampling operation, the cost time of the sampled BP-HMM is usually several times that of the variational BP-HMM. In other words, the variational BP-HMM converges much faster than the sampled BP-HMM. Besides, compared with HMM-BIC, it only takes about twice the time to achieve significant performance improvements. Above all, we can conclude that the proposed variational BP-HMM is an effective and efficient method for trajectory recognition.

6. Conclusions

In this paper, we have proposed a novel variational BP-HMM for modeling and recognizing trajectories. The proposed variational BP-HMM has shared hidden state space which is used to capture the commonality of the cross-category data and class-specific indicators which are used to distinguish the data from different classes. As a result, in the variational BP-HMM, multiple HMMs are used to model multiple classes of trajectories among which a hidden state space is shared.

The more reasonable assumptions of the proposed model make it more suitable for jointly modeling trajectories over all classes and further making trajectory recognition. Experimental results both on synthetic and real-world data have verified that the proposed variational BP-HMM can find the feature sharing patterns among different classes, which helps to better model trajectories and further improve the classification performance. Moreover, compared with the sampled BP-HMM, the derived variational inference for the proposed BP-HMM can reduce the time cost of the training procedure. The experimental time records also show the efficiency of the proposed variational BP-HMM.

Author Contributions

Conceptualization, J.Z. and S.S.; methodology, J.Z. and S.S.; Software, J.Z., Y.Z. and H.D.; formal analysis, J.Z. and Y.Z.; writing—original draft preparation, J.Z.; writing—review and editing, J.Z., Y.Z. and S.S.; supervision, J.Z. and S.S.; Visualization, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the NSFC Projects 62006078 and 62076096, the Shanghai Municipal Project 20511100900, Shanghai Knowledge Service Platform Project ZF1213, the Shanghai Chenguang Program under Grant 19CG25, the Open Research Fund of KLATASDS-MOE and the Fundamental Research Funds for the Central Universities.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository. The data presented in this study are openly available in UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/index.php, accessed on 27 September 2021, reference number [4,35,36].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Prior Distribution of the Parameter W

Denote

\sum_{k = 1}^{\infty} δ (d_{k} = r)

as d. The prior distribution of the parameter W is expressed as

\begin{matrix} \begin{matrix} p (W | θ) & = p (α) p (γ) p (d | γ) \\ \times \prod_{k = 1}^{\infty} {p (V_{k} | α) p (T_{k} | d_{k}, θ) p (μ_{k}, Σ_{k} | θ) \\ \times \prod_{c = 1}^{C} p (f_{c_{k}} | V_{k}, T_{k}, d_{k}) p (π_{k}^{(c)} | f_{c}, θ)} \\ \times \prod_{c = 1}^{C} p (π_{0}^{(c)} | f_{c}, θ) \\ \times \prod_{n = 1}^{N} {\prod_{t = 1}^{T} \prod_{k_{1} = 1}^{\infty} \prod_{k_{2} = 1}^{\infty} {(π_{k_{1} k_{2}}^{(y_{n})})}^{δ (z_{t}^{(n)} = k_{1}, z_{t + 1}^{(n)} = k_{2})} \\ \times \prod_{k = 1}^{\infty} {(π_{0 k}^{(y_{n})})}^{δ (z_{0}^{(n)} = k)}} . \end{matrix} \end{matrix}

(A1)

We use

p (f_{c_{k}} | V_{k}, T_{k}, d_{k}) = p {(f_{c_{k}} | V_{k})}^{δ (d_{k} = 1)} p {(f_{c_{k}} | V_{k}, T_{k})}^{δ (d_{k} > 1)}

to account for the class in which an atom appears. Two terms

p (T_{k} | d_{k}, α)

and

p (d | γ)

are given by Paisley et al. [33],

p (T_{k} | d_{k}, α) = \frac{α^{ν_{k} (δ)}}{\prod_{r ⩾ 2} Γ {(r - 1)}^{1 (d_{k} = r)}} T_{k}^{ν_{k} (2)} e^{- α T_{k} δ (d_{k} > 1)},

where

ν_{k} (s) = \sum_{r ⩾ 2} (r - s) e^{- α T_{k} δ (d_{k} = r)}

, and

p (d | γ) = \prod_{r = 1}^{\infty} \frac{γ \sum_{k} δ (d_{k} = r)}{\sum_{k} δ (d_{k} = r)!} e^{- γ (\sum_{r^{^{'}} = r}^{\infty} \sum_{k = 1}^{\infty} δ (δ (d_{k} = r^{^{'}}) > 0))} .

In

p (T_{k} | d_{k}, α)

, the indicator

d_{k}

is used for selecting the Gamma prior parameters of

T_{k}

. Moreover, the term

p (T_{k} | d_{k}, α)

in (21) will be removed if

d_{k} = 1

. The binary indicator

δ (\sum_{r^{^{'}} = r}^{\infty} \sum_{k = 1}^{\infty} δ (δ (d_{k} = r^{'}) > 0))

in

p (d | γ)

means that at least one of the K indexed atoms occur in round r or the round after r.

Appendix B. The Lower Bound $L (X, ϕ)$

We expand the lower bound as

L (X, ϕ) = E_{Q} (\ln P (X, W | θ)) - E_{Q} [\ln Q]

which is expressed as

\begin{matrix} L (X, ϕ) \\ = \sum_{n = 1}^{N} {\sum_{t = 1}^{T} \sum_{k = 1}^{K} E_{q} [δ (z_{t}^{(n)} = k) \ln p (X_{t} | μ_{k}, Σ_{k}, θ)] \\ + \sum_{t = 1}^{T - 1} \sum_{k_{1} = 1}^{K} \sum_{k_{2} = 1}^{K} E_{q} [δ (z_{t}^{(n)} = k_{1}, z_{t + 1}^{(n)} = k_{2}) \ln (π_{k}^{(y_{n})})] \\ + \sum_{k = 1}^{K} E_{q} [δ (z_{0}^{(n)} = k) \ln (π_{0 k}^{(y_{n})})]} \\ + \sum_{c = 1}^{C} {\sum_{k = 0}^{K} E_{q} [\ln (p (π_{k}^{(c)} | f_{c}, θ))] \\ + \sum_{k = 1}^{K} E_{q} [δ (d_{k} = 1) \ln p (f_{c k} | V_{k})] \\ + \sum_{k = 1}^{K} E_{q} [δ (d_{k} > 1) \ln p (f_{c k} | V_{k}, T_{k})]} \\ + \sum_{k = 1}^{K} {E_{q} \ln p (μ_{k}, Σ_{k} | θ) + E_{q} [\ln p (T_{k} | α, d_{k})] \\ + E_{q} [\ln p (V_{k} | α)]} \\ + \sum_{r = 1}^{\infty} E_{q} [\ln p (Σ_{k} δ (d_{k} = r) | γ)] \\ + E_{q} [\ln p (α)] + E_{q} [\ln p (γ)] \\ - E_{Q} [\ln Q_{- T}] - \sum_{k = 1}^{K} φ_{k} (r > 1) E_{q} [\ln q (T_{k})] . \end{matrix}

Note that we multiply the entropy of

T_{k}

,

E_{q} (T_{k}) \ln q (T_{k})

, by the variational probability

φ_{k} (r > 1)

as done in [33] for keeping the entropy of

T_{k}

from blowing up when

φ_{k} (1) \to 1

, where

φ_{k} (r > 1) = \sum_{r > 1} φ_{k r} = E_{q} [δ (d_{k} > 1)]

.

Appendix C. Coordinate Update for the Key Distributions

Appendix C.1. Coordinate Update for q(f_ck)

Since the Dirichlet distribution of

π_{k}

,

f_{c k}

cannot be obtained by analysis directly. The gradient ascent algorithm is used for updating

f_{c k}, c \in {1, \dots, C}

. The derivative of

L

with respect to

υ_{c k}

is

\begin{matrix} \begin{matrix} \frac{\partial L}{\partial υ_{c k}} & = \sum_{j = 0}^{K} {(r^{(c)} - 1) (Ψ (r_{j k}^{' (c)}) - Ψ (\sum_{t = 1}^{K} r_{j t}^{' (c)})) δ (j \neq k) \\ + (r^{(c)} + κ - 1) (Ψ (r_{j k}^{' (c)}) - Ψ (\sum_{k = 1}^{K} r_{j k}^{' (c)})) δ (j = k) \\ + Ψ (\sum_{t = 1}^{K} v_{c t} r^{(c)} δ (t \neq j) + v_{c j} (r^{(c)} + κ) δ (t = j)) \\ \times ((r^{(c)} + κ) δ (j = k) + r^{(c)} δ (j \neq k)) \\ - Ψ (v_{c k} r^{(c)}) r^{(c)} δ (j \neq k) \\ - Ψ (v_{c k} (r^{(c)} + κ)) (r^{(c)} + κ) δ (j = k)} \\ + φ_{k} (1) (Ψ (τ_{k_{1}}) - Ψ (τ_{k_{2}})) \\ + φ_{k} (r > 1) (Ψ (τ_{k_{1}}) - Ψ (τ_{k_{1}} + τ_{k_{2}}) - \frac{u_{k}^{'}}{v_{k}^{'}} + ▵_{k} (\cdot)) \\ + ln υ_{c k} - ln (1 - υ_{c k}) . \end{matrix} \end{matrix}

Appendix C.2. Coordinate Update for q(d_k)

The update for each

φ_{k}

is given below for

r = 1, \dots, R

. Let

\begin{matrix} \begin{matrix} ρ (r) = & (r - 1) (Ψ (k_{1}) - \ln k_{2}) - \ln Γ (r - 1) \\ + (r - 2) (Ψ (u_{k}^{'}) - \ln v_{k}^{'}) . \end{matrix} \end{matrix}

If

r = 1

,

\begin{matrix} \begin{matrix} φ_{k} (1) & \propto \exp \{n_{0 k} (Ψ (τ_{k_{2}}) - Ψ (τ_{k_{1}} + τ_{k_{2}})) - ξ \sum_{i \neq k} φ_{i} (1)\} . \end{matrix} \end{matrix}

If

r > 2

,

\begin{matrix} \begin{matrix} φ_{k} (r) & \propto \exp {n_{1 k} (Ψ (τ_{k_{1}}) - Ψ (τ_{k_{1}} + τ_{k_{2}}) - \frac{u_{k}^{'}}{v_{k}^{'}}) \\ - n_{0 k} Δ_{k} (\cdot) + H [q (T_{k})] + ρ (r) \\ - ξ \sum_{i \neq k} φ_{k} (r) - \frac{τ_{1}}{τ_{2}} \sum_{j = 2}^{r} \prod_{k^{'} \neq k} \sum_{r^{'} = 1}^{j - 1} φ_{k^{'}} (r^{'})} . \end{matrix} \end{matrix}

where

n_{1 k} = \sum_{c = 1}^{C} v_{c k}

,

n_{0 k} = C - n_{1 k}

and

ξ = \sum_{x}^{\infty} x^{- 2} \ln (x) \approx 0.9375

.

Appendix C.3. Coordinate Update for q(V_k)

We use the gradient ascent algorithm to jointly update

(τ_{k_{1}}, τ_{k_{2}})

by the gradients of the lower bound with respect to

(τ_{k_{1}}, τ_{k_{2}})

. Let

λ_{1} = - n_{0_{k}} φ_{k} (1) - n_{1_{k}} - \frac{k_{1}}{k_{2}} - 1 + τ_{k_{1}} + τ_{k_{2}}

,

λ_{2} = - n_{0_{k}} φ_{k} (r > 1)

,

λ_{3} = n_{1_{k}} + 1 - τ_{k_{1}}

and

λ_{4} = n_{0_{k}} φ_{k} (1) + \frac{k_{1}}{k_{2}} - τ_{k_{2}}

. The derivatives are

\begin{matrix} \begin{matrix} \frac{\partial L}{\partial τ_{k_{1}}} = & λ_{3} Ψ^{'} (τ_{k_{1}}) + λ_{1} Ψ^{'} (τ_{k_{1}} + τ_{k_{2}}) + λ_{2} \frac{\partial Δ_{k} (\cdot)}{\partial τ_{k_{1}}}, \\ \frac{\partial L}{\partial τ_{k_{2}}} = & λ_{4} Ψ^{'} (τ_{k_{2}}) + λ_{1} Ψ^{'} (τ_{k_{1}} + τ_{k_{2}}) + λ_{2} \frac{\partial Δ_{k} (\cdot)}{\partial τ_{k_{2}}} . \end{matrix} \end{matrix}

Since

Ψ (x)

can be expanded as

Ψ (x) = - γ + \sum_{k = 1}^{\infty} (\frac{1}{k} + \frac{1}{x + k})

and its derivative is

Ψ^{'} (x) = \sum_{k = 1}^{\infty} {(x + k - 1)}^{- 2}

, we can get

\begin{matrix} \begin{matrix} \frac{\partial Δ_{k} (\cdot)}{\partial τ_{k_{1}}} = \sum_{m = 1}^{M} { & \frac{1}{m} {(\frac{v_{k}^{'}}{v_{k}^{'} + m})}^{u_{k}^{'}} \prod_{i = 1}^{m} \frac{τ_{k_{1}} + i - 1}{τ_{k_{1}} + τ_{k_{2}} + i - 1} \\ \times {Ψ (τ_{k_{1}} + τ_{k_{2}} + m) + Ψ (τ_{k_{1}}) \\ - Ψ (τ_{k_{1}} + τ_{k_{2}}) - Ψ (τ_{k_{1}} + m)}}, \end{matrix} \end{matrix}

and

\begin{matrix} \begin{matrix} \frac{\partial Δ_{k} (\cdot)}{\partial τ_{k_{2}}} = \sum_{m = 1}^{M} { & \frac{1}{m} {(\frac{v_{k}^{'}}{v_{k}^{'} + m})}^{u_{k}^{'}} \prod_{i = 1}^{m} \frac{τ_{k_{1}} + i - 1}{τ_{k_{1}} + τ_{k_{2}} + i - 1} \\ \times (Ψ (τ_{k_{1}} + τ_{k_{2}} + m) - Ψ (τ_{k_{1}} + τ_{k_{2}}))} . \end{matrix} \end{matrix}

Appendix C.4. Coordinate Update for q(T_k)

We use the gradient ascent algorithm to jointly update

(u_{k}^{'}, v_{k}^{'})

by the gradients of the lower bound with respect to

(u_{k}^{'}, v_{k}^{'})

. The derivatives are

\begin{matrix} \begin{matrix} \frac{\partial L}{\partial u_{k}^{'}} = & Ψ^{'} (u_{k}^{'}) \sum_{r > 1} (r - 2) φ_{k} (r) \\ + φ_{k} (r > 1) (1 - \frac{n_{1_{k}} + \frac{k_{1}}{k_{2}}}{v_{k}^{'}} \\ - n_{0_{k}} \frac{\partial Δ_{k} (\cdot)}{\partial u_{k}^{'}} + (1 - u_{k}^{'}) Ψ^{'} (u_{k}^{'})), \\ \frac{\partial L}{\partial v_{k}^{'}} = & - \frac{1}{v_{k}^{'}} \sum_{r > 1} (r - 2) φ_{k} (r) + φ_{k} (r > 1) \\ \times (\frac{u_{k}^{'}}{{v_{k}^{'}}^{2}} (n_{1_{k}} + \frac{k_{1}}{k_{2}}) - n_{0_{k}} \frac{\partial Δ_{k} (\cdot)}{\partial v_{k}^{'}} - \frac{1}{v_{k}^{'}}), \end{matrix} \end{matrix}

(A2)

where

\begin{matrix} \begin{matrix} \frac{\partial Δ_{k} (\cdot)}{\partial u_{k}^{'}} & = \sum_{m = 1}^{M} {\frac{1}{m} \prod_{i = 1}^{m} \frac{τ_{k_{1}} + i - 1}{τ_{k_{1}} + τ_{k_{2}} + i - 1} \\ \times {(\frac{v_{k}^{'}}{v_{k}^{'} + m})}^{u_{k}^{'}} (\ln (v_{k}^{'}) - \ln (v_{k}^{'} + m))}, \\ \frac{\partial Δ_{k} (\cdot)}{\partial v_{k}^{'}} & = \sum_{m = 1}^{M} {\frac{1}{m} \prod_{i = 1}^{m} \frac{τ_{k_{1}} + i - 1}{τ_{k_{1}} + τ_{k_{2}} + i - 1} \\ \times u_{k}^{'} {(\frac{v_{k}^{'}}{v_{k}^{'} + m})}^{u_{k}^{'} - 1} \frac{m}{{({v_{k}^{'}}^{2} + m)}^{2}} . \end{matrix} \end{matrix}

(A3)

Appendix C.5. Coordinate Update for q(α)

The update formulae for

(k_{1}, k_{2})

are

\begin{matrix} \begin{matrix} k_{1} = K + \sum_{k = 1}^{K} \sum_{r > 1}^{R} (r - 1) φ_{k} (r) + a_{1}, \\ k_{2} = - \sum_{k = 1}^{K} E [\ln (1 - V_{k})] + \sum_{k = 1}^{K} E [T_{k}] φ_{k} (r > 1) + a_{2} . \end{matrix} \end{matrix}

(A4)

It is shown that

φ_{k} (1)

has nothing to do with the update of

α

.

Appendix C.6. Coordinate Update for q(γ)

The update formulae for

(τ_{1}, τ_{2})

are

\begin{matrix} \begin{matrix} τ_{1} = K + b_{1}, \\ τ_{2} = \sum_{r = 1}^{R} {1 - \prod_{k = 1}^{K} \sum_{r^{'} = 1}^{r - 1} φ_{k} (r^{'})} + b_{2} . \end{matrix} \end{matrix}

(A5)

It can be seen from above that

τ_{1}

does not change with iterations while

τ_{2}

depends on

φ_{k} (r^{'})

.

Appendix C.7. Coordinate Update for q(μ_k,Σ_k)

The variational parameter update of

q (μ_{k}, Σ_{k})

is analytical and the update formulae are

\begin{matrix} \begin{matrix} u_{k} = & \frac{\sum_{n}^{N} \sum_{t}^{T} q (z_{t}^{(n)} = k) x_{t}^{(n)} + λ_{0} u_{0}}{λ_{0} + \sum_{n}^{N} \sum_{t}^{T} q (z_{t}^{(n)} = k)}, \\ λ_{k} = & λ_{0} + \sum_{n = 1}^{N} \sum_{t = 1}^{T} q (z_{t}^{(n)} = k), \\ υ_{k} = & υ_{0} + \sum_{n = 1}^{N} \sum_{t = 1}^{T} q (z_{t}^{(n)} = k), \\ Φ_{k} = & λ_{0} u_{0}^{⊤} u_{0} + \sum_{n = 1}^{N} \sum_{t = 1}^{T} q (z_{t}^{(n)} = k) x_{t}^{{(n)}^{⊤}} x_{t}^{(n)} \\ + Φ_{0} - \frac{1}{λ_{0} + \sum_{n}^{N} \sum_{t}^{T} q (z_{t}^{(n)} = k)} \\ \times {(\sum_{n}^{N} \sum_{t}^{T} q (z_{t}^{(n)} = k) x_{t}^{(n)} + λ_{0} u_{0})}^{⊤} \\ \times (\sum_{n}^{N} \sum_{t}^{T} q (z_{t}^{(n)} = k) x_{t}^{(n)} + λ_{0} u_{0}) . \end{matrix} \end{matrix}

Appendix C.8. Coordinate Update for q(π_j^(c))

In order to update

π_{j}^{(c)}

, two cases,

j > 0

and

j = 0

, should be analyzed. For

j > 0

, the logarithmic distribution of

π_{j}^{(c)}

is updated as

\begin{matrix} \begin{matrix} \ln q (π_{j}^{(c)}) & = E_{q} [\ln (p (π_{j} | f_{n}, r, κ) \\ \times \prod_{n = 1}^{N} \prod_{t = 1}^{T - 1} δ (y_{n} = c) P (z_{t + 1}^{(n)} | π, z_{t}^{(n)})], \end{matrix} \end{matrix}

where

p (z_{t + 1}^{(n)} | π_{k}^{(y_{n})}, z_{t}^{(n)} = j) = \prod_{k = 1}^{K} {π_{j_{k}}^{(y_{n})}}^{δ (z_{t}^{(n)} = j, z_{t + 1}^{(n)} = k)},

and

p (π_{j}^{(c)} | f_{c k}, r^{(c)}, κ) = \frac{1}{B} \prod_{k \neq j}^{K} {π_{j k}^{(c)}}^{(r^{(c)} f_{c k} - 1)} {π_{j j}^{(c)}}^{((r^{(c)} + κ) f_{c k} - 1)} .

Here B is the normalizing constant of the Dirichlet distribution

q (π_{j}^{(c)})

. We can get

\begin{matrix} \begin{matrix} \ln q (π_{j}^{(c)}) = \sum_{k \neq j}^{K} {\sum_{(n = 1)}^{N} \sum_{t = 1}^{T - 1} δ (y_{n} = c) q (z_{t}^{(n)} = j, z_{t + 1}^{(n)} = k) \\ + r υ_{c k} - 1} \ln (π_{j k}^{(c)}) + ((r + κ) υ_{c k} - 1 \\ + \sum_{n = 1}^{N} \sum_{t}^{T - 1} δ (y_{n} = c) q (z_{t}^{(n)} = j, z_{t + 1}^{(n)} = j)) \ln (π_{j j}^{(c)}) . \end{matrix} \end{matrix}

Thus the parameters

r_{j}^{' (c)}

for the Dirichlet distribution

q (π_{j}^{(c)})

are formulated as

\begin{matrix} \begin{matrix} r_{j k}^{' (c)} = & \sum_{n = 1}^{N} \sum_{t = 1}^{T - 1} δ (y_{n} = c) q (z_{t}^{(n)} = j, z_{t + 1}^{(n)} = k) + r^{(c)} υ_{c k}, \\ with k \neq j, \end{matrix} \end{matrix}

and

\begin{matrix} \begin{matrix} r_{j j}^{' (c)} = & \sum_{n = 1}^{N} \sum_{t = 1}^{T - 1} δ (y_{n} = c) q (z_{t}^{(n)} = j, z_{t + 1}^{(n)} = j) + (r^{(c)} + κ) υ_{c j} . \end{matrix} \end{matrix}

For

j = 0

,

π_{j}^{(c)}

is the prior probability of the hidden states. We can obtain

r_{0 k} = \sum_{n = 1}^{N} δ (y_{n} = c) q (z_{1}^{(n)} = k) + r^{(c)} υ_{c k} .

Appendix C.9. Coordinate Update for q(Z)

For each class of trajectories,

\begin{matrix} \begin{matrix} {a_{j k}^{*}}^{(c)} & = \exp (E_{q} \ln p (π_{j k}^{(c)})) \\ = \exp (Ψ (r_{j k}^{' (c)}) - Ψ (\sum_{i = 1}^{K} r_{j i}^{' (c)})), \\ {b_{t j}^{*}}^{(n)} & = \exp (E_{q} \ln p (x_{t}^{(n)} | μ_{j}, Σ_{j})) \\ = \exp {- \frac{p}{2} \ln (2 π) - \frac{1}{2} E_{q} \ln (| Σ_{j} |) \\ - \frac{1}{2} (\frac{d}{λ_{k}} + {(x_{t}^{(n)} - u_{j})}^{⊤} (υ_{j} - p - 1) Φ_{j}^{- 1} (x_{t}^{(n)} - u_{j}))}, \end{matrix} \end{matrix}

where p is the dimension of

x_{t}^{(n)}

and

E_{q} \ln (| Σ_{j} |) = - \sum_{i = 1}^{p} Ψ (\frac{υ_{j} + 1 - i}{2}) - p \ln (2) + \ln | Φ_{j} | .

From the above, we need the marginal probabilities

q (z_{t}^{(n)} = j)

and

q (z_{t}^{(n)} = j, z_{t + 1}^{(n)} = k)

. The detailed calculations are as follows. Both of them can be calculated by the forward-backward algorithm. The forward procedure is

\begin{matrix} \begin{matrix} ι_{k}^{t} = P (x_{1} = x_{1}, x_{2} = x_{2}, \dots, x_{t} = x_{t}, z_{t} k | W, Θ), \\ ι_{k}^{1} = a_{0 j}^{*} b_{t k}^{*}, \\ ι_{j}^{t + 1} = b_{t + 1 j}^{*} \sum_{k = 1}^{K} ι_{k}^{t} a_{k j}^{*} . \end{matrix} \end{matrix}

The backward procedure is

\begin{matrix} \begin{matrix} β_{k} (t) = P (x_{t + 1} = x_{t + 1}, \dots, x_{T} = x_{T} | z_{t} = k, W, Θ), \\ β_{k} (T) = 1, \\ β_{k} (T) = \sum_{j = 1}^{K} β_{j} (t + 1) a_{k j}^{*} b_{t + 1 j}^{*} . \end{matrix} \end{matrix}

Thus, the expressions of the posterior distributions are

\begin{matrix} \begin{matrix} q (z_{t} = j) = \frac{ι_{j} (t) β_{j} (t)}{\sum_{j = 1}^{K} ι_{j} (t) β_{j} (t)}, \\ q (z_{t} = j, z_{t + 1} = k) = \frac{ι_{j} (t) a_{j k}^{*} β_{j} (t) b_{t + 1 k}^{*}}{\sum_{k = 1}^{K} ι_{k} (t) β_{k} (t)} . \end{matrix} \end{matrix}

Now we can update the variational parameters in the BP-HMM according to the above equations. We judge the convergence of this update according to the change of the lower bound.

References

Sun, S.; Zhao, J.; Gao, Q. Modeling and recognizing human trajectories with beta process hidden Markov models. Pattern Recognit. 2015, 48, 2407–2417. [Google Scholar] [CrossRef]
Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
Braiek, E.; Aouina, N.; Abid, S.; Cheriet, M. Handwritten characters recognition based on SKCS-polyline and hidden Markov model (HMM). In Proceedings of the International Symposium on Control, Communications and Signal Processing, Hammamet, Tunisia, 21–24 March 2004; pp. 447–450. [Google Scholar]
Freire, A.L.; Barreto, G.A.; Veloso, M.; Varela, A.T. Short-term memory mechanisms in neural network learning of robot navigation tasks: A case study. In Proceedings of the Latin American Robotics Symposium, Valparaiso, Chile, 29–30 October 2009; pp. 1–6. [Google Scholar]
Gao, Q.B.; Sun, S.L. Trajectory-based human activity recognition using hidden conditional random fields. In Proceedings of the International Conference on Machine Learning and Cybernetics, Xi’an, China, 15–17 July 2012; Volume 3, pp. 1091–1097. [Google Scholar]
Bousmalis, K.; Zafeiriou, S.; Morency, L.P.; Pantic, M. Infinite hidden conditional random fields for human behavior analysis. IEEE Trans. Neural Netw. Learn. Syst. 2012, 24, 170–177. [Google Scholar] [CrossRef] [PubMed]
Gao, Q.; Sun, S. Trajectory-based human activity recognition with hierarchical Dirichlet process hidden Markov models. In Proceedings of the International Conference on Signal and Information Processing, Beijing, China, 6–10 July 2013; pp. 456–460. [Google Scholar]
Fox, E.B.; Hughes, M.C.; Sudderth, E.B.; Jordan, M.I. Joint modeling of multiple time series via the beta process with application to motion capture segmentation. Ann. Appl. Stat. 2014, 8, 1281–1313. [Google Scholar] [CrossRef] [Green Version]
Fox, E.; Jordan, M.I.; Sudderth, E.B.; Willsky, A.S. Sharing features among dynamical systems with beta processes. Adv. Neural Inf. Process. Syst. 2009, 22, 549–557. [Google Scholar]
Gao, Q.B.; Sun, S.L. Human activity recognition with beta process hidden Markov models. In Proceedings of the International Conference on Machine Learning and Cybernetics, Tianjin, China, 14–17 July 2013; Volume 2, pp. 549–554. [Google Scholar]
Gao, Y.; Villecco, F.; Li, M.; Song, W. Multi-scale permutation entropy based on improved LMD and HMM for rolling bearing diagnosis. Entropy 2017, 19, 176. [Google Scholar] [CrossRef] [Green Version]
Filippatos, A.; Langkamp, A.; Kostka, P.; Gude, M. A sequence-based damage identification method for composite rotors by applying the Kullback–Leibler divergence, a two-sample Kolmogorov–Smirnov test and a statistical hidden Markov model. Entropy 2019, 21, 690. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Granada, I.; Crespo, P.M.; Garcia-Frías, J. Combining the Burrows-Wheeler transform and RCM-LDGM codes for the transmission of sources with memory at high spectral efficiencies. Entropy 2019, 21, 378. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, C.; Pourtaherian, A.; van Onzenoort, L.; Ten, W.E.T.a.; de With, P.H.N. Infant facial expression analysis: Towards a real-time video monitoring system using R-CNN and HMM. IEEE J. Biomed. Health Inform. 2021, 25, 1429–1440. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Todorovic, S. Action shuffle alternating learning for unsupervised action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, virtual meeting, 19–25 June 2021; pp. 12628–12636. [Google Scholar]
Zhou, W.; Michel, W.; Irie, K.; Kitza, M.; Schlüter, R.; Ney, H. The rwth ASR system for ted-lium release 2: Improving hybrid HMM with specaugment. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 7839–7843. [Google Scholar]
Zhu, Y.; Yan, Y.; Komogortsev, O. Hierarchical HMM for eye movement classification. In European Conference on Computer Vision Workshops; Springer: Cham, Switzerland, 2020; pp. 544–554. [Google Scholar]
Lom, M.; Pribyl, O.; Svitek, M. Industry 4.0 as a part of smart cities. In Proceedings of the 2016 Smart Cities Symposium Prague (SCSP), Prague, Czech Republic, 26–27 May 2016; pp. 1–6. [Google Scholar]
Castellanos, H.G.; Varela, J.A.E.; Zezzatti, A.O. Mobile Device Application to Detect Dangerous Movements in Industrial Processes Through Intelligence Trough Ergonomic Analysis Using Virtual Reality. In The International Conference on Artificial Intelligence and Computer Vision; Springer: Cham, Switzerland, 2021; pp. 202–217. [Google Scholar]
Deng, Q.; Söffker, D. Improved driving behaviors prediction based on fuzzy logic-hidden markov model (fl-hmm). In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 2003–2008. [Google Scholar]
Fouad, M.A.; Abdel-Hamid, A.T. On Detecting IoT Power Signature Anomalies using Hidden Markov Model (HMM). In Proceedings of the 2019 31st International Conference on Microelectronics (ICM), Cairo, Egypt, 15–18 December 2019; pp. 108–112. [Google Scholar]
Nascimento, J.C.; Figueiredo, M.A.; Marques, J.S. Trajectory classification using switched dynamical hidden Markov models. IEEE Trans. Image Process. 2009, 19, 1338–1348. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Thibaux, R.; Jordan, M.I. Hierarchical beta processes and the Indian buffet process. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, San Juan, Puerto Rico, 21–24 March 2007; pp. 564–571. [Google Scholar]
Teh, Y.W.; Jordan, M.I.; Beal, M.J.; Blei, D.M. Sharing clusters among related groups: Hierarchical Dirichlet processes. Adv. Neural Inf. Process. Syst. 2004, 17, 1385–1392. [Google Scholar]
Hughes, M.C.; Fox, E.; Sudderth, E.B. Effective split-merge monte carlo methods for nonparametric models of sequential data. Adv. Neural Inf. Process. Syst. 2012, 25, 1295–1303. [Google Scholar]
Teh, Y.W.; Grür, D.; Ghahramani, Z. Stick-breaking construction for the Indian buffet process. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, San Juan, Puerto Rico, 21–24 March 2007; pp. 556–563. [Google Scholar]
Griffiths, T.L.; Ghahramani, Z. Infinite latent feature models and the Indian buffet process. Adv. Neural Inf. Process. Syst. 2005, 18, 475–482. [Google Scholar]
Hjort, N.L. Nonparametric Bayes estimators based on beta processes in models for life history data. Ann. Stat. 1990, 18, 1259–1294. [Google Scholar] [CrossRef]
Paisley, J.W.; Zaas, A.K.; Woods, C.W.; Ginsburg, G.S.; Carin, L. A stick-breaking construction of the beta process. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 21–24 June 2010; pp. 847–854. [Google Scholar]
Ferguson, T.S. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1973, 1, 209–230. [Google Scholar] [CrossRef]
Selhuraman, J. A constructive definition of the Dirichlet prior. Statist. Sin. 1994, 2, 639–650. [Google Scholar]
Cao, Y.; Li, Y.; Coleman, S.; Belatreche, A.; McGinnity, T.M. Adaptive hidden Markov model with anomaly states for price manipulation detection. IEEE Trans. Neural Netw. Learn. Syst. 2014, 26, 318–330. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Paisley, J.W.; Carin, L.; Blei, D.M. Variational Inference for Stick-Breaking Beta Process Priors. In Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011; pp. 889–896. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Alcock, R.J.; Manolopoulos, Y. Time-series similarity queries employing a feature-based approach. In Proceedings of the 7th Hellenic Conference on Informatics, Ioannina, Greece, 26–29 August 1999; pp. 27–29. [Google Scholar]
Ziaeefard, M.; Bergevin, R. Semantic human activity recognition: A literature review. Pattern Recognit. 2015, 48, 2329–2345. [Google Scholar] [CrossRef]

Figure 1. The probabilistic graphical model for an HMM where

X = {x_{1}, x_{2}, \dots, x_{T}}

represents an observation sequence and

Z = {z_{1}, z_{2}, \dots, z_{T}}

represents the corresponding hidden state sequence.

Figure 1. The probabilistic graphical model for an HMM where

X = {x_{1}, x_{2}, \dots, x_{T}}

represents an observation sequence and

Z = {z_{1}, z_{2}, \dots, z_{T}}

represents the corresponding hidden state sequence.

Figure 2. This is the probabilistic graphical model of the proposed variational BP-HMM.

X^{(n)} = {x_{1}^{(n)}, x_{2}^{(n)}, \dots, x_{T}^{(n)}}

is the nth observed trajectory,

z^{(n)} = {z_{1}^{(n)}, z_{2}^{(n)}, \dots, z_{T}^{(n)}}

is the hidden state sequence of the nth trajectory, and

y^{(n)}

is the class label of the

n th

trajectory which indicates choosing the state transition probabilities from the class it belongs to. In this graphical model, we omit the hyper-parameters.

Figure 2. This is the probabilistic graphical model of the proposed variational BP-HMM.

X^{(n)} = {x_{1}^{(n)}, x_{2}^{(n)}, \dots, x_{T}^{(n)}}

is the nth observed trajectory,

z^{(n)} = {z_{1}^{(n)}, z_{2}^{(n)}, \dots, z_{T}^{(n)}}

is the hidden state sequence of the nth trajectory, and

y^{(n)}

is the class label of the

n th

trajectory which indicates choosing the state transition probabilities from the class it belongs to. In this graphical model, we omit the hyper-parameters.

Figure 3. Examples of control chart patterns.

Figure 4. Selection results of hidden states for four classes on control chart patterns; these four classes are normal pattern (Normal), cyclic pattern (Cyclic), increasing trend (IT) and decreasing trend (DT). The occurrence probabilities

q (f_{c k})

of the hidden states are presented by the sizes of the green blocks. The large size of the green blocks represents high occurrence probability of hidden states.

Figure 4. Selection results of hidden states for four classes on control chart patterns; these four classes are normal pattern (Normal), cyclic pattern (Cyclic), increasing trend (IT) and decreasing trend (DT). The occurrence probabilities

q (f_{c k})

of the hidden states are presented by the sizes of the green blocks. The large size of the green blocks represents high occurrence probability of hidden states.

Figure 5. Selection results of hidden states for different classes on human activity trajectory recognition. The occurrence probabilities

q (f_{c k})

of the hidden states are presented by the sizes of the green blocks. The large size of the green blocks represents high occurrence probability of hidden states.

Figure 5. Selection results of hidden states for different classes on human activity trajectory recognition. The occurrence probabilities

q (f_{c k})

of the hidden states are presented by the sizes of the green blocks. The large size of the green blocks represents high occurrence probability of hidden states.

Figure 6. Correct classification results of HATR dataset for the classes: PTSS, PTES, GA, CPH, WFB, WTS, respectively.

Figure 7. Misclassification results of HATR dataset for the three classes: CPH, WFB and WTS which are misclassified to GA, WTS and CPH, respectively.

Figure 8. Selection results of hidden states for different classes on the Wall-Following navigation recognition. The occurrence probabilities

q (f_{c k})

of the hidden states are presented by the sizes of the green blocks. The large size of the green blocks represents high occurrence probability of hidden states.

Figure 8. Selection results of hidden states for different classes on the Wall-Following navigation recognition. The occurrence probabilities

q (f_{c k})

of the hidden states are presented by the sizes of the green blocks. The large size of the green blocks represents high occurrence probability of hidden states.

Table 1. Data statistics for the CCP, HATR and WFNT datasets and corresponding classes.

Datasets	#Train Trajectories	#Classes (Descriptions)
CCP	20 (5/class)	4 (Normal, Cyclic, IT, DT)
HATR	300 (50/class)	6 (PTSS, PTES, GA, CPH, WFB, WTS)
WFNT	40 (10/class)	4 (F, L, R, B)

Table 2. Comparisons of the classification accuracy for the proposed method VBP-HMM versus HCRF, LSTM, HMM-BIC and the sampled BP-HMM in CCP.

	Classification Accuracy
Approach	HCRF	LSTM	HMM-BIC	SBP-HMM	VBP-HMM
CCP	0.88 ± 0.03	0.95 ± 0.02	0.97 ± 0.01	0.96 ± 0.02	1.00 ± 0.00

Table 3. Comparisons of the classification accuracy for the proposed method VBP-HMM versus HCRF, LSTM, HMM-BIC and the sampled BP-HMM in HATR.

	Classification Accuracy
Approach	HCRF	LSTM	HMM-BIC	SBP-HMM	VBP-HMM
HATR	0.68 ± 0.03	0.75 ± 0.03	0.95 ± 0.02	0.91 ± 0.02	0.96 ± 0.02

Table 4. Classification accuracy for human activity trajectory recognition.

	Classification Accuracy
Predicted Class	PTSS	PTES	GA	CPH	WFB	WTS
PTSS	1.00	0.00	0.00	0.00	0.00	0.00
PTES	0.00	1.00	0.00	0.00	0.00	0.00
GA	0.00	0.00	0.97	0.00	0.00	0.00
CPH	0.00	0.00	0.03	0.96	0.00	0.00
WFB	0.00	0.00	0.00	0.03	1.00	0.23
WTS	0.00	0.00	0.00	0.01	0.00	0.77

Table 5. The confusion matrix of the Wall-Following navigation recognition.

	Classification Accuracy
Predicted Class	F	L	R	B
F	0.95	0.18	0.00	0.00
L	0.02	0.73	0.00	0.00
R	0.03	0.09	0.95	0.0
B	0.00	0.00	0.05	1.00

Table 6. Comparisons of the classification accuracy for the proposed method VBP-HMM versus HCRF, LSTM, HMM-BIC and the sampled BP-HMM in WFNT.

	Classification Accuracy
Approach	HCRF	LSTM	HMM-BIC	SBP-HMM	VBP-HMM
WFNT	0.80 ± 0.03	0.73 ± 0.08	0.86 ± 0.04	0.85 ± 0.02	0.89 ± 0.01

Table 7. Comparisons of total time cost for the proposed method VBP-HMM versus HCRF, LSTM, HMM-BIC and the sampled-BP-HMM in experiments.

	Total Time Cost (s)
Approach	HCRF	LSTM	HMM-BIC	SBP-HMM	VBP-HMM
CCP	196	6	54	2151	117
HATR	118	13	93	2521	205
WFNT	1005	8	115	1819	312

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, J.; Zhang, Y.; Sun, S.; Dai, H. Variational Beta Process Hidden Markov Models with Shared Hidden States for Trajectory Recognition. Entropy 2021, 23, 1290. https://doi.org/10.3390/e23101290

AMA Style

Zhao J, Zhang Y, Sun S, Dai H. Variational Beta Process Hidden Markov Models with Shared Hidden States for Trajectory Recognition. Entropy. 2021; 23(10):1290. https://doi.org/10.3390/e23101290

Chicago/Turabian Style

Zhao, Jing, Yi Zhang, Shiliang Sun, and Haiwei Dai. 2021. "Variational Beta Process Hidden Markov Models with Shared Hidden States for Trajectory Recognition" Entropy 23, no. 10: 1290. https://doi.org/10.3390/e23101290

APA Style

Zhao, J., Zhang, Y., Sun, S., & Dai, H. (2021). Variational Beta Process Hidden Markov Models with Shared Hidden States for Trajectory Recognition. Entropy, 23(10), 1290. https://doi.org/10.3390/e23101290

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Variational Beta Process Hidden Markov Models with Shared Hidden States for Trajectory Recognition

Abstract

1. Introduction

2. Preliminary Knowledge

2.1. Beta Process

2.2. Hidden Markov Models

3. The Sampled BP-HMM

4. The Proposed Variational BP-HMM

4.1. BP-HMM with the Shared Hidden State Space and Class Specific Indicators

4.2. A Simpler Representation for Beta Process

4.3. Joint Distribution of the Proposed BP-HMM

4.4. Variational Inference for the Proposed BP-HMM

4.5. Parameter Update

4.5.1. Calculation for q ( α ) , q ( γ ) , q ( μ k , Σ k ) , q ( d k ) , q ( π k ( c ) ) , q ( Z )

4.5.2. Optimization for q ( V k ) , q ( T k ) , q ( f c k )

4.5.3. Remarks

4.6. Classification

5. Experiment

5.1. Synthetic Data

5.2. Human Activity Trajectory Recognition

5.3. Wall-Following Navigation Task

5.4. Performance Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. The Prior Distribution of the Parameter W

Appendix B. The Lower Bound L X , ϕ

Appendix C. Coordinate Update for the Key Distributions

Appendix C.1. Coordinate Update for q(fck)

Appendix C.2. Coordinate Update for q(dk)

Appendix C.3. Coordinate Update for q(Vk)

Appendix C.4. Coordinate Update for q(Tk)

Appendix C.5. Coordinate Update for q(α)

Appendix C.6. Coordinate Update for q(γ)

Appendix C.7. Coordinate Update for q(μk,Σk)

Appendix C.8. Coordinate Update for q(πj(c))

Appendix C.9. Coordinate Update for q(Z)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.5.1. Calculation for $q (α)$ , $q (γ)$ , $q (μ_{k}, Σ_{k})$ , $q (d_{k})$ , $q (π_{k}^{(c)})$ , $q (Z)$

4.5.2. Optimization for $q (V_{k})$ , $q (T_{k})$ , $q (f_{c k})$

Appendix B. The Lower Bound $L (X, ϕ)$

Appendix C.1. Coordinate Update for q(f_ck)

Appendix C.2. Coordinate Update for q(d_k)

Appendix C.3. Coordinate Update for q(V_k)

Appendix C.4. Coordinate Update for q(T_k)

Appendix C.7. Coordinate Update for q(μ_k,Σ_k)

Appendix C.8. Coordinate Update for q(π_j^(c))