# Algorithms for Hidden Markov Models Restricted to Occurrences of Regular Expressions

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methods

#### 2.1. Hidden Markov Models

_{1:}

_{T}= y

_{1}y

_{2}…y

_{T}∈ * and a hidden sequence x

_{1:}

_{T}= x

_{1}x

_{2}… x

_{T}∈ *, where and are finite alphabets of observables and hidden states, respectively. The hidden sequence is a realization of a Markov process that explains the hidden properties of the observed data. We can formally define an HMM as consisting of a finite alphabet of hidden states = {h

_{1}, h

_{2}, …, h

_{N}}, a finite alphabet of observables = {o

_{1},o

_{2}, …, o

_{M}}, a vector Π = (π

_{hi})

_{1≤}

_{i}

_{≤}

_{N}, where π

_{hi}= ℙ(X

_{1}= h

_{i}) is the probability of the hidden sequence starting in state h

_{i}, a matrix A = {a

_{hi}

_{,}

_{hj}}

_{1≤}

_{i}

_{,}

_{j}

_{≤}

_{N}, where a

_{hi}

_{,}

_{hj}= ℙ(X

_{t}= h

_{j}| X

_{t}

_{−1}= h

_{i}) is the probability of a transition from state h

_{i}to state h

_{j}, and a matrix $B={\{{b}_{{h}_{i},{o}_{j}}\}}_{1\le i\le N}^{1\le j\le M}$, where b

_{hi}

_{,}

_{oj}= ℙ(Y

_{t}= o

_{j}| X

_{t}= h

_{i}) is the probability of state h

_{i}emitting o

_{j}.

_{i}, R

_{i}|1 ≤ i ≤ 3}. In practice, models used for gene finding are much more complex, but this model captures the essential aspects of a gene finder.

**Figure 1.**A Hidden Markov Model (HMM) for gene prediction. Each box represents a hidden state, and the numbers inside are the emission probabilities of each nucleotide. Numbers on arcs are transition probabilities between hidden states.

_{1:}

_{T}. The likelihood of a given observed sequence can be computed using the forward algorithm [3], while the Viterbi algorithm [3] and the posterior-Viterbi algorithm [22] are used for predicting a corresponding hidden sequence. All these algorithms run in (TN

^{2}) time, using (TN) space.

#### 2.1.1. The Forward Algorithm

_{1:}

_{T}by summing the joint probability of the observed and hidden sequences for all possible sequences, x

_{1:}

_{T}. This is given by:

_{1:}

_{T}with x

_{1:}

_{T}as the hidden sequence: the HMM starts in state x

_{1}and emits y

_{1}from x

_{1}, and for all t = 2, …, T, it makes a transition from state x

_{t}

_{−1}to x

_{t}and emits y

_{t}from x

_{t}.

_{t}(x

_{t}) = ℙ(y

_{1:}

_{t}, x

_{t}) = Σx

_{1:}

_{t}

_{−1}ℙ(y

_{1:}

_{t}, x

_{1:}

_{t}) being the probability of observing y

_{1:}

_{t}and being in state x

_{t}at time t. The recursion is:

_{1:}

_{T}) = Σ

_{i}α

_{T}(h

_{i}).

#### 2.1.2. The Viterbi Algorithm

_{1:}

_{T}, that maximizes the joint probability of the observed and hidden sequences (1). It uses the same type of approach as the forward algorithm: a new table, ω, is defined by ω

_{t}(x

_{t}) = max

_{x}

_{1:t−1}{ℙ(y

_{1:}

_{t}, x

_{1:}

_{t})}, the probability of a most likely decoding ending in x

_{t}at time t, having observed y

_{1:}

_{t}. This can be obtained as follows:

_{hi}{ω

_{T}(h

_{i})}.

#### 2.1.3. The Posterior Decoding Algorithm

_{1:}

_{T}, such that x

_{t}= argmax

_{hi}{γ

_{t}(h

_{i})} has the highest posterior probability γ

_{t}(h

_{i}) = ℙ(h

_{i}| y

_{1:}

_{T}). If we let β

_{t}(h

_{i}) = ℙ(y

_{t}

_{+1:}

_{T}| h

_{i}), we have:

_{t}(h

_{i}) by using a recursion similar to Equation (3). Thus, to compute the posterior decoding, we first fill out α

_{t}(h

_{i}) and β

_{t}(h

_{i}) for all t and i and then compute the decoding by x

_{t}= argmax

_{hi}{γ

_{t}(h

_{i})}.

#### 2.1.4. The Posterior-Viterbi Algorithm

_{p}is the set of syntactically correct decodings. To compute this, a new table, γ˜, is defined by ${\tilde{\gamma}}_{t}({x}_{t})={max}_{{x}_{1:t-1}\in {A}_{p}}\{{\prod}_{i=1}^{t}{\gamma}_{t}({x}_{t})\}$, the maximum posterior probability of a decoding from A

_{p}ending in x

_{t}at time t. The table is filled using the recursion:

_{p}with the highest posterior probability is retrieved by backtracking through γ˜ from entry argmax

_{hi}{γ˜

_{T}(h

_{i})}. We note that, provided that the posterior decoding algorithm returns a decoding from A

_{p}, the posterior-Viterbi algorithm will return the same decoding.

#### 2.2. Automata

_{1}, h

_{2}, …, h

_{N}} of an HMM. Let r be a regular expression over , and let FA

_{ }(r) = (Q, , q

_{0}, A, δ) [23] be the deterministic finite automaton (DFA) that recognizes the language described by (h

_{1}∣ h

_{2}∣ … ∣ h

_{N})*(r), where Q is the finite set of states, q

_{0}∈ Q is the initial state, A ⊆ Q is the set of accepting states and δ : Q × → Q is the transition function. FA

_{ }(r) accepts any string that has r as a suffix. We construct the DFA EA

_{ }(r) = (Q, , q

_{0}, A, δ

_{E}) as an extension of FA

_{ }(r), where δ

_{E}is defined by:

_{ }(r) restarts every time it reaches an accepting state. Figure 2 shows FA

_{ }(r) and EA

_{ }(r) for r = (NC

_{1}) ∣ (R

_{1}N) with the hidden alphabet = {N} ∪ {C

_{i}, R

_{i}∣ 1 ≤ i ≤ 3} of the HMM from Figure 2. Both automata have Q = {0, 1, 2, 3, 4}, q

_{0}= 0, and A = {2, 4}. State 1 marks the beginning of NC

_{1}, while state 3 corresponds to the beginning of R

_{1}N. State 2 accepts NC

_{1}, and state 4 accepts R

_{1}N. As C

_{2}, C

_{3}, R

_{2}and R

_{3}are not part of r, using them, the automaton restarts by transitioning to state 0 from all states. We left these transitions out of the figure for clarity. The main difference between FA

_{ }(r) and EA

_{ }(r) is that they correspond to overlapping and non-overlapping occurrences, respectively. For example, for the input string, R

_{1}NC

_{1}, FA

_{ }(r) first finds R

_{1}N using state 4, from which it transitions to state 2 and matches NC

_{1}. However, after EA

_{ }(r) recognizes R

_{1}N, it transitions back to state 0, not matching NC

_{1}. The algorithms we provide are independent of which of the two automata is used, and therefore, all that remains is to switch between them when needed. In our implementation, we used an automata library for Java [24] to obtain FA

_{ }(r), which we then converted to EA

_{ }(r).

**Figure 2.**Two automata, FA

_{ }(r) and EA

_{ }(r), for the pattern r = (NC

_{1}) ∣ (R

_{1}N), = {N} ∪ {C

_{i}, R

_{i}∣ 1 ≤ i ≤ 3}, Q = {0, 1, 2, 3, 4}, q

_{0}= 0, and A = {2, 4}. States 1, 2 and 3, 4 are used for matching sequences ending with NC

_{1}and R

_{1}N, respectively, as marked with gray boxes. The two automata differ only with respect to transitions from accepting states: the dotted transition belongs to FA

_{ }(r) and the dashed one to EA

_{ }(r). For clarity, the figure lacks transitions going from all states to state 0 using C

_{2}, C

_{3}, R

_{2}and R

_{3}.

## 3. Results and Discussion

#### 3.1. The Restricted Forward Algorithm

_{r}(x

_{1:}

_{T}) be the number of matches of r in x

_{1:}

_{T}. We wish to estimate O

_{r}by using its probability distribution. We do this by running the HMM and FA

_{ }(r) in parallel. Let FA

_{ }(r)

_{t}be the state in which the automaton is after t transitions, and define α̂

_{t}(x

_{t}, k, q) = Σ

_{{}

_{x}

_{1:t−1:}

_{Or}

_{(}

_{x}

_{1:t) =}

_{k}

_{}}ℙ(y

_{1:}

_{t}, x

_{1:}

_{t}, FA

_{ }(r)

_{t}= q) to be the entries of a new table, α̂, where k = 0, …, m and m ≤ T is the maximum number of pattern occurrences in a hidden sequence of length T. The table entries are the probabilities of having observed y

_{1:}

_{t}, being in hidden state x

_{t}and automaton state q at time t and having seen k occurrences of the pattern, corresponding to having visited accepting states k times. Letting δ

^{−1}(q, h

_{i}) = {q′ ∣ δ(q′, h

_{i}) = q} be the automaton states from which a transition to q exists, using hidden state h

_{i}and being the indicator function, mapping a Boolean expression to one if it is satisfied and to zero otherwise, we have that:

_{1:}

_{T}) = Σ

_{i}α(h

_{i}) = Σ

_{i}

_{,}

_{k}

_{,}

_{q}α̂

_{T}(h

_{i}, k, q).

^{2}m|Q|

^{2}) running time and a space consumption of (TNm|Q|). In practice, both time and space consumption can be reduced. The restricted forward algorithm can be run for values of k that are gradually increasing up to k

_{max}for which ℙ(at most k

_{max}occurrences of r ∣ y

_{1:}

_{T}) is greater than, e.g., 99.99%. This k

_{max}is generally significantly less than m, while the expectation of the number of matches of r can be reliably calculated from this truncated distribution. The space consumption can be reduced to (N|Q|), because the calculation at time t for a specific value, k, depends only on the results at time t − 1 for k and k − 1.

#### 3.2. Restricted Decoding Algorithms

_{1:}

_{T}, for which O

_{r}(x

_{1:}

_{T}) ∈ [l, u], where l and u are set to, for example, the expected number of occurrences, which can be calculated from the distribution. The restricted decoding algorithms are built in the same way as the restricted forward was obtained: a new table is defined, which is filled using a simple recursion. The evaluation of the table is followed by backtracking to obtain a sequence of hidden states, which contains between l and u occurrences of the pattern. The two restricted decoding algorithms use (TNu|Q|) space and (TN

^{2}u|Q|

^{2}) time.

_{t}and automaton state q at time t and having observed y

_{1:}

_{t}, ω̂

_{t}(x

_{t}, k, q) = max

_{{}

_{x}

_{1:t−1:}

_{Or}

_{(}

_{x}

_{1:t) =}

_{k}

_{}}{ℙ(y

_{1:}

_{t}, x

_{1:}

_{t}, FA

_{ }(r)

_{t}= q)} with k = 0, …, u. The corresponding recursion is:

_{i}

_{,}

_{k}

_{∈[}

_{l}

_{,}

_{u}

_{],}

_{q}{ω̂

_{T}(h

_{i}, k, q)}.

_{p}containing k pattern occurrences, ending in state x

_{t}and automaton state q at time t and having observed y

_{1:}

_{t}, γ̂

_{t}(x

_{t}, k, q) = max

_{{}

_{x}

_{1:t−1:}

_{Or}

_{(}

_{x}

_{1:t) =}

_{k}

_{}}{ℙ(x

_{1:}

_{t}, FA

_{ }(r)

_{t}= q ∣ y

_{1:}

_{T})}. We have:

_{i}

_{,}

_{k}

_{∈[}

_{l}

_{,}

_{u}

_{],}

_{q}{γ̂

_{T}(h

_{i}, k, q)}.

#### 3.3. Experimental Results on Simulated Data

_{1}) ∣ (R

_{1}N), corresponding to the start of a gene. For each of the sequences, we estimated the number of overlapping pattern occurrences with the expected number of pattern occurrences, computed using the restricted forward algorithm, which we then used to run the restricted decoding algorithms. We also computed the prediction given by the two unrestricted decoding algorithms for comparison.

**Figure 3.**Normalized difference, $\frac{\text{estimate}}{\text{true value}}-1$, between the true number of pattern occurrences, the number given by the two unrestricted decoding algorithms and the expected number of pattern occurrences computed using the restricted forward algorithm. For each sequence length, we show the median of the normalized differences in the 500 experiments, together with the 0.025 and 0.975 quantiles, given as error bars.

#### 3.3.1. Counting Pattern Occurrences

#### 3.3.2. Quality of Predictions

#### Nucleotide level

#### Gene level

**Figure 4.**Error types at the gene level. A predicted gene is considered one true positive if it overlaps with at least 50% of a true gene and one false positive if there is no true gene with which it overlaps by at least 50%. Each true gene for which there is no predicted gene that overlaps by at least 50% counts as one false negative. True genes that are covered more than 50% by predicted genes, but for which there is no single predicted gene that covers a minimum of 50% are disregarded.

^{2}= 7 · 5

^{2}= 175 slowdown, as the average expectation of the number of patterns per sequence was k = 7.

**Figure 5.**Prediction quality at the nucleotide level given by average sensitivity, specificity and Matthew's correlation coefficient (MCC) for the decoding algorithms. We ran the restricted decoding algorithms using the expectation calculated from the distribution returned by the restricted forward algorithm. The plots show a zoom-in of the three measures. As both sensitivity and specificity are between zero and one, the Y-axes in these two plots have the same scale.

**Figure 6.**Prediction quality at the gene level given by average recall, precision and F-score for the decoding algorithms. We ran the restricted decoding algorithms using the expectation calculated from the distribution returned by the restricted forward algorithm. The plots show a zoom-in of the three measures with the same scale on the Y axes.

## 4. Conclusions

^{th}pattern match, the restricted Viterbi algorithm could potentially be further extended to incorporate this distribution while calculating the joint probability of observed and hidden sequences. Weighted transducers [26] are sequence modeling tools similar to HMMs, and analyzing patterns in the hidden sequence can potentially also be done by composition of the transducers, which describe the HMM and the automaton.

## Acknowledgments

## Conflicts of Interest

## References

- Chong, J.; Yi, Y.; Faria, A.; Satish, N.; Keutzer, K. Data-parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors. Proceedings of the 1st Annual Workshop on Emerging Applications and Many Core Architecture, Beijing, China, June 2008; pp. 23–35.
- Gales, M.; Young, S. The application of hidden Markov models in speech recognition. Found. Trends Signal Process.
**2007**, 1, 195–304. [Google Scholar] - Rabiner, L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE
**1989**, 77, 257–286. [Google Scholar] - Li, J.; Gray, R. Image Segmentation and Compression Using Hidden Markov Models; Springer: Berlin/Heidelberg, Germany, 2000; Volume 571. [Google Scholar]
- Karplus, K.; Barrett, C.; Cline, M.; Diekhans, M.; Grate, L.; Hughey, R. Predicting protein structure using only sequence information. Proteins Struct. Funct. Bioinformatics
**1999**, 37, 121–125. [Google Scholar] - Krogh, A.; Brown, M.; Mian, I.; Sjolander, K.; Haussler, D. Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol.
**1994**, 235, 1501–1531. [Google Scholar] - Krogh, A.; Larsson, B.; von Heijne, G.; Sonnhammer, E. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol.
**2001**, 305, 567–580. [Google Scholar] - Eddy, S. Multiple Alignment Using Hidden Markov Models. Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, UK, 16–19 July 1995; Volume 3, pp. 114–120.
- Eddy, S. Profile hidden Markov models. Bioinformatics
**1998**, 14, 755–763. [Google Scholar] - Lunter, G. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics
**2007**, 23, i289–i296. [Google Scholar] - Mailund, T.; Dutheil, J.Y.; Hobolth, A.; Lunter, G.; Schierup, M.H. Estimating divergence time and ancestral effective population size of bornean and sumatran orangutan subspecies using a coalescent hidden Markov model. PLoS Genet.
**2011**, 7, e1001319. [Google Scholar] - Siepel, A.; Haussler, D. Phylogenetic Hidden Markov Models. In Statistical Methods in Molecular Evolution; Nielsen, R., Ed.; Springer: New York, NY, USA, 2005; pp. 325–351. [Google Scholar]
- Antonov, I.; Borodovsky, M. GeneTack: Frameshift identification in protein-coding sequences by the Viterbi algorithm. J. Bioinforma. Comput. Biol.
**2010**, 8, 535–551. [Google Scholar] - Lukashin, A.; Borodovsky, M. GeneMark.hmm: New solutions for gene finding. Nucleic Acids Res.
**1998**, 26, 1107–1115. [Google Scholar] - Krogh, A.; Mian, I.S.; Haussler, D. A hidden Markov model that finds genes in E.coli DNA. Nucleic Acids Res.
**1994**, 22, 4768–4778. [Google Scholar] - Aston, J.A.D.; Martin, D.E.K. Distributions associated with general runs and patterns in hidden Markov models. Ann. Appl. Stat.
**2007**, 1, 585–611. [Google Scholar] - Fu, J.; Koutras, M. Distribution theory of runs: A Markov chain approach. J. Am. Stat. Appl.
**1994**, 89, 1050–1058. [Google Scholar] - Nuel, G. Pattern Markov chains: Optimal Markov chain embedding through deterministic finite automata. J. Appl. Probab.
**2008**, 45, 226–243. [Google Scholar] - Wu, T.L. On finite Markov chain imbedding and its applications. Methodol. Comput. Appl. Probab.
**2011**, 15, 453–465. [Google Scholar] - Lladser, M.; Betterton, M.; Knight, R. Multiple pattern matching: A Markov chain approach. J. Math. Biol.
**2008**, 56, 51–92. [Google Scholar] - Nicodeme, P.; Salvy, B.; Flajolet, P. Motif statistics. Theor. Comput. Sci.
**2002**, 287, 593–617. [Google Scholar] - Fariselli, P.; Martelli, P.L.; Casadio, R. A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins. BMC Bioinformatics
**2005**, 6. [Google Scholar] [CrossRef] - Thompson, K. Programming techniques: Regular expression search algorithm. Commun. ACM
**1968**, 11, 419–422. [Google Scholar] - Møller, A. dk.brics.automaton—Finite-State Automata and Regular Expressions for Java. 2010. Available online: http://www.brics.dk/automaton/ (accessed on 16 January 2012).
- Burset, M.; Guigo, R. Evaluation of gene structure prediction programs. Genomics
**1996**, 34, 353–367. [Google Scholar] - Mohri, M. Weighted Automata Algorithms. In Handbook of Weighted Automata; Springer: Berlin/Heidelberg, Germany, 2009; pp. 213–254. [Google Scholar]

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Tataru, P.; Sand, A.; Hobolth, A.; Mailund, T.; Pedersen, C.N.S.
Algorithms for Hidden Markov Models Restricted to Occurrences of Regular Expressions. *Biology* **2013**, *2*, 1282-1295.
https://doi.org/10.3390/biology2041282

**AMA Style**

Tataru P, Sand A, Hobolth A, Mailund T, Pedersen CNS.
Algorithms for Hidden Markov Models Restricted to Occurrences of Regular Expressions. *Biology*. 2013; 2(4):1282-1295.
https://doi.org/10.3390/biology2041282

**Chicago/Turabian Style**

Tataru, Paula, Andreas Sand, Asger Hobolth, Thomas Mailund, and Christian N. S. Pedersen.
2013. "Algorithms for Hidden Markov Models Restricted to Occurrences of Regular Expressions" *Biology* 2, no. 4: 1282-1295.
https://doi.org/10.3390/biology2041282