^{2}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Hidden Markov Models (HMMs) are widely used probabilistic models, particularly for annotating sequential data with an underlying hidden structure. Patterns in the annotation are often more relevant to study than the hidden structure itself. A typical HMM analysis consists of annotating the observed data using a decoding algorithm and analyzing the annotation to study patterns of interest. For example, given an HMM modeling genes in DNA sequences, the focus is on occurrences of genes in the annotation. In this paper, we define a pattern through a regular expression and present a restriction of three classical algorithms to take the number of occurrences of the pattern in the hidden sequence into account. We present a new algorithm to compute the distribution of the number of pattern occurrences, and we extend the two most widely used existing decoding algorithms to employ information from this distribution. We show experimentally that the expectation of the distribution of the number of pattern occurrences gives a highly accurate estimate, while the typical procedure can be biased in the sense that the identified number of pattern occurrences does not correspond to the true number. We furthermore show that using this distribution in the decoding algorithms improves the predictive power of the model.

A Hidden Markov Model (HMM) is a probabilistic model for sequential data with an underlying hidden structure. Because of their computational and analytical tractability, they are widely used especially in speech recognition [

Patterns in the hidden structure are, however, often more relevant to study than the full hidden structure itself. When modeling proteins, one might be interested in neighboring secondary structures that differ, while for sequence alignments, the pattern could capture specific characteristics, such as long indels. In phylogenetic analysis, changes in the tree along the sequence are most relevant, while when investigating coding regions of DNA data, patterns corresponding to genes are the main focus.

Counting the number of occurrences of such patterns can be approached (as in the methods based on [

We present a fundamentally different approach to compute the distribution of the number of pattern occurrences and show how it can be used to improve the prediction of the hidden structure. We use regular expressions as patterns and employ their deterministic finite automata to keep track of occurrences. The use of automata to describe occurrences of patterns in Markov sequences has been described previously in [

The remainder of this paper is organized as follows: we start by introducing Hidden Markov Models and automata; we continue by presenting our restricted algorithms, which we then validate experimentally.

A Hidden Markov Model (HMM) [_{1:}_{T}_{1}_{2}…_{T}_{1:}_{T}_{1}_{2} … _{T}_{1}, _{2}, …, _{N}_{1},_{2}, …, _{M}_{hi}_{1≤}_{i}_{≤}_{N}_{hi}_{1} = _{i}_{i}_{hi}_{,}_{hj}_{1≤}_{i}_{,}_{j}_{≤}_{N}_{hi}_{,}_{hj}_{t}_{j}_{t}_{−1} = _{i}_{i}_{j}_{hi}_{,}_{oj}_{t}_{j}_{t}_{i}_{i}_{j}

_{i}_{i}

A Hidden Markov Model (HMM) for gene prediction. Each box represents a hidden state, and the numbers inside are the emission probabilities of each nucleotide. Numbers on arcs are transition probabilities between hidden states.

HMMs can be used to generate sequences of observables, but their main application is for analyzing an observed sequence, _{1:}_{T}^{2}) time, using

The forward algorithm [_{1:}_{T}_{1:}_{T}_{1:}_{T}_{1:}_{T}_{1} and emits _{1} from _{1}, and for all _{t}_{−1} to _{t}_{t}_{t}

The forward algorithm finds _{t}_{t}_{1:}_{t}_{t}_{1:}_{t}_{−1} ℙ(_{1:}_{t}_{1:}_{t}_{1:}_{t}_{t}_{1:}_{T}_{i} α_{T}_{i}

The Viterbi algorithm [_{1:}_{T}_{t}_{t}_{x}_{1:t−1} {ℙ(_{1:}_{t}_{1:}_{t}_{t}_{1:}_{t}

After computing _{hi}_{T}_{i}

The posterior deco ding [_{1:}_{T}_{t}_{hi}_{t}_{i}_{t}_{i}_{i}_{1:}_{T}_{t}_{i}_{t}_{+1:}_{T}_{i}

The backward algorithm [_{t}_{i}_{t}_{i}_{t}_{i}_{t}_{hi}_{t}_{i}

The posterior decoding algorithm often computes decodings that are very accurate locally, but it may return syntactically incorrect decodings; _{p}_{p}_{t}

After computing _{p}_{hi}_{T}_{i}_{p}

In this paper, we are interested in patterns over the hidden alphabet
_{1}, _{2}, …, _{N}_{
}(_{0}, _{1} ∣ _{2} ∣ … ∣ _{N}_{0} ∈ _{
}(_{
}(_{0}, _{E}_{
}(_{E}

Essentially, _{
}(_{
}(_{
}(_{1}) ∣ (_{1}_{i}_{i}_{0} = 0, and _{1}, while state 3 corresponds to the beginning of _{1}_{1}, and state 4 accepts _{1}_{2}, _{3}, _{2} and _{3} are not part of _{
}(_{
}(_{1}_{1}, _{
}(_{1}_{1}. However, after _{
}(_{1}_{1}. The algorithms we provide are independent of which of the two automata is used, and therefore, all that remains is to switch between them when needed. In our implementation, we used an automata library for Java [_{
}(_{
}(

Two automata, _{
}(_{
}(_{1}) ∣ (_{1}_{i}_{i}_{0} = 0, and _{1} and _{1}_{
}(_{
}(_{2}, _{3}, _{2} and _{3}.

Consider an HMM as defined previously, and let

Let _{r}_{1:}_{T}_{1:}_{T}_{r}_{
}(_{
}(_{t}_{t}_{t}_{{}_{x}_{1:t−1:}_{Or}_{(}_{x}_{1:t) =}_{k}_{}} ℙ(_{1:}_{t}_{1:}_{t}_{
}(_{t}_{1:}_{t}_{t}^{−1}(_{i}_{i}_{i}

Using

These probabilities now allow for the evaluation of the distribution of the number of occurrences, conditioned on the observed data:
_{1:}_{T}_{i} α_{i}_{i}_{,}_{k}_{,}_{q}α̂_{T}_{i}

The ^{2}^{2}) running time and a space consumption of
_{max}_{max}_{1:}_{T}_{max}

The aim of the restricted decoding algorithms is to obtain a sequence of hidden states, _{1:}_{T}_{r}_{1:}_{T}^{2}^{2}) time.

The entries in the table for the restricted Viterbi algorithm contain the probability of a most likely decoding containing _{t}_{1:}_{t}_{t}_{t}_{{}_{x}_{1:t−1:}_{Or}_{(}_{x}_{1:t) =}_{k}_{}} {ℙ(_{1:}_{t}_{1:}_{t}_{
}(_{t}_{i}_{,}_{k}_{∈[}_{l}_{,}_{u}_{],}_{q}_{T}_{i}

For the restricted posterior-Viterbi algorithm, we compute the highest posterior probability of a decoding from _{p}_{t}_{1:}_{t}_{t}_{t}_{{}_{x}_{1:t−1:}_{Or}_{(}_{x}_{1:t) =}_{k}_{}} {ℙ(_{1:}_{t}_{
}(_{t}_{1:}_{T}

The backtracking starts in entry argmax_{i}_{,}_{k}_{∈[}_{l}_{,}_{u}_{],}_{q}_{T}_{i}

We implemented the algorithms in Java, validated and evaluated their performance experimentally as follows: We first generated a test set consisting of 500 pairs of observed and hidden sequences for each length _{1}) ∣ (_{1}

Normalized difference,

We compared the predictive power of the two decoding algorithms in the original and restricted versions, using the expectation for the number of pattern occurrences. For each length in the test set, we measured the quality of each method, both at the nucleotide and gene level, following the analysis in [

To investigate the quality at the nucleotide level, we compared the decoding and the true hidden state position by position. Each position can be classified as a true positive (predicted as part of a gene when it was part of a gene), true negative (predicted as non-coding when it was non-coding), false positive (predicted as part of a gene when it was non-coding) and false negative (predicted as non-coding when it was part of a gene). Using the total number of true positives (

Sensitivity and specificity are always between zero and one and relate to how well the algorithms are able to find genes (true positives) and non-coding regions (true negatives), respectively. MCC reflects the overall correctness and lies between −1 and 1, where 1 represents perfect prediction.

When looking at the decoding position by position, genes that are predicted correctly do not contribute to the measures in an equal manner, but rather, the longer the gene, the more contribution it brings. However, it is interesting how well the genes are recovered, independent of how long they are. To measure this, we consider a predicted gene as one true positive if it overlaps by at least 50% of a true gene and as one false positive if there is no true gene with which it overlaps by at least 50%. Each true gene for which there is no predicted gene that overlaps by at least 50% counts as one false negative; see

They are all between zero and one and reflect how well the true genes have been recovered. The recall gives the fraction of true genes that have been found, while the precision gives the fraction of the predicted genes that are true genes. The F-score is the harmonic mean of the two.

Error types at the gene level. A predicted gene is considered one true positive if it overlaps with at least 50% of a true gene and one false positive if there is no true gene with which it overlaps by at least 50%. Each true gene for which there is no predicted gene that overlaps by at least 50% counts as one false negative. True genes that are covered more than 50% by predicted genes, but for which there is no single predicted gene that covers a minimum of 50% are disregarded.

Apart from these experiments, we also ran the algorithms on the annotated ^{2} = 7 · 5^{2} = 175 slowdown, as the average expectation of the number of patterns per sequence was

Prediction quality at the nucleotide level given by average sensitivity, specificity and Matthew's correlation coefficient (MCC) for the decoding algorithms. We ran the restricted decoding algorithms using the expectation calculated from the distribution returned by the restricted forward algorithm. The plots show a zoom-in of the three measures. As both sensitivity and specificity are between zero and one, the Y-axes in these two plots have the same scale.

Prediction quality at the gene level given by average recall, precision and F-score for the decoding algorithms. We ran the restricted decoding algorithms using the expectation calculated from the distribution returned by the restricted forward algorithm. The plots show a zoom-in of the three measures with the same scale on the Y axes.

We have introduced three novel algorithms that efficiently combine the theory of Hidden Markov Models with automata and pattern matching to recover pattern occurrences in the hidden sequence. First, we computed the distribution of the number of pattern occurrences by using an algorithm similar to the forward algorithm. This problem has been treated in [

From the occurrence number distribution, we calculated the expected number of pattern matches, which estimated the true number of occurrences with high precision. We then used the distribution to alter the prediction given by the two most widely used decoding algorithms: the Viterbi algorithm and the posterior-Viterbi algorithm. We have shown that in the case of the Viterbi algorithm, which finds the best global prediction, using the expected number of pattern occurrences greatly improves the prediction, both at the nucleotide and gene level. However, in the case of the posterior-Viterbi algorithm, which finds the best local prediction, such an addition only fragments the predicted genes, leading to a poorer prediction. Overall, deciding which algorithm is best depends on the final measure used, but as our focus was on finding genes, we conclude that the restricted Viterbi algorithm showed the best result.

As the distribution obtained from the restricted forward algorithm facilitates the calculation of the distribution of the waiting time until the occurrence of the ^{th} pattern match, the restricted Viterbi algorithm could potentially be further extended to incorporate this distribution while calculating the joint probability of observed and hidden sequences. Weighted transducers [

Our method can presumably be used with already existing HMMs to improve their prediction, by using patterns that reflect the problems studied using the HMMs. For example, in [

We are grateful to Jens Ledet Jensen for useful discussions in the initial phase of this study.

The authors declare no conflict of interest.