Abstract
Practitioners have used hidden Markov models (HMMs) in different problems for about sixty years. Moreover, conditional random fields (CRFs) are an alternative to HMMs and appear in the literature as different and somewhat concurrent models. We propose two contributions: First, we show that the basic linear-chain CRFs (LC-CRFs), considered as different from HMMs, are in fact equivalent to HMMs in the sense that for each LC-CRF there exists an HMM—that we specify—whose posterior distribution is identical to the given LC-CRF. Second, we show that it is possible to reformulate the generative Bayesian classifiers maximum posterior mode (MPM) and maximum a posteriori (MAP), used in HMMs, as discriminative ones. The last point is of importance in many fields, especially in natural language processing (NLP), as it shows that in some situations dropping HMMs in favor of CRFs is not necessary.
1. Introduction
Let be a stochastic sequence, with . The random variables , take their values in a finite set , while , take their values either in a discrete or continuous set . Realizations of are hidden while realizations of are observed, and the problem we deal with is to estimate from . We deal with Bayesian methods of estimation, which requires some probabilistic model. Probabilistic model is a distribution—or a family of distributions—which is denoted by , or . We are interested in the case of dependent . The simplest model taking into account this dependence is the well-known hidden Markov model (HMM) [,,,,], whose distribution is given with
In the whole paper, we consider HMMs such that and in Equation (1) are non-null. HMMs allow recursive fast computation of Bayesian estimators, called “classifiers” in this paper, and recalled below. In spite of their simplicity, HMMs are very robust and provide quite satisfactory results in many applications.
Moreover, conditional random fields (CRFs) [,] also allow estimating from . Their definition is different from the definition of HMMs in that in CRFs, one directly considers , and neither nor is needed to perform the estimation. The distribution of general LC-CRFs is written as such:
In this paper, we consider the following basic LC-CRF:
with the normalizing constant.
Authors usually consider the two families HMMs and LC-CRFs as different [,,,,,,,]. They classify the former in the category of “generative models”, while they classify the latter in the category of “discriminative” models.
In this paper, we investigate the relationship between HMMs (Equation (1)) and LC-CRFs (Equation (3)). We immediately see that if is of the form (Equation (1)), then is of the form (Equation (3)). Indeed, and in Equation (1) being non-zero, one can take , for , …, , , for , …, , and . Thus, the posterior distribution of each HMM (Equation (1)) with the positivity conditions assumed above is a LC-CRF (Equation (3)). Our first contribution is to show the converse: for a given LC-CRF of form (Equation (3)), there is an HMM of form (Equation (3)) such that . Moreover, we give exact computation of the HMM parameters , , , from the LC-CRF parameters , . Such a general result, without additional constraints on LC-CRF, is new.
Our second contribution is related to the computation of Bayesian classifiers. In some situations, CRFs are preferred over HMMs because the Bayesian classifiers based on CRFs are computed without considering distributions, which are difficult to model. This is particularly the case with automatic natural language processing (NLP). Indeed, when considering an HMM (Equation (1)) and calculating the Bayesian classifiers “maximum posterior margins” (MPM) and “maximum posterior” (MAP) in the standard way, we use and this is the reason why LC-CRFs are preferred over HMMs. In this paper, we show that the HMM-based MPM and MAP can also be computed without using . Also recall that each Bayesian “discriminative” classifier based on LC-CRF (Equation (3)) is identical to the “generative” Bayesian classifier based on an HMM (Equation (1)), since the posterior distribution of an HMM gives the one of a LC-CRF. Indeed, Bayesian classifiers only depend on the posterior distribution. Thus, the distinction between “generative” classifiers and “discriminative” classifiers is misleading, they are all “discriminative”, but they can be computed in a “generative” way, using , or in a discriminative manner, without using . Thus, we can say that our second contribution is to show that the HMM-based MPM and MAP, usually computed in a generative manner, can also be computed in a discriminative manner. This is important because it shows that abandoning HMMs in favor of CRFs in the aforementioned situations is not justified. Indeed, attached to the first contribution, it shows that the use of the MPM or MAP based on HMM (Equation (1)) is as interesting as the use of the MPM or MAP based on LC-CRF (Equation (3)).
Let us give some more technical details on the two contributions.
- We establish an equivalence between HMMs (Equation (1)) and basic linear-chain CRFs (Equation (3)), which completes the results presented in [].
Let us notice that wanting to compare the two models directly is somewhat misleading. Indeed, HMMs and CRFs are defined with distributions on different spaces. To be precise, we adopt the following definition:
Definition 1.
Let , , , be the two stochastic sequences defined above.
- (i)
- We will call “model” a distribution ;
- (ii)
- We will call “conditional model” a distribution ;
- (iii)
- We will say that a model is “equivalent” to a conditional model if there exists a distribution such that ;
- (iv)
- We will say that a family of models A is “equivalent” to a family of conditional models B if for each model in A there exists an equivalent conditional model in B.
According to Definition 1, HMMs are particular “models”, while CRFs are particular “conditional models”. Then a particular HMM model cannot be equal to a particular conditional CRF model, but it can be equivalent to the latter.
Our contribution is to show that the family of LC-CRFs (Equation (3)) is equivalent to the family of HMMs (Equation (1)). In addition, we specify, for each LC-CRF , a particular HMM such that .
Finally, the core of our first contribution is the following. Let be an HMM (Equation (1)), with parameters . Taking , it is immediate to see that is an equivalent CRF. The converse is not immediate. Is a given CRF equivalent to a certain HMM? If yes, can we find such that is an HMM? Moreover, can we give its (Equation (1)) form? Answering to these questions in a simple linear-chain CRF case is our first contribution. More precisely, we show that the family of LC-CRFs (Equation (3)) is equivalent to the family of HMMs (Equation (1)), and we specify, for each LC-CRF , a particular HMM given in the form (Equation (1)), such that .
- 2.
- We show that the “generative” estimators MPM and MAP in HMM are computable in a “discriminative” manner, exactly as in LC-CRF.
One of the interests of HMMs and CRFs is that in both of them there exist Bayesian classifiers, which allow estimating from in a reasonable computer time. As examples, let us consider the “maximum of posterior margins” (MPM) defined with:
and the “maximum a posteriori” (MAP), defined with
Note that likely to any other Bayesian classifier, MPM and MAP are independent from . This means that in any generative model , any related Bayesian classifier is strictly the same as the one related to the equivalent (in the meaning of Definition 1) CRF model . We see that the distinction between “generative” and “discriminative” classifiers is not justified: all Bayesian classifiers are discriminative. However, in HMM the related MPM and MAP classifiers are computed using , while this is not the case in LC-CRF. We show that both MPM and MAP in HMM can also be computed in a “discriminative” way, without using . Thus, the feasibility of using MPM and MAP in HMM is strictly the same as that of their use in LC-CRF, which is our second contribution. One of the consequences is that the use of MPM and MAP in the two families HMMs and LC-CRFs presents exactly the same interest, in particular in NLP. This shows that abandoning HMMs in favor of LC-CRFs in NLP because of the “generative” nature [,,,,,,,] of their related Bayesian classifiers was not justified.
2. Related Works
Concerning the first contribution, several authors noticed similarities between LC-CRFs and HMMs in different previous works. Our first remark is to notice that trying to compare the two families directly is somewhat incorrect, as they are defined on different spaces. In particular, they cannot be equal. It is well-known that the posterior distribution of an HMM is a LC-CRF. Conversely, showing that for a given CRF it is possible to find an HMM such that is more difficult and at our knowledge, there is no general results, except [], published. However, let us mention [] where authors show a similar result to ours assuming an additional constraint on the LC-CRF considered. In [], authors comment similarities and differences between LC-CRFs and HMMs considered here; however, the problem of searching an HMM equivalent to a given LC-CRF is not addressed. In this paper we show how to compute, from the LC-CRF given with Equation (3), an HMM verifying , without any additional constraints. Concerning the second contribution, the authors generally distinguish between “discriminative” classifiers, linked to discriminative models, and “generative” classifiers, linked to generative models. As mentioned above, our contribution is to show that the MPMs and MAPs based on generative HMMs can also be considered as discriminative classifiers. To our knowledge, there is no work on such conversions, except [].
3. Equivalence between HMMs and Simple Linear-Chain CRFs
We will use the following Lemma:
Lemma 1.
Let be random sequence, taking its values in a finite set . Then
- (i)
- is a Markov chain if and only if (iff) there exist functions from to such thatwhere “” means “proportional to”;
- (ii)
- For the HMM defined with , …, verifying Equation (6), and are given withwhere , …, are defined with the following backward recursion:
Proof of Lemma.
- Let be Markov: . Then (Equation (6)) is verified by , , …, .
- Conversely, let verifies (Equation (6)). Thus with constant. This implies that for each , …, we havewhich shows that is Markov.
Moreover, let us set for , …, . On the one hand, we see that . On the other hand, according to (Equation (9)) we have . As , (Equation (7)) and (Equation (8)) are verified, which ends the proof. □
Proposition 1 below shows that the LC-CRF defined with (Equation (3)) is equivalent to an HMM defined with Equation (1). In addition, , , and in Equation (1) defining an equivalent HMM are computed from and . To the best of our knowledge, except some first weaker results in [], these results are new.
Proposition 1.
Let be a stochastic sequence, with. Each takes its values in , with and finite. If is a LC-CRF with the distribution defined by
then (Equation (10)) is the posterior distribution of the HMM
with
and, for :
where
and , …,are given by the backward recursion
Proof of Proposition 1.
Let us consider functions , …, defined on by
According to the lemma, they define a Markov chain , with . Let us denote its distribution by . As with constant, we have . Let us show that verifies (Equations (11)–(16)). According to the lemma, considering , …, defined with (Equation (16)), and defined with (Equation (15)), we have
(12) being directly implied by the lemma, this ends the proof. □
4. Discriminative Classifiers in Generative HMMs
One of the interests of HMMs and some CRFs with hidden discrete finite data lies in possibilities of analytic fast computation of Bayesian classifiers. As examples of classic Bayesian classifiers, let us consider MPM (Equation (4)) and MAP (Equation (5)). However, in some domains such as NLP, CRFs are preferred to HMMs for the following reasons. As HMM is a generative model, MPM and MAP used in HMM are also called “generative”, and people consider that HMM-based MPM and MAP require the knowledge of . Then people consider it as improper to use the HMM-based MPM and MAP in situations where the distributions are hard to handle. We show that this reason is not valid. More precisely, we show two points:
- (i)
- First, we notice that regardless of the distribution , all Bayesian classifiers are independent from , so that the distinction between « generative » and « discriminative » classifiers is misleading: they are all discriminative;
- (ii)
- Second, we show “discriminative” computation of MPM and MAP in HMMs is not intrinsic to HMMs but is due to its particular classic parameterization Equation (1). In other words, changing the parametrization, it is possible to compute the HMM-based MPM and MAP without using or .
The first point is rather immediate: we note that Bayesian classifier is defined by a loss function through
it is thus immediate to notice that only depends on . This implies that it is the same in a generative model or its equivalent (within the meaning of Definition 1) discriminative model.
We show (ii) by separately considering the MPM and MAP cases.
4.1. Discriminative Computing of HMM-Based MPM
To show (ii), let us consider Equation (1) with . It becomes,
We see that (Equation (21)) is of the form , where does not depend on , …, . This implies that does not depend on , …, either. Then , so that
neither depends on , …, . Thus, the HMM-based classifier MPM also verifies the “discriminative classifier” definition.
How to compute ? It is classically computable using “forward” probabilities and “backward” ones defined with
then
with all and computed using the following forward and backward recursions []:
Setting and recalling that does not depend on , …, , we can arbitrarily modify them. Let us consider the uniform distribution over , so that . Then (Equation (26)) and (Equation (27)) become
and we still have
Finally, we see that is independent from , so that we can take . Then we can state the following proposition.
Proposition 2.
Let ,,, be an HMM Equation (1). Let us define “discriminative forward” quantities , …,, and “discriminative backward” ones , …,by the following forward and backward recursions:
then
Consequently, we can compute the MPM classifier in a discriminative manner, only using , …, , , …, , and , …, .
Note that this result is similar to the result in [], with a different proof.
Remark 1.
Let us notice that according to Equations (31) and (32), it is possible to compute and by a very slight adaptation of classic computing programs giving classic and with recursions (Equations (26) and (27)). All we have to do is to replace with . Of course, and, but (Equation (33)) holds and thus is computable allowing MPM.
Remark 2.
We see that we can compute the MPM in HMM only using , …, , , …,, and , …, . This means that in supervised classification, where we have a learn sample, we can use any parametrization to estimate them. For example, we can model them with logistic regression, as currently done in CRFs. It is of importance to note that such a parametrization is unusual; however, what is important is that the model remains the same.
Remark 3.
When using the MPM to deal with a concrete problem, Proposition 2 implies that talking about a comparison between LC-CRFs and HMMs is somewhat incorrect. Indeed, there is only one model. However, there are two different parameterizations, and this can produce two different results. Indeed, the estimation of the parameters can give two models of the same nature (posterior distribution of an HMM) but unequally situated with respect to the optimal model, the best suited to the data. It would therefore be more correct to speak of a comparison of two parametrizations of HMM, each associated with its own parameter estimator. The same is true concerning the MAP discussed in the next paragraph.
4.2. Discriminative Computing of HMM-Based Map: Discriminative Viterbi
Let , , , be an HMM (Equation (1)). The Bayesian MAP classifier (Equation (5)) based on such HMM is computed with the following Viterbi algorithm []. For each , , and each , let be the path , …, verifying
We see that is a path maximizing over all paths ending in . Then having the paths and the probabilities for each , one determines, for each , the paths and the probabilities , searching with
Setting in Equation (35) , we see that which verifies Equation (35) is the same that which maximizes , so that we can suppress . In other words, we can replace Equation (35) with
Finally, we propose the following discriminative version of the Viterbi algorithm:
- -
- Set ;
- -
- For each , …, , and each , apply (3.17) to find a path from the paths (for all ), and the probabilities (for all );
- -
- End setting .
As with MPM above, we see that we can find with the only use of , …, , , …, , and , …, , exactly as in CRF case. As above, it appears that dropping HMMs in some NLP tasks on the grounds that MAP is a “generative” classifier, is not justified. In particular, in supervised stationary framework, distributions , , and can be estimated in the same way as in LC-CRFs case.
5. Discussion and Conclusions
We have proposed two results. First, we have shown that the basic LC-CRF (Equation (3)) is equivalent to a classical HMM (Equation (1)) in that one can find an HMM whose posterior distribution is exactly the given LC-CRF. More precisely, we specified the way to calculate the parameters , , and which define Equation (1) from the parameters , which define Equation (3). Second, noting that all Bayesian classifiers are discriminative in the sense that they do not depend on the observation distribution, we showed that classifiers based on HMMs, usually considered as “generative”, can also be considered as discriminative classifiers. More specifically, we have proposed discriminative methods for computing classic maximum posterior mode (MPM) and maximum a posteriori (MAP) classifiers based on HMMs. The first result shows that LC-CRFs are as general as classical HMMs. The second shows that at the application level, HMMs offer strictly the same processing power, at least with regard to MPM and MAP, as LC-CRFs.
The practical interest of our contributions is as follows. Until now, some authors considered LC-CRFs and HMMs to be equivalent without presenting rigorous proof, and others considered LC-CRFs to be more general []. Partly because of this uncertainty, CRFs have often been preferred over HMMs. We have proved, at least in the particular framework considered, that the two models were equivalent, and the abandonment of HMMs in favor of CRFs was not always justified. In other words, faced with a particular application, there is no reason to choose CRFs systematically. However, we also cannot say that HMMs should be chosen. Finally, our contribution is likely to encourage practitioners to consider both models, or rather, according to Remark 3, both parametrizations of the same model, on an equal footing.
We considered basic LC-CRFs and HMMs, which are a limited, yet widely used framework. LC-CRFs and HMMs can be extended in different directions. The general question to study is to search an extension of HMM which would be equivalent to the general CRF (Equation (2)). This seems to be a hard problem. However, one can study some existing extensions of HMMs and wonder what kind of CRFs they would be equivalent to. For example, HMMs have been extended to “pairwise Markov models” (PMMs []), and the question is therefore what kind of CRFs would be equivalent to PMMs? Another kind of extension consists of adding a latent process to the pair (. In the case of CRFs this leads to hidden CRFs (HCRFs []), and in the case of HMMs this leads to triplet Markov models (TMMs []). Finally, CRFs were generalized to semi-Markov CRFs [], and HMMs were generalized to hidden semi-Markov models, with explicit distribution of the exact sojourn time in a given state [], or with explicit distribution of the minimal sojourn time in a given state []. The study of the relations between these different extensions of CRFs and HMMs is an interesting perspective for further investigations.
Author Contributions
Conceptualization, E.A., E.M., and W.P.; methodology, E.A., E.M., and W.P.; validation, E.A., E.M., and W.P.; investigation, E.A., E.M., and W.P.; writing—original draft preparation, W.P.; writing—review and editing, E.A., E.M., and W.P; supervision, W.P.; project administration, W.P. All authors have read and agreed to the published version of the manuscript.
Funding
This research was partly funded by the French Government Agency ASSOCIATION NATIONALE RECHERCHE TECHNOLOGIES (ANRT), as part of the CIFRE (“Conventions industrielles de formation par la recherché”) scholarship.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Stratonovich, R.L. Conditional Markov Processes. In Non-Linear Transformations of Stochastic Processes; Pergamon Press: Oxford, UK, 1965; pp. 427–453. [Google Scholar]
- Baum, L.E.; Petrie, T. Statistical Inference for Probabilistic Functions of Finite state Markov Chains. Ann. Math. Stat. 1966, 37, 1554–1563. [Google Scholar] [CrossRef]
- Rabiner, L.; Juang, B. An Introduction to Hidden Markov Models. IEEE ASSP Mag. 1986, 3, 4–16. [Google Scholar] [CrossRef]
- Ephraim, Y. Hidden Markov Processes. IEEE Trans. Inf. Theory 2002, 48, 1518–1569. [Google Scholar] [CrossRef]
- Cappé, O.; Moulines, E.; Ryden, T. Inference in Hidden Markov Models; Springer Series in Statistics; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
- Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001. [Google Scholar]
- Sutton, C.; McCallum, A. An Introduction to Conditional Random Fields. Found. Trends Mach. Learn. 2012, 4, 267–373. [Google Scholar] [CrossRef]
- Jurafsky, D.; Martin, J.H. Speech and Language Processing; Kindle edition; Prentice Hall Series in Artificial Intelligence; Prentice Hall: Hoboken, NJ, USA, 2014. [Google Scholar]
- Ng, A.; Jordan, M. On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. Adv. Neural Inf. Process. Syst. 2001, 14, 841–848. [Google Scholar]
- He, H.; Liu, Z.; Jiao, R.; Yan, G. A Novel Nonintrusive Load Monitoring Approach based on Linear-Chain Conditional Random Fields. Energies 2019, 12, 1797. [Google Scholar] [CrossRef]
- Condori, G.C.; Castro-Gutierrez, E.; Casas, L.A. Virtual Rehabilitation Using Sequential Learning Algorithms. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 639–645. [Google Scholar] [CrossRef]
- Fang, M.; Kodamana, H.; Huang, B.; Sammaknejad, N. A Novel Approach to Process Operating Mode Diagnosis using Conditional Random Fields in the Presence of Missing Data. Comput. Chem. Eng. 2018, 111, 149–163. [Google Scholar] [CrossRef]
- Saa, J.F.D.; Cetin, M. A Latent Discriminative model-based Approach for Classification of Imaginary Motor tasks from EEG data. J. Neural Eng. 2012, 9, 026020. [Google Scholar] [CrossRef] [PubMed]
- Azeraf, E.; Monfrini, E.; Pieczynski, W. On Equivalence between Linear-Chain Conditional Random Fields and Hidden Markov Chains. In Proceedings of the International Conference on Agents and Artificial Intelligence, Virtual, 3–5 February 2022. [Google Scholar]
- Liliana, D.Y.; Basaruddin, C. A Review on Conditional Random Fields as a Sequential Classifier in Machine Learning. In Proceedings of the International Conference on Electrical Engineering and Computer Science (ICECOS), Palembang, Indonesia, 22–23 August 2017; pp. 143–148. [Google Scholar]
- Ayogu, I.I.; Adetunmbi, A.O.; Ojokoh, B.A.; Oluwadare, S.A. A Comparative Study of Hidden Markov Model and Conditional Random Fields on a Yorùba Part-of-Speech Tagging task. In Proceedings of the IEEE International Conference on Computing Networking and Informatics (ICCNI), Lagos, Nigeria, 29–31 October 2017; pp. 1–6. [Google Scholar]
- McCallum, A.; Freitag, D.; Pereira, F.C. Maximum Entropy Markov Models for Information Extraction and Segmentation. ICML 2000, 17, 591–598. [Google Scholar]
- Song, S.L.; Zhang, N.; Huang, H.T. Named Entity Recognition based on Conditional Random Fields. Clust. Comput. J. Netw. Softw. Tools Appl. 2019, 22, S5195–S5206. [Google Scholar] [CrossRef]
- Heigold, G.; Ney, H.; Lehnen, P.; Gass, T.; Schluter, R. Equivalence of Generative and Log-Linear Models. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 1138–1148. [Google Scholar] [CrossRef]
- Azeraf, E.; Monfrini, E.; Vignon, E.; Pieczynski, W. Hidden Markov Chains, Entropic Forward-Backward, and Part-Of-Speech Tagging. arXiv 2020, arXiv:2005.10629. [Google Scholar]
- Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. Ann. Math. Stat. 1970, 41, 164–171. [Google Scholar] [CrossRef]
- Viterbi, A. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Trans. Inf. Theory 1967, 13, 260–269. [Google Scholar] [CrossRef]
- Pieczynski, W. Pairwise Markov Chains. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 634–639. [Google Scholar] [CrossRef]
- Quattoni, A.; Wang, S.B.; Morency, L.-P.; Collins, M.; Darrell, T. Hidden Conditional Random Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1848–1852. [Google Scholar] [CrossRef] [PubMed]
- Pieczynski, W.; Hulard, C.; Veit, T. Triplet Markov Chains in hidden signal restoration. In Proceedings of the SPIE’s International Symposium on Remote Sensing, Crete, Greece, 22–27 September 2002. [Google Scholar]
- Sarawagi, S.; Cohen, W. Semi-Markov conditional random fields for information extraction. Adv. Neural Inf. Process. Syst. 2004, 17. [Google Scholar]
- Yu, S.-Z. Hidden semi-Markov models. Artif. Intell. 2010, 174, 215–243. [Google Scholar] [CrossRef]
- Li, H.; Derrode, S.; Pieczynski, W. Adaptive on-line lower limb locomotion activity recognition of healthy individuals using semi-Markov model and single wearable inertial sensor. Sensors 2019, 19, 4242. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).