1. Introduction
Let
be a stochastic sequence, with
. The random variables
,
take their values in a finite set
, while
,
take their values either in a discrete or continuous set
. Realizations of
are hidden while realizations of
are observed, and the problem we deal with is to estimate
from
. We deal with Bayesian methods of estimation, which requires some probabilistic model. Probabilistic model is a distribution—or a family of distributions—which is denoted by
, or
. We are interested in the case of dependent
. The simplest model taking into account this dependence is the well-known hidden Markov model (HMM) [
1,
2,
3,
4,
5], whose distribution is given with
In the whole paper, we consider HMMs such that and in Equation (1) are non-null. HMMs allow recursive fast computation of Bayesian estimators, called “classifiers” in this paper, and recalled below. In spite of their simplicity, HMMs are very robust and provide quite satisfactory results in many applications.
Moreover, conditional random fields (CRFs) [
6,
7] also allow estimating
from
. Their definition is different from the definition of HMMs in that in CRFs, one directly considers
, and neither
nor
is needed to perform the estimation. The distribution of general LC-CRFs is written as such:
In this paper, we consider the following basic LC-CRF:
with
the normalizing constant.
Authors usually consider the two families HMMs and LC-CRFs as different [
6,
7,
8,
9,
10,
11,
12,
13]. They classify the former in the category of “generative models”, while they classify the latter in the category of “discriminative” models.
In this paper, we investigate the relationship between HMMs (Equation (1)) and LC-CRFs (Equation (3)). We immediately see that if is of the form (Equation (1)), then is of the form (Equation (3)). Indeed, and in Equation (1) being non-zero, one can take , for , …, , , for , …, , and . Thus, the posterior distribution of each HMM (Equation (1)) with the positivity conditions assumed above is a LC-CRF (Equation (3)). Our first contribution is to show the converse: for a given LC-CRF of form (Equation (3)), there is an HMM of form (Equation (3)) such that . Moreover, we give exact computation of the HMM parameters , , , from the LC-CRF parameters , . Such a general result, without additional constraints on LC-CRF, is new.
Our second contribution is related to the computation of Bayesian classifiers. In some situations, CRFs are preferred over HMMs because the Bayesian classifiers based on CRFs are computed without considering distributions, which are difficult to model. This is particularly the case with automatic natural language processing (NLP). Indeed, when considering an HMM (Equation (1)) and calculating the Bayesian classifiers “maximum posterior margins” (MPM) and “maximum posterior” (MAP) in the standard way, we use and this is the reason why LC-CRFs are preferred over HMMs. In this paper, we show that the HMM-based MPM and MAP can also be computed without using . Also recall that each Bayesian “discriminative” classifier based on LC-CRF (Equation (3)) is identical to the “generative” Bayesian classifier based on an HMM (Equation (1)), since the posterior distribution of an HMM gives the one of a LC-CRF. Indeed, Bayesian classifiers only depend on the posterior distribution. Thus, the distinction between “generative” classifiers and “discriminative” classifiers is misleading, they are all “discriminative”, but they can be computed in a “generative” way, using , or in a discriminative manner, without using . Thus, we can say that our second contribution is to show that the HMM-based MPM and MAP, usually computed in a generative manner, can also be computed in a discriminative manner. This is important because it shows that abandoning HMMs in favor of CRFs in the aforementioned situations is not justified. Indeed, attached to the first contribution, it shows that the use of the MPM or MAP based on HMM (Equation (1)) is as interesting as the use of the MPM or MAP based on LC-CRF (Equation (3)).
Let us give some more technical details on the two contributions.
We establish an equivalence between HMMs (Equation (1)) and basic linear-chain CRFs (Equation (3)), which completes the results presented in [
14].
Let us notice that wanting to compare the two models directly is somewhat misleading. Indeed, HMMs and CRFs are defined with distributions on different spaces. To be precise, we adopt the following definition:
Definition 1. Let , , , be the two stochastic sequences defined above.
- (i)
We will call “model” a distribution ;
- (ii)
We will call “conditional model” a distribution ;
- (iii)
We will say that a model is “equivalent” to a conditional model if there exists a distribution such that ;
- (iv)
We will say that a family of models A is “equivalent” to a family of conditional models B if for each model in A there exists an equivalent conditional model in B.
According to Definition 1, HMMs are particular “models”, while CRFs are particular “conditional models”. Then a particular HMM model cannot be equal to a particular conditional CRF model, but it can be equivalent to the latter.
Our contribution is to show that the family of LC-CRFs (Equation (3)) is equivalent to the family of HMMs (Equation (1)). In addition, we specify, for each LC-CRF , a particular HMM such that .
Finally, the core of our first contribution is the following. Let be an HMM (Equation (1)), with parameters . Taking , it is immediate to see that is an equivalent CRF. The converse is not immediate. Is a given CRF equivalent to a certain HMM? If yes, can we find such that is an HMM? Moreover, can we give its (Equation (1)) form? Answering to these questions in a simple linear-chain CRF case is our first contribution. More precisely, we show that the family of LC-CRFs (Equation (3)) is equivalent to the family of HMMs (Equation (1)), and we specify, for each LC-CRF , a particular HMM given in the form (Equation (1)), such that .
- 2.
We show that the “generative” estimators MPM and MAP in HMM are computable in a “discriminative” manner, exactly as in LC-CRF.
One of the interests of HMMs and CRFs is that in both of them there exist Bayesian classifiers, which allow estimating
from
in a reasonable computer time. As examples, let us consider the “maximum of posterior margins” (MPM) defined with:
and the “maximum a posteriori” (MAP), defined with
Note that likely to any other Bayesian classifier, MPM and MAP are independent from
. This means that in any generative model
, any related Bayesian classifier is strictly the same as the one related to the equivalent (in the meaning of Definition 1) CRF model
. We see that the distinction between “generative” and “discriminative” classifiers is not justified: all Bayesian classifiers are discriminative. However, in HMM the related MPM and MAP classifiers are computed using
, while this is not the case in LC-CRF. We show that both MPM and MAP in HMM can also be computed in a “discriminative” way, without using
. Thus, the feasibility of using MPM and MAP in HMM is strictly the same as that of their use in LC-CRF, which is our second contribution. One of the consequences is that the use of MPM and MAP in the two families HMMs and LC-CRFs presents exactly the same interest, in particular in NLP. This shows that abandoning HMMs in favor of LC-CRFs in NLP because of the “generative” nature [
6,
7,
8,
9,
15,
16,
17,
18] of their related Bayesian classifiers was not justified.
4. Discriminative Classifiers in Generative HMMs
One of the interests of HMMs and some CRFs with hidden discrete finite data lies in possibilities of analytic fast computation of Bayesian classifiers. As examples of classic Bayesian classifiers, let us consider MPM (Equation (4)) and MAP (Equation (5)). However, in some domains such as NLP, CRFs are preferred to HMMs for the following reasons. As HMM is a generative model, MPM and MAP used in HMM are also called “generative”, and people consider that HMM-based MPM and MAP require the knowledge of . Then people consider it as improper to use the HMM-based MPM and MAP in situations where the distributions are hard to handle. We show that this reason is not valid. More precisely, we show two points:
- (i)
First, we notice that regardless of the distribution , all Bayesian classifiers are independent from , so that the distinction between « generative » and « discriminative » classifiers is misleading: they are all discriminative;
- (ii)
Second, we show “discriminative” computation of MPM and MAP in HMMs is not intrinsic to HMMs but is due to its particular classic parameterization Equation (1). In other words, changing the parametrization, it is possible to compute the HMM-based MPM and MAP without using or .
The first point is rather immediate: we note that Bayesian classifier
is defined by a loss function
through
it is thus immediate to notice that
only depends on
. This implies that it is the same in a generative model or its equivalent (within the meaning of Definition 1) discriminative model.
We show (ii) by separately considering the MPM and MAP cases.
4.1. Discriminative Computing of HMM-Based MPM
To show (ii), let us consider Equation (1) with
. It becomes,
We see that (Equation (21)) is of the form
, where
does not depend on
, …,
. This implies that
does not depend on
, …,
either. Then
, so that
neither depends on
, …,
. Thus, the HMM-based classifier MPM also verifies the “discriminative classifier” definition.
How to compute
? It is classically computable using “forward” probabilities
and “backward” ones
defined with
then
with all
and
computed using the following forward and backward recursions [
21]:
Setting
and recalling that
does not depend on
, …,
, we can arbitrarily modify them. Let us consider the uniform distribution over
, so that
. Then (Equation (26)) and (Equation (27)) become
and we still have
Finally, we see that is independent from , so that we can take . Then we can state the following proposition.
Proposition 2. Let ,,, be an HMM Equation (1). Let us define “discriminative forward” quantities , …,, and “discriminative backward” ones , …,by the following forward and backward recursions:then Consequently, we can compute the MPM classifier in a discriminative manner, only using , …, , , …, , and , …, .
Note that this result is similar to the result in [
21], with a different proof.
Remark 1. Let us notice that according to Equations (31) and (32), it is possible to compute and by a very slight adaptation of classic computing programs giving classic and with recursions (Equations (26) and (27)). All we have to do is to replace with . Of course, and, but (Equation (33)) holds and thus is computable allowing MPM.
Remark 2. We see that we can compute the MPM in HMM only using , …, , , …,, and , …, . This means that in supervised classification, where we have a learn sample, we can use any parametrization to estimate them. For example, we can model them with logistic regression, as currently done in CRFs. It is of importance to note that such a parametrization is unusual; however, what is important is that the model remains the same.
Remark 3. When using the MPM to deal with a concrete problem, Proposition 2 implies that talking about a comparison between LC-CRFs and HMMs is somewhat incorrect. Indeed, there is only one model. However, there are two different parameterizations, and this can produce two different results. Indeed, the estimation of the parameters can give two models of the same nature (posterior distribution of an HMM) but unequally situated with respect to the optimal model, the best suited to the data. It would therefore be more correct to speak of a comparison of two parametrizations of HMM, each associated with its own parameter estimator. The same is true concerning the MAP discussed in the next paragraph.
4.2. Discriminative Computing of HMM-Based Map: Discriminative Viterbi
Let
, , , be an HMM (Equation (1)). The Bayesian MAP classifier (Equation (5)) based on such HMM is computed with the following Viterbi algorithm [
22]. For each
, , and each
, let
be the path
, …,
verifying
We see that
is a path maximizing
over all paths ending in
. Then having the paths
and the probabilities
for each
, one determines, for each
, the paths
and the probabilities
, searching
with
Setting in Equation (35)
, we see that
which verifies Equation (35) is the same that
which maximizes
, so that we can suppress
. In other words, we can replace Equation (35) with
Finally, we propose the following discriminative version of the Viterbi algorithm:
- -
Set ;
- -
For each , …, , and each , apply (3.17) to find a path from the paths (for all ), and the probabilities (for all );
- -
End setting .
As with MPM above, we see that we can find with the only use of , …, , , …, , and , …, , exactly as in CRF case. As above, it appears that dropping HMMs in some NLP tasks on the grounds that MAP is a “generative” classifier, is not justified. In particular, in supervised stationary framework, distributions , , and can be estimated in the same way as in LC-CRFs case.
5. Discussion and Conclusions
We have proposed two results. First, we have shown that the basic LC-CRF (Equation (3)) is equivalent to a classical HMM (Equation (1)) in that one can find an HMM whose posterior distribution is exactly the given LC-CRF. More precisely, we specified the way to calculate the parameters , , and which define Equation (1) from the parameters , which define Equation (3). Second, noting that all Bayesian classifiers are discriminative in the sense that they do not depend on the observation distribution, we showed that classifiers based on HMMs, usually considered as “generative”, can also be considered as discriminative classifiers. More specifically, we have proposed discriminative methods for computing classic maximum posterior mode (MPM) and maximum a posteriori (MAP) classifiers based on HMMs. The first result shows that LC-CRFs are as general as classical HMMs. The second shows that at the application level, HMMs offer strictly the same processing power, at least with regard to MPM and MAP, as LC-CRFs.
The practical interest of our contributions is as follows. Until now, some authors considered LC-CRFs and HMMs to be equivalent without presenting rigorous proof, and others considered LC-CRFs to be more general [
19]. Partly because of this uncertainty, CRFs have often been preferred over HMMs. We have proved, at least in the particular framework considered, that the two models were equivalent, and the abandonment of HMMs in favor of CRFs was not always justified. In other words, faced with a particular application, there is no reason to choose CRFs systematically. However, we also cannot say that HMMs should be chosen. Finally, our contribution is likely to encourage practitioners to consider both models, or rather, according to Remark 3, both parametrizations of the same model, on an equal footing.
We considered basic LC-CRFs and HMMs, which are a limited, yet widely used framework. LC-CRFs and HMMs can be extended in different directions. The general question to study is to search an extension of HMM which would be equivalent to the general CRF (Equation (2)). This seems to be a hard problem. However, one can study some existing extensions of HMMs and wonder what kind of CRFs they would be equivalent to. For example, HMMs have been extended to “pairwise Markov models” (PMMs [
23]), and the question is therefore what kind of CRFs would be equivalent to PMMs? Another kind of extension consists of adding a latent process
to the pair (
. In the case of CRFs this leads to hidden CRFs (HCRFs [
24]), and in the case of HMMs this leads to triplet Markov models (TMMs [
25]). Finally, CRFs were generalized to semi-Markov CRFs [
26], and HMMs were generalized to hidden semi-Markov models, with explicit distribution of the exact sojourn time in a given state [
27], or with explicit distribution of the minimal sojourn time in a given state [
28]. The study of the relations between these different extensions of CRFs and HMMs is an interesting perspective for further investigations.