Next Article in Journal
How Optimal Transport Can Tackle Gender Biases in Multi-Class Neural Network Classifiers for Job Recommendations
Next Article in Special Issue
Using Epidemiological Models to Predict the Spread of Information on Twitter
Previous Article in Journal
A Scheduling Solution for Robotic Arm-Based Batching Systems with Multiple Conveyor Belts
Previous Article in Special Issue
About the Performance of a Calculus-Based Approach to Building Model Functions in a Derivative-Free Trust-Region Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Communication

Equivalence between LC-CRF and HMM, and Discriminative Computing of HMM-Based MPM and MAP

1
Watson Department, IBM GSB France, avenue de l’Europe,92270 Bois-Colombes, France
2
SAMOVAR, Telecom SudParis, Institut Polytechnique de Paris, 91120 Palaiseau, France
*
Author to whom correspondence should be addressed.
Algorithms 2023, 16(3), 173; https://doi.org/10.3390/a16030173
Submission received: 13 January 2023 / Revised: 8 March 2023 / Accepted: 15 March 2023 / Published: 21 March 2023
(This article belongs to the Special Issue Mathematical Models and Their Applications IV)

Abstract

:
Practitioners have used hidden Markov models (HMMs) in different problems for about sixty years. Moreover, conditional random fields (CRFs) are an alternative to HMMs and appear in the literature as different and somewhat concurrent models. We propose two contributions: First, we show that the basic linear-chain CRFs (LC-CRFs), considered as different from HMMs, are in fact equivalent to HMMs in the sense that for each LC-CRF there exists an HMM—that we specify—whose posterior distribution is identical to the given LC-CRF. Second, we show that it is possible to reformulate the generative Bayesian classifiers maximum posterior mode (MPM) and maximum a posteriori (MAP), used in HMMs, as discriminative ones. The last point is of importance in many fields, especially in natural language processing (NLP), as it shows that in some situations dropping HMMs in favor of CRFs is not necessary.

1. Introduction

Let Z 1 : N = ( Z 1 , , Z N )  be a stochastic sequence, with Z n = ( X n , Y n ) . The random variables X 1 , , X N take their values in a finite set Λ , while Y 1 , , Y N take their values either in a discrete or continuous set Ω . Realizations of X 1 : N = ( X 1 , , X N ) are hidden while realizations of Y 1 : N = ( Y 1 , , Y N ) are observed, and the problem we deal with is to estimate X 1 : N = x 1 : N from Y 1 : N = y 1 : N . We deal with Bayesian methods of estimation, which requires some probabilistic model. Probabilistic model is a distribution—or a family of distributions—which is denoted by p ( z 1 : N ) , or p ( x 1 : N , y 1 : N ) . We are interested in the case of dependent Z 1 , , Z N . The simplest model taking into account this dependence is the well-known hidden Markov model (HMM) [1,2,3,4,5], whose distribution is given with
p x 1 : N , y 1 : N = p x 1 p y 1 x 1 n = 1 N 1 p x n + 1 x n p y n + 1 x n + 1
In the whole paper, we consider HMMs such that p x n + 1 x n and p y n + 1 x n + 1 in Equation (1) are non-null. HMMs allow recursive fast computation of Bayesian estimators, called “classifiers” in this paper, and recalled below. In spite of their simplicity, HMMs are very robust and provide quite satisfactory results in many applications.
Moreover, conditional random fields (CRFs) [6,7] also allow estimating X 1 : N = x 1 : N from Y 1 : N = y 1 : N . Their definition is different from the definition of HMMs in that in CRFs, one directly considers p ( x 1 : N | y 1 : N ) , and neither p ( x 1 : N , y 1 : N ) nor p ( y 1 : N | x 1 : N ) is needed to perform the estimation. The distribution of general LC-CRFs is written as such:
p ( x 1 : N | y 1 : N ) = p x 1 y 1 : N n = 1 N 1 p x n + 1 x n , y 1 : N
In this paper, we consider the following basic LC-CRF:
p x 1 : N | y 1 : N = 1 κ ( y 1 : N ) e x p n = 1 N 1 V n x n , x n + 1 + n = 1 N U n x n , y n
with κ ( y 1 : N ) the normalizing constant.
Authors usually consider the two families HMMs and LC-CRFs as different [6,7,8,9,10,11,12,13]. They classify the former in the category of “generative models”, while they classify the latter in the category of “discriminative” models.
In this paper, we investigate the relationship between HMMs (Equation (1)) and LC-CRFs (Equation (3)). We immediately see that if p x 1 : N , y 1 : N is of the form (Equation (1)), then p x 1 : N | y 1 : N is of the form (Equation (3)). Indeed, p x n + 1 x n and p y n + 1 x n + 1 in Equation (1) being non-zero, one can take V 1 x 1 , x 2 = L o g [ p ( x 1 , x 2 ) ] , V n x n , x n + 1 = L o g [ p x n + 1 x n ] for n = 2 , …, N 1 , U n x n , y n = L o g [ p y n x n ] , for n = 1 , …, N , and κ y 1 : N = p ( y 1 : N ) . Thus, the posterior distribution p x 1 : N | y 1 : N of each HMM (Equation (1)) with the positivity conditions assumed above is a LC-CRF (Equation (3)). Our first contribution is to show the converse: for a given LC-CRF p x 1 : N | y 1 : N of form (Equation (3)), there is an HMM q x 1 : N , y 1 : N of form (Equation (3)) such that q x 1 : N | y 1 : N = p x 1 : N | y 1 : N . Moreover, we give exact computation of the HMM parameters q x 1 , q x n + 1 x n , q y n + 1 x n + 1 , from the LC-CRF parameters V n x n , x n + 1 , U n x n , y n . Such a general result, without additional constraints on LC-CRF, is new.
Our second contribution is related to the computation of Bayesian classifiers. In some situations, CRFs are preferred over HMMs because the Bayesian classifiers based on CRFs are computed without considering p y n x n distributions, which are difficult to model. This is particularly the case with automatic natural language processing (NLP). Indeed, when considering an HMM (Equation (1)) and calculating the Bayesian classifiers “maximum posterior margins” (MPM) and “maximum posterior” (MAP) in the standard way, we use p y n x n and this is the reason why LC-CRFs are preferred over HMMs. In this paper, we show that the HMM-based MPM and MAP can also be computed without using p y n x n . Also recall that each Bayesian “discriminative” classifier based on LC-CRF (Equation (3)) is identical to the “generative” Bayesian classifier based on an HMM (Equation (1)), since the posterior distribution p x 1 : N | y 1 : N of an HMM gives the one of a LC-CRF. Indeed, Bayesian classifiers only depend on the posterior distribution. Thus, the distinction between “generative” classifiers and “discriminative” classifiers is misleading, they are all “discriminative”, but they can be computed in a “generative” way, using p y n x n , or in a discriminative manner, without using p y n x n . Thus, we can say that our second contribution is to show that the HMM-based MPM and MAP, usually computed in a generative manner, can also be computed in a discriminative manner. This is important because it shows that abandoning HMMs in favor of CRFs in the aforementioned situations is not justified. Indeed, attached to the first contribution, it shows that the use of the MPM or MAP based on HMM (Equation (1)) is as interesting as the use of the MPM or MAP based on LC-CRF (Equation (3)).
Let us give some more technical details on the two contributions.
  • We establish an equivalence between HMMs (Equation (1)) and basic linear-chain CRFs (Equation (3)), which completes the results presented in [14].
Let us notice that wanting to compare the two models directly is somewhat misleading. Indeed, HMMs and CRFs are defined with distributions on different spaces. To be precise, we adopt the following definition:
Definition 1.
Let  X 1 , , X N , Y 1 , , Y N  be the two stochastic sequences defined above.
(i) 
We will call “model” a distribution  p x 1 : N , y 1 : N ;
(ii) 
We will call “conditional model” a distribution  p x 1 : N | y 1 : N ;
(iii) 
We will say that a model  p x 1 : N , y 1 : N  is “equivalent” to a conditional model  q x 1 : N | y 1 : N  if there exists a distribution  r y 1 : N  such that  p x 1 : N , y 1 : N = q x 1 : N | y 1 : N r y 1 : N ;
(iv) 
We will say that a family of models A is “equivalent” to a family of conditional models B if for each model  p x 1 : N , y 1 : N  in A there exists an equivalent conditional model  q x 1 : N | y 1 : N  in B.
According to Definition 1, HMMs are particular “models”, while CRFs are particular “conditional models”. Then a particular HMM model cannot be equal to a particular conditional CRF model, but it can be equivalent to the latter.
Our contribution is to show that the family of LC-CRFs (Equation (3)) is equivalent to the family of HMMs (Equation (1)). In addition, we specify, for each LC-CRF q x 1 : N | y 1 : N , a particular HMM p x 1 : N , y 1 : N such that p x 1 : N | y 1 : N = q x 1 : N | y 1 : N .
Finally, the core of our first contribution is the following. Let p x 1 : N , y 1 : N , θ be an HMM (Equation (1)), with parameters θ . Taking r y 1 : N = p y 1 : N , θ , it is immediate to see that p x 1 : N | y 1 : N , θ is an equivalent CRF. The converse is not immediate. Is a given CRF p x 1 : N | y 1 : N , θ equivalent to a certain HMM? If yes, can we find r y 1 : N such that p x 1 : N | y 1 : N , θ r y 1 : N is an HMM? Moreover, can we give its (Equation (1)) form? Answering to these questions in a simple linear-chain CRF case is our first contribution. More precisely, we show that the family of LC-CRFs (Equation (3)) is equivalent to the family of HMMs (Equation (1)), and we specify, for each LC-CRF p x 1 : N | y 1 : N , a particular HMM q x 1 : N , y 1 : N given in the form (Equation (1)), such that p x 1 : N | y 1 : N = q x 1 : N | y 1 : N .
2.
We show that the “generative” estimators MPM and MAP in HMM are computable in a “discriminative” manner, exactly as in LC-CRF.
One of the interests of HMMs and CRFs is that in both of them there exist Bayesian classifiers, which allow estimating x 1 : N from y 1 : N in a reasonable computer time. As examples, let us consider the “maximum of posterior margins” (MPM) defined with:
g y 1 : N = x ^ 1 : N = x ^ 1 , , x ^ N [ n = 1 , , N , p ( x ^ n y 1 : N = s u p x n ( p x n y 1 : N ) ]
and the “maximum a posteriori” (MAP), defined with
g y 1 : N = x ^ 1 : N [ p ( x ^ 1 : N y 1 : N = s u p x 1 : N ( p x 1 : N y 1 : N ) ]
Note that likely to any other Bayesian classifier, MPM and MAP are independent from p ( y 1 : N ) . This means that in any generative model p ( x 1 : N , y 1 : N ) , any related Bayesian classifier is strictly the same as the one related to the equivalent (in the meaning of Definition 1) CRF model p ( x 1 : N | y 1 : N ) . We see that the distinction between “generative” and “discriminative” classifiers is not justified: all Bayesian classifiers are discriminative. However, in HMM the related MPM and MAP classifiers are computed using p y n x n , while this is not the case in LC-CRF. We show that both MPM and MAP in HMM can also be computed in a “discriminative” way, without using p y n x n . Thus, the feasibility of using MPM and MAP in HMM is strictly the same as that of their use in LC-CRF, which is our second contribution. One of the consequences is that the use of MPM and MAP in the two families HMMs and LC-CRFs presents exactly the same interest, in particular in NLP. This shows that abandoning HMMs in favor of LC-CRFs in NLP because of the “generative” nature [6,7,8,9,15,16,17,18] of their related Bayesian classifiers was not justified.

2. Related Works

Concerning the first contribution, several authors noticed similarities between LC-CRFs and HMMs in different previous works. Our first remark is to notice that trying to compare the two families directly is somewhat incorrect, as they are defined on different spaces. In particular, they cannot be equal. It is well-known that the posterior distribution p x 1 : N | y 1 : N of an HMM p x 1 : N , y 1 : N is a LC-CRF. Conversely, showing that for a given CRF p x 1 : N | y 1 : N it is possible to find an HMM q x 1 : N , y 1 : N such that q x 1 : N | y 1 : N = p x 1 : N | y 1 : N is more difficult and at our knowledge, there is no general results, except [14], published. However, let us mention [19] where authors show a similar result to ours assuming an additional constraint on the LC-CRF considered. In [7], authors comment similarities and differences between LC-CRFs and HMMs considered here; however, the problem of searching an HMM equivalent to a given LC-CRF is not addressed. In this paper we show how to compute, from the LC-CRF p x 1 : N | y 1 : N given with Equation (3), an HMM q x 1 : N , y 1 : N verifying q x 1 : N | y 1 : N = p x 1 : N | y 1 : N , without any additional constraints. Concerning the second contribution, the authors generally distinguish between “discriminative” classifiers, linked to discriminative models, and “generative” classifiers, linked to generative models. As mentioned above, our contribution is to show that the MPMs and MAPs based on generative HMMs can also be considered as discriminative classifiers. To our knowledge, there is no work on such conversions, except [20].

3. Equivalence between HMMs and Simple Linear-Chain CRFs

We will use the following Lemma:
Lemma 1. 
Let  W 1 : N = ( W 1 , , W N )  be random sequence, taking its values in a finite set  Δ . Then
(i) 
W 1 : N  is a Markov chain if and only if (iff) there exist  N 1  functions  φ 1 , , φ N 1  from  Δ 2  to  R +  such that
p ( w 1 , , w N ) φ 1 ( w 1 , w 2 ) φ N 1 ( w N 1 , w N )
where means “proportional to”;
(ii) 
For the HMM defined with  φ 1 , …, φ N 1  verifying Equation (6),  p w 1  and  p w n + 1 | w n  are given with
p w 1 = β 1 ( w 1 ) w 1 β 1 ( w 1 ) ; p w n + 1 | w n = φ n ( w n , w n + 1 ) β n + 1 ( w n + 1 ) β n w n
where  β 1 w 1 , …, β N w N  are defined with the following backward recursion:
β N w N = 1 ,   β n w n = w n + 1 φ n ( w n , w n + 1 ) β n + 1 ( w n + 1 )
Proof of Lemma.
  • Let W 1 : N be Markov: p w 1 , , w N = p w 1 p w 2 | w 1 p w 3 | w 2 p w N | w N 1 . Then (Equation (6)) is verified by φ 1 w 1 , w 2 = p w 1 p w 2 | w 1 , φ 2 w 2 , w 3 = p w 3 | w 2 , …, φ N 1 w N 1 , w N = p w N | w N 1 .
  • Conversely, let p w 1 , , w N verifies (Equation (6)). Thus p w 1 , , w N = K φ 1 ( w 1 , w 2 ) φ N 1 ( w N 1 , w N ) with K constant. This implies that for each n = 1 , …, N 1 we have
    p w n + 1 | w 1 , , w n = p w 1 , , w n , w n + 1 p w 1 , , w n = w n + 2 , , w N , φ 1 ( w 1 , w 2 ) φ n w n , w n + 1 φ n + 1 w n + 1 , w n + 2 φ N 1 ( w N 1 , w N ) w n + 1 , w n + 2 , , w N , φ 1 ( w 1 , w 2 ) φ n w n , w n + 1 φ n + 1 w n + 1 , w n + 2 φ N 1 ( w N 1 , w N ) = φ n w n , w n + 1 w n + 2 , , w N , φ n + 1 w n + 1 , w n + 2 φ N 1 ( w N 1 , w N ) w n + 1 , w n + 2 , , w N , φ n w n , w n + 1 φ n + 1 w n + 1 , w n + 2 φ N 1 ( w N 1 , w N ) = p w n + 1 | w n
    which shows that p w 1 , , w N is Markov.
Moreover, let us set β n w n = w n + 1 , w n + 2 , , w N φ n w n , w n + 1 φ N 1 ( w N 1 , w N ) for n = 1 , …, N 1 . On the one hand, we see that β n w n = w n + 1 φ n ( w n , w n + 1 ) β n + 1 ( w n + 1 ) . On the other hand, according to (Equation (9)) we have p w n + 1 | w n = φ n ( w n , w n + 1 ) β n + 1 ( w n + 1 ) β n w n . As p w 1 = β 1 ( w 1 ) w 1 β 1 ( w 1 ) , (Equation (7)) and (Equation (8)) are verified, which ends the proof. □
Proposition 1 below shows that the LC-CRF defined with (Equation (3)) is equivalent to an HMM defined with Equation (1). In addition, p x 1 , p x n + 1 x n , and p y n x n in Equation (1) defining an equivalent HMM are computed from V n x n , x n + 1 and U n x n , y n . To the best of our knowledge, except some first weaker results in [19], these results are new.
Proposition 1.
Let  Z 1 : N = ( Z 1 , , Z N )  be a stochastic sequence, with Z n = ( X n , Y n ) . Each  ( X n , Y n )  takes its values in  Λ × Ω , with  Λ  and  Ω  finite. If  Z 1 : N  is a LC-CRF with the distribution  p x 1 : N | y 1 : N  defined by
p x 1 : N | y 1 : N = 1 κ ( y 1 : N ) e x p n = 1 N 1 V n x n , x n + 1 + n = 1 N U n x n , y n
then (Equation (10)) is the posterior distribution of the HMM
q x 1 : N , y 1 : N = q 1 x 1 q y 1 | x 1 n = 1 N 1 q x n + 1 | x n q y n + 1 | x n + 1
with
q x 1 , y 1 = β 1 x 1 , y 1 x 1 , y 1 β 1 x 1 , y 1
and, for  n = 1 , , N 1 :
q x n + 1 | x n = ψ x n + 1 e x p [ V n x n , x n + 1 ] x n + 1 ψ x n + 1 e x p [ V n x n , x n + 1 ]
q y n + 1 | x n + 1 = e x p [ U n + 1 x n + 1 , y n + 1 ] β n + 1 x n + 1 , y n + 1 ψ x n + 1
where
ψ x n + 1 = y n + 1 e x p [ U n + 1 x n + 1 , y n + 1 ] β n + 1 x n + 1 , y n + 1
and  β 1 x 1 , y 1 , …, β N x N , y N are given by the backward recursion
β N x N , y N = 1
β n x n , y n = x n + 1 , y n + 1 e x p [ V n x n , x n + 1 + U n + 1 x n + 1 , y n + 1 ] β n + 1 x n + 1 , y n + 1
Proof of Proposition 1.
Let us consider functions φ 1 , …, φ N defined on [ Λ × Ω ] 2 by
φ 1 x 1 , y 1 , x 2 , y 2 = e x p V 1 x 1 , x 2 + U 1 x 1 , y 1 + U 2 x 2 , y 2
φ n x n , y n , x n + 1 , y n + 1 = e x p [ V n x n , x n + 1 + U n + 1 x n + 1 , y n + 1 ] ,   for   n = 2 , , N 1
According to the lemma, they define a Markov chain Z 1 : N = ( Z 1 , , Z N ) , with Z n = ( X n , Y n ) . Let us denote its distribution by q ( z 1 : N ) = q x 1 : N , y 1 : N . As q x 1 : N , y 1 : N = K e x p n = 1 N 1 V n x n , x n + 1 + n = 1 N U n x n , y n with K constant, we have q x 1 : N | y 1 : N = p x 1 : N | y 1 : N . Let us show that q x 1 : N , y 1 : N verifies (Equations (11)–(16)). According to the lemma, considering β 1 x 1 , y 1 , …, β N x N , y N defined with (Equation (16)), and ψ x n + 1 defined with (Equation (15)), we have
q x n + 1 , y n + 1 | x n , y n = φ n x n , y n , x n + 1 , y n + 1 β n + 1 x n + 1 , y n + 1 x n + 1 , y n + 1 φ n x n , y n , x n + 1 , y n + 1 β n + 1 x n + 1 , y n + 1 = e x p [ V n x n , x n + 1 + U n + 1 x n + 1 , y n + 1 ] β n + 1 x n + 1 , y n + 1 x n + 1 , y n + 1 e x p [ V n x n , x n + 1 + U n + 1 x n + 1 , y n + 1 ] β n + 1 x n + 1 , y n + 1 =
e x p [ V n x n , x n + 1 ] e x p [ U n + 1 x n + 1 , y n + 1 ] β n + 1 x n + 1 , y n + 1 x n + 1 ψ x n + 1 e x p [ V n x n , x n + 1 ] = e x p [ V n x n , x n + 1 ] x n + 1 ψ x n + 1 e x p [ V n x n , x n + 1 ] exp U n + 1 x n + 1 , y n + 1 β n + 1 x n + 1 , y n + 1 = ψ x n + 1 e x p [ V n x n , x n + 1 ] x n + 1 ψ x n + 1 e x p [ V n x n , x n + 1 ] exp U n + 1 x n + 1 , y n + 1 β n + 1 x n + 1 , y n + 1 ψ x n + 1 = q x n + 1 | x n q y n + 1 | x n + 1
(12) being directly implied by the lemma, this ends the proof. □

4. Discriminative Classifiers in Generative HMMs

One of the interests of HMMs and some CRFs with hidden discrete finite data lies in possibilities of analytic fast computation of Bayesian classifiers. As examples of classic Bayesian classifiers, let us consider MPM (Equation (4)) and MAP (Equation (5)). However, in some domains such as NLP, CRFs are preferred to HMMs for the following reasons. As HMM is a generative model, MPM and MAP used in HMM are also called “generative”, and people consider that HMM-based MPM and MAP require the knowledge of p y n | x n . Then people consider it as improper to use the HMM-based MPM and MAP in situations where the distributions p y n | x n are hard to handle. We show that this reason is not valid. More precisely, we show two points:
(i)
First, we notice that regardless of the distribution p x 1 : N , y 1 : N , all Bayesian classifiers are independent from p y 1 : N , so that the distinction between « generative » and « discriminative » classifiers is misleading: they are all discriminative;
(ii)
Second, we show “discriminative” computation of MPM and MAP in HMMs is not intrinsic to HMMs but is due to its particular classic parameterization Equation (1). In other words, changing the parametrization, it is possible to compute the HMM-based MPM and MAP without using p y 1 : N | x 1 : N or p y 1 : N .
The first point is rather immediate: we note that Bayesian classifier g L is defined by a loss function L : Ω 2 R + through
g L y 1 : N = x ^ 1 : N [ E L g L y 1 : N , X 1 : N y 1 : N = i n f x 1 : N E L x 1 : N , X 1 : N y 1 : N ]
it is thus immediate to notice that g L y 1 : N only depends on p x 1 : N | y 1 : N . This implies that it is the same in a generative model or its equivalent (within the meaning of Definition 1) discriminative model.
We show (ii) by separately considering the MPM and MAP cases.

4.1. Discriminative Computing of HMM-Based MPM

To show (ii), let us consider Equation (1) with p y n x n = p y n p x n | y n p x n . It becomes,
p x 1 : N , y 1 : N = p x 1 y 1 n = 2 T p x n x n 1 p x n | y n p x n n = 1 N p y n
We see that (Equation (21)) is of the form p x 1 : N , y 1 : N = h ( x 1 : N , y 1 : N ) n = 1 N p y n , where h ( x 1 : N , y 1 : N ) does not depend on p y 1 , …, p y N . This implies that h x n , y 1 : N = ( x 1 : n 1 , x n + 1 : N ) h x 1 : N , y 1 : N does not depend on p y 1 , …, p y N either. Then p x n | y 1 : N = p x n , y 1 : N p y 1 : N = h x n , y 1 : N n = 1 N p y n [ x n h x n , y 1 : N ] n = 1 N p y n = h x n , y 1 : N [ x n h x n , y 1 : N ] , so that
p x n | y 1 : N = ( x 1 : n 1 , x n + 1 : N ) p x 1 y 1 t = 2 N p x n x n 1 p x n | y n p x n ( x 1 : N ) p x 1 y 1 t = 2 N p x n x n 1 p x n | y n p x n
neither depends on p y 1 , …, p y N . Thus, the HMM-based classifier MPM also verifies the “discriminative classifier” definition.
How to compute p x n | y 1 : N ? It is classically computable using “forward” probabilities α n x n and “backward” ones β n x n defined with
α n x n = p x n , y 1 : n
β n x n = p y n + 1 : N | x n
then
p x n | y 1 : N = α n x n β n x n x n α n x n β n x n
with all α n x n and β n x n computed using the following forward and backward recursions [21]:
α 1 x 1 = p ( x 1 ) p y 1 x 1 ;   α n + 1 x n + 1 = x n p x n + 1 x n p y n + 1 x n + 1 α n ( x n )
β N x N = 1 ;   β n x n = x n + 1 p x n + 1 x n p y n + 1 x n + 1 β n + 1 x n + 1
Setting p y n x n = p y n p x n | y n p x n and recalling that p x n | y 1 : N does not depend on p y 1 , …, p y N , we can arbitrarily modify them. Let us consider the uniform distribution over Ω , so that p y 1 = = p y N = 1 # Ω = c . Then (Equation (26)) and (Equation (27)) become
α n + 1 * x n + 1 = x n p x n + 1 x n c p x n + 1 | y n + 1 p x n + 1 α n * x n
β n * x n = x n + 1 p x n + 1 x n c p x n + 1 | y n + 1 p x n + 1 β n + 1 * x n + 1
and we still have
p x n | y 1 : N = α n * x n β n * x n x n α n * β n * x n
Finally, we see that p x n | y 1 : N is independent from c , so that we can take c = 1 . Then we can state the following proposition.
Proposition 2.
Let  X 1 , , X N , Y 1 , , Y N  be an HMM Equation (1). Let us define “discriminative forward” quantities  α 1 D x 1 , …, α N D x N , and “discriminative backward” ones  β 1 D x 1 , …, β N D x N by the following forward and backward recursions:
α 1 D x 1 = p x 1 y 1 ; α n + 1 D x n + 1 = x n p x n + 1 x n p x n + 1 | y n + 1 p x n + 1 α n D ( x n )
β N D x N = 1 ; β n D x n = x n + 1 p x n + 1 x n p x n + 1 | y n + 1 p x n + 1 β n + 1 D x n + 1
then
p x n | y 1 : N = α n D x n β n D x n x n α n D x n β n D x n
Consequently, we can compute the MPM classifier in a discriminative manner, only using  p ( x 1 ) , …, p ( x N ) , p x 2 x 1 , …, p x N x N 1 , and p x 1 y 1 , …, p x N y N .
Note that this result is similar to the result in [21], with a different proof.
Remark 1.
Let us notice that according to Equations (31) and (32), it is possible to compute  α n D x n and β n D x n  by a very slight adaptation of classic computing programs giving classic  α n x n = p x n , y 1 : n  and  β n x n = p y n + 1 : N | x n  with recursions (Equations (26) and (27)). All we have to do is to replace p y n + 1 x n + 1  with  p x n + 1 | y n + 1 p x n + 1 . Of course,  α n D x n p x n , y 1 : n  and β n D x n p y n + 1 : N | x n , but (Equation (33)) holds and thus  p x n | y 1 : N  is computable allowing MPM.
Remark 2.
We see that we can compute the MPM in HMM only using  p x 1 , …, p x N , p x 2 x 1 , …, p x N x N 1 , and  p x 1 | y 1 , …, p x N | y N . This means that in supervised classification, where we have a learn sample, we can use any parametrization to estimate them. For example, we can model them with logistic regression, as currently done in CRFs. It is of importance to note that such a parametrization is unusual; however, what is important is that the model remains the same.
Remark 3.
When using the MPM to deal with a concrete problem, Proposition 2 implies that talking about a comparison between LC-CRFs and HMMs is somewhat incorrect. Indeed, there is only one model. However, there are two different parameterizations, and this can produce two different results. Indeed, the estimation of the parameters can give two models of the same nature (posterior distribution of an HMM) but unequally situated with respect to the optimal model, the best suited to the data. It would therefore be more correct to speak of a comparison of two parametrizations of HMM, each associated with its own parameter estimator. The same is true concerning the MAP discussed in the next paragraph.

4.2. Discriminative Computing of HMM-Based Map: Discriminative Viterbi

Let X 1 , , X N , Y 1 , , Y N be an HMM (Equation (1)). The Bayesian MAP classifier (Equation (5)) based on such HMM is computed with the following Viterbi algorithm [22]. For each n = 1 , , N , and each x n , let x 1 : n 1 m a x x n = ( x 1 m a x , , x n 1 m a x ) ( x n ) be the path x 1 m a x , …, x n 1 m a x verifying
p ( x 1 : n 1 m a x x n , x n , y 1 : n ) = s u p x 1 : n 1 p ( x 1 : n 1 , x n , y 1 : n )
We see that x 1 : n 1 m a x x n is a path maximizing p ( x 1 : n 1 , x n | y 1 : n ) over all paths ending in x n . Then having the paths x 1 : n 1 m a x x n and the probabilities p ( x 1 : n 1 m a x x n , x n , y 1 : n ) for each x n , one determines, for each x n + 1 , the paths x 1 : n m a x x n + 1 and the probabilities p ( x 1 : n m a x x n + 1 , x n + 1 , y 1 : n + 1 ) = p x 1 : n 1 m a x x n m a x , x n m a x , x n + 1 , y 1 : n + 1 , searching x n m a x with
p x 1 : n 1 m a x x n m a x , x n m a x , x n + 1 , y 1 : n + 1 ) = sup x n [ p ( x 1 : n 1 m a x x n , x n , y 1 : n ) p ( x n + 1 | x n ) p ( y n + 1 | x n + 1 ) ]
Setting in Equation (35) p y n + 1 x n + 1 = p y n + 1 p x n + 1 | y n + 1 p x n + 1 , we see that x n m a x which verifies Equation (35) is the same that x n m a x which maximizes p ( x 1 : n 1 m a x x n , x n , y 1 : n ) p ( x n + 1 | x n ) p x n + 1 | y n + 1 p x n + 1 , so that we can suppress p y n + 1 . In other words, we can replace Equation (35) with
p x 1 : n 1 m a x x n m a x , x n m a x , x n + 1 , y 1 : n + 1 ) = sup x n [ p ( x 1 : n 1 m a x x n , x n , y 1 : n ) p ( x n + 1 | x n ) p x n + 1 | y n + 1 p x n + 1 ]
Finally, we propose the following discriminative version of the Viterbi algorithm:
-
Set x 1 m a x = argmax x n [ p x 1 y 1 ] ;
-
For each n = 1 , …, N 1 , and each x n + 1 , apply (3.17) to find a path x 1 : n m a x x n + 1 from the paths x 1 : n 1 m a x x n (for all x n ), and the probabilities p x 1 : n m a x x n + 1 , x n + 1 , y 1 : n + 1 (for all x n + 1 );
-
End setting x 1 : N m a x = argmax x N [ p x 1 : N 1 m a x x N , x N , y 1 : N ] .
As with MPM above, we see that we can find x 1 : N m a x with the only use of p ( x 1 ) , …, p ( x N ) , p x 2 x 1 , …, p x N x N 1 , and p x 1 y 1 , …, p x N y N , exactly as in CRF case. As above, it appears that dropping HMMs in some NLP tasks on the grounds that MAP is a “generative” classifier, is not justified. In particular, in supervised stationary framework, distributions p ( x n ) , p x n + 1 x n , and p x n y n can be estimated in the same way as in LC-CRFs case.

5. Discussion and Conclusions

We have proposed two results. First, we have shown that the basic LC-CRF (Equation (3)) is equivalent to a classical HMM (Equation (1)) in that one can find an HMM whose posterior distribution is exactly the given LC-CRF. More precisely, we specified the way to calculate the parameters p x 1 , p x n + 1 x n , and p y n x n which define Equation (1) from the parameters V n x n , x n + 1 , U n x n , y n which define Equation (3). Second, noting that all Bayesian classifiers are discriminative in the sense that they do not depend on the observation distribution, we showed that classifiers based on HMMs, usually considered as “generative”, can also be considered as discriminative classifiers. More specifically, we have proposed discriminative methods for computing classic maximum posterior mode (MPM) and maximum a posteriori (MAP) classifiers based on HMMs. The first result shows that LC-CRFs are as general as classical HMMs. The second shows that at the application level, HMMs offer strictly the same processing power, at least with regard to MPM and MAP, as LC-CRFs.
The practical interest of our contributions is as follows. Until now, some authors considered LC-CRFs and HMMs to be equivalent without presenting rigorous proof, and others considered LC-CRFs to be more general [19]. Partly because of this uncertainty, CRFs have often been preferred over HMMs. We have proved, at least in the particular framework considered, that the two models were equivalent, and the abandonment of HMMs in favor of CRFs was not always justified. In other words, faced with a particular application, there is no reason to choose CRFs systematically. However, we also cannot say that HMMs should be chosen. Finally, our contribution is likely to encourage practitioners to consider both models, or rather, according to Remark 3, both parametrizations of the same model, on an equal footing.
We considered basic LC-CRFs and HMMs, which are a limited, yet widely used framework. LC-CRFs and HMMs can be extended in different directions. The general question to study is to search an extension of HMM which would be equivalent to the general CRF (Equation (2)). This seems to be a hard problem. However, one can study some existing extensions of HMMs and wonder what kind of CRFs they would be equivalent to. For example, HMMs have been extended to “pairwise Markov models” (PMMs [23]), and the question is therefore what kind of CRFs would be equivalent to PMMs? Another kind of extension consists of adding a latent process U 1 : N to the pair ( X 1 : N , Y 1 : N ) . In the case of CRFs this leads to hidden CRFs (HCRFs [24]), and in the case of HMMs this leads to triplet Markov models (TMMs [25]). Finally, CRFs were generalized to semi-Markov CRFs [26], and HMMs were generalized to hidden semi-Markov models, with explicit distribution of the exact sojourn time in a given state [27], or with explicit distribution of the minimal sojourn time in a given state [28]. The study of the relations between these different extensions of CRFs and HMMs is an interesting perspective for further investigations.

Author Contributions

Conceptualization, E.A., E.M., and W.P.; methodology, E.A., E.M., and W.P.; validation, E.A., E.M., and W.P.; investigation, E.A., E.M., and W.P.; writing—original draft preparation, W.P.; writing—review and editing, E.A., E.M., and W.P; supervision, W.P.; project administration, W.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by the French Government Agency ASSOCIATION NATIONALE RECHERCHE TECHNOLOGIES (ANRT), as part of the CIFRE (“Conventions industrielles de formation par la recherché”) scholarship.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Stratonovich, R.L. Conditional Markov Processes. In Non-Linear Transformations of Stochastic Processes; Pergamon Press: Oxford, UK, 1965; pp. 427–453. [Google Scholar]
  2. Baum, L.E.; Petrie, T. Statistical Inference for Probabilistic Functions of Finite state Markov Chains. Ann. Math. Stat. 1966, 37, 1554–1563. [Google Scholar] [CrossRef]
  3. Rabiner, L.; Juang, B. An Introduction to Hidden Markov Models. IEEE ASSP Mag. 1986, 3, 4–16. [Google Scholar] [CrossRef]
  4. Ephraim, Y. Hidden Markov Processes. IEEE Trans. Inf. Theory 2002, 48, 1518–1569. [Google Scholar] [CrossRef] [Green Version]
  5. Cappé, O.; Moulines, E.; Ryden, T. Inference in Hidden Markov Models; Springer Series in Statistics; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
  6. Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001. [Google Scholar]
  7. Sutton, C.; McCallum, A. An Introduction to Conditional Random Fields. Found. Trends Mach. Learn. 2012, 4, 267–373. [Google Scholar] [CrossRef]
  8. Jurafsky, D.; Martin, J.H. Speech and Language Processing; Kindle edition; Prentice Hall Series in Artificial Intelligence; Prentice Hall: Hoboken, NJ, USA, 2014. [Google Scholar]
  9. Ng, A.; Jordan, M. On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. Adv. Neural Inf. Process. Syst. 2001, 14, 841–848. [Google Scholar]
  10. He, H.; Liu, Z.; Jiao, R.; Yan, G. A Novel Nonintrusive Load Monitoring Approach based on Linear-Chain Conditional Random Fields. Energies 2019, 12, 1797. [Google Scholar] [CrossRef] [Green Version]
  11. Condori, G.C.; Castro-Gutierrez, E.; Casas, L.A. Virtual Rehabilitation Using Sequential Learning Algorithms. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 639–645. [Google Scholar] [CrossRef] [Green Version]
  12. Fang, M.; Kodamana, H.; Huang, B.; Sammaknejad, N. A Novel Approach to Process Operating Mode Diagnosis using Conditional Random Fields in the Presence of Missing Data. Comput. Chem. Eng. 2018, 111, 149–163. [Google Scholar] [CrossRef]
  13. Saa, J.F.D.; Cetin, M. A Latent Discriminative model-based Approach for Classification of Imaginary Motor tasks from EEG data. J. Neural Eng. 2012, 9, 026020. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Azeraf, E.; Monfrini, E.; Pieczynski, W. On Equivalence between Linear-Chain Conditional Random Fields and Hidden Markov Chains. In Proceedings of the International Conference on Agents and Artificial Intelligence, Virtual, 3–5 February 2022. [Google Scholar]
  15. Liliana, D.Y.; Basaruddin, C. A Review on Conditional Random Fields as a Sequential Classifier in Machine Learning. In Proceedings of the International Conference on Electrical Engineering and Computer Science (ICECOS), Palembang, Indonesia, 22–23 August 2017; pp. 143–148. [Google Scholar]
  16. Ayogu, I.I.; Adetunmbi, A.O.; Ojokoh, B.A.; Oluwadare, S.A. A Comparative Study of Hidden Markov Model and Conditional Random Fields on a Yorùba Part-of-Speech Tagging task. In Proceedings of the IEEE International Conference on Computing Networking and Informatics (ICCNI), Lagos, Nigeria, 29–31 October 2017; pp. 1–6. [Google Scholar]
  17. McCallum, A.; Freitag, D.; Pereira, F.C. Maximum Entropy Markov Models for Information Extraction and Segmentation. ICML 2000, 17, 591–598. [Google Scholar]
  18. Song, S.L.; Zhang, N.; Huang, H.T. Named Entity Recognition based on Conditional Random Fields. Clust. Comput. J. Netw. Softw. Tools Appl. 2019, 22, S5195–S5206. [Google Scholar] [CrossRef]
  19. Heigold, G.; Ney, H.; Lehnen, P.; Gass, T.; Schluter, R. Equivalence of Generative and Log-Linear Models. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 1138–1148. [Google Scholar] [CrossRef]
  20. Azeraf, E.; Monfrini, E.; Vignon, E.; Pieczynski, W. Hidden Markov Chains, Entropic Forward-Backward, and Part-Of-Speech Tagging. arXiv 2020, arXiv:2005.10629. [Google Scholar]
  21. Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. Ann. Math. Stat. 1970, 41, 164–171. [Google Scholar] [CrossRef]
  22. Viterbi, A. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Trans. Inf. Theory 1967, 13, 260–269. [Google Scholar] [CrossRef] [Green Version]
  23. Pieczynski, W. Pairwise Markov Chains. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 634–639. [Google Scholar] [CrossRef] [Green Version]
  24. Quattoni, A.; Wang, S.B.; Morency, L.-P.; Collins, M.; Darrell, T. Hidden Conditional Random Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1848–1852. [Google Scholar] [CrossRef] [PubMed]
  25. Pieczynski, W.; Hulard, C.; Veit, T. Triplet Markov Chains in hidden signal restoration. In Proceedings of the SPIE’s International Symposium on Remote Sensing, Crete, Greece, 22–27 September 2002. [Google Scholar]
  26. Sarawagi, S.; Cohen, W. Semi-Markov conditional random fields for information extraction. Adv. Neural Inf. Process. Syst. 2004, 17. [Google Scholar]
  27. Yu, S.-Z. Hidden semi-Markov models. Artif. Intell. 2010, 174, 215–243. [Google Scholar] [CrossRef] [Green Version]
  28. Li, H.; Derrode, S.; Pieczynski, W. Adaptive on-line lower limb locomotion activity recognition of healthy individuals using semi-Markov model and single wearable inertial sensor. Sensors 2019, 19, 4242. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Azeraf, E.; Monfrini, E.; Pieczynski, W. Equivalence between LC-CRF and HMM, and Discriminative Computing of HMM-Based MPM and MAP. Algorithms 2023, 16, 173. https://doi.org/10.3390/a16030173

AMA Style

Azeraf E, Monfrini E, Pieczynski W. Equivalence between LC-CRF and HMM, and Discriminative Computing of HMM-Based MPM and MAP. Algorithms. 2023; 16(3):173. https://doi.org/10.3390/a16030173

Chicago/Turabian Style

Azeraf, Elie, Emmanuel Monfrini, and Wojciech Pieczynski. 2023. "Equivalence between LC-CRF and HMM, and Discriminative Computing of HMM-Based MPM and MAP" Algorithms 16, no. 3: 173. https://doi.org/10.3390/a16030173

APA Style

Azeraf, E., Monfrini, E., & Pieczynski, W. (2023). Equivalence between LC-CRF and HMM, and Discriminative Computing of HMM-Based MPM and MAP. Algorithms, 16(3), 173. https://doi.org/10.3390/a16030173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop