Next Article in Journal
Circulant Digraphs with Larger Linear Guessing Number and Smaller Degree
Previous Article in Journal
Anonymous and Traceable: A Dynamic Group Signature-Based Cross-Domain Authentication for IIoT
Previous Article in Special Issue
Some Exact Results on Lindley Process with Laplace Jumps
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Markov Observation Models and Deepfakes

by
Michael A. Kouritzin
Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada
Mathematics 2025, 13(13), 2128; https://doi.org/10.3390/math13132128
Submission received: 31 May 2025 / Revised: 21 June 2025 / Accepted: 26 June 2025 / Published: 29 June 2025

Abstract

Herein, expanded Hidden Markov Models (HMMs) are considered as potential deepfake generation and detection tools. The most specific model is the HMM, while the most general is the pairwise Markov chain (PMC). In between, the Markov observation model (MOM) is proposed, where the observations form a Markov chain conditionally on the hidden state. An expectation-maximization (EM) analog to the Baum–Welch algorithm is developed to estimate the transition probabilities as well as the initial hidden-state-observation joint distribution for all the models considered. This new EM algorithm also includes a recursive log-likelihood equation so that model selection can be performed (after parameter convergence). Once models have been learnt through the EM algorithm, deepfakes are generated through simulation, while they are detected using the log-likelihood. Our three models were compared empirically in terms of their generative and detective ability. PMC and MOM consistently produced the best deepfake generator and detector, respectively.

1. Introduction

Hidden Markov Models (HMMs) were introduced in papers by Baum and Petrie [1] and Baum and Eagon [2]. Traditional HMMs have enjoyed tremendous modelling success in applications like computational finance (see e.g., Petropoulos et al. [3]), single-molecule kinetic analysis (see Nicolai [4]), animal tracking (see Sidrow et al. [5]), forecasting commodity futures (see Date et al. [6]) and protein folding (see Stigler et al. [7]). The unobservable hidden HMM states X are a discrete-time Markov chain, and the observations process Y is some distorted, corrupted partial information or measurement of the current state of X , satisfying the following condition:
P Y n A | X n , X n 1 , , X 1 = P Y n A | X n .
These emission probabilities, P Y n A | X n , have a conditional probability mass function y b X n ( y ) .
Perhaps the most common challenges in HMMs are calibrating the model, decoding the hidden sequence from the observation sequence and real-time belief propagation, i.e., filtering. The first problem is solved recursively in the HMM setting by the Baum–Welch re-estimation algorithm, which is an application of the Expectation-Maximization (EM) algorithm, predating the EM algorithm. The second, decoding problem is solved by the Viterbi algorithm (see Viterbi [8], Rabiner [9], Shinghal and Toussaint [10]), which is a dynamic programming algorithm. The filtering problem is also solved effectively after calibration using a recursive algorithm that is similar to part of the Baum–Welch algorithm. In practice, there can be numeric problems, like a multitude of local maxima to trap the Baum-Welch algorithm, or inefficient matrix operations when the state size is large but the hidden state resides in a small subset most of the time. In these cases, it can be advisable to use particle filters or other alternative methods, which are not the subject of this paper (see instead Cappé et al. [11] for more information). The forward and backward propagation probabilities of the Baum–Welch algorithm also tend to become very small over time, a phenomenon known as the small-number problem. While satisfactory results can sometimes be obtained by (often logarithmic) rescaling, this small-number problem is still a severe limitation of the Baum–Welch algorithm. However, the independent emission form of the observation modelling undertaken in HMMs can be even more fundamentally limiting.
The autoregressive HMM (AR-HMM) and, more generally, the pairwise Markov chain (PMC) were introduced to allow more extensive and practical observation models. For the AR-HMM, the observations take the following structure:
Y n = β 0 ( X n ) + β 1 ( X n ) Y n 1 + + β p ( X n ) Y n p + ε n ,
where { ε n } n = 1 is a (usually zero-mean Gaussian) i.i.d. sequence of random variables, and the autoregressive coefficients are functions of the current hidden state X n . The AR-HMM has experienced strong success in applications like speech recognition (see Bryan and Levinson [12]), the diagnosis of blood infections (see Stanculescu et al. [13]) and the study of climate patterns (see Xuan [14]). One advantage of the AR-HMM is that the Baum–Welch algorithm can still be used (see Bryan and Levinson [12]).
The general PMC model from Pieczynski [15] only assumes that ( X , Y ) is jointly Markov. Derrode and Pieczynski [16], Derrode and Pieczynski [17] and Kuljus and Lember [18] explain the generality of the PMC and give some interesting subclasses of this model. It is now well understood how to filter and decode PMCs. In fact, Kuljus and Lember [18] solve the decoding problem in great generality, while Derrode and Pieczynski [17] use Baum–Welch-like recursions to produce the filter. Both Derrode and Pieczynski [16] and Derrode and Pieczynski [17] assume reversibility of the PMC and have the observations living in a continuous space. To our knowledge, the Baum–Welch rate re-estimation algorithm has not been validated in general for PMCs. Our first goal is to develop and validate this Baum–Welch algorithm for PMCs, while at the same time estimating hidden initial states and overcoming the small-number problem mentioned above by using alternative variables in our forward and backward recursions. Our resulting EM algorithm will apply to many big data problems.
Our second goal is to show the applicability of HMMs and PMCs, as well as a model called the Markov Observation Model (MOM), which falls part way between HMMs and PMCs in deepfake detection and generation. The key to producing and detecting deepfakes is to bring in an element that is easily calculated, yet often overlooked in HMMs and PMCs: likelihood. During training, as well as during detection, likelihood can be used in the place of the discriminator in a Generative Adversarial Network (GAN), while simulation plays the part of the generator. Naturally, the expectation-maximization algorithm also plays a key role in this deepfake application, as explained below.
Our third goal is subtler. Just because the PMC model is more general than the HMM, and the Baum–Welch algorithm can be extended to learn either model, does not mean one should pronounce the death of the HMM. The problem is that the additional generality leads, in general, to a more complicated likelihood with a multitude of maxima for the EM algorithm to become trapped in or choose from. It can become a virtually impossible task to learn a global, or even a useful, maximum. Hence, the performance of the PMC model as a hidden Markov structure can be sub-optimal compared to the performance of the HMM or MOM, as we shall show empirically. Alternatively, the global maximum of the PMC may not be what is wanted. For these reasons, we promote the MOM and, in fact, show that it performs the best in simple deepfake detection, while the PMC generates the best deepfakes.
The HMM and nonlinear filtering theory (NFT) can each be thought of as nonlinear generalization of the Kalman filter (see Kalman [19], Kalman and Bucy [20]). The recent analogues (see [21]) of the celebrated Fujisaki–Kallianpur–Kunita and the Duncan–Mortensen–Zakai equations (see [22,23,24,25,26] for some original and general results) of NFT to continuous-time Markov chain observations provide further evidence of the closeness of the HMM and NFT. The hidden state, called the signal in NFT, can be a general Markov process model and live in a general state space, but there is no universal EM algorithm for identifying the model, like the Baum–Welch algorithm, nor a dynamic programming algorithm for identifying a most likely hidden-state path, like the Viterbi algorithm. Rather, the goals in NFT are usually to compute filters, predictors and smoothers, for which there are no exact closed-form solutions, except in isolated cases (see [27]), and approximations have to be used. Like HMMs, nonlinear filtering has enjoyed widespread application. For instance, the subfield of nonlinear particle filtering, also known as sequential Monte Carlo, has a number of powerful algorithms (see Elfring [28], Pitt and Shephard [29], Del Moral et al. [30], Kouritzin [31], Chopin and Papaspiliopoulos [32]) and has been applied to numerous problems in areas like Bayesian inference (Chopin [33], Kloek and van Dijk [34], van Dijk and Kloek [35]), bioinformatics (Hajiramezanali et al. [36]), economics and mathematical finance (Creal [37]), intracellular movement (Maroulas and Nebenführ [38]), fault detection (D’Amato et al. [39]), pharmacokinetics (Bonate [40]), geosciences (Van Leeuwen et al. [41]), and many other fields. Still, like in HMMs, the observations in nonlinear filter models are largely limited to distorted, corrupted, partial observations of the signal. NFT is used successfully in deepfake generation and detection herein. However, the simplicity of the EM and likelihood algorithms for HMMs, MOMs and PMCs are compelling advantages here in the deepfake application but likely also in some of these other applications of NFT.
The layout of this paper is as follows: In the next section, we explain the models, in particular the Markov observation models, and how they can be simulated. In Section 3 the filter and likelihood calculations are derived. In Section 4, EM techniques are used to derive an analog to the Baum–Welch algorithm for identifying the system (probability) parameters. In particular, joint recursive formulas for the hidden-state and observation transition probabilities, as well as the initial hidden-state-observation joint distribution, are derived. Section 5 contains our deepfake application and results. Section 6 is devoted to connecting the limit points of the EM-type algorithm to the maxima of the conditional likelihood, given the observations. Finally, Section 7 clarifies our contributions, makes our most basic conclusions and suggests some future work the author hopes will be undertaken.

2. Models and Simulation

Let N N be some final time. We first clarify the HMM assumption of independent emission probabilities.
Under the HMM,
P ( Y 1 = y 1 , , Y N = y N { X i } i = 1 N ) = i = 1 N b X i ( y i ) , y i ,
where y b x ( y ) is a probability mass function for each x. Otherwise, the HMM and PMC are explained elsewhere.
Next, we explain how the MOM generalizes the HMM and fits into the PMC. Suppose O is some discrete observation space. In the MOM, like in the HMM, the hidden state is a homogeneous Markov chain X on some discrete (finite or countable) state space E with one-step transition probabilities p x x for x , x E . Contrary to the HMM, the MOM allows self-dependence in the observations (this is illustrated by rightward arrows between the Ys in Figure 1). In particular, MOM observations Y are a (conditional) Markov chain, given the hidden state with the following transition probabilities:
P Y n + 1 = y | { X i = x i } i = 0 n + 1 , { Y j = y j } j = 0 n = q y n y ( x n + 1 ) x 0 , , x N E ; y , y n O
These do not affect the hidden-state transitions, in the sense that
P ( X n + 1 = x ^ | X n = x , { X i } i < n , { Y j } j n ) = p x x ^ , x , x ^ E , n N 0
Still, (3) implies that
P Y n + 1 = y | { X i } i = 0 n + 1 , { Y j } j n = P Y n + 1 = y | X n + 1 , Y n , y O
i.e., that the new observation only depends upon the new hidden state (as well as the past observation). Equations (3) and (4) imply that the hidden-state observation pair X Y is jointly Markov with joint one-step transition probabilities:
P X n + 1 = x , Y n + 1 = y | X n = x n , Y n = y n = p x n x q y n y ( x ) x , x n E ; y , y n O .
The joint Markov property then implies that
P X n + 1 = x , Y n + 1 = y | X 1 = x 1 , Y 1 = y 1 , X 2 = x 2 , Y 2 = y 2 , , X n = x n , Y n = y n = p x n x q y n y x .
Notice that this generalizes the emisson probability to
P Y n = y | X n , X n 1 , , X 1 ; Y n 1 , , Y 1 = P Y n = y | Y n 1 , X n = q Y n 1 y X n
so the MOM generalizes the HMM by just taking q Y n 1 y X n = b X n ( y ) , a state-dependent probability mass function. To see how the MOM is related to the AR-HMM, we rewrite (1) as
Y n Y n 1 Y n 2 Y n p + 1 Y n = β 1 ( X n ) β 2 ( X n ) β 3 ( X n ) β p ( X n ) 1 0 0 0 0 1 0 0 0 0 0 1 0 Y n 1 Y n 2 Y n 3 Y n p Y n 1 + β 0 ( X n ) + ε n 0 0 0 ,
which, given the hidden state X n , gives an explicit formula for Y n in terms of only Y n 1 and some independent noise ε n . Hence, { Y n } is obviously conditionally Markov, and { ( X n , Y n ) } is an MOM. We have not claimed that this subsumes the AR-HMM yet, because { ε n } is usually Gaussian in the AR-HMM, and we handle the case of discrete noise herein. This will be further discussed in Section 7.
A subtly that arises with the MOM over the HMM is that we need an enlarged initial distribution, since we have a Y 0 that is not observed (see Figure 1). Rather, we think of starting up the observation process at time 1, even though there were observations to be made prior to this time. Further, since we generally do not know the model parameters, we need a means to estimate this initial distribution
P X 0 d x 0 , Y 0 d y 0 = μ d x 0 , d y 0 .
It is worth noting that the MOM resembles the stationary PMC under Condition (H) in Pieczynski [15], which forces the hidden state to be Markov by Proposition 2.2 of Pieczynski [15].

Simulation

Any PMC is characterized by an initial distribution μ on E × O and a joint transition probability p x , y x ^ , y ^ for its hidden state and observations. In particular,
p x , y x ^ , y ^ = p x x ^ q y y ^ ( x ^ )
for the MOM and
p x , y x ^ , y ^ = p x x ^ b x ^ ( y ^ )
for the HMM. In any case, the marginal transitions are denoted as
p x , y x ^ = y ^ p x , y x ^ , y ^ and p x , y y ^ = x ^ p x , y x ^ , y ^ .
μ , p characterize a ( μ , p ) -PMC. The initial distribution μ gives the distribution of ( X 0 , Y 0 ) for the MOM and PMC, while the initial distribution μ X gives the distribution of X 1 for the HMM by convention. This convention makes sense since the MOM and PMC have observation history to model in some unknown Y 0 . In the case of the HMM, an initial ( X 1 , Y 1 ) can then be drawn from μ ( x , y ) = μ X ( x ) b x ( y ) .
The simulation of the HMM, MOM and PMC observations is performed in the same way: Begin by drawing ( X 0 , Y 0 ) ( ( X 1 , Y 1 ) for the HMM) from μ , continue the simulation using p x , y x ^ , y ^ and then finally throw out the hidden state X (as well as Y 0 for MOM and PMC) to leave the observation process Y.

3. Likelihood, Filter and Predictor

A PMC is parameterized by its initial distribution μ and joint transition probability p for its hidden state and observations. Its ability to fit a given sequence of observations Y 1 , , Y n up to time n is naturally judged by its likelihood:
L n = L n μ , p = P ( Y 1 , , Y n ) = P μ , p ( Y 1 , , Y n ) for all n 1 with L 0 = 1 .
Here, P μ , p is a probability measure, where X Y is a ( μ , p ) -PMC. Therefore, given several ( μ 1 , p 1 ) , . . , ( μ m , p m ) PMC models, perhaps found by different runs of an expectation-maximization algorithm, as well as an observation Y 1 , , Y N data sequence, one can use the likelihoods { L n μ i , p i } i = 1 m to judge which model best fits the data. Each run of the EM algorithm would converge to a local maximum of the likelihood function, and then the likelihood function could be used to determine which of these produces a higher maximum. Since the MOM and HMM are PMCs (with specific p given in (8) and (9)), this test extends to judging the best MOM and best HMM.
In applications like filtering, the hidden state has significance, and estimating (the distribution of) it is important. The (optimal) filter is the (conditional) hidden-state probability mass function
π n ( x ) P ( X n = x | Y 1 , , Y n ) x E , n 1 .
We first work with the PMC, and then extract the MOM and HMM from these calculations. The likelihood and filter can computed together in real time using the forward probability
α 0 x , y = P Y 0 = y , X 0 = x α n x = P Y 1 , , Y n , X n = x , 1 n N ,
which is motivated from the Baum–Welch algorithm. Then, it follows from (11)–(13) that
π n ( x ) = α n ( x ) ξ α n ( ξ ) = α n ( x ) L n so   L n = ξ α n ( ξ ) n 1 and π 0 ( x , y ) = α 0 x , y .
Moreover, we obtain, based on the multiplication rule, the joint Markov property and (13), the following:
α n x = P Y 1 , , Y n , X n = x = x n 1 P Y 1 , , Y n , X n 1 = x n 1 , X n = x = x n 1 P Y 1 , , Y n 1 , X n 1 = x n 1 P ( X n = x , Y n | Y 1 , , Y n 1 , X n 1 = x n 1 ) = x n 1 α n 1 ( x n 1 ) p x n 1 , Y n 1 x , Y n ,
which can be solved recursively for n = 2 , 3 , , N 1 , N , starting (according to (13)) at
α 1 x 1 = x 0 y 0 μ ( x 0 , y 0 ) p x 0 , y 0 x 1 , Y 1 .
Recall that α 0 = μ is assigned differently. On a computer, we do not recurse α n , due to risk of underflow (the small-number problem), but rather we revert back to the filter π n . Using (14) and (15), one finds that the forward recursion for π is
ρ n x = x n 1 π n 1 ( x n 1 ) p x n 1 , Y n 1 x , Y n , π n x = ρ n x a n , a n = x n ρ n ( x n ) ,
which can be solved forward for n = 2 , 3 , , N 1 , N , starting at
π 1 x = x 0 y 0 μ ( x 0 , y 0 ) p x 0 , y 0 x , Y 1 a 1 , a 1 = x 1 x 0 y 0 μ ( x 0 , y 0 ) p x 0 , y 0 x 1 , Y 1 .
This immediately implies that L 1 = a 1 , and then, by using (14), (17) and induction, that
L n = a 1 a 2 a n L n = L n 1 a n , L 0 = 1 .
Thus, the filter and likelihood can be computed in real time (after initialization) via the recursions in (17) and (19).
Once the filter has been computed, predictors can also be computed using Chapman–Kolmogorov-type equations. For example, it follows from the multiplication rule and the Markov property that the one-step predictor is
P ( Y n + 1 = y n + 1 Y 1 , , Y n ) = x n , x n + 1 P ( Y n + 1 = y n + 1 , X n + 1 = x n + 1 , X n = x n , Y 1 , , Y n ) P ( Y 1 , , Y n ) = x n , x n + 1 P ( Y n + 1 = y n + 1 , X n + 1 = x n + 1 X n = x n , Y 1 , , Y n ) P ( X n = x n Y 1 , , Y n ) = x n , x n + 1 p x n , Y n x n + 1 , y n + 1 π n ( x n ) ,
which reduces to
P ( Y n + 1 = y n + 1 Y 1 , , Y n ) = x n , x n + 1 p x n x n + 1 q Y n y n + 1 ( x n + 1 ) π n ( x n ) ,
and
P ( Y n + 1 = y n + 1 Y 1 , , Y n ) = x n , x n + 1 p x n x n + 1 b x n + 1 ( y n + 1 ) π n ( x n )
respectively in the cases of the MOM and HMM.
In non-real-time applications, we strengthen our hidden-state estimates to include future observations via the joint path filter
Π n 1 , n ( x , x ^ ) = P ( X n 1 = x , X n = x ^ Y 1 , , Y N ) ,
which is a joint pmf for n = 2 , , N . To compute the joint path filter, we first let
β 0 x 0 , x 1 , y = P Y 1 , , Y N | X 0 = x 0 , X 1 = x 1 , Y 0 = y β n x n , x n + 1 = P Y n + 1 , , Y N | X n = x n , X n + 1 = x n + 1 , Y n , 0 < n < N 1 β N 1 x N 1 , x N = P Y N | X N 1 = x N 1 , X N = x N , Y N 1 = p x N 1 , Y N 1 x N , Y N p x N 1 , Y N 1 x N ,
where the last equality follows from the definition of conditional probability, and the normalized versions of β :
χ n ( x , x ^ ) = β n ( x , x ^ ) a n + 1 a N , n = 1 , , N 1 and χ 0 ( x , x ^ , y ) = β 0 ( x , x ^ , y ) a 1 a N .
Notice that we include an extra variable y in α 0 , β 0 . This is because we do not see the first observation Y 0 , so we have to consider all possibilities and treat it like another hidden state. Then, based on (11), (13), the Markov property, (19) and (14), the following is obtained:
Π n 1 , n ( x , x ^ ) = P ( X n 1 = x , X n = x ^ , Y 1 , , Y N ) P ( Y 1 , , Y N ) = α n 1 ( x ) P ( X n = x ^ , Y n , , Y N | X n 1 = x , Y 1 , , Y n 1 ) L N = π n 1 ( x ) P ( X n = x ^ , Y n , , Y N | X n 1 = x , Y n 1 ) a n a N
so based on (24)–(26),
Π n 1 , n ( x , x ^ ) = π n 1 ( x ) P ( X n = x ^ , Y n , , Y N , X n 1 = x , Y n 1 ) P ( X n = x ^ , X n 1 = x , Y n 1 ) a n a N P ( X n = x ^ , X n 1 = x , Y n 1 ) P ( X n 1 = x , Y n 1 ) = π n 1 ( x ) P ( Y n , , Y N | X n = x ^ , X n 1 = x , Y n 1 ) P ( X n = x ^ | X n 1 = x , Y n 1 ) a n a N = π n 1 ( x ) χ n 1 ( x , x ^ ) p x , Y n 1 x ^
for n = 2 , 3 , , N . This means that there are two ways to compute the (marginal) path filter directly from (27):
Π n ( x ) = P ( X n = x Y 1 , , Y N ) = π n ( x ) x n + 1 χ n ( x , x n + 1 ) p x , Y n x n + 1
for n = 1 , 2 , , N 1 , and
Π n ( x ) = P ( X n = x Y 1 , , Y N ) = x n 1 χ n 1 x n 1 , x p x n 1 , Y n 1 x π n 1 ( x n 1 )
for n = 2 , 3 , , N . These all become computationally effective by a backward recursion for χ . It also follows from (24), the definition of conditional probability, the Markov property, partitioning and our transition probabilities that
β n x n , x = P Y n + 1 , , Y N X n = x n , X n + 1 = x , Y n = P Y n + 2 , , Y N X n = x n , X n + 1 = x , Y n + 1 , Y n P Y n + 1 X n = x n , X n + 1 = x , Y n = P Y n + 2 , , Y N X n + 1 = x , Y n + 1 p x n , Y n x , Y n + 1 p x n , Y n x = x E P Y n + 2 , , Y N X n + 2 = x , X n + 1 = x , Y n + 1 P X n + 2 = x X n + 1 = x , Y n + 1 p x n , Y n x , Y n + 1 p x n , Y n x = p x n , Y n x , Y n + 1 p x n , Y n x x β n + 1 ( x , x ) p x , Y n + 1 x ,
so normalizing by (25), the following can be obtained:
χ n x n , x = p x n , Y n x , Y n + 1 a n + 1 p x n , Y n x x χ n + 1 ( x , x ) p x , Y n + 1 x ,
which can be solved backward for n = N 1 , N 2 , , 3 , 2 , 1 , starting from
χ N x N , x N + 1 = 1 .
The n = 0 value for π and χ becomes
χ 0 x 0 , x 1 , y = p x 0 , y x 1 , Y 1 a 1 p x 0 , y x 1 x χ 1 ( x 1 , x ) p x 1 , Y 1 x ,
π 0 x , y = α 0 x , y = μ x , y
to account for the fact that we do not see Y 0 as the data turns on at time 1. With χ 0 in hand, we can estimate the joint distribution of ( X 0 , Y 0 ) , which are the remaining hidden variables. It follows from Bayes’ rule, (11), (19), the multiplication rule, (24) and (25) that
Π 0 ( x , y ) = P ( X 0 = x , Y 0 = y Y 1 , , Y N ) = P ( Y 1 , , Y N X 0 = x , Y 0 = y ) P ( X 0 = x , Y 0 = y ) L N = x 1 P ( Y 1 , . . , Y N X 1 = x 1 , X 0 = x , Y 0 = y ) P ( X 1 = x 1 X 0 = x , Y 0 = y ) μ ( x , y ) a 1 a N = μ ( x , y ) x 1 χ 0 ( x , x 1 , y ) p x , y x 1 .
for all x E , y O .
The pathspace filter and likelihood algorithm is given in Algorithm 1.
Algorithm 1: Path filter and likelihood for PMC
Mathematics 13 02128 i001
    The first part of Algorithm 1, up to the first set of outputs, runs in real time, as the observations arrive, and provides the real-time filter and likelihood. For real-time applications, one would stop there, or else add predictors not included in Algorithm 1 but given as an example in (20). Otherwise, one can refine the estimates of the hidden states based on future observations, which then provides the pathspace filters and is the key to learning a model. This is the second part of Algorithm 1, and is explained below. But first, we note that the recursions developed so far are easily tuned to an MOM or HMM.

3.1. MOM Adjustments

For the MOM, we use (8). We leave (13), (14) and (19) unchanged, so (17) and (18) become
ρ n x = q Y n 1 Y n ( x ) x n 1 E π n 1 ( x n 1 ) p x n 1 x , π n x = ρ n x a n , a n = x n ρ n ( x n ) ,
for all x E , which can be solved forward for n = 2 , 3 , , N 1 , N , starting at
π 1 x = x 0 y 0 μ ( x 0 , y 0 ) p x 0 x q y 0 Y 1 x a 1 , a 1 = x 1 x 0 y 0 μ ( x 0 , y 0 ) p x 0 x 1 q y 0 Y 1 x 1 .
The backward recursions change a little more, starting with (24) and (25), which change to
β 0 x 1 , y = P Y 1 , , Y N | X 1 = x 1 , Y 0 = y β n x n + 1 = P Y n + 1 , , Y N | X n + 1 = x n + 1 , Y n , 0 < n < N 1 β N 1 x N = P Y N | X N = x N , Y N 1 = q Y N 1 Y N ( x N )
and the normalized versions
χ n ( x ^ ) = β n ( x ^ ) a n + 1 a N , n = 1 , , N 1 and χ 0 ( x ^ , y ) = β 0 ( x ^ , y ) a 1 a N
since
P ( Y n + 1 , , Y N | X n = x n , X n + 1 = x n + 1 , Y n ) = P ( Y n , , Y N | X n + 1 = x n + 1 , Y n )
by Lemma 1 (to follow). Then, (27) becomes
Π n 1 , n ( x , x ^ ) = π n 1 ( x ) χ n 1 ( x ^ ) p x x ^
for n = 2 , 3 , , N . This then implies the obvious simplifications of (28) and (29) to
Π n ( x ) = π n ( x ) x n + 1 χ n ( x n + 1 ) p x x n + 1 and Π n ( x ) = χ n 1 x x n 1 p x n 1 x π n 1 ( x n 1 )
for n = 1 , 2 , , N 1 and n = 2 , 3 , , N , respectively. Then, (31) becomes
χ n x = q Y n Y n + 1 ( x ) a n + 1 x χ n + 1 ( x ) p x x
by (5), which is solved backwards starting from χ N x N + 1 = 1 . The values at n = 0 become
χ 0 x 1 , y = q y Y 1 ( x 1 ) a 1 x χ 1 ( x ) p x 1 x , π 0 x , y = μ x , y
and
Π 0 ( x , y ) = μ ( x , y ) x 1 χ 0 ( x 1 , y ) p x x 1 .
for all x E , y O .

3.2. HMM Adjustments

For the HMM, we use (9). We have a MOM with the specific
q y y ^ ( x ^ ) = b x ^ ( y ^ )
that also starts at n = 1 with μ ( x , y ) = μ X ( x ) b x ( y ) , instead of n = 0 . This creates modest changes or simplifications for the filter startup:
ρ 1 x = b x ( Y 1 ) μ X x , a 1 = x ρ 1 x , π 1 x = ρ 1 ( x ) a 1 .
But otherwise, (36) holds with just the substitution q y y ^ ( x ^ ) = b x ^ ( y ^ ) .
To handle the backward recursion, we first reduce the general definition of β in (24), using (2), to
β n x n + 1 = P Y n + 1 , , Y N | X n + 1 = x n + 1 , 0 < n < N 1 β N 1 x N = P Y N | X N = x N = b x N ( Y N )
and the normalized versions
χ n ( x ) = β n ( x ) a n + 1 a N , n = 1 , , N 1 .
There are no α 0 , π 0 , β 0 or χ 0 variables for the HMM. The HMM’s backward-recursion simplifications are based on the following result.
Lemma 1.
For the MOM and the HMM,
P Y n + 1 , , Y N | X n = x n , X n + 1 = x n + 1 , Y n = P Y n + 1 , , Y N | X n + 1 = x n + 1 , Y n for MOM P Y n + 1 , , Y N | X n + 1 = x n + 1 for HMM .
Proof. 
For the MOM, we have
P Y n , , Y N , X n = x n , X n + 1 = x n + 1 P Y n , X n = x n , X n + 1 = x n + 1 = x n + 2 , , x N P X n = x n , Y n p x n x n + 1 q Y n Y n + 1 ( x n + 1 ) p x n + 1 x n + 2 q Y N 1 Y N ( x N ) p x N 1 x N P X n = x n , Y n p x n x n + 1 = x n + 2 , , x N q Y n Y n + 1 ( x n + 1 ) p x n + 1 x n + 2 q Y n + 1 Y n + 2 ( x n + 2 ) p x N 1 x N q Y N 1 Y N ( x N ) = P Y n + 1 , , Y N | X n + 1 = x n + 1 , Y n .
In the case of the HMM, this becomes. However, it follows from the multiplication rule, the tower property and (2) that
P Y n + 1 , , Y N | X n = x n , X n + 1 = x n + 1 , Y n = x n + 2 , , x N b x n + 1 ( Y n + 1 ) p x n + 1 x n + 2 b x n + 2 ( Y n + 2 ) p x N 1 x N b x N ( Y N ) = P Y n + 1 , , Y N | X n + 1 = x n + 1
which establishes the desired dependence.    □
Finally, the initial probability estimate comes from Bayes rule, (11), (24) and (25):
Π 1 ( x ) = P ( X 1 = x | Y 1 , , Y N ) = P ( Y 1 , , Y N | X 1 = x ) P ( X 1 = x ) P ( Y 1 , , Y N ) = β 1 ( x ) μ X ( x ) L N = χ 1 ( x ) μ X ( x ) .

4. Probability Estimation via EM Algorithm

In this section, we develop a recursive expectation-maximization algorithm that can be used to create convergent estimates for the transition and initial probabilities of our models. We leave the theoretical justification of convergence to Section 6.
The main goal of developing an EM algorithm is to find p x , y x ^ , y ^ for all x , x ^ E , y , y ^ O and μ ( x , y ) for all x E , y O . Noting that every time step is considered to be a transition in a discrete-time Markov chain, we would ideally set the following:
p x , y x ^ , y ^ = Expected   transitions   ( x , y )   to   ( x ^ , y ^ )   given   observations Eepected   occurrences   of   ( x , y )   given   observations = 1 Y 1 = y ^ P ( Y 0 = y , X 0 = x , X 1 = x ^ | Y 1 , , Y N ) + n = 2 N 1 Y n 1 = y , Y n = y ^ P ( X n 1 = x , X n = x ^ | Y 1 , , Y N ) P ( Y 0 = y , X 0 = x | Y 1 , , Y N ) + n = 2 N 1 Y n 1 = y P ( X n 1 = x | Y 1 , , Y N ) ,
which means that we must compute P ( Y 0 = y , X 0 = x , X 1 = x ^ | Y 1 , , Y N ) , P ( Y 0 = y , X 0 = x | Y 1 , , Y N ) and, using (23) and (28), Π n = ( x ) for all 0 n N , and Π n 1 , n ( x , x ^ ) for all 1 n N , to get this transition probability estimate. Now, by Bayes’ rule, ((11), (19)), ((24), (25)) and ((13), (14)), we obtain the following:
P ( Y 0 = y , X 0 = x , X 1 = x ^ | Y 1 , , Y N ) = P ( Y 1 , , Y N | X 1 = x ^ , X 0 = x , Y 0 = y ) P ( X 1 = x ^ , X 0 = x , Y 0 = y ) a 1 a N = χ 0 ( x , x ^ , y ) p x , y x ^ π 0 ( x , y )
so
Π 0 , 1 ( x , x ^ ) = y π 0 ( x , y ) p x , y x ^ χ 0 ( x , x ^ , y )
and so
Π 0 ( x ) = y , x ^ π 0 ( x , y ) p x , y x ^ χ 0 ( x , x ^ , y ) .
π n and χ n are computed recursively in (17) and (31) using the prior estimates of p x , y x ^ , y ^ and μ .
Expectation-maximization algorithms use these types of formulas and prior estimates to produce better estimates. We take estimates for p x , y x ^ , y ^ , and μ ( x , y ) and obtain new estimates for these quantities iteratively using (53), (54), (27), (35) and (28):
p x , y x ^ , y ^ = 1 Y 1 = y ^ π 0 ( x , y ) p x , y x ^ χ 0 ( x , x ^ , y ) + n = 1 N 1 1 Y n = y , Y n + 1 = y ^ π n ( x ) p x , y x ^ χ n ( x , x ^ ) π 0 ( x , y ) x 1 p x , y x 1 χ 0 ( x , x 1 , y ) + n = 1 N 1 1 Y n = y π n ( x ) x n + 1 p x , y x n + 1 χ n x , x n + 1 ,
and using (35),
μ ( x , y ) = x 1 χ 0 ( x , x 1 , y ) p x , y x 1 μ ( x , y ) .
Remark 1.
(1) Different iterations of p x , y x ^ , y ^ , μ ( x , y ) will be used on the left- and right-hand sides of (57) and (58). The new estimates on the left are denoted as p x , y x ^ , y ^ , μ ( x , y ) .
(2) Setting the marginal p x , y x ^ = 0 or probability μ ( x , y ) = 0 will result in it staying zero for all updates. This effectively removes this parameter from the EM optimization update, and should be avoided unless it is known that one of these should be 0.
(3) If there are no successive observations with Y n = y and Y n + 1 = y ^ in the actual observation sequence, then all new estimates p x , y x ^ , y ^ will either be set to 0 or close to it. They might not be exactly zero, due to the first term in the numerator of (57), where we could have an estimate of Y 0 = y and an observed Y 1 = y ^ .
We now have everything required for our EM algorithms, which are given for the PMC, MOM and HMM cases in Algorithms 2, 3 and 4 respectively.
These algorithms start with the initial estimates p x , y x ^ , y ^ 1 , μ 1 ( x , y ) of p x , y x ^ , y ^ , μ ( x , y ) , and refine them successively to new estimates p x , y x ^ , y ^ 2 , μ 2 ( x , y ) ; p x , y x ^ , y ^ 3 , μ 3 ( x , y ) ; etc. It is important to know that our estimates { p x , y x ^ , y ^ k , μ k ( x , y ) } improve as k .
Lemma 3 (below) will be used to ensure that an initially positive estimate stays positive as k increases, which is important in our proofs in Section 6. The following lemma follows easily from (31)–(33), (17), (18), (34), induction and the fact that x p x , Y n + 1 x = 1 . A sensible initialization of our EM algorithm would ensure that the condition p x , Y n x ^ , Y n + 1 > 0 holds.
Lemma 2.
Suppose p x , Y n x ^ , Y n + 1 > 0 for all x , x ^ E and n { 1 , , N 1 } . Then,
1. 
χ m ( x , x ^ ) > 0 for all x , x ^ E and m { 1 , , N 1 } .
2. 
χ 0 ( x , x ^ , y ) > 0 for any x , x ^ E , y O , such that p x , y x ^ , Y 1 > 0 .
3. 
π m ( x ) > 0 for all x E and m { 1 , , N } if, in addition, x 0 , y 0 μ ( x 0 , y 0 ) p x 0 , y 0 x ^ , Y 1 > 0 for all x ^ E .
4. 
π 0 ( x , y ) > 0 if μ ( x , y ) > 0 .
The following result is the key to ensuring that our non-zero parameters stay non-zero. It follows from the prior lemma, as well as (57), (58) and (31).
Lemma 3.
Suppose N 2 , p x , Y n x ^ , Y n + 1 > 0 for all x , x ^ E and n { 1 , , N 1 } . Then,
1. 
p x , y x ^ , y ^ > 0 if p x , y x ^ , y ^ > 0 ; { Y n = y , Y n + 1 = y ^ } occurs; and x 0 , y 0 μ ( x 0 , y 0 ) p x 0 , y 0 x , Y 1 > 0 for all x , x 0 E .
2. 
μ ( x , y ) > 0 if μ ( x , y ) > 0 and there exists x ^ such that p x , y x ^ , Y 1 > 0 .
Algorithm 2: EM algorithm for PMC
Mathematics 13 02128 i002
Algorithm 3: EM algorithm for MOM
Mathematics 13 02128 i003
Algorithm 4: EM algorithm for HMM
Mathematics 13 02128 i004

5. Deepfake Application

Motivated by [42], we considered our three hidden models in deepfake generation and detection. In particular, we used the models’ EM, simulation and Bayes’ factor capabilities to generate and detect deepfake real coin-flip sequences, and then compared them to determine which of the three is the best at both generation and detection.
We first created 137 real sequences of 400 coin flips by generating independent fair Bernoulli trials. Another 137 hand fake sequences of 200 coin flips were created by students with knowledge of undergraduate probability. They were told to make them look real to try to fool both humans and machines. Note that we worked with coin flip sequences with a length of 200, except for the training with real sequences, where a length of 400 was used so that length was not a defining factor of these real sequences. This added length to the real sequences did not bias either of the HMM, MOM or PMC over the others, as it was consistent for all.
We used HMM, MOM and PMC simulation with a single hidden-state variable taking s possible values (henceforth referred to as s states) to generate deepfake sequences of 200 coin flips based on the 137 real sequences. To do this, we first learnt each of the 137 real sequences using the EM algorithms with s + 1 hidden states for each model, creating three collections of 137 parameter sets for each s. Then, we simulated a sequence from each set of parameters, throwing the hidden states away, creating three collections of 137 observation coin-flip sequences for each s. These were the HMM-, MOM- and PMC-type deepfake sequences. Note that learning was conducted based on the 400 long real sequences (to remove noise from the parameters), but we created 200 long deepfake sequences.
Once all five sets of (real, fake and deepfake) data had been collected, we ran 100 training and testing trials at each selected s and averaged over these trials. For each trial, we randomly and independently split each of the 137 (hand) fake sequences into 110 training and 27 testing sequences, i.e., an 80-to-20 split. Conversely, we regenerated the 137 independent sets of real sequences and 3 deepfake sequences using, respectively, independent random number and Markov chain simulation with their models, but still divided these sets into 110 training and 27 testing sequences. We then trained the HMM, MOM and PMC with s hidden states on each of these sets of 110 training sequences. Note that since the deepfake sequences were generated with s + 1 hidden states, the actual model generating these sequences could not be identified. At this point, we had 110 sets of HMM parameters (i.e., HMM models) for each of the real, hand fake, HMM, MOM and PMC different training sequences in that trial. Similarly, we had 550 sets of MOM and PMC parameters.
Detection for each testing sequence was carried out using all the models. In a trial, each of the five sets of 27 sequences was run against the 550 HMM, 550 MOM and 550 PMC models. A sequence was then predicted by the HMM to be real, hand fake, HMM-generated, MOM-generated or PMC-generated based on HMM likelihood with s hidden states. In particular, a sequence was predicted to be real if the sum of the log-likelihood over the 110 real HMM models was higher than that over the 110 hand fake, 110 HMM, 110 MOM and 110 PMC HMM models. In the same way, it was predicted to be hand fake, HMM, MOM or PMC by the HMM. This same procedure was repeated for the MOM and for the PMC, and then for the remaining 99 trials, using the regeneration method mentioned above. The results were averaged and put into Table 1, Table 2 and Table 3 in the cases s = 3 , 5 and 7, respectively.

6. Convergence of Probabilities

In this section, we establish the convergence properties of the transition probabilities and the initial distribution { p x , y x ^ , y ^ k , μ k ( x , y ) } that we derived in Section 4. Our method adapts the ideas of Baum et al. [43], Liporace [44] and Wu [45] to our setting.
We think of the transition probabilities and initial distribution as parameters, and let Θ denote all of the non-zero transition and initial distribution probabilities in p , μ . Let e = | E | and o = | O | be the cardinalities of the hidden and observation spaces, and set d = e + o . Then, p x , y x ^ , y ^ : ( E × O ) 2 [ 0 , 1 ] has a domain space of cardinality ( d ) 2 , and μ ( x , y ) [ 0 , 1 ] E O has a domain space of cardinality e × o . Combined, this leads to ( d ) 2 + e × o parameters. However, we are removing the values that will be set to zero and adding sum to one constraints to consider a constrained optimization problem on ( 0 , ) d for some d ( d ) 2 + e × o . Removing these zero possibilities gives us the necessary regularity for our re-estimation procedure. However, it is not enough to just remove them at the beginning. We have to ensure that zero parameters will not creep in during our interations, or else we will be doing such things as taking logarithms of 0. Lemma 3 suggests that estimates not initially set to zeros will not occur as zero in later iterations. In general, we will assume the following:
Definition 1.
A sequence of estimates { p k , q k , μ k } is zero-separating if
1. 
p x , y x ^ , y ^ 1 > 0 iff p x , y x ^ , y ^ k > 0 for all k = 1 , 2 , 3 , ,
2. 
μ 1 ( x , y ) > 0 iff μ k ( x , y ) > 0 for all k = 1 , 2 , 3 , .
Here, iff stands for if and only if.
This means that we can potentially optimize over the p , μ that we initially do not set to zero. Henceforth, we factor the zero p , μ out of Θ , consider Θ ( 0 , ) d with d d and define the parameterized mass functions
p y 0 , y 1 , , y N ( x ; Θ ) = p x 0 , y 0 x 1 , y 1 p x 1 , y 1 x 2 , y 2 p x N 1 , y N 1 x N , y N μ ( x 0 , y 0 )
in terms of the non-zero values only. The observable likelihood
P Y 1 , , Y N ( Θ ) = x 0 , x 1 , , x N y 0 p y 0 , Y 1 , , Y N ( x 0 , x 1 , , x N ; Θ )
is not changed by removing the zero values of p , μ , and this removal allows us to define the re-estimation function
Q Y 1 , , Y N ( Θ , Θ ) = x 0 , , x N y 0 p y 0 , Y 1 , , Y N ( x 0 , , x N ; Θ ) ln p y 0 , Y 1 , , Y N ( x 0 , , x N ; Θ ) .
Note: Here, and in the sequel, the summation in P , Q above is only over the non-zero combinations. We would not include an x i , x i + 1 pair where p x i , Y j x i + 1 , Y j + 1 = 0 , nor an x 0 , y 0 pair where μ ( x 0 , y 0 ) = 0 . Hence, our parameter space is
Γ = { Θ ( 0 , ) d : x ^ , y ^ p x , y x ^ , y ^ = 1 , x , y μ ( x , y ) = 1 } .
Later, we will consider the extended parameter space
K = { Θ [ 0 , 1 ] d : x ^ , y ^ p x , y x ^ , y ^ = 1 , x , y μ ( x , y ) = 1 }
as limit points. Note that in both Γ and K, Θ is only over the p x , y x ^ , y ^ and μ ( x , y ) that are not just set to 0 (before limits).
Then, equating Y 0 with y 0 to ease notation, one obtains the following:
Q ( Θ , Θ ) = x 0 , , x N y 0 n = 1 N p x n 1 , Y n 1 x n , Y n μ ( x 0 , y 0 ) m = 1 N ln p x m 1 , Y m 1 x m , Y m + ln μ ( x 0 , y 0 ) .
The re-estimation function is used to interpret the EM algorithm we derived earlier. We impose the following condition to ensure everything is well defined.
  • (Zero) The EM estimates are zero-separating.
The following result is motivated by Theorem 3 of Liporace [44].
Theorem 1.
Suppose (Zero) holds. The expectation-maximization solutions (57) and (58) derived in Section 4 are the unique critical point of the re-estimation function Θ Q ( Θ , Θ ) , subject to Θ forming probability mass functions. This critical point is a maximum taking value in ( 0 , 1 ] d for d explained above.
We consider it as an optimization problem over the open set ( 0 , ) d , but with the constraint that we have mass functions, so the values have to be in the set ( 0 , 1 ] d .
Proof. 
One obtains based on (62), as well as the constraint x ^ , y ^ p x , y x ^ , y ^ = 1 , that the maximum must satisfy
0 = p x , y x ^ , y ^ Q ( Θ , Θ ) λ ξ , θ p x , y ξ , θ 1 = x 0 , , x N y 0 n = 1 N p x n 1 , Y n 1 x n , Y n m = 1 N 1 x m 1 = x , Y m 1 = y 1 x m = x ^ , Y m = y ^ p x , y x ^ , y ^ μ ( x 0 , y 0 ) λ
where λ is a Lagrange multiplier and Y m 1 = y means Y 0 = y 0 when m = 1 . Multiplying by p x , y x ^ , y ^ , summing over x ^ , y ^ and then using (11), (35) and (28) and then (19), (14) and (25), one determines that
λ = m = 1 N x 0 , , x N y 0 n = 1 N p x n 1 , Y n 1 x n , Y n 1 x m 1 = x , Y m 1 = y μ ( x 0 , y 0 ) = P ( X 0 = x , Y 0 = y , Y 1 , , Y N ) + m = 2 N 1 Y m 1 = y P ( X m 1 = x , Y 1 , , Y N ) = Π 0 ( x , y ) L N + m = 2 N 1 Y m 1 = y Π m 1 ( x ) L N = x 1 β 0 ( x , x 1 , y ) p x , y x 1 α 0 ( x , y ) + m = 2 N x m 1 Y m 1 = y β m 1 ( x , x m ) p x , Y m 1 x m α m 1 ( x ) .
Substituting (64) into (63) and repeating the argument in (64), but with (27) instead of (28), one determines that
p x , y x ^ , y ^ = x 0 , , x N y 0 n = 1 N p x n 1 , Y n 1 x n , Y n m = 1 N 1 x m 1 = x , Y m 1 = y , x m = x ^ , Y m = y ^ λ μ ( x 0 , y 0 ) = 1 Y 1 = y ^ P ( X 0 = x , Y 0 = y , X 1 = x ^ , Y 1 , , Y N ) + m = 2 N 1 Y m 1 = y , Y m = y ^ P ( X m 1 = x , X m = x ^ , Y 1 , , Y N ) x 1 β 0 ( x , x 1 , y ) p x , y x 1 α 0 ( x , y ) + m = 2 N x m 1 Y m 1 = y β m 1 ( x , x m ) p x , Y m 1 x m α m 1 ( x ) = 1 Y 1 = y ^ χ 0 ( x , x ^ , y ) p x , y x ^ π 0 ( x , y ) + m = 2 N 1 Y m 1 = y , Y m = y ^ χ m 1 ( x , x ^ ) p x , Y m 1 x ^ π m 1 ( x ) x 1 χ 0 ( x , x 1 , y ) p x , y x 1 π 0 ( x , y ) + m = 2 N x m 1 Y m 1 = y χ m 1 ( x , x m ) p x , Y m 1 x m π m 1 ( x ) .
To explain the first term in the numerator in the last equality, we use the multiplication rule and (24) to find
P ( X 0 = x , Y 0 = y , X 1 = x ^ , Y 1 , , Y N ) = β 0 ( x , x ^ , y ) P ( X 0 = x , Y 0 = y , X 1 = x ^ ) = χ 0 ( x , x ^ , y ) L N π 0 ( x , y ) p x , y x ^
from which it will follow easily.
Finally, for a maximum, one also requires
0 = μ ( x , y ) Q ( Θ , Θ ) λ ξ E , θ O μ ( ξ , θ ) 1 = x 0 , , x N y 0 n = 1 N p x n 1 , Y n 1 x n , Y n 1 x 0 = x 1 y 0 = y μ ( x , y ) μ ( x 0 , y 0 ) λ ,
where λ is a Lagrange multiplier. Multiplying by μ ( x , y ) and summing over x , y , one obtains that
λ = x 0 , , x N y 0 n = 1 N p x n 1 , Y n 1 x n , Y n μ ( x 0 , y 0 ) = P ( Y 1 , , Y N ) = L N .
Substituting (67) into (66), one obtains by (35) that
μ ( x , y ) = x 0 , , x N y 0 n = 1 N p x n 1 , Y n 1 x n , Y n 1 x 0 = x 1 y 0 = y μ ( x 0 , y 0 ) L N = P ( X 0 = x , Y 0 = y , Y 1 , , Y N ) L N = π 0 ( x , y ) x 1 χ 0 ( x , x 1 , y ) p x , y x 1 .
Now, we have established that the EM algorithm of Section 4 corresponds to the unique critical point of Θ Q ( Θ , Θ ) . Moreover, all mixed partial derivatives of Q in the components of Θ are 0, while
2 Q Y 1 , Y 2 , , Y N ( Θ , Θ ) p x , y x ^ , y ^ 2 = y 0 ; x 0 , , x N n = 1 N p x n 1 , Y n 1 x n , Y n m = 1 N 1 X m 1 = x , Y m 1 = y , x m = x ^ , Y m = y ^ p x , y x ^ , y ^ 2 μ ( x 0 , y 0 )
and
2 Q Y 1 , Y 2 , , Y N ( Θ , Θ ) μ ( x , y ) 2 = y 0 ; x 0 , , x N n = 1 N p x n 1 , Y n 1 x n , Y n m = 1 N 1 y 0 = y , x 0 = x μ ( x , y ) 2 μ ( x 0 , y 0 ) .
Hence, the Hessian matrix is diagonal with negative values along its axis, and the critical point is a maximum. □
The upshot of this result is that if the EM algorithm produces parameters { Θ k } Γ , then Q ( Θ k , Θ k + 1 ) Q ( Θ k , Θ k ) .
Now, we have the following result, based on Theorem 2.1 of Baum et al. [43], that establishes that the observable likelihood is also increasing i.e., P ( Θ k + 1 ) P ( Θ k ) .
Lemma 4.
Suppose (Zero) holds. Q ( Θ , Θ ) Q ( Θ , Θ ) implies P ( Θ ) P ( Θ ) . Moreover, Q ( Θ , Θ ) > Q ( Θ , Θ ) implies P ( Θ ) > P ( Θ ) .
Proof. 
ln ( t ) for t > 0 has convex inverse exp ( t ) . Hence, by Jensen’s inequality,
Q ( Θ , Θ ) Q ( Θ , Θ ) P ( Θ ) = ln exp x 0 , x 1 , , x N y 0 ln p y 0 , Y 1 , , Y N ( x 0 , x 1 , , x N ; Θ ) p y 0 , Y 1 , , Y N ( x 0 , x 1 , , x N ; Θ ) p y 0 , Y 1 , , Y N ( x 0 , x 1 , , x N ; Θ ) P ( Θ ) ln x 0 , x 1 , , x N y 0 p y 0 , Y 1 , , Y N ( x 0 , x 1 , , x N ; Θ ) p y 0 , Y 1 , , Y N ( x 0 , x 1 , , x N ; Θ ) p y 0 , Y 1 , , Y N ( x 0 , x 1 , , x N ; Θ ) P ( Θ ) = ln P ( Θ ) P ( Θ )
and the result follows. □
The stationary points of P and Q are also related.
Lemma 5.
Suppose (Zero) holds. A point Θ Γ is a critical point of P ( Θ ) if, and only if, it is a fixed point of the re-estimation function, i.e., Q ( Θ ; Θ ) = max Θ Q ( Θ ; Θ ) , since Q is differentiable on ( 0 , ) d in Θ .
Proof. 
The following derivatives are equal:
P Y 1 , , Y N ( Θ ) p x , y x ^ , y ^ = x 0 , , x N y 0 n = 1 N p x n 1 , Y n 1 x n , Y n m = 1 N 1 x m 1 = x , Y m 1 = y , x m = x ^ , Y m = y ^ p x m 1 x m μ ( x 0 , y 0 ) = Q Y 1 , Y 2 , , Y N ( Θ , Θ ) p x , y x ^ , y ^ | Θ = Θ ,
which are defined since p x , y x ^ , y ^ 0 . Similarly,
P Y 1 , , Y N ( Θ ) μ ( x , y ) = x 0 , , x N y 0 n = 1 N p x n 1 , Y n 1 x n , Y n 1 ( x 0 , y 0 ) = ( x , y ) = Q Y 1 , Y 2 , , Y N ( Θ , Θ ) μ ( x , y ) | Θ = Θ .
We can rewrite (65), (68) in recursive form, with the values of π and χ substituted in, to find that
Θ k + 1 = M ( Θ k ) ,
where M is a continuous function. Moreover, P : K [ 0 , 1 ] is continuous and satisfies P ( Θ k ) P ( M ( Θ k ) ) from above. Now, we have established everything we need for the following result, which follows from the proof of Theorem 1 of Wu [45].
Theorem 2.
Suppose (Zero) holds. Then, { Θ k } k = 1 is relatively compact, all its limit points (in K) are stationary points of P, producing the same likelihood P ( Θ * ) , say, and P ( Θ k ) converges monotonically to P ( Θ * ) .
Wu [45] provides several interesting results in the context of general EM algorithms to guarantee convergence to local or global maxima under certain conditions. However, the point of this paper is to introduce a new model and algorithms with just enough theory to justify the algorithms. Hence, we do not consider theory under any special cases here, but rather refer the reader to Wu [45].

7. Discussion and Conclusions

We have established a new expectation-maximization (EM) algorithm to converge to the parameters of general pairwise Markov chains and Markov observation models that generalizes the Baum–Welch algorithm for hidden Markov models. Our extension not only expands the model itself, but also identifies the initial distribution and solves the small-number problem. We have shown that the likelihood, filter, and (observation) predictor are all easily computable in real time using a recursion like the forward equation in the EM algorithm (after the parameters have converged). We have shown that the pathspace filter for conditional distribution of the hidden state, given all the observations, is also computable using the results of both the forward and backward equations. We invented a GAN-like setup using the likelihoods of known models (with a voting scheme) for detection and simulation (throwing away the hidden component) for the generation part. Finally, we have shown how all our new technology might be combined to solve interesting problems like deepfake generation and detection. Work that is currently underway appears shows great promise for the application of these methods in areas like fraud detection, statistical process control and deepfake detection. It seems like the quality of the results obtained in these domains will not be satisfactory with any existing approaches in the literature, which will surely validate this present work as more than just theory.
I was asked a couple of intriguing questions by the anonymous reviewers, which I will begin to discuss within this paragraph on potential future work. All our development focused on the discrete-space case. However, the classical Baum–Welch algorithm for HMMs also holds in the continuous (nearly) Gaussian case. A similar generalization to the one we made here should establish an EM algorithm for (nearly) Gauss–Markov coupled hidden-state observation pairs. Then, one would be in a position to properly establish our method for establishing the EM algorithm in the usual AR-HMM with Gaussian noise using the representation (7). Continuing in this direction, one could wonder whether there are EM-based forward–backward equations to estimate the parameters in an ARMA-HMM or ARIMA-HMM, both of which would satisfy an equation like the following:
Y n = β 0 ( X n ) + β 1 ( X n ) Y n 1 + + β p ( X n ) Y n p + ε n + θ 1 ( X n ) ε n 1 + + θ q ( X n ) ε n q
where { β i x } and { θ i x } are parameters that depend upon the state of a hidden Markov chain, and { ε i } is an i.i.d. noise sequence. (Here, the parameters θ would take different values or the equation might be rearranged if we had an ARIMA model instead of an ARMA model). These observation equations are not naturally Markov. Indeed, they are close to the ARFIMA models that are used to simulate long-range-dependent sequences. However, the ARMA-HMM and ARIMA-HMM still have linear observation equations with a finite number of parameters and dependence upon a hidden Markov chain. It would be intriguing to investigate whether the EM method can be extended to handle these cases, and whether there are analogs to the forward and backward equations that can be can be combined to estimate all the parameters in these models.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declare no conflicts of interest.

References

  1. Baum, L.E.; Petrie, T. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. Ann. Math. Stat. 1966, 37, 1554–1563. [Google Scholar] [CrossRef]
  2. Baum, L.E.; Eagon, J.A. An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bull. Am. Math. Soc. 1967, 73, 360–363. [Google Scholar] [CrossRef]
  3. Petropoulos, A.; Chatzis, S.P.; Xanthopoulos, S. A novel corporate credit rating system based on Student’s-t hidden Markov models. Expert Syst. Appl. 2016, 53, 87–105. [Google Scholar] [CrossRef]
  4. Nicolai, C. Solving ion channel kinetics with the QuB software. Biophys. Rev. Lett. 2013, 8, 191–211. [Google Scholar] [CrossRef]
  5. Sidrow, E.; Heckman, N.; Fortune, S.M.; Trites, A.W.; Murphy, I.; Auger-Méthé, M. Modelling multi-scale, state-switching functional data with hidden Markov models. Can. J. Stat. 2022, 50, 327–356. [Google Scholar] [CrossRef]
  6. Date, P.; Mamon, R.; Tenyakov, A. Filtering and forecasting commodity futures prices under an HMM framework. Energy Econ. 2013, 40, 1001–1013. [Google Scholar] [CrossRef]
  7. Stigler, J.; Ziegler, F.; Gieseke, A.; Gebhardt, J.C.M.; Rief, M. The Complex Folding Network of Single Calmodulin Molecules. Science 2011, 334, 512–516. [Google Scholar] [CrossRef] [PubMed]
  8. Viterbi, A.J. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 1967, 13, 260–269. [Google Scholar] [CrossRef]
  9. Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
  10. Shinghal, R.; Toussaint, G.T. Experiments in text recognition with the modified Viterbi algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-l, 184–193. [Google Scholar] [CrossRef]
  11. Cappé, O.; Moulines, E.; Rydén, T. Inference in Hidden Markov Models; Springer: Berlin, Germany, 2007. [Google Scholar]
  12. Bryan, J.D.; Levinson, S.E. Autoregressive Hidden Markov Model and the Speech Signal. Procedia Comput. Sci. 2015, 61, 328–333. [Google Scholar] [CrossRef]
  13. Stanculescu, I.; Williams, C.K.I.; Freer, Y. Autoregressive Hidden Markov Models for the Early Detection of Neonatal Sepsis. IEEE J. Biomed. Health Inform. 2014, 18, 1560–1570. [Google Scholar] [CrossRef] [PubMed]
  14. Xuan, T. Autoregressive Hidden Markov Model with Application in an El Nino Study. Master’s Thesis, University of Saskatchewan, Saskatoon, Canada, 2004. [Google Scholar]
  15. Pieczynski, W. Pairwise Markov chains. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 634–639. [Google Scholar] [CrossRef]
  16. Derrode, S.; Pieczynski, W. Unsupervised data classification using pairwise Markov chains with automatic copula selection. Comput. Stat. Data Anal. 2013, 63, 81–98. [Google Scholar] [CrossRef]
  17. Derrode, S.; Pieczynski, W. Unsupervised classification using hidden Markov chain with unknown noise copulas and margins. Signal Process. 2016, 128, 8–17. [Google Scholar] [CrossRef]
  18. Kuljus, K.; Lember, J. Pairwise Markov Models and Hybrid Segmentation Approach. Methodol. Comput. Appl. Probab. 2023, 25, 67. [Google Scholar] [CrossRef]
  19. Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
  20. Kalman, R.E.; Bucy, R.S. New Results in Linear Filtering and Prediction Theory. ASME. J. Basic Eng. 1961, 83, 95–108. [Google Scholar] [CrossRef]
  21. Kouritzin, M.A. Sampling and filtering with Markov chains. Signal Process. 2024, 225, 109613. [Google Scholar] [CrossRef]
  22. Zakai, M. On the optimal filtering of diffusion processes. Z. Wahrsch. Verw. Geb. 1969, 11, 230–243. [Google Scholar] [CrossRef]
  23. Fujisaki, M.; Kallianpur, G.; Kunita, H. Stochastic differential equations for the nonlinear filtering problem. Osaka J. Math. 1972, 9, 19–40. [Google Scholar]
  24. Kurtz, T.G.; Ocone, D.L. Unique characterization of conditional distributions in nonlinear filtering. Ann. Probab. 1988, 16, 80–107. [Google Scholar] [CrossRef]
  25. Kouritzin, M.A.; Long, H. On extending classical filtering equations. Stat. Probab. Lett. 2008, 78, 3195–3202. [Google Scholar] [CrossRef]
  26. Kurtz, T.G.; Nappo, G. The Filtered Martingale Problem. In The Oxford Handbook of Nonlinear Filtering; Oxford University Press: Oxford, UK, 2010. [Google Scholar]
  27. Kouritzin, M.A. On exact filters for continuous signals with discrete observations. IEEE Trans. Autom. Control 1998, 43, 709–715. [Google Scholar] [CrossRef]
  28. Elfring, J.; Torta, E.; van de Molengraft, R. Particle Filters: A Hands-On Tutorial. Sensors 2021, 21, 438. [Google Scholar] [CrossRef] [PubMed]
  29. Pitt, M.K.; Shephard, N. Filtering Via Simulation: Auxiliary Particle Filters. J. Am. Stat. Assoc. 1999, 94, 590–591. [Google Scholar] [CrossRef]
  30. Del Moral, P.; Kouritzin, M.A.; Miclo, L. On a class of discrete generation interacting particle systems. Electron. J. Probab. 2001, 6, 1–26. [Google Scholar] [CrossRef]
  31. Kouritzin, M.A. Residual and Stratified Branching Particle Filters. Comput. Stat. Data Anal. 2017, 111, 145–165. [Google Scholar] [CrossRef]
  32. Chopin, N.; Papaspiliopoulos, O. An Introduction to Sequential Monte Carlo; Springer Nature: Cham, Switzerland, 2020. [Google Scholar] [CrossRef]
  33. Chopin, N. Central Limit Theorem for Sequential Monte Carlo Methods and its Application to Bayesian Inference. Ann. Stat. 2004, 32, 2385–2411. [Google Scholar] [CrossRef]
  34. Kloek, T.; van Dijk, H.K. Bayesian Estimates of Equation System Parameters: An Application of Integration by Monte Carlo. Econometrica 1978, 46, 1–19. [Google Scholar] [CrossRef]
  35. van Dijk, H.K.; Kloek, T. Experiments with some alternatives for simple importance sampling in Monte Carlo integration. In Bayesian Statistics, Vol. II; Bernardo, J.M., DeGroot, M.H., Lindley, D.V., Smith, A.F.M., Eds.; North-Holland and Valencia University Press: Amsterdam, The Netherlands, 1984; ISBN 0-444-87746-0. [Google Scholar]
  36. Hajiramezanali, E.; Imani, M.; Braga-Neto, U.; Qian, X.; Dougherty, E.R. Scalable optimal Bayesian classification of single-cell trajectories under regulatory model uncertainty. BMC Genom. 2019, 20 (Suppl. S6), 435. [Google Scholar] [CrossRef]
  37. Creal, D. A Survey of Sequential Monte Carlo Methods for Economics and Finance. Econom. Rev. 2012, 31, 245–296. [Google Scholar] [CrossRef]
  38. Maroulas, V.; Nebenführ, A. Tracking Rapid Intracellular Movements: A Bayesian Random Set Approach. Ann. Appl. Stat. 2015, 9, 926–949. [Google Scholar] [CrossRef]
  39. D’Amato, E.; Notaro, I.; Nardi, V.A.; Scordamaglia, V. A Particle Filtering Approach for Fault Detection and Isolation of UAV IMU Sensors: Design, Implementation and Sensitivity Analysis. Sensors 2021, 21, 3066. [Google Scholar] [CrossRef]
  40. Bonate, P. Pharmacokinetic-Pharmacodynamic Modeling and Simulation; Springer: Berlin, Germany, 2011. [Google Scholar]
  41. Van Leeuwen, P.J.; Künsch, H.R.; Nerger, L.; Potthast, R.; Reich, S. Particle filters for high-dimensional geoscience applications: A review. Q. J. R. Meteorol Soc. 2019, 145, 2335–2365. [Google Scholar] [CrossRef] [PubMed]
  42. Kouritzin, M.A.; Newton, F.; Orsten, S.; Wilson, D.C. On Detecting Fake Coin Flip Sequences. IMS Collect. 2008, 4, 107–122. [Google Scholar]
  43. Baum, L.E.; Petrie, T.; Soules, G.; Weiss, N. A Maximization Technique Occurring in Statistical Analysis of Probabilistic Functions in Markov Chains. Ann. Math. Stat. 1970, 41, 164–171. [Google Scholar] [CrossRef]
  44. Liporace, L.A. Maximum likelihood estimation for multivariate observations of Markov sources. IEEE Trans. Inf. Theory 1982, 28, 729–734. [Google Scholar] [CrossRef]
  45. Wu, C.F.J. On the Convergence Properties of the EM Algorithm. Ann. Statist. 1983, 11, 95–103. [Google Scholar] [CrossRef]
Figure 1. Markov observation model structure.
Figure 1. Markov observation model structure.
Mathematics 13 02128 g001
Table 1. Generative and detection ability with s = 3 . Blue highlight indicates this detection method is the best detector, while orange indicates the generation method is the most difficult to detect bythis detection method.
Table 1. Generative and detection ability with s = 3 . Blue highlight indicates this detection method is the best detector, while orange indicates the generation method is the most difficult to detect bythis detection method.
Real (%)Handfake (%)HMM (%)MOM (%)PMC (%)Overall (%)
HMM detection99.9693.3676.8978.2559.7981.65
Standard deviation0.3573.59025.3439.84127.38610.076
MOM detection99.0389.3998.3991.3177.1191.11
Standard deviation2.2500.6122.3479.3705.1292.148
PMC detection10070.1495.1890.0488.0788.69
Standard deviation0.02.2431.9903.4915.5191.402
Overall detection99.6684.3090.1586.5374.9987.15
Standard deviation0.7591.4258.5104.6779.3433.466
Table 2. Generative and detection ability with s = 5 .
Table 2. Generative and detection ability with s = 5 .
Real (%)Handfake (%)HMM (%)MOM (%)PMC (%)Overall (%)
HMM detection10094.7973.6164.8963.2579.31
Standard deviation03.38327.01324.90519.98711.739
MOM detection98.7989.2995.3287.9079.9690.30
Standard deviation2.1010.0013.68511.2039.8683.040
PMC detection96.7170.8289.5484.1892.3286.71
Standard deviation2.4701.6881.9173.5264.6071.218
Overall detection98.584.9786.1678.9978.5185.44
Standard deviation1.0811.2609.1109.1797.5874.062
Table 3. Generative and detection ability with s = 7 .
Table 3. Generative and detection ability with s = 7 .
Real (%)Handfake (%)HMM (%)MOM (%)PMC (%)Overall (%)
HMM detection10095.0041.555.6833.8965.21
Standard deviation03.00329.27028.09922.60812.141
MOM detection98.7689.2996.9690.5290.8293.29
Standard deviation2.1660.0013.41912.0497.9982.531
PMC detection99.8273.2595.7594.2188.3290.27
Standard deviation0.7822.2981.7362.7235.4641.230
Overall detection99.5385.8578.0780.1471.0182.92
Standard deviation0.7681.2609.98910.2318.1984.154
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kouritzin, M.A. Markov Observation Models and Deepfakes. Mathematics 2025, 13, 2128. https://doi.org/10.3390/math13132128

AMA Style

Kouritzin MA. Markov Observation Models and Deepfakes. Mathematics. 2025; 13(13):2128. https://doi.org/10.3390/math13132128

Chicago/Turabian Style

Kouritzin, Michael A. 2025. "Markov Observation Models and Deepfakes" Mathematics 13, no. 13: 2128. https://doi.org/10.3390/math13132128

APA Style

Kouritzin, M. A. (2025). Markov Observation Models and Deepfakes. Mathematics, 13(13), 2128. https://doi.org/10.3390/math13132128

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop