Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM

Lartigue, Thomas; Durrleman, Stanley; Allassonnière, Stéphanie

doi:10.3390/a15030078

Open AccessEditor’s ChoiceArticle

Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM

by

Thomas Lartigue

^1,2,*

,

Stanley Durrleman

¹

and

Stéphanie Allassonnière

³

¹

Aramis Project-Team, Inria, 75012 Paris, France

²

CMAP, CNRS, École Polytechnique, Institut Polytechnique de Paris, 91120 Palaiseau, France

³

HeKA, Centre de Recherche des Cordeliers, Université de Paris, INRIA, INSERM, Sorbonne Université, 75012 Paris, France

^*

Author to whom correspondence should be addressed.

Algorithms 2022, 15(3), 78; https://doi.org/10.3390/a15030078

Submission received: 3 February 2022 / Revised: 21 February 2022 / Accepted: 23 February 2022 / Published: 25 February 2022

(This article belongs to the Special Issue Stochastic Algorithms and Their Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The Expectation Maximisation (EM) algorithm is widely used to optimise non-convex likelihood functions with latent variables. Many authors modified its simple design to fit more specific situations. For instance, the Expectation (E) step has been replaced by Monte Carlo (MC), Markov Chain Monte Carlo or tempered approximations, etc. Most of the well-studied approximations belong to the stochastic class. By comparison, the literature is lacking when it comes to deterministic approximations. In this paper, we introduce a theoretical framework, with state-of-the-art convergence guarantees, for any deterministic approximation of the E step. We analyse theoretically and empirically several approximations that fit into this framework. First, for intractable E-steps, we introduce a deterministic version of MC-EM using Riemann sums. A straightforward method, not requiring any hyper-parameter fine-tuning, useful when the low dimensionality does not warrant a MC-EM. Then, we consider the tempered approximation, borrowed from the Simulated Annealing literature and used to escape local extrema. We prove that the tempered EM verifies the convergence guarantees for a wider range of temperature profiles than previously considered. We showcase empirically how new non-trivial profiles can more successfully escape adversarial initialisations. Finally, we combine the Riemann and tempered approximations into a method that accomplishes both their purposes.

Keywords:

Expectation Maximisation; exponential family; approximate EM; Riemann sums; tempering; annealing

MSC:

62F99; 49N30

1. Introduction

The Expectation Maximisation (EM) algorithm was introduced by Dempster, Laird and Rubin (DLR, [1]) to maximise likelihood functions

g (θ)

defined from inherent latent variables z and that were non-convex and had intricate gradients and Hessians. The EM algorithm estimates

θ

in an iterative fashion, starting from a certain initial value

θ_{0}

and where the new estimate

θ_{n + 1}

at step

n + 1

is function the estimate

θ_{n}

from the previous step n. In addition to presenting the method, DLR provide convergence guarantees on the sequence of estimated parameters

{θ_{n}}_{n}

, namely that it converges towards a critical point of the likelihood function as the step n of the procedure grows. Their flawed proof for this result was later corrected by Wu [2], and more convergence guarantees were studied by Boyles [3]. Since some likelihood functions are too complex to apply DLR’s raw version of the EM, authors of later works have proposed alternative versions, usually with new convergence guarantees. On the one hand, when the maximisation step (M step) is problematic, other optimisation methods such as coordinate descent [2] or gradient descent [4] have been proposed. On the other hand, several works introduce new versions of the algorithm where the expectation step (E step), which can also be intractable, is approximated. Most of them rely on Monte Carlo (MC) methods and Stochastic Approximations (SA) to estimate this expectation. Notable examples include the Stochastic Approximation EM (SAEM, [5]), the Monte Carlo EM (MC-EM, [6]), the Markov Chain Monte Carlo EM (MCMC-EM, [7]), the MCMC-SAEM [8,9] and the Approximate SAEM [10]. Random approximation of the E step have also been used in the case where the data are too voluminous to allow a full E step computation. A non-exhaustive list includes the Incremental EM [11,12], the Online EM [13] and, more recently, the stochastic EM with variance reduction (sEM-vr) [14], the fast Incremental EM (FIEM) [15], the Stochastic Path-Integrated Differential EstimatoR EM (SPIDER-EM) [16], and the mini-batch MCMC-SAEM [17]. Most of these variants come with their own theoretical convergence guarantees for the models of the exponential family. Recent works have also provided theoretical analysis of the EM algorithm outside of the exponential family, with locally strongly-concave log-likelihood function around the global maxima by Balakrishnan et al. [18], or without such strong-concavity assumption by Dwivedi et al. [19].

The stochastically approximated EM algorithms constitute an extensive catalogue of methods. Indeed, there are many possible variants of MCMC samplers that can be considered, as well as the additional parameters, such as the burn-in period length and the gain decrease sequence, that have to be set. All these choices have an impact on the convergence of these “EM-like” algorithms and making the appropriate ones for each problem can be overwhelming, see [20,21,22], among others, for discussions on tuning the MC-EM alone. On several cases, one might desire to dispose of a simpler method, possibly non-stochastic, and non-parametric to run an EM-like algorithm for models with no closed forms. However the literature is lacking in that regards. The Quasi-Monte Carlo EM, introduced by Pan and Thompson [23], is a deterministic version of Monte Carlo EM, however theoretical guarantees are not provided. In that vein, Jank [24] introduces the randomised Quasi-Monte Carlo EM, which is not deterministic, and does not have theoretical guarantees either. Another example of deterministic approximation can be found in the Variational EM [25,26,27], where the conditional distribution is approximated within a certain distribution family, and optimised over before each E step. Note that with such a design, if the distance between the true conditional and the chosen family is non-zero, then the approximation can never converge towards the true conditional.

In addition to intractable E steps, EM procedures also commonly face a second issue: their convergence, when guaranteed, can be towards any maximum. This theoretical remark has crucial numerical implications. Indeed, most of the time, convergence is reached towards a sub-optimal local maximum, usually very dependent on the initialisation. To tackle this issue and improve the solutions of the algorithm, other types of, usually deterministic, approximations of the E step have been proposed. One notable example is the tempering (or “annealing”) of the conditional probability function used in the E step. Instead of replacing an intractable problem by a tractable one, the tempering approximation is used to find better local maxima of the likelihood profile during the optimisation process, in the spirit of the Simulated Annealing of Kirkpatrick et al. [28] and the Parallel Tempering (or Annealing MCMC) of Swendsen and Wang [29], Geyer and Thompson [30]. The deterministic annealing EM was introduced by Ueda and Nakano [31] with a decreasing temperature profile; another temperature profile was proposed by Naim and Gildea [32]. Contrary to most of the studies on stochastic approximations, these two works do not provide theoretical convergence guarantees for the proposed tempered methods, which, as a consequence, does not provide insight on the choice of the temperature scheme. Moreover, the tempered methods do not allow the use of the EM in case of an intractable E step. In their tempered SAEM algorithm, Allassonniere and Chevallier [10] combine the stochastic and tempering approximations, which allows the SAEM to run, even with an intractable E step, while benefiting from the improved optimisation properties brought by the tempering. In addition, theoretical convergence guarantees are provided. However, this method is once again stochastic and parametric.

Overall, most of the literature on approximated E steps focuses on stochastic approximations that estimate intractable conditional probability functions. The few purely deterministic approximations proposed, such as the tempered/annealed EM, are used for other purposes, improving the optimisation procedure, and lack convergence guarantees.

In this paper, we prove that, with mild model assumptions and regularity conditions on the approximation, any Deterministic Approximate EM benefits from the state of the art theoretical convergence guarantees of [2,4,5]. This theorem covers several already existing methods, such as the tempered EM, and paves the way for the exploration of new ideas. To illustrate this, we study a small selection of methods with practical applications that verify our theorem’s hypotheses. First, for E steps without closed form, we propose to use Riemann sums to estimate the intractable normalising factor. This “Riemann approximation EM” is a deterministic, less parametric, alternative to the MC-EM and its variants. It is suited to the small dimension cases, where the analytic estimation of the intractable integral is manageable, and MC estimation unnecessary. Second, we prove that the deterministic annealed EM (or “tempered EM”) of Ueda and Nakano [31] is a member of our general deterministic class as well. We prove that the convergence guarantees are achieved with almost no condition of the temperature scheme, justifying the use of a wider range of temperature profile than those proposed by [31] and [32]. Finally, since the Riemann and tempered approximations are two separate methods that fulfil very different practical purposes, we also propose to associate the two approximations in the “tempered Riemann approximation EM” when both their benefits are desired.

In Section 2.1, we introduce our general class of deterministic approximated versions of the EM algorithm and prove their convergence guarantees, for models of the exponential family. We discuss the “Riemann approximation EM” in Section 2.2, the “tempered EM” in Section 2.3, and their association, “tempered Riemann approximation EM”, in Section 2.4. In Section 3, we study empirically each of these three methods. Section 4 discusses and concludes this work.

We demonstrate empirically that the Riemann EM converges properly on a model with and an intractable E step, and that adding the tempering to the Riemann approximation allows in addition to get away from the initialisation and recover the true parameters. On a tractable Gaussian Mixture Model, we compare the behaviours and performances of the tempered EM and the regular EM. In particular, we illustrate that the tempered EM is able to escape adversarial initialisations, and consistently reaches better values of the likelihood than the unmodified EM, in addition to better estimating the model parameters.

2. Materials and Methods

2.1. Deterministic Approximate EM Algorithm and Its Convergence for the Exponential Family

2.1.1. Context and Motivation

In this section, we propose a new class of deterministic EM algorithms with approximated E step. This class of algorithms is very general. In particular, it includes methods with deterministic approximations of intractable normalisation constant. It also includes optimisation-oriented approximations, such as the annealing approximation used to escape sub-optimal local extrema of the objective. In addition, combinations of these two types of approximations are also covered. We prove that members of this class benefit from the same convergence guarantees that can be found in the state of the art references [2,4,5] for the classical EM algorithm, and under similar model assumptions. The only condition on the approximated distribution being that it converges towards the real conditional probability distribution with a

l_{2}

regularity. Like the authors of [5,7,10], we work with probability density functions belonging to the exponential family. The specific properties of which are providen in the hypothesis

M 1

of Theorem 1.

EM algorithms are most often considered in the context of the missing data problem, see [5]. The formulation of the problem is the following: we observe a random variable x, described by a parametric probability density function (pdf) noted

g (θ)

with parameter

θ \in Θ \subset R^{l}, l \in N^{*}

. We assume that there exists a hidden variable z informing the behaviour of the observed variable x such that

g (θ)

can be expressed as the integral of the complete likelihood

h (z; θ)

:

g (θ) = \int_{z} h (z; θ) μ (d z)

, with

μ

the reference measure. We denote

p_{θ} (z) : = h (z; θ) / g (θ)

the conditional density function of z. For a given sample x, the goal is to maximise the likelihood

g (θ)

with respect to

θ

. In all these notations, and throughout the rest of the paper, we omit x as a variable since it is considered fixed once and for all, and everything is done conditionally to x. In particular, in many applications, the observed data x is made of

N \in N^{*}

samples:

x : = (x^{(1)}, x^{(N)})

. In such cases, the sample size

N < + \infty

is supposed finite and fixed once and for all. We are never in the asymptotic regime

N ⟶ + \infty

.

The foundation of the EM algorithm is that while

ln g (θ)

is hard to maximise in

θ

, the functions

θ \mapsto ln h (z; θ)

and even

θ \mapsto E_{z} [ln h (z; θ)]

are easier to work with because of the information added by the latent variable z (or its distribution). In practice however, the actual value of z is unknown and its distribution

p_{θ} (z)

dependent on

θ

. Hence, the EM was introduced by DLR [1] as the two-stages procedure starting from an initial point

θ_{0}

and iterated over the number of steps n:

(E) With the current parameter $θ_{n}$ , calculate the conditional probability $p_{θ_{n}} (z)$ ;
(M) To get $θ_{n + 1}$ , maximise in $θ \in Θ$ the function $θ \mapsto E_{z \sim p_{θ_{n}} (z)} [ln h (z; θ)]$ ;

Which can be summarised as:

θ_{n + 1} : = T (θ_{n}) : = \underset{θ \in Θ}{argmax} E_{z \sim p_{θ_{n}} (z)} [ln h (z; θ)],

(1)

where we call T the point to point map in

Θ

corresponding to one EM step. We will not redo the basic theory of the exact EM here, but this procedure noticeably increase

g (θ_{n})

at each new step n. However, as discussed in the introduction, in many cases, one may prefer to or need to use an approximation of

p_{θ_{n}} (z)

instead of the exact analytical value.

In the following, we consider a deterministic approximation of

p_{θ} (z)

noted

{\tilde{p}}_{θ, n} (z)

which depends on the current step n and on which we make no assumption at the moment. The resulting steps, defining the “Approximate EM”, can be written under the same form as (1):

θ_{n + 1} : = F_{n} (θ_{n}) : = \underset{θ \in Θ}{argmax} E_{z \sim {\tilde{p}}_{θ_{n}, n} (z)} [ln h (z; θ)],

(2)

where

{\{F_{n}\}}_{n \in N}

is the sequence of point to point maps in

Θ

associated with the sequence of approximations

{\{{\tilde{p}}_{θ, n} (z)\}}_{θ \in Θ; n \in N}

. In order to ensure the desired convergence guarantees, we add a slight modification to this EM sequence: re-initialisation of the EM sequence onto increasing compact sets. This modification was introduced by Chen et al. [33] and adapted by Fort and Moulines [7]. Assume that you dispose of an increasing sequence of compacts

{\{K_{n}\}}_{n \in N}

such that

\cup_{n \in N} K_{n} = Θ

and

θ_{0} \in K_{0}

. Define

j_{0} : = 0

. Then, the transition

θ_{n + 1} = F_{n} (θ_{n})

is accepted only if

F_{n} (θ_{n})

belongs to the current compact

K_{j_{n}}

, otherwise the sequence is reinitialised at

θ_{0}

. The steps of the resulting algorithm, called Stable Approximate EM, can be written as:

\{\begin{matrix} i f F_{n} (θ_{n}) \in K_{j_{n}}, then θ_{n + 1} = F_{n} (θ_{n}), and j_{n + 1} : = j_{n} \\ i f F_{n} (θ_{n}) \notin K_{j_{n}}, then θ_{n + 1} = θ_{0}, and j_{n + 1} : = j_{n} + 1 . \end{matrix}

(3)

This re-initialisation of the EM sequence may seem like a hurdle, however, this truncation is only a theoretical requirement. In practice, the first compact

K_{0}

is taken so large that it covers the most probable areas of

Θ

and the algorithms (2) and (3) are identical as long as the sequence

{\{θ_{n}\}}_{n}

does not diverges towards the border of

Θ

.

2.1.2. Theorem

In the following, we will state the convergence Theorem of Equation (3) and provide a brief description of the main steps of the proof.

Theorem 1 (Convergence of the Stable Approximate EM).

Let

{\{θ_{n}\}}_{n \in N}

be a sequence of the Stable Approximate EM defined in Equation (3). Let us assume two sets of hypotheses:

Conditions M1–3 on the model.
- M1. $Θ \subseteq R^{l}$ , $X \subseteq R^{d}$ and $μ$ is a $σ$ -finite positive Borel measure on $X$ . Let $ψ : Θ \to R$ , $ϕ : Θ \to R^{q}$ and $S : X \to S \subseteq R^{q}$ . Define $L : S \times Θ \to R$ , $h : X \times Θ \to R_{+}^{*}$ and $g : Θ \to R_{+}^{*}$ as:
  
  $\begin{matrix} L (s; θ) & : = ψ (θ) + 〈s, ϕ (θ)〉, \\ h (z; θ) & : = exp (L (S (z); θ)), \\ g (θ) & : = \int_{z \in X} h (z; θ) μ (d z) . \end{matrix}$
- M2. Assume that
  - (a*) $ψ$ and $ϕ$ are continuous on $Θ$ ;
  - (b) for all $θ \in Θ$ , $\bar{S} (θ) : = \int_{z} S (z) p_{θ} (z) μ (d z)$ is finite and continuous on $Θ$ ;
  - (c) there exists a continuous function $\hat{θ} : S \to Θ$ such that for all $s \in S, L (s; \hat{θ} (s)) = \sup_{θ \in Θ} L (s; θ)$ ;
  - (d) g is positive, finite and continuous on $Θ$ and the level set $\{θ \in Θ, g (θ) \geq M\}$ is compact for any $M > 0$ .
- M3. Define $L : = \{θ \in Θ | \hat{θ} \circ \bar{S} (θ) = θ\}$ and, for any $g^{*} \in R_{+}^{*}$ , $L_{g^{*}} : = \{θ \in L | g (θ) = g^{*}\}$ . Assume either that:
  - (a) The set $g (L)$ is compact or
  - ( $a^{'}$ ) for all compact sets $K \subseteq Θ,$ $g (K \cap L)$ is finite.
The conditions on the approximation. Assume that ${\tilde{p}}_{θ, n} (z)$ is deterministic. Let $S (z) = {\{S_{u} (z)\}}_{u = 1}^{q}$ . For all indices u, for any compact set $K \subseteq Θ$ , one of the two following configurations holds:

$\{\begin{matrix} \int_{z} S_{u}^{2} (z) d z < \infty, \\ \sup_{θ \in K} \int_{z} {({\tilde{p}}_{θ, n} (z) - p_{θ} (z))}^{2} d z \underset{n \to \infty}{⟶} 0 . \end{matrix}$

(4)

Or

$\{\begin{matrix} \sup_{θ \in K} \int_{z} S_{u}^{2} (z) p_{θ} (z) d z < \infty, \\ \sup_{θ \in K} \int_{z} {(\frac{{\tilde{p}}_{θ, n} (z)}{p_{θ} (z)} - 1)}^{2} p_{θ} (z) d z \underset{n \to \infty}{⟶} 0 . \end{matrix}$

(5)

Then,

(i)

(a): $lim_{n \to \infty} j_{n} < \infty$ and $\sup_{n \in N} ∥θ_{n}∥ < \infty$ , with probability 1;
(b): $g (θ_{n})$ converges towards a connected component of $g (L)$ .

(ii)

Let

C l : P (Θ) \to Θ

be the set closure function and

d : Θ \times P (Θ) \to R_{+}

be any point-to-set distance function within

Θ

. If

g (L \cap C l ({\{θ_{n}\}}_{n \in N}))

has an empty interior, then

\exists g^{*} \in R_{+}^{*}

such that:

\begin{matrix} g (θ_{n}) \underset{n \to \infty}{⟶} g^{*}, \\ d (θ_{n}, L_{g^{*}}) \underset{n \to \infty}{⟶} 0 . \end{matrix}

Remark 1.

When g is differentiable on $Θ$ , its stationary points are the stable points of the EM, i.e., $L = \{θ \in Θ | \nabla g (θ) = 0\}$ .
The M1–3 conditions in Theorem 1 are almost identical to the similarly named M1–3 conditions in [7]. The only difference is in $M 2$ $(a^{*})$ , where we remove the hypothesis that S has to be a continuous function of z, since it is not needed when the approximation is not stochastic.
The property $\int_{z} S_{u}^{2} (z) d z < \infty$ of the condition (4) can seem hard to verify since S is not integrated here against a probability density function. However, when z is a finite variable, as is the case for finite mixtures, this integral becomes a finite sum. Hence, condition (4) is better adapted to finite mixtures, while condition (5) is better adapted to continuous hidden variables
The two sufficient conditions (4) and (5) involve a certain form of $l_{2}$ convergence of ${\tilde{p}}_{θ, n}$ towards $p_{θ}$ . If the latent variable z is continuous, this excludes countable (and finite) approximations such as sums of Dirac functions, since they have a measure of zero. In particular, this excludes Quasi-Monte Carlo approximations. However, one look at the proof of Theorem 1 (in Appendix A.1) or at the following sketch of proof reveals it is actually sufficient to verify $\sup_{θ \in K} ∥{\tilde{S}}_{n} (θ) - \bar{S} (θ)∥ \underset{n \to \infty}{⟶} 0$ for any compact set K. Where ${\tilde{S}}_{n} (θ) : = \int_{z} S (z) {\tilde{p}}_{θ, n} (z) μ (d z)$ denotes the approximated E step in the Stable Approximate EM. This condition can be verified by finite approximations.

2.1.3. Sketch of Proof

The detailed proof of this result can be found in Appendix A.1, we propose here an abbreviated version where we highlight the key steps.

Two intermediary propositions, introduced and proven in Fort and Moulines [7], are instrumental in the proof of Theorem 1. These two propositions are called Propositions 9 and 11 by Fort and Moulines, and used to prove their Theorem 3. Within our framework, with the new condition

M 2 (a^{*})

and the absence of Monte Carlo sum, the reasoning for verifying the conditions of applicability of the two proposition is quite different from [7] and will be highlighted below. Let us state these two propositions using the notations of this paper:

Proposition 1 (“Proposition 9”).

Consider a parameter space

Θ \subseteq R^{l}

, a point to point map T on

Θ

, a compact

K \subset Θ

and a subset

L \subseteq Θ

such that

L \cap K

is compact. Let g be a

C^{0}

, Lyapunov function relative to

(T, L)

. Assume that there exist a

K -

valued sequence

{\{θ_{n}\}}_{n \in N^{*}} \in K^{N}

such that:

lim_{n \to \infty} | g (θ_{n + 1}) - g \circ T (θ_{n}) | = 0 .

Then

${\{g (θ_{n})\}}_{n \in N}$ converges towards a connected component of $g (L \cap K)$
If $g (L \cap K)$ has an empty interior, then $\exists g^{*} \in R_{+}^{*}$ such that ${\{g (θ_{n})\}}_{n}$ converges towards $g^{*}$ . Moreover, ${\{θ_{n}\}}_{n}$ converges towards the set $L_{g^{*}} \cap K$ . Where $L_{g^{*}} : = \{θ \in L | g (θ) = g^{*}\}$ .

Proposition 2 (“Proposition 11”).

Consider a parameter space

Θ \subseteq R^{l}

, a subset

\subseteq Θ

, a point to point map T on

Θ

and a sequence of point to point maps

{\{F_{n}\}}_{n \in N^{*}}

also on

Θ

. Let

{\{θ_{n}\}}_{n \in N} \in Θ^{N}

be a sequence defined from

{\{F_{n}\}}_{n}

by the Stable Approximate EM equation (3) for some increasing sequence

{\{K_{n}\}}_{n \in N}

of compacts of

Θ

. Let

{\{j_{n}\}}_{n \in N}

be the corresponding sequence of indices, also defined in (3), such that

\forall n, θ_{n} \in K_{j_{n}}

. We assume:

(a)

the C1–2 conditions of Proposition 10 of [7].

-: (C1) There exists g, a $C^{0}$ Lyapunov function relative to $(T, L)$ such that for all $M > 0$ , $\{θ \in Θ | g (θ) > M\}$ is compact, and:

$Θ = \cup_{n \in N} \{θ \in Θ | g (θ) > n^{- 1}\} .$
-: (C2) $g (L)$ is compact OR (C2’) $g (L \cap K)$ is finite for all compact $K \subseteq Θ$ .

(b)

\forall θ \in K_{0}, lim_{n \to \infty} | g \circ F_{n} - g \circ T | (θ) = 0

(c)

∀ compact

K \subseteq Θ

:

lim_{n \to \infty} | g \circ F_{n} (θ_{n}) - g \circ T (θ_{n}) | 𝟙_{θ_{n} \in K} = 0 .

Then:

With probability 1,

\underset{n \to \infty}{\lim \sup} j_{n} < \infty

and

{\{θ_{n}\}}_{n}

is a compact sequence.

Remark 2.

In [7], condition

C 1

of Proposition 2 is mistakenly written as:

Θ = \cup_{n \in N} \{θ \in Θ | W (θ) > n\} .

This is a typo that we have corrected here.

The proof of Theorem 1 is structured as follows: verifying the conditions of Proposition 2, applying Proposition 2, verifying the conditions of Proposition 1 and finally applying Proposition 1.

Verifying the conditions of Proposition 2. Let g be the observed likelihood function defined in hypothesis

M 1

of Theorem 1, T the exact EM point to point map defined in (1),

L : = \{θ \in Θ | T (θ) = θ\}

the set of its stable points,

{\{F_{n}\}}_{n}

a sequence of approximated point to point map as defined in (2). With some increasing sequence

{\{K_{n}\}}_{n \in N}

of compacts of

Θ

, let

{\{θ_{n}, j_{n}\}}_{n \in N}

be defined from the Stable Approximate EM equation (3).

By design of the regular EM, the following two properties are true. First, g is a

C^{0}

, Lyapunov function relative to

(T, L)

. Second, the map T can be written

T : = \hat{θ} \circ \bar{S}

, with the

\hat{θ}

and

\bar{S}

defined in condition

M 2

of Theorem 1. These properties, in conjunction with hypotheses

M 1 - 3

of Theorem 1, directly imply that condition

(a)

of Proposition 2 is verified.

The steps to verify conditions

(b)

and

(c)

differ from those in the proof of [7]. Let

{\tilde{S}}_{n} (θ_{n}) = \int_{z} S (z) {\tilde{p}}_{θ, n} (z) μ (d z)

be the approximated E step in the Stable Approximate EM, such that

F_{n} = \hat{θ} \circ {\tilde{S}}_{n}

. By using uniform continuity properties on compacts, we prove that the following condition is sufficient to verify both

(a)

and

(b)

:

\forall compact K, \sup_{θ \in K} ∥{\tilde{S}}_{n} (θ) - \bar{S} (θ)∥ \underset{n \to \infty}{⟶} 0 .

(6)

Then, upon replacing

{\tilde{S}}_{n}

and

\bar{S}

by their integral forms, it becomes clear that each of the hypotheses (4) or (5) of Theorem 1 is sufficient to verify (6). In the end, all conditions are verified to apply Proposition 2.

Applying Proposition 2. The application of Proposition 2 to the Stable Approximate EM tells us that with probability 1:

\underset{n \to \infty}{\lim \sup} j_{n} < \infty

and

{\{θ_{n}\}}_{n \in N}

is a compact sequence.

Which is specifically the result

(i) (a)

of Theorem 1.

Verifying the conditions of Proposition 1. We already have that the likelihood g is a

C^{0}

, Lyapunov function relative to

(T, L)

. Thanks to Proposition 2, we have that

K : = C l ({\{θ_{n}\}}_{n})

is a compact. Then,

L \cap K

is also compact thanks to hypothesis

M 3

. Moreover, by definition, the EM sequence verifies:

{\{θ_{n}\}}_{n} \in K^{N}

. The last condition that remains to be shown to apply Proposition 1 is that:

lim_{n \to \infty} | g (θ_{n + 1}) - g \circ T (θ_{n}) | = 0 .

If we apply

(c)

of Proposition 2 with

K = C l ({\{θ_{n}\}}_{n})

, we get an almost identical result, but with

θ_{n + 1}

replaced by

F_{n} (θ_{n})

. However,

F_{n} (θ_{n}) \neq θ_{n + 1}

only when

j_{n + 1} = j_{n} + 1

:

| g (θ_{n + 1}) - g \circ T (θ_{n}) | = | g (θ_{0}) - g \circ T (θ_{n}) | 𝟙_{j_{n + 1} = j_{n} + 1} + | g \circ F_{n} (θ_{n}) - g \circ T (θ_{n}) | 𝟙_{j_{n + 1} = j_{n}} .

We have proven with Proposition 2 that there is only a finite number of such increments. Hence, when n is large enough,

F_{n} (θ_{n}) = θ_{n + 1}

always, and we have the desired result.

Applying Proposition 1. Since we verify all we need to apply the conclusions of Proposition 1:

${\{g (θ_{n})\}}_{n \in N}$ converges towards a connected component of $g (L \cap C l ({\{θ_{n}\}}_{n})) \subset g (L)$ .
If $g (L \cap C l ({\{θ_{n}\}}_{n}))$ has an empty interior, then the sequence ${\{g (θ_{n})\}}_{n \in N}$ converges towards a $g^{*} \in R$ and ${\{θ_{n}\}}_{n}$ converges towards the set $L_{g^{*}} \cap C l ({\{θ_{n}\}}_{n}) \subseteq L_{g^{*}}$ .

Which are both respectively exactly

(i) (b)

and

(i i)

of Theorem 1 and conclude the proof of the Theorem.

2.2. Riemann Approximation EM

2.2.1. Context and Motivation

In this section, we introduce one specific case of Approximate EM useful in practice: approximating the conditional probability density function

p_{θ} (z)

at the E step by a Riemann sum, in the scenario where the latent variable z is continuous and bounded. We call this procedure the “Riemann approximation EM”. After motivating this approach, we prove that it is an instance of the Approximate EM algorithm and verifies the hypotheses of Theorem 1, therefore benefits from the convergence guarantees.

Consider the case where z is a continuous variable and its conditional probability

p_{θ} (z)

is a continuous function. Even when

h (z; θ)

can be computed point by point, a closed form may not exist for the renormalisation term

g (θ) = \int_{z} h (z; θ) d z

. In that case, this integral is usually approximated stochastically with a Monte Carlo estimation, see for instance [5,7,10]. When the dimension is reasonably small, a deterministic approximation through Riemann sums can also be performed. For the user this can be a welcome simplification, since MCMC methods have a high hyper-parameter complexity (burn-in duration, gain decrease sequence, Gibbs sampler, etc.), whereas the Riemann approximation involves only the position of the Riemann intervals. The choice of which is very guided by the well known theories of integration (Lagrange, Legendre, etc.), and demonstrated in our experiments to have little impact.

We introduce the Riemann approximation as a member of the Approximate EM class. Since z is supposed bounded in this section, without loss of generality, we will assume that z is a real variable and

z \in [0, 1]

. We recall that

p_{θ} (z) = h (z; θ) / g (θ) = h (z; θ) / \int_{z} h (z; θ) d z

. Instead of using the exact joint likelihood

h (z; θ)

, we define a sequence of step functions

{\{{\tilde{h}}_{n}\}}_{n \in N^{*}}

as:

{\tilde{h}}_{n} (z; θ) : = h (⌊ φ (n) z ⌋ / φ (n); θ)

. Where

φ

is an increasing function from

N^{*} \to N^{*}

such that

φ (n) \underset{n \to \infty}{⟶} \infty

. For the sake of simplicity, we will take

φ = I d

, hence

{\tilde{h}}_{n} (z; θ) = h (⌊ n z ⌋ / n; θ)

. The following result, however, can be applied to any increasing function

φ

with

φ (n) \underset{n \to \infty}{⟶} \infty

.

With these steps functions, the renormalising factor

{\tilde{g}}_{n} (θ) : = \int_{z} {\tilde{h}}_{n} (z; θ) d z

is now a finite sum. That is:

{\tilde{g}}_{n} (θ) = \frac{1}{n} \sum_{k = 0}^{n - 1} h (⌊ k z ⌋ / n; θ)

. The approximate conditional probability

{\tilde{p}}_{n} (θ)

is then naturally defined as:

{\tilde{p}}_{n} (θ) : = {\tilde{h}}_{n} (z; θ) / {\tilde{g}}_{n} (θ)

. Thanks to the replacement of the integral by the finite sum, this deterministic approximation is much easier to compute than the real conditional probability.

2.2.2. Theorem and Proof

We state and prove the following Theorem for the convergence of the EM with a Riemann approximation.

Theorem 2.

Assume that the hidden variable z is continuous and bounded. Consider the approximation

{\tilde{p}}_{n, θ} (z) : = \frac{h (⌊ n z ⌋ / n; θ)}{\int_{z^{'}} h (⌊ n z^{'} ⌋ / n; θ) d z^{'}} .

We call “Riemann approximation EM” the associated Stable Approximate EM. Under conditions

M 1 - 3

of Theorem 1, if

z \mapsto S (z)

is continuous, then the conclusions of Theorem 1 hold for the Riemann approximation EM.

Proof.

Under the current assumptions, it is sufficient to verify condition (4) in order to apply Theorem 1. Without loss of generality, we will assume that the bounded variable z is contained in

[0, 1]

. Since S is continuous, the first part of the condition is easily verified:

\int_{z = 0}^{1} S_{u}^{2} (z) d z < \infty

.

For the second part of the condition, we consider a compact

K \subseteq Θ

. First, note that

h (z; θ) = exp (ψ (θ) + 〈S (z), ϕ (θ)〉)

is continuous in

(z, θ)

, hence uniformly continuous on the compact set

[0, 1] \times K

. Additionally, we have:

\begin{matrix} 0 < m : = \min_{(z, θ) \in [0, 1] \times K} h (z; θ) \leq h (z; θ), \\ h (z; θ) \leq \max_{(z, θ) \in [0, 1] \times K} h (z; θ) = : M < \infty . \end{matrix}

where m and M are constants independent of z and

θ

. This also means that

m \leq g (θ) = \int_{z = 0}^{1} h (z; θ) \leq M

. Moreover, since

{\tilde{h}}_{n} (z; θ) = h (⌊n z⌋ / n; θ)

, then we also have

\forall z \in [0, 1], θ \in K, n \in N, m \leq {\tilde{h}}_{n} (z; θ) \leq M

and

m \leq {\tilde{g}}_{n} (θ) = \int_{z = 0}^{1} {\tilde{h}}_{n} (z; θ) \leq M

.

As h is uniformly continuous,

\forall ϵ > 0, \exists δ > 0, \forall (z, z^{'}) \in {[0, 1]}^{2}, (θ, θ^{'}) \in K^{2}

:

|z - z^{'}| \leq δ and ∥θ - θ^{'}∥ \leq δ \Rightarrow |h (z; θ) - h (z^{'}; θ^{'})| \leq ϵ .

By definition,

⌊n z⌋ / n - z \leq 1 / n

. Hence

\exists N \in N, \forall n \geq N, ⌊n z⌋ / n - z \leq δ

. As a consequence:

\forall ϵ > 0, \exists N \in N, \forall n \geq N, \forall (z, θ) \in [0, 1] \times K, |h (z; θ) - {\tilde{h}}_{n} (z; θ)| \leq ϵ .

In other words,

{{\tilde{h}}_{n}}_{n}

converges uniformly towards h. Let us fix

ϵ

and N now. Then,

\forall n \geq N, \forall (z, θ) \in [0, 1] \times K

:

\begin{matrix} {\tilde{p}}_{θ, n} (z) - p_{θ} (z) & = \frac{{\tilde{h}}_{n} (z; θ)}{\int_{z} {\tilde{h}}_{n} (z; θ) d z} - \frac{h (z; θ)}{\int_{z} h (z; θ) d z} \\ = \frac{{\tilde{h}}_{n} (z; θ) - h (z; θ)}{\int_{z} {\tilde{h}}_{n} (z; θ) d z} + h (z; θ) \frac{\int_{z} (h (z; θ) - {\tilde{h}}_{n} (z; θ)) d z}{\int_{z} h (z; θ) d z \int_{z} {\tilde{h}}_{n} (z; θ) d z} \\ \leq \frac{ϵ}{m} + M \frac{ϵ}{m^{2}} \\ = ϵ \frac{m + M}{m^{2}} . \end{matrix}

Hence, for this

ϵ, \forall n \geq N :

\sup_{θ \in K} \int_{z = 0}^{1} {({\tilde{p}}_{θ, n} (z) - p_{θ} (z))}^{2} d z \leq ϵ^{2} {(\frac{m + M}{m^{2}})}^{2},

which, by definition, means that

\sup_{θ \in K} \int_{z = 0}^{1} {({\tilde{p}}_{θ, n} (z) - p_{θ} (z))}^{2} d z \underset{n \to \infty}{⟶} 0 .

With this, condition (4) is fully verified, and the conclusions of Theorem 1 are applicable, which concludes the proof. □

2.3. Tempered EM

2.3.1. Context and Motivation

In this section, we consider another particular case of Deterministic Approximate EM: the Tempered EM (or “tmp-EM”), first introduced in [31]. We first prove that under mild conditions, tmp-EM verifies the hypotheses of Theorem 1, hence benefits from the state of the art EM convergence guarantees. In particular, we prove that the choice of the temperature profile is almost completely free. This justifies the use of a wider array of temperature profiles than the ones specified in [31,32]. Then, we demonstrate the interest of tmp-EM with non-monotonous profiles, on Mixture estimation problems with hard to escape initialisations.

Tempering or annealing is a common technique in the world of non-convex optimisation. With a non-convex objective function g, naively following the gradients usually leads to undesirable local extrema close to the initial point. A common remedy is to elevate g to the power

\frac{1}{T_{n}}

, with

{T_{n}}_{n \in N}

a sequence of temperatures tending towards 1 as the number n of steps of the procedure increases. When

T_{n} > 1

, the shape of the new objective function

g^{\frac{1}{T_{n}}}

is flattened, making the gradients less strong, and the potential wells less attractive, but without changing the hierarchy of the values. This allows the optimisation procedure to explore more before converging. This concept is put in application in many state of the art procedures. The most iconic probably being the Simulated Annealing, introduced and developed in [28,34,35], where in particular

T_{n} ⟶ 0

instead of 1. It is one of the few optimisation technique proven to find global optimum of non-convex functions. The Parallel Tempering (or Annealing MCMC) developed in [29,30,36] also makes use of these ideas to improve the MCMC simulation of a target probability distribution.

Since non-convex objectives are a common occurrence in the EM literature, [31] introduced the Deterministic Annealed EM, where, in the E step, the conditional probability is replaced by a tempered approximated distribution:

{\tilde{p}}_{n, θ} (z) \propto p_{θ}^{\frac{1}{T_{n}}} (z) \propto h {(z; θ)}^{\frac{1}{T_{n}}}

(renormalised such that

\int_{z} {\tilde{p}}_{n, θ} (z) d z = 1

). For the temperature profile they consider specific sequences

{T_{n}}_{n \in N}

decreasingly converging to 1. Another specific, non-monotonous, temperature scheme was later proposed by Naim and Gildea [32]. In both cases, theoretical convergence guarantees are lacking. Later, in [10], tempering has been applied to the SAEM, with convergence guarantees provided for any temperature scheme

T_{n} ⟶ 1

.

With Theorem 3, we show that the Deterministic Anealing EM, or tmp-EM, is a specific case of the Deterministic Approximate EM of Section 2.1, hence can benefit from the same convergence properties. In particular, we show that any temperature scheme

{T_{n}}_{n} \in {(R^{*})}^{N}, T_{n} \underset{n \to \infty}{⟶} 1

guarantees the state of the art convergence.

Remark 3.

Elevating

p_{θ} (z)

to the power

\frac{1}{T_{n}}

, as is done here and in [31,32], is not equivalent to elevating to the power

\frac{1}{T_{n}}

the objective function

g (θ)

, which would be expected for a typical annealed or tempered optimisation procedure. It is not equivalent either to elevating to the power

\frac{1}{T_{n}}

the proxy function

E_{z \sim p_{θ_{n}} (z)} [h (z; θ)]

that is optimised in the M step. Instead, the weights

p_{θ_{n}} (z)

(or equivalently, the terms

h (z; θ_{n})

) used in the calculation of

E_{z \sim p_{θ_{n}} (z)} [h (z; θ)]

are the tempered terms. This still results in the desired behaviour and is only a more “structured” tempering. Indeed, with this tempering, it is the estimated distribution of the latent variable z that are made less unequivocal, with weaker modes, at each step. This forces the procedure to spend more time considering different configurations for those variables, which renders as a result the optimised function

E_{z \sim p_{θ_{n}} (z)} [h (z; θ)]

more ambiguous regarding which

θ

is the best, just as intended. Then, when n large, the algorithm is allowed to converge, as

T_{n} ⟶ 1

and

E_{z \sim {\tilde{p}}_{n, θ} (z)} ⟶ E_{z \sim p_{θ} (z)}

.

2.3.2. Theorem

We now provide the convergence Theorem for the Approximate EM with the tempering approximation. In particular, this result highlights that there are almost no constraints on the temperature profile to achieve convergence.

Theorem 3.

Let

T_{n}

be a sequence of non-zero real numbers. Consider the approximation introduced in [31]:

{\tilde{p}}_{n, θ} (z) : = \frac{p_{θ}^{\frac{1}{T_{n}}} (z)}{\int_{z^{'}} p_{θ}^{\frac{1}{T_{n}}} (z^{'}) d z^{'}} .

We call “Tempered EM” the associated Stable Approximate EM. Define

\bar{B} (1, ϵ)

be the closed ball centred in 1 and with radius

ϵ \in R_{+}

. Under conditions M1–3 of Theorem 1, if

T_{n} \underset{n \to \infty}{⟶} 1

and for any compact

K \subseteq Θ, \exists ϵ \in] 0, 1 [, \forall α \in \bar{B} (1, ϵ)

:

(T1): $\sup_{θ \in K} \int_{z} p_{θ}^{α} (z) d z < \infty$ ,
(T2): $\forall u \in 〚 1, q 〛, \sup_{θ \in K} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) d z < \infty$ ,

then the conclusions of Theorem 1 hold for the Tempered EM.

Remark 4.

The added condition that

(T 1)

and

(T 2)

must hold for all

α

in a ball

\bar{B} (1, ϵ)

is very mild. Indeed, in Section 2.3.4, we show classical examples that easily verify the much stronger condition that

(T 1)

and

(T 2)

hold for all

α \in R_{+}^{*}

.

2.3.3. Sketch of Proof

The detailed proof of Theorem 3 can be found in Appendix A.2, we propose here an abbreviated version where we highlight the key steps.

Under the current assumptions, it is sufficient to verify the second part of condition (5) in order to apply Theorem 1. To that end, we must control the integral

\int_{z} {(\frac{{\tilde{p}}_{θ, n} (z)}{p_{θ} (z)} - 1)}^{2} p_{θ} (z) d z,

for all

θ

in a compact

K \subseteq Θ

. First, with a Taylor development in the term

(\frac{1}{T_{n}} - 1)

, which converges toward 0 when

n \to \infty

, we control the difference

{({\tilde{p}}_{θ, n} (z) - p_{θ} (z))}^{2}

:

{(\frac{p_{θ} {(z)}^{\frac{1}{T_{n}}}}{\int_{z^{'}} p_{θ} {(z^{'})}^{\frac{1}{T_{n}}}} - p_{θ} (z))}^{2} \leq 2 {(\frac{1}{T_{n}} - 1)}^{2} p_{θ} {(z)}^{2} ({(ln p_{θ} (z) e^{a (z, θ, T_{n})})}^{2} A (θ, T_{n}) + B (θ, T_{n})) .

where the terms

A (θ, T_{n})

,

B (θ, T_{n})

and

a (z, θ, T_{n})

come from the Taylor development. Then, we can control the integral of interest:

\begin{matrix} \int_{z} \frac{{(\frac{p_{θ} {(z)}^{\frac{1}{T_{n}}}}{\int_{z^{'}} p_{θ} {(z^{'})}^{\frac{1}{T_{n}}}} - p_{θ} (z))}^{2}}{p_{θ} (z)} d z \leq & 2 {(\frac{1}{T_{n}} - 1)}^{2} A (θ, T_{n}) \int_{z} p_{θ} (z) e^{2 a (z, θ, T_{n})} {ln}^{2} p_{θ} (z) d z \\ + 2 {(\frac{1}{T_{n}} - 1)}^{2} B (θ, T_{n}) . \end{matrix}

(7)

From the properties of the Taylor development, we prove that

A (θ, T_{n})

and

B (θ, T_{n})

both have upper bounds involving only

\int_{z} p_{θ} (z) ln p_{θ} (z)

and

\int_{z} p_{θ} {(z)}^{\frac{1}{T_{n}}} ln p_{θ} (z)

. In a similar fashion, the term

\int_{z} p_{θ} (z) e^{2 a (z, θ, T_{n})} {ln}^{2} p_{θ} (z)

has an upper bound involving only

\int_{z} p_{θ} (z) {ln}^{2} p_{θ} (z) d z

and

\int_{z} p_{θ} {(z)}^{\frac{2}{T_{n}} - 1} {ln}^{2} p_{θ} (z) d z

.

Using the hypotheses of the Theorem, we prove that for any

α \in \bar{B} (1, ϵ)

and

θ \in K

the terms

\int_{z} p_{θ} {(z)}^{α} ln p_{θ} (z)

and

\int_{z} p_{θ} {(z)}^{α} {ln}^{2} p_{θ} (z)

are both upper bounded by a constant C independent of

θ

and

α

.

Since

T_{n} \underset{n \to \infty}{⟶} 1

, then when n is large enough,

\frac{1}{T_{n}} \in \bar{B} (1, ϵ)

and

\frac{2}{T_{n}} - 1 \in \bar{B} (1, ϵ)

. Hence, the previous result applies to the upper bounds of

A (θ, T_{n})

,

B (θ, T_{n})

and

\int_{z} p_{θ} (z) e^{2 a (z, θ, T_{n})}

{ln}^{2} p_{θ} (z) d z

. As a result, these three terms are respectively upper bounded by

C_{1}

,

C_{2}

and

C_{3}

, three constants independent of

θ

and

T_{n}

.

The inequality (7) then becomes:

\int_{z} \frac{1}{p_{θ} (z)} {(\frac{p_{θ} {(z)}^{\frac{1}{T_{n}}}}{\int_{z^{'}} p_{θ} {(z^{'})}^{\frac{1}{T_{n}}}} - p_{θ} (z))}^{2} d z \leq 2 {(\frac{1}{T_{n}} - 1)}^{2} C_{1} C_{2} + 2 {(\frac{1}{T_{n}} - 1)}^{2} C_{3} .

By taking the supremum in

θ \in K

and the limit when

n ⟶ \infty

, we get the desired result:

\sup_{θ \in K} \int_{z} \frac{1}{p_{θ} (z)} {(\frac{p_{θ} {(z)}^{\frac{1}{T_{n}}}}{\int_{z^{'}} p_{θ} {(z^{'})}^{\frac{1}{T_{n}}}} - p_{θ} (z))}^{2} d z \underset{n \to \infty}{⟶} 0 .

With condition (5) verified, the conclusions of Theorem 1 are applicable, which concludes the proof.

2.3.4. Examples of Models That Verify the Conditions

In this section, we illustrate that the conditions of Theorem 3 are easily met by common models. We take two examples, first the Gaussian Mixture Model (GMM) where the latent variables belong to a finite space, then the Poisson count with random effect, where the latent variables live in a continuous space. As mentioned, these examples are shown to not only verify all hypotheses of Theorem 3, but also to verify

(T 1)

and

(T 2)

for any

α \in R_{+}^{*}

.

Gaussian Mixture Model

Despite being one of the most common models the EM is applied to, the GMM have many known irregularities and pathological behaviours, see [37,38]. Although some recent works, such as [19,39], tackled the theoretical analysis of EM for GMM, none of the convergence results for the traditional EM and its variants proposed by [2,4,5,7] apply to the GMM. The hypothesis that the GMM fail to verify is the condition that the level lines have to be compact (

M 2

(d)

in this paper). All is not lost however for the GMM, indeed, the model verifies all the other hypotheses of the general Theorem 1 as well as the tempering hypotheses of Theorem 3. Moreover, in this paper as in the others, the only function of the unverified hypothesis

M 2

(d)

is to ensure in the proof that the EM sequence stays within a compact. The latter condition is the actual relevant property to guarantee convergence at this stage of the proof. This means that, in practice, if an tempered EM sequence applied to a GMM is observed to remain within a compact, then all the conditions for convergence are met, Theorem 3 applies, and the sequence is guaranteed to converge towards a critical point of the likelihood function.

In the following, we provide more details on the GMM likelihoods and the theorem hypotheses they verify. First, note that the GMM belongs to the exponential family with the complete likelihood:

\begin{matrix} h (z; θ) = \prod_{i = 1}^{N} \sum_{k = 1}^{K} e x p (\frac{1}{2} ( & - {(x^{(i)} - μ_{k})}^{T} Θ_{k} (x^{(i)} - μ_{k}) \\ + ln (|Θ_{k}|) + 2 ln (π_{k}) - p ln (2 π))) 𝟙_{z^{(i)} = k}, \end{matrix}

(8)

and the observed likelihood:

\begin{matrix} g (θ) = \prod_{i = 1}^{N} \sum_{k = 1}^{K} e x p (\frac{1}{2} ( & - {(x^{(i)} - μ_{k})}^{T} Θ_{k} (x^{(i)} - μ_{k}) \\ + ln (|Θ_{k}|) + 2 ln (π_{k}) - p ln (2 π))) . \end{matrix}

(9)

This is an exponential model with parameter:

θ : = ({\{π_{k}\}}_{k = 1}^{K}, {\{μ_{k}\}}_{k = 1}^{K}, {\{Θ_{k}\}}_{k = 1}^{K}) \in \{{\{π_{k}\}}_{k} \in {[0, 1]}^{K} | \sum_{k} π_{k} = 1\} \otimes R^{p \times K} \otimes {S_{p}^{+ +}}^{K} .

where

S_{p}^{+ +}

is the cone of symmetric positive definite matrices of size p. The verification of conditions M1–3 for the GMM (except

M 2

(d)

of course) is a classical exercise since these are the conditions our Theorem shares with any other EM convergence result on the exponential family. We focus here on the hypotheses specific to our Deterministic Approximate EM.

Condition on

\int_{z} p_{θ}^{α} (z) d z

. Let

α \in R_{+}^{*}

, in the finite mixture case, the integrals on z are finite sums:

\int_{z} p_{θ}^{α} (z) d z = \sum_{k} p_{θ}^{α} (z = k) .

Which is continuous in

θ

since

θ \mapsto p_{θ} (z = k) = h (z = k; θ) / g (θ)

is continuous. Hence

\forall α \in R_{+}^{*}, \sup_{θ \in K} \int_{z} p_{θ}^{α} (z) d z < \infty .

Condition on

\int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) d z

. The previous continuity argument is still valid.

Poisson Count with Random Effect

This model is discussed in [7], the authors prove, among other things, that this model verifies the conditions

M 1 - 3

. Here is a brief description of the model: the hidden variables

{\{z^{(i)}\}}_{i = 1}^{N}

are distributed according to an autoregressive process of order 1:

z^{(i)} : = a z^{(i - 1)} + σ ϵ_{i}

, with

| a | < 1

,

σ > 0

and the

{\{ϵ^{(i)}\}}_{i = 1}^{N}

are iid standard gaussian. Conditionally to

{\{z^{(i)}\}}_{i = 1}^{N}

, the observed variables

x^{(i)}

are independent Poisson variables with parameter

exp (θ + z^{(i)})

. The complete likelihood of the model, not accounting for irrelevant constants, is:

h (z; θ) = e^{θ \sum_{i = 1}^{N} x^{(i)}} . e x p (- e^{θ} \sum_{i = 1}^{N} e^{z^{(i)}}) .

(10)

g (θ) = \int_{z} h (z; θ) d z

can be computed analytically up to a constant:

\begin{matrix} g (θ) & = \int_{z \in R^{N}} h (z; θ) d z \\ = e^{θ \sum_{i = 1}^{N} x^{(i)}} \int_{z \in R^{N}} e x p (- e^{θ} \sum_{i = 1}^{N} e^{z^{(i)}}) d z \\ = e^{θ \sum_{i = 1}^{N} x^{(i)}} \prod_{i = 1}^{N} \int_{z^{(i)} \in R} e x p (- e^{θ} e^{z^{(i)}}) d z^{(i)} \\ = e^{θ \sum_{i = 1}^{N} x^{(i)}} {(\int_{u \in R_{+}} \frac{e x p (- u)}{u} d u)}^{N} \\ = e^{θ \sum_{i = 1}^{N} x^{(i)}} E_{1} {(0)}^{N}, \end{matrix}

(11)

where

E_{1} (0)

is a finite, non zero, constant, called “exponential integral”, in particular independent of

α

and

θ

.

Condition on

\int_{z} p_{θ}^{α} (z) d z

. Let K be a compact in

Θ

.

We have

p_{θ} (z) = \frac{h (z; θ)}{g (θ)}

. Let us compute

\int_{z} h {(z; θ)}^{α}

for any positive

α

. The calculations work as in Equation (11):

\begin{matrix} \int_{z \in R^{N}} h {(z; θ)}^{α} & = e^{α θ \sum_{i = 1}^{N} x^{(i)}} \prod_{i = 1}^{N} \int_{z^{(i)} \in R} e x p (- α e^{θ} e^{z^{(i)}}) d z^{(i)} \\ = e^{α θ \sum_{i = 1}^{N} x^{(i)}} E_{1} {(0)}^{N} . \end{matrix}

Hence:

\int_{z} p_{θ}^{α} (z) d z = E_{1} {(0)}^{(1 - α) N} .

Since

E_{1} (0)

is finite, non zero, and independent of

θ

, we easily have:

\forall α \in R_{+}^{*}, \underset{θ \in K}{s u p} \int_{z} p_{θ}^{α} (z) d z < \infty .

θ

does not even have to be restricted to a compact.

Condition on

\int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) d z

. Let K be a compact in

Θ

and

α

a positive real number.

In this Poisson count model,

S (z) = \sum_{i = 1}^{N} e^{z^{(i)}} \in R

. We have:

S^{2} (z) p_{θ}^{α} (z) = {(\sum_{i = 1}^{N} e^{z^{(i)}})}^{2} \frac{e x p (- α e^{θ} \sum_{i = 1}^{N} e^{z^{(i)}})}{E_{1} {(0)}^{α N}} .

(12)

First, let us prove that the integral is finite for any

θ

. We introduce the variables

u_{k} : = \sum_{l = 1}^{k} e^{z_{l}}

. The Jacobi matrix is triangular and its determinant is

\prod_{k} e^{z^{(i)}} = \prod_{k} u_{k}

.

\begin{matrix} \int_{z} S^{2} (z) p_{θ}^{α} (z) d z & = \frac{\int_{z \in R^{N}} {(\sum_{k} e^{z^{(i)}})}^{2} e x p (- α e^{θ} \sum_{k} e^{z^{(i)}}) d z}{E_{1} {(0)}^{α N}} . \end{matrix}

Which is proportional to:

\int_{u_{1} = 0}^{+ \infty} u_{1} \int_{u_{2} = u_{1}}^{+ \infty} u_{2} . . . \int_{u_{N} = u_{N - 1}}^{+ \infty} u_{N}^{3} e^{- α e^{θ} u_{N}} d u_{N} . . . d u_{2} d u_{1} .

where we removed the finite constant

\frac{1}{E_{1} {(0)}^{α N}}

for clarity. This integral is finite for any

θ

because the exponential is the dominant term around

+ \infty

. Let us now prove that

θ \mapsto \int_{z} S^{2} (z) p_{θ}^{α} (z) d z

is continuous. From Equation (12), we have that

$z \mapsto S^{2} (z) p_{θ}^{α} (z)$ is measurable on $R^{N}$ .
$θ \mapsto S^{2} (z) p_{θ}^{α} (z)$ is continuous on K (and on $Θ = R$ ).
With $θ_{M} : = \underset{θ \in K}{m i n} θ$ , then $\forall θ \in K, 0 \leq S^{2} (z) p_{θ}^{α} (z) \leq S^{2} (z) p_{θ_{M}}^{α} (z)$

Since we have proven that

S^{2} (z) p_{θ_{M}}^{α} (z) < \infty

, then we can apply the intervertion Theorem and state that

θ \mapsto \int_{z} S^{2} (z) p_{θ}^{α} (z) d z

is continuous.

It directly follows that:

\forall α \in R_{+}^{*}, \sup_{θ \in K} \int_{z} S^{2} (z) p_{θ}^{α} (z) d z < \infty .

Note that after the change of variable, the integral could be computed explicitly, but involves N successive integration of polynomial × exponential function products of the form

P (x) e^{- α e^{θ} x}

. This would get tedious, especially since after each successful integration, the product with the next integration variable

u_{k - 1}

increases by one the degree of the polynomial, i.e., starting from 3, the degree ends up being

N + 2

. We chose a faster path.

2.4. Tempered Riemann Approximation EM

Context, Theorem and Proof

The Riemann approximation of Section 2.2 makes the EM computations possible in hard cases, when the conditional distribution has no analytical form for instance. It is an alternative to the many stochastic approximation methods (SAEM, MCMC-SAEM, etc.) that are commonly used in those cases. The tempering approximation of Section 2.3 is used to escape the initialisation by allowing the procedure to explore more the likelihood profile before committing to convergence. We showed that both these approximation are particular cases of the wider class of Deterministic Approximate EM, introduced in Section 2.1. However, since they fulfil different purposes, it is natural to use them in coordination and not as alternatives of one another. In this section, we introduce another instance of the Approximate EM: a combination of the tempered and Riemann sum approximations. This “tempered Riemann approximation EM” (tmp-Riemann approximation) can compute EM steps when there is no closed form thanks to the Riemann sums as well as escape the initialisation thanks to the tempering. For a bounded latent variable

z \in [0, 1]

, we define the approximation as:

{\tilde{p}}_{n, θ} (z) : = h {(⌊ n z ⌋ / n; θ)}^{\frac{1}{T_{n}}} / \int_{z^{'}} h {(⌊ n z^{'} ⌋ / n; θ)}^{\frac{1}{T_{n}}} d z^{'}

, for a sequence

{T_{n}}_{n} \in {(R_{+}^{*})}^{N}, T_{n} \underset{n \to \infty}{⟶} 1

.

In the following Theorem, we prove that the tempered Riemann approximation EM verifies the applicability conditions of Theorem 1 with no additional hypothesis from the regular Riemann approximation EM covered by Theorem 2.

Theorem 4.

Under conditions

M 1 - 3

of Theorem 1, and when z is bounded, the (Stable) Approximate EM with

{\tilde{p}}_{n, θ} (z) : = \frac{h {(⌊ n z ⌋ / n; θ)}^{\frac{1}{T_{n}}}}{\int_{z^{'}} h {(⌊ n z^{'} ⌋ / n; θ)}^{\frac{1}{T_{n}}} d z^{'}}

, which we call “tempered Riemann approximation EM”, verifies the remaining conditions of applicability of Theorem 1 as long as

z \mapsto S (z)

is continuous and

{T_{n}}_{n} \in {(R_{+}^{*})}^{N}, T_{n} \underset{n \to \infty}{⟶} 1

.

Proof.

This proof of Theorem 4 is very similar to the proof of Theorem 2 for the regular Riemann approximation EM. The first common element is that for the tempered Riemann approximation EM, the only remaining applicability condition of the general Theorem 1 to prove is also:

\forall compact K \subseteq Θ, \sup_{θ \in K} \int_{z} {({\tilde{p}}_{θ, n} (z) - p_{θ} (z))}^{2} d z \underset{n \to \infty}{⟶} 0 .

In the proof of Theorem 2, we proved that having the uniform convergence of the approximated complete likelihood

{{\tilde{h}}_{n}}_{n}

towards the real h - with both

{\tilde{h}}_{n} (z; θ)

and

h (z; θ)

uniformly bounded - was sufficient to fulfil this condition. Hence, we prove In this section, that these sufficient properties still hold, even with the tempered Riemann approximation, where

{\tilde{h}}_{n} (z; θ) : = h {(⌊n z⌋ / n; θ)}^{\frac{1}{T_{n}}}

.

We recall that

h (z; θ)

is uniformly continuous on the compact set

[0, 1] \times K

, and verifies:

0 < m \leq h (z; θ) \leq M < \infty .

where m and M are constants independent of z and

θ

.

Since

T_{n} > 0, T_{n} \underset{n \to \infty}{⟶} 1

, then the sequence

{1 / T_{n}}_{n}

is bounded. Since

{\tilde{h}}_{n} (z; θ) = h {(⌊n z⌋ / n; θ)}^{\frac{1}{T_{n}}}

, with

0 < m \leq h (⌊n z⌋ / n; θ) \leq M < \infty

for any

z, θ

and n, then we also have:

0 < m^{'} \leq {\tilde{h}}_{n} (z; θ) \leq M^{'} < \infty,

with

m^{'}

and

M^{'}

constants independent of

z, θ

and n.

We have seen in the proof of Theorem 2, that:

\forall ϵ > 0, \exists N \in N, \forall n \geq N, \forall (z, θ) \in [0, 1] \times K, |h (z; θ) - h (⌊n z⌋ / n; θ)| \leq ϵ .

To complete the proof, we control in a similar way the difference

h (⌊n z⌋ / n; θ) - h {(⌊n z⌋ / n; θ)}^{\frac{1}{T_{n}}}

. The function

(h, T) \in [m, M] \times [T_{m i n}, T_{m a x}] \mapsto h^{\frac{1}{T}} \in R

is continuous on a compact, hence uniformly continuous in

(h, T)

. As a consequence:

\forall ϵ > 0, \exists δ > 0, \forall (h, h^{'}) \in {[m, M]}^{2}, (T, T^{'}) \in {[T_{m i n}, T_{m a x}]}^{2},

|h - h^{'}| \leq δ and |T - T^{'}| \leq δ \Rightarrow |h^{\frac{1}{T}} - {(h^{'})}^{\frac{1}{T^{'}}})| \leq ϵ .

Hence, with

N \in N

such that

\forall n \geq N, |T_{n} - 1| \leq δ

, we have:

\begin{matrix} \forall n \geq N, & \forall (z, θ) \in [0, 1] \times K, \\ |h (⌊n z⌋ / n; θ) - h {(⌊n z⌋ / n; θ)}^{\frac{1}{T_{n}}}| \leq ϵ . \end{matrix}

In the end,

\forall ϵ > 0, \exists N \in N, \forall n \geq N, \forall (z, θ) \in [0, 1] \times K

:

\begin{matrix} |h (z; θ) - {\tilde{h}}_{n} (z; θ)| & = |h (z; θ) - h {(⌊n z⌋ / n; θ)}^{\frac{1}{T_{n}}}| \\ \leq |h (z; θ) - h (⌊n z⌋ / n; θ)| + |h (⌊n z⌋ / n; θ) - h {(⌊n z⌋ / n; θ)}^{\frac{1}{T_{n}}}| \\ \leq 2 ϵ . \end{matrix}

In other words, we have the uniform convergence of

{{\tilde{h}}_{n}}

towards h. From there, we conclude following the same steps as in the proof of Theorem 2. □

3. Results

In this section, we describe experiments that explore each of the three studied methods: Riemann EM, tempered EM and tempered Riemann.

3.1. Riemann Approximation EM: Two Applications

3.1.1. Application to a Gaussian Model with the Beta Prior

We demonstrate the interest of the method on a example with a continuous bounded random variable following a Beta distribution

z \sim B e t a (α, 1)

, and an observed random variable following

x \sim N (λ z, σ^{2})

. In other words, with

ϵ \sim N (0, 1)

independent of z:

x = λ z + σ ϵ .

This results in a likelihood belonging to the exponential family:

h (z; θ) = \frac{α z^{α - 1}}{\sqrt{2 π σ^{2}}} e x p (- \frac{{(x - λ z)}^{2}}{2 σ^{2}}) .

Since z is bounded, and everything is continuous in the parameter

(α, λ, σ^{2})

, this model easily verifies each of the conditions M1–3. The E step with this model involves the integral

\int_{z} z^{α} e x p (- \frac{{(x - λ z)}^{2}}{2 σ^{2}}) d z

, a fractional moment of the Gaussian distribution. Theoretical formulas exists for these moments, see [40], however they involve Kummer’s confluent hypergeometric functions, which are infinite series. Instead, we use the Riemann approximation to run the EM algorithm with this model:

{\tilde{h}}_{n} (z; θ) : = h (⌊ φ (n) z ⌋ / φ (n); θ)

. As done previously, we take, without loss of generality,

φ (n) : = n

for the sake of simplicity. The E step only involves the n different values taken by the step function probabilities

h (⌊ n z ⌋ / n; θ)

:

{\tilde{p}}_{θ, n}^{(i)} (\frac{k}{n}) = \frac{h^{(i)} (\frac{k}{n}; θ)}{\frac{1}{n} \sum_{l = 0}^{n - 1} h^{(i)} (\frac{l}{n}; θ)} .

where the exponent

(i)

indicates the index of the observation

x^{(i)}

. To express the corresponding M step in a digest way, let us define the operator

Ψ^{(i)} : R^{[0, 1]} ⟶ R

such that, for any

f : [0, 1] ⟶ R

:

Ψ^{(i)} \circ f = \sum_{k = 0}^{n - 1} {\tilde{p}}_{θ, n}^{(i)} (\frac{k}{n}) \int_{z = k / n}^{(k + 1) / n} f (z) d z .

Then, the M step can be expressed as:

\begin{matrix} \frac{1}{\hat{α}} & = - \frac{1}{N} \sum_{i = 1}^{N} Ψ^{(i)} \circ ln (z), \\ \hat{λ} & = \frac{\sum_{i = 1}^{N} Ψ^{(i)} \circ (x^{(i)} z)}{\sum_{i = 1}^{N} Ψ^{(i)} \circ z^{2}}, \\ {\hat{σ}}^{2} & = \frac{1}{N} \sum_{i = 1}^{N} Ψ^{(i)} \circ {(x^{(i)} - \hat{λ} z)}^{2} . \end{matrix}

(13)

where we took the liberty of replacing f by

f (z)

in these equations for the sake of simplicity. Here N is the total number of observations:

x : = (x^{(1)}, . . ., x^{(N)})

iid.

We test this algorithm on synthetic data. With real values

α = 2, λ = 5, σ = 1.5

, we generate a dataset with

N = 100

observations and run the Riemann EM with random initialisation. This simulation is ran 2000 times. We observe that the Riemann EM is indeed able to increase the likelihood, despite the EM being originally intractable. On Figure 1, we display the average trajectory, with standard deviation, of the negative log-likelihood −

ln (g (θ))

during the Riemann EM procedure. The profile is indeed decreasing. The standard deviation around the average value is fairly high, since each run involves a different dataset and a different random initialisation, hence different value of the likelihood, but the decreasing trend is the same for all of the runs. We also display the average relative square errors on the parameters at the end of the algorithm. They are all small, with reasonably small standard deviation, which indicates that the algorithm consistently recovers correctly the parameters.

To evaluate the impact of the number of Riemann intervals

φ (n)

, we run a second experiment where we compare four different profiles over 50 simulations:

\begin{matrix} (low) φ_{1} (n) & : = n + 1 \\ (medium) φ_{2} (n) & : = n + 100 \\ (high) φ_{3} (n) & : = n + 1000 \\ (linear) φ_{4} (n) & : = 10 \times n + 1 . \end{matrix}

The results are displayed on Figure 2. We can see that, despite the very different profiles, the optimisations are very similar. The “low” profile performs slightly worst, which indicates that a high number of Riemann intervals is most desirable in practice. As long as this number is high enough, Figure 2 suggests that the performances will not depend too much on the profile.

3.1.2. Application in Two Dimensions

The difficulty faced by Riemann methods in general is their geometric complexity when the dimension increases. In this section, we propose a similar experiment in two dimensions to show that the method is still functional and practical in that case.

For this 2D-model, we consider two latent independent Beta random variables

z_{1} \sim B e t a (α_{1}, 1)

and

z_{2} \sim B e t a (α_{2}, 1)

, and two observed variables defined as:

\begin{matrix} x_{1} & = λ_{1} z_{1} + z_{2} + σ_{1} ϵ_{1} \\ x_{2} & = z_{1} + λ_{2} z_{2} + σ_{2} ϵ_{2}, \end{matrix}

with

ϵ_{1} \sim N (0, 1)

,

ϵ_{2} \sim N (0, 1)

, and

(z_{1}, z_{2}, ϵ_{1}, ϵ_{2})

independent. The 2-dimension version of the Riemann E step with n intervals on each dimension is:

{\tilde{p}}_{θ, n}^{(i)} (\frac{k_{1}}{n}, \frac{k_{2}}{n}) = \frac{h^{(i)} (\frac{k_{1}}{n}, \frac{k_{2}}{n}; θ)}{\frac{1}{n^{2}} \sum_{l_{1}, l_{2} = 0}^{n - 1} h^{(i)} (\frac{l_{1}}{n}, \frac{l_{2}}{n}; θ)} .

As before, we define an operator

Ψ^{(i)} : R^{{[0, 1]}^{2}} ⟶ R

such that, for any

f : {[0, 1]}^{2} ⟶ R

:

Ψ^{(i)} \circ f = \sum_{k_{1}, k_{2} = 0}^{n - 1} {\tilde{p}}_{θ, n}^{(i)} (\frac{k_{1}}{n}, \frac{k_{2}}{n}) \int_{z_{1}, z_{2} = k / n}^{(k + 1) / n} f (z_{1}, z_{2}) d z .

Then, the M step can be expressed as:

\begin{matrix} \frac{1}{{\hat{α}}_{1}} & = - \frac{1}{N} \sum_{i = 1}^{N} Ψ^{(i)} \circ ln (z_{1}), \\ {\hat{λ}}_{1} & = \frac{\sum_{i} Ψ^{(i)} \circ (x_{1}^{(i)} z_{1} - z_{2} z_{1})}{\sum_{i} Ψ^{(i)} \circ z_{1}^{2}}, \\ {\hat{σ}}_{1} & = \frac{1}{N} \sum_{i = 1}^{N} Ψ^{(i)} \circ {(x_{1}^{(i)} - {\hat{λ}}_{1} z_{1} - z_{2})}^{2}, \end{matrix}

with symmetric formulas for

{\hat{α}}_{2}, {\hat{λ}}_{2}

and

{\hat{σ}}_{2}

.

For the simulations, we take

(α_{1}, α_{2}) = (1, 3)

,

(λ_{1}, λ_{2}) = (10, - 10)

and

(σ_{1}, σ_{2}) = (2, 3)

. From the previous experiment, we keep only the two least costly profiles: “low”

φ_{1} (n) : = n + 1

and “medium”

φ_{2} (n) : = n + 100

. We also consider two new, sub-linear, profiles “square root”

φ_{5} (n) : = ⌊ \sqrt{n} ⌋ + 1

and “5 square root”

φ_{6} (n) : = 5 \times ⌊ \sqrt{n} ⌋

.

φ_{5} (n)

and

φ_{6} (n)

are designed to have linear complexity even in 2-dimensions.

The results of the EM algorithm runs are displayed on Figure 3. On the left, we follow the number of Riemann squares mapping the 2D space. The difference in computational complexity between profiles is accentuated by the higher dimension. In particular, “medium” performs 6.7 times more computations than “low” and 18.4 times more than “5 square root”. However, on the right of Figure 3, we observe that these three profiles perform similar optimisations. This observation justifies cutting computation costs by using lower resolution profiles to compensate the higher dimension.

3.2. Tempered EM: Application to Mixtures of Gaussian

3.2.1. Context and Experimental Protocol

In this section, we will assess the capacity of tmp-EM to escape from deceptive local maxima. We compare the classical EM to tmp-EM with both a monotonous and an oscillating temperature profile on a very well know toy example: likelihood maximisation within the Gaussian Mixture Model. We confront the algorithms to situations where the true classes have increasingly more ambiguous positions, combined with initialisations designed to be hard to escape from. Although the EM is an optimisation procedure, and the log-likelihood reached is a critical metric, in this example, we put more emphasis on the correct positioning of the cluster centroids, that is to say on the recovery of the

μ_{k}

. The other usual metrics are also in favour of tmp-EM, and can be found in Supplementary Materials.

For the sake of comparison, the experimental design is similar to the one in [10] on the tmp-SAEM. It is as follows: we have three clusters of similar shape and same weight. One is isolated and easily identifiable. The other two are next to one another, in a more ambiguous configuration. Figure 4 represents the three, gradually more ambiguous configurations. Each configuration is called a “parameter family”.

We use two different initialisation types to reveal the behaviours of the three EMs. The first—which we call “barycenter”—puts all three initial centroids at the centre of mass of all the observed data points. However, none of the EM procedures would move from this initial state if the three GMM centroids were at the exact same position, hence we actually apply a tiny perturbation to make them all slightly distinct. The blue crosses on Figure 5 represent a typical barycenter initialisation. With this initialisation method, we assess whether the EM procedures are able to correctly estimate the positions of the three clusters, despite the ambiguity, when starting from a fairly neutral position, providing neither direction nor misdirection. On the other hand, the second initialisation type - which we call “2v1” - is voluntarily misguiding the algorithm by positioning two centroids on the isolated right cluster and only one centroid on the side of the two ambiguous left clusters. The blue crosses on Figure 6 represent a typical 2v1 initialisation. This initialisation is intended to assess whether the methods are able to escape the potential well in which they start and make theirs centroids traverse the empty space between the left and right clusters to reach their rightful position. For each of the three parameter families represented on Figure 4, 1000 datasets with 500 observations each are simulated, and the three EMs are ran with both the barycenter and the 2v1 initialisation.

For tmp-EM, we try two profiles. First, a simple decreasing exponential profile as seen in [31]:

T_{n} = 1 + (T_{0} - 1) exp (- r . n)

. Through a grid search, the values

T_{0} = 5

,

r = 2

for the barycenter initialisation and

T_{0} = 100 r = 1.5

for the 2v1 initialisation are picked for this profile. Since Theorem 3 only requires

T_{n} ⟶ 1

, we also propose an oscillating profile inspired from [10]. The exact formula of these oscillations is:

T_{n} = t h (\frac{n}{2 r}) + (T_{0} - b \frac{2 \sqrt{2}}{3 π}) a^{n / r} + b s i n c (\frac{3 π}{4} + \frac{n}{r})

. Where

0 < T_{0}

,

0 < r

,

0 < b

and

0 < a < 1

. The oscillations in this profile are meant to achieve a two-regimes behaviour. When the temperature reaches low values, the convergence speed is momentarily increased which has the effect of “locking-in” some of the most obviously good decisions of the algorithm. Then, the temperature is re-increased to continue the exploration on the other, more ambiguous, parameters. Those two regimes are alternated in succession with gradually smaller oscillations, resulting in a multi-scale procedure that “locks-in” gradually harder decisions. For some hyper-parameter combinations, the sequence

T_{n}

can have a (usually small) finite number of negative values. Since only the asymptotic behaviour of

T_{n}

is the step n matters for convergence, then the theory allows a finite number of negative values. However, in practice, at least for the sake of interpretation, one may prefer to use only positive values for

T_{n}

. In which case, one can either restrain themselves to parameter combinations that result in no negatives values for

T_{n}

, or enforce positivity by taking

T_{n} \leftarrow m a x (T_{n}, ϵ)

with a certain

ϵ > 0

.

For our experiments, we select the hyper-parameter values with a grid-search. The “normalised”

s i n c

function is used

s i n c (x) = s i n (π x) / (π x)

and the chosen tempering parameters are

T_{0} = 5, r = 2, a = 0.6, b = 20

for the experiments with the barycenter initialisation, and

T_{0} = 100, r = 1.5, a = 0.02, b = 20

for the 2v1 initialisation. We have two different sets of tempering hyper-parameters values, one for each of the two very different initialisation types. However, these values then remain the same for the three different parameter families and for every data generation within them. This underlines that the method is not excessively sensitive to the tempering parameters, and that the prior search for good hyper-parameter values is a worthwhile time investment. Likewise, a simple experiment with 6 clusters, in supplementary materials, demonstrates that the same hyper-parameters can be kept over different initialisation (and different data generations as well) when they were made in a non-adversarial way, by drawing random initial centroids uniformly among the data points.

3.2.2. Experimental Results Analysis

In this section, we analyse the results of EM, decreasing tmp-EM and oscillating tmp-EM over all the simulations.

First, an illustration: Figure 5 and Figure 6 depict the final states of EM and oscillating tmp-EM on one typical simulation for each of the three ambiguity level (the three parameter families) starting from the barycenter and 2v1 initialisation respectively. The simulated data are represented by the green crosses. The initial centroids are in blue. The orange cross represents the estimated centroids positions

{\hat{μ}}_{k}

, and the orange confidence ellipses are visual representations of the estimated covariance matrices

{\hat{Σ}}_{k}

. In Supplementary Materials, we show step by step the path taken by the estimated parameters of tmp-EM before convergence, providing much more detail on the method’s behaviours. These illustrative examples show oscillating tmp-EM better succeeding at clustering recovery than the classical EM. The results over all simulations are aggregated in Table 1, and confirm this observation.

Table 1 presents the average and the standard deviation of the relative

l_{2}

error on

μ_{k}

of the EMs. For each category, the better result over the three EM is highlighted in bold. The recovery of the true class averages

μ_{k}

is spotlighted as it is the essential success metric for this experiment. More specifically, class 1 and 2, the two leftmost classes, are the hardest to correctly recover and the ones whose estimation is the differentiating criterion between the algorithms. Indeed, as can be seen in Table 1,

μ_{3}

is always well estimated by all methods. Hence, in the following, we discuss the error on

μ_{1}

and

μ_{2}

.

With the barycenter initialisation, classical EM and decreasing tmp-EM have similar average relative error levels. With classical EM actually being slightly better. However, oscillating tmp-EM is much better than both of them, with error levels smaller by a factor of 10 on parameter families 1 and 2, and by a factor of 5 on parameter family 3. The standard deviation of oscillating tmp-EM is also lower, by a factor of roughly 3 on parameter families 1 and 2, and by a factor of 2 on parameter family 3. With the 2v1 initialisation, all error levels are higher. This time, decreasing tmp-EM is better in average than classical EM by a factor of 1.7 to 1.4, depending on the parameter family. In turn, oscillating tmp-EM is better than decreasing tmp-EM by a factor of 3.1 to 3.4 depending on the parameter family. Its standard deviation is also lower by a factor of about 2.

Overall, oscillating tmp-EM dominates the simulation. Its error rates on the recovery of

μ_{1}

and

μ_{2}

are always the best, and they remain at low levels even with the most adversarial initialisations. To bolster this last point, we underline that the highest relative error reached by oscillating tmp-EM over all the various scenarios (0.39 on parameter family 3 with 2v1 initialisation) is still lower than the lowest relative error of both classical EM (0.52 on parameter family 1 with barycenter initialisation) and decreasing tmp-EM (0.60 on parameter family 1 with barycenter initialisation).

3.3. Tempered Riemann Approximation EM: Application to a Gaussian Model with Beta Prior

We illustrate the method with the model of Section 3.1.1:

h (z; θ) = \frac{α z^{α - 1}}{\sqrt{2 π σ^{2}}} e x p (- \frac{{(y - λ z)}^{2}}{2 σ^{2}}) .

We apply the tempered Riemann approximation. As in Section 3.1.1, the resulting conditional probability density is a step function defined by the n different values it takes on

[0, 1]

. For the observation

x^{(i)}

,

\forall k \in 〚 0, n - 1 〛

:

{\tilde{p}}_{θ, n}^{(i)} (\frac{k}{n}) = \frac{h^{(i)} {(\frac{k}{n}; θ)}^{\frac{1}{T_{n}}}}{\frac{1}{n} \sum_{l = 0}^{n - 1} h^{(i)} {(\frac{l}{n}; θ)}^{\frac{1}{T_{n}}}} .

The M step, seen in Equation (13), is unchanged. We compare the tempered Riemann EM to the simple Riemann EM on a case where the parameters are ambiguous. With real parameters

α = 0.1, λ = 10, σ = 0.8

, for each of the 100 simulations, the algorithms are initialised at

α_{0} = 10, λ_{0} = 1, σ_{0} = 7

. The initialisation is somewhat adversarial, since the mean and variance of the marginal distribution of y are approximately the same with the real of the initialisation parameter, even though the distribution is different. Figure 7 shows that the tempered Riemann EM better escapes the initialisation than the regular Riemann EM, and reaches errors on the parameters orders of magnitude below. The tempering parameters are here

T_{0} = 150, r = 3, a = 0.02, b = 40

.

4. Discussion and Conclusions

We proposed the Deterministic Approximate EM class to bring together the many possible deterministic approximations of the E step. We proved a unified Theorem, with mild conditions on the approximation, which ensures the convergence of the algorithms in this class. Then, we showcased members of this class that solve the usual practical issues of the EM algorithm. For intractable E steps in low dimension, we introduced the Riemann approximation EM, a less parametric and deterministic alternative to the extensive family of MC-EM. We showed on an empirical intractable example how the Riemann approximation EM was able to increase the likelihood and recover every parameter in a satisfactory manner with its simplest design, and no hyper parameter optimisation.

Second, we studied the tempered EM, introduced by Ueda and Nakano [31] to escape the attractive sub-optimal local extrema of non-convex objectives. We proved that tmp-EM is a specific case of the Deterministic Approximate EM, benefiting from the convergence property as long as the temperature profile converges towards 1. This mild condition justifies the use of many more temperature profiles than the ones tried in [31,32]. To illustrate the interest of complex, non-monotonous, temperature profiles, we demonstrated on experiments with adversarial initial positions the superiority of an oscillating profile over a simple decreasing one.

Finally, we added the Riemann approximation in order to apply the tempering in intractable cases. We were then able to show that the tmp-Riemann approximation massively improved the performances of the Riemann approximation, when the initialisation is ambiguous.

Future works will improve both methods. The Riemann approximation will be generalised to be applicable even when the latent variable is not bounded, and an intelligent slicing of the integration space will improve the computational performances in high dimension. Regarding the tempered EM, since the theory allows the usage of any temperature profile, the natural next step is to look for efficient profiles with few hyper-parameters for fast tuning. Afterwards, implementing an adaptive tuning of the temperature parameters during the procedure will remove the necessity for preliminary grid search altogether.

Supplementary Materials

The following are available at https://www.mdpi.com/article/10.3390/a15030078/s1.

Author Contributions

Conceptualization, T.L. and S.A.; methodology, T.L. and S.A.; software, T.L.; validation, T.L. and S.A.; formal analysis, T.L. and S.A.; investigation, T.L.; resources, S.D. and S.A.; data curation, T.L.; writing—original draft preparation, T.L.; writing—review and editing, T.L., S.D. and S.A.; visualization, T.L.; supervision, S.D and S.A.; project administration, S.A.; funding acquisition, S.D and S.A. All authors have read and agreed to the published version of the manuscript.

Funding

The research leading to these results has received funding from the European Research Council (ERC) under grant agreement No 678304, European Union’s Horizon 2020 research and innovation program under grant agreement No 666992 (EuroPOND) and No 826421 (TVB-Cloud), and the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and reference ANR-10-IAIHU-06 (IHU-A-ICM).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs of the Two Main Theorems

In this Appendix, we provide in full details the proofs that were sketched in the main body of the paper. Appendix A.1 details the proof of convergence for the Deterministic Approximate EM algorithm, our central result, Theorem 1 of the paper. Appendix A.2 details the proof of convergence for tmp-EM, Theorem 3 of the paper.

Appendix A.1. Proof of the General Theorem

In this Section, we prove Theorem 1, which guarantees the convergence of the Deterministic Approximate EM algorithm.

The proof is based on the application of Propositions 1 and 2, taken from Fort and Moulines [7].

We need to prove that, under the conditions of Theorem 1, we verify the conditions of Propositions 1 and 2. Then we will have the results announced in Theorem 1.

Appendix A.1.1. Verifying the Conditions of Proposition 2

g is the likelihood function of a model of the curved exponential family. Let T be the point to point map describing the transition between

θ_{n}

and

θ_{n + 1}

in the exact EM algorithm.

L

the set of stationary points by T:

L : = \{θ \in Θ | T (θ) = θ\}

. (Note that if g is differentiable, the general properties of the EM tell us that its critical points of g are the stationary points:

L = \{θ \in Θ | \nabla g (θ) = 0\}

). Additionally, g is a

C^{0}

Lyapunov function associated to

(T, L)

. Let

{\{θ_{n}\}}_{n}

be the sequence defined by the stable approximate EM with

{F_{n}}_{n \in N}

our sequence of point to point maps.

We verify that under this framework—and with the assumptions of Theorem 1—we check the conditions of Proposition 2.

As in [7], M1–3 directly implies C1–2.

Let us show that we have the last two conditions for Proposition 2:

\forall θ \in K_{0}, lim_{n \to \infty} | g \circ F_{n} - g \circ T | (θ) = 0,

(A1)

and

\forall compact K \subseteq Θ, lim_{n \to \infty} | g \circ F_{n} (θ_{n}) - g \circ T (θ_{n}) | 𝟙_{θ_{n} \in K} = 0 .

(A2)

We focus on (A2), since (A1) is easier to verify and will come from the same reasoning. The first steps are similar to [7]. We underline the most consequent deviations from the proof of [7] when they occur.

Equivalent Formulation of the Convergence

We write Equation (A2) under an equivalent form. Let

{\tilde{S}}_{n} (θ) : = {\{\int_{z} S_{u} (z) {\tilde{p}}_{θ, n} (z) d z\}}_{u = 1}^{q}

and

\bar{S} (θ) : = {\{\int_{z} S_{u} (z) p_{θ} (z) d z\}}_{u = 1}^{q}

. Then

F_{n} (θ_{n}) = \hat{θ} ({\tilde{S}}_{n} (θ_{n}))

and

T (θ_{n}) = \hat{θ} (\bar{S} (θ_{n}))

. Hence

| g \circ F_{n} (u_{n}) - g \circ T (u_{n}) | = | g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) |

. To show Equation (A2):

| g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} \underset{n \to \infty}{\to} 0,

it is sufficient and necessary to have:

\forall ϵ > 0, \exists N \in N, \forall n \geq N, | g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} \leq ϵ .

An other equivalent formulation is that there are a finite number of integers n such that

| g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} > ϵ

, in other words:

\forall ϵ > 0, \sum_{n = 1}^{\infty} 𝟙_{| g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} > ϵ} < \infty .

Use the Uniform Continuity

We aim to relate the proximity between the images

g \circ \hat{θ}

of to the proximity between the antecedents of

g \circ \hat{θ}

. The function

g \circ \hat{θ} : R^{q} \to R

is continuous, but not necessarily uniformly continuous on

R^{q}

. As a consequence, we will need to restrict ourselves to a compact to get uniform continuity properties. We already have a providen compact K.

\tilde{S} : Θ \to R^{l}

is continuous, hence

S (K)

is a compact as well. Let

δ

be a strictly positive real number. Let

\bar{S} (K, δ) : = \{s \in R^{q} | \underset{t \in K}{i n f} | | \bar{S} (t) - s | | \leq δ\}

. Where we use any norm

| | . | |

on

R^{q}

since they are all equivalent.

\bar{S} (K, δ)

is a compact set as well. As a consequence

g \circ θ

is uniformly continuous on

\bar{S} (K, δ)

, which means that:

\forall ϵ > 0, \exists η (ϵ, δ) > 0, \forall x, y \in \bar{S} (K, δ), ∥x - y∥ \leq η (ϵ, δ) \Rightarrow | g \circ \hat{θ} (x) - g \circ \hat{θ} (y) | \leq ϵ .

(A3)

Let us show that, with

α : = m i n (δ, η (ϵ, δ))

,

\forall n,

| g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} > ϵ \Rightarrow ∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} > α .

(A4)

To that end, we show that:

∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} \leq α \Rightarrow | g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} \leq ϵ .

Let us assume that

∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} \leq α

.

If

θ_{n} \notin K

, then

| g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} = 0 \leq ϵ

.

If, in contrary,

θ_{n} \in K

, then

\bar{S} (θ_{n}) \in \bar{S} (K) \subset \bar{S} (K, δ)

.

Since

∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ = ∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} \leq α \leq δ

, then

{\tilde{S}}_{n} (θ_{n}) \in \bar{S} (K, δ)

.

Since

(\bar{S} (θ_{n}), {\tilde{S}}_{n} (θ_{n})) \in \bar{S} {(K, δ)}^{2}

and

∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ \leq α \leq η (ϵ, δ)

, then we get from Equation (A3)

| g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} \leq ϵ .

In both cases, we get that:

∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} \leq α \Rightarrow | g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} \leq ϵ,

which proves Equation (A4).

Sufficient Condition for Convergence

We use Equation (A4) to find a sufficient condition for (A2). This part differs from [7] as our approximation is not defined as a random sum. Equation (A4) is equivalent to

𝟙_{| g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} > ϵ} \leq 𝟙_{∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} > α} .

From that, we get

\forall ϵ > 0, \exists α > 0 \sum_{n = 1}^{\infty} 𝟙_{| g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} > ϵ} \leq \sum_{n = 1}^{\infty} 𝟙_{∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} > α} .

As a consequence, if

\forall α > 0, \sum_{n = 1}^{\infty} 𝟙_{∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} > α} < \infty

Then

\forall ϵ > 0, \sum_{n = 1}^{\infty} 𝟙_{| g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} > ϵ} < \infty

In other, equivalent, words:

\begin{matrix} If & ∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} \underset{n \to \infty}{⟶} 0 \\ Then & | g \circ \hat{θ} ({\tilde{S}}_{n} (θ_{n})) - g \circ \hat{θ} (\bar{S} (θ_{n})) | 𝟙_{θ_{n} \in K} \underset{n \to \infty}{⟶} 0 . \end{matrix}

(A5)

Hence, having for all compact sets

K \subset Θ, ∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} \underset{n \to \infty}{⟶} 0

is sufficient to have the desired condition (A2). Similarly, we find that

\forall θ \in K_{0}

:

\begin{matrix} ∥{\tilde{S}}_{n} (θ) - \bar{S} (θ)∥ \underset{n \to \infty}{⟶} 0 \\ \Rightarrow & | g \circ \hat{θ} ({\tilde{S}}_{n} (θ)) - g \circ \hat{θ} (\bar{S} (θ)) | \underset{n \to \infty}{⟶} 0, \end{matrix}

(A6)

which provides us a sufficient condition for (A1).

Further Simplifications of the Desired Result with Successive Sufficient Conditions

We find another, simpler, sufficient condition for (A2) from Equation (A5). This part is unique to our proof and absent from [7]. It is here that we relate the formal conditions of Proposition 2 to the specific assumptions of our Theorem 1.

We first remove the dependency on the terms

{θ_{n}}_{n}

of the EM sequence:

∥{\tilde{S}}_{n} (θ_{n}) - \bar{S} (θ_{n})∥ 𝟙_{θ_{n} \in K} \leq \sup_{θ \in K} ∥{\tilde{S}}_{n} (θ) - \bar{S} (θ)∥ .

(A7)

From Equations (A5)–(A7) we get that:

\forall compact K \subset Θ, \sup_{θ \in K} ∥{\tilde{S}}_{n} (θ) - \bar{S} (θ)∥ \underset{n \to \infty}{⟶} 0,

is a sufficient condition to have both Equations (A1) and (A2).

To show that the hypotheses of Theorem 1 imply this sufficient condition, we express it in integral form. Let

S = {\{S_{u}\}}_{u = 1, . . ., q}

. We recall that

{\tilde{S}}_{n} (θ) = {\{\int_{z} S_{u} (z) {\tilde{p}}_{θ, n} (z) d z\}}_{u = 1}^{q}

and

\bar{S} (θ) = {\{\int_{z} S_{u} (z) p_{θ} (z) d z\}}_{u = 1}^{q}

. Hence:

{\tilde{S}}_{n} (θ) - \bar{S} (θ) = {\{\int_{z} S_{u} (z) ({\tilde{p}}_{θ, n} (z) - p_{θ} (z)) d z\}}_{u = 1}^{q} .

These q terms can be upper bounded by two different terms depending on the existence of the involved quantities:

\int_{z} S_{u} (z) ({\tilde{p}}_{θ, n} (z) - p_{θ} (z)) d z \leq {(\int_{z} S_{u} {(z)}^{2} d z)}^{\frac{1}{2}} {(\int_{z} {({\tilde{p}}_{θ, n} (z) - p_{θ} (z))}^{2} d z)}^{\frac{1}{2}},

and

\int_{z} S_{u} (z) ({\tilde{p}}_{θ, n} (z) - p_{θ} (z)) d z \leq {(\int_{z} S_{u} {(z)}^{2} p_{θ} (z) d z)}^{\frac{1}{2}} {(\int_{z} {(\frac{{\tilde{p}}_{θ, n} (z)}{p_{θ} (z)} - 1)}^{2} p_{θ} (z) d z)}^{\frac{1}{2}} .

As a consequence, if

\int_{z} S_{u} {(z)}^{2} d z

exists, then it is sufficient to show have:

\sup_{θ \in K} \int_{z} {({\tilde{p}}_{θ, n} (z) - p_{θ} (z))}^{2} d z \underset{n \to \infty}{⟶} 0,

and if

\int_{z} S_{u} {(z)}^{2} p_{θ} (z) d z

exists, then it is sufficient to show have:

\sup_{θ \in K} \int_{z} {(\frac{{\tilde{p}}_{θ, n} (z)}{p_{θ} (z)} - 1)}^{2} p_{θ} (z) d z \underset{n \to \infty}{⟶} 0 .

Among the assumptions of Theorem 1 is one that states that for all compacts

K \subseteq Θ

, one of those scenarios has to be true. Hence our sufficient condition is met.

Conclusions

With the hypotheses of Theorem 1, we have

\forall compact K \subseteq Θ, \sup_{θ \in K} ∥{\tilde{S}}_{n} (θ) - \bar{S} (θ)∥ \underset{n \to \infty}{⟶} 0,

which is a sufficient condition to verify both Equations (A1) and (A2). With these two conditions, we can apply Proposition 2.

Appendix A.1.2. Applying Proposition 2

Since we verify all the conditions of Proposition 2, we can apply its conclusions:

With probability 1, \underset{n \to \infty}{lim \sup} p_{n} < \infty and {\{θ_{n}\}}_{n} compact sequence,

which is specifically the result

(i) (a)

of Theorem 1.

Appendix A.1.3. Verifying the Conditions of Proposition 1

With Proposition 1, we prove the remaining points of Theorem 1:

(i) (b)

and

(i i)

.

For the application of Proposition 1:

$C l ({\{θ_{n}\}}_{n})$ (set closure) plays the part of the compact K
$\{θ \in Θ | T (θ) = θ\}$ plays the part of the set $L$
$L \cap K$ is also compact thanks to hypothesis $M 3$
The likelihood g is the $C^{0}$ Lyapunov function with regards to $(T, L)$
${\{θ_{n}\}}_{n}$ is the K valued sequence (since K is $C l ({\{θ_{n}\}}_{n})$ ).

The last condition that remains to be shown to apply Proposition 1 is that:

lim_{n \to \infty} | g (θ_{n + 1}) - g \circ T (θ_{n}) | = 0 .

We have more or less already proven that, in the previous section of the Proof, with

F_{n} (θ_{n})

in place of

θ_{n + 1}

. The only indices where

F_{n} (θ_{n}) \neq θ_{n + 1}

are when the value of the sequence

p_{n}

experiences an increment of 1. We have proven with Proposition 2 that there is only a finite number of such increments.

| g (θ_{n + 1}) - g \circ T (θ_{n}) | = | g (θ_{0}) - g \circ T (θ_{n}) | 𝟙_{p_{n + 1} = p_{n} + 1} + | g \circ F_{n} (θ_{n}) - g \circ T (θ_{n}) | 𝟙_{p_{n + 1} = p_{n}} .

Since there is only a finite number of increments of the value of

p_{n}

, then

\exists N \in N, \forall n \geq N, 𝟙_{p_{n + 1} = p_{n} + 1} = 0

and

𝟙_{p_{n + 1} = p_{n}} = 1

. In other words:

\begin{matrix} \exists N \in N, \forall n \geq N, | g (θ_{n + 1}) - g \circ T (θ_{n}) | & = | g \circ F_{n} (θ_{n}) - g \circ T (θ_{n}) | \\ \exists N \in N, \forall n \geq N, | g (θ_{n + 1}) - g \circ T (θ_{n}) | & = | g \circ F_{n} (θ_{n}) - g \circ T (θ_{n}) | 𝟙_{θ_{n} \in C l ({\{θ_{k}\}}_{k})} . \end{matrix}

Since

θ_{n}

is always in

C l ({\{θ_{k}\}}_{k})

by definition. Additionally Proposition 2 tells us that

C l ({\{θ_{k}\}}_{k})

is a compact. Moreover, in order to use Proposition 2 in the first place, we had proven that:

\forall compact K \subseteq Θ, lim_{n \to \infty} | g \circ F_{n} (θ_{n}) - g \circ T (θ_{n}) | 𝟙_{θ_{n} \in K} = 0 .

We can apply this directly with

K = C l ({\{θ_{k}\}}_{k})

to conclude the desired result:

lim_{n \to \infty} | g (θ_{n + 1}) - g \circ T (θ_{n}) | = 0

Hence we verify all the conditions to apply Proposition 1.

Appendix A.1.4. Applying Proposition 1

Since we verify all we need, we have the conclusions of Proposition 1:

${\{g (θ_{n})\}}_{n \in N}$ converges towards a connected component of $g (L \cap C l ({\{θ_{n}\}}_{n})) \subset g (L)$
If $g (L \cap C l ({\{θ_{n}\}}_{n}))$ has an empty interior, then ${\{g (θ_{n})\}}_{n \in N}$ converges towards a $g^{*} \in R$ and ${\{θ_{n}\}}_{n}$ converges towards $L_{g^{*}} \cap C l ({\{θ_{n}\}}_{n})$ . Where $L_{g^{*}} : = \{θ \in L | g (θ) = g^{*}\}$

Both points are respectively the statements

(i) (b)

and

(i i)

of Theorem 1.

Which concludes the proof of the Theorem.

Appendix A.2. Proof of the Tempering Theorem

In this Section, we prove Theorem 3 of the main paper, the convergence of the tempered EM algorithm. For that, we need to show that we verify each of the hypotheses of the more general Theorem 1.

We already assumed the conditions M1, M2 and M3 in the hypotheses of Theorem 3. To apply Theorem 1, we need to show that when

{\tilde{p}}_{θ, n} (z) : = \frac{p_{θ}^{\frac{1}{T_{n}}} (z)}{\int_{z^{'}} p_{θ}^{\frac{1}{T_{n}}} (z^{'}) d z^{'}}

, then

\forall compact K \subseteq Θ

, one of the two following configurations holds:

\int_{z} S {(z)}^{2} d z < \infty and \sup_{θ \in K} \int_{z} {({\tilde{p}}_{θ, n} (z) - p_{θ} (z))}^{2} d z \underset{n \to \infty}{⟶} 0,

or

\underset{θ \in K}{s u p} \int_{z} S {(z)}^{2} p_{θ} (z) d z < \infty and \underset{θ \in K}{s u p} \int_{z} {(\frac{{\tilde{p}}_{θ, n} (z)}{p_{θ} (z)} - 1)}^{2} p_{θ} (z) d z \underset{n \to \infty}{⟶} 0 .

Since we have assumed:

\forall compact K \in Θ, \forall α \in \bar{B} (1, ϵ), \forall u, \sup_{θ \in K} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) d z < \infty,

then we already verify the first half of the second configuration for all the compacts K. Hence it is sufficient to prove that:

\forall compact K \in Θ, \sup_{θ \in K} \int_{z} {(\frac{{\tilde{p}}_{θ, n} (z)}{p_{θ} (z)} - 1)}^{2} p_{θ} (z) d z \underset{n \to \infty}{⟶} 0,

(A8)

to have the desired result. The rest of the proof is dedicated to this goal.

Appendix A.2.1. Taylor Development

We use the Taylor’s formula of the first order with the mean-value form of the reminder. For a derivable function f:

f (x) = f (0) + f^{'} (a) x, a \in [0, x],

(A9)

where the interval

[0, x]

has a flexible meaning since x could be negative.

We apply it to:

f (x) = e^{x}, f^{'} (x) = e^{x}, f (x) = 1 + x e^{a}, a \in [0, x],

and:

f (x) = \frac{1}{1 + x}, f^{'} (x) = - \frac{1}{{(1 + x)}^{2}}, f (x) = 1 - \frac{x}{{(1 + a)}^{2}}, a \in [0, x] .

To make the upcoming calculation more readable, we momentarily replace

p_{θ} (z)

by simply p and

T_{n}

by T.

\begin{matrix} p^{\frac{1}{T}} & = p (p^{\frac{1}{T} - 1}) \\ = p e^{(\frac{1}{T} - 1) ln p} \\ = p + (\frac{1}{T} - 1) p ln p e^{a}, a \in [0, (\frac{1}{T} - 1) ln p], \end{matrix}

where

a = a (z, θ, T_{n})

since it depends on the value of

p_{θ} (z)

and

T_{n}

. Provided that the following quantities are defined, we have:

\int_{z} p^{\frac{1}{T}} = 1 + (\frac{1}{T} - 1) \int_{z} p ln p e^{a},

Hence:

\frac{1}{\int_{z} p^{\frac{1}{T}}} = 1 - (\frac{1}{T} - 1) \frac{\int_{z} p ln p e^{a}}{{(1 + b)}^{2}}, b \in [0, (\frac{1}{T} - 1) \int_{z} p ln p e^{a}],

where

b = b (θ, T_{n})

since it depends on the value of

T_{n}

the integral over z of a function of z and

θ

.

In the end, we have:

\frac{p^{\frac{1}{T}}}{\int_{z} p^{\frac{1}{T}}} = p + (\frac{1}{T} - 1) p ln p e^{a} (1 - (\frac{1}{T} - 1) \frac{\int_{z} p ln p e^{a}}{{(1 + b)}^{2}}) - (\frac{1}{T} - 1) p \frac{\int_{z} p ln p e^{a}}{{(1 + b)}^{2}} .

(A10)

Since for any real numbers

{(x + y)}^{2} \leq 2 (x^{2} + y^{2})

, then:

\begin{matrix} {(\frac{p^{\frac{1}{T}}}{\int_{z} p^{\frac{1}{T}}} - p)}^{2} & \leq 2 {(\frac{1}{T} - 1)}^{2} p^{2} ({(ln p e^{a})}^{2} {(1 - (\frac{1}{T} - 1) \frac{\int_{z} p ln p e^{a}}{{(1 + b)}^{2}})}^{2} + {(\frac{\int_{z} p ln p e^{a}}{{(1 + b)}^{2}})}^{2}) \\ = 2 {(\frac{1}{T} - 1)}^{2} p^{2} ({(ln p e^{a})}^{2} A + B) . \end{matrix}

where

A = A (θ, T_{n})

and

B = B (θ, T_{n})

. So far the only condition that has to be verified for all the involved quantities to be defined is that

\int_{z} p ln p e^{a}

exists. With this Taylor development on hand, we state, prove and apply two lemmas which allow us to get (A8) and conclude the proof of the theorem.

Appendix A.2.2. Two Intermediary Lemmas

The two following lemmas provides every result we need to finish the proof.

Lemma A1.

With

p_{θ} (z) = e x p (ψ (θ) + 〈S (z), ϕ (θ〉),

then

\int_{z} p_{θ}^{α} (z) {ln}^{2} p_{θ} (z) d z \leq 2 ψ {(θ)}^{2} \int_{z} p_{θ}^{α} (z) d z + 2 {∥ϕ (θ)∥}^{2} . \sum_{u} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) .

and

\int_{z} p_{θ}^{α} (z) |ln p_{θ} (z)| d z \leq |ψ (θ)| \int_{z} p_{θ}^{α} (z) d z + ∥ϕ (θ)∥ . {(\sum_{u} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) \int_{z} p_{θ}^{α} (z))}^{\frac{1}{2}} .

Proof.

For the first inequality, using the fact that

{(a + b)}^{2} \leq 2 (a^{2} + b^{2})

, we have:

\int_{z} p_{θ}^{α} (z) {ln}^{2} p_{θ} (z) d z \leq 2 ψ {(θ)}^{2} \int_{z} p_{θ}^{α} (z) d z + 2 \int_{z} p_{θ}^{α} (z) {〈S (z), ϕ (θ)〉}^{2},

We use Cauchy-Schwartz:

{〈S (z), ϕ (θ)〉}^{2} \leq {∥ϕ∥}^{2} {∥S (z)∥}^{2} = {∥ϕ∥}^{2} \sum_{u} S_{u} {(z)}^{2},

to get the desired result:

\int_{z} p_{θ}^{α} (z) {ln}^{2} p_{θ} (z) d z \leq 2 ψ {(θ)}^{2} \int_{z} p_{θ}^{α} (z) d z + 2 {∥ϕ (θ)∥}^{2} . \sum_{u} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) .

For the second inequality, we start with Cauchy-Schwartz on

〈\int_{z} S (z) p_{θ}^{α} (z), ϕ (θ)〉

:

\int_{z} p_{θ}^{α} (z) |ln p_{θ} (z)| d z \leq |ψ (θ)| \int_{z} p_{θ}^{α} (z) d z + ∥ϕ (θ)∥ . ∥\int_{z} S (z) p_{θ}^{α} (z)∥ .

Moreover, since:

\int_{z} S_{u} (z) p_{θ}^{α} (z) d z \leq {(\int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) d z)}^{\frac{1}{2}} {(\int_{z} p_{θ}^{α} (z) d z)}^{\frac{1}{2}},

then

\int_{z} p_{θ}^{α} (z) |ln p_{θ} (z)| d z \leq |ψ (θ)| \int_{z} p_{θ}^{α} (z) d z + ∥ϕ (θ)∥ . {(\sum_{u} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) \int_{z} p_{θ}^{α} (z))}^{\frac{1}{2}} .

□

Lemma A2.

With K compact and

ϵ \in R_{+}^{*}

,

p_{θ} (z) = e x p (ψ (θ) + 〈S (z), ϕ (θ〉),

and

{\tilde{p}}_{θ, n} (z) : = \frac{p_{θ}^{\frac{1}{T_{n}}} (z)}{\int_{z^{'}} p_{θ}^{\frac{1}{T_{n}}} (z^{'}) d z^{'}},

if

(i): $T_{n} \in R_{+}^{*} \underset{n \to \infty}{⟶} 1$ ,
(ii): $\sup_{θ \in K} ψ (θ) < \infty$ ,
(iii): $\sup_{θ \in K} ∥ϕ (θ)∥ < \infty$ ,
(iv): $\forall α \in \bar{B} (1, ϵ), \sup_{θ \in K} \int_{z} p_{θ}^{α} (z) d z < \infty$ ,
(v): $\forall α \in \bar{B} (1, ϵ), \forall u, \sup_{θ \in K} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) d z < \infty$ .

then

\sup_{θ \in K} \int_{z} {(\frac{{\tilde{p}}_{θ, n} (z)}{p_{θ} (z)} - 1)}^{2} p_{θ} (z) d z \underset{n \to \infty}{⟶} 0 .

Proof.

Provided that the following integrals exist, we have, thanks to the Taylor development:

\begin{matrix} \int_{z} \frac{1}{p} {(\frac{p^{\frac{1}{T}}}{\int_{z} p^{\frac{1}{T}}} - p)}^{2} & \leq 2 \int_{z} {(\frac{1}{T} - 1)}^{2} p ({(ln p e^{a})}^{2} A + B) \\ = 2 {(\frac{1}{T} - 1)}^{2} A \int_{z} p e^{2 a} {ln}^{2} p + 2 {(\frac{1}{T} - 1)}^{2} B . \end{matrix}

(A11)

In this proof, we find finite upper bounds independent of

θ

and

T_{n}

for

A (θ, T_{n})

,

B (θ, T_{n})

and

\int_{z} p e^{2 a} {ln}^{2} p

, then—since

(\frac{1}{T_{n}} - 1) ⟶ 0

—we have the desired result.

We start by studying

A (θ, T_{n}) = {(1 - (\frac{1}{T} - 1) \frac{\int_{z} p ln p e^{a}}{{(1 + b)}^{2}})}^{2}

. The first term of interest here is

\int_{z} p ln p e^{a}

. We have:

\begin{matrix} a & \in [0, (\frac{1}{T} - 1) ln p], \\ e^{a} & \in [1, p^{\frac{1}{T} - 1}], \\ p ln p e^{a} & \in [p ln p, p^{\frac{1}{T}} ln p] . \end{matrix}

where we recall that the interval is to be taken in a flexible sense, since we do not now a priory which bound is the largest and which is the smallest. What we have without doubt though is:

|p ln p e^{a}| \leq m a x (|p ln p|, |p^{\frac{1}{T}} ln p|) .

We find an upper bound on both those term. Let

α \in \bar{B} (1, ϵ)

, the second result of Lemma A1 provides us:

\int_{z} p_{θ}^{α} (z) |ln p_{θ} (z)| d z \leq |ψ (θ)| \int_{z} p_{θ}^{α} (z) d z + ∥ϕ (θ)∥ . {(\sum_{u} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) \int_{z} p_{θ}^{α} (z))}^{\frac{1}{2}} .

Thanks to the hypotheses (ii), (iii), (iv) and (v), we have:

\begin{matrix} \int_{z} p_{θ}^{α} (z) |ln p_{θ} (z)| d z & \leq \sup_{θ \in K} |ψ (θ)| . \sup_{θ \in K} \int_{z} p_{θ}^{α} (z) d z \\ + \sup_{θ \in K} ∥ϕ (θ)∥ . \sum_{u} {(\sup_{θ \in K} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z))}^{\frac{1}{2}} . {(\sup_{θ \in K} \int_{z} p_{θ}^{α} (z))}^{\frac{1}{2}} \\ = : C (α) \\ < \infty . \end{matrix}

The upper bound

C (α)

in the previous inequality is independent of

θ

and z but still dependant of the exponent

α

. However, since

\bar{B} (1, ϵ)

is closed ball, hypotheses (iv) and (v) can be rephrased as:

\begin{matrix} (iv) & \sup_{α \in \bar{B} (1, ϵ)} \sup_{θ \in K} \int_{z} p_{θ}^{α} (z) d z < \infty, \\ (v) & \forall u, \sup_{α \in \bar{B} (1, ϵ)} \sup_{θ \in K} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) d z < \infty . \end{matrix}

Hence we can actually take the supremum over

α

in the right term of the inequation as well. We have:

\int_{z} p_{θ}^{α} (z) |ln p_{θ} (z)| d z

\begin{matrix} \leq \sup_{θ \in K} |ψ (θ)| . \sup_{α \in \bar{B} (1, ϵ)} \sup_{θ \in K} \int_{z} p_{θ}^{α} (z) d z \\ + \sup_{θ \in K} ∥ϕ (θ)∥ . \sum_{u} {(\sup_{α \in \bar{B} (1, ϵ)} \sup_{θ \in K} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z))}^{\frac{1}{2}} . {(\sup_{α \in \bar{B} (1, ϵ)} \sup_{θ \in K} \int_{z} p_{θ}^{α} (z))}^{\frac{1}{2}} \\ = : C^{'} \\ < \infty . \end{matrix}

This new upper bound

C^{'}

is independent of

α

.

Since

T_{n} \mapsto 1

, then

\exists N \in N, \forall n \geq N, \frac{1}{T_{n}} \in \bar{B} (1, ϵ)

. Hence for

n \geq N

, we can apply the previous inequation to either

α = 1

or

α = \frac{1}{T_{n}}

, which provides us that

\int_{z} p_{θ} (z) |ln p_{θ} (z)|

,

\int_{z} p_{θ}^{\frac{1}{T_{n}}} (z) |ln p_{θ} (z)|

and their supremum in

θ

are all finite, all of them upper bounded by

C^{'}

.

In the end, when

n \geq N

, we have the control

\sup_{θ \in K} |\int_{z} p ln p e^{a}| < C^{'}

.

The next term to control is

\frac{1}{{(1 + b)}^{2}}

.

Since

b \in [0, (\frac{1}{T} - 1) \int_{z} p ln p e^{a}],

then

| b | \leq (\frac{1}{T} - 1) \sup_{θ \in K} \int_{z} p ln p e^{a}

. We already established that for all

n \geq N

,

\sup_{θ \in K} |\int_{z} p ln p e^{a}| \leq C^{'} < \infty

, hence

\sup_{θ \in K} | b (θ, T_{n}) | \underset{T_{n} ⟶ 1}{⟶} 0

. In particular,

\exists N^{'} \in N, \forall n \geq N^{'}, \forall θ \in K

we have

| b (θ, T_{n}) | \leq \frac{1}{2}

. In that case:

\begin{matrix} {(1 + b)}^{2} > {(1 - | b |)}^{2} \geq \frac{1}{4} \\ \frac{1}{{(1 + b)}^{2}} < \frac{1}{{(1 - | b |)}^{2}} \leq 4 . \end{matrix}

In the end, when

n \geq m a x (N, N^{'})

, for any

θ \in K

:

\begin{matrix} A (θ, T_{n}) & \leq 2 + 2 {(\frac{1}{T_{n}} - 1)}^{2} {(\frac{\int_{z} p ln p e^{a}}{{(1 + b)}^{2}})}^{2} \\ \leq 2 + 32 {(\frac{1}{T_{n}} - 1)}^{2} {(\sup_{θ \in K} \int_{z} p ln p e^{a})}^{2} \\ \leq 2 + 32 {(\frac{1}{T_{n}} - 1)}^{2} C^{' 2} \\ \leq 2 + 32 ϵ^{2} C^{' 2} \\ = : C_{1} . \end{matrix}

This upper bound does not depend en

θ

anymore and the part in

T_{n}

simply converges towards 0 when

T_{n} ⟶ 1

.

Treating the term

B (θ, T_{n}) = {(\frac{\int_{z} p ln p e^{a}}{{(1 + b)}^{2}})}^{2} \leq 16 {(\sup_{θ \in K} \int_{z} p ln p e^{a})}^{2} \leq 16 C^{' 2} = : C_{2}

is immediate after having dealt with

A (θ, T_{n})

.

We now treat the term

\int_{z} p e^{2 a} {ln}^{2} p

in the exact same fashion as we did

A (θ, T_{n})

:

\begin{matrix} p ln p e^{a} & \in [p ln p, p^{\frac{1}{T}} ln p] \\ \Rightarrow p {(ln p e^{a})}^{2} & \in [p {ln}^{2} p, p^{\frac{2}{T} - 1} {ln}^{2} p] \\ \Rightarrow p {(ln p e^{a})}^{2} & \leq m a x (p {ln}^{2} p, p^{\frac{2}{T} - 1} {ln}^{2} p) . \end{matrix}

We control those two terms as previously. First we apply Lemma A1 (its first result this time) with

α \in \bar{B} (1, ϵ)

.

\int_{z} p_{θ}^{α} (z) {ln}^{2} p_{θ} (z) d z \leq 2 ψ {(θ)}^{2} \int_{z} p_{θ}^{α} (z) d z + 2 {∥ϕ (θ)∥}^{2} . \sum_{u} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) .

Thanks to the hypotheses (ii), (iii), (iv) and (v), we can once again take the supremum of the bound over

θ \in K

, then over

α \in \bar{B} (1, ϵ)

and conserve finite quantities:

\begin{matrix} \int_{z} p_{θ}^{α} (z) {ln}^{2} p_{θ} (z) d z & \leq 2 \sup_{θ \in K} ψ {(θ)}^{2} . \sup_{α \in \bar{B} (1, ϵ)} \sup_{θ \in K} \int_{z} p_{θ}^{α} (z) d z \\ + 2 \sup_{θ \in K} {∥ϕ (θ)∥}^{2} . \sum_{u} \sup_{α \in \bar{B} (1, ϵ)} \sup_{θ \in K} \int_{z} S_{u}^{2} (z) p_{θ}^{α} (z) \\ = : C_{3} \\ < \infty . \end{matrix}

The previous result is true for

α = 1

, and since once again

\exists N^{''}, \forall n \geq N^{''}, \frac{2}{T_{n}} - 1 \in \bar{B} (1, ϵ) \cap R_{+}^{*}

, it is also true for

α = \frac{2}{T_{n}} - 1

when n is large enough.

C_{3}

is independent of z,

θ

and

T_{n}

.

In the end

\forall n \geq N^{''}, \int_{z} p e^{2 a} {ln}^{2} p \leq C_{3} < \infty

.

We replace the three terms

A (θ, T_{n})

,

B (θ, T_{n})

and

\int_{z} p e^{2 a} {ln}^{2} p

by their upper bounds in the inequality (A11). When

n \geq m a x (N, N^{'}, N^{''})

:

\int_{z} \frac{1}{p} {(\frac{p^{\frac{1}{T}}}{\int_{z} p^{\frac{1}{T}}} - p)}^{2} \leq 2 {(\frac{1}{T_{n}} - 1)}^{2} C_{1} C_{3} + 2 {(\frac{1}{T_{n}} - 1)}^{2} C_{2} .

Which converges towards 0 when

T_{n} ⟶ 1

, i.e., when

n ⟶ \infty

. This concludes the proof of the lemma. □

Appendix A.2.3. Verifying the Conditions of Lemma A2

Now that the lemmas are proven, all that remains is to apply Lemma A2.

(i): We have $T_{n} \in R_{+}^{*} \underset{n \to \infty}{⟶} 1$ by hypothesis.
(ii) and (iii): $\sup_{θ \in K} ψ (θ) < \infty$ and $\sup_{θ \in K} ∥ϕ (θ)∥ < \infty$ are implied by the fact that $ψ (θ) = ψ^{'} (θ) - ln g (θ)$ and $ϕ (θ)$ are continuous
(iv) and (v): are also hypotheses of the theorem.
Hence we can apply Lemma A2. This means that:

$\sup_{θ \in K} \int_{z} {(\frac{{\tilde{p}}_{θ, n} (z)}{p_{θ} (z)} - 1)}^{2} p_{θ} (z) d z \underset{n \to \infty}{⟶} 0 .$

with this last condition verified, we can apply Theorem 1, which concludes the proof.

References

Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1977, 39, 1–22. [Google Scholar]
Wu, C.J. On the convergence properties of the EM algorithm. Ann. Stat. 1983, 11, 95–103. [Google Scholar] [CrossRef]
Boyles, R.A. On the convergence of the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1983, 45, 47–50. [Google Scholar] [CrossRef]
Lange, K. A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1995, 57, 425–437. [Google Scholar] [CrossRef]
Delyon, B.; Lavielle, M.; Moulines, E. Convergence of a stochastic approximation version of the EM algorithm. Ann. Stat. 1999, 27, 94–128. [Google Scholar] [CrossRef]
Wei, G.C.; Tanner, M.A. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Am. Stat. Assoc. 1990, 85, 699–704. [Google Scholar] [CrossRef]
Fort, G.; Moulines, E. Convergence of the Monte Carlo expectation maximization for curved exponential families. Ann. Stat. 2003, 31, 1220–1259. [Google Scholar] [CrossRef]
Kuhn, E.; Lavielle, M. Maximum likelihood estimation in nonlinear mixed effects models. Comput. Stat. Data Anal. 2005, 49, 1020–1038. [Google Scholar] [CrossRef]
Allassonnière, S.; Kuhn, E.; Trouvé, A. Construction of Bayesian deformable models via a stochastic approximation algorithm: A convergence study. Bernoulli 2010, 16, 641–678. [Google Scholar] [CrossRef]
Allassonnière, S.; Chevallier, J. A New Class of EM Algorithms. Escaping Local Maxima and Handling Intractable Sampling. Comput. Stat. Data Anal. 2021, 159, 107159. [Google Scholar]
Neal, R.M.; Hinton, G.E. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models; Springer: Dordrecht, The Netherlands, 1998; pp. 355–368. [Google Scholar]
Ng, S.K.; McLachlan, G.J. On the choice of the number of blocks with the incremental EM algorithm for the fitting of normal mixtures. Stat. Comput. 2003, 13, 45–55. [Google Scholar] [CrossRef]
Cappé, O.; Moulines, E. On-line expectation–maximization algorithm for latent data models. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2009, 71, 593–613. [Google Scholar] [CrossRef]
Chen, J.; Zhu, J.; Teh, Y.W.; Zhang, T. Stochastic expectation maximization with variance reduction. Adv. Neural Inf. Process. Syst. 2018, 31, 7967–7977. [Google Scholar]
Karimi, B.; Wai, H.T.; Moulines, E.; Lavielle, M. On the global convergence of (fast) incremental expectation maximization methods. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 2837–2847. [Google Scholar]
Fort, G.; Moulines, E.; Wai, H.T. A Stochastic Path Integral Differential EstimatoR Expectation Maximization Algorithm. Adv. Neural Inf. Process. Syst. 2020, 34, 16972–16982. [Google Scholar]
Kuhn, E.; Matias, C.; Rebafka, T. Properties of the stochastic approximation EM algorithm with mini-batch sampling. Stat. Comput. 2020, 30, 1725–1739. [Google Scholar] [CrossRef]
Balakrishnan, S.; Wainwright, M.J.; Yu, B. Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Stat. 2017, 45, 77–120. [Google Scholar] [CrossRef]
Dwivedi, R.; Ho, N.; Khamaru, K.; Wainwright, M.J.; Jordan, M.I.; Yu, B. Singularity, misspecification and the convergence rate of EM. Ann. Stat. 2020, 48, 3161–3182. [Google Scholar] [CrossRef]
Booth, J.G.; Hobert, J.P. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. R. Stat. Soc. Ser. (Stat. Methodol.) 1999, 61, 265–285. [Google Scholar]
Levine, R.A.; Casella, G. Implementations of the Monte Carlo EM algorithm. J. Comput. Graph. Stat. 2001, 10, 422–439. [Google Scholar] [CrossRef]
Levine, R.A.; Fan, J. An automated (Markov chain) Monte Carlo em algorithm. J. Stat. Comput. Simul. 2004, 74, 349–360. [Google Scholar] [CrossRef]
Pan, J.X.; Thompson, R. Quasi-Monte Carlo EM algorithm for MLEs in generalized linear mixed models. In COMPSTAT; Physica-Verlag: Heidelberg, Germany, 1998; pp. 419–424. [Google Scholar]
Jank, W. Quasi-Monte Carlo sampling to improve the efficiency of Monte Carlo EM. Comput. Stat. Data Anal. 2005, 48, 685–701. [Google Scholar] [CrossRef]
Attias, H. Inferring Parameters and Structure of Latent Variable Models by Variational Bayes. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1999; pp. 21–30. [Google Scholar]
Bishop, C. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Tzikas, D.G.; Likas, A.C.; Galatsanos, N.P. The variational approximation for Bayesian inference. IEEE Signal Process. Mag. 2008, 25, 131–146. [Google Scholar] [CrossRef]
Kirkpatrick, S.; Gelatt, C.D.; Vecchi, M.P. Optimization by simulated annealing. Science 1983, 220, 671–680. [Google Scholar] [CrossRef] [PubMed]
Swendsen, R.H.; Wang, J.S. Replica Monte Carlo simulation of spin-glasses. Phys. Rev. Lett. 1986, 57, 2607. [Google Scholar] [CrossRef]
Geyer, C.J.; Thompson, E.A. Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 1995, 90, 909–920. [Google Scholar] [CrossRef]
Ueda, N.; Nakano, R. Deterministic annealing EM algorithm. Neural Netw. 1998, 11, 271–282. [Google Scholar] [CrossRef]
Naim, I.; Gildea, D. Convergence of the EM algorithm for Gaussian mixtures with unbalanced mixing coefficients. In Proceedings of the 29th International Conference on Machine Learning; Omnipress: Madison, WI, USA, 2012; pp. 1655–1662. [Google Scholar]
Chen, H.F.; Guo, L.; Gao, A.J. Convergence and robustness of the Robbins-Monro algorithm truncated at randomly varying bounds. Stoch. Process. Their Appl. 1987, 27, 217–231. [Google Scholar] [CrossRef][Green Version]
Van Laarhoven, P.J.; Aarts, E.H. Simulated annealing. In Simulated Annealing: Theory and Applications; Springer: Dordrecht, The Netherlands, 1987; pp. 7–15. [Google Scholar]
Aarts, E.; Korst, J. Simulated Annealing and Boltzmann Machines; John Wiley and Sons Inc.: New York, NY, USA, 1988. [Google Scholar]
Hukushima, K.; Nemoto, K. Exchange Monte Carlo method and application to spin glass simulations. J. Phys. Soc. Jpn. 1996, 65, 1604–1608. [Google Scholar] [CrossRef]
Titterington, D.; Smith, A.; Makov, U. Statistical Analysis of Finite Mixture Distributions; Wiley: New York, NY, USA, 1985. [Google Scholar]
Ho, N.; Nguyen, X. Convergence rates of parameter estimation for some weakly identifiable finite mixtures. Ann. Stat. 2016, 44, 2726–2755. [Google Scholar] [CrossRef]
Dwivedi, R.; Ho, N.; Khamaru, K.; Wainwright, M.; Jordan, M.; Yu, B. Sharp Analysis of Expectation-Maximization for Weakly Identifiable Models. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Online, 26–28 August 2020; pp. 1866–1876. [Google Scholar]
Winkelbauer, A. Moments and absolute moments of the normal distribution. arXiv 2012, arXiv:1209.4340. [Google Scholar]

Figure 1. (a) Average values, with standard deviation, over 2000 simulations of the negative log-likelihood along the steps of the Riemann EM. The Riemann EM increases the likelihood. (b) Average and standard deviation of the relative parameter reconstruction errors at the end of the Riemann EM.

Figure 2. (a) Visual representation of the number of Riemann intervals over the EM steps for each profile

φ_{i}

. The total number of Riemann intervals computed over 100 EM iterations are: 5150 for “low”, 14,950 for “medium”, 50,500 for “linear” and 104,950 for “high”. (b) For each profile, average evolution of the negative log-likelihood, with standard deviation, over 50 simulations. The results are fairly similar, in particular between “medium”, “high” and “linear”.

Figure 2. (a) Visual representation of the number of Riemann intervals over the EM steps for each profile

φ_{i}

. The total number of Riemann intervals computed over 100 EM iterations are: 5150 for “low”, 14,950 for “medium”, 50,500 for “linear” and 104,950 for “high”. (b) For each profile, average evolution of the negative log-likelihood, with standard deviation, over 50 simulations. The results are fairly similar, in particular between “medium”, “high” and “linear”.

Figure 3. (a) Visual representation of the number of Riemann intervals over the EM steps for each profile

φ_{i}

. In higher dimension, the computational complexity of the profiles are very different. More precisely, the total number of Riemann squares computed over 100 EM iterations are: 4534 for “square root”, 125,662 for “5 square root”, 348,550 for “low” and 2,318,350 for “medium”. (b) For each profile, average evolution of the negative log-likelihood, with standard deviation, over 50 simulations. The “square root” profile performs poorly. However, the other three are comparable despite their different computational complexities.

Figure 3. (a) Visual representation of the number of Riemann intervals over the EM steps for each profile

φ_{i}

. In higher dimension, the computational complexity of the profiles are very different. More precisely, the total number of Riemann squares computed over 100 EM iterations are: 4534 for “square root”, 125,662 for “5 square root”, 348,550 for “low” and 2,318,350 for “medium”. (b) For each profile, average evolution of the negative log-likelihood, with standard deviation, over 50 simulations. The “square root” profile performs poorly. However, the other three are comparable despite their different computational complexities.

Figure 4. 500 sample points from a Mixture of Gaussians with 3 classes. The true centroid of each Gaussian are depicted by black crosses, and their true covariance matrices are represented by the confidence ellipses of level 0.8, 0.99 and 0.999 around the centre. Each sub-figure corresponds to one of the three different versions of the true parameters. From (a) to (c): the true

μ_{k}

of the two left clusters (

μ_{1}

and

μ_{2}

) are getting closer while everything else stays identical.

Figure 4. 500 sample points from a Mixture of Gaussians with 3 classes. The true centroid of each Gaussian are depicted by black crosses, and their true covariance matrices are represented by the confidence ellipses of level 0.8, 0.99 and 0.999 around the centre. Each sub-figure corresponds to one of the three different versions of the true parameters. From (a) to (c): the true

μ_{k}

of the two left clusters (

μ_{1}

and

μ_{2}

) are getting closer while everything else stays identical.

Figure 5. Typical final positioning of the centroids by EM (first row) and tmp-EM with oscillating profile (second row) when the initialisation is made at the barycenter of all data points (blue crosses). The three columns represent the three gradually more ambiguous parameter sets. Each figure represents the positions of the estimated centroids after convergence of the EM algorithms (orange cross), with their estimated covariance matrices (orange confidence ellipses). In each simulation, 500 sample points were drawn from the real GMM (small green crosses). In those example, tmp-EM managed to correctly identify the position of the three real centroids.

Figure 6. Typical final positioning of the centroids by EM (first row) and tmp-EM with oscillating profile (second row) when the initialisation is made by selecting two points in the isolated cluster and one in the lower ambiguous cluster (blue crosses). The three columns represent the three gradually more ambiguous parameter sets. Each figure represents the positions of the estimated centroids after convergence of the EM algorithms (orange cross), with their estimated covariance matrices (orange confidence ellipses). In each simulation, 500 sample points were drawn from the real GMM (small green crosses). In those examples, although EM kept two centroids on the isolated cluster, tmp-EM managed to correctly identify the position of the three real centroids.

Figure 7. Results over many simulations of the Riemann EM and tmp-Riemann EM on the Beta-Gaussian model. The tempered Riemann EM reaches relative errors on the real parameters that are orders of magnitude below the Riemann EM with no temperature. The likelihood reached is also lower with the tempering.

Table 1. Average and standard deviation of the relative error on

μ_{k}

,

\frac{{∥{\hat{μ}}_{k} - μ_{k}∥}^{2}}{{∥μ_{k}∥}^{2}}

, made by EM, tmp-EM with decreasing temperature and tmp-EM with oscillating temperature over 1000 simulated dataset with two different initialisations. The three different parameter families, described in Figure 4, correspond to increasingly ambiguous positions of classes 1 and 2. For both initialisations type, the identification of these two clusters is drastically improved by the tempering. Best results highlighted in bold.

Table 1. Average and standard deviation of the relative error on

μ_{k}

,

\frac{{∥{\hat{μ}}_{k} - μ_{k}∥}^{2}}{{∥μ_{k}∥}^{2}}

, made by EM, tmp-EM with decreasing temperature and tmp-EM with oscillating temperature over 1000 simulated dataset with two different initialisations. The three different parameter families, described in Figure 4, correspond to increasingly ambiguous positions of classes 1 and 2. For both initialisations type, the identification of these two clusters is drastically improved by the tempering. Best results highlighted in bold.

		EM		tmp-EM (Decreasing T)		tmp-EM (Decreasing Oscillating T)
Parameter
Family	cl.	barycenter	2v1	barycenter	2v1	barycenter	2v1
1	1	0.52 (1.01)	1.52 (1.24)	0.60 (1.08)	0.87 (1.20)	0.04 (0.26)	0.29 (0.64)
	2	0.55 (1.05)	1.53 (1.25)	0.64 (1.10)	0.96 (1.25)	0.05 (0.31)	0.30 (0.64)
	3	0.01 (0.06)	0.01 (0.03)	0.01 (0.10)	0.01 (0.02)	0.03 (0.17)	0.03 (0.19)
2	1	1.00 (1.42)	1.69 (1.51)	0.96 (1.41)	1.10 (1.46)	0.09 (0.47)	0.37 (0.86)
	2	1.03 (1.44)	1.71 (1.52)	1.08 (1.46)	1.11 (1.46)	0.12 (0.57)	0.32 (0.79)
	3	0.01 (0.05)	0.02 (0.03)	0.01 (0.03)	0.01 (0.04)	0.01 (0.05)	0.04 (0.22)
3	1	1.56 (1.75)	1.79 (1.77)	1.63 (1.76)	1.38 (1.71)	0.31 (0.97)	0.39 (0.98)
	2	1.51 (1.74)	1.88 (1.76)	1.52 (1.74)	1.30 (1.68)	0.30 (0.93)	0.39 (0.97)
	3	0.02 (0.04)	0.02 (0.04)	0.01 (0.03)	0.02 (0.06)	0.01 (0.04)	0.07 (0.30)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lartigue, T.; Durrleman, S.; Allassonnière, S. Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM. Algorithms 2022, 15, 78. https://doi.org/10.3390/a15030078

AMA Style

Lartigue T, Durrleman S, Allassonnière S. Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM. Algorithms. 2022; 15(3):78. https://doi.org/10.3390/a15030078

Chicago/Turabian Style

Lartigue, Thomas, Stanley Durrleman, and Stéphanie Allassonnière. 2022. "Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM" Algorithms 15, no. 3: 78. https://doi.org/10.3390/a15030078

APA Style

Lartigue, T., Durrleman, S., & Allassonnière, S. (2022). Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM. Algorithms, 15(3), 78. https://doi.org/10.3390/a15030078

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM

Abstract

1. Introduction

2. Materials and Methods

2.1. Deterministic Approximate EM Algorithm and Its Convergence for the Exponential Family

2.1.1. Context and Motivation

2.1.2. Theorem

2.1.3. Sketch of Proof

2.2. Riemann Approximation EM

2.2.1. Context and Motivation

2.2.2. Theorem and Proof

2.3. Tempered EM

2.3.1. Context and Motivation

2.3.2. Theorem

2.3.3. Sketch of Proof

2.3.4. Examples of Models That Verify the Conditions

Gaussian Mixture Model

Poisson Count with Random Effect

2.4. Tempered Riemann Approximation EM

Context, Theorem and Proof

3. Results

3.1. Riemann Approximation EM: Two Applications

3.1.1. Application to a Gaussian Model with the Beta Prior

3.1.2. Application in Two Dimensions

3.2. Tempered EM: Application to Mixtures of Gaussian

3.2.1. Context and Experimental Protocol

3.2.2. Experimental Results Analysis

3.3. Tempered Riemann Approximation EM: Application to a Gaussian Model with Beta Prior

4. Discussion and Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proofs of the Two Main Theorems

Appendix A.1. Proof of the General Theorem

Appendix A.1.1. Verifying the Conditions of Proposition 2

Equivalent Formulation of the Convergence

Use the Uniform Continuity

Sufficient Condition for Convergence

Further Simplifications of the Desired Result with Successive Sufficient Conditions

Conclusions

Appendix A.1.2. Applying Proposition 2

Appendix A.1.3. Verifying the Conditions of Proposition 1

Appendix A.1.4. Applying Proposition 1

Appendix A.2. Proof of the Tempering Theorem

Appendix A.2.1. Taylor Development

Appendix A.2.2. Two Intermediary Lemmas

Appendix A.2.3. Verifying the Conditions of Lemma A2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI