# Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Deterministic Approximate EM Algorithm and Its Convergence for the Exponential Family

#### 2.1.1. Context and Motivation

- (E) With the current parameter ${\theta}_{n}$, calculate the conditional probability ${p}_{{\theta}_{n}}\left(z\right)$;
- (M) To get ${\theta}_{n+1}$, maximise in $\theta \in \mathsf{\Theta}$ the function $\theta \mapsto {\mathbb{E}}_{z\sim {p}_{{\theta}_{n}}\left(z\right)}\left(\right)open="["\; close="]">lnh(z;\theta )$;

#### 2.1.2. Theorem

**Theorem**

**1**(Convergence of the Stable Approximate EM)

**.**

**Conditions M1–3 on the model.****M1.**$\mathsf{\Theta}\subseteq {\mathbb{R}}^{l}$, $\mathcal{X}\subseteq {\mathbb{R}}^{d}$ and $\mu $ is a $\sigma $-finite positive Borel measure on $\mathcal{X}$. Let $\psi :\mathsf{\Theta}\to \mathbb{R}$, $\varphi :\mathsf{\Theta}\to {\mathbb{R}}^{q}$ and $S:\mathcal{X}\to \mathcal{S}\subseteq {\mathbb{R}}^{q}$. Define $L:\mathcal{S}\times \mathsf{\Theta}\to \mathbb{R}$, $h:\mathcal{X}\times \mathsf{\Theta}\to {\mathbb{R}}_{+}^{*}$ and $g:\mathsf{\Theta}\to {\mathbb{R}}_{+}^{*}$ as:$$\begin{array}{cc}\hfill L(s;\theta )& :=\psi \left(\theta \right)+\left(\right)open="\langle "\; close="\rangle ">s,\varphi \left(\theta \right)\phantom{\rule{0.166667em}{0ex}},\hfill \end{array}$$**M2.**Assume that- (a*) $\psi $ and $\varphi $ are continuous on $\mathsf{\Theta}$;
- (b) for all $\theta \in \mathsf{\Theta}$, $\overline{S}\left(\theta \right):={\int}_{z}S\left(z\right){p}_{\theta}\left(z\right)\mu \left(dz\right)$ is finite and continuous on $\mathsf{\Theta}$;
- (c) there exists a continuous function $\widehat{\theta}:\mathcal{S}\to \mathsf{\Theta}$ such that for all $s\in \mathcal{S},\phantom{\rule{0.277778em}{0ex}}L(s;\widehat{\theta}\left(s\right))=\underset{\theta \in \mathsf{\Theta}}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}L(s;\theta )$;
- (d) g is positive, finite and continuous on $\mathsf{\Theta}$ and the level set $\left(\right)$ is compact for any $M>0$.

**M3.**Define $\phantom{\rule{-0.166667em}{0ex}}\mathcal{L}:=\phantom{\rule{-0.166667em}{0ex}}\left(\right)open="\{"\; close="\}">\theta \phantom{\rule{-0.166667em}{0ex}}\in \phantom{\rule{-0.166667em}{0ex}}\mathsf{\Theta}|\widehat{\theta}\circ \overline{S}\left(\theta \right)=\theta $ and, for any ${g}^{*}\in {\mathbb{R}}_{+}^{*}$, ${\mathcal{L}}_{{g}^{*}}\phantom{\rule{-0.166667em}{0ex}}:=\phantom{\rule{-0.166667em}{0ex}}\left(\right)open="\{"\; close="\}">\theta \phantom{\rule{-0.166667em}{0ex}}\in \phantom{\rule{-0.166667em}{0ex}}\mathcal{L}|g\left(\theta \right)\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}{g}^{*}$. Assume either that:- (a) The set $g\left(\mathcal{L}\right)$ is compact or
- (${a}^{\prime}$) for all compact sets $K\subseteq \mathsf{\Theta},$$g\left(\right)open="("\; close=")">K\cap \mathcal{L}$ is finite.

**The conditions on the approximation.**Assume that ${\tilde{p}}_{\theta ,n}\left(z\right)$ is deterministic. Let $S\left(z\right)={\left(\right)}_{{S}_{u}}^{\left(z\right)}u=1q$. For all indices u, for any compact set $K\subseteq \mathsf{\Theta}$, one of the two following configurations holds:$$\left(\right)$$Or$$\left(\right)$$

- (i)
- (a)
- $\underset{n\to \infty}{lim}}\phantom{\rule{0.166667em}{0ex}}{j}_{n}<\infty $and$\underset{n\in \mathbb{N}}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\left(\right)open="\parallel "\; close="\parallel ">{\theta}_{n}$, with probability 1;
- (b)
- $g\left({\theta}_{n}\right)$converges towards a connected component of $g\left(\mathcal{L}\right)$.

- (ii)
- Let $Cl:\mathcal{P}\left(\mathsf{\Theta}\right)\to \mathsf{\Theta}$ be the set closure function and $d:\mathsf{\Theta}\times \mathcal{P}\left(\mathsf{\Theta}\right)\to {\mathbb{R}}_{+}$ be any point-to-set distance function within $\mathsf{\Theta}$. If $g\left(\right)open="("\; close=")">\mathcal{L}\cap Cl\left(\right)open="("\; close=")">{\left(\right)}_{{\theta}_{n}}n\in \mathbb{N}$ has an empty interior, then $\exists {g}^{*}\in {\mathbb{R}}_{+}^{*}$ such that:$$\begin{array}{c}\hfill g\left({\theta}_{n}\right)\underset{n\to \infty}{\u27f6}{g}^{*}\phantom{\rule{0.166667em}{0ex}},\\ \hfill d({\theta}_{n},{\mathcal{L}}_{{g}^{*}})\underset{n\to \infty}{\u27f6}0\phantom{\rule{0.166667em}{0ex}}.\end{array}$$

**Remark 1.**

- When g is differentiable on $\mathsf{\Theta}$, its stationary points are the stable points of the EM, i.e., $\mathcal{L}=\left(\right)open="\{"\; close="\}">\theta \in \mathsf{\Theta}|\nabla g(\theta )=0$.
- The M1–3 conditions in Theorem 1 are almost identical to the similarly named M1–3 conditions in [7]. The only difference is in $M2$$\left({a}^{*}\right)$, where we remove the hypothesis that S has to be a continuous function of z, since it is not needed when the approximation is not stochastic.
- The property ${\int}_{z}{S}_{u}^{2}\left(z\right)dz<\infty $ of the condition (4) can seem hard to verify since S is not integrated here against a probability density function. However, when z is a finite variable, as is the case for finite mixtures, this integral becomes a finite sum. Hence, condition (4) is better adapted to finite mixtures, while condition (5) is better adapted to continuous hidden variables
- The two sufficient conditions (4) and (5) involve a certain form of ${l}_{2}$ convergence of ${\tilde{p}}_{\theta ,n}$ towards ${p}_{\theta}$. If the latent variable z is continuous, this excludes countable (and finite) approximations such as sums of Dirac functions, since they have a measure of zero. In particular, this excludes Quasi-Monte Carlo approximations. However, one look at the proof of Theorem 1 (in Appendix A.1) or at the following sketch of proof reveals it is actually sufficient to verify $\underset{\theta \in K}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\left(\right)open="\parallel "\; close="\parallel ">{\tilde{S}}_{n}\left(\theta \right)-\overline{S}\left(\theta \right)$ for any compact set K. Where ${\tilde{S}}_{n}\left(\theta \right):={\int}_{z}S\left(z\right){\tilde{p}}_{\theta ,n}\left(z\right)\mu \left(dz\right)$ denotes the approximated E step in the Stable Approximate EM. This condition can be verified by finite approximations.

#### 2.1.3. Sketch of Proof

**Proposition 1**(“Proposition 9”)

**.**

- ${\left(\right)}_{g}$converges towards a connected component of $g(\mathcal{L}\cap K)$
- If $g(\mathcal{L}\cap K)$ has an empty interior, then $\exists {g}^{*}\in {\mathbb{R}}_{+}^{*}$ such that ${\left(\right)}_{g}$ converges towards ${g}^{*}$. Moreover, ${\left(\right)}_{{\theta}_{n}}$ converges towards the set ${\mathcal{L}}_{{g}^{*}}\cap K$. Where ${\mathcal{L}}_{{g}^{*}}:=\left(\right)open="\{"\; close="\}">\theta \in \mathcal{L}|g\left(\theta \right)={g}^{*}$.

**Proposition 2**(“Proposition 11”)

**.**

- (a)
- the C1–2 conditions of Proposition 10 of [7].
- -
**(C1)**There exists g, a${C}^{0}$Lyapunov function relative to$(T,\mathcal{L})$such that for all$M>0$, $\left(\right)$is compact, and:$$\mathsf{\Theta}={\cup}_{n\in \mathbb{N}}\left(\right)open="\{"\; close="\}">\theta \in \mathsf{\Theta}|g\left(\theta \right){n}^{-1}$$- -
**(C2)**$g\left(\mathcal{L}\right)$ is compact OR**(C2’)**$g(\mathcal{L}\cap K)$ is finite for all compact $K\subseteq \mathsf{\Theta}$.

- (b)
- $\forall \theta \in {K}_{0},\phantom{\rule{1.em}{0ex}}{\displaystyle \underset{n\to \infty}{lim}}\phantom{\rule{0.166667em}{0ex}}|g\circ {F}_{n}-g\circ T|\left(\theta \right)=0$
- (c)
- ∀ compact $K\subseteq \mathsf{\Theta}$:$$\underset{n\to \infty}{lim}|g\circ {F}_{n}\left({\theta}_{n}\right)-g\circ T\left({\theta}_{n}\right)|{\mathrm{\U0001d7d9}}_{{\theta}_{n}\in K}=0\phantom{\rule{0.166667em}{0ex}}.$$

**Remark 2.**

**Verifying the conditions of Proposition 2.**Let g be the observed likelihood function defined in hypothesis $M1$ of Theorem 1, T the exact EM point to point map defined in (1), $\mathcal{L}:=\left(\right)open="\{"\; close="\}">\theta \in \mathsf{\Theta}\left|T\right(\theta )=\theta $ the set of its stable points, ${\left(\right)}_{{F}_{n}}$ a sequence of approximated point to point map as defined in (2). With some increasing sequence ${\left(\right)}_{{K}_{n}}$ of compacts of $\mathsf{\Theta}$, let ${\left(\right)}_{{\theta}_{n}}$ be defined from the Stable Approximate EM equation (3).

**Applying Proposition 2.**The application of Proposition 2 to the Stable Approximate EM tells us that with probability 1:

**Verifying the conditions of Proposition 1.**We already have that the likelihood g is a ${C}^{0}$, Lyapunov function relative to $(T,\mathcal{L})$. Thanks to Proposition 2, we have that $K:=Cl\left(\right)open="("\; close=")">{\left(\right)}_{{\theta}_{n}}n$ is a compact. Then, $\mathcal{L}\cap K$ is also compact thanks to hypothesis $M3$. Moreover, by definition, the EM sequence verifies: ${\left(\right)}_{{\theta}_{n}}n$. The last condition that remains to be shown to apply Proposition 1 is that:

**Applying Proposition 1.**Since we verify all we need to apply the conclusions of Proposition 1:

- ${\left(\right)}_{g}$ converges towards a connected component of $g(\mathcal{L}\cap Cl({\left(\right)}_{{\theta}_{n}}n))$.
- If $g(\mathcal{L}\cap Cl({\left(\right)}_{{\theta}_{n}}n)$ has an empty interior, then the sequence ${\left(\right)}_{g}$ converges towards a ${g}^{*}\in \mathbb{R}$ and ${\left(\right)}_{{\theta}_{n}}$ converges towards the set ${\mathcal{L}}_{{g}^{*}}\cap Cl({\left(\right)}_{{\theta}_{n}}n)$.

#### 2.2. Riemann Approximation EM

#### 2.2.1. Context and Motivation

#### 2.2.2. Theorem and Proof

**Theorem 2.**

**Proof.**

#### 2.3. Tempered EM

#### 2.3.1. Context and Motivation

**Remark 3.**

#### 2.3.2. Theorem

**Theorem 3.**

- (T1)
- $\phantom{\rule{4pt}{0ex}}\underset{\theta \in K}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{\int}_{z}{p}_{\theta}^{\alpha}\left(z\right)dz<\infty \phantom{\rule{0.166667em}{0ex}}$,
- (T2)
- $\phantom{\rule{4pt}{0ex}}\forall u\in \u301a1,q\u301b,\phantom{\rule{1.em}{0ex}}\underset{\theta \in K}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{\int}_{z}{S}_{u}^{2}\left(z\right){p}_{\theta}^{\alpha}\left(z\right)dz<\infty \phantom{\rule{0.166667em}{0ex}}$,

**Remark 4.**

#### 2.3.3. Sketch of Proof

#### 2.3.4. Examples of Models That Verify the Conditions

#### Gaussian Mixture Model

#### Poisson Count with Random Effect

- $z\mapsto {S}^{2}\left(z\right){p}_{\theta}^{\alpha}\left(z\right)$ is measurable on ${\mathbb{R}}^{N}$.
- $\theta \mapsto {S}^{2}\left(z\right){p}_{\theta}^{\alpha}\left(z\right)$ is continuous on K (and on $\mathsf{\Theta}=\mathbb{R}$).
- With ${\theta}_{M}:=\underset{\theta \in K}{min}\phantom{\rule{0.166667em}{0ex}}\theta $, then $\forall \theta \in K,\phantom{\rule{0.277778em}{0ex}}0\le {S}^{2}\left(z\right){p}_{\theta}^{\alpha}\left(z\right)\le {S}^{2}\left(z\right){p}_{{\theta}_{M}}^{\alpha}\left(z\right)$

#### 2.4. Tempered Riemann Approximation EM

#### Context, Theorem and Proof

**Theorem 4.**

**Proof.**

## 3. Results

#### 3.1. Riemann Approximation EM: Two Applications

#### 3.1.1. Application to a Gaussian Model with the Beta Prior

#### 3.1.2. Application in Two Dimensions

#### 3.2. Tempered EM: Application to Mixtures of Gaussian

#### 3.2.1. Context and Experimental Protocol

#### 3.2.2. Experimental Results Analysis

#### 3.3. Tempered Riemann Approximation EM: Application to a Gaussian Model with Beta Prior

## 4. Discussion and Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Proofs of the Two Main Theorems

#### Appendix A.1. Proof of the General Theorem

#### Appendix A.1.1. Verifying the Conditions of Proposition 2

#### Equivalent Formulation of the Convergence

#### Use the Uniform Continuity

#### Sufficient Condition for Convergence

#### Further Simplifications of the Desired Result with Successive Sufficient Conditions

#### Conclusions

#### Appendix A.1.2. Applying Proposition 2

#### Appendix A.1.3. Verifying the Conditions of Proposition 1

- $Cl\left(\right)open="("\; close=")">{\left(\right)}_{{\theta}_{n}}n$ (set closure) plays the part of the compact K
- $\left(\right)$ plays the part of the set $\mathcal{L}$
- $\mathcal{L}\cap K$ is also compact thanks to hypothesis $M3$
- The likelihood g is the ${C}^{0}$ Lyapunov function with regards to $(T,\mathcal{L})$
- ${\left(\right)}_{{\theta}_{n}}$ is the K valued sequence (since K is $Cl\left(\right)open="("\; close=")">{\left(\right)}_{{\theta}_{n}}n$).

#### Appendix A.1.4. Applying Proposition 1

- ${\left(\right)}_{g}$ converges towards a connected component of $g(\mathcal{L}\cap Cl({\left(\right)}_{{\theta}_{n}}n))$
- If $g(\mathcal{L}\cap Cl({\left(\right)}_{{\theta}_{n}}n)$ has an empty interior, then ${\left(\right)}_{g}$ converges towards a ${g}^{*}\in \mathbb{R}$ and ${\left(\right)}_{{\theta}_{n}}$ converges towards ${\mathcal{L}}_{{g}^{*}}\cap Cl({\left(\right)}_{{\theta}_{n}}n)$. Where ${\mathcal{L}}_{{g}^{*}}:=\left(\right)open="\{"\; close="\}">\theta \in \mathcal{L}|g\left(\theta \right)={g}^{*}$

#### Appendix A.2. Proof of the Tempering Theorem

#### Appendix A.2.1. Taylor Development

#### Appendix A.2.2. Two Intermediary Lemmas

**Lemma A1.**

**Proof.**

**Lemma A2.**

- (i)
- ${T}_{n}\in {\mathbb{R}}_{+}^{*}\underset{n\to \infty}{\u27f6}1$,
- (ii)
- $\underset{\theta \in K}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\psi \left(\theta \right)<\infty $,
- (iii)
- $\underset{\theta \in K}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\left(\right)open="\parallel "\; close="\parallel ">\varphi \left(\theta \right)$,
- (iv)
- $\forall \alpha \in \overline{\mathcal{B}}(1,\u03f5),\phantom{\rule{1.em}{0ex}}\underset{\theta \in K}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{\int}_{z}{p}_{\theta}^{\alpha}\left(z\right)dz<\infty $,
- (v)
- $\forall \alpha \in \overline{\mathcal{B}}(1,\u03f5),\phantom{\rule{0.277778em}{0ex}}\forall u,\phantom{\rule{1.em}{0ex}}\underset{\theta \in K}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{\int}_{z}{S}_{u}^{2}\left(z\right){p}_{\theta}^{\alpha}\left(z\right)dz<\infty $.

**Proof.**

#### Appendix A.2.3. Verifying the Conditions of Lemma A2

- (i)
- We have ${T}_{n}\in {\mathbb{R}}_{+}^{*}\underset{n\to \infty}{\u27f6}1$ by hypothesis.
- (ii) and (iii)
- $\underset{\theta \in K}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\psi \left(\theta \right)<\infty $ and $\underset{\theta \in K}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\left(\right)open="\parallel "\; close="\parallel ">\varphi \left(\theta \right)$ are implied by the fact that $\psi \left(\theta \right)={\psi}^{\prime}\left(\theta \right)-lng\left(\theta \right)$ and $\varphi \left(\theta \right)$ are continuous
- (iv) and (v)
- are also hypotheses of the theorem.Hence we can apply Lemma A2. This means that:$$\underset{\theta \in K}{\mathrm{sup}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{\int}_{z}{\left(\right)}^{\frac{{\tilde{p}}_{\theta ,n}\left(z\right)}{{p}_{\theta}\left(z\right)}}2$$

## References

- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.)
**1977**, 39, 1–22. [Google Scholar] - Wu, C.J. On the convergence properties of the EM algorithm. Ann. Stat.
**1983**, 11, 95–103. [Google Scholar] [CrossRef] - Boyles, R.A. On the convergence of the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.)
**1983**, 45, 47–50. [Google Scholar] [CrossRef] - Lange, K. A gradient algorithm locally equivalent to the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.)
**1995**, 57, 425–437. [Google Scholar] [CrossRef] [Green Version] - Delyon, B.; Lavielle, M.; Moulines, E. Convergence of a stochastic approximation version of the EM algorithm. Ann. Stat.
**1999**, 27, 94–128. [Google Scholar] [CrossRef] - Wei, G.C.; Tanner, M.A. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Am. Stat. Assoc.
**1990**, 85, 699–704. [Google Scholar] [CrossRef] - Fort, G.; Moulines, E. Convergence of the Monte Carlo expectation maximization for curved exponential families. Ann. Stat.
**2003**, 31, 1220–1259. [Google Scholar] [CrossRef] - Kuhn, E.; Lavielle, M. Maximum likelihood estimation in nonlinear mixed effects models. Comput. Stat. Data Anal.
**2005**, 49, 1020–1038. [Google Scholar] [CrossRef] - Allassonnière, S.; Kuhn, E.; Trouvé, A. Construction of Bayesian deformable models via a stochastic approximation algorithm: A convergence study. Bernoulli
**2010**, 16, 641–678. [Google Scholar] [CrossRef] - Allassonnière, S.; Chevallier, J. A New Class of EM Algorithms. Escaping Local Maxima and Handling Intractable Sampling. Comput. Stat. Data Anal.
**2021**, 159, 107159. [Google Scholar] - Neal, R.M.; Hinton, G.E. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models; Springer: Dordrecht, The Netherlands, 1998; pp. 355–368. [Google Scholar]
- Ng, S.K.; McLachlan, G.J. On the choice of the number of blocks with the incremental EM algorithm for the fitting of normal mixtures. Stat. Comput.
**2003**, 13, 45–55. [Google Scholar] [CrossRef] - Cappé, O.; Moulines, E. On-line expectation–maximization algorithm for latent data models. J. R. Stat. Soc. Ser. (Stat. Methodol.)
**2009**, 71, 593–613. [Google Scholar] [CrossRef] [Green Version] - Chen, J.; Zhu, J.; Teh, Y.W.; Zhang, T. Stochastic expectation maximization with variance reduction. Adv. Neural Inf. Process. Syst.
**2018**, 31, 7967–7977. [Google Scholar] - Karimi, B.; Wai, H.T.; Moulines, E.; Lavielle, M. On the global convergence of (fast) incremental expectation maximization methods. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 2837–2847. [Google Scholar]
- Fort, G.; Moulines, E.; Wai, H.T. A Stochastic Path Integral Differential EstimatoR Expectation Maximization Algorithm. Adv. Neural Inf. Process. Syst.
**2020**, 34, 16972–16982. [Google Scholar] - Kuhn, E.; Matias, C.; Rebafka, T. Properties of the stochastic approximation EM algorithm with mini-batch sampling. Stat. Comput.
**2020**, 30, 1725–1739. [Google Scholar] [CrossRef] - Balakrishnan, S.; Wainwright, M.J.; Yu, B. Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Stat.
**2017**, 45, 77–120. [Google Scholar] [CrossRef] - Dwivedi, R.; Ho, N.; Khamaru, K.; Wainwright, M.J.; Jordan, M.I.; Yu, B. Singularity, misspecification and the convergence rate of EM. Ann. Stat.
**2020**, 48, 3161–3182. [Google Scholar] [CrossRef] - Booth, J.G.; Hobert, J.P. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. R. Stat. Soc. Ser. (Stat. Methodol.)
**1999**, 61, 265–285. [Google Scholar] - Levine, R.A.; Casella, G. Implementations of the Monte Carlo EM algorithm. J. Comput. Graph. Stat.
**2001**, 10, 422–439. [Google Scholar] [CrossRef] [Green Version] - Levine, R.A.; Fan, J. An automated (Markov chain) Monte Carlo em algorithm. J. Stat. Comput. Simul.
**2004**, 74, 349–360. [Google Scholar] [CrossRef] - Pan, J.X.; Thompson, R. Quasi-Monte Carlo EM algorithm for MLEs in generalized linear mixed models. In COMPSTAT; Physica-Verlag: Heidelberg, Germany, 1998; pp. 419–424. [Google Scholar]
- Jank, W. Quasi-Monte Carlo sampling to improve the efficiency of Monte Carlo EM. Comput. Stat. Data Anal.
**2005**, 48, 685–701. [Google Scholar] [CrossRef] - Attias, H. Inferring Parameters and Structure of Latent Variable Models by Variational Bayes. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1999; pp. 21–30. [Google Scholar]
- Bishop, C. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
- Tzikas, D.G.; Likas, A.C.; Galatsanos, N.P. The variational approximation for Bayesian inference. IEEE Signal Process. Mag.
**2008**, 25, 131–146. [Google Scholar] [CrossRef] - Kirkpatrick, S.; Gelatt, C.D.; Vecchi, M.P. Optimization by simulated annealing. Science
**1983**, 220, 671–680. [Google Scholar] [CrossRef] [PubMed] - Swendsen, R.H.; Wang, J.S. Replica Monte Carlo simulation of spin-glasses. Phys. Rev. Lett.
**1986**, 57, 2607. [Google Scholar] [CrossRef] - Geyer, C.J.; Thompson, E.A. Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc.
**1995**, 90, 909–920. [Google Scholar] [CrossRef] - Ueda, N.; Nakano, R. Deterministic annealing EM algorithm. Neural Netw.
**1998**, 11, 271–282. [Google Scholar] [CrossRef] - Naim, I.; Gildea, D. Convergence of the EM algorithm for Gaussian mixtures with unbalanced mixing coefficients. In Proceedings of the 29th International Conference on Machine Learning; Omnipress: Madison, WI, USA, 2012; pp. 1655–1662. [Google Scholar]
- Chen, H.F.; Guo, L.; Gao, A.J. Convergence and robustness of the Robbins-Monro algorithm truncated at randomly varying bounds. Stoch. Process. Their Appl.
**1987**, 27, 217–231. [Google Scholar] [CrossRef] [Green Version] - Van Laarhoven, P.J.; Aarts, E.H. Simulated annealing. In Simulated Annealing: Theory and Applications; Springer: Dordrecht, The Netherlands, 1987; pp. 7–15. [Google Scholar]
- Aarts, E.; Korst, J. Simulated Annealing and Boltzmann Machines; John Wiley and Sons Inc.: New York, NY, USA, 1988. [Google Scholar]
- Hukushima, K.; Nemoto, K. Exchange Monte Carlo method and application to spin glass simulations. J. Phys. Soc. Jpn.
**1996**, 65, 1604–1608. [Google Scholar] [CrossRef] [Green Version] - Titterington, D.; Smith, A.; Makov, U. Statistical Analysis of Finite Mixture Distributions; Wiley: New York, NY, USA, 1985. [Google Scholar]
- Ho, N.; Nguyen, X. Convergence rates of parameter estimation for some weakly identifiable finite mixtures. Ann. Stat.
**2016**, 44, 2726–2755. [Google Scholar] [CrossRef] - Dwivedi, R.; Ho, N.; Khamaru, K.; Wainwright, M.; Jordan, M.; Yu, B. Sharp Analysis of Expectation-Maximization for Weakly Identifiable Models. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Online, 26–28 August 2020; pp. 1866–1876. [Google Scholar]
- Winkelbauer, A. Moments and absolute moments of the normal distribution. arXiv
**2012**, arXiv:1209.4340. [Google Scholar]

**Figure 1.**(

**a**) Average values, with standard deviation, over 2000 simulations of the negative log-likelihood along the steps of the Riemann EM. The Riemann EM increases the likelihood. (

**b**) Average and standard deviation of the relative parameter reconstruction errors at the end of the Riemann EM.

**Figure 2.**(

**a**) Visual representation of the number of Riemann intervals over the EM steps for each profile ${\phi}_{i}$. The total number of Riemann intervals computed over 100 EM iterations are: 5150 for “low”, 14,950 for “medium”, 50,500 for “linear” and 104,950 for “high”. (

**b**) For each profile, average evolution of the negative log-likelihood, with standard deviation, over 50 simulations. The results are fairly similar, in particular between “medium”, “high” and “linear”.

**Figure 3.**(

**a**) Visual representation of the number of Riemann intervals over the EM steps for each profile ${\phi}_{i}$. In higher dimension, the computational complexity of the profiles are very different. More precisely, the total number of Riemann squares computed over 100 EM iterations are: 4534 for “square root”, 125,662 for “5 square root”, 348,550 for “low” and 2,318,350 for “medium”. (

**b**) For each profile, average evolution of the negative log-likelihood, with standard deviation, over 50 simulations. The “square root” profile performs poorly. However, the other three are comparable despite their different computational complexities.

**Figure 4.**500 sample points from a Mixture of Gaussians with 3 classes. The true centroid of each Gaussian are depicted by black crosses, and their true covariance matrices are represented by the confidence ellipses of level 0.8, 0.99 and 0.999 around the centre. Each sub-figure corresponds to one of the three different versions of the true parameters. From (

**a**) to (

**c**): the true ${\mu}_{k}$ of the two left clusters (${\mu}_{1}$ and ${\mu}_{2}$) are getting closer while everything else stays identical.

**Figure 5.**Typical final positioning of the centroids by EM (first row) and tmp-EM with oscillating profile (second row)

**when the initialisation is made at the barycenter of all data points**(blue crosses). The three columns represent the three gradually more ambiguous parameter sets. Each figure represents the positions of the estimated centroids after convergence of the EM algorithms (orange cross), with their estimated covariance matrices (orange confidence ellipses). In each simulation, 500 sample points were drawn from the real GMM (small green crosses). In those example, tmp-EM managed to correctly identify the position of the three real centroids.

**Figure 6.**Typical final positioning of the centroids by EM (first row) and tmp-EM with oscillating profile (second row)

**when the initialisation is made by selecting two points in the isolated cluster and one in the lower ambiguous cluster**(blue crosses). The three columns represent the three gradually more ambiguous parameter sets. Each figure represents the positions of the estimated centroids after convergence of the EM algorithms (orange cross), with their estimated covariance matrices (orange confidence ellipses). In each simulation, 500 sample points were drawn from the real GMM (small green crosses). In those examples, although EM kept two centroids on the isolated cluster, tmp-EM managed to correctly identify the position of the three real centroids.

**Figure 7.**Results over many simulations of the Riemann EM and tmp-Riemann EM on the Beta-Gaussian model. The tempered Riemann EM reaches relative errors on the real parameters that are orders of magnitude below the Riemann EM with no temperature. The likelihood reached is also lower with the tempering.

**Table 1.**Average and standard deviation of the relative error on ${\mu}_{k}$, $\frac{{\left(\right)}^{{\widehat{\mu}}_{k}}}{2}$, made by EM, tmp-EM with decreasing temperature and tmp-EM with oscillating temperature over 1000 simulated dataset with two different initialisations. The three different parameter families, described in Figure 4, correspond to increasingly ambiguous positions of classes 1 and 2. For both initialisations type, the identification of these two clusters is drastically improved by the tempering. Best results highlighted in

**bold**.

EM | tmp-EM (Decreasing T) | tmp-EM (Decreasing Oscillating T) | |||||
---|---|---|---|---|---|---|---|

Parameter | |||||||

Family | cl. | barycenter | 2v1 | barycenter | 2v1 | barycenter | 2v1 |

1 | 1 | 0.52 (1.01) | 1.52 (1.24) | 0.60 (1.08) | 0.87 (1.20) | 0.04 (0.26) | 0.29 (0.64) |

2 | 0.55 (1.05) | 1.53 (1.25) | 0.64 (1.10) | 0.96 (1.25) | 0.05 (0.31) | 0.30 (0.64) | |

3 | 0.01 (0.06) | 0.01 (0.03) | 0.01 (0.10) | 0.01 (0.02) | 0.03 (0.17) | 0.03 (0.19) | |

2 | 1 | 1.00 (1.42) | 1.69 (1.51) | 0.96 (1.41) | 1.10 (1.46) | 0.09 (0.47) | 0.37 (0.86) |

2 | 1.03 (1.44) | 1.71 (1.52) | 1.08 (1.46) | 1.11 (1.46) | 0.12 (0.57) | 0.32 (0.79) | |

3 | 0.01 (0.05) | 0.02 (0.03) | 0.01 (0.03) | 0.01 (0.04) | 0.01 (0.05)
| 0.04 (0.22) | |

3 | 1 | 1.56 (1.75) | 1.79 (1.77) | 1.63 (1.76) | 1.38 (1.71) | 0.31 (0.97) | 0.39 (0.98) |

2 | 1.51 (1.74) | 1.88 (1.76) | 1.52 (1.74) | 1.30 (1.68) | 0.30 (0.93) | 0.39 (0.97) | |

3 | 0.02 (0.04) | 0.02 (0.04) | 0.01 (0.03) | 0.02 (0.06) | 0.01 (0.04)
| 0.07 (0.30) |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lartigue, T.; Durrleman, S.; Allassonnière, S.
Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM. *Algorithms* **2022**, *15*, 78.
https://doi.org/10.3390/a15030078

**AMA Style**

Lartigue T, Durrleman S, Allassonnière S.
Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM. *Algorithms*. 2022; 15(3):78.
https://doi.org/10.3390/a15030078

**Chicago/Turabian Style**

Lartigue, Thomas, Stanley Durrleman, and Stéphanie Allassonnière.
2022. "Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM" *Algorithms* 15, no. 3: 78.
https://doi.org/10.3390/a15030078