#
A Proximal Point Algorithm for Minimum Divergence Estimators with Application to Mixture Models^{ †}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. A Description of the Algorithm

#### 2.1. General Context and Notations

#### 2.2. EM Algorithm and Tseng’s Generalization

#### 2.3. Generalization of Tseng’s Algorithm

## 3. Some Convergence Properties of ${\varphi}^{k}$

**Definition 1.**

**Remark 1.**

- A0.
- Functions $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}_{T}}),{D}_{\psi}$ are lower semicontinuous;
- A1.
- Functions $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}_{T}}),{D}_{\psi}$ and ${\nabla}_{1}{D}_{\psi}$ are defined and continuous on, respectively, $\Phi ,\Phi \times \Phi $ and $\Phi \times \Phi $;
- AC.
- Function $\varphi \mapsto \nabla {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}_{T}})$ is defined and continuous on Φ;
- A2.
- ${\Phi}^{0}$ is a compact subset of int$(\Phi )$;
- A3.
- ${D}_{\psi}(\varphi ,\overline{\varphi})>0$ for all $\overline{\varphi}\ne \varphi \in \Phi $.

**Proposition 1.**

**Proof.**

**Proposition 2.**

- (a)
- If AC is verified, then any limit point of ${\left({\varphi}^{k}\right)}_{k}$ is a stationary point of $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}^{T}})$;
- (b)
- If AC is dropped, then any limit point of ${\left({\varphi}^{k}\right)}_{k}$ is a “generalized” stationary point of $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}^{T}})$, i.e., zero belongs to the subgradient of $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}^{T}})$ calculated at the limit point.

**Proof.**

**Proposition 3.**

**Proof.**

**Corollary 1.**

**Proof.**

**Proposition 4.**

**Proof.**

## 4. Case Studies

#### 4.1. An Algorithm With Theoretically Global Infimum Attainment

#### 4.2. The Two-Component Gaussian Mixture

**Conclusion 1.**

**Conclusion 2.**

**Conclusion 3.**

**Remark 2.**

## 5. Simulation Study

**Remark 3**

`distrExIntegrate`function of package

`distrEx`. It is a slight modification of the standard function

`integrate`. It performs a Gauss–Legendre quadrature when function

`integrate`returns an error. In the Weibull mixture, we used the

`integral`function from package

`pracma`. Function

`integral`includes a variety of adaptive numerical integration methods such as Kronrod–Gauss quadrature, Romberg’s method, Gauss–Richardson quadrature, Clenshaw–Curtis (not adaptive) and (adaptive) Simpson’s method. Although function

`integral`is slow, it performs better than other functions even if the integrand has a relatively bad behavior.

#### 5.1. The Two-Component Gaussian Mixture Revisited

#### 5.2. The Two-Component Weibull Mixture Model

## 6. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar]
- Tseng, P. An Analysis of the EM Algorithm and Entropy-Like Proximal Point Methods. Math. Oper. Res.
**2004**, 29, 27–44. [Google Scholar] [CrossRef] - Chrétien, S.; Hero, A.O. Generalized Proximal Point Algorithms and Bundle Implementations. Available online: http://www.eecs.umich.edu/techreports/systems/cspl/cspl-316.pdf (acceesed on 25 July 2016).
- Goldstein, A.; Russak, I. How good are the proximal point algorithms? Numer. Funct. Anal. Optim.
**1987**, 9, 709–724. [Google Scholar] [CrossRef] - Chrétien, S.; Hero, A.O. Acceleration of the EM algorithm via proximal point iterations. In Proceedings of the IEEE International Symposium on Information Theory, Cambridge, MA, USA, 16–21 August 1998.
- Csiszár, I. Eine informationstheoretische Ungleichung und ihre anwendung auf den Beweis der ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hung. Acad. Sci.
**1963**, 8, 95–108. (In German) [Google Scholar] - Broniatowski, M.; Keziou, A. Parametric estimation and tests through divergences and the duality technique. J. Multivar. Anal.
**2009**, 100, 16–36. [Google Scholar] [CrossRef] - Cressie, N.; Read, T.R.C. Multinomial goodness-of-fit tests. J. R. Stat. Soc. Ser. B
**1984**, 46, 440–464. [Google Scholar] - Broniatowski, M.; Keziou, A. Minimization of divergences on sets of signed measures. Stud. Sci. Math. Hung.
**2006**, 43, 403–442. [Google Scholar] [CrossRef] - Liese, F.; Vajda, I. On Divergences and Informations in Statistics and Information Theory. IEEE Trans. Inf. Theory
**2006**, 52, 4394–4412. [Google Scholar] [CrossRef] - Al Mohamad, D. Towards a better understanding of the dual representation of phi divergences. 2016; arXiv:1506.02166. [Google Scholar]
- Toma, A.; Broniatowski, M. Dual divergence estimators and tests: Robustness results. J. Multivar. Anal.
**2011**, 102, 20–36. [Google Scholar] [CrossRef] - Rockafellar, R.T.; Wets, R.J.B. Variational Analysis, 3rd ed.; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
- Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and Efficient Estimation by Minimising a Density Power Divergence. Biometrika
**1998**, 85, 549–559. [Google Scholar] [CrossRef] - Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B
**1977**, 39, 1–38. [Google Scholar] - Wu, C.F.J. On the Convergence Properties of the EM Algorithm. Ann. Stat.
**1983**, 11, 95–103. [Google Scholar] [CrossRef] - Ostrowski, A. Solution of Equations and Systems of Equations; Academic Press: Cambridge, MA, USA, 1966. [Google Scholar]
- Chrétien, S.; Hero, A.O. On EM algorithms and their proximal generalizations. ESAIM Probabil. Stat.
**2008**, 12, 308–326. [Google Scholar] [CrossRef] - Berge, C. Topological Spaces: Including a Treatment of Multi-valued Functions, Vector Spaces, and Convexity; Dover Publications: Mineola, NY, USA, 1963. [Google Scholar]
- Meister, A. Deconvolution Problems in Nonparametric Statistics; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Jiménz, R.; Shao, Y. On robustness and efficiency of minimum divergence estimators. Test
**2001**, 10, 241–248. [Google Scholar] [CrossRef] - Nelder, J.A.; Mead, R. A Simplex Method for Function Minimization. Comput. J.
**1965**, 7, 308–313. [Google Scholar] [CrossRef] - The R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2013. [Google Scholar]

**Figure 1.**Decrease of the (estimated) Hellinger divergence between the true density and the estimated model at each iteration in the Gaussian mixture. The figure to the left is the curve of the values of the kernel-based dual Formula (3). The figure to the right is the curve of values of the classical dual Formula (2). Values are taken at a logarithmic scale $log(1+x)$.

**Table 1.**The mean and the standard deviation of the estimates and the errors committed in a 100 run experiment of a two-component Gaussian mixture. The true set of parameters is $\lambda =0.35,$ ${\mu}_{1}=-2$, ${\mu}_{2}=1.5$.

Estimation Method | λ | sd (λ) | ${\mu}_{1}$ | sd (${\mu}_{1}$) | ${\mu}_{2}$ | sd (${\mu}_{2}$) | TVD | sd (TVD) |
---|---|---|---|---|---|---|---|---|

Without Outliers | ||||||||

Classical MD$\phi $DE | 0.349 | 0.049 | –1.989 | 0.207 | 1.511 | 0.151 | 0.061 | 0.029 |

New MD$\phi $DE–Silverman | 0.349 | 0.049 | –1.987 | 0.208 | 1.520 | 0.155 | 0.062 | 0.029 |

MDPD $a=0.5$ | 0.360 | 0.053 | –1.997 | 0.226 | 1.489 | 0.135 | 0.065 | 0.025 |

EM (MLE) | 0.360 | 0.054 | –1.989 | 0.204 | 1.493 | 0.136 | 0.064 | 0.025 |

With $10\%$ Outliers | ||||||||

Classical MD$\phi $DE | 0.357 | 0.022 | –2.629 | 0.094 | 1.734 | 0.111 | 0.146 | 0.034 |

New MD$\phi $DE–Silverman | 0.352 | 0.057 | –1.756 | 0.224 | 1.358 | 0.132 | 0.087 | 0.033 |

MDPD $a=0.5$ | 0.364 | 0.056 | –1.819 | 0.218 | 1.404 | 0.132 | 0.078 | 0.030 |

EM (MLE) | 0.342 | 0.064 | –2.617 | 0.288 | 1.713 | 0.172 | 0.150 | 0.034 |

**Table 2.**The mean and the standard deviation of the estimates and the errors committed in a 100-run experiment of a two-component Weibull mixture. The true set of parameter is $\lambda =0.35,{\nu}_{1}=1.2,{\nu}_{2}=2$.

Estimation Method | λ | sd (λ) | ${\mu}_{1}$ | sd (${\mu}_{1}$) | ${\mu}_{2}$ | sd (${\mu}_{2}$) | TVD | sd (TVD) |
---|---|---|---|---|---|---|---|---|

Without Outliers | ||||||||

Classical MD$\phi $DE | 0.356 | 0.066 | 1.245 | 0.228 | 2.055 | 0.237 | 0.052 | 0.025 |

New MD$\phi $DE–Silverman | 0.387 | 0.067 | 1.229 | 0.241 | 2.145 | 0.289 | 0.058 | 0.029 |

MDPD $a=0.5$ | 0.354 | 0.068 | 1.238 | 0.230 | 2.071 | 0.345 | 0.056 | 0.029 |

EM (MLE) | 0.355 | 0.066 | 1.245 | 0.228 | 2.054 | 0.237 | 0.052 | 0.025 |

With $10\%$ Outliers | ||||||||

Classical MD$\phi $DE | 0.250 | 0.085 | 1.089 | 0.300 | 1.470 | 0.335 | 0.092 | 0.037 |

New MD$\phi $DE–Silverman | 0.349 | 0.076 | 1.122 | 0.252 | 1.824 | 0.324 | 0.067 | 0.034 |

MDPD $a=0.5$ | 0.322 | 0.077 | 1.158 | 0.236 | 1.858 | 0.344 | 0.060 | 0.029 |

EM (MLE) | 0.259 | 0.095 | 0.941 | 0.368 | 1.565 | 0.325 | 0.095 | 0.035 |

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Al Mohamad, D.; Broniatowski, M.
A Proximal Point Algorithm for Minimum Divergence Estimators with Application to Mixture Models. *Entropy* **2016**, *18*, 277.
https://doi.org/10.3390/e18080277

**AMA Style**

Al Mohamad D, Broniatowski M.
A Proximal Point Algorithm for Minimum Divergence Estimators with Application to Mixture Models. *Entropy*. 2016; 18(8):277.
https://doi.org/10.3390/e18080277

**Chicago/Turabian Style**

Al Mohamad, Diaa, and Michel Broniatowski.
2016. "A Proximal Point Algorithm for Minimum Divergence Estimators with Application to Mixture Models" *Entropy* 18, no. 8: 277.
https://doi.org/10.3390/e18080277