#
A Proximal Point Algorithm for Minimum Divergence Estimators with Application to Mixture Models^{ †}

^{*}

^{†}

Next Article in Journal

Next Article in Special Issue

Next Article in Special Issue

Previous Article in Journal

Previous Article in Special Issue

Previous Article in Special Issue

Laboratoire de Statistique Théorique et Appliquée, Université Pierre et Marie CURIE, 4 place Jussieu, 75005 Paris, France

Author to whom correspondence should be addressed.

This paper is an extended version of our paper published in the 2nd Conference on Geometric Science of Information, Palaiseau, France, 28–30 October 2015.

Academic Editors: Frédéric Barbaresco and Frank Nielsen

Received: 11 June 2016
/
Revised: 20 July 2016
/
Accepted: 21 July 2016
/
Published: 27 July 2016

(This article belongs to the Special Issue Differential Geometrical Theory of Statistics)

Estimators derived from a divergence criterion such as $\phi -$ divergences are generally more robust than the maximum likelihood ones. We are interested in particular in the so-called minimum dual $\phi $ –divergence estimator (MD$\phi $ DE), an estimator built using a dual representation of $\phi $ –divergences. We present in this paper an iterative proximal point algorithm that permits the calculation of such an estimator. The algorithm contains by construction the well-known Expectation Maximization (EM) algorithm. Our work is based on the paper of Tseng on the likelihood function. We provide some convergence properties by adapting the ideas of Tseng. We improve Tseng’s results by relaxing the identifiability condition on the proximal term, a condition which is not verified for most mixture models and is hard to be verified for “non mixture” ones. Convergence of the EM algorithm in a two-component Gaussian mixture is discussed in the spirit of our approach. Several experimental results on mixture models are provided to confirm the validity of the approach.

The Expectation Maximization (EM) algorithm is a well-known method for calculating the maximum likelihood estimator of a model where incomplete data is considered. For example, when working with mixture models in the context of clustering, the labels or classes of observations are unknown during the training phase. Several variants of the EM algorithm were proposed (see [1]). Another way to look at the EM algorithm is as a proximal point problem (see [2,3]). Indeed, one may rewrite the conditional expectation of the complete log-likelihood as a sum of the log-likelihood function and a distance-like function over the conditional densities of the labels provided an observation. Generally, the proximal term has a regularization effect in the sense that a proximal point algorithm is more stable and frequently outperforms classical optimization algorithms (see [4]). Chrétien and Hero [5] prove superlinear convergence of a proximal point algorithm derived from the EM algorithm. Notice that EM-type algorithms usually enjoy no more than linear convergence.

Taking into consideration the need for robust estimators, and the fact that the maximum likelihood estimator (MLE) is the least robust estimator among the class of divergence-type estimators that we present below, we generalize the EM algorithm (and the version of Tseng [2]) by replacing the log-likelihood function by an estimator of a $\phi -$divergence between the true distribution of the data and the model. A $\phi $–divergence in the sense of Csiszár [6] is defined in the same way as [7] by:
where $\phi $ is a nonnegative strictly convex function. Examples of such divergences are: the Kullback–Leibler (KL) divergence , the modified KL divergence, the Hellinger distanceamong others. All these well-known divergences belong to the class of Cressie-Read functions [8] defined by
for $\gamma =\frac{1}{2},0,1$ respectively. For $\gamma \in \{0,1\}$, the limit is calculated, and we denote ${\phi}_{0}\left(x\right)=-logx+x-1$ for the case of the modified KL and ${\phi}_{1}\left(x\right)=xlogx-x+1$ for the KL.

$${D}_{\phi}(Q,P)=\int \phi \left(\frac{dQ}{dP}\left(y\right)\right)dP\left(y\right),$$

$${\phi}_{\gamma}\left(x\right)=\frac{{x}^{\gamma}-\gamma x+\gamma -1}{\gamma (\gamma -1)}\phantom{\rule{4.pt}{0ex}}\text{for}\phantom{\rule{4.pt}{0ex}}\gamma \in \mathbb{R}\backslash \{0,1\}.$$

Since the $\phi -$divergence calculus uses the unknown true distribution, we need to estimate it. We consider the dual estimator of the divergence introduced independently by [9,10]. The use of this estimator is motivated by many reasons. Its minimum coincides with the MLE for $\phi \left(t\right)=-log\left(t\right)+t-1$. In addition, it has the same form for discrete and continuous models, and does not consider any partitioning or smoothing.

Let ${\left({P}_{\varphi}\right)}_{\varphi \in \Phi}$ be a parametric model with $\Phi \subset {\mathbb{R}}^{d}$, and denote ${\varphi}^{T}$ as the true set of parameters. Let $dy$ be the Lebesgue measure defined on $\mathbb{R}$. Suppose that $\forall \varphi \in \Phi $, the probability measure ${P}_{\varphi}$ is absolutely continuous with respect to $dy$ and denote ${p}_{\varphi}$ the corresponding probability density. The dual estimator of the $\phi -$divergence given an $n-$sample ${y}_{1},\cdots ,{y}_{n}$ is given by:
with ${\phi}^{\#}\left(t\right)=t{\phi}^{\prime}\left(t\right)-\phi \left(t\right)$. Al Mohamad [11] argues that this formula works well under the model; however, when we are not, this quantity largely underestimates the divergence between the true distribution and the model, and proposes the following modification:
where ${K}_{n,w}$ is the Rosenblatt–Parzen kernel estimate with window parameter w. Whether it is ${\widehat{D}}_{\phi}$, or ${\tilde{D}}_{\phi}$, the minimum dual $\phi -$divergence estimator (MD$\phi $DE) is defined as the argument of the infimum of the dual approximation:

$${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})=\underset{\alpha \in \Phi}{sup}\int {\phi}^{\prime}\left(\frac{{p}_{\varphi}}{{p}_{\alpha}}\right)\left(x\right){p}_{\varphi}\left(x\right)dx-\frac{1}{n}\sum _{i=1}^{n}{\phi}^{\#}\left(\frac{{p}_{\varphi}}{{p}_{\alpha}}\right)\left({y}_{i}\right),$$

$${\tilde{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})=\int {\phi}^{\prime}\left(\frac{{p}_{\varphi}}{{K}_{n,w}}\right)\left(x\right){p}_{\varphi}\left(x\right)dx-\frac{1}{n}\sum _{i=1}^{n}{\phi}^{\#}\left(\frac{{p}_{\varphi}}{{K}_{n,w}}\right)\left({y}_{i}\right),$$

$$\begin{array}{ccc}\hfill {\widehat{\varphi}}_{n}& =& \underset{\varphi \in \Phi}{arg\; inf}{\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}}),\hfill \end{array}$$

$$\begin{array}{ccc}\hfill {\tilde{\varphi}}_{n}& =& \underset{\varphi \in \Phi}{arg\; inf}{\tilde{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}}).\hfill \end{array}$$

Asymptotic properties and consistency of these two estimators can be found in [7,11]. Robustness properties were also studied using the influence function approach in [11,12]. The kernel-based MD$\phi $DE (5) seems to be a better estimator than the classical MD$\phi $DE (4) in the sense that the former is robust whereas the later is generally not. Under the model, the estimator given by (4) is, however, more efficient, especially when the true density of the data is unbounded. More investigation is needed in the context of unbounded densities, since we may use asymmetric kernels in order to improve the efficiency of the kernel-based MD$\phi $DE, see [11] for more details.

In this paper, we propose calculation of the MD$\phi $DE using an iterative procedure based on the work of Tseng [2] on the log-likelihood function. This procedure has the form of a proximal point algorithm, and extends the EM algorithm. Our convergence proof demands some regularity (continuity and differentiability) of the estimated divergence with respect to the parameter vector φ) which is not simply checked using (2). Recent results in the book of Rockafellar and Wets [13] provide sufficient conditions to prove continuity and differentiability of supremal functions of the form of (2) with respect to φ. Differentiability with respect to φ still remains a very hard task; therefore, our results cover cases when the objective function is not differentiable.

The paper is organized as follows: in Section 2, we present the general context. We also present the derivation of our algorithm from the EM algorithm and passing by Tseng’s generalization. In Section 3, we present some convergence properties. We discuss in Section 4 a variant of the algorithm with a theoretical global infimum, and an example of the two-Gaussian mixture model and a convergence proof of the EM algorithm in the spirit of our approach. Finally, Section 5 contains simulations confirming our claim about the efficiency and the robustness of our approach in comparison with the MLE. The algorithm is also applied to the so-called minimum density power divergence (MDPD) introduced by [14].

Let $(X,Y)$ be a couple of random variables with joint probability density function $f(x,y|\varphi )$ parametrized by a vector of parameters $\varphi \in \Phi \subset {\mathbb{R}}^{d}$. Let $({X}_{1},{Y}_{1}),\cdots ,$$({X}_{n},{Y}_{n})$ be n copies of $(X,Y)$ independently and identically distributed. Finally, let $({x}_{1},{y}_{1}),\cdots ,({x}_{n},{y}_{n})$ be n realizations of the n copies of $(X,Y)$. The ${x}_{i}$s are the unobserved data (labels) and the ${y}_{i}$s are the observations. The vector of parameters φ is unknown and needs to be estimated. The observed data ${y}_{i}$ are supposed to be real numbers, and the labels ${x}_{i}$ belong to a space $\mathcal{X}$ not necessarily finite unless mentioned otherwise. The marginal density of the observed data is given by ${p}_{\varphi}\left(y\right)=\int f(x,y|\varphi )dx$, where $dx$ is a measure defined on the label space (for example, the counting measure if we work with mixture models).

For a parametrized function f with a parameter a, we write $f\left(x\right|a)$. We use the notation ${\varphi}^{k}$ for sequences with the index above. The derivatives of a real valued function ψ defined on $\mathbb{R}$ are denoted ${\psi}^{\prime},{\psi}^{\u2033},$ etc. We denote $\nabla f$ the gradient of a real function f defined on ${\mathbb{R}}^{d}$. For a generic function of two (vectorial) arguments $D\left(\varphi \right|\theta )$, then ${\nabla}_{1}D\left(\varphi \right|\theta )$ denotes the gradient with respect to the first (vectorial) variable. Finally, for any set A, we use $int\left(A\right)$ to denote the interior of A.

The EM algorithm estimates the unknown parameter vector by (see [15]):
where $\mathbf{X}=({X}_{1},\cdots ,{X}_{n})$, $\mathbf{Y}=({Y}_{1},\cdots ,{Y}_{n})$ and $\mathbf{y}=({y}_{1},\cdots ,{y}_{n})$. By independence between the couples $({X}_{i},{Y}_{i})$’s, the previous iteration may be written as:
where ${h}_{i}\left(x\right|{\varphi}^{k})=\frac{f(x,{y}_{i}|{\varphi}^{k})}{{p}_{{\varphi}^{k}}\left({y}_{i}\right)}$ is the conditional density of the labels (at step k) provided ${y}_{i}$ which we suppose to be positive $dx-$almost everywhere. It is well-known that the EM iterations can be rewritten as a difference between the log-likelihood and a Kullback–Liebler distance-like function. Indeed,
The final line is justified by the fact that ${h}_{i}\left(x\right|\varphi )$ is a density, therefore it integrates to 1. The additional term does not depend on ϕ and, hence, can be omitted. We now have the following iterative procedure:

$${\varphi}^{k+1}=\underset{\Phi}{arg\; max}\mathbb{E}\left[log\left(f(\mathbf{X},\mathbf{Y}|\varphi )\right)\left|\mathbf{Y}=\mathbf{y},{\varphi}^{k}\right.\right],$$

$$\begin{array}{ccc}\hfill {\varphi}^{k+1}& =& \underset{\Phi}{arg\; max}\sum _{i=1}^{n}\mathbb{E}\left[log\left(f({X}_{i},{Y}_{i}|\varphi )\right)\left|{Y}_{i}={y}_{i},{\varphi}^{k}\right.\right]\hfill \\ & =& \underset{\Phi}{arg\; max}\sum _{i=1}^{n}{\int}_{\mathcal{X}}log\left(f(x,{y}_{i}|\varphi )\right){h}_{i}\left(x\right|{\varphi}^{k})dx,\hfill \end{array}$$

$$\begin{array}{ccc}\hfill {\varphi}^{k+1}& =& \underset{\Phi}{arg\; max}\sum _{i=1}^{n}{\int}_{\mathcal{X}}log\left({h}_{i}\left(x\right|\varphi )\times {p}_{\varphi}\left({y}_{i}\right)\right){h}_{i}\left(x\right|{\varphi}^{k})dx\hfill \\ & =& \underset{\Phi}{arg\; max}\sum _{i=1}^{n}{\int}_{\mathcal{X}}log\left({p}_{\varphi}\left({y}_{i}\right)\right){h}_{i}\left(x\right|{\varphi}^{k})dx+\sum _{i=1}^{n}{\int}_{\mathcal{X}}log\left({h}_{i}\left(x\right|\varphi )\right){h}_{i}\left(x\right|{\varphi}^{k})dx\hfill \\ & =& \underset{\Phi}{arg\; max}\sum _{i=1}^{n}log\left({p}_{\varphi}\left({y}_{i}\right)\right)+\sum _{i=1}^{n}{\int}_{\mathcal{X}}log\left(\frac{{h}_{i}\left(x\right|\varphi )}{{h}_{i}\left(x\right|{\varphi}^{k})}\right){h}_{i}\left(x\right|{\varphi}^{k})dx\hfill \\ & & \phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}\phantom{\rule{2.em}{0ex}}+\sum _{i=1}^{n}{\int}_{\mathcal{X}}log\left({h}_{i}\left(x\right|{\varphi}^{k})\right){h}_{i}\left(x\right|{\varphi}^{k})dx.\hfill \end{array}$$

$${\varphi}^{k+1}=\underset{\Phi}{arg\; max}\sum _{i=1}^{n}log\left({p}_{\varphi}\left({y}_{i}\right|\varphi )\right)+\sum _{i=1}^{n}{\int}_{\mathcal{X}}log\left(\frac{{h}_{i}\left(x\right|\varphi )}{{h}_{i}\left(x\right|{\varphi}^{k})}\right){h}_{i}\left(x\right|{\varphi}^{k})dx.$$

The previous iteration has the form of a proximal point maximization of the log-likelihood, i.e., a perturbation of the log-likelihood by a distance-like function defined on the conditional densities of the labels. Tseng [2] generalizes this iteration by allowing any nonnegative convex function ψ to replace the $t\mapsto -log\left(t\right)$ function. Tseng’s recurrence is defined by:
where J is the log-likelihood function and ${D}_{\psi}$ is given by:
for any real nonnegative convex function ψ such that $\psi \left(1\right)={\psi}^{\prime}\left(1\right)=0$. ${D}_{\psi}({\varphi}_{1},{\varphi}_{2})$ is nonnegative, and ${D}_{\psi}({\varphi}_{1},{\varphi}_{2})=0$ if and only if $\forall i,{h}_{i}\left(x\right|{\varphi}_{1})={h}_{i}\left(x\right|{\varphi}_{2})$ $dx$ almost everywhere.

$${\varphi}^{k+1}=\underset{\varphi}{arg\; sup}J\left(\varphi \right)-{D}_{\psi}(\varphi ,{\varphi}^{k}),$$

$${D}_{\psi}(\varphi ,{\varphi}^{k})=\sum _{i=1}^{n}{\int}_{\mathcal{X}}\psi \left(\frac{{h}_{i}\left(x\right|\varphi )}{{h}_{i}\left(x\right|{\varphi}^{k})}\right){h}_{i}\left(x\right|{\varphi}^{k})dx,$$

We use the relationship between maximizing the log-likelihood and minimizing the Kullback–Liebler divergence to generalize the previous algorithm. We, therefore, replace the log-likelihood function by an estimate of a $\phi -$divergence ${D}_{\phi}$ between the true distribution and the model. We use the dual estimators of the divergence presented earlier in the introduction (2) or (3), which we denote in the same manner ${\widehat{D}}_{\phi}$, unless mentioned otherwise. Our new algorithm is defined by:
where ${D}_{\psi}(\varphi ,{\varphi}^{k})$ is defined by (8). When $\phi \left(t\right)=-log\left(t\right)+t-1$, it is easy to see that we get recurrence (7). Indeed, for the case of (2) we have:

$${\varphi}^{k+1}=\underset{\varphi}{arg\; inf}{\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})+\frac{1}{n}{D}_{\psi}(\varphi ,{\varphi}^{k}),$$

$${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})=\underset{\alpha}{sup}\frac{1}{n}\sum _{i=1}^{n}log\left({p}_{\alpha}\left({y}_{i}\right)\right)-\frac{1}{n}\sum _{i=1}^{n}log\left({p}_{\varphi}\left({y}_{i}\right)\right).$$

Using the fact that the first term in ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})$ does not depend on φ, so it does not count in the arg inf defining ${\varphi}^{k+1}$, we easily get (7). The same applies for the case of (3). For notational simplicity, from now on, we redefine ${D}_{\psi}$ with a normalization by n, i.e.,

$${D}_{\psi}(\varphi ,{\varphi}^{k})=\frac{1}{n}\sum _{i=1}^{n}{\int}_{\mathcal{X}}\psi \left(\frac{{h}_{i}\left(x\right|\varphi )}{{h}_{i}\left(x\right|{\varphi}^{k})}\right){h}_{i}\left(x\right|{\varphi}^{k})dx.$$

Hence, our set of algorithms is redefined by:

$${\varphi}^{k+1}=\underset{\varphi}{arg\; inf}{\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})+{D}_{\psi}(\varphi ,{\varphi}^{k}).$$

We will see later that this iteration forces the divergence to decrease and that, under suitable conditions, it converges to a (local) minimum of ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})$. It results that algorithm (11) being a way to calculate both the MD$\phi $DE (4) and the kernel-based MD$\phi $DE (5).

We show here how, according to some possible situations, one may prove convergence of the algorithm defined by (11). Let ${\varphi}^{0}$ be a given initialization, and define
which we suppose to be a subset of $int(\Phi )$. The idea of defining this set in this context is inherited from the paper Wu [16], which provided the first correct proof of convergence for the EM algorithm. Before going any further, we recall the following definition of a (generalized) stationary point.

$${\Phi}^{0}:=\{\varphi \in \Phi :{\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})\le {\widehat{D}}_{\phi}({p}_{{\varphi}^{0}},{p}_{{\varphi}_{T}})\},$$

Let $f:{\mathbb{R}}^{d}\to \mathbb{R}$ be a real valued function. If f is differentiable at a point ${\varphi}^{*}$ such that $\nabla f\left({\varphi}^{*}\right)=0$, we then say that ${\varphi}^{*}$ is a stationary point of f. If f is not differentiable at ${\varphi}^{*}$ but the subgradient of f at ${\varphi}^{*}$, say $\partial f\left({\varphi}^{*}\right)$, exists such that $0\in \partial f\left({\varphi}^{*}\right)$, then ${\varphi}^{*}$ is called a generalized stationary point of f.

In the whole paper, the subgradient is defined for any function not necessarily convex (see Definition 8.3) in [13] for more details.

We will be using the following assumptions:

- A0.
- Functions $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}_{T}}),{D}_{\psi}$ are lower semicontinuous;
- A1.
- Functions $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}_{T}}),{D}_{\psi}$ and ${\nabla}_{1}{D}_{\psi}$ are defined and continuous on, respectively, $\Phi ,\Phi \times \Phi $ and $\Phi \times \Phi $;
- AC.
- Function $\varphi \mapsto \nabla {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}_{T}})$ is defined and continuous on Φ;
- A2.
- ${\Phi}^{0}$ is a compact subset of int$(\Phi )$;
- A3.
- ${D}_{\psi}(\varphi ,\overline{\varphi})>0$ for all $\overline{\varphi}\ne \varphi \in \Phi $.

Recall also that we suppose that ${h}_{i}\left(x\right|\varphi )>0,dx-a.e.$ We relax the convexity assumption of function ψ. We only suppose that ψ is nonnegative and $\psi \left(t\right)=0$ iff $t=1$. In addition, ${\psi}^{\prime}\left(t\right)=0$ if $t=1$.

Continuity and differentiability assumptions of function $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}_{T}})$ for the case of (3) can be easily checked using Lebesgue theorems. The continuity assumption for the case of (2) can be checked using Theorem 1.17 or Corollary 10.14 in [13]. Differentiability can also be checked using Corollary 10.14 or Theorem 10.31 in the same book. In what concerns ${D}_{\psi}$, continuity and differentiability can be obtained merely by fulfilling Lebesgue theorems conditions. When working with mixture models, we only need the continuity and differentiability of ψ and functions ${h}_{i}$. The later is easily deduced from regularity assumptions on the model. For assumption A2, there is no universal method, see Section 4.2 for an Example. Assumption A3 can be checked using Lemma 2 in [2].

We start the convergence properties by proving that the objective function ${\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}_{T}})$ decreases alongside the the sequence ${\left({\varphi}^{k}\right)}_{k}$, and give a possible set of conditions for the existence of the sequence ${\left({\varphi}^{k}\right)}_{k}$.

(a) Assume that the sequence ${\left({\varphi}^{k}\right)}_{k}$ is well defined in Φ, then ${\widehat{D}}_{\phi}\left({p}_{{\varphi}^{k+1}}\right|{p}_{{\varphi}_{T}})\le {\widehat{D}}_{\phi}\left({p}_{{\varphi}^{k}}\right|{p}_{{\varphi}_{T}})$, and (b) $\forall k,{\varphi}^{k}\in {\Phi}^{0}$. (c) Assume A0 and A2 are verified, then the sequence ${\left({\varphi}^{k}\right)}_{k}$ is defined and bounded. Moreover, the sequence ${\left({\widehat{D}}_{\phi}\left({p}_{{\varphi}^{k}}\right|{p}_{{\varphi}^{T}})\right)}_{k}$ converges.

We prove $\left(a\right)$. We have by definition of the arginf:
We use the fact that ${D}_{\psi}({\varphi}^{k},{\varphi}^{k})=0$ for the right-hand side and that ${D}_{\psi}({\varphi}^{k+1},{\varphi}^{k})\ge 0$ for the left-hand side of the previous inequality. Hence, ${\widehat{D}}_{\phi}({p}_{{\varphi}^{k+1}},{p}_{{\varphi}_{T}})\le {\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})$.

$${\widehat{D}}_{\phi}({p}_{{\varphi}^{k+1}},{p}_{{\varphi}_{T}})+{D}_{\psi}({\varphi}^{k+1},{\varphi}^{k})\le {\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})+{D}_{\psi}({\varphi}^{k},{\varphi}^{k}).$$

We prove $\left(b\right)$ using the decreasing property previously proved in (a). We have by recurrence $\forall k,{\widehat{D}}_{\phi}({p}_{{\varphi}^{k+1}},{p}_{{\varphi}_{T}})\le {\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})\le \cdots \le {\widehat{D}}_{\phi}({p}_{{\varphi}^{0}},{p}_{{\varphi}_{T}})$. The result follows directly by definition of ${\Phi}^{0}$.

We prove $\left(c\right)$ by induction on k. For $k=0$, clearly ${\varphi}^{0}$ is well defined since we choose it. The choice of the initial point ${\varphi}^{0}$ of the sequence may influence the convergence of the sequence. See the Example of the Gaussian mixture in Section 4.2. Suppose, for some $k\ge 0$, that ${\varphi}^{k}$ exists. We prove that the infimum is attained in ${\Phi}^{0}$. Let $\varphi \in \Phi $ be any vector at which the value of the optimized function has a value less than its value at ${\varphi}^{k}$, i.e., ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})+{D}_{\psi}(\varphi ,{\varphi}^{k})\le {\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})+{D}_{\psi}({\varphi}^{k},{\varphi}^{k})$. We have:

$$\begin{array}{ccc}\hfill {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})& \le & {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})+{D}_{\psi}(\varphi ,{\varphi}^{k})\hfill \\ & \le & {\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})+{D}_{\psi}({\varphi}^{k},{\varphi}^{k})\hfill \\ & \le & {\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})\hfill \\ & \le & {\widehat{D}}_{\phi}({p}_{{\varphi}^{0}},{p}_{{\varphi}_{T}}).\hfill \end{array}$$

The first line follows from the non negativity of ${D}_{\psi}$. As ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})\le {\widehat{D}}_{\phi}({p}_{{\varphi}^{0}},{p}_{{\varphi}_{T}})$, then $\varphi \in {\Phi}^{0}$. Thus, the infimum can be calculated for vectors in ${\Phi}^{0}$ instead of Φ. Since ${\Phi}^{0}$ is compact and the optimized function is lower semicontinuous (the sum of two lower semicontinuous functions), then the infimum exists and is attained in ${\Phi}^{0}$. We may now define ${\varphi}^{k+1}$ to be a vector whose corresponding value is equal to the infimum.

Convergence of the sequence ${\left({\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})\right)}_{k}$ comes from the fact that it is non increasing and bounded. It is non increasing by virtue of (a). Boundedness comes from the lower semicontinuity of $\varphi \mapsto {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})$. Indeed, $\forall k,{\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})\ge {inf}_{\varphi \in {\Phi}^{0}}{\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})$. The infimum of a proper lower semicontinuous function on a compact set exists and is attained on this set. Hence, the quantity ${inf}_{\varphi \in {\Phi}^{0}}{\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})$ exists and is finite. This ends the proof. □

Compactness in part (c) can be replaced by inf-compactness of function $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}_{T}})$ and continuity of ${D}_{\psi}$ with respect to its first argument. The convergence of the sequence ${\left({\widehat{D}}_{\phi}\left({\varphi}^{k}\right|{\varphi}_{T})\right)}_{k}$ is an interesting property, since, in general, there is no theoretical guarantee, or it is difficult to prove that the whole sequence ${\left({\varphi}^{k}\right)}_{k}$ converges. It may also continue to fluctuate around a minimum. The decrease of the error criterion ${\widehat{D}}_{\phi}\left({\varphi}^{k}\right|{\varphi}_{T})$ between two iterations helps us decide when to stop the iterative procedure.

Suppose A1 verified, ${\Phi}^{0}$ is closed and $\{{\varphi}^{k+1}-{\varphi}^{k}\}\to 0$.

- (a)
- If AC is verified, then any limit point of ${\left({\varphi}^{k}\right)}_{k}$ is a stationary point of $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}^{T}})$;
- (b)
- If AC is dropped, then any limit point of ${\left({\varphi}^{k}\right)}_{k}$ is a “generalized” stationary point of $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}^{T}})$, i.e., zero belongs to the subgradient of $\varphi \mapsto {\widehat{D}}_{\phi}\left({p}_{\varphi}\right|{p}_{{\varphi}^{T}})$ calculated at the limit point.

We prove $\left(a\right)$. Let ${\left({\varphi}^{{n}_{k}}\right)}_{k}$ be a convergent subsequence of ${\left({\varphi}^{k}\right)}_{k}$ which converges to ${\varphi}^{\infty}$. First, ${\varphi}^{\infty}\in {\Phi}^{0}$, because ${\Phi}^{0}$ is closed and the subsequence $\left({\varphi}^{{n}_{k}}\right)$ is a sequence of elements of ${\Phi}^{0}$ (proved in Proposition 1b).

Let us now show that the subsequence $\left({\varphi}^{{n}_{k}+1}\right)$ also converges to ${\varphi}^{\infty}$. We simply have:

$$\begin{array}{ccc}\hfill \parallel {\varphi}^{{n}_{k}+1}-{\varphi}^{\infty}\parallel & \le & \parallel {\varphi}^{{n}_{k}}-{\varphi}^{\infty}\parallel +\parallel {\varphi}^{{n}_{k}+1}-{\varphi}^{{n}_{k}}\parallel .\hfill \end{array}$$

Since ${\varphi}^{k+1}-{\varphi}^{k}\to 0$ and ${\varphi}^{{n}_{k}}\to {\varphi}^{\infty}$, we conclude that ${\varphi}^{{n}_{k}+1}\to {\varphi}^{\infty}$.

By definition of ${\varphi}^{{n}_{k}+1}$, it verifies the infimum in recurrence (11), so that the gradient of the optimized function is zero:

$$\nabla {\widehat{D}}_{\phi}({p}_{{\varphi}^{{n}_{k}+1}},{p}_{{\varphi}_{T}})+\nabla {D}_{\psi}({\varphi}^{{n}_{k}+1},{\varphi}^{{n}_{k}})=0.$$

Using the continuity assumptions A1 and AC of the gradients, one can pass to the limit with no problem:

$$\nabla {\widehat{D}}_{\phi}({p}_{{\varphi}^{\infty}},{p}_{{\varphi}_{T}})+\nabla {D}_{\psi}({\varphi}^{\infty},{\varphi}^{\infty})=0.$$

However, the gradient $\nabla {D}_{\psi}({\varphi}^{\infty},{\varphi}^{\infty})=0$ because (recall that ${\psi}^{\prime}\left(1\right)=0$) for any $\varphi \in \Phi $
which is equal to zero since ${\psi}^{\prime}\left(1\right)=0$. This implies that $\nabla {\widehat{D}}_{\phi}({p}_{{\varphi}^{\infty}},{p}_{{\varphi}_{T}})=0$.

$$\nabla {D}_{\psi}(\varphi ,\varphi )=\sum _{i=1}^{n}{\int}_{\mathcal{X}}\frac{\nabla {h}_{i}\left(x\right|\varphi )}{{h}_{i}\left(x\right|\varphi )}{\psi}^{\prime}\left(\frac{{h}_{i}\left(x\right|\varphi )}{{h}_{i}\left(x\right|\varphi )}\right){h}_{i}\left(x\right|\varphi )dx=\sum _{i=1}^{n}{\int}_{\mathcal{X}}\nabla {h}_{i}\left(x\right|\varphi ){\psi}^{\prime}\left(1\right)dx,$$

We prove (b). We use again the definition of the arginf. As the optimized function is not necessarily differentiable at the points of the sequence ${\left({\varphi}^{k}\right)}_{k}$, a necessary condition for ${\varphi}^{k+1}$ to be an infimum is that 0 belongs to the subgradient of the function on ${\varphi}^{k+1}$. Since ${D}_{\psi}(\varphi ,{\varphi}^{k})$ is assumed to be differentiable, the optimality condition is translated into:

$$-\nabla {D}_{\psi}({\varphi}^{k+1},{\varphi}^{k})\in \partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{k+1}},{p}_{{\varphi}_{T}})\phantom{\rule{1.em}{0ex}}\forall k.$$

Since ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})$ is continuous, then its subgradient is outer semicontinuous (see [13] Chapter 8, Proposition 7). We use the same arguments presented in (a) to conclude the existence of two subsequences ${\left({\varphi}^{{n}_{k}}\right)}_{k}$ and ${\left({\varphi}^{{n}_{k}+1}\right)}_{k}$ which converge to the same limit ${\varphi}^{\infty}$. By definition of outer semicontinuity, and since ${\varphi}^{{n}_{k}+1}\to {\varphi}^{\infty}$, we have:

$$\underset{{\varphi}^{{n}_{k}+1}\to {\varphi}^{\infty}}{lim\; sup}\partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{{n}_{k}+1}},{p}_{{\varphi}_{T}})\subset \partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{\infty}},{p}_{{\varphi}_{T}}).$$

We want to prove that $0\in {lim\; sup}_{{\varphi}^{{n}_{k}+1}\to {\varphi}^{\infty}}\partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{{n}_{k}+1}},{p}_{{\varphi}_{T}})$. By definition of the (outer) limsup (see [13] Chapter 4, Definition 1 or Chapter 5B):

$$\underset{\varphi \to {\varphi}^{\infty}}{lim\; sup}\partial {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})=\left\{u|\exists {\varphi}^{k}\to {\varphi}^{\infty},\exists {u}^{k}\to u\phantom{\rule{4.pt}{0ex}}\text{with}\phantom{\rule{4.pt}{0ex}}{u}^{k}\in \partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})\right\}.$$

In our scenario, $\varphi ={\varphi}^{{n}_{k}+1}$, ${\varphi}^{k}={\varphi}^{{n}_{k}+1}$, $u=0$ and ${u}^{k}={\nabla}_{1}{D}_{\psi}({\varphi}^{{n}_{k}+1},{\varphi}^{{n}_{k}})$. The continuity of ${\nabla}_{1}{D}_{\psi}$ with respect to both arguments and the fact that the two subsequences ${\varphi}^{{n}_{k}+1}$ and ${\varphi}^{{n}_{k}}$ converge to the same limit, imply that ${u}^{k}\to {\nabla}_{1}{D}_{\psi}({\varphi}^{\infty},{\varphi}^{\infty})=0$. Hence, $u=0\in {lim\; sup}_{{\varphi}^{{n}_{k}+1}\to {\varphi}^{\infty}}\partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{{n}_{k}+1}},{p}_{{\varphi}_{T}})$. By inclusion (12), we get our result:

$$0\in \partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{\infty}},{p}_{{\varphi}_{T}}).$$

This ends the proof. □

The assumption $\{{\varphi}^{k+1}-{\varphi}^{k}\}\to 0$ used in Proposition 2 is not easy to be checked unless one has a close formula of ${\varphi}^{k}$. The following proposition gives a method to prove such assumption. This method seems simpler, but it is not verified in many mixture models (see Section 4.2 for a counter Example).

Assume that A1, A2 and A3 are verified, then $\{{\varphi}^{k+1}-{\varphi}^{k}\}\to 0$. Thus, by Proposition 2 (according to whether AC is verified or not), any limit point of the sequence ${\varphi}^{k}$ is a (generalized) stationary point of ${\widehat{D}}_{\phi}(.|{\varphi}_{T})$.

By contradiction, let us suppose that ${\varphi}^{k+1}-{\varphi}^{k}$ does not converge to 0. There exists a subsequence such that $\parallel {\varphi}^{{N}_{0}\left(k\right)+1}-{\varphi}^{{N}_{0}\left(k\right)}\parallel >\epsilon ,\phantom{\rule{0.277778em}{0ex}}\forall k\ge {k}_{0}$. Since ${\left({\varphi}^{k}\right)}_{k}$ belongs to the compact set ${\Phi}^{0}$, there exists a convergent subsequence ${\left({\varphi}^{{N}_{1}\circ {N}_{0}\left(k\right)}\right)}_{k}$ such that ${\varphi}^{{N}_{1}\circ {N}_{0}\left(k\right)}\to \overline{\varphi}$. The sequence ${\left({\varphi}^{{N}_{1}\circ {N}_{0}\left(k\right)+1}\right)}_{k}$ belongs to the compact set ${\Phi}^{0}$; therefore, we can extract a further subsequence ${\left({\varphi}^{{N}_{2}\circ {N}_{1}\circ {N}_{0}\left(k\right)+1}\right)}_{k}$ such that ${\varphi}^{{N}_{2}\circ {N}_{1}\circ {N}_{0}\left(k\right)+1}\to \tilde{\varphi}$. Besides $\widehat{\varphi}\ne \tilde{\varphi}$. Finally since the sequence ${\left({\varphi}^{{N}_{1}\circ {N}_{0}\left(k\right)}\right)}_{k}$ is convergent, a further subsequence also converges to the same limit $\overline{\varphi}$. We have proved the existence of a subsequence of ${\left({\varphi}^{k}\right)}_{k}$ such that ${\varphi}^{N\left(k\right)+1}-{\varphi}^{N\left(k\right)}$ does not converge to 0 and such that ${\varphi}^{N\left(k\right)+1}\to \tilde{\varphi}$, ${\varphi}^{N\left(k\right)}\to \overline{\varphi}$ with $\overline{\varphi}\ne \tilde{\varphi}$.

The real sequence ${\left({\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})\right)}_{k}$ converges as proved in Proposition 1c. As a result, both sequences ${\widehat{D}}_{\phi}({p}_{{\varphi}^{N\left(k\right)+1}},{p}_{{\varphi}_{T}})$ and ${\widehat{D}}_{\phi}({p}_{{\varphi}^{N\left(k\right)}},{p}_{{\varphi}_{T}})$ converge to the same limit being subsequences of the same convergent sequence. In the proof of Proposition 1, we can deduce the following inequality:
which is also verified for any substitution of k by $N\left(k\right)$. By passing to the limit on k, we get ${D}_{\psi}(\tilde{\varphi},\overline{\varphi})\le 0$. However, the distance-like function ${D}_{\psi}$ is nonnegative, so that it becomes zero. Using assumption A3, ${D}_{\psi}(\tilde{\varphi},\overline{\varphi})=0$ implies that $\tilde{\varphi}=\overline{\varphi}$. This contradicts the hypothesis that ${\varphi}^{k+1}-{\varphi}^{k}$ does not converge to 0.

$$\widehat{D}({p}_{{\varphi}^{k+1}},{p}_{{\varphi}_{T}})+{D}_{\psi}({\varphi}^{k+1},{\varphi}^{k})\le \widehat{D}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}}),$$

The second part of the Proposition is a direct result of Proposition 2. □

Under assumptions of Proposition 3, the set of accumulation points of ${\left({\varphi}^{k}\right)}_{k}$ is a connected compact set. Moreover, if $\varphi \mapsto \widehat{D}({p}_{\varphi},{p}_{{\varphi}_{T}})$ is strictly convex in the neighborhood of a limit point of the sequence ${\left({\varphi}^{k}\right)}_{k}$, then the whole sequence ${\left({\varphi}^{k}\right)}_{k}$ converges to a local minimum of $\widehat{D}({p}_{\varphi},{p}_{{\varphi}_{T}})$.

Since the sequence ${\left(\varphi \right)}_{k}$ is bounded and verifies ${\varphi}^{k+1}-{\varphi}^{k}\to 0$, then Theorem 28.1 in [17] implies that the set of accumulation points of ${\left({\varphi}^{k}\right)}_{k}$ is a connected compact set. It is not empty since ${\Phi}^{0}$ is compact. The remaining of the proof is a direct result of Theorem 3.3.1 from [18]. The strict concavity of the objective function around an accumulation point is replaced here by the strict convexity of the estimated divergence. □

Proposition 3 and Corollary 1 describe what we may hope to get of the sequence ${\varphi}^{k}$. Convergence of the whole sequence is bound by a local convexity assumption in the neighborhood of a limit point. Although simple, this assumption remains difficult to be checked since we do not know where might be the limit points. In addition, assumption A3 is very restrictive, and is not verified in mixture models.

Propositions 2 and 3 were developed for the likelihood function in the paper of Tseng [2]. Similar results for a general class of functions replacing ${\widehat{D}}_{\phi}$ and ${D}_{\psi}$ which may not be differentiable (but still continuous) are presented in [3]. In these results, assumption A3 is essential. Although in [18] this problem is avoided, their approach demands that the log-likelihood has $-\infty $ limit as $\parallel \varphi \parallel \to \infty $. This is simply not verified for mixture models. We present a similar method to the one in [18] based on the idea of Tseng [2] of using the set ${\Phi}^{0}$ which is valid for mixtures. We lose, however, the guarantee of consecutive decrease of the sequence ${\left({\varphi}^{k}\right)}_{k}$.

Assume A1, AC and A2 verified. Any limit point of the sequence ${\left({\varphi}^{k}\right)}_{k}$ is a stationary point of $\varphi \to \widehat{D}({p}_{\varphi},{p}_{{\varphi}_{T}})$. If AC is dropped, then 0 belongs to the subgradient of $\varphi \mapsto \widehat{D}({p}_{\varphi},{p}_{{\varphi}_{T}})$ calculated at the limit point.

If ${\left({\varphi}^{k}\right)}_{k}$ converges to, say, ${\varphi}^{\infty}$, then the result falls simply from Proposition 2.

If ${\left({\varphi}^{k}\right)}_{k}$ does not converge. Since ${\Phi}^{0}$ is compact and $\forall k,{\varphi}^{k}\in {\Phi}^{0}$ (proved in Proposition 1), there exists a subsequence ${\left({\varphi}^{{N}_{0}\left(k\right)}\right)}_{k}$ such that ${\varphi}^{{N}_{0}\left(k\right)}\to \tilde{\varphi}$. Let us take the subsequence ${\left({\varphi}^{{N}_{0}\left(k\right)-1}\right)}_{k}$. This subsequence does not necessarily converge; it is still contained in the compact ${\Phi}^{0}$, so that we can extract a further subsequence ${\left({\varphi}^{{N}_{1}\circ {N}_{0}\left(k\right)-1}\right)}_{k}$ which converges to, say, $\overline{\varphi}$. Now, the subsequence ${\left({\varphi}^{{N}_{1}\circ {N}_{0}\left(k\right)}\right)}_{k}$ converges to $\tilde{\varphi}$, because it is a subsequence of ${\left({\varphi}^{{N}_{0}\left(k\right)}\right)}_{k}$. We have proved until now the existence of two convergent subsequences ${\varphi}^{N\left(k\right)-1}$ and ${\varphi}^{N\left(k\right)}$ with a priori different limits. For simplicity and without any loss of generality, we will consider these subsequences to be ${\varphi}^{k}$ and ${\varphi}^{k+1}$, respectively.

Conserving previous notations, suppose that ${\varphi}^{k+1}\to \tilde{\varphi}$ and ${\varphi}^{k}\to \overline{\varphi}$. We use again inequality (13):

$$\widehat{D}({p}_{{\varphi}^{k+1}},{p}_{{\varphi}_{T}})+{D}_{\psi}({\varphi}^{k+1},{\varphi}^{k})\le \widehat{D}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}}).$$

By taking the limits of the two parts of the inequality as k tends to infinity, and using the continuity of the two functions, we have

$$\widehat{D}({p}_{\tilde{\varphi}},{p}_{{\varphi}_{T}})+{D}_{\psi}(\tilde{\varphi},\overline{\varphi})\le \widehat{D}({p}_{\overline{\varphi}},{p}_{{\varphi}_{T}}).$$

Recall that under A1-2, the sequence ${\left({\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})\right)}_{k}$ converges, so that it has the same limit for any subsequence, i.e., $\widehat{D}({p}_{\tilde{\varphi}},{p}_{{\varphi}_{T}})=\widehat{D}({p}_{\overline{\varphi}},{p}_{{\varphi}_{T}})$. We also use the fact that the distance-like function ${D}_{\psi}$ is non negative to deduce that ${D}_{\psi}(\tilde{\varphi},\overline{\varphi})=0$. Looking closely at the definition of this divergence (10), we get that if the sum is zero, then each term is also zero since all terms are nonnegative. This means that:

$$\forall i\in \{1,\cdots ,n\},\phantom{\rule{1.em}{0ex}}{\int}_{\mathcal{X}}\psi \left(\frac{{h}_{i}\left(x\right|\tilde{\varphi})}{{h}_{i}\left(x\right|\overline{\varphi})}\right){h}_{i}\left(x\right|\overline{\varphi})dx=0.$$

The integrands are nonnegative functions, so they vanish almost everywhere with respect to the measure $dx$ defined on the space of labels.

$$\forall i\in \{1,\cdots ,n\},\phantom{\rule{1.em}{0ex}}\psi \left(\frac{{h}_{i}\left(x\right|\tilde{\varphi})}{{h}_{i}\left(x\right|\overline{\varphi})}\right){h}_{i}\left(x\right|\overline{\varphi})=0\phantom{\rule{1.em}{0ex}}dx-a.e.$$

The conditional densities ${h}_{i}$ are supposed to be positive (which can be ensured by a suitable choice of the initial point ${\varphi}^{0}$), i.e., ${h}_{i}\left(x\right|\overline{\varphi})>0,dx-a.e.$ Hence, $\psi \left(\frac{{h}_{i}\left(x\right|\tilde{\varphi})}{{h}_{i}\left(x\right|\overline{\varphi})}\right)=0,dx-a.e.$ On the other hand, ψ is chosen in a way that $\psi \left(z\right)=0$ iff $z=1$. Therefore:

$$\forall i\in \{1,\cdots ,n\},\phantom{\rule{1.em}{0ex}}{h}_{i}\left(x\right|\tilde{\varphi})={h}_{i}\left(x\right|\overline{\varphi})\phantom{\rule{1.em}{0ex}}dx-a.e.$$

Since ${\varphi}^{k+1}$ is, by definition, an infimum of $\varphi \mapsto \widehat{D}({p}_{\varphi},{p}_{{\varphi}_{T}})+{D}_{\psi}(\varphi ,{\varphi}^{k})$, then the gradient of this function is zero on ${\varphi}^{k+1}$. It results that:

$$\nabla \widehat{D}({p}_{{\varphi}^{k+1}},{p}_{{\varphi}_{T}})+\nabla {D}_{\psi}({\varphi}^{k+1},{\varphi}^{k})=0,\phantom{\rule{1.em}{0ex}}\forall k.$$

Taking the limit on k, and using the continuity of the derivatives, we get that:

$$\nabla \widehat{D}({p}_{\tilde{\varphi}},{p}_{{\varphi}_{T}})+\nabla {D}_{\psi}(\tilde{\varphi},\overline{\varphi})=0.$$

Let us write explicitly the gradient of the second divergence:

$$\nabla {D}_{\psi}(\tilde{\varphi},\overline{\varphi})=\sum _{i=1}^{n}{\int}_{\mathcal{X}}\frac{\nabla {h}_{i}\left(x\right|\tilde{\varphi})}{{h}_{i}\left(x\right|\overline{\varphi})}{\psi}^{\prime}\left(\frac{{h}_{i}\left(x\right|\tilde{\varphi})}{{h}_{i}\left(x\right|\overline{\varphi})}\right){h}_{i}\left(x\right|\overline{\varphi}).$$

We use now the identities (14), and the fact that ${\psi}^{\prime}\left(1\right)=0$, to deduce that:

$$\nabla {D}_{\psi}(\tilde{\varphi},\overline{\varphi})=0.$$

This entails using (15) that $\nabla \widehat{D}({p}_{\tilde{\varphi}},{p}_{{\varphi}_{T}})=0$.

Comparing the proved result with the notation considered at the beginning of the proof, we have proved that the limit of the subsequence ${\left({\varphi}^{{N}_{1}\circ {N}_{0}\left(k\right)}\right)}_{k}$ is a stationary point of the objective function. Therefore, the final step is to deduce the same result on the original convergent subsequence ${\left({\varphi}^{{N}_{0}\left(k\right)}\right)}_{k}$. This is simply due to the fact that ${\left({\varphi}^{{N}_{1}\circ {N}_{0}\left(k\right)}\right)}_{k}$ is a subsequence of the convergent sequence ${\left({\varphi}^{{N}_{0}\left(k\right)}\right)}_{k}$, hence they have the same limit.

When assumption AC is dropped, similar arguments to those used in the proof of Proposition 2b. are employed. The optimality condition in (11) implies:
$$-\nabla {D}_{\psi}({\varphi}^{k+1},{\varphi}^{k})\in \partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{k+1}},{p}_{{\varphi}_{T}})\phantom{\rule{1.em}{0ex}}\forall k.$$

Function $\varphi \mapsto {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})$ is continuous, hence its subgradient is outer semicontinuous and:

$$\underset{{\varphi}^{k+1}\to {\varphi}^{\infty}}{lim\; sup}\partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{k+1}},{p}_{{\varphi}_{T}})\subset \partial {\widehat{D}}_{\phi}({p}_{\tilde{\varphi}},{p}_{{\varphi}_{T}}).$$

By definition of the limsup:
$$\underset{\varphi \to {\varphi}^{\infty}}{lim\; sup}\partial {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}_{T}})=\left\{u|\exists {\varphi}^{k}\to {\varphi}^{\infty},\exists {u}^{k}\to u\phantom{\rule{4.pt}{0ex}}\text{with}\phantom{\rule{4.pt}{0ex}}{u}^{k}\in \partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})\right\}.$$

In our scenario, $\varphi ={\varphi}^{k+1}$, ${\varphi}^{k}={\varphi}^{k+1}$, $u=0$ and ${u}^{k}={\nabla}_{1}{D}_{\psi}({\varphi}^{k+1},{\varphi}^{k})$. We have proved above in this proof that ${\nabla}_{1}{D}_{\psi}(\tilde{\varphi},\overline{\varphi})=0$ using only the convergence of ${\left({\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}_{T}})\right)}_{k}$, inequality (13) and the properties of ${D}_{\psi}$. Assumption AC was not needed. Hence, ${u}^{k}\to 0$. This proves that $u=0\in {lim\; sup}_{{\varphi}^{k+1}\to {\varphi}^{\infty}}\partial {\widehat{D}}_{\phi}({p}_{{\varphi}^{{n}_{k}+1}},{p}_{{\varphi}_{T}})$. Finally, using the inclusion (16), we get our result:
which ends the proof. □

$$0\in \partial {\widehat{D}}_{\phi}({p}_{\tilde{\varphi}},{p}_{{\varphi}_{T}}),$$

The proof of the previous proposition is very similar to the proof of Proposition 2. The key idea is to use the sequence of conditional densities ${h}_{i}\left(x\right|{\varphi}^{k})$ instead of the sequence ${\varphi}^{k}$. According to the application, one may be interested only in Proposition 1 or in Propositions 2–4. If one is interested in the parameters, Propositions 2 to 4 should be used, since we need a stable limit of ${\left({\varphi}^{k}\right)}_{k}$. If we are only interested in minimizing an error criterion ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$ between the estimated distribution and the true one, Proposition 1 should be sufficient.

We present a variant of algorithm (11) which ensures theoretically the convergence to a global infimum of the objective function ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$ as soon as there exists a convergent subsequence of ${\left({\varphi}^{k}\right)}_{k}$. The idea is the same as Theorem 3.2.4 in [18]. Define ${\varphi}^{k+1}$ by:

$${\varphi}^{k+1}=\underset{\varphi}{arg\; inf}{\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})+{\beta}_{k}{D}_{\psi}(\varphi ,{\varphi}^{k}).$$

The proof of convergence is very simple and does not depend on the differentiability of any of the two functions ${\widehat{D}}_{\phi}$ or ${D}_{\psi}$. We only assume A1 and A2 to be verified. Let ${\left({\varphi}^{N\left(k\right)}\right)}_{k}$ be a convergent subsequence. Let ${\varphi}^{\infty}$ be its limit. This is guaranteed by the compactness of ${\Phi}^{0}$ and the fact that the whole sequence ${\left({\varphi}^{k}\right)}_{k}$ resides in ${\Phi}^{0}$ (see Proposition 1b). Suppose also that the sequence ${\left({\beta}_{k}\right)}_{k}$ converges to 0 as k goes to infinity.

Now assumptions of Theorem 3.2.4. from [18] are verified. Thus, using the same lines from the proof of this theorem (inverting all inequalities since we are minimizing instead of maximizing), we may prove that ${\varphi}^{\infty}$ is a global infimum of the estimated divergence, that is

$${\widehat{D}}_{\phi}({p}_{{\varphi}^{\infty}},{p}_{{\varphi}^{T}})\le {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}}),\phantom{\rule{2.em}{0ex}}\forall \varphi \in \Phi .$$

The problem with this approach is that it depends heavily on the fact that the supremum on each step of the algorithm is calculated exactly. This does not happen in general unless function ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})+{\beta}_{k}{D}_{\psi}(\varphi ,{\varphi}^{k})$ is convex or that we dispose of an algorithm that can perfectly solve non convex optimization problems (In this case, there is no meaning in applying an iterative proximal algorithm. We would have used the optimization algorithm directly on the objective function ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$. Although in our approach, we use a similar assumption to prove the consecutive decreasing of ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$, we can replace the infimum calculus in (11) by two things. We require at each step that we find a local infimum of ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})+{D}_{\psi}(\varphi ,{\varphi}^{k})$ whose evaluation with $\varphi \mapsto {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$ is less than the previous term of the sequence ${\varphi}^{k}$. If we can no longer find any local minima verifying the claim, the procedure stops with ${\varphi}^{k+1}={\varphi}^{k}$. This ensures the availability of all the proofs presented in this paper with no change.

We suppose that the model ${\left({p}_{\varphi}\right)}_{\varphi \in \Phi}$ is a mixture of two gaussian densities, and that we are only interested in estimating the means $\mu =({\mu}_{1},{\mu}_{2})\in {\mathbb{R}}^{2}$ and the proportion $\lambda \in [\eta ,1-\eta ]$. The use of η is to avoid cancellation of any of the two components, and to keep the hypothesis ${h}_{i}\left(x\right|\varphi )>0$ for $x=1,2$ verified. We also suppose that the components variances are reduced (${\sigma}_{i}=1$). The model takes the form

$${p}_{\lambda ,\mu}\left(x\right)=\frac{\lambda}{\sqrt{2\pi}}{e}^{-\frac{1}{2}{(x-{\mu}_{1})}^{2}}+\frac{1-\lambda}{\sqrt{2\pi}}{e}^{-\frac{1}{2}{(x-{\mu}_{2})}^{2}}.$$

Here, $\Phi =[\eta ,1-\eta ]\times {\mathbb{R}}^{2}$. The regularization term ${D}_{\psi}$ is defined by (8) where:

$${h}_{i}\left(1\right|\varphi )=\frac{\lambda {e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{1})}^{2}}}{\lambda {e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{1})}^{2}}+(1-\lambda ){e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{2})}^{2}}},\phantom{\rule{1.em}{0ex}}{h}_{i}\left(2\right|\varphi )=1-{h}_{i}\left(1\right|\varphi ).$$

Functions ${h}_{i}$ are clearly of class ${\mathcal{C}}^{1}$(int(Φ)), and so does ${D}_{\psi}$. We prove that ${\Phi}^{0}$ is closed and bounded, which is sufficient to conclude its compactness, since the space $[\eta ,1-\eta ]\times {\mathbb{R}}^{2}$ provided with the euclidean distance is complete.

If we are using the dual estimator of the $\phi -$divergence given by (2), then assumption A0 can be verified using the maximum theorem of Berge [19]. There is still a great difficulty in studying the properties (closedness or compactness) of the set ${\Phi}^{0}$. Moreover, all convergence properties of the sequence ${\varphi}^{k}$ require the continuity of the estimated $\phi -$divergence ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$ with respect to ϕ. In order to prove the continuity of the estimated divergence, we need to assume that Φ is compact, i.e., assume that the means are included in an interval of the form $[{\mu}_{min},{\mu}_{max}]$. Now, using Theorem 10.31 from [13], $\varphi \mapsto {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$ is continuous and differentiable almost everywhere with respect to φ.

The compactness assumption of Φ implies directly the compactness of ${\Phi}^{0}$. Indeed,
${\Phi}^{0}$ is then the inverse image by a continuous function of a closed set, so it is closed in Φ. Hence, it is compact.

$$\begin{array}{ccc}\hfill {\Phi}^{0}& =& \left\{\varphi \in \Phi ,{\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})\le {\widehat{D}}_{\phi}({p}_{{\varphi}^{0}},{p}_{{\varphi}^{T}})\right\}\hfill \\ & =& {\widehat{D}}_{\phi}{({p}_{\varphi},{p}_{{\varphi}^{T}})}^{-1}\left((-\infty ,{\widehat{D}}_{\phi}({p}_{{\varphi}^{0}},{p}_{{\varphi}^{T}})]\right).\hfill \end{array}$$

Using Propositions 4 and 1, if $\Phi =[\eta ,1-\eta ]\times {[{\mu}_{min},{\mu}_{max}]}^{2}$, the sequence ${\left({\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}^{T}})\right)}_{k}$ defined through Formula (2) converges and there exists a subsequence $\left({\varphi}^{N\left(k\right)}\right)$ which converges to a stationary point of the estimated divergence. Moreover, every limit point of the sequence ${\left({\varphi}^{k}\right)}_{k}$ is a stationary point of the estimated divergence.

If we are using the kernel-based dual estimator given by (3) with a Gaussian kernel density estimator, then function $\varphi \mapsto {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$ is continuously differentiable over Φ even if the means ${\mu}_{1}$ and ${\mu}_{2}$ are not bounded. For example, take $\phi ={\phi}_{\gamma}$ defined by (1). There is one condition which relates the window of the kernel, say w, with the value of γ. Indeed, using Formula (3), we can write

$${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})=\frac{1}{\gamma -1}\int \frac{{p}_{\varphi}^{\gamma}}{{K}_{n,w}^{\gamma -1}}\left(y\right)dy-\frac{1}{\gamma n}\sum _{i=1}^{n}\frac{{p}_{\varphi}^{\gamma}}{{K}_{n,w}^{\gamma}}\left({y}_{i}\right)-\frac{1}{\gamma (\gamma -1)}.$$

In order to study the continuity and the differentiability of the estimated divergence with respect to ϕ, it suffices to study the integral term. We have

$$\frac{{p}_{\varphi}^{\gamma}}{{K}_{n,w}^{\gamma -1}}\left(y\right)=\frac{{\left(\frac{\lambda}{\sqrt{2\pi}}exp\left[-\frac{1}{2}{(y-{\mu}_{1})}^{2}\right]+\frac{1-\lambda}{\sqrt{2\pi}}exp\left[-\frac{1}{2}{(y-{\mu}_{2})}^{2}\right]\right)}^{\gamma}}{{\left(\frac{1}{nw}{\sum}_{i=1}^{n}exp\left[-\frac{{(y-{y}_{i})}^{2}}{2{w}^{2}}\right]\right)}^{\gamma -1}}.$$

The dominating term at infinity in the nominator is $exp(-\gamma {y}^{2}/2)$, whereas it is $exp(-(\gamma -1){y}^{2}/\left(2{w}^{2}\right))$ in the denominator. It suffices now in order that the integrand to be bounded by an integrable function independently of $\varphi =(\lambda ,\mu )$ that we have $-\gamma +(\gamma -1)/{w}^{2}<0$. That is $-\gamma {w}^{2}+\gamma -1<0$, which is equivalent to $\gamma ({w}^{2}-1)<-1$. This argument also holds if we differentiate the integrand with respect to λ or either of the means ${\mu}_{1}$ or ${\mu}_{2}$. For $\gamma =2$ (the Pearson’s ${\chi}^{2}$), we need ${w}^{2}>1/2$. For $\gamma =1/2$ (the Hellinger), there is no condition on w.

Closedness of ${\Phi}^{0}$ is proved similarly to the previous case. Boundedness, however, must be treated differently since Φ is not necessarily compact and is supposed to be $\Phi =[\eta ,1-\eta ]\times {\mathbb{R}}^{2}$. For simplicity, take $\phi ={\phi}_{\gamma}$. The idea is to choose ${\varphi}^{0}$ an initialization for the proximal algorithm in a way that ${\Phi}^{0}$ does not include unbounded values of the means. Continuity of $\varphi \mapsto {\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$ permits calculation of the limits when either (or both) of the means tends to infinity. If both the means go to infinity, then ${p}_{\varphi}\left(x\right)\to 0,\forall x$. Thus, for $\gamma \in (0,\infty )\backslash \left\{1\right\}$, we have ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})\to \frac{1}{\gamma (\gamma -1)}$. For $\gamma <0$, the limit is infinity. If only one of the means tends to ∞, then the corresponding component vanishes from the mixture. Thus, if we choose ${\varphi}^{0}$ such that:
then the algorithm starts at a point of Φ whose function value is inferior to the limits of ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$ at infinity. By Proposition 1, the algorithm will continue to decrease the value of ${\widehat{D}}_{\phi}({p}_{\varphi},{p}_{{\varphi}^{T}})$ and never goes back to the limits at infinity. In addition, the definition of ${\Phi}^{0}$ permits to conclude that if ${\varphi}^{0}$ is chosen according to conditions (18) and (19), then ${\Phi}^{0}$ is bounded. Thus, ${\Phi}^{0}$ becomes compact. Unfortunately the value of ${inf}_{\lambda ,\mu}{\widehat{D}}_{\phi}({p}_{(\lambda ,\infty ,\mu )},{p}_{{\varphi}^{T}})$ can be calculated but numerically. We will see next that in the case of the likelihood function, a similar condition will be imposed for the compactness of ${\Phi}^{0}$, and there will be no need for any numerical calculus.

$$\begin{array}{ccc}\hfill {\widehat{D}}_{\phi}({p}_{{\varphi}^{0}},{p}_{{\varphi}^{T}})& <& min\left(\frac{1}{\gamma (\gamma -1)},\underset{\lambda ,\mu}{inf}{\widehat{D}}_{\phi}({p}_{(\lambda ,\infty ,\mu )},{p}_{{\varphi}^{T}})\right)\phantom{\rule{4.pt}{0ex}}\text{if}\phantom{\rule{4.pt}{0ex}}\gamma \in (0,\infty )\backslash \left\{1\right\},\hfill \end{array}$$

$$\begin{array}{ccc}\hfill {\widehat{D}}_{\phi}({p}_{{\varphi}^{0}},{p}_{{\varphi}^{T}})& <& \underset{\lambda ,\mu}{inf}{\widehat{D}}_{\phi}({p}_{(\lambda ,\infty ,\mu )},{p}_{{\varphi}^{T}})\phantom{\rule{2.em}{0ex}}\phantom{\rule{4.pt}{0ex}}\text{if}\phantom{\rule{4.pt}{0ex}}\gamma <0,\hfill \end{array}$$

Using Propositions 4 and 1, under conditions (18) and (19) the sequence ${\left({\widehat{D}}_{\phi}({p}_{{\varphi}^{k}},{p}_{{\varphi}^{T}})\right)}_{k}$ defined through Formula (3) converges and there exists a subsequence $\left({\varphi}^{N\left(k\right)}\right)$ that converges to a stationary point of the estimated divergence. Moreover, every limit point of the sequence ${\left({\varphi}^{k}\right)}_{k}$ is a stationary point of the estimated divergence.

In the case of the likelihood $\phi \left(t\right)=-log\left(t\right)+t-1$, the set ${\Phi}^{0}$ can be written as:
where ${J}_{\mathcal{N}}$ is the log-likelihood function of the Gaussian mixture model. The log-likelihood function ${J}_{\mathcal{N}}$ is clearly of class ${\mathcal{C}}^{1}$(int(Φ)). We prove that ${\Phi}^{0}$ is closed and bounded which is sufficient to conclude its compactness, since the space $[\eta ,1-\eta ]\times {\mathbb{R}}^{2}$ provided with the euclidean distance is complete.

$$\begin{array}{ccc}\hfill {\Phi}^{0}& =& \left\{\varphi \in \Phi ,{J}_{\mathcal{N}}\left(\varphi \right)\ge {J}_{\mathcal{N}}\left({\varphi}^{0}\right)\right\}\hfill \\ & =& {J}_{\mathcal{N}}^{-1}\left([{J}_{\mathcal{N}}\left({\varphi}^{0}\right),+\infty )\right),\hfill \end{array}$$

Closedness. The set ${\Phi}^{0}$ is the inverse image by a continuous function (the log-likelihood) of a closed set. Therefore it is closed in $[\eta ,1-\eta ]\times {\mathbb{R}}^{2}$.

Boundedness. By contradiction, suppose that ${\Phi}^{0}$ is unbounded, then there exists a sequence ${\left({\varphi}^{l}\right)}_{l}$ which tends to infinity. Since ${\lambda}^{l}\in [\eta ,1-\eta ]$, then either of ${\mu}_{1}^{l}$ or ${\mu}_{2}^{l}$ tends to infinity. Suppose that both ${\mu}_{1}^{l}$ and ${\mu}_{2}^{l}$ tend to infinity, we then have ${J}_{\mathcal{N}}\left({\varphi}^{l}\right)\to -\infty $. Any finite initialization ${\varphi}^{0}$ will imply that ${J}_{\mathcal{N}}\left({\varphi}^{0}\right)>-\infty $ so that $\forall \varphi \in {\Phi}^{0},{J}_{\mathcal{N}}\left(\varphi \right)\ge {J}_{\mathcal{N}}\left({\varphi}^{0}\right)>-\infty $. Thus, it is impossible for both ${\mu}_{1}^{l}$ and ${\mu}_{2}^{l}$ to go to infinity.

Suppose that ${\mu}_{1}^{l}\to \infty $, and that ${\mu}_{2}^{l}$ converges (or that ${\mu}_{2}^{l}$ is bounded; in such case we extract a convergent subsequence) to $\mu 2$. The limit of the likelihood has the form:
which is bounded by its value for $\lambda =0$ and ${\mu}_{2}=\frac{1}{n}{\sum}_{i=1}^{n}{y}_{i}$. Indeed, since $1-\lambda \le 1$, we have:

$$L(\lambda ,\infty ,{\varphi}_{2})=\prod _{i=1}^{n}\frac{(1-\lambda )}{\sqrt{2\pi}}{e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{2})}^{2}},$$

$$L(\lambda ,\infty ,{\varphi}_{2})\le \prod _{i=1}^{n}\frac{1}{\sqrt{2\pi}}{e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{2})}^{2}}.$$

The right-hand side of this inequality is the likelihood of a Gaussian model $\mathcal{N}({\mu}_{2},0)$, so that it is maximized when ${\mu}_{2}=\frac{1}{n}{\sum}_{i=1}^{n}{y}_{i}$. Thus, if ${\varphi}^{0}$ is chosen in a way that ${J}_{\mathcal{N}}\left({\varphi}^{0}\right)>{J}_{\mathcal{N}}\left(0,\infty ,\frac{1}{n}{\sum}_{i=1}^{n}{y}_{i}\right)$, the case when ${\mu}_{1}$ tends to infinity and ${\mu}_{2}$ is bounded would never be allowed. For the other case where ${\mu}_{2}\to \infty $ and ${\mu}_{1}$ is bounded, we choose ${\varphi}^{0}$ in a way that ${J}_{\mathcal{N}}\left({\varphi}^{0}\right)>{J}_{\mathcal{N}}\left(1,\frac{1}{n}{\sum}_{i=1}^{n}{y}_{i},\infty \right)$. In conclusion, with a choice of ${\varphi}^{0}$ such that:
the set ${\Phi}^{0}$ is bounded.

$${J}_{\mathcal{N}}\left({\varphi}^{0}\right)>max\left[{J}_{\mathcal{N}}\left(0,\infty ,\frac{1}{n}\sum _{i=1}^{n}{y}_{i}\right),\phantom{\rule{0.277778em}{0ex}}{J}_{\mathcal{N}}\left(1,\frac{1}{n}\sum _{i=1}^{n}{y}_{i},\infty \right)\right],$$

This condition on ${\varphi}^{0}$ is very natural and means that we need to begin at a point at least better than the extreme cases where we only have one component in the mixture. This can be easily verified by choosing a random vector ${\varphi}^{0}$, and calculating the corresponding log-likelihood value. If ${J}_{\mathcal{N}}\left({\varphi}^{0}\right)$ does not verify the previous condition, we draw again another random vector until satisfaction.

Using Propositions 4 and 1, under condition (20) the sequence ${\left({J}_{\mathcal{N}}\left({\varphi}^{k}\right)\right)}_{k}$ converges and there exists a subsequence $\left({\varphi}^{N\left(k\right)}\right)$ which converges to a stationary point of the likelihood function. Moreover, every limit point of the sequence ${\left({\varphi}^{k}\right)}_{k}$ is a stationary point of the likelihood.

Assumption A3 is not fulfilled (this part applies for all aforementioned situations). As mentioned in the paper of Tseng [2], for the two Gaussian mixture example, by changing ${\mu}_{1}$ and ${\mu}_{2}$ by the same amount and suitably adjusting λ, the value of ${h}_{i}\left(x\right|\varphi )$ would be unchanged. We explore this more thoroughly by writing the corresponding equations. Let us suppose, absurdly, that for distinct ϕ and ${\varphi}^{\prime}$, we have ${D}_{\psi}\left(\varphi \right|{\varphi}^{\prime})=0$. By definition of ${D}_{\psi}$, it is given by a sum of nonnegative terms, which implies that all terms need to be equal to zero. The following lines are equivalent $\forall i\in \{1,\cdots ,n\}$:

$$\begin{array}{ccc}\hfill {h}_{i}\left(0\right|\lambda ,{\mu}_{1},{\mu}_{2})& =& {h}_{i}\left(0\right|{\lambda}^{\prime},{\mu}_{1}^{\prime},{\mu}_{2}^{\prime}),\hfill \\ \hfill \frac{\lambda {e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{1})}^{2}}}{\lambda {e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{1})}^{2}}+(1-\lambda ){e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{2})}^{2}}}& =& \frac{{\lambda}^{\prime}{e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{1}^{\prime})}^{2}}}{{\lambda}^{\prime}{e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{1}^{\prime})}^{2}}+(1-{\lambda}^{\prime}){e}^{-\frac{1}{2}{({y}_{i}-{\mu}_{2}^{\prime})}^{2}}},\hfill \\ \hfill log\left(\frac{1-\lambda}{\lambda}\right)-\frac{1}{2}{({y}_{i}-{\mu}_{2})}^{2}+\frac{1}{2}{({y}_{i}-{\mu}_{1})}^{2}& =& log\left(\frac{1-{\lambda}^{\prime}}{{\lambda}^{\prime}}\right)-\frac{1}{2}{({y}_{i}-{\mu}_{2}^{\prime})}^{2}+\frac{1}{2}{({y}_{i}-{\mu}_{1}^{\prime})}^{2}.\hfill \end{array}$$

Looking at this set of n equations as an equality of two polynomials on y of degree 1 at n points, we deduce that as we have two distinct observations, say, ${y}_{1}$ and ${y}_{2}$, the two polynomials need to have the same coefficients. Thus, the set of n equations is equivalent to the following two equations:

$$\left\{\begin{array}{ccc}{\mu}_{1}-{\mu}_{2}& =& {\mu}_{1}^{\prime}-{\mu}_{2}^{\prime}\\ log\left(\frac{1-\lambda}{\lambda}\right)+\frac{1}{2}{\mu}_{1}^{2}-\frac{1}{2}{\mu}_{2}^{2}& =& log\left(\frac{1-{\lambda}^{\prime}}{{\lambda}^{\prime}}\right)+\frac{1}{2}{{\mu}_{1}^{\prime}}^{2}-\frac{1}{2}{{\mu}_{2}^{\prime}}^{2}.\end{array}\right.$$

These two equations with three variables have an infinite number of solutions. Take, for example, ${\mu}_{1}=0,\phantom{\rule{3.33333pt}{0ex}}{\mu}_{2}=1,\phantom{\rule{3.33333pt}{0ex}}\lambda =\frac{2}{3},\phantom{\rule{3.33333pt}{0ex}}{\mu}_{1}^{\prime}=\frac{1}{2},\phantom{\rule{3.33333pt}{0ex}}{\mu}_{2}^{\prime}=\frac{3}{2},\phantom{\rule{3.33333pt}{0ex}}{\lambda}^{\prime}=\frac{1}{2}$.

The previous conclusion can be extended to any two-component mixture of exponential families having the form:
One may write the corresponding n equations. The polynomial of ${y}_{i}$ has a degree of at most $max({m}_{1},{m}_{2})$. Thus, if one disposes of $max({m}_{1},{m}_{2})+1$ distinct observations, the two polynomials will have the same set of coefficients. Finally, if $({\theta}_{1},{\theta}_{2})\in {\mathbb{R}}^{d-1}$ with $d>max({m}_{1},{m}_{2})$, then assumption A3 does not hold.

$${p}_{\varphi}\left(y\right)=\lambda {e}^{{\sum}_{i=1}^{{m}_{1}}{\theta}_{1,i}{y}^{i}-F\left({\theta}_{1}\right)}+(1-\lambda ){e}^{{\sum}_{i=1}^{{m}_{2}}{\theta}_{2,i}{y}^{i}-F\left({\theta}_{2}\right)}.$$

Unfortunately, we have no an information about the difference between consecutive terms $\parallel {\varphi}^{k+1}-{\varphi}^{k}\parallel $ except for the case of $\psi \left(t\right)=\phi \left(t\right)=-log\left(t\right)+t-1$ which corresponds to the classical EM recurrence:

$${\lambda}^{k+1}=\frac{1}{n}\sum _{i=1}^{n}{h}_{i}\left(0\right|{\varphi}^{k}),\phantom{\rule{1.em}{0ex}}{\mu}_{1}^{k+1}=\frac{{\sum}_{i=1}^{n}{y}_{i}{h}_{i}\left(0\right|{\varphi}^{k})}{{\sum}_{i=1}^{n}{h}_{i}\left(0\right|{\varphi}^{k})}\phantom{\rule{1.em}{0ex}}{\mu}_{1}^{k+1}=\frac{{\sum}_{i=1}^{n}{y}_{i}{h}_{i}\left(1\right|{\varphi}^{k})}{{\sum}_{i=1}^{n}{h}_{i}\left(1\right|{\varphi}^{k})}.$$

Tseng [2] has shown that we can prove directly that ${\varphi}^{k+1}-{\varphi}^{k}$ converges to 0.

We summarize the results of 100 experiments on 100 samples by giving the average of the estimates and the error committed, and the corresponding standard deviation. The criterion error is the total variation distance (TVD), which is calculated using the $L1$ distance. Indeed, the Scheffé Lemma (see [20] (Page 129)) states that:

$$\underset{A\in {\mathcal{B}}_{n}\left(\mathbb{R}\right)}{sup}\left|{P}_{\varphi}\left(A\right)-{P}_{{\varphi}^{T}}\left(A\right)\right|=\frac{1}{2}{\int}_{\mathbb{R}}\left|{p}_{\varphi}\left(y\right)-{p}_{{\varphi}^{T}}\left(y\right)\right|dy.$$

The TVD gives a measure of the maximum error we may commit when we use the estimated model in lieu of the true distribution. We consider the Hellinger divergence for estimators based on $\phi -$divergences, which corresponds to $\phi \left(t\right)=\frac{1}{2}{(\sqrt{t}-1)}^{2}$. Our preference of the Hellinger divergence is that we hope to obtain robust estimators without loss of efficiency (see [21]). ${D}_{\psi}$ is calculated with $\psi \left(t\right)=\frac{1}{2}{(\sqrt{t}-1)}^{2}$. The kernel-based MD$\phi $DE is calculated using the Gaussian kernel, and the window is calculated using Silverman’s rule. We included in the comparison the minimum density power divergence (MDPD) of [14]. The estimator is defined by:
where $a\in (0,1]$. This is a Bregman divergence and is known to have good efficiency and robustness for a good choice of the tradeoff parameter. According to the simulation results in [11], the value of $a=0.5$ seems to give a good tradeoff between robustness against outliers and a good performance under the model. Notice that the MDPD coincides with MLE when a tends to zero. Thus, our methodology presented here in this article, is applicable on this estimator and the proximal point algorithm can be used to calculate the MDPD. The proximal term will be kept the same, i.e., $\psi \left(t\right)=\frac{1}{2}{(\sqrt{t}-1)}^{2}$.

$$\begin{array}{ccc}\hfill {\widehat{\varphi}}_{n}& =& \underset{\varphi \in \Phi}{arg\; inf}\int {p}_{\varphi}^{1+a}\left(z\right)dz-\frac{a+1}{a}\frac{1}{n}\sum _{i}^{n}{p}_{\varphi}^{a}\left({y}_{i}\right)\hfill \\ & =& \underset{\varphi \in \Phi}{arg\; inf}{\mathbb{E}}_{{P}_{\varphi}}\left[{p}_{\varphi}^{a}\right]-\frac{a+1}{a}{\mathbb{E}}_{{P}_{n}}\left[{p}_{\varphi}^{a}\right],\hfill \end{array}$$

(Note on the robustness of the used estimators) In Section 3, we have proved under mild conditions that the proximal point algorithm (11) ensures the decrease of the estimated divergence. This means that when we use the dual Formulas (2) and (3), then the proximal point algorithm (11) returns at convergence the estimators defined by (4) and (5), respectively. Similarly, if we use the density power divergence of Basu et al. [14], then the proximal-point algorithm returns at convergence the MDPD defined by (22). The robustness properties of the dual estimators (4) and (5) are studied in [12] and [11] respectively using the influence function (IF) approach. On the other hand, the robustness properties of the MDPD are studied using the IF approach in [14]. The MD$\phi $DE (4) has generally an unbounded IF (see [12] Section 3.1), whereas the kernel-based MD$\phi $DE’s IF may be bounded for example in a Gaussian model and for any $\phi -$divergence with $\phi ={\phi}_{\gamma}$ with $\gamma \in (0,1)$, see [11] Example 2. On the other hand, the MDPD has generally a bounded IF if the tradeoff parameter a is positive, and, in particular, in the Gaussian model. The MDPD becomes more robust as the tradeoff parameteraincreases (see Section 3.3 in [14]). Therefore, we should expect that the proximal point algorithm produces robust estimators in the case of the kernel-based MDφDE and the MDPD, and thus obtain better results than the MLE calculated using the EM algorithm.

Simulations from two mixture models are given below—a Gaussian mixture and a Weibull mixture. The MLE for both mixtures was calculated using the EM algorithm.

Optimizations were carried out using the Nelder–Mead algorithm [22] under the statistical tool R [23]. Numerical integrations in the Gaussian mixture were calculated using the `distrExIntegrate` function of package `distrEx`. It is a slight modification of the standard function `integrate`. It performs a Gauss–Legendre quadrature when function `integrate` returns an error. In the Weibull mixture, we used the `integral` function from package `pracma`. Function `integral` includes a variety of adaptive numerical integration methods such as Kronrod–Gauss quadrature, Romberg’s method, Gauss–Richardson quadrature, Clenshaw–Curtis (not adaptive) and (adaptive) Simpson’s method. Although function `integral` is slow, it performs better than other functions even if the integrand has a relatively bad behavior.

We consider the Gaussian mixture (17) presented earlier with true parameters $\lambda =0.35$, ${\mu}_{1}=-2$$,{\mu}_{2}=1.5$ and known variances equal to 1. Contamination was done by adding in the original sample to the five lowest values random observations from the uniform distribution $\mathcal{U}[-5,-2]$. We also added to the five largest values random observations from the uniform distribution $\mathcal{U}[2,5]$. Results are summarized in Table 1. The EM algorithm was initialized according to condition (20). This condition gave good results when we are under the model, whereas it did not always result in good estimates (the proportion converged towards 0 or 1) when outliers were added, and thus the EM algorithm was reinitialized manually.

Figure 1 shows the values of the estimated divergence for both Formulas (2) and (3) on a logarithmic scale at each iteration of the algorithm.

Concerning our simulation results, the total variation of all four estimation methods is very close when we are under the model. When we added outliers, the classical MD$\phi $DE was as sensitive as the maximum likelihood estimator. The error was doubled. Both the kernel-based MD$\phi $DE and the MDPD are clearly robust since the total variation of these estimators under contamination has slightly increased.

We consider a two-component Weibull mixture with unknown shapes ${\nu}_{1}=1.2,{\nu}_{2}=2$ and a proportion $\lambda =0.35$. The scales are known an equal to ${\sigma}_{1}=0.5,{\sigma}_{2}=2$. The desity function is given by:

$${p}_{\varphi}\left(x\right)=2\lambda {\alpha}_{1}{\left(2x\right)}^{{\alpha}_{1}-1}{e}^{-{\left(2x\right)}^{{\alpha}_{1}}}+(1-\lambda )\frac{{\alpha}_{2}}{2}{\left(\frac{x}{2}\right)}^{{\alpha}_{2}-1}{e}^{-{\left(\frac{x}{2}\right)}^{{\alpha}_{2}}}.$$

Contamination was done by replacing 10 observations of each sample chosen randomly by 10 i.i.d. observations drawn from a Weibull distribution with shape $\nu =0.9$ and scale $\sigma =3$. Results are summarized in Table 2. Notice that it would have been better to use asymmetric kernels in order to build the kernel-based MD$\phi $DE since their use in the context of positive-supported distributions is advised in order to reduce the bias at zero, see [11] for a detailed comparison with symmetric kernels. This is not, however, the goal of this paper. In addition, the use of symmetric kernels in this mixture model gave satisfactory results.

Simulations results in Table 2 confirm once more the validity of our proximal point algorithm and the clear robustness of both the kernel-based MD$\phi $DE and the MDPD.

We introduced in this paper a proximal-point algorithm that permits calculation of divergence-based estimators. We studied the theoretical convergence of the algorithm and verified it in a two-component Gaussian mixture. We performed several simulations which confirmed that the algorithm works and is a way to calculate divergence-based estimators. We also applied our proximal algorithm on a Bregman divergence estimator (the MDPD), and the algorithm succeeded to produce the MDPD. Further investigations about the role of the proximal term and a comparison with direct optimization methods in order to show the practical use of the algorithm may be considered in a future work.

The authors are grateful to Laboratoire de Statistique Théorique et Appliquée, Université Pierre et Marie Curie, for financial support.

Michel Broniatowski proposed use of a proximal-point algorithm in order to calculate the MD$\phi $DE. Michel Broniatowski proposed building a work based on the paper of [2]. Diaa Al Mohamad proposed the generalization in Section 2.3 and provided all of the convergence results in Section 3. Diaa Al Mohamad also conceived the simulations. Finally, Michel Broniatowski contributed to improving the text written by Diaa Al Mohamad. Both authors have read and approved the final manuscript.

The authors declare no conflict of interest.

- McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar]
- Tseng, P. An Analysis of the EM Algorithm and Entropy-Like Proximal Point Methods. Math. Oper. Res.
**2004**, 29, 27–44. [Google Scholar] [CrossRef] - Chrétien, S.; Hero, A.O. Generalized Proximal Point Algorithms and Bundle Implementations. Available online: http://www.eecs.umich.edu/techreports/systems/cspl/cspl-316.pdf (acceesed on 25 July 2016).
- Goldstein, A.; Russak, I. How good are the proximal point algorithms? Numer. Funct. Anal. Optim.
**1987**, 9, 709–724. [Google Scholar] [CrossRef] - Chrétien, S.; Hero, A.O. Acceleration of the EM algorithm via proximal point iterations. In Proceedings of the IEEE International Symposium on Information Theory, Cambridge, MA, USA, 16–21 August 1998.
- Csiszár, I. Eine informationstheoretische Ungleichung und ihre anwendung auf den Beweis der ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hung. Acad. Sci.
**1963**, 8, 95–108. (In German) [Google Scholar] - Broniatowski, M.; Keziou, A. Parametric estimation and tests through divergences and the duality technique. J. Multivar. Anal.
**2009**, 100, 16–36. [Google Scholar] [CrossRef] - Cressie, N.; Read, T.R.C. Multinomial goodness-of-fit tests. J. R. Stat. Soc. Ser. B
**1984**, 46, 440–464. [Google Scholar] - Broniatowski, M.; Keziou, A. Minimization of divergences on sets of signed measures. Stud. Sci. Math. Hung.
**2006**, 43, 403–442. [Google Scholar] [CrossRef] - Liese, F.; Vajda, I. On Divergences and Informations in Statistics and Information Theory. IEEE Trans. Inf. Theory
**2006**, 52, 4394–4412. [Google Scholar] [CrossRef] - Al Mohamad, D. Towards a better understanding of the dual representation of phi divergences. 2016; arXiv:1506.02166. [Google Scholar]
- Toma, A.; Broniatowski, M. Dual divergence estimators and tests: Robustness results. J. Multivar. Anal.
**2011**, 102, 20–36. [Google Scholar] [CrossRef] - Rockafellar, R.T.; Wets, R.J.B. Variational Analysis, 3rd ed.; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
- Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and Efficient Estimation by Minimising a Density Power Divergence. Biometrika
**1998**, 85, 549–559. [Google Scholar] [CrossRef] - Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B
**1977**, 39, 1–38. [Google Scholar] - Wu, C.F.J. On the Convergence Properties of the EM Algorithm. Ann. Stat.
**1983**, 11, 95–103. [Google Scholar] [CrossRef] - Ostrowski, A. Solution of Equations and Systems of Equations; Academic Press: Cambridge, MA, USA, 1966. [Google Scholar]
- Chrétien, S.; Hero, A.O. On EM algorithms and their proximal generalizations. ESAIM Probabil. Stat.
**2008**, 12, 308–326. [Google Scholar] [CrossRef] - Berge, C. Topological Spaces: Including a Treatment of Multi-valued Functions, Vector Spaces, and Convexity; Dover Publications: Mineola, NY, USA, 1963. [Google Scholar]
- Meister, A. Deconvolution Problems in Nonparametric Statistics; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Jiménz, R.; Shao, Y. On robustness and efficiency of minimum divergence estimators. Test
**2001**, 10, 241–248. [Google Scholar] [CrossRef] - Nelder, J.A.; Mead, R. A Simplex Method for Function Minimization. Comput. J.
**1965**, 7, 308–313. [Google Scholar] [CrossRef] - The R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2013. [Google Scholar]

Estimation Method | λ | sd (λ) | ${\mu}_{1}$ | sd (${\mu}_{1}$) | ${\mu}_{2}$ | sd (${\mu}_{2}$) | TVD | sd (TVD) |
---|---|---|---|---|---|---|---|---|

Without Outliers | ||||||||

Classical MD$\phi $DE | 0.349 | 0.049 | –1.989 | 0.207 | 1.511 | 0.151 | 0.061 | 0.029 |

New MD$\phi $DE–Silverman | 0.349 | 0.049 | –1.987 | 0.208 | 1.520 | 0.155 | 0.062 | 0.029 |

MDPD $a=0.5$ | 0.360 | 0.053 | –1.997 | 0.226 | 1.489 | 0.135 | 0.065 | 0.025 |

EM (MLE) | 0.360 | 0.054 | –1.989 | 0.204 | 1.493 | 0.136 | 0.064 | 0.025 |

With $10\%$ Outliers | ||||||||

Classical MD$\phi $DE | 0.357 | 0.022 | –2.629 | 0.094 | 1.734 | 0.111 | 0.146 | 0.034 |

New MD$\phi $DE–Silverman | 0.352 | 0.057 | –1.756 | 0.224 | 1.358 | 0.132 | 0.087 | 0.033 |

MDPD $a=0.5$ | 0.364 | 0.056 | –1.819 | 0.218 | 1.404 | 0.132 | 0.078 | 0.030 |

EM (MLE) | 0.342 | 0.064 | –2.617 | 0.288 | 1.713 | 0.172 | 0.150 | 0.034 |

Estimation Method | λ | sd (λ) | ${\mu}_{1}$ | sd (${\mu}_{1}$) | ${\mu}_{2}$ | sd (${\mu}_{2}$) | TVD | sd (TVD) |
---|---|---|---|---|---|---|---|---|

Without Outliers | ||||||||

Classical MD$\phi $DE | 0.356 | 0.066 | 1.245 | 0.228 | 2.055 | 0.237 | 0.052 | 0.025 |

New MD$\phi $DE–Silverman | 0.387 | 0.067 | 1.229 | 0.241 | 2.145 | 0.289 | 0.058 | 0.029 |

MDPD $a=0.5$ | 0.354 | 0.068 | 1.238 | 0.230 | 2.071 | 0.345 | 0.056 | 0.029 |

EM (MLE) | 0.355 | 0.066 | 1.245 | 0.228 | 2.054 | 0.237 | 0.052 | 0.025 |

With $10\%$ Outliers | ||||||||

Classical MD$\phi $DE | 0.250 | 0.085 | 1.089 | 0.300 | 1.470 | 0.335 | 0.092 | 0.037 |

New MD$\phi $DE–Silverman | 0.349 | 0.076 | 1.122 | 0.252 | 1.824 | 0.324 | 0.067 | 0.034 |

MDPD $a=0.5$ | 0.322 | 0.077 | 1.158 | 0.236 | 1.858 | 0.344 | 0.060 | 0.029 |

EM (MLE) | 0.259 | 0.095 | 0.941 | 0.368 | 1.565 | 0.325 | 0.095 | 0.035 |

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).