A Proximal Point Algorithm for Minimum Divergence Estimators with Application to Mixture Models

Estimators derived from a divergence criterion such as $\varphi-$divergences are generally more robust than the maximum likelihood ones. We are interested in particular in the so-called MD$\varphi$DE, an estimator built using a dual representation of $\varphi$--divergences. We present in this paper an iterative proximal point algorithm which permits to calculate such estimator. This algorithm contains by its construction the well-known EM algorithm. Our work is based on the paper of \citep{Tseng} on the likelihood function. We provide several convergence properties of the sequence generated by the algorithm, and improve the existing results by relaxing the identifiability condition on the proximal term, a condition which is not verified for most mixture models and hard to be verified for non mixture ones. Since convergence analysis uses regularity conditions (continuity and differentiability) of the objective function, which has a supremal form, we find it useful to present some analytical approaches for studying such functions. Convergence of the EM algorithm is discussed here again in a Gaussian and Weibull mixtures in the spirit of our approach. Simulations are provided to confirm the validity of our work and the robustness of the resulting estimators against outliers.


Introduction
The EM algorithm is a well known method for calculating the maximum likelihood estimator of a model where incomplete data is considered. For example, when working with mixture models in the context of clustering, the labels or classes of observations are unknown during the training phase. Several variants of the EM algorithm were proposed, see McLachlan and Krishnan [2007]. Another way to look at the EM algorithm is as a proximal point problem, see Chrétien and Hero [1998b] and Tseng [2004]. Indeed, one may rewrite the conditional expectation of the complete loglikelihood as a sum of the log-likelihood function and a distance-like function over the conditional densities of the labels provided an observation. Generally, the proximal term has a regularization effect in the sense that a proximal point algorithm is more stable and frequently outperforms classical optimization algorithms, see Goldstein and Russak [1987]. Chrétien and Hero Chrétien and Hero [1998a] prove superlinear convergence of a proximal point algorithm derived by the EM algorithm. Notice that EM-type algorithms usually enjoy no more than linear convergence. Taking into consideration the need for robust estimators, and the fact that the MLE is the least robust estimator among the class of divergence-type estimators which we present below, we generalize the EM algorithm (and the version in Tseng [2004]) by replacing the log-likelihood function by an estimator of a ϕ−divergence between the true distribution of the data and the model. A ϕ−divergence in the sense of Csiszár Csiszár [1963] is defined in the same way as Broniatowski and Keziou [2009] by: where ϕ is a nonnegative strictly convex function. Examples of such divergences are: the Kullback-Leibler (KL) divergence for ϕ(t) = t log(t)−t+1, the modified KL divergence for ϕ(t) = − log(t)+ t − 1, the hellinger distance for ϕ(t) = 1 2 ( √ t − 1) among others. All these well-known divergences belong to the class of Cressie-Read functions Cressie and Read [1984] defined by Since the ϕ−divergence calculus uses the unknown true distribution, we need to estimate it. We consider the dual estimator of the divergence introduced independently by Broniatowski and Keziou [2006] and Liese and Vajda [2006]. The use of this estimator is motivated by many reasons. Its minimum coincides with the MLE for ϕ(t) = − log(t) + t − 1. Besides, it has the same form for discrete and continuous models, and does not consider any partitioning or smoothing. Let (P φ ) φ∈Φ be a parametric model with Φ ⊂ R d , and denote φ T the true set of parameters. Let dy be the Lebesgue measure defined on R. Suppose that ∀φ ∈ Φ, the probability measure P φ is absolutely continuous with respect to dy and denote p φ the corresponding probability density. The dual estimator of the ϕ−divergence given an n−sample y 1 , · · · , y n is given by: with ϕ # (t) = tϕ (t) − ϕ(t). AL Mohamad Al Mohamad [2016] argues that this formula works well under the model, however, when we are not, this quantity largely underestimates the divergence between the true distribution and the model, and proposes following modification: where K n,w is the Rosenblatt-Parzen kernel estimate with window parameter w. Whether it iŝ D ϕ , orD ϕ , the minimum dual ϕ−divergence estimator (MDϕDE) is defined as the argument of the infimum of the dual approximation: Asymptotic properties and consistency of these two estimators can be found in Broniatowski and Keziou [2009] and Al Mohamad [2016]. Robustness properties were also studied using the influence function approach in Toma and Broniatowski [2011] and Al Mohamad [2016]. The kernel-based MDϕDE (5) seems to be a better estimator than the classical MDϕDE (4) in the sense that the former is robust whereas the later is generally not. Under the model, the estimator given by (4) is, however, more efficient especially when the true density of the data is unbounded 1 , see Al Mohamad [2016] for a brief comparison.
Here in this paper, we propose to calculate the MDϕDE using an iterative procedure based on the work of Tseng [2004] on the log-likelihood function. This procedure has the form of a proximal point algorithm, and extends the EM algorithm. Our convergence proof demands some regularity of the estimated divergence with respect to the parameter vector which is not simply checked using (2). Recent results in the book of Rockafellar and Wets Rockafellar and Wets [1998] provide sufficient conditions to solve this problem. Differentiability with respect to φ still remains a very hard task, therefore, our results cover cases when the objective function is not differentiable. The paper is organized as follows: In Sect. 1, we present the general context. We also present the derivation of our algorithm from the EM algorithm and passing by Tseng's generalization. In Sect. 2, we present some convergence properties. We discuss in Sect. 3 a variant of the algorithm with a theoretical global infimum, and an exambple on the two-gaussian mixture model and a convergence proof of the EM algorithm in the spirit of our approach. Finally, Sect. 4 contains simulations confirming our claim about the efficiency and the robustness of our approach in comparison with the MLE. The algorithm is also applied to the so-called MDPD introduced by Basu et al. [1998].

General Context and Notations
Let (X, Y ) be a couple of random variables with joint probability density function f (x, y|φ) parametrized by a vector of parameters φ ∈ Φ ⊂ R d . Let (X 1 , Y 1 ), · · · , (X n , Y n ) be n copies of (X, Y ) independently and identically distributed. Finally, let (x 1 , y 1 ), · · · , (x n , y n ) be n realizations of the n copies of (X, Y ). The x i 's are the unobserved data (labels) and the y i 's are the observations. The vector of parameters φ is unknown and need to be estimated. The observed data y i are supposed to be real numbers, and the labels x i belong to a space X not necessarily finite unless mentioned otherwise. The marginal density of the observed data is given by p φ (y) = f (x, y|φ)dx, where dx is a measure defined on the label space (for example, the counting measure if we work with mixture models).
For a parametrized function f with a parameter a, we write f (x|a). We use the notation φ k for sequences with the index above. Derivatives of a real valued function ψ defined on R are written as ψ , ψ , etc. We use ∇f for the gradient of a real function f defined on R d , and J f for the matrix of second order partial derivatives. For a generic function of two (vectorial) arguments D(φ|θ), then ∇ 1 D(φ|θ) denotes the gradient with respect to the first (vectorial) variable. Finally, for any set A, we use int(A) to denote the interior of A.

EM Algorithm and Tseng's Generalization
The EM algorithm estimates the unknown parameter vector by (see Dempster et al. [1977]): where X = (X 1 , · · · , X n ), Y = (Y 1 , · · · , Y n ) and y = (y 1 , · · · , y n ). By independence between the couples (X i , Y i )'s, previous iteration may be written as: where h i (x|φ k ) = f (x,y i |φ k ) p φ k (y i ) is the conditional density of the labels (at step k) provided y i which we suppose to be positive dx−almost everywhere. It is well-known that the EM iterations can be rewritten as a difference between the log-likelihood and a Kullback-Liebler distance-like function. Indeed, The final line is justified by the fact that h i (x|φ) is a density, therefore it integrates to 1. The additional term does not depend on φ and, hence, can be omitted. We now have the following iterative procedure: The previous iteration has the form of a proximal point maximization of the log-likelihood, i.e. a perturbation of the log-likelihood by a distance-like function defined on the conditional densities of the labels. Tseng Tseng [2004] generalizes this iteration by allowing any nonnegative convex function ψ to replace the t → − log(t) function. Tseng's recurrence is defined by: where J is the log-likelihood function and D ψ is given by: for any real nonnegative convex function ψ such that ψ(1) = ψ (1) = 0. D ψ (φ 1 , φ 2 ) is nonnegative, and D ψ (φ 1 , φ 2 ) = 0 if and only if ∀i, h i (x|φ 1 ) = h i (x|φ 2 ) dx−almost everywhere.

Generalization of Tseng's Algorithm
We use the relation between maximizing the log-likelihood and minimizing the Kullback-Liebler divergence to generalize the previous algorithm. We, therefore, replace the log-likelihood function by an estimate of a ϕ−divergence D ϕ between the true distribution and the model. We use the dual estimator of the divergence presented earlier in the introduction (2) or (3) which we denote in the same mannerD ϕ unless mentioned otherwise. Our new algorithm is defined by: where D ψ (φ, φ k ) is defined by (8). When ϕ(t) = − log(t) + t − 1, it is easy to see that we get recurrence (7). Indeed, for the case of (2) we have: log(p φ (y i )).
Using the fact that the first term inD ϕ (p φ , p φ T ) does not depend on φ, so it does not count in the arg inf defining φ k+1 , we easily get (7). The same applies for the case of (3). For notational simplicity, from now on, we redefine D ψ with a normalization by n, i.e.
Hence, our set of algorithms is redefined by: We will see later that this iteration forces the divergence to decrease and that under suitable conditions, it converges to a (local) minimum ofD ϕ (p φ , p φ T ). It results that, algorithm (11) is a way to calculate both the MDϕDE (4) and the kernel-based MDϕDE (5).

Some Convergence Properties of φ k
We show here how, according to some possible situations, one may prove convergence of the algorithm defined by (11). Let φ 0 be a given initialization, and let which we suppose to be a subset of int(Φ). The idea of defining such a set in this context is inherited from the paper of [Wu, 1983] which provided the first correct proof of convergence for the EM algorithm. Before going any further, we recall the following Definition of a (generalized) stationary point.
Definition 1. Let f : R d → R be a real valued function. If f is differentiable at a point φ * such that ∇f (φ * ) = 0, we then say that φ * is a stationary point of f . If f is not differentiable at φ * but the subgradient of f at φ * , say ∂f (φ * ), exists such that 0 ∈ ∂f (φ * ), then φ * is called a generalized stationary point of f .
We will be using the following assumptions: , D ψ and ∇ 1 D ψ are defined and continuous on, respectively, Φ, Φ × Φ and Φ × Φ; Recall also that we suppose that h i (x|φ) > 0, dx − a.e. We relax the convexity assumption of function ψ. We only suppose that ψ is nonnegative and ψ(t) = 0 iff t = 1. Besides ψ (t) = 0 if t = 1. Continuity and differentiability assumptions of function φ →D ϕ (p φ |p φ T ) for the case of (3) can be easily checked using Lebesgue theorems. Continuity assumption for the case of (2) can be checked using Theorem 1.17 or Corollary 10.14 in Rockafellar and Wets [1998]. Differentiability can also be checked using Corollary 10.14 or Theorem 10.31 in the same book. In what concerns D ψ , continuity and differentiability can be obtained merely by fulfilling Lebesgue theorems conditions. When working with mixture models, we only need the continuity and differentiability of ψ and functions h i . The later is easily deduced from regularity assumptions on the model. For assumption A2, there is no universal method, see paragraph (3.2) for an Example. Assumption A3 can be checked using Lemma 2 in Tseng [2004]. We start the convergence properties by proving that the objective functionD ϕ (p φ |p φ T ) decreases alongside the the sequence (φ k ) k , and a possible set of conditions for the existence of the sequence (φ k ) k .
Proof. We prove (a). we have by Definition of the arginf: We use the fact that D ψ (φ k , φ k ) = 0 for the right hand and that D ψ (φ k+1 , φ k ) ≥ 0 for the left hand side of the previous inequality.
We prove (b). Using the decreasing property previously proved in (a), we have by recurrence The result follows for both algorithms directly by Definition of Φ 0 . We prove (c). By induction on k. For k = 0, clearly φ 0 = (λ 0 , θ 0 ) is well defined (a choice we make 2 ). Suppose for some k ≥ 0 that φ k = (λ k , θ k ) exists. We prove that the infimum is attained in Φ 0 . Let φ ∈ Φ be any vector at which the value of the optimized function has a value less than We have: The first line follows from the non negativity of Thus, the infimum can be calculated for vectors in Φ 0 instead of Φ. Since Φ 0 is compact and the optimized function is lower semicontinuous (the sum of two lower semicontinuous functions), then the infimum exists and is attained in Φ 0 . We may now define φ k+1 to be a vector whose corresponding value is equal to the infimum. Convergence of the sequence (D ϕ (p φ k , p φ T )) k comes from the fact that it is non increasing and bounded. It is non increasing by virtue of (a). Boundedness comes from the lower semicontinuity The infimum of a proper lower semicontinuous function on a compact set exists and is attained on this set. Hence, the quantity inf φ∈Φ 0D ϕ (p φ , p φ T ) exists and is finite. This ends the proof.
Compactness in part (c) can be replaced by inf-compactness of function φ →D ϕ (p φ |p φ T ) and continuity of D ψ with respect to its first argument. The convergence of the sequence (D ϕ (φ k |φ T )) k is an interesting property, since in general there is no theoretical guarantee, or it is difficult to prove that the whole sequence (φ k ) k converges. It may also continue to fluctuate around a minimum.
The decrease of the error criterionD ϕ (φ k |φ T ) between two iterations helps us decide when to stop the iterative procedure.
Let's show now that the subsequence (φ n k +1 ) also converges to φ ∞ . We simply have: By Definition of φ n k +1 , it verifies the infimum in recurrence (11), so that the gradient of the optimized function is zero: Using the continuity assumptions A1 and AC of the gradients, one can pass to the limit with no problem: We prove (b). We use again the Definition of the arginf. As the optimized function is not necessarily differentiable at the points of the sequence φ k , a necessary condition for φ k+1 to be an infimum is that 0 belongs to the subgradient of the function on φ k+1 . Since D ψ (φ, φ k ) is assumed to be differentiable, the optimality condition is translated into: is continuous, then its subgradient is outer semicontinuous (see [Rockafellar and Wets, 1998] Chap 8, Proposition 7). We use the same arguments presented in (a) to conclude the existence of two subsequences (φ n k ) k and (φ n k +1 ) k which converge to the same limit φ ∞ . By Definition of outer semicontinuity, and since φ n k +1 → φ ∞ , we have: We want to prove that 0 ∈ lim sup φ n k +1 →φ ∞ ∂D ϕ (p φ n k +1 , p φ T ). By Definition of limsup 3 : . 3 We use here the Definition corresponding to the outer limit, see [Rockafellar and Wets, 1998] Chap 4, Definition 1 or Chap 5-B.
In our scenario, φ = φ n k +1 , φ k = φ n k +1 , u = 0 and u k = ∇ 1 D ψ (φ n k +1 , φ n k ). The continuity of ∇ 1 D ψ with respect to both arguments and the fact that the two subsequences φ n k +1 and φ n k converge to the same limit, imply that (12), we get our result: This ends the proof.
The assumption {φ k+1 − φ k } → 0 used in Proposition 2 is not easy to be checked unless one has a close formula of φ k . The following Proposition gives a method to prove such assumption. This method seems simpler, but it is not verified in many mixture models, see Sect. (3.2) for a counter Example.
Proof. By contradiction, let's suppose that φ k+1 − φ k does not converge to 0. There exists a subsequence such that ) k is convergent, a further subsequence also converges to the same limitφ. We have proved the existence of a subsequence of ( The real sequenceD ϕ (p φ k , p φ T ) k converges as proved in Proposition 1-c. As a result, both se-quencesD ϕ (p φ N (k)+1 , p φ T ) andD ϕ (p φ N (k) , p φ T ) converge to the same limit being subsequences of the same convergent sequence. In the proof of Proposition 1, we can deduce the following inequality: which is also verified to any substitution of k by N (k). By passing to the limit on k, we get D ψ (φ,φ) ≤ 0. However, the distance-like function D ψ is positive, so that it becomes zero. Using assumption A3, D ψ (φ,φ) = 0 implies thatφ =φ. This contradicts the hypothesis that φ k+1 − φ k does not converge to 0. The second part of the Proposition is a direct result of Proposition 2.

Corollary 1. Under assumptions of Proposition 3, the set of accumulation points of
Proof. Since the sequence (φ) k is bounded and verifies φ k+1 − φ k → 0, then Theorem 28.1 in [Ostrowski, 1966] implies that the set of accumulation points of (φ k ) k is a connected compact set. It is not empty since Φ 0 is compact. Let φ ∞ be a limit point of (φ k ) k . The assumption about strict convexity ofD(p φ , p φ T ) in a neighborhood of φ ∞ implies that it is isolated in the sense that if there are another limit pointφ, then there is ε > 0 such that φ ∞ −φ > ε. Hence, the set of accumulation points can be written as the union of at least two disjoint open sets which contradicts the connectedness property. Thus, φ ∞ is the only limit point of the sequence (φ k ). To end the proof, we need to show that the whole sequence converge. By contradiction, if it does not converge, there exists then ε > 0 and an infinity of terms which verifies φ N 0 (k) − φ ∞ > ε. By compactness of Φ 0 , one may extract a subsequence of (φ N 0 (k) ) k , say (φ N 1 •N 0 (k) ) k , which converges to someφ. Moreover, by continuity of the euclidean norm, Contradiction is reached by uniqueness of the limit point of the sequence (φ k ) k .
Proposition 3 and Corollary 1 describe what we may hope to get of the sequence φ k . Convergence of the whole sequence is bound by a local convexity assumption in a neighborhood of a limit point. Although simple, this assumption remains difficult to be checked since we do not know where might be the limit points. Besides, assumption A3 is very restrictive, and is not verified in mixture models. Proposition 2 and 3 were developed for the likelihood function in the paper of Tseng [2004]. Similar results for a general class of functions replacingD ϕ and D ψ which may not be differentiable (but still continuous) are presented in Chrétien and Hero [1998b]. In these results, assumption A3 is essential. Although Chrétien and Hero [2008] overcomes this problem, their approach demands that the log-likelihood has −∞ limit as φ → ∞. This is simply not verified for mixture models. We present a similar method to Chrétien and Hero [2008] based on the idea of Tseng [2004] of using the set Φ 0 which is valid for mixtures. We lose, however, the guarantee of consecutive decrease of the sequence.
Proposition 4. Assume A1, AC and A2 verified. Any limit point of the sequence (φ k ) k is a stationary point of φ →D(p φ , p φ T ). If AC is dropped, then 0 belongs to the subgradient of φ →D(p φ , p φ T ) calculated at the limit point.
Proof. If (φ k ) k converges to, say, φ ∞ , then the result falls simply from Proposition 2. If (φ k ) k does not converge. Since Φ 0 is compact and ∀k, φ k ∈ Φ 0 (proved in Proposition 1), there exists a subsequence (φ N 0 (k) ) k such that φ N 0 (k) →φ. Let's take the subsequence (φ N 0 (k)−1 ) k . This subsequence does not necessarily converge; still it is contained in the compact Φ 0 , so that we can extract a further subsequence (φ N 1 •N 0 (k)−1 ) k which converges to, say,φ. Now, the subsequence (φ N 1 •N 0 (k) ) k converges toφ, because it is a subsequence of (φ N 0 (k) ) k . We have proved until now the existence of two convergent subsequences φ N (k)−1 and φ N (k) with a priori different limits. For simplicity and without any loss of generality, we will consider these subsequences to be φ k and φ k+1 respectively. Conserving previous notations, suppose that φ k+1 →φ and φ k →φ. We use again inequality (13): By taking the limits of the two parts of the inequality as k tends to infinity, and using the continuity of the two functions, we havê Recall that under A1-2, the sequence D ϕ (p φ k , p φ T ) k converges, so that it has the same limit for any subsequence, i.e.D(pφ, p φ T ) =D(pφ, p φ T ). We also use the fact that the distance-like function D ψ is non negative to deduce that D ψ (φ,φ) = 0. Looking closely at the Definition of this divergence (10), we get that if the sum is zero, then each term is also zero since all terms are non negative. This means that: ∀i ∈ {1, · · · , n}, The integrands are non negative functions, so they vanish almost ever where with respect to the measure dx defined on the space of labels.
Comparing the proved result with the notation considered at the beginning of the proof, we have proved that the limit of the subsequence (φ N 1 •N 0 (k) ) k is a stationary point of the objective function. Therefore, The final step is to deduce the same result on the original convergent subsequence (φ N 0 (k) ) k . This is simply due to the fact that (φ N 1 •N 0 (k) ) k is a subsequence of the convergent sequence (φ N 0 (k) ) k , hence they have the same limit.
The proof of the previous proposition is very similar to the proof of Proposition 2. The key idea is to use the sequence of conditional densities h i (x|φ k ) instead of the sequence φ k . According to the application, one may be interested only in Proposition 1 or in Propositions 2-4. If one is interested in the parameters, Propositions 2 to 4 should be used, since we need a stable limit of (φ k ) k . If we are only interested in minimizing an error criterionD ϕ (p φ , p φ T ) between the estimated distribution and the true one, Proposition 1 should be sufficient.

Case studies 3.1 An algorithm with theoretically global infimum attainment
We present a variant of algorithm (11) which ensures theoretically the convergence to a global infimum of the objective functionD ϕ (p φ , p φ T ) as long as there exists a convergent subsequence of (φ k ) k . The idea is the same as Theorem 3.2.4 in [Chrétien and Hero, 2008]. Define φ k+1 by: The proof of convergence is very simple and does not depend on the differentiability of any of the two functionsD ϕ or D ψ . We only assume A1 and A2 to be verified. Let (φ N (k) ) k be a convergent subsequence. Let φ ∞ be its limit. This is guaranteed by the compactness of Φ 0 and the fact that the whole sequence (φ k ) k resides in Φ 0 (see Proposition 1-b). Suppose also that the sequence (β k ) k converges to 0 as k goes to infinity. Let φ by a vector of Φ which has a value ofD ϕ (p φ , p φ T ) strictly inferior to the value of the same By Definition of φ N (k) , we have: which holds for every φ in the whole set Φ. Using the non negativity of the term As we pass to the limit on k, recall firstly that (β k ) k converges to 0, so that any subsequence (β N (k) ) k also converges to 0. Secondly, the continuity assumption on D ψ implies that, since . By compactness of Φ 0 and Proposition 1-b, we have φ ∞ ∈ Φ 0 . The continuity again of D ψ will imply that the quantity D ψ (φ, φ ∞ ) is finite. Finally, inequality (18) now implies that: which contradicts with the choice of φ verifying (17). Hence, φ ∞ is a global infimum on Φ. The problem with this approach is that it depends heavily on the fact that the supremum on each step of the algorithm is calculated exactly. This does not happen in general unless function is convex or that we dispose of an algorithm which can solve perfectly non convex optimization problems 5 . Although in our approach, we use a similar assumption to prove the consecutive decreasing ofD ϕ (p φ , p φ T ), we can replace the infimum calculus in (11) by two things. We require that at each step we find a local infimum ofD ϕ (p φ , p φ T )+D ψ (φ, φ k ) whose evaluation with φ →D ϕ (p φ , p φ T ) is less than the previous term of the sequence φ k . If we can no longer find any local minima verifying the claim, the procedure stops with φ k+1 = φ k . This ensures the availability of all the proofs presented in this paper with no change.

The two-component Gaussian mixture
We suppose that the model (p φ ) φ∈Φ is a mixture of two gaussian densities, and that we are only interested in estimating the means µ = (µ 1 , µ 2 ) ∈ R 2 and the proportion λ ∈ [η, 1 − η]. The use of η is to avoid cancellation of any of the two components, and to keep the hypothesis h i (x|φ) > 0 for x = 1, 2 verified. We also suppose that the components variances are reduced (σ i = 1). The model takes the form In the case of ϕ(t) = − log(t) + t − 1, the set Φ 0 is given by Φ The regularization term D ψ is defined by (8) where: Functions h i are clearly of class C 1 (int(Φ)), hence, assumptions A1 and AC are verified. We prove that Φ 0 is closed and bounded which is sufficient to conclude its compactness, since the space [η, 1 − η] × R 2 provided with the euclidean distance is complete. If we are using the dual estimator of the ϕ−divergence given by (2), then assumption A0 can be verified using the maximum theorem of Berge [1963]. There is still a great difficulty in studying the properties (closedness or compactness) of the set Φ 0 . Moreover, all convergence properties of the sequence φ k require the continuity of the estimated ϕ−divergenceD ϕ (p φ , p φ T ) with respect to φ. In order to prove the continuity of the estimated divergence, we need to assume that Φ is compact, i.e. assume that the means are included in an interval of the form [µ min , µ max ]. Now, using Theorem 10.31 from Rockafellar and Wets [1998], φ →D ϕ (p φ , p φ T ) is continuous and differentiable almost everywhere with respect to φ. The compactness assumption of Φ implies directly the compactness of Φ 0 . Indeed Φ 0 is then the inverse image by a continuous function of a closed set, so it is closed in Φ. Hence, it is compact. (2) converges and there exists a subsequence (φ N (k) ) which converges to a stationary point of the estimated divergence. Moreover, every limit point of the sequence (φ k ) k is a stationary point of the estimated divergence.

Conclusion 1. Using Propositions 4 and 1, if
If we are using the kernel-based dual estimator given by (3) with a Gaussian kernel density estimator, then function φ →D ϕ (p φ , p φ T ) is continuously differentiable over Φ even if the means µ 1 and µ 2 are not bounded. For example, take ϕ = ϕ γ defined by (1). There is one condition which relates the window of the kernel, say w, with the value of γ; γ(w 2 − 1) > −1. For γ = 2 (the Pearson's χ 2 ), we need that w 2 > 1/2. For γ = 1/2 (the Hellinger), there is no condition on w. Closedness of Φ 0 is proved similarly to the previous case. Boundedness is however must be treated differently since Φ is not necessarily compact and is supposed to be Φ = [η, 1 − η] s × R s . For simplicity take ϕ = ϕ γ . The idea is to choose φ 0 an initialization for the proximal algorithm in a way that Φ 0 does not include unbounded values of the means. Continuity of φ →D ϕ (p φ , p φ T ) permits to calculate the limits when either (or both) of the means tends to infinity. If both means goes to infinity, then p φ (x) → 0, ∀x. Thus, for γ ∈ (0, ∞) \ {1}, we haveD ϕ (p φ , p φ T ) → 1 γ(γ−1) . For γ < 0, the limit is infinity. If only one of the means tends to ∞, then the corresponding component vanishes from the mixture. Thus, if we choose φ 0 such that: then the algorithm starts at a point of Φ whose function value is inferior to the limits ofD ϕ (p φ , p φ T ) at infinity. By Proposition 1, the algorithm will continue to decrease the value ofD ϕ (p φ , p φ T ) and never goes back to the limits at infinity. Besides, the Definition of Φ 0 permits to conclude that if φ 0 is chosen according to condition (20,21), then Φ 0 is bounded. Thus, Φ 0 becomes compact. Unfortunately the value of inf λ,µDϕ (p (λ,∞,µ) , p φ T ) can be calculated but numerically. We will see next that in the case of Likelihood function, a similar condition will be imposed for the compactness of Φ 0 , and there will be no need for any numerical calculus. Using Propositions 4 and 1,under condition (20,21) the sequence (D ϕ (p φ k , p φ T )) k defined through formula (3) converges and there exists a subsequence (φ N (k) ) which converges to a stationary point of the estimated divergence. Moreover, every limit point of the sequence (φ k ) k is a stationary point of the estimated divergence.

Conclusion 2.
In the case of the likelihood ϕ(t) = − log(t) + t − 1, the set Φ 0 can be written as: where J is the log-likelihood function. Function J is clearly of class C 1 on (int(Φ)). We prove that Φ 0 is closed and bounded which is sufficient to conclude its compactness, since the space [η, 1 − η] s × R s provided with the euclidean distance is complete. Closedness. The set Φ 0 is the inverse image by a continuous function (the log-likelihood). Therefore it is closed in [η, 1 − η] s × R s . Boundedness. By contradiction, suppose that Φ 0 is unbounded, then there exists a sequence (φ l ) l which tends to infinity. Since λ l ∈ [η, 1 − η], then either of µ l 1 or µ l 2 tends to infinity. Suppose that both µ l 1 and µ l 2 tend to infinity, we then have J(φ l ) → −∞. Any finite initialization φ 0 will imply that J(φ 0 ) > −∞ so that ∀φ ∈ Φ 0 , J(φ) ≥ J(φ 0 ) > −∞. Thus, it is impossible for both µ l 1 and µ l 2 to go to infinity. Suppose that µ l 1 → ∞, and that µ l 2 converges 6 to µ2. The limit of the likelihood has the form: which is bounded by its value for λ = 0 and µ 2 = 1 n n i=1 y i . Indeed, since 1 − λ ≤ 1, we have: The right hand side of this inequality is the likelihood of a Gaussian model N (µ 2 , 0), so that it is maximized when µ 2 = 1 n n i=1 y i . Thus, if φ 0 is chosen in a way that J(φ 0 ) > J 0, ∞, 1 n n i=1 y i , the case when µ 1 tends to infinity and µ 2 is bounded would never be allowed. For the other case 6 Normally, µ l 2 is bounded; still, we can extract a subsequence which converges.
where µ 2 → ∞ and µ 1 is bounded, we choose φ 0 in a way that J(φ 0 ) > J 1, 1 n n i=1 y i , ∞ . In conclusion, with a choice of φ 0 such that: the set Φ 0 is bounded. This condition on φ 0 is very natural and means that we need to begin at a point at least better than the extreme cases where we only have one component in the mixture. This can be easily verified by choosing a random vector φ 0 , and calculate the corresponding log-likelihood value. If J(φ 0 ) does not verify the previous condition, we draw again another random vector until satisfaction. (22) the sequence (J(φ k )) k converges and there exists a subsequence (φ N (k) ) which converges to a stationary point of the likelihood function. Moreover, every limit point of the sequence (φ k ) k is a stationary point of the likelihood.

Conclusion 3. Using Propositions 4 and 1, under condition
Assumption A3 is not fulfilled (this part applies for all aforementioned situations). As mentioned in the paper of [Tseng, 2004], for the two Gaussian mixture Example, by changing µ 1 and µ 2 by the same amount and suitably adjusting λ, the value of h i (x|φ) would be unchanged. We explore this more thoroughly by writing the corresponding equations. Let's suppose, by absurd, that for distinct φ and φ we have D ψ (φ|φ ) = 0. By Definition of D ψ , it is given by a sum of non negative terms, which implies that all terms need to be equal to zero. The following lines are equivalent ∀i ∈ {1, · · · , n}: Looking at this set of n equations as an equality of two polynomials on y of degree 1 at n points, we deduce that as we dispose of two distinct observations, say, y 1 and y 2 , the two polynomials need to have the same coefficients. Thus the set of n equations is equivalent to the following two equations: These two equations with three variables have an infinite number of solutions. Take for example µ 1 = 0, µ 2 = 1, λ = 2 3 , µ 1 = 1 2 , µ 2 = 3 2 , λ = 1 2 .
Remark 1. The previous conclusion can be extended to any two-component mixture of exponential families having the form: One may write the corresponding n equations. The polynomial of y i has a degree of at most max(m 1 , m 2 ). Thus, if one disposes of max(m 1 , m 2 )+1 distinct observations, the two polynomials will have the same set of coefficients. Finally, if (θ 1 , θ 2 ) ∈ R d−1 with d > max(m 1 , m 2 ), then assumption A3 does not hold.
Unfortunately, we have no an information about the difference between consecutive terms φ k+1 − φ k except for the case of ψ(t) = ϕ(t) = − log(t) + t − 1 which corresponds to the classical EM recurrence: . Tseng Tseng [2004] has shown that we can prove directly that φ k+1 − φ k converges to 0.

Simulation Study
We summarize the results of 100 experiments on 100-samples by giving the average of the estimates and the error committed, and the corresponding standard deviation. The criterion error is the total variation distance (TVD) which is calculated using the L1 distance. Indeed, the Scheffé lemma (see Meister [2009] page 129.) states that: The TVD gives a measure of the maximum error we may commit when we use the estimated model in lieu of the true distribution. We consider the Hellinger divergence for estimators based on ϕ−divergences which corresponds to ϕ(t) = 1 2 ( √ t − 1) 2 . Our preference of the Hellinger divergence is that we hope to obtain robust estimators without loss of efficiency, see Jiménz and Shao [2001]. D ψ is calculated with ψ(t) = 1 2 ( √ t − 1) 2 . The kernel-based MDϕDE is calculated using the Gaussian kernel, and the window is calculated using Silverman's rule. We included in the comparison the minimum density power divergence (MDPD) of Basu et al. [1998]. The estimator is defined by:φ where a ∈ (0, 1]. This is a Bregman divergence and is known to have good efficiency and robustness for a good choice of the tradeoff parameter. According to the simulation results in Al Mohamad [2016], the value of a = 0.5 seems to give a good tradeoff between robustness against outliers and a good performance under the model. Notice that the MDPD coincides with MLE when a tends to zero. Thus, our methodology presented here in this article, is applicable on this estimator and the proximal-point algorithm can be used to calculate the MDPD. The proximal term will be kept the same, i.e ψ(t) = 1 2 ( √ t − 1) 2 . Simulations from two mixture models are given below; a Gaussian mixture and a Weibull mixture. The MLE for both mixtures was calculated using the EM algorithm. Optimizations were carried out using the Nelder-Mead algorithm Nelder and Mead [1965] under the statistical tool R R Core Team [2013]. Numerical integrations in the Gaussian mixture were calculated using the distrExIntegrate function of package distrEx. It is a slight modification of the standard function integrate. It performs a Gauss-Legendre quadrature when function integrate returns an error. In the Weibull mixture, we used the integral function from package pracma 7 . Although being slow, it performs better than other functions even if the integrand has a relatively bad behavior.

The two-component Gaussian mixture revisited
We consider the Gaussian mixture (19) presented earlier with true parameters λ = 0.35, µ 1 = −2, µ 2 = 1.5 and known variances equal to 1. Contamination was done by adding in the original sample to the 5 lowest values random observations from the uniform distribution U[−5, −2]. We also added to the 5 largest values random observations from the uniform distribution U[2, 5]. Results are summarized in Table (1). The EM algorithm was initialized according to condition (22). This condition gave good results when we are under the model whereas it did not result always in good estimates (the proportion converged towards 0 or 1) when outliers were added, and thus the EM algorithm was reinitialized manually. Figure (1) shows the values of the estimated divergence for both formulas (2) and (3)   In what concerns our simulation results. The total variation of all four estimation methods is very close when we are under the model. When we added outliers, the classical MDϕDE was as sensitive as the maximum likelihood estimator. The error was doubled. Both the kernel-based MDϕDE and the MDPD are clearly robust since the total variation of these estimators under contamination has slightly increased.

The two-component Weibull mixture model
We consider a two-component Weibull mixture with unknown shapes ν 1 = 1.2, ν 2 = 2 and a proportion λ = 0.35. The scales are known an equal to σ 1 = 0.5, σ 2 = 2. The desity function is given by: Contamination was done by replacing 10 observations of each sample chosen randomly by 10 i.i.d. observations drawn from a Weibull distribution with shape ν = 0.9 and scale σ = 3. Results are summarized in Table (2). Notice that it would have been better to use asymmetric kernels in order to build the kernel-based MDϕDE since their use in the context of positive-supported distributions is advised in order to reduce the bias at zero, see Al Mohamad [2016] for a detailed comparison with symmetric kernels. This is not however the goal of this paper, and besides, the use of symmetric kernels in this mixture model gave satisfactory results.
Simulations results in table 2 confirms once more the validity of our proximal-point algorithm and the clear robustness of both the kernel-based MDϕDE and the MDPD.

Conclusions and perspectives
We introduced in this paper a proximal-point algorithm which permit to calculate divergencebased estimators. We studied the theoretical convergence of the algorithm and verified it on a two-component Gaussian mixture. We performed several simulations which confirmed that the algorithm works and is a way to calculate divergence-based estimators. We also applied our proximal algorithm on a Bregman divergence estimator (the MDPD), and the algorithm succeeded to produce the MDPD. Further investigations about the role of the proximal term are needed and may be considered in a future work.