Deterministic Approximate EM Algorithm; Application to the Riemann Approximation EM and the Tempered EM

The Expectation Maximisation (EM) algorithm is widely used to optimise non-convex likelihood functions with latent variables. Many authors modified its simple design to fit more specific situations. For instance, the Expectation (E) step has been replaced by Monte Carlo (MC), Markov Chain Monte Carlo or tempered approximations, etc. Most of the well-studied approximations belong to the stochastic class. By comparison, the literature is lacking when it comes to deterministic approximations. In this paper, we introduce a theoretical framework, with state-of-the-art convergence guarantees, for any deterministic approximation of the E step. We analyse theoretically and empirically several approximations that fit into this framework. First, for intractable E-steps, we introduce a deterministic version of MC-EM using Riemann sums. A straightforward method, not requiring any hyper-parameter fine-tuning, useful when the low dimensionality does not warrant a MC-EM. Then, we consider the tempered approximation, borrowed from the Simulated Annealing literature and used to escape local extrema. We prove that the tempered EM verifies the convergence guarantees for a wider range of temperature profiles than previously considered. We showcase empirically how new non-trivial profiles can more successfully escape adversarial initialisations. Finally, we combine the Riemann and tempered approximations into a method that accomplishes both their purposes.


Introduction
The Expectation Maximisation (EM) algorithm was introduced by Dempster, Laird and Rubin (DLR, Dempster et al. (1977)) to maximise likelihood functions g(θ) defined from inherent latent variables z and that were non-convex and had intricate gradients and Hessians. The EM algorithm estimates θ in an iterative fashion, starting from a certain initial value θ 0 and where the new estimate θ n+1 at step n + 1 is function the estimate θ n from the previous step n. In addition to presenting the method, DLR provide convergence guarantees on the sequence of estimated parameters {θ n } n , namely that it converges towards a critical point of the likelihood function as the step n of the procedure grows. Their flawed proof for this result was later corrected by Wu (1983), and more convergence guarantees were studied by Boyles (1983). Since some likelihood functions are too complex to apply DLR's raw version of the EM, authors of later works have proposed alternative versions, usually with new convergence guarantees. On the one hand, when the maximisation step (M step) is problematic, other optimisation methods such as coordinate descent Wu (1983) or gradient mate intractable conditional probability functions. The few purely deterministic approximations proposed, such as the tempered/annealed EM, are used for other purposes, improving the optimisation procedure, and lack convergence guarantees.
In this paper, we prove that, with mild model assumptions and regularity conditions on the approximation, any Deterministic Approximate EM benefits from the state of the art theoretical convergence guarantees of Wu (1983); Lange (1995); Delyon et al. (1999). This theorem covers several already existing methods, such as the tempered EM, and paves the way for the exploration of new ideas. To illustrate this, we study a small selection of methods with practical applications that verify our theorem's hypotheses. First, for E steps without closed form, we propose to use Riemann sums to estimate the intractable normalising factor. This "Riemann approximation EM" is a deterministic, less parametric, alternative to the MC-EM and its variants. It is suited to the small dimension cases, where the analytic estimation of the intractable integral is manageable, and MC estimation unnecessary. Second, we prove that the deterministic annealed EM (or "tempered EM") of Ueda and Nakano (1998) is a member of our general deterministic class as well. We prove that the convergence guarantees are achieved with almost no condition of the temperature scheme, justifying the use of a wider range of temperature profile than those proposed by Ueda and Nakano (1998) and Naim and Gildea (2012). Finally, since the Riemann and tempered approximations are two separate methods that fulfil very different practical purposes, we also propose to associate the two approximations in the "tempered Riemann approximation EM" when both their benefits are desired.
In Section 2.1, we introduce our general class of deterministic approximated versions of the EM algorithm and prove their convergence guarantees, for models of the exponential family. We discuss the "Riemann approximation EM" in Section 2.2, the "tempered EM" in Section 2.3, and their association, "tempered Riemann approximation EM", in Section 2.4. In Section 3, we study empirically each of these three methods. Section 4 discusses and concludes this work.
We demonstrate empirically that the Riemann EM converges properly on a model with and an intractable E step, and that adding the tempering to the Riemann approximation allows in addition to get away from the initialisation and recover the true parameters. On a tractable Gaussian Mixture Model, we compare the behaviours and performances of the tempered EM and the regular EM. In particular, we illustrate that the tempered EM is able to escape adversarial initialisations, and consistently reaches better values of the likelihood than the unmodified EM, in addition to better estimating the model parameters.

Context and Motivation
In this section, we propose a new class of deterministic EM algorithms with approximated E step. This class of algorithms is very general. In particular, it includes methods with deterministic approximations of intractable normalisation constant. It also includes optimisation-oriented approximations, such as the annealing approximation used to escape sub-optimal local extrema of the objective. In addition, combinations of these two types of approximations are also covered. We prove that members of this class benefit from the same convergence guarantees that can be found in the state of the art references (Wu, 1983;Lange, 1995;Delyon et al., 1999) for the classical EM algorithm, and under similar model assumptions. The only condition on the approximated distribution being that it converges towards the real conditional probability distribution with a l 2 regularity. Like the authors of Delyon et al. (1999); Fort and Moulines (2003); Allassonnière and Chevallier (2021), we work with probability density functions belonging to the exponential family. The specific properties of which are providen in the hypothesis M 1 of Theorem 1. EM algorithms are most often considered in the context of the missing data problem, see Delyon et al. (1999). The formulation of the problem is the following: we observe a random variable x, described by a parametric probability density function (pdf) noted g(θ) with parameter θ ∈ Θ ⊂ R l , l ∈ N * . We assume that there exists a hidden variable z informing the behaviour of the observed variable x such that g(θ) can be expressed as the integral of the complete likelihood h(z; θ): g(θ) = z h(z; θ)µ(dz), with µ the reference measure. We denote p θ (z) := h(z; θ)/g(θ) the conditional density function of z. For a given sample x, the goal is to maximise the likelihood g(θ) with respect to θ. In all these notations, and throughout the rest of the paper, we omit x as a variable since it is considered fixed once and for all, and everything is done conditionally to x. In particular, in many applications, the observed data x is made of N ∈ N * samples: x := (x (1) , x (N ) ). In such cases, the sample size N < +∞ is supposed finite and fixed once and for all. We are never in the asymptotic regime N −→ +∞.
The foundation of the EM algorithm is that while ln g(θ) is hard to maximise in θ, the functions θ → ln h(z; θ) and even θ → E z [ln h(z; θ)] are easier to work with because of the information added by the latent variable z (or its distribution). In practice however, the actual value of z is unknown and its distribution p θ (z) dependent on θ. Hence, the EM was introduced by DLR Dempster et al. (1977) as the two-stages procedure starting from an initial point θ 0 and iterated over the number of steps n: (E) With the current parameter θ n , calculate the conditional probability p θn (z); Which can be summarised as: where we call T the point to point map in Θ corresponding to one EM step. We will not redo the basic theory of the exact EM here, but this procedure noticeably increase g(θ n ) at each new step n. However, as discussed in the introduction, in many cases, one may prefer to or need to use an approximation of p θn (z) instead of the exact analytical value. In the following, we consider a deterministic approximation of p θ (z) notedp θ,n (z) which depends on the current step n and on which we make no assumption at the moment. The resulting steps, defining the "Approximate EM", can be written under the same form as (1): where {F n } n∈N is the sequence of point to point maps in Θ associated with the sequence of approximations {p θ,n (z)} θ∈Θ; n∈N . In order to ensure the desired convergence guarantees, we add a slight modification to this EM sequence: re-initialisation of the EM sequence onto increasing compact sets. This modification was introduced by Chen et al. (1987) and adapted by Fort and Moulines (2003). Assume that you dispose of an increasing sequence of compacts {K n } n∈N such that ∪ n∈N K n = Θ and θ 0 ∈ K 0 . Define j 0 := 0. Then, the transition θ n+1 = F n (θ n ) is accepted only if F n (θ n ) belongs to the current compact K jn , otherwise the sequence is reinitialised at θ 0 . The steps of the resulting algorithm, called Stable Approximate EM, can be written as: if F n (θ n ) ∈ K jn , then θ n+1 = F n (θ n ), and j n+1 := j n if F n (θ n ) / ∈ K jn , then θ n+1 = θ 0 , and j n+1 := j n + 1 .
This re-initialisation of the EM sequence may seem like a hurdle, however, this truncation is only a theoretical requirement. In practice, the first compact K 0 is taken so large that it covers the most probable areas of Θ and the algorithms (2) and (3) are identical as long as the sequence {θ n } n does not diverges towards the border of Θ.

Theorem
In the following, we will state the convergence Theorem of Equation (3) and provide a brief description of the main steps of the proof.
• The conditions on the approximation. Assume thatp θ,n (z) is deterministic. Let S(z) = {S u (z)} q u=1 . For all indices u, for any compact set K ⊆ Θ, one of the two following configurations holds: Then, (i) (a) lim n→∞ j n < ∞ and sup n∈N θ n < ∞, with probability 1; (b) g(θ n ) converges towards a connected component of g(L).
(ii) Let Cl : P(Θ) → Θ be the set closure function and d : Θ × P(Θ) → R + be any point-to-set distance function within Θ. If g L ∩ Cl {θ n } n∈N has an empty interior, then ∃g * ∈ R * + such that: Remark 1.
• The M1-3 conditions in Theorem 1 are almost identical to the similarly named M1-3 conditions in Fort and Moulines (2003). The only difference is in M 2 (a * ), where we remove the hypothesis that S has to be a continuous function of z, since it is not needed when the approximation is not stochastic.
• The property z S 2 u (z)dz < ∞ of the condition (4) can seem hard to verify since S is not integrated here against a probability density function. However, when z is a finite variable, as is the case for finite mixtures, this integral becomes a finite sum. Hence, condition (4) is better adapted to finite mixtures, while condition (5) is better adapted to continuous hidden variables.
• The two sufficient conditions (4) and (5) involve a certain form of l 2 convergence ofp θ,n towards p θ .
If the latent variable z is continuous, this excludes countable (and finite) approximations such as sums of Dirac functions, since they have a measure of zero. In particular, this excludes Quasi-Monte Carlo approximations. However, one look at the proof of Theorem 1 (in Appendix A.1) or at the following sketch of proof reveals it is actually sufficient to verify sup θ∈K S n (θ) −S(θ) −→ n→∞ 0 for any compact set K. WhereS n (θ) := z S(z)p θ,n (z)µ(dz) denotes the approximated E step in the Stable Approximate EM. This condition can be verified by finite approximations.

Sketch of Proof
The detailed proof of this result can be found in Appendix A.1, we propose here an abbreviated version where we highlight the key steps. Two intermediary propositions, introduced and proven in Fort and Moulines (2003), are instrumental in the proof of Theorem 1. These two propositions are called Propositions 9 and 11 by Fort and Moulines, and used to prove their Theorem 3. Within our framework, with the new condition M 2(a * ) and the absence of Monte Carlo sum, the reasoning for verifying the conditions of applicability of the two proposition is quite different from Fort and Moulines (2003) and will be highlighted below. Let us state these two propositions using the notations of this paper: Proposition 2 ("Proposition 9"). Consider a parameter space Θ ⊆ R l , a point to point map T on Θ, a compact K ⊂ Θ and a subset L ⊆ Θ such that L ∩ K is compact. Let g be a C 0 , Lyapunov function relative to (T, L). Assume that there exist a K−valued sequence {θ n } n∈N * ∈ K N such that:
Proposition 3 ("Proposition 11"). Consider a parameter space Θ ⊆ R l , a subset L ⊆ Θ, a point to point map T on Θ and a sequence of point to point maps {F n } n∈N * also on Θ. Let {θ n } n∈N ∈ Θ N be a sequence defined from {F n } n by the Stable Approximate EM equation (3) for some increasing sequence {K n } n∈N of compacts of Θ. Let {j n } n∈N be the corresponding sequence of indices, also defined in (3), such that ∀n, θ n ∈ K jn . We assume: (a) the C1 − 2 conditions of Proposition 10 of Fort and Moulines (2003).
Remark 2. In Fort and Moulines (2003), condition C1 of Proposition 3 is mistakenly written as: This is a typo that we have corrected here.
The proof of Theorem 1 is structured as follows: verifying the conditions of Proposition 3, applying Proposition 3, verifying the conditions of Proposition 2 and finally applying Proposition 2.
Verifying the conditions of Proposition 3. Let g be the observed likelihood function defined in hypothesis M 1 of Theorem 1, T the exact EM point to point map defined in (1), L := {θ ∈ Θ|T (θ) = θ} the set of its stable points, {F n } n a sequence of approximated point to point map as defined in (2). With some increasing sequence {K n } n∈N of compacts of Θ, let {θ n , j n } n∈N be defined from the Stable Approximate EM equation (3).
By design of the regular EM, the following two properties are true. First, g is a C 0 , Lyapunov function relative to (T, L). Second, the map T can be written T :=θ •S, with theθ andS defined in condition M 2 of Theorem 1. These properties, in conjunction with hypotheses M 1 − 3 of Theorem 1, directly imply that condition (a) of Proposition 3 is verified.
The steps to verify conditions (b) and (c) differ from those in the proof of Fort and Moulines (2003). LetS n (θ n ) = z S(z)p θ,n (z)µ(dz) be the approximated E step in the Stable Approximate EM, such that F n =θ •S n . By using uniform continuity properties on compacts, we prove that the following condition is sufficient to verify both (a) and (b): Then, upon replacingS n andS by their integral forms, it becomes clear that each of the hypotheses (4) or (5) of Theorem 1 is sufficient to verify (6). In the end, all conditions are verified to apply Proposition 3.
Applying Proposition 3. The application of Proposition 3 to the Stable Approximate EM tells us that with probability 1: lim sup n→∞ j n < ∞ and {θ n } n∈N is a compact sequence.
Which is specifically the result (i)(a) of Theorem 1.
Verifying the conditions of Proposition 2. We already have that the likelihood g is a C 0 , Lyapunov function relative to (T, L). Thanks to Proposition 3, we have that K := Cl ({θ n } n ) is a compact. Then, L∩K is also compact thanks to hypothesis M 3. Moreover, by definition, the EM sequence verifies: {θ n } n ∈ K N . The last condition that remains to be shown to apply Proposition 2 is that: If we apply (c) of Proposition 3 with K = Cl ({θ n } n ), we get an almost identical result, but with θ n+1 replaced by F n (θ n ). However, F n (θ n ) = θ n+1 only when j n+1 = j n + 1: We have proven with Proposition 3 that there is only a finite number of such increments. Hence, when n is large enough, F n (θ n ) = θ n+1 always, and we have the desired result.
Applying Proposition 2 Since we verify all we need to apply the conclusions of Proposition 2: • {g(θ n )} n∈N converges towards a connected component of g(L ∩ Cl({θ n } n )) ⊂ g(L).
• If g(L ∩ Cl({θ n } n )) has an empty interior, then the sequence {g(θ n )} n∈N converges towards a g * ∈ R and {θ n } n converges towards the set L g * ∩ Cl({θ n } n ) ⊆ L g * .
Which are both respectively exactly (i)(b) and (ii) of Theorem 1 and conclude the proof of the Theorem.

Context and Motivation
In this section, we introduce one specific case of Approximate EM useful in practice: approximating the conditional probability density function p θ (z) at the E step by a Riemann sum, in the scenario where the latent variable z is continuous and bounded. We call this procedure the "Riemann approximation EM". After motivating this approach, we prove that it is an instance of the Approximate EM algorithm and verifies the hypotheses of Theorem 1, therefore benefits from the convergence guarantees.
Consider the case where z is a continuous variable and its conditional probability p θ (z) is a continuous function. Even when h(z; θ) can be computed point by point, a closed form may not exist for the renormalisation term g(θ) = z h(z; θ)dz. In that case, this integral is usually approximated stochastically with a Monte Carlo estimation, see for instance Delyon et al. (1999); Fort and Moulines (2003); Allassonnière and Chevallier (2021). When the dimension is reasonably small, a deterministic approximation through Riemann sums can also be performed. For the user this can be a welcome simplification, since MCMC methods have a high hyper-parameter complexity (burn-in duration, gain decrease sequence, Gibbs sampler, etc.), whereas the Riemann approximation involves only the position of the Riemann intervals. The choice of which is very guided by the well known theories of integration (Lagrange, Legendre, etc.), and demonstrated in our experiments to have little impact.
We introduce the Riemann approximation as a member of the Approximate EM class. Since z is supposed bounded in this section, without loss of generality, we will assume that z is a real variable and z ∈ [0, 1]. We recall that p θ (z) = h(z; θ)/g(θ) = h(z; θ)/ z h(z; θ)dz. Instead of using the exact joint likelihood h(z; θ), we define a sequence of step functions h n n∈N * as:h n (z; θ) := h( ϕ(n)z /ϕ(n); θ). Where ϕ is an increasing function from N * → N * such that ϕ(n) −→ n→∞ ∞. For the sake of simplicity, we will take ϕ = Id, hencẽ h n (z; θ) = h( nz /n; θ). The following result, however, can be applied to any increasing function ϕ with ϕ(n) −→ n→∞ ∞.
With these steps functions, the renormalising factorg n (θ) := zh n (z; θ)dz is now a finite sum. That is:g n (θ) = 1 n n−1 k=0 h( kz /n; θ). The approximate conditional probabilityp n (θ) is then naturally defined as:p n (θ) :=h n (z; θ)/g n (θ). Thanks to the replacement of the integral by the finite sum, this deterministic approximation is much easier to compute than the real conditional probability.

Theorem and Proof
We state and prove the following Theorem for the convergence of the EM with a Riemann approximation.
We call "Riemann approximation EM" the associated Stable Approximate EM. Under conditions M 1 − 3 of Theorem 1, if z → S(z) is continuous, then the conclusions of Theorem 1 hold for the Riemann approximation EM.
Proof. Under the current assumptions, it is sufficient to verify condition (4) in order to apply Theorem 1. Without loss of generality, we will assume that the bounded variable z is contained in [0, 1]. Since S is continuous, the first part of the condition is easily verified: For the second part of the condition, we consider a compact K ⊆ Θ. First, note that h(z; θ) = exp(ψ(θ)+ S(z), φ(θ) ) is continuous in (z, θ), hence uniformly continuous on the compact set [0, 1] × K. Additionally, we have: where m and M are constants independent of z and θ. This also means that m ≤ g By definition, nz /n − z ≤ 1/n. Hence ∃N ∈ N, ∀n ≥ N, nz /n − z ≤ δ. As a consequence: In other words, {h n } n converges uniformly towards h. Let us fix and N now.
Hence, for this , ∀n ≥ N : which, by definition, means that With this, condition (4) is fully verified, and the conclusions of Theorem 1 are applicable, which concludes the proof.

Context and Motivation
In this section, we consider another particular case of Deterministic Approximate EM: the Tempered EM (or "tmp-EM"), first introduced in Ueda and Nakano (1998). We first prove that under mild conditions, tmp-EM verifies the hypotheses of Theorem 1, hence benefits from the state of the art EM convergence guarantees. In particular, we prove that the choice of the temperature profile is almost completely free. This justifies the use of a wider array of temperature profiles than the ones specified in Ueda and Nakano (1998); Naim and Gildea (2012). Then, we demonstrate the interest of tmp-EM with non-monotonous profiles, on Mixture estimation problems with hard to escape initialisations.
Tempering or annealing is a common technique in the world of non-convex optimisation. With a nonconvex objective function g, naively following the gradients usually leads to undesirable local extrema close to the initial point. A common remedy is to elevate g to the power 1 Tn , with {T n } n∈N a sequence of temperatures tending towards 1 as the number n of steps of the procedure increases. When T n > 1, the shape of the new objective function g 1 Tn is flattened, making the gradients less strong, and the potential wells less attractive, but without changing the hierarchy of the values. This allows the optimisation procedure to explore more before converging. This concept is put in application in many state of the art procedures. The most iconic probably being the Simulated Annealing, introduced and developed in Kirkpatrick et al. (1983); Van Laarhoven and Aarts (1987); Aarts and Korst (1988), where in particular T n −→ 0 instead of 1. It is one of the few optimisation technique proven to find global optimum of non-convex functions. The Parallel Tempering (or Annealing MCMC) developed in Swendsen and Wang (1986); Geyer and Thompson (1995); Hukushima and Nemoto (1996) also makes use of these ideas to improve the MCMC simulation of a target probability distribution.
Since non-convex objectives are a common occurrence in the EM literature, Ueda and Nakano (1998) introduced the Deterministic Annealed EM, where, in the E step, the conditional probability is replaced by a tempered approximated distribution: Tn (renormalised such that zp n,θ (z)dz = 1). For the temperature profile they consider specific sequences {T n } n∈N decreasingly converging to 1. Another specific, non-monotonous, temperature scheme was later proposed by Naim and Gildea (2012). In both cases, theoretical convergence guarantees are lacking. Later, in Allassonnière and Chevallier (2021), tempering has been applied to the SAEM, with convergence guarantees provided for any temperature scheme T n −→ 1.
With Theorem 5, we show that the Deterministic Anealing EM, or tmp-EM, is a specific case of the Deterministic Approximate EM of Section 2.1, hence can benefit from the same convergence properties. In particular, we show that any temperature scheme {T n } n ∈ (R * ) N , T n −→ n→∞ 1 guarantees the state of the art convergence.
Remark 3. Elevating p θ (z) to the power 1 Tn , as is done here and in Ueda and Nakano (1998); Naim and Gildea (2012), is not equivalent to elevating to the power 1 Tn the objective function g(θ), which would be expected for a typical annealed or tempered optimisation procedure. It is not equivalent either to elevating to the power 1 Tn the proxy function E z∼p θn (z) [h(z; θ)] that is optimised in the M step. Instead, the weights p θn (z) (or equivalently, the terms h(z; θ n )) used in the calculation of E z∼p θn (z) [h(z; θ)] are the tempered terms. This still results in the desired behaviour and is only a more "structured" tempering. Indeed, with this tempering, it is the estimated distribution of the latent variable z that are made less unequivocal, with weaker modes, at each step. This forces the procedure to spend more time considering different configurations for those variables, which renders as a result the optimised function E z∼p θn (z) [h(z; θ)] more ambiguous regarding which θ is the best, just as intended. Then, when n large, the algorithm is allowed to converge, as T n −→ 1 and E z∼p n,θ (z) −→ E z∼p θ (z) .

Theorem
We now provide the convergence Theorem for the Approximate EM with the tempering approximation. In particular, this result highlights that there are almost no constraints on the temperature profile to achieve convergence.
Theorem 5. Let T n be a sequence of non-zero real numbers. Consider the approximation introduced in Ueda and Nakano (1998) We call "Tempered EM" the associated Stable Approximate EM. Define B(1, ) be the closed ball centred in 1 and with radius ∈ R + . Under conditions M 1 − 3 of Theorem 1, if T n −→ n→∞ 1 and for any compact then the conclusions of Theorem 1 hold for the Tempered EM.
Remark 4. The added condition that (T 1) and (T 2) must hold for all α in a ball B(1, ) is very mild. Indeed, in Section 2.3.4, we show classical examples that easily verify the much stronger condition that (T 1) and (T 2) hold for all α ∈ R * + .

Sketch of Proof
The detailed proof of Theorem 5 can be found in Appendix A.2, we propose here an abbreviated version where we highlight the key steps. Under the current assumptions, it is sufficient to verify the second part of condition (5) in order to apply Theorem 1. To that end, we must control the integral for all θ in a compact K ⊆ Θ. First, with a Taylor development in the term 1 Tn − 1 , which converges toward 0 when n → ∞, we control the difference (p θ,n (z) − p θ (z)) 2 : where the terms A(θ, T n ), B(θ, T n ) and a(z, θ, T n ) come from the Taylor development. Then, we can control the integral of interest: From the properties of the Taylor development, we prove that A(θ, T n ) and B(θ, T n ) both have upper bounds involving only z p θ (z) ln p θ (z) and z p θ (z) 1 Tn ln p θ (z). In a similar fashion, the term z p θ (z)e 2a(z,θ,Tn) ln 2 p θ (z) has an upper bound involving only z p θ (z) ln 2 p θ (z)dz and z p θ (z) 2 Tn −1 ln 2 p θ (z)dz. Using the hypotheses of the Theorem, we prove that for any α ∈ B(1, ) and θ ∈ K the terms z p θ (z) α ln p θ (z) and z p θ (z) α ln 2 p θ (z) are both upper bounded by a constant C independent of θ and α.
Since T n −→ n→∞ 1, then when n is large enough, 1 Tn ∈ B(1, ) and 2 Tn − 1 ∈ B(1, ). Hence, the previous result applies to the upper bounds of A(θ, T n ), B(θ, T n ) and z p θ (z)e 2a(z,θ,Tn) ln 2 p θ (z)dz. As a result, these three terms are respectively upper bounded by C 1 , C 2 and C 3 , three constants independent of θ and T n .
The inequality (7) then becomes: By taking the supremum in θ ∈ K and the limit when n −→ ∞, we get the desired result: With condition (5) verified, the conclusions of Theorem 1 are applicable, which concludes the proof.

Examples of Models That Verify the Conditions
In this section, we illustrate that the conditions of Theorem 5 are easily met by common models. We take two examples, first the Gaussian Mixture Model (GMM) where the latent variables belong to a finite space, then the Poisson count with random effect, where the latent variables live in a continuous space. As mentioned, these examples are shown to not only verify all hypotheses of Theorem 5, but also to verify (T 1) and (T 2) for any α ∈ R * + .
Gaussian Mixture Model Despite being one of the most common models the EM is applied to, the GMM have many known irregularities and pathological behaviours, see Titterington et al. (1985); Ho et al. (2016). Although some recent works, such as Dwivedi et al. (2020a,b), tackled the theoretical analysis of EM for GMM, none of the convergence results for the traditional EM and its variants proposed by Wu (1983); Lange (1995); Delyon et al. (1999); Fort and Moulines (2003) apply to the GMM. The hypothesis that the GMM fail to verify is the condition that the level lines have to be compact (M 2 (d) in this paper). All is not lost however for the GMM, indeed, the model verifies all the other hypotheses of the general Theorem 1 as well as the tempering hypotheses of Theorem 5. Moreover, in this paper as in the others, the only function of the unverified hypothesis M 2 (d) is to ensure in the proof that the EM sequence stays within a compact. The latter condition is the actual relevant property to guarantee convergence at this stage of the proof. This means that, in practice, if an tempered EM sequence applied to a GMM is observed to remain within a compact, then all the conditions for convergence are met, Theorem 5 applies, and the sequence is guaranteed to converge towards a critical point of the likelihood function.
In the following, we provide more details on the GMM likelihoods and the theorem hypotheses they verify. First, note that the GMM belongs to the exponential family with the complete likelihood: and the observed likelihood: This is an exponential model with parameter: where S ++ p is the cone of symmetric positive definite matrices of size p. The verification of conditions M1-3 for the GMM (except M 2 (d) of course) is a classical exercise since these are the conditions our Theorem shares with any other EM convergence result on the exponential family. We focus here on the hypotheses specific to our Deterministic Approximate EM.
Condition on z p α θ (z)dz. Let α ∈ R * + , in the finite mixture case, the integrals on z are finite sums: Condition on z S 2 u (z)p α θ (z)dz. The previous continuity argument is still valid.
Poisson Count with Random Effect This model is discussed in Fort and Moulines (2003), the authors prove, among other things, that this model verifies the conditions M 1 − 3. Here is a brief description of the model: the hidden variables z (i) N i=1 are distributed according to an autoregressive process of order 1: z (i) := az (i−1) + σ i , with |a| < 1, σ > 0 and the (i) N i=1 are iid standard gaussian. Conditionally to z (i) N i=1 , the observed variables x (i) are independent Poisson variables with parameter exp(θ + z (i) ). The complete likelihood of the model, not accounting for irrelevant constants, is: g(θ) = z h(z; θ)dz can be computed analytically up to a constant: where E 1 (0) is a finite, non zero, constant, called "exponential integral", in particular independent of α and θ.
Condition on z p α θ (z)dz. Let K be a compact in Θ. We have p θ (z) = h(z;θ) g(θ) . Let us compute z h(z; θ) α for any positive α. The calculations work as in Equation (11): Hence: Since E 1 (0) is finite, non zero, and independent of θ, we easily have: θ does not even have to be restricted to a compact.
Condition on z S 2 u (z)p α θ (z)dz. Let K be a compact in Θ and α a positive real number. In this Poisson count model, S(z) = N i=1 e z (i) ∈ R. We have: First, let us prove that the integral is finite for any θ. We introduce the variables u k := k l=1 e z l . The Jacobi matrix is triangular and its determinant is k e z Which is proportional to: where we removed the finite constant 1 E1(0) αN for clarity. This integral is finite for any θ because the exponential is the dominant term around +∞. Let us now prove that θ → z S 2 (z)p α θ (z)dz is continuous. From Equation (12), we have that Since we have proven that S 2 (z)p α θ M (z) < ∞, then we can apply the intervertion Theorem and state that θ → z S 2 (z)p α θ (z)dz is continuous. It directly follows that: Note that after the change of variable, the integral could be computed explicitly, but involves N successive integration of polynomial × exponential function products of the form P (x)e −αe θ x . This would get tedious, especially since after each successful integration, the product with the next integration variable u k−1 increases by one the degree of the polynomial, i.e. starting from 3, the degree ends up being N + 2. We chose a faster path.

Context, Theorem and Proof
The Riemann approximation of Section 2.2 makes the EM computations possible in hard cases, when the conditional distribution has no analytical form for instance. It is an alternative to the many stochastic approximation methods (SAEM, MCMC-SAEM, etc.) that are commonly used in those cases. The tempering approximation of Section 2.3 is used to escape the initialisation by allowing the procedure to explore more the likelihood profile before committing to convergence. We showed that both these approximation are particular cases of the wider class of Deterministic Approximate EM, introduced in Section 2.1. However, since they fulfil different purposes, it is natural to use them in coordination and not as alternatives of one another. In this section, we introduce another instance of the Approximate EM: a combination of the tempered and Riemann sum approximations. This "tempered Riemann approximation EM" (tmp-Riemann approximation) can compute EM steps when there is no closed form thanks to the Riemann sums as well as escape the initialisation thanks to the tempering. For a bounded latent variable z ∈ [0, 1], we define the approximation as:p n,θ (z) := h( nz /n; θ) 1 Tn / z h( nz /n; θ) 1 Tn dz , for a sequence {T n } n ∈ (R * + ) N , T n −→ n→∞ 1. In the following Theorem, we prove that the tempered Riemann approximation EM verifies the applicability conditions of Theorem 1 with no additional hypothesis from the regular Riemann approximation EM covered by Theorem 4. Proof. This proof of Theorem 6 is very similar to the proof of Theorem 4 for the regular Riemann approximation EM. The first common element is that for the tempered Riemann approximation EM, the only remaining applicability condition of the general Theorem 1 to prove is also: In the proof of Theorem 4, we proved that having the uniform convergence of the approximated complete likelihood {h n } n towards the real h -with bothh n (z; θ) and h(z; θ) uniformly bounded -was sufficient to fulfil this condition. Hence, we prove In this section, that these sufficient properties still hold, even with the tempered Riemann approximation, whereh n (z; θ) := h( nz /n; θ) 1 Tn . We recall that h(z; θ) is uniformly continuous on the compact set [0, 1] × K, and verifies: where m and M are constants independent of z and θ.
We have seen in the proof of Theorem 4, that: To complete the proof, we control in a similar way the difference h ( nz /n; θ) − h ( nz /n; θ) T ∈ R is continuous on a compact, hence uniformly continuous in (h, T ). As a consequence: Hence, with N ∈ N such that ∀n ≥ N, |T n − 1| ≤ δ, we have: In the end, In other words, we have the uniform convergence of {h n } towards h. From there, we conclude following the same steps as in the proof of Theorem 4.

Results
In this section, we describe experiments that explore each of the three studied methods: Riemann EM, tempered EM and tempered Riemann.

Application to a Gaussian Model with the Beta Prior
We demonstrate the interest of the method on a example with a continuous bounded random variable following a Beta distribution z ∼ Beta(α, 1), and an observed random variable following x ∼ N (λz, σ 2 ). In other words, with ∼ N (0, 1) independent of z: This results in a likelihood belonging to the exponential family: Since z is bounded, and everything is continuous in the parameter (α, λ, σ 2 ), this model easily verifies each of the conditions M1-3. The E step with this model involves the integral z z α exp − (x−λz) 2 2σ 2 dz, a fractional moment of the Gaussian distribution. Theoretical formulas exists for these moments, see Winkelbauer (2012), however they involve Kummer's confluent hypergeometric functions, which are infinite series. Instead, we use the Riemann approximation to run the EM algorithm with this model:h n (z; θ) := h( ϕ(n)z /ϕ(n); θ).
As done previously, we take, without loss of generality, ϕ(n) := n for the sake of simplicity. The E step only involves the n different values taken by the step function probabilities h( nz /n; θ): where the exponent (i) indicates the index of the observation x (i) . To express the corresponding M step in a digest way, let us define the operator Ψ (i) : R [0,1] −→ R such that, for any f : [0, 1] −→ R: Then, the M step can be expressed as: where we took the liberty of replacing f by f (z) in these equations for the sake of simplicity. Here N is the total number of observations: x := (x (1) , ..., x (N ) ) iid. We test this algorithm on synthetic data. With real values α = 2, λ = 5, σ = 1.5, we generate a dataset with N = 100 observations and run the Riemann EM with random initialisation. This simulation is ran 2000 times. We observe that the Riemann EM is indeed able to increase the likelihood, despite the EM being originally intractable. On Figure 1, we display the average trajectory, with standard deviation, of the negative log-likelihood − ln(g(θ)) during the Riemann EM procedure. The profile is indeed decreasing. The standard deviation around the average value is fairly high, since each run involves a different dataset and a different random initialisation, hence different value of the likelihood, but the decreasing trend is the same for all of the runs. We also display the average relative square errors on the parameters at the end of the algorithm. They are all small, with reasonably small standard deviation, which indicates that the algorithm consistently recovers correctly the parameters.
The results are displayed on Figure 2. We can see that, despite the very different profiles, the optimisations are very similar. The "low" profile performs slightly worst, which indicates that a high number of Riemann intervals is most desirable in practice. As long as this number is high enough, Figure 2 suggests that the performances will not depend too much on the profile.

Application in Two Dimensions
The difficulty faced by Riemann methods in general is their geometric complexity when the dimension increases. In this section, we propose a similar experiment in two dimensions to show that the method is still functional and practical in that case. The total number of Riemann intervals computed over 100 EM iterations are: 5150 for "low", 14,950 for "medium", 50,500 for "linear" and 104,950 for "high". (b). For each profile, average evolution of the negative log-likelihood, with standard deviation, over 50 simulations. The results are fairly similar, in particular between "medium", "high" and "linear".

Context and Experimental Protocol
In this section, we will assess the capacity of tmp-EM to escape from deceptive local maxima. We compare the classical EM to tmp-EM with both a monotonous and an oscillating temperature profile on a very well know toy example: likelihood maximisation within the Gaussian Mixture Model. We confront the algorithms to situations where the true classes have increasingly more ambiguous positions, combined with initialisations designed to be hard to escape from. Although the EM is an optimisation procedure, and the log-likelihood reached is a critical metric, in this example, we put more emphasis on the correct positioning of the cluster centroids, that is to say on the recovery of the µ k . The other usual metrics are also in favour of tmp-EM, and can be found in supplementary materials.  Figure 3: (a). Visual representation of the number of Riemann intervals over the EM steps for each profile ϕi. In higher dimension, the computational complexity of the profiles are very different. More precisely, the total number of Riemann squares computed over 100 EM iterations are: 4534 for "square root", 125,662 for "5 square root", 348,550 for "low" and 2,318,350 for "medium". (b). For each profile, average evolution of the negative log-likelihood, with standard deviation, over 50 simulations. The "square root" profile performs poorly. However, the other three are comparable despite their different computational complexities.
For the sake of comparison, the experimental design is similar to the one in Allassonnière and Chevallier (2021) on the tmp-SAEM. It is as follows: we have three clusters of similar shape and same weight. One is isolated and easily identifiable. The other two are next to one another, in a more ambiguous configuration. Figure 4 represents the three, gradually more ambiguous configurations. Each configuration is called a "parameter family".
We use two different initialisation types to reveal the behaviours of the three EMs. The first-which we call "barycenter "-puts all three initial centroids at the centre of mass of all the observed data points. However, none of the EM procedures would move from this initial state if the three GMM centroids were at the exact same position, hence we actually apply a tiny perturbation to make them all slightly distinct. The blue crosses on Figure 5 represent a typical barycenter initialisation. With this initialisation method, we assess whether the EM procedures are able to correctly estimate the positions of the three clusters, despite the ambiguity, when starting from a fairly neutral position, providing neither direction nor misdirection. On the other hand, the second initialisation type -which we call "2v1 " -is voluntarily misguiding the algorithm by positioning two centroids on the isolated right cluster and only one centroid on the side of the two ambiguous left clusters. The blue crosses on Figure 6 represent a typical 2v1 initialisation. This initialisation is intended to assess whether the methods are able to escape the potential well in which they start and make theirs centroids traverse the empty space between the left and right clusters to reach their rightful position. For each of the three parameter families represented on Figure 4, 1000 datasets with 500 observations each are simulated, and the three EMs are ran with both the barycenter and the 2v1 initialisation.
For tmp-EM, we try two profiles. First, a simple decreasing exponential profile as seen in Ueda and Nakano (1998): T n = 1 + (T 0 − 1) exp(−r.n). Through a grid search, the values T 0 = 5, r = 2 for the barycenter initialisation and T 0 = 100 r = 1.5 for the 2v1 initialisation are picked for this profile. Since Theorem 5 only requires T n −→ 1, we also propose an oscillating profile inspired from Allassonnière and Chevallier (2021). The exact formula of these oscillations is: 3π ) a n/r + b sinc( 3π 4 + n r ). Where 0 < T 0 , 0 < r, 0 < b and 0 < a < 1. The oscillations in this profile are meant to achieve a two-regimes behaviour. When the temperature reaches low values, the convergence speed is momentarily increased which has the effect of "locking-in" some of the most obviously good decisions of the algorithm.
Then, the temperature is re-increased to continue the exploration on the other, more ambiguous, parameters. Those two regimes are alternated in succession with gradually smaller oscillations, resulting in a multi-scale procedure that "locks-in" gradually harder decisions. For some hyper-parameter combinations, the sequence T n can have a (usually small) finite number of negative values. Since only the asymptotic behaviour of T n is the step n matters for convergence, then the theory allows a finite number of negative values. However, in practice, at least for the sake of interpretation, one may prefer to use only positive values for T n . In which case, one can either restrain themselves to parameter combinations that result in no negatives values for T n , or enforce positivity by taking T n ← max(T n , ) with a certain > 0.
For our experiments, we select the hyper-parameter values with a grid-search. The "normalised" sinc function is used sinc(x) = sin(πx)/(πx) and the chosen tempering parameters are T 0 = 5, r = 2, a = 0.6, b = 20 for the experiments with the barycenter initialisation, and T 0 = 100, r = 1.5, a = 0.02, b = 20 for the 2v1 initialisation. We have two different sets of tempering hyper-parameters values, one for each of the two very different initialisation types. However, these values then remain the same for the three different parameter families and for every data generation within them. This underlines that the method is not excessively sensitive to the tempering parameters, and that the prior search for good hyper-parameter values is a worthwhile time investment. Likewise, a simple experiment with 6 clusters, in supplementary materials, demonstrates that the same hyper-parameters can be kept over different initialisation (and different data generations as well) when they were made in a non-adversarial way, by drawing random initial centroids uniformly among the data points.

Experimental Results Analysis
In this section, we analyse the results of EM, decreasing tmp-EM and oscillating tmp-EM over all the simulations. First, an illustration: Figures 5 and 6 depict the final states of EM and oscillating tmp-EM on one typical simulation for each of the three ambiguity level (the three parameter families) starting from the barycenter and 2v1 initialisation respectively. The simulated data are represented by the green crosses. The initial centroids are in blue. The orange cross represents the estimated centroids positionsμ k , and the orange confidence ellipses are visual representations of the estimated covariance matricesΣ k . In supplementary materials, we show step by step the path taken by the estimated parameters of tmp-EM before convergence, providing much more detail on the method's behaviours. These illustrative examples show oscillating tmp-EM better succeeding at clustering recovery than the classical EM. The results over all simulations are aggregated in Table 1, and confirm this observation. Table 1 presents the average and the standard deviation of the relative l 2 error on µ k of the EMs. For each category, the better result over the three EM is highlighted in bold. The recovery of the true class  Figure 5: Typical final positioning of the centroids by EM (first row) and tmp-EM with oscillating profile (second row) when the initialisation is made at the barycenter of all data points (blue crosses). The three columns represent the three gradually more ambiguous parameter sets. Each figure represents the positions of the estimated centroids after convergence of the EM algorithms (orange cross), with their estimated covariance matrices (orange confidence ellipses). In each simulation, 500 sample points were drawn from the real GMM (small green crosses). In those example, tmp-EM managed to correctly identify the position of the three real centroids.  Figure 6: Typical final positioning of the centroids by EM (first row) and tmp-EM with oscillating profile (second row) when the initialisation is made by selecting two points in the isolated cluster and one in the lower ambiguous cluster (blue crosses). The three columns represent the three gradually more ambiguous parameter sets. Each figure represents the positions of the estimated centroids after convergence of the EM algorithms (orange cross), with their estimated covariance matrices (orange confidence ellipses). In each simulation, 500 sample points were drawn from the real GMM (small green crosses). In those examples, although EM kept two centroids on the isolated cluster, tmp-EM managed to correctly identify the position of the three real centroids.
averages µ k is spotlighted as it is the essential success metric for this experiment. More specifically, class 1 and 2, the two leftmost classes, are the hardest to correctly recover and the ones whose estimation is the differentiating criterion between the algorithms. Indeed, as can be seen in Table 1, µ 3 is always well estimated by all methods. Hence, in the following, we discuss the error on µ 1 and µ 2 .
With the barycenter initialisation, classical EM and decreasing tmp-EM have similar average relative error levels. With classical EM actually being slightly better. However, oscillating tmp-EM is much better than both of them, with error levels smaller by a factor of 10 on parameter families 1 and 2, and by a factor of 5 on parameter family 3. The standard deviation of oscillating tmp-EM is also lower, by a factor of roughly 3 on parameter families 1 and 2, and by a factor of 2 on parameter family 3. With the 2v1 initialisation, all error levels are higher. This time, decreasing tmp-EM is better in average than classical EM by a factor of 1.7 to 1.4, depending on the parameter family. In turn, oscillating tmp-EM is better than decreasing tmp-EM by a factor of 3.1 to 3.4 depending on the parameter family. Its standard deviation is also lower by a factor of about 2.
Overall, oscillating tmp-EM dominates the simulation. Its error rates on the recovery of µ 1 and µ 2 are always the best, and they remain at low levels even with the most adversarial initialisations. To bolster this last point, we underline that the highest relative error reached by oscillating tmp-EM over all the various scenarios (0.39 on parameter family 3 with 2v1 initialisation) is still lower than the lowest relative error of both classical EM (0.52 on parameter family 1 with barycenter initialisation) and decreasing tmp-EM (0.60 on parameter family 1 with barycenter initialisation). Table 1: Average and standard deviation of the relative error on µ k , μ k −µ k 2 µ k 2 , made by EM, tmp-EM with decreasing temperature and tmp-EM with oscillating temperature over 1000 simulated dataset with two different initialisations. The three different parameter families, described in Figure 4, correspond to increasingly ambiguous positions of classes 1 and 2. For both initialisations type, the identification of these two clusters is drastically improved by the tempering. Best results highlighted in bold.

Tempered Riemann Approximation EM: Application to a Gaussian Model with Beta Prior
We illustrate the method with the model of Section 3.1.1: We apply the tempered Riemann approximation. As in Section 3.1.1, the resulting conditional probability density is a step function defined by the n different values it takes on [0, 1]. For the observation x (i) , ∀k ∈ 0, n − 1 :p The M step, seen in Equation (13), is unchanged. We compare the tempered Riemann EM to the simple Riemann EM on a case where the parameters are ambiguous. With real parameters α = 0.1, λ = 10, σ = 0.8, for each of the 100 simulations, the algorithms are initialised at α 0 = 10, λ 0 = 1, σ 0 = 7. The initialisation is somewhat adversarial, since the mean and variance of the marginal distribution of y are approximately the same with the real of the initialisation parameter, even though the distribution is different. Figure 7 shows that the tempered Riemann EM better escapes the initialisation than the regular Riemann EM, and reaches errors on the parameters orders of magnitude below. The tempering parameters are here T 0 = 150, r = 3, a = 0.02, b = 40.

Discussion and Conclusions
We proposed the Deterministic Approximate EM class to bring together the many possible deterministic approximations of the E step. We proved a unified Theorem, with mild conditions on the approximation, which ensures the convergence of the algorithms in this class. Then, we showcased members of this class that solve the usual practical issues of the EM algorithm. For intractable E steps in low dimension, we introduced the Riemann approximation EM, a less parametric and deterministic alternative to the extensive family of MC-EM. We showed on an empirical intractable example how the Riemann approximation EM was able to increase the likelihood and recover every parameter in a satisfactory manner with its simplest design, and no hyper parameter optimisation.
Second, we studied the tempered EM, introduced by Ueda and Nakano (1998) to escape the attractive sub-optimal local extrema of non-convex objectives. We proved that tmp-EM is a specific case of the Deterministic Approximate EM, benefiting from the convergence property as long as the temperature profile converges towards 1. This mild condition justifies the use of many more temperature profiles than the ones tried in Ueda and Nakano (1998); Naim and Gildea (2012). To illustrate the interest of complex, nonmonotonous, temperature profiles, we demonstrated on experiments with adversarial initial positions the superiority of an oscillating profile over a simple decreasing one.
Finally, we added the Riemann approximation in order to apply the tempering in intractable cases. We were then able to show that the tmp-Riemann approximation massively improved the performances of the Riemann approximation, when the initialisation is ambiguous.
Future works will improve both methods. The Riemann approximation will be generalised to be applicable even when the latent variable is not bounded, and an intelligent slicing of the integration space will improve the computational performances in high dimension. Regarding the tempered EM, since the theory allows the usage of any temperature profile, the natural next step is to look for efficient profiles with few hyper-parameters for fast tuning. Afterwards, implementing an adaptive tuning of the temperature parameters during the procedure will remove the necessity for preliminary grid search altogether.
Acknowledgements The research leading to these results has received funding from the European Research Council (ERC) under grant agreement No 678304, European Union's Horizon 2020 research and innovation program under grant agreement No 666992 (EuroPOND) and No 826421 (TVB-Cloud), and the French government under management of Agence Nationale de la Recherche as part of the "Investissements d'avenir" program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and reference ANR-10-IAIHU-06 (IHU-A-ICM).

A Proofs of the Two Main Theorems
In this Appendix, we provide in full details the proofs that were sketched in the main body of the paper. Section A.1 details the proof of convergence for the Deterministic Approximate EM algorithm, our central result, Theorem 1 of the paper. Section A.2 details the proof of convergence for tmp-EM, Theorem 5 of the paper.

A.1 Proof of the General Theorem
In this Section, we prove Theorem 1, which guarantees the convergence of the Deterministic Approximate EM algorithm.
The proof is based on the application of Propositions 2 and 3, taken from Fort and Moulines (2003). We need to prove that, under the conditions of Theorem 1, we verify the conditions of Proposition Proposition 2 and Proposition 3. Then we will have the results announced in Theorem 1.

A.1.1 Verifying the Conditions of Proposition 3
g is the likelihood function of a model of the curved exponential family. Let T be the point to point map describing the transition between θ n and θ n+1 in the exact EM algorithm. L the set of stationary points by T : L := {θ ∈ Θ|T (θ) = θ}. (Note that if g is differentiable, the general properties of the EM tell us that its critical points of g are the stationary points: L = {θ ∈ Θ|∇g(θ) = 0}). Additionally, g is a C 0 Lyapunov function associated to (T, L). Let {θ n } n be the sequence defined by the stable approximate EM with {F n } n∈N our sequence of point to point maps.
We verify that under this framework-and with the assumptions of Theorem 1-we check the conditions of Proposition 3.
We focus on (15), since (14) is easier to verify and will come from the same reasoning. The first steps are similar to Fort and Moulines (2003). We underline the most consequent deviations from the proof of Fort and Moulines (2003) when they occur. (15) under an equivalent form.
Use the Uniform Continuity We aim to relate the proximity between the images g •θ of to the proximity between the antecedents of g •θ. The function g •θ : R q → R is continuous, but not necessarily uniformly continuous on R q . As a consequence, we will need to restrict ourselves to a compact to get uniform continuity properties. We already have a providen compact K.S : Θ → R l is continuous, hence S(K) is a compact as well. Let δ be a strictly positive real number. LetS(K, δ) := s ∈ R q inf t∈K ||S(t) − s|| ≤ δ .

Further Simplifications of the Desired Result with Successive Sufficient Conditions
We find another, simpler, sufficient condition for (15) from Equation (18). This part is unique to our proof and absent from Fort and Moulines (2003). It is here that we relate the formal conditions of Proposition 3 to the specific assumptions of our Theorem 1. We first remove the dependency on the terms {θ n } n of the EM sequence: From Equations (18)-(20) we get that: is a sufficient condition to have both Equations (14) and (15). To show that the hypotheses of Theorem 1 imply this sufficient condition, we express it in integral form. Let S = {S u } u=1,...,q . We recall thatS n (θ) = z S u (z)p θ,n (z)dz q u=1 andS(θ) = z S u (z)p θ (z)dz