The Expected Missing Mass under an Entropy Constraint

In Berend and Kontorovich (2012), the following problem was studied: A random sample of size t is taken from a world (i.e., probability space) of size n; bound the expected value of the probability of the set of elements not appearing in the sample (unseen mass) in terms of t and n. Here we study the same problem, where the world may be countably infinite, and the probability measure on it is restricted to have an entropy of at most h. We provide tight bounds on the maximum of the expected unseen mass, along with a characterization of the measures attaining this maximum.


Introduction
Let S be a finite probability space.Without loss of generality, suppose that S = {1, 2, ..., n}.Let p = (p 1 , p 2 , . . ., p n ) be a probability measure on S. Suppose that a random sample X 1 , X 2 , . . ., X t is drawn from S according to p.The missing mass is the random variable U t , defined by: In words, U t is the total probability mass of the set of those elements of S not observed at all in the sample.According to the definition of U t , it is easy to verify that EU t = ∑ n i=1 p i (1 − p i ) t .When we wish to make the dependence on the measure p explicit, we will write E p U t instead of EU t .
One of the earliest mentions of the missing mass is in Good-Turing frequency estimation [1].The latter is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species.This estimator has been used extensively in many machine learning tasks.For example, in the field of natural language modeling, for any sample of words, there is a set of words not occurring in that sample.The total probability mass of the words not in the sample is the so-called missing mass [2].Another example of using Good-Turing missing mass estimation is in [3], where the total summed probability of all patterns not observed in the training data is estimated.In [4], Berend and Kontorovich showed that the expectation of the missing mass is bounded above as follows: (Additionally, deviation bounds were provided in [5].)Moreover, they have shown that: 1. Every local maximum p of EU t is of the form (where without loss of generality we consider only vectors p with p 1 ≤ p 2 ≤ ... ≤ p n ).That is, p consists of one "heavy" atom and n − 1 "light" ones of identical size, where the possibility of "heavy" = "light" is not excluded.2. There exists a threshold τ = τ(n) > n such that: (a) For t ≤ τ, there is a unique global maximum: (b) For t > τ, there is a unique global maximum, and it has the form: For an infinitely countable set S, one cannot generally provide a non-trivial upper bound on EU t in terms of t only.Indeed, for each n, consider the probability measure on N supported on {1, 2, ..., n}, giving equal probabilities to these n atoms.Clearly, EU t ≥ 1 − t n in this case, and the right-hand side becomes arbitrarily close to 1 as n grows.In [4] it was shown that where l(p) roughly measures the size of sets of atoms of comparable mass and c is a universal constant (for an exact definition, we refer to [4]).The bound given in ( 1) is non-trivial only if the sequence (p i ) ∞ i=1 decreases "sufficiently fast".Such results may be useful, as shown in [6][7][8][9].Another possible restriction that makes the problem interesting is that the entropy of p is bounded above by some given value.A similar restriction can be found in [10] in the context of discrete distribution estimation under 1 loss.In this work, we study the possibility of providing tight bounds on EU t under the restriction of some bound on the entropy.Thus, we can formulate our problem as follows: where h ≥ 0 is the maximal allowed entropy.
In the case of distributions over countably infinite spaces we set S = N; otherwise, S = {1, 2, ..., n}, or in short, S = [n].Note that we are looking for the supremum, since in the case S = N it is not a priori clear that the maximum exists (in fact, it turns out that the maximum does exist-see Theorem 2).Additionally, we will show that the maximum is obtained for a measure with finite support, which leads us to study the problem for the case of distributions over finite spaces.We also study the structure of local and global maxima and obtain some results analogous to [4].

Main Results
Our first result is that in the case of S = N, an optimal solution exploits all the available entropy.Denote the entropy of a probability measure p by H(p): Proposition 1.Let S = N, and let p = (p 1 , p 2 , ...) be a probability measure on S. If H(p) < h, then there exists a probability measure p = (p 1 , p 2 , ...) on S for which H(p ) = h and E p U t > E p U t .
Corollary 1.In the problem given by (2)-( 4), we may replace (4) by In Theorem 1, we refer to the case S = [n] and show that an optimal solution of ( 2)-( 4) cannot assume more than four distinct non-zero values.
We do not know whether in some cases there are indeed atoms of four distinct sizes in the optimal solution.
In the case S = [n], it is easy to see that E p U t is continuous with respect to p, and thus EU t attains its maximum.On the other hand, when S = N, it is not a priori clear.Our next result shows that EU t attains its maximum in this case as well.
(i) For each h > 0, the function EU t attains its maximum.(ii) If p is a global maximum point of EU t , then p has a finite support.
In particular, EU t ≤ h ln t .
In Theorem 4 we show that given a fixed h, we cannot significantly improve the upper bound from Theorem 3.

Theorem 4.
For fixed h and every α > 1, if t is large enough then there exists a distribution p with H(p) ≤ h, for which E p U t ≥ h α ln t .
As mentioned earlier, the parameter t represents the size of the sample.It appears that the optimization problem cannot be solved analytically for any fixed arbitrary t.The following results relate to the case t = 1.Obviously, this case is not typical, as one would hardly try to learn much from a sample of size 1.Yet, it may be instructive, as in this case we obtain almost the best possible results.
Next, we describe the structure of an optimal solution for the case S = [n] with t = 1.Proposition 3. Let S = [n], and let p = (p 1 , p 2 , . . ., p n ) be an optimal solution of the problem where ln (k − 1) < h ≤ ln k, k ≤ n.Then (after sorting), p is of the form That is, the non-zero atoms of p consist of one "light" atom and k − 1 "heavy" ones.
Denote the mass p n−k+1 of the light atom in the proposition by p and that of the heavy ones by q.In view of Proposition 3, it suffices to consider the case k = n, namely ln(n − 1) < h ≤ ln n.For ln(n − 1) < h ≤ ln n, the following proposition gives a tight upper bound on EU 1 : Remark 1.When h = ln n (or k = ln(n − 1)), there is an equality without the last term (see the proof of Proposition 3).At these points, the last term indeed vanishes.Inside the interval (ln(n − 1), ln n), Proposition 4 provides an improvement over Proposition 2. In Figure 1, we plot the exact value of max EU 1 (calculated numerically using MATLAB) against the bounds of Propositions

Proofs
Proof of Proposition 1. Change the measure p by splitting some atom i into two atoms of sizes p i and p i where 0 < p i < p i and p i = p i − p i .Let p = (p 1 , ..., p i−1 , p i , p i , p i+1 , ...) be the new measure.The entropy of p is smaller than that of p since Similarly, splitting any number of atoms, we increase both the entropy and EU t .Now, take the first atom, for example, and split it into k sub-atoms, the first k − 1 of which are of size p each and the k-th of size p , where 0 ≤ p ≤ p 1 k and p = p 1 − (k − 1)p, and k is still to be determined.The entropy of the new measure is For sufficiently large k and p = p 1 k , this entropy becomes arbitrarily large, and in particular exceeds h.Take such a k, and consider the entropy of the obtained measure as p grows continuously from 0 to p 1 k .For p = 0, we have basically the original measure (and thus an entropy less than h), while for p = p 1 k the entropy is larger than h.Hence for an appropriate intermediate value of p, the entropy is exactly h.The measure obtained for this p proves our claim.
Proof of Theorem 1. Write down the Lagrangian: The first-order conditions yield: We have: We claim that f vanishes at most three times in (0, 1).Indeed, f (x) = 0 when Denote the left-hand side of (8) by g(x).Then: For every two points x 1 ,x 2 for which (8) holds, there is an intermediate point x 1 < ξ < x 2 such that g (ξ) = 0. Now, g clearly vanishes at no more than two points, so that (8) holds for at most three values of x.It follows that each p i assumes one of up to (the same) four values.
Before we prove Theorem 2 we need two auxiliary lemmas.For h ≥ 0, let X h be the subset of 1 consisting of all non-increasing sequences (p 1 , p 2 , . ..) satisfying the following properties: 1. p i ≥ 0 for each i and n=1 be a sequence in X h , say p n = (p n1 , p n2 , . ..) for n ≥ 1.We want to show that it has a convergent subsequence in X h .Employing the diagonal method, we may assume that p n converges component-wise.Let p = (p 1 , p 2 , . ..) be the limit.It is clear that p has non-negative and non-increasing entries, so we only need to show that Assume first that ∞ ∑ i=1 p i > 1.Then, there exists an index i 0 such that i 0 ∑ i=1 p i > 1. Hence for sufficiently large n, we have . Let i 0 be an integer, to be determined later.We have for all sufficiently large n.Note that for every q = (q 1 , q 2 , . ..) ∈ X h we have q i ≤ 1 i (q 1 + q 2 + . . .
Now we can bound from below the tail entropy of p n : Taking i 0 large enough, we can make the right-hand side larger than h, which is impossible.Hence We To prove convergence in 1 , we estimate we can find an i 0 such that . Due to the component-wise convergence, for sufficiently large n we have . For such n we also have Thus we have Hence we have convergence in 1 .This proves the lemma.
Example 1.Note that the subset of X h consisting of all those vectors whose entropy is exactly h is not compact.
For arbitrary fixed t, the quantity EU t assigns to each point in X h a real number.We will denote this function by EU t .Lemma 2. The mapping EU t : X h −→ R is Lipschitz with constant 1 with respect to the 1 metric.

Proof of Lemma 2. Consider the function
Let M be the Lipschitz constant for f (x).According to Lemma 7 from [4], the candidates for assuming the maximum of | f (x)| are the points x 1 = 0 and Proof of Theorem 2.
(i) Follows from Lemma 1 and Lemma 2.
(ii) Suppose that p = (p 1 , p 2 , . ..) does not have a finite support.Then, we can find an n 0 such that the first n 0 entries p subject to Theorem 1 is still applicable to ( 9)-( 11) with a minor variation.In the beginning of the proof, replace the Lagrangian by and proceed as previously.Since p ∈ X h maximizes EU t , the vector p is a global optimum of this finite-dimensional problem.By Theorem 1, p cannot assume more than four distinct values.
Proof of Theorem 3.For 0 < x < 1 : For each term on the right-hand side of (12) and for 1 ≤ k ≤ t − 1, we have This gives us and therefore Finally, Proof of Theorem 4. Let t > 1 be an integer and α > 1. Define p by: For such t: Proof of Proposition 2. Let P be the random variable assigning to each atom i ∈ S its probability: Denote I = − ln P. Then: The function f (x) = 1 − e −x is concave, so by Jensen's inequality: Remark 2. If h = ln k for some positive integer k, then the bound is attained for the uniform distribution on a space of k points.
Proof of Proposition 3. First, in the case where h = ln k we have a unique optimal solution, which is The first-order conditions yield, at any optimal point, for every i with p * i > 0. Define the function f by f The function vanishes at most twice in (0, 1) because its derivative f (x) = −2 − λ 2 x vanishes at most once.Thus, the non-zero p * i s assume at most two distinct values.In fact, if all were equal, we would have k = ln k, where k is the number of non-zero p * i s, so that we would have exactly two distinct values for the p * i s.Disposing of the points of mass 0, we may assume that all n points of S have positive mass.Denote the number of "light" atoms by .We will show that EU 1 decreases as we increase .Denote the mass of a "light" atom by p and write down the entropy constraint with "light" atoms and n − "heavy" ones: Now, define the function F( , p) by: Note that we treat as a continuous variable.The equation F( , p) = 0 implicitly defines the function p( ).Using the implicit function theorem, we can write an analytic expression for dp d : . Now write down EU 1 as a function of and take the derivative with respect to : Notice that the term 1 − p( ) n − is actually the mass of the "heavy" atom, so to simplify notation we put q( ) = 1 − p( ) n − .Substituting the expression for dp d , we obtain: To show that EU 1 decreases as we increase , it is enough to check that ∂EU 1 ∂ < 0. It suffices to work out the second term in the product of (14).Using the change of variables y = q( ) p( ) − 1, we may write: Notice that y ln (1 + y) > 0, and hence it is enough to check that the numerator is negative.Indeed, It follows that should be as small as possible, which means (since there is at least one light atom) that = 1.Finally, as there is one light atom and n − 1 heavy ones, the entropy h lies in the interval (ln(n − 1), ln n).Reverting to the original notations, we have ln(k − 1) < h < ln k.
is straightforward to check that p * is feasible and attains the upper bound in Proposition 2, and is optimal.Moreover, p * is unique because any feasible non-uniform choice of p * leads to a strict inequality in Jensen's inequality that was used in Proposition 2.Thus, we deal with the case of strict inequalities, ln(k − 1) < h < ln k.We start by showing that any optimal solution (p * 1 , p * 2 , . . ., p * n ) assumes at most two non-zero distinct values.Write down the Lagrangian: