Next Article in Journal
Laminar-Turbulent Patterning in Transitional Flows
Next Article in Special Issue
Rate-Distortion Bounds for Kernel-Based Distortion Measures

Article

# The Expected Missing Mass under an Entropy Constraint

1
Department of Mathematics, Ben-Gurion University, Beer Sheva 84105, Israel
2
Department of Computer Science, Ben-Gurion University, Beer Sheva 84105, Israel
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(7), 315; https://doi.org/10.3390/e19070315
Received: 7 June 2017 / Revised: 20 June 2017 / Accepted: 26 June 2017 / Published: 29 June 2017
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)

## Abstract

In Berend and Kontorovich (2012), the following problem was studied: A random sample of size t is taken from a world (i.e., probability space) of size n; bound the expected value of the probability of the set of elements not appearing in the sample (unseen mass) in terms of t and n. Here we study the same problem, where the world may be countably infinite, and the probability measure on it is restricted to have an entropy of at most h. We provide tight bounds on the maximum of the expected unseen mass, along with a characterization of the measures attaining this maximum.
Keywords:

## 1. Introduction

Let S be a finite probability space. Without loss of generality, suppose that $S = { 1 , 2 , … , n }$. Let $p = ( p 1 , p 2 , … , p n )$ be a probability measure on S. Suppose that a random sample $X 1 , X 2 , … , X t$ is drawn from S according to $p$. The missing mass is the random variable $U t$, defined by:
$U t = ∑ i = 1 n p i 1 { X 1 ≠ i ⋀ … ⋀ X t ≠ i } .$
In words, $U t$ is the total probability mass of the set of those elements of S not observed at all in the sample. According to the definition of $U t$, it is easy to verify that $E U t = ∑ i = 1 n p i ( 1 − p i ) t$. When we wish to make the dependence on the measure $p$ explicit, we will write $E p U t$ instead of $E U t$.
One of the earliest mentions of the missing mass is in Good–Turing frequency estimation [1]. The latter is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species. This estimator has been used extensively in many machine learning tasks. For example, in the field of natural language modeling, for any sample of words, there is a set of words not occurring in that sample. The total probability mass of the words not in the sample is the so-called missing mass [2]. Another example of using Good–Turing missing mass estimation is in [3], where the total summed probability of all patterns not observed in the training data is estimated. In [4], Berend and Kontorovich showed that the expectation of the missing mass is bounded above as follows:
$E U t ≤ e − t n , t ≤ n , n e t , t > n .$
(Additionally, deviation bounds were provided in [5].)
Moreover, they have shown that:
• Every local maximum $p$ of $E U t$ is of the form
$p 1 = p 2 = … = p n − 1 ≤ p n ,$
(where without loss of generality we consider only vectors $p$ with $p 1 ≤ p 2 ≤ … ≤ p n$). That is, $p$ consists of one “heavy” atom and $n − 1$ “light” ones of identical size, where the possibility of “heavy” = “light” is not excluded.
• There exists a threshold $τ = τ ( n ) > n$ such that:
(a)
For $t ≤ τ$, there is a unique global maximum:
$p 1 = p 2 = … = p n − 1 = p n = 1 n .$
(b)
For $t > τ$, there is a unique global maximum, and it has the form:
$p 1 = p 2 = … = p n − 1 < p n .$
For an infinitely countable set S, one cannot generally provide a non-trivial upper bound on $E U t$ in terms of t only. Indeed, for each n, consider the probability measure on $N$ supported on ${ 1 , 2 , … , n }$, giving equal probabilities to these n atoms. Clearly, $E U t ≥ 1 − t n$ in this case, and the right-hand side becomes arbitrarily close to 1 as n grows. In [4] it was shown that
$E U t ≤ l ( p ) c t ,$
where $l ( p )$ roughly measures the size of sets of atoms of comparable mass and c is a universal constant (for an exact definition, we refer to [4]). The bound given in (1) is non-trivial only if the sequence $( p i ) i = 1 ∞$ decreases “sufficiently fast”. Such results may be useful, as shown in [6,7,8,9]. Another possible restriction that makes the problem interesting is that the entropy of $p$ is bounded above by some given value. A similar restriction can be found in [10] in the context of discrete distribution estimation under $ℓ 1$ loss. In this work, we study the possibility of providing tight bounds on $E U t$ under the restriction of some bound on the entropy. Thus, we can formulate our problem as follows:
$sup p ∑ i ∈ S p i ( 1 − p i ) t$
subject to
$(3) ∑ i ∈ S p i = 1 , (4) ∑ i ∈ S p i ln 1 p i ≤ h ,$
where $h ≥ 0$ is the maximal allowed entropy.
In the case of distributions over countably infinite spaces we set $S = N$; otherwise, $S = { 1 , 2 , … , n }$, or in short, $S = [ n ]$. Note that we are looking for the supremum, since in the case $S = N$ it is not a priori clear that the maximum exists (in fact, it turns out that the maximum does exist—see Theorem 2). Additionally, we will show that the maximum is obtained for a measure with finite support, which leads us to study the problem for the case of distributions over finite spaces. We also study the structure of local and global maxima and obtain some results analogous to [4].

## 2. Main Results

Our first result is that in the case of $S = N$, an optimal solution exploits all the available entropy. Denote the entropy of a probability measure $p$ by $H ( p )$:
$H ( p ) = ∑ i ∈ S p i ln 1 p i .$
Proposition 1.
Let $S = N$, and let $p = ( p 1 , p 2 , … )$ be a probability measure on S. If $H ( p ) < h$, then there exists a probability measure $p ′ = ( p 1 ′ , p 2 ′ , … )$ on S for which $H ( p ′ ) = h$ and $E p ′ U t > E p U t$.
Corollary 1.
In the problem given by (2)–(4), we may replace (4) by
$∑ i = 1 ∞ p i ln 1 p i = h .$
In Theorem 1, we refer to the case $S = [ n ]$ and show that an optimal solution of (2)–(4) cannot assume more than four distinct non-zero values.
Theorem 1.
Let $S = [ n ]$, and $p = ( p 1 , p 2 , … , p n )$ be any locally optimal solution. Then, the $p i$’s assume at most four non-zero values; i.e., if the $p i$’s are sorted, then for some indices $j , k , l$ and m, we have $p 1 = … = p j = 0 < p j + 1 = … = p k ≤ p k + 1 = … = p l ≤ p l + 1 = … = p m ≤ p m + 1 = … = p n$.
We do not know whether in some cases there are indeed atoms of four distinct sizes in the optimal solution.
In the case $S = [ n ]$, it is easy to see that $E p U t$ is continuous with respect to $p$, and thus $E U t$ attains its maximum. On the other hand, when $S = N$, it is not a priori clear. Our next result shows that $E U t$ attains its maximum in this case as well.
Theorem 2.
Let $S = N$.
(i)
For each $h > 0$, the function $E U t$ attains its maximum.
(ii)
If $p$ is a global maximum point of $E U t$, then $p$ has a finite support.
Denote $H t = 1 + 1 2 + 1 3 + … + 1 t$. Recall that $ln t ≤ H t ≤ ln t + 1$ for each t.
Theorem 3.
For all $t ≥ 1$, we have $E U t ≤$$h H 2 t − 1$.
In particular, $E U t ≤$$h ln t$.
In Theorem 4 we show that given a fixed h, we cannot significantly improve the upper bound from Theorem 3.
Theorem 4.
For fixed h and every $α > 1$, if t is large enough then there exists a distribution $p$ with $H ( p ) ≤ h$, for which $E p U t ≥ h α ln t$.
As mentioned earlier, the parameter t represents the size of the sample. It appears that the optimization problem cannot be solved analytically for any fixed arbitrary t. The following results relate to the case $t = 1$. Obviously, this case is not typical, as one would hardly try to learn much from a sample of size 1. Yet, it may be instructive, as in this case we obtain almost the best possible results.
Proposition 2.
$E U 1 ≤ 1 − e − h , h > 0 .$
Next, we describe the structure of an optimal solution for the case $S = [ n ]$ with $t = 1$.
Proposition 3.
Let $S = [ n ]$, and let $p = ( p 1 , p 2 , … , p n )$ be an optimal solution of the problem
$max ∑ i = 1 n p i ( 1 − p i )$
subject to
$(6) ∑ i = 1 n p i = 1 , (7) ∑ i = 1 n p i ln 1 p i = h ,$
where $ln k − 1 < h ≤ ln k , k ≤ n$. Then (after sorting), $p$ is of the form
$p 1 = p 2 = … = p n − k = 0 , p n − k + 1 ≤ p n − k + 2 = … = p n .$
That is, the non-zero atoms of $p$ consist of one “light” atom and $k − 1$ “heavy” ones.
Denote the mass $p n − k + 1$ of the light atom in the proposition by p and that of the heavy ones by q. In view of Proposition 3, it suffices to consider the case $k = n$, namely $ln ( n − 1 ) < h ≤ ln n$. For $ln ( n − 1 ) < h ≤ ln n$, the following proposition gives a tight upper bound on $E U 1$:
Proposition 4.
$E U 1 ≤ 1 − e − h − e − h 2 p − 1 + 2 p ln p + 2 p h + n − 1 q 2 − p 2$.
Remark 1.
When $h = ln n$ (or $k = ln ( n − 1 )$), there is an equality without the last term (see the proof of Proposition 3). At these points, the last term indeed vanishes. Inside the interval $( ln ( n − 1 ) , ln n )$, Proposition 4 provides an improvement over Proposition 2. In Figure 1, we plot the exact value of max $E U 1$ (calculated numerically using MATLAB) against the bounds of Propositions 2 and 4 for $h ∈ [ ln 3 , ln 4 ]$. It appears that the additional term in Proposition 4 captures most of the error in Proposition 2.

## 3. Proofs

Proof of Proposition 1.
Change the measure $p$ by splitting some atom i into two atoms of sizes $p i ′$ and $p i ′ ′$ where $0 < p i ′ < p i$ and $p i ′ ′ = p i − p i ′$. Let $p ˜ = ( p 1 , . . . , p i − 1 , p i ′ , p i ′ ′ , p i + 1 , . . . )$ be the new measure. The entropy of $p$ is smaller than that of $p ˜$ since
$p i ln 1 p i = [ p i ′ + p i ′ ′ ] ln 1 p i < p i ′ ln 1 p i ′ + p i ′ ′ ln 1 p i ′ ′ .$
Now
$p i ( 1 − p i ) t = [ p i ′ + p i ′ ′ ] ( 1 − p i ) t < p i ′ ( 1 − p i ′ ) t + p i ′ ′ ( 1 − p i ′ ′ ) t ,$
which implies that $E p U t < E p ˜ U t$. Similarly, splitting any number of atoms, we increase both the entropy and $E U t$.
Now, take the first atom, for example, and split it into k sub-atoms, the first $k − 1$ of which are of size p each and the k-th of size $p ′$, where $0 ≤ p ≤ p 1 k$ and $p ′ = p 1 − ( k − 1 ) p$, and k is still to be determined. The entropy of the new measure is
$( k − 1 ) p ln 1 p + ( p 1 − ( k − 1 ) p ) ln 1 p 1 − ( k − 1 ) p + ∑ i = 2 ∞ p i ln 1 p i .$
For sufficiently large k and $p = p 1 k$, this entropy becomes arbitrarily large, and in particular exceeds h. Take such a k, and consider the entropy of the obtained measure as p grows continuously from 0 to $p 1 k$. For $p = 0$, we have basically the original measure (and thus an entropy less than h), while for $p = p 1 k$ the entropy is larger than h. Hence for an appropriate intermediate value of p, the entropy is exactly h. The measure obtained for this p proves our claim. ☐
Proof of Theorem 1.
Write down the Lagrangian:
$L ( p , λ 1 , λ 2 ) = ∑ i = 1 n p i ( 1 − p i ) t + λ 1 ∑ i = 1 n p i − 1 + λ 2 ∑ i = 1 n p i ln 1 p i − h .$
The first-order conditions yield:
$∂ L ∂ p i = ( 1 − p i ) t − t p i ( 1 − p i ) t − 1 + λ 1 − λ 2 ( ln p i + 1 ) = 0 .$
Denote:
$f ( x ) = ( 1 − x ) t − t x ( 1 − x ) t − 1 + λ 1 − λ 2 ( ln x + 1 ) , 0 < x < 1 .$
We have:
$f ′ ( x ) = t ( 1 − x ) t − 2 [ x ( t + 1 ) − 2 ] − λ 2 x .$
We claim that $f ′$ vanishes at most three times in $( 0 , 1 )$. Indeed, $f ′ ( x ) = 0$ when
$x t ( 1 − x ) t − 2 [ x ( t + 1 ) − 2 ] = λ 2 .$
Denote the left-hand side of (8) by $g ( x )$. Then:
$g ′ ( x ) = − t ( 1 − x ) t − 3 [ t 2 x 2 + t ( x − 4 ) x + 2 ] .$
For every two points $x 1 ,$$x 2$ for which (8) holds, there is an intermediate point $x 1 < ξ < x 2$ such that $g ′ ( ξ ) = 0$. Now, $g ′$ clearly vanishes at no more than two points, so that (8) holds for at most three values of x. It follows that each $p i$ assumes one of up to (the same) four values. ☐
Before we prove Theorem 2 we need two auxiliary lemmas. For $h ≥ 0$, let $X h$ be the subset of $ℓ 1$ consisting of all non-increasing sequences $( p 1 , p 2 , … )$ satisfying the following properties:
• $p i ≥ 0$ for each i and $∑ i = 1 ∞ p i = 1 .$
• $H ( p ) ≤ h .$
Lemma 1.
$X h$ is compact under the $ℓ 1$ metric.
Proof of Lemma 1.
Let $( p n ) n = 1 ∞$ be a sequence in $X h$, say $p n = ( p n 1 , p n 2 , … )$ for $n ≥ 1$. We want to show that it has a convergent subsequence in $X h$. Employing the diagonal method, we may assume that $p n$ converges component-wise. Let $p = ( p 1 , p 2 , … )$ be the limit. It is clear that $p$ has non-negative and non-increasing entries, so we only need to show that $∑ i = 1 ∞ p i = 1$, that $H ( p ) ≤ h$, and that $p n ⟶ n → ∞ p$ in $ℓ 1$.
Assume first that $∑ i = 1 ∞ p i > 1$. Then, there exists an index $i 0$ such that $∑ i = 1 i 0 p i > 1$. Hence for sufficiently large n, we have $∑ i = 1 i 0 p n i > 1$, which is a contradiction. Hence $∑ i = 1 ∞ p i ≤ 1$. Now assume that $∑ i = 1 ∞ p i < 1$. Put $ε = 1 − ∑ i = 1 ∞ p i$. Let $i 0$ be an integer, to be determined later. We have $∑ i = 1 i 0 p n i < 1 − ε 2$ for all sufficiently large n. Note that for every $q = ( q 1 , q 2 , … ) ∈ X h$ we have $q i ≤ 1 i ( q 1 + q 2 + … + q i ) ≤ 1 i$. Now we can bound from below the tail entropy of $p n$:
$∑ i = i 0 + 1 ∞ p n i ln 1 p n i > ∑ i = i 0 + 1 ∞ p n i ln i 0 > ε 2 ln i 0 .$
Taking $i 0$ large enough, we can make the right-hand side larger than h, which is impossible. Hence $∑ i = 1 ∞ p i = 1$.
We now show similarly that $H ( p ) ≤ h$. Assume that $∑ i = 1 ∞ p i ln 1 p i > h$. Then there exists an $i 0$ such that $∑ i = 1 i 0 p i ln 1 p i > h$. Then, however, $∑ i = 1 i 0 p n i ln 1 p n i > h$ for sufficiently large n, which yields a contradiction.
To prove convergence in $ℓ 1$, we estimate $p n − p 1 = ∑ i = 1 ∞ | p n i − p i |$. Let $ε > 0$. Since $∑ i = 1 ∞ p i = 1$, we can find an $i 0$ such that $∑ i = i 0 + 1 ∞ p i < ε 6$. Due to the component-wise convergence, for sufficiently large n we have $∑ i = 1 i 0 | p n i − p i | < ε 6$. For such n we also have $∑ i = i 0 + 1 ∞ p n i < ε 3$ since
$∑ i = 1 i 0 | p n i − p i | < ε 6 ⇒ ∑ i = 1 i 0 p n i − p i < ε 6 ⇒ ∑ i = 1 i 0 p n i > ∑ i = 1 i 0 p i − ε 6 > 1 − ε 3 ⇒ ∑ i = i 0 + 1 ∞ p n i < ε 3 .$
Thus we have
$∑ i = 1 ∞ | p n i − p i | = ∑ i = 1 i 0 | p n i − p i | + ∑ i = i 0 + 1 ∞ | p n i − p i | < ε 6 + ∑ i = i 0 + 1 ∞ | p n i | + ∑ i = i 0 + 1 ∞ | p i | < ε 6 + ε 3 + ε 6 < ε .$
Hence we have convergence in $ℓ 1$. This proves the lemma. ☐
Example 1.
Note that the subset of $X h$ consisting of all those vectors whose entropy is exactly h is not compact. Let us demonstrate this fact, say, for $h = ln 2$. We choose $p n = ( x n , 1 − x n n , 1 − x n n , 1 − x n n , … , 1 − x n n ︸ n times )$, $n ≥ 3 ,$ where $x n$ will be defined momentarily. For arbitrary fixed $n ≥ 3$, put:
$f n ( x ) = − x ln x − ( 1 − x ) ln 1 − x n , 0 ≤ x ≤ 1 .$
We claim that there exists a unique solution $x n$ to the equation $f n ( x ) = ln 2$. Indeed, this follows readily from the fact that $f n ( x )$ is concave and $f n ( 1 ) = 0 < ln 2 < ln n = f n ( 0 )$. Denoting $t n = 1 − x n$, we have
$− ( 1 − t n ) ln 1 − t n − t n ln t n + t n ln n = ln 2 .$
Hence
$t n = 1 ln n ln 2 + t n ln t n + ( 1 − t n ) ln 1 − t n ≤ ln 2 ln n ,$
so that
$x n ≥ 1 − ln 2 ln n , n ≥ 3 ,$
and in particular $x n ⟶ n → ∞ 1$. Thus, $p n ⟶ n → ∞ ( 1 , 0 , 0 , 0 , … )$ while $H ( p n ) = ln 2$ and $H ( ( 1 , 0 , 0 , 0 , … ) ) = 0$, which completes the example.
For arbitrary fixed t, the quantity $E U t$ assigns to each point in $X h$ a real number. We will denote this function by $E U t$.
Lemma 2.
The mapping $E U t$: $X h ⟶ R$ is Lipschitz with constant 1 with respect to the $ℓ 1$ metric.
Proof of Lemma 2.
Consider the function $f : [ 0 , 1 ] ⟶ R$ given by $f ( x ) = x ( 1 − x ) t , 0 ≤ x ≤ 1$. Let M be the Lipschitz constant for $f ( x )$. According to Lemma 7 from [4], the candidates for assuming the maximum of $| f ′ ( x ) |$ are the points $x 1 = 0$ and $x 2 = 2 1 + t$. Now $| f ′ ( x 1 ) | = 1$ and $| f ′ ( x 2 ) | = ( 1 − 2 1 + t ) t − 1 ≤ 1$. Hence the Lipschitz constant for $f ( x )$ is 1. It follows that if $p , p ′ ∈ X h$, then:
$∑ i = 1 ∞ p i ( 1 − p i ) t − ∑ i = 1 ∞ p i ′ ( 1 − p i ′ ) t = ∑ i = 1 ∞ p i ( 1 − p i ) t − p i ′ ( 1 − p i ′ ) t ≤ ∑ i = 1 ∞ p i − p i ′ .$
☐
Proof of Theorem 2.
(i)
Follows from Lemma 1 and Lemma 2.
(ii)
Suppose that $p = ( p 1 , p 2 , … )$ does not have a finite support. Then, we can find an $n 0$ such that the first $n 0$ entries $p 1 , p 2 , … , p n 0$ of $p$ assume more than four different values. Put $p ˜ = ( p 1 , p 2 , … , p n 0 )$ and let $c = p 1 + p 2 + … + p n 0$ and $h ˜ = H ( p ˜ )$. Consider the optimization problem
$max p ∑ i = 1 n 0 p i ( 1 − p i ) t$
subject to
$(10) ∑ i = 1 n 0 p i ln 1 p i ≤ h ˜ , (11) ∑ i = 1 n 0 p i = c .$
Theorem 1 is still applicable to (9)–(11) with a minor variation. In the beginning of the proof, replace the Lagrangian by
$L ( p , λ 1 , λ 2 ) = ∑ i = 1 n p i ( 1 − p i ) t + λ 1 ∑ i = 1 n p i − c + λ 2 ∑ i = 1 n p i ln 1 p i − h ˜$
and proceed as previously. Since $p ∈ X h$ maximizes $E U t$, the vector $p ˜$ is a global optimum of this finite-dimensional problem. By Theorem 1, $p ˜$ cannot assume more than four distinct values.
☐
Proof of Theorem 3.
For $0 < x < 1 :$
$ln ( 1 − x ) = − x − x 2 2 − … − x 2 t − 1 2 t − 1 − … < − x − x 2 2 − … − x 2 t − 1 2 t − 1 = − x − x 2 t − 1 2 t − 1 + … + − x t − k t − k − x t + k t + k + … + − x t − 1 t − 1 − x t + 1 t + 1 − x t t .$
For each term on the right-hand side of (12) and for $1 ≤ k ≤ t − 1$, we have
$− x t − k t − k − x t + k t + k < − x t t − k − x t t + k .$
Indeed,
$− x t − k t − k − x t + k t + k + x t t − k + x t t + k = − x t − k ( 1 − x k ) ( 1 t − k − x k t + k ) < 0 .$
This gives us
$− x − x 2 2 − … − x 2 t − 1 2 t − 1 − … < − x t − x t 2 − … − x t t − … − x t 2 t − 1 < − x t H 2 t − 1 .$
Hence
$ln p < − ( 1 − p ) t H 2 t − 1 ,$
and therefore
$( 1 − p ) t < − ln p H 2 t − 1 .$
Finally,
$E U t = ∑ i = 1 ∞ p i ( 1 − p i ) t ≤ h H 2 t − 1 .$
☐
Proof of Theorem 4.
Let $t > 1$ be an integer and $α > 1$. Define $p$ by:
$p 1 = p 2 = … = p t = h α t ln t , p t + 1 = 1 − h α ln t .$
For such t:
$E p U t = 1 − h α ln t h α ln t t + h α ln t 1 − h α t ln t t ≥ h α ln t 1 − h α t ln t t .$
Now:
$1 − h α t ln t t = 1 − h α t ln t t ln t 1 / ln t ⟶ t → ∞ ( e − h / α ) 0 = 1 .$
☐
Proof of Proposition 2.
Let P be the random variable assigning to each atom $i ∈ S$ its probability:
$P ( i ) = p i , i ∈ S .$
Denote $I = − ln P$. Then:
$E U 1 = ∑ i ∈ S p i ( 1 − p i ) = E [ 1 − P ] = E [ 1 − e − I ] .$
The function $f ( x ) = 1 − e − x$ is concave, so by Jensen’s inequality:
$E U 1 = E [ 1 − e − I ] ≤ 1 − e − E [ I ] = 1 − e − E [ − ln P ] = 1 − e ∑ i ∈ S p i ln p i = 1 − e − h .$
☐
Remark 2.
If $h = ln k$ for some positive integer k, then the bound is attained for the uniform distribution on a space of k points.
Proof of Proposition 3.
First, in the case where $h = ln k$ we have a unique optimal solution, which is $p 1 * = p 2 * = … = p n − k * = 0 < p n − k + 1 * = p n − k + 2 * = … = p n * = 1 k$. It is straightforward to check that $p *$ is feasible and attains the upper bound in Proposition 2, and is optimal. Moreover, $p *$ is unique because any feasible non-uniform choice of $p *$ leads to a strict inequality in Jensen’s inequality that was used in Proposition 2.
Thus, we deal with the case of strict inequalities, $ln ( k − 1 ) < h < ln k$. We start by showing that any optimal solution $( p 1 * , p 2 * , … , p n * )$ assumes at most two non-zero distinct values. Write down the Lagrangian:
$L ( p 1 , p 2 , … , p n , λ 1 , λ 2 ) = ∑ i = 1 n p i ( 1 − p i ) + λ 1 ∑ i = 1 n p i − 1 + λ 2 ∑ i = 1 n p i ln 1 p i − h .$
The first-order conditions yield, at any optimal point,
$∂ L ∂ p i = 1 − 2 p i + λ 1 − λ 2 ln p i + 1 = 0 ,$
for every i with $p i * > 0$. Define the function f by $f ( x ) = 1 − 2 x + λ 1 − λ 2 − λ 2 ln x$. The function vanishes at most twice in $( 0 , 1 )$ because its derivative $f ′ ( x ) = − 2 − λ 2 x$ vanishes at most once. Thus, the non-zero $p i *$s assume at most two distinct values. In fact, if all were equal, we would have $k = ln k$, where k is the number of non-zero $p i *$s, so that we would have exactly two distinct values for the $p i *$s. Disposing of the points of mass 0, we may assume that all n points of S have positive mass. Denote the number of “light” atoms by . We will show that $E U 1$ decreases as we increase . Denote the mass of a “light” atom by p and write down the entropy constraint with “light” atoms and $n − ℓ$ “heavy” ones:
$− ℓ p ln p − ( 1 − ℓ p ) ln 1 − ℓ p n − ℓ = h .$
Now, define the function $F ( ℓ , p )$ by:
$F ( ℓ , p ) = − ℓ p ln p − ( 1 − ℓ p ) ln 1 − ℓ p n − ℓ − h .$
Note that we treat as a continuous variable. The equation $F ( ℓ , p ) = 0$ implicitly defines the function $p ( ℓ )$. Using the implicit function theorem, we can write an analytic expression for $d p d ℓ$:
$d p d ℓ = − p ( ℓ ) ℓ + 1 − ℓ p ( ℓ ) n − ℓ − p ( ℓ ) ℓ ln 1 − ℓ p ( ℓ ) n − ℓ − ln p ( ℓ ) .$
Now write down $E U 1$ as a function of and take the derivative with respect to :
$E U 1 = ℓ p ( ℓ ) ( 1 − p ( ℓ ) ) + ( 1 − ℓ p ( ℓ ) ) 1 − 1 − ℓ p ( ℓ ) n − ℓ .$
$d E U 1 d ℓ = p ( ℓ ) + ℓ p ′ ) ( ℓ 1 − p ( ℓ ) − ℓ p ( ℓ ) p ′ ( ℓ ) − p ( ℓ ) + ℓ p ′ ( ℓ ) 1 − 1 − ℓ p ( ℓ ) n − ℓ − 1 − ℓ p ( ℓ ) n − ℓ 2 − n p ( ℓ ) − n ℓ p ′ ( ℓ ) + ℓ 2 p ′ ( ℓ ) + 1 .$
Notice that the term $1 − ℓ p ( ℓ ) n − ℓ$ is actually the mass of the “heavy” atom, so to simplify notation we put $q ( ℓ ) = 1 − ℓ p ( ℓ ) n − ℓ$. Substituting the expression for $d p d ℓ$, we obtain:
$∂ E U 1 ∂ ℓ = q ( ℓ ) − p ( ℓ ) 2 2 ln q ( ℓ ) p ( ℓ ) − 2 p ( ℓ ) q ( ℓ ) − p ( ℓ ) − 1 = q ( ℓ ) − p ( ℓ ) 2 2 ln q ( ℓ ) p ( ℓ ) − 2 q ( ℓ ) p ( ℓ ) − 1 − 1 .$
To show that $E U 1$ decreases as we increase , it is enough to check that $∂ E U 1 ∂ ℓ < 0$. It suffices to work out the second term in the product of (14). Using the change of variables $y = q ( ℓ ) p ( ℓ ) − 1$, we may write:
$2 ln q ( ℓ ) p ( ℓ ) − 2 q ( ℓ ) p ( ℓ ) − 1 − 1 = 2 ln 1 + y − 2 y − 1 .$
It is straightforward to check that $2 ln 1 + y − 2 y − 1 ≤ 0$:
$2 ln 1 + y − 2 y − 1 = 2 y − 2 ln 1 + y − y ln 1 + y y ln 1 + y .$
Notice that $y ln 1 + y > 0$, and hence it is enough to check that the numerator is negative. Indeed,
$2 y − 2 ln 1 + y − y ln 1 + y y = 0 = 0$
and
$d d y 2 y − 2 ln 1 + y − y ln 1 + y = y 1 + y − ln 1 + y .$
Now $[ y 1 + y − ln 1 + y ] y = 0 = 0$ and $d d y y 1 + y − ln 1 + y = 1 1 + y 2 − 1 1 + y < 0$. Thus, $∂ E U 1 ∂ ℓ < 0$.
It follows that should be as small as possible, which means (since there is at least one light atom) that $ℓ = 1$. Finally, as there is one light atom and $n − 1$ heavy ones, the entropy h lies in the interval $( ln ( n − 1 ) , ln n )$. Reverting to the original notations, we have $ln ( k − 1 ) < h < ln k$. ☐
Proof of Proposition 4.
We use the following refinement of Jensen’s inequality [11]: For any random variable X and concave function $ϕ$,
$ϕ ( E X ) − E ϕ ( X ) ≥ E ϕ ( X ) − ϕ ( E X ) − ϕ + ′ ( E X ) · E X − E X ,$
where $ϕ + ′$ denotes the right-hand derivative of $ϕ$. For $I = − ln P$ and $ϕ ( x ) = 1 − e − x$, the left-hand side of (15) is
$ϕ ( E [ I ] ) − E [ ϕ ( I ) ] = 1 − e − h − E U 1 .$
The right-hand side of (15) gives:
$E ϕ ( I ) − ϕ ( E [ I ] ) − ϕ + ′ ( E [ I ] ) · E I − E [ I ] = E 1 − e − I − 1 − e − h − e − h E − ln P − h = | p · e − h − p + n − 1 q · e − h − q − e − h p · − ln p − h + n − 1 q · − ln q − h | = | p e − h − p − n − 1 q e − h − q − e − h p − ln p − h − n − 1 q − ln q − h | = p e − h − p 2 − n − 1 q e − h + n − 1 q 2 − e − h − 2 p ln p − h − 2 p h + h = e − h 2 p − 1 + 2 p ln p + 2 p h + n − 1 q 2 − p 2 .$
☐

## Acknowledgments

Daniel Berend is supported by the Milken Families Foundation Chair in Mathematics. Aryeh Kontorovich is supported in part by the Israel Science Foundation (grant No. 755/15), Paypal and IBM.

## Author Contributions

The authors contributed equally to this work.

## Conflicts of Interest

The authors declare no conflict of interest.

## References

1. Good, I.J. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
2. McAllester, D.; McAllester, R.E. On The Convergence Rate of Good-Turing Estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, Stanford, CA, USA, 28 June–1 July 2000; pp. 1–6. [Google Scholar]
3. Haslinger, R.; Pipa, G.; Lewis, L.D.; Nikolić, D.; Williams, Z.; Brown, E. Encoding through Patterns: Regression Tree-Based Neuronal Population Models. Neural Comput. 2013, 25, 1953–1993. [Google Scholar] [CrossRef] [PubMed]
4. Berend, D.; Kontorovich, A. The Missing Mass Problem. Stat. Probab. Lett. 2012, 82, 1102–1110. [Google Scholar] [CrossRef]
5. Berend, D.; Kontorovich, A. On The Concentration of the Missing Mass. Electron. Commun. Probab. 2013, 18, 1–7. [Google Scholar] [CrossRef]
6. Kontorovich, A.; Hendler, D.; Menahem, E. Metric Anomaly Detection via Asymmetric Risk Minimization. In SIMBAD 2011. Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 7005, pp. 17–30. [Google Scholar]
7. Luo, H.P.; Schapire, R. Towards Minimax Online Learning with Unknown Time Horizon. In Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
8. Sadeqi, M.; Holte, R.C.; Zilles, S. Detecting Mutex Pairs in State Spaces by Sampling. In Australasian Conference on Artificial Intelligence (2013); Lecture Notes in Computer Science; Springer: Cham, Switzerland; Volume 8272, pp. 490–501.
9. Ben-Hamou, A.; Boucheron, S.; Ohannessian, M.I. Concentration Inequalities in the Infinite Urn Scheme for Occupancy Counts and the Missing Mass, with Applications. Bernoulli 2017, 23, 249–287. [Google Scholar] [CrossRef]
10. Han, Y.J.; Jiao, J.T.; Weissman, T. Minimax Estimation of Discrete Distributions under 1 Loss. IEEE Trans. Inf. Theory 2015, 61, 6343–6354. [Google Scholar] [CrossRef]
11. Hussain, S.; Pečarić, J. An Improvement of Jensen’s Inequality with Some Applications. Asian Eur. J. Math. 2009, 2, 85–94. [Google Scholar] [CrossRef]
Figure 1. Max $E U 1$ vs. the bounds provided by Propositions 2 and 4.
Figure 1. Max $E U 1$ vs. the bounds provided by Propositions 2 and 4.