The Expected Missing Mass under an Entropy Constraint

Berend, Daniel; Kontorovich, Aryeh; Zagdanski, Gil

doi:10.3390/e19070315

Open AccessArticle

The Expected Missing Mass under an Entropy Constraint

by

Daniel Berend

^1,2,

Aryeh Kontorovich

^2,* and

Gil Zagdanski

¹

Department of Mathematics, Ben-Gurion University, Beer Sheva 84105, Israel

²

Department of Computer Science, Ben-Gurion University, Beer Sheva 84105, Israel

^*

Author to whom correspondence should be addressed.

Entropy 2017, 19(7), 315; https://doi.org/10.3390/e19070315

Submission received: 7 June 2017 / Revised: 20 June 2017 / Accepted: 26 June 2017 / Published: 29 June 2017

(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)

Download

Browse Figure

Versions Notes

Abstract

:

In Berend and Kontorovich (2012), the following problem was studied: A random sample of size t is taken from a world (i.e., probability space) of size n; bound the expected value of the probability of the set of elements not appearing in the sample (unseen mass) in terms of t and n. Here we study the same problem, where the world may be countably infinite, and the probability measure on it is restricted to have an entropy of at most h. We provide tight bounds on the maximum of the expected unseen mass, along with a characterization of the measures attaining this maximum.

Keywords:

missing mass; probability estimate; sampling; entropy

1. Introduction

Let S be a finite probability space. Without loss of generality, suppose that

S = {1, 2, \dots, n}

. Let

p = (p_{1}, p_{2}, \dots, p_{n})

be a probability measure on S. Suppose that a random sample

X_{1}, X_{2}, \dots, X_{t}

is drawn from S according to

p

. The missing mass is the random variable

U_{t}

, defined by:

U_{t} = \sum_{i = 1}^{n} p_{i} 1_{{X_{1} \neq i ⋀ \dots ⋀ X_{t} \neq i}} .

In words,

U_{t}

is the total probability mass of the set of those elements of S not observed at all in the sample. According to the definition of

U_{t}

, it is easy to verify that

E U_{t} = \sum_{i = 1}^{n} p_{i} {(1 - p_{i})}^{t}

. When we wish to make the dependence on the measure

p

explicit, we will write

E_{p} U_{t}

instead of

E U_{t}

.

One of the earliest mentions of the missing mass is in Good–Turing frequency estimation [1]. The latter is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species. This estimator has been used extensively in many machine learning tasks. For example, in the field of natural language modeling, for any sample of words, there is a set of words not occurring in that sample. The total probability mass of the words not in the sample is the so-called missing mass [2]. Another example of using Good–Turing missing mass estimation is in [3], where the total summed probability of all patterns not observed in the training data is estimated. In [4], Berend and Kontorovich showed that the expectation of the missing mass is bounded above as follows:

E U_{t} \leq \{\begin{matrix} e^{- \frac{t}{n}}, & t \leq n, \\ \frac{n}{e t}, & t > n . \end{matrix}

(Additionally, deviation bounds were provided in [5].)

Moreover, they have shown that:

Every local maximum $p$ of $E U_{t}$ is of the form

$p_{1} = p_{2} = \dots = p_{n - 1} \leq p_{n},$

(where without loss of generality we consider only vectors $p$ with $p_{1} \leq p_{2} \leq \dots \leq p_{n}$ ). That is, $p$ consists of one “heavy” atom and $n - 1$ “light” ones of identical size, where the possibility of “heavy” = “light” is not excluded.
There exists a threshold $τ = τ (n) > n$ such that:
(a)
For $t \leq τ$ , there is a unique global maximum:

$p_{1} = p_{2} = \dots = p_{n - 1} = p_{n} = \frac{1}{n} .$

(b)
For $t > τ$ , there is a unique global maximum, and it has the form:

$p_{1} = p_{2} = \dots = p_{n - 1} < p_{n} .$

For an infinitely countable set S, one cannot generally provide a non-trivial upper bound on

E U_{t}

in terms of t only. Indeed, for each n, consider the probability measure on

N

supported on

{1, 2, \dots, n}

, giving equal probabilities to these n atoms. Clearly,

E U_{t} \geq 1 - \frac{t}{n}

in this case, and the right-hand side becomes arbitrarily close to 1 as n grows. In [4] it was shown that

E U_{t} \leq \frac{l (p)}{c t},

(1)

where

l (p)

roughly measures the size of sets of atoms of comparable mass and c is a universal constant (for an exact definition, we refer to [4]). The bound given in (1) is non-trivial only if the sequence

{(p_{i})}_{i = 1}^{\infty}

decreases “sufficiently fast”. Such results may be useful, as shown in [6,7,8,9]. Another possible restriction that makes the problem interesting is that the entropy of

p

is bounded above by some given value. A similar restriction can be found in [10] in the context of discrete distribution estimation under

ℓ_{1}

loss. In this work, we study the possibility of providing tight bounds on

E U_{t}

under the restriction of some bound on the entropy. Thus, we can formulate our problem as follows:

\begin{matrix} \sup_{p} & \sum_{i \in S} p_{i} {(1 - p_{i})}^{t} \end{matrix}

(2)

subject to

\begin{matrix} (3) & \sum_{i \in S} p_{i} = 1, \\ (4) & \sum_{i \in S} p_{i} \ln \frac{1}{p_{i}} \leq h, \end{matrix}

where

h \geq 0

is the maximal allowed entropy.

In the case of distributions over countably infinite spaces we set

S = N

; otherwise,

S = {1, 2, \dots, n}

, or in short,

S = [n]

. Note that we are looking for the supremum, since in the case

S = N

it is not a priori clear that the maximum exists (in fact, it turns out that the maximum does exist—see Theorem 2). Additionally, we will show that the maximum is obtained for a measure with finite support, which leads us to study the problem for the case of distributions over finite spaces. We also study the structure of local and global maxima and obtain some results analogous to [4].

2. Main Results

Our first result is that in the case of

S = N

, an optimal solution exploits all the available entropy. Denote the entropy of a probability measure

p

by

H (p)

:

H (p) = \sum_{i \in S} p_{i} \ln \frac{1}{p_{i}} .

Proposition 1.

Let

S = N

, and let

p = (p_{1}, p_{2}, \dots)

be a probability measure on S. If

H (p) < h

, then there exists a probability measure

p^{'} = (p_{1}^{'}, p_{2}^{'}, \dots)

on S for which

H (p^{'}) = h

and

E_{p^{'}} U_{t} > E_{p} U_{t}

.

Corollary 1.

In the problem given by (2)–(4), we may replace (4) by

\sum_{i = 1}^{\infty} p_{i} \ln \frac{1}{p_{i}} = h .

In Theorem 1, we refer to the case

S = [n]

and show that an optimal solution of (2)–(4) cannot assume more than four distinct non-zero values.

Theorem 1.

Let

S = [n]

, and

p = (p_{1}, p_{2}, \dots, p_{n})

be any locally optimal solution. Then, the

p_{i}

’s assume at most four non-zero values; i.e., if the

p_{i}

’s are sorted, then for some indices

j, k, l

and m, we have

p_{1} = \dots = p_{j} = 0 < p_{j + 1} = \dots = p_{k} \leq p_{k + 1} = \dots = p_{l} \leq p_{l + 1} = \dots = p_{m} \leq p_{m + 1} = \dots = p_{n}

.

We do not know whether in some cases there are indeed atoms of four distinct sizes in the optimal solution.

In the case

S = [n]

, it is easy to see that

E_{p} U_{t}

is continuous with respect to

p

, and thus

E U_{t}

attains its maximum. On the other hand, when

S = N

, it is not a priori clear. Our next result shows that

E U_{t}

attains its maximum in this case as well.

Theorem 2.

Let

S = N

.

(i): For each $h > 0$ , the function $E U_{t}$ attains its maximum.
(ii): If $p$ is a global maximum point of $E U_{t}$ , then $p$ has a finite support.

Denote

H_{t} = 1 + \frac{1}{2} + \frac{1}{3} + \dots + \frac{1}{t}

. Recall that

\ln t \leq H_{t} \leq \ln t + 1

for each t.

Theorem 3.

For all

t \geq 1

, we have

E U_{t} \leq

\frac{h}{H_{2 t - 1}}

.

In particular,

E U_{t} \leq

\frac{h}{\ln t}

.

In Theorem 4 we show that given a fixed h, we cannot significantly improve the upper bound from Theorem 3.

Theorem 4.

For fixed h and every

α > 1

, if t is large enough then there exists a distribution

p

with

H (p) \leq h

, for which

E_{p} U_{t} \geq \frac{h}{α \ln t}

.

As mentioned earlier, the parameter t represents the size of the sample. It appears that the optimization problem cannot be solved analytically for any fixed arbitrary t. The following results relate to the case

t = 1

. Obviously, this case is not typical, as one would hardly try to learn much from a sample of size 1. Yet, it may be instructive, as in this case we obtain almost the best possible results.

Proposition 2.

E U_{1} \leq 1 - e^{- h}, h > 0 .

Next, we describe the structure of an optimal solution for the case

S = [n]

with

t = 1

.

Proposition 3.

Let

S = [n]

, and let

p = (p_{1}, p_{2}, \dots, p_{n})

be an optimal solution of the problem

\begin{matrix} \max & \sum_{i = 1}^{n} p_{i} (1 - p_{i}) \end{matrix}

subject to

\begin{matrix} (6) & \sum_{i = 1}^{n} p_{i} = 1, \\ (7) & \sum_{i = 1}^{n} p_{i} \ln \frac{1}{p_{i}} = h, \end{matrix}

where

\ln (k - 1) < h \leq \ln k, k \leq n

. Then (after sorting),

p

is of the form

p_{1} = p_{2} = \dots = p_{n - k} = 0, p_{n - k + 1} \leq p_{n - k + 2} = \dots = p_{n} .

That is, the non-zero atoms of

p

consist of one “light” atom and

k - 1

“heavy” ones.

Denote the mass

p_{n - k + 1}

of the light atom in the proposition by p and that of the heavy ones by q. In view of Proposition 3, it suffices to consider the case

k = n

, namely

\ln (n - 1) < h \leq \ln n

. For

\ln (n - 1) < h \leq \ln n

, the following proposition gives a tight upper bound on

E U_{1}

:

Proposition 4.

E U_{1} \leq 1 - e^{- h} - |e^{- h} (2 p - 1 + 2 p \ln p + 2 p h) + (n - 1) q^{2} - p^{2}|

.

Remark 1.

When

h = \ln n

(or

k = \ln (n - 1)

), there is an equality without the last term (see the proof of Proposition 3). At these points, the last term indeed vanishes. Inside the interval

(\ln (n - 1), \ln n)

, Proposition 4 provides an improvement over Proposition 2. In Figure 1, we plot the exact value of max

E U_{1}

(calculated numerically using MATLAB) against the bounds of Propositions 2 and 4 for

h \in [\ln 3, \ln 4]

. It appears that the additional term in Proposition 4 captures most of the error in Proposition 2.

3. Proofs

Proof of Proposition 1.

Change the measure

p

by splitting some atom i into two atoms of sizes

p_{i}^{'}

and

p_{i}^{''}

where

0 < p_{i}^{'} < p_{i}

and

p_{i}^{''} = p_{i} - p_{i}^{'}

. Let

\tilde{p} = (p_{1}, . . ., p_{i - 1}, p_{i}^{'}, p_{i}^{''}, p_{i + 1}, . . .)

be the new measure. The entropy of

p

is smaller than that of

\tilde{p}

since

\begin{matrix} p_{i} \ln \frac{1}{p_{i}} & = & [p_{i}^{'} + p_{i}^{''}] \ln \frac{1}{p_{i}} \\ < & p_{i}^{'} \ln \frac{1}{p_{i}^{'}} + p_{i}^{''} \ln \frac{1}{p_{i}^{''}} . \end{matrix}

Now

\begin{matrix} p_{i} {(1 - p_{i})}^{t} & = & [p_{i}^{'} + p_{i}^{''}] {(1 - p_{i})}^{t} \\ < & p_{i}^{'} {(1 - p_{i}^{'})}^{t} + p_{i}^{''} {(1 - p_{i}^{''})}^{t}, \end{matrix}

which implies that

E_{p} U_{t} < E_{\tilde{p}} U_{t}

. Similarly, splitting any number of atoms, we increase both the entropy and

E U_{t}

.

Now, take the first atom, for example, and split it into k sub-atoms, the first

k - 1

of which are of size p each and the k-th of size

p^{'}

, where

0 \leq p \leq \frac{p_{1}}{k}

and

p^{'} = p_{1} - (k - 1) p

, and k is still to be determined. The entropy of the new measure is

(k - 1) p \ln \frac{1}{p} + (p_{1} - (k - 1) p) \ln \frac{1}{p_{1} - (k - 1) p} + \sum_{i = 2}^{\infty} p_{i} \ln \frac{1}{p_{i}} .

For sufficiently large k and

p = \frac{p_{1}}{k}

, this entropy becomes arbitrarily large, and in particular exceeds h. Take such a k, and consider the entropy of the obtained measure as p grows continuously from 0 to

\frac{p_{1}}{k}

. For

p = 0

, we have basically the original measure (and thus an entropy less than h), while for

p = \frac{p_{1}}{k}

the entropy is larger than h. Hence for an appropriate intermediate value of p, the entropy is exactly h. The measure obtained for this p proves our claim. ☐

Proof of Theorem 1.

Write down the Lagrangian:

L (p, λ_{1}, λ_{2}) = \sum_{i = 1}^{n} p_{i} {(1 - p_{i})}^{t} + λ_{1} (\sum_{i = 1}^{n} p_{i} - 1) + λ_{2} (\sum_{i = 1}^{n} p_{i} \ln \frac{1}{p_{i}} - h) .

The first-order conditions yield:

\frac{\partial L}{\partial p_{i}} = {(1 - p_{i})}^{t} - t p_{i} {(1 - p_{i})}^{t - 1} + λ_{1} - λ_{2} (\ln p_{i} + 1) = 0 .

Denote:

f (x) = {(1 - x)}^{t} - t x {(1 - x)}^{t - 1} + λ_{1} - λ_{2} (\ln x + 1), 0 < x < 1 .

We have:

f^{'} (x) = t {(1 - x)}^{t - 2} [x (t + 1) - 2] - \frac{λ_{2}}{x} .

We claim that

f^{'}

vanishes at most three times in

(0, 1)

. Indeed,

f^{'} (x) = 0

when

x t {(1 - x)}^{t - 2} [x (t + 1) - 2] = λ_{2} .

(8)

Denote the left-hand side of (8) by

g (x)

. Then:

g^{'} (x) = - t {(1 - x)}^{t - 3} [t^{2} x^{2} + t (x - 4) x + 2] .

For every two points

x_{1},

x_{2}

for which (8) holds, there is an intermediate point

x_{1} < ξ < x_{2}

such that

g^{'} (ξ) = 0

. Now,

g^{'}

clearly vanishes at no more than two points, so that (8) holds for at most three values of x. It follows that each

p_{i}

assumes one of up to (the same) four values. ☐

Before we prove Theorem 2 we need two auxiliary lemmas. For

h \geq 0

, let

X_{h}

be the subset of

ℓ_{1}

consisting of all non-increasing sequences

(p_{1}, p_{2}, \dots)

satisfying the following properties:

$p_{i} \geq 0$ for each i and $\sum_{i = 1}^{\infty} p_{i} = 1 .$
$H (p) \leq h .$

Lemma 1.

X_{h}

is compact under the

ℓ_{1}

metric.

Proof of Lemma 1.

Let

{(p_{n})}_{n = 1}^{\infty}

be a sequence in

X_{h}

, say

p_{n} = (p_{n 1}, p_{n 2}, \dots)

for

n \geq 1

. We want to show that it has a convergent subsequence in

X_{h}

. Employing the diagonal method, we may assume that

p_{n}

converges component-wise. Let

p = (p_{1}, p_{2}, \dots)

be the limit. It is clear that

p

has non-negative and non-increasing entries, so we only need to show that

\sum_{i = 1}^{\infty} p_{i} = 1

, that

H (p) \leq h

, and that

p_{n} \underset{n \to \infty}{⟶} p

in

ℓ_{1}

.

Assume first that

\sum_{i = 1}^{\infty} p_{i} > 1

. Then, there exists an index

i_{0}

such that

\sum_{i = 1}^{i_{0}} p_{i} > 1

. Hence for sufficiently large n, we have

\sum_{i = 1}^{i_{0}} p_{n i} > 1

, which is a contradiction. Hence

\sum_{i = 1}^{\infty} p_{i} \leq 1

. Now assume that

\sum_{i = 1}^{\infty} p_{i} < 1

. Put

ε = 1 - \sum_{i = 1}^{\infty} p_{i}

. Let

i_{0}

be an integer, to be determined later. We have

\sum_{i = 1}^{i_{0}} p_{n i} < 1 - \frac{ε}{2}

for all sufficiently large n. Note that for every

q = (q_{1}, q_{2}, \dots) \in X_{h}

we have

q_{i} \leq \frac{1}{i} (q_{1} + q_{2} + \dots + q_{i}) \leq \frac{1}{i}

. Now we can bound from below the tail entropy of

p_{n}

:

\sum_{i = i_{0} + 1}^{\infty} p_{n i} \ln \frac{1}{p_{n i}} > \sum_{i = i_{0} + 1}^{\infty} p_{n i} \ln i_{0} > \frac{ε}{2} \ln i_{0} .

Taking

i_{0}

large enough, we can make the right-hand side larger than h, which is impossible. Hence

\sum_{i = 1}^{\infty} p_{i} = 1

.

We now show similarly that

H (p) \leq h

. Assume that

\sum_{i = 1}^{\infty} p_{i} \ln \frac{1}{p_{i}} > h

. Then there exists an

i_{0}

such that

\sum_{i = 1}^{i_{0}} p_{i} \ln \frac{1}{p_{i}} > h

. Then, however,

\sum_{i = 1}^{i_{0}} p_{n i} \ln \frac{1}{p_{n i}} > h

for sufficiently large n, which yields a contradiction.

To prove convergence in

ℓ_{1}

, we estimate

{∥p_{n} - p∥}_{1} = \sum_{i = 1}^{\infty} | p_{n i} - p_{i} |

. Let

ε > 0

. Since

\sum_{i = 1}^{\infty} p_{i} = 1

, we can find an

i_{0}

such that

\sum_{i = i_{0} + 1}^{\infty} p_{i} < \frac{ε}{6}

. Due to the component-wise convergence, for sufficiently large n we have

\sum_{i = 1}^{i_{0}} | p_{n i} - p_{i} | < \frac{ε}{6}

. For such n we also have

\sum_{i = i_{0} + 1}^{\infty} p_{n i} < \frac{ε}{3}

since

\begin{array}{l} \sum_{i = 1}^{i_{0}} | p_{n i} - p_{i} | < \frac{ε}{6} & \Rightarrow & |\sum_{i = 1}^{i_{0}} (p_{n i} - p_{i})| < \frac{ε}{6} \\ \Rightarrow & \sum_{i = 1}^{i_{0}} p_{n i} > \sum_{i = 1}^{i_{0}} p_{i} - \frac{ε}{6} > 1 - \frac{ε}{3} \\ \Rightarrow & \sum_{i = i_{0} + 1}^{\infty} p_{n i} < \frac{ε}{3} . \end{array}

Thus we have

\begin{array}{l} \sum_{i = 1}^{\infty} | p_{n i} - p_{i} | & = & \sum_{i = 1}^{i_{0}} | p_{n i} - p_{i} | + \sum_{i = i_{0} + 1}^{\infty} | p_{n i} - p_{i} | \\ < & \frac{ε}{6} + \sum_{i = i_{0} + 1}^{\infty} | p_{n i} | + \sum_{i = i_{0} + 1}^{\infty} | p_{i} | \\ < & \frac{ε}{6} + \frac{ε}{3} + \frac{ε}{6} < ε . \end{array}

Hence we have convergence in

ℓ_{1}

. This proves the lemma. ☐

Example 1.

Note that the subset of

X_{h}

consisting of all those vectors whose entropy is exactly h is not compact. Let us demonstrate this fact, say, for

h = \ln 2

. We choose

p_{n} = (x_{n}, \underset{n times}{\underset{︸}{\frac{1 - x_{n}}{n}, \frac{1 - x_{n}}{n}, \frac{1 - x_{n}}{n}, \dots, \frac{1 - x_{n}}{n}}})

,

n \geq 3,

where

x_{n}

will be defined momentarily. For arbitrary fixed

n \geq 3

, put:

f_{n} (x) = - x \ln x - (1 - x) \ln \frac{1 - x}{n}, 0 \leq x \leq 1 .

We claim that there exists a unique solution

x_{n}

to the equation

f_{n} (x) = \ln 2

. Indeed, this follows readily from the fact that

f_{n} (x)

is concave and

f_{n} (1) = 0 < \ln 2 < \ln n = f_{n} (0)

. Denoting

t_{n} = 1 - x_{n}

, we have

- (1 - t_{n}) \ln (1 - t_{n}) - t_{n} \ln t_{n} + t_{n} \ln n = \ln 2 .

Hence

\begin{matrix} t_{n} & = & \frac{1}{\ln n} (\ln 2 + t_{n} \ln t_{n} + (1 - t_{n}) \ln (1 - t_{n})) \leq \frac{\ln 2}{\ln n}, \end{matrix}

so that

x_{n} \geq 1 - \frac{\ln 2}{\ln n}, n \geq 3,

and in particular

x_{n} \underset{n \to \infty}{⟶} 1

. Thus,

p_{n} \underset{n \to \infty}{⟶} (1, 0, 0, 0, \dots)

while

H (p_{n}) = \ln 2

and

H ((1, 0, 0, 0, \dots)) = 0

, which completes the example.

For arbitrary fixed t, the quantity

E U_{t}

assigns to each point in

X_{h}

a real number. We will denote this function by

E U_{t}

.

Lemma 2.

The mapping

E U_{t}

:

X_{h} ⟶ R

is Lipschitz with constant 1 with respect to the

ℓ_{1}

metric.

Proof of Lemma 2.

Consider the function

f : [0, 1] ⟶ R

given by

f (x) = x {(1 - x)}^{t}, 0 \leq x \leq 1

. Let M be the Lipschitz constant for

f (x)

. According to Lemma 7 from [4], the candidates for assuming the maximum of

| f^{'} (x) |

are the points

x_{1} = 0

and

x_{2} = \frac{2}{1 + t}

. Now

| f^{'} (x_{1}) | = 1

and

| f^{'} (x_{2}) | = {(1 - \frac{2}{1 + t})}^{t - 1} \leq 1

. Hence the Lipschitz constant for

f (x)

is 1. It follows that if

p, p^{'} \in X_{h}

, then:

\begin{array}{l} |\sum_{i = 1}^{\infty} p_{i} {(1 - p_{i})}^{t} - \sum_{i = 1}^{\infty} p_{i}^{'} {(1 - p_{i}^{'})}^{t}| & = & |\sum_{i = 1}^{\infty} (p_{i} {(1 - p_{i})}^{t} - p_{i}^{'} {(1 - p_{i}^{'})}^{t})| \\ \leq & \sum_{i = 1}^{\infty} |p_{i} - p_{i}^{'}| . \end{array}

☐

Proof of Theorem 2.

(i): Follows from Lemma 1 and Lemma 2.
(ii): Suppose that $p = (p_{1}, p_{2}, \dots)$ does not have a finite support. Then, we can find an $n_{0}$ such that the first $n_{0}$ entries $p_{1}, p_{2}, \dots, p_{n_{0}}$ of $p$ assume more than four different values. Put $\tilde{p} = (p_{1}, p_{2}, \dots, p_{n_{0}})$ and let $c = p_{1} + p_{2} + \dots + p_{n_{0}}$ and $\tilde{h} = H (\tilde{p})$ . Consider the optimization problem

$\begin{matrix} \max_{p} & \sum_{i = 1}^{n_{0}} p_{i} {(1 - p_{i})}^{t} \end{matrix}$

(9)

subject to

$\begin{matrix} (10) & \sum_{i = 1}^{n_{0}} p_{i} \ln \frac{1}{p_{i}} \leq \tilde{h}, \\ (11) & \sum_{i = 1}^{n_{0}} p_{i} = c . \end{matrix}$

Theorem 1 is still applicable to (9)–(11) with a minor variation. In the beginning of the proof, replace the Lagrangian by

L (p, λ_{1}, λ_{2}) = \sum_{i = 1}^{n} p_{i} {(1 - p_{i})}^{t} + λ_{1} (\sum_{i = 1}^{n} p_{i} - c) + λ_{2} (\sum_{i = 1}^{n} p_{i} \ln \frac{1}{p_{i}} - \tilde{h})

and proceed as previously. Since

p \in X_{h}

maximizes

E U_{t}

, the vector

\tilde{p}

is a global optimum of this finite-dimensional problem. By Theorem 1,

\tilde{p}

cannot assume more than four distinct values.

☐

Proof of Theorem 3.

For

0 < x < 1 :

\begin{array}{l} \ln (1 - x) & = & - x - \frac{x^{2}}{2} - \dots - \frac{x^{2 t - 1}}{2 t - 1} - \dots \\ < & - x - \frac{x^{2}}{2} - \dots - \frac{x^{2 t - 1}}{2 t - 1} \\ = & (- x - \frac{x^{2 t - 1}}{2 t - 1}) + \dots + (- \frac{x^{t - k}}{t - k} - \frac{x^{t + k}}{t + k}) \\ + \dots + (- \frac{x^{t - 1}}{t - 1} - \frac{x^{t + 1}}{t + 1}) - \frac{x^{t}}{t} . \end{array}

(12)

For each term on the right-hand side of (12) and for

1 \leq k \leq t - 1

, we have

- \frac{x^{t - k}}{t - k} - \frac{x^{t + k}}{t + k} < - \frac{x^{t}}{t - k} - \frac{x^{t}}{t + k} .

Indeed,

- \frac{x^{t - k}}{t - k} - \frac{x^{t + k}}{t + k} + \frac{x^{t}}{t - k} + \frac{x^{t}}{t + k} = - x^{t - k} (1 - x^{k}) (\frac{1}{t - k} - \frac{x^{k}}{t + k}) < 0 .

This gives us

- x - \frac{x^{2}}{2} - \dots - \frac{x^{2 t - 1}}{2 t - 1} - \dots < - x^{t} - \frac{x^{t}}{2} - \dots - \frac{x^{t}}{t} - \dots - \frac{x^{t}}{2 t - 1} < - x^{t} H_{2 t - 1} .

Hence

\ln p < - {(1 - p)}^{t} H_{2 t - 1},

and therefore

{(1 - p)}^{t} < - \frac{\ln p}{H_{2 t - 1}} .

Finally,

E U_{t} = \sum_{i = 1}^{\infty} p_{i} {(1 - p_{i})}^{t} \leq \frac{h}{H_{2 t - 1}} .

☐

Proof of Theorem 4.

Let

t > 1

be an integer and

α > 1

. Define

p

by:

p_{1} = p_{2} = \dots = p_{t} = \frac{h}{\sqrt{α} t \ln t}, p_{t + 1} = 1 - \frac{h}{\sqrt{α} \ln t} .

For such t:

\begin{array}{l} E_{p} U_{t} & = & (1 - \frac{h}{\sqrt{α} \ln t}) {(\frac{h}{\sqrt{α} \ln t})}^{t} + \frac{h}{\sqrt{α} \ln t} {(1 - \frac{h}{\sqrt{α} t \ln t})}^{t} \\ \geq & \frac{h}{\sqrt{α} \ln t} {(1 - \frac{h}{\sqrt{α} t \ln t})}^{t} . \end{array}

Now:

{(1 - \frac{h}{\sqrt{α} t \ln t})}^{t} = {({(1 - \frac{h}{\sqrt{α} t \ln t})}^{t \ln t})}^{1 / \ln t} \underset{t \to \infty}{⟶} {(e^{- h / \sqrt{α}})}^{0} = 1 .

☐

Proof of Proposition 2.

Let P be the random variable assigning to each atom

i \in S

its probability:

P (i) = p_{i}, i \in S .

Denote

I = - \ln P

. Then:

E U_{1} = \sum_{i \in S} p_{i} (1 - p_{i}) = E [1 - P] = E [1 - e^{- I}] .

The function

f (x) = 1 - e^{- x}

is concave, so by Jensen’s inequality:

\begin{array}{l} E U_{1} & = & E [1 - e^{- I}] \\ \leq & 1 - e^{- E [I]} \\ = & 1 - e^{- E [- \ln P]} \\ = & 1 - e^{\sum_{i \in S} p_{i} \ln p_{i}} \\ = & 1 - e^{- h} . \end{array}

☐

Remark 2.

If

h = \ln k

for some positive integer k, then the bound is attained for the uniform distribution on a space of k points.

Proof of Proposition 3.

First, in the case where

h = \ln k

we have a unique optimal solution, which is

p_{1}^{*} = p_{2}^{*} = \dots = p_{n - k}^{*} = 0 < p_{n - k + 1}^{*} = p_{n - k + 2}^{*} = \dots = p_{n}^{*} = \frac{1}{k}

. It is straightforward to check that

p^{*}

is feasible and attains the upper bound in Proposition 2, and is optimal. Moreover,

p^{*}

is unique because any feasible non-uniform choice of

p^{*}

leads to a strict inequality in Jensen’s inequality that was used in Proposition 2.

Thus, we deal with the case of strict inequalities,

\ln (k - 1) < h < \ln k

. We start by showing that any optimal solution

(p_{1}^{*}, p_{2}^{*}, \dots, p_{n}^{*})

assumes at most two non-zero distinct values. Write down the Lagrangian:

L (p_{1}, p_{2}, \dots, p_{n}, λ_{1}, λ_{2}) = \sum_{i = 1}^{n} p_{i} (1 - p_{i}) + λ_{1} (\sum_{i = 1}^{n} p_{i} - 1) + λ_{2} (\sum_{i = 1}^{n} p_{i} \ln \frac{1}{p_{i}} - h) .

The first-order conditions yield, at any optimal point,

\frac{\partial L}{\partial p_{i}} = 1 - 2 p_{i} + λ_{1} - λ_{2} (\ln p_{i} + 1) = 0,

for every i with

p_{i}^{*} > 0

. Define the function f by

f (x) = 1 - 2 x + λ_{1} - λ_{2} - λ_{2} \ln x

. The function vanishes at most twice in

(0, 1)

because its derivative

f^{'} (x) = - 2 - \frac{λ_{2}}{x}

vanishes at most once. Thus, the non-zero

p_{i}^{*}

s assume at most two distinct values. In fact, if all were equal, we would have

k = \ln k

, where k is the number of non-zero

p_{i}^{*}

s, so that we would have exactly two distinct values for the

p_{i}^{*}

s. Disposing of the points of mass 0, we may assume that all n points of S have positive mass. Denote the number of “light” atoms by ℓ. We will show that

E U_{1}

decreases as we increase ℓ. Denote the mass of a “light” atom by p and write down the entropy constraint with ℓ “light” atoms and

n - ℓ

“heavy” ones:

- ℓ p \ln p - (1 - ℓ p) \ln (\frac{1 - ℓ p}{n - ℓ}) = h .

Now, define the function

F (ℓ, p)

by:

F (ℓ, p) = - ℓ p \ln p - (1 - ℓ p) \ln (\frac{1 - ℓ p}{n - ℓ}) - h .

Note that we treat ℓ as a continuous variable. The equation

F (ℓ, p) = 0

implicitly defines the function

p (ℓ)

. Using the implicit function theorem, we can write an analytic expression for

\frac{d p}{d ℓ}

:

\frac{d p}{d ℓ} = - \frac{p (ℓ)}{ℓ} + \frac{\frac{1 - ℓ p (ℓ)}{n - ℓ} - p (ℓ)}{ℓ [\ln (\frac{1 - ℓ p (ℓ)}{n - ℓ}) - \ln p (ℓ)]} .

Now write down

E U_{1}

as a function of ℓ and take the derivative with respect to ℓ:

E U_{1} = ℓ p (ℓ) (1 - p (ℓ)) + (1 - ℓ p (ℓ)) (1 - \frac{1 - ℓ p (ℓ)}{n - ℓ}) .

\begin{array}{l} \frac{d E U_{1}}{d ℓ} & = & (p (ℓ) + ℓ p^{'}) (ℓ) (1 - p (ℓ)) - ℓ p (ℓ) p^{'} (ℓ) \\ - (p (ℓ) + ℓ p^{'} (ℓ)) (1 - \frac{1 - ℓ p (ℓ)}{n - ℓ}) \\ - \frac{(1 - ℓ p (ℓ))}{{(n - ℓ)}^{2}} (- n p (ℓ) - n ℓ p^{'} (ℓ) + ℓ^{2} p^{'} (ℓ) + 1) . \end{array}

(13)

Notice that the term

\frac{1 - ℓ p (ℓ)}{n - ℓ}

is actually the mass of the “heavy” atom, so to simplify notation we put

q (ℓ) = \frac{1 - ℓ p (ℓ)}{n - ℓ}

. Substituting the expression for

\frac{d p}{d ℓ}

, we obtain:

\begin{array}{l} \frac{\partial E U_{1}}{\partial ℓ} & = & {(q (ℓ) - p (ℓ))}^{2} [\frac{2}{\ln \frac{q (ℓ)}{p (ℓ)}} - \frac{2 p (ℓ)}{q (ℓ) - p (ℓ)} - 1] \\ = & {(q (ℓ) - p (ℓ))}^{2} [\frac{2}{\ln \frac{q (ℓ)}{p (ℓ)}} - \frac{2}{\frac{q (ℓ)}{p (ℓ)} - 1} - 1] . \end{array}

(14)

To show that

E U_{1}

decreases as we increase ℓ, it is enough to check that

\frac{\partial E U_{1}}{\partial ℓ} < 0

. It suffices to work out the second term in the product of (14). Using the change of variables

y = \frac{q (ℓ)}{p (ℓ)} - 1

, we may write:

\frac{2}{\ln \frac{q (ℓ)}{p (ℓ)}} - \frac{2}{\frac{q (ℓ)}{p (ℓ)} - 1} - 1 = \frac{2}{\ln (1 + y)} - \frac{2}{y} - 1 .

It is straightforward to check that

\frac{2}{\ln (1 + y)} - \frac{2}{y} - 1 \leq 0

:

\begin{matrix} \frac{2}{\ln (1 + y)} - \frac{2}{y} - 1 & = & \frac{2 y - 2 \ln (1 + y) - y \ln (1 + y)}{y \ln (1 + y)} . \end{matrix}

Notice that

y \ln (1 + y) > 0

, and hence it is enough to check that the numerator is negative. Indeed,

{[2 y - 2 \ln (1 + y) - y \ln (1 + y)]}_{y = 0} = 0

and

\frac{d}{d y} [2 y - 2 \ln (1 + y) - y \ln (1 + y)] = \frac{y}{1 + y} - \ln (1 + y) .

Now

{[\frac{y}{1 + y} - \ln (1 + y)]}_{y = 0} = 0

and

\frac{d}{d y} [\frac{y}{1 + y} - \ln (1 + y)] = \frac{1}{{(1 + y)}^{2}} - \frac{1}{1 + y} < 0

. Thus,

\frac{\partial E U_{1}}{\partial ℓ} < 0

.

It follows that ℓ should be as small as possible, which means (since there is at least one light atom) that

ℓ = 1

. Finally, as there is one light atom and

n - 1

heavy ones, the entropy h lies in the interval

(\ln (n - 1), \ln n)

. Reverting to the original notations, we have

\ln (k - 1) < h < \ln k

. ☐

Proof of Proposition 4.

We use the following refinement of Jensen’s inequality [11]: For any random variable X and concave function

ϕ

,

\begin{matrix} ϕ (E (X)) - E (ϕ (X)) \geq & |E (|ϕ (X) - ϕ (E (X))|) - |ϕ_{+}^{^{'}} (E (X))| \cdot E (|X - E (X)|)|, \end{matrix}

(15)

where

ϕ_{+}^{'}

denotes the right-hand derivative of

ϕ

. For

I = - \ln P

and

ϕ (x) = 1 - e^{- x}

, the left-hand side of (15) is

ϕ (E [I]) - E [ϕ (I)] = 1 - e^{- h} - E U_{1} .

The right-hand side of (15) gives:

\begin{matrix} |E (|ϕ (I) - ϕ (E [I])|) - |ϕ_{+}^{^{'}} (E [I])| \cdot E (|I - E [I]|)| \\ = |E (|1 - e^{- I} - (1 - e^{- h})|) - e^{- h} E (|- \ln P - h|)| \\ = | p \cdot |e^{- h} - p| + (n - 1) q \cdot |e^{- h} - q| \\ - e^{- h} (p \cdot |- \ln p - h| + (n - 1) q \cdot |- \ln q - h|) | \\ = | p (e^{- h} - p) - (n - 1) q (e^{- h} - q) \\ - e^{- h} (p (- \ln p - h) - (n - 1) q (- \ln q - h)) | \\ = |p e^{- h} - p^{2} - (n - 1) q e^{- h} + (n - 1) q^{2} - e^{- h} (- 2 p \ln p - h - 2 p h + h)| \\ = |e^{- h} (2 p - 1 + 2 p \ln p + 2 p h) + (n - 1) q^{2} - p^{2}| . \end{matrix}

☐

Acknowledgments

Daniel Berend is supported by the Milken Families Foundation Chair in Mathematics. Aryeh Kontorovich is supported in part by the Israel Science Foundation (grant No. 755/15), Paypal and IBM.

Author Contributions

The authors contributed equally to this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Good, I.J. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
McAllester, D.; McAllester, R.E. On The Convergence Rate of Good-Turing Estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, Stanford, CA, USA, 28 June–1 July 2000; pp. 1–6. [Google Scholar]
Haslinger, R.; Pipa, G.; Lewis, L.D.; Nikolić, D.; Williams, Z.; Brown, E. Encoding through Patterns: Regression Tree-Based Neuronal Population Models. Neural Comput. 2013, 25, 1953–1993. [Google Scholar] [CrossRef] [PubMed]
Berend, D.; Kontorovich, A. The Missing Mass Problem. Stat. Probab. Lett. 2012, 82, 1102–1110. [Google Scholar] [CrossRef]
Berend, D.; Kontorovich, A. On The Concentration of the Missing Mass. Electron. Commun. Probab. 2013, 18, 1–7. [Google Scholar] [CrossRef]
Kontorovich, A.; Hendler, D.; Menahem, E. Metric Anomaly Detection via Asymmetric Risk Minimization. In SIMBAD 2011. Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 7005, pp. 17–30. [Google Scholar]
Luo, H.P.; Schapire, R. Towards Minimax Online Learning with Unknown Time Horizon. In Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
Sadeqi, M.; Holte, R.C.; Zilles, S. Detecting Mutex Pairs in State Spaces by Sampling. In Australasian Conference on Artificial Intelligence (2013); Lecture Notes in Computer Science; Springer: Cham, Switzerland; Volume 8272, pp. 490–501.
Ben-Hamou, A.; Boucheron, S.; Ohannessian, M.I. Concentration Inequalities in the Infinite Urn Scheme for Occupancy Counts and the Missing Mass, with Applications. Bernoulli 2017, 23, 249–287. [Google Scholar] [CrossRef]
Han, Y.J.; Jiao, J.T.; Weissman, T. Minimax Estimation of Discrete Distributions under ℓ₁ Loss. IEEE Trans. Inf. Theory 2015, 61, 6343–6354. [Google Scholar] [CrossRef]
Hussain, S.; Pečarić, J. An Improvement of Jensen’s Inequality with Some Applications. Asian Eur. J. Math. 2009, 2, 85–94. [Google Scholar] [CrossRef]

Figure 1. Max

E U_{1}

vs. the bounds provided by Propositions 2 and 4.

Figure 1. Max

E U_{1}

vs. the bounds provided by Propositions 2 and 4.

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Berend, D.; Kontorovich, A.; Zagdanski, G. The Expected Missing Mass under an Entropy Constraint. Entropy 2017, 19, 315. https://doi.org/10.3390/e19070315

AMA Style

Berend D, Kontorovich A, Zagdanski G. The Expected Missing Mass under an Entropy Constraint. Entropy. 2017; 19(7):315. https://doi.org/10.3390/e19070315

Chicago/Turabian Style

Berend, Daniel, Aryeh Kontorovich, and Gil Zagdanski. 2017. "The Expected Missing Mass under an Entropy Constraint" Entropy 19, no. 7: 315. https://doi.org/10.3390/e19070315

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Expected Missing Mass under an Entropy Constraint

Abstract

1. Introduction

2. Main Results

3. Proofs

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI