Redundancy of Exchangeable Estimators

Narayana P. Santhanam; Anand D. Sarwate; Jae Oh Woo

doi:10.3390/e16105339

,

and

¹

Department of Electrical Engineering, University of Hawaii at Manoa, 2540 Dole Street, Honolulu, HI 96822, USA

²

Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey, 94 Brett Road, Piscataway, NJ 08854 , USA

³

Applied Mathematics Program, Yale University, 51 Prospect St, New Haven, CT 06511, USA

^*

Authors to whom correspondence should be addressed.

Entropy2014, 16(10), 5339-5357;https://doi.org/10.3390/e16105339

This article belongs to the Section Information Theory, Probability and Statistics

Version Notes

Order Reprints

Abstract

Exchangeable random partition processes are the basis for Bayesian approaches to statistical inference in large alphabet settings. On the other hand, the notion of the pattern of a sequence provides an information-theoretic framework for data compression in large alphabet scenarios. Because data compression and parameter estimation are intimately related, we study the redundancy of Bayes estimators coming from Poisson–Dirichlet priors (or “Chinese restaurant processes”) and the Pitman–Yor prior. This provides an understanding of these estimators in the setting of unknown discrete alphabets from the perspective of universal compression. In particular, we identify relations between alphabet sizes and sample sizes where the redundancy is small, thereby characterizing useful regimes for these estimators.

Keywords:

exchangeability; random exchangeable partitions; Chinese restaurant process; Pitman–Yor process; strong and weak universal compression

1. Introduction

A number of statistical inference problems of significant contemporary interest, such as text classification, language modeling and DNA microarray analysis, require inferences based on observed sequences of symbols in which the sequence length or sample size is comparable or even smaller than the set of symbols, the alphabet. For instance, language models for speech recognition estimate distributions over English words using text examples much smaller than the vocabulary.

Inference in this setting has received a lot of attention, from Laplace [1–3] in the 18th century, to Good [4] in the mid-20th century, to an explosion of work in the statistics [5–13], information theory [14–17,21,22] and machine learning [23–26] communities in the last few decades. A major strand in the information theory literature on the subject has been based on the notion of patterns. The pattern of a sequence characterizes the repeat structure in the sequence, which is the information that can be described well (see Orlitsky et al. [27] for formal characterizations of this idea). The statistical literature has emphasized the importance of exchangeability, which generalizes the notion of independence.

We consider measures over infinite sequences X₁,X₂ . . ., where X_i come from a countable (infinite) set (the alphabet). Let

ℐ

be the collection of all distributions over countable (potentially infinite) alphabets. For p ∈

ℐ

, let p⁽ⁿ⁾ denote the product distribution corresponding to an independent and identically distributed (i.i.d.) sample

X_{1}^{n} = (X_{1}, X_{2}, \dots, X_{n})

, where X_i ~ p. Let

ℐ

⁽ⁿ⁾ be the collection of all such i.i.d. distributions on length n sequences drawn from countable alphabets. Let

ℐ

^∞ be the collection of all measures over infinite sequences of symbols from countable alphabets X₁,X₂ . . ., where the {X_i} are i.i.d. according to some distribution in

ℐ

. The measures are constructed by extending to the Borel sigma algebra the i.i.d. probability assignments on finite length sequences, namely

ℐ

⁽ⁿ⁾, n ≥ 1. We call

ℐ

^∞ the set of i.i.d. measures.

Based on a sample

X_{1}^{n} = (X_{1}, X_{2}, \dots, X_{n})

from an unknown p⁽ⁿ⁾ ∈

ℐ

⁽ⁿ⁾ (or equivalently, the corresponding measure in

ℐ

^∞), we want to create an estimator q_n, which assigns probabilities to length-n sequences. We are interested in the behavior of the sequence of estimators {q_n : n = 1, 2, . . .}. With some abuse of notation, we will use q to denote the estimator q_n when the sample size n is clear from context. We want q_n to approximate p⁽ⁿ⁾ well; in particular, we would like q_n to neither overestimate nor underestimate the probability of sequences of length n under the true p⁽ⁿ⁾.

Suppose that there exist R_n > 0 and A_n > 0, such that for each p ∈

ℐ

, we have:

p^{(n)} ({X_{1}^{n} : q_{n} (X_{1}^{n}) > R_{n} p^{(n)} (X_{1}^{n})}) < 1 / A_{n} .

If A_n is any function of n that grows sufficiently fast with n, the sequence {q_n} does not asymptotically overestimate probabilities of length-n sequences by a factor larger than than R_n with probability one, no matter what measure p ∈

ℐ

generated the sequences.

Protecting against underestimation is not so simple. The redundancy of an estimator q_n (defined formally in Section 2.4) for a length n sequence

x_{1}^{n}

measures how closely

q_{n} (x_{1}^{n})

matches:

max_{p^{(n)} \in ℐ^{(n)}} p^{(n)} (x_{1}^{n}),

the largest probability assigned to

x_{1}^{n}

by any distribution in

ℐ

ⁿ. The estimator redundancy usually either maximizes the redundancy of a sequence or takes the expectation over all sequences. Ideally, we want the estimator redundancy to grow sublinearly in the sequence length n, so that the per-sample redundancy vanishes as n → ∞. If so, we call the estimator universal for

ℐ

. Redundancy thus captures how well q performs against the collection

ℐ

, but the connections between estimation problems and compression run deeper.

In this paper, we consider estimators formed by taking a measure (prior) on

ℐ

. Different priors induce different distributions on the data

X_{1}^{n}

. We think of the prior as randomly choosing a distribution p in

ℐ

, and the observed data

X_{1}^{n}

is generated according to this p. How much information about the underlying distribution p can we obtain from the data (assuming we know the prior)? Indeed, a well known result [28–30] proves that the redundancy of the best possible estimator for

ℐ

^∞ equals the maximum (over all choices of priors) information (in bits) that is present about the underlying source in a length n sequence generated in this manner.

Redundancy is well defined for finite alphabets; recent work [16] has formalized a similar framework for countably infinite alphabets. This framework is based on the notion of patterns of sequences that abstract the identities of symbols and indicate only the relative order of appearance. For example, the pattern of FEDERER is 1232424, while that of PATTERN is 1233456. The crux of the idea is that instead of considering the set of measures

ℐ

^∞ over infinite sequences, we consider the set of measures induced over patterns of the sequences. It then follows that now our estimate q_Ψ is a measure over patterns. While the variables in the sequence are i.i.d., the corresponding pattern merely corresponds to a exchangeable random partition. We can associate a predictive distribution with the pattern probability estimator q_Ψ. This is an estimate of the distribution of X_n₊₁ given the previous observations, and it assigns probabilities to the event that X_n₊₁ will be “new” (has not appeared in

X_{1}^{n}

) and probabilities to the events that X_n₊₁ takes on one of the values that has been seen so far.

The above view of estimation also appears in the statistical literature on Bayesian nonparametrics that focuses on exchangeability. Kingman [31] advocated the use of exchangeable random partitions to accommodate the analysis of data from an alphabet that is not bounded or known in advance. A more detailed discussion of the history and philosophy of this problem can be found in the works of Zabell [11,32] collected in [33]. One of the most popular exchangeable random partition processes is the “Chinese restaurant process” [10], which is a special case of the Poisson–Dirichlet or Pitman–Yor process [13,34]. These processes can be viewed as prior distributions on the set of all discrete distributions that can be used as the basis for estimating probabilities and computing predictive distributions.

In this paper, we evaluate the performance of the sequential estimators corresponding to these exchangeable partition processes. As described before,

ℐ

is the collection of all distributions over countable (potentially infinite) alphabets, and

ℐ

^∞ is the collection of all i.i.d. measures with single letter marginals in

ℐ

. Let

ℐ

_Ψ be the collection of all measures over patterns induced by measures in

ℐ

^∞. We evaluate the redundancy of estimators based on the Chinese restaurant process (CRP), the Pitman–Yor (PY) process and the Ewen’s sampling formula against

ℐ

_Ψ.

In the context of sequential estimation, early work [16] showed that for the collection

ℐ

_Ψ of measures over patterns, universal estimators do exist: the normalized redundancy is O(n^1/2). More recent work [18,19] proved tight bounds on worst-case and average redundancy; these results show that there are sequential estimators with normalized redundancy of O(n^1/3). However, these estimators are computationally intensive and (generally speaking) infeasible in practice. Acharya et al. [20] demonstrated a linear-time estimator with average redundancy O(n^1/2), improving over the earlier constructions achieving O(n^2/3) [16]. By contrast, estimators such as the CRP or the PY estimator were not developed in a universal compression framework, but have been very successful from a practical standpoint. The goal of this paper is to understand these Bayesian estimators from the universal compression perspective.

For the case of the estimators studied in nonparametric Bayesian statistics, our results show that they are in general neither weakly nor strongly universal when compressing patterns or equivalently exchangeable random partitions. While the notion of redundancy is in some sense different from other measures of accuracy, such as the concentration of the posterior distribution about the true distribution, the parameters of the CRPs or the PY processes that do compress well often correspond to the maximum likelihood estimates obtained from the sample.

Because we choose to measure redundancy in the worst case over p, the underlying alphabet size may be arbitrarily large with respect to the sample size n. Consequently, for a fixed sample of size n, the number of symbols could be large, for example a constant fraction of n. The CRP and PY estimators do not have good redundancy against such samples, since they are not the cases the estimators are designed for. However, we can show that a mixture of estimators corresponding to CRP estimators is weakly universal. This mixture is made by optimizing individual CRP estimators that (implicitly) assume a bound on the support of p. If such a bound is known in advance, we can derive much tighter bounds on the redundancy. In this setting, the two-parameter Poisson–Dirichlet (or Pitman–Yor) estimator is superior to the estimator derived from the Chinese restaurant process.

In order to describe our results, we require a variety of definitions from different research communities. In the next section, we describe this preliminary material and place it in context before describing the main results in Section 3.

2. Preliminaries

In this paper, we use the “big-O” notation. A function f(n) = O(g(n)) if there exists a positive constant C, such that for sufficiently large n, |f(n)| ≤ C|g(n)|. A function f(n) = Ω(g(n)), if there exists a positive constant C′, such that for sufficiently large n, |f(n)| ≥ C′|g(n)|. A function f(n) = Θ (g(n)), if f(n) = O(g(n)) and f(n) = Ω(g(n)).

Let

ℐ

_k denote the set of all probability distributions on alphabets of size k and

ℐ

_∞ be all probability distributions on countably infinite alphabets, and let:

ℐ = ℐ_{\infty} \cup ⋃_{k \geq 1} ℐ_{k}

be the set of all discrete distributions irrespective of support and support size.

For a fixed p, let

x_{1}^{n} = (x_{1}, x_{2}, \dots x_{n})

be a sequence drawn i.i.d. according to p. We denote the pattern of

x_{1}^{n}

by

ψ_{1}^{n}

. The pattern is formed by taking ψ₁ = 1 and:

ψ_{i} = {\begin{array}{l} ψ_{j} & x_{i} = x_{j}, j < i \\ 1 + {max}_{j < i} ψ_{j} & x_{i} \neq x_{j}, \forall j < i \end{array}

For example, the pattern of

x_{1}^{7} = FEDERER

is

ψ_{1}^{7} = 1232424

. Let ψⁿ be the set of all patterns of length n. We write p(ψⁿ) for the probability that a length-n sequence generated by p has pattern ψⁿ. For a pattern

ψ_{1}^{n}

, we write φ_μ for the number of symbols that appear μ times in

ψ_{1}^{n}

, and m = ∑φ_μ is the number of distinct symbols in

ψ_{1}^{n}

. We call φ_μ the prevalence of μ. Thus, for FEDERER, we have φ₁ = 2, φ₂ = 1, φ₃ = 1 and m = 4.

2.1. Exchangeable Partition Processes

An exchangeable random partition refers to a sequence (C_n : n ∈ ℕ), where C_n is a random partition of the set [n] = {1, 2, . . . , n}, satisfying the following conditions: (i) the probability that C_n is a particular partition depends only on the vector (s₁, s₂, . . . , s_n), where s_k is the number of parts in the partition of size k; and (ii) the realizations of the sequence are consistent in that all of the parts of C_n are also parts of the partition C_n₊₁, except that the new element n + 1 may either be in a new part of C_n₊₁ by itself or has joined one of the existing parts of C_n.

For a sequence X₁, . . . , X_n from a discrete alphabet, one can partition the set [n] into component sets {A_x}, where A_x = {i : X_i = x} are the indices corresponding to the positions in which x has appeared. When the {X_i} are drawn i.i.d. from a distribution in

ℐ

, the corresponding sequence of random partitions is called a paintbox process.

The remarkable Kingman representation theorem [8] states that the probability measure induced by any exchangeable random partition is a mixture of paintbox processes, where the mixture is taken using a probability measure (“prior” in Bayesian terminology) on the class of paintbox processes. Since each paintbox process corresponds to a discrete probability measure (the one such that i.i.d. X_i drawn from it produced the paintbox process), the prior may be viewed as a distribution on the set of probability measures on a countable alphabet. For technical reasons, the alphabet is assumed to be hybrid, with a discrete part, as well as a continuous part, and also, one needs to work with the space of ordered probability vectors (see [35] for details).

2.2. Dirichlet Priors and Chinese Restaurant Processes

Not surprisingly, special classes of priors give rise to special classes of exchangeable random partitions. One particularly nice class of priors on the set of probability measures on a countable alphabet is that of the Poisson–Dirichlet priors [5,36,37] (sometimes called Dirichlet processes, since they live on the infinite-dimensional space of probability measures and generalize the usual finite-dimensional Dirichlet distribution).

The Chinese restaurant process (or CRP) is related to the so-called Griffiths–Engen–McCloskey (GEM) distribution with parameter θ, denoted by GEM(θ). Consider W₁,W₂, . . . drawn i.i.d. according to a Beta(1, θ) distribution, and set:

\begin{array}{l} p_{1} = W_{1} \\ p_{i} = W_{i} \prod_{j < i} (1 - W_{i}) \forall i > 1 \end{array}

This can be interpreted as follows: take a stick of unit length and break it into pieces of size W₁ and 1 – W₁. Now take the piece of size 1 – W₁ and break off a W₂ fraction of that. Continue in this way. The resulting lengths of the sticks create a distribution on a countably infinite set. The distribution of the sequence p = (p₁, p₂, . . .) is the GEM(θ) distribution.

Remark

Let πdenote the elements of p sorted in decreasing order, so that π₁ ≥ π₂ ≥ · · · . Then, the distribution of π is the Poisson–Dirichlet distribution PD(θ) as defined by Kingman.

Another popular class of distributions on probability vectors is the Pitman–Yor family of distributions [13], also known as the two-parameter Poisson–Dirichlet family of distributions PD(α, θ). The two parameters here are a discount parameter α ∈ [0, 1] and a strength parameter θ > −α. The distribution PD(α, θ) can be generated in a similar way as the Poisson–Dirichlet distribution PD(θ) = PD(0, θ described earlier. Let each W_i be drawn independently according to a Beta(1 − α, θ + iα) distribution, and again, set:

\begin{array}{l} {\tilde{p}}_{1} = W_{1} \\ {\tilde{p}}_{i} = W_{i} \prod_{j < i} (1 - W_{i}) \forall i > 1 \end{array}

A similar “stick-breaking” interpretation holds here, as well. Now, let p be equal to the sequence p̃ sorted in descending order. The distribution of p is PD(α, θ). If we have α < 0 and θ = r|α| for integer r, we may obtain a symmetric Dirichlet distribution of dimension r.

2.3. Pattern Probability Estimators

Given a sample

x_{1}^{n}

with pattern ψⁿ, we would like to produce a pattern probability estimator. This is a function of the form q(ψ_n₊₁|ψⁿ) that assigns a probability of seeing a symbol previously seen in ψⁿ, as well as a probability of seeing a new symbol. In this paper, we will investigate two different pattern probability estimators based on Bayesian models.

The Ewens sampling formula [38–40], which has its origins in theoretical population genetics, is a formula for the probability mass function of a marginal of a CRP corresponding to a fixed population size. In other words, it specifies the probability of an exchangeable random partition of [n] that is obtained when one uses the Poisson–Dirichlet PD(θ) prior to mix paintbox processes. Because of the equivalence between patterns and exchangeable random partitions, it estimates the probability of a pattern

ψ_{1}^{n}

via the following formula:

q_{θ}^{CRP} (ψ_{1}, \dots, ψ_{n}) = \frac{θ^{m}}{θ (θ + 1) \dots (θ + n - 1)} \prod_{μ = 1}^{n} {[(μ - 1)!]}^{φ_{μ}} .

(1)

Recall that φ_μ is the number of symbols that appear μ times in

ψ_{1}^{n}

. In particular, the predictive distribution associated to the Ewens sampling formula or Chinese restaurant process is:

q_{θ}^{CRP} (ψ ∣ ψ_{1}, \dots, ψ_{n}) = {\begin{array}{l} \frac{μ}{n + θ} & \begin{array}{l} ψ appeared μ times \\ in ψ_{1}, \dots, ψ_{n}; \end{array} \\ \frac{θ}{n + θ} & ψ is a new symbol . \end{array}

(2)

More generally, one can define the Pitman–Yor predictor (for α ∈ [0, 1] and θ > − α) as:

q_{α, θ}^{PY} (ψ ∣ ψ_{1}, \dots, ψ_{n}) = {\begin{array}{l} \frac{μ - α}{n + θ} & \begin{array}{l} ψ appeared μ times \\ in ψ_{1}, \dots, ψ_{n}; \end{array} \\ \frac{θ + m α}{n + θ} & ψ is a new symbol . \end{array}

(3)

where m is the number of distinct symbols in

ψ_{1}^{n}

. The probability assigned by the Pitman–Yor predictor to a pattern

ψ_{1}^{n}

is therefore:

q_{α, θ}^{PY} (ψ_{1}, \dots, ψ_{n}) = \frac{θ (θ + α) (θ + 2 α) \dots (θ + (m - 1) α)}{θ (θ + 1) \dots (θ + n - 1)} \prod_{μ = 1}^{n} {(\frac{Γ (μ - α)}{Γ (1 - α)})}^{φ_{μ}} .

(4)

Note that Γ (μ − α)/Γ (1 − α) = (μ − α − 1)(μ − α − 2) · · · (1 − α).

2.4. Strong Universality Measures: Worst-Case and Average

How should we measure the quality of a pattern probability predictor q? We investigate two criteria here: the worst-case and the average-case redundancy. The redundancy of q on a given pattern ψⁿ is:

R (q, ψ^{n}) \overset{def}{=} sup_{p \in ℐ} log \frac{p (ψ^{n})}{q (ψ^{n})},

The worst-case redundancy of q is defined to be:

\hat{R} (q) \overset{def}{=} max_{ψ^{n} \in Ψ^{n}} sup_{p \in ℐ} log \frac{p (ψ^{n})}{q (ψ^{n})} = sup_{p \in ℐ} max_{ψ^{n} \in Ψ^{n}} log \frac{p (ψ^{n})}{q (ψ^{n})}

Recall that p(ψⁿ) just denotes the probability that a length-n sequence generated by p has pattern ψⁿ; it is unnecessary to specify the support here. Since the set of length-n patterns is finite, there is no need for a supremum in the outer maximization above. The worst-case redundancy is often referred to as the per-sequence redundancy, as well.

The average-case redundancy replaces the max over patterns with an expectation over p:

\begin{array}{l} \bar{R} (q) \overset{def}{=} sup_{p \in ℐ} E_{p} [log \frac{p (ψ^{n})}{q (ψ^{n})}] \\ = sup_{p \in ℐ} D (p ‖ q), \end{array}

where D(·||·) is the Kullback–Leibler divergence or relative entropy. That is, the average-case redundancy is nothing but the worst-case Kullback–Leibler divergence between the distribution p and the predictor q.

A pattern probability estimator is considered “good” if the worst-case or average-case redundancies are sublinear in n or R̂ (q)/n → 0 and R̄(q)/n → 0 as n → ∞. Succinctly put, redundancy that is sublinear in n implies that the underlying probability of a sequence can be estimated accurately almost surely. Redundancy is one way to measure the “frequentist” properties of the Bayesian approaches we consider in this paper and refers to the compressibility of the distribution from an information theoretic perspective.

As mentioned in the Introduction, redundancy differs from notions, such as the concentration of the posterior distribution about the true distribution. However, the parameters of the CRPs or the PY processes that compress well often correspond to the maximum likelihood (ML) estimates from the sample.

2.5. Weak Universality

In the previous section, we considered guarantees that hold over the entire model class; both the worst case and average case involve taking a supremum over the entire model class. Therefore, the strong guarantees—average or worst-case—hold uniformly over the model class. However, as we will see, exchangeable estimators, in particular the Chinese restaurant process and Pitman–Yor estimators, are tuned towards specific kinds of sources, rather than all i.i.d. models by the appropriate choice of parameters. This behavior is better captured by looking at the model-dependent convergence of the exchangeable estimators, which is known as weak universality. Specifically, let ℘^∞ be a collection of i.i.d. measures over infinite sequences, and let

P_{Ψ}^{\infty}

be the collection of measures induced on patterns by ℘^∞. We say an estimator q is weakly universal for a class

P_{Ψ}^{\infty}

if for all p ∈ ℘^∞:

\underset{n \to \infty}{lim sup} \frac{1}{n} E_{p} [log \frac{p (ψ^{n})}{q (ψ^{n})}] = 0.

3. Strong Redundancy

We now describe our main results on the redundancy of estimators derived from the prior distributions on

ℐ

.

3.1. Chinese Restaurant Process Predictors

Previously [41], it was shown by some of the authors that the worst-case and average-case redundancies for the CRP estimator are both Ω(n log n), which means it is not strongly universal. However, this negative result follows because the CRP estimator is tuned not to the entire i.i.d. class of distributions, but to a specific subclass of i.i.d. sources depending on the choice of parameter. To investigate this further, we analyze the redundancy of the CRP estimator when we have a bound on the number m of distinct elements in the pattern

ψ_{1}^{n}

.

Chinese restaurant processes

q_{θ}^{CRP} ()

with parameter θ are known to generate exchangeable random partitions where the number of distinct parts M satisfy M/ log n → θ almost surely as the sample size n increases; see e.g., [42]. Equivalently, the CRP generates patterns with M distinct symbols, where M/ log n → θ. However, the following theorem reverses the above setting. Here, we are given an i.i.d. sample of data of length n with m symbols (how the data were generated is not important), but we pick the parameter of a CRP estimator that describes the pattern of the data well. While it is satisfying that the chosen parameter matches the ML estimate of the number of symbols, note that this need not necessarily be the only parameter choice that works.

Theorem 1

[Redundancy for CRP estimators] Consider the estimator

q_{θ}^{CRP} (ψ_{1}^{n})

in (1) and (2). Then, for sufficiently large n and for patterns

ψ_{1}^{n}

whose number of distinct symbols m satisfies:

m \leq C \cdot \frac{n}{log n} {(log log n)}^{2},

the redundancy of the predictor

q_{θ}^{CRP} (ψ_{1}^{n})

with ⌈θ⌉ = m/log n satisfies:

log \frac{p (ψ_{1}^{n})}{q_{θ}^{CRP} (ψ_{1}^{n})} \leq 3 C \cdot \frac{n {(log log n)}^{3}}{log n} = o (n) .

Proof

The number of patterns with prevalence {φ_μ} is:

\frac{n!}{\prod_{μ = 1}^{n} {[μ!]}^{φ_{μ}} φ_{μ}!},

and therefore:

p (ψ_{1}^{n}) \leq \frac{\prod_{μ = 1}^{n} {[μ!]}^{φ_{μ}} φ_{μ}!}{n!},

(5)

since patterns with prevalence {φ_μ} all have the same probability.

Using the upper bound in (5) on

p (ψ_{1}^{n})

and (1) yields:

\begin{array}{l} log \frac{p (ψ_{1}^{n})}{q_{θ}^{CRP} (ψ_{1}^{n})} \leq log \prod_{μ = 1}^{n} \frac{{[μ!]}^{φ_{μ}} φ_{μ}!}{{[(μ - 1)!]}^{φ_{μ}} θ^{m}} + log \frac{θ (θ + 1) \dots (θ + n - 1)}{n!} \\ = log (\prod_{μ = 1}^{n} μ^{φ_{μ}}) + log (\frac{1}{θ^{m}} \prod_{μ = 1}^{n} φ_{μ}!) + log \frac{θ (θ + 1) \dots (θ + n - 1)}{n!} . \end{array}

(6)

Let θ̄ = ⌈θ⌉. The following bound follows from Stirling’s approximation:

\frac{θ (θ + 1) \dots (θ + n - 1)}{n!} \leq \frac{(\bar{θ} + n)!}{θ! n!} \leq \frac{{(\bar{θ} + n)}^{\bar{θ} + n}}{{\bar{θ}}^{\bar{θ}} n^{n}} \leq {(\frac{\bar{θ} + n}{\bar{θ}})}^{\bar{θ}} {(\frac{\bar{θ} + n}{n})}^{n} .

(7)

The first term in (6) can be upper bounded by log(n/m)^m since the argument of the log(·) is maximized over μ ∈ [1, n] when μ = n/m. The second term is also maximized when all symbols appear the same number of times, corresponding to φ_μ = m for one μ. Therefore:

log \frac{p (ψ_{1}^{n})}{q_{θ}^{CRP} (ψ_{1}^{n})} \leq log {(\frac{n}{m})}^{m} + log \frac{m!}{θ^{m}} + log {(\frac{\bar{θ} + n}{\bar{θ}})}^{\bar{θ}} {(1 + \frac{\bar{θ}}{n})}^{n} .

Now,

{(1 + \frac{\bar{θ}}{n})}^{n} \leq e^{\bar{θ}}

for sufficiently large n, so:

log \frac{p (ψ_{1}^{n})}{q_{θ}^{CRP} (ψ_{1}^{n})} \leq log {(\frac{n}{m})}^{m} + log \frac{m!}{θ^{m}} + log {(\frac{(\bar{θ} + n) e}{\bar{θ}})}^{\bar{θ}} .

(8)

Choose θ̄ = m/log n. This gives the bound:

log \frac{p (ψ_{1}^{n})}{q_{θ}^{CRP} (ψ_{1}^{n})} \leq m log (\frac{n}{m}) + log \frac{m!}{m^{m}} {(\frac{\bar{θ}}{θ})}^{m} + m log log n + log {(\frac{(\bar{θ} + n) e}{\bar{θ}})}^{\bar{θ}} .

the second term is negative for sufficiently large m. Therefore:

log \frac{p (ψ_{1}^{n})}{q_{θ}^{CRP} (ψ_{1}^{n})} \leq m log (\frac{n}{m}) + m log log n + \frac{m}{log n} log (2 + \frac{n log n}{m}) .

(9)

Noting that the function above is monotonic in m for n ≥ 16, we choose:

m = C \frac{n}{log n} {(log log n)}^{2},

and the bound becomes:

\begin{array}{l} log \frac{p (ψ_{1}^{n})}{q_{θ}^{CRP} (ψ_{1}^{n})} \leq C n \frac{{(log log n)}^{2}}{log n} log (\frac{log n}{{(log log n)}^{2}}) + C n \frac{{(log log n)}^{3}}{log n} + C n {(\frac{log log n}{log n})}^{2} log (2 + {(\frac{log n}{log log n})}^{2}) \\ \leq 3 C n \frac{{(log log n)}^{3}}{log n} \\ = o (n) . \end{array}

This theorem is slightly dissatisfying, since it requires us to have a bound on the number of distinct symbols in the pattern. In Section 4, we take mixtures of CRP estimators to arrive at estimators that are weakly universal.

3.2. Pitman–Yor Predictors

We now turn to the more general class of Pitman–Yor predictors. We can obtain a similar result as for the CRP estimator, but we can handle all patterns with m = o(n) distinct symbols.

As before, the context for the following theorem is this: we are given an i.i.d. sample of data of length n with m symbols (there is no consideration, as before, as to how the data was generated), but we pick the parameters of a PY estimator that describes the pattern of the data well. The choice of the PY estimator is not necessarily the best, but one that will help us construct the weakly universal estimator in later sections of this paper.

We also note that the choice of the parameter θ below is analogous to our choice when α = 0 (reducing to the CRP case). For patterns generated by a PY process, where 0 < α ≤ 1, the number of distinct symbols grows as n^α. It is known that in this regime, the choice of θ is not distinguishable [13]. However, what is known is that the choice of θ remains o(n^α), something that is achieved in the selection of θ below. As the reader will note, as long as 0 < α < 1 is fixed, the theorem below will help us construct weakly universal estimators further on.

Theorem 2

[Worst-case redundancy] Consider the estimator

q_{α, θ}^{PY} (ψ_{1}^{n})

. Then, for sufficiently large n and for patterns

ψ_{1}^{n}

, whose number of distinct symbols m satisfies m = o(n), the worst-case redundancy of the predictor

q_{α, θ}^{PY} (ψ_{1}^{n})

with θ = m/log n satisfies:

log \frac{p (ψ_{1}^{n})}{q_{α, θ}^{PY} (ψ_{1}^{n})} = o (n) .

(10)

Proof

For a pattern

ψ_{1}^{n}

, from the definition of

q_{α, θ}^{PY} (ψ_{1}^{n})

in (4) and (5),

log \frac{p (ψ_{1}^{n})}{q_{α, θ}^{PY} (ψ_{1}^{n})} \leq log (\frac{\prod_{μ = 1}^{n} {[μ!]}^{φ_{μ}} φ_{μ}!}{n!} \cdot \frac{(θ + 1) \dots (θ + n - 1)}{(θ + α) (θ + 2 α) \dots (θ + m α)} \prod_{μ = 1}^{n} {(\frac{Γ (1 - α)}{Γ (μ - α - 1)})}^{φ_{μ}}) .

(11)

We can bound the components separately. First, as before we have:

\prod_{μ = 1}^{n} φ_{μ}! \leq m!

Since θ > −α, we have θ + α > 0 and:

\begin{array}{l} (θ + α) (θ + 2 α) \dots (θ + (m - 1) α) \geq (θ + α) α (2 α) \dots ((m - 2) α) \\ = (θ + α) (m - 2)! α^{m - 2} . \end{array}

Again, letting θ̄ = ⌈θ⌉, from the same arguments as in (7) and (8),

log \frac{θ (θ + 1) \dots (θ + n - 1)}{n!} \leq \bar{θ} log \frac{(\bar{θ} + n) e}{\bar{θ}} .

Finally, note that (1 − α)(2 − α) · · · (μ − 1 − α) ≥ (1 − α)(μ − 2)!, so:

\begin{array}{l} \frac{\prod_{μ = 1}^{n} {[μ!]}^{φ_{μ}}}{\prod_{μ = 1}^{n} {[(1 - α) (2 - α) \dots (μ - 1 - α)]}^{φ_{μ}}} \leq \prod_{μ = 1}^{n} {(\frac{μ!}{(1 - α) (μ - 2)!})}^{φ_{μ}} \\ \leq \frac{\prod_{μ = 1}^{n} μ^{2 φ_{μ}}}{{(1 - α)}^{m}} \\ \leq \frac{{(n / m)}^{2 m}}{{(1 - α)}^{m}} . \end{array}

Putting this together:

\begin{array}{l} log \frac{p (ψ_{1}^{n})}{q_{α, θ}^{PY} (ψ_{1}^{n})} \leq log \frac{m!}{(θ + α) (m - 2)! α^{m - 2}} + \bar{θ} log \frac{(\bar{θ} + n) e}{\bar{θ}} + log \frac{{(n / m)}^{2 m}}{{(1 - α)}^{m}} \\ \leq 2 m log \frac{n}{m} + (m - 2) log \frac{1}{(1 - α) α} + \bar{θ} log \frac{(\bar{θ} + n) e}{\bar{θ}} + log \frac{m^{2}}{(θ + α)} + log \frac{1}{{(1 - α)}^{2}} . \end{array}

(12)

If m = o(n), then the right side above is less than o(n), as desired.

It is well known that the Pitman–Yor process can produce patterns whose relative frequency is zero, e.g. the pattern 1^k23 · · · (n − k). Therefore, it is not surprising that the worst-case redundancy and average case redundancies can be bad. However, as the next theorem shows, the actual redundancy of the Pitman–Yor estimator is Θ (n), which is significantly better than the lower bound of Ω(n log n) proven in Santhanam and Madiman [41] for Chinese restaurant processes.

Theorem 3

[Redundancies] Consider the estimator

q_{α, θ}^{PY} (ψ_{1}^{n})

. Then, for sufficiently large n, the worst-case redundancy and average case redundancy satisfy:

\hat{R} (q_{α, θ}^{PY} (\cdot)) = Θ (n) and \bar{R} (q_{α, θ}^{PY} (\cdot)) = Θ (n) .

(13)

That is,

q_{α, θ}^{PY} (\cdot)

is neither strongly nor weakly universal.

Proof

For the upper bound, we start with (12) and note that in the worst case, m = O(n), so

\hat{R} (q_{α, θ}^{PY} (\cdot)) = O (n)

and a fortiori

\bar{R} (q_{α, θ}^{PY} (\cdot)) = O (n)

.

For the lower bound, consider the patterns 11 · · · 1 and 12 · · · n. For the Pitman–Yor estimator,

\begin{array}{l} q_{α, θ}^{PY} (11 \dots 1) q_{α, θ}^{PY} (12 \dots n) = \frac{θ (1 - α) \dots (n - 1 - α)}{θ (θ + 1) \dots (θ + n - 1)} \frac{θ (θ + α) \dots (θ + (n - 1) α)}{θ (θ + 1) \dots (θ + n - 1)} \\ = \frac{(1 - α) (θ + α)}{{(θ + 1)}^{2}} \cdot \frac{(2 - α) (θ + 2 α)}{{(θ + 2)}^{2}} \dots \frac{(n - 1 - α) (θ + (n - 1) α)}{{(θ + n - 1)}^{2}} . \end{array}

For j ≥ 1, 0 < α < 1 and α + θ > 0, we show in the claim proven below that:

\frac{(j - α) (θ + j α)}{{(θ + j)}^{2}} \leq max {\frac{1}{2}, α} .

Therefore, each term is less than one. Then, for α < 1, there exists a constant 0 < c < 1, such that:

q_{α, θ}^{PY} (11 \dots 1) q_{α, θ}^{PY} (12 \dots n) \leq c^{n} .

Thus:

log \frac{1}{q_{α, θ}^{PY} (11 \dots 1)} + log \frac{1}{q_{α, θ}^{PY} (12 \dots n)} \geq n log \frac{1}{c} .

Let the distribution p₁ be a singleton, so p₁(1 · · · 1) = 1. For any small δ > 0, we can find a distribution p_n, such that p_n(12 · · · n) = 1 − δ by choosing p_n to be uniform on a sufficiently large set. Thus:

\begin{array}{l} \hat{R} (q_{α, θ}^{PY} (\cdot)) \geq max {log \frac{1 - δ}{q_{α, θ}^{PY} (11 \dots 1)}, log \frac{1 - δ}{q_{α, θ}^{PY} (12 \dots n)}} \\ \geq \frac{1}{2} (log \frac{1}{q_{α, θ}^{PY} (11 \dots 1)} + log \frac{1}{q_{α, θ}^{PY} (12 \dots n)}) + log (1 - δ) \\ \geq \frac{n}{2} log \frac{1}{c} + log (1 - δ) . \end{array}

This shows that

\hat{R} (q_{α, θ}^{PY} (\cdot)) = Ω (n)

. Furthermore,

\begin{array}{l} \bar{R} (q_{α, θ}^{PY} (\cdot)) \geq max {(1 - δ) \frac{1 - δ}{q_{α, θ}^{PY} (11 \dots 1)}, (1 - δ) \frac{1 - δ}{q_{α, θ}^{PY} (12 \dots n)}} \\ \geq (1 - δ) (\frac{n}{2} log \frac{1}{c} + log (1 - δ)), \end{array}

so

\bar{R} (q_{α, θ}^{PY} (\cdot)) = Ω (n)

All that remains is to prove the following claim:

Claim 1

For j ≥ 1, 0 < α < 1 and α + θ > 0 we show that:

\frac{(j - α) (θ + j α)}{{(θ + j)}^{2}} \leq max {\frac{1}{2}, α} .

Proof

First, assume that

0 < α < \frac{1}{2}

. Then, the inequality is:

\frac{(j - α) (θ + j α)}{{(θ + j)}^{2}} \leq \frac{1}{2} .

Equivalently, we need to show:

(1 - 2 α) j^{2} + 2 α^{2} j + θ^{2} + 2 α θ \geq 0.

Since 1 − 2α > 0, the quadratic is always nondecreasing on j ≥ 1. Therefore, the positive integer j = 1 minimizes the quadratic above, and the value of the quadratic at j = 1 is:

1 - 2 α + 2 α^{2} + θ^{2} + 2 α θ = {(α - 1)}^{2} + {(θ + α)}^{2} \geq 0.

Next, assume that

\frac{1}{2} \leq α < 1

. Then, the inequality is:

\frac{(j - α) (θ + j α)}{{(θ + j)}^{2}} \leq α .

Equivalently, we need to show:

((2 α - 1) θ + α^{2}) j + α θ (θ + 1) \geq 0.

(14)

Since 2α − 1 ≥ 0 and θ > −α,

(2 α - 1) θ + α^{2} \geq - (2 α - 1) α + α^{2} = α (1 - α) > 0.

Therefore, the minimum of the left equation in (14) is achieved at j = 1. Note that αθ² > −α³ − 2α²θ by using (α + θ)² > 0. Therefore, the value of the minimum is:

\begin{array}{l} (2 α - 1) θ + α^{2} + α θ (θ + 1) = α θ^{2} + (3 α - 1) θ + α^{2} \geq - α^{3} + (- 2 α^{2} + 3 α - 1) θ + α^{2} \\ \geq - α^{3} - (- 2 α^{2} + 3 α - 1) α + α^{2} \\ = α^{3} - 2 α^{2} + α = α {(α - 1)}^{2} \geq 0. \end{array}

Note that −2α² + 3α − 1 ≥ 0 for

\frac{1}{2} \leq α < 1

, and the claim follows.

The theorem follows.

4. Weak Universality

In this section, we show how to modify the CRP or PY estimators to obtain weakly universal estimators. The CRP and PY cases are identical; therefore, we only work out the CRP case.

For all i ≥ 1 and j ≥ 1, let:

c_{i, j} = \frac{1}{i (i + 1) j (j + 1)}

so that ∑_i,jc_i,j = 1. Let

{\tilde{q}}_{i, j}^{CRP} \cdot

be the CRP measure over patterns with θ = i/ log j. Consider the following measure over patterns of infinite sequences that assigns, for all n and all patterns ψⁿ of length n, the probability:

q^{*} (ψ^{n}) = \sum_{i, j} c_{i, j} {\tilde{q}}_{i j}^{CRP} ψ^{n} .

(15)

We will show that q^* is a weakly universal measure over patterns of i.i.d. sequences.

To do so, we will need the following two lemmas. Lemma 4 is a useful “folk” inequality that we believe is attributed to Minkowski. Lemma 5 relates the expected number of distinct symbols in length n sequences of an i.i.d. process to its entropy and is of independent interest. The result not only strengthens a similar result in [43], but also provides a different and more compact proof.

Lemma 4

For n ≥ 1, let x₁ ≥ x₂ ≥ . . . x_n ≥ 0 and y₁ ≥ y₂ ≥ . . . y_n ≥ 0 be two sorted sequences. Then:

\frac{1}{n} \sum_{l = 1}^{n} x_{l} y_{l} \geq (\frac{1}{n} \sum_{i = 1}^{n} x_{i}) (\frac{1}{n} \sum_{j = 1}^{n} y_{j}) \geq \frac{1}{n} \sum_{l = 1}^{n} x_{l} y_{n + 1 - l} .

Proof

The left inequality of the lemma follows by noting that:

(\sum_{i = 1}^{n} x_{i}) (\sum_{j = 1}^{n} y_{j}) = \sum_{k = 0}^{n - 1} \sum_{l = 1}^{n} x_{l} y_{l + k},

and that the sum ∑_l x_ly_l₊_k is maximized at k = 0, since both sequences are sorted in the same direction. The right inequality of the lemma can be proven similarly, but will not be used in the paper.

Lemma 5

For all discrete i.i.d. processes P with entropy rate (or marginal entropy) H, let M_n be the random variable counting the number of distinct symbols in a sample of length n drawn from P. The following bound holds:

E [M_{n}] \leq \frac{n H}{log n} + 1.

(16)

Proof

Let P(i) = p_i. We begin by noting that:

H = \sum_{i} p_{i} log \frac{1}{p_{i}} = \sum_{i} p_{i} \sum_{j = 1}^{\infty} \frac{{(1 - p_{i})}^{j}}{j},

where the second equality follows by the Taylor series expansion:

- log p_{i} = - log (1 - (1 - p_{i})) = \sum_{j = 1}^{\infty} \frac{{(1 - p_{i})}^{j}}{j} .

The right summation in the equation above is bounded below as follows:

\begin{array}{l} \sum_{j = 1}^{\infty} \frac{{(1 - p_{i})}^{j}}{j} \geq \sum_{j = 1}^{n} \frac{{(1 - p_{i})}^{j}}{j} \\ \overset{(a)}{\geq} \frac{1}{n} \sum_{l = 1}^{n} \frac{1}{l} \sum_{m = 1}^{n} {(1 - p_{i})}^{m} \\ \geq \frac{log n}{n} \frac{(1 - p_{i})}{p_{i}} (1 - {(1 - p_{i})}^{n}) \end{array}

where (a) follows from Minkowski’s inequality in Lemma 4 and the last inequality, because

\sum_{l = 1}^{n} \frac{1}{l} \geq log n

. Thus,

H = \sum_{i} p i \sum_{j = 1}^{\infty} \frac{{(1 - p_{i})}^{j}}{j} \geq \frac{log n}{n} \sum_{i} (1 - p_{i}) (1 - {(1 - p_{i})}^{n}) \geq \frac{log n}{n} (E M_{n} - 1)

where for the second inequality, we use ∑_i p_i(1 − (1 − p_i)ⁿ) ≤ ∑_i p_i ≤ 1.

Theorem 6

[Weak universality for CRP mixtures] For all discrete i.i.d. processes p ∈

ℐ

with a finite entropy rate,

D (p ‖ q^{*}) = o (n) .

(17)

That is, q^* is weakly universal.

Proof

We write the divergence between p and q^* in (15) as the expected log ratio and condition on the value of M_n:

\begin{array}{l} E_{p} [log \frac{p (ψ_{1}^{n})}{\sum_{m} c_{m, n} {\tilde{q}}_{m, n}^{CRP} (ψ_{1}^{n})}] = P (M_{n} > \frac{n {(log log n)}^{2}}{log n}) \cdot E_{p} [log \frac{p (ψ_{1}^{n})}{\sum_{m} c_{m, n} {\tilde{q}}_{m, n}^{CRP} (ψ_{1}^{n})} | M_{n} > \frac{n {(log log n)}^{2}}{log n}] \\ + P (M_{n} < \frac{n {(log log n)}^{2}}{log n}) \cdot E_{p} [log \frac{p (ψ_{1}^{n})}{\sum_{m} c_{m, n} {\tilde{q}}_{m, n}^{CRP} (ψ_{1}^{n})} | M_{n} < \frac{n {(log log n)}^{2}}{log n}] . \end{array}

(18)

Consider the estimator

{\tilde{q}}_{θ}^{i, j} ψ_{1}^{n}

in q^* corresponding to i = M_n and j = log n. This is the estimator

q_{θ}^{CRP} (ψ_{1}^{n})

with θ = M_n/logn. From the proof of Theorem 1, we have:

\begin{array}{l} log \frac{p (ψ_{1}^{n})}{\sum_{m} c_{m, n} {\tilde{q}}_{m, n}^{CRP} (ψ_{1}^{n})} \leq log \frac{p (ψ_{1}^{n})}{c_{i, j} {\tilde{q}}_{i, j}^{CRP} ψ_{1}^{n}} \\ \leq log \frac{1}{c_{i j}} + log \frac{p (ψ_{1}^{n})}{{\tilde{q}}_{i, j}^{CRP} ψ_{1}^{n}} \\ \leq log (M_{n} (M_{n} + 1) (log n) (log n + 1)) + log \frac{p (ψ_{1}^{n})}{q_{θ}^{CRP} (ψ_{1}^{n})} \end{array}

(19)

We will bound the two terms in (19) in the regimes for M_n.

The result of Theorem 1 says that if

M_{n} < \frac{n {(log log n)}^{2}}{log n}

, then:

log \frac{1}{c_{i j}} + log \frac{p (ψ_{1}^{n})}{q_{θ}^{CRP} (ψ_{1}^{n})} \leq log (M_{n} (M_{n} + 1) (log n) (log n + 1)) + o (n)

(20)

Thus, this term is o(n).

If

M_{n} > \frac{n {(log log n)}^{2}}{log n}

, then we first apply Markov’s inequality using the previous lemma:

\begin{array}{l} P (M_{n} > \frac{n {(log log n)}^{2}}{log n}) \leq \frac{log n}{n {(log log n)}^{2}} (\frac{n H}{log n} + 1) \\ \leq \frac{H}{{(log log n)}^{2}} + \frac{log n}{n {(log log n)}^{2}} . \end{array}

(21)

Therefore, for all finite entropy processes, this probability goes to zero as n → ∞. Looking at the term in q^* corresponding

q_{θ}^{CRP} (ψ_{1}^{n})

with θ = M_n/ log n and using the fact that M_n ≤ n, we see that the first term in (19) is upper bounded as O(log n). For the second term, we appeal to (9) in the proof of Theorem 1:

log \frac{p (ψ_{1}^{n})}{q_{θ}^{CRP} (ψ_{1}^{n})} \leq M_{n} log \frac{M_{n}}{n} + M_{n} log log n + \frac{M_{n}}{log n} log (2 + \frac{n log n}{M_{n}})

(22)

\leq n + n log log n + n \frac{log (2 + n log n)}{log n}

(23)

\leq 3 n log log n .

(24)

Plugging these terms into (18):

D (p ‖ q^{*}) \leq (\frac{H}{{(log log n)}^{2}} + \frac{log n}{n {(log log n)}^{2}}) \cdot O (n log log n) + 1 \cdot o (n) = o (n) .

The preceding theorem shows that the mixture of CRP estimators q^* is weakly universal. However, note that q^* is not itself a CRP estimator. An identical construction is possible for the PY estimators, as well. The convergence of the weakly universal q^* depends on the number of entropy of the source, as well as the number of distinct symbols in a sample of size n.

While it would be tempting to predict the performance of the estimator q^* for larger sample sizes N ≥ n, such a task requires a more careful analysis. In general, it may be impossible to non-trivially bound the number of distinct symbols M_N with smaller sample size n, as the following example shows.

Example 1

Let

n = \sqrt{N}

. Consider a set

ℐ

containing the following two distributions: (i) p over ℕ, which assigns probability 1 − 1/n^3/2 = 1 − 1/N^3/4 to the atom 1 and splitting the probability 1/N^3/4 equally among the elements of the set {2, . . . ,N² + 1}; and (ii) p′, which simply assigns probability one to one. A sample of size n from either p or p′ is 1ⁿ with probability at least 1 − 1/N^1/4, no matter what the underlying source is; therefore, we cannot distinguish between these sources with probability 1 − 1/N^1/4 from a sample of size n.

However, a sample of size N from p has Entropy 16 05339f1

(N^1/4) distinct symbols on average, while that of p′ will have only one element. It follows that if all we know is that the unknown distribution comes from

ℐ

, with high probability under the unknown source, we cannot predict whether the number of symbols in a sample of size N will remain one or not from a sample of length n. Furthermore, by changing the ratio of n and N (and therefore, the probability of the symbol 1 under p), we can make the expected number of symbols in a N − length sample under p as large as we want.

However, it is possible to impose restrictions on the class of distributions that allow us to ensure that we can predict the number of symbols in longer samples. In future work, we will borrow from the data-derived consistency formulations of [44] to characterize when we will be able to predict the number of symbols in longer samples.

5. Conclusions and Future Work

In this note, we investigated the worst-case and average-case redundancies of pattern probability estimators derived from priors on

ℐ

that are popular in Bayesian statistics. Both the CRP and Pitman–Yor estimators give a vanishing redundancy per symbol for patterns whose number of distinct symbols m is sufficiently small. The Pitman–Yor estimator requires only that m = o(n), which is an improvement on the CRP. However, when m can be arbitrarily large (or the alphabet size is arbitrarily large), the worst-case and average-case redundancies do not scale like o(n). Here, again, the Pitman–Yor estimator is superior, in that the redundancies scale like Θ(n) as opposed to the Ω(n log n) for the CRP estimator. While these results show that these estimators are not strongly universal, we constructed a mixture of CRP process (which is not itself a CRP estimator) that is weakly universal. One of the estimators derived in [16] is exchangeable and has near-optimal worst case redundancy of

O (\sqrt{n})

. Kingman’s results imply this estimator corresponds to a prior on

ℐ

; however, this prior is yet unknown. Finding this prior may potentially reveal new interesting classes of priors other than the Poisson–Dirichlet priors.

Acknowledgments

The authors thank the American Institute of Mathematics and NSF for sponsoring a workshop on probability estimation, as well as A. Orlitsky and K. Viswanathan, who co-organized the workshop with the first author. They additionally thank P. Diaconis, M. Dudik, F. Chung, R. Graham, S. Holmes, O. Milenkovic, O. Shayevitz, A. Wagner, J. Zhang and M. Madiman for helpful discussions.

Narayana P. Santhanam was supported by a startup grant from the University of Hawaii and NSF Grants CCF-1018984 and CCF-1065632. Anand D. Sarwate was supported in part by the California Institute for Telecommunications and Information Technology (CALIT2) at the University of California, San Diego. Jae Oh Woo was supported by NSF Grant CCF-1065494 and CCF-1346564.

Author Contributions

All authors have worked on this manuscript together.

Conflicts of Interest

The authors declare no conflict of interest.

References

Laplace, P.S. Philosphical Essay on Probabilities: Translated From the Fifth French Edition of 1825; Sources in the History of Mathematics and Physical Sciences (Book 13); Springer Verlag: New York, NY, USA, 1995. [Google Scholar]
De Morgan, A. An Essay on Probabilities, and on Their Application to Life Contingencies and Insurance Offices; Longman, Orme, Brown, Green & Longmans: London, UK, 1838. [Google Scholar]
De Morgan, A. Theory of Probabilities. In Encyclopædia Metropolitana; Smedley, E., Rose, H.J., Rose, H.J., Eds.; Volume II (Pure Sciences); B. Fellowes et. al: London, UK, 1845; pp. 393–490. [Google Scholar]
Good, I. The population frequencies of species and the estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar]
Blackwell, D.; MacQueen, J.B. Ferguson distributions via Pólya urn schemes. Ann. Stat 1973, 1, 353–355. [Google Scholar]
Kingman, J.F.C. Random Discrete Distributions. J. R. Stat. Soc. B 1975, 37, 1–22. [Google Scholar]
Kingman, J.F.C. Random Partitions in Population Genetics. Proc. R. Soc 1978, 361, 1–20. [Google Scholar]
Kingman, J.F.C. The Representation of Partition Structures. J. London Math. Soc 1978, 2, 374–380. [Google Scholar]
Diaconis, P.; Freedman, D. De Finetti’s Generalizations of Exchangeability. In Studies in Inductive Logic and Probability; Jeffrey, RC., Ed.; Volume 2, University of California Press: Berkeley/Los Angeles, CA, USA, 1980; pp. 233–250. [Google Scholar]
Aldous, D.J. Exchangeability and Related Topics. In École d’Été de Probabilités de Saint-Flour, XIII—1983; Lecture Notes in Mathematics, Volume 1117; Springer: Berlin/Heidelberg, Germany, 1985; pp. 1–198. [Google Scholar]
Zabell, S. Predicting the unpredictable. Synthese 1992, 90, 205–232. [Google Scholar]
Pitman, J. Exchangeable and partially exchangeable random partitions. Probab. Theory Relat. Fields 1995, 102, 145–158. [Google Scholar]
Pitman, J.; Yor, M. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab 1997, 25, 855–900. [Google Scholar]
Clarke, B.; Barron, A. Information theoretic asymptotics of Bayes methods. IEEE Trans. Inf. Theory 1990, 36, 453–471. [Google Scholar]
Clarke, B.; Barron, A. Jeffreys’ prior is asymptotically least favorable under entropy risk. J. Stat. Plan. Inference 1994, 41, 37–60. [Google Scholar]
Orlitsky, A.; Santhanam, N.; Zhang, J. Universal compression of memoryless sources over unknown alphabets. IEEE Trans. Inf. Theory 2004, 50, 1469–1481. [Google Scholar]
Orlitsky, A.; Santhanam, N.; Zhang, J. Always Good Turing: Asymptotically optimal probability estimation. Science 2003, 302, 427–431. [Google Scholar]
Acharya, J.; Das, H.; Jafarpour, A.; Orlitsky, A.; Suresh, A. Tight Bounds for Universal Compression of Large Alphabets. Proceedings of IEEE International Symposium on Information Theory (ISIT), Istanbul, Turkey, 7–12 July 2013.
Acharya, J.; Das, H.; Orlitsky, A. Tight Bounds on Profile Redundancy and Distinguishability. Proceedings of 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 3257–3265.
Acharya, J.; Jafarpour, A.; Orlitsky, A.; Suresh, A. Optimal Probability Estimation with Applications to Prediction and Classification. Proceeding of Conference on Learning Theory, Princeton NJ, USA, 12–14 June 2013; pp. 764–796.
Ryabko, B. Compression based methods for nonparametric on-line prediction, regression, classification and density estimation of time series. In Festschrift in Honor of Jorma Rissanen on the Occasion of His 75th Birthday; Grünwald, P., Myllymäki, P., Tabus, I., Weinberger, M., Yu, B., Eds.; Tampere International Center for Signal Processing: Tampere, Finland, 2008; pp. 271–288. [Google Scholar]
Wagner, A.B.; Viswanath, P.; Kulkarni, S.R. Probability Estimation in the Rare-Events Regime. IEEE Trans. Inf. Theory 2011, 57, 3207–3229. [Google Scholar]
Nadas, A. Good, Jelinek, Mercer, and Robins on Turing’s estimate of probabilities. Am. J. Math. Manag. Sci 1991, 11, 229–308. [Google Scholar]
Gale, W.; Church, K. What is wrong with adding one? In Corpus Based Research into Language; Oostdijk, N., de Haan, P., Eds.; Rodopi: Amsterdam, The Netherlands, 1994; pp. 189–198. [Google Scholar]
McAllester, D.; Schapire, R. On the convergence rate of Good Turing estimators. Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT 2000), Palo Alto, CA, USA, 28 June– 1 July 2000; Cesa-Bianchi, N., Goldman, SA., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 2000; pp. 1–6. [Google Scholar]
Drukh, E.; Mansour, Y. Concentration Bounds on Unigrams Language Models. J. Mach. Learn. Res 2005, 6, 1231–1264. [Google Scholar]
Orlitsky, A.; Santhanam, N.; Viswanathan, K.; Zhang, J. On modeling profiles instead of values; Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, Canada, 7–11 July 2004, Meek, C., Halpern, J., Eds.; AUAI Press: Arlington, VA, USA, 2004; pp. 426–435. [Google Scholar]
Gallager, R. Source Coding with Side Information and Universal Coding. In Technical Report LIDS-P-937; M.I.T: Cambridge, MA, USA, 1976. [Google Scholar]
Davisson, L.; Leon-Garcia, A. A source matching approach to finding minimax codes. IEEE Trans. Inf. Theory 1980, 26, 166–174. [Google Scholar]
Ryabko, B.Y. Coding of a source with unknown but ordered probabilities. Probl. Inf. Transm 1979, 15, 134–138. [Google Scholar]
Kingman, J. The Mathematics of Genetic Diversity; CBMS-NSF Regional Conference Series in Applied Mathematics, Volume 34; SIAM: Philadelphia, PA, USA, 1980. [Google Scholar]
Zabell, S. The Continuum of Inductive Methods Revisited. In The Cosmos of Science: Essays of Exploration; Earman, J., Norton, JD., Eds.; The University of Pittsburgh Press: Pittsburgh, PA, USA, 1997; Volume Chapter 12. [Google Scholar]
Zabell, S. Symmetry and Its Discontents: Essays on the History of Inductive Probability; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Pitman, J. Random discrete distributions invariant under size-biased permutation. Adv. Appl. Probab 1996, 28, 525–539. [Google Scholar]
Pitman, J. Combinatorial Stochastic Processes. In École d’Été de Probabilités de Saint-Flour, XXXII—2002; Lecture Notes in Mathematics, Volume 1875; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Ferguson, T. A Bayesian analysis of some nonparametric problems. Ann. Stat 1973, 1, 209–230. [Google Scholar]
Ramamoorthi, R.; Srikanth, K. Dirichlet processes. In Encyclopedia of Statistical Sciences; John Wiley and Sons: New York, NY, USA, 2007. [Google Scholar]
Ewens, W.J. The Sampling Theory of Selectively Neutral Alleles. Theor. Popul. Biol 1972, 3, 87–112. [Google Scholar]
Karlin, S.; McGregor, J. Addendum to a Paper of W. Ewens. Theor. Popul. Biol 1972, 3, 113–116. [Google Scholar]
Watterson, G.A. The Sampling Theory of Selectively Neutral Alleles. Adv. Appl. Probab 1974, 6, 463–488. [Google Scholar]
Santhanam, N.; Madiman, M. Patterns and exchangeability. Proceedings of 2010 IEEE International Symposium on Information Theory (ISIT), Austin, TX, USA, 13–18 June 2010; pp. 1483–1487.
Carlton, M.A. Applications of the Two-Parameter Poisson-Dirichlet Distribution. Ph.D. Thesis, University of California, Los Angeles, Los Angeles, CA, USA, 1999. [Google Scholar]
Orlitsky, A.; Santhanam, N.P.; Viswanathan, K.; Zhang, J. Limit results on pattern entropy. IEEE Trans. Inf. Theory 2006, 52, 2954–2964. [Google Scholar]
Santhanam, N.; Anantharam, V.; Kavcic, A.; Szpankowski, W. Data driven weak universal redundancy. Proceedings of IEEE Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 1877–1881.

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Redundancy of Exchangeable Estimators

Abstract

1. Introduction

2. Preliminaries

2.1. Exchangeable Partition Processes

2.2. Dirichlet Priors and Chinese Restaurant Processes

Remark

2.3. Pattern Probability Estimators

2.4. Strong Universality Measures: Worst-Case and Average

2.5. Weak Universality

3. Strong Redundancy

3.1. Chinese Restaurant Process Predictors

Theorem 1

Proof

3.2. Pitman–Yor Predictors

Theorem 2

Proof

Theorem 3

Proof

Claim 1

Proof

4. Weak Universality

Lemma 4

Proof

Lemma 5

Proof

Theorem 6

Proof

Example 1

5. Conclusions and Future Work

Acknowledgments

Author Contributions

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics