Redundancy of Exchangeable Estimators

Exchangeable random partition processes are the basis for Bayesian approaches to statistical inference in large alphabet settings. On the other hand, the notion of the pattern of a sequence provides an information-theoretic framework for data compression in large alphabet scenarios. Because data compression and parameter estimation are intimately related, we study the redundancy of Bayes estimators coming from Poisson-Dirichlet priors (or"Chinese restaurant processes") and the Pitman-Yor prior. This provides an understanding of these estimators in the setting of unknown discrete alphabets from the perspective of universal compression. In particular, we identify relations between alphabet sizes and sample sizes where the redundancy is small, thereby characterizing useful regimes for these estimators.


Introduction
A number of statistical inference problems of significant contemporary interest, such as text classification, language modeling, and DNA microarray analysis, require inferences based on observed sequences of symbols in which the sequence length or sample size is comparable or even smaller than the set of symbols, the alphabet. For instance, language models for speech recognition estimate distributions over English words using text examples much smaller than the vocabulary.
Inference in this setting has received a lot of attention, from Laplace [21,7,8] in the 18th century, to Good [15] in the mid-20th century, to an explosion of work in the statistics [2,17,18,19,9,1,39,28,31], information theory [4,5,26,25,34,37] and machine learning [23,13,22,10] communities in the last few decades. A major strand in the information theory literature on the subject has been based on the notion of patterns. The pattern of a sequence characterizes the repeat structure in the sequence, which is the information that can be described well (see Orlitsky et al. [24] for formal characterizations of this idea). The statistical literature has emphasized the importance of exchangeability, which generalizes the notion of independence.
We consider measures over infinite sequences X 1 , X 2 . . ., where X i come from a countable (infinite) set (the alphabet). Let I be the collection of all distributions over countable (potentially infinite) alphabets. For p ∈ I, let p (n) denote the product distribution corresponding to an independent and identically distributed (i.i.d.) sample X n 1 = (X 1 , X 2 , . . . , X n ) where X i ∼ p. Let I (n) be the collection of all such i.i.d. distributions on length n sequences drawn from countable alphabets. Let I ∞ be the collection of all measures over infinite sequences of symbols from countable alphabets X 1 , X 2 . . . where the {X i } are i.i.d. according to some distribution in I. The measures are constructed by extending to the Borel sigma algebra the i.i.d. probability assignments on finite length sequences, namely I (n) , n ≥ 1. We call I ∞ the set of i.i.d. measures.
Based on a sample X n 1 = (X 1 , X 2 , . . . , X n ) from an unknown p (n) ∈ I (n) (or equivalently, the corresponding measure in I ∞ ), we want to create an estimator q n which assigns probabilities to length-n sequences. We are interested in the behavior of the sequence of estimators {q n : n = 1, 2, . . .}. With some abuse of notation, we will use q to denote the estimator q n when the sample size n is clear from context. We want q n to approximate p (n) well; in particular we would like q n to neither overestimate nor underestimate the probability of sequences of length n under the true p (n) .
Suppose that there exist R n > 0 and A n > 0 such that for each p ∈ I we have p (n) X n 1 : q n (X n 1 ) > R n p (n) (X n 1 ) < 1/A n .
If A n is any function of n that grows sufficiently fast with n, the sequence {q n } does not asymptotically overestimate probabilities of length-n sequences by a factor larger than than R n with probability 1, no matter what measure p ∈ I generated the sequences. Protecting against underestimation is not so simple. The redundancy of an estimator q n (defined formally in Section 2.4) for a length n sequence x n 1 measures how closely q n (x n 1 ) matches the largest probability assigned to x n 1 by any distribution in I n . The estimator redundancy usually either maximizes the redundancy of a sequence or takes the expectation over all sequences. Ideally, we want the estimator redundancy to grow sublinearly in the sequence length n so the per-sample redundancy vanishes as n → ∞. If so, we call the estimator universal for I. Redundancy thus captures how well q performs against the collection I, but the connections between estimation problems and compression run deeper.
In this paper we consider estimators formed by taking a measure (prior) on I. Different priors induce different distributions on the data X n 1 . We think of the prior as randomly choosing a distribution p in I and the observed data X n 1 is generated according to this p. How much information about the underlying distribution p can we obtain from the data (assuming we know the prior)? Indeed, a well known result [14,6,33] proves that the redundancy of the best possible estimator for I ∞ equals the maximum (over all choices of priors) information (in bits) that is present about the underlying source in a length n sequence generated in this manner.
Redundancy is well defined for finite alphabets; recent work [26] has formalized a similar framework for countably infinite alphabets. This framework is based on the notion of patterns of sequences that abstract the identities of symbols, and indicate only the relative order of appearance. For example, the pattern of FEDERER is 1232424, while that of PATTERN is 1233456. The crux of the idea is that instead of considering the set of measures I ∞ over infinite sequences, we consider the set of measures induced over patterns of the sequences. It then follows that now our estimate q Ψ is a measure over patterns. While the variables in the sequence are i.i.d., the corresponding pattern is merely exchangeable. We can associate a predictive distribution to the pattern probability estimator q Ψ . This is an estimate of the distribution of X n+1 given the previous observations, and assigns probabilities to the event that X n+1 will be "new" (has not appeared in X n 1 ), and probabilities to the events that X n+1 takes on one of the values that has been seen so far.
The above view of estimation also appears in the statistical literature on Bayesian nonparametrics that focuses on exchangeability. Kingman [20] advocated the use of exchangeable random partitions to accommodate the analysis of data from an alphabet that is not bounded or known in advance. A more detailed discussion of the history and philosophy of this problem can be found in the works of Zabell [39,40] collected in [41]. One of the most popular exchangeable random partition processes is the "Chinese restaurant process," [1], which is a special case of the Poisson-Dirichlet or Pitman-Yor process [29,31]. These processes can be viewed as prior distributions on the set of all discrete distributions that can be used as the basis for estimating probabilities and computing predictive distributions.
In this paper we evaluate the performance of the sequential estimators corresponding to these exchangeable partition processes. As described before, I is the collection of all distributions over countable (potentially infinite) alphabets, and I ∞ is the collection of all i.i.d. measures with single letter marginals in I. Let I Ψ be the collection of all measures over patterns induced by measures in I ∞ . We evaluate the redundancy of estimators based on the Chinese restaurant process (CRP), the Pitman-Yor process, and the Ewen's sampling formula against I Ψ .
In the context of sequential estimation, universal estimators do exist [26] for the collection I Ψ of measures over patterns, and the normalized redundancy can scale as O(n 1/2 ) (for a computationally intensive method) or O(n 2/3 ) (for a linear-time estimator). At the same time, estimators such as the CRP or the PY estimator that have not been developed in a universal compression framework have been very successful from a practical standpoint. Therefore, this paper attempts to understand predictors estimators from the universal compression perspective.
For the case of the estimators studied in nonparametric Bayesian statistics, our results show that they are in general neither weakly nor strongly universal when compressing patterns or equivalently exhangeable random partitions. While the notion of redundancy is in some sense different from other measures of accuracy such as concentration of the posterior distribution about the true distribution, the parameters of the CRPs or the PY processes that do compress well often correspond to the maximum likelihood estimates obtained from the sample.
Because we choose to measure redundancy in the worst case over p, the underlying alphabet size may be arbitrarily large with respect to the sample size n. Consequently, for a fixed sample of size n, the number of symbols could be large, for example a constant fraction of n. The CRP and PY estimators do not have good redundancy against such samples since they are not the cases the estimators are designed for. But we can show that a mixture of estimators corresponding to CRP estimators is weakly universal. This mixture is made by optimizing individual CRP estimators that (implicitly) assume a bound on the support of p. If such a bound is known in advance, we can derive much tighter bounds on the redundancy. In this setting the two-parameter Poisson-Dirichlet (or Pitman-Yor) estimator is superior to the estimator derived from the Chinese restaurant process.
In order to describe our results, we require a variety of definitions from different research communities. In the next section we describe this preliminary material and place it in context before describing the main results in Section 3.

Preliminaries
In this paper we use the "big-O" notation. A function f (n) = O(g(n)) if there exists a positive constant C such that for sufficiently large n, |f (n)| ≤ C|g(n)|. A function f (n) = Ω(g(n)) if there exists a positive constant C such that for sufficiently large n, |f (n)| ≥ C |g(n)|. A function f (n) = Θ(g(n)) if f (n) = O(g(n)) and f (n) = Ω(g(n)).
Let I k denote the set of all probability distributions on alphabets of size k, I ∞ be all probability distributions on countably infinite alphabets, and let be the set of all discrete distributions irrespective of support and support size.
For a fixed p, let x n 1 = (x 1 , x 2 , . . . , x n ) be a sequence drawn i.i.d. according to p. We denote the pattern of x n 1 by ψ n 1 . The pattern is formed by taking ψ 1 = 1 and For example, the pattern of x 7 1 = FEDERER is ψ 7 1 = 1232424. Let ψ n be the set of all patterns of length n. We write p(ψ n ) for the probability that a length-n sequence generated by p has pattern ψ n . For a pattern ψ n 1 we write φ µ for the number of symbols that appear µ times in ψ n 1 and m = φ µ is the number of distinct symbols in ψ n 1 . We call φ µ the prevalence of µ. Thus for FEDERER, we have φ 1 = 2, φ 2 = 1, and φ 3 = 1, and m = 4.

Exchangeable partition processes
An exchangeable random partition refers to a sequence (C n : n ∈ N), where C n is a random partition of the set [n] = {1, 2, . . . , n}, satisfying the following conditions: (i) the probability that C n is a particular partition depends only on the vector (s 1 , s 2 , . . . , s n ), where s k is the number of parts in the partition of size k, and (ii) the realizations of the sequence are consistent in that all the parts of C n are also parts of the partition C n+1 , except that the new element n + 1 may either be in a new part of C n+1 by itself or has joined one of the existing parts of C n .
For a sequence X 1 , . . . , X n from a discrete alphabet, one can partition the set [n] into component sets {A x } where A x = {i : X i = x} is the indices corresponding to the positions in which x has appeared. When the {X i } are drawn i.i.d. from a distribution in I, the corresponding sequence of random partitions is called a paintbox process.
The remarkable Kingman representation theorem [19] states that the probability measure induced by any exchangeable random partition is a mixture of paintbox processes, where the mixture is taken using a probability measure ("prior" in Bayesian terminology) on the class of paintbox processes. Since each paintbox process corresponds to a discrete probability measure (the one such that i.i.d. X i drawn from it produced the paintbox process), the prior may be viewed as a distribution on the set of probability measures on a countable alphabet. For technical reasons, the alphabet is assumed to be hybrid, with a discrete part as well a continuous part, and also one needs to work with the space of ordered probability vectors (see [30] for details).

Dirichlet priors and Chinese restaurant processes
Not surprisingly, special classes of priors give rise to special classes of exchangeable random partitions. One particularly nice class of priors on the set of probability measures on a countable alphabet is that of the Poisson-Dirichlet priors [12,2,32] (sometimes called Dirichlet processes since they live on the infinite-dimensional space of probability measures and generalize the usual finite-dimensional Dirichlet distribution).
The Chinese restaurant process (or CRP) is related to the so-called Griffiths-Engen-McCloskey (GEM) distribution with parameter θ, denoted by GEM(θ). Consider W 1 , W 2 , . . . drawn i.i.d. according to a Beta(1, θ) distribution, and set This can be interpreted as follows: take a stick of unit length and break it into pieces of size W 1 and 1 − W 1 . Now take the piece of size 1 − W 1 and break off a W 2 fraction of that. Continue in this way. The resulting lengths of the sticks create a distribution on a countably infinite set. The distribution of the sequence p = (p 1 , p 2 , . . .) is the GEM(θ) distribution.

Remark
Let π denote the elements of p sorted in decreasing order so that π 1 ≥ π 2 ≥ · · · . Then the distribution of π is the Poisson-Dirichlet distribution PD(θ) as defined by Kingman. 2 Another popular class of distributions on probability vectors is the Pitman-Yor family of distributions [31], also known as the two-parameter Poisson-Dirichlet family of distributions PD(α, θ). The two parameters here are a discount parameter α ∈ [0, 1], and a strength parameter θ > −α. The distribution PD(α, θ) can be generated in a similar way as the Poisson-Dirichlet distribution PD(θ) = PD(0, θ) described earlier. Let each W i be drawn independently according to a Beta(1 − α, θ + iα) distribution, and again set A similar "stick-breaking" interpretation holds here as well. Now let p be equal to the sequencẽ p sorted in descending order. The distribution of p is PD(α, θ). If we have α < 0 and θ = r|α| for integer r we may obtain a symmetric Dirichlet distribution of dimension r.

Pattern probability estimators
Given a sample x n 1 with pattern ψ n we would like to produce an pattern probability estimator. This is a function of the form q(ψ n+1 |ψ n ) that assigns a probability of seeing a symbol previously seen in ψ n as well as a probability of seeing a new symbol. In this paper we will investigate two different pattern probability estimators based on Bayesian models.
The Ewens sampling formula [11,16,38], which has its origins in theoretical population genetics, is a formula for the probability mass function of a marginal of a CRP corresponding to a fixed population size. In other words, it specifies the probability of an exchangeable random partition of [n] that is obtained when one uses the Poisson-Dirichlet PD(θ) prior to mix paintbox processes. Because of the equivalence between patterns and exchangeable random partitions, it estimates the probability of a pattern ψ n 1 via the following formula: Recall that φ µ is the number of symbols that appear µ times in ψ n 1 . In particular, the predictive distribution associated to the Ewens sampling formula or Chinese restaurant process is More generally, one can define the Pitman-Yor predictor (for α ∈ [0, 1] and θ > −α) as where m is the number of distinct symbols in ψ n 1 . The probability assigned by the Pitman-Yor predictor to a pattern ψ n 1 is therefore

Strong universality measures: Worst-case and average
How should we measure the quality of a pattern probability predictor q? We investigate two criteria here: the worst-case and the average-case redundancy. The redundancy of q on a given pattern ψ n is The worst-case redundancy of q is defined to bê Recall that p(ψ n ) just denotes the probability that a length-n sequence generated by p has pattern ψ n -it is unnecessary to specify the support here. Since the set of length-n patterns is finite, there is no need for a supremum in the outer maximization above. The worst-case redundancy is often refered to as the per-sequence redundancy as well.
The average-case redundancy replaces the max over patterns with an expectation over p: where D(· ·) is the Kullback-Leibler divergence or relative entropy. That is, the average-case redundancy is nothing but the worst-case Kullback-Leibler divergence between the distribution p and the predictor q.
A pattern probability estimator is considered "good" if the worst-case or average-case redundancies are sublinear in n, orR(q)/n → 0 andR(q)/n → 0 as n → ∞. Succinctly put, redundancy that is sublinear in n implies that the underlying probability of a sequence can be estimated accurately almost surely. Redundancy is one way to measure the "frequentist" properties of the Bayesian approaches we consider in this paper and refers to the compressibility of the distribution from an information theoretic perspective.
As mentioned in the introduction redundancy differs from notions such as concentration of the posterior distribution about the true distribution. However, the parameters of the CRPs or the PY processes that compress well often correspond to the ML estimates from the sample.

Weak universality
In the previous section, we considered guarantees that hold over the entire model class; both the worst case and average case involve taking a supremum over the entire model class. Therefore the strong guarantees-average or worst-case-hold uniformly over the model class. However, as we will see, exchangeable estimators, in particular the Chinese restaurant process and Pitman Yor estimators, are tuned towards specific kinds of sources rather than all i.i.d. models by appropriate choice of parameters. This behavior is better captured by looking at the modeldependent convergence of the exchangeable estimators, which is known as weak universality. Specifically, let P ∞ be a collection of i.i.d. measures over infinite sequences, and let P ∞ Ψ be the collection of measures induced on patterns by P ∞ . We say an estimator q is weakly universal for a class P ∞ Ψ if for all p ∈ P ∞ lim sup

Strong redundancy
We now describe our main results on the redundancy of estimators derived from the prior distributions on I.

Chinese restaurant process predictors
Previously [36] it was shown by some of the authors that the worst-case and average-case redundancies for the CRP estimator are both Ω(n log n), which means it is not strongly universal. However, this negative result follows because the CRP estimator is tuned not to the entire i.i.d. class of distributions, but to a specific subclass of i.i.d. sources depending on the choice of parameter. To investigate this further, we analyze the redundancy of the CRP estimator when we have a bound on the number m of distinct elements in the pattern ψ n 1 . Chinese restaurant processes q CRP θ () with parameter θ are known to generate exchangeable random partitions where the number of distinct parts M satisfy M/ log n → θ almost surely as the sample size n increases, see e.g., [3]. Equivalently, the CRP generates patterns with M distinct symbols, where M/ log n → θ. However, the following theorem reverses the above setting. Here we are given an i.i.d. sample of data of length n with m symbols (how the data was generated is not important), but we pick the parameter of a CRP estimator that describes the pattern of the data well. While it is satisfying that the chosen parameter matches the ML estimate of the number of symbols, note that this need not necessarily be the only parameter choice that works.  (1) and (2). Then for sufficiently large n and for patterns ψ n 1 whose number of distinct symbols m satisfies m ≤ C · n log n (log log n) 2 , the redundancy of the predictor q CRP θ (ψ n 1 ) with θ = m/ log n satisfies: Proof The number of patterns with prevalences {φ µ } is and therefore since patterns with prevalences {φ µ } all have the same probability.
Using the upper bound in (5) on p(ψ n 1 ) and (1) yields Letθ = θ . The following bound follows from Stirling's approximation: The first term in (6) can be upper bounded by log(n/m) m since the argument of the log(·) is maximized over µ ∈ [1, n] when µ = n/m. The second term is also maximized when all symbols appear the same number of times, corresponding to φ µ = m for one µ. Therefore Chooseθ = m/ log n. This gives the bound: the second term is negative for sufficiently large m. Therefore ≤ m log n m + m log log n + m log n log 2 + n log n m .
Noting that the function above is monotonic in m for n ≥ 16, we choose m = C n log n (log log n) 2 , and the bound becomes ≤ Cn (log log n) 2 log n log log n (log log n) 2 + Cn (log log n) 3 log n + Cn log log n log n 2 log 2 + log n log log n This theorem is slightly dissatisfying, since it requires us to have a bound on the number of distinct symbols in the pattern. In Section 4, we take mixtures of CRP estimators to arrive at estimators that are weakly universal.

Pitman-Yor predictors
We now turn to the more general class of Pitman-Yor predictors. We can obtain a similar result as for the CRP estimator, but we can handle all patterns with m = o(n) distinct symbols.
As before, the context for the following theorem is this: we are given an i.i.d. sample of data of length n with m symbols (there is no consideration, as before, as to how the data was generated), but we pick the parameters of a PY estimator that describes the pattern of the data well. The choice of the PY estimator is not necessarily the best, but one that will help us construct the weakly universal estimator in later Sections of this paper.
We also note that the choice of the parameter θ below is analogous to our choice when α = 0 (reducing to the CRP case). For patterns generated by a PY process where 0 < α ≤ 1, the number of distinct symbols grows as n α . It is known that in this regime the choice of θ is not distinguishable [31]. However, what is known is that the choice of θ remains o(n α ), something that is achieved in the selection of θ below. As the reader will note, as long as 0 < α < 1 is fixed, the theorem below will help us construct weakly universal estimators further on.

Theorem 2. [Worst-case redundancy] Consider the estimator q PY
α,θ (ψ n 1 ). Then for sufficiently large n and for patterns ψ n 1 whose number of distinct symbols m satisfies m = o(n), the worst-case redundancy of the predictor q PY α,θ (ψ n 1 ) with θ = m/ log n satisfies: Proof For a pattern ψ n 1 , from the definition of q PY α,θ (ψ n 1 ) in (4) and (5), We can bound the components separately. First, as before we have: Again, lettingθ = θ , from the same arguments as in (7)-(8), Putting this together: If m = o(n) then the right side above is less than o(n), as desired. 2 It is well known that the Pitman-Yor process can produce patterns whose relative frequency is 0, e.g. the pattern 1 k 23 · · · (n − k). Therefore it is not surprising that the worst-case redundancy and average case redundancies can be bad. However, as the next theorem shows, the actual redundancy of the Pitman-Yor estimator is Θ(n), which is significantly better than the lower bound of Ω(n log n) proved in Santhanam and Madiman [36] for Chinese restaurant processes.

Weak universality
In this section, we show how to modify the CRP or PY estimators to obtain weakly universal estimators. The CRP and PY cases are identical, therefore we only work out the CRP case. For all i ≥ 1 and j ≥ 1, let c i,j = 1 i(i + 1)j(j + 1) so that i,j c i,j = 1. Letq CRP i,j · be the CRP measure over patterns with θ = i/ log j. Consider the following measure over patterns of infinite sequences that assigns for all n and all patterns ψ n of length n, the probability We will show that q * is a weakly universal measure over patterns of i.i.d. sequences.
To do so, we will need the following two lemmas. Lemma 4 is a useful "folk" inequality that we believe is attributed to Minkowski. Lemma 5 relates the expected number of distinct symbols in length n sequences of an i.i.d. process to its entropy, and is of independent interest. The result not only strengthens a similar result in [27] but also provides a different and more compact proof.
Proof The left inequality of the lemma follows by noting that x l y l+k , and that the sum l x l y l+k is maximized at k = 0 since both sequences are sorted in the same direction. The right inequality of the lemma can be proven similarly, but will not be used in the paper.
Proof Let P (i) = p i . We begin by noting that where the second equality follows by the Taylor series expansion The right summation in the equation above is bounded below as follows where (a) follows from Minkowski's inequality in Lemma 4 and the last inequality because n l=1 1 l ≥ log n. Thus, where for the second inequality, we use That is, q * is weakly universal.
Proof We write the divergence between p and q * in (16) as the expected log ratio and condition on the value of M n : Consider the estimatorq CRP i,j ψ n 1 in q * corresponding to i = M n and j = log n. This is the estimator q CRP θ (ψ n 1 ) with θ = M n /logn. From the proof of Theorem 1, we have log p(ψ n 1 ) m c m,nq CRP m,n (ψ n 1 ) ≤ log (M n (M n + 1)(log n)(log n + 1)) + log We will bound the two terms in (20) in the regimes for M n . The result of Theorem 1 says that if M n < n(log log n) 2 log n then log 1 ≤ log (M n (M n + 1)(log n)(log n + 1)) + o(n) Thus this term is o(n). If M n > n(log log n) 2 log n then we first apply Markov's inequality using the previous lemma: P M n > n(log log n) 2 log n ≤ log n n(log log n) 2 nH log n + 1 ≤ H (log log n) 2 + log n n(log log n) 2 .
Therefore for all finite entropy processes, this probability goes to 0 as n → ∞. Looking at the term in q * corresponding q CRP θ (ψ n 1 ) with θ = M n / log n and using the fact that M n ≤ n we see that the first term in (20) is upper bounded as O(log n). For the second term we appeal to (9) in the proof of Theorem 1: ≤ M n log M n n + M n log log n + M n log n log 2 + n log n M n ≤ n + n log log n + n log (2 + n log n) log n ≤ 3n log log n.
Plugging these terms into (19): The preceding theorem shows that the mixture of CRP estimators q * is weakly universal. However, note that q * is not itself a CRP estimator. An identical construction is possible for the PY estimators as well. The convergence of the weakly universal q * depends on the number of entropy of the source, as well as the number of distinct symbols in a sample of size n.
While it would be tempting to predict performance of the estimator q * for larger sample sizes N ≥ n, such a task requires a more careful analysis. In general, it may be impossible to nontrivially bound M N , the number of distinct symbols based on the smaller sample of size n as the following example shows. Example 1. Let n = √ N . Consider a set I containing the following two distributions (i) p over N which assigns probability 1−1/n 3/2 = 1−1/N 3/4 to 1 splits the probability 1/N 3/4 equally among the elements of the set {2, . . . ,N 2 + 1}; and (ii) p which simply assigns probability 1 to 1. A sample of size n from either p or p is 1 n with probability at least 1 − 1/N 1/4 no matter what the underlying source is-therefore we cannot distinguish between these sources with probability 1 − 1/N 1/4 from a sample of size n.
But a sample of size N from p has O(N 1/4 ) distinct symbols on an average while that of p will have only 1 element. It follows that if all we know is that the unknown distribution comes from I, whp under the unknown source we cannot predict whether the number of symbols in a sample of size N will remain 1 or not from a sample of length n. Furthermore, by changing the ratio of n and N (and therefore the probability of the symbol 1 under p) we can make the expected number of symbols in a N −length sample under p as large as we want. 2 However, it is possible to impose restrictions on the class of distributions that allow us to ensure that we can predict the number of symbols in longer samples. In future work, we will borrow from the data-derived consistency formulations of [35] to characterize when we will be able to predict the number of symbols in longer samples.

Conclusions and future work
In this note we investigated the worst-case and average-case redundancies of pattern probability estimators derived from priors on I that are popular in Bayesian statistics. Both the CRP and Pitman-Yor estimators give a vanishing redundancy per symbol for patterns whose number of distinct symbols m is sufficiently small. The Pitman-Yor estimator requires only that m = o(n), which is an improvement on the CRP. However, when m can be arbitrarily large (or the alphabet size is arbitrarily large) the worst-case and average-case redundancies do not scale like o(n). Here again, the Pitman-Yor estimator is superior, in that the redundancies scale like Θ(n) as opposed to the Ω(n log n) for the CRP estimator. While these results show that these estimators are not strongly universal, we constructed a mixture of CRP process (which is not itself a CRP estimator) that is weakly universal.
On the other hand, one of the estimators derived in [26] is exchangeable and has near optimal worst case redundancy, growing as O( √ n). From Kingman's results, this estimator can be obtained using a prior on I-however, this prior is yet unknown. Finding this prior may potentially reveal new interesting classes of priors other than the Poisson-Dirichlet priors.