Next Article in Journal
Research and Development of a Chaotic Signal Synchronization Error Dynamics-Based Ball Bearing Fault Diagnostor
Previous Article in Journal
Quantum Computation-Based Image Representation, Processing Operations and Their Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Redundancy of Exchangeable Estimators

by
Narayana P. Santhanam
1,*,
Anand D. Sarwate
2,* and
Jae Oh Woo
3,*
1
Department of Electrical Engineering, University of Hawaii at Manoa, 2540 Dole Street, Honolulu, HI 96822, USA
2
Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey, 94 Brett Road, Piscataway, NJ 08854 , USA
3
Applied Mathematics Program, Yale University, 51 Prospect St, New Haven, CT 06511, USA
*
Authors to whom correspondence should be addressed.
Entropy 2014, 16(10), 5339-5357; https://doi.org/10.3390/e16105339
Submission received: 19 July 2014 / Revised: 23 September 2014 / Accepted: 8 October 2014 / Published: 13 October 2014
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
Exchangeable random partition processes are the basis for Bayesian approaches to statistical inference in large alphabet settings. On the other hand, the notion of the pattern of a sequence provides an information-theoretic framework for data compression in large alphabet scenarios. Because data compression and parameter estimation are intimately related, we study the redundancy of Bayes estimators coming from Poisson–Dirichlet priors (or “Chinese restaurant processes”) and the Pitman–Yor prior. This provides an understanding of these estimators in the setting of unknown discrete alphabets from the perspective of universal compression. In particular, we identify relations between alphabet sizes and sample sizes where the redundancy is small, thereby characterizing useful regimes for these estimators.

1. Introduction

A number of statistical inference problems of significant contemporary interest, such as text classification, language modeling and DNA microarray analysis, require inferences based on observed sequences of symbols in which the sequence length or sample size is comparable or even smaller than the set of symbols, the alphabet. For instance, language models for speech recognition estimate distributions over English words using text examples much smaller than the vocabulary.
Inference in this setting has received a lot of attention, from Laplace [13] in the 18th century, to Good [4] in the mid-20th century, to an explosion of work in the statistics [513], information theory [1417,21,22] and machine learning [2326] communities in the last few decades. A major strand in the information theory literature on the subject has been based on the notion of patterns. The pattern of a sequence characterizes the repeat structure in the sequence, which is the information that can be described well (see Orlitsky et al. [27] for formal characterizations of this idea). The statistical literature has emphasized the importance of exchangeability, which generalizes the notion of independence.
We consider measures over infinite sequences X1,X2 . . ., where Xi come from a countable (infinite) set (the alphabet). Let be the collection of all distributions over countable (potentially infinite) alphabets. For p, let p(n) denote the product distribution corresponding to an independent and identically distributed (i.i.d.) sample X 1 n = ( X 1 , X 2 , , X n ), where Xi ~ p. Let (n) be the collection of all such i.i.d. distributions on length n sequences drawn from countable alphabets. Let be the collection of all measures over infinite sequences of symbols from countable alphabets X1,X2 . . ., where the {Xi} are i.i.d. according to some distribution in . The measures are constructed by extending to the Borel sigma algebra the i.i.d. probability assignments on finite length sequences, namely (n), n ≥ 1. We call the set of i.i.d. measures.
Based on a sample X 1 n = ( X 1 , X 2 , , X n ) from an unknown p(n)(n) (or equivalently, the corresponding measure in ), we want to create an estimator qn, which assigns probabilities to length-n sequences. We are interested in the behavior of the sequence of estimators {qn : n = 1, 2, . . .}. With some abuse of notation, we will use q to denote the estimator qn when the sample size n is clear from context. We want qn to approximate p(n) well; in particular, we would like qn to neither overestimate nor underestimate the probability of sequences of length n under the true p(n).
Suppose that there exist Rn > 0 and An > 0, such that for each p, we have:
p ( n ) ( { X 1 n : q n ( X 1 n ) > R n p ( n ) ( X 1 n ) } ) < 1 / A n .
If An is any function of n that grows sufficiently fast with n, the sequence {qn} does not asymptotically overestimate probabilities of length-n sequences by a factor larger than than Rn with probability one, no matter what measure p generated the sequences.
Protecting against underestimation is not so simple. The redundancy of an estimator qn (defined formally in Section 2.4) for a length n sequence x 1 n measures how closely q n ( x 1 n ) matches:
max p ( n ) ( n ) p ( n ) ( x 1 n ) ,
the largest probability assigned to x 1 n by any distribution in n. The estimator redundancy usually either maximizes the redundancy of a sequence or takes the expectation over all sequences. Ideally, we want the estimator redundancy to grow sublinearly in the sequence length n, so that the per-sample redundancy vanishes as n → ∞. If so, we call the estimator universal for . Redundancy thus captures how well q performs against the collection , but the connections between estimation problems and compression run deeper.
In this paper, we consider estimators formed by taking a measure (prior) on . Different priors induce different distributions on the data X 1 n. We think of the prior as randomly choosing a distribution p in , and the observed data X 1 n is generated according to this p. How much information about the underlying distribution p can we obtain from the data (assuming we know the prior)? Indeed, a well known result [2830] proves that the redundancy of the best possible estimator for equals the maximum (over all choices of priors) information (in bits) that is present about the underlying source in a length n sequence generated in this manner.
Redundancy is well defined for finite alphabets; recent work [16] has formalized a similar framework for countably infinite alphabets. This framework is based on the notion of patterns of sequences that abstract the identities of symbols and indicate only the relative order of appearance. For example, the pattern of FEDERER is 1232424, while that of PATTERN is 1233456. The crux of the idea is that instead of considering the set of measures over infinite sequences, we consider the set of measures induced over patterns of the sequences. It then follows that now our estimate qΨ is a measure over patterns. While the variables in the sequence are i.i.d., the corresponding pattern merely corresponds to a exchangeable random partition. We can associate a predictive distribution with the pattern probability estimator qΨ. This is an estimate of the distribution of Xn+1 given the previous observations, and it assigns probabilities to the event that Xn+1 will be “new” (has not appeared in X 1 n) and probabilities to the events that Xn+1 takes on one of the values that has been seen so far.
The above view of estimation also appears in the statistical literature on Bayesian nonparametrics that focuses on exchangeability. Kingman [31] advocated the use of exchangeable random partitions to accommodate the analysis of data from an alphabet that is not bounded or known in advance. A more detailed discussion of the history and philosophy of this problem can be found in the works of Zabell [11,32] collected in [33]. One of the most popular exchangeable random partition processes is the “Chinese restaurant process” [10], which is a special case of the Poisson–Dirichlet or Pitman–Yor process [13,34]. These processes can be viewed as prior distributions on the set of all discrete distributions that can be used as the basis for estimating probabilities and computing predictive distributions.
In this paper, we evaluate the performance of the sequential estimators corresponding to these exchangeable partition processes. As described before, is the collection of all distributions over countable (potentially infinite) alphabets, and is the collection of all i.i.d. measures with single letter marginals in . Let Ψ be the collection of all measures over patterns induced by measures in . We evaluate the redundancy of estimators based on the Chinese restaurant process (CRP), the Pitman–Yor (PY) process and the Ewen’s sampling formula against Ψ.
In the context of sequential estimation, early work [16] showed that for the collection Ψ of measures over patterns, universal estimators do exist: the normalized redundancy is O(n1/2). More recent work [18,19] proved tight bounds on worst-case and average redundancy; these results show that there are sequential estimators with normalized redundancy of O(n1/3). However, these estimators are computationally intensive and (generally speaking) infeasible in practice. Acharya et al. [20] demonstrated a linear-time estimator with average redundancy O(n1/2), improving over the earlier constructions achieving O(n2/3) [16]. By contrast, estimators such as the CRP or the PY estimator were not developed in a universal compression framework, but have been very successful from a practical standpoint. The goal of this paper is to understand these Bayesian estimators from the universal compression perspective.
For the case of the estimators studied in nonparametric Bayesian statistics, our results show that they are in general neither weakly nor strongly universal when compressing patterns or equivalently exchangeable random partitions. While the notion of redundancy is in some sense different from other measures of accuracy, such as the concentration of the posterior distribution about the true distribution, the parameters of the CRPs or the PY processes that do compress well often correspond to the maximum likelihood estimates obtained from the sample.
Because we choose to measure redundancy in the worst case over p, the underlying alphabet size may be arbitrarily large with respect to the sample size n. Consequently, for a fixed sample of size n, the number of symbols could be large, for example a constant fraction of n. The CRP and PY estimators do not have good redundancy against such samples, since they are not the cases the estimators are designed for. However, we can show that a mixture of estimators corresponding to CRP estimators is weakly universal. This mixture is made by optimizing individual CRP estimators that (implicitly) assume a bound on the support of p. If such a bound is known in advance, we can derive much tighter bounds on the redundancy. In this setting, the two-parameter Poisson–Dirichlet (or Pitman–Yor) estimator is superior to the estimator derived from the Chinese restaurant process.
In order to describe our results, we require a variety of definitions from different research communities. In the next section, we describe this preliminary material and place it in context before describing the main results in Section 3.

2. Preliminaries

In this paper, we use the “big-O” notation. A function f(n) = O(g(n)) if there exists a positive constant C, such that for sufficiently large n, |f(n)| ≤ C|g(n)|. A function f(n) = Ω(g(n)), if there exists a positive constant C′, such that for sufficiently large n, |f(n)| ≥ C′|g(n)|. A function f(n) = Θ (g(n)), if f(n) = O(g(n)) and f(n) = Ω(g(n)).
Let k denote the set of all probability distributions on alphabets of size k and be all probability distributions on countably infinite alphabets, and let:
= k 1 k
be the set of all discrete distributions irrespective of support and support size.
For a fixed p, let x 1 n = ( x 1 , x 2 , x n ) be a sequence drawn i.i.d. according to p. We denote the pattern of x 1 n by ψ 1 n. The pattern is formed by taking ψ1 = 1 and:
ψ i = { ψ j x i = x j , j < i 1 + max j < i ψ j x i x j , j < i
For example, the pattern of x 1 7 = FEDERER is ψ 1 7 = 1232424. Let ψn be the set of all patterns of length n. We write p(ψn) for the probability that a length-n sequence generated by p has pattern ψn. For a pattern ψ 1 n, we write φμ for the number of symbols that appear μ times in ψ 1 n, and m = ∑φμ is the number of distinct symbols in ψ 1 n. We call φμ the prevalence of μ. Thus, for FEDERER, we have φ1 = 2, φ2 = 1, φ3 = 1 and m = 4.

2.1. Exchangeable Partition Processes

An exchangeable random partition refers to a sequence (Cn : n ∈ ℕ), where Cn is a random partition of the set [n] = {1, 2, . . . , n}, satisfying the following conditions: (i) the probability that Cn is a particular partition depends only on the vector (s1, s2, . . . , sn), where sk is the number of parts in the partition of size k; and (ii) the realizations of the sequence are consistent in that all of the parts of Cn are also parts of the partition Cn+1, except that the new element n + 1 may either be in a new part of Cn+1 by itself or has joined one of the existing parts of Cn.
For a sequence X1, . . . , Xn from a discrete alphabet, one can partition the set [n] into component sets {Ax}, where Ax = {i : Xi = x} are the indices corresponding to the positions in which x has appeared. When the {Xi} are drawn i.i.d. from a distribution in , the corresponding sequence of random partitions is called a paintbox process.
The remarkable Kingman representation theorem [8] states that the probability measure induced by any exchangeable random partition is a mixture of paintbox processes, where the mixture is taken using a probability measure (“prior” in Bayesian terminology) on the class of paintbox processes. Since each paintbox process corresponds to a discrete probability measure (the one such that i.i.d. Xi drawn from it produced the paintbox process), the prior may be viewed as a distribution on the set of probability measures on a countable alphabet. For technical reasons, the alphabet is assumed to be hybrid, with a discrete part, as well as a continuous part, and also, one needs to work with the space of ordered probability vectors (see [35] for details).

2.2. Dirichlet Priors and Chinese Restaurant Processes

Not surprisingly, special classes of priors give rise to special classes of exchangeable random partitions. One particularly nice class of priors on the set of probability measures on a countable alphabet is that of the Poisson–Dirichlet priors [5,36,37] (sometimes called Dirichlet processes, since they live on the infinite-dimensional space of probability measures and generalize the usual finite-dimensional Dirichlet distribution).
The Chinese restaurant process (or CRP) is related to the so-called Griffiths–Engen–McCloskey (GEM) distribution with parameter θ, denoted by GEM(θ). Consider W1,W2, . . . drawn i.i.d. according to a Beta(1, θ) distribution, and set:
p 1 = W 1 p i = W i j < i ( 1 - W i )             i > 1
This can be interpreted as follows: take a stick of unit length and break it into pieces of size W1 and 1 – W1. Now take the piece of size 1 – W1 and break off a W2 fraction of that. Continue in this way. The resulting lengths of the sticks create a distribution on a countably infinite set. The distribution of the sequence p = (p1, p2, . . .) is the GEM(θ) distribution.

Remark

Let πdenote the elements of p sorted in decreasing order, so that π1π2 ≥ · · · . Then, the distribution of π is the Poisson–Dirichlet distribution PD(θ) as defined by Kingman.
Another popular class of distributions on probability vectors is the Pitman–Yor family of distributions [13], also known as the two-parameter Poisson–Dirichlet family of distributions PD(α, θ). The two parameters here are a discount parameter α ∈ [0, 1] and a strength parameter θ >α. The distribution PD(α, θ) can be generated in a similar way as the Poisson–Dirichlet distribution PD(θ) = PD(0, θ described earlier. Let each Wi be drawn independently according to a Beta(1 − α, θ + ) distribution, and again, set:
p ˜ 1 = W 1 p ˜ i = W i j < i ( 1 - W i )             i > 1
A similar “stick-breaking” interpretation holds here, as well. Now, let p be equal to the sequence sorted in descending order. The distribution of p is PD(α, θ). If we have α < 0 and θ = r|α| for integer r, we may obtain a symmetric Dirichlet distribution of dimension r.

2.3. Pattern Probability Estimators

Given a sample x 1 n with pattern ψn, we would like to produce a pattern probability estimator. This is a function of the form q(ψn+1|ψn) that assigns a probability of seeing a symbol previously seen in ψn, as well as a probability of seeing a new symbol. In this paper, we will investigate two different pattern probability estimators based on Bayesian models.
The Ewens sampling formula [3840], which has its origins in theoretical population genetics, is a formula for the probability mass function of a marginal of a CRP corresponding to a fixed population size. In other words, it specifies the probability of an exchangeable random partition of [n] that is obtained when one uses the Poisson–Dirichlet PD(θ) prior to mix paintbox processes. Because of the equivalence between patterns and exchangeable random partitions, it estimates the probability of a pattern ψ 1 n via the following formula:
q θ CRP ( ψ 1 , , ψ n ) = θ m θ ( θ + 1 ) ( θ + n - 1 ) μ = 1 n [ ( μ - 1 ) ! ] φ μ .
Recall that φμ is the number of symbols that appear μ times in ψ 1 n. In particular, the predictive distribution associated to the Ewens sampling formula or Chinese restaurant process is:
q θ CRP ( ψ ψ 1 , , ψ n ) = { μ n + θ ψ appeared  μ times in  ψ 1 , , ψ n ; θ n + θ ψ is a new symbol .
More generally, one can define the Pitman–Yor predictor (for α ∈ [0, 1] and θ >α) as:
q α , θ PY ( ψ ψ 1 , , ψ n ) = { μ - α n + θ ψ appeared  μ times in  ψ 1 , , ψ n ; θ + m α n + θ ψ is a new symbol .
where m is the number of distinct symbols in ψ 1 n. The probability assigned by the Pitman–Yor predictor to a pattern ψ 1 n is therefore:
q α , θ PY ( ψ 1 , , ψ n ) = θ ( θ + α ) ( θ + 2 α ) ( θ + ( m - 1 ) α ) θ ( θ + 1 ) ( θ + n - 1 ) μ = 1 n ( Γ ( μ - α ) Γ ( 1 - α ) ) φ μ .
Note that Γ (μα)/Γ (1 − α) = (μα − 1)(μα − 2) · · · (1 − α).

2.4. Strong Universality Measures: Worst-Case and Average

How should we measure the quality of a pattern probability predictor q? We investigate two criteria here: the worst-case and the average-case redundancy. The redundancy of q on a given pattern ψn is:
R ( q , ψ n ) = def sup p log p ( ψ n ) q ( ψ n ) ,
The worst-case redundancy of q is defined to be:
R ^ ( q ) = def max ψ n Ψ n sup p log p ( ψ n ) q ( ψ n ) = sup p max ψ n Ψ n log p ( ψ n ) q ( ψ n )
Recall that p(ψn) just denotes the probability that a length-n sequence generated by p has pattern ψn; it is unnecessary to specify the support here. Since the set of length-n patterns is finite, there is no need for a supremum in the outer maximization above. The worst-case redundancy is often referred to as the per-sequence redundancy, as well.
The average-case redundancy replaces the max over patterns with an expectation over p:
R ¯ ( q ) = def sup p E p [ log p ( ψ n ) q ( ψ n ) ] = sup p D ( p q ) ,
where D(·||·) is the Kullback–Leibler divergence or relative entropy. That is, the average-case redundancy is nothing but the worst-case Kullback–Leibler divergence between the distribution p and the predictor q.
A pattern probability estimator is considered “good” if the worst-case or average-case redundancies are sublinear in n or (q)/n → 0 and (q)/n → 0 as n → ∞. Succinctly put, redundancy that is sublinear in n implies that the underlying probability of a sequence can be estimated accurately almost surely. Redundancy is one way to measure the “frequentist” properties of the Bayesian approaches we consider in this paper and refers to the compressibility of the distribution from an information theoretic perspective.
As mentioned in the Introduction, redundancy differs from notions, such as the concentration of the posterior distribution about the true distribution. However, the parameters of the CRPs or the PY processes that compress well often correspond to the maximum likelihood (ML) estimates from the sample.

2.5. Weak Universality

In the previous section, we considered guarantees that hold over the entire model class; both the worst case and average case involve taking a supremum over the entire model class. Therefore, the strong guarantees—average or worst-case—hold uniformly over the model class. However, as we will see, exchangeable estimators, in particular the Chinese restaurant process and Pitman–Yor estimators, are tuned towards specific kinds of sources, rather than all i.i.d. models by the appropriate choice of parameters. This behavior is better captured by looking at the model-dependent convergence of the exchangeable estimators, which is known as weak universality. Specifically, let be a collection of i.i.d. measures over infinite sequences, and let P Ψ be the collection of measures induced on patterns by . We say an estimator q is weakly universal for a class P Ψ if for all p:
lim sup n 1 n E p [ log p ( ψ n ) q ( ψ n ) ] = 0.

3. Strong Redundancy

We now describe our main results on the redundancy of estimators derived from the prior distributions on .

3.1. Chinese Restaurant Process Predictors

Previously [41], it was shown by some of the authors that the worst-case and average-case redundancies for the CRP estimator are both Ω(n log n), which means it is not strongly universal. However, this negative result follows because the CRP estimator is tuned not to the entire i.i.d. class of distributions, but to a specific subclass of i.i.d. sources depending on the choice of parameter. To investigate this further, we analyze the redundancy of the CRP estimator when we have a bound on the number m of distinct elements in the pattern ψ 1 n.
Chinese restaurant processes q θ CRP ( ) with parameter θ are known to generate exchangeable random partitions where the number of distinct parts M satisfy M/ log nθ almost surely as the sample size n increases; see e.g., [42]. Equivalently, the CRP generates patterns with M distinct symbols, where M/ log nθ. However, the following theorem reverses the above setting. Here, we are given an i.i.d. sample of data of length n with m symbols (how the data were generated is not important), but we pick the parameter of a CRP estimator that describes the pattern of the data well. While it is satisfying that the chosen parameter matches the ML estimate of the number of symbols, note that this need not necessarily be the only parameter choice that works.

Theorem 1

[Redundancy for CRP estimators] Consider the estimator q θ CRP ( ψ 1 n ) in (1) and (2). Then, for sufficiently large n and for patterns ψ 1 n whose number of distinct symbols m satisfies:
m C · n log  n ( log log  n ) 2 ,
the redundancy of the predictor q θ CRP ( ψ 1 n ) with ⌈θ⌉ = m/log n satisfies:
log p ( ψ 1 n ) q θ CRP ( ψ 1 n ) 3 C · n ( log log  n ) 3 log  n = o ( n ) .

Proof

The number of patterns with prevalence {φμ} is:
n ! μ = 1 n [ μ ! ] φ μ φ μ ! ,
and therefore:
p ( ψ 1 n ) μ = 1 n [ μ ! ] φ μ φ μ ! n ! ,
since patterns with prevalence {φμ} all have the same probability.
Using the upper bound in (5) on p ( ψ 1 n ) and (1) yields:
log p ( ψ 1 n ) q θ CRP ( ψ 1 n ) log μ = 1 n [ μ ! ] φ μ φ μ ! [ ( μ - 1 ) ! ] φ μ θ m + log θ ( θ + 1 ) ( θ + n - 1 ) n ! = log ( μ = 1 n μ φ μ ) + log ( 1 θ m μ = 1 n φ μ ! ) + log θ ( θ + 1 ) ( θ + n - 1 ) n ! .
Let θ̄ = ⌈θ⌉. The following bound follows from Stirling’s approximation:
θ ( θ + 1 ) ( θ + n - 1 ) n ! ( θ ¯ + n ) ! θ ! n ! ( θ ¯ + n ) θ ¯ + n θ ¯ θ ¯ n n ( θ ¯ + n θ ¯ ) θ ¯ ( θ ¯ + n n ) n .
The first term in (6) can be upper bounded by log(n/m)m since the argument of the log(·) is maximized over μ ∈ [1, n] when μ = n/m. The second term is also maximized when all symbols appear the same number of times, corresponding to φμ = m for one μ. Therefore:
log p ( ψ 1 n ) q θ CRP ( ψ 1 n ) log ( n m ) m + log m ! θ m + log ( θ ¯ + n θ ¯ ) θ ¯ ( 1 + θ ¯ n ) n .
Now, ( 1 + θ ¯ n ) n e θ ¯ for sufficiently large n, so:
log p ( ψ 1 n ) q θ CRP ( ψ 1 n ) log ( n m ) m + log m ! θ m + log ( ( θ ¯ + n ) e θ ¯ ) θ ¯ .
Choose θ̄ = m/log n. This gives the bound:
log p ( ψ 1 n ) q θ CRP ( ψ 1 n ) m log ( n m ) + log m ! m m ( θ ¯ θ ) m + m log log  n + log ( ( θ ¯ + n ) e θ ¯ ) θ ¯ .
the second term is negative for sufficiently large m. Therefore:
log p ( ψ 1 n ) q θ CRP ( ψ 1 n ) m log ( n m ) + m log log  n + m log  n log ( 2 + n log  n m ) .
Noting that the function above is monotonic in m for n ≥ 16, we choose:
m = C n log  n ( log log  n ) 2 ,
and the bound becomes:
log p ( ψ 1 n ) q θ CRP ( ψ 1 n ) C n ( log log  n ) 2 log  n log ( log  n ( log log  n ) 2 ) + C n ( log log  n ) 3 log  n + C n ( log log  n log  n ) 2 log ( 2 + ( log  n log log  n ) 2 ) 3 C n ( log log  n ) 3 log  n = o ( n ) .
This theorem is slightly dissatisfying, since it requires us to have a bound on the number of distinct symbols in the pattern. In Section 4, we take mixtures of CRP estimators to arrive at estimators that are weakly universal.

3.2. Pitman–Yor Predictors

We now turn to the more general class of Pitman–Yor predictors. We can obtain a similar result as for the CRP estimator, but we can handle all patterns with m = o(n) distinct symbols.
As before, the context for the following theorem is this: we are given an i.i.d. sample of data of length n with m symbols (there is no consideration, as before, as to how the data was generated), but we pick the parameters of a PY estimator that describes the pattern of the data well. The choice of the PY estimator is not necessarily the best, but one that will help us construct the weakly universal estimator in later sections of this paper.
We also note that the choice of the parameter θ below is analogous to our choice when α = 0 (reducing to the CRP case). For patterns generated by a PY process, where 0 < α ≤ 1, the number of distinct symbols grows as nα. It is known that in this regime, the choice of θ is not distinguishable [13]. However, what is known is that the choice of θ remains o(nα), something that is achieved in the selection of θ below. As the reader will note, as long as 0 < α < 1 is fixed, the theorem below will help us construct weakly universal estimators further on.

Theorem 2

[Worst-case redundancy] Consider the estimator q α , θ PY ( ψ 1 n ). Then, for sufficiently large n and for patterns ψ 1 n, whose number of distinct symbols m satisfies m = o(n), the worst-case redundancy of the predictor q α , θ PY ( ψ 1 n ) with θ = m/log n satisfies:
log p ( ψ 1 n ) q α , θ PY ( ψ 1 n ) = o ( n ) .

Proof

For a pattern ψ 1 n, from the definition of q α , θ PY ( ψ 1 n ) in (4) and (5),
log p ( ψ 1 n ) q α , θ PY ( ψ 1 n ) log ( μ = 1 n [ μ ! ] φ μ φ μ ! n ! · ( θ + 1 ) ( θ + n - 1 ) ( θ + α ) ( θ + 2 α ) ( θ + m α ) μ = 1 n ( Γ ( 1 - α ) Γ ( μ - α - 1 ) ) φ μ ) .
We can bound the components separately. First, as before we have:
μ = 1 n φ μ ! m !
Since θ >α, we have θ + α > 0 and:
( θ + α ) ( θ + 2 α ) ( θ + ( m - 1 ) α ) ( θ + α ) α ( 2 α ) ( ( m - 2 ) α ) = ( θ + α ) ( m - 2 ) ! α m - 2 .
Again, letting θ̄ = ⌈θ⌉, from the same arguments as in (7) and (8),
log θ ( θ + 1 ) ( θ + n - 1 ) n ! θ ¯ log ( θ ¯ + n ) e θ ¯ .
Finally, note that (1 − α)(2 − α) · · · (μ − 1 − α) ≥ (1 − α)(μ − 2)!, so:
μ = 1 n [ μ ! ] φ μ μ = 1 n [ ( 1 - α ) ( 2 - α ) ( μ - 1 - α ) ] φ μ μ = 1 n ( μ ! ( 1 - α ) ( μ - 2 ) ! ) φ μ μ = 1 n μ 2 φ μ ( 1 - α ) m ( n / m ) 2 m ( 1 - α ) m .
Putting this together:
log p ( ψ 1 n ) q α , θ PY ( ψ 1 n ) log m ! ( θ + α ) ( m - 2 ) ! α m - 2 + θ ¯ log ( θ ¯ + n ) e θ ¯ + log ( n / m ) 2 m ( 1 - α ) m 2 m log n m + ( m - 2 ) log 1 ( 1 - α ) α + θ ¯ log ( θ ¯ + n ) e θ ¯ + log m 2 ( θ + α ) + log 1 ( 1 - α ) 2 .
If m = o(n), then the right side above is less than o(n), as desired.
It is well known that the Pitman–Yor process can produce patterns whose relative frequency is zero, e.g. the pattern 1k23 · · · (nk). Therefore, it is not surprising that the worst-case redundancy and average case redundancies can be bad. However, as the next theorem shows, the actual redundancy of the Pitman–Yor estimator is Θ (n), which is significantly better than the lower bound of Ω(n log n) proven in Santhanam and Madiman [41] for Chinese restaurant processes.

Theorem 3

[Redundancies] Consider the estimator q α , θ PY ( ψ 1 n ). Then, for sufficiently large n, the worst-case redundancy and average case redundancy satisfy:
R ^ ( q α , θ PY ( · ) ) = Θ ( n )             and            R ¯ ( q α , θ PY ( · ) ) = Θ ( n ) .
That is, q α , θ PY ( · ) is neither strongly nor weakly universal.

Proof

For the upper bound, we start with (12) and note that in the worst case, m = O(n), so R ^ ( q α , θ PY ( · ) ) = O ( n ) and a fortiori R ¯ ( q α , θ PY ( · ) ) = O ( n ).
For the lower bound, consider the patterns 11 · · · 1 and 12 · · · n. For the Pitman–Yor estimator,
q α , θ PY ( 11 1 ) q α , θ PY ( 12 n ) = θ ( 1 - α ) ( n - 1 - α ) θ ( θ + 1 ) ( θ + n - 1 ) θ ( θ + α ) ( θ + ( n - 1 ) α ) θ ( θ + 1 ) ( θ + n - 1 ) = ( 1 - α ) ( θ + α ) ( θ + 1 ) 2 · ( 2 - α ) ( θ + 2 α ) ( θ + 2 ) 2 ( n - 1 - α ) ( θ + ( n - 1 ) α ) ( θ + n - 1 ) 2 .
For j ≥ 1, 0 < α < 1 and α + θ > 0, we show in the claim proven below that:
( j - α ) ( θ + j α ) ( θ + j ) 2 max { 1 2 , α } .
Therefore, each term is less than one. Then, for α < 1, there exists a constant 0 < c < 1, such that:
q α , θ PY ( 11 1 ) q α , θ PY ( 12 n ) c n .
Thus:
log 1 q α , θ PY ( 11 1 ) + log 1 q α , θ PY ( 12 n ) n log 1 c .
Let the distribution p1 be a singleton, so p1(1 · · · 1) = 1. For any small δ > 0, we can find a distribution pn, such that pn(12 · · · n) = 1 − δ by choosing pn to be uniform on a sufficiently large set. Thus:
R ^ ( q α , θ PY ( · ) ) max { log 1 - δ q α , θ PY ( 11 1 ) , log 1 - δ q α , θ PY ( 12 n ) } 1 2 ( log 1 q α , θ PY ( 11 1 ) + log 1 q α , θ PY ( 12 n ) ) + log ( 1 - δ ) n 2 log 1 c + log ( 1 - δ ) .
This shows that R ^ ( q α , θ PY ( · ) ) = Ω ( n ). Furthermore,
R ¯ ( q α , θ PY ( · ) ) max { ( 1 - δ ) 1 - δ q α , θ PY ( 11 1 ) , ( 1 - δ ) 1 - δ q α , θ PY ( 12 n ) } ( 1 - δ ) ( n 2 log 1 c + log ( 1 - δ ) ) ,
so R ¯ ( q α , θ PY ( · ) ) = Ω ( n )
All that remains is to prove the following claim:

Claim 1

For j ≥ 1, 0 < α < 1 and α + θ > 0 we show that:
( j - α ) ( θ + j α ) ( θ + j ) 2 max { 1 2 , α } .

Proof

First, assume that 0 < α < 1 2. Then, the inequality is:
( j - α ) ( θ + j α ) ( θ + j ) 2 1 2 .
Equivalently, we need to show:
( 1 - 2 α ) j 2 + 2 α 2 j + θ 2 + 2 α θ 0.
Since 1 − 2α > 0, the quadratic is always nondecreasing on j ≥ 1. Therefore, the positive integer j = 1 minimizes the quadratic above, and the value of the quadratic at j = 1 is:
1 - 2 α + 2 α 2 + θ 2 + 2 α θ = ( α - 1 ) 2 + ( θ + α ) 2 0.
Next, assume that 1 2 α < 1. Then, the inequality is:
( j - α ) ( θ + j α ) ( θ + j ) 2 α .
Equivalently, we need to show:
( ( 2 α - 1 ) θ + α 2 ) j + α θ ( θ + 1 ) 0.
Since 2α − 1 ≥ 0 and θ >α,
( 2 α - 1 ) θ + α 2 - ( 2 α - 1 ) α + α 2 = α ( 1 - α ) > 0.
Therefore, the minimum of the left equation in (14) is achieved at j = 1. Note that αθ2 >α3 − 2α2θ by using (α + θ)2 > 0. Therefore, the value of the minimum is:
( 2 α - 1 ) θ + α 2 + α θ ( θ + 1 ) = α θ 2 + ( 3 α - 1 ) θ + α 2 - α 3 + ( - 2 α 2 + 3 α - 1 ) θ + α 2 - α 3 - ( - 2 α 2 + 3 α - 1 ) α + α 2 = α 3 - 2 α 2 + α = α ( α - 1 ) 2 0.
Note that −2α2 + 3α − 1 ≥ 0 for 1 2 α < 1, and the claim follows.
The theorem follows.

4. Weak Universality

In this section, we show how to modify the CRP or PY estimators to obtain weakly universal estimators. The CRP and PY cases are identical; therefore, we only work out the CRP case.
For all i ≥ 1 and j ≥ 1, let:
c i , j = 1 i ( i + 1 ) j ( j + 1 )
so that ∑i,jci,j = 1. Let q ˜ i , j CRP · be the CRP measure over patterns with θ = i/ log j. Consider the following measure over patterns of infinite sequences that assigns, for all n and all patterns ψn of length n, the probability:
q * ( ψ n ) = i , j c i , j q ˜ i j CRP ψ n .
We will show that q* is a weakly universal measure over patterns of i.i.d. sequences.
To do so, we will need the following two lemmas. Lemma 4 is a useful “folk” inequality that we believe is attributed to Minkowski. Lemma 5 relates the expected number of distinct symbols in length n sequences of an i.i.d. process to its entropy and is of independent interest. The result not only strengthens a similar result in [43], but also provides a different and more compact proof.

Lemma 4

For n ≥ 1, let x1x2 ≥ . . . xn ≥ 0 and y1y2 ≥ . . . yn ≥ 0 be two sorted sequences. Then:
1 n l = 1 n x l y l ( 1 n i = 1 n x i ) ( 1 n j = 1 n y j ) 1 n l = 1 n x l y n + 1 - l .

Proof

The left inequality of the lemma follows by noting that:
( i = 1 n x i ) ( j = 1 n y j ) = k = 0 n - 1 l = 1 n x l y l + k ,
and that the sum ∑l xlyl+k is maximized at k = 0, since both sequences are sorted in the same direction. The right inequality of the lemma can be proven similarly, but will not be used in the paper.

Lemma 5

For all discrete i.i.d. processes P with entropy rate (or marginal entropy) H, let Mn be the random variable counting the number of distinct symbols in a sample of length n drawn from P. The following bound holds:
E [ M n ] n H log  n + 1.

Proof

Let P(i) = pi. We begin by noting that:
H = i p i log 1 p i = i p i j = 1 ( 1 - p i ) j j ,
where the second equality follows by the Taylor series expansion:
- log  p i = - log ( 1 - ( 1 - p i ) ) = j = 1 ( 1 - p i ) j j .
The right summation in the equation above is bounded below as follows:
j = 1 ( 1 - p i ) j j j = 1 n ( 1 - p i ) j j ( a ) 1 n l = 1 n 1 l m = 1 n ( 1 - p i ) m log  n n ( 1 - p i ) p i ( 1 - ( 1 - p i ) n )
where (a) follows from Minkowski’s inequality in Lemma 4 and the last inequality, because l = 1 n 1 l log  n. Thus,
H = i p i j = 1 ( 1 - p i ) j j log  n n i ( 1 - p i ) ( 1 - ( 1 - p i ) n ) log  n n ( E M n - 1 )
where for the second inequality, we use ∑i pi(1 − (1 − pi)n) ≤ ∑i pi ≤ 1.

Theorem 6

[Weak universality for CRP mixtures] For all discrete i.i.d. processes p with a finite entropy rate,
D ( p q * ) = o ( n ) .
That is, q* is weakly universal.

Proof

We write the divergence between p and q* in (15) as the expected log ratio and condition on the value of Mn:
E p [ log p ( ψ 1 n ) m c m , n q ˜ m , n CRP ( ψ 1 n ) ] = P ( M n > n ( log log  n ) 2 log  n ) · E p [ log p ( ψ 1 n ) m c m , n q ˜ m , n CRP ( ψ 1 n ) | M n > n ( log log  n ) 2 log  n ] + P ( M n < n ( log log  n ) 2 log  n ) · E p [ log p ( ψ 1 n ) m c m , n q ˜ m , n CRP ( ψ 1 n ) | M n < n ( log log  n ) 2 log  n ] .
Consider the estimator q ˜ θ i , j ψ 1 n in q* corresponding to i = Mn and j = log n. This is the estimator q θ CRP ( ψ 1 n ) with θ = Mn/logn. From the proof of Theorem 1, we have:
log p ( ψ 1 n ) m c m , n q ˜ m , n CRP ( ψ 1 n ) log p ( ψ 1 n ) c i , j q ˜ i , j CRP ψ 1 n log 1 c i j + log p ( ψ 1 n ) q ˜ i , j CRP ψ 1 n log ( M n ( M n + 1 ) ( log  n ) ( log  n + 1 ) ) + log p ( ψ 1 n ) q θ CRP ( ψ 1 n )
We will bound the two terms in (19) in the regimes for Mn.
The result of Theorem 1 says that if M n < n ( log log  n ) 2 log  n, then:
log 1 c i j + log p ( ψ 1 n ) q θ CRP ( ψ 1 n ) log ( M n ( M n + 1 ) ( log  n ) ( log  n + 1 ) ) + o ( n )
Thus, this term is o(n).
If M n > n ( log log  n ) 2 log  n, then we first apply Markov’s inequality using the previous lemma:
P ( M n > n ( log log  n ) 2 log  n ) log  n n ( log log  n ) 2 ( n H log  n + 1 ) H ( log log  n ) 2 + log  n n ( log log  n ) 2 .
Therefore, for all finite entropy processes, this probability goes to zero as n → ∞. Looking at the term in q* corresponding q θ CRP ( ψ 1 n ) with θ = Mn/ log n and using the fact that Mnn, we see that the first term in (19) is upper bounded as O(log n). For the second term, we appeal to (9) in the proof of Theorem 1:
log p ( ψ 1 n ) q θ CRP ( ψ 1 n ) M n log M n n + M n log log  n + M n log  n log  ( 2 + n log  n M n )
n + n log log  n + n log  ( 2 + n log  n ) log  n
3 n log log  n .
Plugging these terms into (18):
D ( p q * ) ( H ( log log  n ) 2 + log  n n ( log log  n ) 2 ) · O ( n log log  n ) + 1 · o ( n ) = o ( n ) .
The preceding theorem shows that the mixture of CRP estimators q* is weakly universal. However, note that q* is not itself a CRP estimator. An identical construction is possible for the PY estimators, as well. The convergence of the weakly universal q* depends on the number of entropy of the source, as well as the number of distinct symbols in a sample of size n.
While it would be tempting to predict the performance of the estimator q* for larger sample sizes Nn, such a task requires a more careful analysis. In general, it may be impossible to non-trivially bound the number of distinct symbols MN with smaller sample size n, as the following example shows.

Example 1

Let n = N. Consider a set containing the following two distributions: (i) p over ℕ, which assigns probability 1 − 1/n3/2 = 1 − 1/N3/4 to the atom 1 and splitting the probability 1/N3/4 equally among the elements of the set {2, . . . ,N2 + 1}; and (ii) p′, which simply assigns probability one to one. A sample of size n from either p or p′ is 1n with probability at least 1 − 1/N1/4, no matter what the underlying source is; therefore, we cannot distinguish between these sources with probability 1 − 1/N1/4 from a sample of size n.
However, a sample of size N from p has Entropy 16 05339f1(N1/4) distinct symbols on average, while that of p′ will have only one element. It follows that if all we know is that the unknown distribution comes from , with high probability under the unknown source, we cannot predict whether the number of symbols in a sample of size N will remain one or not from a sample of length n. Furthermore, by changing the ratio of n and N (and therefore, the probability of the symbol 1 under p), we can make the expected number of symbols in a N − length sample under p as large as we want.
However, it is possible to impose restrictions on the class of distributions that allow us to ensure that we can predict the number of symbols in longer samples. In future work, we will borrow from the data-derived consistency formulations of [44] to characterize when we will be able to predict the number of symbols in longer samples.

5. Conclusions and Future Work

In this note, we investigated the worst-case and average-case redundancies of pattern probability estimators derived from priors on that are popular in Bayesian statistics. Both the CRP and Pitman–Yor estimators give a vanishing redundancy per symbol for patterns whose number of distinct symbols m is sufficiently small. The Pitman–Yor estimator requires only that m = o(n), which is an improvement on the CRP. However, when m can be arbitrarily large (or the alphabet size is arbitrarily large), the worst-case and average-case redundancies do not scale like o(n). Here, again, the Pitman–Yor estimator is superior, in that the redundancies scale like Θ(n) as opposed to the Ω(n log n) for the CRP estimator. While these results show that these estimators are not strongly universal, we constructed a mixture of CRP process (which is not itself a CRP estimator) that is weakly universal. One of the estimators derived in [16] is exchangeable and has near-optimal worst case redundancy of O ( n ). Kingman’s results imply this estimator corresponds to a prior on ; however, this prior is yet unknown. Finding this prior may potentially reveal new interesting classes of priors other than the Poisson–Dirichlet priors.

Acknowledgments

The authors thank the American Institute of Mathematics and NSF for sponsoring a workshop on probability estimation, as well as A. Orlitsky and K. Viswanathan, who co-organized the workshop with the first author. They additionally thank P. Diaconis, M. Dudik, F. Chung, R. Graham, S. Holmes, O. Milenkovic, O. Shayevitz, A. Wagner, J. Zhang and M. Madiman for helpful discussions.
Narayana P. Santhanam was supported by a startup grant from the University of Hawaii and NSF Grants CCF-1018984 and CCF-1065632. Anand D. Sarwate was supported in part by the California Institute for Telecommunications and Information Technology (CALIT2) at the University of California, San Diego. Jae Oh Woo was supported by NSF Grant CCF-1065494 and CCF-1346564.

Author Contributions

All authors have worked on this manuscript together.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Laplace, P.S. Philosphical Essay on Probabilities: Translated From the Fifth French Edition of 1825; Sources in the History of Mathematics and Physical Sciences (Book 13); Springer Verlag: New York, NY, USA, 1995. [Google Scholar]
  2. De Morgan, A. An Essay on Probabilities, and on Their Application to Life Contingencies and Insurance Offices; Longman, Orme, Brown, Green & Longmans: London, UK, 1838. [Google Scholar]
  3. De Morgan, A. Theory of Probabilities. In Encyclopædia Metropolitana; Smedley, E., Rose, H.J., Rose, H.J., Eds.; Volume II (Pure Sciences); B. Fellowes et. al: London, UK, 1845; pp. 393–490. [Google Scholar]
  4. Good, I. The population frequencies of species and the estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar]
  5. Blackwell, D.; MacQueen, J.B. Ferguson distributions via Pólya urn schemes. Ann. Stat 1973, 1, 353–355. [Google Scholar]
  6. Kingman, J.F.C. Random Discrete Distributions. J. R. Stat. Soc. B 1975, 37, 1–22. [Google Scholar]
  7. Kingman, J.F.C. Random Partitions in Population Genetics. Proc. R. Soc 1978, 361, 1–20. [Google Scholar]
  8. Kingman, J.F.C. The Representation of Partition Structures. J. London Math. Soc 1978, 2, 374–380. [Google Scholar]
  9. Diaconis, P.; Freedman, D. De Finetti’s Generalizations of Exchangeability. In Studies in Inductive Logic and Probability; Jeffrey, RC., Ed.; Volume 2, University of California Press: Berkeley/Los Angeles, CA, USA, 1980; pp. 233–250. [Google Scholar]
  10. Aldous, D.J. Exchangeability and Related Topics. In École d’Été de Probabilités de Saint-Flour, XIII—1983; Lecture Notes in Mathematics, Volume 1117; Springer: Berlin/Heidelberg, Germany, 1985; pp. 1–198. [Google Scholar]
  11. Zabell, S. Predicting the unpredictable. Synthese 1992, 90, 205–232. [Google Scholar]
  12. Pitman, J. Exchangeable and partially exchangeable random partitions. Probab. Theory Relat. Fields 1995, 102, 145–158. [Google Scholar]
  13. Pitman, J.; Yor, M. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab 1997, 25, 855–900. [Google Scholar]
  14. Clarke, B.; Barron, A. Information theoretic asymptotics of Bayes methods. IEEE Trans. Inf. Theory 1990, 36, 453–471. [Google Scholar]
  15. Clarke, B.; Barron, A. Jeffreys’ prior is asymptotically least favorable under entropy risk. J. Stat. Plan. Inference 1994, 41, 37–60. [Google Scholar]
  16. Orlitsky, A.; Santhanam, N.; Zhang, J. Universal compression of memoryless sources over unknown alphabets. IEEE Trans. Inf. Theory 2004, 50, 1469–1481. [Google Scholar]
  17. Orlitsky, A.; Santhanam, N.; Zhang, J. Always Good Turing: Asymptotically optimal probability estimation. Science 2003, 302, 427–431. [Google Scholar]
  18. Acharya, J.; Das, H.; Jafarpour, A.; Orlitsky, A.; Suresh, A. Tight Bounds for Universal Compression of Large Alphabets. Proceedings of IEEE International Symposium on Information Theory (ISIT), Istanbul, Turkey, 7–12 July 2013.
  19. Acharya, J.; Das, H.; Orlitsky, A. Tight Bounds on Profile Redundancy and Distinguishability. Proceedings of 26th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 3257–3265.
  20. Acharya, J.; Jafarpour, A.; Orlitsky, A.; Suresh, A. Optimal Probability Estimation with Applications to Prediction and Classification. Proceeding of Conference on Learning Theory, Princeton NJ, USA, 12–14 June 2013; pp. 764–796.
  21. Ryabko, B. Compression based methods for nonparametric on-line prediction, regression, classification and density estimation of time series. In Festschrift in Honor of Jorma Rissanen on the Occasion of His 75th Birthday; Grünwald, P., Myllymäki, P., Tabus, I., Weinberger, M., Yu, B., Eds.; Tampere International Center for Signal Processing: Tampere, Finland, 2008; pp. 271–288. [Google Scholar]
  22. Wagner, A.B.; Viswanath, P.; Kulkarni, S.R. Probability Estimation in the Rare-Events Regime. IEEE Trans. Inf. Theory 2011, 57, 3207–3229. [Google Scholar]
  23. Nadas, A. Good, Jelinek, Mercer, and Robins on Turing’s estimate of probabilities. Am. J. Math. Manag. Sci 1991, 11, 229–308. [Google Scholar]
  24. Gale, W.; Church, K. What is wrong with adding one? In Corpus Based Research into Language; Oostdijk, N., de Haan, P., Eds.; Rodopi: Amsterdam, The Netherlands, 1994; pp. 189–198. [Google Scholar]
  25. McAllester, D.; Schapire, R. On the convergence rate of Good Turing estimators. Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT 2000), Palo Alto, CA, USA, 28 June– 1 July 2000; Cesa-Bianchi, N., Goldman, SA., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 2000; pp. 1–6. [Google Scholar]
  26. Drukh, E.; Mansour, Y. Concentration Bounds on Unigrams Language Models. J. Mach. Learn. Res 2005, 6, 1231–1264. [Google Scholar]
  27. Orlitsky, A.; Santhanam, N.; Viswanathan, K.; Zhang, J. On modeling profiles instead of values; Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, Canada, 7–11 July 2004, Meek, C., Halpern, J., Eds.; AUAI Press: Arlington, VA, USA, 2004; pp. 426–435. [Google Scholar]
  28. Gallager, R. Source Coding with Side Information and Universal Coding. In Technical Report LIDS-P-937; M.I.T: Cambridge, MA, USA, 1976. [Google Scholar]
  29. Davisson, L.; Leon-Garcia, A. A source matching approach to finding minimax codes. IEEE Trans. Inf. Theory 1980, 26, 166–174. [Google Scholar]
  30. Ryabko, B.Y. Coding of a source with unknown but ordered probabilities. Probl. Inf. Transm 1979, 15, 134–138. [Google Scholar]
  31. Kingman, J. The Mathematics of Genetic Diversity; CBMS-NSF Regional Conference Series in Applied Mathematics, Volume 34; SIAM: Philadelphia, PA, USA, 1980. [Google Scholar]
  32. Zabell, S. The Continuum of Inductive Methods Revisited. In The Cosmos of Science: Essays of Exploration; Earman, J., Norton, JD., Eds.; The University of Pittsburgh Press: Pittsburgh, PA, USA, 1997; Volume Chapter 12. [Google Scholar]
  33. Zabell, S. Symmetry and Its Discontents: Essays on the History of Inductive Probability; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
  34. Pitman, J. Random discrete distributions invariant under size-biased permutation. Adv. Appl. Probab 1996, 28, 525–539. [Google Scholar]
  35. Pitman, J. Combinatorial Stochastic Processes. In École d’Été de Probabilités de Saint-Flour, XXXII—2002; Lecture Notes in Mathematics, Volume 1875; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  36. Ferguson, T. A Bayesian analysis of some nonparametric problems. Ann. Stat 1973, 1, 209–230. [Google Scholar]
  37. Ramamoorthi, R.; Srikanth, K. Dirichlet processes. In Encyclopedia of Statistical Sciences; John Wiley and Sons: New York, NY, USA, 2007. [Google Scholar]
  38. Ewens, W.J. The Sampling Theory of Selectively Neutral Alleles. Theor. Popul. Biol 1972, 3, 87–112. [Google Scholar]
  39. Karlin, S.; McGregor, J. Addendum to a Paper of W. Ewens. Theor. Popul. Biol 1972, 3, 113–116. [Google Scholar]
  40. Watterson, G.A. The Sampling Theory of Selectively Neutral Alleles. Adv. Appl. Probab 1974, 6, 463–488. [Google Scholar]
  41. Santhanam, N.; Madiman, M. Patterns and exchangeability. Proceedings of 2010 IEEE International Symposium on Information Theory (ISIT), Austin, TX, USA, 13–18 June 2010; pp. 1483–1487.
  42. Carlton, M.A. Applications of the Two-Parameter Poisson-Dirichlet Distribution. Ph.D. Thesis, University of California, Los Angeles, Los Angeles, CA, USA, 1999. [Google Scholar]
  43. Orlitsky, A.; Santhanam, N.P.; Viswanathan, K.; Zhang, J. Limit results on pattern entropy. IEEE Trans. Inf. Theory 2006, 52, 2954–2964. [Google Scholar]
  44. Santhanam, N.; Anantharam, V.; Kavcic, A.; Szpankowski, W. Data driven weak universal redundancy. Proceedings of IEEE Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 1877–1881.

Share and Cite

MDPI and ACS Style

Santhanam, N.P.; Sarwate, A.D.; Woo, J.O. Redundancy of Exchangeable Estimators. Entropy 2014, 16, 5339-5357. https://doi.org/10.3390/e16105339

AMA Style

Santhanam NP, Sarwate AD, Woo JO. Redundancy of Exchangeable Estimators. Entropy. 2014; 16(10):5339-5357. https://doi.org/10.3390/e16105339

Chicago/Turabian Style

Santhanam, Narayana P., Anand D. Sarwate, and Jae Oh Woo. 2014. "Redundancy of Exchangeable Estimators" Entropy 16, no. 10: 5339-5357. https://doi.org/10.3390/e16105339

Article Metrics

Back to TopTop