Characterizing the Asymptotic Per-Symbol Redundancy of Memoryless Sources over Countable Alphabets in Terms of Single-Letter Marginals

The minimum expected number of bits needed to describe a random variable is its entropy, assuming knowledge of the distribution of the random variable. On the other hand, universal compression describes data supposing that the underlying distribution is unknown, but that it belongs to a known set P of distributions. However, since universal descriptions are not matched exactly to the underlying distribution, the number of bits they use on average is higher, and the excess over the entropy used is the redundancy. In this paper, we study the redundancy incurred by the universal description of strings of positive integers (Z+), the strings being generated independently and identically distributed (i.i.d.) according an unknown distribution over Z+ in a known collection P . We first show that if describing a single symbol incurs finite redundancy, then P is tight, but that the converse does not always hold. If a single symbol can be described with finite worst-case regret (a more stringent formulation than redundancy above), then it is known that describing length n i.i.d. strings only incurs vanishing (to zero) redundancy per symbol as n increases. On the contrary, we show it is possible that the description of a single symbol from an unknown distribution ofP incurs finite redundancy, yet the description of length n i.i.d. strings incurs a constant (> 0) redundancy per symbol encoded. We then show a sufficient condition on single-letter marginals, such that length n i.i.d. samples will incur vanishing redundancy per symbol encoded.


Introduction
A number of statistical inference problems of significant contemporary interest, such as text classification, language modeling and DNA microarray analysis, are what are called large alphabet problems.They require inference on sequences of symbols where the symbols come from a set (alphabet) with a size comparable or even larger than the sequence length.For instance, language models for speech recognition estimate distributions over English words using text samples much smaller than the English vocabulary.
An abstraction behind several of these problems is universal compression over large alphabets.The general idea here is to model the problem at hand with a collection of models P instead of a single distribution.The model underlying the data is assumed or known to belong to the collection P, but the exact identity of the model remains unknown.Instead, we aim to use a universal description of data.
The universal description uses more bits on average (averaged over the random sample) than if the underlying model were known, and the additional number of bits used by the universal description is called the redundancy against the true model.The average excess bits over the entropy of the true model will be referred to as the model redundancy for that model.Since one does not know the true model in general, a common approach is to consider collection redundancy or simply redundancy, which is the supremum of the model redundancy, the supremum being taken over all models of the collection.
Typically, we look at sequences of i.i.d.symbols, and therefore, we usually refer to the redundancy of distributions over length n sequences obtained by i.i.d.sampling from distributions from P. The length n of sequences considered will typically be referred to as the sample size.
The nuances of prediction, the compression or estimation where the alphabet size and sample size are roughly equal are not well captured by studying a collection over a finite alphabet when the sample size is increased to infinity.Rather, they are better captured when we begin with a countably infinite support and let the sample size approach infinity or when we let the alphabet size scale as a function of the sample size.However, the collection of all i.i.d.distributions over countably infinite supports has infinite redundancy that renders most estimations or prediction problems impossible.Therefore, there are several alternative formulations to tackle language modeling, classification and estimation questions over large alphabets.
Patterns: One line of work is the patterns [1] approach that considers the compression of the pattern of a sequence rather than the sequence itself.Patterns abstract the identities of symbols and indicate only the relative order of appearance.For example, the pattern of TATTLE is 121134, while that of HONOLULU is 12324545.The point to note is that patterns of length n i.i.d.sequences can be compressed (no matter what the underlying countably infinite alphabet is) with redundancy that grows sublinearly in n [1]; therefore, the excess bits needed to describe patterns are asymptotically vanishing per symbol encoded.Indeed, insights learned in this line of work will be used to understand the compression of sequences, as well, in this paper.

Envelope on Model Classes:
A second line of work considers restricted model classes for applications, particularly where the collection of models can be described in terms of an envelope [2].This approach leads to an understanding of the worst-case formulations.In particular, we are interested in the result that if the worst-case regret (different from and a more stringent formulation than the redundancy described here) of describing a single sample is finite, then the per-symbol redundancy diminishes to zero.We will interpret this result towards the end of the Introduction.While envelope classes are usually chosen so that they are compressible in the worst case, a natural extension is the possibility of choosing classes that are only average-case, but not worst-case, compressible.For this, we need to understand how the single-letter average case redundancy of a class influences the redundancy of compressing strings sampled i.i.d.from distributions in the class-the focus of this paper.

Data-derived Consistency:
A third line of work ignores the uniform convergence framework underlying redundancy or regret formulations.This is useful for large or infinite alphabet model collections that have poor or no redundancy guarantees, but ask a question that cannot be answered with the approaches above.In this line of work, one obtains results on the model redundancy described above instead of (the collection) redundancy.For example, a model collection is said to be weakly compressible if there is a universal measure that ensures that for all models, the model redundancy normalized by the sample size (per-symbol) diminishes to zero.The rate at which the per-symbol model redundancy diminishes to zero depends on the underlying model and for some models could be arbitrarily slower than others.Given a particular block length n, however large, there may be, hence, no non-trivial guarantee that holds over the entire model collection, unlike the redundancy formulation.
However, if we add on the additional constraint that we should estimate the rate of convergence from the data, we get the data-derived consistency formulations in [3].Fundamental to further research in this direction is a better understanding of how single-letter redundancy (of P) relates to the redundancy of length n strings (that of P n ).The primary theme of this paper is to collect such results on the redundancy of classes over countably infinite support.
In the fixed alphabet setting, this connection is well understood.If the alphabet has size k, the redundancy of P is easily seen to be always finite (in fact, ≤ log k) and that of P n scales as k−1 2 log n.However, when P does not have a finite support, the above bounds are meaningless.
Redundancy Capacity Theorem: On the other hand, the redundancy of a collection P over a countably infinite support may be infinite.In this paper we let Z + = {1, 2, 3, ...} be the set of positive integers and N = {0, 1, 2, . ..} be the set of non-negative integers.However, what about the case where the redundancy of a collection P over Z + is finite?Now, a well-known redundancy-capacity [4] argument can be used to interpret the redundancy, which equates the redundancy to the amount of information we can get about the source from the data.In this case, finite (infinite, respectively) redundancy of P implies that a single symbol contains a finite (infinite, respectively) amount of information about the model.
The natural question then is the following.If a collection P over Z + has finite redundancy, does it imply that the redundancy of length n i.i.d.strings from P grows sublinearly?Equivalently, do finite redundancy collections behave similar to their fixed alphabet counterparts?If true, roughly speaking, such a result would inform us that as the universal encoder sees more and more of the sequence, it learns less and less of the underlying model.This would be in line with our intuition, where seeing more data fixes the model.Therefore, the more data we have already seen, the less there is to learn.Yet, as we will show, that is not the case.
Results: To understand these nuances, we first show that if the redundancy of a collection P of distributions over Z + is finite, then P is tight.This turns out to be a useful tool to check if the redundancy is finite in [3], for example.
However, in a departure from other worst-case regret formulations, as in [2], we demonstrate that it is possible for a class P to have finite redundancy, yet the asymptotic per-symbol redundancy of strings sampled i.i.d.from P is bounded away from zero.Therefore, roughly speaking, no matter how much of the sequence the universal encoder has seen, it learns at least a constant number of bits about the underlying model each time it sees an additional symbol.No matter how much data we see, there is more to learn about the underlying model!We finally obtain a sufficient condition on a class P, such that the asymptotic per-symbol redundancy of length n i.i.d.strings diminishes to zero.

Notation and Background
We introduce the notation used in the paper, as well as some prior results that will be used.Following information theoretic conventions, log indicates logarithms to base two and ln to base e.In this paper we let Z + = {1, 2, 3, ...} be the set of positive integers and N = {0, 1, 2, . ..} be the set of non-negative integers.

Redundancy
The notation used here is mostly standard, but we include it for completeness.Let P be a collection of distributions over Z + .Let P n be the set of distributions over length-n sequences obtained by i.i.d.sampling from distributions in P.
P ∞ is the collection of measures over infinite length sequences of Z + obtained by i.i.d.sampling as follows.Observe that Z n + is countable for every n.For simplicity of exposition, we will think of each length n string x as a subset of Z ∞ + -the set of all semi-infinite strings of positive integers that begin with x.Each subset of Z n + is therefore a subset of Z ∞ + .Now the collection J of all subsets of Z n + and all n ∈ Z + , is a semi-algebra [5].The probabilities i.i.d.sampling assigns to finite unions of disjoint sets in J is the sum of that assigned to the components of the union.Therefore, there is a sigma-algebra over the uncountable set Z ∞ + that extends J and matches the probabilities assigned to sets in J by i.i.d.sampling.The reader can assume that P ∞ is the measure on the minimal sigma-algebra that extends J and matches what the probabilities i.i.d.sampling gives to sets in J .See, e.g., [5], for a development of elementary measure theory that lays out the above steps.
Let q be a measure over infinite sequences that we call: the redundancy of length n sequences, or length n i.i.d.redundancy, or simply length n redundancy.The single-letter redundancy refers to the special case when n = 1.We often normalize R n (P ∞ ) in ( 1) by the block length n.We will call R n (P ∞ )/n the per-symbol length n redundancy.
In particular, note the distinction between single letter and per-symbol length n redundancy.In the definition (1), we do not require q to be i.i.d..The single-letter redundancy would correspond to obtaining the infimum in (1) only over the restricted class of i.i.d.measures, while the per-symbol length n redundancy allows for the infimum over all possible measures q.Thus, the per-symbol length n redundancy is upper bounded by the single letter redundancy.Any difference between the two can be thought of as the advantage accrued, because the universal measure learns the underlying measure p.
In this paper, our primary goal is to understand the connections between the single-letter redundancy, on the one hand, and the behavior of length n i.i.d.redundancy, on the other.As mentioned in the Introduction, length n redundancy is the capacity of a channel from P to Z n + , where the conditional probability distribution over Z n + given p ∈ P is simply the distribution p over length n sequences.Roughly speaking, it quantifies how much information about the source we can extract from the sequence.
We will often speak of the per-symbol length n redundancy, which is simply length n redundancy normalized by n, i.e.,, R n (P ∞ )/n.Furthermore, the limit lim sup n→∞ R n (P ∞ )/n is the asymptotic per-symbol redundancy.Whether the asymptotic per-symbol redundancy is zero (we will equivalently say that the asymptotic per-symbol redundancy diminishes to zero to keep in line with prior literature) is in many ways a litmus test for compression, estimation and other related problems.Loosely speaking, if R n (P ∞ )/n → 0, the redundancy-capacity interpretation [4] mentioned above implies that after a point, there is little further information to be learned when we see an additional symbol, no matter what the underlying source is.In this sense, this is the case where we can actually learn the underlying model at a uniform rate over the entire class.
We note that it is possible to define an even more stringent notion-a worst-case-regret.For length n sequences, this is: Single-letter regret is the special case where n = 1, and asymptotic per-symbol regret is the limit as n → ∞ of the length n regret normalized by n.We will not concern ourselves with the worst case formulation in this paper, but mention it in passing for comparison.In the worst-case setting, finite single letter redundancy is necessary and sufficient [2] for the asymptotic per-symbol worst-case regret to diminish to zero.Yet, we show in this paper that it is not necessarily the case for redundancy.It is quite possible that collections with finite single-letter redundancy have asymptotic per-symbol redundancy bounded away from zero.

Patterns
Recent work [1] has formalized a similar framework for countably infinite alphabets.This framework is based on the notion of patterns of sequences that abstract the identities of symbols and indicates only the relative order of appearance.For example, the pattern of PATTERN is 1233456.The k-th distinct symbol of a string is given an index k when it first appears, and that index is used every time the symbol appears henceforth.The crux of the patterns approach is to consider the set of measures induced over patterns of the sequences instead of considering the set of measures P over infinite sequences, Denote the pattern of a string x by Ψ(x).There is only one possible pattern of strings of length one (no matter what the alphabet, the pattern of a length one string is one), two possible patterns of strings of length two (11 and 12), and so on.The number of possible patterns of length n is the n-th Bell number [1], and we denote the set of all possible length n patterns by Ψ n .The measures induced on patterns by a corresponding measure p on infinite sequences of positive integers assigns to any pattern ψ a probability: p(ψ) = p({x : Ψ(x) = ψ}).
In [1], the length n pattern redundancy, was shown to be upper bounded by π(log e) 2n 3 .It was also shown in [6] that there is a measure q over infinite length sequences that satisfies for all n simultaneously: Let the measure induced on patterns by q be denoted as q Ψ for convenience.We can interpret the probability estimator q Ψ as a sequential prediction procedure that estimates the probability that the symbol X n+1 will be "new" (has not appeared in X n 1 ) and the probability that X n+1 takes a value that has been seen so far.This view of estimation also appears in the statistical literature on Bayesian nonparametrics that focuses on exchangeability.Kingman [7] advocated the use of exchangeable random partitions to accommodate the analysis of data from an alphabet that is not bounded or known in advance.A more detailed discussion of the history and philosophy of this problem can be found in the works of Zabell [8,9] collected in [10].

Cumulative Distributions and Tight Collections
For our purposes, the cumulative distribution function of any probability distribution p on Z + (N, respectively) is a function F p : R ∪ {∞} → [0, 1] defined in the following (slightly unconventional) way.We let F p (0) = 0 in case the support is Z + (F p (−1) = 0 if the support is N, respectively).We then define F p on points in the support of p in the way cumulative distribution functions are normally defined.Specifically for all y in the support of p, F p (y) = y j≥0 p(j).
We let F p (−∞) := 0 and F p (∞) := 1.Finally, we extend the definition of F p to all real numbers by linearly interpolating between the values defined already.
Let F −1 p : [0, 1] → R ∪ {∞} denote the inverse function of F p defined as follows.To begin with, is the smallest positive integer y, such that F p (y) = 1.It follows [11] then that: A collection P of distributions on Z + is defined to be tight if for all γ > 0,

Redundancy and Tightness
We focus on the single-letter redundancy in this section and explore the connections between the single-letter redundancy of a collection P and the tightness of P.

Lemma 1.
A collection P over N with bounded length n redundancy is tight.Namely, if the single-letter redundancy of P is finite, then for any γ > 0: Proof Since P has bounded single-letter redundancy, fix a distribution q over N, such that: We define R def = sup p∈P D(p||q) where D(p||q) is the Kullback-Leibler distance between p and q.We will first show that for all p ∈ P and any m > 0, p( log p(X) q(X) > m) ≤ (R + (2 log e)/e)/m.
To see Equation ( 2), let S be the set of all x ∈ N, such that p(x) < q(x).A well-known convexity argument shows that the partial contribution to KL divergence from S, x∈S p(x) log p(x) q(x) ≥ p(S) log p(S) q(S) ≥ − log e e , and hence: Then, Equation (2) follows by a simple application of Markov's inequality.
We will now use Equation (2) to complete the proof of the lemma.Specifically, we will show that for all γ > 0, sup p∈P where m * is the smallest integer, such that (R + (2 log e)/e)/m * < γ/2.Equivalently, for all γ > 0 and p ∈ P, we show that: We prove the above by partitioning q s tail; numbers ) and log p(x) q(x) > m * }.Clearly: and thus: where the right inequality follows from Equation ( 2).
(ii) the set ) and log p(x) q(x) ≤ m * }.Clearly: and therefore: By definition, all x ∈ W 2 satisfy log p(x) q(x) ≤ m * or that p(x) ≤ q(x)2 m * .Hence, we have: The lemma follows. 2 The converse is not necessarily true.Tight collections need not have finite single-letter redundancy, as the following example demonstrates.
Construction: Consider the following collection I of distributions over Z + .First, partition the set of positive integers into the sets T i , i ∈ N, where: Note that |T i | = 2 i .Now, I is the collection of all possible distributions that can be formed as follows: for all i ∈ Z + , pick exactly one element of T i and assign probability 1/((i + 1)(i + 2)) to the element of T i chosen choosing the support as above implicitly assumes the axiom of choice.Note that the set I is uncountably infinite.2 Corollary 2. The set I of distributions is tight.Proof For all p ∈ I, namely, all tails are uniformly bounded over the collection I. Put another way, for all δ > 0 and all distributions: p ∈ I, On the other hand: Proposition 1.The collection I does not have finite redundancy.Proof Suppose q is any distribution over Z + .We will show that ∃p ∈ I, such that: is not finite.Since the entropy of every p ∈ I is finite, we just have to show that for any distribution q over Z + , there exists p ∈ I, such that: is not finite.
Consider any distribution q over Z + .Observe that for all i, |T i | = 2 i .It follows that for all i, there is x i ∈ T i , such that: q(x i ) ≤ 1 2 i .However, by construction, I contains a distribution p * that has for its support {x i : i ∈ Z + } identified above.Furthermore p * assigns: The KL divergence from p * to q is not finite, and the Lemma follows, since q is arbitrary.2

Length n Redundancy
We study how the single-letter properties of a collection P of distributions influence the compression of length n strings obtained by i.i.d.sampling from distributions in P. Namely, we try to characterize when the length n redundancy of P ∞ grows sublinearly in the block length n.Lemma 3. Let P be a collection of distributions over a countable support X .For some m ∈ Z + , consider m pairwise disjoint subsets S i ⊂ X (1 ≤ i ≤ m), and let δ > 1/2.If there exist p 1 , . . .,p m ∈ P, such that: then for all distributions q over X , sup p∈P D(p||q) ≥ δ log m.
In particular if there are an infinite number of sets S i , i ∈ Z + and distributions p i ∈ P, such that p i (S i ) ≥ δ, then the redundancy is infinite.
Proof This is a simplified formulation of the distinguishability concept in [4].For a proof, see e.g., [12]. 2

Counterexample
We now show that it is possible for the single-letter redundancy of a collection B of distributions to be finite, yet the asymptotic per-symbol redundancy (the length n redundancy of B ∞ normalized by n) remains bounded away from zero; in the limit, the block length goes to infinity.To show this, we obtain such a collection B.
Construction: As before, partition the set Z + into T i = {2 i , . . .,2 i+1 − 1}, i ∈ N. Recall that T i has 2 i elements.For all 0 < ≤ 1, let n = 1 .Let 1 ≤ j ≤ 2 n , and let p ,j be a distribution on Z + that assigns probability 1 − to the number one (or equivalently, to the set T 0 ) and to the j-th smallest element of T n , namely the number 2 n + j − 1. B (mnemonic for binary, since every distribution has a support of size two) is the collection of distributions p ,j for all > 0 and 1 ≤ j ≤ 2 n .B ∞ is the set of measures over infinite sequences of numbers corresponding to i.i.d.sampling from B. 2 We first verify that the single-letter redundancy of B is finite.
Proposition 2. Let q be a distribution that assigns q(T i ) = 1 (i+1)(i+2) and for all j ∈ T i , Then: However, the redundancy of compressing length n sequences from B ∞ scales linearly with n.
Proof Let the set {1 n } denote a set containing a length n sequence of only ones.For all n, define 2 n pairwise disjoint sets S i of Z n + , 1 ≤ i ≤ 2 n , where: is the set of all length n strings containing at most two numbers (one and 2 n + i − 1) and at least one occurrence of 2 n + i − 1.Clearly, for distinct i and j between one and 2 n , S i and S j are disjoint.Furthermore, the measure p 1 n ,i ∈ B ∞ assigns S i the probability: From Lemma 3, it follows that length n redundancy of B ∞ is lower bounded by: In a preview of what is to come, we notice that though the single-letter redundancy of the class B over Z + is finite, the single-letter tail redundancy, as described in the equation below, does not diminish to zero; namely, for all M : In fact, in the next section, we relate the single-letter tail redundancy above diminishing to zero to sublinear growth of the i.i.d.length n redundancy.

Sufficient Condition
In this section, we show a sufficient condition on single-letter marginals of P and its redundancy that allows for i.i.d.length n redundancy of P ∞ to grow sublinearly with n.This condition is, however, not necessary; and the characterization of a condition that is both necessary and sufficient is as yet open.
For all > 0, let A p, be the set of all elements in the support of p with probability ≥ , and let T p, = Z + − A p, .Let G 0 = {φ}, where φ denotes the empty string.For all i, the sets: where, in a minor abuse of notation, we use {x 1 , . . .,x i } to denote the set of distinct symbols in the string x i 1 .Let B 0 = {}, and let Observe from an argument similar to the coupon collector problem that: Lemma 4. For all i ≥ 2, p(B i ) ≤ 1 (i + 1) ln(i + 1) .
Proof The proof follows from an elementary union bound: . 2 Theorem 5. Suppose P is a collection of distributions over Z + .Let the entropy be uniformly bounded over the entire collection, and in addition, let the redundancy of the collection be finite.Namely, We will denote: R = sup Recall that for any distribution p, the set T p,δ denotes the support of p, all of whose probabilities are < δ.
Then, the redundancy of length n distributions obtained by i.i.d.sampling from distributions in P, denoted by R n (P ∞ ), grows sublinearly:

Remark
If the conditions of the theorem are met, we can always assume without loss of generality that there is a distribution q 1 that satisfies (3) and simultaneously has finite redundancy.To see this, suppose q 1 satisfies the finite-redundancy condition, namely: while a different distribution q 1 satisfies the second tail-redundancy condition, It is easy to verify that the distribution q that assigns to any x ∈ Z + , q 1 (x) = q 1 (x)+q 1 (x) 2 satisfies both conditions simultaneously.
We will construct q, such that lim sup n→∞ Recall that q Ψ is the optimal universal pattern encoder over patterns of i.i.d.sequences defined in Section 2.2.Furthermore, recall that the redundancy of P is finite and that q 1 is the universal distribution over Z + that attains redundancy R for P.
The universal encoder q is now defined as follows: Furthermore, we define for all x i−1 Namely, we use an optimal universal pattern encoder over patterns of i.i.d.sequences and encode any new symbol using a universal distribution over P. We now bound the redundancy of q as defined above.
We have for all p ∈ P ∞ , .
The first term, normalized by n, can be upper bounded as follows: where we define H p as: and the last inequality follows, since: We have split the length i − 1 sequences into the sets G i−1 and B i−1 and use separate bounds on each set that hold uniformly over the entire model collection.The last inequality above follows from Lemma 4. From Condition (3) of the Theorem, we have that: Therefore, we have: The first term on the left in the first equation above is non-negative, hence the limit above has to equal zero.The equality (a) follows from Cesaro's lemma asserting that for any sequence {a i , i ∈ Z + } with a i < ∞ for all i, if lim i→∞ a i exists, then: For the second term, observe that: x n p(x n ) n j=1 log p(x j |ψ j 1 , x j−1 1 ) q(x j |ψ j 1 , x j−1 1 ) ≤ R + x n p(x n ) n j=2 log p(x j |ψ j 1 , x j−1 1 ) q(x j |ψ j 1 , x j−1 1 ) .
The entropy of p k ∈ U is therefore 1 + h 1 k 2 .Note that the redundancy of U is finite, too.To see this, first note that: Now, letting: R + def = log x∈Z + sup k∈N p k (x) , observe that the distribution: x ∈ Z + .
satisfies for all p k ∈ U: implying that the redundancy of U is ≤ R + + 2. Furthermore, Equation ( 5) implies that worst-case regret is finite, and from [2] the length n redundancy of U ∞ diminishes sublinearly.Now, pick an integer m ∈ Z + .We have for all p ∈ U, Thus, the length n redundancy of U ∞ diminishes to zero, while not satisfying all of the requirements of Theorem 5. Therefore, the conditions of Theorem 5 are only sufficient, not necessary.

Open Problems
We have demonstrated that finite single-letter redundancy of a collection P of distributions over a countably infinite support does not imply that the asymptotic per-symbol redundancy of i.i.d.samples from P diminishes to zero.This is in contrast to the scenario for worst-case regret, where single-letter worst-case regret, being finite, is both necessary and sufficient for asymptotic per-symbol regret to diminish to zero.We have also demonstrated sufficient conditions on the collection P, so that asymptotic per-symbol redundancy of i.i.d.samples diminish to zero in this paper.However, as we show, the sufficient conditions we provide are not necessary.
It is yet open to find a condition on single-letter marginals that is both necessary and sufficient for the asymptotic per-symbol redundancy to diminish to zero.

m 2 ,
yet, for all k ≥ m, we have: