Next Article in Journal
The Influence of Shape on Parallel Self-Assembly
Next Article in Special Issue
Equiprobability, Entropy, Gamma Distributions and Other Geometrical Questions in Multi-Agent Systems
Previous Article in Journal
Economies Evolve by Energy Dispersal
Previous Article in Special Issue
Landauer’s Principle and Divergenceless Dynamical Systems

Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

# A Lower-Bound for the Maximin Redundancy in Pattern Coding

by
Aurélien Garivier
CNRS, Telecom ParisTech, Laboratoire Laboratoire Traitement et Communication de l’Information, 75013 Paris, France
Entropy 2009, 11(4), 634-642; https://doi.org/10.3390/e11040634
Submission received: 1 September 2009 / Accepted: 20 October 2009 / Published: 22 October 2009

## Abstract

:
We show that the maximin average redundancy in pattern coding is eventually larger than for messages of length n. This improves recent results on pattern redundancy, although it does not fill the gap between known lower- and upper-bounds. The pattern of a string is obtained by replacing each symbol by the index of its first occurrence. The problem of pattern coding is of interest because strongly universal codes have been proved to exist for patterns while universal message coding is impossible for memoryless sources on an infinite alphabet. The proof uses fine combinatorial results on partitions with small summands.

## 1. Introduction

#### 1.1. Universal Coding

Let $P$ be a stationary source on an alphabet A, both known by the coder and the decoder. Let be a random process with distribution $P$. For a positive integer n, we denote by $X 1 n$ the vector of the n first components of X and by $P n$ the distribution of $X 1 n$ on $A n$. We denote the logarithm with base 2 by log and the natural logarithm by ln. Shannon’s classical bound [1] states the average bit length of codewords for any coding function is lower-bounded by the n-th order entropy ; moreover, this codelength can be nearly approached, see [2]. One important idea in the proof of this result is the following: every code on the strings of length n is associated with a coding distribution $q n$ on $A n$ in such a way that the code length for x is $- log q n ( x )$, and reciprocally any distribution $q n$ on $A n$ can be associated with a coding function whose code length is approximately $- log q n ( x )$. When $P$ is ergodic, its entropy rate $H ( X ) = lim n → ∞ 1 n H ( X 1 n )$ exists. It is a tight lower bound on the number of bits required per character.
If $P$ is only known to be an element $P θ$ of some class , universal coding consists in finding a single code, or equivalently a single sequence of coding distributions $( q n ) n$, approaching the entropy rate for all sources $P θ ∈ C$ at the same time. Such versatility has a price: for any given source $P θ$, there is an additional cost called the (expected) redundancy $R ( q n , θ )$ of the coding distribution $q n$ that is defined as the difference between the expected code length and the n-th order entropy $H ( X 1 n )$. Two criteria measure the universality of $q n$:
• First, a deterministic approach judges the performance of $q n$ in the worst case by the maximal redundancy $R + ( q n , Θ ) = sup θ ∈ Θ R ( q n , θ )$ The lowest achievable maximal redundancy is called minimax redundancy:
$R + ( n , Θ ) = min q n max θ R ( q n , θ )$
• Second, a Bayesian approach consists in providing Θ with a prior distribution π, and then considering the expected redundancy $E π [ R ( q n , θ ) ]$ (the expectation is here taken over θ). Let $q n π$ be the coding distribution minimizing $E π [ R ( q n , θ ) ]$ The maximin redundancy $R - ( n , Θ )$ of class $C$ is the supremum of all $E π [ R ( q n π , θ ) ]$ over all possible prior distributions π:
$R - ( n , Θ ) = max π min q n E π [ R ( q n , θ ) ]$
A classical minimax theorem (see [3]) states that mild hypotheses are sufficient to ensure that $R - ( n , Θ ) = R + ( n , Θ )$. Class $C$ is said to be strongly universal if $R + ( n , Θ ) = o ( n )$: then universal coding is possible uniformly on $C$. An important result by Rissanen [4] asserts that if the parameter set Θ is k-dimensional, and if there exists a $n -$consistent estimator for θ, then
$R - ( n , Θ ) = R + ( n , Θ ) = k 2 log n + O ( 1 )$
This well-known bound has many applications in information theory, often related to the Minimum Description Length Principle. It is remembered as a “rule of thumb” that redundancy is $1 / 2 log n$ for each parameter of the model. This result actually covers a large variety of cases, among others: memoryless processes, Markov chains, Context tree sources, hidden Markov chains. However, further generalization have been investigated. Shields (see [5]) proved that no coder can achieve a non-trivial redundancy rate on all stationary ergodic processes. Csiszár and Shields [6] gave an example of a non-parametric, intermediate complexity class, known as renewal processes, for which $R - ( n , Θ )$ and $R + ( n , Θ )$ are both of order $O ( n )$. If alphabet A is not known, or if its size is not insignificant compared to n, Rissanen’s bound (1) is uninformative. If the alphabet A is infinite, Kieffer [7] showed that no universal coding is possible even for the class of memoryless processes.

#### 1.2. Dictionary and Pattern

Those negative results prompted the idea of coding separately the structure of string x and the symbols present in x. It was first introduced by Åberg in [8] as a solution to the multi-alphabet coding problem, where the message x contains only a small subset of the known alphabet A. It was further studied and motivated in a series of articles by Shamir [9,10,11,12] and by Jevtić, Orlitsky, Santhanam and Zhang [13,14,15,16] for practical applications: the alphabet is unknown and has to be transmitted separately anyway (for instance, transmission of a text in an unknown language), or the alphabet is very large in comparison to the message (consider the case of images with $k = 2 24$ colors, or texts when taking words as the alphabet units).
To explain the notion of pattern, let us take the example of [9]: string $x =$ “abracadabra” is made of $n = 11$ characters. The information it conveys can be separated in two blocks:
• a dictionary $Δ = Δ ( x )$ defined as the sequence of different characters present in x in order of appearance; in the example $Δ = ( a , b , r , c , d )$.
• a pattern $ψ = ψ ( x )$ defined as the sequence of positive integers pointing to the indices of each letter in Δ; here, $ψ = 12314151231$.
Let $P n$ be the set of all possible patterns of n-strings. For instance, $P 1 = { 1 }$, $P 2 = { 11 , 12 }$, $P 3 = { 111 , 112 , 121 , 122 , 123 }$. Using the same notations as in [15], we call multiplicity $μ j ( ψ )$ of symbol j in pattern $ψ ∈ P n$ the number of occurrences of j in ψ; the multiplicity of pattern ψ is the vector made of all symbol’s multiplicities: —in the former example, $μ = ( 5 , 2 , 2 , 1 , 1 , 0 , … )$. Note that $∑ j = 1 n μ j = n$. Moreover, the profile of pattern ψ provides, for every multiplicity μ, its frequency in $μ ( ψ )$. It can be formally defined as the multiplicity of ψ’s multiplicity: . The profile of string “abracadabra” is $( 2 , 2 , 0 , 0 , 1 , 0 , … )$ as two symbols (c and d) appear once, two symbols (b and r) appear twice and one symbol (a) appears five times. We denote by $Φ n$ the set of possible profiles for patterns of length n, so that $Φ 1 = { ( 1 ) }$, $Φ 2 = { ( 2 , 0 ) , ( 0 , 1 ) }$, $Φ 3 = { ( 3 , 0 , 0 ) , ( 1 , 1 , 0 ) , ( 0 , 0 , 1 ) }$. Note that $∑ μ = 1 n μ ϕ μ = n$. As explained in [15], there is one-to-one mapping between $Φ n$ and the set of unordered partitions of integer n. In Section 3, this point will be used and specified.

#### 1.3. Pattern Coding

Any process X from a source $P θ$ induces a pattern process with marginal distributions on $P n$ defined by . Thus, we can define a n-th block pattern entropy . For stationary ergodic $P θ$, Orlitsky & al. [16] prove that the pattern entropy rate $H ( Ψ ) = lim n → ∞ 1 n H ( Ψ 1 n )$ exists and is equal to $H ( X )$ (whether this quantity is finite or not). This result was independently discovered by Gemelos and Weissman [17].
In the sequel, we shall consider only the case of memoryless sources $P θ$, with marginal distributions $p θ$ on a (possibly infinite) alphabet $A$. Hence, Θ will be the set parameterizing all probability distributions on $A$.
Obviously, the process they induce on is not memoryless. But as patterns convey less information than the initial strings, coding them seems to be an easier task. The expected pattern redundancy of a coding distribution $q n$ on $P n$ can be defined by analogy as the difference between the expected code length under distribution $P θ$ and the n-th block pattern entropy:
As the alphabet is unknown, the maximal pattern redundancy $R Ψ + ( q n , Θ )$ must be defined as the maximum of $R Ψ + ( q n , θ )$ over all alphabets A and all memoryless distributions on A. Of course, the minimax pattern redundancy $R Ψ + ( n , Θ )$ is defined as the lower-bound of $R Ψ + ( q n , Θ )$ in $q n$. Similarly, the maximin pattern redundancy $R Ψ - ( n , Θ )$ is defined as the supremum with respect to all possible alphabets A and all prior distributions π of the lowest achievable average redundancy, that is:
$R Ψ - ( n , Θ ) = sup A , π inf q n E π [ R Ψ ( q n , θ ) ]$

## 2. Theorem

There is still uncertainty on the true order of magnitude of $R Ψ - ( n , Θ )$ and $R Ψ + ( n , Θ )$. However, Orlistky & et al. in [15] and Shamir in [11] proved that for some constants $c 1$ and $c 2$ it holds that $c 1 n 1 / 3 - ϵ ⩽ R Ψ - ( n , Θ ) ⩽ R Ψ + ( n , Θ ) ⩽ c 2 n$. There is hence a gap between upper- and lower-bounds. This gap has been reduced in an article by Shamir [10] where the upper-bound is improved to . The following theorem contributes to the evaluation of $R Ψ - ( n , Θ )$, by providing a slightly better and more explicit lower-bound, the proof of which is particularly elegant.
Theorem 1 For all integers n large enough, the maximin pattern redundancy is lower-bounded as:
Gil Shamir [18] suggests that a bound of similar order can be obtained by properly updating (B12) in [11]. The proof provided in this paper was elaborated independently; both of them use the channel capacity inequality described in Section 3. However, it is interesting to note that they rely on different ideas (unordered partitions of integers and Bernstein’s inequality here, sphere packing arguments or inhomogeneous grids there). An important difference appears in the treatment of the quantization, see Equation 2. [11] provides fine relations between the minimax average redundancy and the alphabet size. The approach presented here does not discriminate between alphabet sizes; in a short and elegant proof, it leads to a slightly better bound for infinite alphabets.

## 3. Proof

We use here standard technique for lower-bounds (see [19]): the n-th order maximin redundancy is bounded from below by (and asymptotically equivalent to) the capacity of the channel joining an input variable W with distribution π on Θ to the output variable $Ψ 1 n$ with conditional probabilities $P θ ( Ψ 1 n )$. Let $H ( Ψ 1 n | W )$ be the conditional entropy of $Ψ 1 n$ given W, and let denote the mutual information of these two random variables, see [2]. Then from [19] and [4] we know that inequality
holds for all alphabets $A$ and all prior distributions π on the set of memoryless distributions on $A$: it is sufficient to give a lower-bound for the mutual information between parameter W and observation Ψ. In words, $R Ψ - ( n , Θ )$ is larger than the logarithm of the number of memoryless sources that can be distinguished from one observation of $Ψ 1 n$.
Given the positive integer n, let $c = c n$ be an integer growing with n to infinity in a way defined later, let λ be a positive constant to be specified later, let $d = λ c$ and let $A = { 1 , … , c }$ We denote by $Θ c , d$ the set of all unordered partitions of c made of summands at most equal to d:
Then $Θ c ≜ Θ c , c$ is the set of all unordered partitions of c. Let also $Φ c , d$ be the subset of $Φ c$ containing the profiles of all patterns $ψ ∈ P c$ whose symbols appear at most d times:
There is a one-to-one mapping $χ c$ between $Θ c$ and $Φ c$ defined by
It is immediately verified that . In [20], citing [21], Dixmier and Nicolas show the existence of an increasing function such that as $c → ∞$, where $λ = d / c$. Numerous properties of function f, and numerical values, are given in [20]; notably, f is an infinitely derivable and concave function which satisfies $f ( λ ) = - 2 λ log λ + 2 λ + O ( λ 3 )$ when $λ → 0$ and $f ( λ ) = π 2 / 3 - 6 / π exp ( - π λ / 6 )$ when $λ → ∞$.
For $θ ∈ Θ c , d$, let $p θ$ be the distribution on $A$ defined by $p θ ( i ) = θ i c$, and let $P θ$ be the memoryless process with marginal distribution $p θ$. Let W be a random variable with uniform distribution on the set $Θ c , d$. Let be a random process such that conditionally on the event ${ W = θ }$, then the distribution of X is $P θ$, and let be the induced pattern process.
We want to bound $I ( Ψ 1 n ; W ) = H ( W ) - H ( W | Ψ 1 n )$ from below. As
we need to find an upper-bound for $H ( W | Ψ 1 n )$. The idea of the proof is the following. >From Fano’s inequality, upper-bounding $H ( W | Ψ 1 n )$ reduces to finding a good estimator $θ ^$ for W: conditionally on $W = θ$, string $X 1 n$ is a memoryless process with distribution $P θ$ and we aim at recovering parameter θ from its pattern $Ψ 1 n$. Each parameter is here an unordered partition with small summands of integer c. Let $T j$ be the number of occurrences of j-th most frequent symbol in ψ. Then constitutes a random unordered partition of n. We show that by “shrinking” T by a factor $c / n$ we build a unordered partition $θ ^$ of c that is equal to parameter θ with high probability, see Figure 1. Note that only partitions with small summands are considered: this allows to have a better uniform control on the probabilities of deviation of each symbol’s frequency, while the cardinality of $Θ c , d$ remains of same (logarithmic) order as that of $Θ c$. Parameters c and d are chosen in order to optimize the rate in Theorem 1, while the value of $λ = d / c$ is chosen at the end to maximize the constant.
Figure 1. The profile of pattern ψ forms a partition of n that can be “shrunk” to θ, the parameter partition of c, with high probability.
Figure 1. The profile of pattern ψ forms a partition of n that can be “shrunk” to θ, the parameter partition of c, with high probability.
Let us now give the details of the proof. If $W = θ$ and if we observe string $X 1 n = x$ having pattern $Ψ 1 n = ψ ∈ P n$, we construct an estimator of θ in the following way: let $ϕ ( ψ )$ be the profile of ψ, and be the corresponding partition of n. For $j ⩾ c$, let , where $[ x ]$ denotes the nearest integer of x. Observe that as alphabet $A$ contains only c different symbols, for all $j > c$ we have $T j = θ ^ j = θ j = 0$.
The distribution of T is difficult to study, but is very related to much simpler random variables. For $1 ⩽ i ⩽ n$ and $j ⩾ 1$, let $U j i = 1 X i = j$; as $U j i$ has a Bernoulli distribution with parameter $θ j c$, and as process X is memoryless, we observe that $U j ≜ ∑ i = 1 n U j i$, the number of occurrences of symbol j in x, has a binomial distribution . Let , and ; $θ ˜$ would be an estimator of θ if we had access to x, but here estimators may only be constructed from ψ. However, there is a strong connection between $θ ^$ and $θ ˜$: the symbols in x are in one-to-one correspondence with the symbols in ψ. Hence, T is just the order statistics of U: $T j = U ( j )$ and thus $θ ^ j = θ ˜ ( j )$.
Now, if then $θ ˜ j = θ j$. Thus, if for all j in the set ${ 1 , … , c }$ it holds that , then $θ ˜ = θ$ and $θ ˜$, as an increasing sequence, is equal to its order statistics $θ ^$. It follows that
and hence, using the union bound:
We chose parameter set $Θ c , d$ so that all summands in partition θ are small with respect to c. Consequently, the variance of the is uniformly bounded: . Recall the following Bernstein inequality [22]: if $Y 1 , ⋯ , Y n$ are independent random variables such that $Y i$ takes its values in $[ - b , b ]$ and such that $Var [ Y i ] ⩽ v$, and if $S = Y 1 + ⋯ + Y n$, then for any positive x it holds that:
Using this inequality for the , we obtain:
Thus, we obtain from (3):
Now, using Fano’s inequality [2]:
Hence,
By choosing we get:
By looking at the table of f given at page 151 of [20], we see that function $λ → f ( λ ) / λ 1 / 3$ reaches its maximum around $λ = 0 . 8$; for that choice, $f ( λ ) ≈ 2 . 07236$ and we obtain:

## Acknowledgment

I would particularly like to thank professor Jean-Louis Nicolas from Institut Girard Desargues, who very kindly helped for the combinatorial arguments.

## References

1. Shannon, C.E. A mathematical theory of communication. Bell System Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar]
2. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons Inc.: New York, NY, USA, 1991. [Google Scholar]
3. Haussler, D. A general minimax result for relative entropy. IEEE Trans. Inform. Theory 1997, 43, 1276–1280. [Google Scholar] [CrossRef]
4. Rissanen, J. Universal coding, information, prediction, and estimation. IEEE Trans. Inform. Theory 1984, 30, 629–636. [Google Scholar] [CrossRef]
5. Shields, P.C. Universal redundancy rates do not exist. IEEE Trans. Inform. Theory 1993, 39, 520–524. [Google Scholar] [CrossRef]
6. Csiszár, I.; Shields, P.C. Redundancy rates for renewal and other processes. IEEE Trans. Inform. Theory 1996, 42, 2065–2072. [Google Scholar] [CrossRef]
7. Kieffer, J.C. A unified approach to weak universal source coding. IEEE Trans. Inform. Theory 1978, 24, 674–682. [Google Scholar] [CrossRef]
8. Åberg, J.; Shtarkov, Y.M.; Smeets, B.J. Multialphabet Coding with Separate Alphabet Description. In Proceedings of Compression and complexity of sequences; Press, I.C.S., Ed.; IEEE: Palermo, Italy, 1997; pp. 56–65. [Google Scholar]
9. Shamir, G.I.; Song, L. On the entropy of patterns of i.i.d. sequences. In Proceedings of 41st Annual Allerton Conference on Communication, Control and Computing; Curran Associates, Inc.: Monticello, IL, USA, 2003; pp. 160–169. [Google Scholar]
10. Shamir, G.I. A new redundancy bound for universal lossless compression of unknown alphabets. In Proceedings of the 38th Annual Conference on Information Sciences and Systems - CISS; IEEE: Princeton, NJ, USA, 2004; pp. 1175–1179. [Google Scholar]
11. Shamir, G.I. Universal lossless compression with unknown alphabets-the average case. IEEE Trans. Inform. Theory 2006, 52, 4915–4944. [Google Scholar] [CrossRef]
12. Shamir, G.I. On the MDL principle for i.i.d. sources with large alphabets. IEEE Trans. Inform. Theory 2006, 52, 1939–1955. [Google Scholar] [CrossRef]
13. Orlitsky, A.; Santhanam, N.P. Speaking of infinity. IEEE Trans. Inform. Theory 2004, 50, 2215–2230. [Google Scholar] [CrossRef]
14. Jevtić, N.; Orlitsky, A.; Santhanam, N.P. A lower bound on compression of unknown alphabets. Theoret. Comput. Sci. 2005, 332, 293–311. [Google Scholar] [CrossRef]
15. Orlitsky, A.; Santhanam, N.P.; Zhang, J. Universal compression of memoryless sources over unknown alphabets. IEEE Trans. Inform. Theory 2004, 50, 1469–1481. [Google Scholar] [CrossRef]
16. Orlitsky, A.; Santhanam, N.P.; Viswanathan, K.; Zhang, J. Limit Results on Pattern Entropy of Stationary Processes. In Proceedings of the 2004 IEEE Information Theory workshop; IEEE: San Antonio, TX, USA, 2004; pp. 2954–2964. [Google Scholar]
17. Gemelos, G.; Weissman, T. On the entropy rate of pattern processes; Technical report hpl-2004-159; HP Laboratories Palo Alto: San Antonio, TX, USA, 2004. [Google Scholar]
18. Shamir, G.I.; From University of Utah, Electrical and Computer Ingeneering. Private communication, 2006.
19. Davisson, L.D. Universal noiseless coding. IEEE Trans. Inform. Theory 1973, IT-19, 783–795. [Google Scholar] [CrossRef]
20. Dixmier, J.; Nicolas, J.L. Partitions sans petits sommants. In A Tribute to Paul Erdös; Cambridge University Press: New York, NY, USA, 1990; Chapter 8; pp. 121–152. [Google Scholar]
21. Szekeres, G. An asymptotic formula in the theory of partitions. Quart. J. Math. Oxford 1951, 2, 85–108. [Google Scholar] [CrossRef]
22. Massart, P. Ecole d’Eté de Probabilité de Saint-Flour XXXIII; LNM; Springer-Verlag: London, UK, 2003; Chapter 2. [Google Scholar]

## Share and Cite

MDPI and ACS Style

Garivier, A. A Lower-Bound for the Maximin Redundancy in Pattern Coding. Entropy 2009, 11, 634-642. https://doi.org/10.3390/e11040634

AMA Style

Garivier A. A Lower-Bound for the Maximin Redundancy in Pattern Coding. Entropy. 2009; 11(4):634-642. https://doi.org/10.3390/e11040634

Chicago/Turabian Style

Garivier, Aurélien. 2009. "A Lower-Bound for the Maximin Redundancy in Pattern Coding" Entropy 11, no. 4: 634-642. https://doi.org/10.3390/e11040634