Let P be a stationary source on an alphabet A, both known by the coder and the decoder.

We show that the maximin average redundancy in pattern coding is eventually larger than 1.84 (n/log n)1/3 for messages of length n. This improves recent results on pattern redundancy, although it does not fill the gap between known lower- and upper-bounds. The pattern of a string is obtained by replacing each symbol by the index of its first occurrence. The problem of pattern coding is of interest because strongly universal codes have been proved to exist for patterns while universal message coding is impossible for memoryless sources on an infinite alphabet. The proof uses fine combinatorial results on partitions with small summands.


Universal Coding
Let P be a stationary source on an alphabet A, both known by the coder and the decoder.Let X = (X n ) n∈N be a random process with distribution P. For a positive integer n, we denote by X n 1 the vector of the n first components of X and by P n the distribution of X n 1 on A n .We denote the logarithm with base 2 by log and the natural logarithm by ln.Shannon's classical bound [1] states the average bit length of codewords for any coding function is lower-bounded by the n-th order entropy H(X n 1 ) = E [− log P n (X n 1 )]; moreover, this codelength can be nearly approached, see [2].One important idea in the proof of this result is the following: every code on the strings of length n is associated with a coding distribution q n on A n in such a way that the code length for x is − log q n (x), and reciprocally any distribution q n on A n can be associated with a coding function whose code length is approximately − log q n (x).When P is ergodic, its entropy rate H(X) = lim n→∞ 1 n H(X n 1 ) exists.It is a tight lower bound on the number of bits required per character.
If P is only known to be an element P θ of some class C = {P θ : θ ∈ Θ}, universal coding consists in finding a single code, or equivalently a single sequence of coding distributions (q n ) n , approaching the entropy rate for all sources P θ ∈ C at the same time.Such versatility has a price: for any given source P θ , there is an additional cost called the (expected) redundancy R(q n , θ) of the coding distribution q n that is defined as the difference between the expected code length E θ [− log q n (X n 1 )] and the n-th order entropy H(X n 1 ).Two criteria measure the universality of q n : • First, a deterministic approach judges the performance of q n in the worst case by the maximal redundancy R + (q n , Θ) = sup θ∈Θ R(q n , θ) The lowest achievable maximal redundancy is called minimax redundancy: • Second, a Bayesian approach consists in providing Θ with a prior distribution π, and then considering the expected redundancy E π [R(q n , θ)] (the expectation is here taken over θ).Let q π n be the coding distribution minimizing A classical minimax theorem (see [3]) states that mild hypotheses are sufficient to ensure that R − (n, Θ) = R + (n, Θ).Class C is said to be strongly universal if R + (n, Θ) = o(n): then universal coding is possible uniformly on C.An important result by Rissanen [4] asserts that if the parameter set Θ is k-dimensional, and if there exists a √ n−consistent estimator for θ, then This well-known bound has many applications in information theory, often related to the Minimum Description Length Principle.It is remembered as a "rule of thumb" that redundancy is 1/2 log n for each parameter of the model.This result actually covers a large variety of cases, among others: memoryless processes, Markov chains, Context tree sources, hidden Markov chains.However, further generalization have been investigated.Shields (see [5]) proved that no coder can achieve a non-trivial redundancy rate on all stationary ergodic processes.Csiszár and Shields [6] gave an example of a non-parametric, intermediate complexity class, known as renewal processes, for which R − (n, Θ) and R + (n, Θ) are both of order O( √ n).If alphabet A is not known, or if its size is not insignificant compared to n, Rissanen's bound (1) is uninformative.If the alphabet A is infinite, Kieffer [7] showed that no universal coding is possible even for the class of memoryless processes.

Dictionary and Pattern
Those negative results prompted the idea of coding separately the structure of string x and the symbols present in x.It was first introduced by Åberg in [8] as a solution to the multi-alphabet coding problem, where the message x contains only a small subset of the known alphabet A. It was further studied and motivated in a series of articles by Shamir [9][10][11][12] and by Jevtić, Orlitsky, Santhanam and Zhang [13][14][15][16] for practical applications: the alphabet is unknown and has to be transmitted separately anyway (for instance, transmission of a text in an unknown language), or the alphabet is very large in comparison to the message (consider the case of images with k = 2 24 colors, or texts when taking words as the alphabet units).
To explain the notion of pattern, let us take the example of [9]: string x = "abracadabra" is made of n = 11 characters.The information it conveys can be separated in two blocks: • a dictionary ∆ = ∆(x) defined as the sequence of different characters present in x in order of appearance; in the example ∆ = (a, b, r, c, d).
• a pattern ψ = ψ(x) defined as the sequence of positive integers pointing to the indices of each letter in ∆; here, ψ = 12314151231.

Pattern Coding
Any process X from a source P θ induces a pattern process Ψ = (Ψ n ) n∈N with marginal distributions on P n defined by For stationary ergodic P θ , Orlitsky & al. [16] prove that the pattern entropy rate 1 ) exists and is equal to H(X) (whether this quantity is finite or not).This result was independently discovered by Gemelos and Weissman [17].
In the sequel, we shall consider only the case of memoryless sources P θ , with marginal distributions p θ on a (possibly infinite) alphabet A. Hence, Θ will be the set parameterizing all probability distributions on A.
Obviously, the process they induce on (P n ) n∈N is not memoryless.But as patterns convey less information than the initial strings, coding them seems to be an easier task.The expected pattern redundancy of a coding distribution q n on P n can be defined by analogy as the difference between the expected code length under distribution P θ and the n-th block pattern entropy: As the alphabet is unknown, the maximal pattern redundancy R + Ψ (q n , Θ) must be defined as the maximum of R + Ψ (q n , θ) over all alphabets A and all memoryless distributions on A. Of course, the minimax pattern redundancy R + Ψ (n, Θ) is defined as the lower-bound of R + Ψ (q n , Θ) in q n .Similarly, the maximin pattern redundancy R − Ψ (n, Θ) is defined as the supremum with respect to all possible alphabets A and all prior distributions π of the lowest achievable average redundancy, that is:

Theorem
There is still uncertainty on the true order of magnitude of R − Ψ (n, Θ) and R + Ψ (n, Θ).However, Orlistky & et al. in [15] and Shamir in [11] proved that for some constants c 1 and c 2 it holds that There is hence a gap between upper-and lower-bounds.This gap has been reduced in an article by Shamir [10] where the upper-bound is improved to O ( n 2/5 ) .The following theorem contributes to the evaluation of R − Ψ (n, Θ), by providing a slightly better and more explicit lower-bound, the proof of which is particularly elegant.
Theorem 1 For all integers n large enough, the maximin pattern redundancy is lower-bounded as: R − Ψ (n, Θ) 1.84 Gil Shamir [18] suggests that a bound of similar order can be obtained by properly updating (B12) in [11].The proof provided in this paper was elaborated independently; both of them use the channel capacity inequality described in Section 3..However, it is interesting to note that they rely on different ideas (unordered partitions of integers and Bernstein's inequality here, sphere packing arguments or inhomogeneous grids there).An important difference appears in the treatment of the quantization, see Equation 2. [11] provides fine relations between the minimax average redundancy and the alphabet size.The approach presented here does not discriminate between alphabet sizes; in a short and elegant proof, it leads to a slightly better bound for infinite alphabets.

Proof
We use here standard technique for lower-bounds (see [19]): the n-th order maximin redundancy is bounded from below by (and asymptotically equivalent to) the capacity of the channel joining an input variable W with distribution π on Θ to the output variable Ψ n 1 with conditional probabilities P θ (Ψ n 1 ).Let H(Ψ n 1 |W ) be the conditional entropy of Ψ n 1 given W , and let ) denote the mutual information of these two random variables, see [2].Then from [19] and [4] we know that inequality holds for all alphabets A and all prior distributions π on the set of memoryless distributions on A: it is sufficient to give a lower-bound for the mutual information I (Ψ n 1 ; W ) between parameter W and observation Ψ.In words, R − Ψ (n, Θ) is larger than the logarithm of the number of memoryless sources that can be distinguished from one observation of Ψ n 1 .Given the positive integer n, let c = c n be an integer growing with n to infinity in a way defined later, let λ be a positive constant to be specified later, let d = λ √ c and let A = {1, . . ., c} We denote by Θ c,d the set of all unordered partitions of c made of summands at most equal to d: is the set of all unordered partitions of c.Let also Φ c,d be the subset of Φ c containing the profiles of all patterns ψ ∈ P c whose symbols appear at most d times: } There is a one-to-one mapping χ c between Θ c and Φ c defined by . In [20], citing [21], Dixmier and Nicolas show the existence of an increasing function f : Numerous properties of function f , and numerical values, are given in [20]; notably, f is an infinitely derivable and concave function which satisfies f , let p θ be the distribution on A defined by p θ (i) = θ i c , and let P θ be the memoryless process with marginal distribution p θ .Let W be a random variable with uniform distribution on the set Θ c,d .Let X = (X n ) n∈N + be a random process such that conditionally on the event {W = θ}, then the distribution of X is P θ , and let Ψ = (Ψ n ) n∈N + be the induced pattern process.
We want to bound we need to find an upper-bound for H(W |Ψ n 1 ).The idea of the proof is the following.¿From Fano's inequality, upper-bounding H(W |Ψ n 1 ) reduces to finding a good estimator θ for W : conditionally on W = θ, string X n 1 is a memoryless process with distribution P θ and we aim at recovering parameter θ from its pattern Ψ n 1 .Each parameter θ = (θ j ) j 1 is here an unordered partition with small summands of integer c.Let T j be the number of occurrences of j-th most frequent symbol in ψ.Then T = (T j ) j 1 constitutes a random unordered partition of n.We show that by "shrinking" T by a factor c/n we build a unordered partition θ of c that is equal to parameter θ with high probability, see Figure 1.Note that only partitions with small summands are considered: this allows to have a better uniform control on the probabilities of deviation of each symbol's frequency, while the cardinality of Θ c,d remains of same (logarithmic) order as that of Θ c .Parameters c and d are chosen in order to optimize the rate in Theorem 1, while the value of λ = d/ √ c is chosen at the end to maximize the constant.
Figure 1.The profile of pattern ψ forms a partition of n that can be "shrunk" to θ, the parameter partition of c, with high probability.
Let us now give the details of the proof.If W = θ and if we observe string X n 1 = x having pattern Ψ n 1 = ψ ∈ P n , we construct an estimator θ = ( θj ) of θ in the following way: let ϕ(ψ) be the profile of ψ, and T = (T j ) j 1 = χ −1 n (ϕ(ψ)) be the corresponding partition of n.For j c, let θj = , where [x] denotes the nearest integer of x.Observe that as alphabet A contains only c different symbols, for all j > c we have T j = θj = θ j = 0.
The distribution of T is difficult to study, but is very related to much simpler random variables.For 1 i n and j 1, let U i j = X i =j ; as U i j has a Bernoulli distribution with parameter θ j c , and as process X is memoryless, we observe that U j ∑ n i=1 U i j , the number of occurrences of symbol j in x, has a binomial distribution B ( n, ; θ would be an estimator of θ if we had access to x, but here estimators may only be constructed from ψ.However, there is a strong connection between θ and θ: the symbols in x are in one-to-one correspondence with the symbols in ψ.Hence, T is just the order statistics of U : T j = U (j) and thus θj = θ(j) .

Now, if
U j c n − θ j < 1 2 then θj = θ j .Thus, if for all j in the set {1, . . ., c} it holds that U j c n − θ j < 1 2 , then θ = θ and θ, as an increasing sequence, is equal to its order statistics θ.It follows that and hence, using the union bound: We chose parameter set Θ c,d so that all summands in partition θ are small with respect to c. Consequently, the variance of the ( U i j ) i,j is uniformly bounded: Var[U i j ] = the following Bernstein inequality [22]: if Y 1 , . . ., Y n are independent random variables such that Y i takes its values in [−b, b] and such that Var[Y i ] v, and if S = Y 1 + • • • + Y n , then for any positive x it holds that:P (S − E[S] x) exp ( − x 2 /2 n(v + x/3))