A Lower-Bound for the Maximin Redundancy in Pattern Coding

Garivier, Aurélien

doi:10.3390/e11040634

Open AccessArticle

A Lower-Bound for the Maximin Redundancy in Pattern Coding

by

Aurélien Garivier

CNRS, Telecom ParisTech, Laboratoire Laboratoire Traitement et Communication de l’Information, 75013 Paris, France

Entropy 2009, 11(4), 634-642; https://doi.org/10.3390/e11040634

Submission received: 1 September 2009 / Accepted: 20 October 2009 / Published: 22 October 2009

(This article belongs to the Special Issue Information and Entropy)

Download

Browse Figure

Versions Notes

Abstract

:

We show that the maximin average redundancy in pattern coding is eventually larger than

1.84 {(\frac{n}{log n})}^{1 / 3}

for messages of length n. This improves recent results on pattern redundancy, although it does not fill the gap between known lower- and upper-bounds. The pattern of a string is obtained by replacing each symbol by the index of its first occurrence. The problem of pattern coding is of interest because strongly universal codes have been proved to exist for patterns while universal message coding is impossible for memoryless sources on an infinite alphabet. The proof uses fine combinatorial results on partitions with small summands.

Keywords:

universal coding; pattern; minimax

1. Introduction

1.1. Universal Coding

Let

P

be a stationary source on an alphabet A, both known by the coder and the decoder. Let

X = {(X_{n})}_{n \in N}

be a random process with distribution

P

. For a positive integer n, we denote by

X_{1}^{n}

the vector of the n first components of X and by

P^{n}

the distribution of

X_{1}^{n}

on

A^{n}

. We denote the logarithm with base 2 by log and the natural logarithm by ln. Shannon’s classical bound [1] states the average bit length of codewords for any coding function is lower-bounded by the n-th order entropy

H (X_{1}^{n}) = E [- log P^{n} (X_{1}^{n})]

; moreover, this codelength can be nearly approached, see [2]. One important idea in the proof of this result is the following: every code on the strings of length n is associated with a coding distribution

q_{n}

on

A^{n}

in such a way that the code length for x is

- log q_{n} (x)

, and reciprocally any distribution

q_{n}

on

A^{n}

can be associated with a coding function whose code length is approximately

- log q_{n} (x)

. When

P

is ergodic, its entropy rate

H (X) = {lim}_{n \to \infty} \frac{1}{n} H (X_{1}^{n})

exists. It is a tight lower bound on the number of bits required per character.

If

P

is only known to be an element

P_{θ}

of some class

C = \{P_{θ} : θ \in Θ\}

, universal coding consists in finding a single code, or equivalently a single sequence of coding distributions

{(q_{n})}_{n}

, approaching the entropy rate for all sources

P_{θ} \in C

at the same time. Such versatility has a price: for any given source

P_{θ}

, there is an additional cost called the (expected) redundancy

R (q_{n}, θ)

of the coding distribution

q_{n}

that is defined as the difference between the expected code length

E_{θ} [- log q_{n} (X_{1}^{n})]

and the n-th order entropy

H (X_{1}^{n})

. Two criteria measure the universality of

q_{n}

:

First, a deterministic approach judges the performance of $q_{n}$ in the worst case by the maximal redundancy $R^{+} (q_{n}, Θ) = {sup}_{θ \in Θ} R (q_{n}, θ)$ The lowest achievable maximal redundancy is called minimax redundancy:

$R^{+} (n, Θ) = min_{q_{n}} max_{θ} R (q_{n}, θ)$
Second, a Bayesian approach consists in providing Θ with a prior distribution π, and then considering the expected redundancy $E_{π} [R (q_{n}, θ)]$ (the expectation is here taken over θ). Let $q_{n}^{π}$ be the coding distribution minimizing $E_{π} [R (q_{n}, θ)]$ The maximin redundancy $R^{-} (n, Θ)$ of class $C$ is the supremum of all $E_{π} [R (q_{n}^{π}, θ)]$ over all possible prior distributions π:

$R^{-} (n, Θ) = max_{π} min_{q_{n}} E_{π} [R (q_{n}, θ)]$

A classical minimax theorem (see [3]) states that mild hypotheses are sufficient to ensure that

R^{-} (n, Θ) = R^{+} (n, Θ)

. Class

C

is said to be strongly universal if

R^{+} (n, Θ) = o (n)

: then universal coding is possible uniformly on

C

. An important result by Rissanen [4] asserts that if the parameter set Θ is k-dimensional, and if there exists a

\sqrt{n} -

consistent estimator for θ, then

R^{-} (n, Θ) = R^{+} (n, Θ) = \frac{k}{2} log n + O (1)

(1)

This well-known bound has many applications in information theory, often related to the Minimum Description Length Principle. It is remembered as a “rule of thumb” that redundancy is

1 / 2 log n

for each parameter of the model. This result actually covers a large variety of cases, among others: memoryless processes, Markov chains, Context tree sources, hidden Markov chains. However, further generalization have been investigated. Shields (see [5]) proved that no coder can achieve a non-trivial redundancy rate on all stationary ergodic processes. Csiszár and Shields [6] gave an example of a non-parametric, intermediate complexity class, known as renewal processes, for which

R^{-} (n, Θ)

and

R^{+} (n, Θ)

are both of order

O (\sqrt{n})

. If alphabet A is not known, or if its size is not insignificant compared to n, Rissanen’s bound (1) is uninformative. If the alphabet A is infinite, Kieffer [7] showed that no universal coding is possible even for the class of memoryless processes.

1.2. Dictionary and Pattern

Those negative results prompted the idea of coding separately the structure of string x and the symbols present in x. It was first introduced by Åberg in [8] as a solution to the multi-alphabet coding problem, where the message x contains only a small subset of the known alphabet A. It was further studied and motivated in a series of articles by Shamir [9,10,11,12] and by Jevtić, Orlitsky, Santhanam and Zhang [13,14,15,16] for practical applications: the alphabet is unknown and has to be transmitted separately anyway (for instance, transmission of a text in an unknown language), or the alphabet is very large in comparison to the message (consider the case of images with

k = 2^{24}

colors, or texts when taking words as the alphabet units).

To explain the notion of pattern, let us take the example of [9]: string

x =

“abracadabra” is made of

n = 11

characters. The information it conveys can be separated in two blocks:

a dictionary $Δ = Δ (x)$ defined as the sequence of different characters present in x in order of appearance; in the example $Δ = (a, b, r, c, d)$ .
a pattern $ψ = ψ (x)$ defined as the sequence of positive integers pointing to the indices of each letter in Δ; here, $ψ = 12314151231$ .

Let

P^{n}

be the set of all possible patterns of n-strings. For instance,

P^{1} = {1}

,

P^{2} = {11, 12}

,

P^{3} = {111, 112, 121, 122, 123}

. Using the same notations as in [15], we call multiplicity

μ_{j} (ψ)

of symbol j in pattern

ψ \in P^{n}

the number of occurrences of j in ψ; the multiplicity of pattern ψ is the vector made of all symbol’s multiplicities:

μ (ψ) = (μ_{j} (ψ)) 1 ⩽ j ⩽ n

—in the former example,

μ = (5, 2, 2, 1, 1, 0, \dots)

. Note that

\sum_{j = 1}^{n} μ_{j} = n

. Moreover, the profile

ϕ = {(ϕ_{μ})}_{μ ⩾ 1}

of pattern ψ provides, for every multiplicity μ, its frequency in

μ (ψ)

. It can be formally defined as the multiplicity of ψ’s multiplicity:

μ (μ (ψ))

. The profile of string “abracadabra” is

(2, 2, 0, 0, 1, 0, \dots)

as two symbols (c and d) appear once, two symbols (b and r) appear twice and one symbol (a) appears five times. We denote by

Φ^{n}

the set of possible profiles for patterns of length n, so that

Φ^{1} = {(1)}

,

Φ^{2} = {(2, 0), (0, 1)}

,

Φ^{3} = {(3, 0, 0), (1, 1, 0), (0, 0, 1)}

. Note that

\sum_{μ = 1}^{n} μ ϕ_{μ} = n

. As explained in [15], there is one-to-one mapping between

Φ^{n}

and the set of unordered partitions of integer n. In Section 3, this point will be used and specified.

1.3. Pattern Coding

Any process X from a source

P_{θ}

induces a pattern process

Ψ = {(Ψ_{n})}_{n \in N}

with marginal distributions on

P^{n}

defined by

P_{θ} (Ψ_{1}^{n} = ψ) = \sum_{ψ (x) = ψ} P_{θ} (X_{1}^{n} = x)

. Thus, we can define a n-th block pattern entropy

H (Ψ_{1}^{n}) = E_{θ} [- log P_{θ} (Ψ_{1}^{n})]

. For stationary ergodic

P_{θ}

, Orlitsky & al. [16] prove that the pattern entropy rate

H (Ψ) = {lim}_{n \to \infty} \frac{1}{n} H (Ψ_{1}^{n})

exists and is equal to

H (X)

(whether this quantity is finite or not). This result was independently discovered by Gemelos and Weissman [17].

In the sequel, we shall consider only the case of memoryless sources

P_{θ}

, with marginal distributions

p_{θ}

on a (possibly infinite) alphabet

A

. Hence, Θ will be the set parameterizing all probability distributions on

A

.

Obviously, the process they induce on

{(P^{n})}_{n \in N}

is not memoryless. But as patterns convey less information than the initial strings, coding them seems to be an easier task. The expected pattern redundancy of a coding distribution

q_{n}

on

P^{n}

can be defined by analogy as the difference between the expected code length under distribution

P_{θ}

and the n-th block pattern entropy:

\begin{matrix} R_{Ψ} (q_{n}, θ) & = & E_{θ} [- log q_{n} (Ψ_{1}^{n})] - H (Ψ_{1}^{n}) \\ = & \sum_{ψ \in Ψ^{n}} P_{θ} (ψ) log \frac{P_{θ} (ψ)}{q_{n} (ψ)} \end{matrix}

As the alphabet is unknown, the maximal pattern redundancy

R_{Ψ}^{+} (q_{n}, Θ)

must be defined as the maximum of

R_{Ψ}^{+} (q_{n}, θ)

over all alphabets A and all memoryless distributions on A. Of course, the minimax pattern redundancy

R_{Ψ}^{+} (n, Θ)

is defined as the lower-bound of

R_{Ψ}^{+} (q_{n}, Θ)

in

q_{n}

. Similarly, the maximin pattern redundancy

R_{Ψ}^{-} (n, Θ)

is defined as the supremum with respect to all possible alphabets A and all prior distributions π of the lowest achievable average redundancy, that is:

R_{Ψ}^{-} (n, Θ) = sup_{A, π} inf_{q_{n}} E_{π} [R_{Ψ} (q_{n}, θ)]

2. Theorem

There is still uncertainty on the true order of magnitude of

R_{Ψ}^{-} (n, Θ)

and

R_{Ψ}^{+} (n, Θ)

. However, Orlistky & et al. in [15] and Shamir in [11] proved that for some constants

c_{1}

and

c_{2}

it holds that

c_{1} n^{1 / 3 - ϵ} ⩽ R_{Ψ}^{-} (n, Θ) ⩽ R_{Ψ}^{+} (n, Θ) ⩽ c_{2} \sqrt{n}

. There is hence a gap between upper- and lower-bounds. This gap has been reduced in an article by Shamir [10] where the upper-bound is improved to

O (n^{2 / 5})

. The following theorem contributes to the evaluation of

R_{Ψ}^{-} (n, Θ)

, by providing a slightly better and more explicit lower-bound, the proof of which is particularly elegant.

Theorem 1 For all integers n large enough, the maximin pattern redundancy is lower-bounded as:

R_{Ψ}^{-} (n, Θ) ⩾ 1.84 {(\frac{n}{log n})}^{1 / 3}

Gil Shamir [18] suggests that a bound of similar order can be obtained by properly updating (B12) in [11]. The proof provided in this paper was elaborated independently; both of them use the channel capacity inequality described in Section 3. However, it is interesting to note that they rely on different ideas (unordered partitions of integers and Bernstein’s inequality here, sphere packing arguments or inhomogeneous grids there). An important difference appears in the treatment of the quantization, see Equation 2. [11] provides fine relations between the minimax average redundancy and the alphabet size. The approach presented here does not discriminate between alphabet sizes; in a short and elegant proof, it leads to a slightly better bound for infinite alphabets.

3. Proof

We use here standard technique for lower-bounds (see [19]): the n-th order maximin redundancy is bounded from below by (and asymptotically equivalent to) the capacity of the channel joining an input variable W with distribution π on Θ to the output variable

Ψ_{1}^{n}

with conditional probabilities

P_{θ} (Ψ_{1}^{n})

. Let

H (Ψ_{1}^{n} | W)

be the conditional entropy of

Ψ_{1}^{n}

given W, and let

I (Ψ_{1}^{n}; W) = H (Ψ_{1}^{n}) - H (Ψ_{1}^{n} | W)

denote the mutual information of these two random variables, see [2]. Then from [19] and [4] we know that inequality

R_{Ψ}^{-} (n, Θ) ⩾ I (Ψ_{1}^{n}; W)

holds for all alphabets

A

and all prior distributions π on the set of memoryless distributions on

A

: it is sufficient to give a lower-bound for the mutual information

I (Ψ_{1}^{n}; W)

between parameter W and observation Ψ. In words,

R_{Ψ}^{-} (n, Θ)

is larger than the logarithm of the number of memoryless sources that can be distinguished from one observation of

Ψ_{1}^{n}

.

Given the positive integer n, let

c = c_{n}

be an integer growing with n to infinity in a way defined later, let λ be a positive constant to be specified later, let

d = λ \sqrt{c}

and let

A = {1, \dots, c}

We denote by

Θ^{c, d}

the set of all unordered partitions of c made of summands at most equal to d:

Θ^{c, d} = \{θ = {(θ_{j})}_{j \in N^{+}} : d ⩾ θ_{1} ⩾ θ_{2} ⩾ \dots and \sum_{j = 1}^{\infty} θ_{j} = c\}

Then

Θ^{c} ≜ Θ^{c, c}

is the set of all unordered partitions of c. Let also

Φ^{c, d}

be the subset of

Φ^{c}

containing the profiles of all patterns

ψ \in P^{c}

whose symbols appear at most d times:

Φ^{c, d} = \{ϕ = (ϕ_{1}, \dots, ϕ_{d}) \in N^{d} : \sum_{μ = 1}^{d} μ ϕ_{μ} = c\}

There is a one-to-one mapping

χ_{c}

between

Θ^{c}

and

Φ^{c}

defined by

\{\begin{matrix} χ_{c} {(θ)}_{μ} & = & |{i : θ_{i} = μ}|; \\ χ_{c}^{- 1} {(ϕ)}_{j} & = & \{\begin{matrix} 0 & if \sum_{i = 1}^{d} ϕ_{i} < j, \\ max {μ : \sum_{i = μ}^{d} ϕ_{i} ⩾ j} & else . \end{matrix} \end{matrix}

It is immediately verified that

χ (Θ^{c, d}) = Φ^{c, d}

. In [20], citing [21], Dixmier and Nicolas show the existence of an increasing function

f : R^{+} \to [0, π \sqrt{\frac{2}{3}}[

such that

ln |Θ^{c, d}| = f (λ) \sqrt{c} (1 + o (1))

as

c \to \infty

, where

λ = d / \sqrt{c}

. Numerous properties of function f, and numerical values, are given in [20]; notably, f is an infinitely derivable and concave function which satisfies

f (λ) = - 2 λ log λ + 2 λ + O (λ^{3})

when

λ \to 0

and

f (λ) = π \sqrt{2 / 3} - \sqrt{6} / π exp (- π λ / \sqrt{6})

when

λ \to \infty

.

For

θ \in Θ^{c, d}

, let

p_{θ}

be the distribution on

A

defined by

p_{θ} (i) = \frac{θ_{i}}{c}

, and let

P_{θ}

be the memoryless process with marginal distribution

p_{θ}

. Let W be a random variable with uniform distribution on the set

Θ^{c, d}

. Let

X = {(X_{n})}_{n \in N^{+}}

be a random process such that conditionally on the event

{W = θ}

, then the distribution of X is

P_{θ}

, and let

Ψ = {(Ψ_{n})}_{n \in N^{+}}

be the induced pattern process.

We want to bound

I (Ψ_{1}^{n}; W) = H (W) - H (W | Ψ_{1}^{n})

from below. As

H (W) = log |Θ^{c, d}| = f (λ) log e \sqrt{c} (1 + o (1))

we need to find an upper-bound for

H (W | Ψ_{1}^{n})

. The idea of the proof is the following. >From Fano’s inequality, upper-bounding

H (W | Ψ_{1}^{n})

reduces to finding a good estimator

\hat{θ}

for W: conditionally on

W = θ

, string

X_{1}^{n}

is a memoryless process with distribution

P_{θ}

and we aim at recovering parameter θ from its pattern

Ψ_{1}^{n}

. Each parameter

θ = {(θ_{j})}_{j ⩾ 1}

is here an unordered partition with small summands of integer c. Let

T_{j}

be the number of occurrences of j-th most frequent symbol in ψ. Then

T = {(T_{j})}_{j ⩾ 1}

constitutes a random unordered partition of n. We show that by “shrinking” T by a factor

c / n

we build a unordered partition

\hat{θ}

of c that is equal to parameter θ with high probability, see Figure 1. Note that only partitions with small summands are considered: this allows to have a better uniform control on the probabilities of deviation of each symbol’s frequency, while the cardinality of

Θ^{c, d}

remains of same (logarithmic) order as that of

Θ^{c}

. Parameters c and d are chosen in order to optimize the rate in Theorem 1, while the value of

λ = d / \sqrt{c}

is chosen at the end to maximize the constant.

Figure 1. The profile of pattern ψ forms a partition of n that can be “shrunk” to θ, the parameter partition of c, with high probability.

Let us now give the details of the proof. If

W = θ

and if we observe string

X_{1}^{n} = x

having pattern

Ψ_{1}^{n} = ψ \in P^{n}

, we construct an estimator

\hat{θ} = {({\hat{θ}}_{j})}_{1 ⩽ j ⩽ c}

of θ in the following way: let

ϕ (ψ)

be the profile of ψ, and

T = {(T_{j})}_{j ⩾ 1} = χ_{n}^{- 1} (ϕ (ψ))

be the corresponding partition of n. For

j ⩾ c

, let

{\hat{θ}}_{j} = [\frac{T_{j} c}{n}]

, where

[x]

denotes the nearest integer of x. Observe that as alphabet

A

contains only c different symbols, for all

j > c

we have

T_{j} = {\hat{θ}}_{j} = θ_{j} = 0

.

The distribution of T is difficult to study, but is very related to much simpler random variables. For

1 ⩽ i ⩽ n

and

j ⩾ 1

, let

U_{j}^{i} = 1_{X_{i} = j}

; as

U_{j}^{i}

has a Bernoulli distribution with parameter

\frac{θ_{j}}{c}

, and as process X is memoryless, we observe that

U_{j} ≜ \sum_{i = 1}^{n} U_{j}^{i}

, the number of occurrences of symbol j in x, has a binomial distribution

B (n, \frac{θ_{j}}{c})

. Let

{\tilde{θ}}_{j} = [\frac{U_{j} c}{n}]

, and

\tilde{θ} = {({\tilde{θ}}_{j})}_{j ⩾ 1}

;

\tilde{θ}

would be an estimator of θ if we had access to x, but here estimators may only be constructed from ψ. However, there is a strong connection between

\hat{θ}

and

\tilde{θ}

: the symbols in x are in one-to-one correspondence with the symbols in ψ. Hence, T is just the order statistics of U:

T_{j} = U_{(j)}

and thus

{\hat{θ}}_{j} = {\tilde{θ}}_{(j)}

.

Now, if

|\frac{U_{j} c}{n} - θ_{j}| < \frac{1}{2}

then

{\tilde{θ}}_{j} = θ_{j}

. Thus, if for all j in the set

{1, \dots, c}

it holds that

|\frac{U_{j} c}{n} - θ_{j}| < \frac{1}{2}

, then

\tilde{θ} = θ

and

\tilde{θ}

, as an increasing sequence, is equal to its order statistics

\hat{θ}

. It follows that

⋂_{j = 1}^{c} \{|\frac{U_{j} c}{n} - θ_{j}| < \frac{1}{2}\} \subset \{\hat{θ} = θ\}

(2)

and hence, using the union bound:

P_{θ} (\hat{θ} \neq θ) ⩽ P_{θ} (⋃_{j = 1}^{c} \{|\frac{U_{j} c}{n} - θ_{j}| ⩾ \frac{1}{2}\}) ⩽ \sum_{j = 1}^{c} P_{θ} (|\frac{U_{j}}{n} - \frac{θ_{j}}{c}| ⩾ \frac{1}{2 c})

(3)

We chose parameter set

Θ^{c, d}

so that all summands in partition θ are small with respect to c. Consequently, the variance of the

{(U_{j}^{i})}_{i, j}

is uniformly bounded:

Var [U_{j}^{i}] = \frac{θ_{j}}{c} (1 - \frac{θ_{j}}{c}) ⩽ \frac{d}{c}

. Recall the following Bernstein inequality [22]: if

Y_{1}, \dots, Y_{n}

are independent random variables such that

Y_{i}

takes its values in

[- b, b]

and such that

Var [Y_{i}] ⩽ v

, and if

S = Y_{1} + \dots + Y_{n}

, then for any positive x it holds that:

P (S - E [S] ⩾ x) ⩽ exp (- \frac{x^{2} / 2}{n (v + x / 3)})

Using this inequality for the

{(U_{j}^{i})}_{1 ⩽ i ⩽ n}

, we obtain:

P_{θ} (|\frac{U_{j}}{n} - \frac{θ_{j}}{c}| ⩾ \frac{1}{2 c}) ⩽ 2 e^{- \frac{n / 4 c^{2}}{2 (d / c + 1 / 6 c)}} = 2 e^{- \frac{n}{8 c (d + 1 / 6)}}

Thus, we obtain from (3):

P (\hat{θ} \neq θ) = \frac{1}{|Θ^{c, d}|} \sum_{θ \in Θ^{c, d}} P_{θ} (\hat{θ} \neq θ) ⩽ 2 c e^{- \frac{n}{8 c (d + 1 / 6)}}

Now, using Fano’s inequality [2]:

\begin{matrix} H (W | Ψ_{1}^{n}) & ⩽ & H (W | \hat{θ}) \\ ⩽ & P (W \neq \hat{θ}) log |Θ^{c, d}| + log 2 \\ ⩽ & 2 c e^{- \frac{n}{8 λ c^{3 / 2}}} f (λ) \sqrt{c} (1 + o (1)) \end{matrix}

Hence,

\begin{matrix} R_{Ψ}^{-} (n, Θ) & ⩾ & I (Ψ_{1}^{n}; W) = H (W) - H (W | Ψ_{1}^{n}) \\ ⩾ & f (λ) log e \sqrt{c} (1 + o (1)) - 2 c e^{- \frac{n}{8 λ c^{3 / 2}}} f (λ) log e \sqrt{c} (1 - o (1)) \\ = & f (λ) log e \sqrt{c} (1 - 2 c e^{- \frac{n}{8 λ c^{3 / 2}}} - o (1)) \end{matrix}

By choosing

c = {(\frac{n}{\frac{16}{3} λ log n})}^{2 / 3}

we get:

\begin{matrix} R_{Ψ}^{-} (n, Θ) & ⩾ & f (λ) log e {(\frac{n}{\frac{16}{3} λ log n})}^{1 / 3} (1 - 2 {(\frac{n}{\frac{16}{3} λ log n})}^{2 / 3} e^{- \frac{2}{3} log n} - o (1)) \\ = & \frac{f (λ)}{λ^{1 / 3}} log e {(\frac{3 n}{16 log n})}^{1 / 3} (1 - o (1)) \end{matrix}

By looking at the table of f given at page 151 of [20], we see that function

λ \to f (λ) / λ^{1 / 3}

reaches its maximum around

λ = 0.8

; for that choice,

f (λ) \approx 2.07236

and we obtain:

R_{Ψ}^{-} (n, Θ) ⩾ 1.843 {(\frac{n}{log n})}^{1 / 3} (1 - o (1))

Acknowledgment

I would particularly like to thank professor Jean-Louis Nicolas from Institut Girard Desargues, who very kindly helped for the combinatorial arguments.

References

Shannon, C.E. A mathematical theory of communication. Bell System Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons Inc.: New York, NY, USA, 1991. [Google Scholar]
Haussler, D. A general minimax result for relative entropy. IEEE Trans. Inform. Theory 1997, 43, 1276–1280. [Google Scholar] [CrossRef]
Rissanen, J. Universal coding, information, prediction, and estimation. IEEE Trans. Inform. Theory 1984, 30, 629–636. [Google Scholar] [CrossRef]
Shields, P.C. Universal redundancy rates do not exist. IEEE Trans. Inform. Theory 1993, 39, 520–524. [Google Scholar] [CrossRef]
Csiszár, I.; Shields, P.C. Redundancy rates for renewal and other processes. IEEE Trans. Inform. Theory 1996, 42, 2065–2072. [Google Scholar] [CrossRef]
Kieffer, J.C. A unified approach to weak universal source coding. IEEE Trans. Inform. Theory 1978, 24, 674–682. [Google Scholar] [CrossRef]
Åberg, J.; Shtarkov, Y.M.; Smeets, B.J. Multialphabet Coding with Separate Alphabet Description. In Proceedings of Compression and complexity of sequences; Press, I.C.S., Ed.; IEEE: Palermo, Italy, 1997; pp. 56–65. [Google Scholar]
Shamir, G.I.; Song, L. On the entropy of patterns of i.i.d. sequences. In Proceedings of 41st Annual Allerton Conference on Communication, Control and Computing; Curran Associates, Inc.: Monticello, IL, USA, 2003; pp. 160–169. [Google Scholar]
Shamir, G.I. A new redundancy bound for universal lossless compression of unknown alphabets. In Proceedings of the 38th Annual Conference on Information Sciences and Systems - CISS; IEEE: Princeton, NJ, USA, 2004; pp. 1175–1179. [Google Scholar]
Shamir, G.I. Universal lossless compression with unknown alphabets-the average case. IEEE Trans. Inform. Theory 2006, 52, 4915–4944. [Google Scholar] [CrossRef]
Shamir, G.I. On the MDL principle for i.i.d. sources with large alphabets. IEEE Trans. Inform. Theory 2006, 52, 1939–1955. [Google Scholar] [CrossRef]
Orlitsky, A.; Santhanam, N.P. Speaking of infinity. IEEE Trans. Inform. Theory 2004, 50, 2215–2230. [Google Scholar] [CrossRef]
Jevtić, N.; Orlitsky, A.; Santhanam, N.P. A lower bound on compression of unknown alphabets. Theoret. Comput. Sci. 2005, 332, 293–311. [Google Scholar] [CrossRef]
Orlitsky, A.; Santhanam, N.P.; Zhang, J. Universal compression of memoryless sources over unknown alphabets. IEEE Trans. Inform. Theory 2004, 50, 1469–1481. [Google Scholar] [CrossRef]
Orlitsky, A.; Santhanam, N.P.; Viswanathan, K.; Zhang, J. Limit Results on Pattern Entropy of Stationary Processes. In Proceedings of the 2004 IEEE Information Theory workshop; IEEE: San Antonio, TX, USA, 2004; pp. 2954–2964. [Google Scholar]
Gemelos, G.; Weissman, T. On the entropy rate of pattern processes; Technical report hpl-2004-159; HP Laboratories Palo Alto: San Antonio, TX, USA, 2004. [Google Scholar]
Shamir, G.I.; From University of Utah, Electrical and Computer Ingeneering. Private communication, 2006.
Davisson, L.D. Universal noiseless coding. IEEE Trans. Inform. Theory 1973, IT-19, 783–795. [Google Scholar] [CrossRef]
Dixmier, J.; Nicolas, J.L. Partitions sans petits sommants. In A Tribute to Paul Erdös; Cambridge University Press: New York, NY, USA, 1990; Chapter 8; pp. 121–152. [Google Scholar]
Szekeres, G. An asymptotic formula in the theory of partitions. Quart. J. Math. Oxford 1951, 2, 85–108. [Google Scholar] [CrossRef]
Massart, P. Ecole d’Eté de Probabilité de Saint-Flour XXXIII; LNM; Springer-Verlag: London, UK, 2003; Chapter 2. [Google Scholar]

© 2009 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.

Share and Cite

MDPI and ACS Style

Garivier, A. A Lower-Bound for the Maximin Redundancy in Pattern Coding. Entropy 2009, 11, 634-642. https://doi.org/10.3390/e11040634

AMA Style

Garivier A. A Lower-Bound for the Maximin Redundancy in Pattern Coding. Entropy. 2009; 11(4):634-642. https://doi.org/10.3390/e11040634

Chicago/Turabian Style

Garivier, Aurélien. 2009. "A Lower-Bound for the Maximin Redundancy in Pattern Coding" Entropy 11, no. 4: 634-642. https://doi.org/10.3390/e11040634

APA Style

Garivier, A. (2009). A Lower-Bound for the Maximin Redundancy in Pattern Coding. Entropy, 11(4), 634-642. https://doi.org/10.3390/e11040634

Article Menu

A Lower-Bound for the Maximin Redundancy in Pattern Coding

Abstract

1. Introduction

1.1. Universal Coding

1.2. Dictionary and Pattern

1.3. Pattern Coding

2. Theorem

3. Proof

Acknowledgment

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI