1. Introduction
The relation between compression and prediction dates back at least as far as William of Ockham in the fourteenth century. This relation was not properly formalized, however, until the notion of Kolmogorov complexity was developed in the twentieth century [
1,
2,
3]. Since then, there have been many efforts to harness compression algorithms for prediction, with a number of researchers focusing particularly on prediction for prefetching either disk pages or web pages. We refer the reader to the text by Cesa–Bianchi and Lugosi [
4] and references therein for a thorough discussion. For example, Krishnan and Vitter [
5] showed that a prefetcher based on LZ78 [
6] is asymptotically competitive with the best finite-state prefetcher. For prefetching, however, the alphabet of possible elements—
i.e., all pages on the disk or web—is huge. In this paper we investigate, therefore, what effect the size of the alphabet has on predictability.
Krishnan and Vitter considered the problem of pure prefetching, in which the prefetcher can replace all the contents of the cache between each page request. They combined a predictor by Hannan [
7], which is asymptotically competitive against any memoryless predictor, with an instantaneous version of LZ78, thus obtaining a predictor such that, for any finite-state predictor and any sequence, their predictor’s success rate converges to or exceeds the finite-state predictor’s success rate with probability approaching 1 as the sequence length increases. (As Krishnan and Vitter noted, a similar construction for predicting binary sequences was given by Feder, Merhav and Gutman [
8].) Notice that this bound cannot be improved to hold with certainty instead of with high probability: the predictor must be randomized because, for any deterministic predictor, an adversary can choose each bit of the sequence to be the opposite of what the predictor guesses. Krishnan and Vitter’s proof is based on the fact that, for any finite-state compressor and any sequence, LZ78’s asymptotic compression ratio is at most that of the finite-state compressor; this necessarily involves the assumption that both the alphabet and the context length used in prediction are fixed. It is known what effect the size of the alphabet has on compressibility, both when we make certain assumptions about the source of the sequence [
9,
10] and when we want to bound the size of the encoding in terms of the
kth-order empirical entropy of the sequence [
11,
12]. We will define a notion of predictability that is analogous to empirical entropy and use similar arguments to give nearly tight bounds on how large the alphabet can be before we cannot guarantee good prediction.
The rest of this paper is laid out as follows: in
Section 2 we briefly review several notions of entropy in computer science—Shannon entropy, empirical entropy, Rényi entropy and min-entropy—before defining empirical predictability; in
Section 3 we show that we can achieve good expected prediction in terms of a sequence’s
kth-order empirical predictability when
, where
k is the length of the contexts considered when making predictions,
is the size of the alphabet,
n is the length of the sequence and
is a constant; in
Section 4 we show we cannot achieve the same bound when
. A preliminary version of these results [
13] was reported at the 9th Canadian Workshop on Information Theory (CWIT ’05) while the author was at the University of Toronto.
2. Empirical Predictability
Shannon [
14] defined the entropy of a random variable to be our uncertainty about its value. Specifically, if a random variable
takes on one of
values according to a probability distribution
, then
The base of the logarithm determines the unit of uncertainty; in computer science, the base is usually assumed to be 2 with the result that the unit is the bit (i.e., our uncertainty about the outcome of flipping a fair coin). Throughout the rest of this paper we write log to mean . Given P, the expected number of bits needed to encode the value of is at least and less than .
The 0th-order empirical entropy
of a sequence
is simply our uncertainty about an element chosen uniformly at random from
S,
i.e.,
where
means that element
a occurs in
S and
is its frequency. For
, the
kth-order empirical entropy
of
S is our expected uncertainty about the random variable
in the following experiment:
i is chosen uniformly at random between 1 and n; if
, then we are told
; if
, then we are told
and asked to guess
. Specifically,
where
is the concatenation of the elements in
S immediately following occurrences of the
k-tuple
. Notice that
unless
is a suffix of
S, in which case it is 1 less. Given a table listing the frequency of each
-tuple in
S, it takes about
bits to encode
S. For further discussion of empirical entropy, we refer readers to Manzini’s analysis [
15] of the Burrows–Wheeler Transform.
The Rényi entropy of order
t of
is defined as
for
, where the random variable
again takes on values according to the probability distribution
. The Rényi entropy of order 0 of
is the logarithm of the size of the support of
P. The limit of the Rényi entropy of order
t of
as
t approaches 1 is the Shannon entropy
of
; as
t approaches infinity, the limit is
, which is often called the min-entropy of
X. Min-entropy is related to predictability because, given
P and asked to guess the value of
X, our best strategy is to choose the most probable value and, thus, guess correctly with probability
.
If we are asked to guess the value of an element
chosen uniformly at random from the sequence
S with no context given, then our best strategy is to choose the most frequent element and, thus, guess correctly with probability
. Following the example of empirical entropy, we call this probability the
0th-order empirical predictability of
S. We define the
kth-order empirical predictability of
S to be the expected predictability of the random variable
in the following experiment:
i is chosen uniformly at random between 1 and n; if
, then we are told
; if
, then we are told
and asked to guess
. Specifically,
for
, where
is again the concatenation of the elements in
S immediately following occurrences of the
k-tuple
. For example, if
then
,
and all higher-order empirical entropies are 1. In other words, if someone asks us to guess an element chosen uniformly at random from
then, given no context, we should choose
, in which case the probability of our prediction being correct is
. If we are given the preceding element (or told there is no preceding element) and it is not an
, then we can answer with certainty; if it is an
, which has probability
, then we should guess either
or
and be right with probability
; overall, the probability of our prediction being correct is
. If we are given the two preceding elements (or told how many preceding elements there are), then we can always answer with certainty.
Given a table listing the frequency of each -tuple in S, we can build a static predictor that, after seeing k elements, always predicts the element that follows that k-tuple most often in S; this predictor guesses correctly times when predicting all the elements in S, which is optimal for a static predictor that uses contexts of length at most k.
3. Upper Bound
Having defined the
kth-order predictability
of
S, it is natural to ask when an adaptive predictor with no advance knowledge of
S can achieve success rate
. Whenever both
k and the size
of the alphabet are fixed, Krishnan and Vitter’s predictor [
5] almost certainly achieves a success rate asymptotically approaching
as
n goes to infinity. If
S is a randomly-chosen permutation, however, then
but the expected success rate of any predictor without advance knowledge of
S, approaches 0 as
n increases. In this section we show that if
, for a constant
, then an adaptive predictor can achieve expected success rate
on any sufficiently long sequence. For simplicity we assume that
k is given although, in practice, a predictor should find an optimal or nearly optimal context length by itself.
The most obvious 0th-order predictor is the one that always guess that the next element will be the most frequent element seen so far. Hannan [
7] randomized this predictor and obtained a predictor
A whose expected success rate converges to
when
. We now consider the most obvious generalization
of Hannan’s predictor to use contexts of a given length
k: after seeing a
k-tuple
, we apply Hannan’s predictor to the subsequence of elements consisting of the concatenation of elements so far that immediately followed occurrences of
;
i.e.,
where
and
is as defined in
Section 2.
Fix
and assume
. Consider the subsequences into which
partitions
S before applying
A, and let
be the subset of them that are each of length at least
. Notice that
A achieves expected success rate
on any subsequence
so, by linearity of expectation,
achieves expected success rate at least
On the other hand, the total length of the subsequences not in
is less than
, so
Therefore, when
S is sufficiently long,
achieves expected success rate
.
Theorem 1 If the n elements of a sufficiently long sequence S are chosen arbitrarily from an alphabet of size σ and , for a constant , then achieves expected success rate .
4. Lower Bound
Compression researchers (see, e.g., [
16] for a survey) have shown how to store
S in
for all
simultaneously, where
is a positive constant. In a previous paper [
11] we showed that it is impossible to prove a worst-case bound of this form when
:
in
-ary De Bruijn cycles [
17] of order
k, each
k-tuple appears exactly once, so such cycles have length
and
kth-order empirical entropy 0;
there are
such cycles [
18] and
;
by the pigeonhole principle, there is no injective mapping from -ary strings of length n with kth-order empirical entropy 0, to binary strings of length ;
therefore, if , then in the worst case we cannot store S in bits for any coefficient .
In a recent paper [
12] we used similar but more sophisticated arguments to show that, if
for some positive constant
, then in the expected case we cannot store
S in
bits for any coefficient
; if
, then with high probability we cannot store
S in that many bits for any coefficient
.
We now turn our attention to proving lower bounds for prediction and show that, if
and the elements of
S are chosen uniformly at random, then no predictor without advance knowledge of
S can achieve an expected success rate close to
. Notice that
is, by definition, at least the number of distinct
k-tuples in
S minus 1: for any distinct
k-tuple
that occurs in
S and is not a suffix of
S, the optimal static predictor described in
Section 2 correctly guesses the element after at least one occurrence of
in
S. Suppose
—implying that
—and let
. Janson, Lonardi and Szpankowski [
19] showed that the expected number of distinct
k-tuples in
S is
so
On the other hand, no predictor without advance knowledge of
S can achieve an expected success rate greater than
.
Theorem 2 If the n elements of a sequence S are chosen uniformly at random from an alphabet of size σ and , then S’s expected kth-order empirical predictability but no predictor without advance knowledge of S can achieve an expected success rate greater than .