A Note on Sequence Prediction over Large Alphabets

Building on results from data compression, we prove nearly tight bounds on how well sequences of length n can be predicted in terms of the size σ of the alphabet and the length k of the context considered when making predictions. We compare the performance achievable by an adaptive predictor with no advance knowledge of the sequence, to the performance achievable by the optimal static predictor using a table listing the frequency of each (k + 1)-tuple in the sequence. We show that, if the elements of the sequence are chosen uniformly at random, then an adaptive predictor can compete in the expected case if k ≤ logσ n− 3− , for a constant > 0, but not if k ≥ logσ n.


Introduction
The relation between compression and prediction dates back at least as far as William of Ockham in the fourteenth century.This relation was not properly formalized, however, until the notion of Kolmogorov complexity was developed in the twentieth century [1][2][3].Since then, there have been many efforts to harness compression algorithms for prediction, with a number of researchers focusing particularly on prediction for prefetching either disk pages or web pages.We refer the reader to the text by Cesa-Bianchi and Lugosi [4] and references therein for a thorough discussion.For example, Krishnan and Vitter [5] showed that a prefetcher based on LZ78 [6] is asymptotically competitive with the best finite-state prefetcher.For prefetching, however, the alphabet of possible elements-i.e., all pages on the disk or web-is huge.In this paper we investigate, therefore, what effect the size of the alphabet has on predictability.Krishnan and Vitter considered the problem of pure prefetching, in which the prefetcher can replace all the contents of the cache between each page request.They combined a predictor by Hannan [7], which is asymptotically competitive against any memoryless predictor, with an instantaneous version of LZ78, thus obtaining a predictor such that, for any finite-state predictor and any sequence, their predictor's success rate converges to or exceeds the finite-state predictor's success rate with probability approaching 1 as the sequence length increases.(As Krishnan and Vitter noted, a similar construction for predicting binary sequences was given by Feder, Merhav and Gutman [8].)Notice that this bound cannot be improved to hold with certainty instead of with high probability: the predictor must be randomized because, for any deterministic predictor, an adversary can choose each bit of the sequence to be the opposite of what the predictor guesses.Krishnan and Vitter's proof is based on the fact that, for any finite-state compressor and any sequence, LZ78's asymptotic compression ratio is at most that of the finite-state compressor; this necessarily involves the assumption that both the alphabet and the context length used in prediction are fixed.It is known what effect the size of the alphabet has on compressibility, both when we make certain assumptions about the source of the sequence [9,10] and when we want to bound the size of the encoding in terms of the kth-order empirical entropy of the sequence [11,12].We will define a notion of predictability that is analogous to empirical entropy and use similar arguments to give nearly tight bounds on how large the alphabet can be before we cannot guarantee good prediction.
The rest of this paper is laid out as follows: in Section 2 we briefly review several notions of entropy in computer science-Shannon entropy, empirical entropy, Rényi entropy and min-entropy-before defining empirical predictability; in Section 3 we show that we can achieve good expected prediction in terms of a sequence's kth-order empirical predictability when k ≤ log σ n − 3 − , where k is the length of the contexts considered when making predictions, σ is the size of the alphabet, n is the length of the sequence and > 0 is a constant; in Section 4 we show we cannot achieve the same bound when k ≥ log σ n.A preliminary version of these results [13] was reported at the 9th Canadian Workshop on Information Theory (CWIT '05) while the author was at the University of Toronto.

Empirical Predictability
Shannon [14] defined the entropy of a random variable to be our uncertainty about its value.Specifically, if a random variable X takes on one of σ values according to a probability distribution P = p 1 , . . ., p σ , then The base of the logarithm determines the unit of uncertainty; in computer science, the base is usually assumed to be 2 with the result that the unit is the bit (i.e., our uncertainty about the outcome of flipping a fair coin).Throughout the rest of this paper we write log to mean log 2 .Given P , the expected number of bits needed to encode the value of X is at least H(X ) and less than H(X ) + 1.
The 0th-order empirical entropy H 0 (S) of a sequence S[1 . . .n] is simply our uncertainty about an element chosen uniformly at random from S, i.e., where a ∈ S means that element a occurs in S and occ(a, S) is its frequency.For k ≥ 1, the kth-order empirical entropy H k (S) of S is our expected uncertainty about the random variable s[i] in the following experiment: i is chosen uniformly at random between 1 and n; if i ≤ k, then we are told s[i]; if i > k, then we are told s[i − k . . .i − 1] and asked to guess s[i].Specifically, where S α is the concatenation of the elements in S immediately following occurrences of the k-tuple α.
Notice that |S α | = occ(α, S) unless α is a suffix of S, in which case it is 1 less.Given a table listing the frequency of each (k + 1)-tuple in S, it takes about nH k (S) bits to encode S. For further discussion of empirical entropy, we refer readers to Manzini's analysis [15] of the Burrows-Wheeler Transform.
The Rényi entropy of order t of X is defined as for 0 ≤ t = 1, where the random variable X again takes on values according to the probability distribution P = p 1 , . . ., p σ .The Rényi entropy of order 0 of X is the logarithm of the size of the support of P .The limit of the Rényi entropy of order t of X as t approaches 1 is the Shannon entropy H(X ) of X ; as t approaches infinity, the limit is − log sup 1≤j≤σ p j , which is often called the min-entropy of X. Min-entropy is related to predictability because, given P and asked to guess the value of X, our best strategy is to choose the most probable value and, thus, guess correctly with probability max 1≤i≤σ {p i }.
If we are asked to guess the value of an element s[i] chosen uniformly at random from the sequence S with no context given, then our best strategy is to choose the most frequent element and, thus, guess correctly with probability max a∈S occ(a,S) n . Following the example of empirical entropy, we call this probability the 0th-order empirical predictability P 0 (S) of S. We define the kth-order empirical predictability P k (S) of S to be the expected predictability of the random variable s[i] in the following experiment: i is chosen uniformly at random between 1 and n; if i ≤ k, then we are told s[i]; if i > k, then we are told s[i − k . . .i − 1] and asked to guess s[i].Specifically, , where S α is again the concatenation of the elements in S immediately following occurrences of the k-tuple α.For example, if S = TORONTO then P 0 (S) = 3/7 ≈ 0.429, (1 + P 0 (T) + 2P 0 (RN) + P 0 (O) + 2P 0 (OO)) = 6 7 ≈ 0.857 and all higher-order empirical entropies are 1.In other words, if someone asks us to guess an element chosen uniformly at random from TORONTO then, given no context, we should choose O, in which case the probability of our prediction being correct is 3/7.If we are given the preceding element (or told there is no preceding element) and it is not an O, then we can answer with certainty; if it is an O, which has probability 2/7, then we should guess either R or N and be right with probability 1/2; overall, the probability of our prediction being correct is 6/7.If we are given the two preceding elements (or told how many preceding elements there are), then we can always answer with certainty.Given a table listing the frequency of each (k + 1)-tuple in S, we can build a static predictor that, after seeing k elements, always predicts the element that follows that k-tuple most often in S; this predictor guesses correctly nP k (S) times when predicting all the elements in S, which is optimal for a static predictor that uses contexts of length at most k.

Upper Bound
Having defined the kth-order predictability P k (S) of S, it is natural to ask when an adaptive predictor with no advance knowledge of S can achieve success rate P k (S).Whenever both k and the size σ of the alphabet are fixed, Krishnan and Vitter's predictor [5] almost certainly achieves a success rate asymptotically approaching P k (S) as n goes to infinity.If S is a randomly-chosen permutation, however, then P 1 (S) = 1 but the expected success rate of any predictor without advance knowledge of S, approaches 0 as n increases.In this section we show that if k ≤ log σ n − 3 − , for a constant > 0, then an adaptive predictor can achieve expected success rate P k (S) on any sufficiently long sequence.For simplicity we assume that k is given although, in practice, a predictor should find an optimal or nearly optimal context length by itself.
The most obvious 0th-order predictor is the one that always guess that the next element will be the most frequent element seen so far.Hannan [7] randomized this predictor and obtained a predictor A whose expected success rate converges to P 0 (S) when n = ω(σ 3 ).We now consider the most obvious generalization A k of Hannan's predictor to use contexts of a given length k: after seeing a k-tuple α, we apply Hannan's predictor to the subsequence of elements consisting of the concatenation of elements so far that immediately followed occurrences of α; i.e., Fix > 0 and assume k ≤ log σ n − 3 − .Consider the subsequences into which A k partitions S before applying A, and let L be the subset of them that are each of length at least σ 3+ /2 .Notice that A achieves expected success rate P 0 (S ) on any subsequence S ∈ L so, by linearity of expectation, A k achieves expected success rate at least On the other hand, the total length of the subsequences not in L is less than Therefore, when S is sufficiently long, A k achieves expected success rate P k (S).
Theorem 1 If the n elements of a sufficiently long sequence S are chosen arbitrarily from an alphabet of size σ and k < log σ n − 3 − , for a constant > 0, then A k achieves expected success rate P k (S).

Lower Bound
Compression researchers (see, e.g., [16] for a survey) have shown how to store S in nH k (S) + o(n log σ) for all k ≤ (1 − ) log σ n simultaneously, where is a positive constant.In a previous paper [11] we showed that it is impossible to prove a worst-case bound of this form when k ≥ log σ n: • in σ-ary De Bruijn cycles [17] of order k, each k-tuple appears exactly once, so such cycles have length σ k and kth-order empirical entropy 0; • there are (σ!) σ k−1 /σ k such cycles [18] and log 2 (σ!) σ k−1 /σ k = Θ(σ k log σ); • by the pigeonhole principle, there is no injective mapping from σ-ary strings of length n with kth-order empirical entropy 0, to binary strings of length o(n log σ); • therefore, if k ≥ log σ n, then in the worst case we cannot store S in λnH k (S) + o(n log σ) bits for any coefficient λ.
In a recent paper [12] we used similar but more sophisticated arguments to show that, if k ≥ (1 + ) log σ n for some positive constant , then in the expected case we cannot store S in λnH k (S) + o(n log σ) bits for any coefficient λ = o(n ); if k ≥ (2 + ) log σ n, then with high probability we cannot store S in that many bits for any coefficient λ.
We now turn our attention to proving lower bounds for prediction and show that, if k ≥ log σ n and the elements of S are chosen uniformly at random, then no predictor without advance knowledge of S can achieve an expected success rate close to P k (S).Notice that nP k (S) is, by definition, at least the number of distinct k-tuples in S minus 1: for any distinct k-tuple α that occurs in S and is not a suffix of S, the optimal static predictor described in Section 2 correctly guesses the element after at least one occurrence of α in S. Suppose k ≥ log σ n-implying that σ ≥ 2-and let c = σ k /n ≥ 1. Janson, Lonardi and Szpankowski [19] showed that the expected number of distinct k-tuples in S is On the other hand, no predictor without advance knowledge of S can achieve an expected success rate greater than 1/σ ≤ 1/2.
Theorem 2 If the n elements of a sequence S are chosen uniformly at random from an alphabet of size σ and k ≥ log σ n, then S's expected kth-order empirical predictability E[P k (S)] ≥ 1 − 1/e > 0.632 but no predictor without advance knowledge of S can achieve an expected success rate greater than 1/σ ≤ 1/2.