Time-Universal Data Compression †

: Nowadays, a variety of data-compressors (or archivers) is available, each of which has its merits, and it is impossible to single out the best ones. Thus, one faces the problem of choosing the best method to compress a given ﬁle, and this problem is more important the larger is the ﬁle. It seems natural to try all the compressors and then choose the one that gives the shortest compressed ﬁle, then transfer (or store) the index number of the best compressor (it requires log m bits, if m is the number of compressors available) and the compressed ﬁle. The only problem is the time, which essentially increases due to the need to compress the ﬁle m times (in order to ﬁnd the best compressor). We suggest a method of data compression whose performance is close to optimal, but for which the extra time needed is relatively small: the ratio of this extra time and the total time of calculation can be limited, in an asymptotic manner, by an arbitrary positive constant. In short, the main idea of the suggested approach is as follows: in order to ﬁnd the best, try all the data compressors, but, when doing so, use for compression only a small part of the ﬁle. Then apply the best data compressors to the whole ﬁle. Note that there are many situations where it may be necessary to ﬁnd the best data compressor out of a given set. In such a case, it is often done by comparing compressors empirically. One of the goals of this work is to turn such a selection process into a part of the data compression method, automating and optimizing it.


Introduction
Nowadays lossless data compressors, or archivers, are widely used in systems of information transmission and storage. Modern data compressors are based on the results of the theory of source coding, as well as on the experience and intuition of their developers. Among the theoretical results, we note, first of all, such deep concepts as entropy, information, and methods of source coding discovered by Shannon [1]. The next important step was done by Fitingoff [2] and Kolmogorov [3], who described the first universal code, as well as Krichevsky who described the first such a code with minimal redundancy [4]. Now practically used data compressors are based on the PPM universal code [5] (which is used along with the arithmetic code [6]), the Lempel-Ziv (LZ) compression methods [7], the Burrows-Wheeler transform [8] (which is used along with the book-stack (or MTF) code [9][10][11]), the class of grammar-based codes [12,13] and some others [14][15][16]. All these codes are universal. This means that, asymptotically, the length of the compressed file goes to the smallest possible value (i.e., the Shannon entropy per letter), if the compressed sequence is generated by a stationary source.
In particular, the universality of practically used codes means that we cannot compare their performance theoretically, because all of them have the same limit ratio of compression. On the other hand, the experiments show that the performance of different data compressors depends on a compressed file and it is impossible to single out one of the best or even remove the worst ones. Thus, there is no theoretical or experimental way to select the best data compressors for practical use. Hence, if someone is going to compress a file, he should first select the appropriate data compressor, preferably giving the best compression. The following obvious two-step method can be applied: first, try all available compressors and choose the one that gives the shortest compressed file. Then place a byte representation of its number and the compressed file. When decoding, the decoder first reads the number of the selected data compressor, and then decodes the rest of the file with the selected data compressor. An obvious drawback of this approach is the need to spend a lot of time in order to first compress the file by all the compressors.
In this paper we show that there exists a method that encodes the file with the (close to) optimal compressor, but uses a relatively small extra time. In short, the main idea of the suggested approach is as follows: in order to find the best, try all the compressors, but, when doing it, use for compression only a small part of the file. Then apply the best data compressor for the compression of the whole file. Based on experiments and some theoretical considerations, we can say that under certain conditions this procedure is quite effective. That is why we call such methods "time-universal." It is important to note that the problems of data compression and time series prediction are very close mathematically (see, for example, [17]). That is why the proposed approach can be directly applied to time series forecasting.
To the best of our knowledge, the suggested approach to data compression is new, but the idea to organize the computation of several algorithms in such a way that any of them worked at certain intervals of time, and their course depends on intermediate results, is widely used in the theory of algorithms, randomness testing and artificial intelligence; see [18][19][20][21].

The Statement of the Problem and Preliminary Example
Let there be a set of data compressors F = {ϕ 1 , ϕ 2 , ...} and x 1 x 2 ... be a sequence of letters from a finite alphabet A, whose initial part x 1 ...x n should be compressed by some ϕ ∈ F. Let v i be the time spent on encoding one letter by the data compressor ϕ i and suppose that all v i are upper-bounded by a certain constant v max , i.e. sup i=1,2,..., v i ≤ v max . (It is possible that v i is unknown beforehand.) The considered task is to find a data compressor from F which compresses x 1 ...x n in such a way that the total time spent for all calculations and compressions does not exceed T(1 + δ) for some δ > 0. Note that T = v max n is the minimum time that must be reserved for compression and δT is the additional time that can be used to find the good compressor (among ϕ 1 , ϕ 2 , ...). It is important to note that we can estimate δ without knowing the speeds v 1 , v 2 , ....
If the number of data compressors F is finite, say, {ϕ 1 , ϕ 2 , ..., ϕ m }, m ≥ 2, and one chooses ϕ k to compress the file x 1 x 2 ...x n , he can use the following two step procedure: encode the file as < k > ϕ k (x 1 x 2 ...x n ), where < k > is log m -bit binary presentation of k. (The decoder first reads log m bits and finds k, then it finds x 1 x 2 ...x n decoding ϕ k (x 1 x 2 ...x n ).) Now our goal is to generalize this approach for the case of infinite F = {ϕ 1 , ϕ 2 , . . . }. For this purpose we take a probability distribution ω = ω 1 , ω 2 , ... such that all ω i > 0. The following is an example of such a distribution: Clearly, it is a probability distribution, because ω k = 1/k − 1/(k + 1). Now we should take into account the length of a codeword which presents the number k, because those lengths must be different for different k. So, we should find such ϕ k that the value is close to minimal. As earlier, the first part − log ω k is used for encoding number k (codes achieving this are well-known, e.g., [22].) The decoder first finds k and then x 1 x 2 ...x n using the decoder corresponding to ϕ k . Based on this consideration, we give the following Definition 1. We call any method that encodes a sequence x 1 x 2 ...x n , n ≥ 1, x i ∈ A, by the binary word of the length − log ω j + |ϕ j (x 1 x 2 ...x n )| for some ϕ j ∈ F, a time-adaptive code and denote it byΦ δ compr . The output of Φ δ compr is the following word: where < ω i > is − log ω i -bit word that encodes i, whereas the time of encoding is not grater than T(1 + δ) (here T = v max n). If for a time-adaptive codeΦ δ compr the following equation is valid Comment 1 It will be convenient to reckon that the whole sequence is compressed not letter-by-letter, but by sub-words, each of which, say, a few kilobytes in length. More formally, let, as before, there be a sequence x 1 x 2 . . . , where x i , i = 1, 2, ... are sub-words whose length (say, L) can be a few kilobytes. In this case x i ∈ {0, 1} 8L . Comment 2 Here and below we did not take into account the time required for the calculation of log ω i and some other auxiliary calculations. If in a certain situation this time is not negligible, it is possible to reduceT in advance by the required value.
This description and the following discussion are fairly formal, so we give a brief preliminary example of a time-adaptive code. To do this, we took 22 data compressors from [23] and 14 files of different lengths. For each file we applied the following three-step scheme: first we took 1% of the file and sequentially compressed it with all the data compressors. Then we selected the three best compressors, took 5% of the file, and sequentially compressed it with the three compressors selected. Finally, we selected the best of these compressors and compressed the file with this compressor. Thus, the total extra time is limited to 22 × 0.01 + 3 × 0.05 = 0.37, i.e. δ ≤ 0.37. Table 1 contains the obtained data.  Table 2 shows that the larger the file, the better the compression. The following table gives some insight into the effect of the extra time. Here we used the same three-step scheme, but the size of the parts was 2% and 10% for the first step and the second, respectively, while the extra time was 0.74. From the tables it can be seen that the performance of the considered scheme increases significantly when the additional time increases. It worth noting, that if one applied all 22 data compressors to the whole file, the extra time would be 21 instead of 0.74.

Theoretical Consideration
Suppose that there is a file x 1 x 2 ...x n and data compressors ϕ 1 , ..., ϕ m , n ≥ 1, m ≥ 1. Let, as before, v i be the time spent on encoding one letter by the data compressor ϕ i , and The goal is to find the data compressor ϕ j , j = 1, ..., m, that compresses the file x 1 x 2 ...x n in the best way in timeT.
Apparently, the following two-step method is the simplest.
Step 3. Calculate s = arg min i=1,...,m |ϕ i (x 1 ...x r )| Step 4. Compress the whole file x 1 x 2 ...x n by ϕ s and compose the codeword s ϕ s (x 1 ...x n ), where s is log m -bit word with the presentation of s.
It will be shown that even this simple method is time universal. On the other hand, there are a lot of quite reasonable approaches to build the time-adaptive codes. For example, it could be natural to try a three step procedure, which was considered in the previous part (see Tables 1 and 2), as well as many other versions. Probably, it could be useful to use multidimensional optimization approaches, such as machine learning, so-called deep learning, etc. That is why, we consider only some general conditions needed for time-universality.
Let us give some needed definitions. Suppose, a time-adaptive data-compressorΦ is applied to x = x 1 ...x t . For any ϕ i we define τ i (t) = max{r : ϕ i (x 1 ...x r ) was calculated, when extra time δ T was exhausted}.
(iii) for any t the methodΦ(x 1 ...x t ) uses such a compressor ϕ s for which, for any i ThenΦ(x 1 ...x n ) is time universal, that is A proof is given in the Appendix A, but here we give some informal comments. First, note that property (i) means that any data compressor will participate in the competition to find the best one. Second, if the sequence x 1 x 2 ... is generated by a stationary source and all ϕ i are universal codes, then the property (iii) is valid with probability 1 (See, for example, [22]). Hence, this theorem is valid for this case. Besides, note that this this theorem is valid for methods described earlier.

Experiments
We conducted several experiments to evaluate the effectiveness of the proposed approach in practice. For this purpose we took 20 data compressor from the "squeeze chart (lossless data compression benchmarks)", http://www.squeezechart.com/index.html and files from this site http: //corpus.canterbury.ac.nz/descriptions/, and http://tolstoy.ru/creativity/90-volume-collection-of-theworks/ (Information about their size is given in the tables below). It is worth noting, that we do not change the collection of the data compressors and the files during experiments. The results are presented in the following tables, where the expression "worst/best" means the ratio of the longest length of the compressed file and the shortest one (for different data compressors). More formally, worst/best = max i,j=1,...,20 (|ϕ i |/|ϕ j |). The expression "chosen/best" is a similar value for a chosen data compressor and the best one. The value "chosen/best" is the frequency of occurrence of the event "the best compressor was selected". Table 3 shows the results of the two-step method, where we took 3% in the first step. Thus, the total extra time is limited to 20 × 0.03 = 0.6, i.e., δ ≤ 0.6. Table 3. Two-step compression. Extra-time δ = 20 × 0.03 = 0.6. Here ratio "chosen best" means a proportion of cases in which the best method was chosen. Table 4 shows the effect of the extra time δ on the efficiency of the method (In this case we took 5% in the first step). Table 4. Two-step compression. Extra-time δ = 20 × 0.05 = 1.  Table 5 contains information about the three step method. Here we took 3% in the first step and then took five data compressors with the best performance. Then, in the second step, we tested those five data compressors taking 5% from the file. Hence, the extra time equals 20 × 0.03 + 5 × 0.05 = 0.85. Table 5. Three-step compression. Extra-time δ = 20 × 0.03 + 5 × 0.05 = 0.85.  Table 6 gives an example of four step method. Here we took 1% in the first step and then took five data compressors with the best performance. Then, in the second step, we tested those five data compressors taking 2% from each file. Basing on the obtained data, we chose three best and tested them on 5% parts. At last, the best of them was used for compression of the whole file. Hence, the extra time equals 20 × 0.01 + 5 × 0.02 +3 × 0.05 = 0.45. Table 6. Four-step compression. Extra-time 20 × 0.01 + 5 × 0.02 +3 × 0.05 = 0.45. If we compare Table 6 and Table 3, we can see that the performance of the four step method is better than two step method, where the extra time is significantly less for the four step method. The same is valid for the considered example of the three step method.

Length of File (byte) Number of Files Ratio "Chosen Best" Average "Worst/Best" Average "Chosen/Best
We can see that the three-and four-step methods make sense because they make it possible to reduce the additional time while maintaining the better quality of the method. Also, we can make another important conclusion. All tables show that the method is more efficient for large files. Indeed, the ratio "chosen/best" and the average value "chosen/best" decreases where the file lengths increases. Moreover, the average value "worst/best" increases where the file lengths increases.

The Time-Universal Code for Stationary Ergodic Sources
In this section we describe a time-universal code for stationary sources. It is based on optimal universal codes for Markov chains, developed by Krichevsky [4,24] and the twice-universal code [25]. Denote by M i , i = 1, 2, ... the set of Markov chains with memory (connectivity) i, and let M 0 be the set of Bernoulli sources. For stationary ergodic µ and an integer r we denote by h r (µ) the r-order entropy (per letter) and let h ∞ (µ) be the limit entropy; see for definitions [22].
Krichevsky [4,24] described the codes ψ 0 , ψ 1 , ... which are asymptotically optimal for M 0 , M 1 , ..., correspondingly. If the sequence x 1 x 2 ...x n , x i ∈ A, is generated by a source µ ∈ M i , the following inequalities are valid almost surely (a.s.): where t grows. (Here C is a constant.) The length of a codeword of the twice-universal code ρ is defined as the following "mixture": (It is well-known in information theory [22] that there exists a code with such codeword lengths, because ∑ x 1 ...x t ∈A t 2 −|ρ(x 1 ...x t )| = 1.) This code is called twice-universal because for any M i , i = 0, 1, ..., and µ ∈ M i the equality (8) is valid (with different C). Besides, for any stationary ergodic source µ a.s.
Let us estimate the time of calculations necessary when using ρ. First, note that it suffices to sum a finite number of terms in (9), because all the terms 2 −|ψ i (x 1 ...x t )| are equal for i ≥ t. On the other hand, the number of different terms grows, where t → ∞ and, hence, the encoder should calculate 2 −|ψ i (x 1 ...x t )| for growing number i's. It is known [24] that the time spent on coding one letter is close for different codes Hence, the time spent for encoding one letter by the code ρ grows to infinity, when t grows. The described below time-universal code Ψ δ has the same asymptotic performance, but the time spent for encoding one letter is a constant.
Step 2. Find such a j that Step 3. Calculate the codeword ψ j (x 1 ...x t ) and output where < j > is the − log ω j+1 -bit codeword of j. The decoding is obvious.
.. be a sequence generated by a stationary source and the code Ψ δ be applied. Then this code is time-universal, i.e., a.s.

Conflicts of Interest:
The author declares no conflict of interest.
Proof of Theorem 2. It is known in Information Theory [22] that h r (µ) ≥ h r+1 (µ) ≥ h ∞ (µ) for any r and (by definition) lim r→∞ h r (µ) = h ∞ (µ). Let > 0 and r be such an integer that h r − h ∞ < . From (11) we can see that there exists such t 1 that m(t) ≥ r if t ≥ t 1 . Taking into account (8) and (11), we can see that there exists t 2 for which a.s. ||ψ r (x 1 ...x t )|/t − h r (µ)| < if t > t 2 . From the description of Ψ δ (the step 3) we can see that there exists such t 3 > max{t 1 , t 2 } for which a.s.