Nowadays, a variety of data-compressors (or archivers) is available, each of which has its merits, and it is impossible to single out the best ones. Thus, one faces the problem of choosing the best method to compress a given file, and this problem is more important the larger is the file. It seems natural to try all the compressors and then choose the one that gives the shortest compressed file, then transfer (or store) the index number of the best compressor (it requires bits, if m is the number of compressors available) and the compressed file. The only problem is the time, which essentially increases due to the need to compress the file m times (in order to find the best compressor). We suggest a method of data compression whose performance is close to optimal, but for which the extra time needed is relatively small: the ratio of this extra time and the total time of calculation can be limited, in an asymptotic manner, by an arbitrary positive constant. In short, the main idea of the suggested approach is as follows: in order to find the best, try all the data compressors, but, when doing so, use for compression only a small part of the file. Then apply the best data compressors to the whole file. Note that there are many situations where it may be necessary to find the best data compressor out of a given set. In such a case, it is often done by comparing compressors empirically. One of the goals of this work is to turn such a selection process into a part of the data compression method, automating and optimizing it.
data compression; universal coding; time-series forecasting
Nowadays lossless data compressors, or archivers, are widely used in systems of information transmission and storage. Modern data compressors are based on the results of the theory of source coding, as well as on the experience and intuition of their developers. Among the theoretical results, we note, first of all, such deep concepts as entropy, information, and methods of source coding discovered by Shannon . The next important step was done by Fitingoff  and Kolmogorov , who described the first universal code, as well as Krichevsky who described the first such a code with minimal redundancy .
Now practically used data compressors are based on the PPM universal code  (which is used along with the arithmetic code ), the Lempel–Ziv (LZ) compression methods , the Burrows–Wheeler transform  (which is used along with the book-stack (or MTF) code [9,10,11]), the class of grammar-based codes [12,13] and some others [14,15,16]. All these codes are universal. This means that, asymptotically, the length of the compressed file goes to the smallest possible value (i.e., the Shannon entropy per letter), if the compressed sequence is generated by a stationary source.
In particular, the universality of practically used codes means that we cannot compare their performance theoretically, because all of them have the same limit ratio of compression. On the other hand, the experiments show that the performance of different data compressors depends on a compressed file and it is impossible to single out one of the best or even remove the worst ones. Thus, there is no theoretical or experimental way to select the best data compressors for practical use. Hence, if someone is going to compress a file, he should first select the appropriate data compressor, preferably giving the best compression. The following obvious two-step method can be applied: first, try all available compressors and choose the one that gives the shortest compressed file. Then place a byte representation of its number and the compressed file. When decoding, the decoder first reads the number of the selected data compressor, and then decodes the rest of the file with the selected data compressor. An obvious drawback of this approach is the need to spend a lot of time in order to first compress the file by all the compressors.
In this paper we show that there exists a method that encodes the file with the (close to) optimal compressor, but uses a relatively small extra time. In short, the main idea of the suggested approach is as follows: in order to find the best, try all the compressors, but, when doing it, use for compression only a small part of the file. Then apply the best data compressor for the compression of the whole file. Based on experiments and some theoretical considerations, we can say that under certain conditions this procedure is quite effective. That is why we call such methods “time-universal.”
It is important to note that the problems of data compression and time series prediction are very close mathematically (see, for example, ). That is why the proposed approach can be directly applied to time series forecasting.
To the best of our knowledge, the suggested approach to data compression is new, but the idea to organize the computation of several algorithms in such a way that any of them worked at certain intervals of time, and their course depends on intermediate results, is widely used in the theory of algorithms, randomness testing and artificial intelligence; see [18,19,20,21].
2. The Statement of the Problem and Preliminary Example
Let there be a set of data compressors and be a sequence of letters from a finite alphabet A, whose initial part should be compressed by some . Let be the time spent on encoding one letter by the data compressor and suppose that all are upper-bounded by a certain constant , i.e., (It is possible that is unknown beforehand.)
The considered task is to find a data compressor from F which compresses in such a way that the total time spent for all calculations and compressions does not exceed for some . Note that is the minimum time that must be reserved for compression and is the additional time that can be used to find the good compressor (among ). It is important to note that we can estimate without knowing the speeds .
If the number of data compressors F is finite, say, , , and one chooses to compress the file , he can use the following two step procedure: encode the file as , where is -bit binary presentation of k. (The decoder first reads bits and finds k, then it finds decoding .) Now our goal is to generalize this approach for the case of infinite F = For this purpose we take a probability distribution = such that all . The following is an example of such a distribution:
Clearly, it is a probability distribution, because .
Now we should take into account the length of a codeword which presents the number k, because those lengths must be different for different k. So, we should find such that the value
is close to minimal. As earlier, the first part is used for encoding number k (codes achieving this are well-known, e.g., .) The decoder first finds k and then using the decoder corresponding to . Based on this consideration, we give the following
We call any method that encodes a sequence , , , by the binary word of the length for some , a time-adaptive code and denote it by . The output of is the following word:
where is -bit word that encodes i, whereas the time of encoding is not grater than (here ).
If for a time-adaptive code the following equation is valid
this code is called time-universal.
It will be convenient to reckon that the whole sequence is compressed not letter-by-letter, but by sub-words, each of which, say, a few kilobytes in length. More formally, let, as before, there be a sequence , where , are sub-words whose length (say, L) can be a few kilobytes. In this case .
Here and below we did not take into account the time required for the calculation of and some other auxiliary calculations. If in a certain situation this time is not negligible, it is possible to reduce in advance by the required value.
This description and the following discussion are fairly formal, so we give a brief preliminary example of a time-adaptive code. To do this, we took 22 data compressors from  and 14 files of different lengths. For each file we applied the following three-step scheme: first we took 1% of the file and sequentially compressed it with all the data compressors. Then we selected the three best compressors, took 5% of the file, and sequentially compressed it with the three compressors selected. Finally, we selected the best of these compressors and compressed the file with this compressor. Thus, the total extra time is limited to 22 × 0.01 + 3 × 0.05 = 0.37, i.e., . Table 1 contains the obtained data.
Table 2 shows that the larger the file, the better the compression. The following table gives some insight into the effect of the extra time. Here we used the same three-step scheme, but the size of the parts was and for the first step and the second, respectively, while the extra time was 0.74.
From the tables it can be seen that the performance of the considered scheme increases significantly when the additional time increases. It worth noting, that if one applied all 22 data compressors to the whole file, the extra time would be 21 instead of 0.74.
3. The Time-Universal Code for the Finite Set of Data Compressors
3.1. Theoretical Consideration
Suppose that there is a file and data compressors , . Let, as before, be the time spent on encoding one letter by the data compressor ,
The goal is to find the data compressor , , that compresses the file in the best way in time .
Apparently, the following two-step method is the simplest.
Step 1. Calculate .
Step 2. Compress the file by and find the length of compressed file , then, likewise, find , etc.
Step 3. Calculate
Step 4. Compress the whole file by and compose the codeword , where is -bit word with the presentation of s.
It will be shown that even this simple method is time universal. On the other hand, there are a lot of quite reasonable approaches to build the time-adaptive codes. For example, it could be natural to try a three step procedure, which was considered in the previous part (see Table 1 and Table 2), as well as many other versions. Probably, it could be useful to use multidimensional optimization approaches, such as machine learning, so-called deep learning, etc. That is why, we consider only some general conditions needed for time-universality.
Let us give some needed definitions. Suppose, a time-adaptive data-compressor is applied to . For any we define
Let there be an infinite word and time-adaptive method which is based on the finite set of data compressors . If its additional time of calculation is not grater than and the following properties are valid:
(i) the limits exist for ,
(iii) for any t the method uses such a compressor for which, for any i
Then is time universal, that is
A proof is given in the Appendix A, but here we give some informal comments. First, note that property (i) means that any data compressor will participate in the competition to find the best one. Second, if the sequence is generated by a stationary source and all are universal codes, then the property (iii) is valid with probability 1 (See, for example, ). Hence, this theorem is valid for this case. Besides, note that this this theorem is valid for methods described earlier.
We conducted several experiments to evaluate the effectiveness of the proposed approach in practice. For this purpose we took 20 data compressor from the “squeeze chart (lossless data compression benchmarks)”, http://www.squeezechart.com/index.html and files from this site http://corpus.canterbury.ac.nz/descriptions/, and http://tolstoy.ru/creativity/90-volume-collection-of-the-works/ (Information about their size is given in the tables below). It is worth noting, that we do not change the collection of the data compressors and the files during experiments. The results are presented in the following tables, where the expression “worst/best” means the ratio of the longest length of the compressed file and the shortest one (for different data compressors). More formally, . The expression “chosen/best” is a similar value for a chosen data compressor and the best one. The value “chosen/best” is the frequency of occurrence of the event “the best compressor was selected”.
Table 3 shows the results of the two-step method, where we took 3% in the first step. Thus, the total extra time is limited to 20 × 0.03 = 0.6, i.e., .
Here ratio “chosen best” means a proportion of cases in which the best method was chosen.
Table 4 shows the effect of the extra time on the efficiency of the method (In this case we took 5% in the first step).
Table 5 contains information about the three step method. Here we took 3% in the first step and then took five data compressors with the best performance. Then, in the second step, we tested those five data compressors taking 5% from the file. Hence, the extra time equals = .
Table 6 gives an example of four step method. Here we took 1% in the first step and then took five data compressors with the best performance. Then, in the second step, we tested those five data compressors taking 2% from each file. Basing on the obtained data, we chose three best and tested them on 5% parts. At last, the best of them was used for compression of the whole file. Hence, the extra time equals + .
If we compare Table 6 and Table 3, we can see that the performance of the four step method is better than two step method, where the extra time is significantly less for the four step method. The same is valid for the considered example of the three step method.
We can see that the three- and four-step methods make sense because they make it possible to reduce the additional time while maintaining the better quality of the method. Also, we can make another important conclusion. All tables show that the method is more efficient for large files. Indeed, the ratio “chosen/best” and the average value “chosen/best” decreases where the file lengths increases. Moreover, the average value “worst/best” increases where the file lengths increases.
4. The Time-Universal Code for Stationary Ergodic Sources
In this section we describe a time-universal code for stationary sources. It is based on optimal universal codes for Markov chains, developed by Krichevsky [4,24] and the twice-universal code . Denote by , the set of Markov chains with memory (connectivity) i, and let be the set of Bernoulli sources. For stationary ergodic and an integer r we denote by the r-order entropy (per letter) and let be the limit entropy; see for definitions .
Krichevsky [4,24] described the codes which are asymptotically optimal for , correspondingly. If the sequence , , is generated by a source , the following inequalities are valid almost surely (a.s.):
where t grows. (Here C is a constant.) The length of a codeword of the twice-universal code is defined as the following “mixture”:
(It is well-known in information theory  that there exists a code with such codeword lengths, because = .) This code is called twice-universal because for any , , and the equality (8) is valid (with different C). Besides, for any stationary ergodic source a.s.
Let us estimate the time of calculations necessary when using . First, note that it suffices to sum a finite number of terms in (9), because all the terms are equal for . On the other hand, the number of different terms grows, where and, hence, the encoder should calculate for growing number i’s. It is known  that the time spent on coding one letter is close for different codes .
Hence, the time spent for encoding one letter by the code grows to infinity, when t grows. The described below time-universal code has the same asymptotic performance, but the time spent for encoding one letter is a constant.
In order to describe the time-universal code we give some definitions. Let, as before, v be an upper-bound of the time spent for encoding one letter by any , be the generated word,
Denote by the following method:
Step 1. Calculate and
Step 2. Find such a j that
Step 3. Calculate the codeword and output
where is the -bit codeword of j. The decoding is obvious.
Let be a sequence generated by a stationary source and the code be applied. Then this code is time-universal, i.e., a.s.
This research was funded by Russian Foundation for Basic Research grant number 18-29-03005.
Conflicts of Interest
The author declares no conflict of interest.
Let and be such a data compressor that = . Having taken into account that the set of data compressors F is finite, we can see that for any there exists such that for all and
From (ii) we obtain that there exists such that for all . Let and be applied to . Suppose that a data-compressor was chosen, when was applied. Hence,
It is true for any , hence, . The theorem is proven. □
It is known in Information Theory  that ≥ ≥ for any r and (by definition) = . Let and r be such an integer that < . From (11) we can see that there exists such that if . Taking into account (8) and (11), we can see that there exists for which a.s. < if . From the description of (the step 3) we can see that there exists such for which a.s.
if . By definition,
Having taken into account that is an arbitrary number and two latest inequalities as well as the fact that a.s. = , we obtain (12). The theorem is proven. □
Shannon, C. A mathematical theory of communication. Bell Syst. Tech. J.1948, 27, 379–423. [Google Scholar]
Fitingof, B.M. Optimal encoding for unknown and changing statistics of messages. Probl. Inform. Transm.1966, 2, 3–11. [Google Scholar]
Kolmogorov, A.N. Three approaches to the quantitative definition of information. Probl. Inform. Transm.1965, 1, 3–11. [Google Scholar] [CrossRef]
Krichevsky, R. A relation between the plausibility of information about a source and encoding redundancy. Probl. Inform. Transm.1968, 4, 48–57. [Google Scholar]
Cleary, J.; Witten, I. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun.1984, 32, 396–402. [Google Scholar] [CrossRef]
Kieffer, J.C.; Yang, E.H. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory2000, 46, 737–754. [Google Scholar] [CrossRef]
Yang, E.H.; Kieffer, J.C. Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform. i. without context models. IEEE Trans. Inf. Theory2000, 46, 755–777. [Google Scholar] [CrossRef]
Drmota, M.; Reznik, Y.A.; Szpankowski, W. Tunstall code, Khodak variations, and random walks. IEEE Trans. Inf. Theory2010, 56, 2928–2937. [Google Scholar] [CrossRef]
Ryabko, B. A fast on-line adaptive code. IEEE Trans. Inf. Theory1992, 28, 1400–1404. [Google Scholar] [CrossRef]