1. Introduction
Nowadays lossless data compressors, or archivers, are widely used in systems of information transmission and storage. Modern data compressors are based on the results of the theory of source coding, as well as on the experience and intuition of their developers. Among the theoretical results, we note, first of all, such deep concepts as entropy, information, and methods of source coding discovered by Shannon [
1]. The next important step was done by Fitingoff [
2] and Kolmogorov [
3], who described the first universal code, as well as Krichevsky who described the first such a code with minimal redundancy [
4].
Now practically used data compressors are based on the PPM universal code [
5] (which is used along with the arithmetic code [
6]), the Lempel–Ziv (LZ) compression methods [
7], the Burrows–Wheeler transform [
8] (which is used along with the book-stack (or MTF) code [
9,
10,
11]), the class of grammar-based codes [
12,
13] and some others [
14,
15,
16]. All these codes are universal. This means that, asymptotically, the length of the compressed file goes to the smallest possible value (i.e., the Shannon entropy per letter), if the compressed sequence is generated by a stationary source.
In particular, the universality of practically used codes means that we cannot compare their performance theoretically, because all of them have the same limit ratio of compression. On the other hand, the experiments show that the performance of different data compressors depends on a compressed file and it is impossible to single out one of the best or even remove the worst ones. Thus, there is no theoretical or experimental way to select the best data compressors for practical use. Hence, if someone is going to compress a file, he should first select the appropriate data compressor, preferably giving the best compression. The following obvious two-step method can be applied: first, try all available compressors and choose the one that gives the shortest compressed file. Then place a byte representation of its number and the compressed file. When decoding, the decoder first reads the number of the selected data compressor, and then decodes the rest of the file with the selected data compressor. An obvious drawback of this approach is the need to spend a lot of time in order to first compress the file by all the compressors.
In this paper we show that there exists a method that encodes the file with the (close to) optimal compressor, but uses a relatively small extra time. In short, the main idea of the suggested approach is as follows: in order to find the best, try all the compressors, but, when doing it, use for compression only a small part of the file. Then apply the best data compressor for the compression of the whole file. Based on experiments and some theoretical considerations, we can say that under certain conditions this procedure is quite effective. That is why we call such methods “time-universal.”
It is important to note that the problems of data compression and time series prediction are very close mathematically (see, for example, [
17]). That is why the proposed approach can be directly applied to time series forecasting.
To the best of our knowledge, the suggested approach to data compression is new, but the idea to organize the computation of several algorithms in such a way that any of them worked at certain intervals of time, and their course depends on intermediate results, is widely used in the theory of algorithms, randomness testing and artificial intelligence; see [
18,
19,
20,
21].
2. The Statement of the Problem and Preliminary Example
Let there be a set of data compressors and be a sequence of letters from a finite alphabet A, whose initial part should be compressed by some . Let be the time spent on encoding one letter by the data compressor and suppose that all are upper-bounded by a certain constant , i.e., (It is possible that is unknown beforehand.)
The considered task is to find a data compressor from F which compresses in such a way that the total time spent for all calculations and compressions does not exceed for some . Note that is the minimum time that must be reserved for compression and is the additional time that can be used to find the good compressor (among ). It is important to note that we can estimate without knowing the speeds .
If the number of data compressors
F is finite, say,
,
, and one chooses
to compress the file
, he can use the following two step procedure: encode the file as
, where
is
-bit binary presentation of
k. (The decoder first reads
bits and finds
k, then it finds
decoding
.) Now our goal is to generalize this approach for the case of infinite
F =
For this purpose we take a probability distribution
=
such that all
. The following is an example of such a distribution:
Clearly, it is a probability distribution, because .
Now we should take into account the length of a codeword which presents the number
k, because those lengths must be different for different
k. So, we should find such
that the value
is close to minimal. As earlier, the first part
is used for encoding number
k (codes achieving this are well-known, e.g., [
22].) The decoder first finds
k and then
using the decoder corresponding to
. Based on this consideration, we give the following
Definition 1. We call any method that encodes a sequence , , , by the binary word of the length for some , a time-adaptive code and denote it by . The output of is the following word:where is -bit word that encodes i, whereas the time of encoding is not grater than (here ). If for a time-adaptive code the following equation is validthis code is called time-universal. Comment 1 It will be convenient to reckon that the whole sequence is compressed not letter-by-letter, but by sub-words, each of which, say, a few kilobytes in length. More formally, let, as before, there be a sequence , where , are sub-words whose length (say, L) can be a few kilobytes. In this case .
Comment 2 Here and below we did not take into account the time required for the calculation of and some other auxiliary calculations. If in a certain situation this time is not negligible, it is possible to reduce in advance by the required value.
This description and the following discussion are fairly formal, so we give a brief preliminary example of a time-adaptive code. To do this, we took 22 data compressors from [
23] and 14 files of different lengths. For each file we applied the following three-step scheme: first we took 1% of the file and sequentially compressed it with all the data compressors. Then we selected the three best compressors, took 5% of the file, and sequentially compressed it with the three compressors selected. Finally, we selected the best of these compressors and compressed the file with this compressor. Thus, the total extra time is limited to 22 × 0.01 + 3 × 0.05 = 0.37, i.e.,
.
Table 1 contains the obtained data.
Table 2 shows that the larger the file, the better the compression. The following table gives some insight into the effect of the extra time. Here we used the same three-step scheme, but the size of the parts was
and
for the first step and the second, respectively, while the extra time was 0.74.
From the tables it can be seen that the performance of the considered scheme increases significantly when the additional time increases. It worth noting, that if one applied all 22 data compressors to the whole file, the extra time would be 21 instead of 0.74.
3. The Time-Universal Code for the Finite Set of Data Compressors
3.1. Theoretical Consideration
Suppose that there is a file
and data compressors
,
. Let, as before,
be the time spent on encoding one letter by the data compressor
,
and let
The goal is to find the data compressor , , that compresses the file in the best way in time .
Apparently, the following two-step method is the simplest.
Step 1. Calculate .
Step 2. Compress the file by and find the length of compressed file , then, likewise, find , etc.
Step 3. Calculate
Step 4. Compress the whole file by and compose the codeword , where is -bit word with the presentation of s.
It will be shown that even this simple method is time universal. On the other hand, there are a lot of quite reasonable approaches to build the time-adaptive codes. For example, it could be natural to try a three step procedure, which was considered in the previous part (see
Table 1 and
Table 2), as well as many other versions. Probably, it could be useful to use multidimensional optimization approaches, such as machine learning, so-called deep learning, etc. That is why, we consider only some general conditions needed for time-universality.
Let us give some needed definitions. Suppose, a time-adaptive data-compressor
is applied to
. For any
we define
Theorem 1. Let there be an infinite word and time-adaptive method which is based on the finite set of data compressors . If its additional time of calculation is not grater than and the following properties are valid:
(i) the limits exist for ,
(ii) for (iii) for any t the method uses such a compressor for which, for any iThen is time universal, that is A proof is given in the
Appendix A, but here we give some informal comments. First, note that property (i) means that any data compressor will participate in the competition to find the best one. Second, if the sequence
is generated by a stationary source and all
are universal codes, then the property (iii) is valid with probability 1 (See, for example, [
22]). Hence, this theorem is valid for this case. Besides, note that this this theorem is valid for methods described earlier.
3.2. Experiments
We conducted several experiments to evaluate the effectiveness of the proposed approach in practice. For this purpose we took 20 data compressor from the “squeeze chart (lossless data compression benchmarks)”,
http://www.squeezechart.com/index.html and files from this site
http://corpus.canterbury.ac.nz/descriptions/, and
http://tolstoy.ru/creativity/90-volume-collection-of-the-works/ (Information about their size is given in the tables below). It is worth noting, that we do not change the collection of the data compressors and the files during experiments. The results are presented in the following tables, where the expression “worst/best” means the ratio of the longest length of the compressed file and the shortest one (for different data compressors). More formally,
. The expression “chosen/best” is a similar value for a chosen data compressor and the best one. The value “chosen/best” is the frequency of occurrence of the event “the best compressor was selected”.
Table 3 shows the results of the two-step method, where we took 3% in the first step. Thus, the total extra time is limited to 20 × 0.03 = 0.6, i.e.,
.
Here ratio “chosen best” means a proportion of cases in which the best method was chosen.
Table 4 shows the effect of the extra time
on the efficiency of the method (In this case we took 5% in the first step).
Table 5 contains information about the three step method. Here we took 3% in the first step and then took five data compressors with the best performance. Then, in the second step, we tested those five data compressors taking 5% from the file. Hence, the extra time equals
=
.
Table 6 gives an example of four step method. Here we took 1% in the first step and then took five data compressors with the best performance. Then, in the second step, we tested those five data compressors taking 2% from each file. Basing on the obtained data, we chose three best and tested them on 5% parts. At last, the best of them was used for compression of the whole file. Hence, the extra time equals
+
.
If we compare
Table 6 and
Table 3, we can see that the performance of the four step method is better than two step method, where the extra time is significantly less for the four step method. The same is valid for the considered example of the three step method.
We can see that the three- and four-step methods make sense because they make it possible to reduce the additional time while maintaining the better quality of the method. Also, we can make another important conclusion. All tables show that the method is more efficient for large files. Indeed, the ratio “chosen/best” and the average value “chosen/best” decreases where the file lengths increases. Moreover, the average value “worst/best” increases where the file lengths increases.
4. The Time-Universal Code for Stationary Ergodic Sources
In this section we describe a time-universal code for stationary sources. It is based on optimal universal codes for Markov chains, developed by Krichevsky [
4,
24] and the twice-universal code [
25]. Denote by
,
the set of Markov chains with memory (connectivity)
i, and let
be the set of Bernoulli sources. For stationary ergodic
and an integer
r we denote by
the
r-order entropy (per letter) and let
be the limit entropy; see for definitions [
22].
Krichevsky [
4,
24] described the codes
which are asymptotically optimal for
, correspondingly. If the sequence
,
, is generated by a source
, the following inequalities are valid almost surely (a.s.):
where
t grows. (Here
C is a constant.) The length of a codeword of the twice-universal code
is defined as the following “mixture”:
(It is well-known in information theory [
22] that there exists a code with such codeword lengths, because
=
.) This code is called twice-universal because for any
,
, and
the equality (
8) is valid (with different
C). Besides, for any stationary ergodic source
a.s.
Let us estimate the time of calculations necessary when using
. First, note that it suffices to sum a finite number of terms in (
9), because all the terms
are equal for
. On the other hand, the number of different terms grows, where
and, hence, the encoder should calculate
for growing number
i’s. It is known [
24] that the time spent on coding one letter is close for different codes
.
Hence, the time spent for encoding one letter by the code grows to infinity, when t grows. The described below time-universal code has the same asymptotic performance, but the time spent for encoding one letter is a constant.
In order to describe the time-universal code
we give some definitions. Let, as before,
v be an upper-bound of the time spent for encoding one letter by any
,
be the generated word,
Denote by
the following method:
Step 1. Calculate
and
Step 2. Find such a
j that
Step 3. Calculate the codeword
and output
where
is the
-bit codeword of
j. The decoding is obvious.
Theorem 2. Let be a sequence generated by a stationary source and the code be applied. Then this code is time-universal, i.e., a.s.