Time-Universal Data Compression

Boris Ryabko

doi:10.3390/a12060116

¹

Institute of Computational Technologies of the Siberian Branch of the Russian Academy of Science, 630090 Novosibirsk, Russia

²

Department of Information Technologies, Novosibirsk State University, 630090 Novosibirsk, Russia

^†

The preliminary version of this paper is accepted for ISIT 2019, Paris.

Algorithms2019, 12(6), 116;https://doi.org/10.3390/a12060116

This article belongs to the Special Issue Data Compression Algorithms and their Applications

Version Notes

Order Reprints

Abstract

Nowadays, a variety of data-compressors (or archivers) is available, each of which has its merits, and it is impossible to single out the best ones. Thus, one faces the problem of choosing the best method to compress a given file, and this problem is more important the larger is the file. It seems natural to try all the compressors and then choose the one that gives the shortest compressed file, then transfer (or store) the index number of the best compressor (it requires

log m

bits, if m is the number of compressors available) and the compressed file. The only problem is the time, which essentially increases due to the need to compress the file m times (in order to find the best compressor). We suggest a method of data compression whose performance is close to optimal, but for which the extra time needed is relatively small: the ratio of this extra time and the total time of calculation can be limited, in an asymptotic manner, by an arbitrary positive constant. In short, the main idea of the suggested approach is as follows: in order to find the best, try all the data compressors, but, when doing so, use for compression only a small part of the file. Then apply the best data compressors to the whole file. Note that there are many situations where it may be necessary to find the best data compressor out of a given set. In such a case, it is often done by comparing compressors empirically. One of the goals of this work is to turn such a selection process into a part of the data compression method, automating and optimizing it.

Keywords:

data compression; universal coding; time-series forecasting

1. Introduction

Nowadays lossless data compressors, or archivers, are widely used in systems of information transmission and storage. Modern data compressors are based on the results of the theory of source coding, as well as on the experience and intuition of their developers. Among the theoretical results, we note, first of all, such deep concepts as entropy, information, and methods of source coding discovered by Shannon []. The next important step was done by Fitingoff [] and Kolmogorov [], who described the first universal code, as well as Krichevsky who described the first such a code with minimal redundancy [].

Now practically used data compressors are based on the PPM universal code [] (which is used along with the arithmetic code []), the Lempel–Ziv (LZ) compression methods [], the Burrows–Wheeler transform [] (which is used along with the book-stack (or MTF) code [,,]), the class of grammar-based codes [,] and some others [,,]. All these codes are universal. This means that, asymptotically, the length of the compressed file goes to the smallest possible value (i.e., the Shannon entropy per letter), if the compressed sequence is generated by a stationary source.

In particular, the universality of practically used codes means that we cannot compare their performance theoretically, because all of them have the same limit ratio of compression. On the other hand, the experiments show that the performance of different data compressors depends on a compressed file and it is impossible to single out one of the best or even remove the worst ones. Thus, there is no theoretical or experimental way to select the best data compressors for practical use. Hence, if someone is going to compress a file, he should first select the appropriate data compressor, preferably giving the best compression. The following obvious two-step method can be applied: first, try all available compressors and choose the one that gives the shortest compressed file. Then place a byte representation of its number and the compressed file. When decoding, the decoder first reads the number of the selected data compressor, and then decodes the rest of the file with the selected data compressor. An obvious drawback of this approach is the need to spend a lot of time in order to first compress the file by all the compressors.

In this paper we show that there exists a method that encodes the file with the (close to) optimal compressor, but uses a relatively small extra time. In short, the main idea of the suggested approach is as follows: in order to find the best, try all the compressors, but, when doing it, use for compression only a small part of the file. Then apply the best data compressor for the compression of the whole file. Based on experiments and some theoretical considerations, we can say that under certain conditions this procedure is quite effective. That is why we call such methods “time-universal.”

It is important to note that the problems of data compression and time series prediction are very close mathematically (see, for example, []). That is why the proposed approach can be directly applied to time series forecasting.

To the best of our knowledge, the suggested approach to data compression is new, but the idea to organize the computation of several algorithms in such a way that any of them worked at certain intervals of time, and their course depends on intermediate results, is widely used in the theory of algorithms, randomness testing and artificial intelligence; see [,,,].

2. The Statement of the Problem and Preliminary Example

Let there be a set of data compressors

F = {φ_{1}, φ_{2}, \dots}

and

x_{1} x_{2} \dots

be a sequence of letters from a finite alphabet A, whose initial part

x_{1} \dots x_{n}

should be compressed by some

φ \in F

. Let

v_{i}

be the time spent on encoding one letter by the data compressor

φ_{i}

and suppose that all

v_{i}

are upper-bounded by a certain constant

v_{m a x}

, i.e.,

{sup}_{i = 1, 2, \dots,} v_{i} \leq v_{m a x} .

(It is possible that

v_{i}

is unknown beforehand.)

The considered task is to find a data compressor from F which compresses

x_{1} \dots x_{n}

in such a way that the total time spent for all calculations and compressions does not exceed

T (1 + δ)

for some

δ > 0

. Note that

T = v_{m a x} n

is the minimum time that must be reserved for compression and

δ T

is the additional time that can be used to find the good compressor (among

φ_{1}, φ_{2}, \dots

). It is important to note that we can estimate

δ

without knowing the speeds

v_{1}, v_{2}, \dots

.

If the number of data compressors F is finite, say,

{φ_{1}, φ_{2}, \dots, φ_{m}}

,

m \geq 2

, and one chooses

φ_{k}

to compress the file

x_{1} x_{2} \dots x_{n}

, he can use the following two step procedure: encode the file as

< k > φ_{k} (x_{1} x_{2} \dots x_{n})

, where

< k >

is

⌈ log m ⌉

-bit binary presentation of k. (The decoder first reads

⌈ log m ⌉

bits and finds k, then it finds

x_{1} x_{2} \dots x_{n}

decoding

φ_{k} (x_{1} x_{2} \dots x_{n})

.) Now our goal is to generalize this approach for the case of infinite F =

{φ_{1}, φ_{2}, \dots} .

For this purpose we take a probability distribution

ω

=

ω_{1}, ω_{2}, \dots

such that all

ω_{i} > 0

. The following is an example of such a distribution:

ω_{k} = \frac{1}{k (k + 1)}, k = 1, 2, 3, \dots .

(1)

Clearly, it is a probability distribution, because

ω_{k} = 1 / k - 1 / (k + 1)

.

Now we should take into account the length of a codeword which presents the number k, because those lengths must be different for different k. So, we should find such

φ_{k}

that the value

⌈ - log ω_{k} ⌉ + | φ_{k} (x_{1} x_{2} \dots x_{n}) |

is close to minimal. As earlier, the first part

⌈ - log ω_{k} ⌉

is used for encoding number k (codes achieving this are well-known, e.g., [].) The decoder first finds k and then

x_{1} x_{2} \dots x_{n}

using the decoder corresponding to

φ_{k}

. Based on this consideration, we give the following

Definition 1.

We call any method that encodes a sequence

x_{1} x_{2} \dots x_{n}

,

n \geq 1

,

x_{i} \in A

, by the binary word of the length

⌈ - log ω_{j} ⌉ + | φ_{j} (x_{1} x_{2} \dots x_{n}) |

for some

φ_{j} \in F

, a time-adaptive code and denote it by

{\hat{Φ}}_{c o m p r}^{δ}

. The output of

{\hat{Φ}}_{c o m p r}^{δ}

is the following word:

{\hat{Φ}}_{c o m p r}^{δ} (x_{1} x_{2} \dots x_{n}) = < ω_{i} > φ_{i} (x_{1} x_{2} \dots x_{n}),

(2)

where

< ω_{i} >

is

⌈ - log ω_{i} ⌉

-bit word that encodes i, whereas the time of encoding is not grater than

T (1 + δ)

(here

T = v_{m a x} n

).

If for a time-adaptive code

{\hat{Φ}}_{c o m p r}^{δ}

the following equation is valid

lim_{n \to \infty} {\hat{Φ}}_{c o m p r}^{δ} (x_{1} \dots x_{n}) / n = inf_{1 = 1, 2, \dots} lim_{n \to \infty} φ_{i} (x_{1} \dots x_{n}) / n,

this code is called time-universal.

Comment 1

It will be convenient to reckon that the whole sequence is compressed not letter-by-letter, but by sub-words, each of which, say, a few kilobytes in length. More formally, let, as before, there be a sequence

x_{1} x_{2} \dots

, where

x_{i}

,

i = 1, 2, \dots

are sub-words whose length (say, L) can be a few kilobytes. In this case

x_{i} \in {0, 1}^{8 L}

.

Comment 2

Here and below we did not take into account the time required for the calculation of

log ω_{i}

and some other auxiliary calculations. If in a certain situation this time is not negligible, it is possible to reduce

\hat{T}

in advance by the required value.

This description and the following discussion are fairly formal, so we give a brief preliminary example of a time-adaptive code. To do this, we took 22 data compressors from [] and 14 files of different lengths. For each file we applied the following three-step scheme: first we took 1% of the file and sequentially compressed it with all the data compressors. Then we selected the three best compressors, took 5% of the file, and sequentially compressed it with the three compressors selected. Finally, we selected the best of these compressors and compressed the file with this compressor. Thus, the total extra time is limited to 22 × 0.01 + 3 × 0.05 = 0.37, i.e.,

δ \leq 0.37

. Table 1 contains the obtained data.

Table 1. Three-step compression. Extra-time

δ

= 0.37.

Table 2 shows that the larger the file, the better the compression. The following table gives some insight into the effect of the extra time. Here we used the same three-step scheme, but the size of the parts was

2 %

and

10 %

for the first step and the second, respectively, while the extra time was 0.74.

Table 2. Three-step compression. Extra-time

δ

= 0.74.

From the tables it can be seen that the performance of the considered scheme increases significantly when the additional time increases. It worth noting, that if one applied all 22 data compressors to the whole file, the extra time would be 21 instead of 0.74.

3. The Time-Universal Code for the Finite Set of Data Compressors

3.1. Theoretical Consideration

Suppose that there is a file

x_{1} x_{2} \dots x_{n}

and data compressors

φ_{1}, \dots, φ_{m}

,

n \geq 1, m \geq 1

. Let, as before,

v_{i}

be the time spent on encoding one letter by the data compressor

φ_{i}

,

v_{m a x} = max_{i = 1, \dots, n} v_{i}, T = n v,

(3)

and let

\hat{T} = T (1 + δ), δ > 0 .

(4)

The goal is to find the data compressor

φ_{j}

,

j = 1, \dots, m

, that compresses the file

x_{1} x_{2} \dots x_{n}

in the best way in time

\hat{T}

.

Apparently, the following two-step method is the simplest.

Step 1. Calculate

r = ⌊ δ T / (m v_{m a x}) ⌋

.

Step 2. Compress the file

x_{1} x_{2} \dots x_{r}

by

φ_{1}

and find the length of compressed file

| φ_{1} (x_{1} \dots x_{r}) |

, then, likewise, find

| φ_{2} (x_{1} \dots x_{r}) |

, etc.

Step 3. Calculate

s = arg {min}_{i = 1, \dots, m}

| φ_{i} (x_{1} \dots x_{r}) |

Step 4. Compress the whole file

x_{1} x_{2} \dots x_{n}

by

φ_{s}

and compose the codeword

⟨ s ⟩

φ_{s} (x_{1} \dots x_{n})

, where

⟨ s ⟩

is

⌈ log m ⌉

-bit word with the presentation of s.

It will be shown that even this simple method is time universal. On the other hand, there are a lot of quite reasonable approaches to build the time-adaptive codes. For example, it could be natural to try a three step procedure, which was considered in the previous part (see Table 1 and Table 2), as well as many other versions. Probably, it could be useful to use multidimensional optimization approaches, such as machine learning, so-called deep learning, etc. That is why, we consider only some general conditions needed for time-universality.

Let us give some needed definitions. Suppose, a time-adaptive data-compressor

\hat{Φ}

is applied to

x = x_{1} \dots x_{t}

. For any

φ_{i}

we define

τ_{i} (t) = max {r : φ_{i} (x_{1} \dots x_{r}) w a s c a l c u l a t e d, w h e n e x t r a t i m e δ T w a s e x h a u s t e d} .

Theorem 1.

Let there be an infinite word

x_{1} x_{2} \dots

and time-adaptive method

\hat{Φ}

which is based on the finite set of data compressors

ϕ_{1}, \dots, ϕ_{m}

. If its additional time of calculation is not grater than

δ T

and the following properties are valid:

(i) the limits

{lim}_{t \to \infty} φ_{i} (x_{1} \dots x_{t}) / t

exist for

i = 1, 2, \dots, m

,

(ii) for

i = 1, 2, \dots, m

lim_{t \to \infty} τ_{i} (t) = \infty,

(5)

(iii) for any t the method

\hat{Φ} (x_{1} \dots x_{t})

uses such a compressor

φ_{s}

for which, for any i

(- log ω_{s} + | φ_{s} (x_{1} \dots x_{τ_{s} (t)}) |) / τ_{s} (t) \leq (- log ω_{i} + | φ_{i} (x_{1} \dots x_{τ_{i} (t)}) |) / τ_{i} (t),

(6)

Then

\hat{Φ} (x_{1} \dots x_{n})

is time universal, that is

lim_{t \to \infty} \hat{Φ} (x_{1} \dots x_{t}) / t = inf_{i = 1, 2, \dots} lim_{t \to \infty} | φ_{i} (x_{1} \dots x_{t}) | / t

(7)

A proof is given in the Appendix A, but here we give some informal comments. First, note that property (i) means that any data compressor will participate in the competition to find the best one. Second, if the sequence

x_{1} x_{2} \dots

is generated by a stationary source and all

φ_{i}

are universal codes, then the property (iii) is valid with probability 1 (See, for example, []). Hence, this theorem is valid for this case. Besides, note that this this theorem is valid for methods described earlier.

3.2. Experiments

We conducted several experiments to evaluate the effectiveness of the proposed approach in practice. For this purpose we took 20 data compressor from the “squeeze chart (lossless data compression benchmarks)”, http://www.squeezechart.com/index.html and files from this site http://corpus.canterbury.ac.nz/descriptions/, and http://tolstoy.ru/creativity/90-volume-collection-of-the-works/ (Information about their size is given in the tables below). It is worth noting, that we do not change the collection of the data compressors and the files during experiments. The results are presented in the following tables, where the expression “worst/best” means the ratio of the longest length of the compressed file and the shortest one (for different data compressors). More formally,

w o r s t / b e s t = {max}_{i, j = 1, \dots, 20} (| φ_{i} | / | φ_{j} |)

. The expression “chosen/best” is a similar value for a chosen data compressor and the best one. The value “chosen/best” is the frequency of occurrence of the event “the best compressor was selected”.

Table 3 shows the results of the two-step method, where we took 3% in the first step. Thus, the total extra time is limited to 20 × 0.03 = 0.6, i.e.,

δ \leq 0.6

.

Table 3. Two-step compression. Extra-time

δ

= 20 × 0.03 = 0.6.

Here ratio “chosen best” means a proportion of cases in which the best method was chosen.

Table 4 shows the effect of the extra time

δ

on the efficiency of the method (In this case we took 5% in the first step).

Table 4. Two-step compression. Extra-time

δ

= 20 × 0.05 = 1.

Table 5 contains information about the three step method. Here we took 3% in the first step and then took five data compressors with the best performance. Then, in the second step, we tested those five data compressors taking 5% from the file. Hence, the extra time equals

20 \times 0.03 + 5 \times 0.05

=

0.85

.

Table 5. Three-step compression. Extra-time

δ

=

20 \times 0.03 + 5 \times 0.05

=

0.85

.

Table 6 gives an example of four step method. Here we took 1% in the first step and then took five data compressors with the best performance. Then, in the second step, we tested those five data compressors taking 2% from each file. Basing on the obtained data, we chose three best and tested them on 5% parts. At last, the best of them was used for compression of the whole file. Hence, the extra time equals

20 \times 0.01 + 5 \times 0.02

+

3 \times 0.05 = 0.45

.

Table 6. Four-step compression. Extra-time

20 \times 0.01 + 5 \times 0.02

+

3 \times 0.05 = 0.45

.

If we compare Table 6 and Table 3, we can see that the performance of the four step method is better than two step method, where the extra time is significantly less for the four step method. The same is valid for the considered example of the three step method.

We can see that the three- and four-step methods make sense because they make it possible to reduce the additional time while maintaining the better quality of the method. Also, we can make another important conclusion. All tables show that the method is more efficient for large files. Indeed, the ratio “chosen/best” and the average value “chosen/best” decreases where the file lengths increases. Moreover, the average value “worst/best” increases where the file lengths increases.

4. The Time-Universal Code for Stationary Ergodic Sources

In this section we describe a time-universal code for stationary sources. It is based on optimal universal codes for Markov chains, developed by Krichevsky [,] and the twice-universal code []. Denote by

M_{i}

,

i = 1, 2, \dots

the set of Markov chains with memory (connectivity) i, and let

M_{0}

be the set of Bernoulli sources. For stationary ergodic

μ

and an integer r we denote by

h_{r} (μ)

the r-order entropy (per letter) and let

h_{\infty} (μ)

be the limit entropy; see for definitions [].

Krichevsky [,] described the codes

ψ_{0}, ψ_{1}, \dots

which are asymptotically optimal for

M_{0}, M_{1}, \dots

, correspondingly. If the sequence

x_{1} x_{2} \dots x_{n}

,

x_{i} \in A

, is generated by a source

μ

\in M_{i}

, the following inequalities are valid almost surely (a.s.):

h_{i} (μ) \leq | ψ_{i} (x_{1} \dots x_{t}) | / t \leq h_{i} (μ) + ((| A | - {1) | A |}^{i} + C) / t,

(8)

where t grows. (Here C is a constant.) The length of a codeword of the twice-universal code

ρ

is defined as the following “mixture”:

| ρ (x_{1} \dots x_{t}) | = - log \sum_{i = 0}^{\infty} ω_{i + 1} 2^{- | ψ_{i} (x_{1} \dots x_{t}) |} .

(9)

(It is well-known in information theory [] that there exists a code with such codeword lengths, because

\sum_{x_{1} \dots x_{t} \in A^{t}}

2^{- | ρ (x_{1} \dots x_{t}) |}

=

1

.) This code is called twice-universal because for any

M_{i}

,

i = 0, 1, \dots

, and

μ \in M_{i}

the equality (8) is valid (with different C). Besides, for any stationary ergodic source

μ

a.s.

lim_{t \to \infty} | ρ_{i} (x_{1} \dots x_{t}) | / t = h_{\infty} (μ) .

(10)

Let us estimate the time of calculations necessary when using

ρ

. First, note that it suffices to sum a finite number of terms in (9), because all the terms

2^{- | ψ_{i} (x_{1} \dots x_{t}) |}

are equal for

i \geq t

. On the other hand, the number of different terms grows, where

t \to \infty

and, hence, the encoder should calculate

2^{- | ψ_{i} (x_{1} \dots x_{t}) |}

for growing number i’s. It is known [] that the time spent on coding one letter is close for different codes

ψ_{i}, i = 0, 1, 2, \dots

.

Hence, the time spent for encoding one letter by the code

ρ

grows to infinity, when t grows. The described below time-universal code

Ψ^{δ}

has the same asymptotic performance, but the time spent for encoding one letter is a constant.

In order to describe the time-universal code

Ψ^{δ}

we give some definitions. Let, as before, v be an upper-bound of the time spent for encoding one letter by any

ψ_{i}

,

x_{1} \dots x_{t}

be the generated word,

T = t v, N (t) = δ T / v = δ t,

m (t) = ⌊ log log N (t) ⌋, s (t) = ⌊ N (t) / (m (t) + 1) ⌋ .

(11)

Denote by

Ψ^{δ}

the following method:

Step 1. Calculate

m (t), s (t)

and

| ψ_{0} (x_{1} \dots x_{s (t)}) |, | ψ_{1} (x_{1} \dots x_{s (t)}) |, \dots, | ψ_{m (t)} (x_{1} \dots x_{s (t)}) | .

Step 2. Find such a j that

| ψ_{j} (x_{1} \dots x_{s (t)}) | = min_{i = 0, \dots, m (t)} | ψ_{i} (x_{1} \dots x_{s (t)}) | .

Step 3. Calculate the codeword

ψ_{j} (x_{1} \dots x_{t})

and output

Ψ^{δ} (x_{1} \dots x_{t}) = < j > ψ_{j} (x_{1} \dots x_{t}),

where

< j >

is the

⌈ - log ω_{j + 1} ⌉

-bit codeword of j. The decoding is obvious.

Theorem 2.

Let

x_{1} x_{2} \dots

be a sequence generated by a stationary source and the code

Ψ^{δ}

be applied. Then this code is time-universal, i.e., a.s.

lim_{t \to \infty} | Ψ^{δ} (x_{1} \dots x_{t}) | / t = inf_{i = 0, 1, \dots} lim_{t \to \infty} | ψ_{i} (x_{1} \dots x_{t}) | / t .

(12)

Funding

This research was funded by Russian Foundation for Basic Research grant number 18-29-03005.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A

Proof of Theorem 1.

Let

λ_{i} = {lim}_{t \to \infty} | φ_{i} (x_{1} \dots x_{t}) | / t

and

φ_{m i n}

be such a data compressor that

λ_{m i n}

=

{min}_{i}

λ_{i}

. Having taken into account that the set of data compressors F is finite, we can see that for any

ϵ > 0

there exists such

t_{1}

that for all

φ_{i} \in F

and

t > t_{1}

(| φ_{i} (x_{1} \dots x_{t}) | - log ω_{i}) / t - λ_{i} | < ϵ .

(A1)

From (ii) we obtain that there exists such

t_{2}

that

τ_{i} (t_{2}) > t_{1}

for all

i = 1, \dots, m

. Let

n \geq t_{2}

and

\hat{Φ}

be applied to

x_{1} x_{2} \dots x_{n}

. Suppose that a data-compressor

φ_{s}

was chosen, when

\hat{Φ}

was applied. Hence,

(- log ω_{s} + | φ_{s} (x_{1} \dots x_{τ_{s} (n)} |) / τ_{s} (n) \leq (- log ω_{m i n} + | φ_{m i n} (x_{1} \dots x_{τ_{m i n} (n)} |) / τ_{m i n} (n) .

(A2)

From (A1) we can see that

(- log ω_{s} + | φ_{s} (x_{1} \dots x_{τ_{s} (n)} |) / τ_{s} (n) \geq λ_{s} - ϵ

(A3)

and

(- log ω_{m i n} + | φ_{m i n} (x_{1} \dots x_{τ_{m i n} (n)} |) / τ_{m i n} (n) \leq λ_{m i n} + ϵ .

(A4)

From the inequalities (A2)–(A4) we obtain

λ_{s} \leq λ_{m i n} + 2 ϵ

. Taking into account, that, by definition,

λ_{m i n}

≤

λ_{s}

we get

λ_{m i n} \leq λ_{s} \leq λ_{m i n} + 2 ϵ .

(A5)

Let us estimate

\hat{Φ} (x_{1} \dots x_{n}) / n

. When

\hat{Φ} (x_{1} \dots x_{n})

was applied, the data compressor

φ_{s}

was chosen. Hence, from (A1) we get

λ_{s} - ϵ \leq \hat{Φ} (x_{1} \dots x_{n}) / n \leq λ_{s} + ϵ .

From those inequalities and (A5) we can see that

λ_{m i n} - ϵ \leq \hat{Φ} (x_{1} \dots x_{n}) / n \leq λ_{m i n} + 3 ϵ .

It is true for any

ϵ > 0

, hence,

{lim}_{n \to \infty} \hat{Φ} (x_{1} \dots x_{n}) / n = λ_{m i n}

. The theorem is proven. □

Proof of Theorem 2.

It is known in Information Theory [] that

h_{r} (μ)

≥

h_{r + 1} (μ)

≥

h_{\infty} (μ)

for any r and (by definition)

{lim}_{r \to \infty} h_{r} (μ)

=

h_{\infty} (μ)

. Let

ϵ > 0

and r be such an integer that

h_{r} - h_{\infty}

<

ϵ

. From (11) we can see that there exists such

t_{1}

that

m (t) \geq r

if

t \geq t_{1}

. Taking into account (8) and (11), we can see that there exists

t_{2}

for which a.s.

| | ψ_{r} (x_{1} \dots x_{t}) | / t - h_{r} (μ) |

<

ϵ

if

t > t_{2}

. From the description of

Ψ^{δ}

(the step 3) we can see that there exists such

t_{3} > max {t_{1}, t_{2}}

for which a.s.

| | ψ_{r} (x_{1} \dots x_{t}) | / t - h_{\infty} (μ) | \leq | | ψ_{r} (x_{1} \dots x_{t}) | / t - h_{r} (μ) |

+ (h_{r} (μ) - h_{\infty} (μ)) < 2 ϵ,

if

t > t_{3}

. By definition,

| Ψ^{δ} (x_{1} \dots x_{t}) | / t \leq (| ψ_{r} (x_{1} \dots x_{t}) | - log ω_{r + 1}) / t .

Having taken into account that

ϵ

is an arbitrary number and two latest inequalities as well as the fact that a.s.

{inf}_{i = 0, 1, \dots} {lim}_{t \to \infty} | ψ_{r} (x_{1} \dots x_{t}) | / t

=

h_{\infty} (μ)

, we obtain (12). The theorem is proven. □

References

Shannon, C. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar]
Fitingof, B.M. Optimal encoding for unknown and changing statistics of messages. Probl. Inform. Transm. 1966, 2, 3–11. [Google Scholar]
Kolmogorov, A.N. Three approaches to the quantitative definition of information. Probl. Inform. Transm. 1965, 1, 3–11. [Google Scholar] [CrossRef]
Krichevsky, R. A relation between the plausibility of information about a source and encoding redundancy. Probl. Inform. Transm. 1968, 4, 48–57. [Google Scholar]
Cleary, J.; Witten, I. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 1984, 32, 396–402. [Google Scholar] [CrossRef]
Rissanen, J.; Langdon, G.G. Arithmetic coding. IBM J. Res. Dev. 1979, 23, 149–162. [Google Scholar] [CrossRef]
Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef]
A Block-Sorting Lossless Data Compression Algorithm. Available online: https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf (accessed on 15 May 2019).
Ryabko, B.Y. Data compression by means of a “book stack”. Probl. Inf. Transm. 1980, 16, 265–269. [Google Scholar]
Bentley, J.; Sleator, D.; Tarjan, R.; Wei, V. A locally adaptive data compression scheme. Commun. ACM 1986, 29, 320–330. [Google Scholar] [CrossRef]
Ryabko, B.; Horspool, N.R.; Cormack, G.V.; Sekar, S.; Ahuja, S.B. Technical correspondence. Commun. ACM 1987, 30, 792–797. [Google Scholar]
Kieffer, J.C.; Yang, E.H. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory 2000, 46, 737–754. [Google Scholar] [CrossRef]
Yang, E.H.; Kieffer, J.C. Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform. i. without context models. IEEE Trans. Inf. Theory 2000, 46, 755–777. [Google Scholar] [CrossRef]
Drmota, M.; Reznik, Y.A.; Szpankowski, W. Tunstall code, Khodak variations, and random walks. IEEE Trans. Inf. Theory 2010, 56, 2928–2937. [Google Scholar] [CrossRef]
Ryabko, B. A fast on-line adaptive code. IEEE Trans. Inf. Theory 1992, 28, 1400–1404. [Google Scholar] [CrossRef]
Willems, F.M.J.; Shtarkov, Y.M.; Tjalkens, T.J. The context-tree weighting method: Basic properties. IEEE Trans. Inf. Theory 1995, 41, 653–664. [Google Scholar] [CrossRef]
Ryabko, B.; Astola, J.; Malyutov, M. Compression-Based Methods of Statistical Analysis and Prediction of Time Series; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
Li, M.; Vitanyi, P. An Introduction to Kolmogorov Complexity and Its Applications, 3rd ed.; Springer: New York, NY, USA, 2008. [Google Scholar]
Calude, C.S. Information and Randomness—An Algorithmic Perspective, 2nd ed.; Springer: Berlin, Germany, 2002. [Google Scholar]
Downey, R.; Hirschfeldt, D.R.; Nies, A.; Terwijn, S.A. Calibrating randomness. Bull. Symb. Log. 2006, 12, 411–491. [Google Scholar] [CrossRef]
Hutter, M. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability; Springer: Berlin, Germany, 2005. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
Mahoney, M. Data Compression Programs. Available online: http://mattmahoney.net/dc/ (accessed on 15 March 2019).
Krichevsky, R. Universal Compression and Retrival; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1993. [Google Scholar]
Ryabko, B. Twice-universal coding. Probl. Inf. Transm. 1984, 3, 173–177. [Google Scholar]

Table 1. Three-step compression. Extra-time

δ

= 0.37.

Table 1. Three-step compression. Extra-time

δ

= 0.37.

$File$	Length (bites)	Best Compressor	Chosen Compressor	Chosen/Best (Ratio of Length)
BIB	111,261	nanozip	lpaq8	1.06
BOOK1	768,771	nanozip	nanozip	1
BOOK 2	610,856	nanozip	nanozip	1
GEO	102,400	nanozip	ccm	1.07
NEWS	377,109	nanozip	nanozip	1
OBJ1	21,504	nanozip	tornado	1.23
OBJ2	246,814	nanozip	lpaq8	1.08
PAPER1	53,161	nanozip	tornado	1.52
PAPER2	82,199	nanozip	tornado	1.54
PIC	513,216	zpaq	bbb	1.25
PROGC	39,611	nanozip	tornado	1.42
PROGL	71,646	nanozip	tornado	1.44
PROGP	49,379	lpaq8	tornado	1.4
TRANS	93,695	lpaq8	lpaq8	1

Table 2. Three-step compression. Extra-time

δ

= 0.74.

Table 2. Three-step compression. Extra-time

δ

= 0.74.

$File$	Legth	Best Compressor	Chosen Compressor	Chosen/Best (Ratio of Length)
BIB	111,261	nanozip	nanozip	1
BOOK1	768,771	nanozip	nanozip	1
BOOK 2	610,856	nanozip	nanozip	1
GEO	102,400	nanozip	nanozip	1
NEWS	377,109	nanozip	lpq1v2	1.14
OBJ1	21,504	nanozip	ccm	1.17
OBJ2	246,814	nanozip	nanozip	1
PAPER1	53,161	nanozip	lpaq8	1.19
PAPER2	82,199	nanozip	nanozip	1
PIC	513,216	zpaq	bbb	1.25
PROGC	39,611	nanozip	lpaq8	1.04
PROGL	71,646	nanozip	lpaq8	1.03
PROGP	49,379	lpaq8	lpaq8	1
TRANS	93,695	lpaq8	lpaq8	1

Table 3. Two-step compression. Extra-time

δ

= 20 × 0.03 = 0.6.

Table 3. Two-step compression. Extra-time

δ

= 20 × 0.03 = 0.6.

Length of File (byte)	Number of Files	Ratio “Chosen Best”	Average “Worst/best”	Average “Chosen/Best”
≤ $10^{5}$	1496	8%	112.87%	103.57%
$10^{5} - 10^{6}$	1122	45.72%	131.22%	102.04%
$10^{6} - 10^{8}$	384	71%	147.95%	100.99%

Table 4. Two-step compression. Extra-time

δ

= 20 × 0.05 = 1.

Table 4. Two-step compression. Extra-time

δ

= 20 × 0.05 = 1.

Length of File (byte)	Number of Files	Ratio “Chosen Best”	Average “Worst/Best”	Average “Chosen/Best”
≤ $10^{5}$	1496	16%	112.87%	102.14%
$10^{5} - 10^{6}$	1122	53.63%	131.22%	101.33%
$10^{6} - 10^{8}$	384	73%	147.95%	100.84%

Table 5. Three-step compression. Extra-time

δ

=

20 \times 0.03 + 5 \times 0.05

=

0.85

.

Table 5. Three-step compression. Extra-time

δ

=

20 \times 0.03 + 5 \times 0.05

=

0.85

.

Length of File (byte)	Number of Files	Ratio “Chosen Best”	Average “Worst/Best”	Average “Chosen/Best”
≤ $10^{5}$	1496	14%	112.87%	102.48%
$10^{5} - 10^{6}$	1122	54.9%	131.22%	101.92%
$10^{6} - 10^{8}$	384	73%	147.95%	100.86%

Table 6. Four-step compression. Extra-time

20 \times 0.01 + 5 \times 0.02

+

3 \times 0.05 = 0.45

.

Table 6. Four-step compression. Extra-time

20 \times 0.01 + 5 \times 0.02

+

3 \times 0.05 = 0.45

.

Length of File (byte)	Number of Files	Ratio “Chosen Best”	Average “Worst/Best”	Average “Chosen/Best”
≤ $10^{5}$	1496	10%	112.87%	103.12%
$10^{5} - 10^{6}$	1122	44.69%	131.22%	102.54%
$10^{6} - 10^{8}$	384	72%	147.95%	100.88%

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Time-Universal Data Compression^†

Abstract

1. Introduction

2. The Statement of the Problem and Preliminary Example

3. The Time-Universal Code for the Finite Set of Data Compressors

3.1. Theoretical Consideration

3.2. Experiments

4. The Time-Universal Code for Stationary Ergodic Sources

Funding

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics

Time-Universal Data Compression †

Abstract

1. Introduction

2. The Statement of the Problem and Preliminary Example

3. The Time-Universal Code for the Finite Set of Data Compressors

3.1. Theoretical Consideration

3.2. Experiments

4. The Time-Universal Code for Stationary Ergodic Sources

Funding

Conflicts of Interest

Appendix A

References

Article Metrics

Citations

Article Access Statistics

Time-Universal Data Compression^†