Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes

Vasinek, Michal; Platos, Jan

doi:10.3390/e19050223

Open AccessArticle

Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes

by

Michal Vasinek

^* and

Jan Platos

Department of Computer Science, FEECS, VSB-Technical University of Ostrava, 17. listopadu 15/12172, Ostrava 708 33, Czech Republic

^*

Author to whom correspondence should be addressed.

Entropy 2017, 19(5), 223; https://doi.org/10.3390/e19050223

Submission received: 30 January 2017 / Revised: 9 May 2017 / Accepted: 10 May 2017 / Published: 13 May 2017

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

The change of zero order entropy is studied over different strategies of grammar production rule selection. The two major rules are distinguished: transformations leaving the message size intact and substitution functions changing the message size. Relations for zero order entropy changes were derived for both cases and conditions under which the entropy decreases were described. In this article, several different greedy strategies reducing zero order entropy, as well as message sizes are summarized, and the new strategy MinEnt is proposed. The resulting evolution of the zero order entropy is compared with a strategy of selecting the most frequent digram used in the Re-Pair algorithm.

Keywords:

data compression; grammars; entropy; transformations; context; Re-Pair

1. Introduction

Entropy is a key concept in the measurement of the amount of information in information theory [1]. From the data compression perspective, this amount of information represents the lower limit of the achievable compression of some information source. Due to the well-known work by Shannon [2], we know that using less bits than the amount given by entropy to represent a particular message or process would necessarily lead to the loss of some information and as a consequence our inability to properly recover the former structure of the message.

This work is focused on the study of entropy in data compression, and therefore, our discussion will be restricted to only finite messages. These finite messages are formed by symbols, and in this perspective, the entropy can be understood as the lowest number of bits needed on average to uniquely represent each symbol in a message. There are messages for which the evaluation of entropy can be a very hard task, and so, we are often forced to satisfy ourselves with some approximation of entropy.

The simplest approximation is the one based on the probability distribution of symbols in a particular message. In this case, the symbols are viewed as independent entities, and their mutual relationships are not taken into account. A better approximation of entropy is based on the conditional probabilities when we also take into account how symbols follow each other. We can also approximate entropy by computing bits per byte ratio of message encoded by state of the art data compression algorithms.

In this article, we study entropy at the level of independent symbols. This approximation of entropy is often called zero order entropy. There are two major data compression algorithms in use that compress messages almost to the rate given by zero order entropy: Huffman [3] and arithmetic [4] coding. The zero order entropy can be computed using the Shannon equation:

H (X) = - \sum_{x \in Σ} p (x) log p (x)

(1)

where X stands for a random variable representing the probability distribution of symbols in the input message m and

p (x)

is the probability of symbol x from alphabet

Σ

. The expected length of the code for the particular symbol x is given by

- log p (x)

. If the expected length of the code is multiplied by its probability, we obtain the average number of bits needed to represent any symbol

x \in Σ

. The size of the message using the expected lengths of codes is given as a product of the length of the message

| m |

measured as a number of symbols and zero order entropy:

{| m |}^{H} = | m | H (X)

(2)

When we refer to the term entropic size of the message, we always mean the quantity given by Equation (2), and it will be denoted using a superscript as

{| m |}^{H}

. We study how the entropic size of the message evolves when all occurrences of some m-gram are substituted for some other n-gram and vice versa. We study two such substitutions: transformations and compression functions. Transformations replace n-grams of the same length. Transformation leaves the message size intact, but since the probability of symbols changes, the value of zero order entropy also has to change. Compression functions replace m-grams for n-grams, where

m > n

, and leave the message size smaller, but the change in zero order entropy also occurs.

The main idea behind the concept of transformations is the following: consider Huffman coding; more probable symbols are encoded by shorter or at least by the same length prefix codes than the lower probability ones, if the symbol

β

is more probable than the symbol

γ

, but in the context of some symbol

α

,

γ

is more frequent than

β

, then if these symbols following

α

are exchanged, the longer codes used for encoding

γ

will instead be encoded by shorter codes representing the encoding of

β

. Under this assumption, it is possible to pre-process data so that the frequency of more frequent symbols increases and the frequency of less frequent symbols decreases.

1.1. Notation and Terminology

The alphabet of the input message m of the size $| m |$ is denoted by $Σ$ and its size by $| Σ |$ .
Greek symbols are used to denote variables representing symbols in the input message. For instance, suppose two digrams $α α$ , $α β$ and the alphabet $Σ = {0, 1}$ . Then, $α α \in {00, 11}$ and $α β \in {00, 01, 10, 11}$ .
When we refer to the term entropy, we always mean Shannon’s entropy defined by Equation (1).
All logarithms are to base two.
Any quantity $Q_{i}$ with a subscript $i \in N$ denotes consecutive states of the quantity between substitutions. For instance, a quantity $Q_{0}$ is a value of the quantity before any substitution is applied, and $Q_{1}$ is a value of the quantity after some substitution is applied.

1.2. Order of Context and Entropy

1.2.1. Zero Order Context

When all symbols are interpreted as independent individual entities and no predecessors are taken into consideration, such a case is called zero order context. Zero order entropy is then computed as Shannon’s entropy of a random variable given by the probabilities of symbols in the input message.

1.2.2. N-th Order Context

In a case where the probability distribution of symbols following a particular fixed length prefix w is taken into consideration, then if the length of the prefix is N, then the order of context is N, and the N-th order entropy is computed as Shannon’s entropy of the conditional distribution of symbols following all different prefixes

w_{i}

.

2. Previous Work

The class of algorithms dealing with exchanges of different

n, m

-grams are called grammar-based algorithms. Their purpose is to provide a set of production rules inferring the content of the message. Using the Chomsky hierarchy, we identify two classes of formal grammars used in data compression: context-free grammars (CFG) and context-sensitive grammars (CSG). Context transformations presented in Section 3 belong to the CSG class; meanwhile, compression functions belong to the CFG class. The problem of the search for the most compact context-free grammar representation of a message is NP-hard, unless

P = N P

[5]. Instead of searching for the optimal solution, many heuristic and greedy algorithms were proposed.

CFGs for data compression were first discussed by Nevill-Manning [6] followed by the proposal of the SEQUITUR algorithm [7]. SEQUITUR reads the sentence in the left-right manner so that each repeated pattern is transformed into a grammar rule. The grammar is utilized in such a way that the two properties are fulfilled: a digram uniqueness (no pair of adjacent symbols appear more than once in the grammar and a rule utility); every rule is used more than once. Kiefer and Yang [8] were the first who addressed data compression using CFGs from the information theoretic perspective; they showed that the LZ78 [9] algorithm can be interpreted as a CFG and that the proposed BISECTION algorithm forms a grammar-based universal lossless source code. BISECTION repeatedly halves the initial message into unique phrases of length

2^{k}

, where k is the integer.

In the work of Yang and He [10], the context-dependent grammars (CDG) for data compression were introduced. In CSG, the context is present in both sides of production rules; meanwhile in CDG, the context is defined only on the left side of the production rule.

One of the first concepts in greedy grammar-based codes was the byte pair encoding (BPE) [11]. The BPE algorithm selects the most frequent digram and replaces it with some unused symbol. The main weakness of this approach is that the algorithm is limited to an alphabet consisting only of byte values. The concept of byte pair encoding was later revised, and the limitation on the alphabet size used was generalized independently by Nakamura and Murashima [12] and by Larsson and Moffat [13]; the resulting approach is called Re-Pair [13]. Re-Pair stands for recursive pairing, and it is a very active field of research [14,15]. It iteratively replaces the most frequent digrams with unused symbols until there is no digram that occurs more than once.

Unlike BPE that codes digrams using only byte values, Re-Pair expects that the symbols of the final message will be encoded using some entropy coding algorithm. Approaches derived from Re-Pair are usually greedy, since each iteration of the algorithm is dependent on a successful search of the extremal value of some statistical quantity related to the input message. The study of the Re-Pair from the perspective of ergodic sources is discussed in [16,17]. Neither BPE nor Re-Pair compress the message into the least possible size, but they rather form a trade-off between message and dictionary sizes. Re-Pair-like algorithms are off-line, in the sense that they need more than one pass through the input message; meanwhile, SEQUITUR incrementally builds the grammar in a single pass. The Re-Pair algorithm is an algorithm with

O (n)

time complexity; it is easy to implement using linked lists and a priority queue. Further, it was shown in [18] that it can compress an input message of length n over an alphabet of size

| Σ |

into at most

2 H_{k} + o (n log | Σ |)

bits, where

H_{k}

is k-th order entropy.

Our recent studies were focused on a special class of grammar transforms that leave the message size intact [19,20]. In the present paper, the class of grammar transformations is extended with a novel concept of higher order context transformation [21]. We shall provide examples of transformations and the evaluation of entropy resp. entropic size reduction to the class of grammar compression algorithms, and we compare the evolution of entropy, entropic size and the resulting number of dictionary entries for Re-Pair and our version of Re-Pair, called MinEnt, which is based on the selection of the pair of symbols reducing the entropic size of the message the most. Re-Pair finds application in areas such as searching in compressed data [22], compression of suffix arrays [23] or compression of inverted indexes [24], to name a few. These areas are also natural application fields for MinEnt. From the perspective of the number of passes through the message, the approaches discussed in this paper belong to off-line algorithms.

3. Transformations and Compression Algorithms

In this section, we will describe and evaluate several invertible transformations T and substitution functions F so that for any two consecutive states of the message,

m_{0}

and

m_{1}

, before and after application of T or F, the following relation holds:

| m_{1} |^{H} < {| m_{0} |}^{H}

(3)

The measure of the size of the message by the entropic size of the message is preferred, since using the arithmetic coding, one can achieve a compression rate very close to the zero order entropy, and so, the size

{| m |}^{H}

is in theory accessible. Further, it allows the comparison of two distinct substitutions when their resulting sizes measured by the number of symbols are equal. The derivation of equations for the computation of

| m_{1} |^{H}

, resp.

{Δ | m |}^{H} = | m_{0} |^{H} - {| m_{1} |}^{H}

, are provided in Section 4.

3.1. Transformations

Consider transformation, where we replace all occurrences of some symbol

β

for some symbol

γ

and vice versa; such a transformation is called a symmetry transformation, because it does not modify any measurable quantities related to the amount of information. The information content is changed when the replacement is taken in the context of an other symbol

α

. Such a transformation corresponds to the exchange of all digrams

α β

for

α γ

and vice versa. In this section, several different forms of transformation are distinguished and briefly described. Some properties of transformations and their proofs can be found in Appendix A.

3.1.1. Context Transformation

The concept of context transformations was first proposed in [25], and the results were presented in [19]. It is the simplest transformation that assumes a pair of digrams beginning with the same symbol when one of the digrams is initially missing in the input message.

Definition 1.

Context transformation (CT) is a mapping

C T (α β \to α γ, w) : Σ^{n} \to Σ^{n}

that replaces all digrams

α β

for

α γ

, where

p (α, γ) = 0

and

β \neq γ

. Σ is the alphabet of the input message w, and n is the length of w.

Let

C T_{\leftarrow}

be the context transformation applied from the end of the message to the beginning and

C T_{\to}

in the opposite direction. The context transformation

C T_{\to}

is an inverse transformation of

C T_{\leftarrow}

. The proof of this property with an explanation for why it is the only pair of the function and its inverse is left to Appendix A. The application of two consecutive context transformations and their inverse functions is presented in the following example:

Example 1.

\begin{matrix} a b c d a b a c d | & C T_{\leftarrow} (a b \to a a) \\ a a c d a a a c d | & C T_{\leftarrow} (c d \to c c) \\ a a c c a a a c c | & C T_{\to} (c c \to c d) \\ a a c d a a a c d | & C T_{\to} (a a \to a b) \\ a b c d a b a c d | \end{matrix}

3.1.2. Generalized Context Transformation

Context transformations were restricted in cases where one of the digrams was missing in the input message. This restriction is removed by the introduction of the generalized context transformations first proposed in [20].

Definition 2.

Generalized context transformation (GCT) is a mapping

G C T (α β \leftrightarrow α γ, w) : Σ^{n} \to Σ^{n}

that exchanges all occurrences of a digram

α β

by a digram

α γ

and vice versa. Σ is the alphabet of the input message w, and n is the length of w.

Example 2.

\begin{matrix} a a b c a b a b & | G C T_{\leftarrow} (a b \leftrightarrow a a) \\ a b a c a a a a & | G C T_{\to} (a a \leftrightarrow a b) \\ a a b c a b a b \end{matrix}

Meanwhile, both transformations

C T

and

G C T

swap occurrences of two different digrams beginning with the same symbol; they differ in the way they are applied and how the inverse transformation is formed.

G C T

can be applied in both directions, and the inverse transformation

G C T^{- 1}

is always applied in the opposite direction, than the forward transformation direction. The algorithm based on the

C T

and

G C T

works as follows:

Find and apply transformation T so that the change of the entropic size ${Δ | m |}^{H} = | m_{0} |^{H} - {| m_{1} |}^{H}$ is maximal.
Repeat Step 1 until no transformation can decrease the entropic size of the message.

It is also possible to define a transformation and its inverse so that all symbols constituting replaced pairs differ, for instance

a b \leftrightarrow c d

; such a transformation is called generic transformation

G T

. In this article, we have not proposed algorithms based on

G T

, but because the set of all generalized context transformations is a subset of a set of generic transformations, the proof of the inverse transformation existence is the same for both

G C T

and

G T

. The reader can find the proof in Appendix A.

3.1.3. Higher Order Context Transformation

Every time we apply any generalized context transformation

G C T

, we acquire knowledge about the positions of two distinct digrams in the message. We can either discard this knowledge or we can try to build on it. In the following definition, we define a transformation that is applied over positions where some other transformation was applied before:

Definition 3.

Let

P (w, m)

be a set of positions of the first symbol following the sub-message w in the message m and

w [i] \neq w [0]

,

i > 0

. If

β, γ \neq w [0]

, then the higher order context transformation (HOCT) is a mapping

H O C T (w β \leftrightarrow w γ, m, P (w, m)) : Σ^{n} \to Σ^{n}

that exchanges all sub-messages

w β

for sub-messages

w γ

and vice versa.

The restriction that the sub-message w has to satisfy is

w [0] \neq w [i]

, where

i > 0

is closely related to the existence of the inverse transformation to

H O C T

. The properties related to the

H O C T

and their proofs are left to Appendix A.

Let

O = | w |

be the size of the sub-message w from Definition 3, then O is an order of

H O C T

. Any

G C T (α β \leftrightarrow α γ)

is then the first order

H O C T (α β \leftrightarrow α γ, m, P (α, m))

. Given that we just before applied some transformation

m_{1} = H O C T_{1} (w β \leftrightarrow w γ, m, P (w, m))

, we can decide to collect the positions of either

w_{1} = w β

or

w_{2} = w γ

, collect the distribution of symbols in

P (w_{i}, m)

and apply another

H O C T (w_{i} ρ \leftrightarrow w_{i} φ, m_{1}, P (w_{i}, m))

. In this sense,

H O C T

is not used only to interchange different sub-messages, but it also allows one to proceed with some other transformation

H O C T

of a higher order. The application of two consecutive

H O C T

transformations is presented in the following example:

Example 3.

\begin{matrix} a b c d a b c d | & H O C T (a b \leftrightarrow a d, P (a, m) = {1, 5}) \\ a d c d a d c d | & H O C T (a d c \leftrightarrow a d d, P (a d, m) = {2, 6})) \\ a d d d a d d d | \end{matrix}

The

H O C T

transformation is a recursive application of

G C T

in the context of some prefix w. The steps of the algorithm are outlined as follows:

Find and apply $H O C T (α β \leftrightarrow α γ)$ over the set of positions $P (α)$ , so that the change of entropic size ${Δ | m |}^{H} = | m_{0} |^{H} - {| m_{1} |}^{H}$ is maximal and ${Δ | m |}^{H} > L i m$ .
If the frequency of $α β$ resp. $α γ$ is larger than one, then repeat Step 1 over the set of positions $P (α β)$ resp. $P (α γ)$ , i.e., positions where $H O C T$ from Step 1 was applied; otherwise, repeat Step 1 over positions $P (α)$ or return if no more $H O C T$ passes the entropic size reduction conditions.

The algorithm above is iteratively called for symbols sorted from the most frequent one to the least frequent one. The

L i m

variable can be used to restrict transformations whose entropic size reduction is too small, so they cannot be efficiently stored in the dictionary.

3.2. Compression Functions

In the preceding section, we described three types of transformations that leave message size intact. In this section, we will focus on a description of two approaches in the replacement of digrams for a new symbol. First, we describe basic principles of the well-known algorithm Re-Pair, and then, we will propose a modification of Re-Pair called MinEnt.

3.2.1. Re-Pair

The main idea behind the Re-Pair algorithm is to repeatedly find the most frequent digram and replace all of its occurrences with a new symbol that is not yet present in the message. The algorithm can be described in the following steps:

Select the most frequent digram $α β$ in message m.
Replace all occurrences of $α β$ for new symbol $γ$ .
Repeat Steps 1 and 2 until every digram appears only once.

In Step 2 of the algorithm, the pair

α β

together with a new symbol

γ

are stored in a dictionary. The implementation details of the Re-Pair algorithm are left to Section 3.2.2 regarding the proposed MinEnt algorithm.

3.2.2. MinEnt

The MinEnt algorithm proposed in this article is derived from the Re-Pair algorithm. The main difference is in Step 1, where instead of the selection of the most frequent digram, we select a digram that minimizes

| m_{1} |^{H}

from Equation (3):

Select digram $α β$ in message $m_{0}$ so that the change of entropic size ${Δ | m |}^{H} = | m_{0} |^{H} - {| m_{1} |}^{H}$ is maximal.
Replace all occurrences of $α β$ for new symbol $γ$ .
Repeat Steps 1 and 2 until every digram appears only once.

More precisely, let

m_{1} = M i n E n t (m_{0}, α β \to γ)

be the application of Step 1 and Step 2 of the MinEnt algorithm, then digram

α β

fulfills:

\underset{α, β \in Σ_{0}}{arg min} | M i n E n t (m_{0}, α β \to γ) |^{H}

(4)

where

Σ_{0}

is the alphabet of the message

m_{0}

. To demonstrate the difference between Re-Pair and MinEnt, consider the following example:

Example 4.

m_{0} = a a b a b c d c d b

The entropic size of

m_{0}

is

| m_{0} |^{H} = 19.71

bits. There are two non-overlapping digrams that occur twice:

a b

and

c d

.

(m_{0}, a b \to e) = a e e c d c d b

(m_{0}, c d \to e) = a a b a b e e b

Based on the Re-Pair algorithm, we do not know which digram should be preferred, because both have the same frequency. In the MinEnt case, we can compute

| m_{1} |^{H}

for both cases, yielding

| m_{1} |_{a b}^{H} = 18

bits and

| m_{1} |_{c d}^{H} = 12.49

bits, and so, the replacement

c d \to e

will be the preferred one.

The MinEnt and the Re-Pair strategies of digram selection are evaluated using the algorithm described in [13]. In the initialization phase of the algorithm, the input file is transformed into the linked list, and each input byte is converted into the unsigned integer value. In the next step, the linked list is scanned, and the frequencies and positions of all digrams are recorded. Frequencies of digrams, resp. the change of the entropic size of the message measured in bytes, are used as indices for the priority queue. The size of the queue is limited to the maximal frequency, resp. in the case of the MinEnt algorithm, the maximum entropic size decrease.

The algorithm iteratively selects the digram with the highest priority, replaces all occurrences of the digram with the newly-introduced symbol, decrements counts of neighboring digrams and increments counts of newly-introduced digrams. In the case of the MinEnt algorithm, we have to recompute the change of the entropic size of all digrams in the priority queue. We restrict the number of recomputed changes of the entropic size to the top 20 digrams with the highest priority, so that the time complexity of this additional step remains

O (1)

. Both algorithms are accomplished in

O (n)

expected time; see [13] for details. The memory consumption is larger in the MinEnt case, because each digram has to be assigned with the additional quantity: the value of the change of the entropic size of the message.

3.3. Discussion of the Transformation and Compression Function Selection Strategies

To demonstrate the behavior of aforementioned algorithms, we proposed strategies for the selection of transformations and compression algorithms. We compared the evolution of the entropy of the alphabet, the entropic size of the message and the final size of the message given as the sum of the entropic size of the message and the upper boundary on the size of the dictionary (Section 3.3.1). The following strategies are being compared:

$G C T$ : selection of the generalized context transformation so that the decrease of entropy is maximal.
$H O C T$ : selection of the higher order context transformation so that the decrease of entropy is maximal in the context of prefix w.
Re-Pair: selection of the most frequent digram and its replacement with an unused symbol.
MinEnt: selection of the most entropic size reducing digram and its replacement with an unused symbol.

3.3.1. The Upper Boundary on the Dictionary Entry Size

All transformations and compression functions are usually stored as an entry in a dictionary. To be able to compare the effectiveness of transformations, we selected the worst case entropy of each symbol, given by

log | Σ_{i} |

, where

Σ_{i}

is an alphabet and subscript i denotes the number of applied transformations.

In the

G C T

and

H O C T

strategies, the size of the alphabet will be constant, unless some symbols were completely removed, then the size of the alphabet decreases. Re-Pair and MinEnt algorithms, which introduce new symbols, have an increasing alphabet size. The upper boundary on the resulting size of each dictionary entry

| D |

for

G C T

and

H O C T

transformations is defined as:

| D | = 3 log | Σ_{0} |

where

| Σ_{0} |

is the size of the initial alphabet. The Larsson and Moffat [13] version of the Re-Pair introduces several efficient ways of dictionary encoding: the Bernoulli model, literal pair enumeration and interpolative encoding. In our experiments with Re-Pair and MinEnt, we used interpolative encoding to encode dictionary.

3.3.2. Comparison of the Alphabet’s Entropy Evolution

Even though the transformations and compression functions pursue the same objective, minimization of the entropic size of the message, they achieve that by a different evolution of zero order entropy. Transformation-based strategies minimize zero order entropy; meanwhile, both compression strategies introduce new symbols, and as a result, zero order entropy grows. The initial values of the quantities of the examined test file are summarized in Table 1. The example of the comparison of the zero order entropy evolution of different strategies is provided in Figure 1a.

Both compression functions achieve a very similar resulting value of zero order entropy. The Re-Pair strategy begins with the highest growth of entropy, but the increase slows down with the number of iterations as the frequency of each consecutive digram drops. As will be discussed in Section 4.2.2, digrams consisting of symbols with a lower frequency will be preferred by MinEnt, because they will be able to achieve a larger decrease of entropic size, and their replacement brings less costs in the zero order entropy increase. This behavior can be observed especially in later iterations of the Re-Pair and MinEnt algorithms.

Both transformations reduce the value of zero order entropy.

G C T

initially drops faster, but in the end, it significantly slows down. The application of the

H O C T

strategy achieves the lowest resulting value of entropy, and the interesting fact is that it decreases at an almost constant rate. The behavior of entropy evolution for different values of the limit in

H O C T

is presented in Figure 1b. The unrestricted case (

L i m = 0

) shows us the bottom limit of zero order entropy reduction using the

H O C T

strategy.

3.3.3. Comparison of Entropic Size Evolution

The selection of the most frequent digram will produce the largest decrease of the number of symbols in each iteration. Surprisingly, the Re-Pair strategy does not necessarily have to converge to its minimum in the lower number of iterations than MinEnt. Figure 2 presents this behavior for the paper5 file of the Calgary corpus. Both approaches end with a similar number of symbols in the resulting message.

The MinEnt strategy achieves the lowest entropic size of the message, and at each iteration, the entropic size of the message is lower than in the case of the Re-Pair strategy, see Figure 3. The overall efficiency depends on our ability to compress the resulting dictionary.

A summary of different transformation strategies is provided in Table 2. A summary of compression functions is then given in Table 3. The least number of iterations was achieved by

H O C T

with

L I M = | D |

; this strategy also leads to the least final size

| m_{f} |

, but it should be emphasized that the resulting entropic size of the message

| m_{f} |

is a very pessimistic estimate, due to the construction of the size of dictionary entries.

Even though the achieved results of both approaches are similar, we see that the resulting message size

| m_{f} |

and alphabet size are lower in the case of MinEnt. The message size

| m_{f} |

is given by the sum of the entropic size of the message and the size of the dictionary stored by interpolative encoding. Using values in the columns of Table 3, we express

| m_{f} | = | m_{i} | H_{i} + i | D_{i} |

; the term

i | D_{i} |

represents the size of the dictionary given as a product of the total number of iterations and the average number of bits needed to encode one iteration. See Table 4 and Table 5 for more results on files from the Calgary and Canterbury corpora.

4. Zero Order Entropy and Entropic Message Size Reduction

The primary purpose of context transformation and other derived transformations is to reduce the zero order entropy measured by Shannon’s entropy [2] defined in Equation (1). In this section, we shall show under what conditions the transformation and compression function reduces zero order entropy resp. the entropic size of the message. Suppose that

H_{0}

is a zero order entropy of message m, and

H_{1}

is a zero order entropy after a transformation T is applied. The conditions under which the following inequalities hold are the major subject of interest.

Δ H = H_{0} - H_{1} = \sum_{x \in Σ} p_{1} (x) log p_{1} (x) - \sum_{x \in Σ} p_{0} (x) log p_{0} (x) > 0

(5)

Let

Σ^{T} \subset Σ

be a set of symbols whose frequencies before and after transformation differ, and

Σ^{I} \subset Σ

is a set of symbols whose frequencies are intact. For transformations, the inequality (5) can be further restricted only to the set of symbols

Σ^{T}

, since the terms containing symbols from

Σ^{I}

subtract:

Δ H = \sum_{x \in Σ^{T}} p_{1} (x) log p_{1} (x) - \sum_{x \in Σ^{T}} p_{0} (x) log p_{0} (x)

(6)

In the following paragraph, we will specify the forms of the set

Σ^{T}

and the relations for the probabilities of its symbols after transformations, so that the change of entropy given by Equation (6) can be computed before any transformation actually occurs.

4.1. Transformation of Probabilities

We begin with the simplest case: suppose the context transformation

C T (α β \to α γ, w)

. Since only the probabilities of symbols

β

and

γ

will change, then

Σ^{C T} = {β, γ}

, and it is sufficient to express probabilities only for

β

and

γ

:

p_{1}^{C T} (β) = p_{0} (β) - p_{0} (α, β)

(7)

and:

p_{1}^{C T} (γ) = p_{0} (γ) + p_{0} (α, β)

(8)

In the case of the generalized context transformation

G C T (α β \leftrightarrow α γ, w)

, the set

Σ^{C T} = Σ^{G C T}

is identical, and the probabilities transform according to:

p_{1}^{G C T} (β) = p_{0} (β) - p_{0} (α, β) + p_{0} (α, γ)

(9)

and:

p_{1}^{G C T} (γ) = p_{0} (γ) + p_{0} (α, β) - p_{0} (α, γ)

(10)

In the last case of higher order transformation, the probabilities transform according to:

p_{1}^{H O C T} (β) = p_{0} (β) - p_{0} (w, β) + p_{0} (w, γ)

(11)

and:

p_{1}^{H O C T} (γ) = p_{0} (γ) + p_{0} (w, β) - p_{0} (w, γ)

(12)

In all cases, the set

Σ^{T}

forms a binary alphabet. The following theorem then describes the condition for zero order entropy reduction:

Theorem 1.

Suppose the generalized context transformation

G C T (α β \leftrightarrow α γ)

. Let

p_{0} (β)

and

p_{0} (γ)

be the probabilities of symbols before the transformation is applied, and let

p_{0, m a x} = m a x {p_{0} (β), p_{0} (γ)}

. After the transformation, the associated probabilities are

p_{1} (β)

,

p_{1} (γ)

and

p_{1, m a x} = m a x {p_{1} (β), p_{1} (γ)}

. If

p_{1, m a x} > p_{0, m a x}

, then the generalized context transformation T reduces entropy.

The proof of Theorem 1 is based on the properties of entropy when only two letters from alphabet

Σ

are considered. Let

p (β) + p (γ) = c

, where

c \leq 1

, c is invariant, it does not change during the transformation. We can express one of these probabilities using the other one; for example, let

p (γ) = c - p (β)

; this allows us to express the entropy function as a function of only one variable. A few examples of such functions are shown in Figure 4. The maximum value of the function is located in the value

c / 2

, and it has two minimums at zero and at c.

Proof.

Since the entropy function for two different letters is defined on the interval

〈 0; c 〉

and it is concave with a maximum at

c / 2

and minimums at zero and c, then

p_{0, m a x}

has to be located on the interval

p_{0, m a x} \in 〈 c / 2; c 〉

; but on that particular interval, the higher the maximum is, the lower the entropy is, so if we increase the maximum (or we can say increase the absolute value of the difference

| p_{β} - p_{α} |

), then the entropy will decrease. ☐

4.2. General Entropy Change Relations

In this section, we generalize the notion of zero order entropy change on the exchange of any two words. The solution is divided into three parts. The first part deals with the set of symbols

Σ^{I}

whose frequency does not change before and after the substitution function is applied; the second part establishes relations for the set of symbols

Σ^{T}

whose probability is changed, but their initial and final frequencies are non-zero; the third part discusses symbols introduced to and removed from the alphabet. Let

Σ^{R}

be a set of removed symbols, and

Σ^{N}

is a set of introduced symbols; then, we can split the sum in Equation (1), yielding:

H (X) = - \sum_{α \in {I, T, R, N}} \sum_{x \in Σ^{α}} p (x) log p (x)

(13)

The four sets of symbols in Equation (13) exhibit different behaviors under the substitution function, and they will be discussed in separate sections. The entropic size of the message

{| m |}^{H}

can also be handled separately; let

H (p (Σ^{α}))

be a portion of entropy conveyed by symbols from alphabet

Σ^{α}

, and let

{| m |}^{H (Σ^{α})}

be particular portions of the entropic size of the message; then, we can split the resulting entropic size as we did before:

\begin{matrix} {| m |}^{H} & = | m | [H (p (Σ^{I})) + H (p (Σ^{T})) + H (p (Σ^{R})) + H (p (Σ^{N}))] \\ = | m | H (p (Σ^{I})) + | m | H (p (Σ^{T})) + | m | H (p (Σ^{R})) + | m | H (p (Σ^{N})) \\ = {| m |}^{H (Σ^{I})} + {| m |}^{H (Σ^{T})} + {| m |}^{H (Σ^{R})} + {| m |}^{H (Σ^{N})} \end{matrix}

(14)

4.2.1. The First Part: Symbols Remaining Intact by the Substitution Function

We begin with symbols that are not part of either of the substituting words

s_{1}

or

s_{2}

. Suppose that the length

| m_{0} |

of the message

m_{0}

turns into some message

m_{1}

of the size

| m_{1} |

. Generally,

| m_{0} | \neq | m_{1} |

, but in a special case of context transformations, these two quantities are equal. However, when the compression or expansion of the message occurs, the part of the Shannon equation will also change due to the change in the total number of symbols.

Suppose that the symbol x is initially in the message

m_{0}

with the probability

p_{0} (x)

. This probability can be expressed using the frequency

f_{0} (x)

and the size of the message as:

p_{0} (x) = \frac{f_{0} (x)}{| m_{0} |}

(15)

Later, after the transformation was applied, the probability changes to:

p_{1} (x) = \frac{f_{0} (x)}{| m_{0} | + Δ m}

(16)

where

Δ m

is a change of the message size. In the case of context transformations where the message size remains the same size, the probability remains the same, as well as the part of entropy formed by non-transformed symbols.

When the two probabilities are placed in relation by some stretching factor

c_{1}

we arrange them into the form:

p_{1} (x) = c_{1} p_{0} (x)

(17)

The factor

c_{1}

(the introduction of

c_{1}

is motivated by the properties of logarithms if we would actually stay with

p_{1} (x)

given by Equation (16), we would get logarithm

log \frac{f_{0} (x)}{| m_{0} | + Δ m} = log f_{0} (x) - log (| m_{0} | + Δ m)

. If instead, we express

p_{1}

using (17), then the logarithm is in product form, and its arguments are single numbers

log c p_{0} = log c + log p_{0}

) can be expressed by substitution of

p_{i} (x)

in Equation (17) by the terms in Equations (15) and (16), leading to:

\frac{f_{0} (x)}{| m_{0} | + Δ m} = c_{1} \frac{f_{0} (x)}{| m_{0} |}

c_{1} = \frac{| m_{0} |}{| m_{0} | + Δ m}

(18)

Then, the relation for zero order entropy after transformation will have the form:

\begin{matrix} H (p_{1} (Σ^{I})) & = - \sum_{x \in Σ^{I}} c_{1} . p_{0} (x) log [c_{1} . p_{0} (x)] \\ = - c_{1} \sum_{x \in Σ^{I}} p_{0} (x) [log c_{1} + log p_{0} (x)] \\ = - c_{1} log c_{1} \sum_{x \in Σ^{I}} p_{0} (x) - c_{1} \sum_{x \in Σ^{I}} p_{0} (x) log p_{0} (x) \\ = - c_{1} log c_{1} \sum_{x \in Σ^{I}} p_{0} (x) + c_{1} H (p_{0} (Σ^{I})) \\ = c_{1} [H (p_{0} (Σ^{I})) - log c_{1} \sum_{x \in Σ^{I}} p_{0} (x)] \end{matrix}

(19)

The example of the behavior of

H (p_{1} (Σ^{I}))

of the intact part is visualized in Figure 5. When the compression of the message occurs, i.e.,

log c_{1} > 0

, then the zero order entropy of intact symbols increases. The less the probability is conveyed by symbols from

Σ^{I}

, the more their zero order entropy is sensitive to the change of

c_{1}

.

The final entropic size is given as follows:

\begin{matrix} | m_{1} |^{H, Σ_{I}} & = | m_{1} | H (p_{1}) = | m_{1} | c_{1} [H (p_{0} (Σ^{I})) - log c_{1} \sum_{x \in Σ^{I}} p_{0} (x)] \\ = | m_{0} | H (p_{0} (Σ^{I})) - | m_{0} | log c_{1} \sum_{x \in Σ^{I}} p_{0} (x)) \\ = | m_{0} |^{H, Σ_{I}} - | m_{0} | \sum_{x \in Σ^{I}} p_{0} (x) log c_{1} \\ = | m_{0} |^{H, Σ_{I}} - log c_{1} \sum_{x \in Σ^{I}} f_{0} (x) \end{matrix}

(20)

If we apply one of transformations, then

Δ m = 0

, and as a consequence,

c_{1} = 1

; the last term on the right will be zero due to

log c_{1} = 0

, so Equation (20) tells us that the entropic size of the message carried by these symbols does not change during transformation. When

| Σ^{T} |

is much smaller than

| Σ^{I} |

, it is convenient to rewrite Equation (20) in terms of

Σ^{T}

:

| m_{1} |^{H, Σ_{I}} = | m_{0} |^{H, Σ_{I}} - log c_{1} [| m_{0} | - \sum_{x \in Σ^{T}} f_{0} (x)]

(21)

Corollary 1.

No compression function ever increases the entropic size of the part of the message consisting of intact symbols.

Proof.

The compression function has the value of

c_{1}

larger than one, as a consequence

log c_{1} > 0

, and so,

| m_{1} |^{H (Σ_{I})} \leq {| m_{0} |}^{H (Σ_{I})}

. ☐

The equality in

| m_{1} |^{H, Σ_{I}} \leq {| m_{0} |}^{H, Σ_{I}}

occurs when

\sum_{x \in Σ^{I}} f_{0} (x) = 0

, i.e., when there are no intact symbols. When the expansion of the message occurs, then

log c_{1} < 0

and the second term of Equation (20) on the right will change to a positive number. Expansion of the message leads to the increase of the entropic size; meanwhile, compression leads to the decrease of the entropic size of intact symbols.

In each iteration of the Re-Pair algorithm, the most frequent digram is selected. This corresponds to the selection of a digram with maximal value of

log c_{1}

, but it does not have to be the digram minimizing the entropic size of this part of the resulting message the most. Consider two digrams

d_{1}

and

d_{2}

, so that their frequencies are equal:

f (d_{1}) = f (d_{2})

; replacing them with a new symbol yields the same stretching factor

c_{1}

, but not necessarily

\sum_{x \in Σ^{I}} f_{0} (x)

. The larger reduction of the entropic size of a message will be achieved when compressed digrams or words consist of less frequent symbols.

4.2.2. The Second Part: Symbols Participating in the Substitution Function

In the second case, the frequencies of symbols and their total number will change. The equation for stretching factor

c_{2}

will be derived in the following way:

p_{1} (x) = \frac{f_{1} (x)}{| m_{1} |} = \frac{f_{0} (x) + Δ f (x)}{| m_{0} | + Δ m}

\frac{f_{0} (x) + Δ f (x)}{| m_{0} | + Δ m} = c_{2} \frac{f_{0} (x)}{| m_{0} |}

The main difference in both cases is that

c_{1}

is a constant; meanwhile,

c_{2}

is a function of the particular symbol x.

c_{2} (x) = \frac{(f_{0} (x) + Δ f (x)) | m_{0} |}{f_{0} (x) (| m_{0} | + Δ m)} = \frac{f_{0} (x) + Δ f (x)}{f_{0} (x)} c_{1} = F (x) c_{1}

(22)

where in the last step, we made the substitution:

F (x) = (f_{0} (x) + Δ f (x)) / f_{0} (x)

The rest of the derivation follows the derivation of Equation (19).

H (p_{1} (Σ^{T})) = - \sum_{x \in Σ^{T}} p_{0} (x) c_{2} (x) log c_{2} (x) - \sum_{x \in Σ^{T}} c_{2} (x) p_{0} (x) log p_{0} (x)

(23)

The behavior of Equation (23) for different values of

p_{0} (x)

is visualized in Figure 6. The substitution of less frequent symbols leads to a lower increase of zero order entropy.

The resulting entropic size simplifies given that:

| m_{1} | c_{2} (x) p_{0} (x) = f_{0} (x) + Δ f (x) = f_{1} (x)

(24)

yields:

\begin{matrix} | m_{1} |^{H} & = | m_{1} | H^{T} (p_{1}) \\ = - \sum_{x \in Σ^{T}} [f_{0} (x) + Δ f (x)] (log c_{2} (x) + log p_{0} (x)) \end{matrix}

(25)

We now analyze both terms in (25) from the perspective of different values of

c_{2} (x)

. We will be particularly interested in compression functions. We know that for compression function

c_{1} > 1

, symbols with

Δ f (x) < 0

, i.e., symbols whose frequency decreases, will have

F (x) < 1

. The positivity or negativity of

log c_{2}

then depends on the value of product

F (x) c_{1}

.

The case when

F (x) c_{1} = 1

has a solution

F (x) = 1 / c_{1}

, then

log c_{2} (x) = 0

. The term

log p_{0} (x)

is always negative. The value of

F (x)

must be larger than

1 / c_{1}

to decrease the zero order entropy conveyed by symbol x, since then,

c_{2} (x) > 1

and, as a consequence,

log c_{2} (x) > 0

.

\begin{matrix} F (x) & > \frac{1}{c_{1}} \\ \frac{f_{0} (x) + Δ f (x)}{f_{0} (x)} & > \frac{| m_{0} | + Δ m}{| m_{0} |} \\ 1 + \frac{Δ f (x)}{f_{0} (x)} & > 1 + \frac{Δ m}{| m_{0} |} \\ \frac{Δ f (x)}{f_{0} (x)} & > \frac{Δ m}{| m_{0} |} \\ \frac{| Δ f (x) |}{f_{0} (x)} & < \frac{| Δ m |}{| m_{0} |} \\ \frac{| Δ f (x) |}{| Δ m |} & < \frac{f_{0} (x)}{| m_{0} |} = p_{0} (x) \end{matrix}

(26)

The introduction of the absolute value in the middle step of the derivation of Inequality (26) is allowed since using compression functions values of

Δ f (x)

and

Δ m

can only be negative. Suppose now that we have a digram

d = α β

, given that

α \neq β

, and we replace it by the newly-introduced

γ

, then

Δ m = Δ f (α) = Δ f (β)

. The left part of Inequality (26) becomes equal to one, so Inequality (26) cannot be satisfied, and

log c_{2} (x)

in this case will be negative and will always increase the amount of information carried by the symbols

α

and

β

.

Finally, we state the condition for the entropic size decrease:

Corollary 2.

The entropic size of the part of the message formed by symbol x decreases when:

\frac{Δ f (x)}{f_{0} (x)} < - \frac{log c_{2} (x)}{log c_{2} (x) + log p_{0} (x)}

(27)

Proof.

\begin{matrix} | m_{1} |^{H} & < | m_{0} |^{H} \\ _{0} (x) + Δ f (x)] [log c_{2} (x) + log p_{0} (x)] & < f_{0} (x) log p_{0} (x) \\ f_{0} (x) log c_{2} (x) + Δ f (x) log p_{0} (x) + Δ f (x) log c_{2} (x) & < 0 \\ Δ f (x) (log p_{0} (x) + log c_{2} (x)) & < - f_{0} (x) log c_{2} (x) \\ \frac{Δ f (x)}{f_{0} (x)} & < - \frac{log c_{2} (x)}{log c_{2} (x) + log p_{0} (x)} \end{matrix}

☐

4.2.3. Third Part: Introduced and Removed Symbols

We begin with symbol x, which is completely removed from the message, so that initially,

p_{0} (x) \neq 0

, but

p_{1} (x) = 0

. This case is trivial, and it has zero participation in the final value of the entropy and the entropic size of message. The remaining case we have to deal with is a case when initially symbol x has zero probability

p_{0} (x) = 0

, but after substitution, its probability will increase to some

p_{1} (x) \neq 0

. The final probability is given as:

p_{1} (x) = \frac{Δ f (x)}{| m_{0} | + Δ m}

(28)

Since the symbol x initially has zero participation in entropy and entropic size, it will always lead to the increase of both quantities. For the set

Σ^{N}

of all such symbols, its portion on total entropy is then given by:

H (p_{1} (Σ^{N})) = - \sum_{x \in Σ^{N}} \frac{Δ f (x)}{| m_{0} | + Δ m} log \frac{Δ f (x)}{| m_{0} | + Δ m}

(29)

and the corresponding final entropic size will be given by:

\begin{matrix} | m_{1} |^{H} & = | m_{1} | H^{N} (p_{1}) \\ = - \sum_{x \in Σ^{N}} Δ f (x) [log Δ f (x) - log (| m_{0} | + Δ m)] \end{matrix}

(30)

It is important to remark that it does not make much sense to introduce more than one symbol in one substitution function, because both quantities would then add themselves twice.

4.3. Calculation of ${Δ | m |}^{H}$

At first glance, it seems that we need to evaluate all symbols to predict zero order entropy, but instead, it is possible to predict the exact change of the entropic size of the message after the application of the compression function by the evaluation of entropic sizes given by Equations (21), (25) and (30) dealing only with symbols

x \in Σ ∖ Σ_{I}

. In the particular case of the Re-Pair algorithm, there are only two symbols whose frequency knowledge is sufficient to evaluate the change of the entropic size of the message; suppose a compression function

C F (α β \to γ)

so that

p_{1} (α) \neq 0

,

p_{1} (β) \neq 0

and

p_{0} (γ) = 0

, then the resulting entropic size is given as:

\begin{matrix} {Δ | m |}^{H} & = | m_{0} | log c_{1} - log c_{1} \sum_{x \in {α, β}} f_{0} (x) \\ + \sum_{x \in {α, β}} f_{0} (x) log c_{2} (x) + Δ f_{0} (x) log p_{0} (x) + Δ f_{0} (x) log c_{2} (x) \\ - Δ f (γ) [log Δ f (γ) - log (| m_{0} | + Δ m)] \end{matrix}

(31)

finally, for the Re-Pair, it holds that if

α \neq β

, then

Δ m = Δ f (α) = Δ f (β) = Δ f (γ) = f (α, β)

, and all

Δ

’s in (31) turn into

f (α β)

. If

α = β

, then

Δ f (α) / 2 = Δ f (γ)

.

5. Conclusions

We described three types of transformations for the preprocessing of messages so that the zero order entropy of messages drops so the resulting message can be more efficiently encoded using zero order entropy compression algorithms like Huffman or arithmetic coding.

We presented relations that govern the change of the message size for transformations and compression functions. Transformations have the advantage that they do not modify the size of the alphabet, especially in the case of digram substitution used by Re-Pair and our proposal of the MinEnt strategy; the resulting size of the alphabet significantly grows, and it brings additional complexity in the storage of the entropy coding model, i.e., the storage of the output alphabet.

The MinEnt strategy selects digrams to be replaced by the minimal entropic size of the resulting message, and it is shown that in most cases, the resulting message size is smaller than the one achieved by Re-Pair. We also showed that the two algorithms follow slightly different execution paths, as MinEnt prefers digrams that consist of less frequent symbols; meanwhile, Re-Pair does not take this into consideration.

The compression functions take advantage of transformations as they achieve a better resulting compression ratio. In future work, we will focus on the storage of the dictionary that will be used in transformation algorithms, because this area can significantly improve the resulting compression ratio. Further, we will focus on the description of the relation between the entropy coding model of the final message and the size of the final alphabet.

Acknowledgments

This work was supported by the project SP2017/100 Parallel processing of Big Data IV, of the Student Grant System, VSB-Technical University of Ostrava. The costs for open access were covered.

Author Contributions

Michal Vasinek realized this work, proposed and developed the implementation of the

C T

,

G C T

,

H O C T

and MinEnt algorithms. Jan Platos provided the guidance during the writing process and revised the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The following sections present properties of transformations. Specifically for each type of transformation we will provide a proof of the inverse transformation existence. Further we will describe how the frequencies of symbols will be altered if the particular transformation is going to be applied.

Appendix A.1. CT—Proof of the Correctness

This theorem defines the inverse transformation of the context transformation:

Theorem A1.

The context transformation

C T^{- 1} \equiv C T_{\to} (α γ \to α β)

is inverse transformation of the context transformation

C T_{\leftarrow} (α β \to α γ)

.

Proof.

Let

C T^{- 1} \equiv C T_{\to}

, if

C T^{- 1}

is inverse then the following must be true for any message m:

C T^{- 1} (C T (m)) = m

. Suppose that we are passing message m from the end to the beginning and suppose that in positions i and

i + 1

digram

α β

is located, this digram is replaced by

α γ

, the next pair of positions explored are

i - 1

and i, but their value is independent of the preceding replacement, because the replacement has taken place in position

i + 1

, so when

C T^{- 1}

is applied in position i it will find there the digram

α γ

and reverts it back to

α β

. ☐

Other combinations of directions do not form a pair of transformation and its inverse. We give an example for each case showing this property:

C T_{\to} (α β \to α α)

and

C T_{\leftarrow}^{- 1}

over the message

m = α β α

:

C T_{\to} (α β α) = α α α

but

C T_{\leftarrow}^{- 1} (α α α) = α β β \neq α β α

. Next consider

C T_{\to} (α β \to α α)

and

C T_{\to}^{- 1}

over the message

m = α α β

:

C T_{\to} (α α β) = α α α

but

C T_{\to}^{- 1} (α α α) = α β α \neq α α β

. And in the last case let’s consider

C T_{\leftarrow} (α β \to α α)

and

C T_{\leftarrow}^{- 1}

over the message

m = α β α

:

C T_{\to} (α α β) = α α α

but

C T_{\to}^{- 1} (α α α) = α β β \neq α α β

.

Let

f_{0} (α γ, m) = 0

is a number of occurrences of particular digram

α γ

in a message m, then the following corollary tells us how many digrams

α γ

is introduced by context transformation

C T_{\to} (α β \to α γ)

:

Corollary A1.

Under assumption that

α \neq γ

the number of occurrences of digrams

α γ

and

α β

after application of transformation

C T_{\to} (α β \to α γ)

is

f_{1} (α γ, C T (m)) = f_{0} (α β, m)

and

f_{1} (α β, C T (m)) = 0

.

Proof.

A proof is a consequence of Theorem A1, since each replacement is independent of each other and so each digram

α β

is replaced by

α γ

leaving

f_{1} (α β) = 0

and

f_{1} (α γ) = f_{0} (α β)

. ☐

The corollary allows us to precisely predict not only the frequencies of the interchanged digrams

α β

and

α γ

but also as a consequence the frequencies of individual symbols after transformation. The special case of transformations on a diagonal (see Definition A1) will be discussed in the next paragraph.

Appendix A.2. Diagonal Context Transformation

Diagonal transformation is a transformation where one of the digrams participating in the transformation is of the form

α α

. The resulting frequency of such a digram is unpredictable without knowledge of the distribution of all n-grams of the form

α^{n}

, where

n \geq 2

, but we show that for any diagonal

C T

, it is possible to predict frequencies of symbols

α

and

β

. The problems with predictability arise from the repetition of symbols.

Definition A1.

Diagonal context transformation is a context transformation of the form

C T_{\leftarrow} (α α \to α β)

.

Consider two transformations,

C T_{1} \equiv C T_{\leftarrow} (α α \to α β, α α α) = α β β

and

C T_{2} \equiv C T_{\leftarrow} (α β \to α α, α β α) = α α α

, if Corollary A1 would also be valid for diagonal transformations, then for instance in the case of

C T_{1}

, the frequency

f_{1} (α β) = f_{0} (α α)

but this obviously is not true, instead we see that the new frequency

f (β)

of symbol

β

is

f_{1} (β) = f_{0} (α α)

.

Suppose we have a message

s = α^{n}

, then

C T_{1} (s) = α β^{n - 1}

, we clearly see that the frequency

f (α β, C T_{1} (s)) = 1

and

f (β β, C T_{1} (s)) = n - 2

, because the number of digrams in a message is given by the length of the message minus one. We can now express the frequency

f (α β, C T_{1} (m))

of the newly introduced occurrences of digram

α β

as a sum of all sub-messages enclosed in m in the form

x s x

, where

x \neq α

for all

n \geq 2

. So we see that it is possible to precisely predict the change of frequency of

α β

, but it demands knowledge of the distribution of all enclosed sub-messages s.

From the other perspective, since each occurrence of digram

α α

in the former message is transformed into

α β

we can see that the frequency

f_{1} (β, C T_{1} (m)) = f_{0} (β) + f_{0} (α α)

and

f_{1} (α, C T_{1} (m)) = f_{0} (α, m) - f_{0} (α α)

.

Very similar behavior is observed in the second case of

C T_{2}

. The problem is in the repetition of the pattern

t = {(α β)}^{n}

, then

C T_{2} (t) = α^{2 n}

and

f_{1} (α α) = 2 n - 1

. Again without knowledge of all sub-messages t enclosed in m we cannot predict the exact change of frequency of neither digram

α α

nor

α β

, but since we know that each pair

α β

in the former message will be transformed to

α α

, we can again precisely predict frequencies of individual symbols

f_{1} (α) = f_{0} (α) + f_{0} (α, β)

and

f_{1} (β) = f_{0} (β) - f_{0} (α β)

.

With the knowledge of the preceding discussion and of Corollary A1 we conclude that for any context transformation

C T

we are able to compute the frequency and corresponding probability of arbitrary symbol after application of any

C T

from the knowledge of initial distribution of symbols and digrams. In [26] we showed that under certain conditions it is possible to process several context transformations simultaneously.

Appendix A.3. GCT—Frequencies Alteration

Corollary A2.

Under assumption that

α \neq γ

,

α \neq β

and

β \neq γ

the number of occurrences of digrams

α γ

and

α β

after application of transformation

G C T_{\leftarrow} (α β \leftrightarrow α γ)

is

f_{1} (α γ, G C T (m)) = f_{0} (α β, m)

and

f_{1} (α β, G C T (m)) = f_{0} (α γ, m)

.

Proof.

Since each digram

α β

resp.

α γ

is replaced by

α γ

resp.

α β

, and neither of the digrams influence the transformation of the other, their frequencies must interchange. ☐

Appendix A.4. Generic Transformation—Proof of Correctness

Generic transformation

G T

exchanges any two digrams. In the design of algorithms, we prefer

G C T

over

G T

since the space from which generic transformations are selected is in this case of order the

{| Σ |}^{4}

and when alphabets of the large size are dealt with, the search in such a space would be computationally very expensive.

Definition A2.

Generic transformation (GT) is the mapping

G T (α β \leftrightarrow γ ρ, w) : Σ^{n} \to Σ^{n}

, Σ is the alphabet of the input message w and n is the length of the input message, that exchanges all digrams

α β

for digram

γ ρ

and vice-versa.

The inverse transformation of

G C T

and

G T

is defined by the following theorem:

Theorem A2.

Generic transformation

G T^{- 1} \equiv G T_{\leftarrow} (α β \leftrightarrow γ ρ)

resp.

G T^{- 1} \equiv G T_{\to} (α β \leftrightarrow γ ρ)

is the inverse of generic transformation

G T_{\to} (α β \leftrightarrow γ ρ)

resp.

G T_{\leftarrow} (α β \leftrightarrow γ ρ)

Proof.

First, we show that it is sufficient to prove that for any string

s = x w x

, it holds that

G T^{- 1} (G T (s)) = s

, where

x \notin Σ^{G T} = {α, β, γ, ρ}

and

w [i] \in Σ^{G T}

. Suppose that x is located in position p then for digrams d in positions

(p - 1, p)

and

(p, p + 1)

it holds that

G T (d) = d

. So the first possible application of GT can occur in positions

(p - 2, p - 1)

and

(p + 1, p + 2)

and these are independent, i.e., non-overlapping.

Next, we show that each replacement made in the forward transformation will be reverted back by inverse transformation. Take for example transformation

G T_{\leftarrow} (α β \leftrightarrow γ ρ)

the transformation is applied in the right to left direction. The last applied forward transformation in positions

(r, r + 1)

replaces for instance digram

α β

for

γ ρ

leaving

w [r, r + 1] = γ ρ

, the inverse transformation, by definition the same transformation applied in the opposite direction, reverts digram

γ ρ

back to

α β

. Now consider any triplet of positions

(r - 1, r, r + 1)

in a transformed message, the input of the inverse transformation in

(r, r + 1)

is dependent on the result of the inverse transformation in the preceding pair of positions, but as we saw the first applied inverse reverted digram back correctly so the state in positions

(r + 1, r + 2)

is exactly like the one of the state left by forward transformation in these positions, so any other digram will be reverted back correctly, because every preceding application of the inverse leaves the state of the digram in the state that was left by the forward transformation and this digram is trivially reverted back to initial state. The same rules are valid for

G T

in the opposite direction, since the transformation

G T_{\leftarrow} (m) = G T_{\to} (m^{T})

, where

m^{T}

is a mirror message of m. ☐

Appendix A.5. HOCT—Proof of the Correctness

The following trivial Lemma will help us to formulate a theorem about inverse transformation to

H O C T

:

Lemma A1.

Let

T = H O C T (w β \leftrightarrow w γ, m, P (w, m))

is a higher order context transformation over the input message m, given that we possess the knowledge of w and positions

P (w, m)

, then

T^{- 1} = T

.

Proof.

Because we don’t have to pass through the whole message either in the forward or inverse transformation case, but only through the set of positions

P (w, m)

, then the symbol in position

i \in P (w, m)

, for instance

m [i] = β

will switch by

H O C T

to

m [i] = γ

and by repeated application of

H O C T

it reverts back to

m [i] = β

. ☐

The Lemma A1 is trivial but comes into play when

P (w, m)

is a product of some other higher order context transformation, i.e., the one with an order lower by one.

Theorem A3.

Let

m_{1} = H O C T_{1} (w α \leftrightarrow w β, m, P (w, m))

and

m_{2} = H O C T_{2} (w α γ \leftrightarrow w α ρ, m_{1}, P (w α, m_{1}))

are two higher order context transformations. Let

T (m) = H O C T_{2} (H O C T_{1} (m))

be a transformation composition of two higher order context transformations over input message m. Then

H O C T_{2}^{- 1} (w α γ \leftrightarrow w α ρ, m_{3}, P (w β, m_{3}))

, such that

m_{3} = H O C T_{1} (m_{2}))

then the transformation composition

T^{- 1} \equiv H O C T_{2}^{- 1} (H O C T_{1} (m_{2})) = m

is the inverse transformation of T.

Several remarks to the formulation of Theorem A3: Transformations

H O C T_{1}

and

H O C T_{2}

are applied over two consecutive states of the message. The positions

P (w α, m_{1})

correspond to the positions

P (w β, m)

, since sub-messages

w α

have been replaced by

w β

in the application of

H O C T_{1}

. The inverse transformation by

H O C T_{2}^{- 1}

is applied instead over positions

P (w β, m_{3})

, since these positions have already been reverted back by

H O C T_{1}

.

The proof is based on the restriction that

w [0] \neq w [i]

,

i > 0

, it can be viewed as we would split the input message m to sub-messages

s_{i}

separated by

w [0]

. For instance, suppose that

w [0]

is a space character in ordinary text, since, by Definition 3, no other character in w can be a space character, it follows that the possible transformations are being applied on words following the space character. Now using the fact that

s_{i}

is enclosed by

w [0]

, i.e., they do not overlap, allows us to handle each sub-message

s_{i}

independently.

Proof.

For the two sets of positions, it holds that

P (w, m) \cap P (w α, m_{1}) = \emptyset

, because elements of the former are predecessors of the latter and s does not overlap. The locations of w in m and

m_{2}

are identical as they were not modified during

H O C T

, i.e.,

P (w, m) = P (w, m_{2})

. When we apply

H O C T_{1}

again it will simply revert the symbols in positions given by

P (w, m)

back according to Lemma A1 yielding the message state

m_{3}

. In the forward transformation

H O C T_{2}

was applied over positions of

P (w α, m_{1})

, but these are the former positions of

P (w β, m)

, that are already transformed back by the application of

H O C T_{1}

, so

P (w α, m_{1})

is equal to

P (w β, m_{3})

and when

H O C T_{2}

is applied over positions

P (w β, m_{3})

it exchanges symbols

γ

and

ρ

and eventually yields m. ☐

The recursive application of Theorem A3 leads to the conclusion that this process can be repeated until there is no other pair of symbols then these containing

w [0]

as one of the symbols

α

or

β

or we simply reach the end of the message.

Corollary A2 about the prediction of frequencies in the case of

G C T

is also applicable in the case of

H O C T

, because the principle that the exact number of replacements is known is also valid and we are able to precisely compute the future probabilities of symbols before the arbitrary

H O C T

is applied.

If we implement the inverse algorithm as a sequential algorithm operating in the left-right manner, it is possible to have one of the transformation symbols if

β, γ

is equal to

w [0]

. Suppose the following example:

m = a b c a b c

,

P (a, m) = {0, 3}

,

H O C T_{1} (a b \leftrightarrow a a)

and

H O C T_{2} (a a c \leftrightarrow a a a)

yielding the output message

m_{2} = a a a a a a

. Now applying inverse transformation sequentially from left to right, we first replace

a a

by

a b

yielding

m_{i} = a b a a a a

, then applying replacement

a b a

for

a b c

yielding

m_{i + 1} = a b c a a a

, now because there is no other transformation that is induced from

a b c

we know that the next a symbol is

w [0]

and we can repeat the preceding process again starting from this a. The sufficient condition for the introduction of

w [0]

as the transformation symbol

β

or

γ

is that w contains no other

w [0]

in

w [i]

,

i > 0

, because the inverse process removes all introduced

w [0]

symbols from the transformed message during left to right sequential inverse transformation.

References

Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing); Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Huffman, D.A. A Method for the Construction of Minimum-Redundancy Codes. Proc. Inst. Radio Eng. 1952, 40, 1098–1101. [Google Scholar] [CrossRef]
Witten, I.H.; Neal, R.M.; Cleary, J.G. Arithmetic Coding for Data Compression. Commun. ACM 1987, 30, 520–540. [Google Scholar] [CrossRef]
Charikar, M.; Lehman, E.; Lehman, A.; Liu, D.; Panigrahy, R.; Prabhakaran, M.; Sahai, A.; Shelat, A. The Smallest Grammar Problem. IEEE Trans. Inf. Theory 2005, 51, 2554–2576. [Google Scholar] [CrossRef]
Nevill-Manning, C.G. Inferring Sequential Structure. Ph.D. Thesis, University of Waikato, Hamilton, New Zealand, May 1996. [Google Scholar]
Nevill-Manning, C.G.; Witten, I.H. Identifying Hierarchical Structure in Sequences: A Linear-time Algorithm. J. Artif. Int. Res. 1997, 7, 67–82. [Google Scholar]
Kieffer, J.C.; Yang, E.-H. Grammar Based Codes: A New Class of Universal Lossless Source Codes. IEEE Trans. Inf. Theory 2000, 46, 737–754. [Google Scholar] [CrossRef]
Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef]
Yang, E.; He, D. Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform 2. With context models. IEEE Trans. Inf. Theory 2003, 49, 2874–2894. [Google Scholar] [CrossRef]
Gage, P. A New Algorithm for Data Compression. C Users J. 1994, 12, 23–38. [Google Scholar]
Nakamura, H.; Marushima, S. Data Compression by Concatenation of Symbol Pairs. In Proceedings of the IEEE International Symposium on Information Theory and Its Applications, Paris, France, 13–17 September 1996; pp. 496–499. [Google Scholar]
Larsson, N.J.; Moffat, A. Off-line dictionary-based compression. Proc. IEEE 2000, 88, 1722–1732. [Google Scholar] [CrossRef]
Claude, F.; Farina, A.; Navarro, G. Re-Pair Compression of Inverted Lists. arXiv 2009. [Google Scholar]
Masaki, T.; Kida, T. Online Grammar Transformation Based on Re-Pair Algorithm. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 29 March–1 April 2016; pp. 349–358. [Google Scholar]
Grassberger, P. Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution. Physics 2002. [Google Scholar]
Calcagnile, L.M.; Galatolo, S.; Menconi, G. Non-sequential recursive pair substitutions and numerical entropy estimates in symbolic dynamical systems. arXiv 2008. [Google Scholar]
Navarro, G.; Russo, L. Re-pair Achieves High-Order Entropy. In Proceedings of the Data Compression Conference, DCC 2008, Snowbird, UT, USA, 25–27 March 2008; p. 537. [Google Scholar]
Vasinek, M.; Platos, J. Entropy Reduction Using Context Transformations. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 26–28 March 2014; p. 431. [Google Scholar]
Vasinek, M.; Platos, J. Generalized Context Transformations—Enhanced Entropy Reduction. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 7–9 April 2015; p. 474. [Google Scholar]
Vasinek, M.; Platos, J. Higher Order Context Transformations. arXiv 2017. [Google Scholar]
Kida, T.; Matsumoto, T.; Shibata, Y.; Takeda, M.; Shinohara, A.; Arikawa, S. Collage System: A Unifying Framework for Compressed Pattern Matching. Theor. Comput. Sci. 2003, 298, 253–272. [Google Scholar] [CrossRef]
González, R.; Navarro, G. Compressed Text Indexes with Fast Locate. In Proceedings of the 18th Annual Conference on Combinatorial Pattern Matching, CPM’07, London, ON, Canada, 9–11 July 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 216–227. [Google Scholar]
Claude, F.; Farina, A.; Navarro, G. Re-Pair compression of inverted lists. arXiv 2009. [Google Scholar]
Vasinek, M. Kontextove Mapy a Jejich Aplikace. Master’s Thesis, Vysoka Skola Banska—Technicka Univerzita Ostrava, Ostrava, Czech Republic, 2013. [Google Scholar]
Vasinek, M.; Platos, J. Parallel Approach to Context Transformations. Available online: http://ceur-ws.org/Vol-1343/paper4.pdf (accessed on 11 May 2017).

Figure 1. Comparison of zero order entropy evolution over the paper5 file from the Calgary corpus. (a) Evolution of zero order entropy for different strategies; (b) evolution of zero order entropy for different values of the limit (LIM) in the

H O C T

strategy.

Figure 1. Comparison of zero order entropy evolution over the paper5 file from the Calgary corpus. (a) Evolution of zero order entropy for different strategies; (b) evolution of zero order entropy for different values of the limit (LIM) in the

H O C T

strategy.

Figure 2. Comparison of Re-Pair and MinEnt algorithms: evolution of the message size measured in the number of symbols over the paper5 file from the Calgary corpus.

Figure 3. Comparison of Re-Pair, MinEnt,

G C T

and

H O C T

algorithms: evolution of the entropic message size measured in bits per byte over the paper5 file from the Calgary corpus.

Figure 3. Comparison of Re-Pair, MinEnt,

G C T

and

H O C T

algorithms: evolution of the entropic message size measured in bits per byte over the paper5 file from the Calgary corpus.

Figure 4. The entropy of two letters with different

c_{i} \in {0.25, 0.5, 1.0}

.

Figure 4. The entropy of two letters with different

c_{i} \in {0.25, 0.5, 1.0}

.

Figure 5. The portion of entropy

H (p_{1} (Σ^{I}))

given by symbols from

Σ^{I}

as a function of

c_{1}

for the constant

H (p_{0} (Σ^{I})) = 4

and

\sum p_{0} (x) \in {0.2, 0.5, 0.8}

.

Figure 5. The portion of entropy

H (p_{1} (Σ^{I}))

given by symbols from

Σ^{I}

as a function of

c_{1}

for the constant

H (p_{0} (Σ^{I})) = 4

and

\sum p_{0} (x) \in {0.2, 0.5, 0.8}

.

Figure 6. Dependency of

H (p_{1} (Σ^{T}))

on different values of

c_{2}

for three cases of

p_{0} (x) \in {0.05, 0.1, 0.2}

.

Figure 6. Dependency of

H (p_{1} (Σ^{T}))

on different values of

c_{2}

for three cases of

p_{0} (x) \in {0.05, 0.1, 0.2}

.

Table 1. Characteristics of the paper5 file from the Calgary corpus: the initial size of alphabet

| Σ |

, the initial file size

| m_{0} |

measured in bytes, the initial entropy

H_{0}

measured in bits and the initial entropic size

| m_{0} |^{H}

measured in bytes.

Table 1. Characteristics of the paper5 file from the Calgary corpus: the initial size of alphabet

| Σ |

, the initial file size

| m_{0} |

measured in bytes, the initial entropy

H_{0}

measured in bits and the initial entropic size

| m_{0} |^{H}

measured in bytes.

File Name	$\| Σ \|$	$\| m_{0} \|$	$H_{0}$	$\| m_{0} \|^{H}$
paper5	91	11 954	4.936	7 376

Table 2. The comparison of transformation strategies using different criteria:

L I M

is the limiting size of the dictionary entry in bytes, i is the number of iterations;

H_{i}

is the final entropy measured in bits;

| m_{i} |^{H}

is the final entropic size measured in bytes;

log | Σ_{i} |

is the upper boundary on the amount of information needed to store one symbol to dictionary; and

| m_{f} |

is the final size of the file given as the sum of the entropic size of the message and the size of the dictionary measured in bytes.

Table 2. The comparison of transformation strategies using different criteria:

L I M

is the limiting size of the dictionary entry in bytes, i is the number of iterations;

H_{i}

is the final entropy measured in bits;

| m_{i} |^{H}

is the final entropic size measured in bytes;

log | Σ_{i} |

is the upper boundary on the amount of information needed to store one symbol to dictionary; and

| m_{f} |

is the final size of the file given as the sum of the entropic size of the message and the size of the dictionary measured in bytes.

Strategy	$LIM$	i	$H_{i}$	$\| m_{i} \|^{H}$	$log \| Σ_{i} \|$	$\| m_{f} \|$
$G C T$	0	2367	3.796	5674	6.508	11,451
$G C T$	$\| D \|$	127	4.260	6366	6.508	6676
$H O C T$	0	3 821	2.786	4163	6.508	13,488
$H O C T$	4	1 143	3.528	5272	6.508	8061
$H O C T$	8	525	3.830	5713	6.508	6994
$H O C T$	$\| D \|$	222	4.067	6078	6.508	6439

Table 3. The comparison of compression strategies using different criteria: i is the number of iterations;

| Σ_{i} |

is the size of the final alphabet;

| m_{i} |

is the resulting size of the file measured as the number of symbols;

H_{i}

is the final entropy measured in bits;

| m_{i} |^{H}

is the final entropic size measured in bytes;

| D_{i} |

is the average number of bits needed to store one phrase in the dictionary; and

| m_{f} |

is the final size of the file given as the sum of the entropic size of the message and the size of the dictionary measured in bytes.

Table 3. The comparison of compression strategies using different criteria: i is the number of iterations;

| Σ_{i} |

is the size of the final alphabet;

| m_{i} |

is the resulting size of the file measured as the number of symbols;

H_{i}

is the final entropy measured in bits;

| m_{i} |^{H}

is the final entropic size measured in bytes;

| D_{i} |

is the average number of bits needed to store one phrase in the dictionary; and

| m_{f} |

is the final size of the file given as the sum of the entropic size of the message and the size of the dictionary measured in bytes.

Strategy	i	$\| Σ_{i} \|$	$\| m_{i} \|$	$H_{i}$	$\| m_{i} \|^{H}$	$\| D_{i} \|$	$\| m_{f} \|$
Re-Pair	1146	965	2832	9.283	3286	10.240	4753
MinEnt	1129	944	2798	9.395	3286	10.281	4737

Table 4. The comparison of strategies using different criterions, i is the number of iterations,

| Σ_{i} |

is the size of the final alphabet,

| m_{i} |

is the resulting size of file measured as the number of symbols,

H_{i}

is the final entropy measured in bits,

| m_{i} |^{H}

is the final entropic size measured in bytes,

| D | / | Σ_{i} |

is the average number of bytes needed to store one phrase into the dictionary and

| m_{f} |

is the final size of the file given as the sum of entropic size of the message and the size of the dictionary measured in bytes.

Table 4. The comparison of strategies using different criterions, i is the number of iterations,

| Σ_{i} |

is the size of the final alphabet,

| m_{i} |

is the resulting size of file measured as the number of symbols,

H_{i}

is the final entropy measured in bits,

| m_{i} |^{H}

is the final entropic size measured in bytes,

| D | / | Σ_{i} |

is the average number of bytes needed to store one phrase into the dictionary and

| m_{f} |

is the final size of the file given as the sum of entropic size of the message and the size of the dictionary measured in bytes.

File Name	$\| m_{0} \|$	$Σ_{0}$	$H_{0}$	i	$\| Σ_{i} \|$	$\| m_{i} \|$	$H_{i}$	$\| m_{i} \|^{H}$	$\frac{\| D \|}{\| Σ_{i} \|}$	$\| m_{f} \|$
Calgary corpus
bib	111,261	81	5.257	5469	4216	15,159	11.410	21,621	11.589	29,544
book1	768,771	82	4.528	23,587	22,649	128,059	13.422	214,859	11.904	249,957
book2	610,856	96	4.681	21,147	18,501	82,446	13.213	136,165	12.829	170,079
news	377,109	98	5.226	20,079	13,602	55,500	12.809	88,863	12.761	120,892
obj1	21,504	256	5.929	1650	1475	6464	9.888	7990	10.918	10,242
obj2	246,814	256	6.280	14,635	9569	35,540	12.323	54,743	13.044	78,607
paper1	53,161	95	4.967	3559	2678	8800	10.890	11,979	11.360	17,033
paper2	82,199	91	4.506	4297	3753	14,102	11.235	19,805	11.181	25,811
paper3	46,526	84	4.588	2989	2575	9061	10.767	12,195	10.791	16,227
paper4	13,286	80	4.602	1194	997	3136	9.622	3772	10.130	5284
paper6	38,105	93	5.000	2834	2108	6670	10.585	8826	11.220	12,801
progc	39,611	92	5.282	2854	2066	6526	10.641	8681	11.254	12,696
progl	71,646	87	4.830	4162	2577	7216	10.851	9788	12.003	16,033
progp	49,379	89	4.823	3147	1684	4528	10.272	5814	11.952	10,516
trans	93,695	99	5.545	5918	2505	6513	10.968	8929	12.419	18,116
Canterbury corpus
alice29.txt	152,089	74	4.435	6733	6068	25,077	11.985	37,568	11.482	47,232
asyoulik.txt	125,179	68	4.889	5799	5293	23,532	11.774	34,634	10.932	42,559
bible.txt	4,047,392	63	4.260	81,229	71,256	386,094	15.017	724,728	14.525	872 215
cp.html	24,603	86	5.107	1785	1271	4242	9.590	5085	10.689	7470
E.coli	4,638,690	4	2.000	67,368	62,924	652,664	13.725	1,119,687	7.462	1182 530
fields.c	11,150	90	4.924	927	658	1503	9.304	1748	10.822	3002
kennedy.xls	1,029,744	256	3.584	2446	2545	160,177	9.788	195,978	8.274	198 508
lcet10.txt	426,754	84	4.627	14,515	12,395	55,691	12.759	88,823	12.426	111,369
ptt5	513,216	159	1.049	5995	5697	30,463	11.424	43,503	11.178	51,880
random.txt	100,000	64	6.000	5065	5126	54,182	11.182	75,731	3.983	78,253
sum	38,240	255	5.447	3116	1749	6251	10.290	8041	11.912	12,681
world192.txt	2,473,400	94	5.024	55,473	47,150	212,647	14.552	386,808	13.973	483 705
xargs.1	4227	74	4.863	468	384	990	8.255	1022	9.811	1596

Table 5. The comparison of the strategies using different criteria: i is the number of iterations;

| Σ_{i} |

is the size of final alphabet;

| m_{i} |

is the resulting size of the file measured as the number of symbols;

H_{i}

is the final entropy measured in bits;

| m_{i} |^{H}

is the final entropic size measured in bytes;

| D | / | Σ_{i} |

is the average number of bytes needed to store one phrase in the dictionary; and

| m_{f} |

is the final size of the file given as the sum of entropic size of the message and the size of the dictionary measured in bytes.

Table 5. The comparison of the strategies using different criteria: i is the number of iterations;

| Σ_{i} |

is the size of final alphabet;

| m_{i} |

is the resulting size of the file measured as the number of symbols;

H_{i}

is the final entropy measured in bits;

| m_{i} |^{H}

is the final entropic size measured in bytes;

| D | / | Σ_{i} |

is the average number of bytes needed to store one phrase in the dictionary; and

| m_{f} |

is the final size of the file given as the sum of entropic size of the message and the size of the dictionary measured in bytes.

File Name	$\| m_{0} \|$	$\| Σ_{0} \|$	$H_{0}$	i	$\| Σ_{i} \|$	$\| m_{i} \|$	$H_{i}$	$\| m_{i} \|^{H}$	$\frac{\| D \|}{\| Σ_{i} \|}$	$\| m_{f} \|$
Calgary corpus
bib	111,261	81	5.201	5513	4150	15,103	11.307	21,346	11.717	29,421
book1	768,771	82	4.527	23,843	22,616	127,777	13.377	213,656	12.134	249,822
book2	610,856	96	4.793	20,852	17,997	80,814	13.170	133,045	12.847	166,533
news	377,109	98	5.190	20,118	13,388	55,347	12.697	87,845	12.918	120,333
obj1	21,504	256	5.948	1638	1418	6459	9.727	7853	11.135	10,133
obj2	246,814	256	6.260	14,673	9337	35,510	12.173	54,031	13.218	78,275
paper1	53 161	95	4.983	3579	2633	8726	10.741	11,716	11.413	16,822
paper2	82,199	91	4.601	4247	3612	13,797	11.088	19,123	11.221	25,080
paper3	46,526	84	4.665	3004	2529	8993	10.676	12,002	10.892	16,092
paper4	13,286	80	4.700	1136	930	3133	9.232	3615	10.288	5076
paper6	38,105	93	5.010	2841	2080	6662	10.398	8659	11.297	12,671
progc	39,611	92	5.199	2871	2034	6530	10.444	8525	11.307	12,583
progl	71,646	87	4.770	4175	2495	7134	10.742	9579	12.100	15,894
progp	49,379	89	4.869	3145	1631	4509	10.166	5730	12.194	10,524
trans	93,695	99	5.533	5916	2425	6515	10.736	8743	12.713	18,145
Canterbury corpus
alice29.txt	152,089	74	4.568	6649	5903	24,825	11.767	36,516	11.573	46,135
asyoulik.txt	125,179	68	4.808	5804	5220	23,359	11.563	33,764	11.018	41,758
bible.txt	4,047,392	63	4.343	77,117	66,593	386,092	14.649	706,991	14.543	847,187
cp.html	24,603	86	5.229	1748	1212	4313	9.462	5101	10.897	7482
E.coli	4,638,690	4	2.000	66,995	62,463	652,663	13.717	1,119,067	7.667	1,183,281
fields.c	11,150	90	5.008	868	587	1606	8.724	1751	11.013	2946
kennedy.xls	1,029,744	256	3.573	2612	2511	159,999	10.012	200,240	8.575	203,040
lcet10.txt	426,754	84	4.669	14,506	12,178	54,937	12.661	86,941	12.462	109,539
ptt5	513,216	159	1.210	23,203	6314	94,463	4.566	53,918	12.308	89,618
random.txt	100,000	64	5.999	5145	5209	54,011	11.235	75,854	4.075	78,475
sum	38,240	255	5.329	3130	1683	6245	10.034	7833	12.184	12,600
world192.txt	2,473,400	94	4.998	54,946	45,920	212,499	14.340	380,896	14.078	477,588
xargs.1	4227	74	4.898	342	326	1235	7.755	1197	9.988	1624

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vasinek, M.; Platos, J. Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes. Entropy 2017, 19, 223. https://doi.org/10.3390/e19050223

AMA Style

Vasinek M, Platos J. Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes. Entropy. 2017; 19(5):223. https://doi.org/10.3390/e19050223

Chicago/Turabian Style

Vasinek, Michal, and Jan Platos. 2017. "Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes" Entropy 19, no. 5: 223. https://doi.org/10.3390/e19050223

APA Style

Vasinek, M., & Platos, J. (2017). Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes. Entropy, 19(5), 223. https://doi.org/10.3390/e19050223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes

Abstract

1. Introduction

1.1. Notation and Terminology

1.2. Order of Context and Entropy

1.2.1. Zero Order Context

1.2.2. N-th Order Context

2. Previous Work

3. Transformations and Compression Algorithms

3.1. Transformations

3.1.1. Context Transformation

3.1.2. Generalized Context Transformation

3.1.3. Higher Order Context Transformation

3.2. Compression Functions

3.2.1. Re-Pair

3.2.2. MinEnt

3.3. Discussion of the Transformation and Compression Function Selection Strategies

3.3.1. The Upper Boundary on the Dictionary Entry Size

3.3.2. Comparison of the Alphabet’s Entropy Evolution

3.3.3. Comparison of Entropic Size Evolution

4. Zero Order Entropy and Entropic Message Size Reduction

4.1. Transformation of Probabilities

4.2. General Entropy Change Relations

4.2.1. The First Part: Symbols Remaining Intact by the Substitution Function

4.2.2. The Second Part: Symbols Participating in the Substitution Function

4.2.3. Third Part: Introduced and Removed Symbols

4.3. Calculation of Δ | m | H

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix A

Appendix A.1. CT—Proof of the Correctness

Appendix A.2. Diagonal Context Transformation

Appendix A.3. GCT—Frequencies Alteration

Appendix A.4. Generic Transformation—Proof of Correctness

Appendix A.5. HOCT—Proof of the Correctness

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.3. Calculation of ${Δ | m |}^{H}$