Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes

: The change of zero order entropy is studied over different strategies of grammar production rule selection. The two major rules are distinguished: transformations leaving the message size intact and substitution functions changing the message size. Relations for zero order entropy changes were derived for both cases and conditions under which the entropy decreases were described. In this article, several different greedy strategies reducing zero order entropy, as well as message sizes are summarized, and the new strategy MinEnt is proposed. The resulting evolution of the zero order entropy is compared with a strategy of selecting the most frequent digram used in the Re-Pair algorithm.


Introduction
Entropy is a key concept in the measurement of the amount of information in information theory [1].From the data compression perspective, this amount of information represents the lower limit of the achievable compression of some information source.Due to the well-known work by Shannon [2], we know that using less bits than the amount given by entropy to represent a particular message or process would necessarily lead to the loss of some information and as a consequence our inability to properly recover the former structure of the message.
This work is focused on the study of entropy in data compression, and therefore, our discussion will be restricted to only finite messages.These finite messages are formed by symbols, and in this perspective, the entropy can be understood as the lowest number of bits needed on average to uniquely represent each symbol in a message.There are messages for which the evaluation of entropy can be a very hard task, and so, we are often forced to satisfy ourselves with some approximation of entropy.
The simplest approximation is the one based on the probability distribution of symbols in a particular message.In this case, the symbols are viewed as independent entities, and their mutual relationships are not taken into account.A better approximation of entropy is based on the conditional probabilities when we also take into account how symbols follow each other.We can also approximate entropy by computing bits per byte ratio of message encoded by state of the art data compression algorithms.
In this article, we study entropy at the level of independent symbols.This approximation of entropy is often called zero order entropy.There are two major data compression algorithms in use that compress messages almost to the rate given by zero order entropy: Huffman [3] and arithmetic [4] coding.The zero order entropy can be computed using the Shannon equation: where X stands for a random variable representing the probability distribution of symbols in the input message m and p(x) is the probability of symbol x from alphabet Σ.The expected length of the code for the particular symbol x is given by − log p(x).If the expected length of the code is multiplied by its probability, we obtain the average number of bits needed to represent any symbol x ∈ Σ.The size of the message using the expected lengths of codes is given as a product of the length of the message |m| measured as a number of symbols and zero order entropy: When we refer to the term entropic size of the message, we always mean the quantity given by Equation (2), and it will be denoted using a superscript as |m| H .We study how the entropic size of the message evolves when all occurrences of some m-gram are substituted for some other n-gram and vice versa.We study two such substitutions: transformations and compression functions.Transformations replace n-grams of the same length.Transformation leaves the message size intact, but since the probability of symbols changes, the value of zero order entropy also has to change.Compression functions replace m-grams for n-grams, where m > n, and leave the message size smaller, but the change in zero order entropy also occurs.
The main idea behind the concept of transformations is the following: consider Huffman coding; more probable symbols are encoded by shorter or at least by the same length prefix codes than the lower probability ones, if the symbol β is more probable than the symbol γ, but in the context of some symbol α, γ is more frequent than β, then if these symbols following α are exchanged, the longer codes used for encoding γ will instead be encoded by shorter codes representing the encoding of β.Under this assumption, it is possible to pre-process data so that the frequency of more frequent symbols increases and the frequency of less frequent symbols decreases.

•
The alphabet of the input message m of the size |m| is denoted by Σ and its size by |Σ|.• Greek symbols are used to denote variables representing symbols in the input message.

•
When we refer to the term entropy, we always mean Shannon's entropy defined by Equation (1).

•
All logarithms are to base two.

•
Any quantity Q i with a subscript i ∈ N denotes consecutive states of the quantity between substitutions.For instance, a quantity Q 0 is a value of the quantity before any substitution is applied, and Q 1 is a value of the quantity after some substitution is applied.

Zero Order Context
When all symbols are interpreted as independent individual entities and no predecessors are taken into consideration, such a case is called zero order context.Zero order entropy is then computed as Shannon's entropy of a random variable given by the probabilities of symbols in the input message.

N-th Order Context
In a case where the probability distribution of symbols following a particular fixed length prefix w is taken into consideration, then if the length of the prefix is N, then the order of context is N, and the N-th order entropy is computed as Shannon's entropy of the conditional distribution of symbols following all different prefixes w i .

Previous Work
The class of algorithms dealing with exchanges of different n, m-grams are called grammar-based algorithms.Their purpose is to provide a set of production rules inferring the content of the message.Using the Chomsky hierarchy, we identify two classes of formal grammars used in data compression: context-free grammars (CFG) and context-sensitive grammars (CSG).Context transformations presented in Section 3 belong to the CSG class; meanwhile, compression functions belong to the CFG class.The problem of the search for the most compact context-free grammar representation of a message is NP-hard, unless P = NP [5].Instead of searching for the optimal solution, many heuristic and greedy algorithms were proposed.
CFGs for data compression were first discussed by Nevill-Manning [6] followed by the proposal of the SEQUITUR algorithm [7].SEQUITUR reads the sentence in the left-right manner so that each repeated pattern is transformed into a grammar rule.The grammar is utilized in such a way that the two properties are fulfilled: a digram uniqueness (no pair of adjacent symbols appear more than once in the grammar and a rule utility); every rule is used more than once.Kiefer and Yang [8] were the first who addressed data compression using CFGs from the information theoretic perspective; they showed that the LZ78 [9] algorithm can be interpreted as a CFG and that the proposed BISECTION algorithm forms a grammar-based universal lossless source code.BISECTION repeatedly halves the initial message into unique phrases of length 2 k , where k is the integer.
In the work of Yang and He [10], the context-dependent grammars (CDG) for data compression were introduced.In CSG, the context is present in both sides of production rules; meanwhile in CDG, the context is defined only on the left side of the production rule.
One of the first concepts in greedy grammar-based codes was the byte pair encoding (BPE) [11].The BPE algorithm selects the most frequent digram and replaces it with some unused symbol.The main weakness of this approach is that the algorithm is limited to an alphabet consisting only of byte values.The concept of byte pair encoding was later revised, and the limitation on the alphabet size used was generalized independently by Nakamura and Murashima [12] and by Larsson and Moffat [13]; the resulting approach is called Re-Pair [13].Re-Pair stands for recursive pairing, and it is a very active field of research [14,15].It iteratively replaces the most frequent digrams with unused symbols until there is no digram that occurs more than once.
Unlike BPE that codes digrams using only byte values, Re-Pair expects that the symbols of the final message will be encoded using some entropy coding algorithm.Approaches derived from Re-Pair are usually greedy, since each iteration of the algorithm is dependent on a successful search of the extremal value of some statistical quantity related to the input message.The study of the Re-Pair from the perspective of ergodic sources is discussed in [16,17].Neither BPE nor Re-Pair compress the message into the least possible size, but they rather form a trade-off between message and dictionary sizes.Re-Pair-like algorithms are off-line, in the sense that they need more than one pass through the input message; meanwhile, SEQUITUR incrementally builds the grammar in a single pass.The Re-Pair algorithm is an algorithm with O(n) time complexity; it is easy to implement using linked lists and a priority queue.Further, it was shown in [18] that it can compress an input message of length n over an alphabet of size |Σ| into at most 2H k + o(n log |Σ|) bits, where H k is k-th order entropy.
Our recent studies were focused on a special class of grammar transforms that leave the message size intact [19,20].In the present paper, the class of grammar transformations is extended with a novel concept of higher order context transformation [21].We shall provide examples of transformations and the evaluation of entropy resp.entropic size reduction to the class of grammar compression algorithms, and we compare the evolution of entropy, entropic size and the resulting number of dictionary entries for Re-Pair and our version of Re-Pair, called MinEnt, which is based on the selection of the pair of symbols reducing the entropic size of the message the most.Re-Pair finds application in areas such as searching in compressed data [22], compression of suffix arrays [23] or compression of inverted indexes [24], to name a few.These areas are also natural application fields for MinEnt.From the perspective of the number of passes through the message, the approaches discussed in this paper belong to off-line algorithms.

Transformations and Compression Algorithms
In this section, we will describe and evaluate several invertible transformations T and substitution functions F so that for any two consecutive states of the message, m 0 and m 1 , before and after application of T or F, the following relation holds: The measure of the size of the message by the entropic size of the message is preferred, since using the arithmetic coding, one can achieve a compression rate very close to the zero order entropy, and so, the size |m| H is in theory accessible.Further, it allows the comparison of two distinct substitutions when their resulting sizes measured by the number of symbols are equal.The derivation of equations for the computation of are provided in Section 4.

Transformations
Consider transformation, where we replace all occurrences of some symbol β for some symbol γ and vice versa; such a transformation is called a symmetry transformation, because it does not modify any measurable quantities related to the amount of information.The information content is changed when the replacement is taken in the context of an other symbol α.Such a transformation corresponds to the exchange of all digrams αβ for αγ and vice versa.In this section, several different forms of transformation are distinguished and briefly described.Some properties of transformations and their proofs can be found in Appendix A.

Context Transformation
The concept of context transformations was first proposed in [25], and the results were presented in [19].It is the simplest transformation that assumes a pair of digrams beginning with the same symbol when one of the digrams is initially missing in the input message.Definition 1. Context transformation (CT) is a mapping CT(αβ → αγ, w) : Σ n → Σ n that replaces all digrams αβ for αγ, where p(α, γ) = 0 and β = γ.Σ is the alphabet of the input message w, and n is the length of w.
Let CT ← be the context transformation applied from the end of the message to the beginning and CT → in the opposite direction.The context transformation CT → is an inverse transformation of CT ← .The proof of this property with an explanation for why it is the only pair of the function and its inverse is left to Appendix A. The application of two consecutive context transformations and their inverse functions is presented in the following example:

. Generalized Context Transformation
Context transformations were restricted in cases where one of the digrams was missing in the input message.This restriction is removed by the introduction of the generalized context transformations first proposed in [20].Definition 2. Generalized context transformation (GCT) is a mapping GCT(αβ ↔ αγ, w) : Σ n → Σ n that exchanges all occurrences of a digram αβ by a digram αγ and vice versa.Σ is the alphabet of the input message w, and n is the length of w.

Example 2.
aabcabab|GCT ← (ab ↔ aa) abacaaaa|GCT → (aa ↔ ab) aabcabab Meanwhile, both transformations CT and GCT swap occurrences of two different digrams beginning with the same symbol; they differ in the way they are applied and how the inverse transformation is formed.GCT can be applied in both directions, and the inverse transformation GCT −1 is always applied in the opposite direction, than the forward transformation direction.The algorithm based on the CT and GCT works as follows: 1.
Find and apply transformation T so that the change of the entropic size ∆|m| Repeat Step 1 until no transformation can decrease the entropic size of the message.
It is also possible to define a transformation and its inverse so that all symbols constituting replaced pairs differ, for instance ab ↔ cd; such a transformation is called generic transformation GT.In this article, we have not proposed algorithms based on GT, but because the set of all generalized context transformations is a subset of a set of generic transformations, the proof of the inverse transformation existence is the same for both GCT and GT.The reader can find the proof in Appendix A.

Higher Order Context Transformation
Every time we apply any generalized context transformation GCT, we acquire knowledge about the positions of two distinct digrams in the message.We can either discard this knowledge or we can try to build on it.In the following definition, we define a transformation that is applied over positions where some other transformation was applied before: Definition 3. Let P(w, m) be a set of positions of the first symbol following the sub-message w in the message m and w , then the higher order context transformation (HOCT) is a mapping HOCT(wβ ↔ wγ, m, P(w, m)) : Σ n → Σ n that exchanges all sub-messages wβ for sub-messages wγ and vice versa.
The restriction that the sub-message w has to satisfy is w[0] = w[i], where i > 0 is closely related to the existence of the inverse transformation to HOCT.The properties related to the HOCT and their proofs are left to Appendix A.
Let O = |w| be the size of the sub-message w from Definition 3, then O is an order of HOCT.Any GCT(αβ ↔ αγ) is then the first order HOCT(αβ ↔ αγ, m, P(α, m)).Given that we just before applied some transformation m 1 = HOCT 1 (wβ ↔ wγ, m, P(w, m)), we can decide to collect the positions of either w 1 = wβ or w 2 = wγ, collect the distribution of symbols in P(w i , m) and apply another HOCT(w i ρ ↔ w i ϕ, m 1 , P(w i , m)).In this sense, HOCT is not used only to interchange different sub-messages, but it also allows one to proceed with some other transformation HOCT of a higher order.The application of two consecutive HOCT transformations is presented in the following example: The HOCT transformation is a recursive application of GCT in the context of some prefix w.The steps of the algorithm are outlined as follows: 1.
Find and apply HOCT(αβ ↔ αγ) over the set of positions P(α), so that the change of entropic size ∆|m| If the frequency of αβ resp.αγ is larger than one, then repeat Step 1 over the set of positions P(αβ) resp.P(αγ), i.e., positions where HOCT from Step 1 was applied; otherwise, repeat Step 1 over positions P(α) or return if no more HOCT passes the entropic size reduction conditions.
The algorithm above is iteratively called for symbols sorted from the most frequent one to the least frequent one.The Lim variable can be used to restrict transformations whose entropic size reduction is too small, so they cannot be efficiently stored in the dictionary.

Compression Functions
In the preceding section, we described three types of transformations that leave message size intact.In this section, we will focus on a description of two approaches in the replacement of digrams for a new symbol.First, we describe basic principles of the well-known algorithm Re-Pair, and then, we will propose a modification of Re-Pair called MinEnt.

Re-Pair
The main idea behind the Re-Pair algorithm is to repeatedly find the most frequent digram and replace all of its occurrences with a new symbol that is not yet present in the message.The algorithm can be described in the following steps: 1.
Select the most frequent digram αβ in message m.

2.
Replace all occurrences of αβ for new symbol γ.

3.
Repeat Steps 1 and 2 until every digram appears only once.
In Step 2 of the algorithm, the pair αβ together with a new symbol γ are stored in a dictionary.The implementation details of the Re-Pair algorithm are left to Section 3.2.2regarding the proposed MinEnt algorithm.

MinEnt
The MinEnt algorithm proposed in this article is derived from the Re-Pair algorithm.The main difference is in Step 1, where instead of the selection of the most frequent digram, we select a digram that minimizes |m 1 | H from Equation (3): 1.
Select digram αβ in message m 0 so that the change of entropic size ∆|m| Replace all occurrences of αβ for new symbol γ.

3.
Repeat Steps 1 and 2 until every digram appears only once.
More precisely, let m 1 = MinEnt(m 0 , αβ → γ) be the application of Step 1 and Step 2 of the MinEnt algorithm, then digram αβ fulfills: where Σ 0 is the alphabet of the message m 0 .To demonstrate the difference between Re-Pair and MinEnt, consider the following example: (m 0 , ab → e) = aeecdcdb (m 0 , cd → e) = aababeeb Based on the Re-Pair algorithm, we do not know which digram should be preferred, because both have the same frequency.In the MinEnt case, we can compute |m 1 | H for both cases, yielding |m 1 | H ab = 18 bits and |m 1 | H cd = 12.49 bits, and so, the replacement cd → e will be the preferred one.The MinEnt and the Re-Pair strategies of digram selection are evaluated using the algorithm described in [13].In the initialization phase of the algorithm, the input file is transformed into the linked list, and each input byte is converted into the unsigned integer value.In the next step, the linked list is scanned, and the frequencies and positions of all digrams are recorded.Frequencies of digrams, resp.the change of the entropic size of the message measured in bytes, are used as indices for the priority queue.The size of the queue is limited to the maximal frequency, resp. in the case of the MinEnt algorithm, the maximum entropic size decrease.
The algorithm iteratively selects the digram with the highest priority, replaces all occurrences of the digram with the newly-introduced symbol, decrements counts of neighboring digrams and increments counts of newly-introduced digrams.In the case of the MinEnt algorithm, we have to recompute the change of the entropic size of all digrams in the priority queue.We restrict the number of recomputed changes of the entropic size to the top 20 digrams with the highest priority, so that the time complexity of this additional step remains O(1).Both algorithms are accomplished in O(n) expected time; see [13] for details.The memory consumption is larger in the MinEnt case, because each digram has to be assigned with the additional quantity: the value of the change of the entropic size of the message.

Discussion of the Transformation and Compression Function Selection Strategies
To demonstrate the behavior of aforementioned algorithms, we proposed strategies for the selection of transformations and compression algorithms.We compared the evolution of the entropy of the alphabet, the entropic size of the message and the final size of the message given as the sum of the entropic size of the message and the upper boundary on the size of the dictionary (Section 3.3.1).The following strategies are being compared: • GCT: selection of the generalized context transformation so that the decrease of entropy is maximal.• HOCT: selection of the higher order context transformation so that the decrease of entropy is maximal in the context of prefix w.

•
Re-Pair: selection of the most frequent digram and its replacement with an unused symbol.

•
MinEnt: selection of the most entropic size reducing digram and its replacement with an unused symbol.

The Upper Boundary on the Dictionary Entry Size
All transformations and compression functions are usually stored as an entry in a dictionary.To be able to compare the effectiveness of transformations, we selected the worst case entropy of each symbol, given by log |Σ i |, where Σ i is an alphabet and subscript i denotes the number of applied transformations.
In the GCT and HOCT strategies, the size of the alphabet will be constant, unless some symbols were completely removed, then the size of the alphabet decreases.Re-Pair and MinEnt algorithms, which introduce new symbols, have an increasing alphabet size.The upper boundary on the resulting size of each dictionary entry |D| for GCT and HOCT transformations is defined as: where |Σ 0 | is the size of the initial alphabet.The Larsson and Moffat [13] version of the Re-Pair introduces several efficient ways of dictionary encoding: the Bernoulli model, literal pair enumeration and interpolative encoding.In our experiments with Re-Pair and MinEnt, we used interpolative encoding to encode dictionary.

Comparison of the Alphabet's Entropy Evolution
Even though the transformations and compression functions pursue the same objective, minimization of the entropic size of the message, they achieve that by a different evolution of zero order entropy.Transformation-based strategies minimize zero order entropy; meanwhile, both compression strategies introduce new symbols, and as a result, zero order entropy grows.The initial values of the quantities of the examined test file are summarized in Table 1.The example of the comparison of the zero order entropy evolution of different strategies is provided in Figure 1a.Both compression functions achieve a very similar resulting value of zero order entropy.The Re-Pair strategy begins with the highest growth of entropy, but the increase slows down with the number of iterations as the frequency of each consecutive digram drops.As will be discussed in Section 4.2.2, digrams consisting of symbols with a lower frequency will be preferred by MinEnt, because they will be able to achieve a larger decrease of entropic size, and their replacement brings less costs in the zero order entropy increase.This behavior can be observed especially in later iterations of the Re-Pair and MinEnt algorithms.
Both transformations reduce the value of zero order entropy.GCT initially drops faster, but in the end, it significantly slows down.The application of the HOCT strategy achieves the lowest resulting value of entropy, and the interesting fact is that it decreases at an almost constant rate.The behavior of entropy evolution for different values of the limit in HOCT is presented in Figure 1b.The unrestricted case (Lim = 0) shows us the bottom limit of zero order entropy reduction using the HOCT strategy.

Comparison of Entropic Size Evolution
The selection of the most frequent digram will produce the largest decrease of the number of symbols in each iteration.Surprisingly, the Re-Pair strategy does not necessarily have to converge to its minimum in the lower number of iterations than MinEnt.Figure 2 presents this behavior for the paper5 file of the Calgary corpus.Both approaches end with a similar number of symbols in the resulting message.The MinEnt strategy achieves the lowest entropic size of the message, and at each iteration, the entropic size of the message is lower than in the case of the Re-Pair strategy, see Figure 3.The overall efficiency depends on our ability to compress the resulting dictionary.A summary of different transformation strategies is provided in Table 2.A summary of compression functions is then given in Table 3.The least number of iterations was achieved by HOCT with LIM = |D|; this strategy also leads to the least final size |m f |, but it should be emphasized that the resulting entropic size of the message |m f | is a very pessimistic estimate, due to the construction of the size of dictionary entries.
Table 2.The comparison of transformation strategies using different criteria: LI M is the limiting size of the dictionary entry in bytes, i is the number of iterations; H i is the final entropy measured in bits; |m i | H is the final entropic size measured in bytes; log |Σ i | is the upper boundary on the amount of information needed to store one symbol to dictionary; and |m f | is the final size of the file given as the sum of the entropic size of the message and the size of the dictionary measured in bytes.Even though the achieved results of both approaches are similar, we see that the resulting message size |m f | and alphabet size are lower in the case of MinEnt.The message size |m f | is given by the sum of the entropic size of the message and the size of the dictionary stored by interpolative encoding.Using values in the columns of Table 3, we express |m f | = |m i |H i + i|D i |; the term i|D i | represents the size of the dictionary given as a product of the total number of iterations and the average number of bits needed to encode one iteration.See Tables 4 and 5 for more results on files from the Calgary and Canterbury corpora.

Strategy
Table 4.The comparison of strategies using different criterions, i is the number of iterations, |Σ i | is the size of the final alphabet, |m i | is the resulting size of file measured as the number of symbols, H i is the final entropy measured in bits, |m i | H is the final entropic size measured in bytes, |D|/|Σ i | is the average number of bytes needed to store one phrase into the dictionary and |m f | is the final size of the file given as the sum of entropic size of the message and the size of the dictionary measured in bytes.

Zero Order Entropy and Entropic Message Size Reduction
The primary purpose of context transformation and other derived transformations is to reduce the zero order entropy measured by Shannon's entropy [2] defined in Equation (1).In this section, we shall show under what conditions the transformation and compression function reduces zero order entropy resp.the entropic size of the message.Suppose that H 0 is a zero order entropy of message m, and H 1 is a zero order entropy after a transformation T is applied.The conditions under which the following inequalities hold are the major subject of interest.
Let Σ T ⊂ Σ be a set of symbols whose frequencies before and after transformation differ, and Σ I ⊂ Σ is a set of symbols whose frequencies are intact.For transformations, the inequality ( 5) can be further restricted only to the set of symbols Σ T , since the terms containing symbols from Σ I subtract: In the following paragraph, we will specify the forms of the set Σ T and the relations for the probabilities of its symbols after transformations, so that the change of entropy given by Equation ( 6) can be computed before any transformation actually occurs.

Transformation of Probabilities
We begin with the simplest case: suppose the context transformation CT(αβ → αγ, w).Since only the probabilities of symbols β and γ will change, then Σ CT = {β, γ}, and it is sufficient to express probabilities only for β and γ: and: In the case of the generalized context transformation GCT(αβ ↔ αγ, w), the set Σ CT = Σ GCT is identical, and the probabilities transform according to: and: In the last case of higher order transformation, the probabilities transform according to: and: In all cases, the set Σ T forms a binary alphabet.The following theorem then describes the condition for zero order entropy reduction: Theorem 1. Suppose the generalized context transformation GCT(αβ ↔ αγ).Let p 0 (β) and p 0 (γ) be the probabilities of symbols before the transformation is applied, and let p 0,max = max{p 0 (β), p 0 (γ)}.After the transformation, the associated probabilities are p 1 (β), p 1 (γ) and p 1,max = max{p 1 (β), p 1 (γ)}.If p 1,max > p 0,max , then the generalized context transformation T reduces entropy.
The proof of Theorem 1 is based on the properties of entropy when only two letters from alphabet Σ are considered.Let p(β) + p(γ) = c, where c ≤ 1, c is invariant, it does not change during the transformation.We can express one of these probabilities using the other one; for example, let p(γ) = c − p(β); this allows us to express the entropy function as a function of only one variable.
A few examples of such functions are shown in Figure 4.The maximum value of the function is located in the value c/2, and it has two minimums at zero and at c. Proof.Since the entropy function for two different letters is defined on the interval 0; c and it is concave with a maximum at c/2 and minimums at zero and c, then p 0,max has to be located on the interval p 0,max ∈ c/2; c ; but on that particular interval, the higher the maximum is, the lower the entropy is, so if we increase the maximum (or we can say increase the absolute value of the difference |p β − p α |), then the entropy will decrease.

General Entropy Change Relations
In this section, we generalize the notion of zero order entropy change on the exchange of any two words.The solution is divided into three parts.The first part deals with the set of symbols Σ I whose frequency does not change before and after the substitution function is applied; the second part establishes relations for the set of symbols Σ T whose probability is changed, but their initial and final frequencies are non-zero; the third part discusses symbols introduced to and removed from the alphabet.Let Σ R be a set of removed symbols, and Σ N is a set of introduced symbols; then, we can split the sum in Equation (1), yielding: The four sets of symbols in Equation ( 13) exhibit different behaviors under the substitution function, and they will be discussed in separate sections.The entropic size of the message |m| H can also be handled separately; let H(p(Σ α )) be a portion of entropy conveyed by symbols from alphabet Σ α , and let |m| H(Σ α ) be particular portions of the entropic size of the message; then, we can split the resulting entropic size as we did before:

The First Part: Symbols Remaining Intact by the Substitution Function
We begin with symbols that are not part of either of the substituting words s 1 or s 2 .Suppose that the length |m 0 | of the message m 0 turns into some message m 1 of the size |m 1 |.Generally, |m 0 | = |m 1 |, but in a special case of context transformations, these two quantities are equal.However, when the compression or expansion of the message occurs, the part of the Shannon equation will also change due to the change in the total number of symbols.
Suppose that the symbol x is initially in the message m 0 with the probability p 0 (x).This probability can be expressed using the frequency f 0 (x) and the size of the message as: Later, after the transformation was applied, the probability changes to: where ∆m is a change of the message size.In the case of context transformations where the message size remains the same size, the probability remains the same, as well as the part of entropy formed by non-transformed symbols.
When the two probabilities are placed in relation by some stretching factor c 1 we arrange them into the form: The factor c 1 (the introduction of c 1 is motivated by the properties of logarithms if we would actually stay with p 1 (x) given by Equation ( 16), we would get logarithm log f 0 (x) |m 0 |+∆m = log f 0 (x) − log (|m 0 | + ∆m).If instead, we express p 1 using (17), then the logarithm is in product form, and its arguments are single numbers log cp 0 = log c + log p 0 ) can be expressed by substitution of p i (x) in Equation ( 17) by the terms in Equations ( 15) and ( 16), leading to: Then, the relation for zero order entropy after transformation will have the form: The example of the behavior of H(p 1 (Σ I )) of the intact part is visualized in Figure 5.When the compression of the message occurs, i.e., log c 1 > 0, then the zero order entropy of intact symbols increases.The less the probability is conveyed by symbols from Σ I , the more their zero order entropy is sensitive to the change of c 1 .The final entropic size is given as follows: If we apply one of transformations, then ∆m = 0, and as a consequence, c 1 = 1; the last term on the right will be zero due to log c 1 = 0, so Equation (20) tells us that the entropic size of the message carried by these symbols does not change during transformation.When |Σ T | is much smaller than |Σ I |, it is convenient to rewrite Equation (20) in terms of Σ T : Corollary 1.No compression function ever increases the entropic size of the part of the message consisting of intact symbols.
Proof.The compression function has the value of c 1 larger than one, as a consequence log c 1 > 0, and so, The equality in |m 1 | H,Σ I ≤ |m 0 | H,Σ I occurs when ∑ x∈Σ I f 0 (x) = 0, i.e., when there are no intact symbols.When the expansion of the message occurs, then log c 1 < 0 and the second term of Equation (20) on the right will change to a positive number.Expansion of the message leads to the increase of the entropic size; meanwhile, compression leads to the decrease of the entropic size of intact symbols.
In each iteration of the Re-Pair algorithm, the most frequent digram is selected.This corresponds to the selection of a digram with maximal value of log c 1 , but it does not have to be the digram minimizing the entropic size of this part of the resulting message the most.Consider two digrams d 1 and d 2 , so that their frequencies are equal: f (d 1 ) = f (d 2 ); replacing them with a new symbol yields the same stretching factor c 1 , but not necessarily ∑ x∈Σ I f 0 (x).The larger reduction of the entropic size of a message will be achieved when compressed digrams or words consist of less frequent symbols.

The Second Part: Symbols Participating in the Substitution Function
In the second case, the frequencies of symbols and their total number will change.The equation for stretching factor c 2 will be derived in the following way: The main difference in both cases is that c 1 is a constant; meanwhile, c 2 is a function of the particular symbol x.
where in the last step, we made the substitution: The rest of the derivation follows the derivation of Equation ( 19).
The behavior of Equation ( 23) for different values of p 0 (x) is visualized in Figure 6.The substitution of less frequent symbols leads to a lower increase of zero order entropy.The resulting entropic size simplifies given that: yields: We now analyze both terms in (25) from the perspective of different values of c 2 (x).We will be particularly interested in compression functions.We know that for compression function c 1 > 1, symbols with ∆ f (x) < 0, i.e., symbols whose frequency decreases, will have F(x) < 1.The positivity or negativity of log c 2 then depends on the value of product F(x)c 1 .
The case when F(x)c 1 = 1 has a solution F(x) = 1/c 1 , then log c 2 (x) = 0.The term log p 0 (x) is always negative.The value of F(x) must be larger than 1/c 1 to decrease the zero order entropy conveyed by symbol x, since then, c 2 (x) > 1 and, as a consequence, log c 2 (x) > 0.
The introduction of the absolute value in the middle step of the derivation of Inequality ( 26) is allowed since using compression functions values of ∆ f (x) and ∆m can only be negative.Suppose now that we have a digram d = αβ, given that α = β, and we replace it by the newly-introduced γ, then ∆m = ∆ f (α) = ∆ f (β).The left part of Inequality ( 26) becomes equal to one, so Inequality (26) cannot be satisfied, and log c 2 (x) in this case will be negative and will always increase the amount of information carried by the symbols α and β.
Finally, we state the condition for the entropic size decrease: Corollary 2. The entropic size of the part of the message formed by symbol x decreases when: Proof.

Third Part: Introduced and Removed Symbols
We begin with symbol x, which is completely removed from the message, so that initially, p 0 (x) = 0, but p 1 (x) = 0.This case is trivial, and it has zero participation in the final value of the entropy and the entropic size of message.The remaining case we have to deal with is a case when initially symbol x has zero probability p 0 (x) = 0, but after substitution, its probability will increase to some p 1 (x) = 0.The final probability is given as: Since the symbol x initially has zero participation in entropy and entropic size, it will always lead to the increase of both quantities.For the set Σ N of all such symbols, its portion on total entropy is then given by: and the corresponding final entropic size will be given by: It is important to remark that it does not make much sense to introduce more than one symbol in one substitution function, because both quantities would then add themselves twice.

Calculation of ∆|m| H
At first glance, it seems that we need to evaluate all symbols to predict zero order entropy, but instead, it is possible to predict the exact change of the entropic size of the message after the application of the compression function by the evaluation of entropic sizes given by Equations ( 21), ( 25) and (30) dealing only with symbols x ∈ Σ \ Σ I .In the particular case of the Re-Pair algorithm, there are only two symbols whose frequency knowledge is sufficient to evaluate the change of the entropic size of the message; suppose a compression function CF(αβ → γ) so that p 1 (α) = 0, p 1 (β) = 0 and p 0 (γ) = 0, then the resulting entropic size is given as:

Conclusions
We described three types of transformations for the preprocessing of messages so that the zero order entropy of messages drops so the resulting message can be more efficiently encoded using zero order entropy compression algorithms like Huffman or arithmetic coding.
We presented relations that govern the change of the message size for transformations and compression functions.Transformations have the advantage that they do not modify the size of the alphabet, especially in the case of digram substitution used by Re-Pair and our proposal of the MinEnt strategy; the resulting size of the alphabet significantly grows, and it brings additional complexity in the storage of the entropy coding model, i.e., the storage of the output alphabet.
The MinEnt strategy selects digrams to be replaced by the minimal entropic size of the resulting message, and it is shown that in most cases, the resulting message size is smaller than the one achieved by Re-Pair.We also showed that the two algorithms follow slightly different execution paths, as MinEnt prefers digrams that consist of less frequent symbols; meanwhile, Re-Pair does not take this into consideration.
The compression functions take advantage of transformations as they achieve a better resulting compression ratio.In future work, we will focus on the storage of the dictionary that will be used in transformation algorithms, because this area can significantly improve the resulting compression ratio.
Further, we will focus on the description of the relation between the entropy coding model of the final message and the size of the final alphabet.it follows that the possible transformations are being applied on words following the space character.Now using the fact that s i is enclosed by w[0], i.e., they do not overlap, allows us to handle each sub-message s i independently.
Proof.For the two sets of positions, it holds that P(w, m) ∩ P(wα, m 1 ) = ∅, because elements of the former are predecessors of the latter and s does not overlap.The locations of w in m and m 2 are identical as they were not modified during HOCT, i.e., P(w, m) = P(w, m 2 ).When we apply HOCT 1 again it will simply revert the symbols in positions given by P(w, m) back according to Lemma A1 yielding the message state m 3 .In the forward transformation HOCT 2 was applied over positions of P(wα, m 1 ), but these are the former positions of P(wβ, m), that are already transformed back by the application of HOCT 1 , so P(wα, m 1 ) is equal to P(wβ, m 3 ) and when HOCT 2 is applied over positions P(wβ, m 3 ) it exchanges symbols γ and ρ and eventually yields m.
The recursive application of Theorem A3 leads to the conclusion that this process can be repeated until there is no other pair of symbols then these containing w[0] as one of the symbols α or β or we simply reach the end of the message.
Corollary A2 about the prediction of frequencies in the case of GCT is also applicable in the case of HOCT, because the principle that the exact number of replacements is known is also valid and we are able to precisely compute the future probabilities of symbols before the arbitrary HOCT is applied.
If we implement the inverse algorithm as a sequential algorithm operating in the left-right manner, it is possible to have one of the transformation symbols if β, γ is equal to w[0].Suppose the following example: m = abcabc, P(a, m) = {0, 3}, HOCT 1 (ab ↔ aa) and HOCT 2 (aac ↔ aaa) yielding the output message m 2 = aaaaaa.Now applying inverse transformation sequentially from left to right, we first replace aa by ab yielding m i = abaaaa, then applying replacement aba for abc yielding m i+1 = abcaaa, now because there is no other transformation that is induced from abc we know that the next a symbol is w[0] and we can repeat the preceding process again starting from this a.The sufficient condition for the introduction of w[0] as the transformation symbol β or γ is that w contains no other w[0] in w[i], i > 0, because the inverse process removes all introduced w[0] symbols from the transformed message during left to right sequential inverse transformation.

Example 4 .
m 0 = aababcdcdb The entropic size of m 0 is |m 0 | H = 19.71bits.There are two non-overlapping digrams that occur twice: ab and cd.

Table 1 .Figure 1 .
Figure 1.Comparison of zero order entropy evolution over the paper5 file from the Calgary corpus.(a) Evolution of zero order entropy for different strategies; (b) evolution of zero order entropy for different values of the limit (LIM) in the HOCT strategy.

Figure 2 .
Figure 2. Comparison of Re-Pair and MinEnt algorithms: evolution of the message size measured in the number of symbols over the paper5 file from the Calgary corpus.

Figure 3 .
Figure 3.Comparison of Re-Pair, MinEnt, GCT and HOCT algorithms: evolution of the entropic message size measured in bits per byte over the paper5 file from the Calgary corpus.

Table 3 .
The comparison of compression strategies using different criteria: i is the number of iterations; |Σ i | is the size of the final alphabet; |m i | is the resulting size of the file measured as the number of symbols; H i is the final entropy measured in bits; |m i | H is the final entropic size measured in bytes; |D i | is the average number of bits needed to store one phrase in the dictionary; and |m f | is the final size of the file given as the sum of the entropic size of the message and the size of the dictionary measured in bytes.

Table 5 .
The comparison of the strategies using different criteria: i is the number of iterations; |Σ i | is the size of final alphabet; |m i | is the resulting size of the file measured as the number of symbols; H i is the final entropy measured in bits; |m i | H is the final entropic size measured in bytes; |D|/|Σ i | is the average number of bytes needed to store one phrase in the dictionary; and |m f | is the final size of the file given as the sum of entropic size of the message and the size of the dictionary measured in bytes.