Next Article in Journal
Classification of Fractal Signals Using Two-Parameter Non-Extensive Wavelet Entropy
Previous Article in Journal
Minimum Entropy Active Fault Tolerant Control of the Non-Gaussian Stochastic Distribution System Subjected to Mean Constraint
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes

Department of Computer Science, FEECS, VSB-Technical University of Ostrava, 17. listopadu 15/12172, Ostrava 708 33, Czech Republic
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(5), 223; https://doi.org/10.3390/e19050223
Submission received: 30 January 2017 / Revised: 9 May 2017 / Accepted: 10 May 2017 / Published: 13 May 2017
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
The change of zero order entropy is studied over different strategies of grammar production rule selection. The two major rules are distinguished: transformations leaving the message size intact and substitution functions changing the message size. Relations for zero order entropy changes were derived for both cases and conditions under which the entropy decreases were described. In this article, several different greedy strategies reducing zero order entropy, as well as message sizes are summarized, and the new strategy MinEnt is proposed. The resulting evolution of the zero order entropy is compared with a strategy of selecting the most frequent digram used in the Re-Pair algorithm.

1. Introduction

Entropy is a key concept in the measurement of the amount of information in information theory [1]. From the data compression perspective, this amount of information represents the lower limit of the achievable compression of some information source. Due to the well-known work by Shannon [2], we know that using less bits than the amount given by entropy to represent a particular message or process would necessarily lead to the loss of some information and as a consequence our inability to properly recover the former structure of the message.
This work is focused on the study of entropy in data compression, and therefore, our discussion will be restricted to only finite messages. These finite messages are formed by symbols, and in this perspective, the entropy can be understood as the lowest number of bits needed on average to uniquely represent each symbol in a message. There are messages for which the evaluation of entropy can be a very hard task, and so, we are often forced to satisfy ourselves with some approximation of entropy.
The simplest approximation is the one based on the probability distribution of symbols in a particular message. In this case, the symbols are viewed as independent entities, and their mutual relationships are not taken into account. A better approximation of entropy is based on the conditional probabilities when we also take into account how symbols follow each other. We can also approximate entropy by computing bits per byte ratio of message encoded by state of the art data compression algorithms.
In this article, we study entropy at the level of independent symbols. This approximation of entropy is often called zero order entropy. There are two major data compression algorithms in use that compress messages almost to the rate given by zero order entropy: Huffman [3] and arithmetic [4] coding. The zero order entropy can be computed using the Shannon equation:
H ( X ) = x Σ p ( x ) log p ( x )
where X stands for a random variable representing the probability distribution of symbols in the input message m and p ( x ) is the probability of symbol x from alphabet Σ . The expected length of the code for the particular symbol x is given by log p ( x ) . If the expected length of the code is multiplied by its probability, we obtain the average number of bits needed to represent any symbol x Σ . The size of the message using the expected lengths of codes is given as a product of the length of the message | m | measured as a number of symbols and zero order entropy:
| m | H = | m | H ( X )
When we refer to the term entropic size of the message, we always mean the quantity given by Equation (2), and it will be denoted using a superscript as | m | H . We study how the entropic size of the message evolves when all occurrences of some m-gram are substituted for some other n-gram and vice versa. We study two such substitutions: transformations and compression functions. Transformations replace n-grams of the same length. Transformation leaves the message size intact, but since the probability of symbols changes, the value of zero order entropy also has to change. Compression functions replace m-grams for n-grams, where m > n , and leave the message size smaller, but the change in zero order entropy also occurs.
The main idea behind the concept of transformations is the following: consider Huffman coding; more probable symbols are encoded by shorter or at least by the same length prefix codes than the lower probability ones, if the symbol β is more probable than the symbol γ , but in the context of some symbol α , γ is more frequent than β , then if these symbols following α are exchanged, the longer codes used for encoding γ will instead be encoded by shorter codes representing the encoding of β . Under this assumption, it is possible to pre-process data so that the frequency of more frequent symbols increases and the frequency of less frequent symbols decreases.

1.1. Notation and Terminology

  • The alphabet of the input message m of the size | m | is denoted by Σ and its size by | Σ | .
  • Greek symbols are used to denote variables representing symbols in the input message. For instance, suppose two digrams α α , α β and the alphabet Σ = { 0 , 1 } . Then, α α { 00 , 11 } and α β { 00 , 01 , 10 , 11 } .
  • When we refer to the term entropy, we always mean Shannon’s entropy defined by Equation (1).
  • All logarithms are to base two.
  • Any quantity Q i with a subscript i N denotes consecutive states of the quantity between substitutions. For instance, a quantity Q 0 is a value of the quantity before any substitution is applied, and Q 1 is a value of the quantity after some substitution is applied.

1.2. Order of Context and Entropy

1.2.1. Zero Order Context

When all symbols are interpreted as independent individual entities and no predecessors are taken into consideration, such a case is called zero order context. Zero order entropy is then computed as Shannon’s entropy of a random variable given by the probabilities of symbols in the input message.

1.2.2. N-th Order Context

In a case where the probability distribution of symbols following a particular fixed length prefix w is taken into consideration, then if the length of the prefix is N, then the order of context is N, and the N-th order entropy is computed as Shannon’s entropy of the conditional distribution of symbols following all different prefixes w i .

2. Previous Work

The class of algorithms dealing with exchanges of different n , m -grams are called grammar-based algorithms. Their purpose is to provide a set of production rules inferring the content of the message. Using the Chomsky hierarchy, we identify two classes of formal grammars used in data compression: context-free grammars (CFG) and context-sensitive grammars (CSG). Context transformations presented in Section 3 belong to the CSG class; meanwhile, compression functions belong to the CFG class. The problem of the search for the most compact context-free grammar representation of a message is NP-hard, unless P = N P [5]. Instead of searching for the optimal solution, many heuristic and greedy algorithms were proposed.
CFGs for data compression were first discussed by Nevill-Manning [6] followed by the proposal of the SEQUITUR algorithm [7]. SEQUITUR reads the sentence in the left-right manner so that each repeated pattern is transformed into a grammar rule. The grammar is utilized in such a way that the two properties are fulfilled: a digram uniqueness (no pair of adjacent symbols appear more than once in the grammar and a rule utility); every rule is used more than once. Kiefer and Yang [8] were the first who addressed data compression using CFGs from the information theoretic perspective; they showed that the LZ78 [9] algorithm can be interpreted as a CFG and that the proposed BISECTION algorithm forms a grammar-based universal lossless source code. BISECTION repeatedly halves the initial message into unique phrases of length 2 k , where k is the integer.
In the work of Yang and He [10], the context-dependent grammars (CDG) for data compression were introduced. In CSG, the context is present in both sides of production rules; meanwhile in CDG, the context is defined only on the left side of the production rule.
One of the first concepts in greedy grammar-based codes was the byte pair encoding (BPE) [11]. The BPE algorithm selects the most frequent digram and replaces it with some unused symbol. The main weakness of this approach is that the algorithm is limited to an alphabet consisting only of byte values. The concept of byte pair encoding was later revised, and the limitation on the alphabet size used was generalized independently by Nakamura and Murashima [12] and by Larsson and Moffat [13]; the resulting approach is called Re-Pair [13]. Re-Pair stands for recursive pairing, and it is a very active field of research [14,15]. It iteratively replaces the most frequent digrams with unused symbols until there is no digram that occurs more than once.
Unlike BPE that codes digrams using only byte values, Re-Pair expects that the symbols of the final message will be encoded using some entropy coding algorithm. Approaches derived from Re-Pair are usually greedy, since each iteration of the algorithm is dependent on a successful search of the extremal value of some statistical quantity related to the input message. The study of the Re-Pair from the perspective of ergodic sources is discussed in [16,17]. Neither BPE nor Re-Pair compress the message into the least possible size, but they rather form a trade-off between message and dictionary sizes. Re-Pair-like algorithms are off-line, in the sense that they need more than one pass through the input message; meanwhile, SEQUITUR incrementally builds the grammar in a single pass. The Re-Pair algorithm is an algorithm with O ( n ) time complexity; it is easy to implement using linked lists and a priority queue. Further, it was shown in [18] that it can compress an input message of length n over an alphabet of size | Σ | into at most 2 H k + o ( n log | Σ | ) bits, where H k is k-th order entropy.
Our recent studies were focused on a special class of grammar transforms that leave the message size intact [19,20]. In the present paper, the class of grammar transformations is extended with a novel concept of higher order context transformation [21]. We shall provide examples of transformations and the evaluation of entropy resp. entropic size reduction to the class of grammar compression algorithms, and we compare the evolution of entropy, entropic size and the resulting number of dictionary entries for Re-Pair and our version of Re-Pair, called MinEnt, which is based on the selection of the pair of symbols reducing the entropic size of the message the most. Re-Pair finds application in areas such as searching in compressed data [22], compression of suffix arrays [23] or compression of inverted indexes [24], to name a few. These areas are also natural application fields for MinEnt. From the perspective of the number of passes through the message, the approaches discussed in this paper belong to off-line algorithms.

3. Transformations and Compression Algorithms

In this section, we will describe and evaluate several invertible transformations T and substitution functions F so that for any two consecutive states of the message, m 0 and m 1 , before and after application of T or F, the following relation holds:
| m 1 | H < | m 0 | H
The measure of the size of the message by the entropic size of the message is preferred, since using the arithmetic coding, one can achieve a compression rate very close to the zero order entropy, and so, the size | m | H is in theory accessible. Further, it allows the comparison of two distinct substitutions when their resulting sizes measured by the number of symbols are equal. The derivation of equations for the computation of | m 1 | H , resp. Δ | m | H = | m 0 | H | m 1 | H , are provided in Section 4.

3.1. Transformations

Consider transformation, where we replace all occurrences of some symbol β for some symbol γ and vice versa; such a transformation is called a symmetry transformation, because it does not modify any measurable quantities related to the amount of information. The information content is changed when the replacement is taken in the context of an other symbol α . Such a transformation corresponds to the exchange of all digrams α β for α γ and vice versa. In this section, several different forms of transformation are distinguished and briefly described. Some properties of transformations and their proofs can be found in Appendix A.

3.1.1. Context Transformation

The concept of context transformations was first proposed in [25], and the results were presented in [19]. It is the simplest transformation that assumes a pair of digrams beginning with the same symbol when one of the digrams is initially missing in the input message.
Definition 1.
Context transformation (CT) is a mapping C T ( α β α γ , w ) : Σ n Σ n that replaces all digrams α β for α γ , where p ( α , γ ) = 0 and β γ . Σ is the alphabet of the input message w, and n is the length of w.
Let C T be the context transformation applied from the end of the message to the beginning and C T in the opposite direction. The context transformation C T is an inverse transformation of C T . The proof of this property with an explanation for why it is the only pair of the function and its inverse is left to Appendix A. The application of two consecutive context transformations and their inverse functions is presented in the following example:
Example 1.
a b c d a b a c d | C T ( a b a a ) a a c d a a a c d | C T ( c d c c ) a a c c a a a c c | C T ( c c c d ) a a c d a a a c d | C T ( a a a b ) a b c d a b a c d |

3.1.2. Generalized Context Transformation

Context transformations were restricted in cases where one of the digrams was missing in the input message. This restriction is removed by the introduction of the generalized context transformations first proposed in [20].
Definition 2.
Generalized context transformation (GCT) is a mapping G C T ( α β α γ , w ) : Σ n Σ n that exchanges all occurrences of a digram α β by a digram α γ and vice versa. Σ is the alphabet of the input message w, and n is the length of w.
Example 2.
a a b c a b a b | G C T ( a b a a ) a b a c a a a a | G C T ( a a a b ) a a b c a b a b
Meanwhile, both transformations C T and G C T swap occurrences of two different digrams beginning with the same symbol; they differ in the way they are applied and how the inverse transformation is formed. G C T can be applied in both directions, and the inverse transformation G C T 1 is always applied in the opposite direction, than the forward transformation direction. The algorithm based on the C T and G C T works as follows:
  • Find and apply transformation T so that the change of the entropic size Δ | m | H = | m 0 | H | m 1 | H is maximal.
  • Repeat Step 1 until no transformation can decrease the entropic size of the message.
It is also possible to define a transformation and its inverse so that all symbols constituting replaced pairs differ, for instance a b c d ; such a transformation is called generic transformation G T . In this article, we have not proposed algorithms based on G T , but because the set of all generalized context transformations is a subset of a set of generic transformations, the proof of the inverse transformation existence is the same for both G C T and G T . The reader can find the proof in Appendix A.

3.1.3. Higher Order Context Transformation

Every time we apply any generalized context transformation G C T , we acquire knowledge about the positions of two distinct digrams in the message. We can either discard this knowledge or we can try to build on it. In the following definition, we define a transformation that is applied over positions where some other transformation was applied before:
Definition 3.
Let P ( w , m ) be a set of positions of the first symbol following the sub-message w in the message m and w [ i ] w [ 0 ] , i > 0 . If β , γ w [ 0 ] , then the higher order context transformation (HOCT) is a mapping H O C T ( w β w γ , m , P ( w , m ) ) : Σ n Σ n that exchanges all sub-messages w β for sub-messages w γ and vice versa.
The restriction that the sub-message w has to satisfy is w [ 0 ] w [ i ] , where i > 0 is closely related to the existence of the inverse transformation to H O C T . The properties related to the H O C T and their proofs are left to Appendix A.
Let O = | w | be the size of the sub-message w from Definition 3, then O is an order of H O C T . Any G C T ( α β α γ ) is then the first order H O C T ( α β α γ , m , P ( α , m ) ) . Given that we just before applied some transformation m 1 = H O C T 1 ( w β w γ , m , P ( w , m ) ) , we can decide to collect the positions of either w 1 = w β or w 2 = w γ , collect the distribution of symbols in P ( w i , m ) and apply another H O C T ( w i ρ w i φ , m 1 , P ( w i , m ) ) . In this sense, H O C T is not used only to interchange different sub-messages, but it also allows one to proceed with some other transformation H O C T of a higher order. The application of two consecutive H O C T transformations is presented in the following example:
Example 3.
a b c d a b c d | H O C T ( a b a d , P ( a , m ) = { 1 , 5 } ) a d c d a d c d | H O C T ( a d c a d d , P ( a d , m ) = { 2 , 6 } ) ) a d d d a d d d |
The H O C T transformation is a recursive application of G C T in the context of some prefix w. The steps of the algorithm are outlined as follows:
  • Find and apply H O C T ( α β α γ ) over the set of positions P ( α ) , so that the change of entropic size Δ | m | H = | m 0 | H | m 1 | H is maximal and Δ | m | H > L i m .
  • If the frequency of α β resp. α γ is larger than one, then repeat Step 1 over the set of positions P ( α β ) resp. P ( α γ ) , i.e., positions where H O C T from Step 1 was applied; otherwise, repeat Step 1 over positions P ( α ) or return if no more H O C T passes the entropic size reduction conditions.
The algorithm above is iteratively called for symbols sorted from the most frequent one to the least frequent one. The L i m variable can be used to restrict transformations whose entropic size reduction is too small, so they cannot be efficiently stored in the dictionary.

3.2. Compression Functions

In the preceding section, we described three types of transformations that leave message size intact. In this section, we will focus on a description of two approaches in the replacement of digrams for a new symbol. First, we describe basic principles of the well-known algorithm Re-Pair, and then, we will propose a modification of Re-Pair called MinEnt.

3.2.1. Re-Pair

The main idea behind the Re-Pair algorithm is to repeatedly find the most frequent digram and replace all of its occurrences with a new symbol that is not yet present in the message. The algorithm can be described in the following steps:
  • Select the most frequent digram α β in message m.
  • Replace all occurrences of α β for new symbol γ .
  • Repeat Steps 1 and 2 until every digram appears only once.
In Step 2 of the algorithm, the pair α β together with a new symbol γ are stored in a dictionary. The implementation details of the Re-Pair algorithm are left to Section 3.2.2 regarding the proposed MinEnt algorithm.

3.2.2. MinEnt

The MinEnt algorithm proposed in this article is derived from the Re-Pair algorithm. The main difference is in Step 1, where instead of the selection of the most frequent digram, we select a digram that minimizes | m 1 | H from Equation (3):
  • Select digram α β in message m 0 so that the change of entropic size Δ | m | H = | m 0 | H | m 1 | H is maximal.
  • Replace all occurrences of α β for new symbol γ .
  • Repeat Steps 1 and 2 until every digram appears only once.
More precisely, let m 1 = M i n E n t ( m 0 , α β γ ) be the application of Step 1 and Step 2 of the MinEnt algorithm, then digram α β fulfills:
arg min α , β Σ 0 | M i n E n t ( m 0 , α β γ ) | H
where Σ 0 is the alphabet of the message m 0 . To demonstrate the difference between Re-Pair and MinEnt, consider the following example:
Example 4.
m 0 = a a b a b c d c d b
The entropic size of m 0 is | m 0 | H = 19 . 71 bits. There are two non-overlapping digrams that occur twice: a b and c d .
( m 0 , a b e ) = a e e c d c d b
( m 0 , c d e ) = a a b a b e e b
Based on the Re-Pair algorithm, we do not know which digram should be preferred, because both have the same frequency. In the MinEnt case, we can compute | m 1 | H for both cases, yielding | m 1 | a b H = 18 bits and | m 1 | c d H = 12 . 49 bits, and so, the replacement c d e will be the preferred one.
The MinEnt and the Re-Pair strategies of digram selection are evaluated using the algorithm described in [13]. In the initialization phase of the algorithm, the input file is transformed into the linked list, and each input byte is converted into the unsigned integer value. In the next step, the linked list is scanned, and the frequencies and positions of all digrams are recorded. Frequencies of digrams, resp. the change of the entropic size of the message measured in bytes, are used as indices for the priority queue. The size of the queue is limited to the maximal frequency, resp. in the case of the MinEnt algorithm, the maximum entropic size decrease.
The algorithm iteratively selects the digram with the highest priority, replaces all occurrences of the digram with the newly-introduced symbol, decrements counts of neighboring digrams and increments counts of newly-introduced digrams. In the case of the MinEnt algorithm, we have to recompute the change of the entropic size of all digrams in the priority queue. We restrict the number of recomputed changes of the entropic size to the top 20 digrams with the highest priority, so that the time complexity of this additional step remains O ( 1 ) . Both algorithms are accomplished in O ( n ) expected time; see [13] for details. The memory consumption is larger in the MinEnt case, because each digram has to be assigned with the additional quantity: the value of the change of the entropic size of the message.

3.3. Discussion of the Transformation and Compression Function Selection Strategies

To demonstrate the behavior of aforementioned algorithms, we proposed strategies for the selection of transformations and compression algorithms. We compared the evolution of the entropy of the alphabet, the entropic size of the message and the final size of the message given as the sum of the entropic size of the message and the upper boundary on the size of the dictionary (Section 3.3.1). The following strategies are being compared:
  • G C T : selection of the generalized context transformation so that the decrease of entropy is maximal.
  • H O C T : selection of the higher order context transformation so that the decrease of entropy is maximal in the context of prefix w.
  • Re-Pair: selection of the most frequent digram and its replacement with an unused symbol.
  • MinEnt: selection of the most entropic size reducing digram and its replacement with an unused symbol.

3.3.1. The Upper Boundary on the Dictionary Entry Size

All transformations and compression functions are usually stored as an entry in a dictionary. To be able to compare the effectiveness of transformations, we selected the worst case entropy of each symbol, given by log | Σ i | , where Σ i is an alphabet and subscript i denotes the number of applied transformations.
In the G C T and H O C T strategies, the size of the alphabet will be constant, unless some symbols were completely removed, then the size of the alphabet decreases. Re-Pair and MinEnt algorithms, which introduce new symbols, have an increasing alphabet size. The upper boundary on the resulting size of each dictionary entry | D | for G C T and H O C T transformations is defined as:
| D | = 3 log | Σ 0 |
where | Σ 0 | is the size of the initial alphabet. The Larsson and Moffat [13] version of the Re-Pair introduces several efficient ways of dictionary encoding: the Bernoulli model, literal pair enumeration and interpolative encoding. In our experiments with Re-Pair and MinEnt, we used interpolative encoding to encode dictionary.

3.3.2. Comparison of the Alphabet’s Entropy Evolution

Even though the transformations and compression functions pursue the same objective, minimization of the entropic size of the message, they achieve that by a different evolution of zero order entropy. Transformation-based strategies minimize zero order entropy; meanwhile, both compression strategies introduce new symbols, and as a result, zero order entropy grows. The initial values of the quantities of the examined test file are summarized in Table 1. The example of the comparison of the zero order entropy evolution of different strategies is provided in Figure 1a.
Both compression functions achieve a very similar resulting value of zero order entropy. The Re-Pair strategy begins with the highest growth of entropy, but the increase slows down with the number of iterations as the frequency of each consecutive digram drops. As will be discussed in Section 4.2.2, digrams consisting of symbols with a lower frequency will be preferred by MinEnt, because they will be able to achieve a larger decrease of entropic size, and their replacement brings less costs in the zero order entropy increase. This behavior can be observed especially in later iterations of the Re-Pair and MinEnt algorithms.
Both transformations reduce the value of zero order entropy. G C T initially drops faster, but in the end, it significantly slows down. The application of the H O C T strategy achieves the lowest resulting value of entropy, and the interesting fact is that it decreases at an almost constant rate. The behavior of entropy evolution for different values of the limit in H O C T is presented in Figure 1b. The unrestricted case ( L i m = 0 ) shows us the bottom limit of zero order entropy reduction using the H O C T strategy.

3.3.3. Comparison of Entropic Size Evolution

The selection of the most frequent digram will produce the largest decrease of the number of symbols in each iteration. Surprisingly, the Re-Pair strategy does not necessarily have to converge to its minimum in the lower number of iterations than MinEnt. Figure 2 presents this behavior for the paper5 file of the Calgary corpus. Both approaches end with a similar number of symbols in the resulting message.
The MinEnt strategy achieves the lowest entropic size of the message, and at each iteration, the entropic size of the message is lower than in the case of the Re-Pair strategy, see Figure 3. The overall efficiency depends on our ability to compress the resulting dictionary.
A summary of different transformation strategies is provided in Table 2. A summary of compression functions is then given in Table 3. The least number of iterations was achieved by H O C T with L I M = | D | ; this strategy also leads to the least final size | m f | , but it should be emphasized that the resulting entropic size of the message | m f | is a very pessimistic estimate, due to the construction of the size of dictionary entries.
Even though the achieved results of both approaches are similar, we see that the resulting message size | m f | and alphabet size are lower in the case of MinEnt. The message size | m f | is given by the sum of the entropic size of the message and the size of the dictionary stored by interpolative encoding. Using values in the columns of Table 3, we express | m f | = | m i | H i + i | D i | ; the term i | D i | represents the size of the dictionary given as a product of the total number of iterations and the average number of bits needed to encode one iteration. See Table 4 and Table 5 for more results on files from the Calgary and Canterbury corpora.

4. Zero Order Entropy and Entropic Message Size Reduction

The primary purpose of context transformation and other derived transformations is to reduce the zero order entropy measured by Shannon’s entropy [2] defined in Equation (1). In this section, we shall show under what conditions the transformation and compression function reduces zero order entropy resp. the entropic size of the message. Suppose that H 0 is a zero order entropy of message m, and H 1 is a zero order entropy after a transformation T is applied. The conditions under which the following inequalities hold are the major subject of interest.
Δ H = H 0 H 1 = x Σ p 1 ( x ) log p 1 ( x ) x Σ p 0 ( x ) log p 0 ( x ) > 0
Let Σ T Σ be a set of symbols whose frequencies before and after transformation differ, and Σ I Σ is a set of symbols whose frequencies are intact. For transformations, the inequality (5) can be further restricted only to the set of symbols Σ T , since the terms containing symbols from Σ I subtract:
Δ H = x Σ T p 1 ( x ) log p 1 ( x ) x Σ T p 0 ( x ) log p 0 ( x )
In the following paragraph, we will specify the forms of the set Σ T and the relations for the probabilities of its symbols after transformations, so that the change of entropy given by Equation (6) can be computed before any transformation actually occurs.

4.1. Transformation of Probabilities

We begin with the simplest case: suppose the context transformation C T ( α β α γ , w ) . Since only the probabilities of symbols β and γ will change, then Σ C T = { β , γ } , and it is sufficient to express probabilities only for β and γ :
p 1 C T ( β ) = p 0 ( β ) p 0 ( α , β )
and:
p 1 C T ( γ ) = p 0 ( γ ) + p 0 ( α , β )
In the case of the generalized context transformation G C T ( α β α γ , w ) , the set Σ C T = Σ G C T is identical, and the probabilities transform according to:
p 1 G C T ( β ) = p 0 ( β ) p 0 ( α , β ) + p 0 ( α , γ )
and:
p 1 G C T ( γ ) = p 0 ( γ ) + p 0 ( α , β ) p 0 ( α , γ )
In the last case of higher order transformation, the probabilities transform according to:
p 1 H O C T ( β ) = p 0 ( β ) p 0 ( w , β ) + p 0 ( w , γ )
and:
p 1 H O C T ( γ ) = p 0 ( γ ) + p 0 ( w , β ) p 0 ( w , γ )
In all cases, the set Σ T forms a binary alphabet. The following theorem then describes the condition for zero order entropy reduction:
Theorem 1.
Suppose the generalized context transformation G C T ( α β α γ ) . Let p 0 ( β ) and p 0 ( γ ) be the probabilities of symbols before the transformation is applied, and let p 0 , m a x = m a x { p 0 ( β ) , p 0 ( γ ) } . After the transformation, the associated probabilities are p 1 ( β ) , p 1 ( γ ) and p 1 , m a x = m a x { p 1 ( β ) , p 1 ( γ ) } . If p 1 , m a x > p 0 , m a x , then the generalized context transformation T reduces entropy.
The proof of Theorem 1 is based on the properties of entropy when only two letters from alphabet Σ are considered. Let p ( β ) + p ( γ ) = c , where c 1 , c is invariant, it does not change during the transformation. We can express one of these probabilities using the other one; for example, let p ( γ ) = c p ( β ) ; this allows us to express the entropy function as a function of only one variable. A few examples of such functions are shown in Figure 4. The maximum value of the function is located in the value c / 2 , and it has two minimums at zero and at c.
Proof. 
Since the entropy function for two different letters is defined on the interval 0 ; c and it is concave with a maximum at c / 2 and minimums at zero and c, then p 0 , m a x has to be located on the interval p 0 , m a x c / 2 ; c ; but on that particular interval, the higher the maximum is, the lower the entropy is, so if we increase the maximum (or we can say increase the absolute value of the difference | p β p α | ), then the entropy will decrease. ☐

4.2. General Entropy Change Relations

In this section, we generalize the notion of zero order entropy change on the exchange of any two words. The solution is divided into three parts. The first part deals with the set of symbols Σ I whose frequency does not change before and after the substitution function is applied; the second part establishes relations for the set of symbols Σ T whose probability is changed, but their initial and final frequencies are non-zero; the third part discusses symbols introduced to and removed from the alphabet. Let Σ R be a set of removed symbols, and Σ N is a set of introduced symbols; then, we can split the sum in Equation (1), yielding:
H ( X ) = α { I , T , R , N } x Σ α p ( x ) log p ( x )
The four sets of symbols in Equation (13) exhibit different behaviors under the substitution function, and they will be discussed in separate sections. The entropic size of the message | m | H can also be handled separately; let H ( p ( Σ α ) ) be a portion of entropy conveyed by symbols from alphabet Σ α , and let | m | H ( Σ α ) be particular portions of the entropic size of the message; then, we can split the resulting entropic size as we did before:
| m | H = | m | [ H ( p ( Σ I ) ) + H ( p ( Σ T ) ) + H ( p ( Σ R ) ) + H ( p ( Σ N ) ) ] = | m | H ( p ( Σ I ) ) + | m | H ( p ( Σ T ) ) + | m | H ( p ( Σ R ) ) + | m | H ( p ( Σ N ) ) = | m | H ( Σ I ) + | m | H ( Σ T ) + | m | H ( Σ R ) + | m | H ( Σ N )

4.2.1. The First Part: Symbols Remaining Intact by the Substitution Function

We begin with symbols that are not part of either of the substituting words s 1 or s 2 . Suppose that the length | m 0 | of the message m 0 turns into some message m 1 of the size | m 1 | . Generally, | m 0 | | m 1 | , but in a special case of context transformations, these two quantities are equal. However, when the compression or expansion of the message occurs, the part of the Shannon equation will also change due to the change in the total number of symbols.
Suppose that the symbol x is initially in the message m 0 with the probability p 0 ( x ) . This probability can be expressed using the frequency f 0 ( x ) and the size of the message as:
p 0 ( x ) = f 0 ( x ) | m 0 |
Later, after the transformation was applied, the probability changes to:
p 1 ( x ) = f 0 ( x ) | m 0 | + Δ m
where Δ m is a change of the message size. In the case of context transformations where the message size remains the same size, the probability remains the same, as well as the part of entropy formed by non-transformed symbols.
When the two probabilities are placed in relation by some stretching factor c 1 we arrange them into the form:
p 1 ( x ) = c 1 p 0 ( x )
The factor c 1 (the introduction of c 1 is motivated by the properties of logarithms if we would actually stay with p 1 ( x ) given by Equation (16), we would get logarithm log f 0 ( x ) | m 0 | + Δ m = log f 0 ( x ) log ( | m 0 | + Δ m ) . If instead, we express p 1 using (17), then the logarithm is in product form, and its arguments are single numbers log c p 0 = log c + log p 0 ) can be expressed by substitution of p i ( x ) in Equation (17) by the terms in Equations (15) and (16), leading to:
f 0 ( x ) | m 0 | + Δ m = c 1 f 0 ( x ) | m 0 |
c 1 = | m 0 | | m 0 | + Δ m
Then, the relation for zero order entropy after transformation will have the form:
H ( p 1 ( Σ I ) ) = x Σ I c 1 . p 0 ( x ) log [ c 1 . p 0 ( x ) ] = c 1 x Σ I p 0 ( x ) [ log c 1 + log p 0 ( x ) ] = c 1 log c 1 x Σ I p 0 ( x ) c 1 x Σ I p 0 ( x ) log p 0 ( x ) = c 1 log c 1 x Σ I p 0 ( x ) + c 1 H ( p 0 ( Σ I ) ) = c 1 [ H ( p 0 ( Σ I ) ) log c 1 x Σ I p 0 ( x ) ]
The example of the behavior of H ( p 1 ( Σ I ) ) of the intact part is visualized in Figure 5. When the compression of the message occurs, i.e., log c 1 > 0 , then the zero order entropy of intact symbols increases. The less the probability is conveyed by symbols from Σ I , the more their zero order entropy is sensitive to the change of c 1 .
The final entropic size is given as follows:
| m 1 | H , Σ I = | m 1 | H ( p 1 ) = | m 1 | c 1 [ H ( p 0 ( Σ I ) ) log c 1 x Σ I p 0 ( x ) ] = | m 0 | H ( p 0 ( Σ I ) ) | m 0 | log c 1 x Σ I p 0 ( x ) ) = | m 0 | H , Σ I | m 0 | x Σ I p 0 ( x ) log c 1 = | m 0 | H , Σ I log c 1 x Σ I f 0 ( x )
If we apply one of transformations, then Δ m = 0 , and as a consequence, c 1 = 1 ; the last term on the right will be zero due to log c 1 = 0 , so Equation (20) tells us that the entropic size of the message carried by these symbols does not change during transformation. When | Σ T | is much smaller than | Σ I | , it is convenient to rewrite Equation (20) in terms of Σ T :
| m 1 | H , Σ I = | m 0 | H , Σ I log c 1 [ | m 0 | x Σ T f 0 ( x ) ]
Corollary 1.
No compression function ever increases the entropic size of the part of the message consisting of intact symbols.
Proof. 
The compression function has the value of c 1 larger than one, as a consequence log c 1 > 0 , and so, | m 1 | H ( Σ I ) | m 0 | H ( Σ I ) . ☐
The equality in | m 1 | H , Σ I | m 0 | H , Σ I occurs when x Σ I f 0 ( x ) = 0 , i.e., when there are no intact symbols. When the expansion of the message occurs, then log c 1 < 0 and the second term of Equation (20) on the right will change to a positive number. Expansion of the message leads to the increase of the entropic size; meanwhile, compression leads to the decrease of the entropic size of intact symbols.
In each iteration of the Re-Pair algorithm, the most frequent digram is selected. This corresponds to the selection of a digram with maximal value of log c 1 , but it does not have to be the digram minimizing the entropic size of this part of the resulting message the most. Consider two digrams d 1 and d 2 , so that their frequencies are equal: f ( d 1 ) = f ( d 2 ) ; replacing them with a new symbol yields the same stretching factor c 1 , but not necessarily x Σ I f 0 ( x ) . The larger reduction of the entropic size of a message will be achieved when compressed digrams or words consist of less frequent symbols.

4.2.2. The Second Part: Symbols Participating in the Substitution Function

In the second case, the frequencies of symbols and their total number will change. The equation for stretching factor c 2 will be derived in the following way:
p 1 ( x ) = f 1 ( x ) | m 1 | = f 0 ( x ) + Δ f ( x ) | m 0 | + Δ m
f 0 ( x ) + Δ f ( x ) | m 0 | + Δ m = c 2 f 0 ( x ) | m 0 |
The main difference in both cases is that c 1 is a constant; meanwhile, c 2 is a function of the particular symbol x.
c 2 ( x ) = ( f 0 ( x ) + Δ f ( x ) ) | m 0 | f 0 ( x ) ( | m 0 | + Δ m ) = f 0 ( x ) + Δ f ( x ) f 0 ( x ) c 1 = F ( x ) c 1
where in the last step, we made the substitution: F ( x ) = ( f 0 ( x ) + Δ f ( x ) ) / f 0 ( x ) The rest of the derivation follows the derivation of Equation (19).
H ( p 1 ( Σ T ) ) = x Σ T p 0 ( x ) c 2 ( x ) log c 2 ( x ) x Σ T c 2 ( x ) p 0 ( x ) log p 0 ( x )
The behavior of Equation (23) for different values of p 0 ( x ) is visualized in Figure 6. The substitution of less frequent symbols leads to a lower increase of zero order entropy.
The resulting entropic size simplifies given that:
| m 1 | c 2 ( x ) p 0 ( x ) = f 0 ( x ) + Δ f ( x ) = f 1 ( x )
yields:
| m 1 | H = | m 1 | H T ( p 1 ) = x Σ T [ f 0 ( x ) + Δ f ( x ) ] ( log c 2 ( x ) + log p 0 ( x ) )
We now analyze both terms in (25) from the perspective of different values of c 2 ( x ) . We will be particularly interested in compression functions. We know that for compression function c 1 > 1 , symbols with Δ f ( x ) < 0 , i.e., symbols whose frequency decreases, will have F ( x ) < 1 . The positivity or negativity of log c 2 then depends on the value of product F ( x ) c 1 .
The case when F ( x ) c 1 = 1 has a solution F ( x ) = 1 / c 1 , then log c 2 ( x ) = 0 . The term log p 0 ( x ) is always negative. The value of F ( x ) must be larger than 1 / c 1 to decrease the zero order entropy conveyed by symbol x, since then, c 2 ( x ) > 1 and, as a consequence, log c 2 ( x ) > 0 .
F ( x ) > 1 c 1 f 0 ( x ) + Δ f ( x ) f 0 ( x ) > | m 0 | + Δ m | m 0 | 1 + Δ f ( x ) f 0 ( x ) > 1 + Δ m | m 0 | Δ f ( x ) f 0 ( x ) > Δ m | m 0 | | Δ f ( x ) | f 0 ( x ) < | Δ m | | m 0 | | Δ f ( x ) | | Δ m | < f 0 ( x ) | m 0 | = p 0 ( x )
The introduction of the absolute value in the middle step of the derivation of Inequality (26) is allowed since using compression functions values of Δ f ( x ) and Δ m can only be negative. Suppose now that we have a digram d = α β , given that α β , and we replace it by the newly-introduced γ , then Δ m = Δ f ( α ) = Δ f ( β ) . The left part of Inequality (26) becomes equal to one, so Inequality (26) cannot be satisfied, and log c 2 ( x ) in this case will be negative and will always increase the amount of information carried by the symbols α and β .
Finally, we state the condition for the entropic size decrease:
Corollary 2.
The entropic size of the part of the message formed by symbol x decreases when:
Δ f ( x ) f 0 ( x ) < log c 2 ( x ) log c 2 ( x ) + log p 0 ( x )
Proof. 
| m 1 | H < | m 0 | H 0 ( x ) + Δ f ( x ) ] [ log c 2 ( x ) + log p 0 ( x ) ] < f 0 ( x ) log p 0 ( x ) f 0 ( x ) log c 2 ( x ) + Δ f ( x ) log p 0 ( x ) + Δ f ( x ) log c 2 ( x ) < 0 Δ f ( x ) ( log p 0 ( x ) + log c 2 ( x ) ) < f 0 ( x ) log c 2 ( x ) Δ f ( x ) f 0 ( x ) < log c 2 ( x ) log c 2 ( x ) + log p 0 ( x )

4.2.3. Third Part: Introduced and Removed Symbols

We begin with symbol x, which is completely removed from the message, so that initially, p 0 ( x ) 0 , but p 1 ( x ) = 0 . This case is trivial, and it has zero participation in the final value of the entropy and the entropic size of message. The remaining case we have to deal with is a case when initially symbol x has zero probability p 0 ( x ) = 0 , but after substitution, its probability will increase to some p 1 ( x ) 0 . The final probability is given as:
p 1 ( x ) = Δ f ( x ) | m 0 | + Δ m
Since the symbol x initially has zero participation in entropy and entropic size, it will always lead to the increase of both quantities. For the set Σ N of all such symbols, its portion on total entropy is then given by:
H ( p 1 ( Σ N ) ) = x Σ N Δ f ( x ) | m 0 | + Δ m log Δ f ( x ) | m 0 | + Δ m
and the corresponding final entropic size will be given by:
| m 1 | H = | m 1 | H N ( p 1 ) = x Σ N Δ f ( x ) [ log Δ f ( x ) log ( | m 0 | + Δ m ) ]
It is important to remark that it does not make much sense to introduce more than one symbol in one substitution function, because both quantities would then add themselves twice.

4.3. Calculation of Δ | m | H

At first glance, it seems that we need to evaluate all symbols to predict zero order entropy, but instead, it is possible to predict the exact change of the entropic size of the message after the application of the compression function by the evaluation of entropic sizes given by Equations (21), (25) and (30) dealing only with symbols x Σ Σ I . In the particular case of the Re-Pair algorithm, there are only two symbols whose frequency knowledge is sufficient to evaluate the change of the entropic size of the message; suppose a compression function C F ( α β γ ) so that p 1 ( α ) 0 , p 1 ( β ) 0 and p 0 ( γ ) = 0 , then the resulting entropic size is given as:
Δ | m | H = | m 0 | log c 1 log c 1 x { α , β } f 0 ( x ) + x { α , β } f 0 ( x ) log c 2 ( x ) + Δ f 0 ( x ) log p 0 ( x ) + Δ f 0 ( x ) log c 2 ( x ) Δ f ( γ ) [ log Δ f ( γ ) log ( | m 0 | + Δ m ) ]
finally, for the Re-Pair, it holds that if α β , then Δ m = Δ f ( α ) = Δ f ( β ) = Δ f ( γ ) = f ( α , β ) , and all Δ ’s in (31) turn into f ( α β ) . If α = β , then Δ f ( α ) / 2 = Δ f ( γ ) .

5. Conclusions

We described three types of transformations for the preprocessing of messages so that the zero order entropy of messages drops so the resulting message can be more efficiently encoded using zero order entropy compression algorithms like Huffman or arithmetic coding.
We presented relations that govern the change of the message size for transformations and compression functions. Transformations have the advantage that they do not modify the size of the alphabet, especially in the case of digram substitution used by Re-Pair and our proposal of the MinEnt strategy; the resulting size of the alphabet significantly grows, and it brings additional complexity in the storage of the entropy coding model, i.e., the storage of the output alphabet.
The MinEnt strategy selects digrams to be replaced by the minimal entropic size of the resulting message, and it is shown that in most cases, the resulting message size is smaller than the one achieved by Re-Pair. We also showed that the two algorithms follow slightly different execution paths, as MinEnt prefers digrams that consist of less frequent symbols; meanwhile, Re-Pair does not take this into consideration.
The compression functions take advantage of transformations as they achieve a better resulting compression ratio. In future work, we will focus on the storage of the dictionary that will be used in transformation algorithms, because this area can significantly improve the resulting compression ratio. Further, we will focus on the description of the relation between the entropy coding model of the final message and the size of the final alphabet.

Acknowledgments

This work was supported by the project SP2017/100 Parallel processing of Big Data IV, of the Student Grant System, VSB-Technical University of Ostrava. The costs for open access were covered.

Author Contributions

Michal Vasinek realized this work, proposed and developed the implementation of the C T , G C T , H O C T and MinEnt algorithms. Jan Platos provided the guidance during the writing process and revised the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The following sections present properties of transformations. Specifically for each type of transformation we will provide a proof of the inverse transformation existence. Further we will describe how the frequencies of symbols will be altered if the particular transformation is going to be applied.

Appendix A.1. CT—Proof of the Correctness

This theorem defines the inverse transformation of the context transformation:
Theorem A1.
The context transformation C T 1 C T ( α γ α β ) is inverse transformation of the context transformation C T ( α β α γ ) .
Proof. 
Let C T 1 C T , if C T 1 is inverse then the following must be true for any message m: C T 1 ( C T ( m ) ) = m . Suppose that we are passing message m from the end to the beginning and suppose that in positions i and i + 1 digram α β is located, this digram is replaced by α γ , the next pair of positions explored are i 1 and i, but their value is independent of the preceding replacement, because the replacement has taken place in position i + 1 , so when C T 1 is applied in position i it will find there the digram α γ and reverts it back to α β . ☐
Other combinations of directions do not form a pair of transformation and its inverse. We give an example for each case showing this property: C T ( α β α α ) and C T 1 over the message m = α β α : C T ( α β α ) = α α α but C T 1 ( α α α ) = α β β α β α . Next consider C T ( α β α α ) and C T 1 over the message m = α α β : C T ( α α β ) = α α α but C T 1 ( α α α ) = α β α α α β . And in the last case let’s consider C T ( α β α α ) and C T 1 over the message m = α β α : C T ( α α β ) = α α α but C T 1 ( α α α ) = α β β α α β .
Let f 0 ( α γ , m ) = 0 is a number of occurrences of particular digram α γ in a message m, then the following corollary tells us how many digrams α γ is introduced by context transformation C T ( α β α γ ) :
Corollary A1.
Under assumption that α γ the number of occurrences of digrams α γ and α β after application of transformation C T ( α β α γ ) is f 1 ( α γ , C T ( m ) ) = f 0 ( α β , m ) and f 1 ( α β , C T ( m ) ) = 0 .
Proof. 
A proof is a consequence of Theorem A1, since each replacement is independent of each other and so each digram α β is replaced by α γ leaving f 1 ( α β ) = 0 and f 1 ( α γ ) = f 0 ( α β ) . ☐
The corollary allows us to precisely predict not only the frequencies of the interchanged digrams α β and α γ but also as a consequence the frequencies of individual symbols after transformation. The special case of transformations on a diagonal (see Definition A1) will be discussed in the next paragraph.

Appendix A.2. Diagonal Context Transformation

Diagonal transformation is a transformation where one of the digrams participating in the transformation is of the form α α . The resulting frequency of such a digram is unpredictable without knowledge of the distribution of all n-grams of the form α n , where n 2 , but we show that for any diagonal C T , it is possible to predict frequencies of symbols α and β . The problems with predictability arise from the repetition of symbols.
Definition A1.
Diagonal context transformation is a context transformation of the form C T ( α α α β ) .
Consider two transformations, C T 1 C T ( α α α β , α α α ) = α β β and C T 2 C T ( α β α α , α β α ) = α α α , if Corollary A1 would also be valid for diagonal transformations, then for instance in the case of C T 1 , the frequency f 1 ( α β ) = f 0 ( α α ) but this obviously is not true, instead we see that the new frequency f ( β ) of symbol β is f 1 ( β ) = f 0 ( α α ) .
Suppose we have a message s = α n , then C T 1 ( s ) = α β n 1 , we clearly see that the frequency f ( α β , C T 1 ( s ) ) = 1 and f ( β β , C T 1 ( s ) ) = n 2 , because the number of digrams in a message is given by the length of the message minus one. We can now express the frequency f ( α β , C T 1 ( m ) ) of the newly introduced occurrences of digram α β as a sum of all sub-messages enclosed in m in the form x s x , where x α for all n 2 . So we see that it is possible to precisely predict the change of frequency of α β , but it demands knowledge of the distribution of all enclosed sub-messages s.
From the other perspective, since each occurrence of digram α α in the former message is transformed into α β we can see that the frequency f 1 ( β , C T 1 ( m ) ) = f 0 ( β ) + f 0 ( α α ) and f 1 ( α , C T 1 ( m ) ) = f 0 ( α , m ) f 0 ( α α ) .
Very similar behavior is observed in the second case of C T 2 . The problem is in the repetition of the pattern t = ( α β ) n , then C T 2 ( t ) = α 2 n and f 1 ( α α ) = 2 n 1 . Again without knowledge of all sub-messages t enclosed in m we cannot predict the exact change of frequency of neither digram α α nor α β , but since we know that each pair α β in the former message will be transformed to α α , we can again precisely predict frequencies of individual symbols f 1 ( α ) = f 0 ( α ) + f 0 ( α , β ) and f 1 ( β ) = f 0 ( β ) f 0 ( α β ) .
With the knowledge of the preceding discussion and of Corollary A1 we conclude that for any context transformation C T we are able to compute the frequency and corresponding probability of arbitrary symbol after application of any C T from the knowledge of initial distribution of symbols and digrams. In [26] we showed that under certain conditions it is possible to process several context transformations simultaneously.

Appendix A.3. GCT—Frequencies Alteration

Corollary A2.
Under assumption that α γ , α β and β γ the number of occurrences of digrams α γ and α β after application of transformation G C T ( α β α γ ) is f 1 ( α γ , G C T ( m ) ) = f 0 ( α β , m ) and f 1 ( α β , G C T ( m ) ) = f 0 ( α γ , m ) .
Proof. 
Since each digram α β resp. α γ is replaced by α γ resp. α β , and neither of the digrams influence the transformation of the other, their frequencies must interchange. ☐

Appendix A.4. Generic Transformation—Proof of Correctness

Generic transformation G T exchanges any two digrams. In the design of algorithms, we prefer G C T over G T since the space from which generic transformations are selected is in this case of order the | Σ | 4 and when alphabets of the large size are dealt with, the search in such a space would be computationally very expensive.
Definition A2.
Generic transformation (GT) is the mapping G T ( α β γ ρ , w ) : Σ n Σ n , Σ is the alphabet of the input message w and n is the length of the input message, that exchanges all digrams α β for digram γ ρ and vice-versa.
The inverse transformation of G C T and G T is defined by the following theorem:
Theorem A2.
Generic transformation G T 1 G T ( α β γ ρ ) resp. G T 1 G T ( α β γ ρ ) is the inverse of generic transformation G T ( α β γ ρ ) resp. G T ( α β γ ρ )
Proof. 
First, we show that it is sufficient to prove that for any string s = x w x , it holds that G T 1 ( G T ( s ) ) = s , where x Σ G T = { α , β , γ , ρ } and w [ i ] Σ G T . Suppose that x is located in position p then for digrams d in positions ( p 1 , p ) and ( p , p + 1 ) it holds that G T ( d ) = d . So the first possible application of GT can occur in positions ( p 2 , p 1 ) and ( p + 1 , p + 2 ) and these are independent, i.e., non-overlapping.
Next, we show that each replacement made in the forward transformation will be reverted back by inverse transformation. Take for example transformation G T ( α β γ ρ ) the transformation is applied in the right to left direction. The last applied forward transformation in positions ( r , r + 1 ) replaces for instance digram α β for γ ρ leaving w [ r , r + 1 ] = γ ρ , the inverse transformation, by definition the same transformation applied in the opposite direction, reverts digram γ ρ back to α β . Now consider any triplet of positions ( r 1 , r , r + 1 ) in a transformed message, the input of the inverse transformation in ( r , r + 1 ) is dependent on the result of the inverse transformation in the preceding pair of positions, but as we saw the first applied inverse reverted digram back correctly so the state in positions ( r + 1 , r + 2 ) is exactly like the one of the state left by forward transformation in these positions, so any other digram will be reverted back correctly, because every preceding application of the inverse leaves the state of the digram in the state that was left by the forward transformation and this digram is trivially reverted back to initial state. The same rules are valid for G T in the opposite direction, since the transformation G T ( m ) = G T ( m T ) , where m T is a mirror message of m. ☐

Appendix A.5. HOCT—Proof of the Correctness

The following trivial Lemma will help us to formulate a theorem about inverse transformation to H O C T :
Lemma A1.
Let T = H O C T ( w β w γ , m , P ( w , m ) ) is a higher order context transformation over the input message m, given that we possess the knowledge of w and positions P ( w , m ) , then T 1 = T .
Proof. 
Because we don’t have to pass through the whole message either in the forward or inverse transformation case, but only through the set of positions P ( w , m ) , then the symbol in position i P ( w , m ) , for instance m [ i ] = β will switch by H O C T to m [ i ] = γ and by repeated application of H O C T it reverts back to m [ i ] = β . ☐
The Lemma A1 is trivial but comes into play when P ( w , m ) is a product of some other higher order context transformation, i.e., the one with an order lower by one.
Theorem A3.
Let m 1 = H O C T 1 ( w α w β , m , P ( w , m ) ) and m 2 = H O C T 2 ( w α γ w α ρ , m 1 , P ( w α , m 1 ) ) are two higher order context transformations. Let T ( m ) = H O C T 2 ( H O C T 1 ( m ) ) be a transformation composition of two higher order context transformations over input message m. Then H O C T 2 1 ( w α γ w α ρ , m 3 , P ( w β , m 3 ) ) , such that m 3 = H O C T 1 ( m 2 ) ) then the transformation composition T 1 H O C T 2 1 ( H O C T 1 ( m 2 ) ) = m is the inverse transformation of T.
Several remarks to the formulation of Theorem A3: Transformations H O C T 1 and H O C T 2 are applied over two consecutive states of the message. The positions P ( w α , m 1 ) correspond to the positions P ( w β , m ) , since sub-messages w α have been replaced by w β in the application of H O C T 1 . The inverse transformation by H O C T 2 1 is applied instead over positions P ( w β , m 3 ) , since these positions have already been reverted back by H O C T 1 .
The proof is based on the restriction that w [ 0 ] w [ i ] , i > 0 , it can be viewed as we would split the input message m to sub-messages s i separated by w [ 0 ] . For instance, suppose that w [ 0 ] is a space character in ordinary text, since, by Definition 3, no other character in w can be a space character, it follows that the possible transformations are being applied on words following the space character. Now using the fact that s i is enclosed by w [ 0 ] , i.e., they do not overlap, allows us to handle each sub-message s i independently.
Proof. 
For the two sets of positions, it holds that P ( w , m ) P ( w α , m 1 ) = , because elements of the former are predecessors of the latter and s does not overlap. The locations of w in m and m 2 are identical as they were not modified during H O C T , i.e., P ( w , m ) = P ( w , m 2 ) . When we apply H O C T 1 again it will simply revert the symbols in positions given by P ( w , m ) back according to Lemma A1 yielding the message state m 3 . In the forward transformation H O C T 2 was applied over positions of P ( w α , m 1 ) , but these are the former positions of P ( w β , m ) , that are already transformed back by the application of H O C T 1 , so P ( w α , m 1 ) is equal to P ( w β , m 3 ) and when H O C T 2 is applied over positions P ( w β , m 3 ) it exchanges symbols γ and ρ and eventually yields m. ☐
The recursive application of Theorem A3 leads to the conclusion that this process can be repeated until there is no other pair of symbols then these containing w [ 0 ] as one of the symbols α or β or we simply reach the end of the message.
Corollary A2 about the prediction of frequencies in the case of G C T is also applicable in the case of H O C T , because the principle that the exact number of replacements is known is also valid and we are able to precisely compute the future probabilities of symbols before the arbitrary H O C T is applied.
If we implement the inverse algorithm as a sequential algorithm operating in the left-right manner, it is possible to have one of the transformation symbols if β , γ is equal to w [ 0 ] . Suppose the following example: m = a b c a b c , P ( a , m ) = { 0 , 3 } , H O C T 1 ( a b a a ) and H O C T 2 ( a a c a a a ) yielding the output message m 2 = a a a a a a . Now applying inverse transformation sequentially from left to right, we first replace a a by a b yielding m i = a b a a a a , then applying replacement a b a for a b c yielding m i + 1 = a b c a a a , now because there is no other transformation that is induced from a b c we know that the next a symbol is w [ 0 ] and we can repeat the preceding process again starting from this a. The sufficient condition for the introduction of w [ 0 ] as the transformation symbol β or γ is that w contains no other w [ 0 ] in w [ i ] , i > 0 , because the inverse process removes all introduced w [ 0 ] symbols from the transformed message during left to right sequential inverse transformation.

References

  1. Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing); Wiley-Interscience: New York, NY, USA, 2006. [Google Scholar]
  2. Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  3. Huffman, D.A. A Method for the Construction of Minimum-Redundancy Codes. Proc. Inst. Radio Eng. 1952, 40, 1098–1101. [Google Scholar] [CrossRef]
  4. Witten, I.H.; Neal, R.M.; Cleary, J.G. Arithmetic Coding for Data Compression. Commun. ACM 1987, 30, 520–540. [Google Scholar] [CrossRef]
  5. Charikar, M.; Lehman, E.; Lehman, A.; Liu, D.; Panigrahy, R.; Prabhakaran, M.; Sahai, A.; Shelat, A. The Smallest Grammar Problem. IEEE Trans. Inf. Theory 2005, 51, 2554–2576. [Google Scholar] [CrossRef]
  6. Nevill-Manning, C.G. Inferring Sequential Structure. Ph.D. Thesis, University of Waikato, Hamilton, New Zealand, May 1996. [Google Scholar]
  7. Nevill-Manning, C.G.; Witten, I.H. Identifying Hierarchical Structure in Sequences: A Linear-time Algorithm. J. Artif. Int. Res. 1997, 7, 67–82. [Google Scholar]
  8. Kieffer, J.C.; Yang, E.-H. Grammar Based Codes: A New Class of Universal Lossless Source Codes. IEEE Trans. Inf. Theory 2000, 46, 737–754. [Google Scholar] [CrossRef]
  9. Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef]
  10. Yang, E.; He, D. Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform 2. With context models. IEEE Trans. Inf. Theory 2003, 49, 2874–2894. [Google Scholar] [CrossRef]
  11. Gage, P. A New Algorithm for Data Compression. C Users J. 1994, 12, 23–38. [Google Scholar]
  12. Nakamura, H.; Marushima, S. Data Compression by Concatenation of Symbol Pairs. In Proceedings of the IEEE International Symposium on Information Theory and Its Applications, Paris, France, 13–17 September 1996; pp. 496–499. [Google Scholar]
  13. Larsson, N.J.; Moffat, A. Off-line dictionary-based compression. Proc. IEEE 2000, 88, 1722–1732. [Google Scholar] [CrossRef]
  14. Claude, F.; Farina, A.; Navarro, G. Re-Pair Compression of Inverted Lists. arXiv 2009. [Google Scholar]
  15. Masaki, T.; Kida, T. Online Grammar Transformation Based on Re-Pair Algorithm. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 29 March–1 April 2016; pp. 349–358. [Google Scholar]
  16. Grassberger, P. Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution. Physics 2002. [Google Scholar]
  17. Calcagnile, L.M.; Galatolo, S.; Menconi, G. Non-sequential recursive pair substitutions and numerical entropy estimates in symbolic dynamical systems. arXiv 2008. [Google Scholar]
  18. Navarro, G.; Russo, L. Re-pair Achieves High-Order Entropy. In Proceedings of the Data Compression Conference, DCC 2008, Snowbird, UT, USA, 25–27 March 2008; p. 537. [Google Scholar]
  19. Vasinek, M.; Platos, J. Entropy Reduction Using Context Transformations. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 26–28 March 2014; p. 431. [Google Scholar]
  20. Vasinek, M.; Platos, J. Generalized Context Transformations—Enhanced Entropy Reduction. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 7–9 April 2015; p. 474. [Google Scholar]
  21. Vasinek, M.; Platos, J. Higher Order Context Transformations. arXiv 2017. [Google Scholar]
  22. Kida, T.; Matsumoto, T.; Shibata, Y.; Takeda, M.; Shinohara, A.; Arikawa, S. Collage System: A Unifying Framework for Compressed Pattern Matching. Theor. Comput. Sci. 2003, 298, 253–272. [Google Scholar] [CrossRef]
  23. González, R.; Navarro, G. Compressed Text Indexes with Fast Locate. In Proceedings of the 18th Annual Conference on Combinatorial Pattern Matching, CPM’07, London, ON, Canada, 9–11 July 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 216–227. [Google Scholar]
  24. Claude, F.; Farina, A.; Navarro, G. Re-Pair compression of inverted lists. arXiv 2009. [Google Scholar]
  25. Vasinek, M. Kontextove Mapy a Jejich Aplikace. Master’s Thesis, Vysoka Skola Banska—Technicka Univerzita Ostrava, Ostrava, Czech Republic, 2013. [Google Scholar]
  26. Vasinek, M.; Platos, J. Parallel Approach to Context Transformations. Available online: http://ceur-ws.org/Vol-1343/paper4.pdf (accessed on 11 May 2017).
Figure 1. Comparison of zero order entropy evolution over the paper5 file from the Calgary corpus. (a) Evolution of zero order entropy for different strategies; (b) evolution of zero order entropy for different values of the limit (LIM) in the H O C T strategy.
Figure 1. Comparison of zero order entropy evolution over the paper5 file from the Calgary corpus. (a) Evolution of zero order entropy for different strategies; (b) evolution of zero order entropy for different values of the limit (LIM) in the H O C T strategy.
Entropy 19 00223 g001
Figure 2. Comparison of Re-Pair and MinEnt algorithms: evolution of the message size measured in the number of symbols over the paper5 file from the Calgary corpus.
Figure 2. Comparison of Re-Pair and MinEnt algorithms: evolution of the message size measured in the number of symbols over the paper5 file from the Calgary corpus.
Entropy 19 00223 g002
Figure 3. Comparison of Re-Pair, MinEnt, G C T and H O C T algorithms: evolution of the entropic message size measured in bits per byte over the paper5 file from the Calgary corpus.
Figure 3. Comparison of Re-Pair, MinEnt, G C T and H O C T algorithms: evolution of the entropic message size measured in bits per byte over the paper5 file from the Calgary corpus.
Entropy 19 00223 g003
Figure 4. The entropy of two letters with different c i { 0 . 25 , 0 . 5 , 1 . 0 } .
Figure 4. The entropy of two letters with different c i { 0 . 25 , 0 . 5 , 1 . 0 } .
Entropy 19 00223 g004
Figure 5. The portion of entropy H ( p 1 ( Σ I ) ) given by symbols from Σ I as a function of c 1 for the constant H ( p 0 ( Σ I ) ) = 4 and p 0 ( x ) { 0 . 2 , 0 . 5 , 0 . 8 } .
Figure 5. The portion of entropy H ( p 1 ( Σ I ) ) given by symbols from Σ I as a function of c 1 for the constant H ( p 0 ( Σ I ) ) = 4 and p 0 ( x ) { 0 . 2 , 0 . 5 , 0 . 8 } .
Entropy 19 00223 g005
Figure 6. Dependency of H ( p 1 ( Σ T ) ) on different values of c 2 for three cases of p 0 ( x ) { 0 . 05 , 0 . 1 , 0 . 2 } .
Figure 6. Dependency of H ( p 1 ( Σ T ) ) on different values of c 2 for three cases of p 0 ( x ) { 0 . 05 , 0 . 1 , 0 . 2 } .
Entropy 19 00223 g006
Table 1. Characteristics of the paper5 file from the Calgary corpus: the initial size of alphabet | Σ | , the initial file size | m 0 | measured in bytes, the initial entropy H 0 measured in bits and the initial entropic size | m 0 | H measured in bytes.
Table 1. Characteristics of the paper5 file from the Calgary corpus: the initial size of alphabet | Σ | , the initial file size | m 0 | measured in bytes, the initial entropy H 0 measured in bits and the initial entropic size | m 0 | H measured in bytes.
File Name | Σ | | m 0 | H 0 | m 0 | H
paper59111 9544.9367 376
Table 2. The comparison of transformation strategies using different criteria: L I M is the limiting size of the dictionary entry in bytes, i is the number of iterations; H i is the final entropy measured in bits; | m i | H is the final entropic size measured in bytes; log | Σ i | is the upper boundary on the amount of information needed to store one symbol to dictionary; and | m f | is the final size of the file given as the sum of the entropic size of the message and the size of the dictionary measured in bytes.
Table 2. The comparison of transformation strategies using different criteria: L I M is the limiting size of the dictionary entry in bytes, i is the number of iterations; H i is the final entropy measured in bits; | m i | H is the final entropic size measured in bytes; log | Σ i | is the upper boundary on the amount of information needed to store one symbol to dictionary; and | m f | is the final size of the file given as the sum of the entropic size of the message and the size of the dictionary measured in bytes.
Strategy LIM i H i | m i | H log | Σ i | | m f |
G C T 023673.79656746.50811,451
G C T | D | 1274.26063666.5086676
H O C T 03 8212.78641636.50813,488
H O C T 41 1433.52852726.5088061
H O C T 85253.83057136.5086994
H O C T | D | 2224.06760786.5086439
Table 3. The comparison of compression strategies using different criteria: i is the number of iterations; | Σ i | is the size of the final alphabet; | m i | is the resulting size of the file measured as the number of symbols; H i is the final entropy measured in bits; | m i | H is the final entropic size measured in bytes; | D i | is the average number of bits needed to store one phrase in the dictionary; and | m f | is the final size of the file given as the sum of the entropic size of the message and the size of the dictionary measured in bytes.
Table 3. The comparison of compression strategies using different criteria: i is the number of iterations; | Σ i | is the size of the final alphabet; | m i | is the resulting size of the file measured as the number of symbols; H i is the final entropy measured in bits; | m i | H is the final entropic size measured in bytes; | D i | is the average number of bits needed to store one phrase in the dictionary; and | m f | is the final size of the file given as the sum of the entropic size of the message and the size of the dictionary measured in bytes.
Strategyi | Σ i | | m i | H i | m i | H | D i | | m f |
Re-Pair114696528329.283328610.2404753
MinEnt112994427989.395328610.2814737
Table 4. The comparison of strategies using different criterions, i is the number of iterations, | Σ i | is the size of the final alphabet, | m i | is the resulting size of file measured as the number of symbols, H i is the final entropy measured in bits, | m i | H is the final entropic size measured in bytes, | D | / | Σ i | is the average number of bytes needed to store one phrase into the dictionary and | m f | is the final size of the file given as the sum of entropic size of the message and the size of the dictionary measured in bytes.
Table 4. The comparison of strategies using different criterions, i is the number of iterations, | Σ i | is the size of the final alphabet, | m i | is the resulting size of file measured as the number of symbols, H i is the final entropy measured in bits, | m i | H is the final entropic size measured in bytes, | D | / | Σ i | is the average number of bytes needed to store one phrase into the dictionary and | m f | is the final size of the file given as the sum of entropic size of the message and the size of the dictionary measured in bytes.
File Name | m 0 | Σ 0 H 0 i | Σ i | | m i | H i | m i | H | D | | Σ i | | m f |
Calgary corpus
bib111,261815.2575469421615,15911.41021,62111.58929,544
book1768,771824.52823,58722,649128,05913.422214,85911.904249,957
book2610,856964.68121,14718,50182,44613.213136,16512.829170,079
news377,109985.22620,07913,60255,50012.80988,86312.761120,892
obj121,5042565.9291650147564649.888799010.91810,242
obj2246,8142566.28014,635956935,54012.32354,74313.04478,607
paper153,161954.96735592678880010.89011,97911.36017,033
paper282,199914.5064297375314,10211.23519,80511.18125,811
paper346,526844.58829892575906110.76712,19510.79116,227
paper413,286804.602119499731369.622377210.1305284
paper638,105935.00028342108667010.585882611.22012,801
progc39,611925.28228542066652610.641868111.25412,696
progl71,646874.83041622577721610.851978812.00316,033
progp49,379894.82331471684452810.272581411.95210,516
trans93,695995.54559182505651310.968892912.41918,116
Canterbury corpus
alice29.txt152,089744.4356733606825,07711.98537,56811.48247,232
asyoulik.txt125,179684.8895799529323,53211.77434,63410.93242,559
bible.txt4,047,392634.26081,22971,256386,09415.017724,72814.525872 215
cp.html24,603865.1071785127142429.590508510.6897470
E.coli4,638,69042.00067,36862,924652,66413.7251,119,6877.4621182 530
fields.c11,150904.92492765815039.304174810.8223002
kennedy.xls1,029,7442563.58424462545160,1779.788195,9788.274198 508
lcet10.txt426,754844.62714,51512,39555,69112.75988,82312.426111,369
ptt5513,2161591.0495995569730,46311.42443,50311.17851,880
random.txt100,000646.0005065512654,18211.18275,7313.98378,253
sum38,2402555.44731161749625110.290804111.91212,681
world192.txt2,473,400945.02455,47347,150212,64714.552386,80813.973483 705
xargs.14227744.8634683849908.25510229.8111596
Table 5. The comparison of the strategies using different criteria: i is the number of iterations; | Σ i | is the size of final alphabet; | m i | is the resulting size of the file measured as the number of symbols; H i is the final entropy measured in bits; | m i | H is the final entropic size measured in bytes; | D | / | Σ i | is the average number of bytes needed to store one phrase in the dictionary; and | m f | is the final size of the file given as the sum of entropic size of the message and the size of the dictionary measured in bytes.
Table 5. The comparison of the strategies using different criteria: i is the number of iterations; | Σ i | is the size of final alphabet; | m i | is the resulting size of the file measured as the number of symbols; H i is the final entropy measured in bits; | m i | H is the final entropic size measured in bytes; | D | / | Σ i | is the average number of bytes needed to store one phrase in the dictionary; and | m f | is the final size of the file given as the sum of entropic size of the message and the size of the dictionary measured in bytes.
File Name | m 0 | | Σ 0 | H 0 i | Σ i | | m i | H i | m i | H | D | | Σ i | | m f |
Calgary corpus
bib111,261815.2015513415015,10311.30721,34611.71729,421
book1768,771824.52723,84322,616127,77713.377213,65612.134249,822
book2610,856964.79320,85217,99780,81413.170133,04512.847166,533
news377,109985.19020,11813,38855,34712.69787,84512.918120,333
obj121,5042565.9481638141864599.727785311.13510,133
obj2246,8142566.26014,673933735,51012.17354,03113.21878,275
paper153 161954.98335792633872610.74111,71611.41316,822
paper282,199914.6014247361213,79711.08819,12311.22125,080
paper346,526844.66530042529899310.67612,00210.89216,092
paper413,286804.700113693031339.232361510.2885076
paper638,105935.01028412080666210.398865911.29712,671
progc39,611925.19928712034653010.444852511.30712,583
progl71,646874.77041752495713410.742957912.10015,894
progp49,379894.86931451631450910.166573012.19410,524
trans93,695995.53359162425651510.736874312.71318,145
Canterbury corpus
alice29.txt152,089744.5686649590324,82511.76736,51611.57346,135
asyoulik.txt125,179684.8085804522023,35911.56333,76411.01841,758
bible.txt4,047,392634.34377,11766,593386,09214.649706,99114.543847,187
cp.html24,603865.2291748121243139.462510110.8977482
E.coli4,638,69042.00066,99562,463652,66313.7171,119,0677.6671,183,281
fields.c11,150905.00886858716068.724175111.0132946
kennedy.xls1,029,7442563.57326122511159,99910.012200,2408.575203,040
lcet10.txt426,754844.66914,50612,17854,93712.66186,94112.462109,539
ptt5513,2161591.21023,203631494,4634.56653,91812.30889,618
random.txt100,000645.9995145520954,01111.23575,8544.07578,475
sum38,2402555.32931301683624510.034783312.18412,600
world192.txt2,473,400944.99854,94645,920212,49914.340380,89614.078477,588
xargs.14227744.89834232612357.75511979.9881624

Share and Cite

MDPI and ACS Style

Vasinek, M.; Platos, J. Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes. Entropy 2017, 19, 223. https://doi.org/10.3390/e19050223

AMA Style

Vasinek M, Platos J. Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes. Entropy. 2017; 19(5):223. https://doi.org/10.3390/e19050223

Chicago/Turabian Style

Vasinek, Michal, and Jan Platos. 2017. "Prediction and Evaluation of Zero Order Entropy Changes in Grammar-Based Codes" Entropy 19, no. 5: 223. https://doi.org/10.3390/e19050223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop