Abstract
This study presents an analysis of RePair, which is a grammar compression algorithm known for its simple scheme, while also being practically effective. First, we show that the main process of RePair, that is, the step by step substitution of the most frequent symbol pairs, works within the corresponding most frequent maximal repeats. Then, we reveal the relation between maximal repeats and grammars constructed by RePair. On the basis of this analysis, we further propose a novel variant of RePair, called MR-RePair, which considers the one-time substitution of the most frequent maximal repeats instead of the consecutive substitution of the most frequent pairs. The results of the experiments comparing the size of constructed grammars and execution time of RePair and MR-RePair on several text corpora demonstrate that MR-RePair constructs more compact grammars than RePair does, especially for highly repetitive texts.
1. Introduction
Grammar compression is one of the lossless data compression methods. For a given text, grammar compression constructs a small size context free grammar that derives only the given text. While the problem of constructing the smallest such grammar for a given text is known to be NP-hard [], several approximation algorithms have been proposed. One of them is RePair [], which is an off-line grammar compression algorithm. Despite its simple scheme, RePair is known for its high compression in practice [,,], and hence, it has been comprehensively studied. Some examples of studies on the RePair algorithm include its extension to an online algorithm [], practical working time/space improvements [,], applications to various fields [,,], and theoretical analysis of generated grammar sizes [,,].
In the field of text processing, the repetitiveness of a text is considered an important property. Furthermore, it has been suggested that the extent of the repetitiveness of a given text directly relates to the number of maximal repeats in the text. Belazzougui et al. [] demonstrated theoretical relations between the number of extensions of maximal repeats and famous other properties of text such as the number of factors in the Lempel–Ziv parsing and the number of runs in the Burrows–Wheeler transform. Also, several text indexing data structures, whose sizes are bounded by the number of extensions of maximal repeats, have been proposed in the literature [,,].
In this study, we analyzed the properties of RePair, focusing on their relationship to maximal repeats. Although RePair has been extensively studied, to the best of our knowledge, no previous study has associated RePair with maximal repeats. Furthermore, we propose MR-RePair, which is a novel grammar compression algorithm based on the property of maximal repeats. Several off-line grammar compression techniques based on the properties of (non-maximal) repeats have been proposed previously [,,]. Recently, Gańczorz and Jeż introduced a heuristic method that improves the practical compression ratio of RePair in terms of the grammar size []. However, none of the previously proposed methods use the properties of maximal repeats. In this study, we aim to demonstrate that there is a theoretical guarantee for the size of constructed grammars; under a specific condition, the size of the grammar constructed by MR-RePair is smaller than or equal to that constructed by RePair. Our experiments show that MR-RePair constructs smaller grammars compared to RePair. We emphasize that generating a grammar of small size is of great importance since most, if not all, existing algorithms/data structures that work on grammar-compressed texts have running time dependent on the grammar sizes (see e.g., [,,,,,,,] and the references therein) and not directly on the encoded sizes.
Contributions: The primary contributions of this study are as follows.
- We show interesting relations between maximal repeats and grammars constructed by RePair.
- We propose MR-RePair, which is a novel variant of RePair based on replacing the most frequent maximal repeats.
- We implement MR-RePair and experimentally demonstrate that MR-RePair produces smaller grammars than all tested implementations of RePair. For a highly repetitive text used in the experiments, MR-RePair decreased the size of the constructed grammar to about 55% of that of RePair.
The rest of this paper is organized as follows. In Section 2, we introduce basic notations and review maximal repeats, grammar compression, and RePair. In Section 3, we present an analysis of the properties of RePair and demonstrate its relationship to maximal repeats. The definition and implementation of MR-RePair and its comparison with RePair are provided in Section 4. In Section 5, we report the experimental results of comparing RePair and MR-RePair. Finally, in Section 6, we conclude the paper.
A preliminary version of this work was previously and partially presented at the Data Compression Conference (DCC 2019) []. Compared to the preliminary version, the additional and extended parts of this paper are as follows.
- We discuss how the sizes of generated grammars differ depending on the implementation of RePair, and prove that a lower bound of the maximum difference of the sizes is (Definition 1 and Theorem 3).
- We describe Naïve-MR-RePair, which is a naïve version of our MR-RePair. Furthermore, we prove that there is the case where the grammar size of Naïve-MR-RePair becomes larger than that of RePair in logarithmic order of the length of input string (Theorem 4).
- We performed our experiment again following the advice of a DCC reviewer.
- For some of the lemmas we omitted the proofs in the previous version, in this version, we show them all.
2. Preliminaries
In this section, we provide some notations and definitions to be used in the following sections. In addition, we recall grammar compression and review the RePair.
2.1. Basic Notations and Terms
Let be an alphabet, that is, an ordered finite set of symbols. An element of is called a string or a text, where denotes its length. Let be an empty string of length 0, that is, . Let and be any text of length n. If with , then s is called a substring of T. Let for any denote a substring of T beginning at i and ending at j in T, and let denote the ith symbol of T. For a finite set S of texts, text T is said to be a superstring of S if T contains all texts of S as substrings. Let denote the frequency of s, i.e., the number of occurrences of s in a text as a substring. If there exists an isomorphism from an alphabet to another alphabet , texts and are said to be isomorphic for and .
2.2. Maximal Repeats
Let s be a substring of text T. If the frequency of s is greater than 1, s is called a repeat. A left (or right) extension of s is any substring of T in a form of (or ), where . We define s as a left (or right) maximal if left (or right) extensions of s occur a strictly less number of times in T than s. Accordingly, s is a maximal repeat of T if s is both left and right maximal. In this paper, we only consider strings with a length of more than 1 as maximal repeats. For example, substring abra of abracadabra is a maximal repeat, whereas br is not.
2.3. Grammar Compression
A context free grammar (CFG or simply grammar) G is defined as a four-tuple , where V denotes an ordered finite set of variables, denotes an ordered finite alphabet, R denotes a finite set of binary relations called production rules (or rules) between V and , and denotes a special variable called start variable. A production rule refers to the situation, where a variable is substituted and written in a form of , with and . Let . If there are such that , , and , we write , and denote the reflexive transitive closure of ⇒ as . Let be a string derived from v, i.e., . We define grammar as a subgrammar of G if , , and .
Given a text T, grammar compression is a method for lossless text data compression that constructs a restricted CFG uniquely deriving the text T. For G to be deterministic, a production rule for each variable must be unique. In what follows, we assume that every grammar is deterministic and each production rule is , where is an expression either () or ( for all ).
For estimating the effectiveness of compression, we use the size of the constructed grammar, which is defined as the total length of the right-hand side of all production rules of the grammar.
2.4. RePair
RePair is a grammar compression algorithm proposed by Larsson and Moffat []. For input text T, let be the grammar constructed by RePair. Then, the RePair procedure can be described with the following steps:
- Step 1. Replace each symbol with a new variable and add to R.
- Step 2. Find the most frequent pair p in T.
- Step 3. Replace every occurrence (or, as many occurrences as possible, when p is a pair consisting of the same symbol) of p with a new variable v, and then, add to R.
- Step 4. Re-evaluate the frequencies of pairs for the updated text generated in Step 3. If the maximum frequency is 1, add to R, and terminate. Otherwise, return to Step 2.
Figure 1 illustrates an example of the grammar generation process of RePair.
Figure 1.
An example of the grammar generation process of RePair for the text abracadabra. The generated grammar is with a size of 16.
Lemma 1
([]). RePair works in expected time and words of space, where n is the length of the source text, k denotes the cardinality of the source alphabet, and denotes the cardinality of the final dictionary.
3. Analysis of RePair
This section presents an analysis of RePair with respect to its relationship to maximal repeats and introduces an important concept, called MR-order.
3.1. RePair and Maximal Repeats
The following theorem describes an essential property of RePair, that is, RePair recursively replaces the most frequent maximal repeats.
Theorem 1.
Let T be a given text, assuming that every most frequent maximal repeat of T does not appear with overlaps with itself. Let f be the frequency of the most frequent pairs of T, and t be a text obtained after all pairs with frequency f in T are replaced by variables. Then, there is a text s such that s is obtained after all maximal repeats with frequency f in T are replaced by variables, and s and t are isomorphic to each other.
We need two lemmas and a corollary to prove Theorem 1. The following lemma shows a fundamental relation between the most frequent maximal repeats and the most frequent pairs in a text.
Lemma 2.
A pair p of variables is most frequent in a text T if and only if p occurs once in exactly one of the most frequent maximal repeats of T.
Proof.
(⇒) Let r be a most frequent maximal repeat containing p as a substring. It is clear that p can only occur once in r, since otherwise, would hold, implying the existence of a frequent maximal repeat that is more frequent than r, contradicting the assumption that r is most frequent. Suppose that there exists a different most frequent maximal repeat containing p as a substring. Similarly, p occurs only once in . Furthermore, since r and can be obtained by left and right extensions to p, , and any occurrence of p is contained in an occurrence of both r and . Since cannot be a substring of r, there exists a string w that is a superstring of r and , such that . However, this contradicts that r and are maximal repeats.
(⇐) Let r be the most frequent maximal repeat such that p occurs once in it. By definition, . If p is not the most frequent symbol pair in T, there exists a pair in T such that . However, this implies that there is a maximal repeat with , contradicting that r is most frequent. □
The following corollary is derived directly from Lemma 2.
Corollary 1.
For a given text, the frequency of the most frequent pairs and that of the most frequent maximal repeats are the same.
The following lemma shows an important property of the most frequent maximal repeats.
Lemma 3.
The length of the overlap between any two occurrences of most frequent maximal repeats is at most 1.
Proof.
Let and be the most frequent maximal repeats that have an overlapping occurrence , where . If we assume that , since and are most frequent maximal repeats, it holds that , i.e., every occurrence of w is preceded by x and followed by y. This implies that as well, but contradicts that and are maximal repeats. □
Theorem 1 can now be proved based on the above lemmas and corollary.
Proof of Theorem 1.
According to Corollary 1, the frequency of the most frequent maximal repeats in T is f. Let p be one of the most frequent pairs in T. According to Lemma 2, there is a unique maximal repeat that is most frequent and contains p once. We denote such maximal repeat as r. Let us assume that there is a substring in T, where , , and . We denote and as and , respectively. There are the following two cases to consider:
(i)and. If , the replacement of p directly corresponds to the replacement of the most frequent maximal repeat, since . If , after p is replaced with a variable v, r is changed to . This occurs f times in the updated text, and according to Lemma 2, the frequency of every pair occurring in is still f. Because the maximum frequency of pairs does not increase, f is still the maximum frequency. Therefore, we replace all pairs contained in in the following steps, whereas and are not replaced. This holds for every occurrence of p, implying that replacing the most frequent pairs while the maximum frequency does not change, corresponds to replacing all pairs (old and new) contained in the most frequent maximal repeats of the same frequency until they are replaced by a single variable. Then, s can be generated by replacing r.
(ii)or. We consider the case where . Note that according to the assumption that is a maximal repeat. Suppose RePair replaces by a variable v before p is replaced. Note that according to Lemma 2, there is a maximal repeat occurring f times and including once (we denote the maximal repeat as ), and by assumption. According to Lemma 3, the length of the overlap of r and is at most 1, and then, only is a symbol present in both r and . After that, is no longer the most frequent maximal repeat because some of its occurrences are changed to . However, still occurs f times in the updated text. Since and , and is a maximal repeat. Then, will become a variable in subsequent steps, similarly to (i). Here, would also become a variable. Thus, we can generate s by first replacing and then replacing . Similarly, this holds for when and . □
3.2. MR-Order
According to Theorem 1, if there is just one most frequent maximal repeat in the current text, then RePair replaces its all occurrences step by step. However, a problem arises if there are two or more most frequent maximal repeats, with some of them overlapping. In this case, the selection order of pairs (of course, they are most frequent) affects the priority of maximal repeats. We call this order of selecting (summarizing) maximal repeats as the maximal repeat selection order (or simply MR-order). Note that, the selection order of pairs actually depends on the implementation of RePair.
For instance, consider the text abcdeabccde, where abc and cde are the most frequent maximal repeats occurring twice. There are two MR-orders, depending on which of the two maximal repeats abc or cde is given priority. The results of the replacement using RePair with the MR-order are (i) with variables x and y such that and , and (ii) with variables z and w such that and . More precisely, there are 12 possible ways in which RePair can compress the text, with the following generated rule sets:
- ,
- ,
- ,
- ,
- ,
- ,
- ,
- ,
- ,
- ,
- ,
- .
Here, 1–6 have the same MR-order because precedes in all of them. At the same time, 7–12 have the same MR-order for the same reason: precedes .
If there are several distinct most frequent pairs with overlaps, RePair constructs grammars with different sizes according to the selection order of the pairs. For example, consider the text bcxdabcyabzdabvbcuda. There are three most frequent pairs, namely, , , and , occurring three times each. If RePair takes first, the rule set of the generated grammar may become and its size is 19. If RePair takes first, the rule set of the generated grammar may become and its size is 18.
Remark 1.
If there are several distinct pairs with the same maximum frequency, the size of the grammar generated by RePair depends on their replacement order.
However, the following theorem states that the MR-order rather than the replacement order of pairs determines the size of the grammar generated by RePair.
Theorem 2.
The sizes of grammars generated by RePair are the same if they are generated in the same MR-order.
Proof.
Let T be a variable sequence appearing in the grammar generation process of RePair and f be the maximum frequency of pairs in T. Suppose that is a variable sequence generated after RePair replaces every pair occurring f times. According to Theorem 1, all generated are isomorphic to one another, then the length of all of them is the same, regardless of the replacement order of pairs. Let be the most frequent maximal repeats of T with preceding all other maximal repeats in this MR-order. As a result, is converted into a variable, and according to Lemma 2, all pairs included in are distinct. Then, the size of the subgrammar which exactly derives is . This holds for the next prioritized maximal repeat (we denote it as ) with the following slight difference: the pattern actually replaced would be a substring of excluding its beginning or end if there are occurrences of overlap with . However, these strings are common in the same MR-order. Then, the sizes of generated subgrammars are the same, regardless of the order of selecting pairs. Similarly, this holds for all most frequent maximal repeats and every maximum frequency of pairs through the entire process of RePair. □
3.3. Greatest Size Difference of RePair
We consider the problem of determining the greatest size difference between possible outcomes of RePair.
Definition 1
(Greatest size difference). Let g and be the sizes of any two possible grammars that can be generated by RePair for a given text. Then, the greatest size difference of RePair (GSDRP) is .
A lower bound of the GSDRP can be established according to the following theorem.
Theorem 3.
Given a text with a length of n, a lower bound of GSDRP is .
Proof.
Let B, L, and R be strings such that
where denote distinct symbols, and each occurrence of ♢ denotes a distinct symbol. Consider text . Here, , , ⋯, , , ⋯, are the most frequent maximal repeats with a frequency f in T. Let G and be grammars generated by RePair for T in different MR-order, such that (i) precedes all other maximal repeats and (ii) follows all other maximal repeats, respectively. We denote the sizes of G and as g and , respectively.
First, we consider G and how RePair generates it. The first rule generated by the replacement is considering the MR-order. After the replacement, L and R remain unchanged, whereas B becomes the following text:
Each pair in occurs only once in the entire text . This means that can never be shortened from the current length of . In the remaining steps, and (for ) are replaced. L and R are changed to texts with a length of each. Hence, the following holds:
Next, we consider and how RePair generates it. According to their MR-order, , ⋯, , , ⋯, are replaced before is selected. They do not overlap with each other, and after they are replaced, does not occur in the generated text. Therefore, there are rules in deriving and (for ), whereas the rule deriving is absent. L and R are changed to texts with a length of each, and B is changed to a text with a length . Hence, the following holds:
4. MR-RePair
The main strategy of the proposed method is to recursively replace the most frequent maximal repeats instead of the most frequent pairs.
In this section, we first explain the naïve version of our method called Naïve-MR-RePair. Although it can have a bad performance in certain cases, it is simple and helpful in understanding our main result. Then, we describe the proposed MR-RePair.
4.1. Naïve-MR-RePair
Definition 2
(Naïve-MR-RePair). For an input text T, let be the grammar generated by Naïve-MR-RePair. Naïve-MR-RePair constructs G through the following steps:
- Step 1.Replace each symbol with a new variable and add to R.
- Step 2.Find the most frequent maximal repeat r in T.
- Step 3.Replace every occurrence (or as many occurrences as possible, when there are overlaps) of r in T with a new variable v and then add to R.
- Step 4.Re-evaluate the frequencies of maximal repeats for the updated text generated inStep 3. If the maximum frequency is 1, add to R and terminate. Otherwise, return toStep 2.
We can easily extend the concept of the MR-order to Naïve-MR-RePair.
Figure 2 illustrates an example of the grammar generation process of Naïve-MR-RePair. Figure 1 and Figure 2 explain why the strategy of using maximal repeats is more effective compared to that using pairs. When compressing the text , both RePair and Naïve-MR-RePair generate subgrammars deriving the most frequent maximal repeat . The rule set of the subgrammar generated by RePair is with a size of 6. At the same time, the rule set of the subgrammar generated by Naïve-MR-RePair is with a size of 4.
Figure 2.
An example of the grammar generation process of Naïve-MR-RePair for the text abracadabra. The generated grammar is with a size of 14.
However, the following theorem indicates that the size of the grammar generated by Naïve-MR-RePair is larger than that generated by RePair in certain cases, even when they work in the same MR-order. Roughly speaking, this is caused by the overlaps of maximal repeats. When there is an occurrence of the most frequent maximal repeat that overlaps with its occurrence, little difference would arise in grammar constructing processes of RePair and Naïve-MR-RePair from the view point of maximal repeats, that is, the targeted maximal repeats would vary in RePair and in Naïve-MR-RePair (RePair replaces the targeted maximal repeat step by step and Naïve-MR-RePair replaces it at once). Indeed, if maximal repeats are carefully embedded in a text for increasing the difference, the case presented in the following theorem occurs.
Theorem 4.
Given a text T with a length of n, let and be the sizes of the grammars generated by RePair and Naïve-MR-RePair for T, respectively, assuming that RePair and Naïve-MR-RePair work in the same MR-order. Then, there is a case when holds. (We show a concrete example of this theorem in Appendix A.)
Proof.
Let and be the grammars generated by RePair and Naïve-MR-RePair, respectively. Let be the text generated just after Step 1 of RePair or Naïve-MR-RePair (the Step 1 is common in both algorithms), that is, such that and (for ), and (or ) be a subgrammar of (or ) deriving . Let , where , such that is the most frequent maximal repeat of , and . Note that . Here and are defined as follows:
- :
- Assume that for and for , then consists of
- rules with ,
- one rule and rules for , and
- one rule .
- :
- Assume that and for , then consists of
- one rule , and
- rules for and .
Let and be the sizes of and , respectively. Then, the following holds:
Here, with regard to the length of , we have
From these, holds. Hence, according to Equations (4) and (5), the following holds:
Therefore, holds for some , and the proposition holds. □
4.2. MR-RePair
The grammar size of Naïve-MR-RePair becomes larger than that of RePair as shown in Theorem 4 because Naïve-MR-RePair cannot replace all occurrences of most frequent maximal repeats if it overlaps with another occurrence of itself. In the remainder of this section, we describe MR-RePair, which is an improved version of Naïve-MR-RePair.
Definition 3
(MR-RePair). For an input text T, let be the grammar generated by MR-RePair. MR-RePair constructs T through the following steps:
- Step 1.Replace each symbol with a new variable and add to R.
- Step 2.Find the most frequent maximal repeat r in T.
- Step 3.Check if and , and if so, use instead of r inStep 4.
- Step 4.Replace every occurrence of r with a new variable v and then add to R.
- Step 5.Re-evaluate the frequencies of maximal repeats for the updated text generated inStep 4. If the maximum frequency is 1, add to R and terminate. Otherwise, return toStep 2.
We can easily extend the concept of the MR-order to MR-RePair. We do not care if it uses in Step 3, instead of . MR-RePair can replace all occurrences of r even if it overlaps with itself in some occurrences since, according to Lemma 3, the length of the overlaps of the most frequent maximal repeats is at most 1. If but r does not overlap with itself, then becomes the most frequent maximal repeat after is replaced by v and is replaced immediately. Similar to RePair, MR-RePair still cannot replace all of them if .
Figure 3 illustrates an example of the grammar generation process of MR-RePair. Although the size of the grammar generated by MR-RePair as shown in Figure 3 is larger than that generated by Naïve-MR-RePair as shown in Figure 2, it is still smaller than that generated by RePair as shown in Figure 1.
Figure 3.
An example of the grammar generation process of MR-RePair for the text abracadabra. The generated grammar is with a size of 15.
Theorem 5.
Assume that RePair and MR-RePair work based on the same MR-order for a given text. Let and be the sizes of the grammars generated by RePair and MR-RePair, respectively. Then, holds.
Proof.
Assume that and are grammars generated by RePair and MR-RePair, respectively, for a given text T with a length of n. Let be the text generated just after Step 1 of RePair or Naïve-MR-RePair (the Step 1 is common in both algorithms), that is, such that and (for ).
Let be the maximum frequency of the maximal repeats in . According to Corollary 1, the maximum frequency of the pairs in is also . Let (or ) be a subgrammar of (or ) generated while RePair (or MR-RePair) replaces pairs (or maximal repeats) with the frequency , (or ) be the size of this subgrammar, and (or ) be the updated text after all pairs (or maximal repeats) with the frequency are replaced. Let be maximal repeats with frequency in assuming that they are prioritized in this order by the MR-order. Let (for ) be the length of the longest substring of such that there are variables that derive the substring in both and . Note that this substring is common to RePair and MR-RePair, and each is at least 2. Since RePair replaces such substring step by step and MR-RePair replaces it at once, the following holds:
From these,
Hence,
holds. According to Equation (6),
Hence,
holds. Therefore, according to Equations (8) and (9),
holds. The updated texts and are isomorphic for and . Let be the maximum frequency of the maximal repeats in (and ). Then, a similar analysis holds for and . Hence, holds similarly to Equation (10), and the updated texts and are isomorphic. Inductively, for every maximum frequency of maximal repeats , holds and the updated texts and are isomorphic. Let k be a natural number such that and , that is, k is the number of times that the maximum frequency decreases through the entire process of RePair and MR-RePair. Then,
hold. Recall that each symbol is replaced with a new variable in the first step both in RePair and in MR-RePair. is the size of the subgrammar consisting of the rules generated in the first step. Since every and , follows Equations (11) and (12), and thus, the proposition holds. holds when every length is 2. □
However, when the MR-orders of RePair and MR-RePair are different, then the grammar generated by MR-RePair can be larger than that generated by RePair, as the following theorem indicates:
Theorem 6.
Unless the MR-order of RePair and MR-RePair are the same, there is a case where the size of the grammar generated by MR-RePair becomes larger than that generated by RePair.
Proof.
We show a concrete example of the case stated in the proposition. Consider text . There are four most-frequent maximal repeats, , , , and . Let A and B be two different MR-orders such that A prioritizes the maximal repeats in order of , , , and B prioritizes the maximal repeats in order of , , , , respectively. MR-RePair working in A generates a grammar whose rules are , where S is the start variable. Meanwhile, RePair working in B generates a grammar whose rules are , where S is the start variable. The size of the grammar generated by MR-RePair working in A is 27, whereas the size of that generated by RePair working in B is 26. □
While Theorem 6 indicates that the grammar can be larger in MR-RePair than in RePair, in Section 5 we demonstrate that MR-RePair outperforms RePair in practice.
We can implement MR-RePair by extending the original implementation of RePair stated in [] and holding the same complexity.
Theorem 7.
Let be the grammar generated by MR-RePair for a given text with a length of n. Then, MR-RePair works in expected time and word space, where k and denote the cardinalities of Σ and V, respectively.
Proof.
Compared to RePair, the additional operations performed by MR-RePair are (i) extending the selected pair to left and right until it becomes a maximal repeat and (ii) checking and excluding either the beginning or the end of the obtained maximal repeat if they are the same. These additional operations can be realized using the same data structures as those employed in RePair. Then, the space complexity of MR-RePair follows Lemma 1.
We can clearly execute operation (ii) in a constant time. Hence, we consider how the time complexity is affected by operation (i). Let l be the length of the maximal repeat containing the focused pair, as well as f be the frequency of the pair. Then, more time is required for MR-RePair to check the left- and right-extensions for all occurrences of the focused pair compared to RePair. However, the length of the entire text is shortened by at least by the replacement. Therefore, MR-RePair works in expected time according to possible counts of the replacement through all of the steps of the algorithm. □
Remark 2.
We can convert a grammar of RePair to that of MR-RePair by repeating the following transform: If a variable v appears only once on the right-hand side of other rules, the rule can be removed for v, and the one occurrence of v can be replaced with the right-hand side of the removed rule. However, the time and space complexity stated in Theorem 7 cannot be achieved in this manner, since additional operations and memory for searching and storing such variables are required.
5. Experiments
We implemented and conducted some comparative experiments. In particular, we compared the sizes of constructed grammars and execution times of the proposed MR-RePair, several existing RePair implementations, and Re-PairImp (https://bitbucket.org/IguanaBen/repairimproved), which was recently proposed in [] as an improvement of RePair.
As stated in Remark 1, the MR-order affects the size of a constructed grammar. In practice, the MR-order varies depending on the implementation of the priority queue that manages pairs. For this reason, we used four different implementations of RePair in the comparative analysis, and they were implemented by Maruyama (https://code.google.com/archive/p/re-pair/), Navarro (https://www.dcc.uchile.cl/~gnavarro/software/index.html), Prezza (https://github.com/nicolaprezza/Re-Pair) [], and Wan (https://github.com/rwanwork/Re-Pair); we ran it with level 0 (no heuristic option), respectively.
Table 1 lists the details of the texts that we used in the experiments. In particular, we employed three texts as highly repetitive texts: one is a randomly generated text (rand77.txt), and the other two are a Fibonacci string (fib41) and a German text (einstein.de.txt) selected from the Repetitive Corpus of the Pizza&Chili Corpus (http://pizzachili.dcc.uchile.cl/repcorpus.html). The randomly generated text, rand77.txt, consists of alphanumeric symbols and some special symbols. It was generated by concatenating 32 copies of a block that includes 1024 random strings of length 64 each, i.e., the size of the randomly generated text is byte. In addition, we used three texts (E.coli, bible.txt, and world192.txt) selected from the Large Corpus (http://corpus.canterbury.ac.nz/descriptions/#large) to consider a real-data case. We executed each program seven times for each text and measured the elapsed CPU time only for the grammar generation process. We calculated the average time across five results, excluding the minimum and maximum values among the seven runs. The experiments were run on a computer equipped with an Intel(R) Core i7-8700 3.2–4.6 GHz 6core, 32GB RAM, and using Ubuntu 16.04. All of the programs were compiled using gcc version 7.4 with the “-O3” option.
Table 1.
Text files used in our experiments.
Table 2 summarizes the experimental results. Unfortunately, Re-PairImp was unable to process fib41 in our experimental environment because of a lack of memory. Here, we excluded the number of rules generating a single terminal symbol from the number of rules since they are the same between RePair and MR-RePair. As shown in the table, the sizes of grammars constructed by each RePair implementation differ from each other for all texts except fib41. In any case, MR-RePair is not inferior to RePair in terms of the size of grammars while in Theorem 6 we show that the grammar can be larger in MR-RePair than in RePair if their MR-orders are different. For rand77.txt, the number of rules and size of the grammars for MR-RePair decreased to about 11% and 55% of those for RePair, respectively. Long maximal repeats occur more frequently in rand77.txt than in other texts and we consider this is a main reason of the remarkable effectiveness of MR-RePair for the text.
Table 2.
Sizes of generated grammars and execution times of the considered algorithms. Each cell in the table represents the number of generated rules, total lengths of the right side of all of the rules except for the start variable, length of the right side of the start variable, and the total grammar size in the order from the top row. The total grammar size presented in the fourth row is the total of the values presented in the second row and the third row. The fifth row separated by a line represents the execution time for compression in seconds. The best results are highlighted in bold.
For einstein.de.txt, the number of rules and size of the grammar decreased to about 44% and 72% of those for RePair, respectively. By contrast, it turned out that the effect of the improvement was limited for the texts from the Large Corpus, which are not highly repetitive. Note that fib41 does not contain any maximal repeats longer than 2 without overlaps. Therefore, MR-RePair generated the same rules as RePair in this case. It should be also be noted that MR-RePair runs at a speed comparable to the fastest implementation of RePair.
6. Conclusions
In this study, we analyzed the process of RePair and revealed that the RePair algorithm replaces the most frequent pairs step by step within their corresponding most frequent maximal repeats. On the basis of this analysis, we designed MR-RePair, a novel variant of RePair. Instead of consecutively substituting the most frequent pairs, MR-RePair performs one-time substitution of the most frequent maximal repeats. Furthermore, we implemented MR-RePair and compared the sizes of its constructed grammars to those of the grammars constructed by several implementations of RePair. Through the experiments, we confirmed the effectiveness of MR-RePair especially for highly repetitive texts.
We defined the greatest size difference of any two possible grammars that can be generated by RePair for a given text, naming it GSDRP. We demonstrated that a lower bound of GSDRP is for a given text of length n. We left improving the lower bound and showing an upper bound of GSDRP as our future work.
We estimated the effectiveness of the compression using the size of the generated grammars instead of the length of the output bits. Reducing the grammar size has important implications since the majority of the existing text algorithms applied to grammar-compressed texts, including grammar-based self indexes [,], edit distance computation [], q-gram mining [,], and pattern matching [,,], have time/space complexities that are dependent on the input grammar size. For instance, the compressed indexes proposed by Claude and Navarro [,] can be directly built on MR-RePair grammar-compressed texts. Algorithms specifically designed for straight-line programs (SLPs), which are text compressions with grammars in Chomsky normal form, can also be easily modified to work on grammars that are not in Chomsky normal form similar to MR-RePair grammars. Hence, MR-RePair serves as a base for practical improvements of these algorithms.
From the viewpoint of storing data more compactly, developing a method for encoding constructed grammars is another important issue. Implementing an efficient encoding method for MR-RePair is one of our future tasks.
Author Contributions
Conceptualization, I.F. and T.T.; methodology, I.F., T.T., Y.N., S.I., and H.B.; software, T.K.; writing—original draft preparation, I.F. and T.K.; writing—review and editing, I.F., T.T., Y.N., S.I., H.B., and T.K.; All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by JSPS KAKENHI Grant Numbers JP18K18002, JP17H01697, JP16H02783, JP18H04098, and JP18K11149, JST CREST Grant Number JPMJCR1402, and JST PRESTO Grant Number JPMJPR1922.
Acknowledgments
The authors would like to thank the people who provided the source codes.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
Let , , and be the grammars generated by RePair, Naïve-MR-RePair, and MR-RePair, respectively. For a given text of length , let , , and be the sizes of , , and , respectively. Let us assume that , where such that is the most frequent maximal repeat of T and . Then, according to the proof of Theorem 4, holds for some m and w such that .
Figure A1 illustrates a specific example of the grammar generation process of RePair and for with and . The size is 18 in this example. Figure A2 illustrates an example of the process of Naïve-MR-RePair and for the same T. It can be noticed from the figures that the size is 19, and thus holds. As shown in Figure A2, Naïve-MR-RePair may fail to extract repetitive patterns in particular cases (such as abcd of in the running example). However, this problem can be solved using MR-RePair. Figure A3 illustrates an example of the process of MR-RePair and for the same . The size is 16, which is smaller than . Although the most frequent maximal repeat at the second replacement step is , MR-RePair replaces with a new variable , providing the additional Step 3 in Definition 3.
Figure A1.
Grammar generation process of RePair and its generated grammar for the text a. The grammar size is 18.
Figure A2.
Grammar generation process of Naïve-MR-RePair and its generated grammar for the text a. The grammar size is 19.
Figure A3.
Grammar generation process of MR-RePair and its generated grammar for the text a. The grammar size is 16.
References
- Charikar, M.; Lehman, E.; Liu, D.; Ring, P.; Prabhakaran, M.; Sahai, A.; Shelat, A. The smallest grammar problem. IEEE Trans. Inf. Theory 2005, 51, 2554–2576. [Google Scholar] [CrossRef]
- Larsson, N.J.; Moffat, A. Off-line dictionary-based compression. Proc. IEEE 2000, 88, 1722–1732. [Google Scholar] [CrossRef]
- Claude, F.; Navarro, G. Fast and compact web graph representations. ACM Trans. Web 2010, 4, 16:1–16:31. [Google Scholar] [CrossRef]
- González, R.; Navarro, G. Compressed text indexes with fast locate. In Annual Symposium on Combinatorial Pattern Matching (CPM 2007); Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4580, pp. 216–227. [Google Scholar]
- Wan, R. Browsing and Searching Compressed Documents. Ph.D. Thesis, The University of Melbourne, Melbourne, Australia, 2003. [Google Scholar]
- Masaki, T.; Kida, T. Online Grammar Transformation Based on Re-Pair Algorithm. In Proceedings of the Data Compression Conference (2016 DCC), Snowbird, UT, USA, 29 March–1 April 2016; pp. 349–358. [Google Scholar]
- Bille, P.; Gørtz, I.L.; Prezza, N. Space-Efficient Re-Pair Compression. In Proceedings of the Data Compression Conference (DCC 2017), Snowbird, UT, USA, 4–7 April 2017; pp. 171–180. [Google Scholar]
- Sekine, K.; Sasakawa, H.; Yoshida, S.; Kida, T. Adaptive dictionary sharing method for Re-Pair algorithm. In Proceedings of the Data Compression Conference (DCC 2014), Snowbird, UT, USA, 26–28 March 2014; p. 425. [Google Scholar]
- Lohrey, M.; Maneth, S.; Mennicke, R. XML tree structure compression using RePair. Inf. Syst. 2013, 38, 1150–1167. [Google Scholar] [CrossRef]
- Tabei, Y.; Saigo, H.; Yamanishi, Y.; Puglisi, S.J. Scalable partial least squares regression on grammar-compressed data matrices. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), San Francisco, CA, USA, 13–17 August 2016; pp. 1875–1884. [Google Scholar]
- Navarro, G.; Russo, L.M. Re-pair Achieves High-Order Entropy. In Proceedings of the Data Compression Conference (DCC 2008), Snowbird, UT, USA, 25–27 March 2008; p. 537. [Google Scholar]
- Ochoa, C.; Navarro, G. RePair and All Irreducible Grammars are Upper Bounded by High-Order Empirical Entropy. IEEE Trans. Inf. Theory 2018, 1–5. [Google Scholar] [CrossRef]
- Belazzougui, D.; Cunial, F.; Gagie, T.; Prezza, N.; Raffinot, M. Composite Repetition-Aware Data Structures. In Annual Symposium on Combinatorial Pattern Matching (CPM 2015); Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2015; Volume 9133, pp. 26–39. [Google Scholar]
- Belazzougui, D.; Cunial, F. Fast label extraction in the CDAWG. In International Symposium on String Processing and Information Retrieval (SPIRE 2017); Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; Volume 10508, pp. 161–175. [Google Scholar]
- Belazzougui, D.; Cunial, F. Representing the Suffix Tree with the CDAWG. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017); Leibniz International Processings in Informatics; Kärkkäinen, J., Radoszewski, J., Rytter, W., Eds.; Schloss Dagstuhl—Leibniz-Zentrum fuer Informatik: Wadern, Germany, 2017; Volume 78, pp. 7:1–7:13. [Google Scholar] [CrossRef]
- Takagi, T.; Goto, K.; Fujishige, Y.; Inenaga, S.; Arimura, H. Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression. In 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017); Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; Volume 10508, pp. 304–316. [Google Scholar]
- Apostolico, A.; Lonardi, S. Off-line compression by greedy textual substitution. Proc. IEEE 2000, 88, 1733–1744. [Google Scholar] [CrossRef]
- Inenaga, S.; Funamoto, T.; Takeda, M.; Shinohara, A. Linear-time off-line text compression by longest-first substitution. In 10th International Symposium on String Processing and Information Retrieval (SPIRE 2003); Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2003; Volume 2857, pp. 137–152. [Google Scholar]
- Nakamura, R.; Bannai, H.; Inenaga, S.; Takeda, M. Simple Linear-Time Off-Line Text Compression by Longest-First Substitution. In Proceedings of the Data Compression Conference (DCC 2007), Snowbird, UT, USA, 27–29 March 2007; pp. 123–132. [Google Scholar]
- Gańczorz, M.; Jeż, A. Improvements on Re-Pair Grammar Compressor. In Proceedings of the Data Compression Conference (DCC 2017), Snowbird, UT, USA, 4–7 April 2017; pp. 181–190. [Google Scholar]
- Claude, F.; Navarro, G. Self-Indexed Grammar-Based Compression. Fundam. Inform. 2011, 111, 313–337. [Google Scholar] [CrossRef]
- Claude, F.; Navarro, G. Improved Grammar-Based Compressed Indexes. In Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE 2012), Cartagena de Indias, Colombia, 21–25 October 2012; pp. 180–192. [Google Scholar] [CrossRef]
- Gawrychowski, P. Faster Algorithm for Computing the Edit Distance between SLP-Compressed Strings. In Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE 2012), Cartagena de Indias, Colombia, 21–25 October 2012; pp. 229–236. [Google Scholar]
- Goto, K.; Bannai, H.; Inenaga, S.; Takeda, M. Fast q-gram mining on SLP compressed strings. J. Discret. Algorithms 2013, 18, 89–99. [Google Scholar] [CrossRef][Green Version]
- Bille, P.; Cording, P.H.; Gørtz, I.L. Compact Q-gram Profiling of Compressed Strings. Theor. Comput. Sci. 2014, 550, 51–58. [Google Scholar] [CrossRef]
- Tomohiro, I.; Nishimoto, T.; Inenaga, S.; Bannai, H.; Takeda, M. Compressed automata for dictionary matching. Theor. Comput. Sci. 2015, 578, 30–41. [Google Scholar] [CrossRef]
- Jeż, A. Faster Fully Compressed Pattern Matching by Recompression. ACM Trans. Algorithms 2015, 11, 20:1–20:43. [Google Scholar] [CrossRef]
- Bille, P.; Cording, P.H.; Gørtz, I.L. Compressed Subsequence Matching and Packed Tree Coloring. Algorithmica 2017, 77, 336–348. [Google Scholar] [CrossRef]
- Furuya, I.; Takagi, T.; Nakashima, Y.; Inenaga, S.; Bannai, H.; Kida, T. MR-RePair: Grammar Compression based on Maximal Repeats. In Proceedings of the Data Compression Conference (DCC 2019), Snowbird, UT, USA, 26–29 March 2019; pp. 508–517. [Google Scholar]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).