Practical Grammar Compression Based on Maximal Repeats †

: This study presents an analysis of RePair, which is a grammar compression algorithm known for its simple scheme, while also being practically effective. First, we show that the main process of RePair, that is, the step by step substitution of the most frequent symbol pairs, works within the corresponding most frequent maximal repeats. Then, we reveal the relation between maximal repeats and grammars constructed by RePair. On the basis of this analysis, we further propose a novel variant of RePair, called MR-RePair, which considers the one-time substitution of the most frequent maximal repeats instead of the consecutive substitution of the most frequent pairs. The results of the experiments comparing the size of constructed grammars and execution time of RePair and MR-RePair on several text corpora demonstrate that MR-RePair constructs more compact grammars than RePair does, especially for highly repetitive texts.


Introduction
Grammar compression is one of the lossless data compression methods. For a given text, grammar compression constructs a small size context free grammar that derives only the given text. While the problem of constructing the smallest such grammar for a given text is known to be NP-hard [1], several approximation algorithms have been proposed. One of them is RePair [2], which is an off-line grammar compression algorithm. Despite its simple scheme, RePair is known for its high compression in practice [3][4][5], and hence, it has been comprehensively studied. Some examples of studies on the RePair algorithm include its extension to an online algorithm [6], practical working time/space improvements [7,8], applications to various fields [3,9,10], and theoretical analysis of generated grammar sizes [1,11,12].
In the field of text processing, the repetitiveness of a text is considered an important property. Furthermore, it has been suggested that the extent of the repetitiveness of a given text directly relates to the number of maximal repeats in the text. Belazzougui et al. [13] demonstrated theoretical relations between the number of extensions of maximal repeats and famous other properties of text such as the number of factors in the Lempel-Ziv parsing and the number of runs in the Burrows-Wheeler transform. Also, several text indexing data structures, whose sizes are bounded by the number of extensions of maximal repeats, have been proposed in the literature [14][15][16].
In this study, we analyzed the properties of RePair, focusing on their relationship to maximal repeats. Although RePair has been extensively studied, to the best of our knowledge, no previous study has associated RePair with maximal repeats. Furthermore, we propose MR-RePair, which is a novel grammar compression algorithm based on the property of maximal repeats. Several off-line grammar compression techniques based on the properties of (non-maximal) repeats have been proposed previously [17][18][19]. Recently, Gańczorz and Jeż introduced a heuristic method that improves the practical compression ratio of RePair in terms of the grammar size [20]. However, none of the previously proposed methods use the properties of maximal repeats. In this study, we aim to demonstrate that there is a theoretical guarantee for the size of constructed grammars; under a specific condition, the size of the grammar constructed by MR-RePair is smaller than or equal to that constructed by RePair. Our experiments show that MR-RePair constructs smaller grammars compared to RePair. We emphasize that generating a grammar of small size is of great importance since most, if not all, existing algorithms/data structures that work on grammar-compressed texts have running time dependent on the grammar sizes (see e.g., [21][22][23][24][25][26][27][28] and the references therein) and not directly on the encoded sizes.

Contributions:
The primary contributions of this study are as follows.
1. We show interesting relations between maximal repeats and grammars constructed by RePair. 2. We propose MR-RePair, which is a novel variant of RePair based on replacing the most frequent maximal repeats. 3. We implement MR-RePair and experimentally demonstrate that MR-RePair produces smaller grammars than all tested implementations of RePair. For a highly repetitive text used in the experiments, MR-RePair decreased the size of the constructed grammar to about 55% of that of RePair.
The rest of this paper is organized as follows. In Section 2, we introduce basic notations and review maximal repeats, grammar compression, and RePair. In Section 3, we present an analysis of the properties of RePair and demonstrate its relationship to maximal repeats. The definition and implementation of MR-RePair and its comparison with RePair are provided in Section 4. In Section 5, we report the experimental results of comparing RePair and MR-RePair. Finally, in Section 6, we conclude the paper.
A preliminary version of this work was previously and partially presented at the Data Compression Conference (DCC 2019) [29]. Compared to the preliminary version, the additional and extended parts of this paper are as follows.
1. We discuss how the sizes of generated grammars differ depending on the implementation of RePair, and prove that a lower bound of the maximum difference of the sizes is 1 6 ( √ 6n + 1 + 13) (Definition 1 and Theorem 3). 2. We describe Naïve-MR-RePair, which is a naïve version of our MR-RePair. Furthermore, we prove that there is the case where the grammar size of Naïve-MR-RePair becomes larger than that of RePair in logarithmic order of the length of input string (Theorem 4). 3. We performed our experiment again following the advice of a DCC reviewer. 4. For some of the lemmas we omitted the proofs in the previous version, in this version, we show them all.

Preliminaries
In this section, we provide some notations and definitions to be used in the following sections. In addition, we recall grammar compression and review the RePair.

Basic Notations and Terms
Let Σ be an alphabet, that is, an ordered finite set of symbols. An element T = t 1 · · · t n of Σ * is called a string or a text, where |T| = n denotes its length. Let be an empty string of length 0, that is, | | = 0. Let Σ + = Σ * \{ } and T = t 1 · · · t n ∈ Σ n be any text of length n. If T = usw with u, s, w ∈ Σ * , then s is called a substring of T. Let T[i..j] = t i · · · t j for any 1 ≤ i ≤ j ≤ n denote a substring of T beginning at i and ending at j in T, and let T[i] = t i denote the ith symbol of T. For a finite set S of texts, text T is said to be a superstring of S if T contains all texts of S as substrings. Let #occ(s) denote the frequency of s, i.e., the number of occurrences of s in a text as a substring. If there exists an isomorphism from an alphabet Σ to another alphabetΣ, texts Σ * andΣ * are said to be isomorphic for Σ andΣ.

Maximal Repeats
Let s be a substring of text T. If the frequency of s is greater than 1, s is called a repeat. A left (or right) extension of s is any substring of T in a form of ws (or sw), where w ∈ Σ * . We define s as a left (or right) maximal if left (or right) extensions of s occur a strictly less number of times in T than s. Accordingly, s is a maximal repeat of T if s is both left and right maximal. In this paper, we only consider strings with a length of more than 1 as maximal repeats. For example, substring abra of T = abracadabra is a maximal repeat, whereas br is not.

Grammar Compression
A context free grammar (CFG or simply grammar) G is defined as a four-tuple G = {V, Σ, S, R}, where V denotes an ordered finite set of variables, Σ denotes an ordered finite alphabet, R denotes a finite set of binary relations called production rules (or rules) between V and (V ∪ Σ) * , and S ∈ V denotes a special variable called start variable. A production rule refers to the situation, where a variable is substituted and written in a form of v → w, with v ∈ V and w ∈ (V ∪ Σ) * . Let X, Y ∈ (V ∪ Σ) * . If there are x l , x, x r , y ∈ (V ∪ Σ) * such that X = x l xx r , Y = x l yx r , and x → y ∈ R, we write X ⇒ Y, and denote the reflexive transitive closure of ⇒ as * ⇒. Let val(v) be a string derived from v, i.e., v * ⇒ val(v). We define grammarĜ = {V,Σ,Ŝ,R} as a subgrammar of G ifV ⊆ V,Σ ⊆ (V ∪ Σ), andR ⊆ R.
Given a text T, grammar compression is a method for lossless text data compression that constructs a restricted CFG uniquely deriving the text T. For G to be deterministic, a production rule for each variable v ∈ V must be unique. In what follows, we assume that every grammar is deterministic and each production rule is v i → expr i , where expr i is an expression either expr i = a (a ∈ Σ) or For estimating the effectiveness of compression, we use the size of the constructed grammar, which is defined as the total length of the right-hand side of all production rules of the grammar.
Step 3. Replace every occurrence (or, as many occurrences as possible, when p is a pair consisting of the same symbol) of p with a new variable v, and then, add v → p to R.
Step 4. Re-evaluate the frequencies of pairs for the updated text generated in Step 3. If the maximum frequency is 1, add S → (current text) to R, and terminate. Otherwise, return to Step 2. Figure 1 illustrates an example of the grammar generation process of RePair. Figure 1. An example of the grammar generation process of RePair for the text abracadabra. The generated }} with a size of 16.

Lemma 1 ([2]
). RePair works in O(n) expected time and 5n + 4k 2 + 4k + √ n + 1 − 1 words of space, where n is the length of the source text, k denotes the cardinality of the source alphabet, and k denotes the cardinality of the final dictionary.

Analysis of RePair
This section presents an analysis of RePair with respect to its relationship to maximal repeats and introduces an important concept, called MR-order.

RePair and Maximal Repeats
The following theorem describes an essential property of RePair, that is, RePair recursively replaces the most frequent maximal repeats. Theorem 1. Let T be a given text, assuming that every most frequent maximal repeat of T does not appear with overlaps with itself. Let f be the frequency of the most frequent pairs of T, and t be a text obtained after all pairs with frequency f in T are replaced by variables. Then, there is a text s such that s is obtained after all maximal repeats with frequency f in T are replaced by variables, and s and t are isomorphic to each other.
We need two lemmas and a corollary to prove Theorem 1. The following lemma shows a fundamental relation between the most frequent maximal repeats and the most frequent pairs in a text. Proof. (⇒) Let r be a most frequent maximal repeat containing p as a substring. It is clear that p can only occur once in r, since otherwise, #occ(p) > #occ(r) would hold, implying the existence of a frequent maximal repeat that is more frequent than r, contradicting the assumption that r is most frequent. Suppose that there exists a different most frequent maximal repeat r containing p as a substring. Similarly, p occurs only once in r . Furthermore, since r and r can be obtained by left and right extensions to p, #occ(r) = #occ(r ) = #occ(p), and any occurrence of p is contained in an occurrence of both r and r . Since r cannot be a substring of r, there exists a string w that is a superstring of r and r , such that #occ(w) = #occ(r) = #occ(r ) = #occ(p). However, this contradicts that r and r are maximal repeats.
(⇐) Let r be the most frequent maximal repeat such that p occurs once in it. By definition, #occ(r) = #occ(p). If p is not the most frequent symbol pair in T, there exists a pair p in T such that #occ(p ) > #occ(p) = #occ(r). However, this implies that there is a maximal repeat r with #occ(r ) = #occ(p ) > #occ(r), contradicting that r is most frequent.
The following corollary is derived directly from Lemma 2.
Corollary 1. For a given text, the frequency of the most frequent pairs and that of the most frequent maximal repeats are the same.
The following lemma shows an important property of the most frequent maximal repeats. Lemma 3. The length of the overlap between any two occurrences of most frequent maximal repeats is at most 1.
Proof. Let xw and wy be the most frequent maximal repeats that have an overlapping occurrence xwy, where x, y, w ∈ Σ + . If we assume that |w| ≥ 2, since xw and wy are most frequent maximal repeats, it holds that #occ(w) = #occ(xw) = #occ(wy), i.e., every occurrence of w is preceded by x and followed by y. This implies that #occ(xwy) = #occ(xw) = #occ(wy) as well, but contradicts that xw and wy are maximal repeats.
Theorem 1 can now be proved based on the above lemmas and corollary.
Proof of Theorem 1. According to Corollary 1, the frequency of the most frequent maximal repeats in T is f . Let p be one of the most frequent pairs in T. According to Lemma 2, there is a unique maximal repeat that is most frequent and contains p once. We denote such maximal repeat as r. Let us assume that there is a substring zxpyw in T, where z, w ∈ Σ, x, y ∈ Σ * , and xpy = r. We denote r [1] and r[|r|] asẋ andẏ, respectively. There are the following two cases to consider: (i) #occ(zẋ) < f and #occ(ẏw) < f . If |r| = 2, the replacement of p directly corresponds to the replacement of the most frequent maximal repeat, since p = r. If |r| > 2, after p is replaced with a variable v, r is changed to xvy. This occurs f times in the updated text, and according to Lemma 2, the frequency of every pair occurring in xvy is still f . Because the maximum frequency of pairs does not increase, f is still the maximum frequency. Therefore, we replace all pairs contained in xvy in the following steps, whereas zẋ andẏw are not replaced. This holds for every occurrence of p, implying that replacing the most frequent pairs while the maximum frequency does not change, corresponds to replacing all pairs (old and new) contained in the most frequent maximal repeats of the same frequency until they are replaced by a single variable. Then, s can be generated by replacing r. (ii) #occ(zẋ) = f or #occ(ẏw) = f . We consider the case where #occ(zẋ) = f . Note that #occ(zxpy) < f according to the assumption that xpy is a maximal repeat. Suppose RePair replaces zẋ by a variable v before p is replaced. Note that according to Lemma 2, there is a maximal repeat occurring f times and including zẋ once (we denote the maximal repeat as r ), and r = r by assumption. According to Lemma 3, the length of the overlap of r and r is at most 1, and then, onlyẋ is a symbol present in both r and r . After that, xpy = r is no longer the most frequent maximal repeat because some of its occurrences are changed to vr [2..|r|]. However, r[2.

MR-Order
According to Theorem 1, if there is just one most frequent maximal repeat in the current text, then RePair replaces its all occurrences step by step. However, a problem arises if there are two or more most frequent maximal repeats, with some of them overlapping. In this case, the selection order of pairs (of course, they are most frequent) affects the priority of maximal repeats. We call this order of selecting (summarizing) maximal repeats as the maximal repeat selection order (or simply MR-order). Note that, the selection order of pairs actually depends on the implementation of RePair.
For instance, consider the text abcdeabccde, where abc and cde are the most frequent maximal repeats occurring twice. There are two MR-orders, depending on which of the two maximal repeats abc or cde is given priority. The results of the replacement using RePair with the MR-order are (i) xyxcx with variables x and y such that x * ⇒ abc and y * ⇒ de, and (ii) zwzcw with variables z and w such that z * ⇒ ab and w * ⇒ cde. More precisely, there are 12 possible ways in which RePair can compress the text, with the following generated rule sets: Here, 1-6 have the same MR-order because abc precedes cde in all of them. At the same time, 7-12 have the same MR-order for the same reason: cde precedes abc.
If there are several distinct most frequent pairs with overlaps, RePair constructs grammars with different sizes according to the selection order of the pairs. For example, consider the text bcxdabcyabzdabvbcuda. There are three most frequent pairs, namely, ab, bc, and da, occurring three times each. If RePair takes ab first, the rule set of the generated grammar may become {v 1 → ab, v 2 → bc, v 3 → dv 1 , S → v 2 xv 3 cyv 1 zv 3 vv 2 uda} and its size is 19. If RePair takes da first, the rule set of the generated grammar may become {v 1 → da, v 2 → bc, S → v 2 xv 1 v 2 yabzv 1 bvv 2 uv 1 } and its size is 18.

Remark 1.
If there are several distinct pairs with the same maximum frequency, the size of the grammar generated by RePair depends on their replacement order.
However, the following theorem states that the MR-order rather than the replacement order of pairs determines the size of the grammar generated by RePair.
Theorem 2. The sizes of grammars generated by RePair are the same if they are generated in the same MR-order.
Proof. Let T be a variable sequence appearing in the grammar generation process of RePair and f be the maximum frequency of pairs in T. Suppose that T is a variable sequence generated after RePair replaces every pair occurring f times. According to Theorem 1, all generated T are isomorphic to one another, then the length of all of them is the same, regardless of the replacement order of pairs. Let r 1 be the most frequent maximal repeats of T with r 1 preceding all other maximal repeats in this MR-order. As a result, r 1 is converted into a variable, and according to Lemma 2, all pairs included in r 1 are distinct. Then, the size of the subgrammar which exactly derives r 1 is 2(|r 1 | − 1) + 1 = 2|r 1 | − 1. This holds for the next prioritized maximal repeat (we denote it as r 2 ) with the following slight difference: the pattern actually replaced would be a substring of r 2 excluding its beginning or end if there are occurrences of overlap with r 1 . However, these strings are common in the same MR-order. Then, the sizes of generated subgrammars are the same, regardless of the order of selecting pairs. Similarly, this holds for all most frequent maximal repeats and every maximum frequency of pairs through the entire process of RePair.

Greatest Size Difference of RePair
We consider the problem of determining the greatest size difference between possible outcomes of RePair.
Definition 1 (Greatest size difference). Let g and g be the sizes of any two possible grammars that can be generated by RePair for a given text. Then, the greatest size difference of RePair (GSDRP) is max(|g − g |).
A lower bound of the GSDRP can be established according to the following theorem.
Proof. Let B, L, and R be strings such that where x, y, l 1 , . . . , l f , r 1 , . . . , r f denote distinct symbols, and each occurrence of ♦ denotes a distinct symbol. Consider text T = BL f −1 R f −1 . Here, xy, l 1 x, · · · , l f x, yr 1 , · · · , yr f are the most frequent maximal repeats with a frequency f in T. Let G and G be grammars generated by RePair for T in different MR-order, such that (i) xy precedes all other maximal repeats and (ii) xy follows all other maximal repeats, respectively. We denote the sizes of G and G as g and g , respectively.
First, we consider G and how RePair generates it. The first rule generated by the replacement is v 1 → xy considering the MR-order. After the replacement, L and R remain unchanged, whereas B becomes the following text: Each pair in B 1 occurs only once in the entire text B 1 L f −1 R f −1 . This means that B 1 can never be shortened from the current length of 3 f . In the remaining steps, l i x and yr i (for i = 1, · · · , f ) are replaced. L and R are changed to texts with a length of 2 f each. Hence, the following holds: Next, we consider G and how RePair generates it. According to their MR-order, l 1 x, · · · , l f x, yr 1 , · · · , yr f are replaced before xy is selected. They do not overlap with each other, and after they are replaced, xy does not occur in the generated text. Therefore, there are 2 f rules in G deriving l i x and yr i (for i = 1, · · · , f ), whereas the rule deriving xy is absent. L and R are changed to texts with a length of 2 f each, and B is changed to a text with a length 2 f . Hence, the following holds: Let us denote the length of the original text T = BL f −1 R f −1 by n. Then, the following holds: holds. According to Equations (1)- (3), holds and the theorem follows.

MR-RePair
The main strategy of the proposed method is to recursively replace the most frequent maximal repeats instead of the most frequent pairs.
In this section, we first explain the naïve version of our method called Naïve-MR-RePair. Although it can have a bad performance in certain cases, it is simple and helpful in understanding our main result. Then, we describe the proposed MR-RePair.

Naïve-MR-RePair
Definition 2 (Naïve-MR-RePair). For an input text T, let G = {V, Σ, S, R} be the grammar generated by Naïve-MR-RePair. Naïve-MR-RePair constructs G through the following steps: Step 1. Replace each symbol a ∈ Σ with a new variable v a and add v a → a to R.
Step 2. Find the most frequent maximal repeat r in T.
Step 3. Replace every occurrence (or as many occurrences as possible, when there are overlaps) of r in T with a new variable v and then add v → r to R.
Step 4. Re-evaluate the frequencies of maximal repeats for the updated text generated in Step 3. If the maximum frequency is 1, add S → (current text) to R and terminate. Otherwise, return to Step 2.
We can easily extend the concept of the MR-order to Naïve-MR-RePair. Figure 2 illustrates an example of the grammar generation process of Naïve-MR-RePair. Figures 1 and 2 explain why the strategy of using maximal repeats is more effective compared to that using pairs. When compressing the text v However, the following theorem indicates that the size of the grammar generated by Naïve-MR-RePair is larger than that generated by RePair in certain cases, even when they work in the same MR-order. Roughly speaking, this is caused by the overlaps of maximal repeats. When there is an occurrence of the most frequent maximal repeat that overlaps with its occurrence, little difference would arise in grammar constructing processes of RePair and Naïve-MR-RePair from the view point of maximal repeats, that is, the targeted maximal repeats would vary in RePair and in Naïve-MR-RePair (RePair replaces the targeted maximal repeat step by step and Naïve-MR-RePair replaces it at once). Indeed, if maximal repeats are carefully embedded in a text for increasing the difference, the case presented in the following theorem occurs.

Theorem 4.
Given a text T with a length of n, let g rp and g nmr be the sizes of the grammars generated by RePair and Naïve-MR-RePair for T, respectively, assuming that RePair and Naïve-MR-RePair work in the same MR-order. Then, there is a case when g nmr = g rp + Ω(log n) holds. (We show a concrete example of this theorem in Appendix A.) Proof. Let G rp = {V rp , Σ rp , S rp , R rp } and G nmr = {V nmr , Σ nmr , S nmr , R nmr } be the grammars generated by RePair and Naïve-MR-RePair, respectively. Let T be the text generated just after Step 1 of RePair or Naïve-MR-RePair (the Step 1 is common in both algorithms), that is, T = v 1 · · · v n such that v i ∈ V rp ∩ V nmr and v i → T[i] ∈ R rp ∩ R nmr (for i = 1, · · · , n), andĜ rp = {V rp ,Σ rp ,Ŝ rp ,R rp } (orĜ nmr = {V nmr ,Σ nmr ,Ŝ nmr ,R nmr }) be a subgrammar of G rp (or G nmr ) deriving T . Let T = (uw) 2 m+1 −1 u, where u ∈ V rp ∩ V nmr , w ∈ (V rp ∩ V nmr ) + such that uwu is the most frequent maximal repeat of T , and m ∈ N + . Note that 2 m+1 − 1 = ∑ m i=0 2 i . HereR rp andR nmr are defined as follows: R rp : Assume that x i ∈V rp for 1 ≤ i ≤ m and y j ∈V rp ∪Σ rp for 1 ≤ j ≤ |w|, thenR rp consists of • |w| rules y j → y l y r with val(y |w| ) = uw, • one rule x 1 → y |w| y |w| and log 2 R nmr : Assume that d = |V nmr | = |R nmr | and z i ∈V nmr for 1 ≤ i ≤ d, thenR nmr consists of Letĝ rp andĝ nmr be the sizes ofĜ rp andĜ nmr , respectively. Then, the following holds: Here, with regard to the length of T , we have n = (2(2 m − 1) + 1)(|w| + 1) + 1, and From these, d = m + 1 holds. Hence, according to Equations (4) and (5), the following holds: Therefore,ĝ nmr >ĝ rp holds for some (m, |w|), and the proposition holds.

MR-RePair
The grammar size of Naïve-MR-RePair becomes larger than that of RePair as shown in Theorem 4 because Naïve-MR-RePair cannot replace all occurrences of most frequent maximal repeats if it overlaps with another occurrence of itself. In the remainder of this section, we describe MR-RePair, which is an improved version of Naïve-MR-RePair.

Definition 3 (MR-RePair).
For an input text T, let G = {V, Σ, S, R} be the grammar generated by MR-RePair. MR-RePair constructs T through the following steps: Step 1. Replace each symbol a ∈ Σ with a new variable v a and add v a → a to R.
Step 2. Find the most frequent maximal repeat r in T.
Step 4. Replace every occurrence of r with a new variable v and then add v → r to R.
Step 5. Re-evaluate the frequencies of maximal repeats for the updated text generated in Step 4. If the maximum frequency is 1, add S → (current text) to R and terminate. Otherwise, return to Step 2.
We can easily extend the concept of the MR-order to MR-RePair. We do not care if it uses r [2..|r|] in Step 3, instead of r[1..|r| − 1]. MR-RePair can replace all occurrences of r even if it overlaps with itself in some occurrences since, according to Lemma 3, the length of the overlaps of the most frequent maximal repeats is at most 1.  Figure 3 illustrates an example of the grammar generation process of MR-RePair. Although the size of the grammar generated by MR-RePair as shown in Figure 3 is larger than that generated by Naïve-MR-RePair as shown in Figure 2, it is still smaller than that generated by RePair as shown in Figure 1.

Theorem 5.
Assume that RePair and MR-RePair work based on the same MR-order for a given text. Let g rp and g mr be the sizes of the grammars generated by RePair and MR-RePair, respectively. Then, 1 2 g rp < g mr ≤ g rp holds.
Proof. Assume that G rp = {V rp , Σ rp , S rp , R rp } and G mr = {V mr , Σ mr , S mr , R mr } are grammars generated by RePair and MR-RePair, respectively, for a given text T with a length of n. Let T be the text generated just after Step 1 of RePair or Naïve-MR-RePair (the Step 1 is common in both algorithms), that is, Let f 1 be the maximum frequency of the maximal repeats in T . According to Corollary 1, the maximum frequency of the pairs in T is also f 1 . LetĜ mr ) be a subgrammar of G rp (or G mr ) generated while RePair (or MR-RePair) replaces pairs (or maximal repeats) with the frequency f 1 , mr ) be the size of this subgrammar, and T mr . Note that this substring is common to RePair and MR-RePair, and each l ( f 1 ) i is at least 2. Since RePair replaces such substring step by step and MR-RePair replaces it at once, the following holds:ĝ From these,ĝ Since every l mr |, 1 2 g rp < g mr ≤ g rp follows Equations (11) and (12), and thus, the proposition holds. g mr = g rp holds when every length l However, when the MR-orders of RePair and MR-RePair are different, then the grammar generated by MR-RePair can be larger than that generated by RePair, as the following theorem indicates: Theorem 6. Unless the MR-order of RePair and MR-RePair are the same, there is a case where the size of the grammar generated by MR-RePair becomes larger than that generated by RePair.
Proof. We show a concrete example of the case stated in the proposition. Consider text abcxabcyab czcxcxcycyczcz. There are four most-frequent maximal repeats, abc, cx, cy, and cz. Let A and B be two different MR-orders such that A prioritizes the maximal repeats in order of abc, cx, cy, cz and B prioritizes the maximal repeats in order of cx, cy, cz, abc, respectively. MR-RePair working in A generates a grammar whose rules Meanwhile, RePair working in B generates a grammar whose rules where S is the start variable. The size of the grammar generated by MR-RePair working in A is 27, whereas the size of that generated by RePair working in B is 26.
While Theorem 6 indicates that the grammar can be larger in MR-RePair than in RePair, in Section 5 we demonstrate that MR-RePair outperforms RePair in practice.
We can implement MR-RePair by extending the original implementation of RePair stated in [2] and holding the same complexity. Theorem 7. Let G = {V, Σ, S, R} be the grammar generated by MR-RePair for a given text with a length of n. Then, MR-RePair works in O(n) expected time and 5n + 4k 2 + 4k + √ n + 1 − 1 word space, where k and k denote the cardinalities of Σ and V, respectively.
Proof. Compared to RePair, the additional operations performed by MR-RePair are (i) extending the selected pair to left and right until it becomes a maximal repeat and (ii) checking and excluding either the beginning or the end of the obtained maximal repeat if they are the same. These additional operations can be realized using the same data structures as those employed in RePair. Then, the space complexity of MR-RePair follows Lemma 1.
We can clearly execute operation (ii) in a constant time. Hence, we consider how the time complexity is affected by operation (i). Let l be the length of the maximal repeat containing the focused pair, as well as f be the frequency of the pair. Then, O( f l) more time is required for MR-RePair to check the left-and right-extensions for all occurrences of the focused pair compared to RePair. However, the length of the entire text is shortened by at least f (l − 1) by the replacement. Therefore, MR-RePair works in O(n) expected time according to possible counts of the replacement through all of the steps of the algorithm.

Remark 2.
We can convert a grammar of RePair to that of MR-RePair by repeating the following transform: If a variable v appears only once on the right-hand side of other rules, the rule can be removed for v, and the one occurrence of v can be replaced with the right-hand side of the removed rule. However, the time and space complexity stated in Theorem 7 cannot be achieved in this manner, since additional operations and memory for searching and storing such variables are required.

Experiments
We implemented and conducted some comparative experiments. In particular, we compared the sizes of constructed grammars and execution times of the proposed MR-RePair, several existing RePair implementations, and Re-PairImp (https://bitbucket.org/IguanaBen/repairimproved), which was recently proposed in [20] as an improvement of RePair.
As stated in Remark 1, the MR-order affects the size of a constructed grammar. In practice, the MR-order varies depending on the implementation of the priority queue that manages pairs. For this reason, we used four different implementations of RePair in the comparative analysis, and they were implemented by Maruyama (https://code.google.com/archive/p/re-pair/), Navarro (https: //www.dcc.uchile.cl/~gnavarro/software/index.html), Prezza (https://github.com/nicolaprezza/ Re-Pair) [7], and Wan (https://github.com/rwanwork/Re-Pair); we ran it with level 0 (no heuristic option), respectively. Table 1 lists the details of the texts that we used in the experiments. In particular, we employed three texts as highly repetitive texts: one is a randomly generated text (rand77.txt), and the other two are a Fibonacci string (fib41) and a German text (einstein.de.txt) selected from the Repetitive Corpus of the Pizza&Chili Corpus (http://pizzachili.dcc.uchile.cl/repcorpus.html). The randomly generated text, rand77.txt, consists of alphanumeric symbols and some special symbols. It was generated by concatenating 32 copies of a block that includes 1024 random strings of length 64 each, i.e., the size of the randomly generated text is 64 × 1024 × 32 = 2, 097, 152 byte. In addition, we used three texts (E.coli, bible.txt, and world192.txt) selected from the Large Corpus (http://corpus.canterbury.ac.nz/ descriptions/#large) to consider a real-data case. We executed each program seven times for each text and measured the elapsed CPU time only for the grammar generation process. We calculated the average time across five results, excluding the minimum and maximum values among the seven runs. The experiments were run on a computer equipped with an Intel(R) Core i7-8700 3.2-4.6 GHz 6core, 32GB RAM, and using Ubuntu 16.04. All of the programs were compiled using gcc version 7.4 with the "-O3" option.  Table 2 summarizes the experimental results. Unfortunately, Re-PairImp was unable to process fib41 in our experimental environment because of a lack of memory. Here, we excluded the number of rules generating a single terminal symbol from the number of rules since they are the same between RePair and MR-RePair. As shown in the table, the sizes of grammars constructed by each RePair implementation differ from each other for all texts except fib41. In any case, MR-RePair is not inferior to RePair in terms of the size of grammars while in Theorem 6 we show that the grammar can be larger in MR-RePair than in RePair if their MR-orders are different. For rand77.txt, the number of rules and size of the grammars for MR-RePair decreased to about 11% and 55% of those for RePair, respectively. Long maximal repeats occur more frequently in rand77.txt than in other texts and we consider this is a main reason of the remarkable effectiveness of MR-RePair for the text.
For einstein.de.txt, the number of rules and size of the grammar decreased to about 44% and 72% of those for RePair, respectively. By contrast, it turned out that the effect of the improvement was limited for the texts from the Large Corpus, which are not highly repetitive. Note that fib41 does not contain any maximal repeats longer than 2 without overlaps. Therefore, MR-RePair generated the same rules as RePair in this case. It should be also be noted that MR-RePair runs at a speed comparable to the fastest implementation of RePair. Table 2. Sizes of generated grammars and execution times of the considered algorithms. Each cell in the table represents the number of generated rules, total lengths of the right side of all of the rules except for the start variable, length of the right side of the start variable, and the total grammar size in the order from the top row. The total grammar size presented in the fourth row is the total of the values presented in the second row and the third row. The fifth row separated by a line represents the execution time for compression in seconds. The best results are highlighted in bold.

Conclusions
In this study, we analyzed the process of RePair and revealed that the RePair algorithm replaces the most frequent pairs step by step within their corresponding most frequent maximal repeats. On the basis of this analysis, we designed MR-RePair, a novel variant of RePair. Instead of consecutively substituting the most frequent pairs, MR-RePair performs one-time substitution of the most frequent maximal repeats. Furthermore, we implemented MR-RePair and compared the sizes of its constructed grammars to those of the grammars constructed by several implementations of RePair. Through the experiments, we confirmed the effectiveness of MR-RePair especially for highly repetitive texts.
We defined the greatest size difference of any two possible grammars that can be generated by RePair for a given text, naming it GSDRP. We demonstrated that a lower bound of GSDRP is 1 6 ( √ 6n + 1 + 13) for a given text of length n. We left improving the lower bound and showing an upper bound of GSDRP as our future work.
We estimated the effectiveness of the compression using the size of the generated grammars instead of the length of the output bits. Reducing the grammar size has important implications since the majority of the existing text algorithms applied to grammar-compressed texts, including grammar-based self indexes [21,22], edit distance computation [23], q-gram mining [24,25], and pattern matching [26][27][28], have time/space complexities that are dependent on the input grammar size. For instance, the compressed indexes proposed by Claude and Navarro [21,22] can be directly built on MR-RePair grammar-compressed texts. Algorithms specifically designed for straight-line programs (SLPs), which are text compressions with grammars in Chomsky normal form, can also be easily modified to work on grammars that are not in Chomsky normal form similar to MR-RePair grammars. Hence, MR-RePair serves as a base for practical improvements of these algorithms.
From the viewpoint of storing data more compactly, developing a method for encoding constructed grammars is another important issue. Implementing an efficient encoding method for MR-RePair is one of our future tasks.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Figures A1-A3 are provided to help in understanding the proof of Theorem 4. Let G rp , G nmr , and G mr be the grammars generated by RePair, Naïve-MR-RePair, and MR-RePair, respectively. For a given text T = a 1 · · · a n (a i ∈ Σ, 1 ≤ i ≤ n) of length |T| = n, let g rp , g nmr , and g mr be the sizes of G rp , G nmr , and G mr , respectively. Let us assume that T = (aw) 2(2 m −1)+1 a, where w ∈ Σ + such that awa is the most frequent maximal repeat of T and m ∈ N + . Then, according to the proof of Theorem 4, g nmr > g rp holds for some m and w such that (m − 1)(|w| − 1) > 1. Figure A1 illustrates a specific example of the grammar generation process of RePair and G rp for T = (abcd) 7 a with m = 2 and |w| = 3. The size g rp is 18 in this example. Figure A2 illustrates an example of the process of Naïve-MR-RePair and G nmr for the same T. It can be noticed from the figures that the size g nmr is 19, and thus g nmr > g rp holds. As shown in Figure A2, Naïve-MR-RePair may fail to extract repetitive patterns in particular cases (such as abcd of (abcd) 7 a in the running example). However, this problem can be solved using MR-RePair. Figure A3 illustrates an example of the process of MR-RePair and G mr for the same T = (abcd) 7 a. The size g mr is 16, which is smaller than g rp = 18. Although the most frequent maximal repeat at the second replacement step is v a v b v c v d v a , MR-RePair replaces v a v b v c v d with a new variable v 1 , providing the additional Step 3 in Definition 3.