<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Algorithms</journal-id>
<journal-title>Algorithms</journal-title>
<issn pub-type="epub">1999-4893</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/a4040262</article-id>
<article-id pub-id-type="publisher-id">algorithms-04-00262</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Carrascosa</surname><given-names>Rafael</given-names></name><xref ref-type="aff" rid="af1-algorithms-04-00262"><sup>1</sup></xref><xref ref-type="corresp" rid="c1-algorithms-04-00262"><sup>*</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Coste</surname><given-names>François</given-names></name><xref ref-type="aff" rid="af2-algorithms-04-00262"><sup>2</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Gallé</surname><given-names>Matthias</given-names></name><xref ref-type="aff" rid="af2-algorithms-04-00262"><sup>2</sup></xref><xref ref-type="corresp" rid="c1-algorithms-04-00262"><sup>*</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Infante-Lopez</surname><given-names>Gabriel</given-names></name><xref ref-type="aff" rid="af1-algorithms-04-00262"><sup>1</sup></xref></contrib></contrib-group>
<aff id="af1-algorithms-04-00262">
<label>1</label> Grupo de Procesamiento de Lenguaje Natural Universidad Nacional de Córdoba/Consejo Nacional de Investigaciones Científicas y Técnicas, Argentina</aff>
<aff id="af2-algorithms-04-00262">
<label>2</label> Symbiose Project, IRISA/INRIA Rennes-Bretagne Atlantique, France</aff>
<author-notes>
<corresp id="c1-algorithms-04-00262">
<label>*</label>Authors to whom correspondence should be addressed; E-Mails: <email>rafacarrascosa@gmail.com</email> (R.C.); <email>mgalle@gmail.com</email> (M.G.).</corresp></author-notes>
<pub-date pub-type="collection">
<year>2011</year></pub-date>
<pub-date pub-type="epub">
<day>26</day>
<month>10</month>
<year>2011</year></pub-date>
<volume>4</volume>
<issue>4</issue>
<fpage>262</fpage>
<lpage>284</lpage>
<history>
<date date-type="received">
<day>12</day>
<month>10</month>
<year>2011</year></date>
<date date-type="accepted">
<day>14</day>
<month>10</month>
<year>2011</year></date></history>
<permissions>
<copyright-statement>© 2011 by the authors; licensee MDPI, Basel, Switzerland.</copyright-statement>
<copyright-year>2011</copyright-year>
<license>
<p>This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>The smallest grammar problem—namely, finding a smallest context-free grammar that generates exactly one sequence—is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery. We propose a new perspective on this problem by splitting it into two tasks: (1) choosing which words will be the constituents of the grammar and (2) searching for the smallest grammar given this set of constituents. We show how to solve the second task in polynomial time parsing longer constituent with smaller ones. We propose new algorithms based on classical practical algorithms that use this optimization to find small grammars. Our algorithms consistently find smaller grammars on a classical benchmark reducing the size in 10% in some cases. Moreover, our formulation allows us to define interesting bounds on the number of small grammars and to empirically compare different grammars of small size.</p></abstract>
<kwd-group>
<kwd>smallest grammar problem</kwd>
<kwd>hierarchical structure inference</kwd>
<kwd>optimal parsing</kwd>
<kwd>data discovery</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>The smallest grammar problem—namely, finding a smallest context-free grammar that generates exactly one sequence—is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery.</p>
<p>The size of a smallest grammar can be considered a computable variant of Kolmogorov complexity, in which the Turing machine description of the sequence is restricted to context-free grammars. The problem is then decidable, but still hard: the problem of finding a smallest grammar with an approximation ratio smaller than 
<inline-formula>
<mml:math id="mm1" display="inline">
<mml:semantics id="sm1">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>8569</mml:mn></mml:mrow>
<mml:mrow>
<mml:mn>8568</mml:mn></mml:mrow></mml:mfrac></mml:mrow></mml:semantics></mml:math></inline-formula> is NP-HARD [<xref ref-type="bibr" rid="b1-algorithms-04-00262">1</xref>]. Nevertheless, a 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(log<sup>3</sup> <italic>n</italic>) approximation ratio—with <italic>n</italic> the length of the sequence—can be achieved by a simple algorithmic scheme based on an approximation to the shortest superstring problem [<xref ref-type="bibr" rid="b1-algorithms-04-00262">1</xref>] and a smaller 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(log <italic>n/g</italic>) (where <italic>g</italic> is the size of a smallest grammar) approximation ratio is possible through more complex mappings from the LZ77-factorization of the sequence to a context-free grammar with a balanced parsing tree [<xref ref-type="bibr" rid="b1-algorithms-04-00262">1</xref>,<xref ref-type="bibr" rid="b2-algorithms-04-00262">2</xref>].</p>
<p>If the grammar is small, storing the grammar instead of the sequence can be interesting from a data compression perspective. Kieffer and Yang developed the formal framework of compression by <italic>Grammar Based Codes</italic> from the viewpoint of information theory, defining irreducible grammars and demonstrating their universality [<xref ref-type="bibr" rid="b3-algorithms-04-00262">3</xref>]. Before this formalization, several algorithms allowing to compress a sequence by context-free grammars had already been proposed. The LZ78-factorization introduced by Ziv and Lempel in [<xref ref-type="bibr" rid="b4-algorithms-04-00262">4</xref>] can be interpreted as a context-free grammar. Let us remark that this is not true for LZ77, published one year before [<xref ref-type="bibr" rid="b5-algorithms-04-00262">5</xref>]. Moreover, it is a commonly used result that the size of a LZ77-factorization is a lower bound on the size of a smallest grammar [<xref ref-type="bibr" rid="b1-algorithms-04-00262">1</xref>,<xref ref-type="bibr" rid="b2-algorithms-04-00262">2</xref>]. The first approach that generated explicitly a context-free grammar with compression ability is <italic>Sequitur</italic> [<xref ref-type="bibr" rid="b6-algorithms-04-00262">6</xref>]. Like LZ77 and LZ78, Sequitur is an on-line algorithm that processes the sequence from left to right. It incrementally maintains a grammar generating the part of the sequence read, introducing and deleting production to ensure that no digram (pair of adjacent symbols) occurs more than once and that each rule is used at least twice. Other algorithms consider the entire sequence before choosing which repeated substring will be rewritten by the introduction of a new rule. Most of these offline algorithms proceed in a greedy manner, selecting in each iteration one repeated word <italic>w</italic> according to a score function and replacing all the (non-overlapping) occurrences of the repeat <italic>w</italic> in the whole grammar by a new terminal <italic>N</italic> and adding the new production <italic>N</italic> → <italic>w</italic> to the grammar. Different heuristics have been used to choose the repeat: the most frequent one [<xref ref-type="bibr" rid="b7-algorithms-04-00262">7</xref>], the longest [<xref ref-type="bibr" rid="b8-algorithms-04-00262">8</xref>] and the one that reduces the most the size of the resulting grammar (C<sc>ompressive</sc> [<xref ref-type="bibr" rid="b9-algorithms-04-00262">9</xref>]). G<sc>reedy</sc> [<xref ref-type="bibr" rid="b10-algorithms-04-00262">10</xref>] belongs to this last family but the score used for choosing the words is oriented toward directly optimizing the number of bits needed to encode the grammar rather than minimizing its size. The running time of Sequitur is linear and linear-time implementations of the first two algorithms exists: R<sc>e</sc>P<sc>air</sc> [<xref ref-type="bibr" rid="b11-algorithms-04-00262">11</xref>] and L<sc>ongest</sc> F<sc>irst</sc> [<xref ref-type="bibr" rid="b12-algorithms-04-00262">12</xref>], while the existence of a linear-time algorithm for C<sc>ompressive</sc> and G<sc>reedy</sc> remains an open question.</p>
<p>In pattern discovery, a smallest grammar is a good candidate for being the one that generates the data according to Occam's razor principle. In that case, the grammar may not only be used for compressing the sequence but also to unveil its structure. Inference of the hierarchical structure of sequences was the initial motivation of Sequitur and has been the subject of several papers applying this scheme to DNA sequences [<xref ref-type="bibr" rid="b6-algorithms-04-00262">6</xref>,<xref ref-type="bibr" rid="b13-algorithms-04-00262">13</xref>,<xref ref-type="bibr" rid="b14-algorithms-04-00262">14</xref>], musical scores [<xref ref-type="bibr" rid="b15-algorithms-04-00262">15</xref>] or natural languages [<xref ref-type="bibr" rid="b7-algorithms-04-00262">7</xref>,<xref ref-type="bibr" rid="b16-algorithms-04-00262">16</xref>]. It can also be a first step to learn more general grammars along the lines of [<xref ref-type="bibr" rid="b17-algorithms-04-00262">17</xref>]. In all the latter cases, a slight difference in the size of the grammar, which would not matter for data compression, can dramatically change the results with respect to the structure. Thus, more sophisticated algorithms than those for data compression are needed.</p>
<p>In this article, which is an extended version of [<xref ref-type="bibr" rid="b18-algorithms-04-00262">18</xref>], we focus on how to choose occurrences that are going to be rewritten. This issue was generally handled straightforwardly in former algorithms by selecting <italic>all</italic> the non-overlapping occurrences in a left to right order. Moreover, once an occurrence had been chosen for being rewritten, the result was definitive and was not altered by the words chosen in the following iterations. In order to remedy these flaws, we show how to globally optimize the choice of the occurrences to be replaced by non-terminals. We are then able to improve classical greedy algorithms by introducing this optimization step at each iteration of the algorithm. This optimization allows us to define the smallest grammar problem in terms of two complementary optimization problems: the choice of non-terminals and the choice of their occurrences. We redefine the search space, prove that all solutions are contained in it and propose a new algorithm performing a wider search by adding the possibility to discard non-terminals previously included in the grammar. Thanks to our new formulation, we are able to analyze the number of different grammars with the same minimal size and present empirical results that measure the conservation of structure among them.</p>
<p>The outline of this paper is the following: in Section 2 we formally introduce the definitions and in Section 3 the classical offline algorithms. Section 4 contains our main contributions. In Section 4.1 we show how to optimize the choice of occurrences to be replaced by non-terminals for a set of words and then extend offline algorithms by optimizing the choice of the occurrences at each step in Section 4.2. We present our search space and show that this optimization can also be used directly to guide the search in a new algorithm in Section 4.3. We present experiments on a classical benchmark in Section 5 showing that the occurrence optimization consistently allows to find smaller grammars. In Section 6 we consider the number of smallest grammars that may exist and discuss the consequences of our results on structure discovery.</p></sec>
<sec>
<label>2.</label>
<title>Previous work and definitions</title>
<sec>
<label>2.1.</label>
<title>Definitions and Notation</title>
<p>We start by giving a few definitions and setting up the nomenclature that we use along the paper. A <italic>string s</italic> is a sequence of characters <italic>s</italic><sub>1</sub> … <italic>s<sub>n</sub></italic>, its length, |<italic>s</italic>| = <italic>n</italic>. <italic>∊</italic> denotes the empty string, and <italic>s</italic>[<italic>i</italic> : <italic>j</italic>] = <italic>s<sub>i</sub> … s<sub>j</sub>, s</italic>[<italic>i</italic> : <italic>j</italic>] = <italic>∊</italic> if <italic>j</italic> &lt; <italic>i</italic>. We say that a word <italic>w</italic> occurs at position <italic>i</italic>, if <italic>w</italic> = <italic>s</italic>[<italic>i</italic> : <italic>i</italic> + |<italic>w</italic>| − 1]. <italic>w</italic> is a repeat of <italic>s</italic> if it occurs more than once in <italic>s</italic> and |<italic>w</italic>| &gt; 1. We denote by <italic>repeats</italic>(<italic>s</italic>) the set of substrings of <italic>s</italic> that occur more than once and of length greater than one. For example <italic>repeats</italic>(<italic>abaaab</italic>) = {<italic>ab, aa</italic>} and <italic>aa</italic> occurs at position 2 and 3.</p>
<p>A <italic>context-free grammar</italic> is a tuple 〈Σ, 
<inline-graphic xlink:href="algorithms-04-00262i2.gif"/>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>, <italic>S</italic>〉, where Σ is the finite set of <italic>terminals</italic> and 
<inline-graphic xlink:href="algorithms-04-00262i2.gif"/> the finite set of <italic>non-terminals</italic>, 
<inline-graphic xlink:href="algorithms-04-00262i2.gif"/> and Σ disjoint. <italic>S</italic> ∈ 
<inline-graphic xlink:href="algorithms-04-00262i2.gif"/> is called the <italic>start symbol</italic> and 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/> is the set of <italic>productions</italic>. Each production (also called <italic>rule</italic>) is of the form <italic>A</italic> → <italic>α</italic> where its <italic>left-hand side A</italic> is a non-terminal and its <italic>right-hand side α</italic> belongs to (Σ ∪ 
<inline-graphic xlink:href="algorithms-04-00262i2.gif"/>)*. We say 
<inline-formula>
<mml:math id="mm2" display="inline">
<mml:semantics id="sm2">
<mml:mrow>
<mml:mi>α</mml:mi>
<mml:mover>
<mml:mo>⇒</mml:mo>
<mml:mn>1</mml:mn></mml:mover>
<mml:mi>β</mml:mi></mml:mrow></mml:semantics></mml:math></inline-formula>, if <italic>α</italic> is of the form <italic>δCδ′, β</italic> = <italic>δγδ′</italic> and <italic>C</italic> → <italic>γ</italic> is a production. A sequence 
<inline-formula>
<mml:math id="mm3" display="inline">
<mml:semantics id="sm3">
<mml:mrow>
<mml:mi>α</mml:mi>
<mml:mover>
<mml:mo>⇒</mml:mo>
<mml:mn>1</mml:mn></mml:mover>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:mover>
<mml:mo>⇒</mml:mo>
<mml:mn>1</mml:mn></mml:mover>
<mml:mo>…</mml:mo>
<mml:mover>
<mml:mo>⇒</mml:mo>
<mml:mn>1</mml:mn></mml:mover>
<mml:mi>β</mml:mi></mml:mrow></mml:semantics></mml:math></inline-formula> is called a <italic>derivation</italic> and in this case we say that <italic>α produces β</italic> and that <italic>β derives</italic> from <italic>α</italic> (denoted by <italic>α</italic> ⇒ <italic>β</italic>).</p>
<p>Given a string <italic>α</italic> over (
<inline-graphic xlink:href="algorithms-04-00262i2.gif"/> ∪ Σ)*, its <italic>constituents</italic> (<italic>cons</italic>(<italic>α</italic>)) are the possible strings of terminals that can be derived from <italic>α</italic>. Formally, <italic>cons</italic>(<italic>α</italic>) = {<italic>w</italic> ∈ Σ* : <italic>α</italic> ⇒ <italic>w</italic>}. The constituents of a grammar are all the constituents of its non-terminals. The <italic>language</italic> is the set of constituents of the start symbol <italic>S, cons</italic>(<italic>S</italic>).</p>
<p>Because the Smallest Grammar framework seeks for a context-free grammar whose language contains one and only one string, the grammars we consider here neither branch (every non-terminal occurs at most once as a left-hand side of a rule) nor loop (if <italic>B</italic> occurs in any derivation starting with <italic>A</italic>, then <italic>A</italic> will not occur in a derivation starting with <italic>B</italic>). In this type of grammars, any substring of the grammar has a unique constituent, in which case we will drop the set notation and define <italic>cons</italic>(<italic>α</italic>) as the only terminal string that can be derived from <italic>α</italic>. Note that if the grammar is in Chomsky Normal Form [<xref ref-type="bibr" rid="b19-algorithms-04-00262">19</xref>], it is equivalent to a straight-line program.</p>
<p>Several definitions of the grammar size exist. Following [<xref ref-type="bibr" rid="b9-algorithms-04-00262">9</xref>], we define the <italic>size of the grammar G</italic>, denoted by |<italic>G</italic>|, to be the length of its encoding by concatenation of its right-hand sides separated by end-of-rule markers: |<italic>G</italic>| =Σ<sub><italic>A</italic>→<italic>α</italic>∈
<inline-graphic xlink:href="algorithms-04-00262i4.gif"/></sub>(|<italic>α</italic>| + 1).</p></sec></sec>
<sec>
<label>3.</label>
<title>IRR</title>
<sec>
<label>3.1.</label>
<title>General Scheme</title>
<p>Most offline algorithms follow the same general scheme. First, the grammar is initialized with a unique initial rule <italic>S</italic> → <italic>s</italic> where <italic>s</italic> is the input sequence and then they proceed iteratively. At each iteration, a word <italic>ω</italic> occurring more than once in <italic>s</italic> is chosen according to a score function <italic>f</italic>, all the (non-overlapping) occurrences of <italic>ω</italic> in the grammar are replaced by a new non-terminal <italic>N<sub>ω</sub></italic> and a new production <italic>N<sub>ω</sub></italic> → <italic>ω</italic> is added to the grammar. We give pseudo-code for this general scheme that we name <italic>Iterative Repeat Replacement</italic> (IRR) in Algorithm 1. There, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/> is the set of production rules being built: this defines a unique grammar <italic>G</italic>(
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>) and therefore we define |
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>| = |<italic>G</italic>(
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>)|. The set of repeats of size longer than one of the right-hand sides of 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/> is denoted by <italic>repeats</italic>(
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>) and 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/><sub><italic>ω</italic>↦<italic>N</italic></sub> is the result of the substitution of <italic>ω</italic> by the new symbol <italic>N</italic> in the right-hand sides of 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/> as detailed in the next paragraph.</p>
<p>When occurrences overlap, one has to specify which occurrences have to be replaced. One solution is to choose all the elements in the <italic>canonical list of non-overlapping occurrences</italic> of <italic>ω</italic> in <italic>s</italic>, which we define to be the list of non-overlapping occurrences of <italic>ω</italic> in a greedy left to right way (all occurrences overlapping with the first selected occurrence are not considered, then the same thing with the second non-eliminated occurrence, <italic>etc.</italic>). This ensures that a maximal number of occurrences will be replaced. When searching for the smallest grammar, one has to consider not only the occurrences of a word in <italic>s</italic> but also their occurrence in right-hand sides of rules that are currently part of the grammar. A canonical list of non-overlapping occurrences of <italic>ω</italic> can be defined for each right-hand side appearing in the set of production rules 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>. This provides a straightforward list of occurrences used in the scoring function or the replacement step by our pseudo-code defining IRR.</p>
<p>The IRR schema instantiates different algorithms, depending on the score function <italic>f</italic>(<italic>ω</italic>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>) it uses. L<sc>ongest</sc> F<sc>irst</sc> corresponds to <italic>f</italic>(<italic>ω</italic>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>) = <italic>f<sub>ML</sub></italic>(<italic>ω</italic>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>) = |<italic>ω</italic>|. Choosing the most frequent repeat, like in R<sc>e</sc>P<sc>air</sc>, corresponds to use <italic>f</italic>(<italic>ω</italic>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>) = <italic>f<sub>MF</sub></italic>(<italic>ω</italic>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>) = <italic>o</italic><sub>
<inline-graphic xlink:href="algorithms-04-00262i4.gif"/></sub>(<italic>ω</italic>), where <italic>o</italic><sub>
<inline-graphic xlink:href="algorithms-04-00262i4.gif"/></sub>(<italic>ω</italic>) is the length of the canonical non-overlapping list of <italic>ω</italic> in the right-hand sides of rules in 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>. Note however the difference that IRR is more general than R<sc>e</sc>P<sc>air</sc> and may select a word which is not a digram.</p>
<p>In order to derive a score function corresponding to C<sc>ompressive</sc>, note that replacing a word <italic>ω</italic> by a non-terminal results in a contraction of the grammar of (|<italic>ω</italic>| − 1)*<italic>o</italic><sub>
<inline-graphic xlink:href="algorithms-04-00262i4.gif"/></sub>(<italic>ω</italic>) and its inclusion in the grammar adds |<italic>ω</italic>| + 1 to the grammar size. This defines <italic>f</italic>(<italic>ω</italic>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>) = <italic>f<sub>MC</sub></italic>(<italic>ω</italic>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>) = (|<italic>ω</italic>| − 1)*(<italic>o</italic><sub>
<inline-graphic xlink:href="algorithms-04-00262i4.gif"/></sub>(<italic>ω</italic>) − 1) − 2. We call these three algorithms IRR-ML (maximal length), IRR-MF (most frequent) and IRR-MC (maximal compression), respectively.</p>
<array>
<tbody>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr>
<tr>
<td colspan="2" align="left" valign="top"><bold>Algorithm 1</bold> Iterative Repeat Replacement.</td></tr>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr>
<tr>
<td colspan="2" align="left" valign="top">IRR(<italic>s, f</italic>)</td></tr>
<tr>
<td colspan="2" align="left" valign="top"><bold>Require:</bold> <italic>s</italic> is a sequence, and <italic>f</italic> is a score function</td></tr>
<tr>
<td align="right" valign="top">1:</td>
<td align="left" valign="top">
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/> ← {<italic>N<sub>s</sub></italic> → <italic>s</italic>}</td></tr>
<tr>
<td align="right" valign="top">2:</td>
<td align="left" valign="top"><bold>while</bold> ∃<italic>ω</italic> : <italic>ω</italic> ← argmax<sub><italic>α</italic>∈repeats(
<inline-graphic xlink:href="algorithms-04-00262i4.gif"/>)</sub> <italic>f</italic>(<italic>α</italic>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>) ∧ |
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/><sub><italic>ω</italic>↦<italic>N<sub>ω</sub></italic></sub>| &lt; |
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>| <bold>do</bold></td></tr>
<tr>
<td align="right" valign="top">3:</td>
<td align="left" valign="top"> 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/> ← 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/><sub><italic>ω</italic>←<italic>N<sub>ω</sub></italic></sub> ∪ {<italic>N<sub>ω</sub></italic> → <italic>ω</italic>}</td></tr>
<tr>
<td align="right" valign="top">4:</td>
<td align="left" valign="top"><bold>end while</bold></td></tr>
<tr>
<td align="right" valign="top">5:</td>
<td align="left" valign="top"><bold>return</bold> <italic>G</italic>(
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>)</td></tr>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr></tbody></array>
<p>The complexity of IRR when it uses one of these scores is 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic><sup>3</sup>) since for a sequence of size <italic>n</italic>, the computation of the scores involving only <italic>o</italic><sub>
<inline-graphic xlink:href="algorithms-04-00262i4.gif"/></sub>(<italic>ω</italic>) and |<italic>ω</italic>| of the 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic><sup>2</sup>) possible repeats can be done in 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic><sup>2</sup>) using a suffix tree structure [<xref ref-type="bibr" rid="b20-algorithms-04-00262">20</xref>] and the number of iterations is bounded by <italic>n</italic> since the size of the grammar strictly decreases at each step.</p>
<p>The grammars found by the three IRR algorithms, plus Sequitur and LZ78 are shown on a small example in <xref ref-type="fig" rid="f1-algorithms-04-00262">Figure 1</xref>. A comparison of the size of the grammars returned by these algorithms over a standard data compression corpus are presented in Section 5. These results confirm that IRR-MC finds the smallest grammars as was suggested in [<xref ref-type="bibr" rid="b9-algorithms-04-00262">9</xref>]. Until now, no other polynomial time algorithm (including theoretical algorithms that were designed to achieve a low approximation ratio [<xref ref-type="bibr" rid="b1-algorithms-04-00262">1</xref>,<xref ref-type="bibr" rid="b2-algorithms-04-00262">2</xref>,<xref ref-type="bibr" rid="b21-algorithms-04-00262">21</xref>]) has proven (theoretically nor empirically) to perform better than IRR-MC.</p></sec>
<sec>
<label>3.2.</label>
<title>Limitations of IRR</title>
<p>Even though IRR algorithms are the best known practical algorithms for obtaining small grammars, they present some weaknesses. In the first place, their greedy strategy does not guarantee that the compression gain introduced by a selected word <italic>ω</italic> will still be interesting in the final grammar. Each time an iteration selects a substring of <italic>ω</italic>, the length of the production rule is reduced; and each time a superstring is selected, its number of occurrences is reduced. Moreover, the first choices mark some breaking points and future words may appear inside them or in another parts of the grammar, but never span over these breaking points.</p>
<p>It could be argued that there may exist a score function that for every sequence scores the repeats in such a way that the order they are presented to IRR results in a smallest grammar. The following theorem proves that this is not the case.</p>
<sec>
<title>Theorem 1</title>
<p>There are infinitely many sequences <italic>s<sub>k</sub></italic>, over a constant alphabet, such that for any score function <italic>f</italic>, |<italic>IRR</italic>(<italic>s<sub>k</sub>, f</italic>)| is greater than the size of a smallest grammar for <italic>s<sub>k</sub></italic>.</p></sec>
<sec>
<title>Proof</title>
<p>Consider the infinite set of sequences
<disp-formula id="FD1">
<mml:math id="mm4" display="block">
<mml:semantics id="sm4">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mi>α</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>β</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>γ</mml:mi>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>1</mml:mn></mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>β</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>γ</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>α</mml:mi>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>2</mml:mn></mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>γ</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>α</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>β</mml:mi>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>3</mml:mn></mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>α</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>γ</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>β</mml:mi>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>4</mml:mn></mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>β</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>α</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>γ</mml:mi>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>5</mml:mn></mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>γ</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>β</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>α</mml:mi>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>6</mml:mn></mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>α</mml:mi>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>7</mml:mn></mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>β</mml:mi>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>8</mml:mn></mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>γ</mml:mi>
<mml:mi>x</mml:mi></mml:mrow></mml:semantics></mml:math></disp-formula>where <italic>α, β</italic> and <italic>γ</italic> are any sequences of size greater than two, with pairwise disjoint alphabets not containing the symbol <italic>x</italic>. These sequences exploit the fact that IRR algorithms replace all possible occurrences of the selected word. Let us define <italic>G</italic>*(<italic>s<sub>k</sub></italic>) as the following grammar:
<disp-formula id="FD2">
<mml:math id="mm5" display="block">
<mml:semantics id="sm5">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>→</mml:mo>
<mml:mi>A</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>β</mml:mi></mml:msub>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>1</mml:mn></mml:msub>
<mml:mi>B</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>γ</mml:mi></mml:msub>
<mml:mi>A</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>2</mml:mn></mml:msub>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>α</mml:mi></mml:msub>
<mml:mi>B</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>3</mml:mn></mml:msub>
<mml:mi>A</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>γ</mml:mi></mml:msub>
<mml:mi>B</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>4</mml:mn></mml:msub>
<mml:mi>B</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>α</mml:mi></mml:msub>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>5</mml:mn></mml:msub>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>β</mml:mi></mml:msub>
<mml:mi>A</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>6</mml:mn></mml:msub>
<mml:mi>A</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>7</mml:mn></mml:msub>
<mml:mi>B</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>8</mml:mn></mml:msub>
<mml:mi>C</mml:mi></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>→</mml:mo>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>α</mml:mi></mml:msub>
<mml:mi>x</mml:mi>
<mml:mspace width="1em"/>
<mml:mi>B</mml:mi>
<mml:mo>→</mml:mo>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>β</mml:mi></mml:msub>
<mml:mi>x</mml:mi>
<mml:mspace width="1em"/>
<mml:mi>C</mml:mi>
<mml:mo>→</mml:mo>
<mml:mi>x</mml:mi>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>γ</mml:mi></mml:msub>
<mml:mi>x</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:semantics></mml:math></disp-formula>to which we add the rules of smallest grammars for <italic>α, β, γ</italic> with as start symbols <italic>S<sub>α</sub>, S<sub>β</sub>, S<sub>γ</sub></italic> respectively. |<italic>G</italic>*| = 42+|<italic>G</italic><sub>min</sub>(<italic>α</italic>)| + |<italic>G</italic><sub>min</sub>(<italic>β</italic>)| + |<italic>G</italic><sub>min</sub>(<italic>γ</italic>)|, where <italic>G</italic><sub>min</sub>(<italic>s</italic>) denotes a smallest grammar for sequence <italic>s</italic>. Note that no IRR algorithm could generate <italic>G</italic>* and, moreover, by enumeration we find that the smallest possible grammar that can be obtained with an IRR algorithm has size 46 + |<italic>G</italic><sub>min</sub>(<italic>α</italic>)| + |<italic>G</italic><sub>min</sub>(<italic>β</italic>)| + |<italic>G</italic><sub>min</sub>(<italic>γ</italic>)|. This grammar has the same production rules as <italic>G</italic>*. In the case where <italic>α, β</italic> and <italic>γ</italic> are of size one (we then replace <italic>S<sub>α</sub></italic> by <italic>α</italic> and so on), this results in an approximation ratio of 
<inline-formula>
<mml:math id="mm6" display="inline">
<mml:semantics id="sm6">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>46</mml:mn></mml:mrow>
<mml:mrow>
<mml:mn>42</mml:mn></mml:mrow></mml:mfrac>
<mml:mo>≈</mml:mo>
<mml:mn>1.095</mml:mn></mml:mrow></mml:semantics></mml:math></inline-formula>. This is a general lower bound for <italic>any</italic> IRR algorithm.</p>
<p>Note that Theorem 1 does not make any assumptions over the possible score function <italic>f</italic> (like, for example, polynomial running time). In order to find <italic>G</italic>*, the choice of occurrences that will be rewritten should be flexible when considering repeats introduced in future iterations.</p></sec></sec></sec>
<sec>
<label>4.</label>
<title>Choice of Occurrences</title>
<sec>
<label>4.1.</label>
<title>Global Optimization of Occurrences Replacement</title>
<p>Once an IRR algorithm has chosen a repeated word <italic>ω</italic>, it replaces all non-overlapping occurrences of that word in the current grammar by a new non-terminal <italic>N</italic> and then adds <italic>N</italic> → <italic>ω</italic> to the set of production rules. In this section, we propose to perform a global optimization of the replacement of occurrences, considering not only the last non-terminal but also all the previously introduced non-terminals. The idea is to allow occurrences of words to be kept (instead of being replaced by non-terminals) if replacing other occurrences of words overlapping them results in a better compression.</p>
<p>We propose to separate the choice of which terminal strings will be final constituents of the final grammar from the choice of which of the occurrences of these constituents will be replaced by non-terminals. First, let us assume that a set of constituents {<italic>s</italic>} ∪ <italic>Q</italic> is given and we want to find a smallest grammar whose constituent set is {<italic>s</italic>} ∪ <italic>Q</italic>. If we denote this set by {<italic>s</italic> = <italic>ω</italic><sub>0</sub>, <italic>ω</italic><sub>1</sub>, …, <italic>ω<sub>m</sub></italic>}, we need to be able to generate these constituents and for each constituent <italic>ω<sub>i</sub></italic> the grammar must thus have a non-terminal <italic>N<sub>i</sub></italic> such that <italic>ω<sub>i</sub></italic> = <italic>cons</italic>(<italic>N<sub>i</sub></italic>). In the smallest grammar problem, no unnecessary rule should be introduced since the grammar has to generate only one sequence. More precisely such a grammar must have exactly <italic>m</italic> + 1 non-terminals and associate production rules.</p>
<p>For this, we define a new problem, called <italic>Minimal Grammar Parsing</italic> (MGP) Problem. An instance of this problem is a set of strings {<italic>s</italic>} ∪ <italic>Q</italic>, such that all strings of <italic>Q</italic> are substrings of <italic>s</italic>. A Minimal Grammar Parsing of {<italic>s</italic>} ∪ <italic>Q</italic> is a context-free grammar <italic>G</italic> = 〈Σ, 
<inline-graphic xlink:href="algorithms-04-00262i2.gif"/>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>, <italic>S</italic>〉 such that:
<list list-type="order">
<list-item>
<p>all symbols of <italic>s</italic> are in Σ.</p></list-item>
<list-item>
<p><italic>S</italic> derives only <italic>s</italic>.</p></list-item>
<list-item>
<p>for each string <italic>ω<sub>ℓ</sub></italic> {<italic>s</italic>} ∪ <italic>Q</italic> there is a non-terminal <italic>N<sub>ℓ</sub></italic> that derives only <italic>ω<sub>ℓ</sub></italic> and 
<inline-graphic xlink:href="algorithms-04-00262i2.gif"/> = {<italic>N</italic><sub>0</sub> … <italic>N<sub>m</sub></italic>}.</p></list-item>
<list-item>
<p>for each non-terminal <italic>N</italic> there is a string <italic>s</italic>′ of {<italic>s</italic>} ∪ <italic>Q</italic> such that <italic>N</italic> derives <italic>s′</italic>.</p></list-item>
<list-item>
<p>|<italic>G</italic>| is of minimal size for all possible grammars that satisfy conditions 1–4.</p></list-item></list></p>
<p>Note that this is similar to the Smallest Grammar Problem, except that all constituents for the non-terminals of the grammar are also given. The MGP problem is related to the problem of static dictionary parsing [<xref ref-type="bibr" rid="b22-algorithms-04-00262">22</xref>] with the difference that the dictionary also has to be parsed. This recursive approach is partly what makes grammars interesting to both compression and structure discovery.</p>
<p>As an example consider the sequence <italic>s</italic> = <italic>ababbababbabaabbabaa</italic> and suppose the constituents are {<italic>s, abbaba, bab</italic>}. This defines the set of non-terminals {<italic>N</italic><sub>0</sub>, <italic>N</italic><sub>1</sub>, <italic>N</italic><sub>2</sub>}, such that cons(N<sub>0</sub>) = <italic>s</italic>, cons(<italic>N</italic><sub>1</sub>) = <italic>abbaba</italic> and cons(<italic>N</italic><sub>2</sub>) = <italic>bab</italic>. A minimal parsing is <italic>N</italic><sub>0</sub> → <italic>aN</italic><sub>2</sub><italic>N</italic><sub>2</sub><italic>N</italic><sub>1</sub><italic>N</italic><sub>1</sub><italic>a</italic>, and <italic>N</italic><sub>1</sub> → <italic>abN</italic><sub>2</sub><italic>a, N</italic><sub>2</sub> → <italic>bab</italic>.</p>
<p>This problem can be solved in a classical way in polynomial time by searching for a shortest path in |<italic>Q</italic>| + 1 graphs as follows. Given the set of constituents, <italic>s</italic> = {<italic>ω</italic><sub>0</sub>, <italic>ω</italic><sub>1</sub>, …, <italic>ω<sub>m</sub></italic>}.</p>
<list list-type="order">
<list-item>
<p>Let {<italic>N</italic><sub>0</sub>, <italic>N</italic><sub>1</sub>, …, <italic>N<sub>m</sub></italic>} be the set of non-terminals. Each <italic>N<sub>ℓ</sub></italic> will be the non-terminal whose constituent is <italic>ω<sub>ℓ</sub></italic>.</p></list-item>
<list-item>
<p>Define <italic>m</italic> + 1 directed acyclic graphs Γ<sub>0</sub> … Γ<sub><italic>m</italic></sub>, where Γ<italic><sub>ℓ</sub></italic> = 〈<italic>M<sub>ℓ</sub>, E<sub>ℓ</sub></italic>〉. If |<italic>ω<sub>ℓ</sub></italic>| = <italic>k</italic> then the graph Γ<italic><sub>ℓ</sub></italic> will have <italic>k</italic> + 1 nodes: <italic>M<sub>ℓ</sub></italic> = {1 … |<italic>ω<sub>ℓ</sub></italic>| + 1}. The edges <italic>E<sub>l</sub></italic> are of two types:
<list list-type="alpha-lower">
<list-item>
<p>for every node <italic>i</italic> there is an edge to node <italic>i</italic> + 1 labeled with <italic>ω<sub>ℓ</sub></italic>[<italic>i</italic>].</p></list-item>
<list-item>
<p>there will be an edge from node <italic>i</italic> to <italic>j</italic> + 1 labeled by <italic>N<sub>m</sub></italic> if there exists a non-terminal <italic>N<sub>m</sub></italic> different from <italic>N<sub>ℓ</sub></italic> such that <italic>ω<sub>ℓ</sub></italic>[<italic>i</italic> : <italic>j</italic>] = <italic>ω<sub>m</sub></italic>.</p></list-item></list></p></list-item>
<list-item>
<p>For each Γ<italic><sub>ℓ</sub></italic>, find a shortest path from 1 to |<italic>ω<sub>ℓ</sub></italic>| + 1.</p></list-item>
<list-item>
<p>Return the labels of these paths.</p></list-item></list>
<p>The right-hand side for non-terminal <italic>N<sub>ℓ</sub></italic> is the concatenation of the labels of a shortest path of Γ<italic><sub>ℓ</sub></italic>. Intuitively, an edge from node <italic>i</italic> to node <italic>j</italic> + 1 with label <italic>N<sub>m</sub></italic> represents a possible replacement of the occurrence <italic>ω<sub>ℓ</sub></italic>[<italic>i</italic> : <italic>j</italic>] by <italic>N<sub>m</sub></italic>. There may be more than one grammar parsing with minimal size. If <italic>Q</italic> is a subset of the repeats of the sequence <italic>s</italic>, we denote by <italic>mgp</italic>({<italic>s</italic>} ∪ <italic>Q</italic>) the set of production rules 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/> corresponding to one of the minimal grammar parsing of {<italic>s</italic>} ∪ <italic>Q</italic>.</p>
<p>The list of occurrences of each constituent over the original sequence can be stored at the moment it is chosen. Supposing then that the graphs are created, and as the length of each constituent is bounded by <italic>n</italic> = |<italic>s</italic>|, the complexity of finding a shortest path for one graph with a classical dynamic programming algorithm lies in 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic> × <italic>m</italic>). Because there are <italic>m</italic> + 1 graphs, computing <italic>mgp</italic>({<italic>s</italic>} ∪ <italic>Q</italic>) is in 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic> × <italic>m</italic><sup>2</sup>).</p>
<p>Note that in practice the graph Γ<sub>0</sub> contains all the information for all other graphs: any Γ<italic><sub>ℓ</sub></italic> is a subgraph of Γ<sub>0</sub>. Therefore, we call Γ<sub>0</sub> the Grammar Parsing graph (<italic>GP-graph</italic>).</p></sec>
<sec>
<label>4.2.</label>
<title>IRR with Occurrence Optimization</title>
<p>We can now define a variant of IRR, called <italic>Iterative Repeat Choice with Occurrence Optimization</italic> (IRCOO) whose pseudo-code is given in Algorithm 2. Different from IRR, what is maintained is a set of terminal strings, and the current grammar in each moment is a Minimal Grammar Parsing over this set of strings. Recall that <italic>cons</italic>(<italic>ω</italic>) gives the only terminal string that can be derived from <italic>ω</italic> (the “constituent”). Again, we define IRCCO-MC, IRCOO-MF, IRCOO-ML the instances of Algorithm 2 where the score function is defined as <italic>f<sub>MC</sub>, f<sub>MF</sub>, f<sub>ML</sub></italic>.</p>
<p>The computation of the <italic>argmax</italic> function depends only on the number of repeats, assuming that <italic>f</italic> is constant, so that its complexity lies in 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic><sup>2</sup>). Like for IRR, the total number of times the while loop is executed is bounded by <italic>n</italic>. The complexity of this generic scheme is thus 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic> × (<italic>n</italic><sup>2</sup> + <italic>n</italic> × <italic>m</italic><sup>2</sup>)).</p>
<array>
<tbody>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr>
<tr>
<td colspan="2" align="left" valign="top"><bold>Algorithm 2</bold> Iterative Repeat Choice with Occurrences Optimization.</td></tr>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr>
<tr>
<td colspan="2" align="left" valign="top">IRCOO(<italic>s, f</italic>)</td></tr>
<tr>
<td colspan="2" align="left" valign="top"><bold>Require:</bold> <italic>s</italic> is a sequence, and <italic>f</italic> is a score function on words</td></tr>
<tr>
<td align="right" valign="top">1:</td>
<td align="left" valign="top">
<inline-graphic xlink:href="algorithms-04-00262i5.gif"/> ← {<italic>s</italic>}</td></tr>
<tr>
<td align="right" valign="top">2:</td>
<td align="left" valign="top">
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/> ← {<italic>S</italic> → <italic>s</italic>}</td></tr>
<tr>
<td align="right" valign="top">3:</td>
<td align="left" valign="top"><bold>while</bold> (∃<italic>ω</italic> : <italic>ω</italic> ← argmax<sub><italic>α</italic>∈repeats(
<inline-graphic xlink:href="algorithms-04-00262i4.gif"/>)</sub> <italic>f</italic>(<italic>α</italic>, 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>)) ∧ |<italic>mgp</italic>(
<inline-graphic xlink:href="algorithms-04-00262i5.gif"/> ∪ {<italic>cons</italic>(<italic>ω</italic>)})| &lt; |
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>| <bold>do</bold></td></tr>
<tr>
<td align="right" valign="top">4:</td>
<td align="left" valign="top"> 
<inline-graphic xlink:href="algorithms-04-00262i5.gif"/> ← 
<inline-graphic xlink:href="algorithms-04-00262i5.gif"/> ∪ {<italic>cons</italic>(<italic>ω</italic>)}</td></tr>
<tr>
<td align="right" valign="top">5:</td>
<td align="left" valign="top"> 
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/> ← <italic>mgp</italic>(
<inline-graphic xlink:href="algorithms-04-00262i5.gif"/>)</td></tr>
<tr>
<td align="right" valign="top">6:</td>
<td align="left" valign="top"><bold>end while</bold></td></tr>
<tr>
<td align="right" valign="top">7:</td>
<td align="left" valign="top"><bold>return</bold> <italic>G</italic>(
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>)</td></tr>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr></tbody></array>
<p>As an example, consider again the sequence from Section 3.2, where <italic>α, β</italic> and <italic>γ</italic> have length one. After three iterations of IRRCOO-MC the words <italic>xax, xbx</italic> and <italic>xcx</italic> are chosen, and a MGP of these words plus the original sequence results in <italic>G</italic>*.</p>
<p>IRRCOO extends IRR by performing a global optimization at each step of the replaced occurrences but still relies on the classical score functions of IRR to choose the words to introduce. But the result of the optimization can be used directly to guide the search in a hill-climbing approach that we introduce in the next subsection.</p></sec>
<sec>
<label>4.3.</label>
<title>Widening the Explored Space</title>
<p>In this section we divert from IRR algorithms by taking the idea presented in IRRCOO a step forward. Here, we consider the optimization procedure (<italic>mgp</italic>) as a scoring function for sets of substrings.</p>
<p>We first introduce the search space, defined over all possible sets of substrings, for the Smallest Grammar Problem and, second, an algorithm performing a wider exploration of this search space than the classical greedy methods.</p>
<sec>
<label>4.3.1.</label>
<title>The Search Space</title>
<p>The <italic>mgp</italic> procedure permits us to resolve the problem of finding a smallest grammar given a fixed set of constituents. The Smallest Grammar Problem reduces then to find this good set of constituents. This is the idea behind the search space we will consider here.</p>
<p>Consider the lattice 〈ℛ(<italic>s</italic>), ⊆〉, where ℛ(<italic>s</italic>) is the collection of all possible sets of repeated substrings in <italic>s</italic>: ℛ(<italic>s</italic>) = 2<sup>repeats(<italic>s</italic>)</sup>. Every node of this lattice corresponds to a set of repeats of <italic>s</italic>. We then define a score function over the nodes of the lattice as <italic>score</italic>(<italic>η</italic>) = |mgp({<italic>s</italic>} ∪ <italic>η</italic>)|.</p>
<p>An algorithm over this search space will look for a local or global minimum. To define this, we first need some notation:</p>
<sec>
<title>Definition 1</title>
<p>Given a lattice 〈<italic>L</italic>, ⪯〉:
<list list-type="order">
<list-item>
<p>ancestors(<italic>η</italic>) = {<italic>θ</italic> ≠ <italic>η</italic> | <italic>η</italic> ⪯ <italic>θ</italic> ∧ (∄<italic>κ, κ</italic> ≠ <italic>η,θ s.t. η</italic> ⪯ <italic>κ</italic> ⪯ <italic>θ</italic>)}</p></list-item>
<list-item>
<p>descendants(<italic>η</italic>) = {<italic>θ</italic> ≠ <italic>η</italic> | <italic>θ</italic> ⪯ <italic>η</italic> ∧ (∄<italic>κ, κ</italic> ≠ <italic>η,θ s.t. θ</italic> ⪯ <italic>κ</italic> ⪯ <italic>η</italic>)}</p></list-item></list></p>
<p>The <italic>ancestors</italic> of node <italic>η</italic> are the nodes exactly “over” <italic>η</italic>, while the <italic>descendants</italic> of node <italic>η</italic> are the nodes exactly “under” <italic>η</italic>.</p>
<p>Now, we are able to define a global and local minimum.</p></sec>
<sec>
<title>Definition 2</title>
<p>Given a lattice 〈<italic>L</italic>, ⪯〉 and an associate score function over nodes, <italic>g</italic>(<italic>η</italic>):
<list list-type="order">
<list-item>
<p>A node <italic>η</italic> is a global <italic>minimum</italic> if <italic>g</italic>(<italic>η</italic>) ≤ <italic>g</italic>(<italic>θ</italic>) for all <italic>θ</italic> ∈ <italic>L</italic>.</p></list-item>
<list-item>
<p>A node <italic>η</italic> is a local <italic>minimum</italic> if <italic>g</italic>(<italic>η</italic>) ≤ <italic>g</italic>(<italic>θ</italic>) for all <italic>θ</italic> ∈ ancestors(<italic>η</italic>) ∪ descendants(<italic>η</italic>).</p></list-item></list></p>
<p>Unless otherwise noted the default score function for nodes will be <italic>score</italic>(<italic>η</italic>) (defined above). Finally, let 
<inline-graphic xlink:href="algorithms-04-00262i6.gif"/>
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/>(<italic>s</italic>) be the set of all grammars of minimal size for the sequence <italic>s</italic>. Similarly, we define ℳ
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/>
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>({<italic>s</italic>} ∪ <italic>η</italic>) the set of minimal grammars with constituents {<italic>s</italic>} ∪ <italic>η</italic>. With this definition, we can state formally that this lattice is a “good” search space:</p></sec>
<sec>
<title>Theorem 2</title>
<p>
<inline-formula>
<mml:math id="mm7" display="inline">
<mml:semantics id="sm7">
<mml:mrow>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mi mathvariant="script">G</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mo>∪</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>η</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi>η</mml:mi>
<mml:mspace width="0.2em"/>
<mml:mtext mathvariant="italic">is global minimum of</mml:mtext>
<mml:mrow>
<mml:mo>〈</mml:mo>
<mml:mrow>
<mml:mi>ℛ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>⊆</mml:mo></mml:mrow>
<mml:mo>〉</mml:mo></mml:mrow></mml:mrow>
<mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:munder>
<mml:mi>ℳ</mml:mi>
<mml:mi mathvariant="script">G</mml:mi>
<mml:mi mathvariant="script">P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo stretchy="false">{</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">}</mml:mo>
<mml:mo>∪</mml:mo>
<mml:mi>η</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:semantics></mml:math></inline-formula></p></sec>
<sec>
<title>Proof</title>
<p>To see the first inclusion (⊆), take a smallest grammar <italic>G</italic>*. All strings in <italic>cons</italic>(<italic>G</italic>*) have to be repeats of <italic>s</italic>, so <italic>cons</italic>(<italic>G</italic>*)\{<italic>s</italic>} corresponds to a node <italic>η</italic> in the lattice and <italic>G</italic>* has to be in ℳ
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/>
<inline-graphic xlink:href="algorithms-04-00262i3.gif"/>({<italic>s</italic>} ∪ <italic>η</italic>). Conversely (⊇), all grammars of the right expression have the same size, which is minimal, so they are all smallest grammars.</p>
<p>Because of the NP-hardness of the problem, it is fruitless (supposing <italic>P</italic> ≠ <italic>NP</italic>) to search for an efficient algorithm to find a global minimum. We will present therefore an algorithm that looks for a local minimum on this search space.</p></sec></sec>
<sec>
<label>4.3.2.</label>
<title>The Algorithm</title>
<p>In contrast with classical methods, we can now define algorithms discarding also constituents. To perform a wider exploration of the search space, we propose a new algorithm performing a succession of greedy search in ascending and descending directions in the lattice until a local minimum is found.</p>
<p>This algorithm explores the lattice in a zigzag path. Therefore we denote it ZZ. It explores the lattice by an alternation of two different phases: <italic>bottom-up</italic> and <italic>top-down</italic>. The bottom-up can be started at any node, it moves upwards in the lattice and at each step it looks among its ascendants for the one with the lowest score. In order to determine which is the one with the lowest score, it inspects them all. It stops when no ascendant has a better score than the current one. As in bottom-up, top-down starts at any given node but it moves downwards looking for the node with the smallest score among its descendants. Going up or going down from the current node is equivalent to adding or removing a substring to or from the set of substrings in the current node respectively.</p>
<p>ZZ starts at the bottom node, that is, the node that corresponds to the grammar <italic>S</italic> → <italic>s</italic> and it finishes when no improvements are made in the score between two bottom-up-top-down iterations. A pseudo-code is given in Algorithm 3.</p>
<array>
<tbody>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr>
<tr>
<td colspan="2" align="left" valign="top"><bold>Algorithm 3</bold> Zigzag algorithm.</td></tr>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr>
<tr>
<td colspan="2" align="left" valign="top">ZZ(<italic>s</italic>)</td></tr>
<tr>
<td colspan="2" align="left" valign="top"><bold>Require:</bold> <italic>s</italic> is a sequence</td></tr>
<tr>
<td align="right" valign="top">1:</td>
<td align="left" valign="top">ℒ ← 〈ℛ(<italic>s</italic>), ⊆〉</td></tr>
<tr>
<td align="right" valign="top">2:</td>
<td align="left" valign="top"><italic>η</italic> ← ∅</td></tr>
<tr>
<td align="right" valign="top">3:</td>
<td align="left" valign="top"><bold>repeat</bold></td></tr>
<tr>
<td align="right" valign="top">4:</td>
<td align="left" valign="top"> <bold>while</bold> ∃<italic>η′</italic> ∈ ℒ : (<italic>η′</italic> ← <italic>argmin</italic><sub><italic>η′</italic> ∈ancestors(<italic>η</italic>)</sub>score(<italic>η′</italic>)) ∧ score(<italic>η′</italic>) ≤ score(<italic>η</italic>)) <bold>do</bold></td></tr>
<tr>
<td align="right" valign="top">5:</td>
<td align="left" valign="top">  <italic>η</italic> ← <italic>η′</italic></td></tr>
<tr>
<td align="right" valign="top">6:</td>
<td align="left" valign="top"> <bold>end while</bold></td></tr>
<tr>
<td align="right" valign="top">7:</td>
<td align="left" valign="top"> <bold>while</bold> ∃<italic>η′</italic> : (<italic>η′</italic> ← <italic>argmin</italic><sub><italic>η′</italic>∈descendants(<italic>η</italic>)</sub>score(<italic>η′</italic>)) ∧ score(<italic>η′</italic>) ≤ score(<italic>η</italic>)) <bold>do</bold></td></tr>
<tr>
<td align="right" valign="top">8:</td>
<td align="left" valign="top">  <italic>η</italic> ← <italic>η′</italic></td></tr>
<tr>
<td align="right" valign="top">9:</td>
<td align="left" valign="top"> <bold>end while</bold></td></tr>
<tr>
<td align="right" valign="top">10:</td>
<td align="left" valign="top"><bold>until</bold> score(<italic>η</italic>) doesn't decrease</td></tr>
<tr>
<td align="right" valign="top">11:</td>
<td align="left" valign="top"><bold>return</bold> <italic>G</italic> (mgp(<italic>η</italic>))</td></tr>
<tr>
<td colspan="2" valign="bottom">
<hr/></td></tr></tbody></array>
<p>For example, suppose that there are 5 substrings that occur more than once in <italic>s</italic> and that they all have length greater than two. Let these strings be numbered from 0 to 4. We start the ZZ algorithm at the bottom node. It inspects nodes {0}, {1}, {2}, {3}, and {4}. Suppose that {1} produces the best grammar, then ZZ moves to that node and starts over exploring the nodes above it. <xref ref-type="fig" rid="f2-algorithms-04-00262">Figure 2</xref> shows a part of the lattice being explored. Dotted arrows point to nodes that are explored while full arrows point to nodes having the lowest score. Suppose that the algorithm then continues up until it reaches node {1, 2, 3} where no ancestor has lower score. Then ZZ starts the top-down phase, going down to node {2, 3} where no descendant has lower score. At this point a bottom-up-top-down iteration is done and the algorithm starts over again. It goes up, suppose that it reaches node {2,3,4} where it stops. Bold circled nodes correspond to nodes were the algorithm switches phases, grey nodes correspond to nodes with the best score among its siblings.</p>
<sec>
<title>Computational Complexity</title>
<p>In the previous section we showed that the computational complexity of computing the score function for each node is 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic> × <italic>m</italic><sup>2</sup>), where <italic>n</italic> is the length of the target string and <italic>m</italic> is the number of substrings in the node. Every time ZZ looks for a substring to add or remove it has to inspect all possible candidates with the aim of finding the one that minimizes the score. Depending on the number of substrings that are already in the node, there might be at most 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic><sup>2</sup>) candidate strings. As a consequence, each step upwards or downwards is made in 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic><sup>2</sup> × <italic>n</italic> × <italic>m</italic><sup>2</sup>). Next, we need to give an upper bound for the length of the path that is potentially traversed by the algorithm. In order to define it, we first note two important properties: the score of the bottom node is equal to <italic>n</italic> and the score of any node containing more than <italic>n</italic>/2 substrings is at least <italic>n</italic>. The bottom-up phase visits at most <italic>n</italic>/2 nodes, and consequently, the top-down can only go down at most <italic>n</italic>/2 steps. Adding them together, it turns out that a bottom-up-top-down iteration traverses at most <italic>n</italic> nodes. Now, each of these iteration decreases the score by at least 1, otherwise the algorithm stops. Since the initial score is <italic>n</italic> plus the fact that the score is always positive, there can be at most <italic>n</italic> bottom-up-top-down iterations. This results in a complexity for the ZZ algorithm of 
<inline-graphic xlink:href="algorithms-04-00262i1.gif"/>(<italic>n</italic><sup>5</sup> × <italic>m</italic><sup>2</sup>).</p></sec></sec>
<sec>
<label>4.3.3.</label>
<title>Non-Monotonicity of the Search Space</title>
<p>We finish this section with a remark on the search space. In order to ease the understanding of the proof, we will suppose that the size of the grammar is defined as |<italic>G</italic>| = Σ<sub><italic>A</italic>→<italic>α</italic>∈
<inline-graphic xlink:href="algorithms-04-00262i4.gif"/></sub>(|<italic>α</italic>|). The proof extends easily if we consider our definition of size, but is more cumbersome (basically, instead of taking blocks of size two in the proof, take them of size three).</p>
<p>We have presented an algorithm that finds a local minimum on the search space. Locality is defined in terms of its direct neighborhood, but we will see that the local minimality of a node does not necessarily extend further:</p>
<sec>
<title>Proposition 1</title>
<p>The lattice 〈ℛ(<italic>s</italic>), ⊆〉) is not monotonic for function score(<italic>η</italic>) = |<italic>mgp</italic>(<italic>η</italic> ∪ {<italic>s</italic>}|. That is, suppose <italic>η</italic> is a local minimum. There may be a node <italic>θ</italic> such that <italic>η</italic> ⪯ <italic>θ</italic> and score(<italic>η</italic>) &gt; score(<italic>θ</italic>).</p></sec>
<sec>
<title>Proof</title>
<p>Consider the following sequence:
<disp-formula id="FD3">
<mml:math id="mm8" display="block">
<mml:semantics id="sm8">
<mml:mrow>
<mml:mtext mathvariant="italic">abcd</mml:mtext>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>1</mml:mn></mml:msub>
<mml:mtext mathvariant="italic">cdef</mml:mtext>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>2</mml:mn></mml:msub>
<mml:mtext mathvariant="italic">efab</mml:mtext>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>3</mml:mn></mml:msub>
<mml:mtext mathvariant="italic">cdab</mml:mtext>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>4</mml:mn></mml:msub>
<mml:mtext mathvariant="italic">efcd</mml:mtext>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>5</mml:mn></mml:msub>
<mml:mtext mathvariant="italic">abef</mml:mtext>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>6</mml:mn></mml:msub>
<mml:mi>b</mml:mi>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>7</mml:mn></mml:msub>
<mml:mi>b</mml:mi>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>8</mml:mn></mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>9</mml:mn></mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>10</mml:mn></mml:mrow></mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>11</mml:mn></mml:mrow></mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>12</mml:mn></mml:mrow></mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>a</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>13</mml:mn></mml:mrow></mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>a</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>14</mml:mn></mml:mrow></mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>15</mml:mn></mml:mrow></mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>c</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>16</mml:mn></mml:mrow></mml:msub>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>17</mml:mn></mml:mrow></mml:msub>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>18</mml:mn></mml:mrow></mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>b</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>19</mml:mn></mml:mrow></mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>d</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:mn>20</mml:mn></mml:mrow></mml:msub>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi>
<mml:mo>.</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>The set of possible constituents is {<italic>ab, bc, cd, de, ef, fa, da, fc, be</italic>}, none of which has size longer than 2. Note that the digrams that appear in the middle of the first blocks (of size four) appear repeated twice, while the others only once. Also, the six four-size blocks are all compositions of constituents {<italic>ab, cd, ef</italic>} (each of which is only repeated once at the end). Consider now the following grammar:
<disp-formula id="FD4">
<mml:math id="mm9" display="block">
<mml:semantics id="sm9">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi>S</mml:mi></mml:mtd>
<mml:mtd>
<mml:mo stretchy="false">→</mml:mo></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:msup>
<mml:mi>B</mml:mi>
<mml:mi>C</mml:mi></mml:msup>
<mml:mi>d</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>1</mml:mn></mml:msub>
<mml:mi>c</mml:mi>
<mml:msup>
<mml:mi>D</mml:mi>
<mml:mi>E</mml:mi></mml:msup>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>2</mml:mn></mml:msub>
<mml:mi>e</mml:mi>
<mml:msup>
<mml:mi>F</mml:mi>
<mml:mi>A</mml:mi></mml:msup>
<mml:mi>b</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>3</mml:mn></mml:msub>
<mml:mi>c</mml:mi>
<mml:msup>
<mml:mi>D</mml:mi>
<mml:mi>A</mml:mi></mml:msup>
<mml:mi>b</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>4</mml:mn></mml:msub>
<mml:mi>e</mml:mi>
<mml:msup>
<mml:mi>F</mml:mi>
<mml:mi>C</mml:mi></mml:msup>
<mml:mi>d</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>5</mml:mn></mml:msub>
<mml:mi>a</mml:mi>
<mml:msup>
<mml:mi>B</mml:mi>
<mml:mi>E</mml:mi></mml:msup>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>6</mml:mn></mml:msub>
<mml:msup>
<mml:mi>B</mml:mi>
<mml:mi>C</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>8</mml:mn></mml:msub>
<mml:msup>
<mml:mi>D</mml:mi>
<mml:mi>E</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>9</mml:mn></mml:msub>
<mml:msup>
<mml:mi>D</mml:mi>
<mml:mi>E</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>10</mml:mn></mml:msub>
<mml:msup>
<mml:mi>F</mml:mi>
<mml:mi>A</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>11</mml:mn></mml:msub>
<mml:msup>
<mml:mi>F</mml:mi>
<mml:mi>A</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>12</mml:mn></mml:msub>
<mml:msup>
<mml:mi>D</mml:mi>
<mml:mi>A</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>13</mml:mn></mml:msub>
<mml:msup>
<mml:mi>D</mml:mi>
<mml:mi>A</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>14</mml:mn></mml:msub>
<mml:msup>
<mml:mi>F</mml:mi>
<mml:mi>C</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>15</mml:mn></mml:msub>
<mml:msup>
<mml:mi>F</mml:mi>
<mml:mi>C</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>16</mml:mn></mml:msub>
<mml:msup>
<mml:mi>B</mml:mi>
<mml:mi>E</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>17</mml:mn></mml:msub>
<mml:msup>
<mml:mi>B</mml:mi>
<mml:mi>E</mml:mi></mml:msup>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>18</mml:mn></mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>b</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>19</mml:mn></mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>d</mml:mi>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>20</mml:mn></mml:msub>
<mml:mi>e</mml:mi>
<mml:mi>f</mml:mi></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msup>
<mml:mi>B</mml:mi>
<mml:mi>C</mml:mi></mml:msup></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mo stretchy="false">→</mml:mo></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>c</mml:mi></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msup>
<mml:mi>D</mml:mi>
<mml:mi>E</mml:mi></mml:msup></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mo stretchy="false">→</mml:mo></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msup>
<mml:mi>F</mml:mi>
<mml:mi>A</mml:mi></mml:msup></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mo stretchy="false">→</mml:mo></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msup>
<mml:mi>D</mml:mi>
<mml:mi>A</mml:mi></mml:msup></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mo stretchy="false">→</mml:mo></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>a</mml:mi></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msup>
<mml:mi>F</mml:mi>
<mml:mi>C</mml:mi></mml:msup></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mo stretchy="false">→</mml:mo></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>c</mml:mi></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msup>
<mml:mi>B</mml:mi>
<mml:mi>E</mml:mi></mml:msup></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mo stretchy="false">→</mml:mo></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:semantics></mml:math></disp-formula>of size 68, which is a smallest grammar given this set of constituents. Moreover, adding any of the three remaining constituents would increase the size of the grammar by one. But, adding all three of them would permit to parse the blocks of size 4 with only two symbols each, plus parsing the three trailing blocks with only one symbol. This means gaining 9 symbols and losing only 6 (because of the introduction of the new right-hand sides).</p></sec></sec></sec></sec>
<sec>
<label>5.</label>
<title>Experiments</title>
<p>In this section we experimentally compare our algorithms with the classical ones from the literature. For this purpose we use the Canterbury Corpus [<xref ref-type="bibr" rid="b23-algorithms-04-00262">23</xref>] which is a standard corpus for comparing lossless data compression algorithms. <xref ref-type="table" rid="t1-algorithms-04-00262">Table 1</xref> lists the sequences of the corpus together with their length and number of repeats of length greater than one.</p>
<p>Not all existing algorithms are publicly available, they resolve in different ways the case when there are more than two words with the best score, they do not report results on a standard corpus or they use different definitions of the size of a grammar. In order to standardize the experiments and score, we implemented all the offline algorithms presented in this paper in the IRR framework. For the sake of completeness, we also add to the comparison LZ78 and Sequitur. Note that we have post-processed the output of the LZ78 factorizations to transform them into context-free grammars. The first series of experiments aims at comparing these classical algorithms and are shown in the middle part of <xref ref-type="table" rid="t2-algorithms-04-00262">Table 2</xref>. On this benchmark, we can see that IRR-MC always outputs the smallest grammars, which are on average 4.22% smaller than those of the second best (IRR-MF), confirming the partial results of [<xref ref-type="bibr" rid="b9-algorithms-04-00262">9</xref>] and showing that IRR-MC is the current state-of-the-art practical algorithm for this problem.</p>
<p>Then, we evaluate how the optimization of occurrences improves IRR algorithms. As shown in the IRRCOO column of <xref ref-type="table" rid="t2-algorithms-04-00262">Table 2</xref>, each strategy for choosing the word is improved by introducing the occurrence optimization. The sole exceptions are for the MF strategy on 
<monospace>grammar.lsp and xargs.1</monospace>, but the difference in these cases is very small and the sequences are rather short. More importantly, we can see that IRRCOO-MC is becoming the new state-of-the-art algorithm, proposing for each test a smaller grammar than IRR-MC, and being outperformed only on 
<monospace>plrabn12.txt</monospace> by IRRCOO-MF.</p>
<p>If given more time, these results can still be improved by using ZZ. As shown in column ZZ of <xref ref-type="table" rid="t2-algorithms-04-00262">Table 2</xref>, it obtains in average 3.12% smaller grammars than IRR-MC, a percentage that increases for the sequences containing natural language (for instance, for 
<monospace>alice29.txt</monospace> the gain is 8.04%), while it is lower for other sequences (only 0.1% for 
<monospace>kennedy.xls</monospace> for example). For the latter case, one can remark that the compression ratio is already very high with IRR-MC and that it may be difficult or impossible to improve it, the last few points of the percentage gain being always the hardest to achieve. As expected, ZZ improves over previous approaches mainly because it explores a much wider fraction of search space. Interestingly enough, the family of IRRCOO algorithms also improves state-of-the-art algorithms but still keeps the greedy flavour, and more importantly, it does so with a complexity cost similar to pure greedy approaches. The price to be paid for computing grammars with ZZ is its computational complexity. This problem already showed up with 
<monospace>plrabn12.txt, lcet10.txt</monospace> (where each repeat individually does not compress the sequence much, so lots of iterations are necessary) and 
<monospace>ptt5</monospace> (which contains about 99 millions of repeats).</p></sec>
<sec>
<label>6.</label>
<title>Non-Uniqueness of Smallest Grammar and Structure Discovery</title>
<p>Depending on the implementation, the <italic>mgp</italic> algorithm could return different small grammars using the same set of constituents because there is usually more than one shortest path in the <italic>GP-graph</italic>, and therefore there are multiple paths from which the algorithm can choose. From the point of view of data discovery, all these grammars are equally interesting if we only consider their size, but different grammars might have different structures despite having the same size.</p>
<p>In this section we investigate these phenomena from two different perspectives. From a theoretical point of view, we provide bounds on the number of different smallest grammars, both globally and locally (fixing the set of constituents). And from an empirical point of view, we explore and compare the actual structural variances among different grammars on some real-life sequences.</p>
<sec>
<label>6.1.</label>
<title>Bounds on the Number of Smallest Grammars</title>
<p>It is clear that a smallest grammar is not necessarily unique. Not so clear is how many smallest grammars there can be. First we will prove that for a certain family of sequences, any node of the search space corresponds to a smallest grammar. As in the proof of Proposition 1 and only to ease the understanding of the proof, we will use |<italic>G</italic>| = Σ<sub><italic>A</italic>→<italic>α</italic>∈
<inline-graphic xlink:href="algorithms-04-00262i4.gif"/></sub>(|<italic>α</italic>|) as the definition of grammar size.</p>
<sec>
<title>Proposition 2</title>
<p>Let <italic>m</italic>(<italic>k</italic>) = max<sub><italic>s</italic> : |<italic>s</italic>|=<italic>n</italic></sub>(number of global minima for 〈ℛ(<italic>s</italic>), ⊆〉).</p>
<p>Then, <italic>m</italic>(<italic>n</italic>) ∈ Ω(<italic>c<sup>n</sup></italic>), with <italic>c</italic> &gt; 1 constant.</p></sec>
<sec>
<title>Proof</title>
<p>It is sufficient to find one family of sequences for which the number of global minima is exponential. Consider the sequence
<disp-formula id="FD5">
<mml:math id="mm10" display="block">
<mml:semantics id="sm10">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>k</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>2</mml:mn></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>2</mml:mn></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>k</mml:mi></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>k</mml:mi></mml:msub>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>k</mml:mi></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>k</mml:mi></mml:msub>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>∏</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>k</mml:mi></mml:munderover>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:semantics></mml:math></disp-formula>of length <italic>n</italic> = 6<italic>k</italic> over an alphabet of size <italic>k</italic> + 2. The <italic>a<sub>i</sub></italic>'s are single symbols. The set of repeated substrings longer than one is {<italic>a<sub>i</sub>a<sub>i</sub></italic>, 1 ≤ <italic>i</italic> ≤ <italic>k</italic>}, of size <italic>k</italic>. For any subset, there is only one minimal grammar with this constituent set. Adding any remaining constituents to this grammar reduces the length of the axiom rule by 2, but does not reduce anything in the remaining rules, and adds 2 to the grammar size. The same happens when a constituent is removed. Therefore, any node of the lattice is a local minimum, and therefore a global. As there are 
<inline-formula>
<mml:math id="mm11" display="inline">
<mml:semantics id="sm11">
<mml:mrow>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mrow>
<mml:mfrac>
<mml:mi>n</mml:mi>
<mml:mn>6</mml:mn></mml:mfrac></mml:mrow></mml:msup></mml:mrow></mml:semantics></mml:math></inline-formula> nodes, this proves the proposition.</p>
<p>Now, we will suppose that the set of constituents is fixed and consider the number of smallest grammars that can be built with this set. Given a set of constituents <italic>Q</italic> we denote with 
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/><sub><italic>Q</italic></sub> the set of all the smallest grammars that can be formed with <italic>Q</italic>. Different implementations of the <italic>mgp</italic> algorithm may return different grammars from 
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/><sub><italic>Q</italic></sub>.</p>
<p>As the following Proposition shows, there are cases where the number of different smallest grammars can grow exponentially for a given set of constituents.</p></sec>
<sec>
<title>Proposition 3</title>
<p>Let 
<inline-formula>
<mml:math id="mm12" display="inline">
<mml:semantics id="sm12">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo>max</mml:mo></mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mo>∑</mml:mo>
<mml:mi>k</mml:mi></mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo>⊆</mml:mo>
<mml:mtext mathvariant="italic">repeats</mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:munder>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>Q</mml:mi>
<mml:mo>∪</mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>}</mml:mo></mml:mrow></mml:msub></mml:mrow>
<mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:semantics></mml:math></inline-formula>.</p>
<p>Then <italic>m</italic>(<italic>n</italic>) ∈ Ω(<italic>c<sup>n</sup></italic>), with <italic>c</italic> &gt; 1 constant.</p></sec>
<sec>
<title>Proof</title>
<p>Let <italic>s<sub>k</sub></italic> be the following sequence (<italic>aba</italic>)<sup><italic>k</italic></sup> and let <italic>Q</italic> be {<italic>ab, ba</italic>}. Then the <italic>mgp</italic> algorithm can parse each <italic>aba</italic> in only one of the two following ways: <italic>aA</italic> or <italic>Ba</italic>, where <italic>A</italic> and <italic>B</italic> derives <italic>ab</italic> and <italic>ba</italic> respectively. Since each occurrence of <italic>aba</italic> can be replaced by <italic>aA</italic> or <italic>Ba</italic> independently, we can see that there are 2<sup><italic>k</italic></sup> different ways of rewriting the body of rule <italic>s<sub>k</sub></italic>. Moreover, all of them have the same minimal length, so there are 2<sup><italic>k</italic></sup> grammars of the same (minimal) size that can be formed for <italic>s<sub>k</sub></italic> taking <italic>Q</italic> as its constituents.</p>
<p>Proposition 2 is complementary to Proposition 3. In the former, we prove that the number of global minima (and therefore, the number of grammars with the same minimal size) is exponential. To prove it, we provided a sequence with the property that any constituent set would yield a grammar with minimal size. In the latter we show that even if the set is fixed, still there can be an exponential number of smallest grammars with this constituent set. Not however that for the proof of Proposition 2 we needed an unbounded alphabet.</p>
<p>These two propositions suggest that it might not be possible to find one <italic>gold</italic> smallest grammar that could be used for structure discovery. If we consider only the size of the grammar, then all grammars in 
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/><sub><italic>Q</italic></sub> are equally interesting, but different grammars might have different structures despite having the same size.</p>
<p>In the next section we analyze the differences among grammars of the same smallest size for a few real-life sequences. We explore their structural variances using similarity metrics and from different points of view.</p></sec></sec>
<sec sec-type="methods">
<label>6.2.</label>
<title>An Empirical Analysis</title>
<p>We will now introduce a way of measuring the difference between any pair of grammars taken from 
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/><sub><italic>Q</italic></sub>. Our measure is based on the fact that our grammars have a single string in their language and that string has only one derivation tree. Therefore it seems natural to use a metric that is commonly used to compare parse trees, such as Unlabeled Brackets <italic>F</italic><sub>1</sub> [<xref ref-type="bibr" rid="b24-algorithms-04-00262">24</xref>](also see [<xref ref-type="bibr" rid="b25-algorithms-04-00262">25</xref>] for a different explanation) in order to compare different grammars (hereafter <italic>UF</italic><sub>1</sub>).</p>
<p>Summarizing the definition, the <italic>UF</italic><sub>1</sub> measures the overlap of the <italic>brackets</italic> sets of two parse trees. Given a grammar with a single string <italic>s</italic> in its language (which is equivalent to a parse tree) the set of <italic>brackets</italic> for that grammar is defined as the pairs of indexes in string <italic>s</italic> for which there is a non-terminal symbol that expands to that position. For instance, the initial symbol <italic>S</italic> of a grammar expands to <italic>s</italic>, so (0, |<italic>s</italic>| − 1) belongs to the bracket set of that grammar.</p>
<p>Full overlap of the brackets sets (<italic>UF</italic><sub>1</sub> = 1) implies that the two smallest grammars are equal, and the converse also holds: if two smallest grammars are equal then <italic>UF</italic><sub>1</sub> = 1. Also, given two smallest grammars over the same fixed constituent set, the opposite may happen. Consider again the sequence <italic>s<sub>k</sub></italic> = (<italic>aba</italic>)<sup><italic>k</italic></sup>, as defined in Proposition 3, and take <italic>G</italic><sub>1</sub> as the grammar that rewrites every <italic>aba</italic> as <italic>aA</italic> while grammar <italic>G</italic><sub>2</sub> rewrites it as <italic>Ba</italic>. <italic>G</italic><sub>1</sub> and <italic>G</italic><sub>2</sub> do not share any brackets, so <italic>UF</italic><sub>1</sub>(<italic>G</italic><sub>1</sub>,<italic>G</italic><sub>2</sub>) = 0.</p>
<p>In our experiments we will use <italic>UF</italic><sub>1</sub> as the way to compare structure between different smallest grammars.</p>
<p>Propositions 2 and 3 suggest that there are sequences for which the number of smallest grammars and the number of grammars for a given set of constituents are both exponential. Both results are based on sequences that were especially designed to show these behaviours. But it might be the case that this behaviour does not occur “naturally”. In order to shed light on this topic we present four different experiments. The first is directly focused on seeing how Proposition 3 behaves in practice and consists in computing the number of grammars with smallest size that are possible to build using the set of constituents found by the ZZ algorithm. The other three experiments analyze how different all these grammars are. Firstly, we compare the structure of different grammars randomly sampled from the set of possible grammars. Then we compare structure, as in the previous experiment, but this time, the used metric discriminates by the length of the constituents. Finally, we consider in how many different ways a single position can be parsed. The first experiment provides evidence supporting the idea of Proposition 3 being the general rule more than an exceptional case. The second group of experiments suggests that there is a common structure among all possible grammars. In particular, it suggests that longer constituents are more stable than shorter ones and that, if for each position we only consider the longest bracket that englobes it, then almost all of these brackets are common between smallest grammars.</p>
<p>All these experiments require a mechanism to compute different grammars sharing the same set of constituents. We provide an algorithm that not only computes this but also computes the total number of grammars that are possible to define for a given set of constituents. Let s be a sequence, and let Q be a set of constituents. An <italic>MGP-graph</italic> is a subgraph of the <italic>GP-graph</italic> used in Section 4.1 for the resolution of the MGP problem. Both have the same set of nodes but they differ in the set of edges. An edge belongs to an <italic>MGP-graph</italic> if and only if it is used in at least one of the shortest path from node 0 to node <italic>n</italic> + 1 (the end node). As a consequence, every path in an <italic>MGP-graph</italic> connecting the start node with the end one is in fact a shortest path in the <italic>GP-graph</italic>, and every path in <italic>MGP-graph</italic> defines a different way to write the rule for the initial non-terminal <italic>S</italic>. Similar subgraphs can be built for each of the non-terminal rules, and collectively we will refer to them as the <italic>MGP-graphs</italic>. Using the <italic>MGP-graphs</italic> it is possible to form a smallest grammar by simply choosing a smallest path for every <italic>MGP-graph</italic>. It only remains to explain how to compute a <italic>MGP-graph</italic>. We will do so by explaining the starting symbol case, for the rest of the non-terminals it follows in a similar fashion.</p>
<p>The algorithm traverses the <italic>GP-graph</italic> deleting edges and keeping those that belong to at least one shortest path. It does it in two phases. First it traverses the graph from left to right. In this phase, the algorithm assumes that the length of the smallest path from node 0 is known for every node <italic>j</italic> &lt; <italic>i</italic>. Since all incoming edges to node i in the <italic>GP-graph</italic> come from previous nodes, it is possible to calculate the smallest path length from node 0 to node <italic>i</italic> and the algorithm only keeps those edges that are part of such paths deleting all others. The first phase produces a graph that includes paths that do not end in the final node. In order to filter out this lasts edges, the second phase of the algorithm runs a breadth-first search from the node <italic>n</italic> + 1 and goes backwards removing edges that are not involved in a smallest path.</p>
<p>In our experiments, we use 
<monospace>alice29.txt, asyoulik.txt</monospace>, and 
<monospace>humdyst.chr</monospace>. The first two are sequences of natural language from the Canterbury corpus, while the last one is a DNA sequence from the historical corpus of DNA used for comparing DNA compressors [<xref ref-type="bibr" rid="b26-algorithms-04-00262">26</xref>]. In all cases, the set of constituents was produced by the ZZ algorithm.</p>
<sec>
<label>6.2.1.</label>
<title>Counting</title>
<p>Using the <italic>MGP-graphs</italic> it is possible to recursively calculate the number of shortest paths that exist from the start node <italic>n</italic><sub>0</sub> to any other node <italic>n<sub>i</sub></italic>. Clearly, the base case states that the number of shortest paths to <italic>n</italic><sub>0</sub> is equal to one. The recursive case states that the number of shortest paths to a node <italic>n</italic><sub><italic>i</italic>+1</sub> is equal to the sum of the number of shortest paths of nodes <italic>n<sub>j</sub>, j</italic> ≤ <italic>i</italic>, that have an edge arriving to node <italic>n</italic><sub><italic>i</italic>+1</sub>.</p>
<p><xref ref-type="table" rid="t3-algorithms-04-00262">Table 3</xref> shows the number of different grammars that exist for our sequences. The experiments provide evidence that the number of small grammars is huge and that it might not even be tractable by a computer. These huge number of grammars pose an important question: How similar are these grammars?</p></sec>
<sec>
<label>6.2.2.</label>
<title><italic>UF</italic><sub>1</sub> with Random Grammars</title>
<p>To answer this question, we sample elements from 
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/><sub><italic>Q</italic></sub> and compare them using <italic>UF</italic><sub>1</sub> as was described at the beginning of Section 6.2. We compare pairs of grammars, and we estimate and report average similarity between elements in 
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/><sub><italic>Q</italic></sub> in <xref ref-type="table" rid="t4-algorithms-04-00262">Table 4</xref>.</p>
<p>In order to sample from 
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/><sub><italic>Q</italic></sub> uniformly, we need to be able to sample each rule body, which means that we need to be able to sample complete paths (from start to end) in each <italic>MGP-graph</italic>. Using the previous algorithm for the number of smallest paths we can extend it to an algorithm that samples with uniform probability: starting at the end node, the algorithm can work its way back to the start node by repeatedly choosing one of the incoming edges with a probability proportional to the amount of smallest paths that go through that edge. The chosen edge labels form a rule body uniformly sampled among all possible ones.</p></sec>
<sec>
<label>6.2.3.</label>
<title>Accounting for Constituent Sizes: <italic>UF</italic><sub>1,<italic>k</italic></sub></title>
<p>Our second experiment aims to discover the class of brackets that make the main difference. Note that the standard <italic>UF</italic><sub>1</sub> measure gives the same weight to a bracket of size 2 than to longer brackets. The following experiment is quite similar to the previous one, but in order to evaluate the impact of smaller brackets in the score we introduce a new custom metric: <italic>UF</italic><sub>1,<italic>k</italic></sub>, which is the same that <italic>UF</italic><sub>1</sub> but the brackets whose length are equal or smaller than a given size <italic>k</italic> are ignored in the calculation. So when <italic>k</italic> = 1, <italic>UF</italic><sub>1,<italic>k</italic></sub> is equal to <italic>UF</italic><sub>1</sub>, but for larger values of <italic>k</italic> more and more brackets are ignored in the calculation. <xref ref-type="table" rid="t5-algorithms-04-00262">Table 5</xref> reports the results for different values of <italic>k</italic>. For each sequence, the table contains two columns: one for <italic>UF</italic><sub>1,<italic>k</italic></sub> and one for the percentage of the total brackets that were included in the calculation.</p>
<p>As it can be seen, the F-measure increases along <italic>k</italic>. This indicates that bigger brackets are found in most grammars of 
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/><sub><italic>Q</italic></sub>, but it also shows that smaller brackets are much more numerous.</p></sec>
<sec>
<label>6.2.4.</label>
<title>Conserved Parsing among Minimal Grammar Parsings</title>
<p>In the previous section we analyzed how different the grammars are considering different sizes of brackets. Here, we consider the differences between the grammars on single positions. The objective of this experiment is to measure the amount of different ways in which a single position of the original sequence <italic>s</italic> can be parsed by a minimal grammar. For this we will consider the partial parse tree where only the first level is retained. Doing this for each minimal grammar, we compute for each position the number of different subtrees it belongs to. This is equivalent to the number of edges that cover that position on the <italic>MGP-graph</italic> of <italic>s</italic>. If the number for one position is one for instance, this means that in all minimal grammars the same occurrence of the same non-terminal expands on this position.</p>
<p>On 
<monospace>alice29.txt</monospace>, 89% of the positions are parsed exactly the same way. A histogram for all values of different parses can be seen in <xref ref-type="fig" rid="f3-algorithms-04-00262">Figure 3</xref>. Note that the y-axis is in logarithmic scale and that the number of positions reduce drastically if the number of parses is increased: only 10% of positions are parsed in two different ways, 1% in three and all others in less than 0.2%.</p>
<p>There were two regions that presented peaks on the number of different symbols. Both correspond to parts in the text with long runs of the same character (white spaces): at the beginning, and in the middle during a poetry.</p>
<p>While this experiment is only restricted to the first level of the parse tree, it seems to indicate that the huge number of different minimal parses is due to a small number of positions where different parses have the same size. Most of the positions however are always parsed the same way.</p></sec></sec>
<sec>
<label>6.3.</label>
<title>Some Remarks</title>
<p>Summarising this section, we have the following theoretical results:
<list list-type="bullet">
<list-item>
<p>There can be an exponential amount of sets of constituents such that all of them produce smallest grammars (Proposition 2).</p></list-item>
<list-item>
<p>There might be two smallest grammars not sharing a single constituent other than <italic>s</italic> (from the proof of Proposition 2).</p></list-item>
<list-item>
<p>There can be an exponential amount of smallest grammars even if the set of constituents is fixed (Proposition 3).</p></list-item>
<list-item>
<p>Two smallest grammars with the same set of constituents might not share a single bracket other than the one for <italic>s</italic> (from the proof of Proposition 3).</p></list-item></list></p>
<p>Thus, from a theoretical point of view, there may exist structures completely incompatible between them, and nevertheless equally interesting according to the minimal size criteria. Given these results, it may be naive to expect a single, correct, hierarchical structure to emerge from approximations of a smallest grammar.</p>
<p>Accordingly, our experiments showed that the number of different smallest grammars grow well beyond an explicitly tractable amount.</p>
<p>Yet, the <italic>UF</italic><sub>1</sub> experiment suggested the existence of a common structure shared by the minimal grammars while the <italic>UF</italic><sub>1,<italic>k</italic></sub> experiment showed that this conservation involves more longer constituents than the smaller ones. The last experiment seems to indicate something similar, but concerning conservation of first-level brackets (which may not be the same as the longest brackets). One interpretation of these experiments is that, in practice, the observed number of different grammars is maybe due to the combination of all the possible parses by less significant non-terminals, while a common informative structure is conserved among the grammars. In that case, identifying the relevant structure of the sequence would require to be able to distinguish significant non-terminals, for instance by statistical tests. Meaningless constituents could also be discarded by modifying the choice function according to available background knowledge or by shifting from a pure Occam's razor point of view to a more Minimum Description Length oriented objective, which would prevent to introduce non informative non-terminals in the grammars.</p></sec></sec>
<sec sec-type="conclusions">
<label>7.</label>
<title>Conclusions and Future Work</title>
<p>In this paper, we analyzed a new approach to the Smallest Grammar Problem, which consisted in optimizing separately the choice of words that are going to be constituents, and the choice of which occurrences of these constituents will be rewritten by non-terminals. Given a choice of constituents, we resolve the optimal choice of constituents with the polynomial-time algorithm <italic>mgp</italic>. We improve classical offline algorithms by optimizing at each step the choice of the occurrences. This permits to overcome a restriction of these classical algorithms which does not permit them to find smallest grammars.</p>
<p>The separation allows to define the search space as a lattice over sets of repeats where each node has an associated score corresponding to the size of the MGP of this node. We propose then a new algorithm that finds a local minimum. It explores this search space by adding, but also removing, repeats to the current set of words. Our experiments show that both approaches outperform state-of-the-art algorithms.</p>
<p>We then analyzed more closely how this approach can be used for structure discovery, a main application of the smallest grammar problem. While in applications of data compression and Kolmogorov complexity, we look for the size of a smallest grammar, in structure discovery the structure itself given by the grammar is the goal. We used our formalization of the smallest grammar problem to analyze the number of possible smallest grammars (globally and given a fixed set of constituents). We proved that there may be an exponential number of grammars with the same minimal size, and analyzed how different they are. Because finding the smallest grammar is intractable, we contented ourself here to study the differences between smallest grammars given the best set of constituents our algorithms were able to find. While in theory there may exist several incompatible smallest grammars, our experiments seem to confirm that, in practical cases, there is an overall stability of the different parses. We think that these results can give both some warnings and clues on how to use the result of an approximation to the smallest grammar problem for structure discovery.</p>
<p>The optimization of the choice of occurrences opens new perspectives when searching for the smallest grammars, especially for the inference of the structure of sequences. In future work, we want to study how this scheme actually helps to find better structure on real applications. Our efficiency concerns in this paper were oriented toward not being exponential. In future work we plan to investigate more efficient algorithms and study the compromise between execution time and final grammar size.</p></sec></body>
<back>
<sec sec-type="display-objects">
<title>Figures and Tables</title>
<fig id="f1-algorithms-04-00262" position="float">
<label>Figure 1.</label>
<caption>
<p>Grammars returned by classical algorithms on sequence <italic>abcdabgeabceabcd</italic>$.</p></caption>
<graphic xlink:href="algorithms-04-00262f1.gif"/></fig>
<fig id="f2-algorithms-04-00262" position="float">
<label>Figure 2.</label>
<caption>
<p>The fraction of the lattice that is explored by the ZZ algorithm.</p></caption>
<graphic xlink:href="algorithms-04-00262f2.gif"/></fig>
<fig id="f3-algorithms-04-00262" position="float">
<label>Figure 3.</label>
<caption>
<p>The X axis are the number of different symbols that expand to one position, the Y axis the number of positions that have this number of expansions. Note that Y is in logarithmic scale.</p></caption>
<graphic xlink:href="algorithms-04-00262f3.gif"/></fig>
<table-wrap id="t1-algorithms-04-00262" position="float">
<label>Table 1.</label>
<caption>
<p>Corpus statistics.</p></caption>
<table frame="box" rules="cols">
<thead>
<tr>
<th align="left" valign="top"><bold>Sequence</bold></th>
<th align="left" valign="top"><bold>Length</bold></th>
<th align="left" valign="top"><bold># of Repeats</bold></th></tr>
<tr>
<th valign="bottom" colspan="3">
<hr/></th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">alice29.txt</td>
<td align="left" valign="top">152,089</td>
<td align="left" valign="top">220,204</td></tr>
<tr>
<td align="left" valign="top">asyoulik.txt</td>
<td align="left" valign="top">125,179</td>
<td align="left" valign="top">152,695</td></tr>
<tr>
<td align="left" valign="top">cp.html</td>
<td align="left" valign="top">24,603</td>
<td align="left" valign="top">106,235</td></tr>
<tr>
<td align="left" valign="top">fields.c</td>
<td align="left" valign="top">11,150</td>
<td align="left" valign="top">56,132</td></tr>
<tr>
<td align="left" valign="top">grammar.lsp</td>
<td align="left" valign="top">3,721</td>
<td align="left" valign="top">12,780</td></tr>
<tr>
<td align="left" valign="top">kennedy.xls</td>
<td align="left" valign="top">1,029,744</td>
<td align="left" valign="top">87,427</td></tr>
<tr>
<td align="left" valign="top">lcet10.txt</td>
<td align="left" valign="top">426,754</td>
<td align="left" valign="top">853,083</td></tr>
<tr>
<td align="left" valign="top">plrabn12.txt</td>
<td align="left" valign="top">481,861</td>
<td align="left" valign="top">491,533</td></tr>
<tr>
<td align="left" valign="top">ptt5</td>
<td align="left" valign="top">513,216</td>
<td align="left" valign="top">99,944,933</td></tr>
<tr>
<td align="left" valign="top">sum</td>
<td align="left" valign="top">38,240</td>
<td align="left" valign="top">666,934</td></tr>
<tr>
<td align="left" valign="top">xargs.1</td>
<td align="left" valign="top">4,227</td>
<td align="left" valign="top">7,502</td></tr></tbody></table></table-wrap>
<table-wrap id="t2-algorithms-04-00262" position="float">
<label>Table 2.</label>
<caption>
<p>Grammar sizes on the Canterbury corpus. The files over which ZZ did not finished are marked with a dash.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top"/>
<th colspan="5" align="center" valign="top">Algorithms from the literature</th>
<th colspan="4" align="center" valign="top">Optimizing occurrences</th></tr>
<tr>
<th valign="bottom" colspan="10">
<hr/></th></tr>
<tr>
<th align="left" valign="bottom" rowspan="2">Sequences</th>
<th align="left" valign="bottom" rowspan="2">Sequitur</th>
<th align="left" valign="bottom" rowspan="2">LZ78</th>
<th colspan="3" align="center" valign="top">IRR</th>
<th colspan="3" align="center" valign="top">IRRCOO</th>
<th align="left" valign="bottom" rowspan="2">ZZ</th></tr>
<tr>
<th align="left" valign="bottom">MF</th>
<th align="left" valign="bottom">ML</th>
<th align="left" valign="bottom">MC</th>
<th align="left" valign="bottom">MF</th>
<th align="left" valign="bottom">ML</th>
<th align="left" valign="bottom">MC</th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">alice29.txt</td>
<td align="left" valign="top">49,147</td>
<td align="left" valign="top">116,296</td>
<td align="left" valign="top">42,453</td>
<td align="left" valign="top">56,056</td>
<td align="left" valign="top"><bold>41,000</bold></td>
<td align="left" valign="top">39,794</td>
<td align="left" valign="top">52,351</td>
<td align="left" valign="top"><bold>39,251</bold></td>
<td align="left" valign="top"><bold>37,701</bold></td></tr>
<tr>
<td align="left" valign="top">asyoulik.txt</td>
<td align="left" valign="top">44,123</td>
<td align="left" valign="top">102,296</td>
<td align="left" valign="top">38,507</td>
<td align="left" valign="top">51,470</td>
<td align="left" valign="top"><bold>37,474</bold></td>
<td align="left" valign="top">36,822</td>
<td align="left" valign="top">48,133</td>
<td align="left" valign="top"><bold>36,384</bold></td>
<td align="left" valign="top"><bold>35,000</bold></td></tr>
<tr>
<td align="left" valign="top">cp.html</td>
<td align="left" valign="top">9,835</td>
<td align="left" valign="top">22,658</td>
<td align="left" valign="top">8,479</td>
<td align="left" valign="top">9,612</td>
<td align="left" valign="top"><bold>8,048</bold></td>
<td align="left" valign="top">8,369</td>
<td align="left" valign="top">9,313</td>
<td align="left" valign="top"><bold>7,941</bold></td>
<td align="left" valign="top"><bold>7,767</bold></td></tr>
<tr>
<td align="left" valign="top">fields.c</td>
<td align="left" valign="top">4,108</td>
<td align="left" valign="top">11,056</td>
<td align="left" valign="top">3,765</td>
<td align="left" valign="top">3,980</td>
<td align="left" valign="top"><bold>3,416</bold></td>
<td align="left" valign="top">3,713</td>
<td align="left" valign="top">3,892</td>
<td align="left" valign="top"><bold>3,373</bold></td>
<td align="left" valign="top"><bold>3,311</bold></td></tr>
<tr>
<td align="left" valign="top">grammar.lsp</td>
<td align="left" valign="top">1,770</td>
<td align="left" valign="top">4,225</td>
<td align="left" valign="top">1,615</td>
<td align="left" valign="top">1,730</td>
<td align="left" valign="top"><bold>1,473</bold></td>
<td align="left" valign="top">1,621</td>
<td align="left" valign="top">1,704</td>
<td align="left" valign="top"><bold>1,471</bold></td>
<td align="left" valign="top"><bold>1,465</bold></td></tr>
<tr>
<td align="left" valign="top">kennedy.xls</td>
<td align="left" valign="top">174,585</td>
<td align="left" valign="top">365,466</td>
<td align="left" valign="top">167,076</td>
<td align="left" valign="top">179,753</td>
<td align="left" valign="top"><bold>166,924</bold></td>
<td align="left" valign="top">166,817</td>
<td align="left" valign="top">179,281</td>
<td align="left" valign="top"><bold>166,760</bold></td>
<td align="left" valign="top"><bold>166,704</bold></td></tr>
<tr>
<td align="left" valign="top">lcet10.txt</td>
<td align="left" valign="top">112,205</td>
<td align="left" valign="top">288,250</td>
<td align="left" valign="top">92,913</td>
<td align="left" valign="top">130,409</td>
<td align="left" valign="top"><bold>90,099</bold></td>
<td align="left" valign="top">90,493</td>
<td align="left" valign="top">120,140</td>
<td align="left" valign="top"><bold>88,561</bold></td>
<td align="left" valign="top">–</td></tr>
<tr>
<td align="left" valign="top">plrabn12.txt</td>
<td align="left" valign="top">142,656</td>
<td align="left" valign="top">338,762</td>
<td align="left" valign="top">125,366</td>
<td align="left" valign="top">180,203</td>
<td align="left" valign="top"><bold>124,198</bold></td>
<td align="left" valign="top">114,959</td>
<td align="left" valign="top">164,728</td>
<td align="left" valign="top"><bold>117,326</bold></td>
<td align="left" valign="top">–</td></tr>
<tr>
<td align="left" valign="top">ptt5</td>
<td align="left" valign="top">55,692</td>
<td align="left" valign="top">106,456</td>
<td align="left" valign="top">45,639</td>
<td align="left" valign="top">56,452</td>
<td align="left" valign="top"><bold>45,135</bold></td>
<td align="left" valign="top">44,192</td>
<td align="left" valign="top">53,738</td>
<td align="left" valign="top"><bold>43,958</bold></td>
<td align="left" valign="top">–</td></tr>
<tr>
<td align="left" valign="top">sum</td>
<td align="left" valign="top">15,329</td>
<td align="left" valign="top">35,056</td>
<td align="left" valign="top">12,965</td>
<td align="left" valign="top">13,866</td>
<td align="left" valign="top"><bold>12,207</bold></td>
<td align="left" valign="top">12,878</td>
<td align="left" valign="top">13,695</td>
<td align="left" valign="top"><bold>12,114</bold></td>
<td align="left" valign="top"><bold>12,027</bold></td></tr>
<tr>
<td align="left" valign="top">xargs.1</td>
<td align="left" valign="top">2,329</td>
<td align="left" valign="top">5,309</td>
<td align="left" valign="top">2,137</td>
<td align="left" valign="top">2,254</td>
<td align="left" valign="top"><bold>2,006</bold></td>
<td align="left" valign="top">2,142</td>
<td align="left" valign="top">2,237</td>
<td align="left" valign="top"><bold>1,989</bold></td>
<td align="left" valign="top"><bold>1,972</bold></td></tr></tbody></table></table-wrap>
<table-wrap id="t3-algorithms-04-00262" position="float">
<label>Table 3.</label>
<caption>
<p>Sequence length, grammar size, number of constituents, and number of grammars for different sequences.</p></caption>
<table frame="box" rules="groups">
<thead>
<tr>
<th align="left" valign="top"><bold>Sequence</bold></th>
<th align="center" valign="top"><bold>
<monospace>humdyst.chr.seq</monospace></bold></th>
<th align="center" valign="top"><bold>
<monospace>asyoulik.txt</monospace></bold></th>
<th align="center" valign="top"><bold>
<monospace>alice29.txt</monospace></bold></th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">Sequence length</td>
<td align="center" valign="top">38,770</td>
<td align="center" valign="top">125,179</td>
<td align="center" valign="top">152,089</td></tr>
<tr>
<td align="left" valign="top">Grammar size</td>
<td align="center" valign="top">10,035</td>
<td align="center" valign="top">35,000</td>
<td align="center" valign="top">37,701</td></tr>
<tr>
<td align="left" valign="top">Number of constituents</td>
<td align="center" valign="top">576</td>
<td align="center" valign="top">2391</td>
<td align="center" valign="top">2749</td></tr>
<tr>
<td align="left" valign="top">Number of grammars</td>
<td align="center" valign="top">2 × 10<sup>497</sup></td>
<td align="center" valign="top">7 × 10<sup>968</sup></td>
<td align="center" valign="top">8 × 10<sup>936</sup></td></tr></tbody></table></table-wrap>
<table-wrap id="t4-algorithms-04-00262" position="float">
<label>Table 4.</label>
<caption>
<p>Different figures for a sample of 1000 grammars uniformly taken from 
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/><sub><italic>Q</italic></sub>.</p></caption>
<table frame="box" rules="groups">
<thead>
<tr>
<th align="left" valign="top"><bold>Sequence</bold></th>
<th align="center" valign="top"><bold>
<monospace>humdyst.chr.seq</monospace></bold></th>
<th align="center" valign="top"><bold>
<monospace>asyoulik.txt</monospace></bold></th>
<th align="center" valign="top"><bold>
<monospace>alice29.txt</monospace></bold></th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top"><italic>UF</italic><sub>1</sub> mean</td>
<td align="center" valign="top">66.02%</td>
<td align="center" valign="top">81.48%</td>
<td align="center" valign="top">77.81%</td></tr>
<tr>
<td align="left" valign="top"><italic>UF</italic><sub>1</sub> standard deviation</td>
<td align="center" valign="top">1.19%</td>
<td align="center" valign="top">1.32%</td>
<td align="center" valign="top">1.52%</td></tr>
<tr>
<td align="left" valign="top">Smallest <italic>UF</italic><sub>1</sub></td>
<td align="center" valign="top">62.26%</td>
<td align="center" valign="top">77.94%</td>
<td align="center" valign="top">73.44%</td></tr>
<tr>
<td align="left" valign="top">Largest <italic>UF</italic><sub>1</sub></td>
<td align="center" valign="top">70.64%</td>
<td align="center" valign="top">85.64%</td>
<td align="center" valign="top">83.44%</td></tr></tbody></table></table-wrap>
<table-wrap id="t5-algorithms-04-00262" position="float">
<label>Table 5.</label>
<caption>
<p><italic>UF</italic><sub>1,<italic>k</italic></sub> for different values of k each from a sample of 100 grammars uniformly taken from 
<inline-graphic xlink:href="algorithms-04-00262i7.gif"/><sub><italic>Q</italic></sub>.</p></caption>
<table frame="box" rules="groups">
<thead>
<tr>
<th align="left" valign="top"><bold>k</bold></th>
<th colspan="2" align="left" valign="top"><bold>
<monospace>asyoulik.txt.ij</monospace></bold></th>
<th colspan="2" align="left" valign="top"><bold>
<monospace>alice29.txt.ij</monospace></bold></th>
<th colspan="2" align="left" valign="top"><bold>
<monospace>humdyst.chr.seq.ij</monospace></bold></th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top"/>
<td align="left" valign="top"><italic>UF</italic><sub>1,<italic>k</italic></sub></td>
<td align="left" valign="top">Brackets</td>
<td align="left" valign="top"><italic>UF</italic><sub>1,<italic>k</italic></sub></td>
<td align="left" valign="top">Brackets</td>
<td align="left" valign="top"><italic>UF</italic><sub>1,<italic>k</italic></sub></td>
<td align="left" valign="top">Brackets</td></tr>
<tr>
<td align="left" valign="top">1</td>
<td align="left" valign="top">81.50%</td>
<td align="left" valign="top">100.00%</td>
<td align="left" valign="top">77.97%</td>
<td align="left" valign="top">100.00%</td>
<td align="left" valign="top">67.32%</td>
<td align="left" valign="top">100.00%</td></tr>
<tr>
<td align="left" valign="top">2</td>
<td align="left" valign="top">88.26%</td>
<td align="left" valign="top">50.86%</td>
<td align="left" valign="top">83.70%</td>
<td align="left" valign="top">53.14%</td>
<td align="left" valign="top">71.11%</td>
<td align="left" valign="top">45.99%</td></tr>
<tr>
<td align="left" valign="top">3</td>
<td align="left" valign="top">92.49%</td>
<td align="left" valign="top">29.57%</td>
<td align="left" valign="top">87.94%</td>
<td align="left" valign="top">32.42%</td>
<td align="left" valign="top">75.93%</td>
<td align="left" valign="top">37.54%</td></tr>
<tr>
<td align="left" valign="top">4</td>
<td align="left" valign="top">95.21%</td>
<td align="left" valign="top">19.85%</td>
<td align="left" valign="top">89.60%</td>
<td align="left" valign="top">22.01%</td>
<td align="left" valign="top">82.17%</td>
<td align="left" valign="top">15.69%</td></tr>
<tr>
<td align="left" valign="top">5</td>
<td align="left" valign="top">96.35%</td>
<td align="left" valign="top">11.78%</td>
<td align="left" valign="top">88.88%</td>
<td align="left" valign="top">14.36%</td>
<td align="left" valign="top">88.51%</td>
<td align="left" valign="top">3.96%</td></tr>
<tr>
<td align="left" valign="top">6</td>
<td align="left" valign="top">97.18%</td>
<td align="left" valign="top">8.23%</td>
<td align="left" valign="top">89.45%</td>
<td align="left" valign="top">9.66%</td>
<td align="left" valign="top">95.46%</td>
<td align="left" valign="top">1.24%</td></tr>
<tr>
<td align="left" valign="top">7</td>
<td align="left" valign="top">97.84%</td>
<td align="left" valign="top">5.72%</td>
<td align="left" valign="top">91.50%</td>
<td align="left" valign="top">6.44%</td>
<td align="left" valign="top">98.38%</td>
<td align="left" valign="top">0.44%</td></tr>
<tr>
<td align="left" valign="top">8</td>
<td align="left" valign="top">97.82%</td>
<td align="left" valign="top">3.83%</td>
<td align="left" valign="top">92.78%</td>
<td align="left" valign="top">4.30%</td>
<td align="left" valign="top">99.87%</td>
<td align="left" valign="top">0.19%</td></tr>
<tr>
<td align="left" valign="top">9</td>
<td align="left" valign="top">98.12%</td>
<td align="left" valign="top">2.76%</td>
<td align="left" valign="top">92.37%</td>
<td align="left" valign="top">2.95%</td>
<td align="left" valign="top">100.00%</td>
<td align="left" valign="top">0.06%</td></tr>
<tr>
<td align="left" valign="top">10</td>
<td align="left" valign="top">98.35%</td>
<td align="left" valign="top">1.88%</td>
<td align="left" valign="top">91.87%</td>
<td align="left" valign="top">2.10%</td>
<td align="left" valign="top">100.00%</td>
<td align="left" valign="top">0.04%</td></tr></tbody></table></table-wrap></sec>
<ack>
<p>The work described in this paper was partially supported by the Program of International Scientific Cooperation MINCyT - INRIA CNRS.</p></ack>
<ref-list>
<title>References</title>
<ref id="b1-algorithms-04-00262"><label>1.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Charikar</surname><given-names>M.</given-names></name><name><surname>Lehman</surname><given-names>E.</given-names></name><name><surname>Liu</surname><given-names>D.</given-names></name><name><surname>Panigrahy</surname><given-names>R.</given-names></name><name><surname>Prabhakaran</surname><given-names>M.</given-names></name><name><surname>Sahai</surname><given-names>A.</given-names></name><name><surname>Shelat</surname><given-names>A.</given-names></name></person-group><article-title>The smallest grammar problem</article-title><source>IEEE Trans. Inf. Theory</source><year>2005</year><volume>51</volume><fpage>2554</fpage><lpage>2576</lpage><pub-id pub-id-type="doi">10.1109/TIT.2005.850116</pub-id></citation></ref>
<ref id="b2-algorithms-04-00262"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rytter</surname><given-names>W.</given-names></name></person-group><article-title>Application of Lempel-Ziv factorization to the approximation of grammar-based compression</article-title><source>Theor. Comput. Sci.</source><year>2003</year><volume>302</volume><fpage>211</fpage><lpage>222</lpage><pub-id pub-id-type="doi">10.1016/S0304-3975(02)00777-6</pub-id></citation></ref>
<ref id="b3-algorithms-04-00262"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kieffer</surname><given-names>J.</given-names></name><name><surname>Yang</surname><given-names>E.H.</given-names></name></person-group><article-title>Grammar-based codes: a new class of universal lossless source codes</article-title><source>IEEE Trans. Inf. Theory</source><year>2000</year><volume>46</volume><fpage>737</fpage><lpage>754</lpage><pub-id pub-id-type="doi">10.1109/18.841160</pub-id></citation></ref>
<ref id="b4-algorithms-04-00262"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ziv</surname><given-names>J.</given-names></name><name><surname>Lempel</surname><given-names>A.</given-names></name></person-group><article-title>Compression of Individual Sequences via Variable-Rate Coding</article-title><source>IEEE Trans. Inf. Theory</source><year>1978</year><volume>24</volume><fpage>530</fpage><lpage>536</lpage><pub-id pub-id-type="doi">10.1109/TIT.1978.1055934</pub-id></citation></ref>
<ref id="b5-algorithms-04-00262"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ziv</surname><given-names>J.</given-names></name><name><surname>Lempel</surname><given-names>A.</given-names></name></person-group><article-title>A Universal Algorithm for Sequential Data Compression</article-title><source>IEEE Trans. Inf. Theory</source><year>1977</year><volume>23</volume><fpage>337</fpage><lpage>343</lpage><pub-id pub-id-type="doi">10.1109/TIT.1977.1055714</pub-id></citation></ref>
<ref id="b6-algorithms-04-00262"><label>6.</label><citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Nevill-Manning</surname><given-names>C.G.</given-names></name></person-group><article-title>Inferring Sequential Structure</article-title><source>Ph.D. Thesis</source><publisher-name>University of Waikato</publisher-name><publisher-loc>Hamilton, New Zealand</publisher-loc><year>1996</year></citation></ref>
<ref id="b7-algorithms-04-00262"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wolff</surname><given-names>J.</given-names></name></person-group><article-title>An algorithm for the segmentation of an artificial language analogue</article-title><source>Br. J. Psychol.</source><year>1975</year><volume>66</volume><fpage>79</fpage><lpage>90</lpage><pub-id pub-id-type="doi">10.1111/j.2044-8295.1975.tb01442.x</pub-id></citation></ref>
<ref id="b8-algorithms-04-00262"><label>8.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Bentley</surname><given-names>J.</given-names></name><name><surname>McIlroy</surname><given-names>D.</given-names></name></person-group><article-title>Data compression using long common strings</article-title><conf-name>Proceeding of Data Compression Conference</conf-name><publisher-name>IEEE</publisher-name><publisher-loc>New York, NY, USA</publisher-loc><year>1999</year><fpage>287</fpage><lpage>295</lpage></citation></ref>
<ref id="b9-algorithms-04-00262"><label>9.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Nevill-Manning</surname><given-names>C.</given-names></name><name><surname>Witten</surname><given-names>I.</given-names></name></person-group><article-title>On-line and off-line heuristics for inferring hierarchies of repetitions in sequences</article-title><conf-name>Proceeding of Data Compression Conference</conf-name><publisher-name>IEEE</publisher-name><publisher-loc>New York, NY, USA</publisher-loc><year>2000</year><fpage>1745</fpage><lpage>1755</lpage></citation></ref>
<ref id="b10-algorithms-04-00262"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Apostolico</surname><given-names>A.</given-names></name><name><surname>Lonardi</surname><given-names>S.</given-names></name></person-group><article-title>Off-line compression by greedy textual substitution</article-title><source>Proc. IEEE</source><year>2000</year><volume>88</volume><fpage>1733</fpage><lpage>1744</lpage><pub-id pub-id-type="doi">10.1109/5.892709</pub-id></citation></ref>
<ref id="b11-algorithms-04-00262"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Larsson</surname><given-names>N.</given-names></name><name><surname>Moffat</surname><given-names>A.</given-names></name></person-group><article-title>Off-line dictionary-based compression</article-title><source>Proc. IEEE</source><year>2000</year><volume>88</volume><fpage>1722</fpage><lpage>1732</lpage><pub-id pub-id-type="doi">10.1109/5.892708</pub-id></citation></ref>
<ref id="b12-algorithms-04-00262"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nakamura</surname><given-names>R.</given-names></name><name><surname>Inenaga</surname><given-names>S.</given-names></name><name><surname>Bannai</surname><given-names>H.</given-names></name><name><surname>Funamoto</surname><given-names>T.</given-names></name><name><surname>Takeda</surname><given-names>M.</given-names></name><name><surname>Shinohara</surname><given-names>A.</given-names></name></person-group><article-title>Linear-Time Text Compression by Longest-First Substitution</article-title><source>Algorithms</source><year>2009</year><volume>2</volume><fpage>1429</fpage><lpage>1448</lpage><pub-id pub-id-type="doi">10.3390/a2041429</pub-id></citation></ref>
<ref id="b13-algorithms-04-00262"><label>13.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Lanctot</surname><given-names>J.K.</given-names></name><name><surname>Li</surname><given-names>M.</given-names></name><name><surname>Yang</surname><given-names>E.H.</given-names></name></person-group><article-title>Estimating DNA sequence entropy</article-title><conf-name>ACM-SIAM Symposium on Discrete Algorithms</conf-name><publisher-name>Society for Industrial and Applied Mathematics</publisher-name><publisher-loc>Philadelphia, PA, USA</publisher-loc><year>2000</year><fpage>409</fpage><lpage>418</lpage></citation></ref>
<ref id="b14-algorithms-04-00262"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Evans</surname><given-names>S.C.</given-names></name><name><surname>Kourtidis</surname><given-names>A.</given-names></name><name><surname>Markham</surname><given-names>T.</given-names></name><name><surname>Miller</surname><given-names>J.</given-names></name></person-group><article-title>MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress</article-title><source>EURASIP J. Bioinf. Sys. Biol.</source><year>2007</year><volume>2007</volume><fpage>1</fpage><lpage>16</lpage></citation></ref>
<ref id="b15-algorithms-04-00262"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nevill-Manning</surname><given-names>C.G.</given-names></name><name><surname>Witten</surname><given-names>I.H.</given-names></name></person-group><article-title>Identifying Hierarchical Structure in Sequences: A linear-time algorithm</article-title><source>J. Artif. Intell. Res.</source><year>1997</year><volume>7</volume><fpage>67</fpage><lpage>82</lpage></citation></ref>
<ref id="b16-algorithms-04-00262"><label>16.</label><citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Marcken</surname><given-names>C.D.</given-names></name></person-group><article-title>Unsupervised language acquisition</article-title><source>Ph.D. Thesis</source><publisher-name>Massachusetts Institute of Technology</publisher-name><publisher-loc>Cambridge, MA, USA</publisher-loc><year>1996</year></citation></ref>
<ref id="b17-algorithms-04-00262"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sakakibara</surname><given-names>Y.</given-names></name></person-group><article-title>Efficient Learning of Context-Free Grammars from Positive Structural Examples</article-title><source>Inf. Comput.</source><year>1992</year><volume>97</volume><fpage>23</fpage><lpage>60</lpage><pub-id pub-id-type="doi">10.1016/0890-5401(92)90003-X</pub-id></citation></ref>
<ref id="b18-algorithms-04-00262"><label>18.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Carrascosa</surname><given-names>R.</given-names></name><name><surname>Coste</surname><given-names>F.</given-names></name><name><surname>Gallé</surname><given-names>M.</given-names></name><name><surname>Infante-Lopez</surname><given-names>G.</given-names></name></person-group><source>Choosing Word Occurrences for the Smallest Grammar Problem</source><publisher-name>LATA</publisher-name><publisher-loc>Trier, Germany</publisher-loc><month>May</month><day>24–28</day><year>2010</year></citation></ref>
<ref id="b19-algorithms-04-00262"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Karpinski</surname><given-names>M.</given-names></name><name><surname>Rytter</surname><given-names>W.</given-names></name><name><surname>Shinohara</surname><given-names>A.</given-names></name></person-group><article-title>An efficient pattern-matching algorithm for strings with short descriptions</article-title><source>Nord. J. Comput.</source><year>1997</year><volume>4</volume><fpage>172</fpage><lpage>186</lpage></citation></ref>
<ref id="b20-algorithms-04-00262"><label>20.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Gusfield</surname><given-names>D.</given-names></name></person-group><source>Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology</source><publisher-name>Cambridge University Press</publisher-name><publisher-loc>Cambridge, UK</publisher-loc><year>1997</year></citation></ref>
<ref id="b21-algorithms-04-00262"><label>21.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sakamoto</surname><given-names>H.</given-names></name><name><surname>Maruyama</surname><given-names>S.</given-names></name><name><surname>Kida</surname><given-names>T.</given-names></name><name><surname>Shimozono</surname><given-names>S.</given-names></name></person-group><article-title>A Space-Saving Approximation Algorithm for Grammar-Based Compression</article-title><source>IEICE Trans</source><year>2009</year><volume>92-D</volume><fpage>158</fpage><lpage>165</lpage></citation></ref>
<ref id="b22-algorithms-04-00262"><label>22.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schuegraf</surname><given-names>E.J.</given-names></name><name><surname>Heaps</surname><given-names>H.S.</given-names></name></person-group><article-title>A comparison of algorithms for data base compression by use of fragments as language elements</article-title><source>Inf. Storage Retr.</source><year>1974</year><volume>10</volume><fpage>309</fpage><lpage>319</lpage><pub-id pub-id-type="doi">10.1016/0020-0271(74)90069-2</pub-id></citation></ref>
<ref id="b23-algorithms-04-00262"><label>23.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Arnold</surname><given-names>R.</given-names></name><name><surname>Bell</surname><given-names>T.</given-names></name></person-group><article-title>A Corpus for the Evaluation of Lossless Compression Algorithms</article-title><conf-name>Proceeding of Data Compression Conference</conf-name><publisher-name>IEEE</publisher-name><publisher-loc>Washington, DC, USA</publisher-loc><year>1997</year><fpage>201</fpage></citation></ref>
<ref id="b24-algorithms-04-00262"><label>24.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Abney</surname><given-names>S.</given-names></name><name><surname>Flickenger</surname><given-names>S.</given-names></name><name><surname>Gdaniec</surname><given-names>C.</given-names></name><name><surname>Grishman</surname><given-names>C.</given-names></name><name><surname>Harrison</surname><given-names>P.</given-names></name><name><surname>Hindle</surname><given-names>D.</given-names></name><name><surname>Ingria</surname><given-names>R.</given-names></name><name><surname>Jelinek</surname><given-names>F.</given-names></name><name><surname>Klavans</surname><given-names>J.</given-names></name><name><surname>Liberman</surname><given-names>M.</given-names></name><name><surname>Marcus</surname><given-names>M.</given-names></name><name><surname>Roukos</surname><given-names>S.</given-names></name><name><surname>Santorini</surname><given-names>B.</given-names></name><name><surname>Strzalkowski</surname><given-names>T.</given-names></name></person-group><article-title>Procedure for quantitatively comparing the syntactic coverage of English grammars</article-title><conf-name>Proceedings of the Workshop on Speech and Natural Language</conf-name><person-group person-group-type="editor"><name><surname>Black</surname><given-names>E.</given-names></name></person-group><publisher-name>Association for Computational Linguistics</publisher-name><publisher-loc>Stroudsburg, PA, USA</publisher-loc><year>1991</year><fpage>306</fpage><lpage>311</lpage></citation></ref>
<ref id="b25-algorithms-04-00262"><label>25.</label><citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Klein</surname><given-names>D.</given-names></name></person-group><article-title>The Unsupervised Learning of Natural Language Structure</article-title><source>Ph.D. Thesis</source><publisher-name>University of Stanford</publisher-name><publisher-loc>Stanford, CA, USA</publisher-loc><year>2005</year></citation></ref>
<ref id="b26-algorithms-04-00262"><label>26.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Manzini</surname><given-names>G.</given-names></name><name><surname>Rastero</surname><given-names>M.</given-names></name></person-group><article-title>A simple and fast DNA compressor</article-title><source>Softw. Pract. Exp.</source><year>2004</year><volume>34</volume><fpage>1397</fpage><lpage>1411</lpage><pub-id pub-id-type="doi">10.1002/spe.619</pub-id></citation></ref></ref-list></back></article>
