This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
The smallest grammar problem—namely, finding a smallest contextfree grammar that generates exactly one sequence—is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery. We propose a new perspective on this problem by splitting it into two tasks: (1) choosing which words will be the constituents of the grammar and (2) searching for the smallest grammar given this set of constituents. We show how to solve the second task in polynomial time parsing longer constituent with smaller ones. We propose new algorithms based on classical practical algorithms that use this optimization to find small grammars. Our algorithms consistently find smaller grammars on a classical benchmark reducing the size in 10% in some cases. Moreover, our formulation allows us to define interesting bounds on the number of small grammars and to empirically compare different grammars of small size.
The smallest grammar problem—namely, finding a smallest contextfree grammar that generates exactly one sequence—is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery.
The size of a smallest grammar can be considered a computable variant of Kolmogorov complexity, in which the Turing machine description of the sequence is restricted to contextfree grammars. The problem is then decidable, but still hard: the problem of finding a smallest grammar with an approximation ratio smaller than
If the grammar is small, storing the grammar instead of the sequence can be interesting from a data compression perspective. Kieffer and Yang developed the formal framework of compression by
In pattern discovery, a smallest grammar is a good candidate for being the one that generates the data according to Occam's razor principle. In that case, the grammar may not only be used for compressing the sequence but also to unveil its structure. Inference of the hierarchical structure of sequences was the initial motivation of Sequitur and has been the subject of several papers applying this scheme to DNA sequences [
In this article, which is an extended version of [
The outline of this paper is the following: in Section 2 we formally introduce the definitions and in Section 3 the classical offline algorithms. Section 4 contains our main contributions. In Section 4.1 we show how to optimize the choice of occurrences to be replaced by nonterminals for a set of words and then extend offline algorithms by optimizing the choice of the occurrences at each step in Section 4.2. We present our search space and show that this optimization can also be used directly to guide the search in a new algorithm in Section 4.3. We present experiments on a classical benchmark in Section 5 showing that the occurrence optimization consistently allows to find smaller grammars. In Section 6 we consider the number of smallest grammars that may exist and discuss the consequences of our results on structure discovery.
We start by giving a few definitions and setting up the nomenclature that we use along the paper. A
A
Given a string
Because the Smallest Grammar framework seeks for a contextfree grammar whose language contains one and only one string, the grammars we consider here neither branch (every nonterminal occurs at most once as a lefthand side of a rule) nor loop (if
Several definitions of the grammar size exist. Following [
Most offline algorithms follow the same general scheme. First, the grammar is initialized with a unique initial rule
When occurrences overlap, one has to specify which occurrences have to be replaced. One solution is to choose all the elements in the
The IRR schema instantiates different algorithms, depending on the score function
In order to derive a score function corresponding to C
IRR(
The complexity of IRR when it uses one of these scores is
The grammars found by the three IRR algorithms, plus Sequitur and LZ78 are shown on a small example in
Even though IRR algorithms are the best known practical algorithms for obtaining small grammars, they present some weaknesses. In the first place, their greedy strategy does not guarantee that the compression gain introduced by a selected word
It could be argued that there may exist a score function that for every sequence scores the repeats in such a way that the order they are presented to IRR results in a smallest grammar. The following theorem proves that this is not the case.
There are infinitely many sequences
Consider the infinite set of sequences
Note that Theorem 1 does not make any assumptions over the possible score function
Once an IRR algorithm has chosen a repeated word
We propose to separate the choice of which terminal strings will be final constituents of the final grammar from the choice of which of the occurrences of these constituents will be replaced by nonterminals. First, let us assume that a set of constituents {
For this, we define a new problem, called
all symbols of
for each string
for each nonterminal

Note that this is similar to the Smallest Grammar Problem, except that all constituents for the nonterminals of the grammar are also given. The MGP problem is related to the problem of static dictionary parsing [
As an example consider the sequence
This problem can be solved in a classical way in polynomial time by searching for a shortest path in 
Let {
Define
for every node
there will be an edge from node
For each Γ
Return the labels of these paths.
The righthand side for nonterminal
The list of occurrences of each constituent over the original sequence can be stored at the moment it is chosen. Supposing then that the graphs are created, and as the length of each constituent is bounded by
Note that in practice the graph Γ_{0} contains all the information for all other graphs: any Γ
We can now define a variant of IRR, called
The computation of the
IRCOO(
As an example, consider again the sequence from Section 3.2, where
IRRCOO extends IRR by performing a global optimization at each step of the replaced occurrences but still relies on the classical score functions of IRR to choose the words to introduce. But the result of the optimization can be used directly to guide the search in a hillclimbing approach that we introduce in the next subsection.
In this section we divert from IRR algorithms by taking the idea presented in IRRCOO a step forward. Here, we consider the optimization procedure (
We first introduce the search space, defined over all possible sets of substrings, for the Smallest Grammar Problem and, second, an algorithm performing a wider exploration of this search space than the classical greedy methods.
The
Consider the lattice 〈ℛ(
An algorithm over this search space will look for a local or global minimum. To define this, we first need some notation:
Given a lattice 〈
ancestors(
descendants(
The
Now, we are able to define a global and local minimum.
Given a lattice 〈
A node
A node
Unless otherwise noted the default score function for nodes will be
To see the first inclusion (⊆), take a smallest grammar
Because of the NPhardness of the problem, it is fruitless (supposing
In contrast with classical methods, we can now define algorithms discarding also constituents. To perform a wider exploration of the search space, we propose a new algorithm performing a succession of greedy search in ascending and descending directions in the lattice until a local minimum is found.
This algorithm explores the lattice in a zigzag path. Therefore we denote it ZZ. It explores the lattice by an alternation of two different phases:
ZZ starts at the bottom node, that is, the node that corresponds to the grammar
ZZ(
For example, suppose that there are 5 substrings that occur more than once in
In the previous section we showed that the computational complexity of computing the score function for each node is
We finish this section with a remark on the search space. In order to ease the understanding of the proof, we will suppose that the size of the grammar is defined as 
We have presented an algorithm that finds a local minimum on the search space. Locality is defined in terms of its direct neighborhood, but we will see that the local minimality of a node does not necessarily extend further:
The lattice 〈ℛ(
Consider the following sequence:
The set of possible constituents is {
In this section we experimentally compare our algorithms with the classical ones from the literature. For this purpose we use the Canterbury Corpus [
Not all existing algorithms are publicly available, they resolve in different ways the case when there are more than two words with the best score, they do not report results on a standard corpus or they use different definitions of the size of a grammar. In order to standardize the experiments and score, we implemented all the offline algorithms presented in this paper in the IRR framework. For the sake of completeness, we also add to the comparison LZ78 and Sequitur. Note that we have postprocessed the output of the LZ78 factorizations to transform them into contextfree grammars. The first series of experiments aims at comparing these classical algorithms and are shown in the middle part of
Then, we evaluate how the optimization of occurrences improves IRR algorithms. As shown in the IRRCOO column of
If given more time, these results can still be improved by using ZZ. As shown in column ZZ of
Depending on the implementation, the
In this section we investigate these phenomena from two different perspectives. From a theoretical point of view, we provide bounds on the number of different smallest grammars, both globally and locally (fixing the set of constituents). And from an empirical point of view, we explore and compare the actual structural variances among different grammars on some reallife sequences.
It is clear that a smallest grammar is not necessarily unique. Not so clear is how many smallest grammars there can be. First we will prove that for a certain family of sequences, any node of the search space corresponds to a smallest grammar. As in the proof of Proposition 1 and only to ease the understanding of the proof, we will use 
Let
Then,
It is sufficient to find one family of sequences for which the number of global minima is exponential. Consider the sequence
Now, we will suppose that the set of constituents is fixed and consider the number of smallest grammars that can be built with this set. Given a set of constituents
As the following Proposition shows, there are cases where the number of different smallest grammars can grow exponentially for a given set of constituents.
Let
Then
Let
Proposition 2 is complementary to Proposition 3. In the former, we prove that the number of global minima (and therefore, the number of grammars with the same minimal size) is exponential. To prove it, we provided a sequence with the property that any constituent set would yield a grammar with minimal size. In the latter we show that even if the set is fixed, still there can be an exponential number of smallest grammars with this constituent set. Not however that for the proof of Proposition 2 we needed an unbounded alphabet.
These two propositions suggest that it might not be possible to find one
In the next section we analyze the differences among grammars of the same smallest size for a few reallife sequences. We explore their structural variances using similarity metrics and from different points of view.
We will now introduce a way of measuring the difference between any pair of grammars taken from
Summarizing the definition, the
Full overlap of the brackets sets (
In our experiments we will use
Propositions 2 and 3 suggest that there are sequences for which the number of smallest grammars and the number of grammars for a given set of constituents are both exponential. Both results are based on sequences that were especially designed to show these behaviours. But it might be the case that this behaviour does not occur “naturally”. In order to shed light on this topic we present four different experiments. The first is directly focused on seeing how Proposition 3 behaves in practice and consists in computing the number of grammars with smallest size that are possible to build using the set of constituents found by the ZZ algorithm. The other three experiments analyze how different all these grammars are. Firstly, we compare the structure of different grammars randomly sampled from the set of possible grammars. Then we compare structure, as in the previous experiment, but this time, the used metric discriminates by the length of the constituents. Finally, we consider in how many different ways a single position can be parsed. The first experiment provides evidence supporting the idea of Proposition 3 being the general rule more than an exceptional case. The second group of experiments suggests that there is a common structure among all possible grammars. In particular, it suggests that longer constituents are more stable than shorter ones and that, if for each position we only consider the longest bracket that englobes it, then almost all of these brackets are common between smallest grammars.
All these experiments require a mechanism to compute different grammars sharing the same set of constituents. We provide an algorithm that not only computes this but also computes the total number of grammars that are possible to define for a given set of constituents. Let s be a sequence, and let Q be a set of constituents. An
The algorithm traverses the
In our experiments, we use
Using the
To answer this question, we sample elements from
In order to sample from
Our second experiment aims to discover the class of brackets that make the main difference. Note that the standard
As it can be seen, the Fmeasure increases along
In the previous section we analyzed how different the grammars are considering different sizes of brackets. Here, we consider the differences between the grammars on single positions. The objective of this experiment is to measure the amount of different ways in which a single position of the original sequence
On
There were two regions that presented peaks on the number of different symbols. Both correspond to parts in the text with long runs of the same character (white spaces): at the beginning, and in the middle during a poetry.
While this experiment is only restricted to the first level of the parse tree, it seems to indicate that the huge number of different minimal parses is due to a small number of positions where different parses have the same size. Most of the positions however are always parsed the same way.
Summarising this section, we have the following theoretical results:
There can be an exponential amount of sets of constituents such that all of them produce smallest grammars (Proposition 2).
There might be two smallest grammars not sharing a single constituent other than
There can be an exponential amount of smallest grammars even if the set of constituents is fixed (Proposition 3).
Two smallest grammars with the same set of constituents might not share a single bracket other than the one for
Thus, from a theoretical point of view, there may exist structures completely incompatible between them, and nevertheless equally interesting according to the minimal size criteria. Given these results, it may be naive to expect a single, correct, hierarchical structure to emerge from approximations of a smallest grammar.
Accordingly, our experiments showed that the number of different smallest grammars grow well beyond an explicitly tractable amount.
Yet, the
In this paper, we analyzed a new approach to the Smallest Grammar Problem, which consisted in optimizing separately the choice of words that are going to be constituents, and the choice of which occurrences of these constituents will be rewritten by nonterminals. Given a choice of constituents, we resolve the optimal choice of constituents with the polynomialtime algorithm
The separation allows to define the search space as a lattice over sets of repeats where each node has an associated score corresponding to the size of the MGP of this node. We propose then a new algorithm that finds a local minimum. It explores this search space by adding, but also removing, repeats to the current set of words. Our experiments show that both approaches outperform stateoftheart algorithms.
We then analyzed more closely how this approach can be used for structure discovery, a main application of the smallest grammar problem. While in applications of data compression and Kolmogorov complexity, we look for the size of a smallest grammar, in structure discovery the structure itself given by the grammar is the goal. We used our formalization of the smallest grammar problem to analyze the number of possible smallest grammars (globally and given a fixed set of constituents). We proved that there may be an exponential number of grammars with the same minimal size, and analyzed how different they are. Because finding the smallest grammar is intractable, we contented ourself here to study the differences between smallest grammars given the best set of constituents our algorithms were able to find. While in theory there may exist several incompatible smallest grammars, our experiments seem to confirm that, in practical cases, there is an overall stability of the different parses. We think that these results can give both some warnings and clues on how to use the result of an approximation to the smallest grammar problem for structure discovery.
The optimization of the choice of occurrences opens new perspectives when searching for the smallest grammars, especially for the inference of the structure of sequences. In future work, we want to study how this scheme actually helps to find better structure on real applications. Our efficiency concerns in this paper were oriented toward not being exponential. In future work we plan to investigate more efficient algorithms and study the compromise between execution time and final grammar size.
Grammars returned by classical algorithms on sequence
The fraction of the lattice that is explored by the ZZ algorithm.
The X axis are the number of different symbols that expand to one position, the Y axis the number of positions that have this number of expansions. Note that Y is in logarithmic scale.
Corpus statistics.
 

alice29.txt  152,089  220,204 
asyoulik.txt  125,179  152,695 
cp.html  24,603  106,235 
fields.c  11,150  56,132 
grammar.lsp  3,721  12,780 
kennedy.xls  1,029,744  87,427 
lcet10.txt  426,754  853,083 
plrabn12.txt  481,861  491,533 
ptt5  513,216  99,944,933 
sum  38,240  666,934 
xargs.1  4,227  7,502 
Grammar sizes on the Canterbury corpus. The files over which ZZ did not finished are marked with a dash.
Algorithms from the literature  Optimizing occurrences  

 
Sequences  Sequitur  LZ78  IRR  IRRCOO  ZZ  
MF  ML  MC  MF  ML  MC  
alice29.txt  49,147  116,296  42,453  56,056  39,794  52,351  
asyoulik.txt  44,123  102,296  38,507  51,470  36,822  48,133  
cp.html  9,835  22,658  8,479  9,612  8,369  9,313  
fields.c  4,108  11,056  3,765  3,980  3,713  3,892  
grammar.lsp  1,770  4,225  1,615  1,730  1,621  1,704  
kennedy.xls  174,585  365,466  167,076  179,753  166,817  179,281  
lcet10.txt  112,205  288,250  92,913  130,409  90,493  120,140  –  
plrabn12.txt  142,656  338,762  125,366  180,203  114,959  164,728  –  
ptt5  55,692  106,456  45,639  56,452  44,192  53,738  –  
sum  15,329  35,056  12,965  13,866  12,878  13,695  
xargs.1  2,329  5,309  2,137  2,254  2,142  2,237 
Sequence length, grammar size, number of constituents, and number of grammars for different sequences.
Sequence length  38,770  125,179  152,089 
Grammar size  10,035  35,000  37,701 
Number of constituents  576  2391  2749 
Number of grammars  2 × 10^{497}  7 × 10^{968}  8 × 10^{936} 
Different figures for a sample of 1000 grammars uniformly taken from
66.02%  81.48%  77.81%  
1.19%  1.32%  1.52%  
Smallest 
62.26%  77.94%  73.44% 
Largest 
70.64%  85.64%  83.44% 
Brackets  Brackets  Brackets  
1  81.50%  100.00%  77.97%  100.00%  67.32%  100.00% 
2  88.26%  50.86%  83.70%  53.14%  71.11%  45.99% 
3  92.49%  29.57%  87.94%  32.42%  75.93%  37.54% 
4  95.21%  19.85%  89.60%  22.01%  82.17%  15.69% 
5  96.35%  11.78%  88.88%  14.36%  88.51%  3.96% 
6  97.18%  8.23%  89.45%  9.66%  95.46%  1.24% 
7  97.84%  5.72%  91.50%  6.44%  98.38%  0.44% 
8  97.82%  3.83%  92.78%  4.30%  99.87%  0.19% 
9  98.12%  2.76%  92.37%  2.95%  100.00%  0.06% 
10  98.35%  1.88%  91.87%  2.10%  100.00%  0.04% 
The work described in this paper was partially supported by the Program of International Scientific Cooperation MINCyT  INRIA CNRS.