# Parsing Expression Grammars and Their Induction Algorithm

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## Featured Application

**PEG library for Python.**

## Abstract

## 1. Introduction

## 2. Definition of PEGs

- $\u03f5$, the empty string;
- a, symbol or string occurrence;
- $r\ast $, zero or more repetitions of regular expression;
- a+, one or more repetitions;
- $a|b$, non-deterministic choice of symbol, formally defined as $a+b$;
- ${r}_{1}{r}_{2}$, concatenation;
- $\left(r\right)$, parenthesis for grouping of expressions.

- $\u03f5$, the empty string;
- a, any terminal, $a\in T$;
- A, any nonterminal, $A\in V$;
- ${e}_{1}\gg {e}_{2}$, a sequence;
- ${e}_{1}|{e}_{2}$, prioritized choice;
- +e, one or more repetitions;
- ${\scriptstyle \sim}e$, a not-predicate.

- $\mathrm{consume}(\u03f5,x)=0$.
- $\mathrm{consume}(a,ax)=1$; $\mathrm{consume}(a,bx)=\mathrm{None}$; $\mathrm{consume}(a,\u03f5)=\mathrm{None}$.
- $\mathrm{consume}(A,x)=\mathrm{consume}(e,x)$ if $A\Leftarrow e$.
- If $\mathrm{consume}({e}_{1},{x}_{1}{x}_{2}y)=k$ and $\mathrm{consume}({e}_{2},{x}_{2}y)=m$, then the following holds: $\mathrm{consume}({e}_{1}\gg {e}_{2},{x}_{1}{x}_{2}y)=k+m$; if $\mathrm{consume}({e}_{1},x)=\mathrm{None}$, then $\mathrm{consume}({e}_{1}\gg {e}_{2},x)=\mathrm{None}$; if $\mathrm{consume}({e}_{1},{x}_{1}y)=k$ and $\mathrm{consume}({e}_{2},y)=\mathrm{None}$, then we can be sure that $\mathrm{consume}({e}_{1}\gg {e}_{2},{x}_{1}y)=\mathrm{None}$.
- If $\mathrm{consume}({e}_{1},{x}_{1}y)=k$, then $\mathrm{consume}({e}_{1}|{e}_{2},{x}_{1}y)=k$; if $\mathrm{consume}({e}_{1},{x}_{1}y)=\mathrm{None}$ and $\mathrm{consume}({e}_{2},{x}_{1}y)=k$, then $\mathrm{consume}({e}_{1}|{e}_{2},{x}_{1}y)=k$; if $\mathrm{consume}({e}_{1},y)=\mathrm{None}$ and $\mathrm{consume}({e}_{2},y)=\mathrm{None}$, then $\mathrm{consume}({e}_{1}|{e}_{2},y)=\mathrm{None}$.
- If $\mathrm{consume}(e,{x}_{1}y)=k$ and $\mathrm{consume}(+e,y)=n$, then $\mathrm{consume}(+e,{x}_{1}y)=k+n$; if $\mathrm{consume}(e,x)=\mathrm{None}$, then $\mathrm{consume}(+e,x)=\mathrm{None}$; if $\mathrm{consume}(e,{x}_{1}y)=k$ and $\mathrm{consume}(+e,y)=\mathrm{None}$, then $\mathrm{consume}(+e,{x}_{1}y)=k$.
- If $\mathrm{consume}(e,x)=\mathrm{None}$, then $\mathrm{consume}({\scriptstyle \sim}e,x)=0$; if $\mathrm{consume}(e,{x}_{1}y)=k$, then $\mathrm{consume}({\scriptstyle \sim}e,{x}_{1}y)=\mathrm{None}$.

## 3. Induction Algorithm

#### 3.1. Genetic Programming

- Initialize the population.
- Evaluate the individual programs in the current population. Assign a numerical fitness to each individual.
- Until the emerging population is fully populated, repeat the following steps:
- Select two individuals in the current population using a selection algorithm.
- Perform genetic operations on the selected individuals.
- Insert the result of crossover, i.e., the better one out of two children, into the emerging population.

- If a termination criterion is fulfilled, go to step 5. Otherwise, replace the current population with the emerged population, saving the best individual, and repeat steps 2–4 (elitism strategy).
- Present the best individual as the output from the algorithm.

#### 3.2. Deterministic Algorithm Used in Initializing a GP Population

Algorithm 1: Inferring a single expression |

**Lemma**

**1.**

**Proof.**

**Lemma**

**2.**

**Proof.**

**Theorem**

**1.**

**Proof.**

#### 3.3. Python’s PEG Library Performance Evaluation

- a*
- (ab)*
- ((b|(aa))|(((a(bb))((bb)|(a(bb)))*)(aa)))*((a?)|(((a(bb))((bb)|(a(bb)))*)(a?)))
- a*((b|bb)aa*)*(b|bb|a*)
- (aa|bb)*((ba|ab)(bb|aa)*(ba|ab)(bb|aa)*)*(aa|bb)*
- ((a(ab)*(b|aa))|(b(ba)*(a|bb)))*
- a*b*a*b*

`++`EGG. An interpreter ran on a four-core Intel i7-965, 3.2 GHz processor in a Windows 10 operating system with 12 GB RAM.

`++`library (EGG) overcame its Python counterparts.

## 4. Results and Discussion

- Precision, $P=\mathit{tp}/(\mathit{tp}+\mathit{fp})$;
- Recall, $R=\mathit{tp}/(\mathit{tp}+\mathit{fn})$;
- F-score, $F1=2\times P\times R/(P+R)$;
- Accuracy, $ACC=(\mathit{tp}+\mathit{tn})/(\mathit{tp}+\mathit{tn}+\mathit{fp}+\mathit{fn})$;
- Area under the ROC curve, $AUC=(\mathit{tp}/(\mathit{tp}+\mathit{fn})+\mathit{tn}/(\mathit{fp}+\mathit{tn}\left)\right)/2$;
- Matthews correlation coefficient, $\mathrm{MCC}=\frac{\mathit{tp}\times \mathit{tn}-\mathit{fp}\times \mathit{fn}}{\sqrt{(\mathit{tp}+\mathit{fp})(\mathit{tp}+\mathit{fn})(\mathit{tn}+\mathit{fp})(\mathit{tn}+\mathit{fn})}}$;where the terms true positives (tp), true negatives (tn), false positives (fp), and false negatives (fn) compare the results of the classifier under test with trusted external judgments. Thus, in our case, tp is the number of correctly recognized amyloids, fp is the number of nonamyloids recognized as amyloids, fn is the number of amyloids recognized as nonamyloids, and tn is the number of correctly recognized nonamyloids. The last column concerns CPU time of computations (induction plus classification in s).

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- De la Higuera, C. Grammatical Inference: Learning Automata and Grammars; Cambridge University Press: New York, NY, USA, 2010. [Google Scholar]
- Wieczorek, W. Grammatical Inference—Algorithms, Routines and Applications; Studies in Computational Intelligence; Springer: Cham, Switzerland, 2017; Volume 673. [Google Scholar] [CrossRef]
- Gold, E.M. Language Identification in the Limit. Inf. Control
**1967**, 10, 447–474. [Google Scholar] [CrossRef][Green Version] - Coste, F.; Fredouille, D. Unambiguous Automata Inference by Means of State-Merging Methods. In Proceedings of the Machine Learning: ECML 2003, 14th European Conference on Machine Learning, Cavtat-Dubrovnik, Croatia, 22–26 September 2003; pp. 60–71. [Google Scholar]
- Miclet, L. Grammatical Inference. In Syntactic and Structural Pattern Recognition Theory and Applications; Bunke, H., Sanfeliu, A., Eds.; World Scientific Series in Computer Science; World Scientific: Singapore, 1990; Volume 7, pp. 237–290. [Google Scholar]
- Dupont, P. Regular grammatical inference from positive and negative samples by genetic search: The GIG method. In Grammatical Inference and Applications; Carrasco, R.C., Oncina, J., Eds.; Springer: Berlin/Heidelberg, Germany, 1994; pp. 236–245. [Google Scholar]
- Sakakibara, Y. Learning context-free grammars using tabular representations. Pattern Recognit.
**2005**, 38, 1372–1383. [Google Scholar] [CrossRef] - Solan, Z.; Horn, D.; Ruppin, E.; Edelman, S. Unsupervised learning of natural languages. Proc. Natl. Acad. Sci. USA
**2005**, 102, 11629–11634. [Google Scholar] [CrossRef] [PubMed][Green Version] - Rulot, H.; Vidal, E. Modelling (Sub)String Length Based Constraints through a Grammatical Inference Method. In Proceedings of the NATO Advanced Study Institute on Pattern Recognition Theory and Applications; Devijver, P.A., Kittler, J., Eds.; Springer: Berlin/Heidelberg, Germany, 1987; pp. 451–459. [Google Scholar]
- van Zaanen, M. ABL: Alignment-based Learning. In Proceedings of the 18th Conference on Computational Linguistics—Volume 2, Jcken, Germany, 31 July–4 August 2000; Association for Computational Linguistics: Stroudsburg, PA, USA, 2000. COLING ’00. pp. 961–967. [Google Scholar] [CrossRef]
- Eyraud, R.; de la Higuera, C.; Janodet, J.C. LARS: A learning algorithm for rewriting systems. Mach. Learn.
**2007**, 66, 7–31. [Google Scholar] [CrossRef][Green Version] - Kuramitsu, K.; ya Hamaguchi, S. XML schema validation using parsing expression grammars. PeerJ Prepr.
**2015**, 3, e1503. [Google Scholar] - Ierusalimschy, R. A text pattern-matching tool based on Parsing Expression Grammars. Softw. Pract. Exp.
**2009**, 39, 221–258. [Google Scholar] [CrossRef][Green Version] - Moss, A. Simplified Parsing Expression Derivatives. In Language and Automata Theory and Applications; Leporati, A., Martín-Vide, C., Shapira, D., Zandron, C., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 425–436. [Google Scholar]
- Hopcroft, J.E.; Motwani, R.; Ullman, J.D. Introduction to Automata Theory, Languages, and Computation, 2nd ed.; Addison-Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 2001. [Google Scholar]
- Ford, B. Parsing expression grammars: A recognition-based syntactic foundation. In Proceedings of the 31st ACM SIGACT/SIGPLAN Symposium on Principles of Programming Languages, Venice, Italy, 14–16 January 2004; Volume 39, pp. 111–122. [Google Scholar]
- Grune, D.; Jacobs, C. Parsing Techniques: A Practical Guide; Springer: New York, NY, USA, 2008; pp. 1–662. [Google Scholar] [CrossRef]
- Banzhaf, W.; Francone, F.D.; Keller, R.E.; Nordin, P. Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998. [Google Scholar]
- Tomita, M. Learning of Construction of Finite Automata from Examples Using Hill-Climbing: RR: Regular Set Recognizer; Department of Computer Science, Carnegie-Mellon University: Pittsburgh, PA, USA, 1982. [Google Scholar]
- Wozniak, P.; Kotulska, M. AmyLoad: Website dedicated to amyloidogenic protein fragments. Bioinformatics
**2015**, 31, 3395–3397. [Google Scholar] [CrossRef] [PubMed][Green Version] - Lang, K.; Pearlmutter, B.; Price, R. Results of the Abbadingo one DFA Learning Competition and a New Evidence-Driven State Merging Algorithm; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 1998; Volume 1433, pp. 1–12. [Google Scholar]
- Lang, K.J. Random DFA’s can be approximately learned from sparse uniform examples. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; ACM: New York, NY, USA; Pittsburgh, PA, USA, 1992; pp. 45–52. [Google Scholar]
- Trakhtenbrot, B.; Barzdin, Y. Finite Automata: Behavior and Synthesis; Fundamental Studies in Computer Science; North-Holland Publishing Company: Amsterdam, The Netherlands, 1973. [Google Scholar]
- Asgari, E.; Mofrad, M.R.K. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE
**2015**, 10, 1–15. [Google Scholar] [CrossRef] [PubMed] - Wu, T.F.; Lin, C.J.; Weng, R.C. Probability Estimates for Multi-class Classification by Pairwise Coupling. J. Mach. Learn. Res.
**2004**, 5, 975–1005. [Google Scholar]

**Figure 3.**The number of symbols in a PEG (red line) and the number of letters in a respective test set (blue line).

**Figure 4.**Combined amyloid databases used in this work. Pos and Neg denote, respectively, positive and negative word counts in the database.

**Figure 5.**Combined amyloid databases used in work. Pos and Neg denote, respectively, positive and negative word counts in the database.

**Figure 6.**Average error ($1\u2014$fitness accuracy) vs. expression length for different generations based on random data with two letters in the alphabet, 100 words at each set of example and counter-example and word lengths between 2 and 20.

No. | $\left|\mathbf{\Sigma}\right|$ | $\left|\mathit{X}\right|$ | $\left|\mathit{Y}\right|$ | ${\mathit{d}}_{\mathbf{min}}$ | ${\mathit{d}}_{\mathbf{max}}$ |
---|---|---|---|---|---|

1 | 2 | 10 | 10 | 1 | 10 |

2 | 2 | 100 | 100 | 2 | 20 |

3 | 4 | 500 | 500 | 3 | 30 |

4 | 4 | 1000 | 1000 | 4 | 40 |

5 | 8 | 5000 | 5000 | 5 | 50 |

6 | 8 | 6000 | 6000 | 6 | 60 |

7 | 16 | 7000 | 7000 | 7 | 70 |

8 | 16 | 8000 | 8000 | 8 | 80 |

9 | 32 | 9000 | 9000 | 9 | 90 |

10 | 32 | 10,000 | 10,000 | 10 | 100 |

No. | Description | PEG Grammar |
---|---|---|

1 | Sequence of a’s | $S\Leftarrow a\gg S\left|{\scriptstyle \sim}\right(a\left|b\right)$ |

2 | Sequence of (ab)’s | $S\Leftarrow a\gg b\gg S\left|{\scriptstyle \sim}\right(a\left|b\right)$ |

3 | Any string without an odd number of consecutive b’s after an odd number of consecutive a’s | $A\Leftarrow a\gg {\scriptstyle \sim}a|a\gg a\gg A$ $B\Leftarrow b\gg {\scriptstyle \sim}b|b\gg b\gg B$ $C\Leftarrow a\gg A|{\scriptstyle \sim}a$ $D\Leftarrow b\gg B|{\scriptstyle \sim}b$ $S\Leftarrow +(a\gg a|b)\gg S|A\gg D\gg S\left|{\scriptstyle \sim}\right(a\left|b\right)$ |

4 | Any string without more than two consecutive b’s | $S\Leftarrow (+a|\u03f5)\gg (+\left(\right(b\gg b\left|b\right)\gg +a\left)\right|\u03f5)\gg $ $(b\gg b\gg {\scriptstyle \sim}(a\left|b\right)|b\gg {\scriptstyle \sim}(a\left|b\right)\left|{\scriptstyle \sim}\right(a\left|b\right))$ |

5 | Any string of even length that, making pairs, has an even number of (ab)’s or (ba)’s | $A\Leftarrow +(a\gg a|b\gg b\left)\right|\u03f5$ $B\Leftarrow a\gg b|b\gg a$ $S\Leftarrow A\gg (+(B\gg A\gg B\gg A\left)\right|\u03f5)\gg A\gg {\scriptstyle \sim}(a\left|b\right)$ |

6 | Any string such that the difference between the numbers of a’s and b’s is a multiple of three | $A\Leftarrow a\gg (+(a\gg b\left)\right|\u03f5)\gg (b|a\gg a)$ $B\Leftarrow b\gg (+(b\gg a\left)\right|\u03f5)\gg (a|b\gg b)$ $S\Leftarrow +\left(A\right|B)\gg {\scriptstyle \sim}(a\left|b\right)\left|{\scriptstyle \sim}\right(a\left|b\right)$ |

7 | Zero or more a’s followed by zero or more b’s followed by zero or more a’s followed by zero or more b’s | $A\Leftarrow +a|\u03f5$ $B\Leftarrow +b|\u03f5$ $S\Leftarrow A\gg B\gg A\gg B\gg {\scriptstyle \sim}\left(a\right|b)$ |

Case No. | Positive/Negative | Word Length | PEG [s] | Grako [s] | Arpeggio [s] | EGG [s] |
---|---|---|---|---|---|---|

1 | Positive | 1–100 | 0.01 | 0.21 | 0.02 | <0.01 |

1 | Positive | 101–1000 | 0.04 | 1.93 | 0.13 | <0.01 |

1 | Positive | 1001–10,000 | 0.44 | 17.62 | 0.92 | 0.01 |

1 | Positive | 10,001–100,000 | 5.93 | 182 | 9.82 | 0.01 |

1 | Negative | 1–100 | <0.01 | 0.11 | 0.01 | 0.06 |

1 | Negative | 101–1000 | 0.02 | 1.13 | 0.05 | 0.02 |

1 | Negative | 1001–10,000 | 0.24 | 9.68 | 0.42 | – |

1 | Negative | 10,001–100,000 | 2.41 | 81.95 | 3.58 | – |

2 | Positive | 1–100 | 0.01 | 0.19 | 0.01 | <0.01 |

2 | Positive | 101–1000 | 0.03 | 1.68 | 0.12 | <0.01 |

2 | Positive | 1001–10,000 | 0.12 | 5.64 | 0.4 | <0.01 |

2 | Positive | 10,001–100,000 | 1.2 | 47.95 | 3.38 | <0.01 |

2 | Negative | 1–100 | <0.01 | 0.11 | 0.01 | 0.01 |

2 | Negative | 101–1000 | 0.02 | 0.82 | 0.05 | 0.01 |

2 | Negative | 1001–10,000 | 0.07 | 3.19 | 0.18 | 0.08 |

2 | Negative | 10,001–100,000 | 0.59 | 24.54 | 1.42 | 0.05 |

4 | Positive | 1–100 | <0.01 | 0.2 | 0.02 | <0.01 |

4 | Positive | 101–1000 | 0.04 | 2.12 | 0.36 | <0.01 |

4 | Positive | 1001–10,000 | 0.26 | 12.37 | 2.28 | <0.01 |

4 | Positive | 10,001–100,000 | 2.40 | 102.19 | 20.29 | 0.01 |

4 | Negative | 1–100 | 0.01 | 0.27 | 0.02 | 0.03 |

4 | Negative | 101–1000 | 0.04 | 2.11 | 0.32 | 0.03 |

4 | Negative | 1001–10,000 | 0.25 | 11.93 | 2.01 | 0.22 |

4 | Negative | 10,001–100,000 | 2.32 | 98.61 | 17.98 | 0.22 |

5 | Positive | 1–100 | <0.01 | 0.2 | 0.02 | <0.01 |

5 | Positive | 101–1000 | 0.04 | 2.12 | 0.36 | <0.01 |

5 | Positive | 1001–10,000 | 0.26 | 12.37 | 2.28 | 0.01 |

5 | Positive | 10,001–100,000 | 2.4 | 102.19 | 20.29 | <0.01 |

5 | Negative | 1–100 | 0.01 | 0.27 | 0.02 | 0.05 |

5 | Negative | 101–1000 | 0.04 | 2.11 | 0.32 | 0.02 |

5 | Negative | 1001–10,000 | 0.25 | 11.93 | 2.01 | 0.08 |

5 | Negative | 10,001–100,000 | 2.32 | 98.61 | 17.98 | 0.08 |

6 | Positive | 1–100 | 0.01 | 0.23 | 0.02 | <0.01 |

6 | Positive | 101–1000 | 0.05 | 2.52 | 0.17 | <0.01 |

6 | Positive | 1001–10,000 | 0.40 | 17.64 | 1.21 | <0.01 |

6 | Positive | 10,001–100,000 | 1.67 | 71.70 | 4.93 | 0.01 |

6 | Negative | 1–100 | 0.01 | 0.34 | 0.02 | 0.03 |

6 | Negative | 101–1000 | 0.12 | 5.34 | 0.33 | 0.06 |

6 | Negative | 1001–10,000 | 1.05 | 44.86 | 2.61 | 0.09 |

6 | Negative | 10,001–100,000 | 4.61 | 187.90 | 10.42 | 0.17 |

7 | Positive | 1–100 | <0.01 | 0.23 | 0.02 | <0.01 |

7 | Positive | 101–1000 | 0.04 | 1.85 | 0.09 | <0.01 |

7 | Positive | 1001–10,000 | 0.17 | 7.94 | 0.42 | 0.01 |

7 | Positive | 10,001–100,000 | 1.91 | 73.4 | 3.92 | <0.01 |

7 | Negative | 1–100 | <0.01 | 0.16 | 0.01 | 0.03 |

7 | Negative | 101–1000 | 0.02 | 1.09 | 0.06 | 0.02 |

7 | Negative | 1001–10,000 | 0.11 | 5.37 | 0.28 | 0.15 |

7 | Negative | 10,001–100,000 | 1.26 | 49.92 | 2.64 | 0.10 |

Parameters | Values |
---|---|

Objective: | evolve expression classifying amino acid sequences according to examples and counterexamples |

Terminal set: | $\u03f5$, A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V |

Function set: | ${\scriptstyle \sim}$, ≫, | |

Population size: | 5 |

Crossover probability: | 1.0 |

Selection: | Tournament selection, size $r=3$ |

Termination criterion: | 6000 generations have passed |

Maximum depth of tree after crossover: | 100 |

Initialization method: | A special dedicated algorithm, $k=\left|X\right|=\left|Y\right|$ equals half of the cardinality of examples |

P | R | F1 | ACC | AUC | MCC | Time [s] | |
---|---|---|---|---|---|---|---|

PEG | 0.627 | 0.294 | 0.400 | 0.739 | 0.610 | 0.291 | 0.3 |

ABL | 0.645 | 0.183 | 0.286 | 0.728 | 0.571 | 0.232 | 412.9 |

ADIOS | 0.329 | 0.633 | 0.433 | 0.508 | 0.544 | 0.082 | 15.6 |

Blue-fringe | 0.367 | 0.303 | 0.332 | 0.639 | 0.541 | 0.088 | 0.9 |

ECGI | 0.875 | 0.064 | 0.120 | 0.720 | 0.530 | 0.189 | 31.0 |

Traxbar | 0.234 | 0.101 | 0.141 | 0.636 | 0.481 | −0.05 | 0.3 |

SVM | 0.224 | 0.001 | 0.131 | 0.526 | 0.471 | −0.06 | 0.4 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wieczorek, W.; Unold, O.; Strąk, Ł. Parsing Expression Grammars and Their Induction Algorithm. *Appl. Sci.* **2020**, *10*, 8747.
https://doi.org/10.3390/app10238747

**AMA Style**

Wieczorek W, Unold O, Strąk Ł. Parsing Expression Grammars and Their Induction Algorithm. *Applied Sciences*. 2020; 10(23):8747.
https://doi.org/10.3390/app10238747

**Chicago/Turabian Style**

Wieczorek, Wojciech, Olgierd Unold, and Łukasz Strąk. 2020. "Parsing Expression Grammars and Their Induction Algorithm" *Applied Sciences* 10, no. 23: 8747.
https://doi.org/10.3390/app10238747