# Parsing Expression Grammars and Their Induction Algorithm

## Abstract

## Featured Application

**PEG library for Python.**

## 1. Introduction

## 2. Definition of PEGs

- $\u03f5$, the empty string;
- a, symbol or string occurrence;
- $r\ast $, zero or more repetitions of regular expression;
- a+, one or more repetitions;
- $a|b$, non-deterministic choice of symbol, formally defined as $a+b$;
- ${r}_{1}{r}_{2}$, concatenation;
- $\left(r\right)$, parenthesis for grouping of expressions.

- $\u03f5$, the empty string;
- a, any terminal, $a\in T$;
- A, any nonterminal, $A\in V$;
- ${e}_{1}\gg {e}_{2}$, a sequence;
- ${e}_{1}|{e}_{2}$, prioritized choice;
- +e, one or more repetitions;
- ${\scriptstyle \sim}e$, a not-predicate.

- $\mathrm{consume}(\u03f5,x)=0$.
- $\mathrm{consume}(a,ax)=1$; $\mathrm{consume}(a,bx)=\mathrm{None}$; $\mathrm{consume}(a,\u03f5)=\mathrm{None}$.
- $\mathrm{consume}(A,x)=\mathrm{consume}(e,x)$ if $A\Leftarrow e$.
- If $\mathrm{consume}({e}_{1},{x}_{1}{x}_{2}y)=k$ and $\mathrm{consume}({e}_{2},{x}_{2}y)=m$, then the following holds: $\mathrm{consume}({e}_{1}\gg {e}_{2},{x}_{1}{x}_{2}y)=k+m$; if $\mathrm{consume}({e}_{1},x)=\mathrm{None}$, then $\mathrm{consume}({e}_{1}\gg {e}_{2},x)=\mathrm{None}$; if $\mathrm{consume}({e}_{1},{x}_{1}y)=k$ and $\mathrm{consume}({e}_{2},y)=\mathrm{None}$, then we can be sure that $\mathrm{consume}({e}_{1}\gg {e}_{2},{x}_{1}y)=\mathrm{None}$.
- If $\mathrm{consume}({e}_{1},{x}_{1}y)=k$, then $\mathrm{consume}({e}_{1}|{e}_{2},{x}_{1}y)=k$; if $\mathrm{consume}({e}_{1},{x}_{1}y)=\mathrm{None}$ and $\mathrm{consume}({e}_{2},{x}_{1}y)=k$, then $\mathrm{consume}({e}_{1}|{e}_{2},{x}_{1}y)=k$; if $\mathrm{consume}({e}_{1},y)=\mathrm{None}$ and $\mathrm{consume}({e}_{2},y)=\mathrm{None}$, then $\mathrm{consume}({e}_{1}|{e}_{2},y)=\mathrm{None}$.
- If $\mathrm{consume}(e,{x}_{1}y)=k$ and $\mathrm{consume}(+e,y)=n$, then $\mathrm{consume}(+e,{x}_{1}y)=k+n$; if $\mathrm{consume}(e,x)=\mathrm{None}$, then $\mathrm{consume}(+e,x)=\mathrm{None}$; if $\mathrm{consume}(e,{x}_{1}y)=k$ and $\mathrm{consume}(+e,y)=\mathrm{None}$, then $\mathrm{consume}(+e,{x}_{1}y)=k$.
- If $\mathrm{consume}(e,x)=\mathrm{None}$, then $\mathrm{consume}({\scriptstyle \sim}e,x)=0$; if $\mathrm{consume}(e,{x}_{1}y)=k$, then $\mathrm{consume}({\scriptstyle \sim}e,{x}_{1}y)=\mathrm{None}$.

## 3. Induction Algorithm

#### 3.1. Genetic Programming

- Initialize the population.
- Evaluate the individual programs in the current population. Assign a numerical fitness to each individual.
- Until the emerging population is fully populated, repeat the following steps:
- Select two individuals in the current population using a selection algorithm.
- Perform genetic operations on the selected individuals.
- Insert the result of crossover, i.e., the better one out of two children, into the emerging population.

- If a termination criterion is fulfilled, go to step 5. Otherwise, replace the current population with the emerged population, saving the best individual, and repeat steps 2–4 (elitism strategy).
- Present the best individual as the output from the algorithm.

#### 3.2. Deterministic Algorithm Used in Initializing a GP Population

Algorithm 1: Inferring a single expression |

**Proof.**

#### 3.3. Python’s PEG Library Performance Evaluation

- a*
- (ab)*
- ((b|(aa))|(((a(bb))((bb)|(a(bb)))*)(aa)))*((a?)|(((a(bb))((bb)|(a(bb)))*)(a?)))
- a*((b|bb)aa*)*(b|bb|a*)
- (aa|bb)*((ba|ab)(bb|aa)*(ba|ab)(bb|aa)*)*(aa|bb)*
- ((a(ab)*(b|aa))|(b(ba)*(a|bb)))*
- a*b*a*b*

`++`EGG. An interpreter ran on a four-core Intel i7-965, 3.2 GHz processor in a Windows 10 operating system with 12 GB RAM.

`++`library (EGG) overcame its Python counterparts.

## 4. Results and Discussion

- Precision, $P=\mathit{tp}/(\mathit{tp}+\mathit{fp})$;
- Recall, $R=\mathit{tp}/(\mathit{tp}+\mathit{fn})$;
- F-score, $F1=2\times P\times R/(P+R)$;
- Accuracy, $ACC=(\mathit{tp}+\mathit{tn})/(\mathit{tp}+\mathit{tn}+\mathit{fp}+\mathit{fn})$;
- Area under the ROC curve, $AUC=(\mathit{tp}/(\mathit{tp}+\mathit{fn})+\mathit{tn}/(\mathit{fp}+\mathit{tn}\left)\right)/2$;
- Matthews correlation coefficient, $\mathrm{MCC}=\frac{\mathit{tp}\times \mathit{tn}-\mathit{fp}\times \mathit{fn}}{\sqrt{(\mathit{tp}+\mathit{fp})(\mathit{tp}+\mathit{fn})(\mathit{tn}+\mathit{fp})(\mathit{tn}+\mathit{fn})}}$;where the terms true positives (tp), true negatives (tn), false positives (fp), and false negatives (fn) compare the results of the classifier under test with trusted external judgments. Thus, in our case, tp is the number of correctly recognized amyloids, fp is the number of nonamyloids recognized as amyloids, fn is the number of amyloids recognized as nonamyloids, and tn is the number of correctly recognized nonamyloids. The last column concerns CPU time of computations (induction plus classification in s).

## 5. Conclusions

**Figure 3.**The number of symbols in a PEG (red line) and the number of letters in a respective test set (blue line).

**Figure 4.**Combined amyloid databases used in this work. Pos and Neg denote, respectively, positive and negative word counts in the database.

**Figure 5.**Combined amyloid databases used in work. Pos and Neg denote, respectively, positive and negative word counts in the database.

**Figure 6.**Average error ($1\u2014$fitness accuracy) vs. expression length for different generations based on random data with two letters in the alphabet, 100 words at each set of example and counter-example and word lengths between 2 and 20.

No. | $\left|\mathbf{\Sigma}\right|$ | $\left|\mathit{X}\right|$ | $\left|\mathit{Y}\right|$ | ${\mathit{d}}_{\mathbf{min}}$ | ${\mathit{d}}_{\mathbf{max}}$ |
---|---|---|---|---|---|

1 | 2 | 10 | 10 | 1 | 10 |

2 | 2 | 100 | 100 | 2 | 20 |

3 | 4 | 500 | 500 | 3 | 30 |

4 | 4 | 1000 | 1000 | 4 | 40 |

5 | 8 | 5000 | 5000 | 5 | 50 |

6 | 8 | 6000 | 6000 | 6 | 60 |

7 | 16 | 7000 | 7000 | 7 | 70 |

8 | 16 | 8000 | 8000 | 8 | 80 |

9 | 32 | 9000 | 9000 | 9 | 90 |

10 | 32 | 10,000 | 10,000 | 10 | 100 |

No. | Description | PEG Grammar |
---|---|---|

1 | Sequence of a’s | $S\Leftarrow a\gg S\left|{\scriptstyle \sim}\right(a\left|b\right)$ |

2 | Sequence of (ab)’s | $S\Leftarrow a\gg b\gg S\left|{\scriptstyle \sim}\right(a\left|b\right)$ |

3 | Any string without an odd number of consecutive b’s after an odd number of consecutive a’s | $A\Leftarrow a\gg {\scriptstyle \sim}a|a\gg a\gg A$ $B\Leftarrow b\gg {\scriptstyle \sim}b|b\gg b\gg B$ $C\Leftarrow a\gg A|{\scriptstyle \sim}a$ $D\Leftarrow b\gg B|{\scriptstyle \sim}b$ $S\Leftarrow +(a\gg a|b)\gg S|A\gg D\gg S\left|{\scriptstyle \sim}\right(a\left|b\right)$ |

4 | Any string without more than two consecutive b’s | $S\Leftarrow (+a|\u03f5)\gg (+\left(\right(b\gg b\left|b\right)\gg +a\left)\right|\u03f5)\gg $ $(b\gg b\gg {\scriptstyle \sim}(a\left|b\right)|b\gg {\scriptstyle \sim}(a\left|b\right)\left|{\scriptstyle \sim}\right(a\left|b\right))$ |

5 | Any string of even length that, making pairs, has an even number of (ab)’s or (ba)’s | $A\Leftarrow +(a\gg a|b\gg b\left)\right|\u03f5$ $B\Leftarrow a\gg b|b\gg a$ $S\Leftarrow A\gg (+(B\gg A\gg B\gg A\left)\right|\u03f5)\gg A\gg {\scriptstyle \sim}(a\left|b\right)$ |

6 | Any string such that the difference between the numbers of a’s and b’s is a multiple of three | $A\Leftarrow a\gg (+(a\gg b\left)\right|\u03f5)\gg (b|a\gg a)$ $B\Leftarrow b\gg (+(b\gg a\left)\right|\u03f5)\gg (a|b\gg b)$ $S\Leftarrow +\left(A\right|B)\gg {\scriptstyle \sim}(a\left|b\right)\left|{\scriptstyle \sim}\right(a\left|b\right)$ |

7 | Zero or more a’s followed by zero or more b’s followed by zero or more a’s followed by zero or more b’s | $A\Leftarrow +a|\u03f5$ $B\Leftarrow +b|\u03f5$ $S\Leftarrow A\gg B\gg A\gg B\gg {\scriptstyle \sim}\left(a\right|b)$ |

Case No. | Positive/Negative | Word Length | PEG [s] | Grako [s] | Arpeggio [s] | EGG [s] |
---|---|---|---|---|---|---|

1 | Positive | 1–100 | 0.01 | 0.21 | 0.02 | <0.01 |

1 | Positive | 101–1000 | 0.04 | 1.93 | 0.13 | <0.01 |

1 | Positive | 1001–10,000 | 0.44 | 17.62 | 0.92 | 0.01 |

1 | Positive | 10,001–100,000 | 5.93 | 182 | 9.82 | 0.01 |

1 | Negative | 1–100 | <0.01 | 0.11 | 0.01 | 0.06 |

1 | Negative | 101–1000 | 0.02 | 1.13 | 0.05 | 0.02 |

1 | Negative | 1001–10,000 | 0.24 | 9.68 | 0.42 | – |

1 | Negative | 10,001–100,000 | 2.41 | 81.95 | 3.58 | – |

2 | Positive | 1–100 | 0.01 | 0.19 | 0.01 | <0.01 |

2 | Positive | 101–1000 | 0.03 | 1.68 | 0.12 | <0.01 |

2 | Positive | 1001–10,000 | 0.12 | 5.64 | 0.4 | <0.01 |

2 | Positive | 10,001–100,000 | 1.2 | 47.95 | 3.38 | <0.01 |

2 | Negative | 1–100 | <0.01 | 0.11 | 0.01 | 0.01 |

2 | Negative | 101–1000 | 0.02 | 0.82 | 0.05 | 0.01 |

2 | Negative | 1001–10,000 | 0.07 | 3.19 | 0.18 | 0.08 |

2 | Negative | 10,001–100,000 | 0.59 | 24.54 | 1.42 | 0.05 |

4 | Positive | 1–100 | <0.01 | 0.2 | 0.02 | <0.01 |

4 | Positive | 101–1000 | 0.04 | 2.12 | 0.36 | <0.01 |

4 | Positive | 1001–10,000 | 0.26 | 12.37 | 2.28 | <0.01 |

4 | Positive | 10,001–100,000 | 2.40 | 102.19 | 20.29 | 0.01 |

4 | Negative | 1–100 | 0.01 | 0.27 | 0.02 | 0.03 |

4 | Negative | 101–1000 | 0.04 | 2.11 | 0.32 | 0.03 |

4 | Negative | 1001–10,000 | 0.25 | 11.93 | 2.01 | 0.22 |

4 | Negative | 10,001–100,000 | 2.32 | 98.61 | 17.98 | 0.22 |

5 | Positive | 1–100 | <0.01 | 0.2 | 0.02 | <0.01 |

5 | Positive | 101–1000 | 0.04 | 2.12 | 0.36 | <0.01 |

5 | Positive | 1001–10,000 | 0.26 | 12.37 | 2.28 | 0.01 |

5 | Positive | 10,001–100,000 | 2.4 | 102.19 | 20.29 | <0.01 |

5 | Negative | 1–100 | 0.01 | 0.27 | 0.02 | 0.05 |

5 | Negative | 101–1000 | 0.04 | 2.11 | 0.32 | 0.02 |

5 | Negative | 1001–10,000 | 0.25 | 11.93 | 2.01 | 0.08 |

5 | Negative | 10,001–100,000 | 2.32 | 98.61 | 17.98 | 0.08 |

6 | Positive | 1–100 | 0.01 | 0.23 | 0.02 | <0.01 |

6 | Positive | 101–1000 | 0.05 | 2.52 | 0.17 | <0.01 |

6 | Positive | 1001–10,000 | 0.40 | 17.64 | 1.21 | <0.01 |

6 | Positive | 10,001–100,000 | 1.67 | 71.70 | 4.93 | 0.01 |

6 | Negative | 1–100 | 0.01 | 0.34 | 0.02 | 0.03 |

6 | Negative | 101–1000 | 0.12 | 5.34 | 0.33 | 0.06 |

6 | Negative | 1001–10,000 | 1.05 | 44.86 | 2.61 | 0.09 |

6 | Negative | 10,001–100,000 | 4.61 | 187.90 | 10.42 | 0.17 |

7 | Positive | 1–100 | <0.01 | 0.23 | 0.02 | <0.01 |

7 | Positive | 101–1000 | 0.04 | 1.85 | 0.09 | <0.01 |

7 | Positive | 1001–10,000 | 0.17 | 7.94 | 0.42 | 0.01 |

7 | Positive | 10,001–100,000 | 1.91 | 73.4 | 3.92 | <0.01 |

7 | Negative | 1–100 | <0.01 | 0.16 | 0.01 | 0.03 |

7 | Negative | 101–1000 | 0.02 | 1.09 | 0.06 | 0.02 |

7 | Negative | 1001–10,000 | 0.11 | 5.37 | 0.28 | 0.15 |

7 | Negative | 10,001–100,000 | 1.26 | 49.92 | 2.64 | 0.10 |

Parameters | Values |
---|---|

Objective: | evolve expression classifying amino acid sequences according to examples and counterexamples |

Terminal set: | $\u03f5$, A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V |

Function set: | ${\scriptstyle \sim}$, ≫, | |

Population size: | 5 |

Crossover probability: | 1.0 |

Selection: | Tournament selection, size $r=3$ |

Termination criterion: | 6000 generations have passed |

Maximum depth of tree after crossover: | 100 |

Initialization method: | A special dedicated algorithm, $k=\left|X\right|=\left|Y\right|$ equals half of the cardinality of examples |

P | R | F1 | ACC | AUC | MCC | Time [s] | |
---|---|---|---|---|---|---|---|

PEG | 0.627 | 0.294 | 0.400 | 0.739 | 0.610 | 0.291 | 0.3 |

ABL | 0.645 | 0.183 | 0.286 | 0.728 | 0.571 | 0.232 | 412.9 |

ADIOS | 0.329 | 0.633 | 0.433 | 0.508 | 0.544 | 0.082 | 15.6 |

Blue-fringe | 0.367 | 0.303 | 0.332 | 0.639 | 0.541 | 0.088 | 0.9 |

ECGI | 0.875 | 0.064 | 0.120 | 0.720 | 0.530 | 0.189 | 31.0 |

Traxbar | 0.234 | 0.101 | 0.141 | 0.636 | 0.481 | −0.05 | 0.3 |

SVM | 0.224 | 0.001 | 0.131 | 0.526 | 0.471 | −0.06 | 0.4 |

