Split-Based Algorithm for Weighted Context-Free Grammar Induction

: The split-based method in a weighted context-free grammar (WCFG) induction was formalised and veriﬁed on a comprehensive set of context-free languages. WCFG is learned using a novel grammatical inference method. The proposed method learns WCFG from both positive and negative samples, whereas the weights of rules are estimated using a novel Inside–Outside Contrastive Estimation algorithm. The results showed that our approach outperforms in terms of F1 scores of other state-of-the-art methods.


Introduction
The task of grammar or automata induction is a part of symbolic artificial intelligence [1] and is called grammatical inference or grammar induction [2]. Among different subtasks of this scientific field, learning (stochastic or more general weighted) context-free grammars (CFGs) from input data has been growing in importance, due to its practical implications such as natural language and biological sequences modelling.
Learning CFG is known to be a hard task and notable open questions are still open [2]. According to Gold's theorem [3], CFGs cannot be learned from positive examples only, but in 1969 Horning proved that for effective probabilistic/stochastic CFG (PCFG) induction no negative evidence is obligatory [4]. It is imperative to note that learning PCFG only from positive data leads to grammars, thereby making it difficult to discriminate negative sequences from the input data. To overcome these difficulties, we have recently proposed the novel algorithm for weighted CFG (WCFG) learning [5,6]. Weighted Grammar-based Classifier System (WGCS) is one of the few grammatical inference approaches learning both grammar structure (i.e., rules) and stochastic grammar parameters (i.e., weights of rules). Initially, the method was dedicated to learning crisp context-free grammar [7], and later, it was extended to weighted versions (including fuzzy one [8] or stochastic [9]).
WGCS is learned in an unsupervised manner from unannotated data such as, a structured corpus or treebank. There are some other unsupervised grammatical inference methods like ABL [10], EMILE [11], ADIOS [12], or LS [13]. However, none of these methods induces both structure and parameters of grammar.
The main contribution of this paper is to define and test a new version of WGCS approach, in which the split concept has been employed to reveal the grammar structure. Although the split was used for the first time in [6], its verification was rudimentary and limited due to the unrepresentative bioinformatics dataset. Subsequently, a new approach was formalised and tested over a comprehensive set of artificial CFG datasets, and its computational complexity was given. Moreover, the improved WGCS was compared with two state-of-the-art unsupervised methods-LS [13] and ADIOS [12]-dedicated to CFG learning. Additionally, the rule weight estimation algorithm was improved by mitigating unbalanced data bias.
The rest of the paper is organised as follows. Section 2 gives some details about our approach. In Section 3, we present a test environment and eventually the results are reported in Section 4. Section 5 concludes the paper.

Weighted Grammar-Based Classifier System
WGCS belongs to the family of learning classification systems [14] and is based on a previous version of [7] that only works on context-free grammars with no probabilities or weights. According to the idea of grammatical inference, WGCS system receives a set of tagged positive and negative sentences as an input to the system and results is WCFG. In WGCS, all grammar rules are in Chomsky Normal Form (CNF). The induction scheme of this model is shown in Figure 1 and the overall system architecture is shown in Figure 2.

Weighted Context-Free Grammar
A context-free grammar is a quadruple (N, T, S, R), where N-a finite set of nonterminals symbols disjoint from T, T-a finite set of terminals symbols, S ∈ N the start symbol, and R-a finite set of rules of the form X → α, where X ∈ N and α ∈ (N ∪ T) * . CFG is in CNF when each rule takes one of the two following forms: X → Y Z where X, Y, Z ∈ N or X → t where X ∈ N and t ∈ T.
A WCFG associates a positive number called the weight with each rule in R (assigning a weight of zero to a rule equates to excluding it from R). More formally, the WCFG is a 5-tuple (N, T, S, R, W), where (N, T, S, R) is a CFG and W is a finite set of weights of each rule resulting from a function φ(X → α) → w, where X → α ∈ R and w > 0 is a positive weight.

Grammar Initialisation
Grammar in the WGCS system is initialised in two ways. The first way is to load a previously prepared grammar from a file. The second way is to generate it automatically in the application. Based on the training set, the symbols in this set are terminal symbols of the grammar, while their uppercase representations are nonterminal symbols. Then, to generate the terminal rules, each nonterminal symbol is assigned to each terminal symbol. On the other hand, nonterminal rules are generated by all possible combinations of all nonterminal symbols. According to [15], the number of all production rules of the CFG is O(L 3 ), where L is the length of the input sentence. However, it should be noted that generating all productions is a one-time operation and does not affect the complexity of the method. In practice, the number of generated rules in the inference process is significantly less than the upper limit. Theoretically, the number of production rules in comparative methods (LS and ADIOS) is bounded by O(L).

Stochastic CKY Parser
To verify whether a given sentence belongs to a specific grammar, special algorithms called parsers are used. One of the most common CFG parsers based on dynamic programming is the Cocke-Kasami-Younger (CKY) parser [16,17]. Its extension used to parse stochastic CFGs is Stochastic CKY, first described in [18]. Both classical and stochastic CKY algorithms assume grammar to be in CNF. The stochastic CKY algorithm is used in WGCS system and its operation is represented by Algorithm 1.

Split Algorithm
This method is based on [19,20], where grammar is induced incrementally. During each iteration of the algorithm, a new X j nonterminal symbol is created from another X i nonterminal symbol through a split operation. The symbol selected for the split operation is the symbol most often used in rule weight estimation (having the largest count) (see line 2 in Algorithm 2). This symbol is generally called the split symbol. Then, for all X i → t terminal rules, we create new terminal rules, replacing X i with X j (see lines 3-5 in Algorithm 2). Next, new nonterminal rules are created in two ways:

1.
For symbols X i and X j , create all possible rules (see line 6 in Algorithm 2). Since the rules are in CNF, their number is 8.

2.
For all nonterminal rules with X i in the form X a → X b X c , where X a , X b or X c is X i , create new untimely rules in the same form, replacing X i with X j . For multiple occurrences of X i in a rule, create all combinations (see lines 7-20 in Algorithm 2).  To illustrate the split method, suppose there is the set of rules:

Algorithm 1 Stochastic CKY
the set of terminals T = {a, b}, and the start symbol S. We select the symbol with the largest count (let it be Y) and create a new symbol-a split symbol-Z. According to the lines 3-5 of the Algorithm 2, new terminal rules are generated: {Z → a, Z → b}. The new nonterminal rules are generated as follows.

1.
From Y and split symbol Z create: The result is: Create new terminal rule X j → t 5: end for 6: Create all possible nonterminal rules from two nonterminal symbols X i and X j : Create new nonterminal rules: Create new nonterminal rules: Create new nonterminal rules: else if X a == X i then 15: Create new nonterminal rule X a → X j X c

18:
else if X c == X i then 19: Create new nonterminal rule X a → X b X j 20: end if 21: end for

Rule Weight Estimation
After establishing he grammar structure, we can focus on fine-tuning the weights of the rules. The most common algorithm used for this purpose is inside-outside. It is a special case of the Expectation-Maximization algorithm designed to estimate the parameters of a stochastic context-free grammar, originally the probabilities of the rules, and in our case, the rule weights. Two algorithms will be described in this subsection-the original insideoutside algorithm and its extended version using negative sentences when estimating rule weights used in the WGCS system.

Inside-Outside
Baker introduced the inside-outside algorithm [21]. Its computational complexity is O(L 3 |N| 3 ), where L is the sentence length and |N| is the number of nonterminal symbols in the grammar [22].
The inside-outside algorithm starts the estimation process from the initial probabilities/weights of the rules (usually assigned randomly). At each iterative step, it updates the probability/weight of the rule based on the frequency of the rule in the training set. To better understand the algorithm, let us introduce the basic nomenclature.
• Probability/weight of the rule: for nonterminal rules: φ(X → YZ) -for terminal rules: φ(X → x) • Probability/weight of deriving a sentence from grammar: where ⇒ stands for sentence derivation, W stands for sentence output, and w 1 w 2 . . . w n are the individual words of the sentence W.

•
The inside probability is the probability of deriving from a given symbol the nonterminal sequence of words w i . . . w j from the sentence W = w 1 . . . w n : where X is any nonterminal grammar symbol. Figure 3 shows the graphical interpretation of the inside probability for the nontermi- • The outside probability is the probability of deriving from the starting symbol the string w 1 . . . w i−1 Xw j+1 . . . w n from the sentence W = w 1 . . . w n : Figure 4 shows the graphical representation of the outside probability for the nonterminal symbol Y, α ij (Y).
for nonterminal rules: • The total estimated number of uses of the rule determines how often the rule occurs in all sentences in the training set: where X → α stands for a terminal or nonterminal rule and W i for successive sentences in the training set. • The new weight/probability of a rule is calculated as the ratio of the total estimated uses of a given rule to the sum of the estimated total uses of a rule with the same left-hand symbol: where X → γ stands for a rule with the same left-hand side symbol as the rule in the numerator and any right-hand side form (gamma) of a rule in a CNF, either a terminal symbol or two nonterminal symbols.

Inside-Outside Contrastive Estimation
The inside-outside contrastive estimation (IOCE) algorithm is an extended version of the inside-outside algorithm to include the use of negative sentences in the estimation of rule weights. This approach is inspired by works using the classic contrastive estimation method [23,24]. However, it differs significantly from the solutions proposed in those works.
In IOCE, we introduce the so-called negative estimation factor: where count negative (X −→ α) determines the total estimated number of uses of the rule in all negative sentences in the training set and θ = number_o f _positive_sentences number_o f _negative_sentences is introduced to mitigate imbalanced datasets bias.
Using this coefficient, we calculate the new weight of the rule: The general idea of the negative estimation coefficient is that, if the rule often appears in the set of negative sentences, the coefficient is smaller. By multiplying the coefficient by the current weight of the rule, we reduce its weight. When the rule does not appear even once in the set of negative sentences, the coefficient is equal to 1 and the weight of the rule does not change.

Removing Rules with Low Weight
In the WGCS system with a split algorithm, the grammar size increases with each iteration, which significantly affects the computational complexity of IO/IOCE algorithms. Therefore, to prevent slowing down of the system and maintain good quality sentence classification, the WGCS system has implemented a mechanism that cleans the grammar from rules with low weights. If a rule gains a weight below the threshold, the rule is removed from the system. There are two thresholds for deleting rules, one for nonterminal rules 0.001 and another for terminal rules 0.000001. These values have been determined experimentally.

Experimental Protocol and Complexity Analysis
WCFG is learned using WGCS according to the experimental protocol described in Algorithm 3. The run-time complexity for the worst-case scenario of the given algorithm can be evaluated as follows. Say that k is the number of iterations of the WGCS algorithm, |G|-the size of the grammar, z-the number of sentences in the training set, n-the number of IOCE iterations, |N|-the number of nonterminal symbols, L-the length of the sentence, and y-the number of sentences in the validation set. Remove rules with low weights 6: Test grammar with CKY on the training set 7: end for 8: Test best grammar with CKY on the validation set In the algorithm above, for a worst-case evaluation it should be assumed that run-time of the split algorithm (line 3) is bounded by |G|, IOCE algorithm (line 4) is bounded by z · n · |N| 3 L 3 , removing rules (line 5) by |G|, testing grammar (line 6) by z · L 3 |G|, and testing the best grammar (step 8) by y · L 3 |G|. Thus the total amount of time to run lines 1-8 is: Note that |G| = k · |T| + k 3 and |N| = k 3 , and the total amount of time can be calculated as follows which can be factored as: Therefore, the total running time for this algorithm is estimated as: Note that the maximum number of terminals |T| can be replaced by (y + z)L O(z · n · (y + z) · L · k 4 L 3 + y · (y + z) · L · k 3 L 3 ) which reduces to O((y + z) · k 3 L 4 (z · n · k + y)) As we can see the complexity of the proposed method is polynomially bounded with respect to the input size.

Datasets
For our experiments, 28 datasets have been prepared. Most of them were generated based on the random context-free grammars G i obtained from the CFG/TestSet Generator [25]. The target finite samples, that had about 200 words each were constructed as follows. Let The total estimated number of uses for the rule, where K i is an integer from 10 to 20 and L i = L(G i ). 100 words, chosen randomly from the set Z i , along with optimal examples given by the generator constituted examples. Let z ∈ Z i and y ∈ Σ * be words that differ by a few letters-as a consequence of a swap, insertion, or deletion. 100 words y ∈ Y i , y ∈ L i , 1 ≤ |y| ≤ K i , generated randomly in this way, constituted counterexamples.

Brief Description of Other Approaches
In [13], Wieczorek described a local search (LS) approach. In this study, we will use a simple example to present this method using a simple example. Assume that the set {ab, abab} constitutes examples while the set {a, b, ba, aba} constitutes counterexamples. The first step is to construct a grammar that generates all the examples. A special algorithm has been devised for this purpose. It could produce the following grammar: S → AB, A → a, B → CD, C → b, D → AC | E, and E → . Further, in a loop, two variables are merged as long as the grammar can be shortened. In the examples, B and C (into B), S and D, and S and E, we get the grammar: S → AB | , A → a, B → b | BS. Finally, unnecessary variables and rules are removed from the resultant grammar (in the example the rule S → ). Every step is controlled by means of counterexamples to obtain valid grammar.
The Automatic DIstillation Of Structure (ADIOS) model that uses only examples builds syntactic representations of a sample of language from unlabelled data [28]. It consists of two elements: (1) a Representational Data Structure (RDS) graph and (2) a Pattern Acquisition (PA) algorithm that constructs the RDS in an unsupervised manner. The goal of the PA algorithm is to detect patterns, i.e., repetitive sequences of "significant" strings occurring in the examples. Here, the PA algorithm is related to prior work on alignment-based learning and regular expression extraction from strings. However, the authors of ADIOS stress claim, that their algorithm requires no prejudging of either the scope of the primitives or their classification. In the initial phase of the PA algorithm, the examples are segmented down to the smallest possible morphological constituents. In the second phase, the PA algorithm repeatedly scans the RDS graph for common patterns, which are then used to modify the graph. ADIOS algorithm has been tested on a variety of linguistic and bioinformatics data with promising results.
The code of WGCS and LS along with the benchmarks are available at [29]. The code of ADIOS is available on request from the authors of this method.

Results
Our experiments were performed on Intel Core i7-7567U CPU, 3.5 GHz processor, with 32 GB RAM under Windows 10 operating system. Three methods, i.e., our proposal WGCS and two references methods: local search (LS) and ADIOS, were used to infer grammars for 28 benchmark datasets. A five-fold crossvalidation was performed on each set and the results were averaged. To evaluate the quality classification of the compared methods, we use the classification results stored in a confusion matrix. Four scores were defined as tp, fp, fn, and tn, representing the numbers of true positives (correctly recognised positive sentences), false positives (negatives recognised as positives), false negatives (positives recognised as negatives), and true negatives (correctly recognised negatives), respectively. Based on the values stored in the confusion matrix, we calculate the widely used Precision, Recall (Sensitivity), and combined metric F1-score. Precision is defined as P = tp/(tp + f p), Recall (Sensitivity) as R = tp/(tp + f n), and F1 as the harmonic mean of Precision and Sensitivity F1 = 2 · (P · R/(P + R)). Table 2 shows these results with respect to Precision (Pr), Recall (Rc), and F1 score. This table additionally shows the average production number and the average computation time obtained for each tested grammar by all methods. The average number of productions and the average computation time were calculated over five folds of the crossvalidation method used.
To find out whether the observed differences are statistically significant, we follow the Wilcoxon-signed-rank test [30] for WGCS vs. LS and WGCS vs. ADIOS. In this test, the null hypothesis (H 0 ) states that the difference between the pairs follows a symmetric distribution around zero. The alternative hypothesis (H 1 ), on the other hand, states that the difference between the pairs does not follow a symmetric distribution around zero (i.e., the difference is not a coincidence). As can we see in Table 3, p-values are small enough (all below 0.025) to reject H 0 . Therefore, we can conclude that WGCS performs better than the two competing methods on prepared benchmarks, although it is a slower method than ADIOS.

Conclusions
We have formalized and verified the split method in weighted context-free grammar induction. The new approach to weighted CFG learning has been applied in the frame of the Weighted Grammar-based Classifier System. Additionally, the inside-outside contrastive estimation algorithm was improved by correcting unbalanced data bias. The experiments conducted over 28 context-free languages showed that WGCS with splitting outperforms the state-of-the-art methods in terms of F1 scores.
Further work is ongoing to investigate the use of the combined split-merge method in discovering WCFG. It should be noted that the split mechanism leads to an overlinear increase in the number of generated productions. It is also worth noting that grammar splitting is focused on specialising grammar production, whereas grammar merging can generalise the model by merging some nonterminals. Grammar merging seems to be a promising approach to pruning unwanted structures [31].

Data Availability Statement:
The data presented in this study are available in [29].

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: