^{*}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/.)

In order to be able to capture effects from co-transcriptional folding, we extend stochastic context-free grammars such that the probability of applying a rule can depend on the length of the subword that is eventually generated from the symbols introduced by the rule, and we show that existing algorithms for training and for determining the most probable parse tree can easily be adapted to the extended model without losses in performance. Furthermore, we show that the extended model is suited to improve the quality of predictions of RNA secondary structures. The extended model may also be applied to other fields where stochastic context-free grammars are used like natural language processing. Additionally some interesting questions in the field of formal languages arise from it.

Single-stranded RNA molecules consist of a sequence of nucleotides connected by phosphordiester bonds. Nucleotides only differ by the bases involved, them being adenine, cytosine, guanine and uracil. The sequence of bases is called the

When abstracting from the primary structure, secondary structures are often denoted as words over the alphabet Σ = {(,|,)}, where a corresponding pair of parentheses represents a pair of bases connected by a hydrogen bond, while a | stands for an unpaired base. For example when starting transcription at the marked end the structure from

The oldest and most commonly used method for computing the secondary structure is to determine the structure with minimum free energy. This was first done by Nussinov

While the energy models used today are much more sophisticated, taking into account, e.g., the types of bases involved in a pair, the types of bases located next to them,

A different approach is based on probabilistic modeling. An (ambiguous) stochastic context-free grammar (SCFG) that generates the primary structures is chosen such that the derivation trees for a given primary structure uniquely correspond to the possible secondary structures. The probabilities of this grammar are then trained either from molecules with known secondary structures or by expectation maximization ([

After this training the probabilities on the derivation trees as induced by the trained probabilities will model the probability of the corresponding secondary structures, assuming the training data was representative and the grammar is actually capable of modeling this distribution. Thus the most probable secondary structure (derivation tree) is computed as prediction ([

Many other approaches as well as extensions and modifications of the ones mentioned above have been suggested over the years. A recent overview can be found in [

With the exception of approaches that simulate the actual physical folding process (e.g., [

Since the simulation algorithms have the downside of being computationally expensive, it is desirable to add the effects of co-transcriptional folding into the traditional algorithms. In this paper we present an idea how this can be achieved for the SCFG approach.

Due to co-transcriptional folding one would expect that the probability of two bases being paired depends on how far the bases are apart, and the probability of a part of the molecule forming a specific motif depends on how large the part of the molecule is. In SCFGs the distance between two paired bases (resp. the size of a motif) is just the size of the subword that results from the rule application introducing the base pair as first and last symbol (resp. starting building of the motif), assuming such a rule application exists. Thus to model this variability in the probabilities we suggest the concept of length-dependent SCFGs, which extend SCFGs such that rule probabilities now additionally depend on the size of the subword resulting from the rule application.

We will present this extension formally in Section 2. In Sections 3 and 4 we show that existing training algorithms can easily be adapted to the new model without significant losses in performance.

We have compared the prediction quality of the modified model with the conventional one for different grammars and sets of RNA. The results, presented in detail in Section 5, show that taking the lengths into account yields an improvement in most cases.

We assume the reader is familiar with basic terms of context-free grammars. An introduction can be found in ([

A _{A → α∈R}

Words are generated as for usual context-free grammars, the product of the probabilities of the production rules used in a parse tree Δ provides its probability ^{*} ^{*} ^{*}

_{w∈L(G)}

As stated in the introduction we want to include the length of the generated subword in the rule probabilities in order to model that the probability of bases being paired depends on how far they are apart in the primary structure. To do this we first need to define the length of a rule application formally:

Let

Now the straightforward way to make the rule probabilities depend on the length is to make the mapping _{A → α∈R}

First we note that that some combinations of rule and length can never occur in a derivation, e.g., a rule _{A → α∈R}

Further problems arise from the fact that while before

As an example consider the grammar characterised by the rules

Checking the grammar more closely we find two points where the length of the word generated from an intermediate symbol is not determined when the symbol is introduced during a derivation: At the start of the derivation the length of the generated word is not predetermined and when the rule

Both cases can be countered by adding the proper probability distributions to the model. However while the start needs a single probability distribution _{ℓ}^{i−1}) for a rule with

Combining the above considerations length-dependent stochastic context-free grammars can be defined:

A _{ℓ}_{A → α∈R} _{ℓ}

For _{α,l}

Terminals are always assigned a length of 1.

Any intermediate symbol ^{i}^{*}

The assigned lengths add up to

Words are generated as for usual context-free grammars. The probability _{α,l}^{*} ^{*} _{ℓ}^{*}

_{w∈L(G)}

This definition allows for the productions to be equipped with arbitrary probabilities as long as for any given pair (premise, length) they represent a probability distribution or are all 0. In the present paper we will however confine ourselves with grouping the lengths together in finitely many intervals, a rule having the same probability for lengths that are in the same interval. This allows for the probabilities to be stored as a vector and be retrieved in the algorithms without further computation. It is however, as we will see, still powerful enough to yield a significant improvement over non-length-dependent SCFGs with respect to the quality of the prediction.

Since for bottom up parsing algorithms like CYK (see e.g., [_{α,l}

For probabilistic Earley parsing ([_{α,l}

Neither of these changes influence the run-time significantly. The same holds for the changes to the training algorithms explained at the beginning of the following section.

When given a set of training data

the grammar is consistent and

the likelihood _{t∈
}

For (non-length-dependent) SCFGs it is well known how this can be achieved [

If we are training from full-parse trees the relative frequencies with which the rules occur (among all rules with the same left-hand side) are a maximum-likelihood estimator.

When given a training set of words from

Given an initial estimate of the rule probabilities it uses a modified version of the parsing algorithm to determine for each word _{i,j}_{i,j}

Both algorithms can easily be converted for LSCFGs: Instead of determining the (actual resp. expected) number of occurrences for each rule globally we determine separate counts for each length interval and use these to compute separate (expected) relative frequencies. This works since each occurrence (resp. probability of occurrence) is tied to a specific subword and thus associated a length.

What remains to be shown is that the modified versions of the algorithms still provide rule probabilities that yield consistent grammars and maximize the likelihood.

Our proof is based on the observation that an LSCFG is equivalent to an indexed grammar where each intermediate symbol _{i}_{i}_{i}

This indexed grammar is not context-free since it contains infinitely many intermediate symbols and rules but it can be converted into an SCFG by removing all symbols and rules with indices larger than some threshold _{i}_{ℓ}

For technical reasons the following formal definition will introduce one additional modification: The rules _{i}_{i}_{α,i}_{α,i}_{α,i}

For _{ℓ}_{n}_{n}, T, R_{n}, S′, P_{n}

_{n}_{i}_{α,i}

Let ^{m}_{n}_{m}_{n}

Additionally _{n}_{0≤j≤n} _{ℓ}

The first claim is shown by structural induction on the derivations using the intuition given before Definition 4. The additional fact follows immediately from the first claim and the definition of the probabilities _{i}

Thus the probability distribution on _{n}_{ℓ}_{n}, n_{t∈
} |

If _{n}_{i}_{α,i}_{n}_{j}_{α,j}

The second restriction obviously only applies if lengths are to be grouped into intervals as we will do in this paper. However not all such groupings can yield a consistent grammar. The following gives, as we will show afterwards, a sufficient condition that they do.

Let _{i}^{i}^{*} _{i}_{j}^{j}^{*} _{j}

Intuitively to satisfy this condition we may not group lengths

Note that the partitioning into sets of one element each, which corresponds to not grouping lengths into intervals at all is trivially consistent with each CFG

Let

A function

A pair of functions (_{ℓ}

Let _{ℓ}_{ℓ}

For _{n}_{n}, T, R_{n}, S′, P_{n}

_{n}, T, R_{n}, S′

By the definition of _{n}_{0} → _{0} → _{α,0} and _{α,0} → _{α,0} occurs in
_{α,0} does not occur in
_{0} nor

Thus

_{i}_{i}_{i}

By Definitions 4 and 6 we find _{n}^{*} _{n}_{i}_{α,i}_{n}_{α,i}_{i}_{i}_{B∈I} _{i,AB}_{i}_{B∈I} _{i,AB}

Now let _{i}_{i}_{i}_{i}_{i}_{i}_{B∈I} _{i,CB}_{i}_{B∈I} _{i,CB}_{i}_{i}_{B∈I} _{i,DB}_{i}_{B∈I} _{i,DB}_{i}

Let _{ℓ}

Let _{Δ} denotes the word generated by Δ. Then we find

On the right hand side of (

Thus by the definitions of

Let

From Theorem 11 of [

In order to find the most probable derivation for a given primary structure we decided to employ a probabilistic Earley parser, since it allows to use the grammars unmodified while the commonly used CYK algorithm requires the grammars to be transformed into Chomsky normal form.

A (non-probabilistic) Earley parser operates on lists of items (also called dotted productions), representing partial derivations. We will write (_{j}X_{j}X

The parser is initialized with (0 : _{0}

_{k}X

_{1}

_{2}) and

_{i+ 1}=

_{k}X

_{1}

_{2}). Predictor: If ∃ (

_{k}X

_{1}

_{2}) and

_{i}A

_{j}Y

_{k}X

_{1}

_{2}), add (

_{k}X

_{1}

_{2}).

Intuitively the scanner advances the point past terminal symbols if they match the corresponding symbol of the parsed word, the predictor adds all the productions that might yield a valid extension of the following (nonterminal) symbol and the completer advances the point, if one actually did.

We then have _{0}

If we want to determine the most probable derivation of a word with respect to a SCFG (either length-dependent or not) we need to keep track of the probabilities of partial derivations. This can simply be done by adding them to the items as an additional parameter.

The initialisation then adds (0 : _{0}

_{k}X

_{1}

_{2},

_{i+1}=

_{k}X

_{1}

_{2},

_{k}X

_{1}

_{2},

_{i}A

_{j}Y

_{1}), ∃ (

_{k}X

_{1}

_{2},

_{2}) and ∄ (

_{k}X

_{1}

_{2},

_{1}·

_{2}·

_{ν,i−j}

_{k}X

_{1}

_{2},

The modifications of scanner and predictor are straightforward. Since choosing the most probable sequence of steps for each partial derivation will lead to the most probable derivation overall, the completer maximises the overall probability by choosing the most probable alternative, whenever there are multiple possibilities for generating a subword.

If _{0}_{ℓ}

For a more detailed introduction of probabilistic Earley parsing as well as a proof of correctness and hints on efficient implementation see ([

Differing from ([

In order to see if adding length-dependency actually improves the quality of the predictions of RNA secondary structures from stochastic context-free grammars, we used length-dependent and traditional versions of four different grammars to predict two sets of RNA molecules for which the correct secondary structure is already known. Both sets were split into a training set which was used to train the grammars and a benchmark set for which secondary structure were predicted using the trained grammars. We then compared these predicted structures to the structures from the database, computing two commonly used criteria to measure the quality:

Both frequencies were computed over the complete set (instead of calculating individual scores for each molecule and taking the average of these).

In [

Their training set consists of 139 each large and small subunit rRNAs, the benchmark dataset contains 225 RNase Ps, 81 SRPs and 97 tmRNAs. Both sets are available from

Additionally we wanted to see if length-dependent prediction can further improve the prediction quality for tRNA which is already predicted well by conventional SCFGs.

In order to do so we took the tRNA database from [

We used 4 different grammars for our experiments:

In each of the grammars

As we have stated before, the grammars are to generate the primary structures as words and encode the secondary structures in the derivation. Thus we have to distinguish (in the grammar) between terminal symbols representing unpaired bases and terminal symbols being part of a base pair. In the above grammars the former are denoted by the symbol

Now simply introducing a rule for each combination of how the symbols

In order to include this idea without having to extend the theory in Section 3 we left

G1 and G2 have been taken from [

G4 has been taken from [

As we stated in Section 2 we implemented length-dependency such that we grouped the lengths into intervals, the rule probabilities changing only from one interval to the other but not within them.

Since the influence a change in length has on the probabilities most likely depends on the relative change rather than the absolute one, we decided to make the intervals longer as the subwords considered get longer. This also helps to keep the estimated probabilities accurate since naturally any training set will contain fewer data points per length as the length increases.

Aside from this consideration and the restrictions implied by Definition 5 (consistency of a set of intervals) there is no obvious criterion that helps with deciding on a set of intervals.

Thus we created several different sets ranging from approximately 10 to approximately 100 intervals, evaluating a subset of the prediction data with each of them. The results corresponded to our expectation that finer intervals would tend to improve the quality of prediction until at some point the amount of data available to estimate each probability would be too sparse and thus the quality would degrade. The following set of intervals yielded the best or close to the best results for all 4 grammars:

Lengths up to 40 were not grouped at all (

At first glance it may seem surprising that on G4 the same number of intervals can be used as on the smaller grammars given the greater number of rules and thus greater number of probabilities that have to be trained. However a closer look shows that for all intermediate symbols in G4 except

We did the training and prediction using a length-dependent version of the Earley-style-parser from [

Looking at these results it is immediately apparent that the predictions from G1 are too bad to be of any use either without or with lengths. Additionally the higher complexity of G4 compared to G2 and G3 did not lead to an improvement in prediction quality.

Concerning the other grammars we note that adding length-dependency significantly improved the results on tRNA while they became worse on the mixed set.

A possible explanation for these results could be that the correct parameters for the folding are different for different types of RNA. In order to test this hypothesis we took the three types of RNA in the mixed benchmark set, splitting the molecules of each type into a training set containing 2/3 of them and a benchmark set containing the remaining molecules. On these three sets we again did the training and prediction for the grammars G2 and G3, them being the most likely candidates for future use. The results are listed in

For each of the sets G3 with lengths performed best, backing our assumption. Additionally while the non-length-dependent versions of the grammar performed almost equal, the length-dependent version of G2 fell behind G3 significantly, indicating that length-dependent prediction is more sensitive to choice of grammar.

The considerations in Note 1 lead to the assumption that both versions should take about the same time on a given grammar. This was confirmed during our experiments, with none of the versions being consistently faster,

Concerning the different grammars, predictions using G1 were faster than those for G2 by a factor of ∼ 1.5. Between G2 and G3 resp. G3 and G4 the factor was ∼ 2.

We introduced an extension to the concept of stochastic context-free grammars that allows the probabilities of the productions to depend on the length of the generated subword.

Furthermore we showed that existing algorithms that work on stochastic context-free grammars like training algorithms or determining the most likely parse-tree can easily be adapted to the new concept without significantly affecting their run-time or memory consumption.

Using the LSCFGs to predict the secondary structure of RNA molecules we found that if training and prediction are done on the same type of RNA, the grammar G3 with lengths always outperformed all of the non-length-dependent grammars we tested.

These results indicate that LSCFGs are indeed capable of giving better predictions than classic SCFGs. However further experiments will be needed to confirm these initial results on other data sets and determine good choices for grammar and length intervals.

While our extension to the concept of stochastic context-free grammars stemmed from one specific application, it is not application specific. The concepts and methods presented in Sections 2-4 can immediately be applied to any other application where SCFGs are used as a model, e.g., natural language processing. From our limited insight in that field it appears possible that length-dependent grammars can successfully be applied there as well.

In addition to the applications, extending the concept of context-free grammars also gives rise to interesting questions in the field of formal language theory. The most obvious of these questions is if adding in length-dependencies changes the class of languages that can be generated. We have already been able to show this (For a simple example take the grammar with the productions ^{n2}∣

Example of an RNA secondary structure. Letters represent bases, the colored band marks the phosphordiester bonds, short edges mark hydrogen bonds. (The different colors only serve to identify the corresponding parts in the formal language represantation below.)

Grammar performance, given as sensitivity % (specificity %) rounded to full percent.

without | with | without | with | |
---|---|---|---|---|

lengths | lengths | |||

G1 | 2 (2) | 2 (3) | 6 (6) | 6 (20) |

G2 | 48 (45) | 35 (27) | 80 (80) | 96 (79) |

G3 | 40 (48) | 40 (41) | 78 (81) | 95 (96) |

G4 | 39 (47) | 22 (54) | 78 (83) | 84 (95) |

Grammar performance, given as sensitivity % (specificity %) rounded to full percent.

without | with | without | with | without | with | |
---|---|---|---|---|---|---|

lengths | lengths | lengths | ||||

G2 | 50 (47) | 46 (36) | 57 (52) | 42 (34) | 39 (36) | 41 (23) |

G3 | 47 (45) | 52 (53) | 57 (52) | 59 (54) | 38 (36) | 45 (59) |

We would like to thank the anonymous referees of the previous revisions for their helpful suggestions.