LR Parsing for LCFRS

Kallmeyer, Laura; Maier, Wolfgang

doi:10.3390/a9030058

Open AccessArticle

LR Parsing for LCFRS

by

Laura Kallmeyer

^* and

Wolfgang Maier

Department for Computational Linguistics, Institute for Language and Information, Heinrich-Heine Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany

^*

Author to whom correspondence should be addressed.

Algorithms 2016, 9(3), 58; https://doi.org/10.3390/a9030058

Submission received: 21 March 2016 / Revised: 16 August 2016 / Accepted: 18 August 2016 / Published: 27 August 2016

(This article belongs to the Special Issue Selected papers from “Theorietage der Gesellschaft für Informatik” in Speyer; Invited Talks)

Download

Browse Figures

Versions Notes

Abstract

:

LR parsing is a popular parsing strategy for variants of Context-Free Grammar (CFG). It has also been used for mildly context-sensitive formalisms, such as Tree-Adjoining Grammar. In this paper, we present the first LR-style parsing algorithm for Linear Context-Free Rewriting Systems (LCFRS), a mildly context-sensitive extension of CFG which has received considerable attention in the last years in the context of natural language processing.

Keywords:

parsing; automata; LCFRS

1. Introduction

In computational linguistics, the modeling of discontinuous structures in natural language, i.e., of structures that span two or more non-adjacent portions of the input string, is an important issue. In recent years, Linear Context-Free Rewriting System (LCFRS) [1] has emerged as a formalism which is useful for this task. LCFRS is a mildly context-sensitive [2] extension of CFG in which a single non-terminal can cover

k \in N

continuous blocks of terminals. CFG is a special case of LCFRS where

k = 1

. In CFG, only embedded structures can be modelled. In LCFRS, in contrast, yields can be intertwined. The schematically depicted derivation trees in Figure 1 illustrate the different domains of locality of CFG and LCFRS. As can be seen, in the schematic CFG derivation, X contributes a continuous component β to the full yield. In the schematic LCFRS derivation, however, X contributes three non-adjacent components to the full yield. Intuitively speaking, the difference between LCFRS and CFG productions is that the former allow for arguments on their non-terminals. Each argument spans a continuous part of the input string, i.e., an argument boundary denotes a discontinuity in the yield of the non-terminal. The production itself specifies how the lefthand side non-terminal yield is built from the yields of the righthand side non-terminals. In the example, the X node in the schematic LCFRS tree would be derived from a production which, on its left hand side, has a terminal

X (x_{1}, x_{2}, x_{3})

. Note that CFG non-terminals can consequently be seen as LCFRS non-terminals with an implicit single argument.

LCFRS has been employed in various subareas of computational linguistics. Among others, it has been used for syntactic modeling. In syntax, discontinuous structures arise when a displacement occurs within the sentence, such as, e.g., with topicalization. Constructions causing discontinuities are frequent: In the German TiGer treebank [3], about a fourth of all sentences contains at least one instance of discontinuity [4]; in the English Penn Treebank (PTB) [5], this holds for about a fifth of all sentences [6]. Below, three examples for discontinuities are shown. In (1), the discontinuity is due to the fact that a part of the VP is moved to the front. In (2), the discontinuity comes from the topicalization of the pronoun Darüber. (3), finally, is an example for a discontinuity which has several parts. It is due to a topicalization of the prepositional phrase Ohne internationalen Schaden, and scrambling [7] of elements after the verb.

(1): Selbst besucht hat er ihn nie
Personally visited has he him never
“He has never visited him personally”
(2): Darüber muss nachgedacht werden
About it must thought be
“One must think about that”
(3): Ohne internationalen Schaden könne sich Bonn von dem Denkmal nicht distanzieren
Without international damage could itself Bonn from the monument not distance
“Bonn could not distance itself from the monument without international damage.”

The syntactic annotation in treebanks must account for discontinuity. In a constituency framework, this can be done by allowing crossing branches and grouping all parts of a discontinuous constituent under a single node. Similarly in dependency syntax [8], words must be connected such that edges end up crossing. As an example, Figure 2 shows both the constituency and dependency annotations of (2). While the discontinuous yields inhibit an interpretation of either structure as a CFG derivation tree, both can immediately been modeled as the derivation of an LCFRS [4,9].

In grammar engineering, LCFRS and related formalisms have been employed in two ways. Grammatical Framework (GF) is actively used for multilingual grammar development and allows for an easy treatment of discontinuities, see [11] for details. GF is an extension of Parallel Multiple Context-Free Grammar, which itself is a generalization of LCFRS [12]. TuLiPA [13] is a multi-formalism parser used in a development environment for variants of Tree Adjoining Grammar (TAG). It exploits the fact that LCFRS is a mildly context-sensitive formalism with high expressivity. LCFRS acts as a pivot formalism, i.e., instead of parsing directly with a TAG instance, TuLiPA parses with its equivalent LCFRS, obtained through a suitable grammar transformation [14].

LCFRS has also been used for the modeling of non-concatenative morphology, i.e., for the description of discontinuous phenomena below word level, such as stem derivation in Semitic languages. In such languages, words are derived by combining a discontinuous root with a discontinuous template. In Arabic, for instance, combining the root k-t-b with the template i-a results in kitab (“book”), combining it with a-i results in katib (“writer”), and combining it with ma-∅-a results in maktab (“desk”). Authors of [15,16] use LCFRS for the modeling of such processes. Finally, in machine translation, suitable versions of LCFRS, resp. equivalent formalisms, have been proposed for the modeling of translational equivalence [17,18]. They offer the advantage that certain alignment configurations that cannot be modeled with synchronous variants of CFG can be induced by LCFRS [19]. As an example, see Figure 3, which shows an English sentence with its French translation, along with a grammar that models the alignment between both.

Parsing is a key task since it serves as a backend in many practical applications such as the ones mentioned above. The parsing of LCFRS has received attention both on the symbolic and on the probabilistic side. Symbolic parsing strategies, such as CYK and Earley variants have been presented [20,21,22], as well as automaton-based parsing [23] and data-driven probabilistic parsing techniques [24,25,26]. To our knowledge, however, no LR strategy for LCFRS has so far been presented in the literature. LR parsing is an incremental shift-reduce parsing strategy in which the transitions between parser states are guided by an automaton which is compiled offline. LR parsers were first introduced for deterministic context-free languages [27] and later generalized to context-free languages [28,29] and tree-adjoining languages [30,31]. Also for conjunctive grammars and boolean grammars, LR parsing algorithms exist [32,33,34]. In this paper, we present an LR-style parser for LCFRS, extending our earlier work [35]. Our parsing strategy is based on the incremental parsing strategy implemented by Thread Automata [23].

The remainder of the article is structured as follows. In the following section, we introduce LCFRS and thread automata. Section 3 introduces LR parsing for the context-free case. Section 4 presents the LR algorithm for LCFRS along with an example. In particular, Section 4.1 gives the intuition behind the algorithm, Section 4.2 introduces the algorithms for automaton and parse table constructions, and Section 4.3 presents the parsing algorithm. Section 5 concludes the article.

2. Preliminaries

2.1. Linear Context-Free Rewriting Systems

We now introduce Linear Context-Free Rewriting Systems (LCFRS). In LCFRS, a single non-terminal can span

k \geq 1

continuous blocks of a string. A CFG is simply a special case of an LCFRS in which

k = 1

. We notate LCFRS with the syntax of Simple Range Concatenation Grammars (SRCG) [36], a formalism equivalent to LCFRS. Note that we restrict ourselves to the commonly used string rewriting version of LCFRS and omit the more general definition of LCFRS of [37]. Furthermore, we assume that our LCFRSs are monotone (see condition 3 below) and ε-free (we can make this assumption without loss of generality see [20,38]).

Definition 1 (LCFRS).

An Linear Context-Free Rewriting System (LCFRS) [1,20] is a tuple

G = (N, T, V, P, S)

where

N is a finite set of non-terminals with a function dim: $N \to N$ determining the fan-out of each $A \in N$ ;
T and V are disjoint finite sets of terminals and variables;
$S \in N$ is the start symbol with $\dim (S) = 1$ ; and
P is a finite set of rewriting rules (or productions). For all $γ \in P$ , the following holds.
- γ has the form
  
  $A (α_{0}, \dots, α_{\dim (A) - 1}) \to A_{1} (x_{0}^{(1)}, \dots, x_{\dim (A_{1}) - 1}^{(1)}) \dots A_{m} (x_{0}^{(m)}, \dots, x_{\dim (A_{m}) - 1}^{(m)})$
  
  where $A, A_{1}, \dots, A_{m} \in N$ , $m \geq 0$ being the rank of γ; $x_{j}^{(l)} \in V$ for $1 \leq l \leq m, 0 \leq j < \dim (A_{i})$ ; and $α_{i} \in {(V \cup T)}^{+}$ for $0 \leq i < \dim (A)$ ; all $α_{i}$ and $x_{j}^{(l)}$ are called arguments or components. γ may be abridged as $A (\vec{α}) \to A_{1} (\vec{x_{1}}) \dots A_{m} ({\vec{x}}_{m})$ ; and we define $rank (γ) = m$ .
- For all $x \in V$ there is at most one unique occurrence of x on each lefthand side and righthand side of γ; furthermore, there exists an lefthand side occurrence iff there exists a righthand side occurrence.
- Variable occurrences in γ are ordered by a strict total order ≺ which is such that for all $x_{1}, x_{2}$ with occurrences in γ, if $x_{1}$ precedes $x_{2}$ in any ${\vec{x}}_{i}$ , $1 \leq i \leq m$ , then $x_{1}$ also precedes $x_{2}$ in α.

The rank of G is given by

\max ({rank (γ) ∣ γ \in P})

and written as

rank (G)

. G is of fan-out k if

k \geq \max ({\dim (V) ∣ V \in N})

and is written as

\dim (G)

. The set

C_{G}

is defined as

C_{G} = ⋃_{γ \in P} C_{γ}

(subscript G can be omitted if clear from the context).

The LCFRS derivation is based on the instantiation of rules, the replacement of variable occurrences by strings of terminals.

Definition 2 (Instantiation).

An instantiation of a production

γ \in P

with

γ = A (\vec{α}) \to A_{1} ({\vec{α}}_{1}) \dots A_{m} ({\vec{α}}_{m})

is given by a function

σ_{γ} : V_{γ} \to T^{*}

, where

V_{γ} = {x ∣ \exists i, j s u c h t h a t rhs (γ, i, j) = x}

the set of all

x \in V

occurring in γ. γ is instantiated if all variable occurrences x have been replaced by

σ_{γ} (x)

. A non-terminal in an instantiated production is an instantiated non-terminal.

We now describe the LCFRS derivation process. A (partial) derivation is specified as a pair

[D : A (\vec{α})]

called item. Thereby,

A (\vec{α})

specifies the yield and D is the derivation tree. We use a bracketed representation of trees, e.g.,

γ_{0} (γ_{1} γ_{2})

is a tree with a root node labeled

γ_{0}

that has two daughters, a left daughter labeled

γ_{1}

and a right daughter labeled

γ_{2}

.

Definition 3 (Derivation).

Let

G = (N, T, V, P, S)

be an LCFRS. The set

Δ_{G}

of derivable items with respect to G contains all and only those items that can be deduced with the following deduction rules (the subscript G on

Δ_{G}

may be omitted if clear from the context).

$\frac{}{[γ : A (\vec{α})]} γ : A (\vec{α}) \to ε \in P a n d L (v) = γ$
$\frac{[D_{1} : A_{1} ({\vec{α}}_{1})] \dots [D_{m} : A_{m} ({\vec{α}}_{m})]}{[γ (D_{1} \dots D_{m}) : A (\vec{α})]} A (\vec{α}) \to A_{1} ({\vec{α}}_{1}) \dots A_{m} ({\vec{α}}_{m}) is an instantiation of γ \in P$

We can now define the string language of an LCFRS G on the basis of the yield of a non-terminal.

Definition 4 (Yield, String Language, and Tree Language).

Let

G = (N, T, V, P, S)

be an LCFRS.

For every $A \in N$ we define $yield (A) = {\vec{α} ∣ \exists D s u c h t h a t [D : A (\vec{α})] \in Δ_{G}}$ .
The string language of G is defined as $L_{S} (G) = {w ∣ 〈 w 〉 \in yield (S)}$ .
The derivation tree language of G is defined as $L_{D} (G) = {D ∣ \exists w s u c h t h a t [D : S (〈 w 〉)] \in Δ_{G}}$ .

Note that this definition of a tree language captures the set of derivation trees, not the derived trees with crossing branches along the lines of Figure 1 and Figure 2.

Example 1 (LCFRS Derivation).

Figure 4 shows an LCFRS that generates the language

a^{n} b^{m} a^{n} b^{m} a^{n}

,

n, m \in N

, together with a derivation of the string

a a b a a b a a

. The derivation works as follows. First, the leaf nodes of the derivation tree are produced using the first deduction rule with the productions

γ_{4}

and

γ_{5}

. They introduce the first three as and the two bs. In order to insert three more as, we use the second deduction rule with production

γ_{2}

. This leads to a derivation tree with a root labeled

γ_{2}

that has the

γ_{4}

node as its single daughter. Finally the as and the bs are combined with the second deduction rule and production

γ_{1}

. This leads to a derivation tree with root label

γ_{1}

that has two daughters, the first one is the

γ_{2}

-derivation tree and the second one the

γ_{5}

node. The order of the daughters reflects the order of elements in the righthand sides of productions.

Finally, we define some additional notation that we will need later: Let

γ = A (α_{0}, \dots, α_{\dim (A) - 1}) \to A_{1} (x_{0}^{(1)}, \dots, x_{\dim (A_{1}) - 1}^{(1)}) \dots A_{m} (x_{0}^{(m)}, \dots, x_{\dim (A_{m}) - 1}^{(m)})

be a production.

$lhs (γ)$ gives A, $lhs (γ, i)$ gives $α_{i}$ and $lhs (γ, i, j)$ the jth symbol of $α_{i}$ ; $rhs (γ, l)$ gives $A_{l}$ , and $rhs (γ, l, k)$ gives the kth component of the lth RHS element. Thereby, i, j and k start with index 0, and l starts with index 1. These functions have value ⊥ whenever there is no such element.
In the sense of dotted productions, we define a set of symbols $C_{γ}$ denoting computation points of γ as $C_{γ} = {γ_{i . j} ∣ 0 \leq i < \dim_{A}, 0 \leq j \leq | α_{i} |}$ .

2.2. Thread Automata

Thread automata TA [23] are a generic automaton model which can be parametrized to recognize different mildly context-sensitive languages. The TA for LCFRS (LCFRS-TA) implements a prefix-valid top-down incremental parsing strategy similar to the ones of [21,22]. In this paper, we restrict ourselves to introducing this type of TA.

As an example, let us consider the derivation of the word

a a b a

with the LCFRS from Figure 5 and Figure 6 shows a sample run of a corresponding TA where only the successful configurations are given. We have one thread for each rule instantiation that is used in a parse. In our case, the derivation is

S (a a b a) \overset{α}{\Rightarrow} A (a a, b a) \overset{β}{\Rightarrow} A (a, b) \overset{γ}{\Rightarrow} ε

, we therefore have one thread for the α-rule, one for β and one for γ. The development of these threads is given in the three columns of Figure 6.

Each thread contains at each moment of the TA run only a single symbol, either a non-terminal or a dotted production. The TA starts with the α-thread containing the start symbol. In a predict step, this is replaced with the dotted production corresponding to starting the α-rule. Now the dot precedes the first variable of the non-terminal A in the righthand side. We therefore start a new A-thread (we “call” a new A). In this thread, rule β is predicted, and the dot is moved over the terminal a in a scan step, leading to

A (a • x, y a) \to A (x, y)

as content of the thread. Now, since the dot again precedes the first component of a righthand side non-terminal A, we call a new A-thread, which predicts γ this time. A scan allows to complete the first component of the γ-rule (thread content

A (a •, b) \to ε

). Here, since the dot is at the end of a component, i.e., precedes a gap, the γ-thread is supended and its mother thread, i.e. the one that has called the γ-thread, becomes the active thread with the dot now moved over the variable for the firsts γ-component (

A (a x •, y a) \to A (x, y)

). Similarly, a second suspend operation leads to the α-thread becoming the active thread again with the dot moved over the finished component of the righthand side A (

S (x • y) \to A (x, y)

). The dot now precedes the variable of the second component of this A, therefore the β-thread is resumed with the dot moving on to the second component (

A (a x, • y a) \to A (x, y)

). Similarly, a second resume operation leads to the γ-thread becoming active with content

A (a, • b) \to ε

. The subsequent scan yields a completed A-item that can be published, which means that the mother thread becomes active with the dot moved over the variable for the last righthand side component. After another scan, again, we have a completed item and the α-thread becomes active with a content

S (x y •) \to A (x, y)

that signifies that we have found a complete S-yield. Since the entire input has been processed, the TA run was successful and the input is part of the language accepted by the TA.

More generally, an LCFRS-TA for some LCFRS

G = (N, T, V, P, S)

works as follows. As we have seen, the processing of a single instantiated rule is handled by a single thread which will traverse the lefthand side arguments of the rule. A thread is given by a pair

p : X

, where

p \in {1, \dots, m}^{*}

with m the rank of G is the address, and

X \in N \cup {ret} \cup C

where

ret \notin N

is the content of the thread. In our example in Figure 6, addresses were left aside. They are, however, crucial for determining the mother thread to go back to in a suspend step. In Figure 6, the α-thread has address ε, the β-thread address 1 and the γ-thread address 11, i.e., the address of the mother thread is a prefix of the address of the child thread, respectively. This is why a thread knows where to go back to in a suspend step. An automaton state is given by a tuple

〈 i, p, T 〉

where

T

is a set of threads, the thread store, p is the address of the active thread, and

i \geq 0

indicates that i tokens have been recognized. We use

〈 0, ε, {ε : S} 〉

as start state.

Definition 5 (LCFRS-TA).

Let

G = (N, T, V, P, S)

be a LCFRS. The LCFRS thread automaton for G is a tuple

〈 N^{'}, T, S, ret, δ, Θ 〉

with

$N^{'} = N \cup C_{G} \cup {r e t}$ ;
δ is a function from $C$ to ${1, \dots, m} \cup {⊥}$ such that $δ (γ_{k, i}) = j$ if there is a l such that $lhs (γ, k, i) = rhs (γ, j - 1, l)$ , and $δ (γ_{k, i}) = ⊥$ if $lhs (γ, k, i) \in T \cup {⊥}$ ; and
Θ is a finite set of transitions as defined below.

Intuitively, a δ value j tells us that the next symbol to process is a variable that is an argument of the j-th righthand side non-terminal.

Definition 6 (Transitions of a LCFRS-TA).

The transition set Θ of a LCFRS-TA as in Definition 5 contains the following transitions:

Call: $γ_{k, i} \to [γ_{k, i}] A \in Θ$ if $A = rhs (γ, j - 1)$ and $lhs (γ, k, i) = rhs (γ, j - 1, 0)$ where $j = δ (γ_{k, i})$ .
Predict: $A \to γ_{0, 0} \in Θ$ if $A = lhs (γ)$ .
Scan: $γ_{k, i} \overset{lhs (γ, k, i)}{\to} γ_{k, i + 1} \in Θ$ if $lhs (γ, k, i) \in T$ .
Publish: $γ_{k, j} \to ret \in Θ$ if $\dim (lhs (γ)) = k + 1$ and $j = | lhs (γ, k) |$ .
Suspend:
(i)
$[γ_{k, i}] ret \to γ_{k, i + 1} \in Θ$ if $lhs (γ, k, i) = rhs (γ, δ (γ_{k, i}) - 1, \dim (rhs (δ (γ_{k, i}) - 1)) - 1)$ .
(ii)
$[γ_{k, i}] β_{l, j} \to γ_{k, i + 1} [β_{l, j}] \in Θ$ if $\dim (lhs (β)) > l + 1$ , $| lhs (β, l) | = j$ , $lhs (γ, k, i) = rhs (γ, δ (γ_{k, i}) - 1, l)$ and $rhs (γ, δ (γ_{k, i}) - 1) = lhs (β)$ .
Resume: $γ_{k, i} [β_{l, j}] \to [γ_{k, i}] β_{l + 1, 0} \in Θ$ if $lhs (γ, k, i) = rhs (γ, δ (γ_{k, i}) - 1, l + 1)$ , $rhs (γ, δ (γ_{k, i}) - 1) = lhs (β)$ and $β_{l, j + 1} \notin C_{G}$ .

These are all thransitions in Θ.

A transition

α \overset{a}{\to} β

roughly indicates that in the current thread store, α can be replaced with β while scanning a. Square brackets in α and β indicate parts that do not belong to the active thread. Call transitions start a new thread for a daughter non-terminal, i.e., a righthand side non-terminal. They move down in the parse tree. Predict transitions predict a new rule for a non-terminal A. Scan transitions read a lefthand side terminal while scanning the next input symbol. Publish marks the completion of a production, i.e., its full recognition. Suspend transitions suspend a daughter thread and resume the parent. i.e., move up in the parse tree. There are two cases: Either (i) the daughter is completely recognized (thread content

ret

) or (ii) the daughter is not yet completely recognized, we have only finished one of its components. Resume transitions resume an already present daughter thread, i.e., move down into some daughter that has already been partly recognized.

Figure 7 shows the transitions for our sample grammar. Let us explain the semantics of these rules with these examples before moving to the general deduction rules for TA configurations. As already mentioned, a rule

X_{1} \overset{t}{\to} X_{2}

with

X_{1}, X_{2} \in N^{'}

and

t \in T \cup {ε}

expresses that with

X_{1}

being the content of the active thread, we can replace

X_{1}

with

X_{2}

while scanning t. The predict, scan and publish rules are instances of this rule type.

X \in N^{'}

in square brackets to the left characterize the content of the mother of the active thread while

X \in N^{'}

in square brackets to the right characterize the content of a daughter thread where the address of the daughter depends on the one of the active thread and the δ-value of the content of the active thread. Consider for instance the last suspend rule. It signifies that if the content of the active thread is

γ_{0, 1}

and its mother thread has content

β_{0, 1}

, this mother thread becomes the new active thread with its content changed to

β_{0, 2}

while the daughter thread does not change content. The resume rules, in contrast to this, move from the active thread to one of its daughters. Take for instance the last rule. It expresses that, if the active thread has content

β_{1, 0}

and if this thread has a

δ (β_{1, 0})

th daughter thread with content

γ_{0, 1}

, we make this daughter thread the active one and change its content to

γ_{1, 0}

, i.e., move its dot over a gap.

This is not exactly the TA for LCFRS proposed in [23] but rather the one from [39], which is close to the Earley parser from [21].

The set of configurations for a given input

w \in T^{*}

is then defined by the deduction rules in Figure 8 (the use of set union

S_{1} \cup S_{2}

in these rules assumes that

S_{1} \cap S_{2} = \emptyset

). The accepting state of the automaton for some input w is

〈 | w |, ε, {ε : r e t} 〉

.

3. LR Parsing

In an LR parser (see [40] chapter 9), the parser actions are guided by an automaton, resp. a parse table which is compiled offline. Consider the context-free case. An LR parser for CFG is a guided shift-reduce parser, in which we first build the LR automaton. Its states are sets of dotted productions closed under prediction, and its transitions correspond to having recognized a part of the input, e.g., to moving the dot over a righthand side element after having scanned a terminal or recognized a non-terminal.

Consider for instance the CFG

G = 〈 {S}, {a, b, c}, {S \to a S b | c}, S 〉

. Its LR automaton is given in Figure 9. The construction of this automaton starts from the prediction closure of

{S^{'} \to • S}

, which contains the two S-productions with the dot at the beginning of the righthand side. The prediction closure of a set of dotted productions q is defined as follows: if

A \to α • B β \in q

, then

B \to • γ \in q

for all B-productions

B \to γ

in the grammar. From a state q, a new state can be obtained by computing all items that can be reached by moving the dot over a certain

x \in N \cup T

in any of the dotted productions in q and then, again, building the prediction closure. The new state can be reached from the old one by an edge labeled with x. This process is repeated until all possible states have been constructed. In our example, when being in

q_{0}

, we can for instance reach

q_{1}

by moving the dot over an a, and we can reach

q_{5}

(the acceptance state) by moving the dot over an S.

This automaton guides the shift-reduce parsing in the following way: At every moment, the parser is in one of the states. We can shift an

a \in T

if there is an outgoing a-transition. The state changes according to the transition. We can reduce if the state contains a completed

A \to α •

. We then reduce with

A \to α

. The new state is obtained by following the A-transition, starting from the state reached before processing that rule.

The information about possible shift and reduce operations in a state is precompiled into a parse table: Given an LR automaton with n states, we build a parse table with n rows. Each row i,

0 \leq i < n

, describes the possible parser actions associated with the state

q_{i}

, i.e., for each state and each possible shift or reduce operation, it tells us in which state to go after the operation. The table has

| T | + | N | + 1

columns where the first are indexed with the terminals, then follows a column for reduce operations and then a column for each of the non-terminals. For every pair of states

q_{i}

,

q_{j}

that are linked by an a-edge for some

a \in T

, the field of

q_{i}

and a contains sj, which indicates that in

q_{i}

, one can shift an a and then move to

q_{j}

. For every state

q_{i}

that contains rule number j with a dot at the end of the righthand side, we add rj to the reduce field of

q_{i}

. Finally, for every pair of states

q_{i}

,

q_{j}

that are linked by an A-edge for some

A \in N

, the field of

q_{i}

and A contains j, which indicates that in

q_{i}

, after having completed an A, we can move to

q_{j}

. On the left of Figure 10, we see the parse table resulting from the automaton in Figure 9. The first

| T | + 1

columns are the so-called action-part of the table while the last

| N |

columns form the goto-part.

The parse table determines the possible reductions and shifts we can do at every moment. We start parsing with a stack containing only

q_{0}

. In a shift, we push the new terminal followed by the state indicated in the table on the stack. In a reduction, we use the production indicated in the table. We pop its righthand side (in reverse order) and push its lefthand side onto the stack. The new state is the state indicated in the parse table field for the state preceding the lefthand side of this production on the stack and the lefthand side category of the production. On the right of Figure 10, we see a sample trace of a parse according to our LR parse table. Note that this example allowed for deterministic LR parsing, which is not the case in general. There might be shift-reduce or reduce-reduce conflicts.

4. LR for LCFRS

4.1. Intuition

We will now adapt the ideas underlying LR parsing for CFG to the case of LCFRS. The starting point is the set of possible threads that can occur in a TA for a given LCFRS. Our LR automaton for LCFRS precompiles predict, call and resume steps into single states, i.e., states are closed under these three operations.

As an example, consider the TA from Section 2.2. Figure 11 sketches some of the states one obtains for this TA.

As in the CFG case, we start with a state for the start symbol S and, in this state, add all predictions of S-rules with the dot at the left of the lefthand side component. Furthermore, for every item in some state q where the dot precedes a variable: If this variable is the ith component of a righthand side non-terminal A, we add to q all A-rules with the dot at the beginning of the ith component in the lefthand side. For

i = 1

, this is a call-and-predict operation, for

i > 1

a resume operation.

Moving the dot over a terminal symbol a leads to a transition labeled with a that yields a new state. And moving the dot over the variable that is the ith component of some righthand side symbol A leads to a transition labeled with

A_{i}

.

Our parser then has shift operations and two kinds of reduce operations (the two types of suspend operations in the TA). Either we reduce the ith component of some A where

i < \dim (A)

or we reduce the last component of some A. In the first case, we have to keep track of the components we have already found, in particular of their derivation tree addresses. This is why we have to extend the states sketched in Figure 11 to using dotted productions with addresses.

Our predict/resume closure operations then have to add the index of the daughter that gets predicted/resumed to the address. A problem here is that this can lead to states containing infinitely many items. An example is

q_{2}

in Figure 11. The dotted production

A (a x, • y a) \to A (x, y)

leads, via the resume closure, to new instances of

A (a x, • y a) \to A (x, y)

and

A (a, • b) \to ε

, this time with a different address since these are daughters of the first. Consequently,

q_{2}

would be the set

{ε : S (x • y) \to A (x, y)

,

1 : A (a x, • y a) \to A (x, y)

,

1 : A (a, • b) \to ε

,

11 : A (a x, • y a) \to A (x, y)

,

11 : A (a, • b) \to ε

,

111 : A (a x, • y a) \to A (x, y)

,

111 : A (a, • b) \to ε

, ...}. Such a problem of an infinite loop of predict or resume steps occurs whenever there is a left recursion in one of the components of a rule. We do not want to exclude such recursions in general. Therefore, in order to keep the states finite nevertheless, we allow the addresses to be regular expressions. Then

q_{2}

for instance becomes

{ε : S (x • y) \to A (x, y)

,

1^{+} : A (a x, • y a) \to A (x, y)

,

1^{+} : A (a, • b) \to ε}

. Concerning the transitions, here as well, we add the addresses of the item(s) that have triggered that transition. See Figure 12 for an example. A new state always starts with an item with address ε.

Once the automaton is constructed, the shift-reduce parser proceeds as follows: A configuration of the parser consists of a stack, a set of completed components and the remaining input. The completed components are of the form

r k γ_{i}

where r is a regular expression,

k

is a pointer to the derivation tree node for this use of the rule

γ : A (a, b) \to ε

and i is the component index. The stack has the form

Γ_{1} x_{1} Γ_{2} \dots x_{n - 1} Γ_{n}

where

Γ_{i}

is a regular expression followed by a state and

x_{i} \in T \cup {j_{k} | j

points at a derivation tree node for a rule γ with

1 \leq k \leq \dim (lhs (γ))}

.

For the input

a a b a

for instance, we would start with a stack

ε q_{0}

and then, following the corresponding transition in the automaton, we can shift an a and move to

q_{1}

. Since the shift transition tells us that this shift refers to the daughter with address 1, this address is concatenated to the preceding address. The new stack would be

ε q_{0} a 1 q_{1}

and, after a second shift,

ε q_{0} a 1 q_{1} a 11 q_{1}

. In

q_{1}

we can not only shift as but we can also reduce since there is an item with the dot at the end of the first A-component,

A (a •, b) \to ε

. Such a reduce means that we remove this component (here a single a) from the stack, then create a node

110 γ_{1}

as part of the set of completed components, and then push its pointer and index (here

0_{1}

). The transition from the state preceding the a that is labeled

lhs {(γ)}_{1}

gives the new state and address. The stack is then

ε q_{0} a 1 q_{1} 0_{1} 1 q_{3}

, and the set of completed components is

{110 γ_{1}}

.

q_{3}

also tells us that a first component of a rule, this time

A (a x, y a) \to A (x, y)

, has been completed. It consists of a followed by a first A-component, therefore we have to remove a and

0_{1}

(in reverse order) from the stack and then follow the

A_{1}

transition from

q_{0}

. The result is a stack

ε q_{0} 1_{1} 1 q_{2}

and a completed components set

{11 β_{1}, 110 γ_{1}}

. The completed components are, furthermore, equipped with an immediate dominance relation. In this example,

1

immediately dominates

0

. Whenever a reduce state concerns a component that is not the first one, we check in our set of completed components that we have already found a preceding component with the same address or (if it is a regular expression denoting a set) a compatible address. In this case, we increment the component index of the corresponding derivation tree node. In the end, in case of a successful parse, we have a stack

ε q_{0} i_{1} ε q_{acc}

and a component set giving the entire derivation tree with root node

i

.

4.2. Automaton and Parse Table Construction

The states of the LR-automaton are sets of pairs

r : X

where r is a regular expression over

{1, \dots, rank (G)}

, and

X \in C_{G} \cup {S}

. They represent predict and resume closures.

Definition 7.

We define the predict/resume closure

\bar{q}

of some set

q \subseteq {ε : X | X \in C_{G} \cup {S}}

by the following deduction rules:

Initial predict: $\frac{ε : S}{ε : α_{0, 0}} lhs (α) = S$
Predict/resume: $\frac{p : γ_{i, j}}{p k : γ_{l, 0}^{'}} lhs (γ, i, j) = rhs (γ, k - 1, l), rhs (γ, k) = lhs (γ^{'})$

This closure is not always finite. However, if it is not, we obtain a set of items that can be represented by a finite set of pairs

r : γ_{i, j}

plus eventually

ε : S

such that r is a regular expression denoting a set of possible addresses. As an example for such a case, see

q_{2}

in Figure 12. Let

{Reg}_{G}

be the set of all regular expressions over

{1, \dots, rank (G)}

.

Lemma 8.

For every

q \subseteq {ε : X | X \in C_{G} \cup {S}}

, it holds that if

\bar{q}

is infinite, then there is a finite set

{\bar{q}}^{'} \subset {r : X | r \in {Reg}_{G}, X \in C_{G} \cup {S}}

such that

\bar{q} = {p : X | t h e r e i s a n r : X \in {\bar{q}}^{'} w i t h p \in L (r)}

.

Proof.

C_{G}

is finite. For each

γ_{i, j} \in C_{G}

, the set of possible addresses it might be combined with in a state that is the closure of

{ε : X_{1}, ε : X_{2}, \dots, ε : X_{n}}

is generated by the CFG

〈 N, T, P, S_{n e w} 〉

with

$N = C_{G} \cup {S} \cup {S_{n e w}}$ ,
$T = {1, \dots, m}$ , and
P contains the following rules:
-
$S_{n e w} \to X_{i} \in P$ for all $1 \leq i \leq n$ ,
-
$X \to Y k \in P$ for all instances $\frac{p : X}{p k : Y}$ of deduction rules,
-
and $γ_{i, j} \to ε$ .

This is a regular grammar, its string language can thus be characterized by a regular expression.

The construction of the set of states starts with

q_{0} = \bar{{ε : S}}

. From this state, follwing a-transitions or

A_{i}

-transitions as sketched above, we generate new states. We first formalize these transitions:

Definition 9.

Let

G = (N, T, V, P, S)

be a LCFRS. Let

q \subseteq {ε : S} \cup {r : X | r \in {Reg}_{G}, X \in C_{G}}

.

For every $A \in N$ , every $1 \leq i \leq \dim (A)$ and every $r \in {Reg}_{G}$ , we define
-
$read (q, A_{i}, r) = {ε : γ_{j, k + 1} | r : γ_{j, k} \in q$ and there is some l such that $rhs (γ, l) = A$ and $lhs (γ, j, k) = rhs (γ, l, i - 1)}$ and
-
$\bar{read} (q, A_{i}, r) = \bar{read (q, A_{i}, r)}$ .
For every $a \in T$ and every $r \in {Reg}_{G}$ , we define
-
$read (q, a, r) = {ε : γ_{j, k + 1} | r : γ_{j, k} \in q$ and $lhs (γ, j, k) = a}$ and
-
$\bar{read} (q, a, r) = \bar{read (q, a, r)}$ .

The set of states of our automaton is then the closure of

{q_{0}}

under the application of the

\bar{r e a d}

-functions. The edges in our automaton correspond to

\bar{r e a d}

-transitions, where each edge is labeled with the corresponding pair

r : A_{i}

or

r : a

respectively. Furthermore, we add a state

{ε : S •}

and an edge labeled

ε : S_{1}

from the start state to this state, which is the acceptance state.

The complete automaton we obtain for the grammar in Figure 5 is shown in Figure 13. For the sake of readability, we use dotted productions instead of the computation points

γ_{i, j}

.

The number of possible states is necessarily finite since each state is the closure of some set containing only items with address ε. There are only finitely many such sets.

In the parse table, our operations are

s (p, q)

for shifting some terminal a followed by the old address concatenated with p and state q and

r (α, i)

for reducing the ith component of rule α. The two reduce operations can be distinguished by the component indices. These operations are entered into the first

| T | + 1

columns of the table, the so-called action part. Furthermore, the goto-part of the table tells where to go when traversing a component edge and which address to add then.

The parse table can be read off the automaton as follows:

Definition 10.

Let

G = 〈 N, T, P, S 〉

be a LCFRS, M its LR-automaton with state set Q. Then we define the corresponding LR parse table as a pair

〈 action, goto 〉

with

action

a

| Q | \times (| T | + 1)

table and

goto

a

| Q | \times | {A_{i} | A \in N, 1 \leq i \leq \dim (A)} |

table. The content of these tables is as follows:

$s (p, q^{'}) \in action (q, a)$ iff $\bar{read} (q, a, p) = q^{'}$ ;
$r (γ, i) \in action (q, -)$ iff there is some $p : γ_{i - 1, k} \in q$ such that $k = | lhs (γ, i - 1) |$ .
$〈 p, q^{'} 〉 \in goto (q, A_{i})$ iff $\bar{read} (q, A_{i}, p) = q^{'}$ .
Nothing else is in these tables.

Figure 14 shows the parse table for our example.

4.3. Parsing Algorithm

The parser has slightly more complex configurations than the CFG shift-reduce parser. Let Q be the set of states of the LR automaton for a given LCFRS G,

q_{0}

the state containing

ε : S

and

q_{acc}

the accepting state containing

ε : S •

. Furthermore, let

N = {0, 1, 2, \dots}

be the set of possible pointers to derivation tree nodes. The parser configurations are then triples

〈 Γ, C, i 〉

consisting of

a stack $Γ \in {Reg}_{G} \circ Q \circ {((T \cup {i_{k} | i \in N, k \leq \max_{A \in N} \dim (A)}) \circ {Reg}_{G} \circ Q)}^{*}$ ,
a set of constraints C that are either of the form $r i γ_{j}$ with $r \in {Reg}_{G}$ , $i \in N$ , $γ \in P$ and $1 \leq j \leq \dim (lhs (γ))$ or of the form $k ⊲_{d} l$ with $k, l \in N$ , $1 \leq d \leq rank (G)$ .
and the index j up to which the input w has been recognized, $0 \leq j \leq | w |$ .

Concerning the constraint set C, the first type of constraints describes nodes in the derivation tree while the second describes immediate dominance relations between them, more precisely the fact that one node is the dth daughter of another node. There is an immediate relation between the two: if

i_{1} ⊲_{d} i_{2}

, then the address of

i_{2}

must be the regular expression of

i_{1}

concatenated with the daughter index d of

i_{2}

. Let

N_{C}

be the set of indices from

N

occurring in C.

The initial configuration is

〈 ε q_{0}, \emptyset, 0 〉

. From a current configuration, a new one can be obtained by one of the following operations:

Shift: Whenever we have $p q$ on top of the stack, the next input symbol is a and in our parse table we have $action (q, a) = s (p^{'}, q^{'})$ , we push a followed by $p p^{'} q^{'}$ onto the stack and increment the index j.
Reduce: Whenever the top of the stack is $p_{1} q$ and we have a parse table entry $r (γ, i) \in action (q, -)$ , we can reduce the ith component of γ, i.e, suspend γ. Concretely, this means the following:
-
Concerning the constraint set C, if $i = 1$ , we add $p_{1} k γ_{1}$ to C where $k$ is a new pointer.
-
If $i \geq 1$ , we check whether there is a $p_{2} k γ_{i - 1}$ in C such that the intersection $L (p_{1}) \cap L (p_{2})$ of the languages denoted by $p_{1}$ and $p_{2}$ , $L (p_{1})$ and $L (p_{2})$ , is not empty. We then replace $p_{2} k γ_{i - 1}$ in C with $p k γ_{i}$ where p is a regular expression denoting $L (p_{1}) \cap L (p_{2})$ .
-
Concerning the stack Γ, we remove $3 | lhs (γ, i) |$ elements (i.e., terminals/component pointers and their regular expressions and states) from Γ. Suppose the topmost state on the stack is now $p^{'} q^{'}$ . If $goto (q^{'}, lhs {(γ)}_{i}) = 〈 p^{″}, q^{″} 〉$ , we push $k_{i}$ followed by $p^{'} p^{″} q^{″}$ on the stack.
-
Concerning the dominance relations, for every $i_{j}$ among the symbols we removed from the stack, we add $k ⊲_{d} i$ where d is the corresponding daughter index.
-
It can happen that, thanks to the new dominance relations, we can further restrict the regular expressions characterizing the addresses in C. For instance, if we had $1^{+} i \dots$ , $1^{+} k \dots$ and $i ⊲_{1} k$ in our set, we would replace the address of $k$ with $1^{+} 1$ .
-
The index j remains unchanged in reduce steps.

Note that the intersection of two deterministic finite state automata is quadratic in the size of the two automata. In LCFRS without left recursion in any of the components, the intersection is trivial since the regular expressions denote only a single path each.

As an example, we run the parser from Section 4.2 with an input

w = a a b a

. The trace is shown in Figure 15. In this example, the immediate dominance constraints in C are depicted as small trees. We start in

q_{0}

, and shift two as, which leads to

q_{1}

. We have then fully recognized the first components of γ and β: We suspend them and keep them in the set of completed components, which takes us to

q_{2}

. Shifting the b takes us to

q_{6}

, from where we can reduce, which finally takes us to

q_{4}

. From there, we can shift the remaining a (to

q_{5}

), with which we have fully recognized β. We can now reduce both β and with that, α, which takes us to the accepting state

q_{8}

.

The way the different parsing operations can lead to new configurations can be described by a corresponding parsing schema, i.e., a set of deduction rules and a goal item specification. This is given in the following definition (the use of set union

C_{1} \cup C_{2}

in these rules assumes

C_{1} \cap C_{2} = \emptyset

) (comparing the configurations of a LCFRS-TA to the configurations of our parser, it becomes clear that one problem of the TA is that, because of possible call-predict loops, we can have infinitely many configurations in a TA. [23] avoids this problem by a dynamic programming implementation of TAs. In our automata, this problem does not occur because of the regular expressions we use as addresses).

Definition 11 (LR Parsing: Deduction Rules).

Let

G = 〈 N, T, P, S 〉

be a LCFRS,

〈 action, goto 〉

its LR parse table, and

q_{0}

and

q_{acc}

as introduced above.

The set of possible configurations for the LR parser is then defined by the following deduction rules:

Initial configuration: $\frac{}{〈 ε q_{0}, \emptyset, 0 〉}$
Shift:
$\frac{〈 Γ p q, C, j 〉}{〈 Γ p q w_{j + 1} p p^{'} q^{'}, C, j + 1 〉} action (q, w_{j + 1}) = s (p^{'}, q^{'})$
Reduce 1:
$\frac{〈 Γ p_{k} q_{k} X_{k} \dots p_{1} q_{1} X_{1} p q, C, j 〉}{〈 Γ p_{k} q_{k} l_{1} p_{k} p^{'} q^{'}, f_{⊲} (C \cup {p l γ_{1}} \cup C_{⊲}), j 〉} \begin{matrix} r (γ, 1) \in action (q, -), k = | lhs (γ, 0) |, \\ goto (q_{k}, lhs {(γ)}_{1}) = 〈 p^{'}, q^{'} 〉, l \notin N_{C} \end{matrix}$
Reduce 2:
$\frac{〈 Γ p_{k} q_{k} X_{k} \dots p_{1} q_{1} X_{1} p q, C \cup {p^{″} l γ_{i - 1}}, j 〉}{〈 Γ p_{k} : q_{k} l_{i} p_{k} p^{'} q^{'}, f_{⊲} (C \cup {p_{n e w} l γ_{i}} \cup C_{⊲}), j 〉} \begin{matrix} r (γ, i) \in action (q, -), 1 < i, k = | lhs (γ, i - 1) |, \\ A = lhs (γ), goto (q_{k}, A_{i}) = 〈 p^{'}, q^{'} 〉, \\ L (p_{n e w}) = L (p^{''}) \cap L (p_{k} p^{'}) \neq \emptyset \end{matrix}$
where in both reduce operations,
- $C_{⊲} = {l ⊲_{d} m | m_{n} \in {X_{1}, \dots, X_{k}}$ for some $n \geq 1,$ $m_{n}$ stands for the nth component of the dth righthand side element of γ for some $n \geq 1,$ and $l ⊲_{d} m \notin C}$ .
- $f_{⊲} (C)$ is defined as the closure of C under application of the following rules:
  (a)
  Compilation of address information:
  $\frac{C \cup {p_{1} i x, p_{2} j y, i ⊲_{d} j}}{C \cup {p_{1}^{'} i x, p_{1}^{'} d j y, i ⊲_{d} j}} L (p_{1}^{'} d) = L (p_{1} d) \cap L (p_{2})$
  (b)
  Removal of daughters of completed rules:
  $\frac{C \cup {p i_{1} γ_{k}, p d i_{2} x, i_{1} ⊲_{d} i_{2}}}{C \cup {p i_{1} γ_{k}}} k = \dim (lhs (γ)), 1 \leq d \leq | rhs (γ) |$
Goal items: $〈 ε q_{0} i_{1} ε q_{acc}, {ε i α_{1}}, | w | 〉$ for some $i \in N$ and $α \in P$ with $lhs (α) = S$ .

4.4. An Example with Crossing Dependencies

The running example we have considered so far generates only a context-free language. It was chosen for its simplicity and for the fact that it has a non-terminal with a fan-out

> 1

and a left recursion in one of the components, which requires the use of regular expressions. Let us now consider the LCFRS in Figure 16, an LCFRS for a language that is not context-free with a cross-serial dependency structure. We have again left recursions, this time in all components of the rules

β_{a}

and

β_{b}

. And the rank of the grammar is 2, not 1 as in the running example we have considered so far. This gives us more interesting derivation trees.

The LR automaton for this grammar is given in Figure 17. Since we have a left recursion in both components of the β rules, the predict and resume closures of the items that start the traversal of these components contain each an infinite number of addresses in the derivation tree, encoded in a regular expression (

1^{+}

or

2^{+}

). This automaton leads then to the parse table in Figure 18. As can be seen, the grammar does not allow for a deterministic parsing since, when being in

q_{0}

for instance and having completed the first component of an A, this can lead to

q_{1}

or

q_{5}

, consequently there are several entries in the corresponding goto-field.

Figure 19 gives a sample parse (only successful configurations) for the input

a a b a a b

. When reducing the components of the A yields (nodes

0

and

1

) and of the components of the B yields (

2

), we do not know the exact address of these elements in the derivation tree. Only in the last reduce step, it becomes clear that the

1

and

2

nodes are immediate daughters of the root of the derivation tree, and their addresses are resolved to 1 and 2 respectively.

As mentioned, this grammar does not allow for a deterministic parsing. With one additional lookahead, the grammar would become deterministic (the ε-possibility is to be taken when being at the end of one of the

a^{n}

or

b^{m}

substrings of the input).

4.5. Deterministic Parsing

As mentioned, with one lookahead, LR parsing of the LCFRS in the previous example would become deterministic.

In general, when doing natural language processing, the grammars will always be ambiguous and a deterministic parsing is not possible. Instead, some way to share structure in the style of generalized LR [29] parsing has to be found. In other applications, however, we might have LCFRSs that allow for a deterministic parsing, similar to LR(k) parsing as in [27].

Non-determinism with LCFRS can arise not only from shift-reduce or reduce-reduce conflicts as in CFGs but also from a conflict between different

goto

transitions that have the same component non-terminal but different node addresses (see the multiple entries in the

goto

table of our example). We therefore need lookaheads not only for reduce (and, if

k > 1

, shift) but also for the

goto

entries.

We now define, similar to the CFG case [27], sets of

First

symbols for dotted productions, which contain all terminals that are reachable as first element of the yield, starting from the dotted production. For every

γ_{i, j}

, let us define

First (γ, i, j)

as follows. (Remember that

lhs (γ, i, j)

gives the jth symbol of the ith component in the lefthand side of γ, i.e., the element following the dot.)

$ε \in First (γ, i, j)$ if $lhs (γ, i, j)$ is not defined (i.e, the dot is at the end of the ith component);
$a \in First (γ, i, j)$ if $lhs (γ, i, j) = a$ (i.e., the dot precedes a);
$a \in First (γ, i, j)$ if $lhs (γ, i, j) = x \in V$ , x is the lth component in a righthand side element of γ with non-terminal B, and there is a $γ_{B} \in P$ with $lhs (γ_{B}) = B$ such that $a \in First (γ_{B}, l, 0)$ .
Nothing else is in $First (γ, i, j)$ .

Besides this, we also need the notion of

Follow

.

Follow

sets are defined for all component nonterminals

A_{i}

(

A \in N, 1 \leq i \leq \dim (A)

) of our LCFRS. They contain the terminals that can immediately follow the yield of such a component nonterminal.

$$ \in Follow (S_{1})$ where S is the start symbol and $$ \notin T$ is a new symbol marking the end of the input.
For every production γ with $x \in V$ being the jth element of the ith lefthand side component (indices start with $i = 0$ and $j = 0$ ) and x being the lth element of the yield of a nonterminal B in the righthand side (also starting with $l = 0$ ), we have $First (γ_{i, j + 1}) \subseteq Follow (B_{l + 1})$ , provided $First (γ_{i, j + 1})$ is defined.
For every production γ with $lhs (γ) = A$ and with $x \in V$ being the last element of the ith lefthand side component (indices start with $i = 0$ ) and the lth element of the yield of a nonterminal B in the righthand side (also starting with $l = 0$ ), we have $Follow (A_{i + 1}) \subseteq Follow (B_{l + 1})$ .
Nothing else is in the sets $Follow (A_{i})$ of our component nonterminals.

For

L R (1)

parsing, i.e., LR parsing with 1 lookahead, we replace the items in our states with triples consisting of the adress (regular expression), the dotted production

γ_{i, j}

and the possible next terminal symbols. To this end, we define sets

Next (γ, i, j)

that are the

First

sets for dots that are not at the end of components and otherwise the

Follow

-sets of the component non-terminal:

Next (γ, i, j) = First (γ, i, j)

if

γ_{i, j + 1}

exists and

Next (γ, i, j) = Follow (lhs {(γ)}_{i + 1})

otherwise.

A simple way to extend our automaton to

L R (1)

is by extending the items in the states as follows: Every item with dotted production

γ_{i, j}

is equipped with the lookahead set

Next (γ, i, j)

. Furthermore, transitions are also labeled with triples where the new third element that is added is the terminal in case of a shift transition and, in case of a transition that moves the dot over a non-terminal, the

Next

set of the result of moving the dot according to the transition.

Consider the sample automaton in Figure 17.

q_{0}

has two outgoing

A_{1}

edges with different addresses. One (derivation tree node address ε) arises from moving the dot over the variable x in

S (• x y z u) \to A (x, z) B (y, u)

. We have

Next (S (x • y z u) \to A (x, z) B (y, u)) = First (B (• b, b) \to ε) \cup First (B (• x b, y b) \to B (x, y)) = {b}

, therefore this transition would be relabeled

ε : A_{1} : {b}

. The other transition (address

1^{+}

) leads to

A (x • a, y a) \to A (x, y)

, the

First

set is

{a}

and therefore this transition would be relabeled

1^{+} : A_{1} : {a}

.

Consequently, the

goto

part of the parse table, that would now have entries for every combination of component non-terminal and lookahead symbol, would have at most one entry per field. Hence, LR(1) parsing of our sample grammar is deterministic.

Concerning the reduce steps in the parse table, we would also have a lookahead. Here we can use the elements from the

Follow

set of the reduced components as lookaheads. The

r (β_{a}, 1)

reduce entry for state

q_{2}

for instance would have the lookaheads from

Follow (A_{1}) = {a, b}

.

Instead of this simple LR technique one can also compute the lookahead sets while constructing the automaton, in the style of canonical LR parsing for CFGs. This would allow for more precise predictions of next moves.

There are LCFRSs that do not allow for a deterministic

L R

parsing, no matter how long the length k of the lookahead strings (the non-deterministic CFGs, which are 1-LCFRSs are such grammars). But a precise characterization of the class of deterministic LCFRSs and an investigation of whether, for every

L R (k)

LCFRS, we can find a weakly equivalent

L R (0)

LCFRS (as is the case for CFG), is left for further research.

4.6. Soundness and Completeness of the LR Algorithm

In order to show soundness and completeness of the LR algorithm, we will relate it to the LCFRS-TA for the same grammar. We will slightly modify the deduction rules for TA by assuming that (1) call and predict steps are combined and the initial configuration is omitted, i.e., we start with the result of the first predict steps; (2) we omit the publish steps; and (3) we don’t delete threads of finished rules. The modified rules are given Figure 20. Concerning the LR automaton, we replace the regular expressions in our parsing configurations with concrete addresses. In principal we can then have infinitely many configurations in cases of left recursions (just like in the TA case), i.e., in cases where a shift or goto transition comes with a regular expression denoting an infinite language. Furthermore, we do not delete daughters of completed rules in the constraint set either, i.e., we assume that the rule 2(b) in Definition 11 concerning the application of

f_{⊲}

to the constraint set does not apply. Clearly, this does not change the languages recognized by these automata, it just increases the set of possible configurations. Since we use no longer regular expressions but single addresses, the deduction rules of the LR parser get simpler. Immediate dominance constraints are no longer needed since they follow from the addresses. The relevant deduction rules are given in Figure 21.

Figure 22 shows the correspondence between TA configurations and LR configurations for our sample derivation of

a a b a

.

We now define a function

f_{TA}

that gives for every LR configuration the set of corresponding TA configurations. The idea is the following: An LR configuration describes a path in our LR automaton. This path yields several possible thread contents.

Before giving the formal definition, let us consider two examples from Figure 22 for illustration. Take as a first example the LR configuration

〈 ε q_{0} a 1 q_{1} a 11 q_{1}, \emptyset, 2 〉

. The corresponding path in our automaton is

ε : q_{0} \begin{array}{l} ε : S, ε : α_{0, 0}, \\ 1 : β_{0, 0}, 1 : γ_{0, 0} \end{array} \overset{1 : a}{\to} \overset{1 : q_{1}}{\begin{matrix} ε : β_{0, 1}, ε : γ_{0, 1} \\ 1 : β_{0, 0}, 1 : γ_{0, 0} \end{matrix}} \overset{1 : a}{\to} \overset{11 : q_{1}}{\begin{matrix} ε : β_{0, 1}, ε : γ_{0, 1} \\ 1 : β_{0, 0}, 1 : γ_{0, 0} \end{matrix}}

Each thread store in the corresponding TA configuration corresponds to a possible path through these states.

For every ε-element in the last state of this path (here $ε : β_{0, 1}$ or $ε : γ_{0, 1}$ ), we compute the TA configurations $〈 2, 11, S_{β} 〉$ and $〈 2, 11, S_{γ} 〉$ respectively for these elements:
Let us consider first the case of $S_{β}$ . We first define a set $S_{β}^{'}$ , starting with $S_{β}^{'} = {11 : β_{0, 1}}$ :
-
Since $11 : β_{0, 1}$ was obtained via the a-transition, from $1 : β_{0, 0}$ in $1 : q_{1}$ , we add $11 : β_{0, 0}$ to $S_{β}^{'}$ .
-
$11 : β_{0, 0}$ in turn is in the second state on the path since we obtained it via prediction from $1 : β_{0, 1}$ , which is added to $S_{β}^{'}$ as well.
-
Again, $1 : β_{0, 1}$ arose out of a shift of a from $1 : β_{0, 0}$ , which is added as well and then, because of predict in $q_{0}$ , we add $ϵ : α_{0, 0}$ . This gives $S_{β}^{'} = {11 : β_{0, 1}, 11 : β_{0, 0}, 1 : β_{0, 1}, 1 : β_{0, 0}, ϵ : α_{0, 0}}$ .
-
We then remove all items where there is an item with the same address and rule in the set but with a lefthand side index that is more advanced. This yields $S_{β}^{'} = {11 : β_{0, 1}, 1 : β_{0, 1}, ϵ : α_{0, 0}}$ , which is the thread set $S_{β}$ we were looking for.
-
In a similar way, for the γ case, we obtain $S_{γ} = {11 : γ_{0, 1}, 1 : β_{0, 1}, ϵ : α_{0, 0}}$
In addition, we add all configurations that we can reach from the ones constructed so far via predict or resume. These are reflected in corresponding items in the last state of our path. In our sample case, $1 : β_{0, 0}$ and $1 : γ_{0, 0}$ are both elements of the last state and arose directly out of $ϵ : β_{0, 1}$ via predict. Therefore, we add these to the thread store with $11 : β_{0, 1}$ being the active thread, which leads to $〈 2, 111, {ε : α_{0, 0}, 1 : β_{0, 1}, 11 : β_{0, 1}, 111 : β_{0, 0}} 〉$ and $〈 2, 111, {ε : α_{0, 0}, 1 : β_{0, 1}, 11 : β_{0, 1}, 111 : γ_{0, 0}} 〉$ .
These are all configurations in $f_{TA} (〈 ε q_{0} a 1 q_{1} a 11 q_{1}, \emptyset, 2 〉)$ .

Now consider as a second example

〈 ε : q_{0} 1_{1} ε : q_{2} b 11 : q_{6}, {110 γ_{1}, 11 β_{1}}, 3 〉

, corresponding to the following path:

ε : q_{0} \begin{array}{l} ε : S, ε : α_{0, 0} \\ 1 : β_{0, 0}, 1 : γ_{0, 0} \end{array} \overset{ε : A_{1}}{\to} \overset{ε : q_{2}}{\begin{array}{l} ε : α_{0, 1} \\ 1^{+} : β_{1, 0}, 1^{+} : γ_{1, 0} \end{array}} \overset{1^{+} : b}{\to} \overset{1^{+} : q_{6}}{ε : γ_{1, 1}}

Again, we have to go back from the ε-element in the last state in order to construct the thread store. The last element gives us

11 : γ_{1, 1}

.

Consequently, the configuration is

〈 3, 11, S 〉

and we construct

S

starting with

S^{'} = {11 : γ_{1, 1}}

. Following the shift transition back adds

11 : γ_{1, 0}

to

S^{'}

.

11 : γ_{1, 0}

, in turn, arose from a resume applied to

1 : β_{1, 0}

, which is therefore added as well.

1 : β_{1, 0}

was predicted from

ε : α_{0, 1}

, which is added as well. Now we follow the

A_{1}

-transition backwards and therefore add

ε : α_{0, 0}

. Finally, since for a specific address and rule, we keep only the most advanced position, we obtain

〈 3, 11, {ε : α_{0, 1}, 1 : β_{1, 0}, 11 : γ_{1, 1}} 〉

.

In order to define the corresponding TA-sets, we annotate the LR automaton with pointers that show for each element c in any of the states, which other element

c^{'}

has lead to c, either because of moving the dot over a terminal or a component variable or because of a predict/resume closure operation. In the latter case, the pointer is labeled with the daughter index of the predict/resume operation. For this annotation, we assume that instead of pairs

r : c

of regular expression and comutation point, we use onyl pairs

p : c

of address and computation point. This means of course, that we have potentially infinite states and potentially infinite predecessor paths through a state and from one state to another.

Definition 12 (Predecessor Annotation).

Let M be an LR-automaton with state set Q constructed from an LCFRS G.

Q = {〈 x, q 〉 | q \in Q, x \in q}

.

We define the predecessor annotation of M as a labeled relation

Predec \subset Q \times Q \times ({1, \dots, rank (G)}^{*} \cup {-})

such that for every

q \in Q

and every

p : c \in q

:

If there is a $p^{'} : c^{'} \in q$ such that $p = p^{'} i$ and $p i : c$ can be deduced from $p : c^{'}$ by the predict/resume rule from Definition 7, $〈 〈 p : c, q 〉, 〈 p^{'} : c^{'}, q 〉, i 〉 \in Predec$ .
If $p = ε$ and there is a $q^{'} \in Q$ and a $p^{'} : c^{'} \in q^{'}$ such that $ε : c \in read ({p^{'} : c^{'}}, x, p^{'})$ for some $x \in T \cup {A_{i} | A \in N, 1 \leq i \leq \dim (A)}$ , then $〈 〈 p : c, q 〉, 〈 p^{'} : c^{'}, q^{'} 〉, - 〉 \in Predec$ .

Nothing else is in

Predec

.

Now we can define the set of possible TA stores for a given LR-configuration.

For every set of constraints C, we define

Nodes (C)

as the set of pointers in C. Furthermore, for every pointer

i \in Nodes (C)

, we define

addr (i, C)

as the address associated with

i

, and

label (i, C)

as its label

β_{j}

.

For a given LR configuration, we compute the corresponding TA configurations via deduction rules. The items in these rules have the form

[S, 〈 Γ, C 〉, i, p, p_{2} : c]

with

S

a thread store,

〈 Γ, C 〉

a LR configuration, i the index in the input, p an address and

p_{2} : c

an element of the topmost state in Γ. We start with an empty thread store and our LR configuration. The deduction rules then either move backwards inside the topmost Γ state or remove elements from Γ while adding corresponding threads to the thread store. In the end, we will have an empty stack Γ and then, the thread store

S

together with the index i and the address p of the current thread gives us one of the TA configurations corresponding to the LR configuration we started with. p stays constant throughout the entire computation.

p_{1}

represents the address of the daughter that has lead to moving into the current state.

Definition 13.

Let G be an LCFRS,

M_{G}

its LR automaton. For every input

w \in T^{*}

and every configuration

〈 Γ, C, i 〉

that can be deduced via

M_{G}

for the input w, we define

f_{TA} (〈 Γ, C, i 〉) = {〈 i, p, S 〉 | [S, 〈 ε, \emptyset 〉, i, p, -, -]

can be deduced from

[\emptyset, Γ, C, i, -, -, -]

via the following deduction rules}.

Current thread: $\frac{[\emptyset, 〈 Γ p_{1} q, C 〉, i, -, -]}{[{p_{1} p_{2} : c}, 〈 Γ p_{1} q, C 〉, i, p_{1} p_{2}, p_{2} : c]} p_{2} : c \in q$
Predict/resume backwards: $\frac{[S, 〈 Γ p_{1} q, C 〉, i, p, p_{2} j : c]}{[S \cup {p_{1} p_{2} : c^{'}}, 〈 Γ p_{1} q, C 〉, i, p, p_{2} : c^{'}]} \begin{matrix} 〈 〈 p_{2} j : c, q 〉, 〈 p_{2} : c^{'}, q 〉, j 〉 \\ \in Predec \end{matrix}$
Shift backwards: $\frac{[S, 〈 Γ p_{1} q^{'} a p_{1} p_{2} q, C 〉, i, p, ε : c]}{[S, 〈 Γ p_{1} q^{'}, C 〉, i, p, p_{2} : c^{'}]} \begin{matrix} a \in T \\ a n d 〈 〈 ε : c, q 〉, 〈 p_{2} : c^{'}, q^{'} 〉, - 〉 \in Predec \end{matrix}$
Shift non-terminal backwards: $\frac{[S, 〈 Γ p_{1} q^{'} k_{j} p_{1} p_{2} q, C 〉, i, p, ε : c]}{[S, 〈 Γ p_{1} q^{'}, C 〉, i, p, p_{2} : c^{'}]} \begin{matrix} \exists n s u c h t h a t p_{1} p_{2} n = addr (k, C) a n d \\ 〈 〈 ε : c, q 〉, 〈 p_{2} : c^{'}, q^{'} 〉, - 〉 \in Predec \end{matrix}$
Empty stack: $\frac{[S, 〈 ε q_{0}, C 〉, i, p, ε : c]}{[S, 〈 ε, C 〉, i, p, -]}$
Add completed components: $\frac{[S, 〈 ε, C 〉, i, p, -]}{[S \cup {p^{'} : β_{i - 1, | lhs (β, i) |}}, 〈 ε, C 〉, i, p, -]} \begin{matrix} p^{'} k β_{i} \in C, \\ p^{'} d o e s n o t o c c u r i n S \end{matrix}$

Current thread picks one element

p_{2} : c

from the topmost state q in Γ as the starting element. The address of the corresponding thread is a concatenation of the address

p_{1}

preceding q on the stack and the address

p_{2}

of the chosen state element. Predict/resume backwards applies whenever the address of the current state element is not ε, we then follow the

Predec

link and add a corresponding new thread to our thread store. Shift backwards follows a

Predec

link that corresonds to a reversed edge in the automaton labeled with a terminal. In this case, we do not change the thread store since we already have a thread for the rule application in question with a more advanced index. We only remove the shifted terminal and state from the LR stack and we split the address of the state and the element within it according to the

Predec

link. Shift non-terminal backwards is a similar move following a

Predec

link that corresonds to a reversed edge in the automaton labeled with a component non-terminal. These four rules serve to reduce the LR stack while constructing threads out of it. This process terminates with a single application of Empty stack. Once the stack is empty, we have to take care of the components described in C. For all of these, if no corresponding thread exists, wee add it by applying Add completed component.

Lemma 14.

Let

G = (N, T, V, P, S)

be a LCFRS of rank m. Let

〈 N^{'}, T, S, ret, δ, Θ 〉

be the LCFRS-TA of G, and let M be the LR-automaton constructed for G with

〈 action, goto 〉

being the LR parse table one can read off M.

For every input word

w \in T^{*}

, with

C_{T A}

being the set of TA configurations,

C_{L R}

the set of LR configurations arising from parsing w with the TA or the LR parser:

C_{T A} = ⋃_{c_{L R} \in C_{L R}} f_{TA} (c_{L R})

Proof.

Proof by induction over the deduction rules of the TA automaton and the LR parser:

Initial configurations: $c_{0} = 〈 ε q_{0}, \emptyset, 0 〉 \in C_{L R}$ can be added as initial configuration in the LR parser iff $〈 0, ε, {ε : α_{0, 0}} 〉$ for every $α \in P$ with $lhs (α) = S$ are initial configurations in the TA and, furthermore, for any configuration one can obtain from one of these initial TA configurations via applications of Predict it holds that there is a corresponding element in $q_{0}$ . Conversely, except for $〈 ε : S 〉 \in q_{0}$ , for every element in $q_{0}$ , there is a corresponding configuration in the TA containing this element and all its predecessors as threads.
Predict in the TA starting from any of the $〈 0, ε, {ε : α_{0, 0}} 〉$ :
$\frac{〈 0, p, S \cup {p : γ_{0, 0}} 〉}{〈 0, p j, S \cup {p : γ_{0, 0}} \cup {p j : γ_{0, 0}^{'}} 〉} \begin{matrix} γ_{0, 0} \to [γ_{0, 0}] A \in Θ, A \in N, δ (γ_{0, 0}) = j + 1, \\ A \to γ_{1, 0}^{'} \in Θ, i . e ., A = lhs (γ^{'}) \end{matrix}$
Elements in $q_{0}$ are obtained by
-
Initial predict: $\frac{ε : S}{ε : α_{0, 0}} \begin{matrix} lhs (α) = S \end{matrix}$
-
Predict closure on $q_{0}$ in the LR automaton: $\frac{p : γ_{0, 0}}{p j : γ_{0, 0}^{'}} \begin{matrix} lhs (γ, 0, 0) = rhs (γ, j - 1, 0), \\ rhs (γ, j) = lhs (γ^{'}) \end{matrix}$
with a $Predec$ link labeled j from the consequent item to the antecedent item.
Consequently, we obtain that the set of TA configurations we get by using only the initial rule and Predict is exactly $f_{T A} (c_{0})$ , i.e., the union of all $f_{T A} (c)$ for LR configurations we can obtain by the initial rule.
Scan and Shift: We now assume that we have sets $C_{T A}$ of TA configurations and $C_{L R}$ of LR configurations such that $C_{T A} = ⋃_{c_{L R} \in C_{L R}} f_{TA} (c_{L R})$ .
To show: For any pair $c_{L R}, f_{TA} (c_{L R})$ contained in the two sets and for any terminal a and any address p: $c_{T A, 1}, \dots, c_{T A, k}$ are all the configurations in $f_{TA} (c_{L R})$ with current thread address p that allow for an application of Scan with a, the new configurations being $c_{T A, 1}^{'}, \dots, c_{T A, k}^{'}$ iff $c_{L R}$ allows to derive a new configuration $c_{L R}^{'}$ via an application of Shift with the terminal a and address p, such that the following holds: $f_{TA} (c_{L R}^{'}) = ⋃_{i = 1}^{k} \bar{c_{T A, i}^{'}}$ where $\bar{c_{T A}^{'}}$ is the closure of ${c_{T A}^{'}}$ under applications of Predict and Resume.
Let us repeat the operations Scan and Shift:
Scan in the TA: $\frac{〈 j, p, S \cup {p : γ_{k, i}} 〉}{〈 j + 1, p, S \cup {p : γ_{k, i + 1}} 〉} \begin{matrix} γ_{k, i} \overset{w_{j + 1}}{\to} γ_{k, i + 1} \in Θ \end{matrix}$
Shift in the LR automaton: $\frac{〈 Γ p_{1} q, C, j 〉}{〈 Γ p_{1} q w_{j + 1} p_{1} p_{2} q^{'}, C, j + 1 〉} \begin{matrix} action (q, w_{j + 1}) \\ = s (r, q^{'}), p_{2} \in L (r) \end{matrix}$
Let $c_{T A, i} = 〈 j, p, S \cup {p : γ_{m_{i}, n_{i}}^{i}} 〉$ for $1 \leq i \leq k$ . According to our induction assumption, $c_{L R}$ has a form $〈 Γ p_{1} q, C, j 〉$ such that there exist a $p_{2} : γ_{m_{i}, n_{i}}^{i} \in q$ with $p = p_{1} p_{2}$ for all $1 \leq i \leq k$ . Furthermore, the dot in $γ_{m_{i}, n_{i}}^{i}$ precedes a terminal $a = w_{j + 1}$ , we consequently can perform scans on $c_{T A, 1}, \dots, c_{T A, k}$ and there is a shift transition $action (q, a) = s (r, q^{'}), p_{2} \in L (r)$ . Consequently, we can obtain new TA configurations $c_{T A, i}^{'} = 〈 j + 1, p, S \cup {p : γ_{m_{i}, n_{i} + 1}^{i}} 〉$ and a new LR configuration $〈 Γ p_{1} q w_{j + 1} p_{1} p_{2} q^{'}, C, j + 1 〉$ in the LR automaton.
According to the LR automaton construction, $\bar{{ε : γ_{m_{1}, n_{1} + 1}^{1}, \dots, ε : γ_{m_{k}, n_{k} + 1}^{k}}} = q^{'}$ with $ε : γ_{m_{i}, n_{i} + 1}^{i}$ having a predecessor link pointing to $p_{2} : γ_{m_{i}, n_{i}}^{i}$ in q. Further elements in $q^{'}$ are obtained by predict and resume and are linked to the elements with address ε by a predecessor chain. The same predict and resume operations can be applied to the corresponding TA configurations $c_{T A, 1}^{'}, \dots, c_{T A, k}^{'}$ as well. Therefore our induction assumption holds.
Suspend and Reduce: We assume again that we have sets $C_{T A}$ of TA configurations and $C_{L R}$ of LR configurations such that $C_{T A} = ⋃_{c_{L R} \in C_{L R}} f_{TA} (c_{L R})$ .
To show: For any pair $c_{L R}, f_{TA} (c_{L R})$ contained in the two sets: there is a $c_{T A} \in f_{TA} (c_{L R})$ that allows to derive a new configuration $c_{T A}^{'}$ via an application of suspend iff $c_{L R}$ allows to derive a new configuration $c_{L R}^{'}$ via an application of reduce 1/2 such that $f_{TA} (c_{L R}^{'}) = \bar{c_{T A}^{'}}$ .
Let $c_{T A} = 〈 i, p j, S \cup {p : γ_{k, n}, p j : β_{l, m}} 〉$ such that a suspend can be applied resulting in $〈 i, p, S \cup {p : γ_{k, n + 1}, p j : β_{l, m}} 〉$ . According to our induction assumption, $c_{L R}$ has a form $〈 Γ p j q, C, i 〉$ such that there exists a $ε : β_{l, m} \in q$ (without loss of generality, we assume our grammars to be ε-free. Consequently, computation points with the dot at the end of a component cannot arise from predict or resume operations and therefore always have address ε). Furthermore, since in $β_{l, m}$ , the dot is at the end of the $(l + 1)$ th component, we have $r (β, l + 1) \in action (q, -)$ .
Let us remind the suspend and reduce operations:
TA Suspend: $\frac{〈 i, p j, S \cup {p : γ_{k, n}, p j : β_{l, m}} 〉}{〈 i, p, S \cup {p : γ_{k, n + 1}, p j : β_{l, m}} 〉} \begin{matrix} [γ_{k, i}] β_{l, m} \to γ_{k, n + 1} [β_{l, m}] \in Θ o r \\ β_{l, m} \to ret, [γ_{k, n}] ret \to γ_{k, n + 1} \in Θ \end{matrix}$
LR Reduce 1: $\frac{〈 Γ p_{k} q_{k} X_{k} \dots p_{1} q_{1} X_{1} p j q, C, i 〉}{〈 Γ p_{k} q_{k} l_{1} p_{k} p^{'} q^{'}, C \cup {p j l β_{1}}, i 〉} \begin{matrix} r (β, 1) \in action (q, -), k = | lhs (γ, 0) |, \\ goto (q_{k}, lhs {(γ)}_{1}) = 〈 r, q^{'} 〉, p^{'} \in L (r), l \notin N_{C} \end{matrix}$
LR Reduce 2: $\frac{〈 Γ p_{k} q_{k} X_{k} \dots p_{1} q_{1} X_{1} p j q, C \cup {p j l β_{l - 1}}, i 〉}{〈 Γ p_{k} q_{k} l_{l} p_{k} p^{'} q^{'}, C \cup {p j l β_{l}}, i 〉} \begin{matrix} r (β, l) \in action (q, -), \\ 1 < l, k = | lhs (β, l - 1) |, \\ goto (q_{k}, lhs {(β)}_{l}) = 〈 r, q^{'} 〉, p^{'} \in L (r) \end{matrix}$
where for both reduce steps, any component pointer $i_{m}$ among the $X_{1}, \dots, X_{k}$ has an address $p j m$ for some $m \in {1, \dots, | rhs (β) |}$ .
The only difference between the two reduce steps is that in the first, a first component is found and therefore a new pointer added to C, while in the second, only the component index of an already present pointer in C is incremented.
In our case, there must be elements $p^{'} : γ_{k, n}, p^{'} j : β_{l, 0} \in q_{k}$ such that $p_{k} p^{'} = p$ and $ε : γ_{k, n + 1} \in q^{'}$ . This means that for the resulting LR configuration, we obtain exactly the TA thread store plus eventually additional store entries obtained from predict or resume operations.

4.7. Comparison to TAG Parsing

As mentioned in the beginning, LR parsing has also been proposed for TAG [31,41]. The LR states in this case contain dotted productions that correspond to small subtrees (a node and its daughters) in the TAG elementary trees. They are closed under moving up and down in these trees, predicting adjunctions and predicting subtrees below foot nodes. There are two different transitions labeled with non-terminal symbols in the LR automata constructed for TAG: Transitions that correspond to having finished a subtree and moving the dot over the corresponding non-terminal node and transitions that correspond to having finished the part below a foot node and moving back into the adjoined auxiliary tree. The latter transitions are labeled with the adjunction site. In the parser, when following such a transition, the adjunction site is stored on the stack of the parser. Adjunctions can be nested, consequently, we can have a list (actually a stack) of such adjunction sites that we have to go back to later. This way to keep track of where to go back to once the adjoined tree is entirely processed corresponds to the derivation node addresses we have in the thread automata and the LCFRS LR parser since both mechanisms capture the derivation structure. The difference is that with TAG, we have a simple nested structure without any interference of nested adjunctions in different places. And the contribution of auxiliary trees in the string contains only a single gap, consequently only a single element in the daughter list of a node contains such information about where to go back to when having finished the auxiliary tree. Therefore each of these chains of adjunctions can be kept track of locally on the parser stack.

Restricting our LR parser to well-nested 2-LCFRS amounts to restricting it to TAG. This restriction means that our LCFRS rules have the following form: Assume that

γ = A (α_{0}, \dots, α_{\dim (A) - 1}) \to A_{1} (x_{0}^{(1)}, \dots, x_{\dim (A_{1}) - 1}^{(1)}) \dots A_{m} (x_{0}^{(m)}, \dots, x_{\dim (A_{m}) - 1}^{(m)})

is a rule in such an LCFRS. Then we can partition the righthand side into minimal subsequences

A_{1} \dots A_{i_{1}}

, ...,

A_{i_{k}} \dots A_{m}

such that a) the yields of these subsequences do not cross; b) every subsequence either (i) contains only a single non-terminal or (ii) has a first element of fan-out 2 that wraps around the yields of the other non-terminals. In this latter case, the yield of the second to last non-terminals can again be partitioned in the same way. Consider as an example a rule

A (x_{1} x_{2} x_{3} x_{4} x_{5} x_{6} x_{7}) \to B (x_{1}) C (x_{2}, x_{6}) D (x_{3}, x_{5}) E (x_{4}) F (x_{7})

. Here, we can partition the righthand side into the daughter sequences B and

C, D, E

and F that can be parsed one after the other without having to jump back and forth between them. Concering

C, D, E

, the C wraps around

D, E

and, concerning this shorter sequence, D wraps around E. This means that during parsing, when processing the yield of one of these subsequences, we have to keep track of the last first component of some non-terminal that we have completed since this will be the first one for which we have to find a second component. We can push this information on a stack and have to access only the top-most element. There is no need to search through the entire partial derivation structure that has already been built in order to find a matching first component when reducing a second one. On the parse stack, it is enough to push the rule names for which we have completed a first component and still need to find a second. Pointers to the derivation structure are no longer needed. And node addresses are not needed either. Besides nestings between sisters as in this example, we can also have embedded nestings in the sense that a first comoponent is part of some other component. In this case, we have to finish the lower non-terminal before finishing the higher. In reductions of first components, we keep can keep track of embedded first components by removing them from the stack and popping them again after having added the new element. This makes sure that the second component of the most deeply embedded first component inside the rightmost first component of what we have processed so far must be completed next.

The LR automaton does not need node addresses any longer. Besides this, it looks exactly like the automata we have built so far. As an example consider the LR automaton in Figure 23.

Whenever we reduce a first component of a rule with fan-out 2, we push the rule name on the component stack and any stack symbols corresponding to daughters of this component (i.e., the first component non-terminals are part of the reduced yield but not the corresponding second one) are taken from this stack and then pushed on an extra stack attached to the new rule symbol, such that the rightmost daughter is the highest element on this embedded stack. Whenever we reduce a second component, we must have the corresponding rule name as the top stack element and then we can delete it. The top is defined as follows: If the top is an element without attached stack, it is itself the top of the entire stack. If not, the top of its attached stack is the top of the entire stack structure. An example is given in Figure 24. This example involves two reduce steps with first components, the first in state

q_{9}

where the rule name γ is pushed on the stack. In the second reduce step, we reduce the first component of β. This contains a non-terminal

A_{1}

for which we have not seen the corresponding

A_{2}

yet. Therefore, one symbol is popped from the stack (for

A_{1}

), and the new β is pushed with an attached stack containing the removed γ. Whenever we reduce a second component of a rule, the top-most stack element must be the name of that rule.

5. Conclusions

This paper presents the first LR style algorithm for LCFRS parsing. It extends the well-known LR parsing techniques from CFG to the case of LCFRS where a single non-terminal can span several non-adjacent strings in the input. Even though the derivation structure needs to be kept track off during parsing in order to make sure that the correct components are related to each other, the proposed algorithm constructs only a finite automaton, by capturing derivation tree addresses in regular expressions whenever left recursions occur in some component.

The resulting parsing algorithm extends previous work on incremental Earley-style LCFRS parsing and on Thread Automata, an automaton model for LCFRS that proceeds in an incremental way but that does not precompile predict and resume steps the way it is proposed for the LR construction in this paper.

Acknowledgments

The research presented in this paper was partly funded by the German Science Foundation (DFG). We are grateful to anonymous reviewers of this paper and of a previous short version of this work that has been presented at NAACL-HLT 2015. Their comments and suggestions helped considerably to improve the paper.

Author Contributions

This work is an extension of previous research of the authors. The underlying ideas emerged from mutual discussion, to which both authors have contributed substantially. The present paper was mostly written by Laura Kallmeyer. Wolfgang Maier contributed the introduction and the first part of the second section.

Conflicts of Interest

The authors declare no conflict of interest.

References

Vijay-Shanker, K.; Weir, D.; Joshi, A.K. Characterising structural descriptions used by various formalisms. In Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics, Stanford, CA, USA, 6–9 July 1987; pp. 104–111.
Joshi, A. How much context-sensitivity is necessary for characterizing structural descriptions? In Natural Language Processing: Theoretical, Computational and Psychological Perspectives; Dowty, D., Karttunen, L., Zwicky, A., Eds.; Cambridge University Press: New York, NY, USA, 1985; pp. 206–250. [Google Scholar]
Brants, S.; Dipper, S.; Hansen, S.; Lezius, W.; Smith, G. The TIGER treebank. In Proceedings of the 1st Workshop on Treebanks and Linguistic Theories, Sofia, Bulgaria, 13–14 December 2002; pp. 24–42.
Maier, W.; Lichte, T. Characterizing discontinuity in constituent treebanks. In Proceedings of the 14th International Conference, Bordeaux, France, 25–26 July 2009; pp. 167–182.
Marcus, M.P.; Santorini, B.; Marcinkiewicz, M.A. Building a large annotated corpus of english: The penn treebank. Comput. Linguist. 1993, 19, 313–330. [Google Scholar]
Evang, K.; Kallmeyer, L. PLCFRS parsing of english discontinuous constituents. In Proceedings of the 12th International Conference on Parsing Technologies (IWPT 2011), Dublin, Ireland, 5–7 October 2011; pp. 104–116.
Haider, H.; Rosengren, I. Scrambling; Sprache und Pragmatik: Lund, Sweden, 1998. [Google Scholar]
Kübler, S.; McDonald, R.; Nivre, J. Dependency parsing. Synth. Lect. Hum.Lang. Technol. 2009, 1, 1–127. [Google Scholar] [CrossRef]
Kuhlmann, M. Mildly non-projective dependency grammar. Comput. Linguist. 2013, 39, 355–387. [Google Scholar] [CrossRef]
Skut, W.; Krenn, B.; Brants, T.; Uszkoreit, H. An annotation scheme for free word order languages. In Proceedings of the 5th Applied Natural Language Processing Conference, 31 March–3 April 1997; pp. 88–95.
Ranta, A. Grammatical Framework: Programming with Multilingual Grammars; CSLI Publications: Stanford, CA, USA, 2011. [Google Scholar]
Ljunglöf, P. Expressivity and Complexity of the Grammatical Framework. Ph.D. Thesis, Göteborg University, Gothenburg, Sweden, 2004. [Google Scholar]
Kallmeyer, L.; Maier, W.; Parmentier, Y.; Dellert, J. TuLiPA-Parsing extensions of TAG with range concatenation grammars. Bull. Pol. Acad. Sci. 2010, 58, 377–391. [Google Scholar] [CrossRef]
Kallmeyer, L.; Parmentier, Y. On the relation between multicomponent tree adjoining grammars with tree tuples (TT-MCTAG) and range concatenation grammars (RCG). In Proceedings of the Second International Conference on Language and Automata Theory and Applications (LATA 2008), Tarragona, Spain, 13–19 March 2008; pp. 263–274.
Dada, A.; Ranta, A. Implementing an open source arabic resource grammar in GF. In Perspectives on Arabic Linguistics: Papers from the Annual Symposium on Arabic Linguistics; John Benjamins Publishing: Amsterdam, The Netherlands, 2007; pp. 209–231. [Google Scholar]
Botha, J.A.; Blunsom, P. Adaptor grammars for learning non-concatenative morphology. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Washington, DC, USA, 18–21 October 2013; pp. 345–356.
Melamed, I.D.; Satta, G.; Wellington, B. Generalized multitext grammars. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Barcelona, Spain, 21–26 July 2004; pp. 661–668.
Kaeshammer, M. Hierarchical machine translation with discontinuous phrases. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015.
Kaeshammer, M. Synchronous linear context-free rewriting systems for machine translation. In Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation, Atlanta, GA, USA, 13 June 2013; pp. 68–77.
Seki, H.; Matsumura, T.; Fujii, M.; Kasami, T. On multiple context-free grammars. Theor. Comput. Sci. 1991, 88, 191–229. [Google Scholar] [CrossRef]
Burden, H.; Ljunglöf, P. Parsing linear context-free rewriting systems. In Proceedings of the Ninth International Workshop on Parsing Technology, Vancouver, BC, Canada, 9–10 October 2005; pp. 11–17.
Kallmeyer, L.; Maier, W. An incremental earley parser for simple range concatenation grammar. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09), Paris, France, 7–9 October 2009; pp. 61–64.
Villemonte de la Clergerie, E. Parsing mildly context-sensitive languages with thread automata. In Proceedings of the COLING 2002: The 19th International Conference on Computational Linguistics, Taipei, Taiwan, 26–30 August 2002.
Kallmeyer, L.; Maier, W. Data-driven parsing using probabilistic linear context-free rewriting systems. Comput. Linguist. 2013, 39, 87–119. [Google Scholar] [CrossRef]
Van Cranenburgh, A. Efficient parsing with linear context-free rewriting systems. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 23–27 April 2012; pp. 460–470.
Angelov, K.; Ljunglöf, P. Fast statistical parsing with parallel multiple context-free grammars. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, 26–30 April 2014; pp. 368–376.
Knuth, D.E. On the translation of languages from left to right. Inf. Control 1965, 8, 607–639. [Google Scholar] [CrossRef]
Tomita, M. LR parsers for natural languages. In Proceedings of the COLING 1984: The 10th International Conference on Computational Linguistics, Stanford, CA, USA, 2–6 July 1984; pp. 354–357.
Tomita, M. An efficient augmented context-free parsing algorithm. Comput. Linguist. 1987, 13, 31–46. [Google Scholar]
Nederhof, M.J. An alternative LR algorithm for TAGs. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, QC, Canada, 10–14 August 1998; pp. 946–952.
Prolo, C.A. LR Parsing for Tree Adjoining Grammars and Its Application to Corpus-Based Natural Language Parsing. Ph.D. Thesis, Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA, 2003. [Google Scholar]
Aizikowitz, T.; Kaminski, M. LR(0) conjunctive grammars and deterministic synchronized alternating pushdown automata. In Proceedings of the 6th International Computer Science Symposium on Computer Science, Theory and Applications, St. Petersburg, Russia, 14–18 June 2011; pp. 345–358.
Okhotin, A. Generalized LR parsing algorithm for boolean grammars. Int. J. Found. Comput. Sci. 2006, 17, 629–664. [Google Scholar] [CrossRef]
Barash, M.; Okhotin, A. Generalized LR parsing algorithm for grammars with one-sided contexts. Theory Comput. Syst. 2016. [Google Scholar] [CrossRef]
Kallmeyer, L.; Maier, W. LR parsing for LCFRS. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; pp. 1250–1255.
Boullier, P. Proposal for a Natural Language Processing Syntactic Backbone; Research Report 3342; INRIA-Rocquencourt: Rocquencourt, France, 1998. [Google Scholar]
Weir, D. Characterizing Mildly Context-Sensitive Grammar Formalisms. Ph.D. Thesis, University of Pennsylviania, Philadelphia, PA, USA, 1988. [Google Scholar]
Kracht, M. The Mathematics of Language; Mouton de Gruyter: Berlin, Germany, 2003. [Google Scholar]
Kallmeyer, L. Parsing beyond Context-Free Grammar; Springer: Heidelberg, Germany, 2010. [Google Scholar]
Grune, D.; Jacobs, C. Parsing Techniques. A Practical Guide, 2nd ed.; Monographs in Computer Science; Springer: New York, NY, USA, 2008. [Google Scholar]
Nederhof, M.J. Solving the correct-prefix property for TAGs. In Proceedings of the Fifth Meeting on Mathematics of Language, Schloss Dagstuhl, Germany, 25–28 August 1997; pp. 124–130.

Figure 1. Locality in CFG vs. LCFRS derivation.

Figure 2. Discontinuity and non-projectivity in NeGra annotation (see (2)) [10].

Figure 3. LCFRS for machine translation alignments.

Figure 4. LCFRS example derivation of

a a b a a b a a

with grammar for

a^{n} b^{m} a^{n} b^{m} a^{n}

.

Figure 4. LCFRS example derivation of

a a b a a b a a

with grammar for

a^{n} b^{m} a^{n} b^{m} a^{n}

.

Figure 5. LCFRS for

{a^{n} a b a^{n} | n \geq 0}

.

Figure 5. LCFRS for

{a^{n} a b a^{n} | n \geq 0}

.

Figure 6. Sample TA-run for

w = a a b a

.

Figure 6. Sample TA-run for

w = a a b a

.

Figure 7. TA transitions for the LCFRS from Figure 5.

Figure 8. Deduction rules for TA configurations.

Figure 9. Sample LR automaton for a CFG.

Figure 10. Parse table and sample parse (rule 1 =

S \to a S b

, rule 2 =

S \to c

).

Figure 10. Parse table and sample parse (rule 1 =

S \to a S b

, rule 2 =

S \to c

).

Figure 11. Sample LR states for the TA in Figure 6.

Figure 12. Sample LR states for the TA in Figure 6 including addresses.

Figure 13. The automaton for the LCFRS in Figure 5.

Figure 14. The parse table.

Figure 15. Sample LR parser run with

w = a a b a

.

Figure 15. Sample LR parser run with

w = a a b a

.

Figure 16. A LCFRS for

{a^{n} b^{m} a^{n} b^{m} | n, m \geq 1}

.

Figure 16. A LCFRS for

{a^{n} b^{m} a^{n} b^{m} | n, m \geq 1}

.

Figure 17. LR automaton for

{a^{n} b^{m} a^{n} b^{m} | n, m \geq 1}

.

Figure 17. LR automaton for

{a^{n} b^{m} a^{n} b^{m} | n, m \geq 1}

.

Figure 18. Parse table obtained from the LR automaton for

{a^{n} b^{m} a^{n} b^{m} | n, m \geq 1}

.

Figure 18. Parse table obtained from the LR automaton for

{a^{n} b^{m} a^{n} b^{m} | n, m \geq 1}

.

Figure 19. Sample LR parser run with

w = a a b a a b

(only successful configurations).

Figure 19. Sample LR parser run with

w = a a b a a b

(only successful configurations).

Figure 20. Revised deduction rules for TA configurations.

Figure 21. LR deduction rules without regular expressions.

Figure 22. TA configurations and corresponding LR configurations for our sample parse of

w = a a b a

.

Figure 22. TA configurations and corresponding LR configurations for our sample parse of

w = a a b a

.

Figure 23. LR automaton for an LCFRS with rules

α : S (x y) \to A (x, y)

,

γ : A (a b, c) \to ε

,

β : A (a x b, c y) \to A (x, y)

.

Figure 23. LR automaton for an LCFRS with rules

α : S (x y) \to A (x, y)

,

γ : A (a b, c) \to ε

,

β : A (a x b, c y) \to A (x, y)

.

Figure 24. Parsing trace for

a a b b c c

(only successful configurations), well-nested 2-LCFRS parsing.

Figure 24. Parsing trace for

a a b b c c

(only successful configurations), well-nested 2-LCFRS parsing.

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kallmeyer, L.; Maier, W. LR Parsing for LCFRS. Algorithms 2016, 9, 58. https://doi.org/10.3390/a9030058

AMA Style

Kallmeyer L, Maier W. LR Parsing for LCFRS. Algorithms. 2016; 9(3):58. https://doi.org/10.3390/a9030058

Chicago/Turabian Style

Kallmeyer, Laura, and Wolfgang Maier. 2016. "LR Parsing for LCFRS" Algorithms 9, no. 3: 58. https://doi.org/10.3390/a9030058

APA Style

Kallmeyer, L., & Maier, W. (2016). LR Parsing for LCFRS. Algorithms, 9(3), 58. https://doi.org/10.3390/a9030058

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LR Parsing for LCFRS

Abstract

1. Introduction

2. Preliminaries

2.1. Linear Context-Free Rewriting Systems

2.2. Thread Automata

3. LR Parsing

4. LR for LCFRS

4.1. Intuition

4.2. Automaton and Parse Table Construction

4.3. Parsing Algorithm

4.4. An Example with Crossing Dependencies

4.5. Deterministic Parsing

4.6. Soundness and Completeness of the LR Algorithm

4.7. Comparison to TAG Parsing

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI