CYK Parsing over Distributed Representations

: Parsing is a key task in computer science, with applications in compilers, natural language processing, syntactic pattern matching, and formal language theory. With the recent development of deep learning techniques, several artiﬁcial intelligence applications, especially in natural language processing, have combined traditional parsing methods with neural networks to drive the search in the parsing space, resulting in hybrid architectures using both symbolic and distributed representations. In this article, we show that existing symbolic parsing algorithms for context-free languages can cross the border and be entirely formulated over distributed representations. To this end, we introduce a version of the traditional Cocke–Younger–Kasami (CYK) algorithm, called distributed (D)-CYK, which is entirely deﬁned over distributed representations. D-CYK uses matrix multiplication on real number matrices of a size independent of the length of the input string. These operations are compatible with recurrent neural networks. Preliminary experiments show that D-CYK approximates the original CYK algorithm. By showing that CYK can be entirely performed on distributed representations, we open the way to the deﬁnition of recurrent layer neural networks that can process general context-free languages.


Introduction
The area of natural language parsing has been dominated for decades by the so-called symbolic paradigm of artificial intelligence, which embraces the collection of methods that are based on high-level, human-readable, symbolic representations. The Cocke-Younger-Kasami algorithm (CYK) (Cocke 1969;Younger 1967;Kasami 1965), the Early algorithm (Earley 1970) and the shift-reduce algorithm (Sippu and Soisalon-Soininen 1988) are at the core of most common algorithms for natural language parsing, both constituency-based and dependency-based, and they all use symbolic representations for grammar rules, parser states and syntactic trees.
Starting with the early 90's and the surge of data-driven methods in natural language processing, grammar rules and their weights have been estimated from large data sets of syntactically annotated sentences, and probabilistic parsers and parsers based on discriminative models have flourished. Nonetheless, all these methods are based on the above mentioned parsing algorithms, with grammar rules and parser states still symbolically represented. With the recent development of deep learning techniques, fast and accurate dependency parsers have been designed on top of the shift-reduce algorithm, where parser actions are parameterized using neural networks (Chen and Manning 2014;Ambati, Deoskar, and Steedman 2016). Furthermore, the use of recurrent neural networks has made it possible to design models that condition parsing actions on the entire syntactic derivation history (Titov and Henderson 2007;Dyer et al. 2016), resulting in considerable enhancement of parsing performance. Still, for all of these parsers based on neural networks, the underlying configuration of the parser is a stack of symbols and the search space is represented at that level.
The main goal in the above mentioned research is to inform parsing decisions using contextual information as rich as possible. This means that the context-free assumption of the underlying formalism is dropped. In case of the neural network parsers, this is achieved using distributed representations and matrix multiplication operations embedded into the underlying network architecture. In this article we explore a somewhat tangential direction: we attempt to entirely remove the symbolic level of representation from the underlying parser. There are at least two existing lines of research in this direction. In a first line (Vinyals et al. 2014) parsing is seen as a sequenceto-sequence translation task, with the input sentence mapped to a linear syntactic interpretation, and training is done with long-short term memories (LSTM) networks over millions of sentences which have been annotated by existing parsers. In this case it seems that the LSTM network learns two things: the associations among fragments of sentences and fragments of trees and a way of recombining these fragments in the final interpretation. The grammar representation is then hidden in the weights of the LSTM. In the second line of research (Zanzotto and Dell'Arciprete 2013;Senay et al. 2015) both sentences and trees are represented in distributed vectors and neutral networks learn a way to map sentence vectors to tree vectors. However, the overall model does not have the ability to replicate distributed vectors for trees from distributed vectors for sentences. Resulting vectors are not accurate enough to have an impact on final tasks (Zanzotto and Dell'Arciprete 2013). Moreover, in this case as well the grammar is represented in an unpredictable way in the weights of the multi-layer perceptron or the LSTM.
In this article we take a rather different route, and show that traditional parsing algorithms can cross the border of distributed representations. We propose a version of the CYK algorithm, called D-CYK, that entirely works on distributed representations. This is achieved by transforming the parsing table at the base of the CYK algorithm into a real number square matrix, and by implementing the basic operations of the algorithm using matrix multiplication. Implementations of the CYK algorithm using matrix multiplications are well known in the literature (Graham and Harrison 1976;Valiant 1975) but they use symbolic representations, while in our proposal grammar symbols as well as constituent indices are all encoded into real numbers and in a distributed way. A second, important difference is that in the standard CYK algorithm the parsing table has size (n + 1) × (n + 1), with n the length of the input string, while in our D-CYK the parsing table has size d × d, where d depends on the distributed representation. This means that, up to some extent, we can parse input sentences of different lengths without changing the size of the parsing table in the D-CYK algorithm. We are not aware of any parsing algorithm for context-free grammars having such property.
We report experiments on toy grammars showing that our D-CYK can successfully approximate the CYK algorithm. This novel approach opens the way to the investigation of parsing algorithms where the symbolic level of representation is entirely dropped.  Figure 1: A simple context-free grammar, the CYK parsing table P for the string aab, and the "distributed" representation P left of P in Tetris-like notation.

Preliminaries
In this section we introduce the basics about the CYK algorithm and overview a class of distributed representation called holographic reduced representation.

CYK algorithm
The CYK algorithm is a classical algorithm for parsing based on context-free grammars, using dynamic programming. We provide here a brief description of the algorithm in order to introduce the notation used in later sections; we closely follow the presentation in (Graham and Harrison 1976). The algorithm requires context-free grammars in Chomsky Normal Form (CNF), where each rule has the form A → BC or A → a, where A, B, C are nonterminal symbols and a is a terminal symbol. We write R to denote the set of all rules of the grammar. Given an input string w = a 1 · · · a n , n ≥ 1, where each a i is an alphabet symbol, the algorithm uses a 2-dimensional table P of size (n + 1) × (n + 1), where each entry stores a set of nonterminals representing partial parses of the input string. More precisely, for 0 ≤ i < j ≤ n, a nonterminal A belongs to set P [i, j] if and only if there exists a parse tree with root A generating the substring a i+1 · · · a j of w. Thus, w can be parsed if the initial nonterminal of the grammar S is added to P [0, n]. Algorithm 1 shows how table P is populated. P is first initialized using unary rules, at line 3. Then each entry P [i, j] is filled at line 9 by looking at pairs P [i, k] and P [k, j] and by using binary rules.
Algorithm 1 CYK(string w = a 1 · · · a n , rule set R) return table P 1: for i ← 1 to n do 2: A running example is presented in Figure 1, showing a set R of grammar rules along with the table P produced by the algorithm when processing the input string w = aab.
Since S ∈ P [0, 3], we conclude that w can be parsed by the grammar.

Distributed Representations with Holographic Reduced Representations
Holographic reduced representations (HRR; Plate 1995) are distributed representations well-suited for our aim of encoding the 2-dimensional parsing table P of the CYK algorithm and for implementing the operation of selecting the content of its cells P [i, j]. In the following, we introduce the operations we use, along with a graphical way to represent their properties. The graphical representation is based on Tetris-like pieces.
The starting point of a distributed representation is how to encode symbols into vectors: symbol a can be encoded using random vector a ∈ R d drawn from a multivariate normal distribution a ∼ N (0, I 1 √ d ). These vectors are used as basis vectors for the Johnson-Lindenstrauss Tranform (Johnson and Lindenstrauss 1984) as well as for random indexing (Sahlgren 2005). The major property of these random vectors is the following Given the above representation of symbols, we can define a basic operation [ ] ⊕ and its approximate inverse [ ] . These operations take as input a symbol and provide a matrix in R d×d , and are the basis for our encoding and decoding. The first operation is defined as where A • is the circulant matrix of the vector a and Φ is a permutation matrix. This operation has a nice approximated inverse in We then have since Φ is a permutation matrix and therefore ΦΦ T = I, and since due to the fact that A • and B • are circulant matrices based on random vectors a, b ∼ N (0, I 1 √ d ). With the [ ] ⊕ and [ ] operations at hands, we can now encode and decode strings, that is, finite sequences of symbols. As an example, the string abc can be represented as Using the above operations, we can also encode sets of strings. For instance, the string set S = {abS, DSa} is represented as the sum of matrix products We can then test whether abS ∈ S by computing the matrix product providing a positive answer since the result is different from 0.
Not only our operations above can be used to encode sets, as described above, they can also be used to encode multi-sets, that is, they can keep a count of the numer of occurrences of a given symbol/string within a collection. For instance, consider a multi-set with two occurrences of symbol a. This can be encoded by means of the sum  Figure 2: Tetris-like graphical representation for the pieces for symbols in our running example.
With these operations and with this Tetris metaphor, we can describe our model to encode P tables in matrices and to implement rule applications by means of matrix multiplication, as discussed in the next section.

The CYK algorithm on Distributed Representations
The distributed CYK algorithm (D-CYK) is our version of the CYK algorithm running over distributed representations and using matrix algebra. As the traditional CYK, this algorithm recognizes whether or not a string w can be generated by a context-free grammar with a set of rules R in Chomsky Normal Form. Yet, unlike the traditional CYK algorithm, the parsing table P and the rule set R are encoded through matrices in R d×d , using the distributed representation of Section 2.2, and rule application is obtained with matrix algebra.
In this section we describe how the D-CYK algorithm encodes: (i) the table P by means of two matrices P left and P right ; (ii) the unary rules in R by means of a matrix R for each nonterminal A. We then specify the steps of the D-CYK algorithm and illustrate its execution using the running example of Figure 1.

Encoding the Table P in matrices P left and P right
The table P of the CYK algorithm can be seen as a collection of triples (i, j, X). More precisely, the collection of triples contains element (i, j, X) if and only if X ∈ P [i, j]. Given the representation of Section 2.2, the table P is encoded by means of two matrices P left and P right in R d×d , each containing the collection of triples (i, j, X) in distributed representation. More precisely, each triple (i, j, X) is encoded as: Then matrix P left is the sum of all elements P left [i, j, X], encoding the collection of all triples from P . Similarly, P right is the sum of all elements P right [i, j, X]. To visualize this representation, the matrix P left of our running example is represented in the Tetris-like notation in Figure 1, where we have used the pieces in Figure 2.

Encoding and Using Unary Rules
The CYK algorithm uses symbols a i from the input string w and unary rules to fill in cells P [i − 1, i], as seen in Algorithm 1. We simulate this step in our D-CYK using our distributed representation and matrix operations. D-CYK represents the input a i using the matrix P left . Hence, before the application of unary rules we have For processing unary rules we use the first part of the D-CYK algorithm, called D-CYK_unary and reported in Algorithm 2. D-CYK_unary takes P left , encoding symbols from the input string, and produces updated matrices P left and P right encoding nonterminal symbols resulting from the application of unary rules.
For the running example of Figure 1, the initial content of P left is Taking as input the above matrix, D-CYK_unary produces the two matrices In D-CYK_unary, we use matrices R (A) u , for each nonterminal A in the left-hand side of some unary rule. These matrices are conceived to detect the applicability of rules of the form A → a, where a is some alphabet symbol, to matrix P left , and are also used to update matrices P left and P right . Matrix R (A) u is defined as where R is the set of rules of the grammar. The operation between R (A) u and P left (line 4 in Algorithm 2), which detects whether some rule A → a is applicable at position (i − 1, i) of the input string, is is reinforced by the subsequent use of the sigmoid function. Hence, if some unary rule with left-hand side symbol A is applicable, the resulting matrix is approximately the identity matrix I, else the resulting matrix is approximately the zero matrix 0. Then, the operations in lines 5 and 6: add a non-zero matrix to P left and P right , respectively, only if rules for A are matched in P left containing the input sentence.
Algorithm 2 D-CYK_unary(string w = a 1 a 2 · · · a n , matrices R for A ∈ nonterminals do 4: We describe the application of D-CYK_unary using the running example in Figure 1 and the Tetris-like representation. The two unary rules D → a and E → b are represented as R in the Tetris-like form. As already seen, given the input sequence aab the matrix P left is initialized as We focus on the application of rule D → a to cell P [0, 1] of the parsing table, represented through matrices R (D) u and P left , respectively. At steps 4 and 5 of Algorithm 2, taken together, we have The Update part of the assignment can be expressed as

Encoding and Using Binary Rules
To complete the specification of algorithm D-CYK, we describe here how to encode binary rules in such a way that these rules can be fire over the distributed representation of table P through operations in our matrix algebra. We introduce the second part of the algorithm, called D-CYK_binary and specified in Algorithm 3, and we clarify why both P left and P right , introduced in section 3.3, are needed. Binary rules in R with nonterminal symbol A in the left-hand side are all encoded in a matrix R is conceived for defining matrix operations that detect if rules of the form A → BC, for some B and C, fire in position (i, j), given P left and P right . This operations result in a nearly identity matrix I for a specific position (i, j) if at least one specific rule coded in R (A) b fires in position (i, j) over positions (i, k) and (k, j), for any value of k. This will enable the insertion of new symbols in P left and P right .
Algorithm 3 D-CYK_binary(P, rules R A for each A) return P 1: for j ← 2 to n do 2: for i ← j − 2 to 0 do 3: for A ∈ nonterminals do 4: we encode the right-hand side of each binary rule A → BC in R as All the right-hand sides of binary rules with symbol A in the left-hand side are then collected in matrix R Algorithm 3 uses these rules to determine whether a symbol A can fire in a position (i, j) for any k. The key part is line 4 of Algorithm 3 selects elements in P left and P right according to rules for A. Matrices P left and P right have been designed in such a way that, after the nonterminal symbols in the selected elements have been annihilated, the associated spans (i, k) and (k, j) merge into span (i, j). Finally, the terms [j] ⊕ [i] are meant to check whether the span (i, j) has survived. If this is the case, the resulting matrix will be very close to the identity matrix I, otherwise it will be very close to the null matrix 0. To reinforce this matrix, similarly to Algorithm 2, we use the sigmoid function σ(x). Finally, we apply an element-wise multiplication with I, which is helpful to remove noise.
To visualize the behavior of the algorithm D-CYK_binary and the effect of using R is: Let's focus on the position (1, 3). The operation is the following: where P left and P right are as in equations 2 and 3. We can then write With this second part of the algorithm, CYK has been reduced to D-CYK, which works with distributed representations and matrix operations compliant with neural networks.

Experiments
The aim of these experiments is to show that the distributed CYK algorithm behaves as the original CYK algorithm. For this purpose, we do not need huge datasets but small well-defined set of sentences derived from fixed grammars as defined in the following sections.

Experimental Set-up
We experimented with five different grammars with increasing sets of rules and a set of 2,000 sentences. The grammars are: G 0 is the basic grammar with 8 rules with 3 unary rules; G 1 with 25 rules expands only unary rules over G 0 ; G 2 with 28 rules expands only binary rules over G 1 ; G 3 with 34 rules expands only binary rules over G 1 ; and, finally, G 4 with 41 rules expands only binary rules over G 1 . The set of sentences has been produced by randomly generated 2,000 sentences of different lengths by using the grammar G 0 .
These sentences can be recognized by all the 5 grammars as all the other grammars are obtained by adding rules to G 0 .
As we want to understand whether D-CYK is able to reproduce the computation of the original CYK, we used cell f1-measure (f 1), which evaluates whether the distributed P is similar to the original P. To evaluate f 1, we decoded the distributed versions P left of P by using a simple decoding algorithm Dec (Algorithm 4). We compared Dec(P left ) and Algorithm 4 Dec(P left ) return table P 1: for i ← 0 to n do 2: for j ← i + 1 to n + 1 do 3: the matrix P obtained by applying the traditional CYK algorithm with the corresponding grammar on test sentences. By comparing Dec(P left ) and the corresponding P , we evaluated the cell precision and the cell recall by considering P the oracle and Dec(P left ) the system. f 1 is then computed according to the traditional equation of mixing precision and recall.

Results
Results are really encouraging showing that, as the dimensions of the matrices increase, D-CYK can approximate with its operations what is done by the traditional CYK. The f1-measure is in fact increasing with the dimension of the matrix. This is mainly due to an improvement of the cell symbol precision as the cell symbol recall is substantially stable. Hence, as the dimension increases D-CYK gets more precise in replicating the original matrix.
The size of the grammar is instead a major problem. In fact, the precision of the algorithm is affected by the number of rules whereas the recall is substantially similar across the three different grammars.
These results confirm that it is possible to transfer a traditional algorithm on a version, which is defined on distributed representations.

Conclusions and Future Work
In these days, the predominance of symbolic, grammar-based syntactic parsers for natural language has been successfully challenged by neural networks, which are based on distributed representations. Years of results and understanding can be lost. We proposed D-CYK that is a distributed version of the CYK, a classical parsing algorithm. Experiments show that D-CYK can do the same task of the original CYK in this new setting.
Neural networks are a tremendous opportunity to develop novel solutions for known tasks. Our solution opens an avenue to an innovative set of possibilities: revitalizing symbolic methods in neural networks. In fact, our algorithm is the first step towards the definition of a "complete distributed CYK algorithm" that builds trees in distributed representations during the computation. Moreover, it can foster the definition of recurrent layers of CYK-informed neural networks.