Sequence Tagging for Fast Dependency Parsing †

: Dependency parsing has been built upon the idea of using parsing methods based on shift-reduce or graph-based algorithms in order to identify binary dependency relations between the words in a sentence. In this study we adopt a radically different approach and cast full dependency parsing as a pure sequence tagging task. In particular, we apply a linearization function to the tree that results in an output label for each token that conveys information about the word’s dependency relations. We then follow a supervised strategy and train a bidirectional long short-term memory network to learn to predict such linearized trees. Contrary to the previous studies attempting this, the results show that this approach not only leads to accurate but also fast dependency parsing. Furthermore, we obtain even faster and more accurate parsers by recasting the problem as multitask learning, with a twofold objective: to reduce the output vocabulary and also to exploit hidden patterns coming from a second parsing paradigm (constituent grammars) when used as an auxiliary task.


Introduction
One of the building blocks in Natural Language Processing (NLP) is parsing, that provides syntactic analyses of a text.A structure of a sentence is commonly represented as constituency [1] or dependency tree [2].Constituency grammar introduces the notion of constituents where a sentence is decomposed into sub-phrases while in dependency, words are connected according to their dependency relation (every word in a sentence is dependent on another word that is defined as its head).An example of each tree structure is given in Figure 1.Various parsing algorithms have been developed for constituency and dependency parsing.For the latter, transition-and graph-based approaches have been most widely used.In transition-based (or shift-reduce) dependency parsing, the best transition (create an arc between two words, do a shift, reduce a word, ...) is predicted at each timestep given the state of the current configuration of the parser [3].In contrast, a graph-based parser explores incrementally all possible parses of a tree through graph fragments and a tree with the highest score is chosen [4].
In this context, neural architectures have gained popularity in the field of NLP, where long short-term memory (LSTM) networks have been proven to be useful in many problems, because of their ability to decide which information to remember [5].This is especially useful in dependency parsing, where we have to identify long-distance relationships between words.In addition, it has been shown that these architectures can benefit from learning various tasks jointly, the so-called multitask learning (MTL) [6].In MTL setups, it can be also helpful to add an auxiliary task, whose result is not relevant but can be used to improve the performance on the main task.This is due to the ability of the network to exploit hidden patterns that are present in the main task and the ability of the shared representations to prevent overfitting.

Method
Recent research has shown that constituency parsing can be reduced to sequence tagging, a structured prediction problem where for each input token a single label output is generated [7].To do so, the syntactic trees need to be linearized through an encoding method, as shown in Figure 1a.
In a similar fashion, we propose to apply sequence tagging models for dependency parsing [8], using NCRF++ [9] as our sequence tagging framework.We propose a part-of-speech tag-based (PoS) encoding where the information of token's head and dependency relation is encapsulated in a label of the form (p i , h i , r i ).The first element p i of the tuple encodes the relative distance to the token's head in terms of words with a PoS tag h i , and where r i is the dependency relation between those two tokens.An example of an encoded dependency tree is shown in Figure 1b.For instance, the label for the token "control" is (-1,V,DOBJ) which means that the head is the first token to the left (-1) among those with the PoS tag V, and that the dependency relation is DOBJ.
Furthermore, it has been shown that constituency parsing can leverage from MTL setups [10].Hence, our model attempts to learn dependency label as a 2-task setup where: one task consists of learning (p i , h i ) since they are the most closely related among the elements in the tuple, and the second task consists in learning the dependency relation (r i ).Additionally, we also explore whether constituency parsing as auxiliary task can improve the performance of dependency parsing as the main task.

Results
We evaluate models on the English Penn Treebank [11].We use the standard metrics: Unlabeled and Labeled Attachment Score (UAS/LAS).Table 1 shows that our single-task model provides a good trade-off between speed and accuracy in comparison with existing transition-and graph-based models.In Table 2 we show that our model achieves even better performance when applying MTL, where dependency parsing as tagging is better learned when treating it as 2-task (S-MTL).Finally, the best result for dependency parsing is achieved when adding constituency parsing as auxiliary task (D-MTL-AUX).More experiments on various languages and the reported speeds when including the MTL approach are presented in [12].

Discussion
We have obtained a fast and accurate dependency parsing method showing that the dependency parsing problem can be reduced to a conceptually simple sequence tagging task where dependency trees are encoded into labels.In this way, our research has put emphasis not only on the accuracy of dependency parsing, but also on improving the speed, to make it feasible to parse the big amounts of data available today.

Figure 1 .
Figure 1.An example of syntactic trees for the same sentence represented under the constituency and dependency formalisms.Below, labels for each token encoding the trees.

Table 1 .
Model's speed and accuracy compared with existing dependency parsers on the PTB test set.speeds taken from the original papers.

Table 2 .
Unlabeled Attachment Score (UAS), Labeled Attachment Score (LAS) and speed on a single core CPU for the MTL models on the PTB test sets.S-S: single model, S-MTL: 2-task, D-MTL-AUX: with constituency parsing as auxiliary task.