Dependency Parsing with Transformed Feature

Abstract: Dependency parsing is an important subtask of natural language processing. In this paper, we propose an embedding feature transforming method for graph-based parsing, transform-based parsing, which directly utilizes the inner similarity of the features to extract information from all feature strings including the un-indexed strings and alleviate the feature sparse problem. The model transforms the extracted features to transformed features via applying a feature weight matrix, which consists of similarities between the feature strings. Since the matrix is usually rank-deficient because of similar feature strings, it would influence the strength of constraints. However, it is proven that the duplicate transformed features do not degrade the optimization algorithm: the margin infused relaxed algorithm. Moreover, this problem can be alleviated by reducing the number of the nearest transformed features of a feature. In addition, to further improve the parsing accuracy, a fusion parser is introduced to integrate transformed and original features. Our experiments verify that both transform-based and fusion parser improve the parsing accuracy compared to the corresponding feature-based parser.


Introduction
Recently, traditional models based on discrete features have been employed in many typical natural language processing tasks, such as named entity recognition [1], word segmentation [2], semantic similarity assessment [3], etc.In dependency parsing, there are many feature-based parsers that have achieved high parsing accuracy such as ZPar [4], DuDuPlus [5], etc.There are two main kinds of parsing algorithms: graph-based parsing [6][7][8][9] and transition-based parsing [10,11].The graph-based parsing searches for the best dependency tree from all trees that are formed with words in a sentence.However, a huge number of trees of a long sentence will slow down the searching of the best tree.The typical transition-based parsing makes decisions with the best score in the parsing process by using a greedy strategy based on a shift-reduce algorithm.The greedy strategy makes its computational complexity much lower than that of the graph-based parsing.Thus, many researches focus on integrating neural network structures into it and exploiting word embedding.Chen and Manning [12] introduced the neural network classifier and continuous representations into a greedy transition-based dependency parser, in order to improve the accuracy.Dyer et al. [11] employed stack long short-term memory recurrent neural networks in a transition-based parser and yielded another state-of-the-art parsing accuracy.Word embedding (distributed word representation) is a widely used method to represent words as lower-dimensional real vectors and is beneficial to the parsers.Usually, it is learned by unsupervised learning approaches.This method is developed based on the distributional hypothesis, thus containing semantic and syntactic information.
Most of the parsers will omit the feature strings, denoted as un-indexed feature strings, of which occurrence frequency is lower than a threshold.The lower frequency is partly due to some nonfrequent words, but they may carry some useful information because some synonyms of the nonfrequent words are frequent words.Therefore, to utilize un-indexed feature strings and alleviate the sparse problem, we propose a transform-based dependency parsing based on the graph-based parsing algorithm.Our work changes nothing in the traditional feature-based parser but adds a transforming stage between the feature extracting stage and the learning or decoding stage.In the stage, a sparse feature weight matrix is employed to dynamically transform original features into transformed features.Thus, the parser directly utilizes word embeddings so as to form a feature weight matrix to transform features to consider the lower frequency features, which shows another implementation to incorporate embeddings effectively.Furthermore, we also propose a fusion parser that combines a transform-based parser and a feature-based parser.The experimental results show that our proposed parsers are more accurate compared to the corresponding feature-based parsers and indicate that un-indexed feature strings are beneficial to the parser.

Related Works
Le and Zuidem [13] proposed an infinite-order generative dependency model by using an Inside-Outside Recursive Neural Network, which allowed information to flow both bottom-up and top-down.Reranking with this model showed competitive improvement in parsing accuracy.Zhu et al. [14] introduced a recursive convolutional neural network model to capture syntactic and compositional-semantic representations.By using the model to rank dependency trees in a list of k-best candidates, their parser achieved very competitive parsing accuracy.Chen and Manning [12] introduced a neural network classifier for a greedy, transition-based dependency parser, which used a small number of dense features by exploiting continuous representations, in order to improve the accuracy.Bansal et al. [10] extracted bucket features by creating an indicator feature per dimension of the word vector with the discontinuity bucketing function, clustered bit string features like Brown Clustering in addition to traditional features, which improved the accuracy.Chen et al. [9] proposed a method to learn feature embeddings on a large dataset automatically which is parsed with a pre-trained parser to improve the parsing accuracy with feature embeddings and traditional features.By using their FE-based feature templates, they generated the embedding features to improve the parsing accuracy.The features using in parsing are selected by their occurring numbers, such as features occurring more than once used in Bansal et al. [10], to alleviate the sparse problem and noise influence.However, the omitted features do contain useful information because their infrequency may be caused by rare words while some of them have frequent synonyms.

Graph-Based Dependency Parsing
A dependency tree describes the relationships among the words in a sentence.All nodes in the tree are words.Figure 1 shows an example of a dependency tree, where an arc indicates a dependency relation from a head to a modifier."ROOT" is an artificial root token.Table 1 describes the main symbols used here.In this section, we first introduce basic dependency parsing.Then, the transform-based dependency parsing is presented.We employ a feature-based parser with the graph-based parsing algorithm as our underlying parser.Given a segmented sentence x and its dependency tree y, the score of y for x is defined as follows: where ω ∈ R N is a weight vector for the features learned in a training process; f (x, y) is a feature vector (given data x and y, we abbreviate it as f for simplification), in which features refer to indexed feature strings extracted from a sentence x and y.After the features are extracted, the Margin Infused Relaxed Algorithm (MIRA) [15,16] is adopted to learn the weight vector ω by the following way: where ω is an optimized weight vector; T is the training treebank; ds(x) is a set of all feasible dependency trees of x, and L(y, ȳ) is the number of words with the incorrect parent in predicted tree ȳ.The condition in Equation (2) states that the parser with optimized ω separates y from any other tree ȳ ∈ ds(x) with a distance no less than L(y, ȳ), and the minimum cost function results in preventing the norm of the weight vector from "blow-up".As every feature is corresponding to a feature string indexed in a lookup map, we will not distinguish them unless the feature strings are un-indexed, which are numerous because of the huge incorrect dependent tree or feature clipping.: similarity between two feature strings

Feature Similarity Information
The vector f is generated from the parts, which are factored from the dependency trees, and the ith element f i is the occurrence number of the corresponding indexed feature string C i extracted from the parts by applying the tp(C i )th template.The feature template T draws information from a part, and for the second-order part, the templates normally consist of some atomic templates, namely, T , T dhdc and T dhd , where the superscript h + j or d + j represents the position that is the jth word after the parent or child, and the subscript w, p, dhdc and dhd denote the types, namely, the surface word, the part-of-speech (POS) tag, the direction between h and d and the direction in h, d and c, respectively.Thus, the following equation depicts the ith feature template: where function n T (i, pt, itm) returns the number of the atomic templates for the position pt ∈ {h, d, c} and the type itm ∈ {w, p} in the ith feature template, and the binary operator ∪ concatenates the two input strings.Since T h w catches a word in a part while it draws its synonym in another part, feature strings extracted from the same feature template may be similar.Therefore, there are many related feature strings, each of which contain information about other feature strings.
Bansal et al. [10] proposed a new bucket feature template B k to identify the equivalence of two embedding at the kth dimension.Therefore, this will introduce a similarity σ k (C i , C j ) between two feature strings C i and C j extracted with the same feature template: where T * * (C) returns the atomic string extracted with the atomic template T * * in the feature string C or returns an empty string if the used template does not contain T * * ; B k (w) returns the value of the embedding of w at the kth dimension.Chen et al. [9] utilizes feature embeddings instead of word embeddings, and introduces a new feature embedding template that consists of the index of the drawing dimension, the feature template extracting that feature string, and the drawing function B k .Thus, this also defines a similarity σ k (C i , C j ) between two feature strings C i and C j , where B k (C) draws the value of the embedding of C at the kth dimension.However, there are many omitted (un-indexed) feature strings due to occurring lower or extracted from the incorrect tree, and they may carry some useful information.Given an un-indexed feature string, its embedding is unknown.Thus, we utilize its inner structure to calculate the similarity from the indexed feature.Similar to the work of Bansal et al. [10], we exploit another method, directly computing the similarity via word embeddings as follows: where σ w calculates the similarity between two words via embeddings.

Transform-Based Parsing
Given a training and testing data, we first assume that there are N indexed feature strings and U total feature strings.Together with Equation ( 6), a feature matrix Θ is obtained by the following equation to measure the similarity between the feature strings Thus, a transformed feature fi is represented as a row θ i of the feature weight matrix Θ.Like f i , fi corresponds to the indexed feature string C i .Similar to the score Equation (1), the following equation depicts the transform-based score of a dependency tree where ω is a weight vector for transformed features, and f is a transformed feature vector extracted from x and y.The transformed features are generated by Θ • f (x, y), where f (x, y) is the extending extractor of f (x, y) and returns the un-indexed feature strings as well.After substituting Equation ( 9) into Equation ( 2), it can be re-written in the following form where ω is an optimized weight vector for transformed features.

Optimizing Influence
Because of the similarity of some feature strings, the rank K of Θ may be less than N and U, and N − K rows of Θ can be represented by the other rows.We can find K linearly independent rows, and re-array them in the first K rows.Then, the ith (i > K) row θ i of Θ, the representation of the ith transformed feature representation, can be constructed as follows: where α i,j is the coefficient of basis θ j .Thus, N − K transformed features are redundant, but it can be proven that it does not influence the optimization procedure of MIRA.For this purpose, Problem (9) can be rewritten as Equation ( 11) where ω = ω − ω.The dual problem of Problem (11) can be written as follows: where D and e are as follows: Since there are N − K dependent transformed features, we firstly drop them to explore the change of Problem (12).Let D K , e K be the dropped version of D, e, D = H T • D K • H since the rank rank(D K ) = rank(D), where H is a full-rank matrix.Moreover, it can be proved that Problem (12) can be rewritten in the following form if the feasible region is Therefore, redundant transformed features will introduce an additional transforming matrix H −1 transforming e and rotating the feasible region.However, the dropped version (14) and original version ( 12) can be calculated with quadratic programming in the same way, such as Hildreth's algorithm [17].After λ is calculated, ω can be computed directly by A T • λ.Therefore, redundant transformed features do not affect the optimization but would weight the constraints due to e K T • H −1 , which may hurt the performance.
The problem is alleviated by only considering the Q nearest transformed features for a feature as the approach results in a very sparse matrix Θ.When Q is larger, transform-based features will catch more information of feature similarity.Hence, when the metrics of similarity σ(C i , C j ) are reliable, Q should be assigned with a relatively large value to fully exploit the similarity between features.

Fusion Decoding
Fusion is proved to be helpful in classification tasks [18,19].Considering that original and transformed features describe different views of a span, we attempt to fuse them together for possible improvement.Like some reranking frameworks such as the forest reranking algorithm [20] without needing the K-best output, we employ a fusion decoding to integrate original and transformed features.The scoring function of this fusion decoding combines transform-based and feature-based scoring functions as follows: where η is a hyper-parameter for fusion decoding.

Implementation of Transform-Based Parsing
To verify the effectiveness of the transformed features and compare with the base features in a graph-based parsing algorithm, we adopt a projective parser learned with MIRA to build the transform-based parser.Without loss of generality, we only consider first-order features over dependency parts [6] and second-order features over grandchild and sibling parts [21,22].The training process of the transform-based parser can be divided into several stages as follows.
1. Building Feature Lookup Map: An indexed set {C i } i∈[1,|C T |] of feature strings is generated by enumerating the feature strings, where occurrence frequency is more than five, in a training corpus and assigning their indexes in occurring order.Then, they form a lookup map to identify the indexed feature string.2. Caching Sentence Transformed Features: The feature strings of every sentence in the training corpus and its possible dependency trees are extracted.Then, for each feature string C * (indexed or un-indexed), the column θ C * of θ is constructed and cached, where the similarity σ w between words w 1 and w 2 is defined as the dot product of their embeddings.

Training:
The parameters are learned with MIRA, and it is known that the graph-based dependency parsing is time-consuming.
In the stage of Caching Sentence Transformed Features, a feature string C may occur many times and is non-indexed.Moreover, the amount of the indexed features strings is usually several million.Thus, the vector θ C should be dynamically calculated with Equation ( 6) and cached to accelerate the extraction.

Experiments
In this paper, the transform-based parser is compared with the feature-based parser on the English Penn Treebank (PTB) and Chinese Treebank (CTB) version 5.0 [23] with gold-standard segmentation and POS tags.As the feature transform stage can be integrated into any order parsers, without loss of generality, we only conduct the experiments on second order parsers.

Experiments Setup
An open-source conversion utility Penn2Malt with head rules compiled by Zhang and Clark [24] is employed to generate CoNLL dependencies on CTB data.The same split of the CTB5 as defined by Zhang and Clark [24] is adopted except that the training corpus is smaller than the original due to the time and storage consuming of the training.In order to reserve the statistical characteristics of the original training corpus, we construct the training corpus by selecting every five sentences in the original training corpus, and the split is as Table 2.For English, we use the head rules provided by Yamada and Matsumoto for the converting, and the POS tags are predicted by the Stanford POS tagger (accuracy ≈ 97.2%).There are several popular models for efficiently and effectively learning word vectors, such as Global Vectors for word representation (Glove) [25], skip-gram with negative-sample (SGNS) in Word2Vec [26,27] and syntax relative skip-gram with negative-sample (SSGNS) [28].
According to experiments reported by Suzuki et al. [29], GloVe and SGNS provide similar performance in many tasks.We only consider two models-SGNS and SSGNS in this paper.The training dataset is the Daily Xinhua News Agency part of the Chinese Gigaword Fifth Edition (LDC2011T13), and it is segmented by a Stanford Word Segmenter [30], which is trained with the training dataset in Table 2. Since a word may have different POS tags with different meanings, attaching the POS tag of a word to itself may alleviate the vague meaning in word representation learning.Hence, Stanford Postagger [31] is employed to tag the word in the training dataset.

Results and Analysis
We employ a feature-based parser with second order parsing built by Ma and Zhao [32] as the baseline (the parser is a fourth order parser, but we constrain it to second order as all experiments are elevated under the second order).Parsing accuracies are measured by unlabeled attachment score (UAS), the percentage of words with the correct head, and labeled attachment score (LAS), the percentage of words with the correct head and label, which ignores any token with PU tag (punctuation).

Parsing Accuracy
At first, we evaluate our parser without considering the un-indexed features, but the accuracies of transform-based parsing are smaller than that of baseline.Henceforth, the experiments are done with the un-indexed features at default.The final scores evaluated on the test corpus are shown in Table 3.In the table, Trans is the same as the transition-based parser with stack long short term memory (LSTM) introduced by Dyer et al. [11], except that it adopts the arc-eager transition system to use dynamic oracles [33] and dropout to prevent over-fitting; OrgEM0 represents the parser with embedding dictionary learned by the SGNS model; OrgEM1 represents the parser with the dictionary based on the SSGNS model; MixOrgEM0 represents the fusion pipelines with OrgEM0.The UAS (LAS) of OrgEM0 are 0.757% (0.804%) higher than that of Baseline, and the scores of OrgEM1 are slightly better than those of OrgEM0.Similarly, the UAS (LAS) of OrgEM0 are 0.579% (0.324%) higher than that of Trans.The fusion parsers MixOrgEM0 and MixOrgEM1 integrate feature information and further improve the parsing accuracy, where UAS and LAS increase as much as 1.175% and 1.284%, respectively.For English, we observe similar experiments to those of Chinese.The UAS (LAS) of OrgEM0 are 0.617% (0.545%) higher than that of Baseline.The fusion versions improve the accuracy as well.Therefore, the transform-based parser improves parsing accuracy, and the SSGNS model performs better than the SGNS model in transform-based parsing because of the consideration of the word order.The parameters selection is conducted on CTB, and we use the same parameters in the experiments of English.The two hyper-parameters (Q, win) are determined by finding the maximum score on the development corpus.η is searched in every experience in the fusion pipeline because of the low cost of enumerating it.The experiments based on embedding dictionaries with SGNS model trained by the famous tool-kits Word2Vec.Firstly, for the hyper-parameter win, its searching range is [1,10].The parsing accuracy of the development corpus are compared in Figure 2. It is shown that the transform-based parser can improve the parsing accuracy with the embedding dictionaries for different window sizes.The best performance is achieved at win = 3, and the UAS and LAS are 1.08% and 0.95% higher than those of the baseline, respectively.Figure 3 indicates that the fusion pipeline can integrate the original and transformed features to improve the parsing performance further.The fusion parser is a pure feature based parser (baseline) at η = 0, while the parser is a pure transform based parser at η = 1.The best performance of the fusion pipeline is achieved at (win = 1, η = 0.8), and UAS and LAS are 1.16% and 1.14% higher than the scores of the baseline, respectively.Therefore, the window size is chosen as win = 1 in the following experiences.Secondly, for the hyper-parameter Q, its searching range is [5,100] due to the limit in computing resources.The experiments are shown in Figure 4.The larger the Q, the more relevant transformed features are retrieved for a feature, and the larger the size of transformed feature caches of sentences.However, the score is decreasing with the increasing of Q, and it reaches the maximum at Q = 10.This indicates that a larger Q may result in more noise in transformed feature caches of sentences.Thus, it is partially caused by noise in word representation.Figure 5 shows the scores of the fusion pipeline.The biggest UAS improvement is achieved at (Q = 40 & η = 0.6).Thus, we select the hyper-parameters as (win = 1 & Q = 40) in the following experiences.In addition, for the parser with the SSGNS model, UAS and LAS are 85.07%and 82.7%, respectively.They are higher than the corresponding scores in the parser with the SGNS model.The dictionary learned by the SSGNS model considers the word order and may contain more syntactic information.Therefore, the parser would be beneficial from the dictionary with more syntactical information.Figure 6 shows the scores of the fusion pipeline are higher than those with the SGNS model.

Conclusions
In this paper, we propose a transform-based parser that utilizes transformed features instead of original features to exploit all feature strings.The transformed features are generated by using a feature weight matrix.It is proved that the redundant weights of transformed features, which are caused by rank deficiency of the feature weight matrix, does not affect the learning of MIRA.The result shows that the scores of the transform-based parsers with un-indexed feature strings are higher than those of the corresponding feature-based parser.Because original and transformed features play the same role in the learning algorithm, the fusion parser can be simply constructed by integrating the information of original and transformed features.The results show that the fusion parser outperforms the transform-based parser in terms of parsing accuracy.
Although our parser is a second-order parser, it can be adopted into fourth-order or even higher order parsers and some reranking frameworks to improve the parsing accuracy.With word embedding, the improvement of generalization is on the surface word level, and more investigations will be conducted for improvement of generalization in the level of the inner syntactical structure.In the future, we plan to customize the similarity function between feature strings to utilize the semantic and syntactic information indicated by further embeddings.

Figure 2 .
Figure 2. The parsing accuracy under different win.

Figure 3 .
Figure 3.The fusion parsing accuracy under different win and η.

Figure 4 .Figure 5 .
Figure 4.The parsing accuracy and cache size under different Q.

Figure 6 .
Figure 6.The fusion parsing accuracy with the SSGNS model under different η.

Table 1 .
List of symbols used in this paper.
i : the ith word of a sentence x T : feature template C : feature string C i : the ith indexed feature string tp(C i ) : index of the feature template generating C i C T : set of indexed feature strings extracted from T, {C i } i∈[1,N] ds(x) : set of all feasible dependency trees of x L(y, ȳ): loss function for ȳ from y f (•) : transformed feature vector function fi (•) : transformed feature indicating function corresponding to C i f (•) : feature vector function f i (•) : feature indicating function corresponding to C i N : number of indexed feature strings U : number of total feature strings.c : window size for word embedding η : hyper-parameter for fusion decoding Q : number of nearest transformed features for a feature ω : weight vector for features ω : weight vector for transformed features Θ : feature weight matrix θ i : ith row of Θ σ w (•) : similarity between two words via embeddings σ(•)

Table 2 .
The data split for training, testing and development.

Table 3 .
Results of the test corpus.