Bidirectional Long Short-Term Memory Network with a Conditional Random Field Layer for Uyghur Part-Of-Speech Tagging

: Uyghur is an agglutinative and a morphologically rich language; natural language processing tasks in Uyghur can be a challenge. Word morphology is important in Uyghur part-of-speech (POS) tagging. However, POS tagging performance suffers from error propagation of morphological analyzers. To address this problem, we propose a few models for POS tagging: conditional random ﬁelds (CRF), long short-term memory (LSTM), bidirectional LSTM networks (BI-LSTM), LSTM networks with a CRF layer, and BI-LSTM networks with a CRF layer. These models do not depend on stemming and word disambiguation for Uyghur and combine hand-crafted features with neural network models. State-of-the-art performance on Uyghur POS tagging is achieved on test data sets using the proposed approach: 98.41% accuracy on 15 labels and 95.74% accuracy on 64 labels, which are 2.71% and 4% improvements, respectively, over the CRF model results. Using engineered features, our model achieves further improvements of 0.2% (15 labels) and 0.48% (64 labels). The results indicate that the proposed method could be an effective approach for POS tagging in other morphologically rich languages.


Introduction
Part-of-speech (POS) tagging, which is a fundamental task in natural language understanding, has attracted considerable attention from researchers for various languages.In computational linguistics, this task involves labeling words in sentences with a unique POS tag according to their syntactic function in context.It plays an important role in natural language processing (NLP) and has been widely applied to a few high-level NLP tasks such as syntactic analysis, named entity recognition, and machine translation [1].With the creation of social media and the development of electronic communication in Xinjiang, China, a large quantity of digital text in Uyghur is produced currently.The information extracted from these texts can be used for different NLP tasks such as POS tagging for Uyghur.Uyghur is an agglutinative and morphologically rich language.Therefore, this is an extremely challenging and interesting task.At present, state-of-the-art POS tagging accuracy is approximately 97% for English [2][3][4][5], approximately 96% for Chinese [6][7][8][9] on news text, and approximately 96.85% for Uyghur [10].
In Uyghur, words can be broadly divided into independent words, function words, and exclamatory words.Independent words include verbs and substantive words.Nouns, adjectives, numerals, quantifiers, pronouns, adverbs, and mimetic words belong to the class of substantive words [11].Function words include three kinds of words: conjunctions, prepositions, and particles.Uyghur is an agglutinative language, meaning that potentially several affixes (e.g., denoting person, number, case, or mood) are frequently attached to one word stem.Independent word affixes are divided into two main types: verbal affixes and substantive affixes.There are 150 verbal affixes and 65 different substantive affixes, which includes 49 noun affixes, 57 numeral affixes, and 55 adjective affixes.In theory, the number of various combinations of nominal affixes is 1502.However, according to recent statistical analysis [10], only 368 combinations appear in practice.For instance, there are 21 different affix variants of the word weqe ("accident," "event," or "incident") in the corpus used in this paper (as shown in Table 1).It is necessary to perform morphological analysis of Uyghur words before POS tagging.If POS tagging is performed without stemming, different variants of the same word will be identified as different words, and a large number of unknown words will appear, instead of different morphosyntactic variants (as shown in Table 1).To fully understand this issue, consider the following sentence (in Latin script): alimjan ulugh alimimiz mehmud qeshqeri tughulghinining 1000-yilliqini xatirilesh ilmiy muhakime yighinida söz qildi.
Translation: Speech delivered by Alimjan at the 1000th anniversary conference of great scholar and lexicographer Mahmud al-Kashgari.
In this example, alim appears twice, as alimjan ("a person name") and alimimiz ("our scholar," or "our scientist"), and both instances are nouns.If alim, which is the more frequently used form, is the only form that appears in the training corpus, POS tagging would identify alimimiz as out of vocabulary (OOV).Unfortunately, (i) there is no open source stemming tool, (ii) the development cost of such a tool is high because the Uyghur language is a low-resource language and it has agglutinative and rich morphological features, and iii) the performance of stemming affects the performance of POS tagging.
To address this problem, we propose embedding words and characters and using syllable features in a bidirectional long short-term memory network with a conditional random field layer (BI-LSTM-CRF).This method combines handcrafted features with a neural network model and is described in Section 3.5.The experimental results are described in Section 4.
However, when morphological analysis is performed, the number of ambiguous phenomena may increase.For example, at means "name", "horse", or "shoot" and is either a verb or a noun, atqin means "shoot it" and is a verb, and atlar means "horses" and is a noun.After stemming, these words become the ambiguous word at, which is quite difficult to distinguish.To our knowledge, there is no study that addresses this problem.As mentioned earlier, in the Uyghur language, different word classes take different affixes (common affixes are also present), e.g., Uyghur nouns are inflected for number (singular and plural), case (nominative, accusative, dative, locative, ablative, genitive, similitude, locative-qualitative, limitative, equivalence) [11], and person (first, second, third), and verbs are conjugated for tense: present and past; person; voice: causative and passive; aspect: continuous; mood.In addition, these affixes are typically attached to the stem in a relatively fixed order, e.g., the general order of attachment for nouns is number, person, and case.For example, the word atlirimning means "my horses'" and can split into affixes such as at (stem) + lir (plural) + im (first person) + ning (genitive case).Therefore, the affixes and their order in a word may refer to the class that the words belong to.It is better to use intra-word information to capture syntactic and semantic information on Uyghur POS tagging.We obtain word shape information in our proposed model using character embedding, which is described in Section 3.6.
Several studies on Uyghur POS tagging employ a small POS tag set; however, only a few studies consider a large POS tag set that can support high-level NLP tasks with richer information.Moreover, most existing Uyghur POS tagging models are linear statistical models, such as hidden Markov models (HMMs), maximum entropy models (MEMs), and n-gram models, all of which are limited to using only past and future features.
Our main contributions in this paper are as follows: (i) We apply long short-term memory (LSTM) networks, bidirectional LSTM (BI-LSTM) networks, an LSTM network with a conditional random field layer (LSTM-CRF), and the BI-LSTM-CRF model to Uyghur POS tagging.We experimentally compare the performance of the models on Uyghur POS tagging data sets and show that this task can be effectively performed by neural networks and that competitive tagging accuracy can be obtained without handcrafted features.Moreover, we show that because the BI-LSTM-CRF model considers word-and sentence-level information and can fully use past and future input features, it is an effective method of performing the POS tagging task in morphologically rich languages.(ii) For the first time, we examine the performance of easily applied engineered features, such as syllable-and suffix-based features, with character embedding and word embedding in Uyghur POS tagging and further improve the performance.(iii) We demonstrate that our approach can achieve state-of-the-art performance on small and large tag sets.

Related Works
In recent years, several POS tagging approaches have been developed.Collobert et al. [12] proposed a learning algorithm that can be applied to POS tagging; their system learns internal representations on a large unlabeled training data set instead of exploiting man-made features.Its results on common data sets indicate that such an approach performs well.Ptaszynski and Momouchi [13] applied a handcrafted dictionary to Ainu POS tagging.Evaluation on a training set provided positive results.Zheng, Chen, and Xu [14] explored the feasibility of performing Chinese POS tagging using a deep learning method, in which a multilayer neural network [15] is used to discover relevant features in input sentences.In addition, dos Santos et al. [2] proposed a convolutional neural network that learns the character-level representation of words and then associates them with a word-level representation to perform POS tagging.The evaluation of the system on the Wall Street Journal and Mac-Morpho corpora obtained accuracies of 97.32% and 97.47%, respectively.Labeau, Löser, and Allauzen [16] introduced a POS tagging application that can infer word representations from a character stream without using any man-made features.Pan, Yan, Zhou, Yu, and Guo [17] presented a Khmer automatic POS tagging method based on a cascaded CRF model that achieved an accuracy of 95.44% on an open corpus.Abdulkareem and Tiun [18] designed and implemented several POS tagging models (such as k-nearest neighbor, naïve Bayes, and decision tree models) for Arabic tweets and achieved an accuracy of 87.97%.
POS tagging for Uyghur has drawn attention in recent years.For instance, Tahir, Tursun, and Rozi [19] attempted to label POS automatically by adopting a bigram model based on an HMM model.Their tag set was designed for a speech synthesis system; data smoothing and unknown words were not considered.Najmidin, Mamat, and Ibrahim [20] presented n-gram-based POS tagging for Uyghur texts.The parameters and data smoothing of the n-gram model were analyzed, and the efficiencies of bigram and trigram models were compared.Wang, Zu, and Litifu [21] investigated functional suffix strings and discussed the feasibility of POS tagging.Their results indicate that such a method is useful for Uyghur and other Turkic languages.To capitalize on the context features, Imam, Maimaiti, Ibrayim, and Abdurixit [22] employed perceptron training and Viterbi algorithms for POS tagging.Palidan and Fang [10] presented a maximum-entropy-based POS tagging model that combines the morphological features for multi-category word POS tagging and the data sparsity problem caused by inflection.Their results show that the suffix feature significantly improves the form type and unknown word tagging accuracy compared with other feature-based tagging models.Our model differs from the above models in that we use a powerful BI-LSTM-CRF network, which performs better than conventional statistical models, and that we use word-and character-level BLSTMs to collect longer context information and to extract more useful character-level features for the Uyghur language.

CRF Model
A CRF [23] is an undirected graphical model that has been successfully applied in several sequence labeling tasks including word segmentation, POS tagging, and named entity recognition.The CRF model can prevent the limited feature selection in HMMs and MEMs by considering the correlations between labels in neighborhoods [4].Furthermore, it can acquire a global optimum via a process of global feature normalization.
Let an observation sequence that must be labeled be S = {s 1 , s 2 , • • • , s n }, where s i is the vector of the ith word, and let L = {l 1 , l 2 , • • • , l n } be a sequence of labels for S, where l i is the label of the ith word.The linear-chain CRF model can then be written as where ψ i (l , l, S) = exp W T l l S i + b l ,l is the potential function corresponding to a label pair (l , l), W T is the weight vector, b is the bias, and ϕ(S) denotes the set of possible label sequences for S.

LSTM Model
An LSTM network is a special kind of recurrent neural network that is capable of learning long-term dependencies and can retrieve rich global information.An LSTM unit uses a series of multiplicative gates, such as input, output, and forget gates, and a memory cell to control the information flows in and out of the internal states of the network [24].In addition, it determines the information that should be discarded or sent to the next time step.There are several slightly different versions of LSTM; here, we present a vanilla LSTM with the structure given in Figure 1.
context features, Imam, Maimaiti, Ibrayim, and Abdurixit [22] employed perceptron training and Viterbi algorithms for POS tagging.Palidan and Fang [10] presented a maximum-entropy-based POS tagging model that combines the morphological features for multi-category word POS tagging and the data sparsity problem caused by inflection.Their results show that the suffix feature significantly improves the form type and unknown word tagging accuracy compared with other feature-based tagging models.Our model differs from the above models in that we use a powerful BI-LSTM-CRF network, which performs better than conventional statistical models, and that we use word-and character-level BLSTMs to collect longer context information and to extract more useful characterlevel features for the Uyghur language.

CRF Model
A CRF [23] is an undirected graphical model that has been successfully applied in several sequence labeling tasks including word segmentation, POS tagging, and named entity recognition.The CRF model can prevent the limited feature selection in HMMs and MEMs by considering the correlations between labels in neighborhoods [4].Furthermore, it can acquire a global optimum via a process of global feature normalization.
Let an observation sequence that must be labeled be = { , , ⋯ , }, where is the vector of the ith word, and let = { , , ⋯ , } be a sequence of labels for S, where is the label of the ith word.The linear-chain CRF model can then be written as where , , = + , is the potential function corresponding to a label pair , , WT is the weight vector, b is the bias, and φ denotes the set of possible label sequences for S.

LSTM Model
An LSTM network is a special kind of recurrent neural network that is capable of learning longterm dependencies and can retrieve rich global information.An LSTM unit uses a series of multiplicative gates, such as input, output, and forget gates, and a memory cell to control the information flows in and out of the internal states of the network [24].In addition, it determines the information that should be discarded or sent to the next time step.There are several slightly different versions of LSTM; here, we present a vanilla LSTM with the structure given in Figure 1.
For time t, the multiplicative gates and memory are defined as follows: For time t, the multiplicative gates and memory are defined as follows: where σ(•) is the non-linear sigmoid function and f, i, o, C, and h are the vectors of the forget gate, input gate, output gate, memory cell, and hidden state, respectively.These vectors have the same size.

Bidirectional LSTM Model
In sequence labeling tasks, it is beneficial to employ the previous and future input features over a given duration.However, the hidden state in a single forward LSTM captures previous features only and does not consider the future.Therefore, an elegant solution is BI-LSTM [25], which can be regarded as a stack of two LSTM layers.The previous features are extracted by a forward LSTM layer, and the future features are captured by a backward LSTM layer.In this way, we can effectively utilize the previous and future features; this alleviates the disambiguation problem mentioned in Section 1.

LSTM-CRF Model
For practical applications, the combination of a linear statistical model with a neural network has been proposed to prevent the problem that the performance of a neural network is largely determined by data.We implemented an LSTM-CRF [4] model consisting of an LSTM network and a CRF model.The basic idea is to use the LSTM layer to consider the previous input features and obtain sentence level tag information from the CRF layer.Therefore, the output is an optimal tag sequence instead of mutually independent tags.
Formally, X = {x 1 , x 2 , • • • , x n } represents a generic input sequence, y = {y 1 , y 2 , • • • , y n } represents the tag sequence for X, and P n * k denotes a probability matrix, where k is the number of tag types.The optimal tag sequence can be obtained by maximizing the target function.
where P i,j is the probability that the ith word is tagged as the ith tag and A is the state-transition matrix, where element A i,j is the probability of transferring from the ith tag to the jth tag.

BI-LSTM-CRF Model
Similar to the LSTM-CRF model, the BI-LSTM-CRF model is constructed from a BI-LSTM network and a CRF model.The output vectors of BI-LSTM are fed into CRF using the structure given in Figure 2. where σ • is the non-linear sigmoid function and f, i, o, C, and h are the vectors of the forget gate, input gate, output gate, memory cell, and hidden state, respectively.These vectors have the same size.Moreover, Wf, Wi, Wo, and WC denote the weight matrices and bf, bi, bo, and bC represent the bias vectors.

Bidirectional LSTM Model
In sequence labeling tasks, it is beneficial to employ the previous and future input features over a given duration.However, the hidden state in a single forward LSTM captures previous features only and does not consider the future.Therefore, an elegant solution is BI-LSTM [25], which can be regarded as a stack of two LSTM layers.The previous features are extracted by a forward LSTM layer, and the future features are captured by a backward LSTM layer.In this way, we can effectively utilize the previous and future features; this alleviates the disambiguation problem mentioned in Section 1.

LSTM-CRF Model
For practical applications, the combination of a linear statistical model with a neural network has been proposed to prevent the problem that the performance of a neural network is largely determined by data.We implemented an LSTM-CRF [4] model consisting of an LSTM network and a CRF model.The basic idea is to use the LSTM layer to consider the previous input features and obtain sentence level tag information from the CRF layer.Therefore, the output is an optimal tag sequence instead of mutually independent tags.
Formally, = { , , ⋯ , } represents a generic input sequence, = { , , ⋯ , } represents the tag sequence for X, and * denotes a probability matrix, where k is the number of tag types.The optimal tag sequence can be obtained by maximizing the target function.
where , is the probability that the ith word is tagged as the ith tag and A is the state-transition matrix, where element , is the probability of transferring from the ith tag to the jth tag.

BI-LSTM-CRF Model
Similar to the LSTM-CRF model, the BI-LSTM-CRF model is constructed from a BI-LSTM network and a CRF model.The output vectors of BI-LSTM are fed into CRF using the structure given in Figure 2.

Word Embeddings
In NLP, word embeddings [26], also known as distributed word representations, can capture the semantic and syntactic features of a word and reduce the requirement for handcrafted features [2].We use randomly initialized word embeddings with 256 dimensions.

Character Embeddings
According to dos Santos et al. [2] and Lample et al. [27], character-level representations can extract morphological features from words and are extremely useful, particularly for morphologically rich languages.Our proposed method is similar to that of Lample et al. [27]; we randomly initialize a character lookup table with every character and feed every character embedding of the words from the character lookup table to the bidirectional LSTMs.Then, we concatenate the result of forward and backward representations to derive character-level word representations, which have 50 dimensions.Finally, we concatenate it with the word-level representation from a word lookup table to form the final word embeddings.

Engineered Features
We built a candidate feature set of useful features to determine which feature has the strongest influence on POS tagging.The candidate feature set consists of 11 features, as described below: 1.
F wc : The word feature, which represents the word itself.

2.
F suffix : The suffix of the word.We collected 153 unique suffixes.

3.
F s : The word without its suffix.The performance of the morphological analysis affects the accuracy of this stemming feature.However, we discovered that a word without a suffix typically corresponds to its stem.Hence, we substituted words without suffixes for stems.

4.
F st : The classification of the suffix.The set of suffixes is divided into eight types according to the POS type.

5.
F bt : The first-level POS label.We use this feature only when the tag set consists of 64 tags, and it is predicted using a 15-label tagging model with shared parameters.
In POS tagging, the syllable features of the words are crucial.It is better to extract syllables than stems in morphologically rich languages such as Uyghur.More accurate morphological information can be obtained through appropriate selection of syllables as features.Uyghur words can be composed of multiple syllables, which we express as follows: where n is the number of syllables in a word.For most words, the initial syllables mainly describe the semantic information, which can be used to reduce the OOV.The ensuing syllables mainly describe the suffix information, which can be used to distinguish different contexts.Given such considerations, we performed a statistical analysis of syllables to select the best syllable features.The statistics of the number of syllables per word that appear in our corpus are summarized in Table 2.
According to Table 2, the majority of Uyghur words have a syllable length of seven or less.Moreover, the length of the syllables that contain grammatical information is between one and four.Therefore, depending on the syllable length of the current token, we extract features with different syllable lengths.These syllable features and their formulas are given below.

6.
F f 2 : The first two syllables of a word.This feature can be calculated using the following formula: F la : All syllables except the first one.This feature can be calculated using the following formula: F l4 : The last one to four syllables in a word.This feature can be calculated using the following formula: F l3 : The last one to three syllables of a word.This feature can be calculated using the following formula: 10. F l2 : The last one or two syllables of a word.This feature can be calculated using the following formula: 11. F l1 : The last syllable in a word.This feature can be calculated using the following formula:

Data Sets
At present, there is no widely known uniform specification for Uyghur POS tagging sets; however, there have been several attempts to establish a tagging standard.For example, Xinjiang Laboratory of Multi-Language Information Technology and Xinjiang Normal University have independently created their tagging standards.
Xinjiang Laboratory of Multi-Language Information Technology created a manually annotated Uyghur POS tagging corpus that contains over 1.2 million tokens.Its tag set uses 15 first-level POS labels (as shown in Table 3), 71 second-level POS labels, and 51 third-level POS labels.Here, we use the corpus of Xinjiang Laboratory of Multi-Language Information Technology and its first-and second-level POS tag sets for modeling and conducting experiments (only 64 labels are used for the second-level POS tag set in this work; all punctuations are classified as one tag).The corpus statistics are summarized in Table 4.

Results and Discussion
This section presents the results of training CRF (http://github.com/zhongkaifu/CRFSharp),LSTM, LSTM-CRF, BI-LSTM, and BI-LSTM-CRF with identical feature sets.For CRF training, the window size is 5 for F wc and 3 for other features; all other parameters maintained at their default values.We used stochastic gradient descent with a fixed learning rate of 0.01 and a dropout rate of 0.5.Therefore, the differences in the results are entirely due to the different models.

Selection of Engineered Features
To determine which features are distinctive and more effective in POS tagging, we studied the effects of different engineered features on the CRF model.
Table 5 shows the accuracy of POS tagging for the CRF model with different feature combinations.It is quite clear that not all features are valid: a few features have a larger contribution to the accuracy than others.For instance, the accuracies achieved using the combinations < F wc + F s + F su f f ix > and < F wc + F s + F su f f ix + F st > are not significantly different.Hence, considering the utilization of resources and for increasing the speed of training, not all features are employed in the follow-up experiments; the combination < F wc

Comparison of Different Models
To assess the influence of models with respect to word and character features, we trained each model with the word and character features from the same data set.In addition, in Section 3.6.3,we presented several engineered features that were shown to be effective in Section 4.2.1.We then used these features jointly with the word embedding feature; each feature corresponds to a 30-dimensional embedding vector.For instance, F wn is the word feature, F i is the engineered feature, and i is the ith engineered feature.The sample can be defined as follows: where ⊕ is the direct connection operation.After the concatenation operation, the sample contains word features, e.g., semantic and syntactic information, and additional morphological information.We also trained CRF, LSTM-CRF and BI-LSTM-CRF models with engineered features.
To verify the availability of the proposed method, we compare the results of different models in Table 6.
Table 6 provides the accuracies of the proposed models.It shows that the BI-LSTM-CRF model achieves accuracies of 98.41% and 95.74% on the 15-and 64-label test data sets, respectively, which is clearly higher than the accuracies of CRF and the other models.The reason for this phenomenon is that the CRF model typically requires several features, such as spelling and morphological features, to achieve good performance.In contrast, the LSTM based models, i.e., BI-LSTM, LSTM-CRF, and BI-LSTM-CRF, are more robust and less reliant on non-word features.When additional features are used, BI-LSTM-CRF outperforms CRF and LSTM-CRF and obtains the highest accuracy for every data set.For 15 labels, the accuracies of BI-LSTM-CRF are 0.41% and 0.45% higher than that of CRF, and for 64 labels, the accuracies are 0.67% and 0.66% higher than that of CRF.There is no significant difference between LSTM-CRF and BI-LSTM-CRF.System performance of all models are further improved after the engineered features are added.The improvement for 64 labels is significant, that is, an improvement of 0.48% on the test set that uses the BI-LSTM-CRF model.This is because the engineered features effectively reduce data sparseness and provide rich morphological information.This phenomenon demonstrates that it is useful to add a few handcrafted features to the BI-LSTM-CRF and LSTM-CRF models when the training data set is limited and the tag set is extremely large, and it could be more effective for morphologically rich languages such as Uyghur to jointly consider syllable-or morpheme-based representations that are larger than a character and require rich morphological information.

Comparison with Different Configurations
In this experiment, in order to understand the behavior of BI-LSTM-CRF in different conditions, we performed an error analysis on the testing set.Specifically, we partition each data set into in-vocabulary words (IV), out-of-vocabulary words (OOV), multi-category words (MC) and in-vocabularyand-single-category words (IVASC).A word is considered IV if it appears in both the training and testing (or development) set, and OOV words are the ones do not appear in training set but in the testing (or development) set.MC words are the ones that can represent more than one part-of-speech in whole data set, while IVASC are the IV words that have only one part-of-speech.The statistics of the partition on each corpus are shown in Table 7.We can see in Table 8 that, for both tag set, the Word-Char + Feature model performs best followed by Word-Char model.For the OVV words, the accuracy of the Char-based architectures, i.e., Char, Word-Char, and Word-Char + Feature, reaches large improvements over the baseline.This demonstrates that by adding character-based embeddings, BI-LSTM-CRF model more powerful on OOV words.For the IV words, the Char architecture reaches only small improvements over baseline on 15 labels, while on 64 labels the Char architecture is worse than the baseline.Interestingly, the character-level embeddings seem to have opposite effects on MC words.The Baseline + Feature model is competitive to the BI-LSTM-CRF models on MC words.We can also see in the results that engineered features largely improve the system performance, especially when dealing with OOV words.We think that the features address quite different information and add up well.This result suggests that, for the Uyghur POS tagging, the selected engineered features are very effective for both CRF and BI-LSTM-CRF models.

Conclusions
We studied the POS tagging problem as a sequence labeling problem.We applied LSTM network-based models to Uyghur POS tagging and reported the state-of-the-art tagging accuracy on small and large tag sets.Instead of using engineered features, the proposed method uses word-and character-based representations that capture morphological and orthographic information and achieves better accuracy than the CRF model, which relies heavily on handcrafted features and domain-specific knowledge.Furthermore, carefully selected engineered features were used to further improve the results for the CRF and BI-LSTM-CRF models.
Moreover, W f , W i , W o , and W C denote the weight matrices and b f , b i , b o , and b C represent the bias vectors.

Table 2 .
Syllable 1 statistics.A rule-based Uyghur syllabification tool (we also have a web service interface) developed by the natural language processing group of Xinjiang Laboratory of Multi-Language Information Technology is used.The accuracy is over 99%.We can provide it to anyone for research purposes. 1

Table 3 .
First-level POS tagging set for Uyghur.

Table 6 .
Performance of the different models."Word only" refers to the BI-LSTM-CRF model that uses word embeddings only, "Char only" refers to the BI-LSTM-CRF model that uses character-level embeddings only, "+ Feature" refers to the model that uses the engineered features which discussed in Section 4.2.1 (%).

Table 7 .
Statistics of the partition on each corpus.

Table 8
illustrates the performances of BI-LSTM-CRF models on different subsets of words.The results of CRF model are provided as a baseline.

Table 8 .
POS results on test data set with BI-LSTM-CRF using different configurations, "Word" refers to word embeddings, "Char" refers to character-level embeddings and "Word-Char" refers to a combination of previous two architectures, "+ Feature" refers to the engineered features which discussed in Section 4.2.1 (%).