Learning Subword Embedding to Improve Uyghur Named-Entity Recognition

: Uyghur is a morphologically rich and typical agglutinating language, and morphological segmentation a ﬀ ects the performance of Uyghur named-entity recognition (NER). Common Uyghur NER systems use the word sequence as input and rely heavily on feature engineering. However, semantic information cannot be fully learned and will easily su ﬀ er from data sparsity arising from morphological processes when only the word sequence is considered. To solve this problem, we provide a neural network architecture employing subword embedding with character embedding based on a bidirectional long short-term memory network with a conditional random ﬁeld layer. Our experiments show that subword embedding can e ﬀ ectively enhance the performance of the Uyghur NER, and the proposed method outperforms the model-based word sequence method.


Introduction
Many scholars study named-entity recognition (NER) because of its importance to natural language processing.NER uses sequence-labeling to automatically recognize entities in text, including persons, locations, and organizations.Using deep learning, NER has achieved good performance with languages having large-scale datasets, such as English [1,2] and Chinese [3,4].Strengthening the information construction of ethnic minority languages is a driving force in the development and social advancement of China.However, because Uyghur is an ethnic minority language in China, NER as a fundamental information construction task requires vast improvements.The main problem is that Uyghur is a morphologically rich and typical agglutinating language, wherein a word may present different variations with the connection of affixes.Thus, the complex and rich morphology presents the problem of extremely sparse data.Moreover, the beginning characters of named entities have no capitalization that can be used as distinct features, unlike that English.Additionally, there are only a small quantity of annotated corpora and no public corpus for Uyghur NER.
Currently, most research on Uyghur NER has adopted statistical methods of machine learning, including conditional random fields (CRF) [5] and hybrid approaches [6,7].These methods depend excessively on handcrafted features and domain-specific knowledge resources.However, the process of collecting features and resources is inefficient and expensive.To avoid heavy feature engineering, our objective is to provide a neural network architecture that employs subword-and character-embedding based on a bidirectional long short-term memory (LSTM) network with a CRF layer to improve Uyghur NER performance.

Related Works
NER for English and German on the CoNLL-2003 dataset has drawn the attention of many researchers.Traditionally, NER systems have employed machine-learning tactics, including CRF [8], hidden Markov models [9] and support vector machines [10].Handcrafted features and domain-specific knowledge resources (e.g., a manually annotated dataset) are needed as inputs to train these models.
With advances in deep learning, neural network models for sequence labeling have been spectacularly well-utilized for high-performance NER tasks.Collobert et al. [11] adopted an architecture based on convolutional neural networks (CNNs) with CRFs to solve sequence-tagging problems, which improved the performance and significantly reduced the dependency on task-specific engineering.Huang et al. [12] proposed a bidirectional (bi) LSTM with a CRF Layer, achieving 90.10% F1 with both Senna embedding and gazetteer features.A bi-LSTM-CNN architecture was used to detect word-and character-level features, as proposed by Chiu et al. [13].It outperformed the method that relied on the heavy feature engineering and achieved fairly good performance on CoNLL-2003 and OntoNotes 5.0.Lample et al. [1] presented a bi-LSTM-CRF architecture that obtained effective information from character-based word embedding.Rei et al. [14] presented an architecture that amalgamates character-based word embedding by using an attention mechanism, surpassing the architecture based on concatenating the word-and character-level representations.Ma et al. [2] offered a bi-LSTM-CNN-CRF neural network architecture that automatically benefits from word-and character-level representation.Shen et al. [15] used deep active learning for NER, but only on small-scale labeled data.However, these approaches are not particularly applicable to the morphologically rich and agglutinating languages, so many scholars have made certain improvements to take to a more ideal performance, which employed morphological embedding [16], phonological character representations [17] and morpheme-level representations [18], respectively.
Research on Uyghur NER is still in its early stages and mainly focuses on particular entities.For example, Tashpolat et al. [6] employed a CRF-and rules-based post-processing approach to achieve high performance on Uyghur person-name recognition via the analysis of agglutinative characteristics.Maimaiti et al. [7] presented a CRF model with rules for Uyghur location-name recognition by introducing different handcrafted features, especially syllables and similar words via word embedding.Maihefureti et al. [19] researched rule-based Uyghur organization-name recognition, which depended upon syntactical and semantical knowledge.Halike et al. [20] implemented the recognition of times, numerals, and quantifiers using an approach that relied on the manual rule library.Our approach is different because we simultaneously identify person, location, and organization.Recent advances notwithstanding, a morphologically rich language such as Uyghur requires a combination of word-and character-level embedding as input features, instead of handcrafted features and domain-specific knowledge.

Methodology
In this section, we describe the proposed neural network architecture.The word-based neural model is introduced first; it is a bi-LSTM-CRF model that promotes the performance of Uyghur NER.Then, we propose the subword-based neural model, which takes a sequence of subwords as input.To fully understand the architectures, we take a sentence using Uyghur Latin script as example, "niGmEt beyjiNdiki turalGusida turwatidu", which means "niGmEt lives in Beijing".

Word-Based Neural Model
We first introduce the word-based neural model, following the models presented by Lample et al. [1]. Figure 1 shows the neural network architecture.
Recurrent neural networks (RNNs) are neural network language models used for processing sequential data.RNNs can capture long-distance dependencies by leveraging historical information.However, they are not very effective for NER, and this causes problems of gradient vanishing and exploding [21].LSTMs [22] have been proposed to overcome RNN shortcomings by incorporating a memory-cell while exploiting long-term dependencies.An LSTM cell uses several gates to regulate the proportion of information to be stored vs. forgotten.Greff et al. [23] explored eight LSTM variants based on Vanilla LSTM [24] on three representative tasks and compare their performances, concluding that Vanilla LSTM performs well in all applications, while the other eight variants had no significant performance improvement.Therefore, we use the following equations to update the LSTM cell at time t, which is the same as Vanilla LSTM: where σ is the logistic sigmoid function; indicates the point-wise product; x t , o t , c t are the input, output, and cell vectors, respectively; h t is the hidden vector at time t; W indicates the weight matrices of different gates; and b represents bias vectors.Therefore, h t is defined by the input vector, x t , and the hidden vector, h t−1 , at the previous moment.For many sequence-labeling tasks (such as NER), both past and future information are beneficial for predictions.It is advisable to utilize bi-LSTM to capture contextual information from two directions.This method has been proven successful for many tasks [25].
For a sequence of vectors, X = (x 1 , x 2 , . . ., x n ), the bi-LSTM computes forward representations, Using the model, the final expression of each word is acquired by using forward and backward representations, To get better feature combinations, the bi-LSTM contains a hidden layer at the top, so that we can encode a more reliable pattern for each word: where W d is a weight matrix for the hidden layer.
In general, there are two ways to estimate current labels.The first uses a softmax layer that acts as an output layer to independently make tagging decisions.The softmax function is a normalized exponential function that predicts the probability distribution over all labels with possibilities for every word: where p(y t = j d t ) is the probability that the label of the tth word, y t , is j; k is the number of all possible labels; and W o,j is the jth row of the output weight matrix, W o .During model training, the negative log-probability of the correct labeling sequence is minimized: NER tags with "beginning-inside-outside" formats have strong constraints, meaning that an organization (ORG) on the inside cannot follow a location (LOC) at the beginning or on the outside.Thus, the softmax layer is insufficient.CRF focuses on the sentence level instead of decoding each label independently.Thus, CRF tagging is ideal for NER tasks.Given a sequence of predictions, y = y 1 , y 2 , . . ., y n , its score can be defined as where P is the matrix of the scores output from the bi-LSTM and P i,y i is the score in which the tag of the jth word is y i .T y i ,y i−1 represents the score of a transition from tag y t to tag y t+1 in a sentence.Over the course of training, the log-probabilities of the correct tag-sequence are maximized: exp(S(x, y)) Y x represents the entirety of the possible tag sequences.In the test stage, we used Viterbi algorithm for prediction of the output sequence with maximal conditional probability.
Information 2019, 10, x 4 of 10 where  is the matrix of the scores output from the bi-LSTM and  , is the score in which the tag of the th word is  .  , represents the score of a transition from tag y to tag y in a sentence.Over the course of training, the log-probabilities of the correct tag-sequence are maximized: Y represents the entirety of the possible tag sequences.In the test stage, we used Viterbi algorithm for prediction of the output sequence with maximal conditional probability.

Subword-Based Neural Model
The input vector in the traditional bi-LSTM-CRF model takes a word as its basic unit.However, Uyghur is an agglutinating language in which a word comprises a stem and affixes.If only the word vector is considered, the semantic information cannot be fully learned, causing it to suffer from data sparsity arising from morphological processes.Therefore, we consider morphological segmentation to exploit smaller meaning-bearing units to improve performance.Morphological segmentation breaks words into meaning-bearing subword units called morphemes [26].Thus, Uyghur morphology segmentation allows us to break words into more familiar units than have been previously observed.Uyghur morphology segmentation falls into two segmentation categories: single-point and multi-point.Single-point segmentation refers to segmenting a word into a stem and a suffix, whereas multi-point segmentation refers to a more fine-grained segmentation, further segmenting a suffix on the basis of single-point segmentation.To fully explain the phenomenon, we provide the following example.

Subword-Based Neural Model
The input vector in the traditional bi-LSTM-CRF model takes a word as its basic unit.However, Uyghur is an agglutinating language in which a word comprises a stem and affixes.If only the word vector is considered, the semantic information cannot be fully learned, causing it to suffer from data sparsity arising from morphological processes.Therefore, we consider morphological segmentation to exploit smaller meaning-bearing units to improve performance.Morphological segmentation breaks words into meaning-bearing subword units called morphemes [26].Thus, Uyghur morphology segmentation allows us to break words into more familiar units than have been previously observed.Uyghur morphology segmentation falls into two segmentation categories: single-point and multi-point.Single-point segmentation refers to segmenting a word into a stem and a suffix, whereas multi-point segmentation refers to a more fine-grained segmentation, further segmenting a suffix on the basis of single-point segmentation.To fully explain the phenomenon, we provide the following example.
Latin Uyghur: niGmEt beyjiNdiki turalGusida turwatidu.(niGmEt lives in Beijing.)Single-point segmentation: niGmEt beyjiN/diki turalGu/sida tur/watidu Multi-point segmentation: niGmEt beyjiN/diki turalGu/si/da tur/watidu In this study, we use three methods derived from the Xinjiang University & Iflytek Voice and Language Joint Laboratory for Uyghur morphology segmentation.The differences among the methods are shown in Table 1.To mitigate the data sparsity problem, we propose a bi-LSTM-CRF model based on the subword sequence.This model comprises bi-LSTM and CRF layers, but it is distinct from the traditional model, because its input sequence is changed, and a tag for each subword is independently predicted.Additionally, we introduce subword embedding with character embedding as the input vectors of this model.Figure 2 shows the model structure.In this study, we use three methods derived from the Xinjiang University & Iflytek Voice and Language Joint Laboratory for Uyghur morphology segmentation.The differences among the methods are shown in Table 1.To mitigate the data sparsity problem, we propose a bi-LSTM-CRF model based on the subword sequence.This model comprises bi-LSTM and CRF layers, but it is distinct from the traditional model, because its input sequence is changed, and a tag for each subword is independently predicted.Additionally, we introduce subword embedding with character embedding as the input vectors of this model.Figure 2 shows the model structure.

Word Embedding
Word embedding (i.e., distributed word representation) has become popular with researchers because of its ability to simultaneously obtain semantic and syntactic information from words in a large unlabeled corpus [27].To obtain high-quality word embedding, instead of randomly initializing the embedding, we use a large-scale unannotated dataset to prepare pre-trained word embedding, as developed at the Xinjiang University and Iflytek Voice and Language Joint Laboratory.It contains 1,891,895 sentences and a vocabulary size of 2,461,449 tokens.We adopt the skip-gram model of word2vec, provided by Gensim (https://radimrehurek.com/gensim/index.html), while training word embedding as "pre-trained."

Word Embedding
Word embedding (i.e., distributed word representation) has become popular with researchers because of its ability to simultaneously obtain semantic and syntactic information from words in a large unlabeled corpus [27].To obtain high-quality word embedding, instead of randomly initializing the embedding, we use a large-scale unannotated dataset to prepare pre-trained word embedding, as developed at the Xinjiang University and Iflytek Voice and Language Joint Laboratory.It contains 1,891,895 sentences and a vocabulary size of 2,461,449 tokens.We adopt the skip-gram model of word2vec, provided by Gensim (https://radimrehurek.com/gensim/index.html), while training word embedding as "pre-trained."

Subword Embedding
We used the above Uyghur morphology segmentation method to process the annotated dataset and took subwords as basic training units using a skip-gram model of word2vec, which is similar to training for word embedding.Thus, semantic information containing subword embedding assumes that every subword can stand independently.After segmentation, the subword vocabulary size corresponding to the bi-LSTM, SRILM-Ngram, and MaxMatch methods are 2,034,757; 2,109,530; and 2,051,620, respectively.

Character Embedding
Additionally, abundant structure information of the entity is embodied in character-level features.Character embedding is not only useful for researching languages rich in morphology, it also alleviates the out-of-vocabulary problem [26].First, we randomly initialize a character lookup table with a character embedding for every character.Character-embedding matching for every character in a word is provided in both directions of the bi-LSTM network.Finally, the concatenation of the forward and backward representations from the bi-LSTM is used as the character-level feature of the word.

Datasets
Our models were evaluated with a manually annotated Uyghur NER corpus, created at Multilingual Information Technology Laboratory of Xinjiang University [28].It contains 39,027 sentences and 102,360 named entities.Person (PER), location (LOC), and organization (ORG) account for approximately 27.81%, 41.60%, and 30.58%, respectively.Additionally, the entity labels are annotated using IOB notation.We used the 10-fold cross-validation method to validate performance, where the training (train), development (dev), and test (test) sets accounted for 80%, 10%, and 10%, respectively.The statistics of the dataset are shown in Table 2. Note: The number in parentheses indicates the number of non-repeating token or entities.Sentence, Token and NE refer to the number of sentence, tokens, named-entities in each data set.

Training and Evaluation
Our models were trained using a back-propagation algorithm that updated the parameters for every training example [1].During the training phase, we prepared 300-dimensional pre-trained word or subword embeddings using the skip-gram model to initialize the model.We set the maximum epoch to 100.The dimensions of the forward and backward LSTMs were set to 100.We used stochastic gradient descent with a learning rate of 0.01 and a gradient clipping of 5.0 for optimization.We used dropout with a probability of 0.5 to avoid overfitting.The final dimension of our character-based embedding of words was 50.The measurement score of the Uyghur NER performance is F 1 , which relates to precision and recall on the test set.

Experimental Results and Discussion
Results for the different morphological segmentations of the subword-based neural model that only considers subword embedding are shown in Table 3.The best performance (89.02% in F-score) appeared when the Uyghur morphological segmentation-based bi-LSTM was used.However, the F 1 score of the other segment function did not show a significant improvement.The reason may be that the SRILM-Ngram-and MaxMatch-based morphological segmentation methods are a type of multi-point segmentation, causing excessive segmentation that leads to ambiguity for Uyghur NER.Furthermore, the accuracy of these two segmentation methods was relatively low.Therefore, morphological segmentation-based bi-LSTM was utilized in the next experiment.We conducted many experiments representing different models to understand their influences on the Uyghur NER system.We explored the impact of using word/subword embedding and character-level embedding.The baseline results are from Wang et al. [29], who used a semi-supervised approach based on CRF.Table 4 compares the word-based and subword-based neural models.Compared to the baseline, the neural network model has a slight advantage.We found that, when the input embedding process reached word or subword embedding, the F-score of the subword-based method was higher.When character-level embedding was added, the neural network model improved by at least 0.5% on the basis of word vectors.The word-based neural model with character-level embedding performed best for ORG.However, the results of average F1 scores show that the subword-based model was more suitable than the word-based models.

OOV Error Comparison with Different Models
To further understand the behavior of the subword-based neural model, we performed error analysis on the testing set.Specifically, we divided each dataset into in-vocabulary (IV) entities, out-of-training-vocabulary (OOTV) entities, out-of-embedding-vocabulary (OOEV) entities, and out-of-both-vocabulary (OOBV) entities.An entity is considered OOBV if at least one word is not in the training set and at least one word is in embedding vocabulary.The other three subsets can be performed the same way.Table 5 shows the statistics of the division of each corpus.Table 6 illustrates the performance of the subword-based and word-based neural models on diverse subsets of entities.When comparing the performance of the CRF statistical model and bi-the LSTM-CRF neural network model for each entity category, the version with only word/subword embedding had a few difficulties correctly recognizing the OOEV of named entities.This demonstrates that the neural network model largely depended on input embedding.However, the subword-based neural model with character embedding achieved a 2% improvement over the previous best OOBV result.Thus, almost all improvements of the subword-based neural model via embedding was conducive to Uyghur NER.

Conclusions
In this paper, we presented a subword-based neural network model based on bi-LSTM-CRF for Uyghur NER, which does not require handcrafted features or any knowledge sources to capture linguistic information.In experiments conducted, we utilized different Uyghur morphology segmentations and obtained very promising results compared to the word-based neural model.Further, subword embedding was conducive to system performance when the accuracy of morphology segmentation was higher, or no excessive morphology segmentation existed.Even though Uyghur is a morphologically rich and low-resource language, subword embedding is a simple and effective remedy to achieve state-of-the-art performance for such NER datasets.Further work should be done to evaluate subword embedding across other natural language processing applications, such as machine translation.Additionally, a better generic neural network model using cross-lingual embedding will be explored to deal with low-resource and agglutinating language processing.

Table 1 .
Different morphology segmentation methods for Uyghur.

Table 1 .
Different morphology segmentation methods for Uyghur.

Table 2 .
Statistics of the entity type for the Uyghur named-entity recognition (NER) dataset.

Table 4 .
Comparison of performance on different neural models (%).Bold indicates the best result in below models for each entity category.

Table 5 .
Statistics of the division on each corpus.

Table 6 .
Comparison of performance on different subsets of entities (%).