Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

: Part-of-Speech (POS) tagging is one of the most important tasks in the ﬁeld of natural language processing (NLP). POS tagging for a word depends not only on the word itself but also on its position, its surrounding words, and their POS tags. POS tagging can be an upstream task for other NLP tasks, further improving their performance. Therefore, it is important to improve the accuracy of POS tagging. In POS tagging, bidirectional Long Short-Term Memory (Bi-LSTM) is commonly used and achieves good performance. However, Bi-LSTM is not as powerful as Transformer in leveraging contextual information, since Bi-LSTM simply concatenates the contextual information from left-to-right and right-to-left. In this study, we propose a novel approach for POS tagging to improve the accuracy. For each token, all possible POS tags are obtained without considering context, and then rules are applied to prune out these possible POS tags, which we call rule-based data preprocessing. In this way, the number of possible POS tags of most tokens can be reduced to one, and they are considered to be correctly tagged. Finally, POS tags of the remaining tokens are masked, and a model based on Transformer is used to only predict the masked POS tags, which enables it to leverage bidirectional contexts. Our experimental result shows that our approach leads to better performance than other methods using Bi-LSTM.


Introduction
Part-of-Speech (POS) tagging is one of the most important tasks in the field of natural language processing (NLP). It assigns a POS tag to each word in a given sentence. For a short and simple sentence "I like dogs", a POS tagger can easily identify the word I as a pronoun, the word like as a verb, and the word dogs as a noun. However, some words in complex sentences are difficult to tag correctly by POS taggers. The same word in a different context has different POS tags, which makes POS tagging a challenging task.
POS tagging can be an upstream task for other NLP tasks, such as semantic parsing [1], machine translation [2], and relation extraction [3], to improve their performance. Hence, improving the accuracy of POS tagging becomes an important goal.
For example, the dependency parser in Stanza pipeline [4] takes the result of POS tagging as part of the input because POS tagging is helpful for dependency parsing [5]. Although current POS taggers have achieved 97.3% token accuracy, the sentence accuracy is not as high [6]. This may cause performance loss for the dependency parser because it utilizes POS tags of all tokens to extract a dependency parse tree of a sentence. It is the wrong POS tagging for one word that may result in the extraction of a wrong tree.
In recent years, most POS taggers have used bidirectional Long Short-Term Memory (Bi-LSTM) [7,8] for POS tagging. In addition to word-level embeddings, they append other types of embeddings to improve the accuracy. However, Bi-LSTM is not as powerful as Transformer [9] in leveraging contextual information, since Bi-LSTM simply concatenates contextual information from left-to-right and right-to-left. With the self-attention mechanism, deep learning models based on Transformer may deliver performance gains for POS tagging.
as Transformer [9] in leveraging contextual information, since Bi-LSTM simply concatenates contextual information from left-to-right and right-to-left. With the self-attention mechanism, deep learning models based on Transformer may deliver performance gains for POS tagging.
In this paper, we propose a novel approach to improve the accuracy of POS tagging, which includes rule-based data preprocessing and a deep learning model. Figure 1 shows an example of POS tagging with our approach. During the rule-based data preprocessing, for each token, all possible POS tags are obtained without considering the context. Then, rules are applied to prune out these possible POS tags. After pruning, most of the tokens only possess one candidate POS tag, and they are considered to be correctly tagged by the data preprocessing. In the inference phase, POS tags of the remaining tokens are masked, and a deep learning model is responsible for predicting the masked POS tags. The model is based on the Transformer's encoder, and a POS (Part-of-Speech) embedding layer is introduced to accept POS tags tagged by the data preprocessing, which is helpful in predicting the POS tags of the remaining tokens. By combining the rule-based data preprocessing and deep learning, we obtain POS tags of all tokens.  Figure 1. POS tagging for the sentence: "The expressions are formed using functions". In the step of pruning out possible POS tags, the POS tag VBD of the token formed is eliminated after one of our rules is used to tag the verb following the token are as VBN or VBG. For the token using and functions, their POS tags cannot be determined and are therefore masked. In the inference phase, the masked POS tags are predicted by a deep learning model.
The contributions can be summarized as follows: (1) To further improve the accuracy of POS tagging, we propose a novel approach for POS tagging, which combines the rule-based methods and deep learning. (2) We implement a rule-based method to tag some portion of the words. It can enhance the performance of POS tagging when combined with deep learning. (3) The proposed method utilizes the self-attention to capture dependencies between words at any distance. Moreover, we mask a certain portion of POS tags and the model only predicts the masked POS tags, which enables the model to better exploit the global contextual information. POS tagging for the sentence: "The expressions are formed using functions". In the step of pruning out possible POS tags, the POS tag VBD of the token formed is eliminated after one of our rules is used to tag the verb following the token are as VBN or VBG. For the token using and functions, their POS tags cannot be determined and are therefore masked. In the inference phase, the masked POS tags are predicted by a deep learning model.
The contributions can be summarized as follows: (1) To further improve the accuracy of POS tagging, we propose a novel approach for POS tagging, which combines the rule-based methods and deep learning. (2) We implement a rule-based method to tag some portion of the words. It can enhance the performance of POS tagging when combined with deep learning. (3) The proposed method utilizes the self-attention to capture dependencies between words at any distance. Moreover, we mask a certain portion of POS tags and the model only predicts the masked POS tags, which enables the model to better exploit the global contextual information. (4) We evaluate our method on a public dataset. On the dataset, the method achieves a pertoken tag accuracy of 98.6% and a whole-sentence correct rate of 76.04%. Experimental results demonstrate the effectiveness of the method.
This paper is organized as follows. In the next section, we review related work. Section 3 provides details of the rule-based data preprocessing. Section 4 presents the structure of the deep learning model. Section 5 gives the experimental settings and evaluation. Finally, we conclude the paper with a summary in Section 6 and give an outlook on future work in Section 7.

Related Work
In this section, we review the related work on the Penn Treebank POS tagset, POS tagging, and Transformer, respectively.

Penn Treebank Tagset
The Penn Treebank POS tagset [10], which contains 36 POS tags and 12 other tags, is widely used to annotate large corpora of English (See Table 1). To tag words in a given sentence with specific POS tags, the Penn Treebank POS tagset is adopted in this paper.

POS Tagging
There exist different methods for POS tagging, such as rule-based methods, methods based on linear statistic models, and deep learning methods based on Bi-LSTM.
Brill [11,12] proposes a trainable rule-based POS tagger, which can automatically construct rules and use the rules to tag all tokens in a given sentence. However, this is difficult to use on real data due to the complexity of natural languages. Some works are based on linear statistic models, such as Conditional Random Fields (CRF) [13] and Hidden Markov [14]. These statistic models perform relatively well on the corpora tagged with a coarse-grained tagset, but they do not perform as well as the Bi-LSTM on the corpora tagged with a fine-grained tagset [15].
In recent years, methods for POS tagging have mainly been based on Bi-LSTM since this is a powerful model that captures time dynamics via recurrence [16]. There are several methods for learning the vector representation of words, such as Word2Vec [17], fast-Text [18], and Glove [19]. Bi-LSTM takes the vector representations as input and leverages the semantic information from the representations to assign a POS tag to each element. Wang et al. [20] use Bi-LSTM for POS tagging. In addition to the word embedding layer, a function is introduced to indicate the original case of words. Ling et al. [21] propose a C2W model based on LSTM, which composes representations of characters into representations of words. The experimental result shows that the C2W model achieves better performance than the word lookup tables in POS tagging. Plank et al. [22] also use Bi-LSTM as a base model for POS tagging, where the input includes not only word-level embeddings but also the character-level embeddings. The POS tagger in Stanza pipeline [4] adopts a highway Bi-LSTM [23] with inputs coming from the concatenation of three sources: (1) a pretrained word embedding; (2) a trainable frequent word embedding; (3) a characterlevel embedding [24], and uses affine classifiers for each type of tag [25]. The above methods [4,[20][21][22] improve the accuracy of POS tagging by enriching the input information.
To further improve the accuracy of POS tagging, some works [26][27][28] combine Bi-LSTM with CRF since CRF can learn sentence-level tag information. In addition to CRF, Bi-LSTM can be integrated with adversarial neural networks to extract better features [29,30], which also can improve the accuracy. POS tagging with Bi-LSTM typically requires a large number of annotated samples. To solve the problem of lacking a large number of training samples, some works [31][32][33] apply transfer learning to POS tagging.

Transformer
Vaswani et al. [9] propose a new network architecture different from recurrent neural networks (RNNs) and LSTMs, Transformer, which is solely based on the self-attention mechanisms and eschews recurrence and convolutions.
The self-attention mechanism in Transformer enables it to efficiently capture dependencies between words at any distance. The input to the self-attention function is composed of queries, keys of dimension d k , and values. The queries, the keys, and the values are mapped into three representations Q, K, and V with three linear layers, then the attention is computed on Q, K, and V.
For NLP, Transformer architecture has become the de-facto standard [34] thanks to the self-attention mechanism. In particular, some pretrained language models based on Transformer, such as BERT [35] and its variant RoBERTa [36], have achieved state-of-the-art results on different NLP tasks. They belong to masked language modeling, which is more powerful than standard conditional language models in utilizing both left-to-right and right-to-left contextual information.
Considering the power of the Transformer, we propose to build a model for POS tagging based on Transformer.

Rule-Based Data Preprocessing
The rule-based data preprocessing can acquire POS tags of most of the tokens, which consists of three steps. The first step is producing all possible POS tags of each token. This is followed by the pruning, where the context is considered to reduce the number of candidate POS tags of each token. The last step is masking POS tags that are prepared for model training or inference.

Producing All Possible POS Tags
Without considering the context, all possible POS tags of each token are acquired through the process of lemmatization and transformation. After the two processes, the POS tag in a given context is assured to be in the possible POS tags.
In the process of lemmatization, each token is lemmatized based on simple word deformation rules. Specifically, each token is converted into lemmas by modifying its suffixes. Then, a dictionary containing only lemmas and their basic POS tags (See Table 2) is used to check whether the lemmas are correct. For instance, the process of lemmatization for the token watches is as follows. Firstly, we check whether the token itself is a lemma. According to the dictionary, the token watches cannot be a lemma. Secondly, various possible ways of editing suffixes are tried, such as deleting the suffix s to acquire the token watche, but the token watche does not exist in the dictionary and, thus, cannot be the lemma of the token watches. The only way to get its lemma watch is to delete the suffix es. Finally, the dictionary is queried for basic POS tags. For the lemma watch, its basic POS tags are NN and VB.
In the process of transformation, lemmas with different basic POS tags are reverted to the token, which is also based on the deformation rules. Meanwhile, possible POS tags of the token can be acquired.
For the lemma watch with the POS tag NN, only by adding the suffix es can the lemma watch be transformed to the token watches, and its POS tag is identified as NNS. For the lemma watch with the POS tag VB, we can get the token watches with the POS tag VBZ. After the above processes, the possible POS tags for the token watches are NNS and VBZ.
However, there are a small number of words whose possible POS tags cannot be acquired in the above way, including words with irregular deformations. As shown in Table 3, their possible POS tags are cached so that they can be obtained directly without the above process. It is possible to obtain wrong lemmas and, thus, obtain impossible POS tags because of the simple method, but this is very rare. Even if impossible POS tags are obtained, it does not matter, since the impossible POS tags can be filtered out in the next step. Furthermore, the deep learning model is used to predict its POS tag if the impossible POS tags cannot be filtered out.

Pruning out Possible POS Tags
Once all possible POS tags of each token are obtained, rules can be applied to prune out the possible POS tags, from which some POS tags are excluded or selected.
It is almost impossible to rely entirely on rule-based methods to correctly label all tokens. Even if it is possible, an enormous number of rules are required, and they may conflict with each other, which causes the algorithm to be time-consuming. Therefore, a compromise solution is adopted, where rare cases are ignored to reduce the number of rules. Instead of using rule-based methods to tag all tokens, our aim is to tag some portion of the tokens. The object of POS tagging is mostly declarative sentences. The declarative sentences must meet the principle that there is at least one finite verb except for elliptical sentences. Hence, the POS tags of some verbs can be directly determined, such as the word am, the word does, the word have, modal verbs and their inflections, and we start with these verbs to tag tokens behind them. For instance, in the sentence fragment "has been redecorated", the word has is tagged with VBZ, and then the word been is tagged with VBN. At last, the word redecorated is tagged as VBN because of the word been.
The POS tag of a word is constrained by the POS tags of its surrounding words. As long as the POS tag of a word is determined, it can be used for reducing the candidate POS tags of the surrounding words.
In the previous step, there are words that have only one possible POS tag, such as DT, IN, CC, PRP, or PRP$. In most cases, these POS tags do not appear in the candidate POS tags of words with other POS tags. Therefore, the rules mainly focus on other POS tags.
For the word that has multiple candidate POS tags, these tags are eliminated when: • It follows a preposition or determiner, and these tags are VB, VBP, VBD, VBZ, and MD.

•
It follows an adjective, and these tags are RB, RBR, RBS, VB, VBP, VBD, VBZ, and MD. • It is followed by an adverb, and these tags are JJ, JJR, and JJS. • It is followed by or follows a verb with the POS tag VB, VBP, VBD, VBZ, or MD, and these tags are VB, VBP, VBD, VBZ, and MD.
There are also several rules for selection from the candidate POS tags as follows: • If it follows a preposition or determiner, or there are modifiers between the word and the preposition or the determiner, and the word can be used as a noun but cannot be used as a modifier, then the word is tagged NN or NNS. • If it is followed by a noun and its candidate POS tags contains JJ, JJR, JJS, VBN, or VBG, then these POS tags are selected as new candidate POS tags. • If it is followed by an adjective and its candidate POS tags contains RB, RBR, RBS, VB, VBP, VBD, VBN, or VBZ, then these POS tags are selected as new candidate POS tags.
For the word whose candidate POS tags only contain VB and VBP, its POS tag is identified as VB when:

•
It is the first word in a sentence.

•
It follows an adverb that is the first word in a sentence. • It follows the word to or there are adverbs between it and the word to.
For the word to, it is necessary to make a distinction between the POS tag TO and IN. Three situations are simply divided: (1) If it is followed by a verb with the POS tag VB, its POS tag is identified as TO. (2) If it is followed by a noun, an adjective, or a verb with the POS tag VBG or VBN, its POS tag is identified as IN. (3) If it is followed by an adverb, the POS tag of the word following the adverb needs to be observed. If the word is an adjective, its POS tag is identified as IN. Otherwise, its POS tag is identified as TO.
We have constructed the above rules, which can cover the vast majority of cases. In addition to the construction of the rules, the order in which the rules are applied is critical. For example, in the sentence "He did make it", the fourth rule in the rules for excluding POS tags is not applicable if no other rule is considered. However, if the rule to tag the token did as VBD and then tag the token make as VB is applied before the fourth rule, there is no problem with the fourth rule. Since the number of the above rules is small, we can manually adjust the order to achieve a best performance.
The rules are iteratively applied to prune out the POS tags until candidate POS tags of each word do not change. Algorithm 1 presents the pseudo-code of the pruning. We abstract the application of each rule as a process called ApplyRule. In the process ApplyRule(rule, words, sets), the rule denoted by the variable rule is applied to each word in the variable words to prune out its candidate POS tags. For each word, its local contextual information, such as its candidate POS tagset, its surrounding words, the POS tags of the surrounding words, and the positions of the surrounding words, are accessed to determine if the conditions of the rule are satisfied. If satisfied, some POS tags are excluded or selected Electronics 2022, 11, 56 7 of 14 from the candidate POS tagset of the word according to the rule. Otherwise, the rule processes the next word. After the rule is applied to all words, the process ApplyRule(rule, words, sets) returns a Boolean value that indicates whether there exists a POS tagset whose content changes relative to the content before the rule is applied. If the return value is true, the variable flag will be updated to true and, thus, the variable changed will also be updated to true. This causes a new iteration to be performed until the variable changed is equal to false. In most cases, the number of iterations through all rules (the number of iterations of the while loop in the above pseudo-code) is not more than five. Therefore, the pruning is not time-consuming.
After pruning, the candidate POS tagset of each token is obtained. If the candidate POS tagset of a token contains one POS tag, the token is considered to be correctly tagged. Otherwise, the POS tag of the token is predicted by deep learning models. In most instances, POS tags of most tokens (about 68% on the dataset mentioned in Section 5.1) can be determined. Ideally, POS tags of all tokens in a sentence are tagged correctly, which happens in simple and short sentences.

Masking POS Tags
There are two most successful pretraining objectives, autoregressive language modeling and autoencoding [37]. BERT is based on denoising autoencoding, which masks 15% of all tokens at random and only predicts the mask tokens [35]. This allows BERT to utilize bidirectional contexts, which is more powerful than the shallow concatenation of a left-to-right and a right-to-left model.
Inspired by the idea of BERT, we mask a certain portion of POS tags and only predict the masked POS tags. Figure 2 illustrates an example. Unlike random masking in BERT, the POS tag of the token is masked if the candidate POS tagset of a token contains more than one POS tag. of all tokens at random and only predicts the mask tokens [35]. This allows BERT to utilize bidirectional contexts, which is more powerful than the shallow concatenation of a left-toright and a right-to-left model.
Inspired by the idea of BERT, we mask a certain portion of POS tags and only predict the masked POS tags. Figure 2 illustrates an example. Unlike random masking in BERT, the POS tag of the token is masked if the candidate POS tagset of a token contains more than one POS tag. In this manner, our model is able to exploit POS tags of its surrounding tokens to predict the POS tag of the token if the POS tag of a token is masked. Moreover, the model focuses on learning how to label tokens whose POS tags are difficult to obtain through the data preprocessing.

Model
Our model uses the Transformer's encoder as a base model. With the masking, the In this manner, our model is able to exploit POS tags of its surrounding tokens to predict the POS tag of the token if the POS tag of a token is masked. Moreover, the model focuses on learning how to label tokens whose POS tags are difficult to obtain through the data preprocessing.

Tagging with Transformer
Given a sentence w 1 , w 2 , . . . , w N with POS tags y 1 , y 2 , . . . , y N and m 1 , m 2 , . . . , m N . Our deep learning model aims to predict the POS tag probability distribution. Here, m i = 1 indicates y i is masked and m i = 0 indicates y i is not masked.

Model
Our model uses the Transformer's encoder as a base model. With the masking, the encoder is able to make better use of bidirectional contexts than the Bi-LSTM that simply concatenates contextual information from left-to-right and right-to-left.
The input layer is shown in Figure 3, which is composed of the word embedding layer, the POS embedding layer, and the position embedding layer. The word embedding is a vectorized representation of words. Similarly, the POS embedding represents specific POS tags. As shown in Figure 4, for tokens whose POS tags are masked, their POS embeddings are replaced with E [MASK] . In addition to word embeddings and POS embeddings, position embeddings are required to indicate the position, because Transformer eliminates recurrence.

masked.
In this manner, our model is able to exploit POS tags of its surrounding tokens to predict the POS tag of the token if the POS tag of a token is masked. Moreover, the model focuses on learning how to label tokens whose POS tags are difficult to obtain through the data preprocessing.

Model
Our model uses the Transformer's encoder as a base model. With the masking, the encoder is able to make better use of bidirectional contexts than the Bi-LSTM that simply concatenates contextual information from left-to-right and right-to-left.
The input layer is shown in Figure 3, which is composed of the word embedding layer, the POS embedding layer, and the position embedding layer. The word embedding is a vectorized representation of words. Similarly, the POS embedding represents specific POS tags. As shown in Figure 4, for tokens whose POS tags are masked, their POS embeddings are replaced with E [ ] . In addition to word embeddings and POS embeddings, position embeddings are required to indicate the position, because Transformer eliminates recurrence.     As a result of the composition of the input embedding, the self-attention mechanism in the encoder allows the model to attend to information from word embeddings, POS embeddings, and position embeddings, which satisfies the fact that POS tagging for a word depends not only the word itself but also on its position, its surrounding words, and their POS tags. Figure 5 shows the structure of the model, which aims to predict the masked POS As a result of the composition of the input embedding, the self-attention mechanism in the encoder allows the model to attend to information from word embeddings, POS Electronics 2022, 11, 56 9 of 14 embeddings, and position embeddings, which satisfies the fact that POS tagging for a word depends not only the word itself but also on its position, its surrounding words, and their POS tags. Figure 5 shows the structure of the model, which aims to predict the masked POS tags. The model takes tokens and POS tags as input, and they are transformed into input embeddings in the input layer. The Transformer's encoder is able to exploit bidirectional contextual information thanks to the masking and the self-attention mechanism. After the computation of the encoder, a linear layer with softmax function is used to compute the probability of each POS tag. Here, the model predicts POS tags of the token 2 and the token 6 .

Training
We construct , which represents these POS tags that are not masked, and the training objective is to maximize the likelihood on training data ( | , , . . . , , ) ≈ ( | , , . . . , ) where represents masked POS tags.

Inference
For tokens of a given sentence, most are correctly tagged during the data preprocessing and there is no need to use deep learning models to predict their POS tags. If the candidate POS tagset of the token contains more than one POS tag, its POS tag is predicted by the model. The most likely POS tag of the token can be chosen as where is the number of tag types and represents POS tags of tokens tagged by the data preprocessing.

Dataset
Due to lack of the Penn Treebank WSJ dataset, we use the Groningen Meaning Bank

Training
We constructŷ, which represents these POS tags that are not masked, and the training objective is to maximize the likelihood on training data where y represents masked POS tags.

Inference
For tokens of a given sentence, most are correctly tagged during the data preprocessing and there is no need to use deep learning models to predict their POS tags. If the candidate POS tagset of the token w i contains more than one POS tag, its POS tag is predicted by the model. The most likely POS tag of the token w i can be chosen as where k is the number of tag types andŷ represents POS tags of tokens tagged by the data preprocessing.

Dataset
Due to lack of the Penn Treebank WSJ dataset, we use the Groningen Meaning Bank (GMB) dataset [38], which is a large semantic annotated corpus. The dataset contains the annotation of all tokens with various tags, in which we only use the POS tags.
In the dataset, there are 62,010 sentences (1,354,149 tokens). After being shuffled, 80% of the dataset is split into the training set and 20% is split into the test set for validation. After the data preprocessing, POS tags of about 68% tokens are obtained. Therefore, POS tags of about 32% tokens are masked so as to be predicted by the model.

Settings
To configure the model for training, the optimizer used is Adam [39] with a learning rate of 0.001, β 1 = 0.9 and β 2 = 0.999. The loss function is set to cross entropy loss, and only masked POS tags participate in the computation of the loss function.
For the word embedding layer, it is initialized with 100-dimensional Glove word embeddings [19].
For the POS embedding layer, the dimensionality of POS embeddings is 10, which is enough to represent 48 Penn Treebank POS tags.
For the position embedding layer, absolute position embeddings are employed to encode the position, and the dimensionality is 110. Since the length of most sentences in the data source is less than 64, the max position is set to 64. If the length of a sentence is greater than 64, it will be truncated. Otherwise, it will be padded.
In the Transformer's encoder, two identical layers are stacked. For the multi-head attention sublayer, the number of attention heads is 11. For the feed-forward sublayer, the dimensionality of input and output is 110, and the dimensionality of the inner layer is 3072.
For the linear layer, the softmax function for multiclass classification is used, and its hidden size is 48.
To train the model, the PyTorch [40], an open source machine learning framework, is adopted. The version of the adopted PyTorch is 1.7.0. In the adopted PyTorch, the version of the CUDA toolkit is 10.1. In the following experiments, we use a batch size of 128 instances and train the model on RTX2060 for 100 epochs to report the results.

Evaluation
We evaluate our approach on two metrics: token accuracy and sentence accuracy. The token accuracy is the tag accuracy of per-token, and the sentence accuracy refers to the whole-sentence correct rate. The sentence accuracy is generally lower than the token accuracy, because a sentence is considered to be correctly labeled only if all the tokens of the sentence are correctly tagged.
The accuracy is jointly determined by the rule-based data preprocessing and the model. As shown in Figure 6, the token accuracy is more than 96% after one epoch because the data preprocessing has tagged most tokens (about 68%) before training. During the training of the model, both the token accuracy and the sentence accuracy are gradually improved. After 19 epochs, they reach a maximum on the test set. Specifically, the token accuracy increases to 98.60% and the sentence accuracy rises to 76.04%. model. As shown in Figure 6, the token accuracy is more than 96% after one epoch because the data preprocessing has tagged most tokens (about 68%) before training. During the training of the model, both the token accuracy and the sentence accuracy are gradually improved. After 19 epochs, they reach a maximum on the test set. Specifically, the token accuracy increases to 98.60% and the sentence accuracy rises to 76.04%. For comparison with other methods, four models are chosen as baseline systems as follows.

•
Bi-LSTM: A two-layer Bi-LSTM with the hidden size 50 is used, where we do not load pretrained word embeddings in word embedding layer. • BLSTM RNN with word embedding [20]: In addition to a two-layer Bi-LSTM with the hidden size 100, a function is introduced to indicate original case of words. For a fair comparison, 100-dimensional Glove word embeddings are adopted in word embedding layer. • C2W: A C2W model [21] is employed to generate 100-dimensional character-level embeddings of words, and a two-layer Bi-LSTM with the hidden size 100 takes the embeddings as input for POS tagging. The C2W model is composed of a character embedding layer and a unidirectional LSTM with the hidden size 100. For the character embedding layer in the C2W model, it generates 50-dimensional embeddings of characters, which are fed into the unidirectional LSTM to produce 100-dimensional character-level embeddings of words. • Highway Bi-LSTM: A two-layer highway Bi-LSTM [23] with the hidden size 150 is adopted. The input to the highway Bi-LSTM comes from two parts: 100-dimensional Glove word embeddings and 50-dimensional character-level embeddings generated by a C2W model. In the C2W model, the character embedding layer yields 10-dimensional embeddings of characters and the unidirectional LSTM produces 50-dimensional character-level embeddings of words.
To verify the impact of each component of our method, two baselines are constructed as follows.

•
Transformer's Encoder: The rule-based data preprocessing is removed from our method to verify whether the data preprocessing can deliver the performance gains. Without the rule-based data preprocessing, we cannot mask a certain portion of POS tags. Thus, the encoder of the Transformer is used to predict POS tags of all tokens. • MLP with the data preprocessing: To verify the effectiveness of the self-attention mechanism on POS tagging, the multi-head attention layers are removed from the Transformer's encoder, which degenerates to a multilayer perceptron (MLP). With the rule-based data preprocessing, the MLP only predicts the masked POS tags. It is difficult to capture dependencies between words due to lack of the self-attention layers. Table 4 shows the results of different methods on GMB dataset. We can see that our method outperforms all baselines in both the token accuracy and the sentence accuracy. The method achieves a token accuracy of 98.6% and a sentence accuracy of 76.04% on the dataset. By observing the results of the Transformer's Encoder and the MLP with the data preprocessing, we can find that both the token accuracy and the sentence accuracy will drop if one of the components is removed from our method. This shows that all components of our method are indispensable. For other baselines, the C2W model performs better than the BLSTM RNN with word embedding, perhaps because the character-level embeddings contain richer semantic features than the word-level embeddings. In all baseline systems, the highway Bi-LSTM performs best, which can be attributed to the highway network and the richer input information.
Compared with the highway Bi-LSTM, our method without using the character-level embeddings achieves more competitive results. Specifically, our method boosts the sentence accuracy by about 3%. It is the combination of the rule-based data preprocessing and the deep learning model based on the self-attention that further improves the accuracy.
The above observations indicate that the rule-based data preprocessing is useful to improve the accuracy, and the self-attention mechanism can bring performance improvement by attending to information from word embeddings, position embeddings, and POS embeddings between words. The results demonstrate the effectiveness of our method.

Conclusions
In this paper, we propose a novel approach for POS tagging, including rule-based data preprocessing and a deep learning model based on Transformer. During the rule-based data preprocessing, most of the tokens are tagged, which enables the model to utilize their POS tags to predict the POS tags of the remaining tokens. By masking a certain portion of POS tags and utilizing the self-attention, the model is able to leverage bidirectional contexts. Our approach combines the rule-based methods with deep learning, which is helpful for the research on POS tagging. Experiments on the GMB dataset for POS tagging validate the effectiveness of the proposed method, and the proposed method achieves a per-token tag accuracy of 98.6% and a whole-sentence correct rate of 76.04% on the dataset.

Future Works
In the future, we plan to extend our approach to the research of POS tagging in other languages. Meanwhile, we can improve the rule-based data preprocessing and the deep learning model to further improve the accuracy of POS tagging.
Existing works on rule-based methods focus on how to correctly tag all words of a sentence, which is almost impossible due to the complexity of natural languages. In this paper, we provide a simple implementation of tagging some portion of the words. It is feasible and can improve the accuracy of POS tagging when combined with deep learning. However, the rules in this study only consider the local context and, thus, cannot satisfy all cases. There is still room to optimize in the construction of rules.
For the deep learning model, the Transformer is slightly insensitive to the position information of words, leading to insignificant improvement in the token accuracy. The absolute position embeddings can be replaced with relative position embeddings [41] to enhance the performance of the Transformer. Additionally, ELMo [42] can be employed to acquire the contextualized word representations in the input layer, which may allow the Transformer to make better use of contextual information.
We believe that our study is beneficial to POS tagging for the languages in which existing POS taggers perform poorly. Our approach can be applied to these languages by constructing rules corresponding to these languages in the data preprocessing and is excepted to improve the accuracy.