VPN: Variation on Prompt Tuning for Named-Entity Recognition

: Recently, prompt-based methods have achieved a promising performance in many natural language processing benchmarks. Despite success in sentence-level classiﬁcation tasks, prompt-based methods work poorly in token-level tasks, such as named entity recognition (NER), due to the sophisticated design of entity-related templates. Note that the nature of prompt tuning makes full use of the parameters of the mask language model (MLM) head, while previous methods solely utilized the last hidden layer of language models (LMs) and the power of the MLM head is overlooked. In this work, we discovered the characteristics of semantic feature changes in samples after being processed using MLMs. Based on this characteristic, we designed a prompt-tuning variant for NER tasks. We let the pre-trained model predict the label words derived from the training dataset at each position and fed the generated logits (non-normalized probability) to the CRF layer. We evaluated our method on three popular datasets, and the experiments showed that our proposed method outperforms the state-of-the-art model in all three Chinese datasets.


Introduction
Recently, pre-trained language models (LMs), such as BERT [1] and RoBERTa [2], have achieved a dominant performance on almost all natural language processing (NLP) tasks, upon simply fine-tuning these LMs with an extra task-specific head with task-specific training data in the downstream tasks. Despite the effectiveness and simplicity of finetuning LMs, there is still a wide gap between the objective functions of the pre-training and fine-tuning phases. A common conclusion in the literature [3,4] is that this mismatch results in the under-utilization of these powerful LMs.
Prompt-based approaches [5][6][7][8][9] have been proposed to address this problem. Unlike traditional supervised learning, which solely utilizes the parameters in LMs with rich distributed knowledge, prompt-based methods reformulate a downstream task's objective forms as those in the pre-training phase, directly modelling the probability of words without using any task-specific layers [3]. As shown in Figure 1, the sentiment classification, for example, can identify the sentiment y ∈ Y towards a given input sentence X ∈ D. In traditional LM fine-tuning, we take the softmax of the special word, such as [CLS], and the true label y as the loss function to further train the LM. Then, we obtain the predicted labelŷ as the sentiment predicted. In typical prompt tuning, we add a template T = [e 1 , e 2 , ..., [MASK], ..., e t ] containing a [MASK] special token to the original input sequence X, then feed the new sequence X = [X, T] into the LM and let the LM predict the [MASK] token of the target token in the vocabulary, indicating the sentiment of the original input. Recent efforts show that prompt-based methods, as shown above, have achieved promising results in many sentence-level NLP tasks, such as natural language inference [10], sentence As a fundamental task, NER is irreplaceable in many downstream NPL tasks, such as event recognition, entity linking, etc. NER aims to put the named entity mentioned in a sentence into some pre-defined categories, such as location, person, organization, etc. Former efforts have often required an extra label-specific output dense layer, which is randomly initialized. This makes it difficult for the model to fit into an optimal point. Liu et al. [12] adopted NER prompt tuning, not introducing any extra parameters other than the parameters of the pre-trained model. They enumerated all possible entity spans and filled them in templates, meaning that inferring a sentence required feeding that sentence into the model many times. Despite its effectiveness, the enumeration procedure is time-consuming and intolerable.
Ma et al. [13] proposed a template-free prompt-tuning model for few-shot NER. They eliminated the use of templates and let the model predict class-related pivot words derived from unlabelled data instead of original words at each entity position while still predicting the original words at non-entity positions. In this way, inferring a sentence only needs the sentence to be fed into the model once. Their model gained a lot in few-shot settings while working ordinarily in rich-resource settings.
In this study, we propose a simple yet effective variation on the prompt tuning for NER. In the BIO scheme, the tags B and I denote that the current word is at the beginning or inside of an entity, respectively, and O denotes that the current word is not a component of the entity. In the IO scheme, the beginning of an entity is also tagged with I. For example, unlike Ma et al. [13], the IO scheme can be used to find label words; however, this makes it difficult for the model to separate several consecutive homogeneous entities, and the correlations between tags are neglected. Furthermore, the beginning and interior of an entity often convey different semantic information. For instance, the word City in the LOC entity New York City is more likely to be predicted as I-LOC rather than B-LOC in the BIO scheme, while in the IO scheme, the implicit semantic gaps between all three words are neglected, and all three words in New York City are treated equivalently. We derive the top-K tag-wise label words in the BIO scheme according to the frequency of occurrence and the corresponding normalized frequency. Then we let the pre-trained model predict the label words at each position and feed the generated logits (non-normalized probability) to a CRF layer to capture the correlations between the tags. We do not introduce any extra parameters other than the parameters of the pre-trained model to obtain the logits of all tags at each position.
Our contributions are as follows: (i) We found that the feature changes after the MLMs were limited, which can improve the effectiveness of the NER task and avoid introducing additional parameters. (ii) We proposed a simple yet effective variation on the prompt tuning for NER. (iii) We do not introduce any extra parameters other than the parameters of the pre-trained model to obtain the logits of all tags at each position.
(iv) Experiments show that our proposed method outperforms the state-of-the-art model on three popular datasets.

Related Works
In this section, we briefly introduce studies related to prompt-tuning methods and prompt tuning for NER.

Prompt Tuning
As shown above, prompt-based methods reformulate the objectives of the fine-tuning phase as a close-style objective. In this way, the gaps between the objectives of the pretraining and fine-tuning phases are bridged. GPT-3 [14] uses hand-crafted prompts for tuning and achieves a very impressive performance on various tasks, especially for few-shot learning settings. Inspired by GPT-3, many attempts [15][16][17][18] concerning knowledge probing use hand-crafted prompts to boost the models and have been widely used in relation to classification tasks [4], entailment classification and natural language inference [3,5]. Automatically generating label words and templates [19,20] avoid labour-intensive prompt design. Recently, some continuous prompts [8,21] have been proposed using learnable continuous label words and templates rather than discrete words in the vocabularies of pre-trained models.

Prompt Tuning for NER
NER is a token-level classification task that is difficult for prompt tuning. According to a popular survey of prompt tuning [3], the template design is complex for NER. To obtain templates, NER needs to enumerate all possible entity spans and types, then feed the spans and types to a pre-defined template, which is time-consuming and labour-intensive. The decoding speed increases significantly when the input sequence increases [12]. Furthermore, Ma et al. [13] proposed a one-pass decoding strategy for NER, discarding the complex template design and letting the LM predict the class-related pivot word (or label word) at the entity position. On the other hand, they claim that they did not introduce any extra parameters except for the parameters of the pre-trained model. However, they introduced extra biases when adding the special tokens corresponding to the labels in the vocabulary of the pre-trained model and set them to 0. Thus, the original biases in the pre-trained model parameters are lost.

Problem Setup
t=1 and its corresponding label sequence y i = [y i,t ] T t=1 , where T denotes the sequence length and y i,t ∈ Y is the entity type from a pre-defined entity type set Y. The NER task aims to predict the entity type of the input word sequences in the test dataset D test split from D.

Label-Word Selection
As shown in Figure 2, assume we have m kinds of tags Y = {l j } m j=1 in dataset D. For each tag l j , we find all words with the label l j from the training samples, then we select the K most frequent words [c j,k ] K k=1 as a representative of the label l j . For each representative word c j,k , its normalized frequency is denoted as w j,k , and the corresponding word index in the vocabulary is denoted as d j,k . It should be noted that d j,k is related to the specific pre-trained model. In the implementation, we notice that although an entity word will not appear in both entity categories, the characters in one entity word may appear in both entity categories in the Chinese datasets. This means different entity tags in the BIO scheme might have the same label word. For example, character {"美"} (beautiful) with the tag B-GPE occurs in the GPE entity word {"美国"} (America), and it also occurs in the ORG entity word {"国美电 器"} (a housekeeping appliance market) with the tag I-GPE. Considering that our model relies heavily on the quality of label-word selection, this co-occurrence confuses our model when distinguishing which entity type it belongs to for each word-containing character {"美"} (beautiful) . To solve this issue, we designed Algorithm 1. If the same label word occurs in different tags, we assign that label word to the tag with the maximum number of occurrences in the dataset. We first collect all the characters and sort them by the number of occurrences for each tag. Then we introduce a hyper-parameter threshold thr and sample thr * K label words and their occurrences for each tag. The sampled results are a tag-pair dict. This hyper-parameter threshold thr is to ensure there are K label words for each tag in the final filtered label-word dict. Next, we merge all word-occurrence pairs together. In this word-occurrence pair list, we keep the pair with the highest occurrence and discard the rest across all pairs for a unique word. Then, for each tag, we enumerate the tag-pair dict, keep the pair that occurs in the pair list, and discard the rest. Finally, we select the top K pairs for each tag. This tag-pair dict is our final filtered label-word dict.

Algorithm 1: Label-Word Selection and Filtration
Data: Dataset D; number of label words K; hyper-parameter thr Result: Top K tag pairs dict Tag_pairs_dict 1 for word, tag in D do 2 count the (word, word_num) with respect to tag 3 end 4 Tag_pairs_dict ← sorts the (word, word_num) pair with respect to tag and select the top K * thr pair 5 P ← merges the pairs of all tags in Tag_pairs_dict 6 for (word, word_num) in P do 7 For a unique word, keep the pair with the largest word_num across all pairs and discard the rest 8 end 9 P ← processed P 10 Generate the Tag_pairs_dict over the tags for a pair in Tag_pairs_dict & P

Dataset
The following real-world datasets are considered in our study. Table 1 shows the statistics of the datasets.

Implementation Details
For all our experiments, we used the bert-base-chinese (https://github.com/googleresearch/bert, accessed on 20 March 2023) pre-trained model as our backbone structure. The hidden size and number of layers of the backbone model are 768 and 12, respectively. We implemented experiments in the TensorFlow framework. The batch size was 8 across all our experiments. In addition, the learning rate of the CRF layer was 1 × 10 −3 , and the learning rate of all other layers was 1 × 10 −5 , using the AdamW optimizer with a 0.1 warm-up ratio. For small datasets, such as Weibo, we set the total epochs to 50. For MSRA and OntoNotes 4.0, we set the total epochs to 20. For evaluation, we used the BIO scheme. Tags B and I denote that the current word is at the beginning or inside the entity, respectively. Tag O denotes that the current word is not an entity component.

Modelling VPN
We let the LM predict several label words in the vocabulary and obtain the overall tag-related logits. These label words are more relevant to tags rather than classes. In this way, we can also model the logits of positions labelled O and use the BIO scheme rather than the IO scheme, which can use the CRF layer to boost the model's performance.
In this work, we consider an NER task as a sequence-to-sequence task. Figure 3 shows the overall architecture of our proposed model. Given an input sequence x = {x 1 , x 2 , ..., x T } and the corresponding label sequence y = {y 1 , y 2 , ..., y T }, we embed each word using a pre-trained LM to obtain an embedded sequence E emb ∈ R T×d H : where e(x t ) ∈ R d H is the last layer of the hidden state of word x t , and T and d H denote the sequence length and hidden dimension of the transformer model, respectively.

CRF Layer
Today, I'm leaving for Shanghai. In order to take full advantage of the pre-trained model, along with BERT's pretraining stage, we calculate the word prediction logits using the masked language model head as follows: where Dense 1 ∈ R T×d H , Dense 2 ∈ R d H ×|V | , logit vocab ∈ R T×|V | , and |V | represents the cardinal number of the vocabulary.
For each word x t , we obtain the label logit through mean pooling corresponding to the top K representative words of the entity tags, that is, where logit label ∈ R T×m . Then, we feed the logit label to the conditional random field (CRF) [26] layer. Implementation-wise, CRF computes an energy given a candidate output y and a context x (i.e., input sequence), followed by a softmax operator to obtain the conditional likelihood, i.e., Here, Y all is the set of all possible tag sequences, and the transition matrix A ∈ R m×m characterizes the smoothness of the label sequence (probability of switching between consequent labels).

Baselines
In this work, we evaluated our proposed model and compared it with several competitive baselines. These baseline models include the BERT model and other BERT-related models as the backbone model. The baseline models include: • BERT-tagger. BERT-tagger [1] is a strong baseline in token-level classification tasks such as NER; • BERT+Glyce. Meng et al. [27] took advantage of glyph information to enrich the pictographic evidence in characters using historical Chinese scripts; • BERT+FLAT. Li et al. [28] converted the character-word lattice structure into a flat structure of spans; • BERT-MRC. Li et al. [29] reformulated NER as a machine reading-comprehension task. Table 2 shows our main test's F1 results. From the results, we first find that our model significantly outperforms all the baseline models, including the state-of-the-art models, on the three Chinese datasets. We owe these across-the-board gains to the reuse of the MLM head derived from the original pre-trained model, eliminating the need to design a label-specific output layer; the CRF layer is also helpful. For a small dataset, such as Weibo, compared to the vanilla BERT-tagger, the rest of the baseline models showed little improvement, while our model improved by 4-5%. For the large dataset, OntoNotes 4.0, all three baseline models improved by 3-4% compared to the vanilla BERT-tagger, while our model achieved an improvement of 4.89% compared to the BERT-tagger. For the larger dataset, MSRA, all the models achieved satisfying results, while our model marginally outperformed the baselines. From the results, we find that the size of the dataset has a huge impact on the results. Another observation in terms of small datasets, such as Weibo NER, is that the performance gains were greater in small datasets than large datasets compared to the baselines. The baseline models all have class-related output layers in which the parameters are randomly generated; this might be why they were harder to fit in a smaller dataset.

Ablation Study
In Figure 4, we compare how varying the size of the candidate hyper-parameter K affects the performance. The performance peaks at a moderate K; after this the gain tapers off. This is because when using an excessive or lesser amount of K, the label words of each tag introduce some helpful or less helpful words, affecting the performance of the model. Furthermore, we can find that different datasets have different most-appropriate K values, indicating that the distribution of the data also has a great impact.

Motivation of Our Method
When using prompt-based methods to solve sentence-classification tasks, researchers add a template with a special [MASK] token to the original input text and let the pretrained model predict a [MASK] set of label words, each representing a specific pre-defined class of the original input text. In this way, the prediction ability of the [MASK] token is fully exploited. Intuitively, we wondered whether the non-mask token has the ability to predict. We conduct a sentence-restoration experiment to test our hypothesis. We fed the original input text to a pre-trained model and obtained the last hidden states of each token. Subsequently, we further fed the hidden states into the masked language head of a pre-trained model used in the pre-training phase and obtained the logits over the pre-trained model's vocabulary.
Here, we report the accuracy at the token level of each token, then output the tokens with the largest logit to restore the original input text. For example, we input the sentence, "今天出发去上海" (Today, I'm leaving for Shanghai) and want the model to output the original sentence. There are often tens of thousands of tokens in the pre-trained model's vocabulary, and to restore the plain non-mask original token is not easy. We conducted our experiments on two datasets: AGNews [30] and The People's Daily. Table 3 shows the results. We notice that, in English datasets, such as AGNews, the accuracy at the token level is about 0.87, while in Chinese datasets, such as The People's Daily, the accuracy is about 0.95, showing that these non-mask tokens also have the ability to predict. Note that the MLM heads in the pre-trained model can achieve remarkable results in predicting the input token without any fine-tuning. Therefore, we can let the pre-trained model's MLM head predict other label words in the vocabulary of the pre-trained model instead. In token-classification tasks, such as NER, distinguishing different token categories is required; therefore, we assign different token categories (i.e., entity tag type, e.g., B-LOC) with different label words and let the pre-trained model's MLM head predict the tag-related label words for each token. Summing the predicted logits for a label word of a specific token category means the logit for that token can be predicted as that token category. Similar to the procedure of letting the mask token predict the pre-defined label words in a sentence-classification task, we let the non-mask token predict a set of tokens as label words to solve the named-entity task. Figure 5 shows the correlation of our model in a named-entity task and prompt tuning in a classification task.

English Results and Future Work
Our model can be applied not only to the Chinese language, but also to other languages. We conducted a series of experiments on the English datasets CoNLL 2003 [31] and OntoNotes 5.0 [32]. CoNLL-2003 is a named-entity-recognition dataset released as a part of the CoNLL-2003 shared task language-independent named-entity recognition. The data consist of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file, and a large file with unannotated data. The English data were taken from the Reuters Corpus. This corpus consists of Reuters news stories between August 1996 and August 1997. For the training and development set, ten days' worth of data were taken from the files representing the end of August 1996. For the test set, the texts were from December 1996. The pre-processed raw data cover the month of September 1996. OntoNotes 5.0 is a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes Release 5.0 contains the content of earlier releases and adds source data from and/or additional annotations for newswire, broadcast news, broadcast conversation, telephone conversation and web data in English and Chinese and newswire data in Arabic. Here we use the English dataset of OntoNotes 5.0. Table 4 shows the statistics of the datasets. We compared our model with the vanilla BERT-tagger. We trained our model for 10 and 50 epochs on the OntoNotes 5.0 and CoNLL 2003 datasets, respectively. Other hyper-parameter settings remained the same as the Chinese dataset experiments. Table 5 shows our results on the two English datasets. From the F1 results, we can see that our model is slightly worse than the baseline.  The label-word selection procedure is of great importance in our model. We collect the label words and their weights in the raw datasets, with the label words being naturallanguage words. Note that the natural-language words cannot be predicted in pre-trained models, so we need to convert the natural-language label words to label tokens in the vocabulary of the pre-trained model. In the Chinese language, the smallest unit of text is a character, and the tokens in the vocabulary of a pre-trained model are almost all characters. For example, when we feed the input sentence {"今天出发去上海"} (Today, I'm leaving for Shanghai) the tokenized output using the bert-base-chinese (https: //github.com/google-research/bert, accessed on 20 March 2023) pre-trained model is {"今", "天", "出", "发", "去", "上", "海"} (in the Chinese language, the phrase "今天" means "today"; "出发" means "leave"; "去上海" means "go to Shanghai") . We can see that the natural-language input sentence and output tokens are almost the same, with the output tokens still retaining the semantics of the input sentence. However, things are different when it comes to the English language. For the English language, the tokenizers of the pre-trained model tend to split the natural word into its sub-words. For example, the word miscellaneous expresses clear semantics, while the tokenized result mi, ##s, ##cell, aneous loses the original semantics of miscellaneous. Therefore, the reason we cannot obtain the best performance is probably because the tokenization procedure is more complex for the English language, so even if we find suitable natural-language label words, it is still difficult for us to find suitable words in the vocabulary of the pre-trained model to express the semantics hidden in the entity labels.
For future work, we will explore better label-word selection methods to find suitable tokens in the vocabularies of the pre-trained models to better express the semantics of tags. In the English dataset, we can choose not to use words that can be split into sub-words by tokenizers as our label words. Furthermore, we will explore generative pre-trained models, such as GPT-3, as our backbone model and let the model predict the label words.

Results per Entity Type
In Figures 6-10, we draw the confusion matrix head maps using sklearn [33]. Meanwhile, we report the results per entity class. Tables 6-10 are the experimental results of Weibo NER, MSRA, OntoNotes 4.0, CoNLL 2003, and OntoNotes 5.0, respectively. From the results, we find that the scores of big datasets such as MSRA and OntoNotes 4.0 are much better than those of small datasets such as Weibo NER. Moreover, we can see that the total entity num of a specific entity type affects the results a lot: in line 3 and line 5 of Table 6, the scores of entity type LOC are much less than those of PER, and in line 3 and line 5 of Table 8, the scores of entity type LOC are also much less than those of PER. Furthermore, we notice that ORG entities are more likely to be predicted as the GPE entity type compared to other entities in Figure 6, and vice versa. This may be because the semantic information of those two entity types is very close, and it can be hard to find suitable label words to distinguish them. In the OntoNotes 4.0 dataset, we find that the entity num of ORG and PER are very close in Table 8, but the scores of ORG are much less than those of PER. Observing Figure 8, we can see that there are occurrences of misidentification between the GPE, LOC, and ORG entity types, which may indicate that designing these three difficult-to-distinguish entity types in the same dataset is unwise. In the CoNLL 2003 dataset, the entity num of LOC is much fewer than that of other entity types, and accordingly, its performance scores were notably lower than those of the other entity types. This suggests that a larger number of entities is required to provide adequate training for the model, resulting in an improved performance. In the big OntoNotes 5.0 dataset, which has 18 entity classes, the entity distribution is unbalanced. In Table 10, we can see that the scores of entity types with a small portion of the total entity num are much less than those of entity types with a big portion of the total entity num.

Conclusions
In this work, we proposed a simple yet effective variation on prompt tuning for Chinese NER. We took the one-pass decoding strategy, which significantly increases the decoding speed. We let the LM predict several label words derived from a training dataset and convert them to label tokens in the vocabulary of the pre-trained model, retrieving the overall tag-related logits. These label words are more relevant to the tag than the classes; in this way, we can also model the logits of positions labelled O and use the BIO scheme rather than the IO scheme, which can use the CRF layer to boost the model's performance. Experiments show that our proposed method outperforms state-of-the-art models for three popular datasets. For small datasets, such as Weibo, compared to the vanilla BERT-tagger, the rest of the baseline models have little improvement, while our model improved by 4-5%.