1. Introduction
Named entity recognition (NER) aims to identify specific, meaningful entities from text, such as LOCATION and ORGANIZATION, and classify them into predefined categories, as shown in
Figure 1. NER is an essential prerequisite task for many natural language processing task, such as information extraction [
1], question-answering systems [
2], and machine translation [
3].
In recent years, neural-network-based techniques have been widely applied in NER [
4,
5]. However, neural networks are data-driven machine learning methods, and the quantity of training data often limits their performance. Unfortunately, annotated data used for training are often scarce and expensive, especially in specific domains (e.g., the food safety risk domain). Therefore, there has been widespread interest in a challenging yet practical research field: few-shot NER.
One of the challenges of few-shot NER is how to accurately incorporate prior knowledge to effectively classify unseen entity types when confronted with a few examples. Recently, similarity-based methods such as prototype networks have been extensively studied and achieved great success for few-shot learning [
6,
7,
8]. The core idea is to classify input examples from a new domain based on the similarity between their representations and those of each class in the support set. However, this approach experiences a significant drop in performance in a few-shot setting due to the limited representativeness of the data. The prompt-based approach [
9,
10], by manually or automatically adding prompt words to sentences, guides the model to learn more quickly and accurately while reducing the gap between pretraining and fine tuning. This approach has shown remarkable performance in few-shot learning. However, these methods do not directly leverage the rich prior knowledge contained in label semantics.
In addition, Chinese NER is more challenging than English NER due to the relatively ambiguous entity mentions in Chinese, which limits feature representation and affects the accuracy of NER. Zhang and Yang [
11] addressed this issue by using lattice LSTM to represent the entity in sentences and by incorporating the potential lexical information into a character-based LSTM-CRF model. While this character-based representation effectively solves the segmentation error problem, it requires the introduction of a complex external lexicon. Some more recent attempts have switched to span-based feature representations for Chinese NER [
12,
13], explicitly utilizing span-level information to address token-wise label dependency and better handle nested entities. However, these span-based feature representations only perform simple concatenation of the start and end positions of the span without fully exploiting the internal information of the span, which limits the feature representation of the named entity. For example, when the named entity “伊河谷食品科技有限公司 (Yihegu Food Technology Limited Company, Urumqi, China)” has the internal information “科技有限公司 (Technology Limited Company)”, it becomes easier to classify it as an ORGANIZATION entity.
In response to these challenges, we propose a model called SLNER with enhanced span and label semantic representations to tackle the challenges of Chinese few-shot NER. Specifically, SLNER utilizes two encoders. One encoder is used to encode the text and its spans. This module captures the head and tail information of spans using the biaffine attention mechanism and incorporates self-attention to capture the internal information of spans. Ultimately, these representations are fused to obtain enhanced span representations. This approach fully utilizes the internal composition of entity mentions to achieve more accurate feature representations. In contrast to traditional span-based methods that concatenate the start and end positions of entity mentions, our approach fully exploits the information of each token within entity mentions, providing sufficient and essential clues for entity recognition.
The other encoder is used to encode full label names. Label names are highly generalized specific entity categories and exhibit similar semantics to entities, which can provide additional prior knowledge in few-shot scenarios. Compared to traditional similarity-based methods (e.g., prototype networks), label semantics provide more generic similarity representations, especially in situations where the target domain has a scarcity of samples.
Ultimately, our model learns to match span representations with label representations. We employ a two-stage training strategy using source and target domains, enabling the model to transfer knowledge from the high-resource source domain to the low-resource target domain.
Furthermore, to promote research and applications in low-resource domains, we developed an NER dataset named RISK, which was specifically designed for the food safety risk domain. RISK comprises 5 coarse-grained and 20 fine-grained entity types, each labeled and organized in a hierarchical structure of coarse-grained + fine-grained. We also conducted a performance evaluation of our model on the RISK dataset, and the experimental results demonstrate the challenging nature of the RISK dataset. Constructing an NER dataset in the food safety domain can drive advancements in related research areas such as food traceability and food safety regulation. Additionally, this dataset can serve as a foundation for the development of applications in food safety, including food safety warning systems and food recall management.
We have documented the experimental results on three sampling benchmark Chinese NER datasets and a self-built food safety risk domain dataset. Our contributions can be summarized as follows:
We propose a simple and effective model named SLNER, which leverages enhanced span representations and label semantics to address the issues of inadequate prior knowledge and limitations in feature representation in Chinese few-shot named entity recognition;
We created a challenging food safety risk domain dataset, RISK, which is divided into 5 coarse-grained and 20 fine-grained entity categories. This dataset provides data support for the development of named entity recognition applications in the domain of food safety;
Our proposed model achieved promising performance on the four sampling Chinese NER datasets (including our self-built dataset). Specifically, our model outperformed previous works with F1 scores ranging from 0.20% to 6.57% in different few-shot settings (following the settings of PCBERT) on the Ontonotes, MSRA and Resume datasets. It also achieved promising F1 scores on our self-built RISK dataset.
3. Method
This chapter first formalizes the problem of few-shot named entity recognition (
Section 3.1). Then, we propose our model, SLNER, for Chinese few-shot NER (
Section 3.2). The model consists of two encoders: one for encoding the text and its span to obtain better feature representations (
Section 3.3.1) and another for encoding the full label name to capture additional prior knowledge (
Section 3.3.2). Additionally, we adopt a two-stage training strategy using source and target domains (
Section 3.4). The details are outlined as follows.
3.1. Few-Shot NER Task Formalization
For the few-shot NER task, assume that we have a resource-rich source domain NER dataset, , where represents the -th text (), represents the -th token (), and represents the labels corresponding to the entity spans in the -th text (). We use to denote the label set of the source domain dataset. Then, given a resource-scarce target domain dataset, , the number of texts in the target domain dataset is limited (i.e., ), and the label types in the target domain may differ from those in the source domain (i.e., ). We aim to leverage the knowledge from the source domain dataset to improve the model’s performance on the target domain dataset.
3.2. Overall Structure
The overall architecture of the SLNER model is illustrated in
Figure 5. For span representation, given a sentence (
) of length
, we use BERT as our encoder, which encodes the context of the
-th token in the sentence as follows:
where
is the hidden dimension of the encoder, and the output dimension after passing the original sentence through the encoder is
.
To further enhance the modelling of the sequential order of the text, the embedding representation obtained from BERT is then passed through a bidirectional LSTM layer. The forward LSTM network captures the hidden forward states (historical features), while the backward LSTM network captures the hidden backward states (future features), resulting in a context-aware encoding representation:
At this stage, the output dimension of the original sentence through the bidirectional LSTM layer is .
We predefine an n-gram value (
), which represents the maximum length of spans that can be formed in a text. The number of possible spans that can be formed in a sentence of length
is given by:
Next, the token-embedding sequence obtained from the LSTM layer is used to construct the span-level feature vector representation (
) through a span extractor (details in
Section 3.3.1). The dimension of
is finally expanded to
through a feed-forward layer.
For label representation (details in
Section 3.3.2), we manually define the appropriate full label name for each label. Similarly, we use BERT as the encoder and directly encode the label using Equation
to obtain the global semantic feature (
). The difference from span encoding is that we further pass
through a pooler layer to obtain the semantic feature vector representation (
), which serves as the final representation of the label:
where
represents the weight parameters in the pooler layer,
represents the bias parameters, and
is the activation function. The dimension of
needs to be expanded to
, where
is the number of entity categories.
According to our approach, there is a correlation between labels and spans appearing in the text. Therefore, we capture this correlation through the dot product:
where the similarity matrix (
) has dimensions of
. We use a standard linear classifier with a softmax function to predict the entity type for each span, resulting in the final predicted output (
):
where
is the trainable parameter of the classifier, and
is the bias. Finally, we use the cross-entropy loss function to compute the loss, which measures the difference between the predicted results and the ground truth labels:
3.3. Specific Structure
3.3.1. Enhanced Span Representation
In previous span-based NER models [
35], it was common practice to concatenate the embedding information of the start-position token and the end-position token of the entity (referred to as the “outer span”) to represent the span of that entity, which is then used for the final classification decision:
This approach lacks interaction between the start and end tokens and fails to fully utilize the informative content within the span. Moreover, this span representation is coarse-grained. To address these limitations, Yu et al. [
26] proposed a biaffine decoder that utilizes two fully connected layers to enable interaction between the start and end tokens while simultaneously predicting the span type. However, in this biaffine method, the information within the span is still ignored.
To fully utilize the informative content within the span, we employ enhanced span to generate the final span representation (as shown in
Figure 6). Specifically, we pass the token-embedding information through the outer and inner span modules. The outer span module, similar to the biaffine decoder method, utilizes the biaffine attention mechanism to obtain the outer span representation:
where
and
represent the start and end token embedding of spans in a text, respectively;
and
are learnable parameters; and
is the bias. The dimension of
is expanded to
.
We designed the inner span module to capture the token-level information within the span. For this purpose, we use linear attention to generate information interaction for each token. Specifically, this module uses span and the start- and end-position indices of the span as input. It starts by applying a feed-forward neural network (FFNN) to non-linearly transform the input representations, obtaining context-aware representations. Then, it computes normalized scores for each position. Finally, these representations are fed into a self-attention layer, which combines the representations of each position with those of other positions, weighted by the attention scores. This allows the model to capture potential relationships between the tokens within the span. The result is the inner span representation:
where
represents the hidden representation from the bidirectional LSTM, and
and
are the learnable weights and biases of the feed-forward neural network, respectively. The indices
correspond to the token indices within the span, where
and
represent the start and end indices of the span, respectively. When
(indicating a span of length 1), we do not extract additional features and simply use the hidden representation (
).
To predict the entity type, we integrate the outer span representation and inner span representation in a gate network to obtain the weight coefficient (
) (as shown in
Figure 7):
where
and
are trainable parameters of the gate network, and σ represents the sigmoid function. The dimension of
is
.
The final enhanced span representation is obtained by weighting the inner span representation and outer span representation using
:
where
represents element-wise multiplication, and the resulting
has dimensions of
.
3.3.2. Label Representation
We believe that label semantics can provide additional prior knowledge. Label semantics carry the semantic information of entities in the same category, as this information is manually summarized and induced from a large amount of data. Therefore, when data are limited, especially in small-sample scenarios, we can introduce label semantics to allow our model to make generalizations from the available data. Furthermore, full label names themselves are mentions that appear in various contexts within the text. Their frequencies are synchronized to some extent with the corresponding entity words of their respective categories. Thus, there exists a semantic correlation between label names and the span tokens appearing in the text, and this correlation can be leveraged and utilized.
Considering that our label encoder is based on BERT and incorporates prior knowledge from pretraining, our label representation module allows any form of text to be used as input. This design not only enables easy and rapid expansion to unseen label sets in low-resource domains but also prevents the model from forgetting prior knowledge. We experimented with different label forms and analyzed their effects (
Section 5.1).
Table 1 presents the final forms of full label names used in this study (for the non-entity type in each dataset, we uniformly use “其他 (other)” as the label name).
3.4. Training Strategy
Compared to the previous work of traditional NER neural architecture, our model does not require a new randomly initialized top-layer classifier for new datasets with new unseen label names. Therefore, our model allows domain transfer for different label categories, which is very beneficial for few-shot learning. On this basis, we adopt a two-stage training program. In the first stage, we pretune our model on the source dataset to obtain a prior knowledge-rich source domain model. In the second stage, we fine tune the source model from the previous stage as the initial model on the target domain dataset. During model training, two encoders are updated at each iteration of the two stages, which helps align the span-embedding space with the label-embedding space.