Local Feature Enhancement for Nested Entity Recognition Using a Convolutional Block A tt ention Module

: Named entity recognition involves two main types: nested named entity recognition and ﬂ at named entity recognition. The span-based approach treats nested entities and ﬂ at entities uniformly by classifying entities on a span representation. However, the span-based approach ignores the local features within the entities and the relative position features between the head and tail tokens, which a ﬀ ects the performance of entity recognition. To address these issues, we propose a nested entity recognition model using a convolutional block a tt ention module and rotary position embedding for local features and relative position features enhancement. Speci ﬁ cally, we apply rotary position embedding to the sentence representation and capture the semantic information be-tween the head and tail tokens using a bia ﬃ ne a tt ention mechanism. Meanwhile, the convolution module captures the local features within the entity to generate the span representation. Finally, the two parts of the representation are fused for entity classi ﬁ cation. Extensive experiments were conducted on ﬁ ve widely used benchmark datasets to demonstrate the e ﬀ ectiveness of our proposed model.


Introduction
Named entity recognition (NER) aims to identify words of specific meaning from text, e.g., person, organization, location, and is a fundamental task in natural language processing, aiming to identify the corresponding entities from text and label their types [1]. It is an essential human support task for downstream tasks such as syntactic analysis [2], machine translation [3], automated question and answer [4], and knowledge graphs [5].
The NER task was first proposed by Rau et al. [6] and has been widely used in areas such as information extraction. Initially, lexicon-based and rule-based approaches were commonly used in NER, which mainly utilized string and model matching. Petasis et al. [7] proposed a method to better maintain rule-based NER and classification systems for improving the shortcomings in lexicon-based and rule-based approaches, but it still has some limitations. With the development of machine learning, statistical-based methods came into the public eye. Li et al. [8] were the first to add Conditional Random Fields (CRF) to NER and obtained good results, but the training time was too long. Mikhael et al. [9] were the first to apply the maximum entropy model to NER, which combined a rule-based grammar with a maximum entropy model, but the time complexity required for training was high. In recent years, with the development of deep learning, deep-learning-based methods have also been applied to NER, eliminating the need for extensive manual feature engineering and domain knowledge provided by domain experts in traditional ways. Peng et al. [10] combined Long Short-Term Memory (LSTM) with CRFs for the joint training of NER and word separation tasks, and the experimental results were effectively improved. In addition, methods based on convolutional neural networks [11] and hybrid neural networks [12] have also been widely used on the NER task with good results.
The above methods are all oriented toward flat entities, which refer to entities that do not contain other entities within the identified entity. However, human languages are very complex, and not every token can be represented by a single label, so nested entities are prevalent in all areas of expertise [13]. A nested entity is a phenomenon where one or more short entities are contained within a long entity. As shown in Figure 1, using the above methods cannot identify the nested entities resulting in a loss of information. To solve the nested entity problem, Ju et al. [14] proposed a dynamic stacked entity decoding model, yet there is a phenomenon that the inner entities cannot use the information of the outer entities. Some researchers have tried to adopt a hypergraph-based approach [15,16], but problems such as very complex hypergraph structures arise when sentences are too long or when there are too many classes of entities. Most researchers have used a span-based approach in model formation [17,18] to solve the problem of nested entities. The span-based approach is used to divide the sentence into a two-dimensional grid, where each span in the grid is represented by the semantic information of the head and tail tokens, and the problem of nested entities can be naturally solved by classifying entities for each span. Zheng et al. [19] improved the span-based approach by locating entity boundaries and jointly learning the boundary detection and entity classification tasks. Yuan et al. [20] devised a span-based biaffine attention mechanism incorporating boundary information and used the biaffine mechanism to compute entity scores for NER. However, this approach does not consider the entity's local feature and the relative position feature between the head and tail tokens, which affects the performance of entity recognition.
To address these issues, we propose a nested entity recognition model that achieves local feature enhancement and relative position feature fusion using convolutional modules and rotary position embedding. Specifically, since the same token is represented differently at the beginning and end of an entity, we use two feedforward networks to differentiate between the head and tail sequences. Subsequently, rotary position embedding [21] is applied to the head and tail sequences and a biaffine attention mechanism [22] to produce a span representation with relative position information. At the same time, a convolutional module is used to capture the span representation composed of local features within the entity. Finally, the two parts of the span representation are fused to achieve entity classification. Our main contributions are as follows: 1. A relative position feature was added between head and tail tokens to the span representation using rotational position encoding to improve the precision of entity recognition.
2. The local feature extraction was performed using a convolutional attention module to achieve local feature enhancement and improve model performance.
3. Channel and spatial attention were applied to differentiate the importance of different dimensions and tokens.

Related Work
NER is one of the critical steps in constructing a knowledge graph. In early methods, NER was treated as a sequence labeling task, often using Bidirectional Long Short-Term Memory (BiLSTM) [23] as the encoder and a Hidden Markov Model (HMM) [24] and CRFs as the decoders. Ratinov et al. [25] proposed an improved HMM architecture, which includes using different observation features and introducing transition features to enhance the performance of NER. Konkol et al. [26] constructed a Czech NER system based on CRFs. Before the emergence of pre-trained language models, the BiLSTM-CRF model had achieved excellent performance in NER tasks. With the advent of pre-trained language models, combining them with traditional sequence labeling models can further enhance performance.
However, the phenomenon of nested entities naturally exists in the text, in which a long entity contains one or more short entities. Kim et al. [27] first proposed the concept of nested entities and constructed the biomedical nested entity recognition dataset, GENIA. Traditional sequence labeling methods cannot solve the problem of complex nested entities. Therefore, many researchers have explored ways for nested entity recognition.
In the early stages, the layered-based model was mainly used for nested entity recognition. This model decodes nested entities by stacking multiple sequence labeling modules in a layered manner. For example, Ju et al. [14] recognized nested entities from the inside out by stacking BiLSTM-CRF modules. The model first identifies the inner layer of entities. If there are entities in the current layer, the model stacks another BiLSTM-CRF module on top of the current one until all nested entities are recognized. By stacking CRF models, Jiang et al. [28] recognized nested entities in electronic medical records. While this approach is intuitive and easy to implement, the problem of error propagation becomes more pronounced as the number of nested entity layers increases. A transition-based model performs the sequential parsing of characters in the entire sentence to achieve nested entity recognition. Wang et al. [29] introduced three components: Buffer, Stack, and Action. Buffer stores the unprocessed sentence, Stack stores the processed sentence, and Action processes the tokens in the sentence. Although transition-based methods do not suffer from error propagation, they have notable limitations. These methods can only handle entities consisting of two tokens; however, entities are often longer than that. The approach based on reading comprehension [30] cleverly transforms the named entity recognition task into a reading comprehension task. This approach converts each entity label into a question. For example, a question like "What is the organization entity in this sentence?" can be constructed for an organization entity. Then, the question and the target sentence are concatenated using a special token [SEP]. Finally, the entity is obtained based on the output position. However, this method lacks a fixed method for constructing questions, and the quality of the prediction heavily relies on the construction of the questions.
To address the shortcomings of the previous methods, span-based methods [31] are now mainly used for nested entity recognition. The span-based approach classifies each sub-span in a sentence. For example, Yu et al. [32] use a biaffine attention mechanism to generate a span representation of head and tail tokens for entity classification. However, this approach mainly considers only the head and tail tokens' information of entities to compose entities, resulting in the unsatisfactory performance of entity classification.
Some work improves the performance of entity recognition by enhancing boundary information. For example, Tan et al. [33] enhance the span representation by additional boundary supervision. Gao et al. [34] refine the representation of entity classification by multi-task learning of boundary recognition. Xu et al. [35] use the representation of boundary recognition and sentence representation for entity recognition after concatenating. Although adding boundary information can improve the precision of entity recognition, the span-based model already uses the boundary information of entities to form a categorical representation, and the internal information of entities should be considered more. And the length characteristics of the entities are also used as an important basis for entity judgment.
Therefore, we propose a local feature-enhanced nested entity recognition model, which uses a convolution module to capture local features for entity classification and rotary position embedding to obtain relative position features between head and tail characters. In contrast to previous work, we consider multiple features for entity classification to improve the model's performance.

Methods
The structure of our model is shown in Figure 2, which mainly consists of four parts. First, the pre-trained language model and BiLSTM are used as the encoder to generate contextual semantic representations of the input sentences. After that, the biaffine attention mechanism is used to construct the representation of head-tail pairs, which is prepared for the subsequent entity classification. Meanwhile, the convolutional module is used to capture and refine the representation within the entity to achieve local feature enhancement. Finally, the representation of the biaffine attention mechanism and the convolution module are fused to perform entity classification. In this section, we will elaborate on the details of each section.

Encoder
We use the pre-trained language model Bidirectional Encoder Representation from Transformers (BERT) [36] and BiLSTM as the encoder of the model. BERT is able to construct favorable sentence representations and is widely used in NLP tasks. Given an aftertokenization sentence, = ( , … , ) with characters, we feed the sentence into the BERT to obtain the sentence representation = ( , … , ). Subsequently, to further model the contextual information, we use BiLSTM to yield the final sentence representation: BERT generates the word embedding ∈ ℝ . BiLSTM obtains the contextual information embedding ℎ ∈ ℝ , where the dimension of the hidden state of the BiLSTM is 2 .

Head-Tail Pair Representation Module
The biaffine attention mechanism and rotary position embedding are mainly used to construct head-tail pair representations with relative position feature as an essential part of subsequent entity classification. Since the same tokens are represented differently when they are located at the beginning and end of an entity, we first use two feedforward networks to obtain the head-sequence representation and the tail-sequence representation: , ∈ ℝ × , , and ∈ ℝ are trainable parameters. As the span-based model uses semantic information of head and tail tokens to determine whether the current span is an entity, it does not consider the relative position between head and tail tokens, ignoring the fact that entities have a specific length limit. Therefore, introducing the relative position feature between the head and tail tokens into the span representation can improve the ability to recognize entities. Rotary position embedding combined with linear attention can capture the relative position feature using absolute position embedding, which is more suitable for span-based models than sinusoidal position embedding. We, therefore, apply rotary position embedding to both head and tail sequences: where the dimensions of and are the same as the input.
( * ) denotes the addition of rotary position embedding information to * . We use ( ) as an example: is the position of the token in a sentence and takes a range of values ∈ [0, ). = 10,000 / , where is the token embedding dimension of the input sequence, taking values in the range ∈ [0, /2). Our model uses the biaffine attention mechanism to generate the span representation of head-tail pairs as it is more effective in capturing the correlation between head and tail information than directly concatenating the semantic information of head and tail tokens to generate the span representation. After rotary position embedding, the head and tail sequences are fed into the biaffine attention mechanism decoder to obtain the scoring tensor for entity classification.
, , and are trainable parameters. Among them, the scoring for span is represented by a tensor of size × × . The sub-span starts with token and ends with .

Local Feature Representation Module
The pre-trained language model and BiLSTM focus on constructing the global information of sentences rather than local information. However, local features within an entity are essential information for judging the entity. Therefore, we use a Convolutional Block Attention Module (CBAM) [37] for local feature capture. First, we perform a dimension extension on the sentence representation and capture local features using convolution for the sentence representation: where, ∈ ℝ × × , in which denotes the embedding dimension of BERT, and the values of and are the same as the sentence length. Gaussian Error Linear Units (GeLU) denotes the activation function, and × denotes a convolution operation with a convolution kernel of 3 × 3. Subsequently, since the importance placed on different channels varies, we use channel attention to assign different weights to each dimensional representation: where ∈ ℝ × × denotes the output after channel attention. denotes that channel attention weights have been assigned to .
( ) is the calculation of channel attention weights for : ( ) = ( ( ( ( ))) + ( ( ( )))) where denotes the activation function sigmoid, denotes the average pooling operation, and denotes the maximum pooling operation. ∈ ℝ / × , ∈ ℝ × / , is a constant, and the purpose of dimensional transformation is to reduce parameter overhead. The mapping of spatial attention is then generated according to the internal spatial relations characterized by the following: where ∈ ℝ × × , in which and denote that spatial attention weights have been assigned to and .
( ) is the calculation of spatial attention weights for : where denotes the size of the convolution kernel, and [ ( ); ( )] denotes the concatenation of the average pooling and maximum pooling results. is the local feature we generate, which will be used for subsequent entity classification.

Entity Classification
Finally, the features generated by the biaffine attention mechanism and the features generated by the convolution module are fused for the entity classification task. The contains the probabilities of all entity types in the span, starting with token and ending with token . The type with the highest probability is taken as the predicted result of the entity type in the span. In training, this study uses a cross-entropy loss function to optimize the model: where ( ) and ( ) denote the actual entity type distribution and the predicted entity type distribution, respectively. ℒ represents the entity classification loss in our model.

Experiments
In this section, we conduct extensive experiments on three nested entity recognition and two flat entity recognition datasets.

Datasets
We begin with an introduction to the five datasets used in the experiment: ACE2004 [38] is a dataset issued by the Language Data Consortium (LDC) and exists in three types, English, Arabic, and Chinese, of which we have chosen the English type. This dataset aims to develop automatic content extraction techniques to support the automated processing of human language in a textual form. It can be used for nested entity recognition in NER.
ACE2005 [39] is also a dataset published by LDC and includes English, Arabic, and Chinese training datasets. We chose the English-type dataset. This dataset can be used for nested NER and relation extraction tasks. There are 145,000 English words in the dataset with six entity types: person name, organization name, location name, time, currency, and percentage.
The GENIA [25] dataset was extracted from the biomedical literature, which informs the development of bio-text mining systems. The dataset contains 1999 abstracts from medical literature analysis and retrieval systems. The abstracts are collected using three medical subject terms: human, blood cell, and copy factor. It can be used to develop and evaluate natural language processing (NLP) algorithms and tools, such as text classification and named entity recognition.
MSRA [40] is a dataset on Chinese NER from Microsoft Research Asia (MSRA), which contains over 50,000 Chinese entity recognition annotations in the following entity categories: place name, institution name, and person name.

Baselines
Xie et al. [42]: detection and identification of entities on multi-granularity feature information.
Luan et al. [43]: NER via dynamic span diagrams. Straková et al. [44]: linearization of multiple labels of nested entities into one label followed by sequence labeling methods.
Tan et al. [33]: a span-based nested entity recognition model using boundary enhancement.
Fu et al. [45]: a view-based nested NER as constituency parsing with partially observed trees.
Wang et al. [13]: nested entity recognition using a stacked model in the shape of a pyramid.
Xu et al. [35]: construction of nested entity recognition models for span representation using additive attention mechanisms.
Gao et al. [34]: training using biaffine attention mechanisms and multi-task learning for boundary recognition.
Zhang and Yang [41]: a model for Chinese entity recognition tasks that explicitly utilizes word information and word order information is proposed.
Yan et al. [46]: an NER framework using an adaptive transformer encoder is proposed for modeling character-level features and word-level features.
Gui et al. [47]: a dictionary-based graph neural network model is used to deal with the Chinese entity recognition problem.
Kong et al. [48]: a multi-level CNN was constructed to capture both short-term and long-term contextual information for entity recognition.
Wu et al. [49]: A new lexicon enhancement is proposed that can effectively improve the problems of excessive memory and computational costs caused by the previous use of lexicons.

Hyperparameters and Evaluation Indicators
All our experiments were performed on computers with an Intel(R) Xeon(R) Silver 4116 CPU @ 2.10 GHz and two sheets with 16 G of video memory NVIDIA-Tesla T4 GPU cards, and the models were implemented on the Pytorch1.12.0 framework.
In our model, we use the Bert-Large-Cased model for ACE2004 and ACE2005 datasets; use the dmis-labbiobert-v1.1 model for GENIA datasets; and use the Chinese_Rob-erta_wwm_ext model for Resume and the MSRA dataset. We set the hyperparameters for the five datasets used, as shown in Table 1. We maintain consistent hyperparameters with the baseline model. During parameter tuning, we mainly adjust the parameters batch size and learning rate to obtain the most suitable parameters for the model and test the results with the best model on the validation set. We evaluate the performance of the model based on the precision (P), recall (R), and F1 values, and these three values are calculated as follows:

Results
The performance of our model on the nested entity recognition task is shown in Table  2. The results demonstrate that our model effectively improves the performance of the nested entity recognition task. Our model achieves F1 values of 86.66%, 85.83%, and 81.33% on the three datasets, which are 0.06%, 0.43%, and 1.32% higher than the state-ofthe-art baseline model, respectively. In particular, Gao et al. [31] used a multi-task learning model with a biaffine attention mechanism and boundary recognition for nested entity recognition. However, our model performed significantly better on all three datasets, demonstrating that relative position and local features within the entity can be more effective for nested entity recognition. Table 2. Comparison of our model with the baseline model on the nested entity recognition dataset. The highest score is marked in bold.

ACE2004 ACE2005 GENIA P (%) R (%) F1 (%) P (%) R (%) F1 (%) P (%) R (%) F1 (%)
Xie et al. [42] 81  Table 3 shows the results of our model and the baseline model on the flat entity recognition task. Our model achieves the best performance on the MSRA and Resume datasets with F1 values of 95.79% and 96.47%, precision of 95.83% and 96.62%, and recall of 95.76% and 96.32%, respectively. The comparison models used are all based on the sequence labeling method, but our model not only improves in precision but also increases in recall. This proves that our method is applicable not only to nested entities but also to flat entities. Table 3. Comparison of our model and the baseline model on the flat entity recognition dataset. The highest score is marked in bold. To demonstrate the importance of relative position features and local features in the entity recognition task, ablation experiments were carried out according to entity type. The results of the ablation experiments on the ACE2004 dataset are shown in Table 4, where BAM indicates that only the biaffine attention mechanism is used; RoPE-BAM indicates that the capture of relative position features is achieved by rotary position embedding; and CBAM-RoPE-BAM indicates that the local features are enhanced using the convolution module on top of RoPE-BAM. The results show that the F1 values of CBAM-RoPE-BAM and RoPE-BAM are 1.67% and 0.49% higher than those of BAM, respectively, and the performance is improved for all seven types of entity recognition. We believe this is because the relative position feature can effectively improve the precision of the model. At the same time, the local features within the entity can help the model to perform entity recognition more comprehensively. As a result, the recall of the model has also improved. In addition, we also conducted ablation experiments on the flat entity recognition dataset Resume, and the results are shown in Table 5. The CBAM-RoPE-BAM model improved the F1 values on several entity categories. And the precision was significantly enhanced on several entity types, with 3.3% and 2.72% improvement on the TITLE and ORG entity types, respectively. However, the recall of the RoPE-BAM model dropped dramatically by only 50% on the LOC entity type, which we believe is because the number of LOC entities in the training set is too small. The average length is basically 5, but only half of the entities in the test set are of length 5, resulting in overfitting. The results of the ablation experiments on all datasets show that our proposed method is effective for nested NER, flat NER, Chinese NER, and English NER tasks, demonstrating the generalization performance of the CBAM-RoPE-BAM model. We explored the effect of the number of convolutional modules on the results of the GENIA dataset, and the results are shown in Figure 3, where N denotes the number of convolution modules. It can be seen from the table that the best performance is achieved when the number of convolutions N = 5, and the performance when only one convolution module is used is also better than that without a convolution module, demonstrating that convolution modules are effective in capturing local features of entities for entity classification.

Conclusions
In this paper, we propose a novel nested entity recognition model to address the problems of single features and poor generalization ability leading to low recall in current methods. Specifically, it is considered that the length feature of entities is an important factor affecting entity classification. Therefore, this paper utilizes rotary position embedding to capture the relative position features between the head and tail characters. Considering the span-based method ignores the internal features of the entity, this paper uses the convolution module to enhance the local features. The results of extensive experiments on five benchmark datasets prove that the relative position features can improve the precision of entity recognition pairs. At the same time, the local features inside the entity can effectively improve the generalization ability of the span-based model and thus increase the recall of the model.

Conflicts of Interest:
The authors declare no conflict of interest.