Incorporating Entity Type-Aware and Word–Word Relation-Aware Attention in Generative Named Entity Recognition

: Named entity recognition (NER) is a critical subtask in natural language processing. It is particularly valuable to gain a deeper understanding of entity boundaries and entity types when addressing the NER problem. Most previous sequential labeling models are task-specific, while recent years have witnessed the rise of generative models due to the advantage of tackling NER tasks in the encoder–decoder framework. Despite achieving promising performance, our pilot studies demonstrate that existing generative models are ineffective at detecting entity boundaries and estimating entity types. In this paper, a multiple attention framework is proposed which introduces the attention of entity-type embedding and word–word relation into the named entity recognition task. To improve the accuracy of entity-type mapping, we adopt an external knowledge base to calculate the prior entity-type distributions and then incorporate the information input to the model via the encoder’s self-attention. To enhance the contextual information, we take the entity types as part of the input. Our method obtains the other attention from the hidden states of entity types and utilizes it in self-and cross-attention mechanisms in the decoder. We transform the entity boundary information in the sequence into word–word relations and extract the corresponding embedding into the cross-attention mechanism. Through word–word relation information, the method can learn and understand more entity boundary information, thereby improving its entity recognition accuracy. We performed experiments on extensive NER benchmarks, including four flat and two long entity benchmarks. Our approach significantly improves or performs similarly to the best generative NER models. The experimental results demonstrate that our method can substantially enhance the capabilities of generative NER models.


Introduction
Named entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into predefined categories, such as the entity types of person, organization, and location.NER is one of the fundamental research problems in natural language processing, which has been widely adopted in information retrieval and question-answering systems [1][2][3].Previous works [4][5][6][7][8][9] have addressed NER tasks with task-specific token-level sequential labeling or span-level classification methods.In token-level sequential labeling methods, each token is assigned a label to represent its entity type.On the other hand, span-level classification methods enumerate all possible spans in the sentences and classify them into predefined entity types.
Recently, sequence-to-sequence (seq2seq) generative approaches [10,11] have gained attention in the NER community due to their ability to jointly model all NER tasks in a unified framework.Although these models have shown promising results across all three NER categories (flat NER, nested NER, and discontinuous NER), NER faces two limitations that require the proposal of suitable methods for addressing them.First, these generative models are ineffective at utilizing information about entity boundaries.For example, consider a seq2seq method tasked with identifying entities in a complex sentence; the model might successfully recognize "New York" as an entity but fail to discern the boundary between "New York" and "University" in the phrase "New York University".It incorrectly identifies the entire phrase as one entity instead of two separate entities: the location "New York", and the organization "University".This ineffectiveness stems from the autoregressive decoding process inherent in seq2seq models failing to capture the interword relationships within a sentence that are essential for identifying entity boundaries.Current seq2seq models generate coarse-grained entities and leverage context insufficiently, making a more precise and fine-grained approach required.Second, the seq2seq framework does not explicitly consider the effect of entity types on NER.While entity-type generation is based on the compounding tokens, the incorrect generation of these tokens from the decoder can lead to misguided entity-type mapping information.Access to prior knowledge from external entity databases aids the model in precisely determining entity types, and these entity types can in turn can shape the contextual learning of sentences, thereby improving the representation of entities.Therefore, it is crucial to incorporate entity-type mapping and entity boundary information into seq2seq NER models in order to overcome these limitations.
In this paper, we present a novel approach to address the limitations of existing seq2seq NER models.We propose a novel approach incorporating attention mechanisms built on the entity type and the word-word relation.Specifically, we leverage external knowledge bases such as Wikipedia to learn entity-type information, further improving entity-type mapping.We integrate entity-type distributions into the encoder.Within the decoder, we embed the entity type through self-attention and cross-attention mechanisms, improving the associative mapping between entities and their respective types.Furthermore, we integrate representations of word-word relations into the decoder in order to enhance the capability to distinguish entity boundaries.
Our proposed approach promotes the smooth incorporation of entity-type information and word pair relational knowledge into the seq2seq NER framework.The main contributions are summarized as follows: • We propose a generative NER framework which merges entity-type embedding and word-word relation representation to improve named entity recognition performance.

•
We leverage two novel attention mechanisms in the NER framework, namely, entity type-aware attention and word-word relation-aware attention, improving the interaction between entities, entity types, and word-word relations for better contextual information.

•
We present a series of experiments demonstrating the effectiveness of our method against various baselines.Ablation studies further show the contribution of each component within our approach, confirming their individual effectiveness.

Named Entity Recognition
Named entity recognition (NER) is a significant research area in natural language processing.Various methods have been proposed for named entity recognition.Traditional research on NER performed modeling as a sequence labeling task, primarily focusing on flat NER [4][5][6]12].The focus later shifted to complex-structure NER [13][14][15][16][17][18], which is studied separately.Different approaches to NER include sequence labeling, span-based, hypergraph-based, and generative methods.
NER is typically treated as a sequence labeling problem which assigns a tag to each token and uses a sequence model [4,6,14,[19][20][21][22][23] to predict labels of the sequences (e.g., BIO).Collobert et al. [20] introduce the linear-chain conditional random field (CRF) in convolutional neural networks (CNNs) and made the sequence labeling problem one of determining the respective likelihoods between adjacent tags.Strubell et al. [5] obtained features for the sentences via CNN.Following this work, Lample et al. [4] adopted bidirectional LSTM with CRF to obtain the token representations.A number of studies [24,25] have combined CNNs and RNNs to extract features and learn word representations.Ju et al. [26] used dynamically stacking flat NER layers in the LSTM model for nested NER.Tang et al. [27] extended BIO to the BIOHD label scheme for discontinuous NER.However, these methods primarily require the design of different tagging schemes, and do not effectively address structurally complex or longer sentences such as nested and long NER.In contrast, our generative model can simultaneously handle this issue.Furthermore, we incorporate entitytype embedding from external knowledge and the encoder, while, word-word relation representation is merged to enhance entity boundary information.
Span-based methods commonly tackle complex NER tasks, e.g., nested NER.Researchers have proposed various approaches to obtain reasonable spans.Wang et al. [28] proposed a model which allows for interaction between spans from different layers.Yu et al. [15] utilized bi-affine attention to measure the possibility as a mention of span.Tan et al. [29] first predicted the boundary and then performed classification over the span features.Ouchi et al. [30] built a feature space with similar entity spans.However, these span-based models must enumerate all possible spans, while our model directly generates the entities.Li et al. [9] and Zhang et al. [31] presented a similar NER model, treating it as a span-based machine reading comprehension task.These methods require the design of template-based questions and multiple accesses to models, however, which raises computational costs; our proposed framework is able to avoid these issues.Lin et al. [32] offered a method that first detects the type of an anchor word and then locates the entity's boundaries.Shen et al. [16] presented a two-stage object detection method for nested entities.In addition to the above issues, these models have problems with gaps or chaining errors in the results of different stages.This paper proposes a method that enables interaction between entity type and entity boundary information to directly generate entity sequences.
Hypergraph-based methods have been proposed to cover many possible mentions in a sentence effectively.Lu et al. [33] introduced a method for joint mention extraction and classification using hypergraphs, followed by similar work from Muis et al. [23,34], who utilized a multigraph representation to address overlapping NER.Katiyar et al. [35] developed a hypergraph representation for nested entities, leveraging features extracted from an RNN.Wang et al. [7,36] proposed the idea of neural segmental hypergraphs.However, these models have trouble dealing with long inputs or many entity categories, as their hypergraph structures become extremely complex.These methods additionally struggle with spurious structure and structural ambiguity during inference [37].In this paper, we aim to further investigate a simple and efficient model that learns entity type and boundary information for NER.
Generative methods treat the entity span sequence as a generation task, aiming to generate entities and entity types directly without requiring unique design of the tagging schema or ways to enumerate spans.Seq2seq methods have been proposed that directly generate entity label sequences from the input [5].Strakova et al. [14] proposed a seq2seq method for nested NER that directly outputs the label of each token, associating the relation between words and labels via hard attention.Yan et al. [10] presented a pointer-based model which splices the label's embedding with the token representation.Zhang et al. [11] offered data augmentation from a causal perspective to generate entities.However, generative models do not fully utilize the entity boundary information implicit in the entity itself, which is critical for named entity recognition.

Attention Mechanism
Attention mechanisms have been employed in machine translation, machine comprehension, named entity recognition, and related natural language processing tasks.Attention-based methods have achieved impressive results, in that attention mechanisms have a large amount of memory and can build cooperation between the sequence of input.In addition, attention mechanisms allow models to automatically concentrate on the essential parts of the information while ignoring the less relevant details, thereby enhanc-ing the model's ability to process complex data.The transformer model, first introduced by Vaswani et al. [38], has revolutionized the field of named entity recognition with its self-attention and cross-attention mechanisms, allowing long-range dependencies within the data to be captured.Based on this, many works [39,40] have leveraged attention mechanisms to integrate features from various sources of information, such as character representations, word embeddings, and position embedding.These integrated approaches enable models to focus selectively on the most pertinent information and select valuable knowledge, facilitating more nuanced language understanding.Ren et al. [39] proposed using an attention-based architecture over the word embedding and character-level component to learn the same semantic features for each word.Tan et al. [40] directly utilized attention mechanisms to capture the global dependencies of the input in order to enhance the performance of Chinese NER.TENER [41] is an adapted encoder based on attention mechanisms to merge the character-and word-level features.FLAT [42] uses a variant of self-attention to leverage the relative span position encoding.

Generative NER Task Formulation
In this section, we define the problem of seq2seq named entity recognition.Given an input sentence of n tokens X = {x 1 , x 2 , . . ., x n }, the goal of seq2seq NER is to generate a target sequence Here, s and f are the starting and ending indices of a span, k is the span index in an entity, and g i ∈ {g 1 , . . ., g N } is the entity type, where N is the total number of possible entity types.The generated schematic can be shown in Figure 1.

Methodology
In this section, we introduce our proposed methodology.We outline the overall framework and describe the details of the modules (entity type-aware attention and wordword relation-aware attention) for implementing our method.The process in generative entities is introduced as well.Readers can expect a comprehensive overview of the supports that form the backbone of our framework.

Model Overview
Our model consists of an encoder that encodes the input sequence to its contextual embedding and a decoder that generates the output sequence with entity annotations.In addition to the entity generation task, we design a relation representation learning task over the token pairs to better capture the correlations among the entities.The entity token relation attention and entity type attention are introduced and fused into the encoder and decoder as shown in Figure 1.We present the details of each component separately in the following subsections.

Entity Type Aware Attention
Heterogeneous factors such as entity types and entity boundaries [15,[43][44][45] greatly impact named entity recognition.In this section, we discuss the modeling of entity types in our seq2seq NER framework, allowing for interactions with the input sequences and guiding the model to learn more effective token representation.We merge the entity type-aware attention in the encoder and the decoder.
In the encoder, we incorporate entity types as part of the input, which are concatenated with the given sentence.To achieve better representations of entities, we try to obtain the entity type aware attention from an external entity base.Prior knowledge can be obtained to improve the NER task.Then, we feed the entity type-aware attention into the self-attention layers in the encoder.
As shown in Figure 2, the comprehensive representations of entity types are obtained from an external entity base incorporated into the encoder through entity type attention.
The representation of an entity type T is a weighted sum of the entity representations.For example, assuming that the entity label set is G = {person, location, organization}, as shown in Figure 2, if entity type T = location contains entity set C T = {Beijing, London, . . ., New York}, the initial embedding of entity type T can be obtained as follows: where is the embedding of the i-th entity in the entity set C T and θ i is the weight of the entity, such as its frequency.Note that if the entity types are not in the external source, we may obtain entity type embeddings by random initialization or entity type tokens.To leverage the entity type information, we design an entity type-aware attention and integrate it into the corresponding self-attention layers.The vectors K E ∈ R B×N h ×N×d k and V E ∈ R B×N h ×N×d k represent the key and value of the entity type-aware attention, which are concatenated with the original key K ∈ R B×N h ×n×d k and value V ∈ R B×N h ×n×d k vectors in the encoder as follows: where B is the batch size, N h denotes the number of the heads, head l is the head representation of the l-th layer, K E , V E are obtained from the embedding E by an MLP layer [46], E ∈ R N×d h is the embedding of entity types with N as the number of entity types, Q ∈ Q B×N h ×n×d k denotes the query of the attention, and ⊕ means the concatenation.Figure 3 shows the merging of the entity type attention.The entity type-aware attention from the external entity knowledge base is only applied to self-attention layers in the encoder.
Concatenation in a self-attention layer.In the encoder, K E and V E are calculated from the entity type embedding of external entity knowledge via the type-aware MLP layer.In the decoder, K E and V E are calculated from the hidden states of entity type tokens in the encoder via the type-aware MLP layer.
In the decoder, E is from the hidden states of the entity type tokens, which are part of the input.Then, we use Equations ( 2) and (3) to apply the entity type-aware attention.In this process, we apply it to all self-attention and cross-attention layers.

Word-Word Relation-Aware Attention
Improving entity boundary detection is crucial for named entity recognition; therefore, we integrate information relevant to entity boundaries into our framework.We utilize word-word relation representations as feature information to learn about entity boundaries.Specifically, inspired by [45,[47][48][49], which can be used to enhance relations between tokens in an entity for NER, we extract these word-word relation features to improve the representation of the predicted token in the decoder.
Given a sentence X = {x 1 , . . ., x n }, we obtain the hidden representation H = {h 1 . . ., h n } after the encoder layers.The word-word relation representation set in the sentence r ij is the relation of word pair (x i , x j ), and is obtained through the bi-affine layer.
where W 1 , W 2 , and b denote the trainable parameters and MLP is the fully connected layer.The word-word relation representations r ij reflect the entity boundary information and the syntactic structure of the sentence.These representations are valuable knowledge with which the model can perceive and learn entity boundaries more accurately, which plays a vital role in identifying entities.As the above r ij cannot be directly used in our proposed word-word relation-aware attention mechanism, we feed the representations between words into an MLP layer to obtain useful features.Meanwhile, this process can generate keys and values that match the attention mechanism inherent in the seq2seq model itself.Finally, the obtained word-word relation-aware attention key K R ∈ R B×N h ×M×d k and value V R ∈ R B×N h ×M×d k matrices are concatenated with the original key K ∈ K B×N h ×n×d k and value V ∈ R B×N h ×n×d k , respectively, in the cross-attention layers to enhance the model generation with regard to the entity boundaries.The entity relation-aware attention is defined as follows: where B is the batch size, N h denotes the number of heads, M is the length of the wordword relation-aware attention using the convolution operation, K l R and V l R are from K R and V R , respectively, and K l E and V l E are calculated from the entity type embedding described in the above section.However, the entity type embedding is the hidden state of the entity types as a part of the input sequence.Figure 4 shows the word-word relation-aware attention incorporated into the cross-attention layer.

Entity Decoding
The decoder decodes the token embedding from the encoder to generate the entities.
In particular, at step t, the decoder acquires the token embedding h d t ∈ R d h based on the encoder output and all the previous decoded tokens as follows: where H e ∈ R n×d h , Y ∧ <t = y ∧ 1 , . . ., y ∧ t−1 is the generated token sequence before t.To enhance the accuracy of the generated tokens, we introduce a context fusion layer to further decode the output.We join the entity type representations and hidden state of the tokens in the encoder to the respective hidden states of the output, which can help to improve the representation of context within the sentence: where E is the entity type representation in the encoder and ⊗ denotes the dot product.Finally, the output token index distribution P t can be obtained by the function For a given sequence X = {x 1 , x 2 , . . ., x n }, we attempt to minimize the negative log-likelihood concerning the corresponding ground truth labels, which can be defined as

Experiments
This experiments section provides our detailed experimental setup, including the NER datasets evaluated, the backbones with parameter settings, the evaluation metrics, and the baseline models for comparison.In addition, we discuss the main results of the datasets compared to the baselines.This section aims to provide readers with insight into the validation of the proposed method and its performance against the baselines.
To further validate the effectiveness of the model, we evaluated the long NER datasets EBM-NLP (https://github.com/bepnye/EBM-NLP/)(accessed on 12 November 2023) [57] and SemEval 2017 (https://scienceie.github.io/resources.html(accessed on 12 November 2023)) [58] as well.EBM-NLP annotates PICO spans, defining the Participants, Interventions, Comparisons, and Outcomes in a clinical trial paper [59].We processed it following the works [60][61][62].SemEval 2017 is a task involving extracting keyphrases and relations from documents with mention-level keyphrase identification (the types are PROCESS, TASK, and MATERIAL), mention-level keyphrase classification, and mention-level semantic relation extraction between keyphrases.We merged the first two subtasks as the NER task and prepared it following [58].The statistics of the datasets are listed in Tables 1 and 2.

Implementation Details
We adopted BART-Large as the backbone network.Following previous works [10,45], the encoder and decoder had twelve layers with 1024 dimensional embedding for our experiments.For the English datasets, we used the BART-Large model [63].For the Chinese datasets, we used the BART-Large-Chinese model [64].We used the AdamW [65] optimizer.We executed a grid search of the hyperparameters shown in Table 3 and selected the set of parameters that had the best performance on the validation set.The batch size was 32 for OntoNotes 5.0 and 16 for the others.Finally, we used 1 × 10 −5 for the BART-Large model and 5 × 10 −5 for the other components.When deriving the key K E and value V E of the self-attention from the entity type embedding, we employed multilayer MLP similar to the proj_down-proj_up structure [66], with down dim 512 and up dim 1024.For most baseline methods, the hyperparameters were set according to the experimental configurations in the original papers.However, there were a few variations; for the SemEval 2017 and EBM-NLP datasets, the batch size and max sequence length were the same as those used for our proposed method.

Evaluation Metrics
An entity was considered to be correctly predicted if the entity label and boundary matched the ground truth.Following prior works [10,11,33], we computed the precision (P), recall (R), and F1 (F) scores for each dataset, utilizing the F1 score on the validation set as the criterion for selecting the optimal model.We ran each experiment five times and report the average metrics.

Comparative Baselines
We compared our method with several previous baselines: • LSTM-CRF/Stack-LSTM [4] and ID-CNNs [5] offer iterated dilated convolutional neural networks.• SH [7] propose the use of hypergraphs to address the NER task.• Seq2Seq [14] obtains tokens with the labels in a sequence.• BiaffineNER [15] offers a bi-affine module for span-based models to explore spans.• BartNER [10] offers a unified NER model with a pointer generating the start-end indexes of entities and types.
• DebiasNER [11] designs data augmentation to eliminate incorrect biases from a causality perspective.• W2NER [37] models the unified NER as word-word relation classification to tackle the different NER tasks.• LatticeLSTM [56] probes a lattice LSTM encoding the characters and words matching a lexicon.• TENER [41] uses an encoder to consider character, word, direction, relative distance, and unscaled attention.

•
LGN [67] uses a lexicon-based graph neural network with global semantics to interact among characters, words, and sentence semantics.• FLAT [42] is a flat-lattice model that converts the lattice structure into a flat structure.

•
Lexicon [68] uses lexical knowledge in Chinese NER based on a collaborative graph network.• LR-CNN [69] uses a CNN-based approach with lexicons via a rethinking mechanism.• PLTE [70] uses the characters and matches lexical words in parallel via the transformer.• SoftLexcion [71] merges the word lexicon into the character representations and adjusts the character representation layer.• MECT [72] uses a multi-metadata embedding-based cross-transformer that fuses the characters' structural information.
The results on all the datasets are shown in Tables 4-6.Several key observations can be made from the comparison results.Our model performs better than some sequence labeling, span-based, and generative models on most datasets.For CoNLL2003, our method outperforms sequence labeling [4], the span-based method [15], and the hypergraph-based method [7] by 2.6%, 1.02% and 3.04% in terms of F1 score.Compared with the seq2seq models (Seq2Seq [14], BartNER [10], and DebiasNER [11]), CoNLL 2003 shows increases of 0.56%, 1.02%, and 0.4%.For OntoNotes 5.0, our method improves the F1 score by 4.0% and 0.34% compared with the ID-CNNs [5] and W2NER [37] baselines.Compared with generative models BartNER and DebiasNER, there is an improvement of 0.46% and 0.42%, respectively, in terms of the F1 score.For the Chinese NER datasets, all demonstrate improvements in performance to varying extents.Compared to the generative method BartNER, our method shows increases of 5.99%, 0.76%, 2.75%, and 7.68%, respectively.For SemEval 2017 and EBM-NLP, the performance of our method shows significant improvement compared with both BartNER and W2NER.These experimental results validate our hypothesis that our approach can effectively model word-word relations, thereby improving entity decoding.Moreover, the entity type information incorporated into our method further boosts model performance.
The results of our approach show slight improvement overall compared to W2NER.The reason for this is that W2NER unites the position region-aware representation of the grid and the relation representation of token pairs to estimate entities.Although our model uses word-word relation representation, it has only coarse granularity and no directionality during inference.It can be seen that the improvement of our model over the baselines on the EBM-NLP datasets is marginal compared to PubMedBERT [61].The primary reason for this is that PubMedBERT is a domain-specific pre-trained language model utilizing biomedical data.Nonetheless, our model is capable of attaining comparable or superior performance.

Analysis and Discussion
In this analysis and discussion section, we analyze the results obtained from the various components within our method to show the performance of our proposed approach.Then, to assess our method's robust ability to process long sentences and long entities, we explore the implications of our approach and compare its performance with that of the baselines.Finally, by analyzing some instances of existing incorrect predictions, we aim to provide insights for potential future work.This section contextualizes our results, and we expect that the analysis will benefit future work.

Ablation Study
To estimate the impact of various components within our method, we conducted ablation studies by sequentially omitting each component, namely, word-word relationaware attention (rel-att) and entity type-aware attention (type-att).In this section, we designate the seq2seq model devoid of "rel-att" and "type-att" as the "baseline".We refer to the model that includes word-word relation-aware attention as "+ rel-att".The same nomenclature applies to "+ type-att" and "+ rel-att & type-att".
The outcomes of the ablation studies are presented in Tables 7-9.The results indicate that our proposed method significantly outperforms the baseline when incorporating wordword relation attention (+ rel-att) or entity type attention (+ type-att).This validates the effectiveness of both word-word relation attention and entity type attention.Moreover, integrating word-word relation attention and entity type-aware attention into an NER framework yields the most promising results.The improvements have subtle differences considering the differences in the entity types, domains, and complexity of entities.For the Chinese NER datasets, the improvement is relatively significant.In addition to their simple structures, the external entity type embeddings mitigate the diverse and complex expressions of Chinese entities, leading to a closer representation of entities within the same type.For example, the entity types in Weibo have similarities, such as per.nam and per.nom, representing specific and general persons, respectively (i.e., "张三" is a per.nam, and "男人" is labeled as per.nom).Integrating external entities to construct the embedding of entity types can help to enhance the distinction between them.Furthermore, including the entity type as part of the input enhances the in-context learning within the sentence.Leveraging entity typeaware attention during decoding further reinforces the mapping between entity textual information and entity types.Additionally, given Chinese tokenization traits, introducing word pair relation representation helps to understand the entity boundary information.On the SemEval 2017 and EBM-NLP datasets, the performance of our proposed components is better compared with the baseline.Our method can generate various length sequences and learn more information from the entity type and word-word relation attention.

Effect on Long Sentences
In this section, we split the test set by sentence length in order to assess our method's ability to process long sentences.The results are shown in Figure 5. Specifically, we categorized the test sets into subsets based on sentence length (#words), with ranges set at [0-10, 10-20, 30-40, 50-60, ≥60].It is obvious that our method provides the largest overall gain compared to the generative model on long sentences (≥60 words) [10].Note that the amount of data with sentence lengths in [50-60, ≥60] is relatively tiny compared to other groups; thus, the evaluation is relatively high.Nonetheless, our model performs better than the baseline models, particularly in long-sentence scenarios.

Effect on Long Entities
To verify the ability of our method to handle long entities, we report the experimental results in Figure 6.In this section, we selected BartNER and W2NER as baselines.Considering the number of long entities, we set the entity length in the range from 1 to 5 for CoNLL2003, 1 to 6 for SemEval 2017, and 1 to 8 for EBM-NLP.As can be observed, there is little change in the performance of the sets with a small length of entities.However, on CoNLL2003 with long entities (E(L) ≥ 5), the F1 score improves by up to 3%.For SemEval 2017, when the entity length is less than 5, the effect difference of each model is not noticeable; however, when it is greater than 5, the effect of our model is nearly 7% higher than the other models [10,37].For EBM-NLP, our method provides the largest gain compared to the other models on long entities (E(L) ≥ 8).

Effectiveness for Entity Boundary
We ran experiments to analyze the effectiveness of entity boundary recognition, e.g., for Instance #1 in Table 10, the prediction "(中韩, gpe)" has an error in entity boundary detection even though the generated entity type is "gpe".In this analysis, we only consider the entity boundary metric and overlook whether the entity type is correct.The F1 scores of different models on the datasets are shown in Figure 7.It can be observed that our method has a positive impact on entity boundary detection.For CoNLL2003 and OntoNotes 4.0, our method performs slightly better than W2NER, which employs relative position representation and fine-grained token pair relation representation.For SemEval 2017, our proposed approach is significantly superior to the other models.These results are consistent with our expectation that leveraging the entity type-aware attention and word-word relation-aware attention into the generative NER framework can contribute to enhancing entity boundary detection performance.instance #2 : the Office of Fair Trade called for British Airways/American to allow third-party access to their joint frequent flyer programme where the applicant does not have access to an equivalent programme .Pred: (Office of Fair Trade, ORG), (British Airways American, ORG) Gold: (Office of Fair Trade, ORG), (British Airways American, ORG) instance #3: These data were converted to standard triangulation language (STL) surface data as an aggregation of fine triangular meshes using 3D visualization and measurement software (Amira version X , FEI , Burlington , MA , USA).Pred: (standard triangulation language, Process), (triangular meshes, Material), (3D visualization, Process) Gold: (standard triangulation language, Process), (triangular meshes, Material), (3D visualization, Process), (Amira version X, Material) instance #4: South Africa's trip to Kanpur for the third test against India has given former England test cricketer Bob Woolmer the chance of a sentimental return to his birthplace .Pred: (South Africa, LOC), (Kanpur, LOC), (India, LOC), (England, LOC), (Bob Woolmer, PER) (test cricketer, PER) Gold: (South Africa, LOC), (Kanpur, LOC), (India, LOC), (England, LOC), (Bob Woolmer, PER)

Case Study
We selected a number of instances for analysis to promote further future works in the field of NER.We have classified the incorrect entities into four classes, as shown in Table 10.
Incorrect Boundaries: As instance #1 shows, the generated entity has an incorrect boundary.This sentence of instance #1 in Chinese mentions two countries, which were misidentified as a single entity.In Chinese, the entity words denote abbreviations of the two countries, and the model did not learn this knowledge, which conveys the boundary difficulty of the entity.For multiple nested entities in sentences, the model is more prone to misjudgment.Therefore, enhancing the learning of entity boundary information can improve model performance.
Distract Context: As instance #2 shows, our model predicts the incorrect entity type because of the ambiguous contexts that may be expressed in a similar context or lacking the descriptive context.The same tokens may have different meanings in various semantic contexts.For accurate recognition of entity types, it is necessary for the model to learn a comprehensive understanding of sentence context and entity types in order to make the correct judgment.
Missing Entities: As instance #3 shows, the result misses an entity.This may be because the entity is rare or specific, caused by the unbalanced learning which makes the model tend to judge sentences with a similar context to high-frequency entities.For this type of error, improving the model's understanding of critical information in sentences can be achieved by enhancing the attention mechanisms and attempting data augmentation.
Extra Entities: As instance #4 shows, our model predicts extra entities that look right but are not in the gold set.The reason for this may be that the entity appears repeatedly in other sentences or the data are noisy.Combination with other tasks, such as entity boundary detection, entity linking, and entity disambiguation, can helpto prevent excessive entity recognition.

Conclusions
In this work, we have introduced a novel approach that merges entity type and word-word relation into the generative NER framework to achieve better performance.Specifically, we combine entity type and word-word relation by attention mechanisms with the original attention in the backbone network, improving the model's ability to discriminate entity types and detect entity boundaries.We further take the entity types as special tokens and as part of the input for learning valuable knowledge from the context.Experiments on various benchmarks show the superior performance of our method.Integrating entity types as special tokens further enriches the model's context learning, allowing for more precise entity recognition.Furthermore, introducing entity type attention further strengthens the connection between entity tokens in the sentence and predefined entity types.We transform the information of the entity boundary to the relations of word pairs and merge it in the proposed framework via the attention mechanism, including the syntactic and semantic relationships between words to enhance the accuracy of entity boundary detection.However, there are weaknesses in the potential increase of model complexity and evaluation time.To address these issues, we will explore further approaches, such as non-autoregressive methods, which can speed up the decoding.Because our proposed method primarily relies on a large amount of annotated data, we will focus on generative NER models based on large language models in a low-resource setting and consider how to integrate resources such as images.The practical implications of our proposed approach extend to relation extraction, where accurate entity recognition is crucial.We are trying to use this method to improve knowledge graph construction.

Figure 1 .
Figure 1.The architecture of our method, which contains both entity type-aware and word-word relation-aware attentions in the seq2seq framework.We incorporate the word-word relation-aware attention into the decoder and the entity type-aware attention into both the encoder and decoder.For the attention concatenation, refer to the process shown in the following sections.

Figure 2 .
Figure 2. Entity type embedding from external entity knowledge.

Figure 4 .
Figure 4. Concatenation in a self-attention and cross-attention layer.In the decoder, K E and V E are calculated from the hidden states of entity type tokens in the encoder by the type-aware MLP layer., while K R and V R are calculated from the word-word relation representations by the bi-affine and relation-aware MLP layer.

Figure 6 .
Figure 6.Results on various datasets when entity length changes.

Figure 7 .
Figure 7. Results on entity boundary detection.

Table 1 .
Statistics of NER datasets other than SemEval 2017 and EBM-NLP.

Table 2 .
Statistics of the SemEval 2017 and EBM-NLP datasets.

Table 3 .
Hyperparameters used to train our model.

Table 4 .
Results on the CoNLL2003 and OntoNotes 5.0 datasets; results are statistically significant with p-value < 0.005.The best scores are in bold, while the second-best scores are underlined.

Table 5 .
Results on the Chinese NER datasets; results are statistically significant with p-value < 0.005.The best scores are in bold, while the second-best scores are underlined.

Table 6 .
Results on the Long NER SemEval 2017 and EBM-NLP datasets; results are statistically significant with p-value < 0.005.The best scores are in bold, while the second-best scores are underlined.

Table 7 .
Ablation studies on the CoNLL2003 and OntoNotes 5.0 datasets.

Table 8 .
Ablation studies on the Chinese NER datasets.

Table 9 .
Ablation studies on the Long NER SemEval 2017 and EBM-NLP datasets.

Table 10 .
Error analysis.Text in color indicates predicted entities that are incorrect.