HTLinker: A Head-to-Tail Linker for Nested Named Entity Recognition

: Named entity recognition (NER) aims to extract entities from unstructured text, and a nested structure often exists between entities. However, most previous studies paid more attention to ﬂair named entity recognition while ignoring nested entities. The importance of words in the text should vary for different entity categories. In this paper, we propose a head-to-tail linker for nested NER. The proposed model exploits the extracted entity head as conditional information to locate the corresponding entity tails under different entity categories. This strategy takes part of the symmetric boundary information of the entity as a condition and effectively leverages the information from the text to improve the entity boundary recognition effectiveness. The proposed model considers the variability in the semantic correlation between tokens for different entity heads under different entity categories. To verify the effectiveness of the model, numerous experiments were implemented on three datasets: ACE2004, ACE2005, and GENIA, with F1-scores of 80.5%, 79.3%, and 76.4%, respectively. The experimental results show that our model is the most effective of all the methods used for comparison.

Traditional NER models [11][12][13][14][15] are usually based on a single-layer sequence labeling approach, which often assigns one label to each token. However, not all entities in the text exist independently, and nested structures may exist between different entities. As shown in Figure 1, the entity "European" is nested inside entity "European Judicial Systems", and "European" is part of two entities. The nested structure in entities is both realistic and will further improve the accuracy of the model in extracting entities.
In recent years, numerous models have been proposed for nested NER. For instance, hypergraph-based approaches [16][17][18] were proposed to deal with nested structures among entities. However, the construction of hypergraphs relies on extensive hand-crafted design. For a complex task, a common idea is to decompose it into simple modules. MGNER [19] divides NER into two steps: a detector and a classifier, which first locates the entity spans and then determines the entity category based on the span representation. However, this model ignores the boundary information of entities, resulting in inaccurate localized entity boundaries. Zheng et al. [20] exploited a boundary-aware method to precisely locate entity boundaries and then determine the entity category. However, this model ignores the connection between entity boundary identification and entity category classification.
These findings show that Nuclear NF-kappa B is necessary to activate the kappa B enhancer . Inspired by the hierarchical boundary tagger method [6,21] and linking to another method of relation extraction [7], we constructed a head-to-tail NER model that divides nested NER into two highly correlated steps. First, all entity heads in the text are identified, the candidate entity head information is integrated and passed to the text, and then the corresponding entity tails are identified for different entity categories. For a text with N entities, 1 + N sequence tagging models are needed to extract entities. One operation is required to identify N entity heads, and N operations are needed to identify the corresponding entity tails of the different entity categories for each entity head. This approach has two advantages: first, identifying entities from head to tail can handle nested entities efficiently; second, a strong correlation exists between the two steps. If an error occurs in entity head recognition, the text obtains the wrong condition and will not make further mistakes in recognizing the wrong entity. In addition, the head-to-tail approach is based on the boundary identification method. The entity category is determined with the integration of entity head-to-tail information and relative position information. The head and tail of an entity exist symmetrically for the corresponding entity category. Compared with direct recognition of head-to-tail pairs, it is a more effective method of extracting information from entity boundary recognition to entity category determination. The main contributions are as follows: • We introduce a novel head-to-tail NER model to extract entities from head to tail, which can handle the nested structure existing between different entities. • To more accurately extract entities, the proposed model sequentially extracts entity head h, entity tail t, and entity category e. Nested NER is divided into two steps: identifying the entity heads, and then identifying the corresponding entity tails of each candidate entity head for different entity categories. • We conducted extensive experiments on three datasets: ACE2004, ACE2005, and GENIA. Th F1-scores of HTLinker are 80.5%, 79.3%, and 76.4% on the three datasets, respectively. Compared to nine other models, the proposed method obtains the best results for extracting entities from unstructured texts.

Problem Formulation
Given a text sequence S, the purpose of nested named entity recognition is to extract entities of pre-defined entity categories. The extracted entities (h, t, e) have three main parts: entity head h, entity tail t, and entity category e. These entity categories are all derived from a pre-defined set E. The extracted entities are valid only if the boundaries of entities (h, t) and entity categories (e) are accurately located.
Note that there may be nested entities, e.g., the ORG "the First National Bank of America" contains the GPE "America". Previous methods locate entities in two steps: they first locate the entity location ( f (S) → (h, t)) and then determine the entity category ( f (S, h, t) → (e)). HTLinker locates the entity head first ( f (S) → h), and then integrates the information to locate the corresponding entity tail and entity category ( f (S, h) → (t, e)). Figure 2 illustrates the framework of HTLinker. Specifically, HTLinker consists of an encoder and a decoder. The purpose of the encoder is to convert the given text sequence into a vectorized representation. The decoder consists of three main parts: the entity head tagger, the head feature transmitter, and the entity tail tagger. First, all possible entity heads are located by the entity head tagger. For the extracted candidate entity head h, the head feature transmitter passes its features to the sequence embedding, and then the corresponding entity tail and entity category are obtained by the entity tail tagger. An example is presented in Figure 2. The framework of HTLinker with an example of nested named entity recognition. In the example, the purpose of the framework is to extract all the entities from the given text ". . . the First National Bank of America. . ." Note that there is a nested structure of "the First National Bank of America" and "America" in the given text. First, the entity head tagger locates the two candidate entity heads "the" and "America" in the text. Then, for the candidate entity head "the", its features are fused into the text sequence embedding. Finally, combining the text context information and entity head information, the entity tail tagger is able to locate the corresponding entity tail "America" to the candidate entity head "the" for entity category ORG. In addition, for the candidate entity head "America", the corresponding ("America", GPE) can be identified.

Encoder
A text sequence S of length n needs to be encoded as a vectorized representation. The encoder uses a pre-trained model BERT, which mainly consists of an N-layer rransformer. The pre-trained BERT has a strong ability to characterize the semantics of the text. Based on BERT, we transform the text sequence S into sentence embedding H S with the following procedure: where H S ∈ R m×d emb and d emb represent the encoded token embedding dimension.

Entity Head Tagger
For the sequence embedding H S from the encoder, the possible entity heads in it are located by the sequence tagging method. The entity head tagger ignores entity category information when positioning the head of all entities in the text. Specifically, a binary head classifier is used to assign a probability value to each token, which indicates the likelihood that the token is an entity head. The sigmoid function with a value range from 0 to 1 can be used to calculate the probability. The process of the tagger is shown below: where W head ∈ R d emb ×1 and b head ∈ R n×1 are trainable matrices, p head i ∈ (0, 1) denotes the probability that the ith token of the given sequence is the head of an entity, h S i is the ith token embedding in H S , and p head i is the ith element in P head . Then, we can determine the positions of the heads of all possible candidate entities: where I head ∈ R 1×m and m denote the number of candidate entity heads, and t head is the judgment threshold for the entity head, which can be adjusted according to the training process.
To effectively locate the entity heads, the binary cross-entropy loss is gradually minimized during the training process: where t head i is 1 only if the ith token is the head of an entity; otherwise, it is 0.

Head Feature Transmitter
For the kth candidate entity head i head k of I head , to locate its corresponding entity tails with entity categories, the information of the entity head is fused into the sequence embedding H S . The information of the entity head includes the position information and the semantic information of the entity head.
First, we consider a combination of sequence embedding and position embedding of a candidate entity head. The relative positions of all tokens of the sequence and the candidate entity head can be encoded as the relative positional embedding, which can be used to learn the span features of the entities: where I S ∈ R 1×m denotes the absolute position of all tokens in the sequence, and I rp ∈ R 1×m denotes the position of all tokens in the sequence relative to the candidate entity head. H RP ∈ R m×d rp is a trainable parameter, which is randomly initialized at the beginning of training. Then, conditional layer normalization is employed to fuse the information of the candidate entity head with the sequence embedding H S : where h (in) i is the ith token embedding in H S and h (out) i is the ith token embedding in H C . γ and β are trainable parameters that denote the mean and standard deviation of the conditional inputs, respectively; µ i and σ i denote the mean and standard deviation of the token embedding h (in) i , respectively. Finally, by combining H RP and H C , we obtain H containing entity header information and contextual information: where H ∈ R n×(d emb +d rp ) .

Entity Tail Tagger
For the kth candidate entity head, we can find the corresponding entity tails and entity categories using the entity tail tagger. Specifically, a binary tail classifier is used to assign a probability value to each token, which indicates the likelihood that the token is an entity tail of different entity categories for the candidate entity head: where W e tail ∈ R (d emb +d rp )×l and b e tail ∈ R 1×l are trainable parameters, l is the number of entity categories, and h i is the ith token embedding of H. Note that p tail i denotes not only the entity tail but also the entity category.
To effectively identify the entity tails and entity categories corresponding to the candidate entity head, the binary cross-entropy loss is gradually minimized during the training process: where t tail ij is 1 only if the ith token in the text sequence is the head of an entity of the jth entity category in E; otherwise, it is 0.

Joint Learning
To learn the head and tail features of the entities in an integrated manner, we backpropagate the loss entropy of locating the head and tail of the entities together: To better update the gradients, the Adam [22] optimizer is employed for the model update.

Datasets
To demonstrate the effectiveness of the proposed method, extensive experiments were implemented on three datasets: ACE2004, ACE2005 [23], and GENIA [24]. Table 1 presents the statistics of the three datasets. Next, we provide a brief presentation of the three datasets. Two ACE datasets (ACE2004 (https://catalog.ldc.upenn.edu/LDC2005T09 accessed on 5 June 2021) and ACE2005 (https://catalog.ldc.upenn.edu/LDC2006T06) accessed on 5 June 2021) have been used for several natural language processing tasks, including named entity recognition [14,15], relation extraction [4][5][6], event extraction [8,9], etc. The data contained in ACE datasets are derived from the news domain and contain seven entity categories: PER, LOC, ORG, GPE, VEH, WEA, and FAC. To effectively compare the proposed method with other named entity recognition models, the dataset was divided following the methods in previous work [16]. The split of training instances, development instances, and test instances in the datasets was 8:1:1. In addition, the percentages of nested entities in ACE2004 and ACE2005 are 45.7% and 39.8%, respectively.
The GENIA (http://www.geniaproject.org/genia-corpus/pos-annotation accessed on 5 June 2021) dataset is generally used for tasks such as named entity recognition [14,15] and event extraction [8,9]. The data contained in the dataset are derived from the biomedical domain, containing five entity categories: DNA, RNA, Protein, Cell_Line, and Cell_Type. To effectively compare the proposed method with other named entity recognition models, the dataset was divided following the steps in a previous work [17]. The split of training instances, validation instances, and test instances in the dataset is 8.1:0.9:1. In addition, about 21.6% of the entities in GENIA have a nested structure.

Baselines
We compared the proposed method with the following nine methods: •  [20] applies a boundary-aware approach for extracting nested entities, which mitigates the error propagation in layered NER models. • Anchor-region networks (ARNs) [28] leverage the head-driven phrase structures for extracting nested entities. • MGNER [19] applies a novel entity position detector to locate entities in a certain range around each token, and is able to extract nested or non-overlapping entities from unstructured texts. • BiFlaG [29] identifies the inner entities by GCN based on identifying the outer entities to handle the nested entity issue. Table 2 shows the hyperparameter settings for the experiments. The optimizer in the experiment was Adam. The learning rate was 1.0 × 10 −5 . The maximum number of learning epochs was 80. In addition, the threshold for the entity head position judgment was set to 0.5 and the threshold for the tail position judgment was set to 0.5. The thresholds could be adjusted to achieve a balance between precision and recall. All experiments were based on Tensorflow. We implemented the experiments on an NVIDA Tesla V100 GPU and an Intel Xeon E5-2698 CPU. To prevent over-fitting, the training is terminated when the F1-score on the development set is not improved for ten consecutive epochs. The batch size was set to 8. The training time for each epoch was 253, 288, and 469 s on ACE2004, ACE2005, and GENIA, respectively.

Evaluation Metrics
For a predicted entity extracted from the text, the extraction is valid only if its boundary location and entity category are the same as the gold entity. Specifically, entity head h, entity tail t, and entity category e all need to be considered in the prediction. To fairly compare the proposed method with other methods, three metrics (precision, recall, and F1-score) were used to evaluate the effectiveness of the proposed method. Table 3 presents the results of the three metrics of the nine NER models: precision, recall, and F1-score. First, HTLinker achieves better results in extracting nested named entities from given texts compared with the nine baselines. Specifically, the F1-scores of HTLinker are 80.5%, 79.3%, and 76.4% on ACE2004, ACE2005, and GENIA, respectively, which are 1.0%, 4.2%, and 0.4% better compared to the baselines, respectively. Second, although the precision of HTLinker is not the highest, HTLinker better balances precision and recall, which results in HTLinker performing better on the main metric, F1-score. Third, the F1-scores are higher on dataset B compared to another boundary-based NER model, Boundary-aware [20]. Third, compared to another boundary-based NER model, Boundaryaware [20], HTLinker extracts entities better on GENIA: F1-score, precision, and recall are higher by 1.7%, 0.1%, and 3.2%, respectively. Despite considering entity boundaries as a whole, a crucial issue is that Boundary-aware cannot effectively match entity heads and tails from text sequences. Compared with identifying entity category labels using entity boundary information as a condition, HTLinker is more effective in identifying entity tails under different entity categories by inputting entity heads as conditional information.  Table 4 demonstrates the capability of HTLinker to extract different elements of an entity (h, t, e). First, HTLinker is able to accurately locate the head or tail of the entities from a given text. Second, HTLinker is able to locate the boundaries (h, t) of the entities well. Specifically, the F1-scores of HTLinker are 87.0%, 85.9%, and 80.2% in locating entity boundaries on the ACE2004, ACE2005, and GENIA, respectively. Finally, by observing the effectiveness of HTLinker in identifying the different elements of the entities, HTLinker achieves promising results in identifying entity boundaries. Tables 5 and 6 describe the performance of HTLinker on extracting different categories of entities. The proposed model performs similarly in identifying the same category of entities on the ACE2004 and ACE2005 datasets. Specifically, the difference in the F1-scores of HTLinker on extracting the four categories of entities, PER, LOC, ORG, and GPE, is around 2%. In addition, HTLinker achieves better results on both ACE datasets when extracting the entities of PER. This is due to he boundary of PER having a clear trigger word. On GENIA, HTLinker achieves the best result in extracting the entities of RNA with an F1-score of 84.4% due to the existence of distinct trigger words in the entities of RNA. However, the model is less effective in extracting the entities of DNA, and the boundary of DNA can be accurately located while being misclassified as RNA. This is due to the model being able to accurately locate entity boundaries while often being disturbed by similar boundary information for both DNA and RNA.  Figure 3 demonstrates the impact of the thresholds on extracting entities. The thresholds of the two taggers remained consistent in the experiments. First, as the thresholds of the two taggers increased, precision increased and recall decreased. The precision and recall for extracting entities were above 70% for the threshold range tested experimentally. Second, the F1-score increased and then decreased as the threshold was increased. When the threshold was 0.6, the F1-score reached the highest values of 81.0% and 79.6% on the ACE2004 and ACE2005 datasets, respectively. When the threshold was 0.5, the F1-score reached the highest value of 76.4% on the GENIA dataset.

Named Entity Recognition
NER is an essential task in information extraction (IE) and has attracted the interest of numerous researchers. Early works [11][12][13] relied on hand-crafted features to extract entities. Hidden Markov model (HMM) [30] and conditional random field (CRF) [31] were applied in these NER models. Later, deep learning approaches were widely used. CNN-CRF [14] automatically captures the semantic features of text using a convolutional neural network (CNN), and is combined with CRF to extract entities. Then, bidirectional LSTM (BiLSTM) [15] was used for learning the semantic features of text combined with CRF to learn the correlation between entity labels. In addition to word-level representation, character-level information was considered to better use the information in the given text. LSTM [32] and CNN [33,34] were employed to learn character-level features, which overcame the out-of-vocabulary (oov) issue. However, these methods often only assign a label to each token, which ignores the nested structure that exists between entities.

Nested Named Entity Recognition
Work in recent years has paid more attention to the existence of nested structures of entities in text.
To efficiently extract nested entities, hypergraphs [16,17] were constructed to identify the nested entities in texts. However, the ambiguous structure of the hypergraph affects the effectiveness of extracting nested entities during the inference process. To overcome this issue, Seg-Graph [18] applies a segmental hypergraph structure. In addition, the construction of hypergraphs relies on hand-crafted design. To more effectively construct the hypergraph structure, Hyper-Graph [27] is a novel hypergraph model based on the BILOU tagging scheme, using LSTM to automatically learn the structure of the hypergraph.
Nested entities can also be extracted from inside to outside or from outside to inside by stacking sequence labeling models. HMM [11,12,35] and SVM [36] were employed to construct multi-layer sequence labeling models to extract nested entities. However, these methods extract outside and inside entities independently, ignoring the dependencies between nested entities. Alex et al. [37] constructed two modules based on the CRF sequence labeling model and cascaded CRFs from inside to outside and from outside to inside to enhance the correlation between nested entities. However, one issue is that this model struggles to handle nested entities of the same category. BiLSTM-CRF [34] is a stable and effective model for flair NER, so Ju et al. [25] extracted nested entities from text by stacking BiLSTM-CRF. However, stacking different layers of sequence labeling models causes error propagation.
Instead of sequence labeling models, span-based methods have also been used to identify nested entities. MGNER [19] is a novel framework that divides nested named entity recognition into two parts: entity span detection and span classification. The span representation is obtained by first locating the entity span, and then the entity category classification is performed based on the entity span information. This method achieved promising performance in extracting entities, but the boundaries of the extracted entities are susceptible to deviations. Zheng et al. [20] accurately located the boundary using the boundary-aware method and combined the two subtasks of boundary detection and span classification by parameter sharing.

Conclusions and Future Work
In this paper, we presented a head-to-tail named entity recognition model to extract nested or normal entities from a given text. The proposed model is a sequence-based tagging approach that identifies entity boundaries and entity categories using two correlated steps. The entity boundary is divided into the entity head and entity tail, and it is easier to identify the entity tail for different entity categories by making each entity head the prior condition. In addition, dividing entity head and tail into two steps in cascade for identification facilitates the more accurate localization of entity boundaries. Specifically, the positioning of entity boundaries requires two cascading steps, and more stringent conditions improve the accuracy of the model in extracting entity boundaries. The experimental results demonstrated the effectiveness of the proposed method, which achieved the best performance in comparison with nine baselines.
In the future, we will explore more effective methods for fusing the extracted entity head information to improve the accuracy of extracting entity tails for different entity categories.