Named Entity Recognition Networks Based on Syntactically Constrained Attention

Sun, Weiwei; Liu, Shengquan; Liu, Yan; Kong, Lingqi; Jian, Zhaorui

doi:10.3390/app13063993

Open AccessArticle

Named Entity Recognition Networks Based on Syntactically Constrained Attention

by

Weiwei Sun

^1,2

,

Shengquan Liu

^1,2,*,

Yan Liu

^1,2,

Lingqi Kong

^1,2 and

Zhaorui Jian

^1,2

¹

College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China

²

Xinjiang Multilingual Information Technology Laboratory, College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(6), 3993; https://doi.org/10.3390/app13063993

Submission received: 22 February 2023 / Revised: 11 March 2023 / Accepted: 17 March 2023 / Published: 21 March 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The task of named entity recognition can be transformed into a machine reading comprehension task by associating the query and its context, which contains entity information, with the encoding layer. In this process, the model learns a priori knowledge about the entity, from the query, to achieve good results. However, as the length of the context and query increases, the model struggles with an increasing number of less relevant words, which can distract it from the task. Although attention mechanisms can help the model understand contextual semantic relations, without explicit constraint information, attention may be allocated to less task-relevant words, leading to a bias in the model’s understanding of the context. To address this problem, we propose a new model, the syntactic constraint-based dual-context aggregation network, which uses syntactic information to guide query and context modeling. By incorporating syntactic constraint information into the attention mechanism, the model can better determine the relevance of each word in the context of the task, and selectively focus on the relevant parts of the context. This enhances the model’s ability to read and understand the context, ultimately improving its performance in named entity recognition tasks. Extensive experiments on three datasets, ACE2004, ACE2005, and GENIA, show that this method achieves superior performance when compared to previous methods.

Keywords:

named entity recognition; machine reading comprehension; syntactic constraint information

1. Introduction

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that aims to detect entity spans and semantic categories in a given context. Current NER can be classified into two subclasses: flat NER and nested NER, depending on whether the entities are nested or not. Nested entities are those with nested spans, such as “[Bank of [China]]”, where both “[China]” and “[Bank of China]” are named entities. This type of entity is common in datasets such as ACE2004, ACE2005, and GENIA. Recently, Li et al. [1] proposed a unified approach to solve the nested entity problem. They transformed the named entity recognition task into a Machine Reading Comprehension (MRC) task, where they used the annotation information of the predefined tag set as a query (as shown in Table 1). They then used a pre-trained language model to learn the joint representation of the query and context, and finally predicted the label of each token. This method has achieved good results in named entity recognition tasks. Figure 1 illustrates the process of the model, based on machine reading comprehension in identifying named entities.

When compared to other methods used for processing natural language tasks, the approach based on machine reading comprehension has its own advantages. Other methods lack priority semantic information about the entity type, and the resulting models do not have a semantic understanding of the entity type being extracted. In contrast, the machine reading comprehension approach benefits from a query, such as “find an organization in the context.”, that encourages the model to link the word “organizational” in the query to the location entities in the context. Moreover, the model can reduce the ambiguity of similar label categories by encoding a comprehensive description of label categories. However, Li et al. [1] overlooked the problem that, if the context sequence is lengthy and does not emphasize words strongly associated with the task (referred to as strong words), the model may focus on all words, resulting in the allocation of more attention to weakly associated words (referred to as weak words). Under the influence of these words, the model’s focus on text components becomes biased.

The key to improving machine reading comprehension performance lies in the ability to effectively model language knowledge from verbose details, as well as to overcome noise. Zhang et al. [2] suggested using syntax to guide the text modeling of context and questions, in order to obtain better word representations of language motivation, and achieved good results in machine reading comprehension tasks. Inspired by Zhang et al. [2], we propose a syntactic constraint-based neural attention network that takes the dependency of each word in a sentence as a constraint. We establish associations between related words under the constraint of syntactic structure clues, and guide the model to select strong words based on these associations. This provides more accurate attention signals, and ultimately alleviates the distraction problem that is caused by lengthy sentences. We analyze sentences using a syntactic dependency tree, which describes the dependencies between words, and use these dependencies as constraint information, in order to enhance the attention mechanism, and to capture the syntactically relevant parts of each word of interest. Specifically, we used the natural language-processing toolkit Spacy (https://spacy.io/ (accessed on 8 April 2021)) to generate a syntactic dependency tree of the context, obtaining the relevant nodes for each word in the context, as shown in Figure 2. Unlike Zhang et al. [2], we focus on contextual information through two pruning strategies. They used syntax to guide the modeling of query and context, which is not applicable to the NER task, as our queries contain task-relevant, prior knowledge that is useful for identifying ambiguous and nested entities.

In summary, our contribution is as follows:

For NER tasks, we propose a new Syntactic Constrained Dual Context Aggregation Network (SCAN-Net) model. SCAN-Net is the first model that uses a syntactic dependency tree to apply the dependency of a single word in a sentence, as a constraint, to the NER task.
To make relatively better use of syntactic information for the model to understand the text, we propose two syntactic constraint strategies.
Numerous experiments have been conducted on the ACE2004, ACE2005, and GENIA datasets, and the results have demonstrated the effectiveness of SCAN-Net model.

2. Related Work

2.1. Named Entity Recognition Task

Named Entity Recognition (NER) is an essential task in natural language processing, and it has been traditionally formulated as a sequence-labeling problem [3,4,5]. Conditional Random Field (CRF) models have been widely used for NER [6,7,8], but recent pre-training language models, such as BERT and its variants, have also shown significant improvements in terms of NER’s performance [9,10,11]. However, recognizing nested entities, where an entity can contain another entity, remains a challenging problem.

To address nested NER, different approaches have been proposed. Finkel et al. [12] transformed the nested NER task into a syntactic parsing task, while Lu et al. [13] proposed a hypergraph model and several extended models [14,15,16]. Luan et al. [17] introduced the dynamically generated graph framework, and Li et al. [1] suggested treating NER as a reading comprehension task, in order to naturally solve the nested NER problem. Yu et al. [18] used an affine model based on multi-level BiLSTM to assign scores to the all possible spans in the sentence, and Hou et al. [19] generated semantically enhanced entity embedding to promote the learning of the semantic commonality of entity–context and entity–entity interactions. Yan et al. [20] integrated the pre-trained sequence into the sequence model BART [21] for its use in the framework, and linearized the entity into a sequence using three entity-type representations, providing a novel reference method to explore nested NER tasks, while Li et al. [22] iteratively identified entity segments and performed relational classification tasks, in order to determine overlapping or inherited entities. Finally, Yang et al. [23] converted the named entity identification task into a selection resolution task, and used a pointer network to track shared boundaries.

2.2. NLP Tasks Based on the MRC Approach

Recent work by Li et al. [24] proposed a model that treats the relationship extraction task as a multi-round QA task. Zhao et al. [25] addressed the issue that using only a single query does not clearly describe the meaning of entities and relationships in a multi-round QA task, due to semantic diversity. To tackle this, the model they designed introduces a diverse question-and-answer mechanism, and devises two answer-selection strategies, in order to integrate different answers. Li et al. [1] proposed an MRC-based named entity recognition network with label knowledge, using entity label knowledge as a query. To make full use of the label knowledge, Yang et al. [26] coded contextual and label knowledge separately, and integrated the label knowledge into contextual representations, by using a semantic fusion module.

2.3. Application of Syntactic Information to NLP Tasks

Recent approaches have focused on learning context representations that are sensitive to syntactic structures, by utilizing syntactic dependency analysis. Syntactic dependency trees facilitate the creation of a tree structure, based on the linguistic relationships between words in a context, which can guide the model in capturing the dependencies between words in a sentence. Prior studies have incorporated information garnered from the analysis of syntactic dependency trees, by transforming dependencies into vectors. For instance, studies such as [27,28,29,30] leverage the structural information regarding the dependency syntactic tree, in order to construct the adjacency matrix, as well as use graph neural networks (GNNs) to conduct emotion analysis. Fu et al. [31] extracted sequence features and region features of the context using the dependency structure, and constructed a complete word graph, in order to extract implicit features between all word pairs, leading to significant improvements in the overlapping relational triad extraction task. Meanwhile, Tian et al. [32] utilized multi-headed attention mechanisms, in order to compute the relationship weights between any two words, allowing the model to exploit different dependencies accordingly. Instead of updating the representation of adjacent words using GNNs, our model prunes attention using syntactic information, in order to constrain the relevant tokens, which is a more linguistically motivated approach than simply adding dependent features.

3. Methodology

3.1. Overview

Our objective was to design a neural network model that utilizes syntactic information, in order to constrain attention mechanisms, transforming the NER task into an MRC task, while also using sentence word dependencies, in order to further limit attention. The architecture of the model is illustrated in Figure 3.

3.2. Formalization of Tasks

Given a sequence of n tokens in a training set D, with a context

X =

{x_{1}, x_{2}, \dots, x_{n}}

, the aim of the task is to answer questions

Q =

{q_{1}, q_{2}, \dots, q_{c}}

, based on the information in each X, and assign a label to each candidate span

y^{t} \in Y

, where Y is a predefined set of entity types (e.g., ORG, LOC, etc.) and the answers are X candidate spans in

({x_{i}^{t}}_{s t a r t}, {x_{j}^{t}}_{e n d})

, where

{x_{i}^{t}}_{s t a r t}

and

{x_{j}^{t}}_{e n d}

indicates that span

(x_{i}, x_{j})

is the

y^{t}

of the ground truth.

3.3. Coding Layer

The embedding layer converts word indices into continuous vectors. In SCAN-Net, we utilized BERT to encode the source sequence. BERT [11] is a pre-trained, deep, bidirectional transformer model that has achieved state-of-the-art performances in various NLP tasks. To process the query and context, we first tokenize [33] them, and then concatenate them into a single sequence, which serves as the input sequence for BERT.

[CLS] Query [SEP] Context [SEP]

BERT then receives the combined sequence and outputs the context representation matrix

H \in R^{n \times d}

.

3.4. Syntactically Constrained Attention Networks

First, the syntactic adjacency matrix

A = (a_{i, j}) \in R^{n \times n}

of the text is constructed using the syntactic dependency tree. The elements

a_{i, j}

in the syntactic adjacency matrix A indicate the word

x_{i}

and word

x_{j}

, and whether there is a dependency relationship between them. Then, by using a multi-head attention score matrix, this yields the attention matrix

A_{s c o r e}^{i} \in R^{n \times n} (1 \leq l \leq h)

, where h denotes the number of heads, as shown in Formula (1) below:

A_{s c o r e}^{i} = \frac{Q K^{T}}{\sqrt{d_{k}}}

(1)

where Q and K are the abstract matrices obtained by projecting the context vector

H \in R^{n \times d}

from the BERT output. For each attention score matrix

A_{s c o r e}^{i}

, the goal is to make each word focus, as much as possible, on only the words in the sentence that have a dependency on itself, subject to the constraints of syntactic information. Syntactically constrained attention

A_{S C}^{i}

is defined as follows: words with dependencies are selected separately for each head, based on the attention score matrix. Then, for each syntactically constrained attention

A_{S C}^{i}

, a softmax operation is applied. To obtain a syntactic constraint attention that is relatively better for the model, in order to understand the context

A_{S C}^{i}

, we designed two strategies to integrate the syntactic information with the attention matrix, namely, the soft-pruning syntactic attention enhancement strategy and the hard-pruning syntactic concentration attention strategy.

Soft-pruning syntactic enhancement attention strategy (hereinafter referred to as soft-pruning strategy): the attention weights of the strong words in each attention score matrix

A_{s c o r e}^{i}

are increased, according to the syntactic dependency information provided by the adjacency matrix A, and the attention weights of the weak words are kept constant. After obtaining

A_{S C}^{i}

, the softmax operation is then performed as follows. The specific operations are shown in Formulas (2) and (3).

A_{S C}^{i} = A \oplus A_{s c o r e}^{i}

(2)

A_{S C_{s c o r e}}^{i} = s o f t m a x (A_{S C}^{i})

(3)

where

A_{S C}^{i}

denotes the syntactic constraint attention of the ith head, which is obtained after a softmax operation is conducted on it, in order to sharpen the constraint attention of the ith head

A_{S C_{s c o r e}}^{i}

.

Hard-pruning syntax concentration attention strategy (hereinafter referred to as hard-pruning strategy): the attention of weak words in each attention score matrix

A_{s c o r e}^{i}

is directly removed, based on the syntactic dependency information provided by the adjacency matrix A. The specific operations are shown in Formulas (4) and (5).

A_{S C}^{i} = A \circ A_{s c o r e}^{i}

(4)

A_{S C_{s c o r e}}^{i} = s o f t m a x (A_{S C}^{i})

(5)

While both of these strategies can assist the model in comprehending the context, the principles behind them are different. The soft-pruning strategy aims to strengthen the model by increasing the attention weights of strong words, enabling the model to pay more attention to them while still considering weak words. On the other hand, the hard-pruning strategy restricts the model by eliminating the attention of weak words, forcing it to focus solely on the strong words that have not been eliminated.

After obtaining

A_{S C_{s c o r e}}^{i}

, the abstract matrix V is obtained by projecting the text vector

H \in R^{n \times d}

output from BERT again, by using the method used to calculate Q, K above. Afterwards,

A_{S C_{s c o r e}}^{i}

is introduced into the abstract matrix V to obtain the syntactically constrained contextual representation

H_{S C} \in R^{n \times d}

. The specific operations are shown in Formulas (6) and (7).

H_{S C} = A_{S C_{s c o r e}}^{i} \cdot V

(6)

V = W_{v} H + b_{v}

(7)

where

W_{v}

,

b_{v}

are the learnable weight matrices, and the weight matrix parameters are not shared among V, Q, and K.

3.5. Semantic Aggregation Layer

Next, in order to effectively use the contextual information of syntactic constraints,

H_{S C}

is firstly passed through the feed-forward layer, the layers are normalized, and, finally, the resulting table is then aggregated with the context vector output, given by BERT, to obtain the final output

\tilde{H} \in R^{n \times d}

, as shown in Formulas (8) and (10):

\tilde{H} = α H + (1 - α) \bar{H}

(8)

\bar{H} = L N (H_{S C}^{'})

(9)

H_{S C}^{'} = σ (H_{S C})

(10)

where

L N

denotes the normalization operation,

σ

denotes the activation function, and

α

denotes the hyperparameters that measure the original contextual information and the syntactically constrained contextual information.

3.6. Decoding Layer

Unlike Li et al. [1], who directly performed decoding operations on contextual features containing query information, we allow the model to analyze the context from multiple aspects by introducing syntactic constraint information. First, the model predicts all possible entity heads and tails, based on the multifaceted information of the context. Then, the predicted entity heads and entity tails are matched, one by one. The details are shown in Formulas (11) and (13).

P_{i}^{s t a r t} = W_{s t a r t} \tilde{H} + b_{s t a r t}

(11)

P_{j}^{e n d} = W_{e n d} \tilde{H} + b_{e n d}

(12)

P_{i_{s t a r t}, j_{e n d}} = s i g m o d (W_{m a t c h} \cdot c o n c a t ({\tilde{H}}_{x_{i}}^{s t a r t}, {\tilde{H}}_{x_{j}}^{e n d}))

(13)

where

P_{i}^{s t a r t}

and

P_{i}^{e n d}

denote the probabilities that the ith token in the context is the entity start position and entity end position, respectively, and, in the experiment, it is set to 1 if the set threshold is exceeded, and 0 otherwise.

W_{s t a r t}

;

W_{e n d}

and

W_{m a t c h}

denote the learnable weight matrix.

P_{i_{s t a r t}, j_{e n d}}

denotes the probability that the ith token is the start index position and the jth token is the end index position.

3.7. Loss Functions

In the training phase, we pair X with two label sequences of length n,

Y_{s t a r t}

and

Y_{e n d}

, indicating that each token

x_{i}

is the true label of the start index and the end index of any entity. The total cross-entropy loss function

L_{s p a n}

is shown in Formulas (14) and (17).

L_{s p a n} = L_{s t a r t} + L_{e n d} + η L_{m a t c h}

(14)

L_{s t a r t} = C E (P_{i}^{s t a r t}, Y_{s t a r t})

(15)

L_{e n d} = C E (P_{j}^{e n d}, Y_{e n d})

(16)

L_{m a t c h} = C E (P_{i_{s t a r t}, j_{e n d}}, Y_{s t a r t, e n d})

(17)

where

L_{s t a r t}

,

L_{e n d}

, and

L_{m a t c h}

denote the answer start loss, answer end loss, and start–end match loss, respectively, and

Y_{s t a r t, e n d}

denotes the ground truth of the start index matching the end index, the

η \in [0, 1]

.

4. Materials and Methods

4.1. Datasets

This section details the tests carried out on the ACE2004, ACE2005, and GENIA datasets. To fairly evaluate SCAN-Net, in terms of the ACE2004 and ACE2005 datasets, we used the same data split as Katiyar et al. [15] and Lin et al. [34], and for GENIA, we followed Katiyar et al. [15], using five types of entities, splitting the training, development, and test sets into 8.1:0.9:1.0. The training samples, validation samples, and tests performed for each entity class in the dataset samples are shown in Table 2.

4.2. Experimental Setups

For the ACE2004 and ACE2005 datasets, this experiment was trained on an NVIDIA GeForce RTX 3090. For the GENIA dataset, the experiments were conducted on an NVIDIA RTX A6000. The main hyperparameter settings for the models in this paper are shown in Table 3. The training procedure was implemented using PyTorch (https://pytorch.org/ (accessed on 1 March 2021)). Experiments on the ACE2004 and ACE2005 datasets show that the best results were obtained with 8 attention heads. Experiments on the GENIA dataset showed that the best results were obtained with 16 attention heads. We applied various, regularized, dropout rate models to different layers. The experiments are based on an early stopping strategy for the development set, where training is terminated early if the validation loss does not improve within 5 epochs.

4.3. Results

The experimental results were compared with the following model as a baseline. It is worth noting that since two strategies were designed, in order to obtain syntactically constrained attention

A_{S C}^{i}

, only the model with the better overall results, based on the hard-pruning strategy, is compared here with the baselines. The baseline models we chose include the following: (1) Straková et al. [35] regarded nested NER as a sequence-to-sequence problem. (2) Luan et al. [17] proposed a generic method for information extraction using dynamically constructed span graphs with shared span representations. (3) Li et al. [1] transformed the NER task into an MRC task by formalizing entity extraction as a question that extracts the answer span of the question. Since different queries are proposed for different entity types, this paradigm is naturally able to alleviate the entity nesting problem. (4) Yu et al. [18] used the idea of parsing graphs’ dependency tree reltionships, then provided a global input via the biaffine model, and, finally, scored the start and end tokens in the sentence separately. (5) Hou et al. [19] injected the embedding of semantic-type words into the entity embedding, in order to reduce the differences in context commonalities, and then combined them with the existing entity embedding through linear aggregation to obtain good results. (6) Yan et al. [20] described the NER task as an entity generation sequence task, without specifically designing the markup schema or using the enumeration span approach. (7) Li et al. [22] first traversed all possible text spans to identify entity fragments, and then converted all entity fragments into a relational classification task, in order to determine as to whether entity fragments were overlapping or inherited. (8) Yang et al. [23] designed a representation method based on a pointer network to parse nested named entities.

Table 4 presents the performance of SCAN-Net against baseline approaches on the ACE04, ACE05, and GENIA datasets, demonstrating the best F1 scores achieved by our model can be compared to current advanced nested NER models. Notably, SCAN-Net outperforms Li et al. [1] by 0.28%, 0.79%, and 1.54% on the ACE04, ACE05, and GENIA datasets, respectively. This improvement is reasonable, since the syntactic constraint focus of SCAN-Net helps the model reduce the interference of redundant information in sentences, allowing it to pay more attention to important information in the context. This represents a significant improvement over models based on machine reading comprehension frameworks. Moreover, there is a substantial improvement of 1.54% observed for performances on the GENIA dataset. We speculate that since BioBERT is a biomedical text pretraining model that is based on BERT, it uses a large corpus of biomedical literature during the pre-training process, introducing knowledge embedded in the biomedical domain. Therefore, BioBERT performs exceptionally well on the biomedical named entity recognition dataset GENIA. The results demonstrate the effectiveness of our syntactic constraint attention mechanism for ambiguous and nested named entity recognition tasks.

4.4. Analysis and Discussion

4.4.1. Analysis of the Validity of Syntactic Information

In syntactic attention networks, we use sentence dependencies as constraint information. Four sets of comparison experiments were conducted on the ACE2005 dataset, in order to demonstrate the positive impact of constraint information on the model: (1) by simply adding a multiheaded self-attentive mechanism; (2) by using a soft-pruning strategy (3) by using a hard-pruning strategy. The results of the experiments are shown in Table 5.

From Table 5, we can see that the soft-pruning strategy model improves the F1 score by 0.42% over the baseline, and the hard-pruning strategy model improves the F1 score by 0.79% over the baseline. This is because the use of the hard-pruning strategy was able to help the model reduce the weight of redundant information in the sentences by using the dependencies as constraint information, thus retaining more important syntactic information, and proving that the syntactic information is effective for the NER task. In addition, the hard-pruning strategy model was more effective overall than the soft-pruning strategy model, which we speculate is because the soft-pruning strategy model only adds additional attentional weight to strong words, and does not constrain the attention scattered over weak words. Although the attention weight of the strong words is increased, the weak words will still influence the model’s judgment. In contrast, the hard-pruning strategy model directly eliminates the attention from weak words, thus solving the attention distraction problem. The table also shows that the hard-pruning strategy model has a higher p value than the soft-pruning strategy model, but a slightly lower R-value. We speculate that this is because the syntactic dependency tree is generated by the natural language-processing toolkit only, and not by manual annotation, so when processing some sentences, there may be incorrect syntactic dependency trees generated, resulting in the hard-pruning strategy model ignoring some words that have some relevance to the task, and thus failing to recognize some named entities. This is the reason for the slightly lower R-value of the hard-pruning model as compared to the soft-pruning model. However, the higher values of the P and F1 scores indicate that the hard-pruning strategy is overall better than the soft-pruning strategy, in terms of helping the model to focus on strong words and eliminate weak words.

4.4.2. The Impact of Semantic Dependencies on NER

Syntactic information, in particular, that contained in dependency trees, is helpful in conducting information extraction tasks in existing studies. Most approaches construct the dependency trees as adjacency matrices and then encode the information in the graph through graph convolutional neural networks, where the information in each node communicates with its neighbors through the connections between them in each GCN layer. In this work, different models were compared against syntactic information to verify the effectiveness of our approach. (1) BiGCN: to capture features between front-to-back and back-to-front word pairs, a bi-directional GCN is used in this paper. (2) AGGCN: since dependency trees are obtained using complex feature engineering, an attention matrix formed by multi-headed self-attention mechanisms is used to represent the semantic associations between words, and then GCN is used to extract features. (3) SCAN-Net: Our model. (4) BAGCN: We use BiGCN and AGGCN to aggregate dual-channel syntactic features. (5) BAAGCN: We use the three-channel methods of BiGCN, AGGCN and syntactic constraints. The experimental results are shown in Figure 4.

As can be seen from Figure 4, the SCAN-Net model achieved the best results when compared to the model using GCN, with BiGCN improving the F1 score by 0.22% compared to baseline, AGGCN improving the F1 score by 0.39% compared to baseline, and the two-channel model and three-channel model improving the F1 score by 0.67% and 0.7% respectively. Clearly, SCAN-Net shows a more significant improvement than the BiGCN and AGGCN models. For the three-channel model, although there is an improvement when compared to the baseline model, a slight decrease in performance was observed when compared to SCAN-Net, i.e., BiGCN and AGGCN suppress the performance of SCAN-Net. We speculate that the model can reduce noisy information under the effect of SCAN-Net constraint information, thus extracting more important information. In contrast, although the features extracted by BiGCN and AGGCN also allow the model to focus more on important information, with strong task relevance, they cannot eliminate the noisy information of weak words as SCNA-Net does, resulting in the overall effect being inferior to SCNA-Net. It is worth noting that although the SCAN-Net model has a slightly lower recall rate when compared to other non-baseline models, it has the highest F1 score and P values, which could also validate the speculation in Section 4.4.1, in that although the SCNA-Net model using the hard-pruning strategy may ignore some strong words, resulting in the model failing to recognize some named entities, it can still achieve good results due, to its excellent performance in rejecting weak words. The above experiments demonstrated that the syntactically constrained attention mechanism has a better performance than the graph convolutional network in the named entity recognition task, based on the reading comprehension approach.

4.5. Case Study

To specifically demonstrate the advantages of SCNA-Net over the baseline approach [1], a case study was conducted. As shown in Figure 5, in the first context, the baseline approach identifies all of “Dallas Lutheran School” as a location category, which is clearly incorrect; the first “Dallas Lutheran School” should be represented as an organization or governing body entity. In the second context, the baseline approach identifies all of “Starbucks” as organizational or management entities, but the first “Starbucks” category here should be location. In comparison, SCNA-Net is able to identify and correctly classify all of these entities. This is because, compared to the baseline approach, SCNA-Net incorporates syntactic information contained in the text into the attention mechanism, allowing the model to expand its view of the phrase range, as well as to analyze the entity type of each word in the text more comprehensively, based on word dependencies. For example, in the second case, if the model could not link “Starbucks” and “at downtown” at the end, it would be difficult to distinguish the specific type of the entity.

5. Conclusions

In this paper, we propose a syntactically constrained attention-based neural network for named entity recognition, in order to address the problem of traditional attentional mechanism divergence in the NER task. First, by converting the NER task into an MRC task, the model can understand the deeper meaning of the named entities. Then, the syntactic information is introduced into the attention mechanism, in order to guide the model to focus on contextual information that is highly relevant to the task, thereby alleviating the attention-scattering problem in traditional attention models. The experimental results show that SCAN-Net is more competitive than the existing nested NER models.

Future research can explore the following directions. Firstly, while queries can provide valuable prior knowledge for named entity recognition, the knowledge provided may still be limited. Therefore, researchers can consider using knowledge graphs for data augmentation, as well as training a query knowledge base tailored for this task by using neural network models. Secondly, as the maximum input length for BERT is limited, the model may still struggle with processing extremely long sentences. To address this issue, researchers can explore the use of multi-hop reading comprehension techniques.

Author Contributions

Conceptualization, S.L.; methodology, W.S.; investigation, Z.J.; visualization, L.K.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Xinjiang Uygur Autonomous Region (No. 2019D01C060) and the National Natural Science Foundation of China (No. 61966034).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, X.; Feng, J.; Meng, Y.; Han, Q.; Wu, F.; Li, J. A unified mrc framework for named entity recognition. arXiv 2019, arXiv:1910.11476. [Google Scholar]
Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; Zhao, H.; Wang, R. Sg-net: Syntax-guided machine reading comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9636–9643. [Google Scholar]
Liu, L.; Shang, J.; Xu, F.; Xiang, R.; Han, J. Empower sequence labeling with task-aware neural language model. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Lin, Y.; Liu, L.; Ji, H.; Yu, D.; Han, J. Reliability-aware dynamic feature composition for name tagging. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 165–174. [Google Scholar]
Cao, Y.; Hu, Z.; Chua, T.-S.; Liu, Z.; Ji, H. Low-resource name tagging learned with weakly labeled data. arXiv 2019, arXiv:1908.09659. [Google Scholar]
Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williamstown, MA, USA, 28 June–1 July 2001; pp. 282–289. [Google Scholar]
Finkel, J.R.; Grenager, T.; Manning, C. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 25–30 June 2005; pp. 363–370. [Google Scholar]
Liu, X.; Zhang, S.; Wei, F.; Zhou, M. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 359–367. [Google Scholar]
Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, LA, USA, 1–6 June 2018; Volume 1, pp. 2227–2237. [Google Scholar]
Alan, A.; Duncan, B.; Roland, V. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 21–25 August 2018. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Finkel, J.R.; Manning, C.D. Nested named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–7 August 2009. [Google Scholar]
Lu, W.; Dan, R. Joint mention extraction and classification with mention hypergraphs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015. [Google Scholar]
Muis, A.O.; Lu, W. Labeling gaps between words: Recognizing overlapping mentions with mention separators. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017. [Google Scholar]
Katiyar, A.; Cardie, C. Nested named entity recognition revisited. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 1. [Google Scholar]
Wang, B.; Wei, L. Neural segmental hypergraphs for overlapping mention recognition. arXiv 2018, arXiv:1810.01817. [Google Scholar]
Luan, Y.; Wadden, D.; He, L.; Shah, A.; Osten-dorf, M.; Hajishirzi, H. A general framework for information extraction using dynamic span graphs. arXiv 2019, arXiv:1904.03296. [Google Scholar]
Yu, J.; Bohnet, B.; Poesio, M. Named entity recognition as dependency parsing. arXiv 2020, arXiv:2005.07150. [Google Scholar]
Hou, F.; Wang, R.; He, J.; Zhou, Y. Improving entity linking through semantic reinforced entity embed-dings. arXiv 2021, arXiv:2106.08495. [Google Scholar]
Yan, H.; Gui, T.; Dai, J.; Guo, Q.; Zhang, Z.; Qiu, X. A unified generative framework for various ner subtasks. arXiv 2021, arXiv:2106.01223. [Google Scholar]
Mike, L.; Yinhan, L.; Naman, G.; Marjan, G.; Abdelrahman, M.; Omer, L.; Ves, S.; Luke, Z. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
Li, F.; Lin, Z.; Zhang, M.; Ji, D. A Span-Based Model for Joint Overlapped and Discontinuous Named Entity Recognition. arXiv 2021, arXiv:2106.14373. [Google Scholar]
Yang, S.; Tu, K. Bottom-up constituency parsing and nested named entity recognition with pointer networks. arXiv 2021, arXiv:2110.05419. [Google Scholar]
Li, X.; Yin, F.; Sun, Z.; Li, X.; Yuan, A.; Chai, D.; Zhou, M. Entity-relation extraction as multiturn question answering. arXiv 2019, arXiv:1905.05529. [Google Scholar]
Zhao, T.; Yan, Z.; Cao, Y.; Li, Z. Asking effective and diverse questions: A machine reading comprehension based framework for joint entity-relation extraction. In Proceedings of the Twenty-Ninth International Conference on Inter-national Joint Conferences on Artificial Intelligence, Yokohama, Japan, 7–15 January 2021; pp. 3948–3954. [Google Scholar]
Yang, P.; Cong, X.; Sun, Z.; Liu, X. Enhanced language representation with label knowledge for span extraction. arXiv 2021, arXiv:2111.00884. [Google Scholar]
Huang, B.; Carley, K.M. Syntax-aware aspect level sentiment classification with graph attention networks. arXiv 2019, arXiv:1909.02606. [Google Scholar]
Zhang, C.; Li, Q.; Song, D. Aspect-based sentiment classification with aspect-specific graph convolutional networks. arXiv 2019, arXiv:1909.03477. [Google Scholar]
Hou, X.; Huang, J.; Wang, G.; Qi, P.; Zhou, B. Selective attention based graph convolutional networks for aspect-level sentiment classification. arXiv 2021, arXiv:1910.10857. [Google Scholar]
Li, R.; Chen, H.; Feng, F.; Ma, Z.; Hovy, E. Dual graph convolutional networks for aspect-based sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Bangkok, Thailand, 1–6 August 2021; Volume 1. [Google Scholar]
Fu, T.J.; Li, P.H.; Ma, W.Y. Graphrel: Modeling text as relational graphs for joint entity and re-lation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Tian, Y.; Chen, G.; Song, Y.; Wan, X. Dependency-driven relation extraction with attentive graph convolu-tional networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Bangkok, Thailand, 1–6 August 2021; Volume 1. [Google Scholar]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Lin, H.; Lu, Y.; Han, X.; Sun, L. Sequence-to-nuggets: Nested entity mention detection via anchor-region networks. arXiv 2019, arXiv:1906.03783. [Google Scholar]
Straková, J.; Straka, M.; Hajič, J. Neural architectures for nested ner through linearization. arXiv 2019, arXiv:1908.06926. [Google Scholar]

Figure 1. The process of recognizing entities, using a named entity recognition model, based on a machine reading comprehension framework.

Figure 2. Syntactic dependency tree and corresponding adjacency matrix, as generated for a given context. (a) Shows the syntactic dependency tree and (b) shows the corresponding normalized adjacency matrix of the syntactic dependency tree.

Figure 3. Illustration of the SCAN-Net model.

Figure 4. Effect of syntactic information on NER performance.

Figure 5. Case study of SCAN-Net model.

Table 1. Example of converting different entity categories into question queries. If the annotation information of the entity type exists, it is summarized as a query, as shown in the queries corresponding to the GPE, LOC and PER types in the ACE2005 dataset. If the label information of the entity type does not exist, the query is constructed in the form of template filling, as shown in the queries of the cell_line, DNA and protein types in the GENIA dataset, where the entity types are colored blue.

	Entity Type	Query
ACE2005	GPE	geographical political entities are geographical regions defined by political and or social groups, such as countries, nations, regions, cities, states, and the government and its people.
	LOC	location entities are limited to geographical entities, such as geographical areas and landmasses, mountains, bodies of water, and geological formations.
	PER	a person entity is limited to humans, including a single individual or a group.
GENIA	cell_line	find all cell line entities in the context.
	DNA	find all DNA entities in the context
	protein	find all protein entities in the context.

Table 2. Dataset Statistics.“#” denotes the amount.

	Sentences
	# Train	# Dev	# Test
ACE2004	6200	745	812
ACE2005	7294	971	1057
GENIA	18,546	15,023	1669

Table 3. Hyperparameters of the model, where the ACE2004 and ACE2005 datasets use BERT-base-uncased and GENIA uses BioBERT-base-cased-v1.2.

Hyper-Parameter	Value
Weight Initialization	BERT
Batch-size	16
Learning rete	2.00 × 10 $^{- 5}$
Warmup Proportion	0.1
Optimizer	AdamW
epoch	20
Dropout rate	0.2–0.4
Max Sequence Length	300

Table 4. Comparison of the performance of the ACE2004, ACE2005 and GENIA datasets. * indicates the results we achieved through their code. Bold numbers indicate the best results.

ACE2004
Model	P	R	F1
Straková et al. [35]	-	-	84.4
Luan et al. [17]	-	-	84.7
Yu et al. [18] *	85.42	85.92	85.67
Li et al. [1] *	86.38	85.07	85.72
SCAN-Net (ours)	86.91	85.11	86.0
ACE2005
Model	P	R	F1
Straková et al. [35]	-	-	84.33
Li et al. [1] *	85.48	84.36	84.92
Yu et al. [18] *	84.5	84.72	84.61
Hou et al. [19]	83.95	85.39	84.66
Yan et al. [20]	83.16	86.38	84.74
Li et al. [22]	-	-	84.30
Yang et al. [23]	84.61	86.43	85.53
SCAN-Net (ours)	85.98	85.45	85.71
GENIA
Model	P	R	F1
Straková et al. [35]	-	-	76.44
Li et al. [1] *	79.62	76.8	78.19
Yu et al. [18] *	79.43	78.32	78.87
Hou et al. [19]	79.45	78.94	79.19
Yan et al. [20]	78.87	79.6	79.23
Li et al. [22]	-	-	78.30
Yang et al. [23]	78.08	78.26	78.16
SCAN-Net (ours)	79.95	79.52	79.73

Table 5. Ablation trials on the ACE2005 dataset. † indicates the use of a soft-pruning strategy and ‡ indicates the use of a hard-pruning strategy.

Model	P	R	F1
Baseline	85.48	84.36	84.92
+Attention	84.97	85.08	85.03
+Syntax Attention †	84.91	85.78	85.34
+Syntax Attention ‡	85.98	85.45	85.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, W.; Liu, S.; Liu, Y.; Kong, L.; Jian, Z. Named Entity Recognition Networks Based on Syntactically Constrained Attention. Appl. Sci. 2023, 13, 3993. https://doi.org/10.3390/app13063993

AMA Style

Sun W, Liu S, Liu Y, Kong L, Jian Z. Named Entity Recognition Networks Based on Syntactically Constrained Attention. Applied Sciences. 2023; 13(6):3993. https://doi.org/10.3390/app13063993

Chicago/Turabian Style

Sun, Weiwei, Shengquan Liu, Yan Liu, Lingqi Kong, and Zhaorui Jian. 2023. "Named Entity Recognition Networks Based on Syntactically Constrained Attention" Applied Sciences 13, no. 6: 3993. https://doi.org/10.3390/app13063993

APA Style

Sun, W., Liu, S., Liu, Y., Kong, L., & Jian, Z. (2023). Named Entity Recognition Networks Based on Syntactically Constrained Attention. Applied Sciences, 13(6), 3993. https://doi.org/10.3390/app13063993

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Named Entity Recognition Networks Based on Syntactically Constrained Attention

Abstract

1. Introduction

2. Related Work

2.1. Named Entity Recognition Task

2.2. NLP Tasks Based on the MRC Approach

2.3. Application of Syntactic Information to NLP Tasks

3. Methodology

3.1. Overview

3.2. Formalization of Tasks

3.3. Coding Layer

3.4. Syntactically Constrained Attention Networks

3.5. Semantic Aggregation Layer

3.6. Decoding Layer

3.7. Loss Functions

4. Materials and Methods

4.1. Datasets

4.2. Experimental Setups

4.3. Results

4.4. Analysis and Discussion

4.4.1. Analysis of the Validity of Syntactic Information

4.4.2. The Impact of Semantic Dependencies on NER

4.5. Case Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI