Syntax-Informed Self-Attention Network for Span-Based Joint Entity and Relation Extraction

: Current state-of-the-art joint entity and relation extraction framework is based on span-level entity classiﬁcation and relation identiﬁcation between pairs of entity mentions. However, while maintaining an efﬁcient exhaustive search on spans, the importance of syntactic features is not taken into consideration. It will lead to a problem that the prediction of a relation between two entities is related based on corresponding entity types, but in fact they are not related in the sentence. In addition, although previous works have proven that extract local context is beneﬁcial for the task, it still lacks in-depth learning of contextual features in local context. In this paper, we propose to incorporate syntax knowledge into multi-head self-attention by employing part of heads to focus on syntactic parents of each token from pruned dependency trees, and we use it to model the global context to fuse syntactic and semantic features. In addition, in order to get richer contextual features from the local context, we apply local focus mechanism on entity pairs and corresponding context. Based on applying the two strategies, we perform joint entity and relation extraction on span-level. Experimental results show that our model achieves signiﬁcant improvements on both Conll04 and SciERC dataset compared to strong competitors.


Introduction
Relation Extraction (RE) is the task of aiming to find semantic relationships between pairs of the entity mentions from unstructured texts based on entity recognition. As an essential task in information extraction, which is widely applied in question answering [1], populate knowledge base [2], and other knowledge-based tasks. The relational facts obtained from RE are often expressed in form of knowledge triplet {h, r, t}, where h and t represent head and tail entity, respectively, and r represents the relation between the entities. One annotated example is illustrated in Figure 1. Previous works can be divided into two categories: one is in pipelined setting, the other is in joint setting. The pipelined method [3][4][5] divides the task into two steps: named entity recognition (NER) and relation extraction (RE). Although this kind of method seems reasonable, it ignores relevance between the two tasks. For example, by identifying entity mentions, we can classify "Michael Jackson" and "Chicago" as person and locate type, respectively. It is easy to find that the type of above two entities is helpful for identifying relation "Live-in". To tackle with this problem, joint method is proposed. Different from pipelined models, joint models can extract both entities and relations with shared parameters in an end-to-end architecture. It can effectively integrate the feature of entities and relations and achieve better results in the task.
For the problem of extracting relations, the relational facts in sentences are often complicated: a sentence may contain multiple entities, an entity may participate in multiple relational triplets, and triplets may have overlapped entity pairs. In addition to this, the nesting issue may occur in these entity mentions. Refs. [6][7][8] only can cover the situation where none of triplets have overlapped entities in a sentence. After that, many works like [9][10][11] can cover the situation where triplets have overlapped entity pairs and an entity participates in different triplets. Among these models, each token corresponds to a single fixed representation and we usually refer to them as token-based models. However, some tokens may participate in different spans, such as "the minister of the department of education" and "the department of education". It's hard for token-level models to model overlapping entities due to their natural limitation that one token has a single fixed representation. To tackle this issue, span-based models are proposed. Refs. [12,13] solve the problem of entity nesting by searching all possible spans exhaustively to complete RE task. At present, span-based joint entity and relation extraction models can cover the most complex situations.
However, despite span-based model can cover more cases, it still has many shortcomings that affect the quality of extraction. First, while maintaining an efficient exhaustive search on spans, the importance of syntactic features is not taken into consideration. It will lead to a problem that the prediction of a relation between two entities is related based on corresponding entity types, but in fact they are not related in the sentence. Employing additional syntactic features is helpful for this issue. Second, although previous state-ofthe-art work like [13] has proven that extracting local context is superior to global context for RE task, but simply conducting a pooling method over the context directly to align feature dimensions can result in the loss of some critical information. For the processing of local context, in-depth learning of contextual features is needed.
Recently, Transformer [14] is a popular model applied on a wide range of Natural Language Processing (NLP) tasks. Multi-head self-attention is the core component of Transformer, by dividing semantic features into multiple subspaces and performing multiple attention functions to compute attention scores for each contextual word, it is possible to learn richer contextual features in the sentence. However, recent works [15][16][17][18] show that the power of multiple heads is not fully exploited. Therefore, how to make full use of these heads is a valuable question. In addition, BERT [19] is a new language representation model based on a multi-layer bidirectional Transformer, which achieved impressive improvements on many NLP tasks, such as sentiment classification [20], question answering [21], reading comprehension [22], and so on.
Based on the above observations, we propose a joint model based on pre-trained BERT, which combines syntax-informed and local context attention mechanism. In NER task, we map span-based entity mentions to specified types and then get candidate entity pairs. In RE task, based on candidate entity pairs, the above two attention mechanisms are applied. We use pruned dependency trees as our syntactic features, and then analyze the effect of different pruning methods on results.
To recap, the main contributions of this work are as follows: • We propose to incorporate syntax knowledge into multi-head self-attention by employing part of the heads to focus on syntactic parents of each token from pruned dependency trees. In addition, we use it to model the global context to fuse syntactic and semantic features. Then, we compare the effects of several different pruning methods on the results; • Based on the first point, according to the different positions of each pair of candidate entities from the NER task, we mask part of content of the sentence dynamically, just keeping the entity pair and content between them, and perform a local focus mechanism on the context to learn richer contextual features in the RE task; • We employ BERT as our pre-trained language model and fine-tune it during training, and experimental results show that our model achieves significant improvements on both Conll04 and SciERC dataset compared to strong competitors.

Syntax-Based Relation Extraction Model
Syntactic features are beneficial for RE tasks because they can capture dependencies between different components in a sentence that may be ambiguous on the surface. In the past, at the core of relation extraction approaches are statistical classifiers, many of which find that syntactic features play an important role. For traditional methods, Ref. [23] proposed a tree-based kernel approach and introduced kernels defined over shallow parse representations of text. Ref. [24] introduced the simplified dependency path to tree kernels. They proved that such feature can improve the performance of extraction. For neural-based models, Ref. [25] first proposed to learn relation representations from Shortest Dependency Path (SDP) through Convolutional Neural Network (CNN), and Refs. [26,27] tried to employ LSTM to encode SDP, which proved the effectiveness for syntactic feature encoding with neural models. In addition, they removed irrelevant information by trimming the trees and achieved better results. In addition to these preliminary attempts, some works modeled syntax trees by improving feature extractors. Refs. [6,28] used Tree-LSTM [29] and Tree-GRU models, respectively, which were an extended form of RNNs over dependency trees and had stronger modeling capabilities. After that, since the dependency tree structure is inherently graph-like, with the emergence of Graph Convolution Network (GCN) [30], Refs. [11,31] employed GCN to generate syntactic representations. Experimental results proved that sequence models had complementary strengths to syntax-based models and combining them can achieve impressive performance on their benchmarks. Ref. [32] used dependency trees to extract the syntax-based importance scores for the words with Ordered-Neuron Long-Short Term Memory Networks [33] (ON-LSTM) and injected these features into the model. Their experimental results showed that capturing the syntactic importance of the words was beneficial for RE task and had greater generalization ability.
In addition, with the introduction of Transformer [14], although the multi-head selfattention mechanism has been proven to have excellent performance in some NLP tasks, it lacks a hierarchical structure of the input sentence. Refs. [34,35] employed absolute structural position and relative structural position to encode structural features, and integrated them into self-attention, respectively, which had better performance on machine translation tasks. In summary, it is feasible to incorporate syntactic feature into self-attention and provide more possibilities for the expansion of syntax-based model in other NLP tasks.

Joint Entity and Relation Extraction Model
Early joint models were mainly feature-based [36][37][38]. They often rely on the quality of feature construction heavily, which is time-consuming and lacks sufficient generalizability. With success of deep learning on many NLP tasks, neural-based models make up for the shortcomings of feature-based models, and have many applications in the field of RE tasks. Ref. [6] proposed to employ RNN-LSTM with Tree-LSTM for joint entity and relation extraction, and modeled syntactic and sequence information simultaneously. Ref. [39] proposed a hybrid neural network that employed BiLSTM for entity recognition and CNN for relationship extraction without any handcrafted features. Refs. [40,41] proposed to model the task as a table filling problem, but it would cause redundant information and increase the possibility of errors. Ref. [8] first proposed to model the task as a sequence labeling problem on BIOES scheme. They synthesized entity types and relations as labels, and artificially features were not adopted. To some extent, although the problem of error propagation is avoided, it has shortcomings on identification of the overlapping relations. After that, Ref. [9] proposed to divide situations into three categories: Normal (none of its triplets have overlapped entities), Single Entity Overlap (SEO), and Entity Pair Overlap (EPO). They viewed the task as a triple generation process based on seq2seq with a copy mechanism to cover all above situations. Refs. [6][7][8] only solved the problem of the Normal situation. After that, some works began to cover SEO and EPO situations. Ref. [10] employed BiLSTM to encode sequence information and modeled the RE task as a multihead selection problem. In addition, Conditional Random Field (CRF) was used to decode so that the model can identify entities and all possible relations between them to solve overlapping relations issue. Based on the architecture of [10], Refs. [42,43] incorporated different types of attention, but still based on a sequence labeling scheme. Ref. [11] adopted GCN to encode dependency trees and BiLSTM as their sequence model, and integrated them into an end-to-end architecture. Ref. [44] formulated the joint extraction as a token pair linking problem and introduced a novel handshaking tagging scheme to align the boundary tokens of entity pairs under each relation type. In addition, the results showed that their method had good performance on relation extraction task. Due to the entity nesting issue, span-based method becomes popular. Refs. [12,13] completed the NER task by exhaustively searching spans and made these spans into candidate entity pairs, and let the model cover situations not considered in [9]. Ref. [45] dynamically constructed span graphs, by selecting the most confident entity spans and linking these nodes with confidence-weight. Through propagation of the graph, it can refine the span representation iteratively. On the basis of [45], Ref. [46] replaced the RNN encoder with BERT, and get better extraction results. Ref. [47] was also based on the Transformer and modeled the task as a multi-turn QA problem in an innovative manner, but it required constructing question templates manually, which was time-consuming. Overall, span-based method has achieved the best performance on multiple benchmarks of joint models and using the Transformer for feature extraction usually has better performance.

Methodology
In this section, we present our joint model illustrated in Figure 2. We employ BERT as our language representation model and fine-tune it during training. First, the input sequence will be mapped into BERT embeddings, and then we adopt pooling strategy to group the subword units of each word and obtain its representation on the word level (see Section 3.2). Second, due to the span-level, the NER task can better identify nested entity mentions. We use feature embeddings from the first step to construct span-level representation, then complete the NER task on the span level and get candidate entity pairs (see Section 3.3). Third, in order to let the model attend to syntactic features, we adopt syntax-informed multi-head self-attention to model the global context (see Section 3.4). Fourth, we leverage these candidate entity pairs from the second step to extract context between them, and local context representation is constructed based on the features from the previous step. Then, we apply the local focus mechanism on it. The attention mechanism of local context can help us to learn in-depth contextual features (see Section 3.5). Finally, we get final representations of entity pairs and corresponding context to complete the RE task (see Section 3.6).

Task Definition
Our model is a unified framework to classify named entities and relations in entity pairs. The input of our model represents a sequence of tokens W = {w 0 , w 1 , · · · , w n }. For the input sequence, we search for all possible spans that satisfy the specified requirements and then get the candidate span set S. For each span s ∈ S, we compute a vector of entity type scores and choose the category with the highest score as its type. For the span whose prediction is none, we will filter it out. Then, from candidate entity pair set S × S and pre-defined relation set R, we extract corresponding semantic relationship based on context.

Embedding Layer
Given a sentence W raw = {w 0 , w 1 , · · · , w n } as a sequence of tokens, the embedding layer will map these tokens into feature vectors. In this layer, we employ BERT as our pre-trained language model. Before mapping these tokens, we transform the input as a sequence is a special token added in front of every input sample, and [SEP] is a special separator token.
Since the BERT tokenizer is created with Wordpiece [48], which decomposes infrequent words into frequent subwords for unsupervised tokenization of the input token, but since syntactic features can only be represented on word-level, we need to align the subword units with their corresponding origin word. Therefore, if a word has multiple subword units, we will adopt the average pooling to group these units and obtain the word-level representation. After that, we can get feature sequence X = {x 0 , x 1 , · · · , x l }, where l is the length of the input sequence.

Span-Based Named Entity Recognition
First, we find all candidate spans from the input sentence. Since too many spans will take up too much memory, we set a threshold t to limit the length of candidate spans, and spans whose length exceed t will be discarded. We denote a span as s = {s 0 , s 1 , · · · , s n }, where n ≤ t. For span s, we apply max pooling over it to create its representation. Since width is a distinguishing feature for span, we will integrate it into representation of span and denote w as the width feature. In addition, after BERT encodes the input sentence, the vector corresponding to token [CLS] will be generated, and it represents contextual features of the whole sentence. For entity classification, contextual information usually plays an important role, so we will incorporate it into span representation and denote the [CLS] vector as c. Finally, we concatenate all above features to get the final representation s of a candidate span: where ⊕ represents the feature concatenation operator.
After getting the final representation s , we classify candidate spans into specific types. Let E be a set of pre-defined entity types. We then employ a fully connected layer for classification of entities. For each entity type e ∈ E, we have From (2), we can get the score of each type of the span. In addition, each span will be assigned the entity type corresponding to the highest prediction score. Then, all spans are classified as none will be filtered out. Others will be combined into candidate entity pairs for the future RE task.

Syntax-Informed Multi-Head Self-Attention
Multi-head self-attention is the core component of Transformer [14], which not only allows the model to jointly attend to information from different feature subspaces at different positions, but also can be parallelized. Each of the heads learns a distinct attention function to attend to all of tokens in the sequence, and the feature of these heads will be concatenated to create the final representation.
Many experimental results have shown that it has a powerful ability to learn context features and achieves impressive performance in neural machine translation benchmark [14,49]. However, there are many studies find that the multi-head self-attention's capability has not been fully exploited [15][16][17][18], and we expect to incorporate syntactic features into multi-head selfattention architecture to enhance its modeling capabilities. Therefore, we propose to incorporate syntax knowledge into multi-head self-attention by employing part of the heads to focus on syntactic parents of each token from pruned dependency trees, and use the rest of the heads to focus on semantic information in order to exploit the ability of the heads even further. In this layer, we will apply this mechanism on the global context.
For semantic feature, we let h 1 attention heads to attend to it. For each head, we map the input matrix X of N tokens representations into query Q h , key K h , and value V h with h different learnable matrices: After projecting, our Q h , K h , and V h are of dimensions N × d Q , N × d K and N × d V , 0 ≤ h ≤ h 1 , and d Q , d K , d V are dimensions of one head. Then, we use scaled dot-product to calculate attention weights, which is formulated as follows: where √ d h is a scaled weight, and the value is a square root of its embedding dimension. Then, semantic attention weights will be multiplied by V h for each token, and we will get the semantic-attended token representations: Then, we concatenate the results of these heads, and get semantic representation M Sem : For syntactic features, we let h 2 attention heads to attend to it and map the input matrix into query, key, and value as (4). We use external tools to generate a dependency tree of the input sentence, and prune the tree with different methods according to the candidate entity pairs generated from NER task. Similar to [6], we test three methods for pruning: Then, we convert the prune tree into an adjacency matrix, and denote it as A prune . The attention weights are calculated as where A prune is a N × N one-hot matrix, and 0 ≤ h ≤ h 2 .
Similarly, syntactic attention weights will be multiplied by V h for each token, and we will get the syntax-attended token representations M Syn : After that, we concatenate syntactic and semantic representation together, and denote it as M: Finally, we employ a residual connection to our model. A layer of linear transformation is performed on the feature representation M and then added to the input matrix X (see Section 3.2). After that, the result followed by a layer normalization [50] operation, and get output representation O: where W SA and b SA are parameters of the linear transformation, and dimensionality of the W SA is d f × d f , where d f is the dimension of word embedding. We can also treat the transformation as a convolution operation with a kernel size of 1.

Local Context Focus Layer
In this layer, our goal is to capture internal correlation of the local context and let the model focus on features that are beneficial for the RE task.
As shown in Figure 3, local context is part of the input sentence. Based on O from Section 3.4, it will be divided into three parts: head entity s head = {O h 0 , O h 1 , · · · , O h l 1 }, tail entity s tail = {O t 0 , O t 1 , · · · , Ot l 2 }, and context between the two entities c rel = {O c 0 , O c 1 , · · · , O c l 3 }, where l 1 , l 2 , l 3 represents the length of corresponding sequence, respectively. Other parts of the sentence will be masked. For local context attention, we calculate the score between each token and other tokens in parallel, and multiply the score with corresponding value feature. In local context, we apply multi-head self-attention (MHSA) to highlight the features that are beneficial to the task. The specific calculation method can be seen in (3)∼(6) and (11). Finally, we can get the local concerned context representation m:

Relation Extraction
In this section, we complete RE task. Based on m from Section 3.5, we also divide it into three parts m head , m ctx , m tail as well. We use a layer of full connection to classify relations in candidate entity pairs. Due to different lengths of the above three pieces of information, we need to aggregate them so that they can have a fixed length. Here, we use max pooling to aggregate each part, and concatenate them as a whole representation. In addition, we also need span width feature w of two entities. Then, we will get Since there may be multiple relations between two entities, we regard our RE task as a multi-label binary classification problem. From (15), each pair of candidate entities will generate probabilities corresponding to all the predefined relations, and then we set a threshold to filter out the results with low confidence.

Model Training
We employ BERT as our pre-trained language model, and its parameters will be fine-tuned during training. Since the joint model contains two tasks, the loss function will be designed as the sum of the two task losses.
Both parts of the loss function, use the form of cross entropy as follows: The first part is the loss calculated from classifying all candidate spans, and the second part is obtained by classifying relations from candidate entity pairs.

Datasets and Experimental Settings
We conduct experiments on two relation extraction datasets and achieve significant improvements on both corpora.
The first dataset is Conll04 [51], which contains sentences with annotated name entities and relations from news articles. We follow the processing method of the dataset in [40] and split the data into training set, validation set, and test set. The test set consists of 288 sentences. The training set and validation set contain 1153 sentences in total, and we use 20% of them for validation. For Conll04, it defines four entity types (Person, Organization, Location, and Other) and five relation categories (Live-in, Work-for, OrgBased-in, Kill, and Located-in).
The second dataset is SciERC [52], which contains 500 scientific abstracts that come from 12 AI conference/workshop proceedings in four AI communities. We process the dataset according to [52]. There are 551 sentences for the test set, and 1861 and 275 sentences for the training and validation sets, respectively. For SciERC, it defines six entity types (Method, Metric, Task, Material, Generic, and Other-ScientificTerm) and seven relation types (Feature-of, Used-for, Compare, Part-of, Hyponym-Of, Conjunction, and Evaluate-for).
Some statistical data of the two datasets are shown in Table 1. In our experiments, we measure NER and RE performance by computing precision, recall, and F-measure score on the test set. To evaluate the experimental results, we adopted the standard formula to calculate the scores as (17) Following previous works, for Conll04, we report both micro average and macro average F1 score. For micro-F1, we calculate the metric globally by counting the total true positives, false negatives, and false positives. For macro-F1, we calculate the metric for each label, and get their unweighted mean. A relation is correct if its type and corresponding two entities are all correct and an entity is considered correct if its span and type are both right. For SciERC, we report micro-F1 score as previous works and, in the RE task, we do not consider whether the entity type is correct.
In this work, we employ BERT as our pre-trained model, and its parameters will be fine-tuned during training. For the Conll04 dataset, we will adopt BERT-base-cased from HuggingFace's Transformers [53]. In addition, for SciERC, the SciBERT [54] model will be adopted, which is trained on papers from the corpus of semanticsholar.org specifically. For the acquisition of syntactic features, we use the Stanford dependency parser of the Stanford CoreNLP toolkit [55] to generate dependency structures. We train the model with a learning rate of 5 × 10 −5 , and adopt the Adam optimizer with weight decay as 0.01, using a linear warmup schedule. The dimension of width embedding is 25, and the batch size is set to 2. The contextual embedding size from BERT is 768, and the threshold for filtering relations is 0.55. For all layers with multi-head self-attention, the number of heads is 8, and we use half of the heads to focus on syntactic features. In addition, for span pruning, we limit spans to a max length of 10. All of the hyper-parameters are tuned on the validation set. Table 2 demonstrates the overall experimental results. For our joint model, we report both named entity recognition and relation extraction task performance on the test dataset. By applying syntax-informed self-attention mechanism with the LCA pruning method and local context focus mechanism, we observe that our model outperforms all listed joint models and achieves significant improvements on relation extraction tasks. For Conll04, we find that our model outperforms the previous state-of-the-art model [14] by 1.70 micro-F1 and 1.61 macro-F1. To evaluate the generalizability of our method, we also train and evaluate the model on the SciERC dataset. For SciERC, we find that our model has 1.67 micro-F1 more than previous state-of-the-art work. In addition, our proposed model improves upon other models in both precision and recall. All results demonstrate that our model is very effective in joint entity and relation extraction tasks. Metrics: micro-average = † , macro-average = ‡ , not stated = *.

Analysis of the Pruning Effect of Syntax Trees
Syntactic features are beneficial for RE tasks because it can capture dependencies between different components of a sentence and usually expresses as tree structure. When we focus on extracting the semantic relationship between two entities in a sentence, some irrelevant information in the dependency tree may interfere with our analysis. Therefore, we choose several different methods for pruning the syntax tree to remove irrelevant information as much as possible. For details about the implementation, see Section 3.4.
The comparison results are shown in Tables 3 and 4. We find that, for both Conll04 and SciERC, the LCA method shows better performance than ALL and SDP. Intuitively, retaining the entire tree may keep too much irrelevant information, and keeping only the shortest dependency path will be too aggressive for removing information. Therefore, using the LCA method can retain effective features more reasonably. Our experiments have also verified this.

Ablation Tests
In this section, we conduct ablation tests on Conll04 and SciERC validation datasets reported in Tables 5 and 6 to analyze the effectiveness of different components of our proposed model. We test three variants of the model as follows: • All Model: Use the complete model proposed above for testing; • Local Context Focus: Remove the local attention module for context, and then explore its contribution to overall performance improvement of the model; • Syntactic Feature Fusion: Remove syntactic features from the multi-head self-attention module and then analyze its contribution.
When we remove the local context focus mechanism, the performance of the relation extraction task decreases for both datasets. It contributes 1.39 micro-F1 and 1.44 macro-F1 for Conll04, and 2.45 micro-F1 for SciERC. For the global context, we find that the performance of relation extraction becomes worse if we remove all syntactic heads. It contributes 2.59 micro-F1 and 2.84 macro-F1 for Conll04, and 1.34 micro-F1 for SciERC.
All results described above demonstrate that using part of the heads to focus on syntactic feature is effective. In addition, the pruned syntax tree is an important feature for RE tasks. In addition, by attending to the semantic information of local context, it is possible to exploit more contextual features for the RE task.

Discussion
In this section, we evaluate the prediction results of our proposed model and select several representative samples to discuss advantages and shortcomings of our model. In these cases, the blue part is the correct result, and the red one is wrong. The content in the lower right corner of the right parenthesis denotes "entity pair number-left (L)/right (R) entity-relationship".
Case 1 (Effect of syntactic feature): • Case 1 represents a kind of situation where the prediction of a relation between two entities is related based on corresponding entity types, but in fact they are not related in the sentence. We draw a pruned syntax tree of the example by the LCA method illustrated in Figure 4. The subtree in the bottom right corner represents the LCA tree corresponding to the two entities with correct results. For "Black Sea Fleet", we see that "Eduard Baltin" has a direct modified relationship with it, but "Yevhen Saburov" does not. From the dependency tree, we can obtain a structural modified relationship between entities, which has a positive impact on relation extraction performance. In addition, it is easy to find out that, if we adopt the SDP method, "Commander" will be off the tree. However, "Commander" has strong instructions for extracting the relation "Work-for". Therefore, the LCA method can keep more important information. Case 2 represents a kind of situation where the sentence contains clue words. Although previous work like [13] has proved that extracting local context is superior to the global context for the task, it still lacks in-depth learning of contextual features in local context. In the sentence, we can easily find that the phrase "took shelter in" is an important clue for relation extraction. When we add local focus layer, the model can classify the relation as "Live-in" correctly, but if we do not add this layer and simply use a pooling method over local context to align feature dimensions, the model cannot identify corresponding relations.
Case 3 (Negative sample): • Case 3 is a negative example, indicating that there are no straightforward clue words to indicate that there is a relationship "Kill" between entities "Oswald" and "Kennedy". However, by reading the context, the two events "Kennedy was killed" and "Oswald was arrested" are logically causally connected, and there should be a relationship "Kill" between the two entities.
Although our method makes the model achieve better performance on relation extraction tasks, it is obvious that our model lacks sufficient linguistic inference abilities. From Case 3, it can be found that semantic relational reasoning is an important issue because it allows our model to extract knowledge facts from the text more comprehensively. In addition, the relational reasoning problem usually involves the semantic synthesis of multiple sentences.
In this research field, Ref. [56] proposed a dataset for document-level relation extraction tasks, and summarized several types of relational reasoning like logical, coreference, common sense, and so on. Recently, many works like [57,58] have begun to tackle the problem of cross-sentence relational reasoning, which is an important research direction in the future.

Conclusions and Future Work
Enabling machines with cognitive and logical capabilities will be a hot topic in future research, and the knowledge graph will be the "brain" of intelligent applications. The entity and relation extraction task is an important step in constructing knowledge graphs, and comprehensively and accurately extracting knowledge facts from texts has important research and economic value.
In this paper, we propose a model to extract entities and relations from unstructured text, which incorporates syntax-informed multi-head self-attention and a local context focus mechanism in end-to-end framework. In order to integrate syntactic features into a span-based joint extraction model, combined with the observation that the capability of many attention heads in the multi-head attention mechanism is not fully exploited, we propose to employ part of the heads to focus on syntactic parents of each token from dependency trees with different pruning methods to fuse syntactic and semantic features. In addition, we apply a local focus mechanism on entity pairs and corresponding context to get richer contextual features from the local context. Based on pre-trained BERT, results from the experiments on Conll04 and SciERC datasets confirmed that our model achieves significant improvements compared to strong competitors. In addition, we find that using the LCA method to prune a syntax tree in the syntax-informed multi-head self-attention mechanism has the greatest impact on the relation extraction task.
In addition, there are some limitations of this work worthy of further discussion. At present, our model can only extract knowledge facts on sentence-level. However, in fact, a large number of knowledge facts are expressed in multiple sentences. In future, we plan to study how to improve the accuracy of document-level relation extraction tasks. In addition, some semantic relationships need to be inferred by reading and synthesizing multiple sentences in context. Letting the model have powerful linguistic inference ability will be the highlight of our future research.