Dual Pointer Network for Fast Extraction of Multiple Relations in a Sentence

Relation extraction is a type of information extraction task that recognizes semantic relationships between entities in a sentence. Many previous studies have focused on extracting only one semantic relation between two entities in a single sentence. However, multiple entities in a sentence are associated through various relations. To address this issue, we propose a relation extraction model based on a dual pointer network with a multi-head attention mechanism. The proposed model finds n-to-1 subject-object relations using a forward object decoder. Then, it finds 1-to-n subject-object relations using a backward subject decoder. Our experiments confirmed that the proposed model outperformed previous models, with an F1-score of 80.8% for the ACE-2005 corpus and an F1-score of 78.3% for the NYT corpus.


Introduction
Relation extraction is a task that involves recognizing semantic relations (i.e., tuple structures; [subject, relation, object triples]) among entities in a sentence [1].Zeng et al. [2] divided sentences into three types according to the triplet overlap degree: normal, entity pair overlap (EPO), and single entity overlap (SEO).In the normal type, the triples do not have overlapped entities; in the EPO type, some triples have an overlapped entity pair; and in the SEO type, some triplets have an overlapped entity, but these triplets do not have overlapped entity pairs.In this study, we focus on promptly extracting both the normal and SEO types because most relations are included in these types, as shown in Figure 1.dual pointer network model to efficiently extract multiple relations from a sentence through forward scanning (i.e., scanning from the first word to the last) and backward scanning (i.e., scanning from the last word to the first).The proposed model discovers an object of the current subject during forward scanning.Through forward scanning, all normal type relations can be found.However, SEO type relations are only partially found because a subject should point to only one object in the pointer network architecture.To address this limitation, the proposed model performs backward scanning to identify a subject of the current object.
The remainder of this paper is organized as follows.In Section 2, we review previous studies on relation extraction.Section 3 describes the proposed dual pointer network model.In Section 4, we elaborate on the experimental setup and results.Finally, we conclude the study in Section 5.

Previous Works
With the significant success of deep neural networks in the field of natural language processing, many researchers have proposed various relation extraction models based on convolutional neural networks (CNNs).These include the CNN model based on max-pooling [3], the CNN model based on multiple sizes of kernels [4], the combined CNN model [5], and the contextualized graph convolutional network (C-GCN) model [6].Relation extraction models based on recurrent neural network (RNNs) have also been proposed, including the long-short term memory (LSTM) model based on the dependency tree [7] and the LSTM model using the position-aware attention technique [8].These models have focused on normal type extraction (i.e., extracting only one relation between two entities from a single sentence).However, many entities in a single sentence can form multiple relations.To resolve this problem, some studies have proposed multiple relation extraction.For example, Luan et al. [9] treated triples in sentences as a graph and proposed a multiple relations extraction model that iteratively extracts spans between triples in the graph.In the present study, we propose a relation extraction model to simultaneously find all possible relations among multiple entities in a sentence.The proposed model is based on the pointer network [10].The pointer network is a sequence-to-sequence (Seq2Seq) model in which an attention mechanism [11] is modified to learn the conditional probability of an output, whose values correspond to positions in a given input sequence.We modify the pointer network to include dual decoders, an object decoder (a forward decoder) and a subject decoder (a backward decoder).The object decoder extracts n-to-1 relations as shown in the following example: [Lee, employed, ABC Mart] and [his Father, Owner, ABC Mart] are extracted from the sentence.The subject decoder extracts 1-to-n relations as shown in the following example: [Lee, employed, ABC Mart] and [Lee, Family, his Father] are extracted from the sentence.

Context and Entity Encoder
The context and entity encoder computes the degree of association between words and entities in a given sentence.For example, { 1 ,  2 , …,   } and { 1 ,  2 , …,   } refer to word and entity embedding vectors, respectively.Figure 3 illustrates the process of word and entity embedding.As shown in Figure 3, the word embedding vectors are concatenations of two types of embeddings: word-level GloVe [12] embeddings for representing the meaning of words and character-level CNN embeddings [13] for addressing out-of-vocabulary problems.The entity embedding vectors are concatenations of three types of embeddings: word-level CNN embedding for representing the meaning of entities composed of multiple words, character-level CNN embedding for addressing out-of-vocabulary problems, and entity type embedding for representing the categorical information of input entities.Each word in the word-level CNN embedding is represented by word-level GloVe embeddings.The word embedding vectors are used as input for a bidirectional LSTM network to obtain contextual information as follows: where   is an embedding vector of the i-th word in a sentence, and [ ⃗  ;  ⃖  ] is a concatenation of  ⃗  and  ⃖  that represents the output vectors of a forward LSTM and a backward LSTM, respectively.The entity embedding vectors are used as input for a forward LSTM network because the entities are listed in the order they appear in a sentence, as shown below.
where   is an embedding vector of the t-th one among all entities occurring in a sentence, and   is an output vector encoded by a forward LSTM.The output vectors of the bidirectional LSTM network { 1 ,  2 , … ,   } and the forward LSTM network { 1 ,  2 , … ,   } are used as input for the context-to-entity attention layer (as shown in Figure 2), to compute the relative degrees of association between words and entities.This is similar to the well-known multi-head attention mechanism [14] as shown below.
=   * (, )  , where the query  is set to   , the key  and the value  are set to C's.The query  is split into  vectors, where  is the number of heads.The attention score   is calculated by a scaled-dot product, where  is a normalization factor.The context-to-entity layer output   is determined through a fully-connected neural network (FNN) using a concatenation of  heads as input.

Dual Pointer Network Decoder
In a pointer network, attentions show the position distributions of an encoding layer.Because an attention is highlighted at only one position, the pointer network has a structural limitation when one entity forms relations with several entities (for instance, "Lee" in Figure 1).The proposed model adopts a dual pointer network decoder (see Figure 2) to overcome this limitation.The first decoder, called an object decoder, learns the position distribution from subjects to objects as follows: where ℎ  is a concatenation of the entity embedding vector   and the LSTM-encoded entity embedding vector   , and the decoding vector   (i.e., the t-th entity to determine its objects) is calculated by the forward LSTM.
where ̂  and ̂  represent a position and a relation name of   's subject, respectively.In Figure 1, "Lee" should point to both "ABC mart" and "his father."This problem cannot be solved using the conventional forward decoder because it cannot point to multiple targets.However, the subject decoder (a backward decoder) resolves this problem, because "ABC mart" and "his father" can point to "Lee."Additionally, we adopt a multi-head attention mechanism to improve the performance of the dual pointer network; this is shown in the following equation.), ̂ = argmax(relu(  [ℎ 0 ; ℎ 1 ; ℎ 2 ; … ; ℎ  ] +   )), where the query  is set to   , the key  and the value  are set to 's.The position distribution ̂ is calculated by an average of  multi-head attention vectors, and the relation name ̂ is determined through an FNN using a concatenation of  heads as the input.

Implementation detail
The context and entity encoder comprised 256 hidden units in each layer, and the dual pointer network decoder comprised 512 hidden units.We adopted a 0.1 drop-out probability for all the LSTM cells.We used 8 heads, with 32 units per head, for the multi-head attention.The vocabulary size and word-embedding size was set to 16,925 and 300, respectively.The filter size of the CNNs for character and word embeddings were 3, 4, and 5.The total number of filters was 100.50-dimensional random initialized vectors were used for the character and entity embeddings.A cross-entropy function was used as a cost function to maximize the log-probability as follows: where  is the target answer,  � is the score distribution of the model prediction, and  is the number of target classes.The loss is calculated by the cross-entropy combination of all targets and predictions.The weighting factor α was experimentally set to 0.6 as a scalar value.

Datasets and Experimental Setting
We evaluated the proposed model using the following benchmark datasets.

ACE-05 corpus [15]:
The ACE dataset includes seven major entity types and six major relation types.The ACE-05 corpus does not properly evaluate models that extract multiple triples from a sentence.Therefore, if some triples in the ACE-05 corpus share a sentence (i.e., some triples occur in the same sentence), the triples were merged, as shown in Figure 4. NYT corpus [16].This is a news corpus sampled from news articles published in the NYT.The training data is automatically labeled using distant supervision.The NYT corpus was manually converted to a relation extraction dataset by Zheng et al. [17].We excluded sentences without relation facts from Zheng's corpus.Finally, we obtained 66,202 sentences in total.We used 59,581 sentences for training and 6,621 for testing.We adopted the standard micro precision, recall, and F1 score to evaluate the results:

Experimental Results
In the first experiment, we evaluated the effectiveness of the multi-head attention in the dual pointer network decoder; the results are summarized in Table 1.The evaluation was performed using the ACE-05 corpus.1, single-head refers to a conventional attention mechanism proposed by Bahdanau et al. [11].As shown in Table 1, the multi-head attention mechanism used in the proposed model demonstrated better performance than the single-head one.Then, using the ACE-05 corpus, we evaluated the effectiveness of multi-head attention in the context and entity encoder; the results are summarized in Table 2.In Table 2, BIDAF [18] refers to a machine reading and comprehension (MRC) model based on a co-attention mechanism between a query and a context.C2Q and Q2C are refer to mean context-to-query attention and query-to-context attention used in the BIDAF model, respectively.As shown in Table 2, the multi-head attention mechanism used in the proposed model showed the best F1-score.
In the second experiment, we compared the proposed model with previous state-of-the-art models.Table 3 compares the performance of the proposed model and with other models for the ACE-2005 dataset.In Table 3, SPTree [6] is a model that applies the dependency information between the entities.In FCM [19], handcrafted features are combined with word embeddings.DYGIE [9] dynamically generates spans between entities and spans' representations.Span-Level [20]  In Table 4, NovelTag [17] is an end-to-end model that extracts entities and their relations based on a novel tagging scheme designed for relation extraction.MultiDecoder [2] is a Seq2Seq-based model that combines the entity and relation extraction using a decoder with copy mechanism.GraphRE [23] is a joint model that extracts entities and their relation using graph convolutional networks (GCN) [24].As shown in Table 4, the proposed model outperformed all models.It is not reasonable to directly compare the proposed model with these models because it requires gold-labeled entities, while the other models automatically extract entities from sentences.Although direct comparison is unfair, the proposed model exhibited considerably better performance.If we adopt a state-of-the-art named entity tagger based on BERT [25] with F1-scores of 0.9 or more, the proposed model is expected to show F1-scores of 0.662 or more based on simple multiplication.The cases where the proposed model incorrectly extracted relations were also grouped in Table 5.Most incorrect predictions included cases where the decoders incorrectly pointed out subjects or objects, and these incorrect entities lead to incorrect relation names, as shown in the first and third sentences in Table 5.In some cases, the decoder did not point out subjects or objects.As a result, any triples in a sentence were not omitted, as shown in the second sentence.

Conclusion
We proposed a relation extraction model to find all possible relations among multiple entities in a sentence simultaneously.The proposed model is based on pointer networks with multi-head attention mechanisms.To extract all possible relations from a sentence, we modified a single decoder into a dual decoder.In the dual decoder, the object decoder extracts n-to-1 subject-object relations, and the subject decoder extracts 1-to-n subject-object relations.The results from the experiments with the ACE-05 corpus and the NYT corpus confirmed that the proposed model shows an improvement in performance.Our future work will focus on an end-to-end model that directly extracts entities and their relations.In addition, we will focus on a method for improving performance using a large-scale language model like BERT [25].

Figure 2 .
Figure 2. Overall architecture of dual pointer networks for relation extraction.

Figure 2
Figure 2 illustrates the architecture of the proposed model.This consists of two parts, a context and entity encoder, and a dual pointer network decoder.

[
Iraqi forces, Art, artillery] [Iraqi forces, Gen-aff, Iraqi] [Iraqi forces, Part-whole, Iraqi] [Iraqi forces, Gen-aff, Iraqi] It is the first time they have had freedom of movement with cars and weapons since the start of the intifada [they, Art, cars] [they, Art, weapons] [they, Art, cars] It was in northern Iraq today that an eight artillery round hit the site occupied by Kurdish fighters near Chamchamal [Kurdish fighters, Phys, the site] [the site, Phys, Chamchamal] [Kurdish, Gen-aff, Kurdish fighters] [the site, Part-whole, northern Iraq] [Kurdish fighters, Phys, the site] [the site, Phys, Chamchamal] [the site, Part-whole, northern Iraq] [Kurdish fighters, Art, artillery]

Table 1 .
Performance for different attention mechanisms in the dual pointer network decoder

Table 2 .
Performance for different attention mechanisms in the context and entity encoder

Table 3 .
Performance comparison on ACE-2005 dataset

Table 4 .
jointly performs entity mention detection and relation extraction.HRCNN [21] is a hybrid model of CNN, RNN, and FNN.Walk-Based [22] is a graph-based neural network model.As shown in Table 1, the proposed model outperformed all models across all metrics.Table 4 compares the performance of the proposed model with existing models for the NYT Corpus.Performance comparisons on NYT corpus