Enhanced Heterogeneous Graph Attention Network with a Novel Multilabel Focal Loss for Document-Level Relation Extraction

Chen, Yang; Shi, Bowen

doi:10.3390/e26030210

Open AccessArticle

Enhanced Heterogeneous Graph Attention Network with a Novel Multilabel Focal Loss for Document-Level Relation Extraction

by

Yang Chen

^1,*

and

Bowen Shi

²

¹

State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China

²

School of Journalism, Communication University of China, Beijing 100024, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(3), 210; https://doi.org/10.3390/e26030210

Submission received: 8 January 2024 / Revised: 5 February 2024 / Accepted: 22 February 2024 / Published: 28 February 2024

(This article belongs to the Special Issue Natural Language Processing and Data Mining)

Download

Browse Figures

Versions Notes

Abstract

Recent years have seen a rise in interest in document-level relation extraction, which is defined as extracting all relations between entities in multiple sentences of a document. Typically, there are multiple mentions corresponding to a single entity in this context. Previous research predominantly employed a holistic representation for each entity to predict relations, but this approach often overlooks valuable information contained in fine-grained entity mentions. We contend that relation prediction and inference should be grounded in specific entity mentions rather than abstract entity concepts. To address this, our paper proposes a two-stage mention-level framework based on an enhanced heterogeneous graph attention network for document-level relation extraction. Our framework employs two different strategies to model intra-sentential and inter-sentential relations between fine-grained entity mentions, yielding local mention representations for intra-sentential relation prediction and global mention representations for inter-sentential relation prediction. For inter-sentential relation prediction and inference, we propose an enhanced heterogeneous graph attention network to better model the long-distance semantic relationships and design an entity-coreference path-based inference strategy to conduct relation inference. Moreover, we introduce a novel cross-entropy-based multilabel focal loss function to address the class imbalance problem and multilabel prediction simultaneously. Comprehensive experiments have been conducted to verify the effectiveness of our framework. Experimental results show that our approach significantly outperforms the existing methods.

Keywords:

relation extraction; heterogeneous graph neural network; entropy; attention mechanism; dependency tree

1. Introduction

Document-level relation extraction (DocRE) is defined as extracting all relationships between entities in multiple sentences of a document, which plays a critical role in many natural language processing tasks, such as knowledge graph construction, knowledge discovery, and knowledge-based question answering [1,2,3]. Compared to sentence-level relation extraction [4], DocRE is more challenging and harder to deal with for three main reasons: (1) Each entity in DocRE may have multiple mentions distributed across multiple sentences; (2) Relationships in the DocRE setting may span multiple sentences; (3) Logical reasoning is required to identify many complex relationships. Given these characteristics of DocRE, the relationships can be categorized into intra-sentential and inter-sentential relations. For instance, as shown in Figure 1, the entity mentions “Portland Golf Club” and “United States” are in the same sentence, so the relationship “country” between them is the intra-sentential relation. However, the entity mentions “Oregon” and “Washington County” span two sentences, and their relationship is an inter-sentential relation, which requires the model’s ability of non-local representations and relational inference.

Faced with these difficulties in DocRE, two primary categories of related works have emerged: sequence-based and graph-based methods. Sequence-based models [5,6,7] fall short in capturing the intricate structural information necessary for modeling long-distance inter-sentential relationships. Therefore, recent approaches predominantly rely on graph structures constructed from the document. These graphs offer improved modeling of structural information and long-distance dependencies. However, these graph-based works mostly focus on concept-level entity pairs to conduct relation prediction and inference, leading to a loss of context information contained in fine-grained entity mentions. We contend that relation prediction and inference should be grounded in specific entity mentions rather than abstract entity concepts. Furthermore, recent research [8] shows that there are totally three relation reasoning paths in the commonly used dataset for DocRE, which consists of an intra-sentential relation path and two inter-sentential relation paths (coreference reasoning path and logical inference path). These diverse relation path patterns pose a significant challenge to a model’s representational capacity and inference capabilities. To sum up, intra-sentential and inter-sentential relation extractions demand local mention-level representation and global mention-level representation, respectively, which is crucial for the model’s expressive capacity.

To address these challenges, we propose a two-stage mention-level framework based on an enhanced heterogeneous graph attention network for document-level relation extraction. To be concrete, firstly, we adopt a pre-coreference-resolution strategy to preprocess the input dataset. With the clues of evidence in the dataset, we can transform the coreference reasoning path into either an intra-sentential relation path or a logical inference path. Then, we construct a mention-aware heterogeneous graph from the input document based on the dependency tree. Entity mentions are connected to the nodes of the graph based on our designed rules. Secondly, we adopt two different strategies for intra-sentential and inter-sentential relations. For intra-sentential relation prediction, we directly utilize local representations to model relationships within a sentence. For inter-sentential relation, which involves relation inference, we employ a graph neural network to generate global representations for entity mentions. Unlike previous graph-based methods that mostly adopt graph convolutional neural-network-based models to model local and non-local representations, we design an efficient enhanced heterogeneous graph attention network. This network allows us to better capture long-distance semantic relationships.

Furthermore, for relational logical reasoning, we design an entity-coreference path-based inference strategy to conduct relational inference. Finally, in order to address the class imbalance problem and multilabel prediction simultaneously, inspired by circle loss [9] and focal loss [10], we introduce a cross-entropy-based multilabel focal loss function. To verify the effectiveness of our framework, extensive experiments have been conducted. Experimental results show that our approach significantly outperforms the existing methods. We summarize the main contributions of this paper as follows:

We propose a two-stage mention-level framework for document-level relation extraction, which constructs a dependency-tree-based mention-aware heterogeneous graph and adopts different strategies for intra-sentential and inter-sentential relation prediction.
For inter-sentential relation prediction and inference, we propose an enhanced heterogeneous graph attention network to better model long-distance semantic relationships and design an entity-coreference path-based inference strategy to conduct relation inference.
We introduce a novel cross-entropy-based multilabel focal loss function to address the class imbalance problem and multilabel prediction simultaneously.
Comprehensive experiments have been conducted to verify the effectiveness of our framework. Experimental results show that our approach significantly outperforms the existing methods.

2. Related Work

Early methods on relation extraction (RE) mainly concentrate efforts on sentence-level RE [4,11,12,13,14,15], which focuses on learning local context representation for relation predicting. These works achieve great success in this ideal scenario. However, there are numerous relational facts that can only be identified across sentences in the real scenario [7,16], which is extremely difficult for the sentence-level RE method to handle. Recent attempts have been made to deal with document-level relation extraction (DocRE) through various methods that can be generally categorized into sequence-based and graph-based methods.

Sequence-based models [5,6,7] encode the whole document to output each word’s context hidden representation, which is then used to predict the relation between entities. These models are not capable of fully capturing the structural information to model inter-sentential relationships over long distances. Therefore, graph-based methods [17,18,19] have sprung up recently. Christopoulou et al. [18] introduce a new edge-oriented graph neural model designed for document-level relation extraction employing multi-instance learning. Constructing a document-level graph featuring diverse types of nodes and edges, the authors model intra- and inter-sentence pairs concurrently using an iterative algorithm applied to the graph edges. Zeng et al. [20] propose a double-graph-based method that constructs a heterogeneous mention-level graph and an entity-level graph. Firstly, they introduce a heterogeneous mention-level graph (hMG) for the interaction among different mentions. Secondly, they present an entity-level graph (EG) and suggest an innovative path reasoning mechanism to facilitate relational reasoning among entities. However, their entity-level graph sacrifices some fine-grained information. Wang et al. [21] introduce rhetorical structure theory to select appropriate evidence and to reason relations. By incorporating rhetorical structure theory (RST) as external knowledge, they aim to select appropriate evidence and demonstrate the reasoning process on a novel document graph, RST-GRAPH. This graph indicates valid semantic associations between multiple text units through RST and integrates a set of reasoning modules to capture evidence efficiently. Zeng et al. [22] propose a separate architecture to represent intra-sentential and inter-sentential relations, respectively. Additionally, they present a novel logical reasoning module that models logical reasoning as self-attention among representations of all entity pairs. Xu et al. [8] propose a discriminative relation reasoning framework that uses three sub-tasks to model intra- and inter-sentence relations. Building upon this framework, they introduce a discriminative reasoning network (DRN). In this network, they utilize both the heterogeneous graph context and the document-level context to represent distinct reasoning paths. Tan et al. [23] propose a semi-supervised framework utilizing three components—axial attention, adaptive focal loss, and knowledge distillation module—to deal with different problems in DocRE. Firstly, an axial attention module is employed to enhance performance on two-hop relations by learning interdependencies among entity pairs. Secondly, an adaptive focal loss is introduced to address the class imbalance issue in DocRE. Lastly, knowledge distillation is utilized to reconcile differences between human-annotated data and distantly supervised data. Different from most previous works based on concept-level entity representations, our method proposes a two-stage framework to deal with intra- and inter-sentential relations based on mention-level representations.

3. Methodology

In this part, we introduce our two-stage mention-level framework based on enhanced heterogeneous graph attention network for document-level relation extraction. Figure 2 illustrates the overview of our method.

3.1. Task Formulation

Here, we first present the task formulation of document-level relation extraction. Given a document

D

consisting of n sentences

\{s_{0}, s_{1}, s_{2} \dots s_{n - 1}\}

and a set of entities

E = {e_{0}, e_{1}, e_{2} \dots

e_{m - 1}}

, each entity

e_{i}

in

E

may have multiple mentions, denoted as

{\{m_{j}\}}_{0}^{k}

, distributed in different sentences of document

D

. For pairs of entities in

E

, the objective of document-level relation extraction (DocRE) is to predict relationship

y \in R

of the entity pairs, where

R

is the predefined relation set including “NA”, which means there is no relation between the given entity pair. Note that there may be multiple relations between the same entity pair.

3.2. Model Architecture

3.2.1. Pre-Coreference-Resolution

Recent research [8] shows that there are totally three relation reasoning paths in the commonly used dataset for DocRE, which consists of an intra-sentential relation path and two inter-sentential relation paths. Furthermore, the two inter-sentential relation paths can be divided into the coreference reasoning path and the logical inference path. In our framework, we first use spaCy [24] to apply coreference resolution to the document of the dataset, whereby coreference reasoning paths would be converted to intra-sentential relation paths with the help of the annotated supporting evidence contained in the dataset. To better illustrate this process, we present corresponding examples for three relation reasoning paths and pre-coreference-resolution in Figure 3. For the intra-sentential relation path and the logical inference path, we will detail our methods for them in subsequent sections.

3.2.2. Mention-Aware Dependency Graph Construction

In order to take advantage of the rich semantic information in the dependency tree, we construct our mention-aware dependency graph by designed rules. Specifically, we first use spaCy to transform each sentence of document

D

into a syntax dependency tree. Previous graph-based methods construct dependency trees based on fine-grained tokens, leading to the information loss of entity mention as a whole. Therefore, we replace the token nodes of entity mentions with corresponding mention nodes and, meanwhile, keep the fundamental semantic structure. Besides the semantic edges of the dependency tree, we connect the mention nodes, which belong to the same entity with a pre-defined edge type named “entity-coreference”, contributing to further relation inference. To illustrate clearly, we present the transformation in Figure 4.

3.2.3. Context Encoder

We use pre-trained language models (PLMs) as our basic encoder to obtain context representations of the input document. Formally, the input document

D = {w_{i}}_{0}^{n}

is fed to the PLM to output the local context representations, which is formulated as

H_{l o c a l} = P L M (D)

(1)

where

H_{l o c a l}

is the output hidden vectors of the pre-trained language model.

3.2.4. Enhanced Heterogeneous Graph Attention Network

To tackle the challenge of the heterogeneous dependency-based graph, previous works mostly adopt graph convolutional network-based models such as R-GCN [25], which is not efficient for relation inference due to its neglect of edge-type information. To model long-distance relationships, we propose an effective enhanced heterogeneous attention graph network, providing the global representations of entity mentions for inter-sentential relation inference.

Attention-incorporated edge-type information. Though R-GCN extends the graph convolutional network to a heterogeneous graph setting, it might not be optimal due to the equal importance of nodes from different edge types. Inspired by graph attention networks [26], we integrate edge-type embedding with an attention mechanism to enable the model to focus on crucial information from specific edge types. To be concrete, we embed each edge type into a d-dimensional vector. The attention can be formulated as follows:

α_{i j} = \frac{e x p (L e a k y R e L U (a^{⊤} [W h_{i} ∣ ∣ W h_{j} ∣ ∣ R_{ϕ (i, j)}]))}{\sum_{k \in N_{i}} e x p (L e a k y R e L U (a^{⊤} [W h_{i} ∣ ∣ W h_{k} ∣ ∣ R_{ϕ (i, k)}]))}

(2)

where

N_{i}

is the neighbor set of node i, W is a learnable transformation matrix, and

R_{ϕ (i, j)}

refers to the edge-type embedding between node i and node j. During training, the edge-type embedding matrix R is learnable.

Residual attention. To make the graph network deeper, inspired by [27], we use pre-activation residual connection in node aggregation. In addition, RealFormer [28] shows that the residual attention layer is beneficial. So, we add residual connection to the attention layer. The above process can be formulated as follows:

\begin{matrix} h_{i}^{l} = σ (\sum_{j \in N_{i}} α_{i j}^{l} W^{l} h_{j}^{l - 1} + h_{i}^{l - 1}) \end{matrix}

(3)

\begin{matrix} α_{i j}^{l} = \frac{e x p (a t t_s c o r e^{l} (h_{i}^{l}, h_{j}^{l}) + a t t_s c o r e^{l - 1} (h_{i}^{l - 1}, h_{j}^{l - 1}))}{\sum_{k \in N_{i}} e x p (a t t_s c o r e^{l} (h_{i}^{l}, h_{k}^{l}) + a t t_s c o r e^{l - 1} (h_{i}^{l - 1}, h_{k}^{l - 1}))} \end{matrix}

(4)

\begin{matrix} a t t_s c o r e^{l} (h_{i}^{l}, h_{j}^{l}) = L e a k y R e L U (a^{⊤} [W^{l} h_{i} ∣ ∣ W^{l} h_{j} ∣ ∣ R_{ϕ (i, j)}]) \end{matrix}

(5)

where

σ

is an activation function.

Multihead attention. To further boost the network’s expressive capacity, we use a multihead attention mechanism adopted by previous works. In this setting, the rules for updating node i can be formulated as:

h_{i}^{l} = σ (∣ ∣_{k = 1}^{q} \sum_{j \in N_{i}} α_{i j k}^{l} W_{k}^{l} h_{j}^{l - 1} + h_{i}^{l - 1})

(6)

where

σ

is an activation function,

∣ ∣

represents the concatenation operation of vectors, q is the number of heads, and

α_{i j k}^{l}

is calculated by (4).

3.2.5. Relation Classification

Intra-sentential relation. For intra-sentential relation prediction, we directly use the entity mention’s local representation to identify the relation. Specifically, given an entity mention pair

{m_{h}, m_{t}}

within a sentence, each mention’s local context embedding is calculated as

h = \frac{1}{l} \sum_{i = s t a r t}^{s t a r t + l} h_{i}

, where

s t a r t

is the position mention beginning with, l is the length of the mention, and

h_{i}

is the output hidden vector of the pre-trained language model in the i-th position. We first concatenate the entity-type embedding and local mention context embedding from the PLM such as roberta [29]. To incorporate semantic structure information within the sentence, we also use the lowest common ancestor of the two mention nodes in the dependency tree as additional structural information. We use a gated linear unit (GLU) to fuse the representations. Formally, the logits score of the mentions pair can be calculated as follows:

\begin{matrix} s c o r e = W_{o} ((W_{u} {\hat{h}}_{l o c a l}) ⊙ σ (W_{v} {\hat{h}}_{l o c a l})) \end{matrix}

(7)

\begin{matrix} {\hat{h}}_{l o c a l} = h_{i} ∣ ∣ h_{j} ∣ ∣ a_{i j} \end{matrix}

(8)

where

∣ ∣

stands for the concatenation operation of vectors, ⊙ is the point-wise multiplication,

σ

is the sigmoid activation function,

a_{i j}

is the hidden vector of the lowest common ancestor of the two mention nodes in the dependency tree, and

W_{u}, W_{v} \in R^{s \times d}

, and

W_{o} \in R^{r \times s}

are learnable parameters.

Inter-sentential relation. For inter-sentential relation prediction involving relation inference, the global representations of entity mentions are indispensable, which are obtained by our enhanced heterogeneous graph attention network (EHGAT). We argue that the inter-sentential logical reasoning path always appears in a specific sentence pattern. Therefore, we design an entity-coreference path-based inference strategy to capture composite relations. First, we give the definition of the compoundable mention pair of entity-coreference paths.

Definition 1.

Given entity mentions

m_{1}

,

m_{2}

within sentence

s_{1}

and mentions

m_{3}, m_{4}

within sentence

s_{2}

, if

m_{2}

,

m_{3}

refer to the same entity and

m_{1}

,

m_{4}

refer to different entities,

m_{1}

and

m_{4}

are defined as the compoundable mention pair of the entity-coreference path.

According to the above definition, we can ensure that if there is a relation

R_{1}

between

m_{1}

and

m_{2}

, a relation

R_{2}

between

m_{3}

and

m_{4}

, a predefined relation is likely existing between

m_{1}

and

m_{4}

. For a document

D

, we pick out all compoundable mention pairs of coreference paths. For each compoundable mention pair

m_{s}

and

m_{o}

, we use a gated linear unit (GLU) to fuse the representations, and the logits score between them is formulated as follows:

\begin{matrix} s c o r e = W_{o} ((W_{u} {\hat{h}}_{g l o b a l}) ⊙ σ (W_{v} {\hat{h}}_{g l o b a l})) \end{matrix}

(9)

\begin{matrix} {\hat{h}}_{g l o b a l} = h_{s} ∣ ∣ h_{o} \end{matrix}

(10)

where ⊙ is the point-wise multiplication,

h_{s}, h_{o}

are the global context embeddings of the mention pair

{m_{s}, m_{o}}

,

σ

is the sigmoid activation function, and

W_{u}, W_{v} \in R^{s \times d}

, and

W_{o} \in R^{r \times s}

are learnable parameters. Note that the parameters

W_{u}, W_{v}

, and

W_{o}

differ from those in the intra-sentential section.

3.2.6. Multilabel Focal Loss Function

According to our analysis of the dataset for DocRE, there may be multiple relationships between the mention pairs. So, the mention-level relation prediction should be regarded as a multilabel classification problem. Furthermore, most of the mention pairs have the “NA” relations, which means that the negative instances are much more than the positive instances, leading to the class imbalance problem. To address the above challenges simultaneously, inspired by circle loss [9] and focal loss [10], we introduce a cross-entropy-based multilabel focal loss function. Specifically, we first introduce a threshold class denoted as “TH”. The logits scores of positive classes are expected to be higher than the logits score of the threshold class, and the logits scores of negative classes are expected to be lower than the logits score of the threshold class, which can be formulated as follows:

\begin{matrix} L_{p o s}^{n} = log (e x p (S_{T H}^{n}) + \sum_{i \in Ω_{n e g}} e x p (S_{i}^{n})) + \\ log (e x p (- S_{T H}^{n}) + \sum_{j \in Ω_{p a s}} e x p (- S_{j}^{n})) \end{matrix}

(11)

L_{n e g}^{k} = - log \frac{e x p (S_{T H}^{k})}{\sum_{i \in Ω} e x p (S_{i}^{k})}

(12)

where

Ω_{n e g}

represents the set of negative classes,

Ω_{p o s}

represents the set of positive classes, and

S_{i}

represents the logits score of specific relation class. Finally, the total loss of the dataset is calculated as:

\begin{matrix} L = \sum_{n \in R_{p o s}} L_{p o s}^{n} + \sum_{k \in R_{n e g}} {(1 - p_{k} (N A))}^{γ} L_{n e g}^{k} \end{matrix}

(13)

\begin{matrix} p_{k} (N A) = \frac{e x p (S_{T H}^{k})}{\sum_{i \in Ω} e x p (S_{i}^{k})} \end{matrix}

(14)

where

R_{p o s}

is the set of positive instances,

R_{n e g}

is the set of negative instances,

γ

is a hyperparameter, and

p (N A)

is the probability of the “NA” relationship. By the above focal loss style loss function, the imbalance class problem is alleviated, leading to an improvement in performance.

4. Experiments

This section presents the details of our experiments, including the datasets, settings, baselines, hyperparameters, and experimental results.

4.1. Dataset

We use DocRED [7], the widely used benchmark of document-level relation extraction, to evaluate our method. DocRED is the largest human-annotated dataset for document-level relation extraction first proposed by Yao et al. [7]. It is constructed from Wikipedia and Wikidata and contains over 5000 documents, 132,375 entities, 96 relation types, and 56,354 relational facts. Different from the sentence-level relation extraction dataset, more than 40.7% of the relational facts in DocRED are inter-sentential relations, which can only be extracted from multiple sentences. In addition, evidence of relational facts is available in DocRED, which is critical for our approach. Details on DocRED are presented in Table 1.

4.2. Baseline Methods

We use recent competitive models as baselines for comparison, including Coref [30], SSAN [31], GAIN [20], ATLOP [32], DocuNet [33], EIDER [34], SAIS [35], HAG [36], and AFLKD [23].

Coref [30] presents a language representation model named CorefBERT, strengthening the coreferential reasoning ability of the BERT model.

SSAN [31] formalizes entity structure for document-level relation extraction and effectively incorporates such structural priors into both contextual reasoning and structure reasoning of entities.

GAIN [20] constructs two levels of graph structures: mention-level graph and entity-level graph, based on which entity interactions and relational logical reasoning are modeled.

ATLOP [32] proposes adaptive thresholding and localized context pooling for document-level relation extraction.

DocuNet [33] formulates document-level relation extraction as a semantic segmentation task and introduces the document U-shaped network.

EIDER [34] proposes a three-stage DocRE framework. It comprises joint relation and evidence extraction, evidence-centered relation extraction, and fusion of extraction results, which take advantage of the evidence sentences.

SAIS [35] explicitly teaches the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps and further boosts the performance through evidence-based data augmentation and ensemble inference while reducing the computational cost.

HAG [36] proposes a heterogeneous affinity graph inference network, which utilizes coref-aware relation modeling and a noise suppression mechanism to address the long-distance reasoning challenges in document-level RE.

AFLKD [23] utilizes an axial attention module for learning the inter-dependency among entity pairs. It also proposes an adaptive focal loss to tackle the class imbalance problem and adopts knowledge distillation to make use of distantly supervised data in DocRED.

4.3. Implementation Details

Our model is implemented with Pytorch [37] and the Transformer library of Huggingface [38]. We use the pre-trained language model

R O B E R T A_{l a r g e}

[29] as the encoder of our framework. During training, our model is optimized with AdamW [39] based on the mixed-precision mode. We stack six layers in the enhanced heterogeneous graph attention network. The hyperparameter

γ

of the multilabel focal loss is set to 2.

Following previous works [7,32,34], we adopt F1 and Ign F1 scores as the standard evaluation metrics. The difference between F1 and Ign F1 is that the shared relation facts in the training and dev/test sets are not included in the Ign F1 metric’s calculation. For a fair comparison, our baseline models are all based on

R O B E R T A_{l a r g e}

[29], which our TSFGAT adopts. The experiments are conducted five times with different random seeds, and the average scores are reported. We train and evaluate our TSFGAT on four Tesla V100 16 GB GPUs.

4.4. Experimental Result

We report the experimental result in Table 2. It shows that our model significantly outperforms the existing baselines. Specifically, our method achieves 68.14 Ign_F1 and 69.93 F1 score, which is a new state-of-the-art result on DocRED. It suggests that our approach can capture mention’s both local context representation for intra-sentential relation extraction and global context representation for inter-sentential relation extraction. It is worth noting that among the above baselines, our method is the only one that incorporates both the evidence information contained in DocRED and the syntactic structure information.

5. Analysis

In this section, we discuss the influence of each module on our method and provide details of the ablation experiments. Additionally, we conduct a case study to explain the inference capacity of our approach.

5.1. Ablation Study

Effect of two-stage strategy. In order to evaluate the effectiveness of our two-stage strategy, which predicts intra-sentential and inter-sentential relations separately, we conduct an ablation study on DocRED and report the result in the third row of Table 3. Specifically, in the “w/o two-stage strategy” model setting, we delete the intra-sentential stage, which means that the intra-sentential relations are also predicted with the global mention representation of the inter-sentential stage. The experimental result shows that when we adopt the one-stage strategy the Ign_F1 and F1 scores drop by 2.82 and 3.31, respectively, demonstrating the effectiveness of our designed two-stage framework. It suggests that intra- and inter-sentential relation extractions are required for distinctive local and non-local mention-level representations.

Effect of pre-coreference-resolution. To evaluate the effectiveness of pre-coreference-resolution, we conduct an ablation study on DocRED and report the result in the fourth row of Table 3. Specifically, in the “w/o pre-coreference-resolution” setting, we delete the process of pre-coreference-resolution. The result manifests that the Ign_F1 and F1 scores significantly drop by 26.46 and 26.23, respectively. It proves that our framework relies on the process of pre-coreference-resolution to handle the coreference reasoning path, which is illustrated in Section 3.2.1.

Effect of enhanced heterogeneous graph attention network. In order to prove the effectiveness of the proposed enhanced heterogeneous graph attention network (EHGAT) as a whole, we conduct an ablation study on DocRED and report the result in the fifth row of Table 3. Specifically, in the “w/o enhanced heterogeneous graph attention network” setting, similar to earlier efforts, we utilize R-GCN [25] in the inter-sentential stage. The experimental result shows that the Ign_F1 and F1 scores drop by 1.85 and 2.21, respectively. It demonstrates the superiority of our proposed enhanced heterogeneous graph attention network compared to the widely used R-GCN model in previous works.

Effect of attention with edge-type information. To analyze the role of attention with edge-type embedding in our proposed EHGAT, we conduct an ablation study on DocRED and report the result in Table 3. Specifically, in the “w/o attention with edge-type information” setting, we only use node information to calculate the attention instead. The result shows that when we abandon type information in the attention calculation of the heterogeneous graph, the Ign_F1 and F1 scores drop by 0.52 and 0.92, respectively. It suggests that incorporating type information is beneficial for the modeling of heterogeneous graphs.

Effect of residual attention. To analyze the effect of residual attention in our proposed EHGAT, we conduct an ablation study on DocRED and report the result in Table 3. Specifically, in the “w/o residual attention” setting, we use vanilla attention without residual connection. The result shows that compared to the full model, the Ign_F1 and F1 scores drop by 1.01 and 1.17, respectively. It proves that the residual attention mechanism is effective for our multilayer heterogeneous graph network.

Effect of multilabel focal loss. In order to assess the effect of the proposed multilabel focal loss, we conduct an ablation study on DocRED and report the result in Table 3. To be concrete, in the “w/o multilabel focal loss” setting, we adopt the vanilla binary cross-entropy loss instead, which is adopted by most previous works. The result shows that compared to the full model, the Ign_F1 and F1 scores drop by 3.77 and 4.29, respectively. It demonstrates the effectiveness of our proposed multilabel focal loss for the imbalanced multilabel dataset.

5.2. Case Study

To better show the inference capacity of our approach, we present several prediction cases of the development dataset of DocRED in Figure 5. To be concrete, Case 1 involves the coreference reasoning path. This case is converted to intra-sentential relation prediction by our Pre-Coreference-Resolution process. It shows our method’s ability to deal with the coreference reasoning path. As shown in Figure 5, Case 2 involves a relational logical inference path. If we want to determine the relationship “country” between entity mention “Royal Swedish Academy of Sciences” and “Swedish”, the model should capture the logical inference path: (Johan Gottlieb Gahn, country of citizenship, Swedish) + (Gahn, member of, Royal Swedish Academy of Sciences) = (Royal Swedish Academy of Sciences, country, Swedish). By our entity-coreference path-based inference strategy and enhanced heterogeneous attention networks, our model attains the global representations of mention “Royal Swedish Academy of Sciences” and “Swedish”, based on which relation inference is conducted to get the right answer “country”. Case 3 is similar to Case 2 in Figure 5, and we do not elaborate on it. The above cases show our model’s capacity for relation prediction and inference that is explainable.

6. Conclusions

In this paper, we propose a two-stage mention-level framework for document-level relation extraction, which constructs a dependency tree-based mention-aware heterogeneous graph and adopts different strategies for intra-sentential and inter-sentential relation prediction. For inter-sentential relation prediction and inference, we propose an enhanced heterogeneous graph attention network to better model the long-distance semantic relationships and design an entity-coreference path-based inference strategy to conduct relation inference. Furthermore, we introduce a cross-entropy-based multilabel focal loss function to address the class imbalance problem and multilabel prediction simultaneously. A series of experiments are conducted on the widely used benchmark of DocRE. Experimental results manifest that our approach significantly outperforms the existing methods. Further ablation analysis demonstrates the effectiveness of each component of our framework.

Author Contributions

Conceptualization, Y.C.; Methodology, Y.C.; Software, Y.C.; Validation, Y.C.; Investigation, B.S.; Resources, B.S.; Writing—original draft, Y.C.; Writing—review & editing, B.S.; Project administration, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds for the Central Universities (Grant No. CUC23ZDTJ002).

Data Availability Statement

Data available on request due to restrictions, e.g., privacy or ethical. The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Song, Y.; Li, W.; Dai, G.; Shang, X. Advancements in Complex Knowledge Graph Question Answering: A Survey. Electronics 2023, 12, 4395. [Google Scholar] [CrossRef]
Wei, S.; Liang, Y.; Li, X.; Weng, X.; Fu, J.; Han, X. Chinese Few-Shot Named Entity Recognition and Knowledge Graph Construction in Managed Pressure Drilling Domain. Entropy 2023, 25, 1097. [Google Scholar] [CrossRef] [PubMed]
Tian, H.; Zhang, X.; Wang, Y.; Zeng, D. Multi-task learning and improved TextRank for knowledge graph completion. Entropy 2022, 24, 1495. [Google Scholar] [CrossRef]
Xu, J.; Chen, Y.; Qin, Y.; Huang, R.; Zheng, Q. A feature combination-based graph convolutional neural network model for relation extraction. Symmetry 2021, 13, 1458. [Google Scholar] [CrossRef]
Verga, P.; Strubell, E.; McCallum, A. Simultaneously self-attending to all mentions for full-abstract biological relation extraction. arXiv 2018, arXiv:1802.10569. [Google Scholar]
Jia, R.; Wong, C.; Poon, H. Document-Level N-ary Relation Extraction with Multiscale Representation Learning. arXiv 2019, arXiv:1904.02347. [Google Scholar]
Yao, Y.; Ye, D.; Li, P.; Han, X.; Lin, Y.; Liu, Z.; Liu, Z.; Huang, L.; Zhou, J.; Sun, M. DocRED: A large-scale document-level relation extraction dataset. arXiv 2019, arXiv:1906.06127. [Google Scholar]
Xu, W.; Chen, K.; Zhao, T. Discriminative reasoning for document-level relation extraction. arXiv 2021, arXiv:2106.01562. [Google Scholar]
Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6398–6407. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Jiang, X.; Wang, Q.; Li, P.; Wang, B. Relation extraction with multi-instance multi-label convolutional neural networks. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 1471–1480. [Google Scholar]
Huang, Y.Y.; Wang, W.Y. Deep residual learning for weakly-supervised relation extraction. arXiv 2017, arXiv:1707.08866. [Google Scholar]
Soares, L.B.; FitzGerald, N.; Ling, J.; Kwiatkowski, T. Matching the blanks: Distributional similarity for relation learning. arXiv 2019, arXiv:1906.03158. [Google Scholar]
Peng, H.; Gao, T.; Han, X.; Lin, Y.; Li, P.; Liu, Z.; Sun, M.; Zhou, J. Learning from context or names? an empirical study on neural relation extraction. arXiv 2020, arXiv:2010.01923. [Google Scholar]
Yin, H.; Liu, S.; Jian, Z. Distantly Supervised Relation Extraction via Contextual Information Interaction and Relation Embeddings. Symmetry 2023, 15, 1788. [Google Scholar] [CrossRef]
Cheng, Q.; Liu, J.; Qu, X.; Zhao, J.; Liang, J.; Wang, Z.; Huai, B.; Yuan, N.J.; Xiao, Y. HacRED: A Large-Scale Relation Extraction Dataset Toward Hard Cases in Practical Applications. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online Event, 1–6 August 2021; pp. 2819–2831. [Google Scholar]
Sahu, S.K.; Christopoulou, F.; Miwa, M.; Ananiadou, S. Inter-sentence relation extraction with document-level graph convolutional neural network. arXiv 2019, arXiv:1906.04684. [Google Scholar]
Christopoulou, F.; Miwa, M.; Ananiadou, S. Connecting the dots: Document-level neural relation extraction with edge-oriented graphs. arXiv 2019, arXiv:1909.00228. [Google Scholar]
Wang, D.; Hu, W.; Cao, E.; Sun, W. Global-to-local neural networks for document-level relation extraction. arXiv 2020, arXiv:2009.10359. [Google Scholar]
Zeng, S.; Xu, R.; Chang, B.; Li, L. Double graph based reasoning for document-level relation extraction. arXiv 2020, arXiv:2009.13752. [Google Scholar]
Wang, H.; Qin, K.; Lu, G.; Yin, J.; Zakari, R.Y.; Owusu, J.W. Document-level relation extraction using evidence reasoning on RST-GRAPH. Knowl. Based Syst. 2021, 228, 107274. [Google Scholar] [CrossRef]
Zeng, S.; Wu, Y.; Chang, B. Sire: Separate intra-and inter-sentential reasoning for document-level relation extraction. arXiv 2021, arXiv:2106.01709. [Google Scholar]
Tan, Q.; He, R.; Bing, L.; Ng, H.T. Document-Level Relation Extraction with Adaptive Focal Loss and Knowledge Distillation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 1672–1681. [Google Scholar]
Honnibal, M.; Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To Appear 2017, 7, 411–420. [Google Scholar]
Schlichtkrull, M.; Kipf, T.N.; Bloem, P.; Van Den Berg, R.; Titov, I.; Welling, M. Modeling relational data with graph convolutional networks. In Proceedings of the Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, 3–7 June 2018; Proceedings 15. Springer: Cham, Switzerland, 2018; pp. 593–607. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Li, G.; Xiong, C.; Thabet, A.; Ghanem, B. Deepergcn: All you need to train deeper gcns. arXiv 2020, arXiv:2006.07739. [Google Scholar]
He, R.; Ravula, A.; Kanagal, B.; Ainslie, J. Realformer: Transformer likes residual attention. arXiv 2020, arXiv:2012.11747. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Ye, D.; Lin, Y.; Du, J.; Liu, Z.; Li, P.; Sun, M.; Liu, Z. Coreferential reasoning learning for language representation. arXiv 2020, arXiv:2004.06870. [Google Scholar]
Xu, B.; Wang, Q.; Lyu, Y.; Zhu, Y.; Mao, Z. Entity structure within and throughout: Modeling mention dependencies for document-level relation extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 14149–14157. [Google Scholar]
Zhou, W.; Huang, K.; Ma, T.; Huang, J. Document-level relation extraction with adaptive thresholding and localized context pooling. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 14612–14620. [Google Scholar]
Zhang, N.; Chen, X.; Xie, X.; Deng, S.; Tan, C.; Chen, M.; Huang, F.; Si, L.; Chen, H. Document-level relation extraction as semantic segmentation. arXiv 2021, arXiv:2106.03618. [Google Scholar]
Xie, Y.; Shen, J.; Li, S.; Mao, Y.; Han, J. Eider: Empowering Document-level Relation Extraction with Efficient Evidence Extraction and Inference-stage Fusion. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 257–268. [Google Scholar]
Xiao, Y.; Zhang, Z.; Mao, Y.; Yang, C.; Han, J. SAIS: Supervising and augmenting intermediate steps for document-level relation extraction. arXiv 2021, arXiv:2109.12093. [Google Scholar]
Li, R.; Zhong, J.; Xue, Z.; Dai, Q.; Li, X. Heterogenous affinity graph inference network for document-level relation extraction. Knowl. Based Syst. 2022, 250, 109146. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. An example from dataset DocRED. Intra-sentential and inter-sentential are marked with blue and red lines, respectively [1,2,3,4,5,6,7,8].

Figure 2. Overview of our two-stage framework for intra-sentential and inter-sentential relation extractions.

Figure 3. Examples of three relation reasoning paths and pre-coreference-resolution. Note that after pre-coreference-resolution, the coreference reasoning paths would be converted into intra-sentential relation paths [1,2,3,7].

Figure 4. Illustration of construction procedure of mention-aware dependency graph. The circles represent the entity mention nodes in the graph. Mention nodes that refer to the same entity are connected with the “entity-coreference” edge.

Figure 5. Three cases for illustrating the predictions of our TSFGAT. Case 1 involves the coreference reasoning path. Cases 2 and 3 involve the relational logical inference path [1,2,7].

Table 1. Statistics of benchmark DocRED.

Statistics	DocRED
# Train docs.	3053
# Dev docs.	1000
# Test docs.	1000
# Distant docs.	101,873
# Relations	97
Avg. # entities per doc.	19.5
Avg. # mentions per entity	1.4
Avg. # relations per doc.	12.5

Table 2. Experimental results of F1 and Ign F1 scores (%) on the DocRED dataset. In bold are the highest results.

Model	Evidence Information	Syntactic Structure	Dev		Test
Model	Evidence Information	Syntactic Structure	Ign_F1	F1	Ign_F1	F1
Coref [30]	✗	✗	57.35	59.43	57.9	60.25
SSAN [31]	✔	✗	59.40	61.42	60.25	62.08
GAIN [20]	✗	✗	60.87	63.09	60.31	62.76
HAG [36]	✗	✗	60.85	63.06	60.78	60.82
ATLOP [32]	✗	✗	61.32	63.18	61.39	63.40
DocuNet [33]	✗	✗	62.23	64.12	62.39	64.55
EIDER [34]	✔	✗	62.34	64.27	62.85	64.79
SAIS [35]	✗	✗	62.23	65.17	63.44	65.11
AFLKD [23]	✗	✗	65.27	67.12	65.24	67.28
TSFGAT(ours)	✔	✔	67.57	69.87	68.14	69.93

Table 3. Ablation study on DocRED. Ign_F1 and F1 on test set are reported.

Model	Ign_F1	F1
Full model	68.14	69.93
w/o two-stage strategy	65.32	66.80
w/o pre-coreference-resolution	41.68	43.65
w/o enhanced heterogeneous graph attention network	66.29	67.72
w/o attention with edge-type information	67.62	69.01
w/o residual attention	67.13	68.76
w/o multilabel focal loss	64.37	65.64

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Shi, B. Enhanced Heterogeneous Graph Attention Network with a Novel Multilabel Focal Loss for Document-Level Relation Extraction. Entropy 2024, 26, 210. https://doi.org/10.3390/e26030210

AMA Style

Chen Y, Shi B. Enhanced Heterogeneous Graph Attention Network with a Novel Multilabel Focal Loss for Document-Level Relation Extraction. Entropy. 2024; 26(3):210. https://doi.org/10.3390/e26030210

Chicago/Turabian Style

Chen, Yang, and Bowen Shi. 2024. "Enhanced Heterogeneous Graph Attention Network with a Novel Multilabel Focal Loss for Document-Level Relation Extraction" Entropy 26, no. 3: 210. https://doi.org/10.3390/e26030210

APA Style

Chen, Y., & Shi, B. (2024). Enhanced Heterogeneous Graph Attention Network with a Novel Multilabel Focal Loss for Document-Level Relation Extraction. Entropy, 26(3), 210. https://doi.org/10.3390/e26030210

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Heterogeneous Graph Attention Network with a Novel Multilabel Focal Loss for Document-Level Relation Extraction

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Task Formulation

3.2. Model Architecture

3.2.1. Pre-Coreference-Resolution

3.2.2. Mention-Aware Dependency Graph Construction

3.2.3. Context Encoder

3.2.4. Enhanced Heterogeneous Graph Attention Network

3.2.5. Relation Classification

3.2.6. Multilabel Focal Loss Function

4. Experiments

4.1. Dataset

4.2. Baseline Methods

4.3. Implementation Details

4.4. Experimental Result

5. Analysis

5.1. Ablation Study

5.2. Case Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI