You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

6 February 2024

Joint Entity and Relation Extraction Model Based on Inner and Outer Tensor Dot Product and Single-Table Filling

,
,
,
and
1
College of Computer Science and Technology, Jilin University, Changchun 130012, China
2
College of Computer Science and Technology, Changchun University, Changchun 130022, China
3
Ministry of Education Key Laboratory of Intelligent Rehabilitation and Barrier-Free Access for the Disabled, Changchun 130022, China
4
Jilin Provincial Key Laboratory of Human Health State Identification and Function Enhancement, Changchun 130022, China

Abstract

Joint relational triple extraction is a crucial step in constructing a knowledge graph from unstructured text. Recently, multiple methods have been proposed for extracting relationship triplets. Notably, end-to-end table-filling methods have garnered significant research interest due to their efficient extraction capabilities. However, existing approaches usually generate separate tables for each relationship, which neglects the global correlation between relationships and context, producing a large number of useless blank tables. This problem results in issues of redundant information and sample imbalance. To address these challenges, we propose a novel framework for joint entity and relation extraction based on a single-table filling method. This method incorporates all relationships as prompts within the text sequence and associates entity span information with relationship labels. This approach reduces the generation of redundant information and enhances the extraction capability for overlapping triplets. We utilize the internal and external multi-head tensor fusion approach to generate two sets of table feature vectors. These vectors are subsequently merged to capture a wider range of global information. Experimental results on the NYT and WebNLG datasets demonstrate the effectiveness of our proposed model, which maintains excellent performance, even in complex scenarios involving overlapping triplets.

1. Introduction

Extracting entity pairs and their relationships from unstructured text and constructing relationship triplets are important steps in building knowledge graphs [1,2] and question-answering systems [3,4]. Early relational triple extraction tasks were mainly carried out using a pipeline approach [5] consisting of two subtasks: entity extraction [6] and relationship classification [7]. However, such extraction methods often suffer from the problem of cascading errors and efficiency, making it difficult to capture the dependencies between entities and relationships. Therefore, researchers have primarily focused on the joint extraction approach, attempting to extract triplets in an end-to-end manner to reduce the issues of cascading errors and exposure bias. For example, [8] proposed a two-stage extraction network, first extracting the subject and then extracting the corresponding relationship and object.
In recent years, end-to-end table-filling methods [9,10,11,12] have demonstrated their powerful extraction capabilities, especially in extracting triplets from complex sentences containing overlapping triplets. For instance, [10] divided joint extraction into three parts and completed the extraction in one stage, thereby avoiding exposure bias. Due to the complexity of human language, the number of entities and their relationships in a sentence is uncertain, which increases the difficulty of extracting relationship triplets [13], specifically, the problem of relationship overlap [14]. The overlapping problem refers to situations where multiple triplets share common entities or relations. Specifically, it involves cases where the boundaries of entities or relations overlap, creating ambiguity in the extraction process. The cases of triplet overlap can generally be classified into three categories: SEO, EPO, and SOO, as shown in Figure 1. If several triples in a sentence share the same entity, it is SEO; for example, the triples (Thomas Davies, graduated, University of London) and (England, contains, University of London) share the entity “University of London”. It is EPO if there are multiple relationships between certain entity pairs; for example, the triples (France, contains, Paris) and (Paris, capital, France) both exist for the same entity pair “France” and “Paris”. A triple is SOO if the subject and object are nested entities of each other; for example, if the object “Thomas” in the triple (Thomas Davies, first name, Thomas) is a nested entity of the subject “Thomas Davies”. The design of table-filling schemes also affects the ability to extract complex entities and relationships. Designing an efficient and highly generalizable table-filling method has become a research focus.
Figure 1. Overall cases of normal, EPO, SEO, and SOO overlapping triples. Different colors represent different entities in the case sentence.
Currently, models based on table-filling methods for extracting triplets usually maintain a separate table for each relationship, where each entry in the table represents the existence of a certain relationship between token pairs [11,12]. However, this approach presents three issues: (1) It generates many useless relationship tables, and the number of positive samples in the tables with relationships is much smaller than the number of blank labels, making the model easily affected by negative samples during the learning process. (2) When extracting entity spans, it often relies on complex multi-class labels or additional tables to map the head and tail markers. The former requires a complex decoding mechanism to achieve triplet extraction, whereas the latter introduces more redundant information. (3) It fails to fully utilize the semantic information of relationships. These relationships are often mapped to an ID and cannot establish intrinsic connections with the context and relevant entities, making it difficult to capture fine-grained semantic information between relationships.
Inspired by ref. [15], in this paper, we propose a single-table-filling framework model, which no longer maintains multiple relational tables but instead uses a shared representation to express the relationship between entities. Specifically, the relationships are first extracted as words with a special meaning. These relationship words are then concatenated with the original text and transformed into word embeddings, which are input into the BERT [16] language model for uniform encoding. Unlike the table-filling approach in ref. [15], we rely on the entity span information within the relationship labels. This means that we construct continuous subject–relation pair labels and object–relation pair labels to identify the spans of subject and object entities, thereby addressing the problem of complex triple overlap. To capture more global information, BiGRU [17] is employed to extract additional hidden layer information, and a multi-head tensor dot-product operation is employed to encode table features. These encoded features are subsequently fused with the output of the self-attention mechanism within the transformer [18] to iteratively acquire the intermediate encoding of table cells. Finally, the sigmoid function is employed to obtain the probability table for the final result. To address the extreme imbalance of positive and negative samples, as well as difficult samples, we employ the focal loss [19] as a loss function to replace the traditional binary cross-entropy, as it exhibits better training performance in situations with imbalanced sample difficulty and positive-negative sample distribution.
The main contributions of this work are as follows:
(1)
We propose a novel table-filling scheme, which can extract entities with multiple relations and tokens end to end. Even when applied to a simple network architecture, it achieves high accuracy.
(2)
We propose a new framework model that combines the attention mechanism inside the transformer with the multi-head tensor dot-product results of sentence representations, enriching the feature vectors of the table and improving accuracy compared to extracting results using only the attention mechanism.
(3)
We apply the focal loss function to entity relation extraction table-filling methods. To the best of our knowledge, most current relation extraction models based on table-filling methods use cross-entropy as the loss function for training.
(4)
We evaluate the proposed model on two public datasets, NYT [20] and WebNLG [21], and select classic models from the past three years as baselines. Our model achieves the best accuracy on the NYT dataset and demonstrates higher training efficiency.

3. Methodology

In this section, we first present the definition of the joint relational triple extraction problem, followed by an introduction to our table annotation strategy and decoding algorithm. Finally, we provide a detailed description of the model architecture.

3.1. Problem Definition

Given a text sequence X = x 1 , x 2 , x n , where x i , i = 1 , , n is the word in the text sequence and n is the length of the text sequence, our objective is to extract a relational triple T = i = 1 N h i , r i , t i , where h i , r i , and t i represent the subject, relationship between the two entities, and object, respectively. N is the number of triplets in the sequence. The subjects and objects are derived from the entities E = e 1 , e 2 , e k in text X, where E is the set of all entities in the sentence, and k is the number of entities in the text. r i ϵ R , where R is the predefined set of relationships.

3.2. Table-Filling Schemes

Firstly, we map all relationships to a single token, allowing this word to carry potential semantic information about the existence of the relationship. Then, we concatenate all word relationships from the set R with the sequence X to form the complete text sequence Z = x 1 , , x n , x n + 1 , , x n + m . The length of Z is ( n + m ) , where m represents the size of the set R.
Based on the input S, we construct a table with a size of ( n + m ) 2 , as shown in Figure 2. For each cell indexed by the i-th row and j-th column, it has a label t representing the token pair x i , x j Z × Z . If t 0 , it indicates the existence of a relation for that token pair. The table is primarily divided into three parts, with the blue section representing the entity–entity extraction. If the label t of x i , x j in this section is non-zero, it means there is a predefined relationship between the entities with x i and x j as the tail tokens (or a single token representing an entity). For example, the label for (Davies, England) is 1, which means that there is a relationship between the entity ending with “Davies” and the entity ending with “England”.
Figure 2. Example of the scheme. The bold part represents the relation sequence. The purple arrow represents searching the entire entity from back to front. The green area represents the entity of the object, the orange area represents the entity of the subject, and the blue area represents the aligned tokens at the tails of the subject and object. The lower triangle is symmetrical to the upper triangle. On the right are three triplets obtained based on this table.
The green section represents the extraction section of subject relations. If the label t of x i , x j in this section is 1, it indicates an association between the word x i and the relation r j . It is important to note that in this case, x j ( j > n ) is no longer a word from X but represents a word relationship in Z. When there are vertical consecutive labels of 1 in this section, they signify that the consecutive tokens in the corresponding row form a subject entity. For example, the label for (Davies, born) is 1, and it is observed that the label for the previous token pair (Thomas, born) is also 1, which indicates that “Davies Thomas” is a subject, and “born” is the corresponding relation.
The orange section represents the extraction section of object relations. This section follows a similar principle to the extraction section of subject relations. When there are horizontally consecutive labels of 1 in this section, they indicate that the entity formed by the continuous tokens in the corresponding column is an object, and it is associated with the relation r j . For example, if the label for (born, England) is 1, it means that “England” is an object, and the corresponding relation is “born”. The row corresponds to “graduate” and “contains”, whereas the column corresponds to “University of London” with six labels, indicating the association between “University of London” and the “graduate” and “contains” relations.
For the extraction of triplets, we integrate the extraction results of these three components. First, we save the tokens extracted from the entity–entity extraction into the set ε . The subject-relation extraction results are saved in the dictionary D s , and the object-relation extraction results are saved in the dictionary D o . Note that a dictionary is a variable type; its format is (key: value), where key and value represent a mapping relationship. The keys of these two dictionaries are the positions of the entity tokens, and the values are the relations corresponding to the tokens. Next, we iterate through the set ε to search the two dictionaries. Using the set ε , we map each tail position to the corresponding entity ending with that position. If the mapped subject and object ( h i and t i ) have the same relation ( r i ), they can form a triplet. Finally, based on the tail positions, we traverse the dictionaries D s and D o from back to front to find all the tokens of the entity, complete the entity, and form the final relation triplet. The specific decoding process is shown in Algorithm 1.
Finally, the three parts of the label are integrated, and after our decoding algorithm, we can obtain the triplets (Davies Thomas, born, England), (Davies Thomas, graduate, University of London), and (England, contains, University of London). Our padding strategy has the following three main advantages:
(1)
All relations are unified in one table, reducing a large number of redundant samples. For form-filling strategies such as GRTE [12] and OneRel [11], n 2 × m cells need to be filled, whereas our filling strategy only requires ( m + n ) 2 cells.
(2)
It can solve the complex overlapping problems of EPO, SEO, and SOO.
(3)
The labels for each part of the relationship not only provide the relative position information of the corresponding entity (i.e., whether the entity is the subject or object) but also provide the span information of the entity (i.e., the specific position of the entity in the sentence). This allows our model to achieve a one-module one-step extraction process, further avoiding exposure bias.
Algorithm 1 Table decoding strategy
Input: Table probability matrix L R ( n + m ) ( n + m ) , threshold t
Output: Triplet T extracted from sentence Z
1:
D s = d i c t ( ) , D o = d i c t ( ) , ε = s e t ( ) // Used to store subject-relation, object-relation, and entity pairs, respectively. The d i c t ( ) function creates a new dictionary, and the s e t ( ) function creates a new set.
2:
L r o w , L c o l = t o r c h . w h e r e L i , j t // This is a function in the Python programming language that extracts the positions of all cells in the probability matrix L with values greater than the threshold t. It saves the positions of all rows in L r o w and the positions of all columns in L c o l .
3:
for   e a c h r o w , c o l L r o w , L c o l  do
4:
    if  r o w 0  and  c o l 0  and  r o w n  then
5:
        if  c o l < n + 1  then
6:
           add  ( r o w , c o l )  to  ε
7:
        end if
8:
    else if  c o l n + 1  then
9:
         c o l = r o w n 2
10:
        if  r o w D s  then
11:
            D s r o w = [ ]
12:
        end if
13:
        add  c o l  to  D s r o w
14:
    end if
15:
end for
16:
Replace the dictionary D s with the dictionary D o , and transpose the input probability matrix. Repeat steps 3–15 to extract the relationship-tail entity part and store it in the dictionary D o
17:
D r = s e t ( ) // A set storing relations in a sentence
18:
for   e a c h ( s u b _ t , o b j _ t ) D r  do
19:
    if  s u b _ t D s  and  o b j _ t D o  then
20:
         D r D s s u b _ t & D o o b j _ t
21:
        for  r e l D r  do
22:
            s u b _ h = s u b _ t //Set a variable to find the beginning of a subject
23:
            o b j _ h = o b j _ t //Set a variable to find the beginning of an object
24:
           for  s u b _ h D s r e l  do
25:
                s u b _ h = s u b _ h 1
26:
           end for
27:
           for  o b j _ h D o r e l  do
28:
                o b j _ h = o b j _ h 1
29:
           end for
30:
            T s u b _ h , s u b _ t , r e l , o b j _ h , o b j _ t
31:
        end for
32:
    end if
33:
end for
34:
return T

3.3. The Model Framework

The overall architecture of the model in this paper is shown in Figure 3. It mainly consists of three parts: the input layer, feature extraction layer, and table-generation layer.
Figure 3. The overall architecture of the model, which is divided into three parts: the input layer, feature extraction layer, and table-generation layer.
Input Layer. We use a pre-trained BERT-based model as the encoder for sentences. The concatenated sentence Z is input to obtain the token-level sentence representation H R ( n + m ) × d n :
H = t 1 , t 2 t n , t n + 1 t n + m = B E R T x 1 , x 2 x n , r 1 r m
where x i is a word in the text sequence Z. After passing through the BERT encoder, it is converted into a word vector and then input into the BERT model to calculate the word embedding. t i represents the encoded word embedding, and d n represents the dimension of the word embedding (768 at the base level). We also extract the scores from the self-attention mechanism in the BERT encoder, originating from the multi-head self-attention computation in the encoder of the 12-layer transformer. The output of each BERT encoder layer is based on the output of the previous layer. The specific formula [18] is as follows:
A t t e n t i o n ( Q , K , V ) i = s o f t m a x S i V = s o f t m a x H i 1 W Q H i 1 W K T d k H i 1 W V
where S i represents the scores of the multi-head self-attention mechanism in the i-th layer of the BERT encoder. W Q , W K , and W V , respectively, represent the learnable mapping matrices for the query matrix Q, key matrix K, and value matrix V in the attention mechanism. H i 1 represents the hidden layer output of the previous encoder layer, and d k is the embedding dimension in the attention computation. Here, we focus on the attention scores of the final layer, denoted as S 12 R ( n + m ) ( n + m ) d h , where d h represents the number of heads. Note that the attention scores taken here are the unnormalized results, which means they have not undergone the operation of softmax normalization.
Feature Extraction Layer. We use the representation H from BERT’s final layer output as input to the BiGRU [17] model, which enables capturing more semantic information, enhancing token-level representations, and obtaining the vector representation Y:
Y = B i G R U ( H )
After the BiGRU update, we normalize Y horizontally, apply a tanh activation layer, and apply dropout, resulting in the input U for the multi-head tensor dot-product operation. The specific formula is as follows:
U = tanh L a y e r N o r m ( Y )
Through our experiments, we found that in our model, the multi-head tensor dot-product and multi-head self-attention mechanisms exhibit similar performance when extracting table features. However, the computational cost of the multi-head tensor dot product is lower. Its computation is similar to that of self-attention but omits the operations involving the value matrix and the multiplication of feature vectors after linear transformation. Firstly, we map U to the Q and K vectors through two linear transformations. Then, we transpose the K vector and perform matrix multiplication. Finally, we divide the result by the square root of d k to obtain the table feature R R ( n + m ) 2 × d h . The specific formulas are as follows:
Q i = U i W q + b q
K i = U i W k + b k
P i = Q i K i T d k
μ i x , y = q i , x k i , y T
Formulas (9) and (10) represent the linear transformation operations. W q and W k are learnable parameter matrices, b q and b k are bias values, and Q , K R ( n + m ) × d h × d k are the query matrix and key matrix, respectively. Here, i 1 , , d h denotes the head position of the mapping. Formula (7) represents the overall result, whereas Formula (8) represents the specific calculation result μ i ( x , y ) for each cell in the table. x and y denote the rows and columns of the table, respectively. q i , x and k i , y represent the query vector and key vector at the mapping position of the corresponding time step.
Table-Generation Layer. We add the result of Formula (7) to the S i of Formula (2), take the average of all the heads, and compute the final probability matrix L R ( n + m ) 2 using the sigmoid function. The specific formula is as follows:
L = s i g m o i d 1 τ i τ P i + S i
where τ is the number of heads. In contrast to the UniRel [15] model, which only recognizes discrete single-token labels, it is sufficient for UniRel to utilize the attention mechanism within BERT to capture information. However, our model needs to capture continuous labels, requiring more global information and entity feature information. Compared to the GRTE [12] model, ours can effectively reduce model complexity by only requiring binary classification to extract all entities, thus improving the precision of triple extraction.

3.4. Model Optimization

We adopt a joint training approach and apply methods such as dropout and horizontal regularization to the BiGRU layer and multi-head tensor dot-product layer to enhance the model’s generalization performance and prevent overfitting. Due to the table-filling method used in this work, the tables used are often larger than those used in conventional table-filling methods, resulting in a much larger number of negative samples compared to positive samples within a single table. Additionally, the majority of samples are easy to classify (i.e., predicted probabilities close to 0 or 1), with difficult-to-classify samples constituting a small portion (i.e., predicted probabilities around 0.5). Traditional binary cross-entropy loss treats these samples equally. Therefore, we utilize the focal loss function for training, which performs better in cases of imbalanced sample categories and uneven distribution of easy and difficult samples. The specific cross-entropy loss function is shown in Equation (10), and the focal loss function is shown in Equation (11):
c b = 1 ( n + m ) 2 i n + m j n + m p i , j log p i , j + 1 p i , j l o g 1 p i , j
c f = 1 n + m 2 i n + m j n + m p i , j α 1 p i , j γ log p i , j + 1 p i , j 1 α p i , j γ log 1 p i , j
where α is the weight that adjusts the balance between positive and negative samples, and γ is the weight that adjusts the balance for hard examples. p i , j represents the predicted value of the token-level cell ( i , j ) , whereas p i , j represents its ground-truth label value.

4. Experiments

This section begins by introducing the two datasets employed in this research. Subsequently, it elaborates on the experimental setup and the specific hyperparameter settings. Next, a comparative analysis is performed, pitting the proposed table-filling method and model against existing approaches to showcase their superior effectiveness. Finally, the impact of individual model components on the results is thoroughly investigated.

4.1. Datasets

We evaluate our method on two benchmark datasets, NYT [20] and WebNLG [21]. These two datasets have been widely used in the study of joint relational triple extraction. The NYT dataset consists of over 60,000 sentences, and the training set, test set, and validation set include 56,195, 5000, and 500 sentences, respectively, including 24 relations. It was generated using a distant supervision approach by extracting articles from the New York Times. On the other hand, the WebNLG dataset is sourced from articles in Wikipedia and comprises over 6000 sentences with 171/216 relations. Among them, 5019 sentences are used as the training set, 500 sentences are used as the validation set, and 703 sentences are used as the test set. Many sentences in these datasets exhibit overlapping relationships or contain multiple triples, which enables us to evaluate the performance of our model in handling overlapping and multiple triple problems. There are two different versions of each dataset. NYT* and WebNLG* only require matching the last word of the entity, whereas NYT and WebNLG require matching the entire entity span. Specific information on the datasets is shown in Table 2.
Table 2. Statistics of the datasets.
Following a previous work [11], we divide the test set into three categories—SEO, EPO, and SOO—based on the overlap type. Additionally, we divide it into one or more triples based on the number of triples. Note that a sentence may have multiple overlaps. The details are shown in Table 3.
Table 3. Statistics on the number of triples in the test set, where N represents the number of triples in a sentence.

4.2. Parameter Settings

Our model is implemented in PyTorch and runs on an NVIDIA GeForce RTX 3090 24G GPU. For fairness, we use the BERT case-based model as the pre-trained language model. We employ the cosine annealing algorithm as the learning rate adjustment strategy, with an initial learning rate of 3 × 10 5 / 5 × 10 5 for NYT and WebNLG, respectively. AdamW is used for parameter optimization. On NYT and WebNLG, the batch size is set to 32/16, and the threshold values are 0.53/0.48, respectively. The model is trained for 100 epochs, and the dropout rates are 0.1. For the weight setting of the focal loss function, α is set to 0.5 for the NYT dataset and 0.6/0.75 for the WebNLG/WebNLG* dataset, whereas γ is set to 2 for both datasets. Following previous works [8,10,11,12,25], we set the maximum sentence length for testing to be within 100.

4.3. Main Results

Table 4 summarizes the results of our model compared to other baseline methods on the two datasets. We evaluated the models using precision, recall, and F 1 score, considering the extracted triples were correct only when the subject, object, and relationship matched the gold standard annotations completely. We selected several classic and efficient models from the past three years for comparison:
Table 4. Performance comparison of different methods on the NYT and WebNLG datasets. The best is bold, and the next best is underlined.
  • CasRel [8] uses a cascaded framework for sequential extraction, which is relatively slow. Our model can extract all triplets in the text at once.
  • TPLinker [10] uses a one-stage token-linking approach for joint relation triplet extraction but requires multiple supporting modules. Our model only needs one module to extract complete triplets.
  • PRGC [25] divides joint relation triplet extraction into three subtasks, which can lead to exposure bias. In contrast, our model treats it as a holistic table-filling task, avoiding exposure bias.
  • EmRel [30] represents relations as embedding vectors but still requires multiple components. It refines the representation of entities and relations through an attention-based fusion module.
  • GRTE [12] is the state-of-the-art method for the NYT dataset, but it requires multiple class labels to determine a complete entity pair. Our method only requires binary classification.
  • OneRel [11] is the state-of-the-art method for the WebNLG dataset, achieving improved triplet recognition efficiency through a one-stage single-model approach. However, it still requires triplet extraction in a three-dimensional table, whereas ours only requires extraction in a two-dimensional table, reducing memory usage.
All the experimental results of the baselines were directly sourced from the original literature, and the performance of each model was compared using precision (Prec.), recall (Rec.), and F 1 score. Their calculation formulas are as follows:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
where T P represents instances correctly predicted as positive by the model, F P represents instances incorrectly predicted as positive by the model but are actually negative, and F N represents instances incorrectly predicted as negative by the model but are actually positive.
The experimental results show that our model achieved the highest F 1 score and precision on the NYT and WebNLG datasets, outperforming the F 1 score of the second-highest GTRE model by 0.3% and its precision by 0.8% and 0.6%, respectively. However, recall was slightly lower than the GTRE model. This indicates that our model performed best when the number of relations was relatively small. On the WebNLG dataset, our model achieved a slightly lower F 1 score by 0.2% compared to the previous best model, OneRel, but achieved higher precision by 0.6%, maintaining competitiveness. We believe that the main reason for this is that the model’s prediction accuracy primarily comes from the attention mechanism, both internal and external to the BERT encoder. The WebNLG* dataset contains up to 171 relations, and when combined with the original input text, the total sequence length exceeds 300. However, the attention calculation and multi-head tensor dot-product methods did not excel in handling long texts. Moreover, larger tables also introduced more negative samples, and although we used focal loss to mitigate this, the problem remained. On the WebNLG dataset, although our model did not achieve better results than the OneRel and GRTE models, it still exhibited good triple extraction ability. We believe there may be two reasons for this: (1) Similar to the WebNLG* dataset, the increase in the number of relations led to overly long synthetic text sequences, where the attention mechanism could not exert its advantage. (2) The relationship prompt information was composed of individual tokens, and when there were too many relations, similar prompts caused confusion and led to misjudgment.
The above results demonstrate that our model exhibits satisfactory performance in extracting triplet accuracy when the number of predefined relations in the dataset is relatively small. But as the number of relations increases, the effectiveness of our model suffers.

4.4. Detailed Results on Complex Scenarios

In this section, we evaluate the extraction capability of our model in cases where there are overlapping triplets and single sentences containing multiple triplets. For comparison with previous models, we conducted experiments on two different subsets: NYT* and WebNLG*. The specific extraction results are shown in Table 5 and Table 6. We can observe that our model achieved the best F 1 scores on 11 out of 20 subsets and the second-best scores on 3 subsets. Particularly, on the NYT* dataset with a smaller number of relations, our model achieved the best scores in all overlapping situations. On the WebNLG dataset, our model still obtained the four best results, indicating its excellent performance on datasets with a large number of relations. However, as the number of predefined relations increases, the model’s ability to recognize similar semantic relationships decreases, making it more challenging to handle sentences with multiple triplets. Overall, our model is able to handle complex sentence structures well. However, as the number of relations increases, its recognition ability is also affected. Compared to other baseline models, our model can still effectively extract triplets in complex scenarios.
Table 5. F 1 values of different models for sentences with different numbers of triples. The best is bold, and the next best is underlined.
Table 6. F 1 values of different models for sentences with different overlapping patterns. The best is bold, and the next best is underlined.

4.5. Detailed Results on Different Subtasks

We further explored the results of our model on different subtasks, dividing the extraction of relation triples into two subtasks: entity pair recognition and relation classification. In this context, h represents the head, t represents the tail, and r represents the relation. Only when both the head and tail entities in ( h , r , t ) are correct can it be counted as a correct prediction. The specific results are shown in Table 7.
Table 7. Experimental results on subtasks, where ( h , t ) represents an entity pair, r represents a relation contained in a sentence, and ( h , r , t ) represents a complete relation triple. The best is bold.
Most of our model’s test results on NYT were better than the baseline. Interestingly, our method often achieved higher precision at the expense of recall. In relation extraction, our model’s precision was 1.3% higher than recall. We analyze the specific reasons for this in Section 4.8. For the two subtasks of WebNLG, our model obtained the best precision scores but comparativelyower recall and F 1 scores. This is the main bottleneck of our model.

4.6. Efficiency of the Model

The model’s number of parameters, training time, required memory for training, and inference time on the NYT dataset were evaluated, as shown in Table 8. Our model has slightly more parameters than TPLinker and OneRel, mainly due to the large number of parameters in the BiGRU model. In terms of training time, our model trained 1.5 times faster than TPLinker and 1.6 times faster than the OneRel model. We believe a possible reason for this is that our model is trained as a single module, allowing for batch processing of samples and faster tensor dot-product computation. Although OneRel is also trained as a single module, each linear layer requires a large number of parameters n 2 × d n × 3 to calculate, where d n is the embedding dimension of the encoder and n is the sentence length, resulting in more time consumption. TPLinker trains its head, tail, and relation separately as three models, which takes a relatively longer time. Regarding training memory, our tensor space was limited to a two-dimensional plane, requiring less memory. In terms of inference speed, TPLinker required the most time, mainly because it needs to iterate over all token pairs and use token linking to determine the head and tail entities. Therefore, it has the highest computational complexity, whereas our model and OneRel only need to iterate over tokens with specific markers. Our inference time was higher than that of OneRel, mainly because complete triplet decoding in OneRel usually only needs three labels, whereas our labels are relatively dense, requiring more iterations.
Table 8. The efficiency comparison of the model on the NYT dataset. Params represents the number of parameters of the encoder in the model. Training time refers to the time required to train an epoch, inference indicates the time required to infer a sentence, and memory indicates the video memory occupied during training. The batch size is set to 8. The best is bold.

4.7. Ablation Study

In this section, we conduct ablation experiments on the NYT dataset to demonstrate the effectiveness of various components in the proposed method. The specific results are shown in Table 9.
Table 9. Ablation experiments on the NYT dataset, where w/o indicates removing the module, ours refers to the proposed complete model, and BiLSTM indicates replacing the BiGRU module with a BiLSTM module. The best is bold.
We conducted five sets of experiments, with the first set being our proposed complete model. In the second set, we removed the multi-head tensor dot-product operation and only used the scores obtained from the multi-head self-attention as the final result. Compared to the first group, the F 1 score decreased by 0.2%. Although a simpler structure improved the efficiency of the model, it could not fully utilize the semantic information in the language model, resulting in fewer identifiable triples. This was also the reason for the decrease in precision and the increase in recall. Therefore, the addition of additional sentence features is beneficial to this model. In the second group, we removed the multi-head self-attention module from the encoder, resulting in a decrease of 0.3% in the F 1 score. This indicates that the internal attention mechanism in the BERT encoder plays an important role in the filling framework. In the third group, we did not use the focal loss function but instead used the traditional binary loss function, resulting in a decrease of 0.4% in the F 1 score. In Figure 4, we can observe that whether using the focal loss or the cross-entropy loss, the models converged after around 10,000 steps. However, training with cross-entropy loss (green line) resulted in a significantly slower convergence speed. This is because focal loss assigns more weight to hard samples, making the model pay more attention to them, which is very helpful for this table-filling method. In the fourth group, we replaced the BiGRU [17] model with the BiLSTM model and found that compared to the BiLSTM model, BiGRU worked better in combination with the attention mechanism, resulting in better performance.
Figure 4. The model training process of different ablation experiments. The x-axis represents the number of training iteration steps in thousands, with a batch size of 32.
Figure 4 shows shows the convergence process of various ablation models on the NYT dataset. “ATT” represents using only the internal attention calculation within the encoder, and “tensor” represents using only the external tensor dot-product operation. The entire model converged after approximately 10,000 steps, but it can be clearly seen that the full framework converged the fastest. The next fastest was when using the BiLSTM module, followed by using only the internal attention mechanism and then using only the external tensor dot product. The slowest convergence was observed when using cross-entropy training.

4.8. Case Study

In this section, we selected three sentences from the NYT dataset to analyze our model. The first sentence contained normal triplets, the second sentence contained EPO overlapping triplets, and the third sentence contained SEO overlapping triplets. We compared the recognition results of the different models in the ablation experiments, as shown in Table 10. In the first sentence, we compared the use of different loss functions and found that both models recognized incorrect entities due to insufficient information in the sentences. However, the model trained with cross-entropy recognized more incorrect relationships, possibly because of the semantic similarity between “lived” and “birth,” which are difficult to distinguish, resulting in lower precision. During training, focal loss can assign more weight to this situation.
Table 10. Examples of normal, EPO, and SEO in the NYT dataset. Orange represents misrecognized relations, red represents misrecognized entities, and blue represents correctly recognized triples.
In the second sentence, we compared a model lacking internal attention in the encoder with the proposed complete model. It can be observed that the deficient model did not recognize all triplets, possibly due to insufficient global information. In the third sentence, we compared the results of the model without auxiliary tensor dot-product operations with the proposed complete model. It can be observed that the model incorrectly identified “Glendale” as “Glenndale.” This might be because the overly simplistic architecture failed to fully capture the inherent correlation between entities and relationships in complex scenes. In conclusion, our model is relatively rigorous and comprehensive, predicting fewer triplets but with higher accuracy. This explains the high precision and low recall of the model.

4.9. Parameter Analysis

We further explored the influence of hyperparameters on the model. We set the candidate values of α = 0.5 , 0.55 , 0.6 , 0.65 , with a fixed decay rate of 0.2 and a threshold of 0.5. The results are shown in Figure 5 and Figure 6. On the NYT dataset, when the values of α were 0.55 and 0.65, precision and recall tended to be stable, and when the values of α were 0.5 and 0.6, precision and recall decreased and increased, respectively. On the WebNLG* dataset, sample inhomogeneity was more pronounced, so our optimal α value was higher than that of the NYT dataset.
Figure 5. Comparison of different values of the α value of the F 1 score on the NYT dataset.
Figure 6. Comparison of different values of the α value of the F 1 score on the WebNLG* dataset.

5. Conclusions

In this paper, we propose a new method for joint relational triple extraction using table filling. By transforming relations into prompt information and making entity spans dependent on this information, complex triple extraction tasks can be accomplished with only one table, effectively reducing redundant information and memory consumption during training. Based on this, we integrate the internal attention calculation of the BERT language model with the sentence feature-based multi-head tensor dot product to extract more table features. We also introduce a simple and effective loss function to address the problem of sample imbalance in tables. Experimental results on two benchmark datasets demonstrate that our model is competitive in terms of efficiency and accuracy. In the future, we plan to explore solutions for addressing the limitation of attention mechanisms in long text sequences. We also plan to introduce other neural architectures, such as graph neural networks [31], through ensemble learning [32,33] to improve the model’s extraction capability in long texts and multi-relational scenarios. Additionally, relationship-prioritized [34] approaches can be experimented with to reduce the relationship sequence and enhance our model. And we think that the method based on table filling can not only be applied to the task of joint triple extraction but also extended to other tasks in the future [35].

Author Contributions

Conceptualization, P.F. and D.O.; methodology, P.F. and L.Y.; software, R.W. and B.Z.; validation, P.F., D.O. and L.Y.; writing—original draft preparation, L.Y.; writing—review and editing, P.F. and L.Y.; visualization, P.F. and L.Y.; funding acquisition, P.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Development Plan Project of the Jilin Provincial Science and Technology Department (Key Technology Research on Risk Prediction and Assessment of Old Chronic Diseases Based on Medical Knowledge Graphs (2023JB405L07)).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Publicly available datasets were used in this study. These data can be found here: https://drive.google.com/file/d/1RxBVMSTgBxhGyhaPEWPdtdX1aOmrUPBZ/view (accessed on 27 December 2023).

Acknowledgments

We would like to express our deepest gratitude to all those who have contributed to the completion of this research and the writing of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SEOSingle-Entity Overlap
EPOEntity-Pair Overlap
SOOSubject-Object Overlap
RNNRecurrent Neural Network
BERTBidirectional Encoder Representations from Transformers
NLPNatural Language Processing
BIGRUBidirectional Gated Recurrent Unit
NYTNew York Times
WebNLGWeb Generation from Natural Language Data
NERNamed Entity Recognition
GRETGlobal Feature-Oriented Relational Triple Extraction
UniRelUnified Representation and Interaction for Joint Relational
TPlinkerSingle-Stage Joint Extraction of Entities and Relations Through Token Pair Linking
OneRelJoint Entity and Relation Extraction with One Module in One Step
CasRelCascade Binary Tagging Framework for Relational
BiLSTMBidirectional Long Short-Term Memory

References

  1. Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Philip, S.Y. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 494–514. [Google Scholar] [CrossRef] [PubMed]
  2. Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; Zhu, X. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
  3. Chen, Z.Y.; Chang, C.H.; Chen, Y.P.; Nayak, J.; Ku, L.W. UHop: An Unrestricted-Hop Relation Extraction Framework for Knowledge-Based Question Answering. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 3–5 June 2019; pp. 345–356. [Google Scholar]
  4. Bian, N.; Han, X.; Chen, B.; Sun, L. Benchmarking knowledge-enhanced commonsense question answering via knowledge-to-text transformation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 12574–12582. [Google Scholar]
  5. Chan, Y.S.; Roth, D. Exploiting syntactico-semantic structures for relation extraction. In Proceedings of the 49th Annual Meeting of the Association for Computationalinguistics: Humananguage Technologies, Portland, OR, USA, 19–24 June 2011; pp. 551–560. [Google Scholar]
  6. Li, J.; Fei, H.; Liu, J.; Wu, S.; Zhang, M.; Teng, C.; Ji, D.; Li, F. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 10965–10973. [Google Scholar]
  7. Guo, Z.; Zhang, Y.; Lu, W. Attention Guided Graph Convolutional Networks for Relation Extraction. In Proceedings of the 57th Annual Meeting of the Association for Computationalinguistics, Florence, Italy, 28 July–2 August 2019; pp. 241–251. [Google Scholar]
  8. Wei, Z.; Su, J.; Wang, Y.; Tian, Y.; Chang, Y. A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computationalinguistics, Virtual, 5–10 July 2020; pp. 1476–1488. [Google Scholar]
  9. Gupta, P.; Schütze, H.; Andrassy, B. Table filling multi-task recurrent neural network for joint entity and relation extraction. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2537–2547. [Google Scholar]
  10. Wang, Y.; Yu, B.; Zhang, Y.; Liu, T.; Zhu, H.; Sun, L. TPLinker: Single-stage Joint Extraction of Entities and Relations Through Token Pair Linking. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 1572–1582. [Google Scholar]
  11. Shang, Y.M.; Huang, H.; Mao, X. Onerel: Joint entity and relation extraction with one module in one step. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 11285–11293. [Google Scholar]
  12. Ren, F.; Zhang, L.; Yin, S.; Zhao, X.; Liu, S.; Li, B.; Liu, Y. A Novel Global Feature-Oriented Relational Triple Extraction Model based on Table Filling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 2646–2656. [Google Scholar]
  13. Wang, Z.; Nie, H.; Zheng, W.; Wang, Y.; Li, X. A novel tensor learning model for joint relational triplet extraction. IEEE Trans. Cybern. 2023. [Google Scholar] [CrossRef] [PubMed]
  14. Zeng, X.; Zeng, D.; He, S.; Liu, K.; Zhao, J. Extracting relational facts by an end-to-end neural model with copy mechanism. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 506–514. [Google Scholar]
  15. Tang, W.; Xu, B.; Zhao, Y.; Mao, Z.; Liu, Y.; Liao, Y.; Xie, H. UniRel: Unified Representation and Interaction for Joint Relational Triple Extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7087–7099. [Google Scholar]
  16. Kenton, J.D.M.W.C.; Toutanova, L.K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 3–5 June 2019; pp. 4171–4186. [Google Scholar]
  17. Feng, P.; Zhang, X.; Zhao, J.; Wang, Y.; Huang, B. Relation Extraction Based on Prompt Information and Feature Reuse. Data Intell. 2023, 5, 824–840. [Google Scholar] [CrossRef]
  18. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
  19. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  20. Riedel, S.; Yao, L.; McCallum, A. Modeling relations and their mentions without labeled text. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference—ECML PKDD 2010, Barcelona, Spain, 20–24 September 2010; Proceedings, Part III 21. Springer: Berlin/Heidelberg, Germany, 2010; pp. 148–163. [Google Scholar]
  21. Gardent, C.; Shimorina, A.; Narayan, S.; Perez-Beltrachini, L. Creating training corpora for nlg micro-planning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017. [Google Scholar]
  22. Zheng, S.; Wang, F.; Bao, H.; Hao, Y.; Zhou, P.; Xu, B. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017. [Google Scholar]
  23. Bekoulis, G.; Deleu, J.; Demeester, T.; Develder, C. Joint entity recognition and relation extraction as a multi-head selection problem. Expert Syst. Appl. 2018, 114, 34–45. [Google Scholar] [CrossRef]
  24. Miwa, M.; Bansal, M. End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016. [Google Scholar]
  25. Zheng, H.; Wen, R.; Chen, X.; Yang, Y.; Zhang, Y.; Zhang, Z.; Zhang, N.; Qin, B.; Ming, X.; Zheng, Y. PRGC: Potential Relation and Global Correspondence Based Joint Relational Triple Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vrtual, 1–6 August 2021; pp. 6225–6235. [Google Scholar]
  26. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 27, 3104–3112. [Google Scholar]
  27. Cabot, P.L.H.; Navigli, R. REBEL: Relation extraction by end-to-end language generation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual, 7–11 November 2021; pp. 2370–2381. [Google Scholar]
  28. Wang, Y.; Sun, C.; Wu, Y.; Zhou, H.; Li, L.; Yan, J. UniRE: A Unified Label Space for Entity Relation Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual, 1–6 August 2021; pp. 220–231. [Google Scholar]
  29. Ma, Y.; Hiraoka, T.; Okazaki, N. Named entity recognition and relation extraction using enhanced table filling by contextualized representations. J. Nat. Lang. Process. 2022, 29, 187–223. [Google Scholar] [CrossRef]
  30. Xu, B.; Wang, Q.; Lyu, Y.; Shi, Y.; Zhu, Y.; Gao, J.; Mao, Z. EmRel: Joint Representation of Entities and Embedded Relations for Multi-triple Extraction. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 659–665. [Google Scholar]
  31. Zhao, K.; Xu, H.; Cheng, Y.; Li, X.; Gao, K. Representation iterative fusion based on heterogeneous graph neural network for joint entity and relation extraction. Knowl.-Based Syst. 2021, 219, 106888. [Google Scholar] [CrossRef]
  32. Liu, J.; Zhao, S.; Wang, G. SSEL-ADE: A semi-supervised ensemble learning framework for extracting adverse drug events from social media. Artif. Intell. Med. 2018, 84, 34–49. [Google Scholar] [CrossRef] [PubMed]
  33. An, T.; Chen, Y.; Chen, Y.; Ma, L.; Wang, J.; Zhao, J. A machine learning-based approach to ERα bioactivity and drug ADMET prediction. Front. Genet. 2023, 13, 1087273. [Google Scholar] [CrossRef] [PubMed]
  34. Li, Z.; Fu, L.; Wang, X.; Zhang, H.; Zhou, C. RFBFN: A relation-first blank filling network for joint relational triple extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Dublin, Ireland, 22–27 May 2022; pp. 10–20. [Google Scholar]
  35. An, T.; Wang, J.; Zhou, B.; Jin, X.; Zhao, J.; Cui, G. Impact of strategy conformity on vaccination behaviors. Front. Phys. 2022, 10, 972457. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.