Extraction of Joint Entity and Relationships with Soft Pruning and GlobalPointer

: In recent years, scholars have paid increasing attention to the joint entity and relation extraction. However, the most difﬁcult aspect of joint extraction is extracting overlapping triples. To address this problem, we propose a joint extraction model based on Soft Pruning and GlobalPointer, short for SGNet. In the ﬁrst place, the BERT pretraining model is used to obtain the text word vector representation with contextual information, and then the local and non-local information of the word vector is obtained through graph operations. Speciﬁcally, to address the lack of information caused by the rule-based pruning strategies, we utilize the Gaussian Graph Generator and the attention-guiding layer to construct a fully connected graph. This process is called soft pruning for short. Then, to achieve node message passing and information integration, we employ GCNs and a thick connection layer. Next, we use the GlobalPointer decoder to convert triple extraction into quintuple extraction to tackle the problem of problematic overlapping triples extraction. The GlobalPointer decoder, unlike the typical feedforward neural network (FNN), can perform joint decoding. In the end, to evaluate the model performance, the experiment was carried out on two public datasets: the NYT and WebNLG. The experiments show that SGNet performs substantially better on overlapping extraction and achieves good results on two publicly available datasets.


Introduction
The entities and the relationships between entities are the main elements in the knowledge graph (KG) [1], and its general form is (subject entity, relation, object entity), referred to as entity-relation triplet for short. Most current knowledge graphs are stored in the form of triples, such as DBpedia [2], Free-base [3], NELL [4], Probase [5], Wikidata [6], etc. Extracting triples from unstructured text is a fundamental task of information extraction in natural language processing (NLP) and a key step in knowledge graph construction (KGC) [7,8].
Pipeline-based relational triplet extraction approaches were used at the beginning, such as DSPT [9], and kernel methods [10]. The work is decomposed into two distinct subtasks using the pipeline approach: named entity identification (NER) [11,12] and relationship extraction (RE) [13,14]. This method first identifies the entities and then predicts the relationship of each entity pair. The pipeline approach has the advantages of simplicity, flexibility, and ease of implementation. However, this method tends to overlook the interactions between entities and relationships and is prone to error propagation [15].
To establish the interaction between these two sub-tasks, the joint extraction method is presented. Compared with the pipeline method, the joint extraction method integrates the information of entities and relations and effectively reduces the error propagation; thus, this method has received more and more attention from scholars. The joint extraction methods are divided into the feature-based model [16][17][18][19] and the neural network-based model [20,21]. The feature-based model requires a more complex preprocessing, which leads to the introduction of additional error information. Furthermore, feature-based models necessitate manual feature extraction, which increases the burden significantly. In contrast, feature learning based on neural network models is performed by machines without human intervention.
The joint learning approach can integrate information from both because it can extract entities and relations simultaneously [22]. An ideal joint extraction system should be able to adaptively handle multiple situations, especially triple extraction where entities overlap, i.e., multiple relations share a common entity, which most methods cannot handle. An example of triple overlap is shown in Figure 1. The triple overlapping problem is separated into three categories: normal, single entity overlap (SEO), and entity pair overlap (EPO). Some researchers have studied the problem. HRL [23] introduces reinforcement learning methods into seq2seq models to enhance interactions between entities, which resulted in a significant improvement. GraphRel [24] utilizes the graph convolutional networks (GCNs) [25] to build a relational graph, which takes into account the relation between all words. Despite the success of federated learning approaches, there are still drawbacks. First, in previous work using graph neural networks to obtain representations of nodes, the graph input to the graph convolutional network is pruned. However, this is prone to missing information and does not guarantee that the neural network can learn the potential representations of all nodes. Secondly for entity-and relation-independent decoding, if wrong entities are identified, each relation is also assigned to these entities, resulting in a large number of invalid triples. When there are multiple relationships between entity pairs, the classifier becomes confused and cannot make accurate judgments [26].
To solve the above problems, this paper proposes a joint entity and relation extraction with Soft Pruning and GlobalPointer (SGNet), whose purpose is to improve the extraction accuracy of triples. Specifically, firstly, the graph module is used to extract the node features to obtain the token's local and global information. In terms of decomposition, to avoid pruning, the graph we construct is fully connected. The node in the graph is each token, and the edge is determined by KL divergence to calculate the distribution difference between the two nodes, and the distribution of the nodes is generated by a trainable Gaussian Graph Generator. Then, the GCNs and dense connection layer are used to obtain local and non-local information of nodes. Finally, to realize joint decoding, we convert the triple (s, r, o) extraction procedure into quintuple (sh, st, r, oh, ot) extraction, where subject head, subject tail, object head, and object tail are represented as sh, st, oh, and ot, respectively. We use GlobalPointer to decompose the quintuple into pairs (sh, st), (oh, ot), (sh, oh|p), and (st, ot|p) to score.
Our contributions are as follows: (1) We employ a Gaussian Graph Generator to initialize the text graph to avoid the problem of missing information caused by pruning. Each word in the sentence is a node in the graph. Edges are obtained by computing the distribution difference between two nodes by KL divergence to encourage information propagation between nodes with high distribution differences. (2) We decompose the quintuple extraction problem into scoring the four token pairs after transforming the triple extraction into a quintuple extraction task. Constructing (sh, st) matrices and (oh, ot) matrices using GlobalPointer, as well as (sh, oh|p) matrices and (st, ot|p) matrices under certain relations, allows for joint entity-relational extraction. (3) We conduct experiments to evaluate our model on the two NYT and WebNLG public datasets. The experimental results show that our model outperforms the baseline model in extracting both overlapping and non-overlapping triples, demonstrating the effectiveness of the graph module and joint decoding module.

Related Work
The early relationship extraction work generally adopts the pipeline method [27,28], that is, the named entity recognition task is performed on the unstructured text first, and then the relationship extraction task is performed, that is, the entity is extracted first and then the relationship between the entity pairs is determined. While the pipeline approach is simple to use, both extraction models are quite flexible in that the entity and relationship models can use different datasets and do not require both annotated entity and relationship databases. However, this will result in the following drawbacks: (1) Error accumulation: the performance of relation extraction in the next phase will be affected by the error of entity extraction. (2) Entity redundancy: since the entities are paired in pairs initially, then the relation is categorized, and the candidate entities with no association will provide duplicate information, increasing the error rate and complexity. (3) Interaction separation: the dependency and internal relationship between the two subtasks are neglected.
To overcome the above drawbacks, some scholars gradually proposed the joint extraction model. It means that only one model can be used to extract the triples in the text, which strengthens the connection between the two subtasks of entity extraction and relation extraction, and alleviates error propagation [29]. The joint extraction model for entities and relations can be categorized into two types: joint extraction models based on parameter sharing and joint extraction models based on joint decoding. The parameter sharing model is to share the parameters of the input features or internal hidden layers. Miwa et al. reported that a recurrent-neural-network-based joint extraction model uses a bidirectional long short-term memory network (Bi-LSTM) to encode entities first, and uses Tree-LSTM, which considers dependency tree-based information to model the relationship between entities [30]. However, this model only works at the sentence level and is only suitable for simple dependency parsing. Katiyar et al. reported that pointer network decoding queries all entities before their position (forward query) for the current entity, and calculates the attention score [31]. Zeng et al. reported that a joint extraction model with a fusion copy mechanism is based on the sequence-to-sequence (Seq2Seq) idea, but the model decoder only copies the tail bytes of the entity, resulting in the incomplete extraction of multi-byte entities [32]. The joint extraction model based on parameter sharing has no constraints on sub-models, but the interaction between the entity and relational models is weak due to the use of an independent decoding technique.
To enhance the interaction between the two sub-models, a joint decoding algorithm is proposed. Dai et al. reported that a unified joint extraction model, which labels entity and relation labels according to query word position p, then detects entities at p and identifies entities at other positions that have relations with the former [15]. Arzoo et al. used conditional random field (CRF) to simultaneously model entity and relationship models and obtained the output results of entity and relationship through the Viterbi decoding algorithm [33]. Zhang et al. reported a globally optimized end-to-end relation extraction neural model and proposed new LSTM features. Furthermore, they proposed a new method to integrate syntactic information to facilitate global learning [34]. Zheng et al. proposed a special label type to transform entity recognition and relation classification into a sequence labeling problem. The sentence is encoded by the encoding layer, and then the hidden layer vector is input into the decoding layer to directly obtain triples. The extraction process is not divided into two sub-processes: entity recognition and relationship classification [35]. Wei et al. reported that in a cascaded binary-based annotation framework, i.e., firstly, a pointer is used to annotate the start and end positions of subject entities, all possible subject entities in the sentence are extracted, and then a pointer annotation is used to identify all possible relations and object entities of subject entities [26]. However, the difficulty of joint entity relation extraction lies in overlapping entity relation extraction.
To overcome this difficulty in extracting overlapping triples, Yu et al. proposed a decomposition strategy, which decomposes the extraction task into first extracting head entities, and then extracting entity relations, and the two tasks share the coding layer. The two subtasks are further transformed into a multi-sequence labeling problem using a span-distance-based labeling scheme [36]. Wang et al. introduced a novel handshake tagging strategy to make the following judgments for a word in a sentence: whether it is the beginning or the end of an entity, and whether it is the head or the tail of an entity under a particular relation. Such a judgment is made to improve the accuracy of overlapping triples recognition [37].
Different from the previous work of extracting triples, in this paper, we directly convert the triples into quintuples, then score the elements in the quintuple one by one, and finally use the joint decoding module to parse out the triples. Figure 2 shows the overall structure of our model, which consists of three parts: the BERT model, the graph module, and the joint decoding module. We will describe each section in detail below.

Graph Model
As illustrated in Figure 3, our graph module consists of graph generation, attentionguided layer, densely connected layer, and linear combination layer.

BERT Model
Before performing the graph operation, the sentence must be transformed into word embedding. We employ the BERT [38] as a pretraining model to encode sentence semantic vectors in this paper. Compared with the traditional Word2Vec [39] word embedding method, the BERT model takes word position into account. Because a word can express distinct meanings in different places, word position information cannot be disregarded throughout the word embedding process.
The BERT is a pretraining model that includes predictive context information and location information. Given a sequence X with n tokens, we map X to a BERT input sequence X Input = [x 0 , x 1 , . . . , x n+1 ]. Here, x 0 represents the "[CLS]" token at the start of the sentence, and x n + 1 represents the "[SEP]" token at the end of the sentence. After BERT encoding, the corresponding token is represented as Here, v 0 of token "[CLS]" is considered as the task-specific token of the entire sequence, and ] is the word embedding we employ in the downstream task.

Graph Model
As illustrated in Figure 3, our graph module consists of graph generation, attentionguided layer, densely connected layer, and linear combination layer.

Graph Model
As illustrated in Figure 3, our graph module consists of graph generation, attentionguided layer, densely connected layer, and linear combination layer.  The graph module is shown with an example of four nodes and an adjacency matrix. The node embeddings and adjacency matrix are generated with KL divergence as inputs. Then, employing multi-head attention to construct N attention-guided adjacency matrices, the resulting matrices are fed into N separate densely connected layers, generating new representations. Finally, a linear combination layer is applied to integrate the outputs of the N different densely connected layers.

Gaussian Graph Generator
We utilize the Gaussian Graph Generator [40] to construct the graph's edge to minimize the inaccuracy produced by the natural language tools. Specifically, we encode each node v i of BERT model output into Gaussian distribution as follows: where f θ and f θ ' are two trainable neural networks and A is a nonlinear activation function. The SoftPlus function is used as the activation function because the standard deviation of a Gaussian distribution is confined (0, +∞).

KL Divergence
In probability theory or information theory, KL divergence (Kullback-Leibler divergence), also known as relative entropy, is a way to describe the difference between two probability distributions p and Q. It can measure data objects whose geometric distances Appl. Sci. 2022, 12, 6361 6 of 17 (such as cosine distance, Euclidean distance) are difficult to measure [41]. When the two distributions are closer, the KL divergence value is smaller. KL divergence is defined as follows: where p(x) is the target distribution, q(x) is the matching distribution, x i is the discrete random variable, and N is the length of the distribution.
To evaluate the differences in the distributions of the two nodes, KL divergence is utilized to determine the connection strength between nodes. The propagation of messages between token representations with considerable semantic variances is encouraged to obtain the potential links between tokens. As an example, the semantic gap between "we" and "our" is extremely tiny, so the strength of association between two words is small and the weight assigned is negligible; the semantic gap between "Jobs" and "Apple" is large, so the connection strength of the two words is vast and the weight assigned is significant. The edge weight between the i-th and j-th nodes is calculated as follows: are the distributions of the i-th node and j-th node. We obtain the graph G = (V, A) with a directed graph due to the asymmetry of the KL divergence. V denotes the set of all nodes, and A denotes the adjacency matrix.
GCNs are neural networks that operate directly on graph structures. Let us take a look at how the nodes are updated first for multi-layer GCNs. Given a graph with n nodes, we can represent the graph with an n × n adjacency matrix A. In traditional GCN operations, A ij = 1 and A ji = 1 if there is an edge going from node i to node j, otherwise A ij = 0 and A ji = 0. The purpose of using GCN is to aggregate adjacent nodes and learn the information of K-order nodes. The convolution calculation of node i at the l-th layer takes h l−1 as input and outputs h (l) : where A denotes the adjacency matrix and W represents the weight matrix. ρ is the activation function, and h (0) i is the initial input representation v i . We know that in most previous research, the input graph was pruned. However, rule-based pruning strategies might eliminate some crucial information from the entire tree. As a result, we use the attention guidance layer to turn the graph into N fully connected graphs [42]. The self-attention mechanism is used to create the adjacency matrix, which can be written as follows: where Q and K are BERT output and d denotes the dimension of the BERT output.
With the above operation, we do not need to rely on external NLP toolkits, and avoid pruning operations.
To gain more information about the structure of large graphs, we refer to the dense connection [43]. With the help of dense connections, we can train deeper models, allowing rich local and non-local information to be captured for learning a better graph representa-Appl. Sci. 2022, 12, 6361 7 of 17 tion. The p is represented as the connection between the initial node representation and the layer 1, 2, . . . , l − 1 node representation.
Each densely connected layer has L sub-layers. Each sub-layer has multi-layer GCNs. Our graph model concatenates the outputs of each sub-layer to form new representations. Therefore, unlike GCN models, where the hidden dimension is more than or equal to the input dimension, our graph module's hidden dimension diminishes as the number of layers grows, improving parameter efficiency.
Thus, for N dense connections and the resulting N attention-guided adjacency matrices, we obtain the output representation of the t-th graph according to the following equation: where t = 1, 2, . . . , N and A (t) is t-th the attention guided adjacency matrix. W Finally, the linear combination layer is applied to integrate the outputs of the N different densely connected layers. Formally, the output representation of the graph module after the linear combination layer can be written as where h comb ∈ R d×N is obtained by connecting N densely connected layers, b comb is a bias vector for the linear transformation, and W comb ∈ R (d×N)×d is the weight matrix.

GlobalPointer Joint Decoder
At first glance, joint extraction appears to be the extraction of triples (s, p, o) (i.e., subject, predicate, object), but it really is the extraction of quintuples (sh, st, p, oh, ot), where sh and st are the head and tail positions of subject entity, respectively, and oh and ot are the head and tail positions of object entity, respectively. Here, we can convert the triple in Figure 1 into a quintuple, as shown in Figure 4. From the perspective of the probability map, it is only necessary to design a quintet scoring function. However, if we count all quintuples one by one, the total is far too enormous. Suppose the sentence length is m and there are n relations, then the number of all quintuples is Appl. Sci. 2022, 12, x FOR PEER REVIEW 8 of 17 the head and tail positions of object entity, respectively. Here, we can convert the triple in Figure 1 into a quintuple, as shown in Figure 4. From the perspective of the probability map, it is only necessary to design a quintet scoring function. However, if we count all quintuples one by one, the total is far too enormous. Suppose the sentence length is m and there are n relations, then the number of all quintuples is It is tough for us to perform the aforementioned equation. As a result, a simplification is required. We may use the following decomposition strategy for this: where S(sh, st) and S(oh, ot) denote the head and tail scores of the subject entity and object entity, respectively, and the subject entity and object entity are parsed out by S(sh, st) > 0 and S(oh, ot) > 0. S(sh, oh|p) means that the head features of subject and object are used as their representations to conduct one matching. The relation can be resolved when S(sh, oh|p) > 0. If there is entity nesting, it is necessary to consider S(st, ot|p). S(sh, st), S(oh, ot) is used to identify the entity corresponding to the subject and object, and it is equivalent to having a named entity recognition (NER) task, so we can use a GlobalPointer [44] to complete it. S(sh, oh|p) is to identify the specific relationship p of (sh, oh) pair; here, we also use GlobalPointer to complete it, but to avoid the sh > oh to remove the default lower triangle mask of GlobalPointer; finally, S(st, ot|p) is the same as S(sh, oh|p).

Training and Prediction
The purpose of this paper's training is to use positive and negative samples to achieve model inference. As a result, we employ multi-label cross-entropy as the training loss function, which is defined as  It is tough for us to perform the aforementioned equation. As a result, a simplification is required. We may use the following decomposition strategy for this: S(sh, st, p, oh, ot) = S(sh, st) + S(oh, ot) + S(sh, oh|p) + S(st, ot|p) (10) where S(sh, st) and S(oh, ot) denote the head and tail scores of the subject entity and object entity, respectively, and the subject entity and object entity are parsed out by S(sh, st) > 0 and S(oh, ot) > 0. S(sh, oh|p) means that the head features of subject and object are used as their representations to conduct one matching. The relation can be resolved when S(sh, oh|p) > 0. If there is entity nesting, it is necessary to consider S(st, ot|p). S(sh, st), S(oh, ot) is used to identify the entity corresponding to the subject and object, and it is equivalent to having a named entity recognition (NER) task, so we can use a GlobalPointer [44] to complete it. S(sh, oh|p) is to identify the specific relationship p of (sh, oh) pair; here, we also use GlobalPointer to complete it, but to avoid the sh > oh to remove the default lower triangle mask of GlobalPointer; finally, S(st, ot|p) is the same as S(sh, oh|p).

Training and Prediction
The purpose of this paper's training is to use positive and negative samples to achieve model inference. As a result, we employ multi-label cross-entropy as the training loss function, which is defined as where P and N are the sets of positive and negative categories. The position of the positive category is 1 and the position of the negative category is 0. Thus, S i denotes the i-th position in the label corresponding to the i-th value of the prediction. However, there is a fatal error in the loss function defined in this way. The number of positive categories is much smaller than the number of negative categories. This results in a large cost for both creation and transmission. Therefore, the loss function for the above can be adjusted as follows: where A = P ∪ N. After the above operation, the size of the label matrix is greatly reduced.

Datasets
To validate the performance of the model, we opted to conduct experiments on two public datasets: NYT [45] and WebNLG [46]. The labels in both datasets are triples, and the entities in the triples are not the entire entity span, but the last word of the entity in the sentence. For example, for the entity New York, the entity record in the dataset is York. The dataset we use is from [32], and its statistical results are shown in Table 1.

Evaluation Metrics
In our experiments, we maintain consistency with our previous work. For the extracted relationship triples, when the head of the subject entity and object entity and the relation is predicted correctly, we consider that the triple is correct. We follow common evaluation guidelines and report the standard micro precision (Prec.), recall (Rec.), and F1 score.  (13) where TP means correctly predicting positive samples as positive, FN means wrongly predicting positive samples as negative, and FP means wrongly predicting negative samples as positive.

Implementation Details
Our model SGNet is implemented with PyTorch, the parameter optimization method uses AdamW [47], and the learning rate is 3 × 10 −5 . The pretrained BERT model leverages the base-cased English model 2 and sets the maximum length of input sentences to 100. The batch size for training is set to 16 and the batch size for validation is 32 in the NYT and the WebNLG. The GPU used in the experiment is NVIDIA GeForce RTX 3090, each video memory is 24 G, the running memory is 46 G, and the programming language is Python3.7. We choose the model that performs best on the validation set, then feed the test set to that model to obtain the output. Furthermore, we keep the parameters in the graph module the same as in [40]. Our paper uses a two-layer graph module, and the first-layer graph module does not need to be guided by the attention mechanism.

Main Results
We compare the following baseline models with our proposed model:

•
NovelTagging [35] applies a novel tagging method to transform the joint extraction of entities and relations into a sequence tagging problem, but it cannot tackle the overlap problem. • CopyRE [32] uses seq2seq to generate all triples to solve the overlap problem for a sentence, but such an approach only considers a single token and not multiple tokens. • GraphRel [24]: A model that generates a weighted relation graph for each relation type, and applies a GCN to predict relations between all entity pairs. • OrderCopyRE [23]: An improved model of CopyRE that uses reinforcement learning to generate multiple triples.
• ETL-Span [36] decomposes the joint extraction task into two subtasks. The first subtask is to distinguish all head entities that may be related to the target relation, and the second subtask is to determine the corresponding tail entity and relation for each extracted head entity. • WDec [48]: An improved model of CopyRE, which solves the problem that CopyRE misses multiple tokens. • CasRel [26] identifies the head entity first and then the tail entity under a particular relationship. • DualDec [49] designs an efficient cascaded dual-decoder approach to address the extraction of overlapping relation triplets, which consists of a text-specific relation decoder and a relation-corresponding entity decoder. • RMAN [50] not only considers the semantic features in the sentence but also leverages the relation type to label the entities to obtain a complete triple. Table 2 reports the main results of our model and the other baseline models on both datasets. As can be seen from the table, our model outperforms all baseline models in the F1 score. For the NYT dataset, our model achieves the best results in recall and F1 scores, with F1 scores improving by 1.7%, 0.8%, and 5.9% over the CasRel framework, DualDec model, and RSAN model, respectively. For the WebNLG dataset, our model performs well, with F1 scores improving by 0.1%, 1.0%, and 7.4% over the CasRel framework, DualDec model, and RSAN model, respectively. In the experiment, to verify the effectiveness of the graph module and the joint decoding module, we removed the graph module and only kept the BERT model and the joint decoding module, which were named SGNet WG . For the NYT dataset, the F1 score of the SGNet WG model is 0.6% higher than that of CasRel, which only uses the BERT module and the decoding module. Furthermore, compared with SGNet WG , our model improves by 1.1% and 1.7% on the NYT and WebNLG, respectively. These encouraging results show that the graph module and joint decoding module can effectively assist the model to extract the relation triples. Although SGNet WG does not perform very well for the WebNLG dataset, SGNet performs better.
Previous studies have shown that it is already difficult to improve the F1 score obtained by the model on the dataset when it reaches 90+, which is close to the human limit [26]. Therefore, our model's results on the WebNLG dataset are close to those of the CasRel model. On the other hand, for the WebNLG dataset itself, the small amount of training data (5019) makes it particularly difficult to extract 246 predefined relations.

Result Analysis on Different Sentence Types
Since overlapping triple extraction is a challenging problem for joint extraction, the majority of the models in Table 2 have low F1 scores. Specifically, the statistics in Table 1 show that most of the sentences in the NYT dataset belong to the normal class, while in the WebNLG dataset, most of the sentences belong to the EPO and SEO classes. This fails most previous models to extract overlapping triples.
To explore the performance of our model on extracting triples, we extracted triplets from sentences with different triplets and compared them with baseline models, and the results are shown in Table 3. As seen in the table, the majority of the baseline models on both datasets exhibit a diminishing trend in their capacity to extract triples as the number of triples (N) in the phrase grows, whereas our model shows a rising trend. At N = 4, our model surpasses the DualDec model, which has the highest F1 score of the baseline models, while both show an upward trend. Our model outperforms the DualDec model in F1 scores for the NYT dataset, with F1 scores improving by 0.7%, 0.8%, 0.3%, 0.4%, and 0.5%, respectively. Compared with DualDec, our model improves by 3.6%, 0.6%, 4.8%, 2.5%, and 6.3% on the WebNLG dataset, respectively. It shows that our model can deal with complex multi-relationship problems in sentences. Table 3. F1 score of extracting relational triples from sentences with a different number (denoted as N) of triples. Further, to investigate the performance of our model in extracting overlapping relational triples, we extracted triples of different overlapping triples types from the sentences. According to different overlapping categories, the sentences used in the test can be divided into three types, including normal, SEO, and EPO. As shown in Figure 5, our method achieves the best result. For normal class, i.e., not containing overlapping relationships, our model has a 0.7% higher F1 score than the highest baseline model on the NYT dataset and outperforms most of the baseline models on the WebNLG dataset, and its F1 score was only 0.2% lower than CasRel. For single entity overlap (SEO) and entity pair overlap (EPO), the F1 score of our model is higher than the best baseline model on both datasets. Compared to the baseline model with the highest F1 score, our model improves by 0.7% and 0.3% on the SEO class of NYT and WebNLG, respectively. Additionally, our model improves by 0.6% and 0.3% for the EPO class. Compared to the CasRel model, the SGNet WG model improves by 0.2% for the NYT dataset and decreases by 0.2% for the WebNLG dataset in the SEO class, and it improves by 0.1% for NYT dataset and 0.2% for the WebNLG dataset in EPO class. Similarly, for SGNet WG , compared with SGNet, the improvement of SGNet model is small, but higher than the CasRel model. Since the CasRel model has a certain effect in dealing with overlapping triples, and the effect of the SGNet model is due to this, our model is more capable of dealing with overlapping triples. Moreover, as we described earlier, when the model F1 score reaches 90+, it has saturated and there is very little room for improvement.
the improvement of SGNet model is small, but higher than the CasRel model. Since the CasRel model has a certain effect in dealing with overlapping triples, and the effect of the SGNet model is due to this, our model is more capable of dealing with overlapping triples. Moreover, as we described earlier, when the model F1 score reaches 90+, it has saturated and there is very little room for improvement.  Table 4 shows the case study of our proposed SGNet. Due to that our model decodes by decomposing triples into quintuples, the output of the model is also triples. For the selection of sentences in the table, we divided into normal and overlapping cases. First, for the first sentence, the sentence contains only normal triples, and our model is able to  Table 4 shows the case study of our proposed SGNet. Due to that our model decodes by decomposing triples into quintuples, the output of the model is also triples. For the selection of sentences in the table, we divided into normal and overlapping cases. First, for the first sentence, the sentence contains only normal triples, and our model is able to extract them. The second example exists where the SEO and EPO triples overlap. As shown in the table, our model can completely extract the overlapping triples in the sentence, which shows the effectiveness of our model.  Table 5. From the table we can observe that without using the graph module our model is better than TPLinker in terms of parallel processing. This encouraging result is due to the multi-label loss function of the model we mentioned earlier. As we said before, there are far fewer positive classes than negative classes, so the dimension of the label matrix is greatly reduced, which improves the training speed. In addition, it can be seen from the results that due to the introduction of the graph module into our model, the F1 score of the model is improved, but the negative effect is that the training time and inference time are longer. However, the training time and inference time of the two models are not much different. Therefore, in the case of increasing the number of parameters of the model, the efficiency of our model is better than that of TPLinker.

Discussion
In the introduction of our paper, we first analyze the pipeline-based approach, which leads to the joint learning approach since the pipeline approach ignores the interaction between entities and relations and is prone to error propagation. Our model is also based on the joint learning approach.
First of all, we consider a joint decoding approach to decompose triples into quintuples to strengthen the information exchange between tasks and prevent the problem of information imbalance in the feature extraction stage. In experiments, the F1 score of SGNet WG is higher than most of the models by comparing with other models, and in the exploration of triple overlapping experiments, the F1 score of SGNet WG is higher than the best baseline model. Next, in order to better learn the semantic relationship information of sentences, we introduce a graph neural network. Modeling text as a graph, nodes can better pick up local and non-local information. In the experimental results, it can be seen that the F1 score of SGNet WG is improved compared with SGNet.
Finally, due to the deepening of the model depth, the training time and inference time of our model are longer in the analysis of the model software complexity, but it is acceptable. In exchange for this, the advantage is that the extraction accuracy is improved.

Conclusions
We investigate extracting text features via graph operations since unstructured text belongs to non-Euclidean data and graphs can better represent such non-Euclidean data. However, recent research indicates that the pruning process may destroy some vital information in the complete tree; hence, the approach taken in this article is to construct a fully linked graph. Furthermore, traditional graphs are built by external toolkits, and they are fixed and cannot be modified. For the graph generated by the graph generator, the parameters of the generator can be optimized by training to make it have a certain generalization ability. Because conventional GCN only acquires local node features and frequently ignores global text features, we adopt a dense connection layer in this research to obtain additional global information. Furthermore, to solve the problem that the model has difficulty extracting overlapping triples, we use the GlobalPointer joint decoding method. The extraction of triples is subdivided into quintuple extraction. The extraction of entity head and entity tail under a specific relationship will be transformed into a form similar to scaled dot-product attention so that the overlapping entities can be identified with multiple specific relationship matrices, and then the overlapping problem can be tackled. The experiment results on the NYT and WebNLG datasets demonstrated the effectiveness of our model. In addition, our experiments show that our model can effectively extract overlapping triples and handle complex sentence multi-relation problems.
In the future, we will further improve the performance of this model to apply it in knowledge graphs or other fields.  Data Availability Statement: The study did not report any data.

Conflicts of Interest:
There are no conflict to declare.