REEGAT: RoBERTa Entity Embedding and Graph Attention Networks Enhanced Sentence Representation for Relation Extraction

: Relation extraction is one of the most important intelligent information extraction technologies, which can be used to construct and optimize services in intelligent communication systems (ICS). One issue with the existing relation extraction approaches is that they use one-sided sentence embedding as their ﬁnal prediction vector, which degrades relation extraction performance. The innovative relation extraction model REEGAT (RoBERTa Entity Embedding and Graph Attention networks enhanced sentence representation) that we present in this paper, incorporates the concept of enhanced word embedding from graph neural networks. The model ﬁrst uses RoBERTa to obtain word embedding and PyTorch embedding to obtain relation embedding. Then, the multi-headed attention mechanism in GAT (graph attention network) is introduced to weight the word embedding and relation embedding to enrich further the meaning conveyed by the word embedding. Finally, the entity embedding component is used to obtain sentence representation by pooling the word embedding from GAT and the entity embedding from named entity recognition. The weighted and pooled word embedding contains more relational information to alleviate the one-sided problem of sentence representation. The experimental ﬁndings demonstrate that our model outperforms other standard methods.


Introduction
Information extraction is one of the fundamental capabilities of intelligent communication systems (ICS) to provide intelligent services to people. Information extraction is used in many different industries built upon ICS, such as web data retrieval systems [1,2], recommendation systems [3,4], the intelligent medical field [5,6], emotion monitoring systems [7], and the wireless sensing field [8,9]. Moreover, information extraction is utilized by government departments to gain insights into public opinion [10,11], etc. In the field of information extraction, relation extraction, or RE for short, is a critical task. With RE, relationships between entities are extracted from unstructured text and stored as structured data. The most common form of storage for structured data here is a triad, i.e., (head entity, relationship to tail entity, tail entity), where entity refers to a word containing a proper name such as a person, place, or organization, and the relation extraction technique can extract the relation between entities.
At first, information extraction depended on pattern matching, in which a specialist or academic in the relevant field constructed a recognition template using the properties of a sample dataset. KNN-BLOCK DBSCAN, presented by Chen et al. [12], is an efficient 1. In most current models, word embedding vectors are directly used to obtain sentence embedding vectors. However, the resulting sentence embedding vectors lack relational information. Therefore, we use the attention mechanism in the graph attention network model to weight the word embedding vector of the text. We weight the word embedding vector trained by RoBERTa with the relational embedding term. The weighted word embedding vector contains relational information. Then, the sentence embedding vector is obtained through the word embedding vector. At this time, the sentence embedding vector includes relational information.

2.
We extract entities from the training corpus with the help of the WCL-BBCD model. The entities are converted into word embedding vectors, i.e., entity embedding vectors. Moreover, the sentence embedding vector and the entity embedding vector are stitched together to form the input vector of the final fully connected layer. As a result, the sentence embedding vector contains relational and entity information, which gives it a richer meaning. 3.
The REEGAT model is proposed to alleviate the one-sidedness problem based on the above approach. For the SemEval-2010 Task 8 and Wiki80 datasets, REEGAT surpassed the other models described in this study.
The remainder of this paper is organized as follows to present our work: Section 2 provides a review of related work in the field; Section 3 presents the suggested approach in full detail; Section 4 discusses and analyzes the experimental findings obtained through our system; and Section 5 summarizes and concludes the paper, highlighting the significance and implications of our research findings.

Relation Extraction
In natural language processing, a relation is a connection between two or more entities. RE aims to identify and extract these relations from text, resulting in structured data representing a triad with the corresponding entities.
The three primary approaches to relation extraction are rule-based, deep-learningbased, and traditional machine-learning-based methods. Rule-based methods rely on the expertise of linguistic experts to develop grammar rules that facilitate relation classification. While this approach may be effective in some domains, it may not be as versatile as other methods. A potential way to improve the effectiveness of text pattern methods such as the one proposed by Nakashole et al. [20], is to incorporate prior knowledge, such as knowledge graphs, into the representation of entities and their relationships. This approach could enhance the accuracy of the extracted relationships and make them more useful for downstream applications.
The rule-based approach has the advantage of having high extraction accuracy in a constrained field. The disadvantage is that the construction of its rules requires a lot of human resources, and the portability of the completed regulations is poor.
Zhang et al. summarized, in the paper review [21], that the current traditional relationship extraction methods are mainly based on statistical language models. In general, there are two categories into which supervised classical machine-learning-based approaches can be divided: kernel-function-based methods [22] and feature-engineeringbased methods [23]. Machine-learning-based methods that use feature engineering convert text features into vectors using machine learning algorithms, while methods based on kernel functions calculate entity similarity using kernel functions.
Most recently, deep-learning-based methods, such as graph neural networks (GNN) and remote supervision (distant supervision, DS), have received extensive attention from researchers. Wu et al. [24] summarized the development history, principles, applications, and research directions of graph neural networks. GNN were first proposed by Scarselli et al. [25]. GNN can effectively represent the relationship between entities in the field of relationship extraction. Remote supervision believes that if two entities retrieve a certain relationship in the knowledge base and appear in pairs in a sentence, the sentence must express the relationship between the two entities in a certain way. Graph convolutional networks [26] (GCN) belong to GNN, and GCN introduce convolution during training. Zhang et al. [27] presented a pruned-tree-based graph convolutional network (contextualized graph convolutional networks, C-GCN), which can calculate the shortest path between two entities that may have a relationship. Guo et al. [28] presented an attention-based graph convolutional network (attention guided graph convolutional networks, AGGCN), which can automatically learn and select subgraph structures that are helpful for relation extraction tasks. Jin et al. [29] believed that previous works are limited to simultaneously identifying the relationship between two entities. The influence of different relationships between entities in this context will be ignored, so it is proposed to use GCN to learn the dependencies between relations. In addition, Wu et al., based on deep reinforcement learning, proposed a method for autonomous target search by UAVs in complex disaster scenes [30].
In general, rule-based, supervised, and semi-supervised methods based on traditional machine learning and supervised methods based on deep learning are well-suited for RE in restricted domains. On the other hand, remotely supervised and unsupervised methods are more appropriate for RE in open domains.
Relation extraction is commonly considered a form of text classification task [14], which can be performed using a text classification model. Large-scale pre-trained language representation models based on transformer [19], such as BERT [15] and RoBERTa [17], have demonstrated exceptional performance on several natural language processing tasks, including text classification. Consequently, these models are also well-suited for relation extraction.
The basic steps to perform relation extraction tasks with large-scale pre-trained language representation models are:

1.
Given a text to be classified, each word's embedding in the text is obtained from a large-scale pre-trained language representation model.

2.
The first token in the sentence's word embedding is pooled to produce the sentence embedding. Tokens are the output of the text input to the tokenizer. Often the tokenizer of these large-scale pre-trained language representation models will add a starting token to the first part of the sentence. For example, in BERT, it will add the "[cls]" token to the sentence, and in RoBERTa, it will add the "<s>" token to the sentence. The starting token is the first token in the sentence.

3.
The final prediction vector for classification is obtained by feeding the sentence embedding from the first token into a wholly connected layer. Typically, the fully connected layer's output dimension is set to the number of relation types, and the resulting vector represents the score of each relation type's assigned input sentence. The relation extracted from the sentence is the one for which the relation type with the best performance was chosen.

BERT
In 2018, Google introduced BERT, a cutting-edge pre-trained language representation model that has gained significant popularity in recent years [15]. The bi-directional transformer, the central component of the BERT architecture, uses self-attention mechanisms to capture complex word relationships [19]. Token, segment, and position embeddings are added together to create the input vector for BERT. The bi-directional transformer generates the resultant work by merging the vectors of the hidden layer. This architecture has demonstrated high performance in assorted tasks relating to natural language processing, including text classification, text generation, and relation extraction.
The introduction of BERT has marked a significant breakthrough regarding natural language processing. Among the most inventive features of the model is its pre-training method, which incorporates two fundamental techniques, namely masked language model (MLM) and next sentence prediction (NSP). MLM masks 10% of the tokens at random, with 10% left unmasked and 80% replaced with "[MASK]". NSP, on the other hand, focuses on the relationship between two sentences, which is essential for tasks such as natural language inference and question answering. Leveraging pre-trained BERT models with remarkable accuracy allows downstream tasks such as named entity recognition to be completed.

Graph Neural Networks
The graph is a data structure that can effectively handle unstructured data and can be applied in many fields, and Figure 1 shows three structures of graphs. For example, in social networking, a social network graph can be formed by using users' social relationships to predict the type of users. In e-commerce, the interaction history between users and products can be analyzed to recommend products accurately for users. In citation networks, the citations and cited relationships between documents can be analyzed to classify documents. However, traditional neural networks, such as convolutional neural networks, are difficult to apply on top of graph-structured data due to the complex, diverse, and fickle nature of graph structures. Because each graph has different nodes and each node has a different number of neighboring nodes, the required translation invariance of convolutional neural networks is no longer satisfied, which leads to the fact that neither sampling nor pooling operations can be performed on graph data. Graph-based neural networks have gained popularity in information extraction due to their ability to capture rich dependencies between entities. Specifically, GAT (graph attention network) [31] and GCN (graph convolutional network) [26] are two famous graph neural network models that have been applied in NER and RE. In NER, each word is treated as a graph node, and edges are added between nodes based on their entity types. By contrast, in RE, the graph treats each word as a node, with weighted edges representing the probability of the relationship between each word. The advantage of graph-based neural networks is that they can incorporate contextual information from the entire sentence, thus improving the accuracy of entity recognition and relation extraction.

Framework of the Proposed REEGAT Model
Relationship extraction techniques in the context of deep learning likewise no longer need to incorporate feature engineering techniques such as those used in machine learning. Named entity recognition needs to focus more on the meaning of the words themselves, while relationship extraction needs to focus more on the meaning of the sentences themselves. Relationship extraction requires giving more attention to the meaning of the sentence itself. In general, the sentence embedding vector obtained by pooling the word embedding vector of the first token in the sentence inevitably ignores other words' information. If the ignored information is about the entity word, it will have some impact on the prediction performance of the model. Therefore, if only this sentence embedding vector is used as the input vector of the fully connected layer in the final prediction process, there is a certain problem of one-sidedness.
To alleviate the problem of one-sidedness caused by using only the sentence embedding vector as the final fully connected layer input vector, this section proposes the relationship extraction model REEGAT (RoBERTa entity embedding with graph attention networks) based on a graph neural network. The REEGAT model uses the RoBERTa model as the word embedding encoder and the PyTorch embedding model as the relational embedding encoder. The general scheme of the model is shown in Figure 2. The REEGAT model is implemented in two parts to alleviate the one-sidedness problem. The first is to add relationship information to the sentence embedding vector, and the second is to add entity word information to the sentence embedding vector. The specific implementation of the two parts is as follows: 1.
Adding relational information: This part first uses the attention mechanism in the graph attention networks (GAT) model to weight the word embedding vectors of the text. The process is to weight the word embedding vector generated by the RoBERTa model with the relational embedding vector generated by the PyTorch embedding model. Then, the first weighted word embedding vector is used to weight the relational embedding vector. Finally, the first weighted relational embedding vector is used to weight the first weighted word embedding vector. The second weighted word embedding vector contains the relation information. Then, the sentence embedding vector is obtained by pooling the word embedding vector of the first token, which also contains the relation information accordingly.

2.
Adding entity word information: The WCL-BBCD model extracts the entities in the sentences. It obtains the word embedding vector of the corresponding word from the quadratic weighted word embedding vector obtained in step (1)   It then uses the sentence embedding vector obtained in step (2) as the final input vector. Using the sentence embedding vector, with richer sentence meaning, to replace the original sentence embedding vector as the final input vector can alleviate the problem of one-sidedness to a certain extent.
The flowchart of the REEGAT model is shown in Figure 2. RoBERTa is employed to produce word embedding, which the GAT model uses as input. The semantic representation of the word embedding is enhanced by the multi-head attention mechanism used by the GAT model to assign weights to the word embedding. Then, the entity embedding component is utilized to obtain the sentence embedding, using the weighted word embedding as input and the specific pooling algorithm. Next, the corresponding entity and its embedding are obtained through the entity recognition result of our proposed WCL-BBCD [32]. Finally, the sentence embedding and entity embedding are stitched to form the final prediction vector.

Embedding Layer
The first component in the REEGAT model is the embedding layer. The primary purpose of using an embedding layer is to capture words' semantic and syntactic information in a dense, low-dimensional space, which can be easily fed into neural networks for further analysis.
The embedding layer consists of the RoBERTa model and the embedding model. The RoBERTa model is used as the word embedding encoder, and the PyTorch embedding model is used as the relational embedding encoder. The RoBERTa model is consistent with the BERT model, which is also based on the two-way transformer. We chose the RoBERTa model because it has two differences to the BERT model.
The first difference between RoBERTa and BERT is the word segmentation algorithm. BERT utilizes WordPiece as its word segmentation algorithm, while the word segmentation algorithm used by RoBERTa is BPE (byte pair encoding). BPE is implemented as follows: (1) build a character dictionary based on the corpus; (2) count the most frequent neighboring character pairs in the corpus and update the neighboring pairs to the dictionary; (3) repeat step (2) until the specified number of iterations or the size of the dictionary is reduced to a specific size. The difference between BPE and WordPiece is that WordPiece merges the subwords with the most significant change in the likelihood value of the language model. In contrast, BPE directly merges the two subwords with the highest frequency of occurrence. RoBERTa uses BPE because the pre-training of RoBERTa uses a larger corpus than BERT, which reduces the overhead of calculating the likelihood value in WordPiece. In addition, BPE uses byte encoding to alleviate the "unknown" problem's occurrence effectively.
The second difference between RoBERTa and BERT is the pre-training task, MLM and NSP. The masked token is tagged with a "[MASK]" tag, performed in the data preprocessing stage, so it will only be executed once. In contrast, RoBERTa uses dynamic masking, where random masking is performed again each time a sequence is input to the model, so that even if a tremendous amount of data is input to the model, the model will try different masking strategies until the most suitable one is chosen for the input data.
In summary, these two modifications make the performance of RoBERTa better than that of BERT. Thus, we chose RoBERTa instead of BERT to generate the word embedding.

Weighting Layer
After the embedding layer comes the weighting layer, which assigns different weights to the features learned by the embedding layer. The primary purpose of using a weighting layer is to allow the model to assign more importance to certain features or words based on their relevance to the task at hand. We compared GAT and GCN in choosing the model components and finally chose GAT. In our requirement, neighboring nodes should have more reasonable dependencies. However, in GCN, it is impossible to assign different weights to the neighboring nodes of a node in the graph, which leads to the inability of GCN to capture spatial information effectively. Furthermore, using GCN requires knowledge of the graph structure before training. This leads to the poor performance of the trained GCN model on other graph structures, which means a poor generalization ability.
GAT (graph attention network) was proposed by Veličković in 2018. It uses the attention mechanism to determine the weighted sum of nearby node features, substituting the fixed function utilized in GCN. The benefit of introducing the attention mechanism is addressing the limitation of GCN, which cannot assign different weights to adjacent nodes of a graph node. By introducing the self-attention mechanism, information between adjacent nodes can be obtained without needing information from the entire graph. This makes the GAT more efficient and effective than GCN in capturing the graph's dependencies and relationships between nodes.
The REEGAT model which we propose comprises two graph attention layers. The graph attention layer uses the model's multi-headed attention mechanism to give equal weights to the relation embedding and word embedding. RoBERTa provides the word embedding, and PyTorch embedding provides the relation embedding. The graph attention layer first uses the relation embedding to weight the word embedding by the multi-headed attention mechanism, and its calculation formula is: where WE represents the word embedding, RE represents the relation embedding, MultiHead represents the multi-head attention mechanism, and WE 1 attn represents the output vector using the first multi-head attention mechanism.
Then, WE 1 attn and the word embedding WE are added to obtain the new vector WE attn , and WE is spliced to obtain the new word embedding WE concat . The formula is shown below: Then, the dimensionality reduction operation of WE concat is performed through the fully connected layer and activated by the sigmoid-activation function with the following equation: where Dense represents the fully connected layer, and weight represents the weights. The output range of the sigmoid is (0, 1), so the range of the weight is also (0, 1), and the final output is the weighted word embedding WE 1 , the calculation formula of which is: The weighted relation embedding is computed similarly to the weighted word embedding, simply by switching the positions of WE and RE in Equation (1) with each other. Since the length of each sentence in a batch is inconsistent, padding complements the sentences that are not long enough. However, the word embedding used for padding impacts the calculation of weights, so adding a mask to the initial calculation of RE attn is necessary. If the mask is set to 1, the multi-headed attention mechanism will change the corresponding part of the key to "-inf". The reason for setting it to "-inf" is that the weight of this part of the key will tend to be 0 after the softmax, which will not affect the weight calculation.
The first graph attention layer results are the weighted word embedding and the relation embedding. Then, the two weighted vectors are input to the second graph attention layer. The second attention layer uses the relation embedding to weight the word embedding by the multi-headed attention mechanism, and the word embedding, after reweighting, is the output of the second attention layer.

Entity Embedding Component
The entity embedding component is mainly proposed to acquire entity and sentence embedding. In NER, entity embedding is typically generated using methods such as WCL-BBCD, which extracts entities from the training corpus and transforms them into word embedding, thereby obtaining entity embedding. The sentence embedding is derived by pooling the word embedding computed by GAT. The algorithm is shown in Algorithm 1.

Algorithm 1 Pooling Algorithm
Require: WE 2 Ensure: PoE i 1: WE 2 0 = WE 2 [:, 0, :] // Take the word embedding of the first token 2: The following is a description of the algorithm. The Pooling Algorithm is a method used for extracting features from word embeddings in natural language processing tasks. Given a matrix of word embeddings WE 2 , the algorithm selects the embedding of the first token in the sequence, applies dropout regularization to it, and then passes it through a fully connected layer followed by a hyperbolic tangent activation function. Finally, the output is again subject to dropout regularization, resulting in the final pooled embedding PoE i .

Experimental Setup
In this work, we run experiments utilizing the SemEval-2010 Task 8 dataset [33] and the Wiki80 dataset [34]. The SemEval-2010 Task 8 dataset defines a task of relation extraction, that is, given two labeled words and sentences containing them, predict the proper relation from a given list of relation types. The dataset defines a total of "9+1" relations, in which each relation in the "9" relations can be subdivided into 2 relations according to the order of labeling nouns. For example, in the "9" relations, the causal relation can be further subdivided into 2 relations: "A causes B (cause-effect (e1, e2))" or "B causes A (cause-effect (e2, e1))". In this paper, we continue to use the official divided training set for training and the divided test set for testing. There are 8000 pieces of data in the training set and 2717 pieces in the test set. The Wiki80 dataset is extracted from the FewRel dataset by the NLP team at Tsinghua University. This dataset is also appropriate for the relation extraction task and is refined by hand without noise. In the Wiki80 dataset, 80 relations are defined. The officially partitioned training set contains 630 training data for each relation, and the test set contains 70 test data for each relation.
We use Precision, Recall, and F1 score as the evaluation standards for the experimental data when assessing the results.
A single machine with a single GPU, an Intel(R) Xeon(R) CPU running at 2.20 GHz with 25 GB of memory, and an NVIDIA Tesla P100 with 16 GB of memory, make up the experimental hardware environment for this paper.

Evaluation Metrics
In relation extraction, the model's performance is typically evaluated using the Precision, Recall, and F1 score metrics. Based on the number of true positives (TP), false positives (FP), and false negatives (FN) in the data, these metrics are computed. TP is the proportion of samples that were successfully identified using the relation type, while FP is a measure of how many samples have been incorrectly associated with a relation type. FN is the amount of samples that should have been associated with a relation type but were not, in terms of number of samples. By computing these values, we can judge whether the model is accurate and thorough in classifying relation types.
The precision is the proportion of correctly identified samples to all samples in a set of data.
The Precision can be calculated using the following formula: Recall is defined as the proportion of correctly classified samples to all of the actual positive samples in a dataset. Hence, the calculation method for Recall can be obtained as follows: According to the definition of the F1 score, it is a comprehensive index that balances the effects of both Precision and Recall and represents the harmonic mean of Precision and Recall. Thus, the following can be deduced as the F1 score calculation method: For the above evaluation indicators, there are two calculation methods, namely "macro-averaged" and "micro-averaged". The Precision, Recall, and F1 score for each category individually are calculated for the macro-average, and then the average of all categories is found concerning the three indicators. Micro-average is used to calculate the Precision, Recall, and F1 score for the overall data, without distinction of the category. In this paper, the evaluation indicators used in RE are the macro-average Precision, Recall, and F1 score. The SemEval-2010 Task 8 dataset is evaluated using the script "semeval2010_task8_scorer-v1.2.pl".

Baseline Models
In order to verify the validity of the REEGAT model proposed in this paper, this subsection conducts comparison experiments on the SemEval-2010 Task 8 and Wiki80 datasets and analyzes the experimental results. The comparison models set up for the experiments are: 1.
C-GCN [27]: C-GCN uses BiLSTM to extract vectors containing semantic and syntactic information and uses them as input to the GCN. 3.
AGGCN [28]: AGGCN is a graph convolutional neural network (GCN) that utilizes a multi-headed attention mechanism to transform a traditional graph into a fully weighted connected graph, enabling it to obtain information about neighbor nodes with different weights.

5.
RoBERTa: RoBERTa, used in the REEGAT model, is the word embedding model. 6.
R-BERT [35]: R-BERT is based on BERT, to combine the target entity information to process the relation extraction task. 7.
R-RoBERTa: R-RoBERTa is based on RoBERTa, to combine the target entity information to process the relation extraction task. 8.
RIFRE [36]: RIFRE is based on BERT by adding a heterogeneous graph neural network to mine the possible relations between entities. 9.
ROFRE: ROFRE is constructed from RIFRE by replacing the word embedding model from BERT with RoBERTa. ROFRE is constructed by us for the purpose of comparison, it has not been published elsewhere. 10. REEGAT: Model proposed in this paper.

Relation Extraction Results
The experiments related to RE were conducted in the aforementioned experimental environment. The obtained results on the SemEval-2010 Task 8 and Wiki80 datasets are presented in Table 1. REEGAT incorporates the target entity information obtained by WCL-BBCD, proposed in our previous paper [32], while REEGAT also combines the idea of adding graph neural networks in RIFRE, and changes the attention mechanism used in RIFRE from a selfattention mechanism to a multi-headed attention mechanism, and the above strategies effectively improve the prediction accuracy of RE. On the SemEval-2010 Task 8 dataset, the prediction accuracy of RIFRE is higher than that of R-BERT, the proposed REEGAT in this work obtains the greatest Precision, Recall, and F1 scores, which are 1.30%, 0.31%, and 0.81% better than those of RIFRE, respectively. While on the Wiki80 dataset, the prediction accuracy of R-BERT is higher than that of RIFRE, and the Precision, Recall, and F1 scores of REEGAT proposed in this paper are improved by 0.78%, 0.76%, and 0.78%, respectively, compared with R-BERT.
Compared with the BERT model, the graph neural-network-based relation extraction model REEGAT proposed in this paper has three differences. First, it uses the RoBERTa model to replace the BERT model to train the word embedding vector; second, it proposes using the graph attention network model to weight the word embedding vector using the relational embedding vector; third, it proposes using the entity embedding component to stitch the entity embedding vector and the sentence embedding vector to form a new input vector. To verify the role of each component in REEGAT, ablation experiments are conducted in this paper. In addition, we add a ranking in the last column of the table, indicating the ranking of the total scores of Precision, Recall, and F1 score for each model in the previous section, to demonstrate the performance of GAT and the presence or absence of the entity embedding component on the improvement of the results. The components included in the comparative model in this paper are shown in Table 2. The following is an analysis of the experimental findings from the ablation experiments:

1.
The experiment outcomes show that GAT can significantly increase the prediction accuracy. Specifically, RIFRE, which utilizes GAT, outperforms BERT. Similarly, the performance of ROFRE, which uses GAT, is better than that of RoBERTa, while the performance of REEGAT, which incorporates GAT, is better than that of R-RoBERTa.  Figure 4.   In summary, the ablation experiments show that the model's prediction accuracy can be significantly improved by using both the GAT and the entity embedding component.

Hyperparameter Study
Different hyperparameters can significantly affect the model's prediction accuracy, and selecting appropriate hyperparameters is crucial for achieving optimal performance. In this section, we experiment with the SemEval-2010 Task 8 dataset to investigate the impact of hyperparameters on the experimental results. The experiments follow a controlled approach, where the number of heads in the multi-head attention mechanism is varied while keeping other hyperparameters constant. The aim is to observe the effects of this specific parameter on the model's performance. As shown in Figure 5, the results indicate that for the SemEval-2010 Task 8 dataset, the REEGAT model achieved the highest macroaverage F1 score (91.95) when using the multi-head attention mechanism with two heads. Specifically, the macro-average Precision peaked (91.38) with two heads, and then decreased as the number of heads increased. In contrast, the macro-average Recall first decreased and then increased, peaking (93.07) when the number of heads was 16.   Figure 5. The effect of the number of heads on experimental outcomes.
As illustrated in Figure 5, the trends of the macro-average Precision, macro-average Recall, and macro-average F1 score of the Wiki80 dataset are consistent with an initial decrease, followed by an increase, and ultimately a decrease, as the quantity of heads grows. The peaks, which were 87.83, 87.82, and 87.82, respectively, were reached when the number of heads was eight.

Discussion
The proposed model has outperformed earlier state-of-the-art models and achieved the best results in all three evaluation metrics used in the experiments, compared to earlier models. This is true for both the SemEval-2010 Task 8 and Wiki80 datasets. These findings support our working hypothesis. Using the GAT model to strengthen the connection between the word embedding vector and the relational embedding vector and the method of entity embedding can alleviate the current one-sidedness problem. The experiments demonstrate that using these methods to alleviate the one-sidedness problem can effectively improve the accuracy of RE.
From a broader perspective, our study's practical contributions to the field of RE have important implications for various real-world applications. For instance, the proposed model's accuracy and efficiency can aid in extracting relationships between entities in diverse texts, such as biomedical literature, news articles, and social media posts. The improved performance of our model could also help to automate tasks that require analyzing a large volume of text data, such as sentiment analysis or topic modeling. Our research provides a valuable tool for professionals in various industries who rely on accurate and efficient RE.
Of course, the REEGAT model proposed in this paper still has certain limitations. The graph neural-network-based relation extraction model proposed in this paper currently only considers the relations between two entities. In practical applications, a sentence likely contains multiple entities and there are relationships between multiple entities. Therefore, to improve the generality of the model, it is necessary to consider the relationship extraction between multiple entities in the future.
In addition, Michel et al. [37] proposed a new method for selecting the optimal number of attention heads according to the task and available computing resources. Sometimes, the number of heads only affects the model's performance slightly. In this paper, different numbers of heads are tested in the experiment to obtain the results. We believe introducing the new method proposed by Paul can further optimize our results under limited computing power. In our future research, we will try to add this method to our experiments to obtain more efficient results.
Moving forward, there are several future directions for research in this field. One area of interest in NER research is using contrastive learning to enhance the quality of entity embeddings. Further exploration of the use of GAT in RE can be conducted, potentially incorporating additional features such as syntactic and semantic information. Finally, the representation of entities and their relationships can be improved using prior knowledge, such as knowledge graphs, in word embedding research.
In summary, our proposed model outperformed state-of-the-art models and achieved impressive results in all three evaluation metrics used in our experiments. Our findings reinforce the effectiveness of entity embedding and GAT in enhancing the accuracy of RE models. Moreover, our research's contribution to the field of RE extends to its potential to improve the performance of automated text analysis tasks in various industries. The impact of our research highlights the need for further exploration of entity embedding and GAT in RE and the potential for enhancing word embeddings with prior knowledge.

Conclusions
This paper proposes and verifies the effectiveness of the REEGAT model in performing relation extraction tasks. REEGAT mainly uses a GAT and an entity embedding component to enrich the meaning expressed in the sentence embedding as the prediction vector, effectively alleviating the problem of one-sidedness of using only the sentence embedding as the final prediction vector. The proposed model in this paper has been experimentally verified and compared, demonstrating superior performance on the SemEval-2010 Task 8 and Wiki80 datasets. Additionally, of the three evaluation indicators used in the experiment, it produced the best results.
In the future, we will investigate the potential of contrastive learning in the RE field. Additionally, we may be able to increase the efficiency of various intelligent information extraction services provided through intelligent communication systems, including named entity recognition and relation extraction, by incorporating knowledge graphs into word embedding representation.

Acknowledgments:
The authors are grateful to the anonymous reviewers for their invaluable comments and suggestions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
Abbreviated description of symbols.