Entity-Centric Fully Connected GCN for Relation Classiﬁcation

: Relation classiﬁcation is an important task in the ﬁeld of natural language processing, and it is one of the important steps in constructing a knowledge graph, which can greatly reduce the cost of constructing a knowledge graph. The Graph Convolutional Network (GCN) is an effective model for accurate relation classiﬁcation, which models the dependency tree of textual instances to extract the semantic features of relation mentions. Previous GCN based methods treat each node equally. However, the contribution of different words to express a certain relation is different, especially the entity mentions in the sentence. In this paper, a novel GCN based relation classiﬁer is propose, which treats the entity nodes as two global nodes in the dependency tree. These two global nodes directly connect with other nodes, which can aggregate information from the whole tree with only one convolutional layer. In this way, the method can not only simplify the complexity of the model, but also generate expressive relation representation. Experimental results on two widely used data sets, SemEval-2010 Task 8 and TACRED, show that our model outperforms all the compared baselines in this paper, which illustrates that the model can effectively utilize the dependencies between nodes and improve the performance of relation classiﬁcation.


Introduction
Relation classification is an important task of natural language processing (NLP). It is the key step in many natural language applications. It extracts specific event or fact information from natural language text to help us automatically classify, extract and reconstruct massive content. This information usually includes entities, relations, and events. For example, extract time, location, and key person relations from news. Therefore, relation classification is widely used in information extraction [1,2], construction of knowledge base [3,4], and question answering [5,6] systems.
Most of the existing classification models are based on deep learning, such as recurrent neural networks(RNN), convolutional neural networks(CNN) and their improved models [7][8][9][10][11][12]. In the relation classification model, the natural language processing tools are employed to convert words in the text sequence into low-dimensional vectors. Then, the feature extractors employed to capture word representations and sentence-level semantic representations. Finally, the semantic relations between entities are predicated by using a classifier. Predicates are usually very important when predicating the relation between entities. If the distance between the entity and the predicate is too long, key information may be difficult to encode the semantic information. To solve this problem, a grammatical dependency tree is proposed to capture long-distance semantic information, simplify complex sentences and extract key information [13]. Xu et al. [14] applied long short term Memory networks (LSTM) on SDP between entities. Cai et al. [15] proposed a recurrent convolutional neural network that uses a two-channel LSTM to encode the global pattern in SDP, and uses CNN to capture the local features of every two adjacent words linked by dependencies.
To accurately obtain the semantic information between entities, early relation classification models used neural networks to obtain the shortest dependency path between entities. Yanxu et al. [14] proposed to apply LSTM to word sequences in the shortest path. Yangliu et al. [16] proposed to apply RNN to extract subtree features, and use CNN to extract shortest path features. Mito Makoto et al. [17] simplified the dependency tree to a subtree under the lowest common ancestor between entities. However, these models that run directly on the dependency tree are difficult to parallelize, so the computational efficiency is very low, because it is usually difficult to perform effective batch training on the tree.
Kipf et al. [18] presented the GCN, which can successfully process non-Euclidean data, and GCN model is also widely used in image recognition [19], visual question answering [20], and biology [21], and achieved good performance. However, the traditional GCN treats all nodes in the graph as equally important, ignoring the importance of entities in the sentence. And because GCN can only capture the information of direct neighbor nodes in one convolution process, GCN can only capture short-range context information. This problem can only be solved by adding a GCN layer. However, current research shows that increasing the number of GCN layers for relational classification tasks will make the model more complex [22]. Meanwhile, increasing the number of GCN layers will also cause excessive smoothness of node features, which will cause local features to converge to similar values. Therefore, it is worth exploring and applying an extended GCN, which can capture the key information in the graph and eliminate some unnecessary noise. In order to solve the above GCN problem, we use the entity node as a global node, and propose an entity-centric fully connected GCN (FCGCN) model. The model in this paper uses an effective graph convolution operation to encode the dependency relation structure of the input sentence, and then extracts the entity-centric representation for reliable relation prediction. We evaluate the model on the popular SemEval 2010 Task 8 dataset and TACRED dataset. Compared with previous models on these two data sets, the proposed model can make better use of the importance of entities in relation classification and obtain better results. When combined with the pruning strategy, the effect is further improved.
In short, the main contributions of this paper are as follows: • We proposes a relation classification neural model based on graph convolutional network, using two entities as global nodes, that is, entity nodes have corresponding edges to other nodes.

•
We use the difference vector of the entity pair as part of the relation classification constraint to make the relation classification result more accurate.

•
A detailed analysis of the model and pruning technology shows that the pruning strategy and the proposed model have complementary advantages.

Materials and Methods
In this section, we propose a novel model for relation classification. The proposed model uses entity nodes as global nodes, and make full use of the importance of the entity itslef in relation classification to produce more accurate results.
The framework is mainly composed of four modules: (1) Sequence Encoding Module (2) Attention Module (3) Fully Connect Module (4) Relation Classification Module. The detail of our model is shown in Figure 1. The objective function of the proposed model consists of two parts, the loss of entity difference classifier L 1 and the loss of relation classifier L 2 . L 1 is employed to train the difference vector of entity pair to be classified to the true relation type, and L 2 is employed to optimize the model by using the contextual semantic information of sentence.

Sequence Encoding Module
Word embedding: This paper uses GLOVE [23] to convert a sentence into a lowdimensional vector. Suppose x = [x 1 , x 2 , ..., x n ] represents a sentence, where x i represents the i-th word and n is the length of the sentence. The t-th word embedding in s is denoted as e w i ∈ R d w , d w is the dimension of the word embedding. Position embedding: Inspired by Zeng et al. [9], our work also considers the position feature of words.
There are two relative distances e p 1 i ∈ R d p and e p 2 i ∈ R d p between each word with entities e 1 and e 2 in the sentence. For example, e 1 Jack Ma /e 1 is the founder of the Internet company e 2 Alibaba /e 2 , the relative distances of word company to "Jack Ma" and "Alibaba" are −7 and 1. We joint the word vector with the position vector to get the word vector representation. Therefore, the sentence can be expressed as follows where e i ∈ R m , m = d w + 2d p , d w , d p denotes the size of the vector. Bi-LSTM layer: In order to encode the sequence features and context information of sentences, we connect the forward and reverse LSTM states to the BI-LSTM layer, The calculation formula is as follows where d t represents the dimension of the hidden layer of LSTM and ← → L i represents a vector which containe the bidirectional semantic features.

Attention Module
The attention mechanism can automatically calculate the importance of different words in a sentence. Transformer model [24] shows that "self-attention" can achieve better results in sequence coding, it can compute the correlation between words in the sentence.
In a text sequence, the importance of words is often different. Especially entity nodes are important for relation classification. "Self-attention" can capture the internal structure of a sentence by learning the interaction between words in the sentence, and obtain the attention score between each word in the sentence. Therefore, more semantic information between words can be obtained through the "self-attention" mechanism. In this paper, we use multi-head attention to distinguish the importance of words in a sentence. The input given in this paper is the vector s = e w 1 , e w 2 , ..., e w n after word embedding. In the self-attention process, we set query, key and value are equal to s, which is Q = K = V = s ∈ R n×d w . We get the hidden representation of the sentence as follows.
In the multi-head self-attention module, we pass Q, K, and V through a linear transformation, and input them into the scaled dot-roduct attention, and perform the above steps h times. Therefore, the calculation of the i-th head is as follows The output of the multi-head attention layer is obtained by concatenating the outputs of h heads, which is then input into a linear transformation to generate the final attention values.
where M = [m 1 , m 2 , ..., m n ] is the hidden representations sentences generated by multihead self-attention, W o ∈ R d t ×d t represents the weight matrix.
To make better use of the information captured by multi-head self-attention, the same feedforward network is employed to the m i in the multi-head self-attention output M. The calculation method is as follows where M 1 ∈ R d×d q , M 2 ∈ R d q ×d represents the conversion matrix. β 1 and β 2 represents the bias, and d q represents the size of the hidden layer. Inspired by Vaswani et al. [24], we integrate the multi-head self-attention and the output of the feedforward layer through the residual connection [25], and then perform layer normalization [26]. The calculation method is as follows where C = {c 1 , c 2 , ..., c n } ∈ R n×d ,Ḿ = LayerNorm(M + s) represents the residual connection between sentence embedding and multi-head self-attention layer output. We obtain the final word attention results as follows where W ∈ R d w ×d w represents square matrix, and r ∈ R d w ×1 represents a random query vector.

Fully Connect Moudle
The graph convolutional network [18] is an adaptation of the convolutional neural network [27] for encoding graphs. We can employ n × n adjacency matrix A to represent a graph with n nodes. If there is an edge from node i to node j which means A ij = 1. For the convolution operation of node i in the l-th layer, we use the l-1th layer node to represent the input vector h (l−1) i , and use h (l) i to represent the output vector. The convolution formula is as follows where W (l) represents the weight matrix, b (l) represents the bais vector, and p represents the activation function (e.g., RELU).
To emphasize the importance of entities in the dependency tree. Inspired by Guo et al. [28], this paper proposed a new extended dependency tree, in which entities are used as global nodes, and the global nodes directly connect with other nodes. Since the node cannot connect to itself in the dependency relation, the information of h (l−1) i will never be transferred to h (l) i . It is necessary to add self-citation to itself. The schematic diagram is shown in Figure 2. The global nodes have three advantages. First, it emphasizes the importance of entity nodes in relation classification, and employ entity information to improve the accuracy of relation classification. Secondly, global nodes provide each node with a global view of the input graph, so that the global node can not only gather the information of its neighbor nodes, but also gather the information of all other nodes in a convolution operation. Finally, global nodes can help other nodes gather information, which can promote the process of node information dissemination. We combine the pruning strategy proposed by Zhang et al. [29], which can allow entity nodes to gather more key information in sentences and eliminate irrelevant information. Therefore, we modify the convolution calculation to where h n = s = e w 1 , e w 2 , ..., e w n , d i = ∑ n j=1 A ij represents the out-degree of the node i, and I represent the matrix of n × n global and other nodes with edges.

Relation Classification Module
After passing through the L-level GCN, we get the hidden representation of each node. In order to use the hidden representation of these nodes to represent the relation classification, we first obtain the following sentence representation where n represents the matrix representation of the sentence after the L-layer FCGCN.
Lin et al. [30] regard the relation r as the transformation from the head entity to the tail entity (e sub + r ≈ e obj ) and use the margin-based ranking criterion to learn their embeddings (i.e., entities and relations). Finally, their model simply used the corresponding embedding representation vector calculation to predict the relation between two target entities, and achieved good results. These studies have fully evaluated that the difference in word embedding of entity pairs can reflect the semantic information of the relation between them. Therefore, we believe that the difference vector of entities can also provide effective evidence for relation classification. Specifically, given a sentence marked with r = e sub − e obj , we use the difference vector of the entity pair as part of the relation classification constraint and get the predicted probability of the entity as follows where e sub and e obj represent the vector of the entity obtained after FCGCN, MLP(.) represents the multilayer perceptron. Finally, the vector obtained after the entity phase is cut and h sent are joined to obtain the final vector. The formula is as follows We feed the hidden representation of sentence information into the feedforward neural network through the softmax operation to obtain the probability distribution of the following relationship

Module Training
In this framework, the objective function of the proposed model FCGCN includes two parts, the loss of entity classifier L1 and the loss of relation classifier L2. Suppose there are N pieces of data in the training set T = {T 1 , T 2 , ..., T N } and their corresponding labels {y 1 , y 2 , ..., y N }, We use cross entropy to solve the loss function of the difference vector classifier L1 Similarly, we also use the cross entropy shown below to get the loss of the relationship classifier L2 Finally, we use the l2-norm to constrain the minimized loss function L min L = L 1 + L 2 (22) In this paper, the stochastic gradient descent method is used to optimize.

DateSets
We verify the performance of FCGCN on two widely used relation classification datasets: TACRED and SemEval 2010 Task 8.
TACRED: The TACRED dataset contains 106,264 instances and 42 relation types (including 41 predefined semantic relations and a "None" relation, the "None" relation means that there is no any relationship between an entity pair) [31]. Each instance of the dataset is a sentence that contains a pair of entity mentions, and it is labeled by one of the 42 relation types.
SemEval 2010 Task 8: The SemEval 2010 Task 8 is very popular recently in dealing with the problem of relation classification, which contains 10,717 instances and 9 relation types and a particular "other" relation type [32]. This dataset is very small, which is only 1/10 of TACRED. Each instance in this dataset expresses a sematic relation between two marked entities. 8000 instances are contained in the training set and 2717 instances are contained in the test set.
Based on these two datasets, we employ pre-trained 300-dimensional Glove [23] to initialize the word embeddings. The pruning path K is set to 1, and the batch size is 50. The model is trained for 100 epochs. The details of the selected hyperparameters are summary in Table 1.

Performance Comparison
The performances of the proposed model and baselines on TACRED dataset are illustrated in Table 2. We employ four types of models as the baselines: (1) Logistic regression (LR) based model is a traditional method for relation classification, which utilizes the dependency information and the lexical information of sequences to form sentence-level features. (2) The convolutional neural network based model proposed by Nguyen et al. [10] is employed to deal with the relation classification task, which develop a feature extractor to automatically capture the semantic features of sentences for relation classification. (3) The long short term memory (LSTM) based methods, including PA-LSTM [31] model, tree-LSTM model [33] and SDP-LSTM model [14], are introduced as the baselines. The PA-LSTM model employs a position-aware attention mechanism and a LSTM encoder to extract the sentence features.
The SDP-LSTM model focuses on the shortest dependency path between an entity pair, in which the key semantic information is contained. The tree-LSTM model encodes the whole tree structure to obtain the semantic information of sentences. (4) The graph convolutional network based models C-GCN proposed by Zhang et al. [29] and AGGCN proposed by Zhang et al. [34] are also introduced as a baseline in our experiment.
From Table 2, we can observe that the GCN based models achieve better performances that other baselines and our propose method outperforms all the baselines. Our model achieves significant improvement in terms of F1 with at least 0.7. The CNN based model has the best precision score 75.6 and the lowest recall score with only 47.5. Compared to AGGCN and C-GCN, the proposed FCGCN achieves better performance. We believe the reason behind it is that the global entity nodes can effectively accumulate sematic information from the whole dependency tree and form more informative semantic features. The experimental results prove the effectiveness of the proposed FCGCN model.
In order to evaluate the universality of the proposed proposed model, we also conducted experiments on SemEval 2010 Task 8. We mainly carried out experiments on some dependency models, as shown in Table 3. The proposed model can still get the F1 score 86.0 in SemEval 2010 Task 8, and and outperforms the compared baselines.

Ablation Study
To demonstrate the effectiveness of various components of the propose model, we implement a series of ablation experiments on the TACRED dataset. The results of ablation study are shown in Table 4, from which we can observe that the performance of the proposed model drops with removing the components. Especially removing the pruning strategy, the F1 score decreases 2.2, which indicates the importance of the pruning mechanism. The performance drops by 0.7 and 0.9 respectively when removing the Bi-LSTM layer and multi-head attention layer. It is worth noting that the performance drops by 1.5 when masking the entities with random tokens, which indicates that the semantic information contained in the entity mentions is crucial for relation classification. Effect of Pruning: Figure 3 shows the effect of the FCGCN model with pruning tree, where K indicates that the distance between the token contained in the pruning tree and the dependent path in the LCA subtree is at most K. In order to show the path-centric pruning effect, when the pruning distance K changes, we compare the experimental results under different epochs. We conducted experiments on K ∈ {0, 1, 2, ∞} on the TACRED development set, and compared the models without pruning. As shown in Figure 3, when K = 1, the performance of all epochs reaches the peak, which also evaluates that pruning improves the proposed model. Effect of Mask-entity: Figure 4 shows the effect of the FCGCN model with and without concealed entities.To show the role of the entity itself in the FCGCN model, we compared the experimental results under different epochs. As shown in Figure 4, the entity is not covered, and the performance of all epochs reaches the peak, which also evaluates the importance of the entity in relation extraction.

Analyze of Bi-LSTM & Multi-Head Attention:
In the relation classification model based on deep learning, Bi-LSTM and multi-head attention mechanism are widely used in the model. By capturing the forward and backward LSTM information, each word in the sentence can capture contextual semantic information. Multi-head attention mechanism can help us pay more attention to the important words in the sentence, rather than some words that can bring noise. This model uses multiple queries to calculate in parallel to select multiple pieces of information from the input information to assign attention weights to each word in the sentence, playing the role of global observation. As shown in Table 4, the F1 scores of Bi-LSTM and Multi-Head Attention in the proposed model indicate that BI-LSTM has complementary effects in capturing sequence information, multi-head attention mechanism in global correlation and acquiring more semantics between words. Combining Bi-LSTM and multi-head attention mechanism can obtain richer word-level and sentence-level representations, capture more semantic information, and can improve the accuracy of relation classification.

Effect of Hyper-Parameters
In this section, the influence of two hyperparameters in the method proposed in this paper is discussed through experiments, the number of attention heads h and the number of graph convolutional layers d l .
Since multi-head attention allows the model to collectively pay attention to information from different representation subspaces from different locations. Choosing the right number of attention heads is crucial to the proposed model. As can be seen from the Figure 5a, as the number of attention heads increases, the performance improves. Using 3 heads in the multi-head self-attention layer produces the best performance. Use more than 3 heads, each additional head, performance will gradually decrease. Then, we also study the influence of the number of graph convolution layers when the number of attention heads is fixed at 3. It can be seen from the Figure 5b that as the number of GCN layers increases, the performance improves. The best effect is achieved when using two layers in FCGCN. Using more than 2 layers, each additional layer, and the multi-layer GCN algorithm used for relation classification tasks will produce high space complexity.

Conclusions
In this paper, we introduced a novel entity-centric fully connected graph convolutional network (FCGCN) for relation classification. In FCGCN, we made global nodes directly connect with other nodes. This operation emphasizes the importance of entity nodes for relation classification, especially when the entities not be masked, the performance of the model is greatly improved. In this way, the information of all nodes in the graph can be directly obtained by performing only 1 layer of FCGCN. Moreover, the difference vector of the entity pair was introduced to constraint the relation between entity pair, which can effectively improve the relation classification results. The experimental results on the TACRED and the SemEval 2010 task 8 datasets show that FCGCN use the semantic information in the dependency tree more comprehensively, and outperformed better results than baselines. We also found that the F1 value reached 67.1 when combined with the dependency tree after using the pruning strategy, which shows that the combination of them further improves the model.
In future work, we will explore a new pruning strategy to combine with the proposed model to reduce noise as much as possible, thereby further improving the performance of the model.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

CNN
Convolutional Neural Network LSTM Long Short-Term Memory Bi-LSTM Bi-directional Long Short-Term Memory GCN Graph Convolutional Network