Entity Relation Extraction Based on Entity Indicators

: Relation extraction aims to extract semantic relationships between two speciﬁed named entities in a sentence. Because a sentence often contains several named entity pairs, a neural network is easily bewildered when learning a relation representation without position and semantic information about the considered entity pair. In this paper, instead of learning an abstract representation from raw inputs, task-related entity indicators are designed to enable a deep neural network to concentrate on the task-relevant information. By implanting entity indicators into a relation instance, the neural network is effective for encoding syntactic and semantic information about a relation instance. Organized, structured and uniﬁed entity indicators can make the similarity between sentences that possess the same or similar entity pair and the internal symmetry of one sentence more obviously. In the experiment, a systemic analysis was conducted to evaluate the impact of entity indicators on relation extraction. This method has achieved state-of-the-art performance, exceeding the compared methods by more than 3.7%, 5.0% and 11.2% in F1 score on the ACE Chinese corpus, ACE English corpus and Chinese literature text corpus, respectively.


Introduction
Relation extraction is one of the fundamental information extraction (IE) tasks that aims to identify the semantic relationship between two named entities in a sentence [1]. For example, given a sentence "Steve Jobs was the co-founder of Apple Inc.", an employee relation instance can be identified between "Steve Jobs" and "Apple Inc.". Finding the relation between two named entities has a wide range of applications, e.g., clinical decision support [2], drug discovery [3] and economic management [4]. This task is seen as foundational to support other natural language processing (NLP) tasks, such as knowledge graph construction [5], question answering [6] and natural language understanding [7]. Therefore, relation extraction has received extensive research attention [8]. Currently, the neural network is the most popular method to support relation extraction, where multilayer stacked architecture is adopted to support the designated feature transformation, e.g., convolutional neural network (CNN) [9], recurrent neural network (RNN) [10] and attention mechanism [11]. This approach has the advantage of extracting high-order abstract features from raw inputs, avoiding the effort required for the manual generation of designed features. The main problem for relation extraction is that a sentence often contains several named entities. Because relation types are asymmetrical, every entity pair in a sentence should be considered a possible relation instance. This consideration leads to a serious data imbalance problem. Furthermore, all entity pairs share the same context, weakening the discriminability of the features to predicate a relation instance. Therefore, obtaining entity position information is highly important for a neural network to concentrate on the considered entity pair. 1 Entity indicators are designed to support relation extraction. Several types of entity indicators are proposed in this paper. These indicators are effective for capturing the semantic and structural information of a relation instance. 2 The entity indicators are evaluated based on three public corpora, providing a systematic analysis of these indicators in supporting relation extraction. A performance comparison showed that our method considerably outperforms all compared works.
The rest of this paper is organized as follows. Section 2 discusses the related works on relation extraction. In Section 3, entity indicators are introduced. Section 4 evaluates the entity indicators based on three public corpora. The conclusion is given in Section 5.

Related Works
Most of the early research methods on relation extraction can be categorized into feature-based methods [20] and kernel-based methods [21]. In feature-based models, a shallow architecture (e.g., Support Vector Machine (SVM) [22] or Maximum Entropy (ME) [20]) is adopted to make a prediction based on categorical features. For example, Kambhatla et al. [20] combined lexical features, syntactic features and semantic features for relation extraction. Kia Dashtipour et al. [23] proposed a scalable system for Persian Named Entity Recognition (PNER) that was developed by combining Persian grammatical rule-based approach with SVM. Minard et al. [24] used external resources, e.g., stems of the words, and VerbNet's classes. Because structural information of relation instances is important to support relation extraction, Chen et al. [25] proposed a feature assembly method to generate combined features. In addition, Liu et al. [26] proposed a convolutional tree kernel-based method that incorporates semantic similarity measures. Panyam et al. [27] employed a graph kernel to encode edge labels and edge directions. The extendibility of kernels is limited because manually designed distance functions (kernel) are used to compute the similarities between relation instances. Furthermore, generation of dependency trees heavily depends on external toolkits, which are also error-prone.
Because neural networks have the advantage of automatically extracting high-order representation from relation instances, they are widely used to support relation extraction. For example, Leng et al. [28] designed a deep architecture fusing both word level and sentence level information. Liu et al. [9] introduced an early model based on a neural network, where a convolutional neural network (CNN) was designed to support relation extraction. In relation extraction, the entity position is important information to point to the entity pair. Zeng et al. [29] utilized relative distance between the current word and the target entity pair to encode position features under a CNN model. To utilize contextual features between the named entities, Zeng et al. [12] proposed a piecewise convolutional neural network (PCNN). Li et al. [30] combined the PCNN and attention mechanism for relation extraction. Zhang et al. [31] applied CNN model to shape classification in image processing. Wang et al. [32] used a bidirectional shortest dependency path (Bi-SDP) attention mechanism to capture the dependency information of words. To capture semantic dependencies in a relation instance, a long short-term memory (LSTM) model with attention mechanism was also applied [33,34]. Based on a graph neural network, Zhao et al. [35] proposed an N-gram Graph LSTM (NGG LSTM) model to capture sentence structure information. Chen et al. [13] used a multichannel deep neural network to partition a whole sentence into five parts, enabling the neural network to learn different representations for the same word.
Instead of investigating position embedding, Zhang et al. [18] utilized entity indicators to identify the start and end of the entity pair under a recurrent neural network. In this approach, indicators are placed on both sides of the entity pair to point to entity positions. An attention-based bidirectional long short-term memory (Att-BLSTM) model was proposed for obtaining contextual semantic dependencies from long texts [33]. Entity indicators were also used as the position indicators in this method. Based on the pretrained language model BERT [36], Soares et al. [17] placed four entity indicators on both sides of two named entities. The outputs of these four entity marker representations are concatenated as the relation representation. Zhong et al. [19] used entity type information to mark the entity pair and to learn the relation representation. Huang et al. [37] proposed a knowledge graph enhanced transformer encoder architecture to handle the semantic gap problem between word embeddings and entity embeddings.
The study on named entity recognition (NER) is important for relation extraction. Due to relation extraction being based on the correct position of entity pair in the sentences, the study on multiple languages NER is becoming more and more popular. McDonough et al. [38] studied the specified scope geographic NER about modern french. Isozaki et al. [39] studied Japanese named entity recognition based on a rule generator and decision tree learning. Medical NER in Swedish and Spanish has also been studied [40]. In summary, the ability to capture position information about named entities is highly important for supporting relation extraction and the study on relation extraction can also extend to multiple languages. In feature-based models, features are combined to capture the position information. The shortcoming of feature-based models is that manually designed rules should be used to generate combined features. Furthermore, comparing with distributed representations of words, categorical features are not effective to encode semantic information of words. In kernel-based models, dependency trees are used to model the sentence structure. The main problem of kernel-based models is that generating dependency trees depends strongly on external toolkits, which are also error-prone. This process also depends on manually designed distance functions for computing the similarities between the relation instances. The neural network is the most popular technique for relation extraction. Position embedding, multichannel, PCNN and entity markers were proposed to capture the position information. The shortcoming of these methods is that they cannot simultaneously capture both position information and semantic information.

Methodology
A relation instance is defined as a triad I = r, a 1 , a 2 that consists of a relation mention (r) and two arguments (a 1 , a 2 ). A relation mention is a sentence or a clause that mentions a relation, e.g., r = w 1 , w 2 , · · · , w N . Arguments refer to named entities in the relation mention.
In a deep neural network, word embedding is implemented to transform the onehot representation of a word into a dense semantic space. The embedding operation is represented as: [H e 1 , H e 2 , · · · , H e N ] = Embedding(r) The output is referred as H e . The superscript e represents that it is a hidden layer outputted by an embedding layer.
Let W c ∈ R L×K be a filter. Then, the convolutional operation is defined as where f c denotes a nonlinear function, and b is a bias term. Then, the implementation of a convolutional operation through the H e is represented as: The output of the convolutional layer is referred to as H c ∈ R (N−K+1)×N , representing a high-order abstract representation generated from the local patches of the input. It is often followed with a pooling layer to collect the salient features. A padding operation can be designed to generate a matrix H c with the same column size as H e .
If the dependency between the inputs is introduced, the recurrent operation can be defined as where f r is a nonlinear function, and U r , V r and b are the parameters of the recurrent operation.
The recurrent operation indicates that the output H r i depends on both the input H e i and the previous state H r i−1 . Implementing a recurrent operation on H e , the output H r can be formalized as: The recurrent network has the advantage of capturing the dependency information in a sentence.
In a deep neural network, the CNN and RNN are mainly used to support designed abstract feature transformation. They usually follow a fully connected layer (Conn(·)) and an output layer (So f tmax(·)), carrying out a global adjustment and outputting normalized results.
For the CNN and RNN, their abilities to capture structural information are different. In the CNN, the parameters W c and b are shared across the whole input matrix H e . Due to the vanishing gradient problem, the recurrent network weakly captures the long-distance dependence in a sentence. Let w i and w j (i + K < j) be two named entities in a relation mention r, where K is the size of the convolutional filter. Because the distance between w i and w j is larger than K, the convolutional operation cannot capture the dependency between them. In the recurrent operation, the semantic information of w i is propagated as where the influence of w i vanishes considerably. In this condition, both the CNN and RNN are unable to capture the dependency between the entity pair. Furthermore, named entities can be composed of any words. A neural network can be easily bewildered when learning a relation representation without position information for the named entities.
In this paper, entity indicators are proposed for capturing position information of a relation instance. Prior to introducing the entity indicator method, two traditional strategies for getting entity positions are presented, which will be compared with the entity indicator strategy.
Position Embedding: Let i and j denote the start positions of two named entities e 1 and e 2 in a sentence, respectively. Then, the coordinate of a word with position k is computed The vector is embedded into a higher-dimension vector, concatenated with the word embedding and fed into a neural network. This process is formalized as follows: (4) Let N, L and D represent the length of a relation mention, the dimension of word embedding and the dimension of position embedding, respectively. Then, H e ∈ R N×(L+D) is the output of word embedding concatenated with position embedding.
Every part is viewed as an independent channel that uses a nonshared lookup table to transform every word in each channel into a vector representation. This approach can be formalized as follows: The multiple channels enable the same word to express different semantic meanings in different channels. In the training process, these channels do not interact during recurrent propagation, enabling the neural network to learn different representations for the same word.

Entity Indicators
In a traditional relation extraction task, named entities are manually labeled and given as inputs. They have precise positions in a sentence. By inserting specific tokens next to the boundaries of named entities, it enables a deep neural network to concentrate on the considered entity pair. Each indicator is encoded into the same representation, which can be seen as the "anchor" of a relation instance. Therefore, in addition to pointing to the positions of arguments, they are beneficial for learning the dependencies between the entity indicators and words.
Let r = w 1 , · · · , w i , · · · , w i+s , · · · , w j , · · · , w j+t , · · · , w N be a relation mention where e 1 = w i , · · · , w i+s and e 2 = w j , · · · , w j+t denote two named entities in r. Because of asymmetrical relation types, the relation mention r can generate two relation instances: I 1 = r, a 1 = e 1 , a 2 = e 2 and I 2 = r, a 1 = e 2 , a 2 = e 1 . In a relation instance, four indicators are inserted into two sides of arguments a 1 and a 2 , denoted as l 11 , l 12 , l 21 and l 22 , respectively. Then, the relation mentions for the relation instances I 1 and I 2 are revised as: w 1 , · · · , l 11 , w i , · · · , w i+s , l 12 , · · · , l 21 , w j , · · · , w j+t , l 22 , · · · , w N w 1 , · · · , l 21 , w i , · · · , w i+s , l 22 , · · · , l 11 , w j , · · · , w j+t , l 12 , · · · , w N where l 11 , l 12 , l 21 and l 22 denote predefined tokens. When a relation mention is implanted with these indicators and fed into a neural network, the network will "know" information about the positions of the arguments. Furthermore, entity indicators can encode more information, such as syntactic or semantic information. In this paper, three types of entity indicators are proposed: position indicators, semantic indicators and compound indicators. To simplify our discussion, entity indicators of arguments a 1 and a 2 are denoted as a quadruple l 11 , l 12 , l 21 , l 22 . representing the beginning and end boundaries relative to a 1 and a 2 , respectively. Semantic Indicators: The entity type and subtype contain important semantic information about named entities. Therefore, using the entity type and subtype as entity indicators can capture both entity positional information and entity semantic information. Entity indicators can be combined with entity types, subtypes and argument positions and are divided into four subcategories: entity type indicator, entity subtype indicator, entity type with position and entity subtype with position. For example, let "PER" and "ORG" represent the entity types person and organization, respectively. "IND" and "GOV" represent the entity subtypes individual and government, respectively. Two indicator quadruples can be generated as Compound Indicators: The above semantic indicators have shown the ability to combine semantic information and positional information (e.g., [/PER_1]). This strategy can be extended further to generate more complex indicators, which are referred as compound indicators. Compound indicators have a two-side effect. On the one hand, they encode more syntactic or semantic information, which is beneficial for enhancing the discriminability of a neural network. On the other hand, they also lead to a sparse representation, which disperses the significance of the indicators. In this paper, two types of compound indicators are evaluated to demonstrate their influence on the performance. The first utilizes the entity type and subtype simultaneously. In summary, three types and nine subtypes of entity indicators are proposed in this paper. They are listed in Table 1. In all the indicators, the entity type and subtype are manually annotated in the employed corpus. They are widely used in the field of information extraction. To generate POS tags of words, two popular POS tools J IEBA and NLTK were adopted for Chinese and English, respectively. In addition to the above three indicator types, syntactic indicators can be used. For example, [V_1], [/N_1] indicate that for the first argument, the left word is a verb and the right word is a noun. However, because POS tags are generated by external toolkits that are error-prone, they are not used independently.

Model
After entity indicators were inserted into relation mentions, they are ready for processing by a deep neural network. In this paper, we designed a simple but effective architecture to evaluate the effectiveness of entity indicators. This model is composed of an input layer, an embedding layer, a convolutional layer and an output layer. The architecture of this model is shown in Figure 1. In the input layer, instead of directly input an original sentence, entity indicators are implanted into relation mentions to point to the position, syntactic and semantic information of the arguments.
In the embedding layer, three approaches are adopted to support word embedding. In the first approach, a randomly initialized lookup table is adopted to support embedding. In the second approach, wiki-100 [41] and GoogleNews-vectors-negative-300 are adopted for Chinese and English word embeddings, respectively. The third model is based on BERT [36], which is pretrained with external resources by an unsupervised method. This approach is effective to capture semantic information of words. Furthermore, BERT is based on the Transformer [11], which can learn the dependency between words.
The convolutional layer performs four one-dimensional convolutional operations with a kernel shape of 3 × 1 on the output of the embedding layer. The output of this operation is a vector with size 50. The convolutional layer automatically learns abstract representation from local features. The output of the convolutional layer is fed into maxpooling and fully connected operations. The pooling layer collects salient features from the input, which reduces the parameters of the model and extends the generalizability of the model. A fully connected layer resembles the features for global regulation. In the output layer, a cross-entropy loss function is adopted to calculate the loss during the training process (Our code to implement deep learning model with entity indicators for relation extraction is available at: https://github.com/WeizheYang-SHIN/Entity_indicators_for_ RE, accessed on 20 March 2021).

Experiments
The experiment used a NVIDIA Tesla P40 GPU for training and testing under Linux environment. In this section, three datasets the ACE 2005 Chinese corpus, ACE 2005 English corpus [42] and Chinese literature text corpus (CLTC) [43] were adopted to evaluate the performance of entity indicators.
The ACE 2005 corpus is a classic dataset for automatic content extraction. It is collected from weblogs, broadcast news, newsgroups and broadcast conversation. The corpus was annotated with 7 entity types, 44 entity subtypes, 6 relation types and 18 relation subtypes. The Chinese corpus contains 628 documents, containing 9244 relation mentions that are used as positive instances. If two named entities have no predefined relation type, the instance should be considered as negative. To generate negative instances for training, the method in Chen et al. [25] is adopted, which generates 98,140 negative instances. The ACE English corpus generates 6583 positive instances and 97,534 negative instances.
The Chinese literature text corpus [43] is a discourse-level corpus whose articles are collected from Chinese literature. In total, seven entity types and nine relation types are manually annotated according to a gold standard. In this corpus, the entity relation only exists in four entity types: Thing, Person, Location and Organization. The corpus contains a total of 695 articles for training, 58 for validation and 84 for testing.
The experimental performance is measured by the commonly used evaluation indexes in the NLP field, namely, precision rate (P), recall rate (R) and comprehensive evaluation (F1). The total performance (referred to as "Total") is the macro-average over all positive relation types. The ACE 2005 corpus was divided into 8:1:1 for training, validation and testing in our experiments. The fixed length of a sentence is set as 100.

Performance of Entity Indicators
In this section, based on the ACE Chinese corpus, ACE English corpus and CLTC corpus, all the entity indicators in Table 1 were evaluated to demonstrate their influence on the performance. In this experiment, word embeddings were initialized by two pretrained embedding models, wiki-100 [41] for Chinese and GoogleNews-vectors-negative-300 for English. The results are shown in Table 2, where the "None" model was implemented on the relation mentions directly without any entity indicators and was implemented for comparison. "×" means that the performance is not supported because no entity subtype was annotated in the CLTC corpus. Comparison of the ACE Chinese corpus and the CLTC corpus shows that the performance is higher for the ACE Chinese corpus. This is because in the CLTC corpus, sentence semantics are often expressed in a subtle and special manner [44]. Their intuitions and feelings are usually expressed through very complex and flexible sentence structures The results show that distinguishing the beginning and end boundaries of named entities is beneficial for improving the performance (e.g., "P_D " and "P_TS"). When entity types are encoded into the entity indicators, the performance is improved considerably. All the entity indicators considerably outperform the "None" strategy. As Table 2 shows, the performance is improved with entity indicators encoding more relevant information.
When the POS tags are used in the indicator "C_PTSA ", the ACE Chinese corpus shows an impressive improvement, outperforming the " C_DTSA" by more than 12% in F1 score. An improvement was also observed for the CLTC corpus. On the other hand, on the English corpus, the performance is unexpectedly decreased with POS tags. The reason of the difference may be that English is an alphabetic language, where many adjacent words are function words (e.g., articles, auxiliary words), which are more ambiguous and have little lexical meaning, possibly affecting the performance.

Comparison with Other Strategies
In this experiment, the entity indicator method was compared with the position embedding and multichannel methods to demonstrate their ability to capture the entity positions of a relation instance. The performance on the three corpora is shown in Table 3.  The "None" is the same as in Table 2. In the position embedding model, the position coordinates of every word are embedded into a 25-dimensional vector. In the multichannel model [13], each channel employs an independent lookup table for word embedding that does not interact during recurrent propagation. In the entity indicator model, the "C_PTSA" encoding is used for the ACE Chinese and CLTC corpora, while the "C_DTSA" encoding is adopted for the ACE English corpus. To avoid the influence caused by external resources, all the word embeddings are initialized by randomly initialized lookup tables.
Position embedding and multichannel are two traditional strategies to support relation extraction [13,18]. A comparison of Tables 2 and 3 shows that in the CLTC corpus, the performance displays impressive improvements. However, position embedding and multichannel have little effect on the ACE corpus. This is because ACE suffers from a serious data imbalance problem. Furthermore, because the relation types of ACE are asymmetrical, every positive instance (e.g., s, e 1 , e 2 ) has a corresponding negative instance (e.g., s, e 2 , e 1 ). In this condition, the position information of named entities is more complex, worsening the final performance.
Compared with other strategies, entity indicators show remarkable improvements. In both the ACE corpus and the CLTC corpus, the performance is stably increased. The reason for the improvement is that named indicators encode positional information (e.g., entity boundary positions), syntactic information (e.g., POS tags) and semantic information (e.g., entity types and subtypes). Thus, they provide powerful support for relation extraction.

Evaluation on the Chinese Corpus
In the ACE Chinese corpus, the related works can be divided into those using shallow architecture and those using deep architecture. Some of them are listed in Table 4. Among the shallow architecture models, Yu et al. [45] proposed a convolutional tree kernel-based approach for relation extraction. Liu et al. [26] adopted a tree-kernel model that combines external semantic resources (HowNet) to support relation extraction. Chen [46] proposed a feature calculus strategy. In deep architecture models, Li et al. [47] proposed a lattice LSTM model that combines multigrained information for relation extraction. Chen et al. [48] designed a CNN-attention neural network model.
For comparison, the same settings as those in Chen et al. [48] were adopted in this experiment. Shallow architecture models divide the evaluation corpus into 8:2 for training and testing. In deep architecture models, the corpus is divided into 8:1:1 for training, developing and testing. In this experiment, two strategies are adopted to initialize the lookup table for word embedding. The first is "Random-CNN", which uses a randomly initialized lookup table. In the second ("BERT-CNN"), BERT [36] is adopted to support the word embedding. The embedding layer outputs 768-dimensional vectors. Because kernel-based models are heavily based on error-prone parsing trees, the methods reported by Yu et al. [45] and Liu et al. [26] show lower performance. Chen et al. [46] designed a feature calculus strategy to generate combined features, achieving state-of-theart performance. In the works of Li et al. [47] and Chen et al. [48], the performance is worsened because their neural networks are directly implemented on raw inputs and cannot utilize the positional and semantic information of named entities. Compared with the related works, entity indicators receive the highest performance, outperforming the stateof-the-art by 3% in F1 score for relation types and by 7% in F1 score for relation subtypes.
In the following, the entity indicator method was evaluated on the CLTC corpus, which was published by Peking University in 2017 [43]. It contains 695 articles for training, 58 for validation and 84 for testing. Because the training, validation and testing articles are manually divided, it is convenient for comparing between different systems. Wen et al. [44] have tested several models on this corpus, and the obtained performance characteristics are listed in Table 5, where the column "Arch." denotes neural network architectures in corresponding models. In the compared models, Zhang et al. [54] proposed a multifeature fusion model, which integrates multilevel features into deep neural network models. The convolutional layer is built on the Att-BLSTM model to capture the features at the word level to obtain more structural information. Wen et al. [44] obtained the highest performance by implementing two bidirectional LSTMs on the shortest dependency path of relation instances. Because Chinese sentences are insensitive to structure, parsing a Chinese sentence is errorprone and depends strongly on external toolkits. Therefore, the performance is influenced by the parsing process. On the other hand, entity indicators are based on manually annotated named entities. They contain precise position information about relation arguments. Combined with POS tags and entity types, the entity indicators are effective for utilizing entity positional and semantic information. Compared with that of the SR-BRCNN model, the performance of this approach is improved by more than 11% in terms of the F1-score.

Evaluation on the English Corpus
In this experiment, the entity indicator is compared with several related works implemented on the ACE English corpus.
Kambhatla et al. [20] proposed a feature-based maximum entropy (ME) model. Zheng et. al. [16] used multiple CNN convolution kernels of different sizes to extract features from raw inputs. Gormley et al. [55] presented a feature-rich compositional embedding model (FCM) that combines handcrafted features and word embeddings. Zhou et al. [56] adopted an SVM model with diverse lexical, syntactic and semantic knowledge. Zhong et al. [19] used encoding of entity markers to represent the relation mention based on BERT model. Chen et al. [46] proposed a feature calculus method, where combined features are generated for capturing structural information of the sentences. The results are presented in Table 6.
The results showed that entity indicators outperformed the compared work by approximately 5% F1-score in relation types and approximately 18.5% F1-score in relation subtypes.

Conclusions
Unlike sentence classification that makes a prediction based on sentence representation, relation extraction should consider the semantic information between two named entities. Because a sentence often contains several named entities that share the same context, directly making a decision-based sentence representation learned from raw inputs is not effective for supporting relation extraction. In this paper, entity indicators were proposed to capture position information of a relation instance. Instead of implementation on a raw input, task-related entity indicators are inserted into each relation instance. This strategy lets the neural network "know" the position, syntactic and semantic information of the named entities about a relation instance. The uniformly structured indicators can make the similarity between sentences and the internal symmetry of one sentence more obviously. It also helps neural networks to learn semantic dependencies in a relation instance. Experiments have shown that this is a powerful approach for supporting relation extraction. In further work, the notion of using entity indicators can be extended to other NLP tasks (e.g., named entities) for a neural network to capture the structural and semantic information of a sentence.