An Attention-Based Model Using Character Composition of Entities in Chinese Relation Extraction

: Relation extraction is a vital task in natural language processing. It aims to identify the relationship between two speciﬁed entities in a sentence. Besides information contained in the sentence, additional information about the entities is veriﬁed to be helpful in relation extraction. Additional information such as entity type getting by NER (Named Entity Recognition) and description provided by knowledge base both have their limitations. Nevertheless, there exists another way to provide additional information which can overcome these limitations in Chinese relation extraction. As Chinese characters usually have explicit meanings and can carry more information than English letters. We suggest that characters that constitute the entities can provide additional information which is helpful for the relation extraction task, especially in large scale datasets. This assumption has never been veriﬁed before. The main obstacle is the lack of large-scale Chinese relation datasets. In this paper, ﬁrst, we generate a large scale Chinese relation extraction dataset based on a Chinese encyclopedia. Second, we propose an attention-based model using the characters that compose the entities. The result on the generated dataset shows that these characters can provide useful information for the Chinese relation extraction task. By using this information, the attention mechanism we used can recognize the crucial part of the sentence that can express the relation. The proposed model outperforms other baseline models on our Chinese relation extraction dataset.


Introduction
Relation extraction aims to identify the relationship between two specified entities in a sentence.For example, from the sentence "LeBron James was born in Akron, Ohio.",we can get triple informaiton (LeBron James, Birthplace, Akron).Since it was put forward, relation extraction has been one of the most critical tasks in NLP (Nature Language Processing) and played a crucial role in QA (Question-Answer), Knowledge Graph construction, and many other applications.
There have been many studies in relation extraction, both in English and other languages.These methods show a trend from initial rule-based methods, traditional feature-based models, such as SVM (Support Vector Machine) [1] and probabilistic graphical models [2], to neural network-based approaches [3,4].At the same time, the research focus also changes from supervised learning to distant supervised learning [5].
Besides finding different ways of modeling the sentences, researchers also try to use additional information such as entity information in the task.Some studies use entity type [4] and entity descriptions [6].However, both of these methods have their limitations.The number of entity types obtained by the NER (Named Entity Recognition) system is not enough, especially in large scale relation extraction.Even though there is a large knowledge base, only a small part of the entities in the dataset can find the appropriate descriptions when using the entity descriptions.However, there exists another way that can overcome these limitations to provide information about the entities in Chinese relation extraction tasks.A notable difference between English and Chinese is the characters.In Chinese, there exists another way to provide information about the entities.A notable difference between English and Chinese is the characters.In English, there are only 26 letters.Most of them do not have specific meanings.In Chinese, there are thousands of frequently-used characters, and plenty of them have explicit meanings.Based on this difference, we suggest that we can get information about the entities, such as type, color and location.from characters that constitute the entities.For example, as shown in Figure 1.The word '中国' has two Chinese characters, '中' and '国'.From the character '国', which can express a country, we can infer that the word means a country in high probability.Moreover, when given the word '李小龙', we may know that the word refers to a person as the first character '李' usually appears in the first name.So, by using character compositions of the entities, we can provide more information about the entities compared with the entity types provided by the NER system, and it can provide information of all the entities without extra resources.The effect of this method is still not verified as far as we know.The main reason is lacking a large-scale open-domain dataset.To verify the hypothesis, this paper creates a large scale Chinese relation dataset based on a Chinese encyclopedia.Based on this dataset, we propose an attention-based model to verify the effectiveness of the character information provided by entity compositions in the Chinese relation extraction task.The experimental results show that by using this information, the attention mechanism can recognize the crucial part of the sentence through which we can infer the relationship between two entities.Furthermore, the proposed model also achieves better performance compared with other baseline models that are widely used in the relation extraction tasks.The main contribution of this work are as follows.
First, we build Baike dataset using distant supervision based on Baidubaike, a large scale online Chinese encyclopedia, to solve the problem of lacking large-scale open-domain datasets.We elaborate on the process of generating the dataset and analyze it from several aspects such as instance distribution and label accuracy of each relation during distant supervision.After comparing with other datasets, we believe our dataset is the most appropriate dataset for the large scale Chinese relation extraction task.
Second, we propose the BLSTM-CCAtt (bidirectional-LSTM model using Character Composition Attention) model, which is an attention-based neural network model using the information provided by Chinese character compositions of entities.Through this model, we illustrate how this information is useful to infer the relation between two entities in detail.Then we analyze how this information works in our model in detail.Moreover, the experiment results show that the proposed model gets the best F1 score among all the tested models on the Baike dataset.

Neural Network in Relation Extraction
Neural Network has become the mainstream in NLP studies and it achieves the best performance in the relation extraction task.Socher et al. [3] propose the Recursive Matrix-Vector Model that uses Recursive Neural Network to model the shortest dependency path (SDP) between entities in the sentence.Zeng et al. [4] introduce Convolutional Neural Network (CNN) to the relation extraction task.These two studies are the earliest work using neural networks in relation classification.The result shows that these methods get better results than the traditional feature-based methods.Zeng et al. [7] propose Piecewise Convolutional Neural Network (PCNN).PCNN separates the sentence into three parts by the two given entities and uses max-pooling separately after the convolutional layer.Xu et al. [8] and Xu et al. [9] use CNN and Long Short Term Memory Network (LSTM) to model the SDP between the two given entities respectively.Liu et al. [10] consider the subtrees attached to the SDP.Before modeling SDP by CNN, the embedding of subtrees getting by recursive neural network is appended.Since the attention-based models improve the performance of many NLP tasks, attention is also used in relation classification.Zhou et al. [11] propose a LSTM model with attention.Wang et al. [12] choose multi-layer CNN with attention.Both work show better performance than the models without attention.

Distant Supervision
Lacking labeled data is a major problem in relation extraction.Especially in large scale knowledge graph construction that involves thousands of relations, the cost of labeling data manually is unacceptable.To solve this problem, Mintz et al. [5] propose distant supervision using triples from freebase to label unstructured text.Data generated by distant supervision is quite noisy.To alleviate the noise, Riedel et al. [13], Hoffmann et al. [14] and Surdeanu et al. [2] use graph model to find which instances are labeled incorrectly.In the area of neural relation extraction, Zeng et al. [7] use multi-instance learning at the first time.In their work, they use sentence bags as the input of their model instead of one simple sentence.When training the model, they select the sentence with the max calculated probability to update the parameters.Lin et al. [15] adopt attention to optimize instance selection.Qin et al. [16] use Generative Adversarial Network (GAN) to solve this problem of wrong labeled instances.

Chinese Relation Extraction
Studies in Chinese relation extraction are far less than English.One crucial reason is lacking large-scale datasets.Many previous work use the ACE 2005 Chinese corpus (LDC2006T0 6) dataset that is quite small for neural-net-based methods.So, some work choose to make their own dataset.For example, Chen et al. [17] make a dataset which contains three types of relations and test mult-instance learning on it.Wen et al. [18] use a dataset based on Chinese SanWen and propose a structure regularized neural network.Most of these previous work are based on word-level or character-level.So, some work decide to use multi-grained models to take advantage of both levels.The latest one of them is proposed by Li et al. [19] which uses a lattice-based structure to dynamically integrate word-level features into the character-based method.

Dataset Construction
Dataset is a critical part of relation extraction.It determines whether the model trained by the dataset can apply to real-world problems.However, current Chinese relation extraction datasets are either too small or in a specific domain.So, our goal is to create a large scale open domain Chinese relation extraction dataset.As for now, there are two usual ways to create datasets.The first one is labeling all the data manually.In this way, we can get a high-quality dataset in which each instance is guaranteed to be right.However, this method is not appropriate to create large scale datasets for its cost.The second one is the distant supervision method proposed by Mintz [5] that uses known triples to label unstructured text.Although the quality of the dataset generated by distant supervision is not as good as that of the manually labeled dataset due to the introduction of wrong labeled instances in the process of auto-labelling, compared with the manually annotated data set, the labeling cost is negligible.Therefore, we choose distant supervision to generate our dataset.In this section, we illustrate the process of generating our Chinese relation extraction dataset, which is named the Baike dataset.

Dataset Collection
The most widely used dataset in English is NYT'10 dataset [13] that uses triples in Freebase to label raw text in NewYork Times.Here we use Baidubaike, an online Chinese encyclopedia, to generate the dataset.Compared with Freebase that provides triples to indicate relations between entities, Baidubaike is more like Wikipedia that contains text, tables, and pictures to describe a real-world thing that we treat as an entity.So unlike the NYT'10 dataset, we can both obtain triples and text.
The detailed process is shown as follows.First, we crawl one million pages from Baidubaike.These pages contain tables and text as Figure 2 shows.After filtration and disambiguation, each page can be treated as an introduction to an entity.In each page, the tables provide structured information about the entity such as birthplace or profession.After excluding items that describe attributes about the entity like height and weight, we can get relations about this entity that finally form triples.Then, we use these triples to label unstructured text in the page to get instances.All aliases are considered in the labeling process.After all the previous processes, we get 1496491 instances in 1444 relations.The number of instances of most relations is quite small that the model can hardly learn enough information to distinguish these relations.So, we select 53 of them that have more than 5000 instances as candidates.Unlike Freebase, relations described by tables in Baidubaike are irregular, and a large portion of them are ambiguous.We eventually select 30 relations that have relatively clear explanations among these candidates.The negative samples that mean the relation is not in these 30 relations are randomly chosen from the unselected relations.However, the data is quite imbalanced.The largest number of instances among these relations is more than 150,000, while the smallest one is only about 5000.To alleviate this problem, we randomly subsample the relations which contain too many instances.
After subsample, we divide the training and testing set by the proportion of 50:1.The minimum number of instances for each relation in the test set is limited to 200.In the dividing process, triples in the testing set are not allowed to appear in the training set so that the result can be less affected by the over-fitting of the trained model.All the instances in the testing set are labeled manually to eliminate the influence of mislabelling in the evaluating process.

Dataset Analysis
In this section, we analyze various aspects of the proposed dataset to provide a deeper understanding of the dataset and illustrate why it is more appropriate for the Chinese relation extraction task.The details of our dataset are shown in Table 1.The first column of the table is relation types in Chinese.The second column lists the interpretations of all the relation types.Each of them is described as 'head entity type/relation description/tail entity type' like Wikipedia.The head and tail entities correspond to the subject and object in the triple.The entity types confirm which kind of entity can appear in the relation.The relation description describes the relationship between these two entities.The third and fourth columns are numbers of triple and instance of each relation.The last column is the label accuracy of distant supervision estimated by the process of labeling the testing set.The accuracy of each relationship is calculated by dividing the number of instances correctly labeled by the total number of instances.The distribution of instances and triples is shown in Figure 3.

Proposed Model
In this section we proposed a neural network model named as BLSTM-CCAtt that uses the character composition of the entities to provide additional information.The overall structure of our model is shown in Figure 4.The construction of this model is similar to most previous models that start with encoders and end up with a softmax classifier.However, unlike other work, we use the character composition to provide additional information about the entities.There are three encoders in our model, which is one sentence encoder and two entity encoders.We compare several frequently used encoders to select the most appropriate one.After comparison and analysis, bidirectional-LSTM (BLSTM) is chosen as all these three encoders.When encoding the sentence, attention mechanism uses the outputs of entity encoders as the query to give weight to the words or characters and get the vector expression of this sentence.After a full connection layer, a softmax classifier is used to classify the relationship between the entities.i and e t i is the embedding of the i-th input of the head and tail entity.q and r e are calculated by the head entity r h e and tail entity r t e in different ways.The weight α i is calculated by q and h the hidden states using attention.

Embedding
Following most neural network models, the first step of our model is to transform the input tokens into low-dimensional vectors.When encoding sentences, the "input tokens" refers to words or characters according to whether the encoder is word-level or character-level.These input tokens are transformed into vectors by looking up the pre-trained embeddings.Position feature [4] is used to specify the given entity pair.It also needs to be transformed into vectors by looking up the position embeddings.When encoding entities, the "input tokens" refers to the Chinese characters that composed the two entities.These characters are also transformed into vectors.
Given a sentence with n input tokens S = {s 1 , s 2 , . . ., s n }, two marked entities e h and e t , and an embedding matrix E s of dimension d c × |V|, where d c is a hyper-parameter that indicates the dimension of the embedding vector, and V stands for the vocabulary, every input token s i is represented as vector v i ∈ R d c after projected into the embedding space.Position feature is widely used in previous work, and the effect is verified.For each token s i in the sentence, we can get two relative distances to the two entities.The distances are mapped to randomly initialized vectors p h i and p t i , p i ∈ R d p where d p is a hyper-parameter which indicates the dimension of position vector.The final representation of token s i in the sentence is the connection of the token embedding and two position embeddings, which is The sentence is finally represented as R = {r 1 , r 2 , . . ., r n }.The representation of entities is similar to the sentence.Each character c i that constitutes an entity E = {c 1 , c 2 , . . ., c m } is mapped to e i ∈ R d c using embedding matrix E e .The entity is finally represented as E = {e 1 , e 2 , . . ., e m }.

Encoders
As mentioned above, plenty of models have been used to encode a given sentence including CNN, RNN, and more complex neural networks.To select the appropriate encoder, we consider both word-level and character-level models.Le et al. [21] illustrate that shallow-and-wide networks have better performance than deep models with word inputs.On the other side, deep models indeed give better performances than shallow networks when the text input is represented as a sequence of characters.However, the property of Chinese decides that character-level models can be simpler than English.So, after comparing several models, we eventually select BLSTM as our sentence encoder.There are three reasons for using BLSTM in our model.First, BLSTM shows similar or even better performance when given character composition information of entities.Second, LSTM-based models have more explicit meanings in the attention mechanism that is used in the next step than CNN-based models.Last, BLSTM is quite simple compared with other complex models, which means it has fewer parameters and faster calculating speed.The detailed encoding process is shown in Figure 4. Given a sentence R = {r 1 , r 2 , . . ., r n }, the hidden states of forward LSTM H f are and backward LSTM H b is final hidden states of BLSTM sentence encoder are where Given an entity E = {e 1 , e 2 , . . ., e n }, we use a BLSTM encoder to encode the entity just like encoding the sentences.The hidden states of entity encoder are where e i is the embedding of the i-th character of the entity and the calculation of BLSTM is the same with the sentence encoder.After the BLSTM encoder, the average pooling result of the hidden states r e i ∈ R d e , where d e is the size of hidden states of the entity encoder, is used as the representation of the entity.

Attention
After encoding the sentence and two entities, we use the attention mechanism to take the best advantage of the information provided by the character composition of the entities.
The attention mechanism is widely used in NLP tasks such as QA and Machine Translate.It aims to select the most relevant part concerning the given query.The workflow of attention mechanism is as follows.Given a series of states where k i ∈ R d k , and one query q ∈ R d q .The output x is calculated as attention vector α multiply by the states V .
x = αV (5) α is calculated by q and K using attention function f att .
In most NLP tasks, the states V are also used as the keys K.In our model, we use the hidden states of our sentence encoder H s as V .So the attention vector can be calculated as, α = so f tmax f att q, H s (7) There are several forms of attention function f att , the multiply form is frequently used and selected in our model.The function is as follows: where W ∈ R d q ×d V is a parameter matrix.There is no query q in the relation extraction task.To solve this problem, we use the representations of two entities r h e and r t e , where h and t indicate whether the entity is the head or tail entity, to generate the query q.Previous work [22] has demonstrated the property of word embeddings, for example w("China") − w("Beijing") = w("Japan") − w("Tokyo").This means the difference between two word embeddings can indicate the relationship of these two words more or less.This is more clear in KG embedding.The basic assumption of many knowledge graph embedding work [23,24] is that given a triple (h, l, t), where h and t are two entities in relation l, the embedding should satisfy equation h + l = t.Base on this assumption, q is calculated as follows.
The final representation of the sentence r s is calculated as follows.
r s = so f tmax(qW H s )H s (10) In order to emphasize the entity information, we connect the sentence and entity representation as the instance representation r. r = [r s : where r e is the joint of the two entity representations.

Multi-Instance Learning
Distant learning [5] has dramatically reduced the cost of getting labeled data and made it possible to generate large scale data sets.However, it is not perfect.The primary shortage is the wrong label problem.In order to solve this problem, multi-instance learning is introduced to the relation extraction task.Instead of one single sentence, the input of the network of multi-instance learning is a bag.Suppose there are m bags {B 1 , B 2 , . . ., B m } and the k-th bag contains n instances B k = {S 1 , S 2 , . . ., S n } of the same entity pairs.Rather than labeling of each instance, multi-instance learning predicts the label of bags.So, the method of calculating the representation of bags is the key component of multi-instance learning.Several strategies such as selecting instance with the highest probability [7], attention-based method [6,15,25], adversarial training [26] and reinforcement learning [27,28] are used in previous work.
In this work, we use the sentence-level attention [15] that is simple and effective, as our multi-instance learning method.In this method, the representation of each bag is the weighted summation of the instances representations in the bag.
where R = {r 1 , r 2 , . . ., r n } is the matrix of the instance representations, and α s is calculated as follows: where W s is weighted diagonal matrix and l is the relation representation vector.The prediction probability p of the bag is calculated as follows.
In this equation, L is the matrix of the relation representations, and d is the bias vector.Cross-entropy is used as the objective function.Adam algorithm [29] is adopted to minimize the objective function.

Experiments
In this section, we design a set of experiments to prove the advantage of our model and explain how our model works.First, we compare our model with several baseline models on our dataset.Second, we compare some popular encoders and try several ways to use the character composition information.After comparison, we find the method used in our model achieves the best performance.Then, we analyze how the attention mechanism used in our model works.Finally, we analyze the improvement of multi-instance learning on our dataset.

Experiment Result and Comparison
In this section, we compare several baseline models, which are widely used in relation extraction task, with the proposed model .These model are as follows: CNN [4], the first CNN model used in relation classification.In this paper, we do not use the lexical features to avoid the influence of extra information getting by other tools.
PCNN [7], a piecewise CNN model that improves the CNN model by modifying the max-pooling method and use multi-instance learning.
Att-BLSTM [11], an attention-based bidirectional LSTM model.BLSTM-SelfAtt [31], a self-attention based bidirectional LSTM model for sentence embedding.Here we add the position feature to figure out the two entities.
All the models are tested on the proposed Baike dataset.The multi-instance learning methods are removed from all the tested models to ignore the side effect.We conduct the experiments on both character-based and word-based versions of the models mentioned above.The AUC value and F1 score of these models are shown in Table 3.When calculating the F1 score, the negative samples are excluded.Each number in Table 3 is the average of 10 times experiments.The result shows that the proposed BLSTM-CCAtt model achieves the best performance among all the models in both in word-level and character-level.The performance of the LSTM-based models is better than the CNN-based ones.BLSTM-CCAtt (proposed), Att-BLSTM, and BLSTM-SelfAtt models all use attention methods.These attention-based models outperform the basic BLSTM model.The difference among these three models is the BLSTM-CCAtt model use character compositions of the two entities to generate the query q while the other two models use random initialized vectors as query q.This difference demonstrates the advantage of using character compositions of entities.Compared with these baseline models, the F1 score of BLSTM-CCAtt is higher than the CNN-based models by about 1.5 and higher than other attention-based models by about 0.4.The improvement of the BLSTM-CCAtt model is significant on our Baike dataset.

Usage of the Character Composition Information
According to our hypothesis, character composition information can bring extra information, which can be beneficial for our relation extraction task.Many factors can affect the results of using this information, such as encoder selection and ways of using this information.Since many encoders have been used to encode the sentences, there exist several ways to use the character composition information.In this section, we test five popular encoders and three ways of using character composition information to find how we can take the most advantage of the character composition information.The result shows that the method we used in our model achieves the best result.
The tested five sentence encoders are CNN, PCNN, BLSTM, BLSTM-RES [32] and BLSTM-SelfAtt.Some of these encoders are used in previous work.In our method, these encoders are just part of our model.The results of these encoders are used together with the character composition information, which is obtained by the entity encoders, to obtain the final classification results.
There are three ways of using the character composition information.The first one abandons the attention mechanism and directly connects the sentence representation from sentence encoder with the entity representation from the entity encoders as the instance representation to predict the relationship between the entities.The second way uses the attention mechanism, which uses entity representation to calculate sentence representation.The calculated sentence representation is treated as the instance representation.The third way is the proposed method, which is called as Att&Con (Attention and Connection).In this method, the instance representation is the concatenation of the sentence representation calculated by the attention mechanism and the entity representation.
We try these methods on each encoder to find out in which situation we can take the most advantage of the character composition information.However, not all encoders can use these three methods.For example, self-attention-based models usually do not need external queries.So, the BLSTM-SelfAtt encoder only uses the first method.We do not use attention-based methods on CNN and PCNN because we believe that LSTM-based attention models are more interpretable in NLP tasks, although some work use CNN-based attention model [12].All the tested combinations are list in Table 4.To emphasize the effect of character composition, we also try to use sentence representation with no character composition information.In this situation, the model is the same as previous work.All the methods are tested in both word-level and character-level.The result is shown in Table 4. From Table 4, we can see that the performances of the RNN-based encoders are better than the CNN-based ones in all situations.When using the connection method, the F1 score of each model is improved except the BLSTM-SelfAtt model.The improvement in character-level is smaller than word-level.The reason is that in character-level, the information provided by character composition is included in the sentence.In word-level, this information is a useful supplement.In addition, there is no improvement in the BLSTM-SelfAtt model.The reason is that the self-attention mechanism gives higher weights to important elements so that it can capture enough information from the two entities.When using the attention method, BLSTM gets better results in both character-level and word-level, while the performance of BLSTM-RES using attention gets worse.We believe that compared with the connection method, the introduction of character composition information in the attention method is indirect.It tries to use the information to find the crucial part in the sentence that can decide the relation between the two entities.This mechanism fails in BLSTM-RES.Because, in BLSTM-RES, the attention mechanism tends to ignore most words in a sentence.The detailed analysis is listed in Section 5.3.When using the Att&Con method, the result shows that the proposed BLSTM-CCAtt model, which uses BLSTM encoder and Att&Con, can get the best result since it can take the most advantage of the character composition information.

Attention Analyze
The attention mechanism is a crucial part of the proposed BLSTM-CCAtt model.In this section, we illustrate how the attention mechanism works and demonstrate whether the attention mechanism can find the crucial parts of the sentence using character composition information.The crucial parts of the sentence are the words through which we can infer the relationship between the two entities.We explain the attention mechanism in three circumstances.
First, we focus on relations which we can deduce the relation through entities, such as '性别' and '民族'.In these relations, the set of entity in one side is quite small and hardly appear in other relations.For example, when given an entity '藏族', which means the Zang nationality, and the other one is a person name in a sentence, this sentence may belong to the '民族' relation in very high probability even consider the negative examples.As shown in Figure 5, the weight of the key entity calculated by entity representation is very high and other tokens in this sentence are nearly ignored, especially in BLSTM-Res model.This is the simplest situation, and both models make similar choices.
Relation : 民族(person/race belongs to/race) Head entity: 云丹久美(YunDan kumi) Tail entity : 藏族(Tibetan) Instance : YunDan kumi, a Tibetan male singer.He is the promoter of public benefit activities named 'Love 100' and the week champion of the tv show 'Avenue of Stars' in 2011.Then, we consider the relations in which the set of its entity on one side is small but appear in more than one relation.The typical kind of these relations is which contain entities that represent countries such as '国籍' and '所属国家'.A typical example is given in Figure 6.In this example, the word '中国' which means China is the key clue to infer the relation.But only through this word, the relation can not be clearly judged because it can appear in both relations.The behaviors of these two models are different here.The BLSTM model gives higher weight to other words when emphasizing on the word '中国'.Through these words, the classifier can get information to make the right decision.On the other side, BLSTM-Res model only focuses on the key entity and ignore the other words.So, it can hardly give the right answer.In both two kinds of relations mentioned above, entities play a crucial role in determining the relation.In both situations, our model focuses on key entities that can decide the relationship.So, here comes the question of whether our model only focuses on the given entities.We analyze some other relations in which the entities are less important than some other keywords in the sentences and can provide few clue.It is uncertain whether the proposed attention mechanism can still find the critical part.We select '作者' and '歌曲原唱' relations which meet the requirements to analyze which part the attention focus on.The result is shown in Figure 7.In the '作者' relation, the word '作者', which can be interpreted as writer, has higher weight than other items.In the '歌曲原唱' relation, both models focus on the word '演唱', which means 'singing', and the word '一首', which is a quantifier usually used on songs.So, in these relations, the attention mechanism can still find out the critical part.From all the situations mention above, we conclude that character information provided by entity composition can provide helpful clues to judge the relation.Besides that, we also find an interesting fact which may be closely related to the failure of attention mechanism in the BLSTM-Res model.When making the decision, the BLSTM-Res model tends to allocate high weights for a few words and ignore other words compared with the BLSTM model, leading to a loss of necessary information in the sentence.So, we also conclude that BLSTM is more suitable than BLSTM-Res to be the sentence encoder in the proposed model.

Multi-instance Learning Analysis
In this section, we will analyze how distant supervision influences the classification result and how multi-instance learning can improve the performance of the models using our dataset.So, we analyze the classification result of each relation in the proposed model in word-level.The result is shown in Table 5.
From Table 5, we find that compared with label accuracy, the properties of relation itself are more important.For example, in '下辖地区' relation, the classification F1 score is 85.71 and 84.87 (Multi-instance Learning) even though the label accuracy is only 33.33%.By contrast, in '所属地区' the classification F1 score is 58.16 and 60.43 (Multi-instance Learning) although the label accuracy is 82.63%.It is in high probability caused by the uncertainty of the relations.In '所属地区' relation, the first entity may refer to a region, and it also can refer to an organization.Other relations, such as '类型' and '所属国家', have the same issue.When using the multi-instance learning method, the F1 score of the proposed model is improved from 87.30 to 87.89.After analyzing the improvement of each relation, we find that unlike our hypothesis before, which is the promotion of relations with low label accuracy is higher than that with higher label accuracy, the promotion is average, and it seems to be not related to the label accuracy.

Conclusions and Future Work
Extra information that can not obtain directly from the sentence is verified to be helpful in relation extraction.The information used by previous work such as entity type obtained from NLP tools and knowledge bases all has their limitations.Many Chinese characters have unique meanings.Using the information provided by these characters can improve many tasks in Chinese language processing.In Chinese relation extraction, characters that constitute the entities can provide additional information.In this paper, we do several work to verify the effectiveness of this information.
First, to solve the problem of lacking dataset, we generate a dataset based on Baidubaike using distant supervision.Compared with previous datasets, our dataset is more appropriate for the large scale open domain Chinese relation extraction task.Second, we propose an attention-based model.By analyzing the attention mechanism, we find that using this information can effectively find out the vital part of the sentence.Furthermore, the model achieves the best performance among all tested models.Besides, we analyze the relationship between label accuracy and classification result.We find that the critical factor is the complexity of each relation instead of label accuracy.
When comparing with previous models and selecting the encoders, this paper mainly uses some representative model, rather than the latest state-of-the-art models.The reason is that by using some representative model, the effectiveness of introducing character information can be proved.Testing other models may be a supplement of our work and can be done in future work.

Figure 1 .
Figure 1.Information implied by Chinese characters.

Figure 4 .
Figure 4. Proposed Attention-based model, r i represents the i-th input of the sentence, e hi and e t i is the embedding of the i-th input of the head and tail entity.q and r e are calculated by the head entity r h e and tail entity r t e in different ways.The weight α i is calculated by q and h the hidden states using attention.

Figure 5 .
Figure 5. Attention analysis of '民族' relation, the darker the color is the higher the weight is.

Table 1 .
[20]rmation of proposed dataset.Instance/Triple Distribution in Baike dataset.The x-axis is the index of relations.The y-axis is the number of instances or triples.First, we analyze the relation types in our dataset.Table1shows that the relation types in our dataset cover a broad scope.As shown in the last column of Table1, theThe accuracy of distant supervision is quite different among relations.The average accuracy of all the labeled relations is 68.28%.The "出版社" relation has the highest accuracy, which is 94.83%.The accuracy of "下辖地区" relation is only 26.72%.This difference might be reflected in the relation extraction result.Methods of reducing noises in the dataset are helpful in the extraction process.Then, we compared our dataset with two frequently used datasets.As shown in Table2, our dataset contains more relation types and instances.The Chinese SanWen dataset[20]contains 9 types of relations among 726 Chinese literature articles, 29,096 sentences.The ACE 2005 dataset contains 8023 relation facts with 18 relation subtypes collected from newswires, broadcasts, and weblogs.Our dataset includes 463788 instances in 30 relation types from different fields.Compared with other datasets, our dataset is much larger and covers wider fields.In the real world open-domain relation extraction task, there exist many kinds of sentences and thousands of relations.These small-scale or specific-domain datasets are incompetent to the task.So our dataset is more appropriate.In conclusion, our dataset is more suitable for large-scale open-domain relation extraction task.

Table 2 .
Comparison of datasets.

Table 3 .
AUC and F1-scores of different models.

Table 4 .
Comparison of F1 score in different situation.

图
Relation : 作者(literary work/writer/person) Head entity: 信号处理基础(Foundations of Signal Processing) Tail entity : 杨浩(Hao Yang) Instance : Foundations of Signal Processing is a book written by Hao Yang .It is published by Science Press in 2008.

Table 5 .
F1 score of each relation using proposed model in word level.