RGloVe: An Improved Approach of Global Vectors for Distributional Entity Relation Representation

.


Introduction
With the explosive growth and easy accessibility of web documents, extracting the useful nuggets from the irrelevant and redundant messages becomes a cognitively demanding and time consuming task.Under this circumstance, information extraction is proposed to extract the structured data from text documents.The automatic content extraction (ACE) program [1] provides annotated corpus and evaluation criteria for a series of information extraction tasks.As an important level of information extraction, relation extraction aims to extract the relationships between named entities.Relation extraction is widely used in many fields such as automatic construction of knowledge base and information retrieval.
Traditional relation extraction is often limited to extracting the pre-defined types.For example, ACE 2003 defines five relation types, including AT (location relationships), NEAR (to identify relative locations), PART (part-whole relationships), ROLE (the role a person plays in an organization) and SOCIAL (such as parent and sibling), which are further separated into 24 relation sub-types.However, traditional relation extraction leaves the question open whether it is still efficient for massive and heterogeneous corpora such as web documents [2][3][4].To combat these problems, many researches have been done to extract more abundant relations without requiring any specific relation type or a vocabulary such as verbs [5][6][7][8].
In light of the above, this paper aims to answer the following research questions: RQ1 How to learn the distributional entity representations only by using the statistical information of entity-entity co-occurrences?RQ2 How to build, store and query the relationships between entities without extracting the predefined relation types or vocabulary?
To answer above questions, we have made the following contributions in this paper.For RQ1, this paper presents an improved model of global vectors called RGloVe based on the idea of distributed representation.Global vectors (GloVe) [9] is an effective method to train distributional word representations from the global statistics of word occurrences in the whole corpus.In order to train model more efficiently, we improve the traditional GloVe model by using cosine similarity between entity vectors to approximate the entity occurrences instead of dot product.Because cosine similarity can convert vector to unit vector, it is intuitively more reasonable and more easily converge to a local optimum.For RQ2, instead of extracting relation types or a vocabulary, the task of distributional relation representation aims to extract a series of triples (e 1 , e 2 , ω).The weight ω is a real value which indicates the correlation of two entities e 1 and e 2 .In order to store these triples and facilitate their retrival, we introduce a graph database of Neo4j where nodes represent the entities and edges represent the relationships between entities.The cypher query language of Neo4j provides a highly accessible way to query different levels of relationships (e.g., friends of a friend).
The rest of this paper is as follows.We review related work in traditional relation extraction and distributional word representation in Section 2. In Section 3, we will present all the details of Neo4j and RGloVe.Section 4 shows the experiment results of quantitative representation between named entities.Finally, we conclude our work and point out future work in Section 5.

Related Work
Our work is inspired by traditional entity relation extraction.Our model of RGloVe has its root in distributional word representation.In this section, we will briefly review some related works of these two aspects.

Entity Relation Extraction
Relation extraction, as an important level of information extraction, has been widely researched.These proposed methods can be classified into three categories: supervised learning, semi-supervised learning and unsupervised learning.The most typical methods of supervised learning are kernel-based methods [10][11][12][13][14][15].Although supervised learning-based methods perform very well, their performance much relies on the availability of a large amounts of manually labeled data.So, many researchers begin to focus on the semi-supervised learning [16][17][18][19], which can make full use of unlabeled data given a small amount of labeled data.
Above works are often limited to extracting the pre-defined types, which make it difficult for open domain applications.Recently, many unsupervised learning-based methods [20][21][22][23][24][25] of open domain relation extraction have been proposed to reduce the heavy manual labor.TextRunner [5] was the first Open IE (OIE) system, where a large set of relational entity tuples were extracted without requiring any human labor and then these tuples were assigned a probability.Fader, Soderland and Etzioni [7] developed another Open IE system of REVERB by introducing two syntactic and lexical constraints on verb-centered relations.Tseng, Lee, Lin, Liao, Liu, Chen, Etzioni and Fader [8] presented the first Chinese Open IE system (CORE) which can extract entity relation triples from Chinese texts by combining many NLP techniques.Kalyanpur and Murdock [6] described entity-relation analysis in IBM Watson, which aimed to detect noun-centric-phrases in the text.Distributional relation representation is similar with unsupervised learning-based methods.Both of them aim to extract entity relations without requiring any specific relation types.Compared with previous works on Open IE, distributional relation representation focuses on the training of entity vectors instead of extracting more abundant features by introducing a series of NLP tools.

Distributional Word Representation
There are many effective methods of distributional word representations such as one-hot representation, latent semantic analysis and distributed representation.One-hot representation is a sparse word vector which dimension is equal to the size of the vocabulary.In the vector, there is only one 1 where the corresponding word appear in the vocabulary and a lot of zeroes.So one-hot representation often suffers from the curse of dimensionality.Another problem of one-hot representation is that it is hard to find the relationship between word vectors.To solve above problems, many researchers try to transform words into low dimensional semantic space.For example, Landauer, et al. [26] reported the results of using latent semantic analysis (LSA), a high-dimensional linear associative model to analyze a large corpus of natural text and generate a representation that captures the similarity of words and text.Turney [27] introduced Latent Relational Analysis (LRA), a method for measuring relational similarity which is correspondence between relations.Sebastian, et al. [28] presented a novel framework for constructing semantic spaces that takes syntactic relations into account.Gamallo, et al. [29] concluded that Singular Value Decomposition (SVD) is a more efficient model for a number of word similarity extraction tasks.Distributed representation is another effective low dimensional vector representation.For example, Bengio, et al. [30] combined n-gram model into a simple neural network architecture to learn distributional word representation.Collobert, et al. [31] presented a multilayer neural network architecture to learn distributional word representation in the window-based context of a word instead of the preceding context.Mikolov, et al. [32] proposed continuous bag-of-words model (CBOW) and continuous skip-gram model for learning distributional word representation.Pennington, Socher and Manning [9] proposed a global vector model by training only on the nonzero elements in co-occurrence matrix.

Methods
For the task of distributional relation representation, we propose an improved global vectors model called RGloVe which can train the word vectors more effectively.Finally, a graph database is introduced to build, store and query these extracted entity relationships.

Co-Occurrence Matrix
We use a preprocessing tool to extract all the named entities in the whole corpus.It is assumed that if entity i and entity j occur in the same document, these two entities will be regarded as co-occurrence.Let the co-occurrence matrix be denoted by X, whose element X ij represents the co-occurrence frequency of entity i and entity j.X ij can be computed as, where L di is the location of entity i in a document d.It is an effective method to show that the more distant two entities in a document, the less relevant these entities.

Distributional Entityrepresentation of RGloVe
Without abundant features, the statistics of entity occurrences in a corpus is the primary source of information available to distributional relation representation between entities.Global vectors method has been proposed to train word vectors by efficiently leveraging statistical information.In order to train the entity vectors more efficiently for distributional entity relation representation, we make some improvements of global vectors.Firstly, RGloVe uses cosine similarity between entity vectors to approximate the entity occurrences instead of dot product in the traditional global vectors.Secondly, RGloVe reduces the weight funcation to linear function which value is limited to between 0 and 1.Finally, RGloVe train the entity vectors by AdaGrad [33].

Brief Review of Global Vectors
Global vectors aim to design a series of functions F, which are equivalent to the ratios of co-occurrence probabilities.For vector space with inherently linear structures, these functions depend only on the difference of two target word vectors.This idea can be expressed as, where ω ∈ R d are word vectors and P ik is the co-occurrence probability of entity i and entity k.
To achieve the symmetry, it is required that F be a homomorphism, modifying Equation ( 2) to, Let F be exponential function and adding two bias items b i , b k respectively for w i , w k , To weight all co-occurrences differently, a non-decreasing weight function can be designed as, where α and x cuto f f can be provided with empirical value.Finally, the cost function, which combines a least squares regression model with the weight function F, is presented as, where V is the size of the vocabulary.The word vectors can be trained by AdaGrad.More details of derivation can be found in [9].

Global Vectors for Distributional Relation Representation
Cosine similarity between entity vectors is a very effective quantitative representation of entity relations, which inspires us to study the ratio of co-occurrence probabilities from the point of cosine function.If two entity vectors w i and w j have a very high degree of similarity, entity i will occur more frequently in the context of j.This idea can be expressed as, Let F also be exponential function and adding two bias items b i , b j respectively for w i , w j , Equation (4) will be changed to, Compared with Equation (4), we can conclude that it is more natural to approximate co-occurrence matrix by cosine similarity than dot product.
For the weighting function in global vectors, the cutoff is designed to ensure that large values of x are not overweighted.The main drawback to this weighting function is that it is hard to choose the empirical values of α and x cuto f f .In addition, unlike word co-occurrences such as the and and, entity co-occurrences will not suffer from extremely frequent co-occurrences.So, we simplify the weighting function as, where x max is the maximum value in the co-occurrence matrix.Figure 1 shows the two different weighting functions, where the blue one represents our simplified weighting function.
Algorithms 2017, 10, 42 5 of 11 Let F also be exponential function and adding two bias items , Compared with Equation (4), we can conclude that it is more natural to approximate cooccurrence matrix by cosine similarity than dot product.
For the weighting function in global vectors, the cutoff is designed to ensure that large values of x are not overweighted.The main drawback to this weighting function is that it is hard to choose the empirical values of α and cutoff x .In addition, unlike word co-occurrences such as the and and, entity co-occurrences will not suffer from extremely frequent co-occurrences.So, we simplify the weighting function as, ( ) where max x is the maximum value in the co-occurrence matrix.Figure 1 shows the two different weighting functions, where the blue one represents our simplified weighting function.Finally, a new cost function can be showed by, ( ) , , 1 max cos , log

Training by AdaGrad
The goal of training is to obtain optimal entity vectors by minimizing the cost function.Stochastic gradient descent is an effective gradient descent optimization method for minimizing an objective function.But standard stochastic gradient descent methods only depend on the same initial learning rate.The adaptive gradient algorithm (AdaGrad) is proposed to solve this problem.AdaGrad can adaptively assign different learning rates to each parameter by, Finally, a new cost can be showed by,

Training by AdaGrad
The goal of training is to obtain optimal entity vectors by minimizing the cost function.Stochastic gradient descent is an effective gradient descent optimization method for minimizing an objective function.But standard stochastic gradient descent methods only depend on the same initial learning rate.The adaptive gradient algorithm (AdaGrad) is proposed to solve this problem.AdaGrad can adaptively assign different learning rates to each parameter by, where η is the initial learning rate and ε is a small positive number.g t is the gradient of cost function, which can be showed by, It can be concluded from Equation (11) that AdaGrad updates each parameter more slowly with larger update distance.Through above training, we can obtain all the entity vectors, which inherently present the relations between entities.So it is natural to establish entity relationships by computing the cosine similarity of entity vectors.

Entity Relational Storage of Neo4j
Neo4j is a commercially supported open-source graph database.It stores data in a graph, the most generic of data structures, capable of elegantly representing kind of data in a highly accessible way.For entity relation representation, Neo4j records entities in nodes which have two properties: entity name and entity frequency.These nodes are organized by relation type which has the property of co-occurrence weight (X ij in the co-occurrence matrix).Figure 2 shows the graph structure that we use for entity relation representation.
where η is the initial learning rate and ε is a small positive number.t g is the gradient of cost function, which can be showed by, It can be concluded from Equation (11) that AdaGrad updates each parameter more slowly with larger update distance.Through above training, we can obtain all the entity vectors, which inherently present the relations between entities.So it is natural to establish entity relationships by computing the cosine similarity of entity vectors.

Entity Relational Storage of Neo4j
Neo4j is a commercially supported open-source graph database.It stores data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way.For entity relation representation, Neo4j records entities in nodes which have two properties: entity name and entity frequency.These nodes are organized by relation type which has the property of co-occurrence weight ( ij X in the co-occurrence matrix).Figure 2 shows the graph structure that we use for entity relation representation.The cosine similarity between entity vectors is one of the most common methods for distributional entity relation representation.However, Neo4j is suitable for storing relational types instead of real numbers.In order to classify the relational types from distributional entity relation representation between entities, this paper presents an unsupervised method based on entity vectors.Firstly, we choose 24 relation sub-types in the task of ACE relation extraction as our relational types.Then we use the traditional global vector model to train the relational type vectors in a large-scale corpus of Sina News.Finally, we choose the most probable entity relationship type by calculating the cosine similarity between entity vectors and relationship type vector.
Neo4j can store hundreds of millions of nodes and relations.Querying from huge data needs a powerful query language.The declarative graph query language of cypher is designed to allow for expressive and efficient querying.Cypher is a humane query language which is similar with SQL.There are four common distinct clauses for querying: START (starting pointing in the graph), MATCH (the graph pattern to match), WHERE (filtering criteria) and RETURN (what to return).The cosine similarity between entity vectors is one of the most common methods for distributional entity relation representation.However, Neo4j is suitable for storing relational types instead of real numbers.In order to classify the relational types from distributional entity relation representation between entities, this paper presents an unsupervised method based on entity vectors.Firstly, we choose 24 relation sub-types in the task of ACE relation extraction as our relational types.Then we use the traditional global vector model to train the relational type vectors in a large-scale corpus of Sina News.Finally, we choose the most probable entity relationship type by calculating the cosine similarity between entity vectors and relationship type vector.
Neo4j can store hundreds of millions of nodes and relations.Querying from huge data needs a powerful query language.The declarative graph query language of cypher is designed to allow for expressive and efficient querying.Cypher is a humane query language which is similar with SQL.
There are four common distinct clauses for querying: START (starting pointing in the graph), MATCH (the graph pattern to match), WHERE (filtering criteria) and RETURN (what to return).

Experiments
In this section, we present and discuss the experimental results on the Chinese data set of Sina News.The flow diagram of our experiments is showed by Figure 3 First, an open tool of ICTCLAS [34] is employed to conduct word segmentation, POS tagging and named entity recognition.Co-occurrence matrix is obtained by making the statistics of entity co-occurrences in the whole corpus.Then, we use our improved model of RGloVe to train the entity vectors.Finally, we use the graph database of Neo4j to build, store and query these extracted relationships.

Experiments
In this section, we present and discuss the experimental results on the Chinese data set of Sina News.The flow diagram of our experiments is showed by Figure 3 First, an open tool of ICTCLAS [34] is employed to conduct word segmentation, POS tagging and named entity recognition.Co-occurrence matrix is obtained by making the statistics of entity co-occurrences in the whole corpus.Then, we use our improved model of RGloVe to train the entity vectors.Finally, we use the graph database of Neo4j to build, store and query these extracted relationships.

Data Set and Experimental Settings
We choose the data set of Sina News, which contains 121,157 documents between 1 March and 31 August 2015.These documents are different in length and cover various categories, including politics, economy, sports, entertainment, etc.After preprocessing, 127,128 named entities and 3,230,441 entity pairs are extracted from the whole corpus.
We perform a comparative experiment among Word2Vec [32], GloVe [9] and RGloVe.Word2Vec is a very popular model based on neural networks to train word vectors.For Word2Vec, we choose the model of CBOW with 25 iterations for relationship type of 300 dimentions.For GloVe and RGloVe, we train the models using AdaGrad with initial learning rate of 0.05.We run 100 iterations for entity vectors of 300 dimensions.For global vectors, we set

Entity Vectors Presentation
Figures 4 and 5 intuitively present the vectors of GloVe and RGloVe.From the result of RGloVe, we can see clearly that Jams and Curry have the similar vector curve because they are all famous basketball players.Also, we can see that Obama and Trump have the similar vector curve because they are presidents of USA.But it is hard to find the rules from the result of GloVe.

Name Entity Recognition
Step 1 Preprocessing Step 2

Data Set and Experimental Settings
We choose the data set of Sina News, which contains 121,157 documents between 1 March and 31 August 2015.These documents are different in length and cover various categories, including politics, economy, sports, entertainment, etc.After preprocessing, 127,128 named entities and 3,230,441 entity pairs are extracted from the whole corpus.
We perform a comparative experiment among Word2Vec [32], GloVe [9] and RGloVe.Word2Vec is a very popular model based on neural networks to train word vectors.For Word2Vec, we choose the model of CBOW with 25 iterations for relationship type of 300 dimentions.For GloVe and RGloVe, we train the models using AdaGrad with initial learning rate of 0.05.We run 100 iterations for entity vectors of 300 dimensions.For global vectors, we set x cutoff = 100 and α=3/4.Each model generates two sets of word vectors W and W, which are equivalent.The final results of our entity vectors are decided by the sum W + W.

Entity Vectors Presentation
Figures 4 and 5 intuitively present the vectors of GloVe and RGloVe.From the result of RGloVe, we can see clearly that Jams and Curry have the similar vector curve because they are all famous basketball players.Also, we can see that Obama and Trump have the similar vector curve because they are presidents of USA.But it is hard to find the rules from the result of GloVe.

Quantitative Representation Result and Discussion
Cosine similarity between entity vectors provides a very effective quantitative representation of entity relations.In this paper, we use three methods of Word2Ve, GloVe and RGloVe to obtain the entity vectors.In order to compare the performance of these models, we make a series of assumptions and evaluation parameters: error rate, top N precision and average accuracy of relationship classification.

Error Rate
It is assumed that if the cosine similarity of two entities in the co-occurrence matrix is less than zero, the tuple will be regarded as a negative instance.Error rate is the ratio of all negative instances to the size of the co-occurrence matrix.Table 1 shows that our improved model of RGloVe achieves a 9.59% lower error rate than traditional global vectors.T from co-occurrence matrix as our ground truth.Then we define the similarity matrix, whose element tabulates the cosine similarity between two entity vectors.
Finally, we choose top N similarity triples c T from similarity matrix as our comparative result.Top N precision is defined by,

Quantitative Representation Result and Discussion
Cosine similarity between entity vectors provides a very effective quantitative representation of entity relations.In this paper, we use three methods of Word2Ve, GloVe and RGloVe to obtain the entity vectors.In order to compare the performance of these models, we make a series of assumptions and evaluation parameters: error rate, top N precision and average accuracy of relationship classification.

Error Rate
It is assumed that if the cosine similarity of two entities in the co-occurrence matrix is less than zero, the tuple will be regarded as a negative instance.Error rate is the ratio of all negative instances to the size of the co-occurrence matrix.Table 1 shows that our improved model of RGloVe achieves a 9.59% lower error rate than traditional global vectors.T from co-occurrence matrix as our ground truth.Then we define the similarity matrix, whose element tabulates the cosine similarity between two entity vectors.
Finally, we choose top N similarity triples c T from similarity matrix as our comparative result.Top N precision is defined by,

Quantitative Representation Result and Discussion
Cosine similarity between entity vectors provides a very effective quantitative representation of entity relations.In this paper, we use three methods of Word2Ve, GloVe and RGloVe to obtain the entity vectors.In order to compare the performance of these models, we make a series of assumptions and evaluation parameters: error rate, top N precision and average accuracy of relationship classification.

Error Rate
It is assumed that if the cosine similarity of two entities in the co-occurrence matrix is less than zero, the tuple will be regarded as a negative instance.Error rate is the ratio of all negative instances to the size of the co-occurrence matrix.Table 1 shows that our improved model of RGloVe achieves a 9.59% lower error rate than traditional global vectors.We first select top N weight triples T g from co-occurrence matrix as our ground truth.Then we define the similarity matrix, whose element tabulates the cosine similarity between two entity vectors.
Finally, we choose top N similarity triples T c from similarity matrix as our comparative result.Top N precision is defined by, Top N precision is an effective approximate estimation of co-occurrence weights by computing the similarity of entity vectors.Table 2 shows the experimental results with different sample sizes.We can see from the results that our improved global vectors model can achieve better estimation to ground truth.But the top N precision is very low in both models of Word2Vec and GloVe because of the weakening of extremely frequent co-occurrences.In our improved model, we relax this weakening effect by using linear weighted function.In order to evaluate our performance of relationship classification, we conduct a manual labeling scheme to annotate the relationship types between extracted entities.Three independent annotators are instructed to distinguish 100 entity pairs of each relation sub-types.To measure the reliability our annotation scheme, we construct an agreement study by computing a value of Fleiss' kappa [35].Fleiss's kappa is a statistical method for measuring the reliability of agreement between a fixed number of raters.For our annotation, we achieve a Fleiss's kappa value of 0.69, which is considered substantial agreement.Table 3 shows that our improved model of RGloVe achieves a 2.5% higher average accuracy than traditional global vectors and is close to the supervised method of SVM.

Conclusions and Future Work
In this paper, we have proposed an improved method of global vectors RGloVe for distributional entity relation representation.Unlike traditional relation extraction, distributional relation representation aims to train the entity vectors and measure the degree of closeness of relationship between two entities.The major advantage of distributional relation representation is that it is no longer limited to predefined relation types, which makes it easy to be applied to open domain question answering and information retrieval.
The statistics of entity co-occurrences in a corpus is the primary source of information available to distributional relation representation between entities.We first obtain a co-occurrence matrix, each of whose elements represents the co-occurrence weight of two entities.Then, in order to train the entity vectors more efficiently, we have developed an improved global vectors model of RGloVe by using the cosine similarity to approximate the entity occurrences instead of dot product.Finally, a graph database of Neo4j is introduced for building, storing and querying the relationships between named entities.The final comparative experiments show the superiority of our methods.In the future work, we will explore better classification criteria than cosine similarity between entity vectors and relationship type vector.In addition, it is significant to extend our model to perform experiments on the English corpus.

Figure 2 .
Figure 2. The graph structure for visual representation.

Figure 2 .
Figure 2. The graph structure for visual representation.

Figure 3 .
Figure 3.Our experimental framework for distributional entity relation representation.
of word vectors W and W  , which are equivalent.The final results of our entity vectors are decided by the sum W W +  .

Figure 3 .
Figure 3. experimental framework for distributional entity relation representation.

Figure 4 .
Figure 4.The result of GloVe.

Figure 4 .
Figure 4.The result of GloVe.

Figure 4 .
Figure 4.The result of GloVe.

Table 2 .
Top N precision.

Table 3 .
Average accuracy of relationship classification.