AN APPROACH FOR MEASURING SEMANTIC RELATEDNESS BETWEEN WORDS VIA RELATED TERMS

In this paper we propose a new approach for measuring semantic relatedness between words. The semantic relatedness between words are not measured directly, but are computed via set of words highly related to them, which we call the set of determiner words. Our approach for evaluating relatedness belongs to web page counting based measurement methods. We take into account some information, which contains hierarchical and other type of relations between the words. The experimental results demonstrate the effectiveness of proposed method. Keywords-semantic relatedness, semantic similarity, information based measurement, information content


INTRODUCTION
Measures of relatedness or similarity are used in a variety of applications, such as information retrieval, automatic indexing, word sense disambiguation, automatic text correction.Semantic similarity and semantic relatedness are sometimes used interchangeable in the literature.These terms however, are not identical.Semantic relatedness indicates degree to which words are associated via any type (such as synonymy, meronymy, hyponymy, hypernymy, functional, associative and other types) of semantic relationships.Semantic similarity is a special case of relatedness and takes into consideration only hyponymy/hypernymy relations.The relatedness measures may use a combination of the relationships existing between words depending on the context or their importance.To illustrate difference between similarity and relatedness, Reznik [1] provides the widely used example of car and gasoline.These terms are not very similar; they have only few features in common.But they are more closely related in a functional context; namely that cars use gasoline.A number of researchers use distance measure as opposite of similarity.
In this work we propose a new approach for measuring semantic relatedness between words.Main idea of the approach is that the semantic relatedness between words is not measured directly, but is determined via a set of words high related to them, which we call the set of determiner words.Our approach for evaluating relatedness belongs to web pages counting based measurements methods.But we take into account some information, expressing hierarchical and other type relations between the words.Comparison the experimental results with a benchmark set of human similarity ratings show the effectiveness of the proposed approach.
The paper is organized as follows.Section 2 presents related work.In section 3 motivations on proposed method is given.The method for evaluating semantic relatedness between the words is discussed in section 4. In this section the implementation results are presented also.Our conclusions and future work are presented in the final section.

RELATED WORK
A number semantic similarity method has been developed.Generally these methods can be classified into two main categories: edge counting methods and information content methods.Edge counting methods, also known as path based methods define the similarity of two words as a function of the length of the path linking the word and on the position of the terms in the taxonomy.The work of Rada et al. [2] deals with the basis of edge counting based methods.They compute semantic relatedness in terms of the number of edges between the words in the taxonomy.In Leacock and Chodorow [3] measure takes into account depth of the taxonomy in which the words are found: lch(c 1 ,c 2 ) =-log(length(c 1 ,c 2 )/2D, where length(c 1 ,c 2 ) is the number of nodes along the shortest path between the two nodes.D is the maximum depths of the taxonomy.The Wu and Palmer similarity metric measures the depth of two given words in the taxonomy, along with the depths of the least common subsume (LCS): sim wup =(2*depth(LCS)/(depths(word 1 )+depth(word 2 )).[4] Information content methods, also known as corpus based methods measure the difference in information content of two words as a function of their probability of occurrence in a corpus.The method first proposed by Resnik [1].According to Resnik similarity of two words is equal to information content (IC) of the least common subsumer: sim rez =IC(lsc(c 1 ,c 2 )).However, because many words may share the same LCSr, and would therefore have identical values of similarity, Resnik measure may not be able to obtain fine grained distinctions .[5] Jiang and Conrath [6] and Lin [7] have developed measures that scale the information content of the subsuming concept by the information content of the individual concepts.Lin does this via a ratio, and Jiang and Conrath with a difference.
Gloss based methods define the relatedness between two words as a function of gloss overlap.[8] Banerjee and Pedersen [9] have proposed the method that computes the overlap score by extending the glosses of the words under consideration to include the glosses of related works in a hierarchy.
Many of these measures were initially defined using the context of the WordNet ontology [10].WordNet is a lexical reference system that was created by a team of linguists and psycholinguists at Princeton University.WordNet may be distinguished from traditional lexicons in that lexical information is organized according to word meanings, and not according to word forms.As a result of the shift of emphasis toward word meanings, the core unit in WordNet is something called a synset.Synsets are sets of words that have the same meaning, that is, synonyms.A synset represents one concept, to which different word forms refer.For example, the set {car, auto, automobile, machine, motorcar} is a synset in WordNet and forms one basic unit of the WordNet lexicon.Although there are subtle differences in the meanings of synonyms, these are ignored in WordNet.
Some researchers define the semantic relatedness between the words using Web.Danushka Bollegala an et al. [11] has proposed a method that exploits page counts and text snippets returned by a Web search engine to measure semantic similarity between words.Rudi L and et al. [12] developed the method that defines the relatedness between the words via Google Similarity Distance.They use the World Wide Web as the database, and Google as the search engine.An approach to computing semantic relatedness using Wikipedia is proposed in [13].Michael Strube and Simone Paolo Ponzetto also investigated the use of Wikipedia for computing semantic relatedness measures [14] Yhua Li and et all [15] has determined the semantic similarity by a number of information sources which consist of structural information from a taxonomy and information content from a corpus.Some similarity measure based on applications of fuzzy sets theory.Particularly, the new fuzzy similarity measure with better performance compared with conventional similarity methods have been proposed in [16].

MOTIVATION
In this section we briefly focus on drawbacks of Web oriented and WordNet oriented approaches to motivate our method.First we look on Web oriented approach.Two linguistic factors negatively affect the results obtained from web based relatedness computing.These factors are synonymy, when many word are referring to same concept (for example, car and automobile), and polysemy, when many concepts are expressed by the same word (for example, Oracle).The impact of synonymy is that if a document consists of synonym word, then the other synonym of the word usually is not used in this document; authors prefer to use same word to expressing same meaning.For this reason, similarity degree between synonym words, computing via only web based methods, gets less value than as it is.For example, Google Search for "journey" returns 114000000 hits.(For calculating NGD distance the following site was used: http://digitalhistory.uwo.ca/cgi-bin/ngd-calculator.cgi).The number of hits for "voyage" is 113000000.The numbers of pages where "journey" and "voyage" are occurred are 1670000.Using these data we obtain a normalized Google Distance between the highly semantic similar words "journey" and "voyage" as NGD (journey, voyage) ≈0.90808 If we believe that this result is reliable, we must say that there is not any similarity between the "journey" and "voyage".
Polysemy gives opposite effect, causing documents that use the same word in different senses to be considered related when they should not be.For example, the word "cord" may be used in various means (rope, automobile, rock group, spinal cord…).A Google Search for "cord" returns 61400000 articles.But if we are interested only "spinal cord" meaning of the word, approximately 148000 articles will meet our interest.
Namely, for these reasons measuring semantic similarity based on large search engine don't give expected results.Certainly, without any alternatives web contains sufficient information about words and their relations.But the main problem is to find the ways that allow us to extract only useful, related information from the huge information storage.Now we will give a sample that clearly indicates the drawbacks of Wordnet based methods.The similarity values between the "student"and"examination", have computed by methods based on Wordnet ontology are given in the Table 1 (For calculating similarity please refer to: http://marimba.d.umn.edu/cgi-bin/similarity.cgi).As it is seen from Table 1, the similarities on hco, lin and res methods are equal to null.Other methods return little similarity value between the words.Comparing the values from the tables above we can conclude that "student" and "animal" have more similarity than "student" and "examination".The samples clearly indicate the difference between the "relatedness" and "similarity".To strengthen the idea, both approaches are not sufficient for measuring relatedness between the words separately.Before measuring relatedness we must clearly determine what we expect from relatedness and measure methods we should choose according to our expectation.
In the next section we propose the relatedness measure which may be useful for applications on information retrieval.
To solve the problems we encountered a method, determining the similarity of words via related those terms (like keywords for articles) which we call determiner words.For every word it is not difficult to find closely related terms.For example, if we say "student", the words "examination", "university", "instructor", and "young people" comes into the mind.We think that using a set of related words allows us to define a word more preciously.

THE METHOD
Let W 1 and W 2 be words, which we want to measure relatedness between them.The method determines the following steps: where k is equal to or less than (n+k).

Calculate the normalizing values of relatedness between the determiners and W
Where freq (d i , W1) -is a number of pages where d i and W 1 are occurred together.Analogically, freq(d i ,W 2 )-is defined.maxfreq We consider that if a determiner word is highly related to the word, then the probability of the determiner occurring in the pages where the word's appearance is high.In a special case, if d i is synonymy, or nearly synonymy to W 1 (W 2 ) we take ) , ( 3. Calculate the relatedness between the words: α i is called the co-occurrence factor, and is defined as 2, d i is occurred in both words W 1 and W 2 1, otherwise syn is called synonymy factor and is defined as syn = 1, W 1 and W 2 are synonymy or nearly synonymy 0, otherwise The sample. To explore our method we use the pair of words of (car, train) from Rubenstein-Goodenough set [17].We take W1 = car; W 2 = train.Determiner words of W1 and W 2 are D 1 = {rail, transport, vehicle, freight, passenger) D 2 = {automobile, motor, wheel, passenger, vehicle) Thus automobile is a synonymy (or nearly synonymy) of car, we count that all the pages in which car occurred, automobile is occurred also.In other words, hits of (car, automobile) are equal to 1.The words vehicle and passenger are determiners of the both words.For these determiners co-occurrence factor is equal to 2. Data about the number of hits are given in Table 3. ** indicates that related value is not real numbers of pages, where automobile and car are co-occurred.Thus these words are synonymy, as the hits numbers we take the maximum of hits for car on the determiner words.
According to the formulae we obtained that relatedness between train and car is 0.54182719 apposite of 6.31 on FC (note that FC measuring gives a number between 0 and 10).

IMPLEMENTATION
For realization our method we used WordNet and Wikipedia as information sources.As we mentioned above WordNet is a lexical database, developed at Princeton by Miller and freely available.On 2006 the WordNet database contains about 150,000 words organized in over 115,000 synsets for a total of 207,000 word-sense pairs.However WordNet does not include some named entities and specialized concepts.Wikipedia is a multilingual, web-based, free content encyclopedia project, operated by the Wikimedia Foundation, a non-profit organization.On July 20, 2007, Wikipedia has approximately 7.8 million articles in 253 languages, 1.893 million of which are in the English edition.
For evaluating proposed method we used Miller-Charles dataset [10].Miller-Charles dataset consists of 30 word-pairs rated by a group of 38 human subjects.The word pairs are rated on a scale from 0 (no similarity) to 4 (perfect synonymy).The dataset is considered as a reliable benchmark for evaluating semantic similarity measurements.Most researchers have used only 28 word pairs of the Miller-Charles set.These pairs have been used in our experiments also.In table 4 the result of experiments implementing on Miller Charles dataset is presented.The correlation derived on the proposed method (0.953) shows high effectiveness of the proposed method.

CONCLUSION
A new approach for measuring the relatedness between the words has been presented in this paper.The approach is based on using determiner words.The experimental results show the effectiveness of the method.But there are some problems with application of the method.Main problem is to choose the determiner words.For this purpose, articles from Wikipedia may be used.Using common words as determiner is not recommended.Although is not limit to numbers of determiners, we think that 5-10 determiners for per words are sufficient.The main baseline of our future studies is design of the algorithm, allowing selecting of determiners from information sources automatically.

1 .
Determine the pairs of sets of the related words on W 1 and W 2 .Let D 1 ={d 11 , d 12 , d 13 ,…, d 1n ) and D 2 ={d 21 , d 22 , d 23 ,…, d 2m ), these are the sets of determiner words of W 1 and W 2 respectively.Next we form the set of common determiner words D as: D = D 1 ∪ D 2 We call the elements of D as d to avoid of complexity.D=={d 1 , d 2 , d 3 ,…, d k ),

Table 1 .
Similarity between the word student and examination

Table 2 .
Similarity between the word student and animal

Table 3 .
Determiners of the word train and car and theirs hits * indicates that related words are determiners of both words.

Table 4 .
Semantic Similarity of Human Ratings and Baselines on Miller-Charles dataset