AUTOMATED EXTRACTION OF SEMANTIC WORD RELATIONS IN TURKISH LEXICON

This paper studies the extraction of semantic word relations found in Turkish lexicon. Main goal of the study is to build an effective lexical-conceptual database and contribute to natural language processing (NLP) studies in Turkish. Fundamental word relations to be studied are meronymy (part-whole), synonymy, antonymy and hypernymy (hierarchical). This study is an improvement of an earlier work [1] on semantic relations of Turkish lexicon. It was inspired by well known projects such as Rose [2], ThinkMap [3], and WordNet [4]. An online dictionary provided by Turkish Language Foundation (TDK) [5] is used as the corpus in this study. The dictionary contains more than 63K lexemes. Morphological analysis are done by using a tool called Zemberek [6]. The results are presented by means of obtained noun-pairs and their accuracy. Key WordsTurkish WordNet, Semantic word relations, NLP.


INTRODUCTION
Turkish language is spoken with different accents and dialects in many different geographical areas over the world [7].Despite its common usage, Turkish is a lesser studied language in interdisciplinary applications, such as computational linguistics (CL), NLP and artificial intelligence.
Words are the fundamental building blocks of the communication, thinking, and decision making cognitive processes.While the learning process of words takes place, most of the information related to these words is also kept in the background.Although, the most commonly used dictionaries have been transferred to the electronic environment and have been utilized by information technologies in the last decade, they generally provide only the words and their definitions.However, various useful information and features and relationships among them can not be represented.Therefore, the valuable data can not be facilitated by many other applications.Storing the words along with their various features and relationships in a knowledge base, implementation of WordNet that allows demonstration of wide variety of relationships between words is aimed to put together in the context of this study [8].
Traditional dictionaries have some fundamental features and generally word and its definition is the most commonly shared feature.In the context of this study, all useful features that are provided in traditional dictionaries is brought together, and additionally, insertion of new words and definitions, description of different relationships between words and association of words by these predefined relations, automatic inference of new relationships by considering the interaction of the relations are provided as the fundamental utilities.In the meanwhile, the semantic annotations are protected by keeping the link between the words and their various senses.An interface will be formed that simulates human language acquisition process and collects the information via internet by the contribution of many people.The system currently obtains the required knowledge from existing resources.However, the data formed in this environment will be controlled by experts before the direct transfer to the knowledge base and only the approved ones will be allowed to permanently effect for further processing steps.
While it is possible to find applications that have some specific features and relationships of the words for English such as WordNet [4] [9] and other languages, it is not possible to utilize these applications for Turkish language.
ThinkMap Visual Thesaurus [3] is an interactive dictionary and thesaurus which creates word maps that blossom with meanings and branch to related words.Its innovative display encourages exploration and learning.The word relations are represented by visual interactive components.Semantic inference, in addition to the other resources, a database that includes the relationships between words and terms in the language is needed.There are various studies to create such databases in the literature.
The Teach Rose Project [2] that has been started in the first quarter of 2007 for English has a close relationship with this study.It is simulating the learning mechanism of a child named Rose by an approach called Hive Mind.Hive Mind uses the theory that if everyone contributes a tiny bit, much likes bees in a bee hive; a massive bee hive can be built.Rose simulates human intelligence by participating in dialogue with site visitors, building vocabulary, building associations, and asking questions.
The most commonly used resource in these studies is WordNet [4] [9] which includes synonym sets for nouns, verbs and adjectives and some semantic relations between them.WordNet first appeared after five years of study with a great labour and taken up a lot of time and includes 150.000 word formats consist of one or more words and 115.000 synonym sets.WordNet uses a hierarchic structure that includes hypernym and hyponym relations.Hypernyms are extracted from descriptions, and then this process is used to obtain new hypernyms by using new inferences.
Information in WordNet is organized around logical groupings called synsets.Each synset consists of a list of synonymous words or collocations (e.g."fountain pen", "take in"), and pointers that describe the relations between this synset and other synsets.A word or collocation may appear in more than one synset, and in more than one part of speech.[10].
The following example illustrates this situation.The word "yüz" in Turkish has senses like "to swim, a hundred, face, etc." and whenever a relationship is needed between the "sayı" (number) and "yüz" the sense that is "a hundred" has to be linked and the rest of the senses will be irrelevant.

2.1.Rule Extraction
The study of words in the goal of understanding their meanings and how they relate to each other is very large and complex field in itself.Aiming to render this information usable by a computer presents an even larger problem.The major goal is analyzing the definitions given in the Turkish XML lexicon to find the relationships between the words.It is required to analyze the meaning of the defining sentences from the XML tags <kelime> and <grup_anlam> to achieve this and in that respect semantic knowledge has been concentrated on.Typical relationships and a few examples that can be used in this application are given in Table 1.
Table 1.Much of the work on semantic relations, from a perspective of extraction of information from a dictionary, is done via the analysis of defining formulas.Defining formulas correspond to phrasal patterns that occur often through the dictionary definitions suggesting particular semantic relations.
For example, the relations part-of, made-of can be detected directly via the defining formulas <X1 is a part of X2>, <X1 is made of X2> whenever the definitions contain these patterns.Various rules similar to these have been defined to find the relationships between the words and relationships.Then the frequencies of each rule for the related relations of the words have been calculated.In the meanwhile, transitive or inverse relations have been considered and taken into account.A partial list of rules is provided in Table 2.
On the other hand if the relations were too specific, it would be very hard to find formulas for rules from our lexicon that has 63K entries.So the generic rules were defined as shown in Table 2 that lists the most frequent defining formulas.The rest of the relations were added by looking through the definition of the words and trying to see which relations would be needed.

2.2.Extracted Relations
In this section from the object group "synonymy, antonymy, amount-of, member-of" relations have been analyzed in great detail.Additionally the hierarchical relation is shown by the kind-of and a member-of relation extracted from the definitions via defining formulas such as shown in the examples below and followed by illustrative sentences and the predicates that can be derived from them.
The symbol X is the word entry in the dictionary and Y is another word used in the definition of this word.The relation that obeys the given pattern is extracted between the word X and Y.The first rule of antonymy relation is "X: W 1 W 2 …................. W n , Y karşıtı." and the example is given as "aç:Yemek yemesi gereken, tok karşıtı".X matches aç(hungry) W 1 W 2 …................. W n , matches "Yemek yemesi gereken," and Y matches tok(satiated.Therefore the words "aç" and "tok" are antonyms. The defining formulas, illustrative examples and the extracted relations for each category are demonstrated in the tables (Table 3-Table 8).

2.3.Morphological Analysis
Turkish is an agglutinative language and frequently uses affixes, and specifically suffixes, or endings [11].One word can have many affixes and these can also be used to create new words, such as creating a verb from a noun, or a noun from a verbal root.
Most affixes indicate the grammatical function of the word [11].The only native prefixes are alliterative intensifying syllables used with adjectives or adverbs.
The extensive use of affixes can give rise too long words.To give an example, a morphological structure of a word in a Turkish language is given in the following example [12]: Uygarlaştıramadıklarımızdanmışsınızcasına ( (behaving) as if you are among those whom we could not civilize/cause to become civilized) Therefore all words that are acquired from the patterns have to be morphologically parsed to obtain the word stems.Turkish extensively uses agglutination to form new words from nouns and verbal stems.The majority of Turkish words originate from the application of derivative suffixes to a relatively small set of core vocabulary.
The main problem in our application is stemming the words.Stemming is the process for reducing inflectional or derived words in a language to a reduced form that may or may not be the morphological root of the words.It is not necessary that the stemmed words should give the morphological root of the word.It is sufficient that similar words match to similar stem, e.g. the words "call", "caller", "calls" should match to same stem "call" [13].Following example is detected according to one of the rules of hypernymy relation: "Ölüm, yangın, deprem vb.olayların yarattığı üzüntü, keder, elem" The hypernymy relation is found between the word pairs: {"ölüm(death)", " yangın(fire)"," deprem(earthquake)"}, and "olayların(of the events')" that has some suffixes.Morphological analysis is needed to have the stem of the word.To achieve this process an open source, platform independent, general purpose NLP library and toolset designed for Turkic languages Zemberek is used.9 shows the analysis of the word "olayların" and it has two results.This list may contain many different roots, so it will be impossible to find the true root.Therefore the root of the beginning element of the list (Kök: olay) is accepted as a default root of the word.After this operation the new related word becomes "olay(event')"

RESULTS AND COMPARISON
This section demonstrates the accuracy results of the automatic detection of word relations.The results in the tables below indicate that some relations are hard to be detected automatically from the definitions.Alternatively, one can also infer that the rules employed are not sufficient and some other rules are necessary for these types of relations.Additionally the accuracy of the results can be improved and the necessary rules can be easily obtained by increasing the rules of the relations.On the other hand, some relations can be completely or at least generally detected without further modifications and this is promising for some other types of relations.10 shows the accuracy of the classifier as the percentage of correctly classified compounds in a given class divided by the total number of compounds in that class.The overall (average) accuracy of the classifier is also depicted.Table 10 demonstrates that the total number of outputs that is obtained from our implementation by using extraction algorithms for the relations and accuracy of this implementation.
Table 11 shows the relations obtained for each relation from different rules and indicates that some rules are hard to be detected automatically.On the other hand, some rules can be completely or at least generally detected without further modifications and this is promising for some other types of generations.
The first column of Table 12 indicates the rules of the Hypernymy Relation.The second column points the total number of extracted relations from that rules.The columns named total and correct are used to calculate accuracy of each rule for the hypernymy relation.
The accuracy calculation for a rule is as shown below:

3.1.Error Sources
Experimental results show that automatic relation extraction of words in Turkish language is really difficult to be accomplished with high accuracy.Some of the sources of incorrect results are explained below.
Two nouns, or groups of nouns, may be joined to form subordinative conjunctions.In our relation extraction algorithm subordinative conjunctions are not considered while finding related words.In the following example according to Rule 3 of the Kind-of Relation the correct related word with "bal arısı" should be "eklem bacaklı".These are not considered due to the difficulty of detection of the subordinative conjunctions in Turkish.

CONCLUSION
are the fundamental building blocks of the cognitive processes.While the learning process of words takes place, most of the information related to these words is also kept in the background.The simulation should be started from the smallest units of human learning mechanisms in order to model the knowledge acquisition and communication abilities of humans in computational domain to some extent.Therefore, it is planned to study in the word level in the context of this project.Storing the words along with their various features and relationships in a knowledge base, formation of a WordNet that allows demonstration of wide variety of relationships between words, and also to associate the words with their equivalent translations in the other languages for applications of multilingual environments are among the major goals of this study.
The design is implemented in such a way that it is flexible, scalable and trainable by humans and it is possible to imitate the dynamic learning and processing mechanism of human being in this manner.
In our application some formulas are defined for relating the words by using dictionary definitions as the starting point.These formulas are applied to the meaning of the words by using a computer program.All the related words and their relations that are handled from the program which we have done are stored in the files.The results indicate that some relations are hard to be detected automatically from the definitions.On the other hand, some relations can be completely or at least generally detected without further modifications and this is promising for some other types of relations.

Table 3 .
Synonymy rules, examples and extracted relations

Table 4 .
Antonymy rules, examples and extracted relations

Table 5 .
Amount-of rules, examples and extracted relations

Table 6 .
Member-of rules, examples and extracted relations

Table 8 .
Group-of rules, examples and extracted relations

Table 9 .
Root and the suffix list in Zemberek

Table 10 .
Accuracy results for automatic detection of word relations

Table 11 .
Number of relationships obtained according to each rule

Table 12 .
Accuracy results for hypernymy relation