Constructing Uyghur Commonsense Knowledge Base by Knowledge Projection

: Although considerable e ﬀ ort has been devoted to building commonsense knowledge bases (CKB), it is still not available for many low-resource languages such as Uyghur because of expensive construction cost. Focusing on this issue, we proposed a cross-lingual knowledge-projection method to construct an Uyghur CKB by projecting ConceptNet’s Chinese facts into Uyghur. We used a Chinese–Uyghur bilingual dictionary to get high-quality entity translation in facts and employed a back-translation method to eliminate the entity-translation ambiguity. Moreover, to tackle the inner relation ambiguity in translated facts, we made a hand-crafted rule to convert the structured facts into natural-language phrases and built the Chinese–Uyghur lingual phrases based on the similarity of phrases that corresponded to the bilingual semantic similarity scoring model. Experimental results show that the accuracy of our semantic similarity scoring model reached 94.75% for our task, and they successfully project 55,872 Chinese facts into Uyghur as well as obtain 67,375 Uyghur facts within a very short period.


Introduction
Knowledge Bases (KBs) play an important role in many natural-language processing (NLP) tasks such as question answering, web searching and dialog tasks [1,2]. KBs describe the knowledge about entities, relations, and their attributes. In a KB, each fact is a triple of the form (h, r, t) that indicates the head entity h and tail entity t are connected with a relationship named r, e.g., (Bat, CapableOf, Fly). As a part of KB, commonsense knowledge (CSK) is mainly referred to as background knowledge and used in natural-language processing tasks that require reasoning based on implicit knowledge [3][4][5][6]. However, many languages, especially low-resource languages including Uyghur, have no existing commonsense knowledge base (CKB) to use [7][8][9].
Constructing CKBs from scratch is very time-consuming and labor-intensive. Instead, there are many available CKB resources in other rich-resource languages such as English and Chinese. A straightforward way to construct an Uyghur CKB is to directly translate Chinese KB to Uyghur based on the surface texts of a fact with the existing machine translation (MT) system or bilingual dictionary. However, we find that this method suffers from the problem of ambiguity. For example, consider translating Chinese fact (主机 <host computer>, CapableOf, 发热 <heat>) shown in Figure 1. The head entity "主机" (host computer) has six Uyghur translation candidates, including (main engine) and (central computer). The tail entity "发热" (heat) also has six translation candidates including (heat) and (have a fever). Thus, (主机, CapableOf, 发热) will generate 6 × 6 = 36 Uyghur translation candidates in total. There are two main challenges to effectively disambiguate these translation triples. The first challenge is how to remove the semantic unrelated translation candidates for a single entity. The second challenge is how to effectively model the semantics of the Chinese fact and Uyghur fact in common semantic space.
In this paper, we address these two challenges by presenting a cross-lingual knowledgeprojection method to translate the Chinese CKB into Uyghur. Given a Chinese fact, first, the method uses a Chinese-Uyghur bilingual dictionary to get entity translations in fact and use back-translation to remove the semantically unrelated translation candidates. Then, the method converts each Chinese and Uyghur fact to a parallel sentence using a hand-crafted rule template and achieves their sentence representation by a recursive autoencoder. Finally, the method encodes the source and target sentence in the same semantic space using the bidimensional attention network, and calculates the distance between them to get the semantic similarity score.
Being the largest multilingual CKB, ConceptNet [10] connects words and phrases of natural language with labeled edges and maintains knowledge as a triple of two concepts and relations between them. The relations come from a fixed set. The latest release (v5.6.0) of ConceptNet has 369,687 unique Chinese facts (both head and tail nodes are Chinese) while the number of Uyghur facts is only 3872 (≈1.05%). We focus on ConceptNet in this paper. We project ConceptNet's Chinese facts to Uyghur by using a Chinese-Uyghur bilingual dictionary and a bilingual semantic similarity scoring model, automatically building an Uyghur CKB from existing Chinese CKBs, taking the advantages of the cross-lingual knowledge projection.
Suppose we project a Chinese fact into a target-side Uyghur fact, and obtain candidate translations by using dictionary translation. We denote these candidates as , , … , . Our goal is to estimate a projection score ℎ( | ) and find the most appropriate Uyghur fact that maximizes the score, which can be formulated in Equation (1).
The structure of this paper is organized as follows: in Section 2 we discuss the related works; in Section 3 we introduce our cross-lingual knowledge-projection method; in Section 4 we present the experiments and analysis of the results. Conclusions will be given in the last section. There are two main challenges to effectively disambiguate these translation triples. The first challenge is how to remove the semantic unrelated translation candidates for a single entity. The second challenge is how to effectively model the semantics of the Chinese fact and Uyghur fact in common semantic space.

Related Works
In this paper, we address these two challenges by presenting a cross-lingual knowledge-projection method to translate the Chinese CKB into Uyghur. Given a Chinese fact, first, the method uses a Chinese-Uyghur bilingual dictionary to get entity translations in fact and use back-translation to remove the semantically unrelated translation candidates. Then, the method converts each Chinese and Uyghur fact to a parallel sentence using a hand-crafted rule template and achieves their sentence representation by a recursive autoencoder. Finally, the method encodes the source and target sentence in the same semantic space using the bidimensional attention network, and calculates the distance between them to get the semantic similarity score.
Being the largest multilingual CKB, ConceptNet [10] connects words and phrases of natural language with labeled edges and maintains knowledge as a triple of two concepts and relations between them. The relations come from a fixed set. The latest release (v5.6.0) of ConceptNet has 369,687 unique Chinese facts (both head and tail nodes are Chinese) while the number of Uyghur facts is only 3872 (≈1.05%). We focus on ConceptNet in this paper. We project ConceptNet's Chinese facts to Uyghur by using a Chinese-Uyghur bilingual dictionary and a bilingual semantic similarity scoring model, automatically building an Uyghur CKB from existing Chinese CKBs, taking the advantages of the cross-lingual knowledge projection.
Suppose we project a Chinese fact f s into a target-side Uyghur fact, and obtain n candidate translations by using dictionary translation. We denote these candidates as f t 1 , f t 2 , . . . , f t n . Our goal is to estimate a projection score h( f t i | f s ) and find the most appropriate Uyghur fact that maximizes the score, which can be formulated in Equation (1).
The structure of this paper is organized as follows: in Section 2 we discuss the related works; in Section 3 we introduce our cross-lingual knowledge-projection method; in Section 4 we present the experiments and analysis of the results. Conclusions will be given in the last section.

Related Works
Low-resource languages often suffer from a lack of annotated corpora to estimate high-performing neural network models for many NLP tasks. Cross-lingual knowledge projection is an efficient way to bridge the gap across languages.
Named-entity recognition (NER) for low-resource languages has received great benefit from cross-lingual language projection. Bharadwaj et al. [11] built a transfer model using phonetic features instead of lexical features. These features are not strictly language-independent but work well when languages share vocabulary but have spelling variations, as in the case of Turkish, Uzbek, and Uyghur. Mayhew et al. [12] used lexicon to translate the available annotated data in one or several high-resource language(s), and learned a standard monolingual NER model. They evaluated their model on 7 diverse languages and improved the state of the art (SOTA) by an average of 5.5% F1 points. To improve the mapping of lexical items across languages, Xie et al. [13] proposed a method that finds translations based on bilingual word embeddings and uses self-attention to improve the robustness for word-order differences. Their method achieved a SOTA NER performance on commonly tested languages.
Chen et al. [14], Wang et al. [15], and Klein et al. [16] represented concepts in multiple languages in a common vector space and ensured a concept in source language has a similar vector representation to its target-side counterpart. Xu et al. [17] treated the cross-lingual knowledge projection as a graph-matching problem and proposed a graph-attention-based solution, which matches all the entities in two topic entity graphs and jointly models the local matching information to derive a graph-level matching vector.
Manaal et al. [18] presents a system that performs relation extraction (RE) on a sentence in the source language by translating the sentence into English then performing RE in English and projecting the relation phrase back to the source language sentence. Their method only needs a MT system from the source language to English without any other analysis tools for the source language and can extract relationships for any source languages.
Due to the lack of training data for sentiment analysis, Jeremy et al. [19] introduced a bilingual sentiment embedding model for cross-lingual sentiment classification. Their model only requires a small bilingual lexicon, a source-language corpus annotated for sentiment, and monolingual word embeddings for each language. Experiments on three language combinations for sentence-level cross-lingual sentiment outperforms the SOTA methods.
Several studies proposed methods for the one-to-one projection of facts. To expand Chinese KB by leveraging English KB resources, Feng et al. [20] presented a gated neural network approach to map the source triples and target triples in the same semantic vector space. Their experimental result showed the model can successfully alleviate the projection ambiguity. The work by Naoki et al. [21] is the closest related to our study. They treated cross-lingual knowledge projection as a structured version of the MT task and generated a training corpus from ConceptNet using hand-crafted rules for every type of relationship. By combining MT and a target-side knowledge-base completion model, they projected the English CSK into Japanese and Chinese with high precision.

Data Preprocessing
Being a multilingual commonsense KB, the ConceptNet contains facts from hundreds of languages. Each fact consists of five parts: The URI of the whole edge, the relationship expressed by the edge, the node at the head of the edge, the node at the tail of the edge and JSON-structured additional information. To project the Chinese facts into Uyghur, first, we need to filter out the facts in which both the head and tail node entities are Chinese, and convert them into a format of f s = (e s 1 , r, e s 2 ), where e s 1 ∈ E 1 , r ∈ R, e s 2 ∈ E 2 . Symbol E 1 and E 2 represent the set of the head and the tail nodes while R represents the set of the relationships between nodes.

Dictionary-Based Entity Translation
To project a given Chinese fact f s = (e s 1 , r, e s 2 ) into Uyghur, we need to translate the head and tail entity into Uyghur separately. To get a high-quality entity translation, we use a Chinese-Uyghur bilingual dictionary to get the target fact f t = (e t 1 , r, e t 2 ), where e t 1 ∈ e t 11 , e t 12 , . . . , e t 1n and e t 2 ∈ e t 21 , e t 22 , . . . , e t 2m . From the example shown in Figure 1, it can be seen that for most cases we can get more than one candidate Uyghur entity for each Chinese entity after translation, but some of them are semantically unrelated to the original Chinese entity.
Back-translation [22] method is first used in MT to enrich the training corpus. In this paper, we use it to eliminate the entity-translation ambiguity. Firstly, we use an Uyghur-Chinese bilingual dictionary to translate the translated candidate entities back to Chinese, then compare that with the original one, and keep the candidate entities if they are equal.
After back-translation, we get translated head and tail entityé t for a Chinese fact f s . Combining with relationship r, the translated fact and the count of translated fact can be expressed asf wheré n andḿ is the count of the translated candidate head and tail entity, respectively.

Rule-Based Conversion of Structured Knowledge
Although we get the correct translation of the head and the tail entity by the dictionary-based translation separately, when combining with relationship r, the generated Uyghur fact also displays semantic ambiguity between the head and tail entity with the relationship. For the given example above, the generated Uyghur fact ( <host computer>, CapableOf, <have a fever>) does not semantically have a CapableOf relationship. It is challenging to solve this inner relation ambiguity in a single projected fact. We suppose that the original Chinese fact does not have any ambiguity, and by calculating the semantic similarity of the original and the projected fact, we can tackle this inner relationship ambiguity in translated fact. However, it is difficult to calculate the semantic similarity of two facts while all facts in ConceptNet are in triple structure. Therefore, we make hand-crafted templates for every relationship in Chinese and Uyghur separately to convert the structured facts into phrases. Thus, we can get the similarity of the facts by calculating the semantic similarity of the parallel phrases generated by templates. Hand-crafted rule templates for Chinese and Uyghur are shown in Table 1. e1 and e2 in templates will be programmatically replaced by head and tail entities in fact. Being an agglutinative language, Uyghur has many affixes which play an important role in syntax information. In Uyghur, there are multivariant affixes with different variants of one affix added to harmonize the phonetic characteristics of the particular stem. For example, the plural affix has two variants " / " and they must be chosen based on the phonetic harmony rule between stem and variants [23]. Aizimaiti et al. [24] proposed a rule-based variant-selection algorithm for Uyghur affixes based on Uyghur phonetic harmony. We use their method while replacing entities in a template to select a correct affix variation to combine with the entity for each Uyghur entity.

Bilingual Semantic Similarity Scoring Model
Bidimensional attention-based recursive autoencoders for learning bilingual phrase embeddings (BattRAE) were first proposed by Zhang et al. to evaluate the semantic similarity between a source phrase and a target phrase in an MT task [25,26]. We introduced the BattRAE model to score the semantic similarity of parallel phrases generated from the original and projected facts using the hand-crafted template. This model learns bilingual phrase embeddings according to the strengths of interactions between the linguistic items at different levels of granularity on the source side and the target side. Figure 2 shows the overall architecture of the BattRAE model.

Bilingual Semantic Similarity Scoring Model
Bidimensional attention-based recursive autoencoders for learning bilingual phrase embeddings (BattRAE) were first proposed by Zhang et al. to evaluate the semantic similarity between a source phrase and a target phrase in an MT task [25,26]. We introduced the BattRAE model to score the semantic similarity of parallel phrases generated from the original and projected facts using the hand-crafted template. This model learns bilingual phrase embeddings according to the strengths of interactions between the linguistic items at different levels of granularity on the source side and the target side. Figure 2 shows the overall architecture of the BattRAE model.

Learning Multilevel Phrase Embeddings
We use recursive autoencoders (RAE, Figure 2a) to learn initial embeddings at different levels of phrases. By combining two children vectors from the bottom up recursively, RAE can generate low-dimensional vector representations for variable-sized sequences. The recursion procedure usually consists of two main steps: composition and reconstruction.
Composition: Generally, for a list of words in a phrase ( , , ), each of them will be embedded into a -dimensional continuous vector, RAE selects two neighboring children (e.g., = and = ) via some selection criterion, and then composes them into a parent embedding , which can be computed by Equation (2).

Learning Multilevel Phrase Embeddings
We use recursive autoencoders (RAE, Figure 2a) to learn initial embeddings at different levels of phrases. By combining two children vectors from the bottom up recursively, RAE can generate low-dimensional vector representations for variable-sized sequences. The recursion procedure usually consists of two main steps: composition and reconstruction.
Composition: Generally, for a list of words in a phrase (x 1 , x 2 , x 3 ), each of them will be embedded into a d-dimensional continuous vector, RAE selects two neighboring children (e.g., c 1 = x 1 and c 2 = x 2 ) via some selection criterion, and then composes them into a parent embedding y 1 , which can be computed by Equation (2).
where [c 1 : c 2 ] ∈ R 2d is the concatenation of c 1 and c 2 , W (1) ∈ R d×2d is a parameter matrix, b (1) ∈ R d is a bias term, f is element-wise activation function such as tan h (·), which is used in our experiments. Reconstruction: After getting the d-dimensional representation for parent y 1 in the composition step, to measure how well the parent y 1 represents its children, we reconstruct the original child nodes via a reconstruction layer formulated in Equation (3).
where c 1 and c 2 are the reconstructed children, W (2) ∈ R 2d×d and b (2) ∈ R 2d , The minimum Euclidean distance between [c 1 : c 2 ], and c 1 : c 2 is usually used as the selection criterion during composition. These two steps repeat until the embedding of the entire phrase is generated. While embedding, RAE also constructs a binary tree. The structure of the tree is determined by the used selection criterion in composition. We use a greedy algorithm [27] based on the following reconstruction error, which can be seen as Equation (4).
where y is an intermediate node of the binary tree T(x), and parameters W (1) and W (2) are learned to minimize the sum of reconstruction errors. Given a binary tree learned by RAE, the leaf, internal nodes, and root of the tree which represents the representations of words, sub-phrases, and phrases separately, we can use RAE to produce the embeddings of phrases at different levels. As shown in Figure 1, RAE learns representations of the source and target phrases in different semantic spaces, marked as d s and d t , respectively.

Bidimensional Attention Network
We propose the bidimensional attention network (Figure 2b) to incorporate a multilevel representation of embeddings from RAE into phrase embeddings and further into the semantic similarity of bilingual phrases. We can put vectors from all nodes of a tree into the columns of a matrix of size (2n − 1) × d (n s , d s for source and n t , d t for target), where d is the dimension of embeddings and n is the length of phrase (there are n − 1 steps in RAE to construction and therefore are 2n − 1 nodes in total). Let us denote these matrices by M s and M t for the source and target tree, respectively. Then we can project all the embeddings into a common attention space by using by a non-linear projection function f (Wx + b). In this attention space, all the embeddings from the source tree can "interact" with all the embeddings from the target tree. We will measure the interaction strength between the i-th projected source embedding and the j-th projected target embedding by Equation (7).
where g(·) and f (·) are non-linear activation functions, e.g., the sigmoid(·) and the tan h(·) functions are used in this paper, A s (Equation (5)) and A t (Equation (6)) are projections of M s and M t to the attention space, W (3) ∈ R d a ×d s and W (4) ∈ R d a ×d t are transformation matrix, b A ∈ R d a is the bias term. We will use the same bias-term force model to learn to encode the attention semantics into transformation matrices, rather than the bias term. It can be seen that we define a (2n s − 1) × (2n t − 1) matrix B, which is called the bidimensional attention matrix represented by Equation (7). Intuitively, this matrix is a result of handshakes between source and target phrases at a multilevel representation. We can interpret the sum of the i-th row as the total strength that the i-th source node has on the semantic similarity between the two considered phrases.
where a s ∈ R n s and a t ∈ R n t are the semantic matching score vectors. Because of phrase length uncertainty, we can normalize all these strengths using a so f tmax function: a s = So f tmax( a s ), a t = So f tmax( a t ). This forces a s and a t to become real-valued distributions in the attention space, known as attention weights. Then, we use them to obtain the final phrase representations by the following Equation (9).
where p s ∈ R d s , and p t ∈ R d t , notice that they still are located in their language-specific vector space.

Semantic Similarity
To measure the semantic similarity of the bilingual phrase, first we transform the learned phrases representations p s and p t into common d sim -dimensional semantic space by a non-linear projection formulated in Equations (10) and (11).
s t = f (W (6) p t + b s ) (11) where W (5) ∈ R d sim ×d s , W (6) ∈ R d sim ×d t and b s ∈ R d sim are the parameters. We will also use the same bias term as shown in Equation (6). Then, to get the final semantic similarity of bilingual phrases, we calculate the cosine similarity of p s and p t by Equation (12) (Figure 2c). s( f , e) = s Ts ||s s || ||s t || (12) where f and e indicate the source and target phrase, and || · || denotes the L2-norm of a vector. According to the definition of semantic similarity, the semantic error will be introduced to measure the semantic equivalence of source and target phrase. Given a positive bilingual phrase pair ( f , e) with its negative samples ( f − , e) and ( f , e − ), we use the following error-based max-margin function, which is formulated in Equation (13).
Intuitively, minimizing this error will maximize the similarity of the positive instance and minimize the similarity of the negative pairs. For each training instance ( f , e), the joint objective of BattRAE is defined by Equation (14): where E rec ( f , e) = E rec ( f ) + E rec (e), α + β = 1, and R(θ) is regularization term.

Setup
• Facts dataset: Through the experiments, we will use the facts obtained from ConceptNet version 5.6.0 (https://github.com/commonsense/conceptnet5/wiki/Downloads).  (2) choosing a random word from the phrase and replacing it with its farthest word by calculating the cosine distance all over the vocabulary. • Semantic similarity model hyperparameters: we set d s = d s = d a = d sim = 50, α = 0.125 (so that β = 0.875), use L-BFGS algorithm (libLBFGS (http://www.chokkan.org/software/liblbfgs/)) to optimize the objective function.

Entity Filtering and Translation Performance
The filtering and dictionary-based entity-translation results are shown in Table 2. We obtain 369,687 Chinese facts after filtering, which contains 67,400 unique start entities and 85,800 unique end entities with 24 relations. Through dictionary-based entity translation, we translate the 99,600 Chinese facts to Uyghur and get 2,900,000 translated Uyghur facts with 24,800 head and 27,900 tail entities with 23 relations. It can be seen that the translation generated from many incorrect facts leads to entity-translation ambiguity. By using the dictionary-based entity back-translation, we can filter out the incorrect or semantically unrelated entities on the source and translated side. The back-translation result shows that we effectively remove 95% of the incorrectly translated facts while losing only 36.4% of facts on the source side. We obtain 143,900 translated Uyghur facts for 63,400 Chinese facts in the dictionary-based translation step. By using the hand-crafted rule template, we can generate the 143,900 Uyghur-Chinese parallel sentences and score the semantic similarity of each sentence by trained bilingual semantic similarity scoring model.
There is no publicly available test set for Uyghur-Chinese bilingual phrase similarity measurement, so we randomly select 2000 scored bilingual sentences, using a combination of automatic test and manual check, to get the accuracy of the semantic scoring model.

Automatic Test:
We also focus on whether our model can recognize correct bilingual phrases; in other words, we assign higher semantic similarity scores for the correct bilingual phrases. We use semantic accuracy metrics for this evaluation, which is mentioned in [25]. Formally, given a pair of correct bilingual phrases ( f , e ) and its incorrect counterpart ( f , e − ) (replaced with a non-translation target phrase) or ( f − , e) (replaced with a non-translation source phrase), the semantic accuracy (SAcc) of the bilingual phrase is defined as follows: (we take ( f , e − ) for example) Manual Check: We use crowdsourcing to check semantic scores, detailed as follows: • As we use cosine similarity as a scoring metric, whose values are distributed from −1 to 1, we set zero as the semantic similarity threshold.

•
Workers check the Uyghur facts with labels: (1) "True, makes sense in every context", (2): "False, does not make sense, or does not make sense in some contexts".

•
Each Uyghur fact is judged by three workers.

•
We aggregate the collected judgments by taking the median.
Finally, we kept the Uyghur facts that have been through the automatic and manual test verification. Performance of the scoring model is shown in Figure 3. It can be observed that the model can achieve 94.75% accuracy for our task, which works well for many relationship templates, except for SymbolOf and Synonym relationship.
Appl. Sci. 2019, 9, x 9 of 12 or ( , ) (replaced with a non-translation source phrase), the semantic accuracy ( ) of the bilingual phrase is defined as follows: (we take ( , ) for example) Manual Check: We use crowdsourcing to check semantic scores, detailed as follows: • As we use cosine similarity as a scoring metric, whose values are distributed from −1 to 1, we set zero as the semantic similarity threshold. • Workers check the Uyghur facts with labels: (1) "True, makes sense in every context", (2): "False, does not make sense, or does not make sense in some contexts". • Each Uyghur fact is judged by three workers. • We aggregate the collected judgments by taking the median.
Finally, we kept the Uyghur facts that have been through the automatic and manual test verification. Performance of the scoring model is shown in Figure 3. It can be observed that the model can achieve 94.75% accuracy for our task, which works well for many relationship templates, except for SymbolOf and Synonym relationship.

Error Analysis
After analyzing the semantic scoring result, we find two types of errors, as follows: • Unknown Word Error: Although we have pretrained word embeddings on a fairly large

Error Analysis
After analyzing the semantic scoring result, we find two types of errors, as follows: • Unknown Word Error: Although we have pretrained word embeddings on a fairly large corpus, we also find that being an agglutinative language, Uyghur still has some words that could not be included. They affect the accuracy of the model due to random initialization. For example, the words (politician) and (president) could not get a correct embedding when training and testing, as the sentence which contains this word gets a low score.
• Template Error: We define a single template for each relationship type, which works well for most facts. However, for some verbs, the dictionary-translated entities format does not match with the hand-crafted template, so it generates ungrammatical sentences, especially for Uyghur.
For example, Table 3 shows the generated sentences of grammatical errors for the Causes relation.
Because we get translations of Uyghur verbs with the incorrect tense according to the dictionary, the template generates ungrammatical sentences and gets incorrect score while testing.

Construct Uyghur CKB
To get the final projected Uyghur facts, we need to filter the bilingual phrases according to the semantic similarity score. We use two filtering strategies, as follows:

•
For Chinese facts which only have a single candidate projected Uyghur fact, we will keep this fact if the semantic score is greater than zero.

•
For Chinese facts which have multiple candidate Uyghur facts, we will sort the scores and keep the highest one.
The results of filtered Uyghur facts are shown in Figure 4. After all the above steps, we can filter out the projected 55,872 Chinese facts into Uyghur successfully and get the 67,375 facts.
Appl. Sci. 2019, 9, x 10 of 12 according to the dictionary, the template generates ungrammatical sentences and gets incorrect score while testing.

Construct Uyghur CKB
To get the final projected Uyghur facts, we need to filter the bilingual phrases according to the semantic similarity score. We use two filtering strategies, as follows: • For Chinese facts which only have a single candidate projected Uyghur fact, we will keep this fact if the semantic score is greater than zero. • For Chinese facts which have multiple candidate Uyghur facts, we will sort the scores and keep the highest one.
The results of filtered Uyghur facts are shown in Figure 4. After all the above steps, we can filter out the projected 55,872 Chinese facts into Uyghur successfully and get the 67,375 facts.

Conclusions and Future Work
We propose a method to project knowledge stored in Chinese into the Uyghur language. We focus on CSK that is required to understand human communications. The main challenge of this

Conclusions and Future Work
We propose a method to project knowledge stored in Chinese into the Uyghur language. We focus on CSK that is required to understand human communications. The main challenge of this work is entity ambiguity and inner relationship ambiguity. To get the entity projection, our method uses a Chinese-Uyghur dictionary for the entity translation and employs back-translation for entity ambiguity. To resolve the inner relationship ambiguity, we make relationship templates to convert facts to bilingual phrases and use a semantic similarity scoring model to filter facts. Experiments show that our method works well for our projection task. Finally, we projected 55,872 Chinese facts into Uyghur and got 67,375 Uyghur facts successfully. There are still more than 300,000 Chinese facts that cannot be translated correctly by the dictionary and we are planning to project them into Uyghur by using the existing Chinese-Uyghur MT system combined with the proposed semantic similarity scoring model.