Constructing Uyghur Commonsense Knowledge Base by Knowledge Projection
Abstract
:1. Introduction
2. Related Works
3. Method
3.1. Data Preprocessing
3.2. Dictionary-Based Entity Translation
3.3. Rule-Based Conversion of Structured Knowledge
3.4. Bilingual Semantic Similarity Scoring Model
3.4.1. Learning Multilevel Phrase Embeddings
3.4.2. Bidimensional Attention Network
3.4.3. Semantic Similarity
4. Experiment
4.1. Setup
- Facts dataset: Through the experiments, we will use the facts obtained from ConceptNet version 5.6.0 (https://github.com/commonsense/conceptnet5/wiki/Downloads).
- Dictionary: We use the Chinese–Uyghur bilingual dictionary, which contains 328,000 unique Chinese terms and 531,000 unique Uyghur terms, to translate entity.
- Word embeddings dataset: We use the toolkit Word2Vec (https://github.com/tmikolov/word2vec) to pretrain the word embeddings, which contains 11,500,000 Chinese sentences provided by Sogou (http://www.sogou.com/labs/resource/list_yuliao.php) to train the Chinese word embeddings and 1,500,000 Uyghur sentences crawled from the Tianshan website (http://uy.ts.cn/) to train the Uyghur word embeddings.
- Semantic similarity model training dataset: To obtain high-quality bilingual phrases to train the semantic scoring model, we use the Moses decoder (http://www.statmt.org/moses/) to force decoding [28] on CWMT2013 Chinese–Uyghur parallel corpus (https://www.cis.um.edu.mo/cwmt2014/en/cfp.html) which contains 109,000 parallel sentences, and the extra collected 1,380,000 bilingual phrases. To generate negative samples for each training phrase, we used the following two strategies introduced by Ondrej [29]: (1) taking a completely different phrase; and (2) choosing a random word from the phrase and replacing it with its farthest word by calculating the cosine distance all over the vocabulary.
- Semantic similarity model hyperparameters: we set , (so that ), use L-BFGS algorithm (libLBFGS (http://www.chokkan.org/software/liblbfgs/)) to optimize the objective function.
4.2. Experiment
4.2.1. Entity Filtering and Translation Performance
4.2.2. Bilingual Semantic Similarity Scoring Model Analysis
- As we use cosine similarity as a scoring metric, whose values are distributed from −1 to 1, we set zero as the semantic similarity threshold.
- Workers check the Uyghur facts with labels: (1) “True, makes sense in every context”, (2): “False, does not make sense, or does not make sense in some contexts”.
- Each Uyghur fact is judged by three workers.
- We aggregate the collected judgments by taking the median.
- Unknown Word Error: Although we have pretrained word embeddings on a fairly large corpus, we also find that being an agglutinative language, Uyghur still has some words that could not be included. They affect the accuracy of the model due to random initialization. For example, the words سىياسىئون (politician) and پرېزىدېنت (president) could not get a correct embedding when training and testing, as the sentence which contains this word gets a low score.
- Template Error: We define a single template for each relationship type, which works well for most facts. However, for some verbs, the dictionary-translated entities format does not match with the hand-crafted template, so it generates ungrammatical sentences, especially for Uyghur. For example, Table 3 shows the generated sentences of grammatical errors for the Causes relation. Because we get translations of Uyghur verbs with the incorrect tense according to the dictionary, the template generates ungrammatical sentences and gets incorrect score while testing.
4.2.3. Construct Uyghur CKB
- For Chinese facts which only have a single candidate projected Uyghur fact, we will keep this fact if the semantic score is greater than zero.
- For Chinese facts which have multiple candidate Uyghur facts, we will sort the scores and keep the highest one.
5. Conclusions and Future Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Cui, W.; Xiao, Y.; Wang, H.; Song, Y.; Hwang, S.-W.; Wang, W. KBQA: Learning question answering over QA corpora and knowledge bases. Proc. VLDB Endow. 2017, 10, 565–576. [Google Scholar] [CrossRef]
- Xiong, C.; Power, R.; Callan, J. Explicit semantic ranking for academic search via knowledge graph embedding. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 1271–1279. [Google Scholar]
- Davis, E.; Marcus, G.J.C.A. Commonsense reasoning and commonsense knowledge in artificial intelligence. Commun. ACM 2015, 58, 92–103. [Google Scholar] [CrossRef]
- Young, T.; Cambria, E.; Chaturvedi, I.; Zhou, H.; Biswas, S.; Huang, M. Augmenting end-to-end dialogue systems with commonsense knowledge. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Ostermann, S.; Roth, M.; Modi, A.; Thater, S.; Pinkal, M. SemEval-2018 Task 11: Machine comprehension using commonsense knowledge. In Proceedings of the 12th International Workshop on Semantic Evaluation, New Orleans, LA, USA, 5–6 June 2018; pp. 747–757. [Google Scholar]
- Ma, Y.; Peng, H.; Cambria, E. Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Qiu, L.; Zhang, H. Review of Development and Construction of Uyghur Knowledge Graph. In Proceedings of the 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), Guangzhou, China, 21–24 July 2017; pp. 894–897. [Google Scholar]
- Abaidulla, Y.; Osman, I.; Tursun, M. Progress on Construction Technology of Uyghur Knowledge Base. In Proceedings of the 2009 International Symposium on Intelligent Ubiquitous Computing & Education, Chengdu, China, 15–16 May 2009. [Google Scholar]
- Yilahun, H.; Imam, S.; Hamdulla, A. A survey on uyghur ontology. Int. J. Database Theory Appl. 2015, 8, 157–168. [Google Scholar] [CrossRef]
- Speer, R.; Chin, J.; Havasi, C. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Bharadwaj, A.; Mortensen, D.; Dyer, C.; Carbonell, J. Phonologically aware neural model for named entity recognition in low resource transfer settings. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 1462–1472. [Google Scholar]
- Mayhew, S.; Tsai, C.-T.; Roth, D. Cheap translation for cross-lingual named entity recognition. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 2536–2545. [Google Scholar]
- Xie, J.; Yang, Z.; Neubig, G.; Smith, N.A.; Carbonell, J. Neural Cross-Lingual Named Entity Recognition with Minimal Resources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
- Chen, M.; Tian, Y.; Yang, M.; Zaniolo, C. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. arXiv 2016, arXiv:1611.03954. [Google Scholar]
- Wang, Z.; Lv, Q.; Lan, X.; Zhang, Y. Cross-lingual Knowledge Graph Alignment via Graph Convolutional Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 349–357. [Google Scholar]
- Klein, P.; Ponzetto, S.P.; Glavaš, G. Improving neural knowledge base completion with cross-lingual projections. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017. [Google Scholar]
- Xu, K.; Wang, L.; Yu, M.; Feng, Y.; Song, Y.; Wang, Z.; Yu, D. Cross-lingual Knowledge Graph Alignment via Graph Matching Neural Network. arXiv 2019, arXiv:1905.11605. [Google Scholar] [Green Version]
- Faruqui, M.; Kumar, S. Multilingual open relation extraction using cross-lingual projection. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics, Denver, CO, USA, 31 May–5 June 2015. [Google Scholar]
- Barnes, J.; Klinger, R.; Walde, S.S. Bilingual sentiment embeddings: Joint projection of sentiment across languages. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
- Feng, X.; Tang, D.; Qin, B.; Liu, T. English-chinese knowledge base translation with neural network. In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016), Osaka, Japan, 11–17 December 2016; pp. 2935–2944. [Google Scholar]
- Otani, N.; Kiyomaru, H.; Kawahara, D.; Kurohashi, S. Cross-lingual Knowledge Projection Using Machine Translation and Target-side Knowledge Base Completion. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 1508–1520. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Improving neural machine translation models with monolingual data. arXiv 2015, arXiv:1511.06709. [Google Scholar]
- Munire, M.; Li, X.; Yang, Y. Construction of the Uyghur Noun Morphological Re-Inflection Model Based on Hybrid Strategy. Appl. Sci. 2019, 9, 722. [Google Scholar] [CrossRef]
- Ainiwaer, A.; Jun, D.; Xiao, L.I. Rules and Algorithms for Uyghur Affix Variant Collocation. J. Chin. Inf. Process. 2018, 32, 27–33. [Google Scholar]
- Zhang, B.; Xiong, D.; Su, J.; Qin, Y. Alignment-Supervised Bidimensional Attention-Based Recursive Autoencoders for Bilingual Phrase Representation. IEEE Trans. Cybern. 2018. [Google Scholar] [CrossRef] [PubMed]
- Zhang, B.; Xiong, D.; Su, J. Battrae: Bidimensional attention-based recursive autoencoders for learning bilingual phrase embeddings. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Socher, R.; Pennington, J.; Huang, E.H.; Ng, A.Y.; Manning, C.D. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Scotland, UK, 27–31 July 2011; pp. 151–161. [Google Scholar]
- Wuebker, J.; Mauser, A.; Ney, H. Training phrase translation models with leaving-one-out. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; pp. 475–484. [Google Scholar]
- Hübsch, O. Core Fidelity of Translation Options in Phrase-Based Machine Translation. Bachelor’s Thesis, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic, 2017. [Google Scholar]
Relationship | English | Chinese | Uyghur |
---|---|---|---|
IsA | e1 is part of e2 | e1是e2的一种 | e1 بولسا e2 (نىڭ) بىر تۈرى |
Causes | The effect of e1 is e2 | e1会e2 | e1 بولسا e2 بولىدۇ |
Desires | e2 wants to e1 | e1需要e2 | e1 e2(غا، قا، گە، كە) مۇھتاج |
CapableOf | e1 can e2 | e1会e2 | e2 e1 (نى) قىلالايدۇ |
SymbolOf | e1 represents e2 | e1代表e2 | e2 e1 (نى) ئىپادىلەيدۇ |
HasProperty | e2 is e1 | e1是e2的 | e1 e2 |
RelatedTo | e1 is related to e2 | e1跟e2有关 | e1 بىلەن e2 مۇناسىۋەتىلىك |
UsedFor | You can use e1 to e2 | e2的时候可能会用到e1 | e2 e1 (نى) ئىشلىتىشى مومكىن |
CausesDesire | e1 makes you want to e2 | e1让你想要e2 | e1 سىزنى e2 |
MadeOf | e1 is made of e2 | e1可以用e2制成 | e1 e2 ئارقىلىق ياسىلىدۇ |
NotDesires | e1 not desires e2 | e1不想e2 | e1 e2 (نى) خالىمايدۇ |
AtLocation | You are likely to find e1 in e2 | 你可以在e2找到e1 | سىز e2(دىن، تىن) e1(نى) تاپالايسز |
DerivedFrom | e2 is derived from e1 | e2源自e1 | e1 e2 (دىن، تىن) كەلگەن |
partOf | e2 is part of e1 | e2是e1的一部分 | e1 e2 (نىڭ) بىر قىسمى |
HasSubevent | One of the things you do when you e1 is e2 | 当e1时,可能会e2 | e1 بولسا e2 مومكىن |
Synonym | 11 and e2 are synonymous | e1和e2是同义词 | e1 بىلەن e2 مەنىداش سۆز |
HasA | e2 has e1 | e2有e1 | e2(دا، تا، دە، تە) e1 بار |
Method | Source (Chinese) | Translated (Uyghur) | Relation Count | ||||
---|---|---|---|---|---|---|---|
Head | Tail | Fact | Head | Tail | Fact | ||
Original | 67,464 | 85,832 | 369,687 | 0 | 24 | ||
Translation | 12,708 | 15,373 | 99,649 | 24,863 | 27,921 | 2,944,802 | 23 |
Back-translation | 7336 | 8089 | 63,406 | 8835 | 9873 | 143,926 | 17 |
Method | Chinese | Uyghur (Incorrect) | Uyghur (correct) |
---|---|---|---|
Sentences | 难过会忧郁 | كۆڭلى بۇزۇلماق بولسا قايغۇلۇق بولىدۇ | كۆڭلى بۇزۇلسا قايغۇرىدۇ |
愚弄会翻脸 | كولدۇرلاتماق بولسا يۈز ئۆرۈمەك بولىدۇ | كولدۇرلاتسا يۈز ئۆرىدۇ |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Anwar, A.; Li, X.; Yang, Y.; Wang, Y. Constructing Uyghur Commonsense Knowledge Base by Knowledge Projection. Appl. Sci. 2019, 9, 3318. https://doi.org/10.3390/app9163318
Anwar A, Li X, Yang Y, Wang Y. Constructing Uyghur Commonsense Knowledge Base by Knowledge Projection. Applied Sciences. 2019; 9(16):3318. https://doi.org/10.3390/app9163318
Chicago/Turabian StyleAnwar, Azmat, Xiao Li, Yating Yang, and Yajuan Wang. 2019. "Constructing Uyghur Commonsense Knowledge Base by Knowledge Projection" Applied Sciences 9, no. 16: 3318. https://doi.org/10.3390/app9163318
APA StyleAnwar, A., Li, X., Yang, Y., & Wang, Y. (2019). Constructing Uyghur Commonsense Knowledge Base by Knowledge Projection. Applied Sciences, 9(16), 3318. https://doi.org/10.3390/app9163318