Semi-Automatic Corpus Expansion and Extraction of Uyghur-Named Entities and Relations Based on a Hybrid Method
Abstract
:1. Introduction
2. Related Works
3. Uyghur Relation Extraction Model
3.1. Task Definition
3.2. Feature Template
3.3. Rules
3.4. Hybrid Neural Network Model Training
3.4.1. Bidirectional LSTM-Encoding Layer
3.4.2. LSTM-Encoding Layer
3.4.3. LSTM-Decoding Layer
3.5. CRF Model
4. Experiments
- The corpus size was small.
- There were also many sentences that did not have any entities and relations.
- In this paper, the most basic corpus of Uyghur named entities and relations was constructed, and no study of the relation extraction research in Uyghur was undertaken.
4.1. Dataset
4.2. Experimental Results
- Fwc: the word feature that represents the word itself.
- Fsyllable: the syllable that represents the suffix and prefix of the word.
- Farg_type: the argument type, i.e., whether it is a first or second argument.
- Frelation_type: the relation type that represents the relation between two entities.
- Fposition: the entity position feature that represents the position of the word contained in each entity.
4.3. Analysis and Discussion
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Nogueira, C.; Santos, D.; Xiang, B.; Zhou, B. Classifying relations by ranking with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; pp. 626–634. [Google Scholar]
- Han, X.; Liu, Z.; Sun, M. Neural knowledge acquisition via mutual attention between knowledge graph and text. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Zeng, X.; Zeng, D.; He, S. Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 506–514. [Google Scholar]
- Luan, Y.; He, L.; Ostendorf, M.; Hajishirzi, H. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3219–3232. [Google Scholar]
- Fader, A.; Zettlemoyer, L.; Etzioni, O. Open question answering over curated and extracted knowledge bases. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, San Diego, CA, USA, 24–27 August 2014; pp. 1156–1165. [Google Scholar]
- Xiang, Y.; Chen, Q.; Wang, X.; Qin, Y. Answer selection in community question answering via attentive neural networks. IEEE SPL 2017, 24, 505–509. [Google Scholar] [CrossRef]
- Lehmann, J.; Isele, R.M.; Jakob, A.J.; Bizer, C. DBpedia—A large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web J. 2015, 6, 167–195. [Google Scholar] [CrossRef] [Green Version]
- Zhou, P.; Shi, W.; Tian, J. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016; pp. 207–212. [Google Scholar]
- Xu, K.; Feng, Y.; Huang, S. Sematic Relation Classification via Convolutional Neural Networks with Simple Negative Sampling. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Proceeding, Lisbon, Portugal, 17–21 September 2015; pp. 536–540. [Google Scholar]
- Zeng, D.; Liu, K. Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Proceeding, Lisbon, Portugal, 17–21 September 2015; pp. 1753–1762. [Google Scholar]
- Lin, Y.; Shen, S.; Liu, Z. Neural Relation Extraction with Selective Attention over Instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1083–1106. [Google Scholar]
- Xu, Y.; Mou, L.I. Classifying Relation via Long Short Term Memory Networks along Shortest Dependency Paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Proceeding, Lisbon, Portugal, 17–21 September 2015; pp. 1785–1794. [Google Scholar]
- Liu, Z.; Sun, M.; Lin, Y.; Xie, R. Knowledge representation learning: A review. J. Comput. Res. Dev. 2016, 53, 247–261. [Google Scholar] [CrossRef]
- Abiderexiti, K.; Maimaiti, M.; Yibulayin, T.; Wumaier, A. Annotation schemes for constructing Uyghur named entity relation corpus. In Proceedings of the 2016 International Conference on Asian Language Processing (IALP), Tainan, Taiwan, 21–23 November 2016; pp. 103–107. [Google Scholar]
- Parhat, S.; Ablimit, M.; Hamdulla, A. A Robust Morpheme Sequence and Convolutional Neural Network-Based Uyghur and Kazakh Short Text Classification. Information 2019, 10, 387. [Google Scholar] [CrossRef] [Green Version]
- Takanobu, R.; Zhang, T.; Liu, J. A Hierarchical Framework for Relation Extraction with Reinforcement Learning. arXiv 2018, arXiv:1811.03925. [Google Scholar] [CrossRef] [Green Version]
- Nguyen, D.Q.; Verspoor, K. End-to-End Neural Relation Extraction Using Deep Biaffine Attention. In Proceedings of the 41st European Conference on Information Retrieval (ECIR 2019), Cologne, Germany, 14–18 April 2019; pp. 1–9. [Google Scholar]
- Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 260–270. [Google Scholar]
- Li, Q.; Ji, H. Incremental joint extraction of entity mentions and relations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, USA, 23–25 June 2014; pp. 402–412. [Google Scholar]
- Miwa, M.; Bansal, M. End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1105–1116. [Google Scholar]
- Li, Z.; Yang, Z.; Shen, C. Integrating shortest dependency path and sentence sequence into a deep learning framework for relation extraction in clinical text. BMC Med. Inform. Decis. Mak. 2019, 19, 22. [Google Scholar] [CrossRef] [PubMed]
- Ren, X.; Wu, Z.; He, W.; Qu, M.; Voss, C.R.; Ji, H.; Abdelzaher, T.F.; Han, J. Cotype: Joint extraction of typed entities and relations with knowledge bases. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 1015–1024. [Google Scholar]
- Liu, Z.; Xiong, C.; Sun, M. Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2395–2405. [Google Scholar]
- Xu, P.; Barbosa, D. Connecting Language and Knowledge with Heterogeneous Representations for Neural Relation Extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 3–5 June 2019; pp. 3201–3206. [Google Scholar]
- Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
- Zhang, Z.; Han, X.; Liu, Z. ERNIE: Enhanced Language Representation with Informative Entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Fortezza da Basso, Florence, Italy, 28 July–2 August 2019; pp. 1441–1451. [Google Scholar]
- Graves, A.; Mohamed, A.-R.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
- Yao, K.; Peng, B.; Zhang, Y.; Yu, D.; Zweig, G.; Shi, Y. Spoken language understanding using long short-term memory neural networks. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (SLT), South Lake Tahoe, NV, USA, 7–10 December 2014; pp. 189–194. [Google Scholar]
- Yao, K.; Peng, B.; Zweig, G.; Yu, D.; Li, X.; Gao, F. Recurrent conditional random field for language understanding. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4077–4081. [Google Scholar]
- Mintz, M.; Bills, S.; Snow, R.; Jurafsky, D. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2—ACL-IJCNLP ‘09, Suntec, Singapore, 2–7 August 2009; pp. 1003–1011. [Google Scholar]
- Li, Y.; Jiang, J.; Chieu, H.L.; Chai, K.M.A. Extracting relation descriptors with conditional random fields. In Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, 8–13 November 2011; pp. 392–400. [Google Scholar]
- Quan, C.; Hua, L.; Sun, X.; Bai, W. Multichannel convolutional neural network for biological relation extraction. Biomed Res. Int. 2016. [Google Scholar] [CrossRef] [PubMed]
- Zheng, S.; Wang, F.; Bao, H.; Hao, Y.; Zhou, P.; Xu, B. Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1227–1236. [Google Scholar]
- Zhang, M.; Zhang, Y.; Fu, G. End-to-end neural relation extraction with global optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 1730–1740. [Google Scholar]
- Luo, L.; Yang, Z.; Yang, P.; Zhang, Y.; Wang, L.; Lin, H.; Wang, J. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 2017, 8, 1381–1388. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wushouer, J.; Abulizi, W.; Abiderexiti, K.; Yibulayin, T.; Aili, M.; Maimaitimin, S. Building contemporary Uyghur grammatical information dictionary. In Worldwide Language Service Infrastructure. WLSI 2015; Murakami, Y., Lin, D., Eds.; Springer: Cham, Switzerland, 2016; pp. 137–144. [Google Scholar]
- Aili, M.; Xialifu, A.; Maihefureti, M.; Maimaitimin, S. Building Uyghur dependency treebank: Design principles, annotation schema and tools. In Worldwide Language Service Infrastructure. WLSI 2015. Lecture Notes in Computer Science; Murakami, Y., Lin, D., Eds.; Springer: Cham, Switzerland, 2016; pp. 124–136. [Google Scholar]
- Maimaiti, M.; Wumaier, A.; Abiderexiti, K.; Yibulayin, T. Bidirectional long short-term memory network with a conditional random field layer for Uyghur part-of-speech tagging. Information 2017, 8, 157. [Google Scholar] [CrossRef] [Green Version]
- Zheng, S.; Hao, Y.; Lu, D.; Bao, H.; Xu, J.; Hao, H.; Xu, B. Joint entity and relation extraction based on a hybrid neural network. Neurocomputing 2017, 257, 59–66. [Google Scholar] [CrossRef]
Type | Content (in Uyghur) | Content (in English) |
---|---|---|
Sentence | Alim Adilning ayali Aliye Ubul. | Alim Adil’s wife is Aliye Ubul |
First entity | Alim Adilning | Alim Adil’s |
Relation type | Personal.Family | Personal.Family |
Tail entity | Aliye Ubul | Aliye Ubul |
Uyghur | English | Tag |
---|---|---|
Xintian | Xintian | B_Org–Aff.Employment_2 |
shirketi | company | I_Org–Aff.Employment_2 |
kurghuchisi | founder | E_Org–Aff.Employment_2 |
Zilu | Zilu | S_Org–Aff.Employment_1 |
xongliyen | Honglian | B_Org–Aff.Employment_4 |
shirkiti | company | I_Org–Aff.Employment_4 |
kurghuchisi | founder | E_Org–Aff.Employment_4 |
dupëng | Du peng | B_Org–Aff.Employment_3 |
bilen | and | O |
körüxti | see | O |
. | . | O |
Feature Type | Template | Meaning |
---|---|---|
Atomic feature | Current word and the words of its upper and lower two windows; the window size is 5. | |
Characteristics of the current word and the words of its upper and lower windows; the window size is 3. | ||
Composite feature | Combination of the current word and the word in its upper window. | |
Combination feature of the current word and the word in its upper window. | ||
Current word and its combination features. | ||
Characteristics of the current word and the combination features of its upper and lower windows. |
Feature Template | Feature Meaning | Representative Character | English |
---|---|---|---|
%x[−2, 0] | –2 rows from the current row, column 0 | Alim | Alim |
%x[−1, 0] | −1 row from the current row, column 0 | adilning | Adil’s |
%x[0, 0] | 0 rows from the current row, column 0 | ayali | wife |
%x[1, 0] | 1 row from the current row, column 0 | Aliye | Aliye |
%x[2, 0] | 2 rows from the current row, column 0 | Ubul | Ubul |
%x[3, 0] | 3 rows from the current row, column 0 | . | . |
%x[−1, 0]/%x[0, 0] | −1 row from the current row, column 0, combination of row 0 and column 0 | adilning/ayali | Adil’s/wife |
%x[0, 0]/%x[1, 0] | row 0, column 0, the combination of row 1 and column 0 | ayali/Aliye | wife/Aliye |
Types | Subtypes |
---|---|
Part–Whole | Part–Whole.Geo |
Part–Whole.Subsidiary | |
Per–Social | Per–Social.Business |
Per–Social.Family | |
Per–Social.Role | |
Per–Social.Other | |
Physical | Physical.Located |
Physical.Near | |
Org–Aff | Org–Aff.Employment |
Org–Aff.Investor–Shareholder | |
Org–Aff.Student–Alum | |
Org–Aff.Owner | |
Org–Aff.Founder | |
Gen–Aff | Gen–Aff.Person–Age |
Gen–Aff.Organizationwebsite |
Relation Type | Arg1 | Arg2 |
---|---|---|
Physical.Located | PER | FAC, LOC, GPE |
Entity | adil | shangxeyde |
English | Adil | At the Shanghai |
Type | Original Corpus [14] | Expanded Corpus |
---|---|---|
Documents | 571 | 1032 |
Filtered documents (relational documents) | 422 | 842 |
Sentences | 6173 | 17,765 |
Words | 27,846 | 1,142,241 |
Total Number of Entity Types | Entity Coverage (%) |
---|---|
217,605 | 88.4 |
Entity Type | Entity Coverage (%) |
---|---|
GPE | 44.51 |
ORG | 21.33 |
PER | 16.71 |
TTL | 8.46 |
LOC | 7.21 |
FAC | 1.41 |
AGE | 0.29 |
URL | 0.07 |
Number of Relations | Relation Coverage (%) |
---|---|
4307 | 51.60 |
ID | Relation Type | NUM | ID | Relation Type | NUM |
---|---|---|---|---|---|
1 | Org–Aff.Employment | 534 | 9 | Physical.Near | 725 |
2 | Org–Aff.Owner | 234 | 10 | Gen–Aff.Organizationwebsite | 556 |
3 | Per–Social.Family | 240 | 11 | Per–Social.Other | 525 |
4 | Per–Social.Role | 1437 | 12 | Gen–Aff.Person–Age | 167 |
5 | Part–Whole.Geo | 513 | 13 | Org–Aff.Student–Alum | 154 |
6 | Org–Aff.Founder | 454 | 14 | Part–Whole.Subsidiary | 1475 |
7 | Org–Aff.Investor–Shareholder | 634 | 15 | Per–Social.Business | 168 |
8 | Physical.Located | 159 |
Type | Percentage of Type (%) | Coverage of Type (%) |
---|---|---|
Physical | 4.32 | 2.23 |
Part–Whole | 46.83 | 24.16 |
Gen–Aff | 0.49 | 0.25 |
Per–Social | 35.52 | 18.33 |
Org–Aff | 12.84 | 6.63 |
Relation Subtype | Types (%) | Subtypes (%) |
---|---|---|
Subsidiary | 34.53 | 17.8 |
Role | 33.83 | 17.4 |
Employment | 12.54 | 6.47 |
Per–Social | 35.52 | 18.33 |
Org–Aff | 12.84 | 6.63 |
Located | 3.74 | 41.93 |
Other | 1.25 | 65.65 |
Near | 45.58 | 45.30 |
Family | 32.44 | 65.23 |
Person–Age | 45.37 | 54.19 |
Organization Website | 43.12 | 76.06 |
Business | 32.00 | 43.00 |
Investor–Shareholder | 54.14 | 54.07 |
Student–Alum | 43.02 | 65.01 |
Ownership | 67.05 | 54.02 |
Founder | 54.09 | 65.05 |
Type | P (%) | R (%) | F1 (%) |
---|---|---|---|
80.00 | 37.98 | 57.14 |
Parameter Description | Value |
---|---|
Dimension of word embedding | 300 |
The number of hidden units in the encoding layer | 300 |
The number of hidden units in decoding layer | 300 |
Context window size of CNN module | 3 |
The filter number of CNN | 100 |
Dropout ratio | 0.3 |
Model | Type | P (%) | R (%) | F1 (%) |
---|---|---|---|---|
Our CRF method | Fwc | 80.00 | 37.98 | 57.14 |
Fwc + Frelation_type + Fposition, relation_type, arg_type | 85.00 | 42.34 | 61.34 | |
Fwc + Fposition + Frelation_type + Fposition, relation_type, arg_type, wc, position+relation_type+arg_type | 85.43 | 43.56 | 57.64 | |
Fwc + Fposition + Frelation_type + Fposition, relation_type, arg_type, wc, position | 84.77 | 41.94 | 56.11 | |
Fwc + Fposition + Frelation_type + Fposition, relation_type, arg_type, wc, position+relation_type+arg_type | 86.89 | 41.88 | 56.51 | |
Our hybrid neural network+Feature | Fposition + Frelation_type+ Farg_type | 65.2 | 38.1 | 48.7 |
hybrid neural network [39] | - | 44.14 | 38.25 | 34.61 |
CNN [1] | - | 57.4 | 25.6 | 35.4 |
LSTM-LSTM [34] | - | 42.3 | 51.1 | 41.5 |
Type | P | R | F1 |
---|---|---|---|
All | 83.33 | 26.32 | 40.00 |
Aff.Employment | 75 | 33.33 | 46.15 |
Aff.Owner | 100 | 50 | 66.67 |
Social.Family | 75 | 33.33 | 46.15 |
Social.Role | 100 | 20 | 33.33 |
Whole.Geo | 66.67 | 42.34 | 51.78 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Halike, A.; Abiderexiti, K.; Yibulayin, T. Semi-Automatic Corpus Expansion and Extraction of Uyghur-Named Entities and Relations Based on a Hybrid Method. Information 2020, 11, 31. https://doi.org/10.3390/info11010031
Halike A, Abiderexiti K, Yibulayin T. Semi-Automatic Corpus Expansion and Extraction of Uyghur-Named Entities and Relations Based on a Hybrid Method. Information. 2020; 11(1):31. https://doi.org/10.3390/info11010031
Chicago/Turabian StyleHalike, Ayiguli, Kahaerjiang Abiderexiti, and Tuergen Yibulayin. 2020. "Semi-Automatic Corpus Expansion and Extraction of Uyghur-Named Entities and Relations Based on a Hybrid Method" Information 11, no. 1: 31. https://doi.org/10.3390/info11010031
APA StyleHalike, A., Abiderexiti, K., & Yibulayin, T. (2020). Semi-Automatic Corpus Expansion and Extraction of Uyghur-Named Entities and Relations Based on a Hybrid Method. Information, 11(1), 31. https://doi.org/10.3390/info11010031