Research on Uyghur Pattern Matching Based on Syllable Features
Abstract
:1. Introduction
2. Related Research
3. Uyghur Language
3.1. Uyghur Alphabet
3.2. Morphological Changes of Words
3.3. Vowel Weakening
3.4. Syllable-Encoded Text
4. Uyghur String Matching
4.1. Basic Concepts
- uChar is a Uyghur Unicode character, and its encoding range is (u0600–u06ff).
- Sb is a syllable, which is composed of several uChar. When Sb is a syllable composed of three uChar, its structure is Sb [uChar1 uChar2 uChar3].
- Sc is the syllable encoding of Sb in B16 encoding scheme [47]. Each Sc encoding length is equal to a Unicode character, and the encoding range is in the Unicode Private Use Area (ue000–uf8ff).
- W is a Uyghur word composed of several Sb, W (Sb1Sb2…Sbn), its length is equal to the number of uChar in the word.
- Wz is the result of syllable segmentation and encoding of word W. When W has three syllables, its structure is Wz (Sc1Sc2Sc3), and the length of Wz is equal to the number of syllables of W.
- P is a pattern and noun stem, its structure is similar to W, and W = P + Inflectional suffix.
- Pz is the syllable code of pattern P, and its structure is similar to Wz.
- T is a text containing n W, its structure is T (W0, W1, …, W (n−1)).
- Tz is a syllable-encoded compressed text, which is generated by T after syllable encoding. Its structure is Tz (Wz0, Wz1, …, Wz(n−1)).
- Structure matching: when the sequence of characters in string S1 is unchanged, it is completely contained in string S2, and the length of S1 ≤ the length of S2
- Semantic matching: when W = P + Inflectional suffixes, semantics of pattern P are included in word W; then, P and W have semantic matching. Sometimes, the weakening of vowels results in the change of W structure and the mismatch of P structure. Pattern P length is less than or equal to word W length.
- Matching result: when P matches a W semantic or structure in T, a complete W is returned. For example, when P = man, T = {"other", "manchu", "mankind", "man", "men"}, the result of P structure matching is {"manchu", "mankind", "man"}, and the result of semantic matching is {"man", "mankind", "men"}.
4.2. Retrieval Parameters and Calculation Formulas
4.3. Preparation of Experimental Corpus
- Type A corpus generated by the algorithm. First, this paper selects 22 high-frequency words based on word length, syllable structure, and number of syllables (Table 1); 11 of the words are weakened (Word V.W). These weakened words cover all four vowel weakening types (a→i, a→e, ä→i, ä→e). We design a morphology-based word generator algorithm based on stemming, taking nouns as an example, this algorithm can generate all the 312 forms of P in a dictionary [45] by adding 1–4 layers of suffixes. The experimental corpus T generated by this algorithm covers 22 × 312 = 6864 forms of 22 words. If the matching algorithm can match all 312 forms of P according to the P, it means that the algorithm theoretically can recognize and match all forms of pattern P in any natural language environment, and hence recall = 1. When recall = 1, corpora B and C can be used for the next experiment to test the pattern matching ability of the algorithm in a natural language environment.
- Type B corpus made from natural language text. We collect a certain amount of actual corpus for word segmentation to generate a word data set. The content of the dataset is the words appearing in the corpus and the frequency of occurrence F in the corpus. The corpus is based on Unicode-encoded text with a size of 46.8 MB and has covered comprehensive news, agricultural technology, agency names, novels, natural sciences, dictionaries and encyclopedias, and social media short texts. There were 136,523 unique words and 7434 unique syllables.
- Type A and type B corpus is a list of experimental words obtained through algorithm derivation and database fuzzy query; type C is a paragraph of natural language text composed of several typical sentences.
4.4. Matching of Existing Algorithms
- Both P and T, Pz and Tz match exactly, for example: toxu and därya.
- P and T match exactly, and Pz and Tz partially match, for example: quš.
- Both P and T, Pz and Tz have matching failures, for example, naxša.
5. Improvement of Matching Algorithm
5.1. Analysis
5.2. Solutions
- alma: there are four structures that can express all other structures within T and Tz. They are alma *|alma + sx, almam *|almam + sx (first person), almaŋ *|almaŋ + sx (second person), and almi *|almi + sx (third person, weakened).
- quš: matches all forms of quš * in T and meets the requirements in Tz: quš +sx and qu +š#+sx. Here, we need to determine the value range of š#. According to the last calculation, š# has 6408 possibilities. By observing š#, there are the following structures: quš + sx (stem, no person), qu+ šum+sx (first person), qu+šung+ sx (second person), and qu+ši+sx (third person). All three forms of quš can be represented with three structures, which is a very interesting phenomenon. This means that the value range of š# can be reduced from 6408 to only 3 (šum, šuŋ, ši) and the remaining 6405 can be ignored. Another exciting result is that if the current value of š# is not one of these three, it can be determined that the word Wz in Tz may not meet the semantic matching condition Wz = Pz + inflectional suffix. For example, when Tz = {Wz1 = so+qu+šuš+niŋ (the war’s…), Wz2 = tö+gi+qu+ši+niŋ (the ostrich’s…)} because the third syllable of Wz1 šuš is not in (šum, šuŋ, ši), this method automatically excludes Wz1 and can match Wz2. When Pz is used to search Tz, the search results can be used to exclude some words that are not related to Pz semantically, without performing a semantic analysis; this further improves the precision, and the retrieval speed is faster. This method is also effective for generating weakened words alma. If the structure after alma adds a configuration suffix cannot satisfy (al+mam, al+maŋ, al+mi), then the matching results are not related to alma; for example, almas (al+mas: diamond) is not related to alma (apple) in semantics.
5.3. Improvement of BM Algorithm
Algorithm 1. BM-U (P, T) pattern matching algorithm |
input: P input a stem output: match_num |
match_num←0 start position ←0 if IsVowelWeaken(P) then Pv← P shortest vowel weakened form end if P_list←append each personal suffix to stem P, Pv P_list_common_part←get P_list common part for i = 0… P_list. items. count-1 do P_list_remain_parts[i]←P_list[i]-P_list_common_part end for if P_list_common_part. length=0 then // No common part for i = 0 … P_list[i]. length-1 do Boyer_Moore (T, start position, P_list[i]) end for return match_num Else do found = Boyer_Moore (T, start position, P_list_common_part) if found then for i = 0 …P_list_remain_parts. Length-1 do P = P_list_common_part + P_list_remain_parts [i] if pattern_match (P) then match_num++ start position ←next start position break //find 1, looking for next P_list_common_part end if end for // P remaining parts match failed, looking for next P_list_common_part start position ←next start position else return match_num end if while start position < T. length - P_list_common_part. Length end if return match_num |
6. Experiment and Analysis
- We used the BM-U algorithm to test the word morphology matching ability of pattern P, and used type A corpus generated by the algorithm. If recall = 1, it means that the new algorithm can recognize all word forms of pattern P, can use the type B and type C corpus to test the algorithm precision, accuracy, F1-measure, and observe the recall value of the algorithm in the natural corpus environment.
- We used the BM and BM-U algorithms to test the matching ability of patterns P and Pz on natural language type B corpora T and Tz. Observe the matching performance of the two algorithms on the two encoding formats through experiments and calculate the impact of the algorithm and matching unit changes on the matching rate.
- In order to facilitate the observation of the matching performance of the new algorithm, two algorithms were used to conduct demonstrative pattern matching experiments on natural language paragraphs using type C corpus.
- We conducted pattern matching of syllable encoding file Tz format for monosyllable and non-syllable strings, compare character-based text T with syllable-based text Tz.
6.1. BM-U Word Morphology Matching Ability
6.2. Matching Experiments on Natural Language Words
6.2.1. Experimental Results
6.2.2. Analysis of Experimental Results
6.3. Matching Experiments on Natural Language Sentences
6.4. Monosyllabic and Non-syllabic Retrieval
6.5. Comparison with Other Related Studies
7. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Ma, Z.F.; Yang, S.Y.; Guo, G.F. A fast improved pattern matching algorithm based on BM. Control Decis. 2013, 28, 1855–1858. [Google Scholar]
- Zhou, X.; Xu, B.; Qi, Y.; Li, J. MRSI: A Fast Pattern Matching Algorithm for Anti-virus Applications. In Proceedings of the International Conference on Networking, Cancun, Mexico, 13–18 April 2008; pp. 256–261. [Google Scholar]
- Du, F. A Faster Pattern Matching Algorithm for Intrusion Detection. Adv. Mater. Res. 2012, 532, 1414–1418. [Google Scholar] [CrossRef]
- Tahir, M.; Sardaraz, M.; Ikram, A.A. EPMA: Efficient pattern matching algorithm for DNA sequences. Expert Syst. Appl. 2017, 80, 162–170. [Google Scholar] [CrossRef]
- Gagie, T.; Gawrychowski, P.; Puglisi, S.J. Approximate pattern matching in LZ77-compressed texts. J. Discret. Algorithms 2015, 32, 64–68. [Google Scholar] [CrossRef]
- Ablez, W.; Ayzohra; Niyaz, A.; Osman, I. Study on the Some Key Technology of Improving the Quality of Uyghur Search. Math. Pract. Theory 2013, 43, 119–123. [Google Scholar]
- Xue, P.Q.; Xian, Y.; Nurbol; Silamu, W. Sensitive information filtering algorithm based on Uyghur text information network research. Comput. Eng. Appl. 2018, 54, 236–241. [Google Scholar]
- Mahmoud, A.; Yusuf, H.; Jiajun, Z.H.A.N.G.; Chengqing, Z.O.N.G.; Hamdulla, A. Name recognition in the Uyghur language based on fuzzy matching and syllable-character conversion. J. Tsinghua Univ. 2017, 57, 188–196. [Google Scholar]
- Kahaerjiang, A.; Tuergen, Y.; Tianfang, Y.; Aishan, W.; Aishan, M. An Improved Method for Uyghur Sentence Similarity Computation. J. Chin. Inf. Process. 2011, 25, 50–53. [Google Scholar]
- Boyer, R.S.; Moore, S.J. A Fast String Searching Algorithm. Commun. Acm 1977, 20, 762–772. [Google Scholar] [CrossRef]
- Wu, S.; Manber, U. A Fast Algorithm for Multi-Pattern Searching; Technical Report TR-94-17; University of Arizona: Tucson, AZ, USA, 1994. [Google Scholar]
- Xiaohua, L. A Boyer-Moore Type String Matching Algorithm with Memory and Its Computational Complexity. J. Hunan Univ. Nat. Sci. 2008, 35, 84–88. [Google Scholar]
- Yipe, W. An Improved Wu-Manber Multi-pattern Matching Algorithm for Chinese Encoding. J. Chin. Comput. Syst. 2015, 36, 778–781. [Google Scholar]
- Nurmemet, Y.; Wushour, S.; Reyiman, T. Syllable based language model for large vocabulary continuous speech recognition of Uyghur. J. Tsinghua Univ. Sci. Technol. 2013, 53, 741–744. [Google Scholar]
- Mamateli, T. Context dependent syllable based speech synthesis system for Uyghur. Comput. Eng. Appl. 2011, 47, 141–143. [Google Scholar]
- Mahmut, M.; Turgun, I. A Research on Syllable Based Uyghur Text Proofreading System. In Proceedings of the the Ninth National Conference on Computational Linguistics, CCL 2007, Dalian, China, 6–8 August 2007. [Google Scholar]
- Ranagul, D.; Askar, H.; Dilmurat, T. Acoustic Analysis on Prosodic Feature of CVC Type Syllable in Uyghur Language. Comput. Eng. 2011, 37, 193–195. [Google Scholar]
- Isabel, R.M.C. Comparison of Uyghur and Spanish syllables. China Natl. Exhib. 2018, 8, 119–121. [Google Scholar]
- Dawut, Y.; Abdureyim, H.; Yang, N.N. Research on Multiple Pattern Matching Algorithm for Uyghur. Comput. Eng. 2015, 41, 143–148. [Google Scholar]
- Tohti, T.; Huang, J.; Hamdulla, A.; Tan, X. Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur. Inf. Int. Interdiscip. J. 2019, 10, 246. [Google Scholar] [CrossRef] [Green Version]
- Culpepper, J.S.; Moffat, A. Phrase-Based Pattern Matching in Compressed Text. In International Symposium on String Processing and Information Retrieval; Springer: Berlin, Heidelberg, 2006; Volume 4209, pp. 337–345. [Google Scholar]
- Huynh, T.N.; Hon, W.K.; Lam, T.W.; Sung, W.K. Approximate string matching using compressed suffix arrays. Theor. Comput. Sci. 2006, 352, 240–249. [Google Scholar] [CrossRef]
- YongKang, X.; Guanglu, Y.; Songfeng, L. Approximate string matching algorithm based on compressed suffix array. Comput. Eng. Appl. 2015, 51, 139–142. [Google Scholar]
- Buluş, H.N.; Carus, A.; Mesut, A. A new word-based compression model allowing compressed pattern matching. Turk. J. Electr. Eng. Comput. Sci. 2017, 25, 3607–3622. [Google Scholar] [CrossRef]
- Karkkainen, J.; Navarro, G.; Ukkonen, E. Approximate string matching on Ziv-Lempel compressed text. J. Discret. Algorithms 2003, 1, 313–338. [Google Scholar] [CrossRef] [Green Version]
- Wang, J.F.; Li, Z.R.; Cai, C.Z.; Chen, Y.Z. Assessment of approximate string matching in a biomedical text retrieval problem. Comput. Biol. Med. 2005, 35, 717–724. [Google Scholar] [CrossRef]
- Navarro, G.; Tarhio, J. LZgrep: A Boyer–Moore string matching tool for Ziv–Lempel compressed text. Softw. Pract. Exp. 2005, 35, 1107–1130. [Google Scholar] [CrossRef]
- Quanzhu, Y.; Xiaojian, D.; Xueli, R.; Zhifeng, Z. Research of BWT-Boyer-Moore Compressed Domain Search Algorithm. Appl. Res. Comput. 2006, 23, 59–61. [Google Scholar]
- Narupiyakul, L.; Thomas, C.; Cercone, N.; Sirinaovakul, B. Thai Syllable-Based Information Extraction Using Hidden Markov Models. In Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2004, Seoul, Korea, 15–21 February 2004; pp. 537–546. [Google Scholar]
- Hackett, P.G.; Oard, D.W. Comparison of word-based and syllable-based retrieval for Tibetan (poster session). In Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, Hong Kong, China, 30 September–1 October 2000. [Google Scholar]
- Oflazer, K.; Kuruoz, I. Tagging and Morphological Disambiguation of Turkish Text. In Proceedings of the Conference on Applied Natural Language Processing, Stuttgart, Germany, 13–15 October 1994; pp. 144–149. [Google Scholar]
- Hakkanitur, D.; Oflazer, K.; Tur, G. Statistical Morphological Disambiguation for Agglutinative Languages. Comput. Humanit. 2002, 36, 381–410. [Google Scholar] [CrossRef]
- Xu, J.; Pan, J.; Yan, Y. Agglutinative Language Speech Recognition Using Automatic Allophone Deriving. Chin. J. Electron. 2016, 25, 328–333. [Google Scholar] [CrossRef]
- Park, H.; Oh, K.; Choi, H.; Gweon, G. Constructing a paraphrase database for agglutinative languages. Data Knowl. Eng. 2019, 123, 101604. [Google Scholar] [CrossRef]
- Saimaiti, A.; Wang, L.; Yibulayin, T. Learning Subword Embedding to Improve Uyghur Named-Entity Recognition. Inf. Int. Interdiscip. J. 2019, 10, 139. [Google Scholar] [CrossRef] [Green Version]
- Mireguli, A.; Mijiti, A.; Aisikaer, A. A Morphological Analysis Based Algorithm for Uyghur Vowel Weakening Identification. J. Chin. Inf. Process. 2008, 22, 43–47. [Google Scholar]
- Dawel, A.; Hayrat, H. Study on the Rule-based Kazakh Word Lemmatization Algorithm. J. Xinjiang Univ. Nat. Sci. Ed. 2011, 28, 116–119. [Google Scholar]
- Saren, B. Research on the Causes of the Weakening and Even Disappearance of Short Vowels in Mongolian. J. Inn. Mong. Univ. Natl. Soc. Sci. 2005, 31, 29–31. [Google Scholar]
- Hayes, B.; Siptar, P.; Zuraw, K.; Londe, Z. Natural and Unnatural Constraints in Hungarian Vowel Harmony. Language 2009, 85, 822–863. [Google Scholar] [CrossRef] [Green Version]
- Goldsmith, J.; Riggle, J. Information theoretic approaches to phonological structure: The case of Finnish vowel harmony. Nat. Lang. Linguist. Theory 2012, 30, 859–896. [Google Scholar] [CrossRef]
- Genxiong, J. The Weakening and Dropping of Vowels in Mongolian Language. J. Inn. Mong. Univ. Natl. Soc. Sci. 2010, 36, 27–29. [Google Scholar]
- Qingxia, D.; Ling, W. Experimental Study on the Characteristics of the Weakened Syllables of Jingpo. J. Minzu Univ. China Philos. Soc. Sci. Ed. 2014, 5, 154–159. [Google Scholar]
- Jaworski, S. Phonetic and Phonological Vowel Reduction in Russian. Pozn. Stud. Contemp. Linguist. 2010, 46, 51–68. [Google Scholar] [CrossRef] [Green Version]
- Bakovic, E. Vowel harmony and stem identity. San Diego Linguistic Papers. 2003, 1, 1–42. [Google Scholar]
- Xinjiang Uygur Autonomous Region National Language Working Committee. Dictionary of Modern Uyghur Literature Language Orthography; Xinjiang People’s Publishing House: Urumqi, China, 1997. [Google Scholar]
- Wayit, A.; Jamila, W.; Turgun, I. Modern Uyghur automatic syllable segmentation method and its implementation. China Sci. 2015, 10, 957–961. [Google Scholar]
- Abliz, W.; Wu, H.; Maimaiti, M.; Wushouer, J.; Abiderexiti, K.; Yibulayin, T.; Wumaier, A. A Syllable-Based Technique for Uyghur Text Compression. Inf. Int. Interdiscip. J. 2020, 11, 172. [Google Scholar] [CrossRef] [Green Version]
- Ainiwaer, A.; Jun, D.; Xiao, L.I. Rules and Algorithms for Uyghur Affix Variant Collocation. J. Chin. Inf. Process. 2018, 32, 27–33. [Google Scholar]
- Tuergen, I.; Kahaerjiang, A.; Aishan, W.; Maihemuti, M. A Survey of Central Asian Language Processing. J. Chin. Inf. Process. 2018, 32, 1–13+21. [Google Scholar]
No | Word | Syllable | Word (V.W) | Syllable | V.W. e.g., | ||
---|---|---|---|---|---|---|---|
1 | dog | it | vc | Man | är | vc | eri[ä→e] |
2 | bird | quš | cvc | Tea | čay | cvc | čeyi[a→e] |
3 | year | yil | cvc | Land | yär | cvc | yeri[ä→e] |
4 | sue | ärz | vcc | Ideal | ğayä | cv+cv | ğayini[ä→i] |
5 | chiken | toxu | cv+cv | Ox | kala | cv+cv | kalisi[a→i] |
6 | people | xälq | cvcc | Human | adäm | v+cvc | adimi[ä→i] |
7 | woman | ayal | v+cvc | Apple | alma | vc+cv | almilar[a→i] |
8 | river | därya | cvc+cv | Paper | qäğäz | cv+cvc | qäğizi[ä→i] |
9 | Roosevelt | Rozwelt | cvc+cvcc | Ethnic | millät | cvc+cvc | milliti[ä→i] |
10 | product | mähsulat | cvc+cv+cvc | America | Amerika | v+cv+cv+cv | amerikida[a→i] |
11 | airplane | ayrupilan | vc+cv+cv+cvc | Song | naxša | cvc+cv | naxšiniŋ [a→i] |
No. | P | Pm | Pr | Fr | F0 | P | Pm | Pv | Pvm | Pr | Fr | F0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | it | 380 | 27 | 302 | 6077 | är | 178 | eri | 264 | 63 | 34056 | 39971 |
2 | quš | 236 | 81 | 1811 | 2816 | čay | 85 | čeyi | 9 | 34 | 421 | 784 |
3 | yil | 672 | 130 | 21843 | 25951 | yär | 190 | yeri | 106 | 209 | 14876 | 16788 |
4 | ärz | 39 | 13 | 144 | 468 | ğayä | 4 | ğayi | 25 | 25 | 198 | 923 |
5 | toxu | 46 | 35 | 503 | 527 | kala | 22 | kali | 71 | 37 | 1934 | 2562 |
6 | xälq | 82 | 82 | 11294 | 11294 | adäm | 61 | adimi | 24 | 85 | 6091 | 6091 |
7 | ayal | 39 | 38 | 17303 | 17305 | alma | 118 | almi | 55 | 21 | 87 | 1796 |
8 | därya | 42 | 42 | 3776 | 3776 | qäğäz | 13 | qäğizi | 5 | 18 | 340 | 340 |
9 | rozwelt | 19 | 19 | 1618 | 1618 | millät | 22 | milliti | 10 | 29 | 1191 | 1191 |
10 | mähsulat | 60 | 60 | 2698 | 2698 | amerika | 2 | ameriki | 34 | 36 | 5147 | 5147 |
11 | ayrupilan | 29 | 29 | 975 | 975 | naxša | 1 | naxši | 26 | 27 | 335 | 335 |
Word (stem) | T (M/Mis) | Tz (M/Mis) | e.g., | Word (V.W) | T (M/Mis) | Tz (M/Mis) | e.g., |
---|---|---|---|---|---|---|---|
it | 312/0 | 172/140 | i+ti | är | 172/140 | 172/140 | e+ri |
quš | 312/0 | 172/140 | qu+šum | čay | 172/140 | 172/140 | če+yi |
yil | 312/0 | 172/140 | yi+li | yär | 172/140 | 172/140 | ye+ri |
ärz | 312/0 | 172/140 | är+zi | ğayä | 62/250 | 1/311 | ğayi+si |
toxu | 312/0 | 312/0 | kala | 62/250 | 1/311 | kali+lar | |
xälq | 312/0 | 172/140 | xäl+qi | adäm | 172/140 | 172/140 | adi+mi |
ayal | 312/0 | 172/140 | aya+li | alma | 62/250 | 1/311 | almi+da |
därya | 312/0 | 312/0 | qäğäz | 172/140 | 172/140 | qäği+zi | |
rozwelt | 312/0 | 172/140 | rozwelti+niŋ | millät | 172/140 | 172/140 | milli+ti |
mähsulat | 312/0 | 172/140 | mähsula+ti | amerika | 62/250 | 1/311 | ameriki+čä |
ayrupilan | 312/0 | 172/140 | ayrupila+ni | naxša | 62/250 | 1/311 | naxši+si |
P+Suffix | Matching Expression | ||
---|---|---|---|
Structure | Results | P | Pv |
+Null | alma quš | alma* quš * | al + ma + sx quš + sx |
+plural | alma + lar = almilar quš + lar = qušlar | almi* quš* | al + mi + sx quš + sx |
+ Personal | Almam | alma + miz = almimiz | almaŋ | alma + ŋiz = almiŋiz | alma + si = almisi | alma + liri = almiliri; Qušum | qušumiz | qušuŋ | qušuŋiz | quši | qušliri | almam* almaŋ* almi* quš* | al + mam + sx al + maŋ + sx al + mi + sx quš + sx qu + šum + sx qu + šuŋ + sx qu + ši + sx |
+ Case | Alminiŋ | almiğa | almini | almida | almidin | almidäk | almidiki | almiğičä | almičä | almičilik; qušniŋ | qušqa | qušni | qušta | quštin | quštäk | quštiki | qušqičä | quščä | quščilik | almi* quš* | al + mi + sx quš + sx |
No | P | Pm | Pvm | Pr | Alg | T_P | Tz_P | T_R | Tz_R | T_F | Tz_F | T_A | Tz_A |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | it | 380 | 0 | 27 | BM | 0.07 | 0.10 | 1.00 | 0.59 | 0.13 | 0.17 | 0.07 | 0.59 |
BM-U | 0.07 | 0.09 | 1.00 | 1.00 | 0.13 | 0.16 | 0.07 | 0.24 | |||||
2 | quš | 236 | 0 | 81 | BM | 0.34 | 0.46 | 1.00 | 0.81 | 0.51 | 0.59 | 0.34 | 0.61 |
BM-U | 0.34 | 0.44 | 1.00 | 1.00 | 0.51 | 0.61 | 0.34 | 0.56 | |||||
3 | yil | 672 | 0 | 130 | BM | 0.19 | 0.43 | 1.00 | 0.75 | 0.32 | 0.54 | 0.19 | 0.76 |
BM-U | 0.19 | 0.34 | 1.00 | 0.96 | 0.32 | 0.50 | 0.19 | 0.62 | |||||
4 | ärz | 39 | 0 | 13 | BM | 0.33 | 1.00 | 1.00 | 0.46 | 0.50 | 0.63 | 0.33 | 0.82 |
BM-U | 0.33 | 0.46 | 1.00 | 0.85 | 0.50 | 0.59 | 0.33 | 0.62 | |||||
5 | toxu | 46 | 0 | 35 | BM | 0.76 | 0.92 | 1.00 | 0.97 | 0.86 | 0.94 | 0.76 | 0.91 |
BM-U | 0.76 | 0.92 | 1.00 | 0.97 | 0.86 | 0.94 | 0.76 | 0.91 | |||||
6 | xälq | 82 | 0 | 82 | BM | 1.00 | 1.00 | 1.00 | 0.60 | 1.00 | 0.75 | 1.00 | 0.60 |
BM-U | 1.00 | 1.00 | 1.00 | 0.91 | 1.00 | 0.96 | 1.00 | 0.91 | |||||
7 | ayal | 39 | 0 | 38 | BM | 0.97 | 1.00 | 1.00 | 0.74 | 0.99 | 0.85 | 0.97 | 0.74 |
BM-U | 0.97 | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 0.97 | 1.00 | |||||
8 | därya | 42 | 0 | 42 | BM | 1.00 | 1.00 | 1.00 | 0.93 | 1.00 | 0.96 | 1.00 | 0.93 |
BM-U | 1.00 | 1.00 | 1.00 | 0.93 | 1.00 | 0.96 | 1.00 | 0.93 | |||||
9 | rozwelt | 19 | 0 | 19 | BM | 1.00 | 1.00 | 1.00 | 0.95 | 1.00 | 0.97 | 1.00 | 0.95 |
BM-U | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||||
10 | mähsulat | 60 | 0 | 60 | BM | 1.00 | 1.00 | 1.00 | 0.63 | 1.00 | 0.78 | 1.00 | 0.63 |
BM-U | 1.00 | 1.00 | 1.00 | 0.97 | 1.00 | 0.98 | 1.00 | 0.97 | |||||
11 | ayrupilan | 29 | 0 | 29 | BM | 1.00 | 1.00 | 1.00 | 0.69 | 1.00 | 0.82 | 1.00 | 0.75 |
BM-U | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
No | P | Pm | Pvm | Pr | Alg | T_P | Tz_P | T_R | Tz_R | T_F | Tz_F | T_A | Tz_A |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | är | 178 | 264 | 63 | BM | 0.31 | 0.37 | 0.89 | 0.87 | 0.46 | 0.52 | 0.71 | 0.77 |
BM-U | 0.14 | 0.23 | 1.00 | 0.98 | 0.25 | 0.37 | 0.14 | 0.52 | |||||
2 | čay | 85 | 9 | 34 | BM | 0.32 | 0.38 | 0.79 | 0.79 | 0.45 | 0.51 | 0.31 | 0.46 |
BM-U | 0.36 | 0.43 | 1.00 | 1.00 | 0.53 | 0.60 | 0.36 | 0.52 | |||||
3 | yär | 190 | 106 | 209 | BM | 0.91 | 0.93 | 0.82 | 0.79 | 0.86 | 0.85 | 0.81 | 0.81 |
BM-U | 0.71 | 0.80 | 1.00 | 0.97 | 0.83 | 0.87 | 0.71 | 0.80 | |||||
4 | ğayä | 4 | 25 | 25 | BM | 0.50 | 1.00 | 0.08 | 0.04 | 0.14 | 0.08 | 0.14 | 0.17 |
BM-U | 0.86 | 1.00 | 1.00 | 1.00 | 0.93 | 1.00 | 0.86 | 1.00 | |||||
5 | kala | 22 | 71 | 37 | BM | 0.14 | 0.30 | 0.08 | 0.08 | 0.10 | 0.13 | 0.43 | 0.56 |
BM-U | 0.40 | 0.51 | 1.00 | 0.95 | 0.57 | 0.66 | 0.40 | 0.61 | |||||
6 | adäm | 61 | 24 | 85 | BM | 1.00 | 1.00 | 0.72 | 0.66 | 0.84 | 0.79 | 0.72 | 0.67 |
BM-U | 1.00 | 1.00 | 1.00 | 0.88 | 1.00 | 0.94 | 1.00 | 0.88 | |||||
7 | alma | 118 | 55 | 21 | BM | 0.02 | 0.20 | 0.10 | 0.05 | 0.03 | 0.08 | 0.22 | 0.86 |
BM-U | 0.12 | 0.34 | 1.00 | 1.00 | 0.22 | 0.51 | 0.12 | 0.77 | |||||
8 | qäğäz | 13 | 5 | 18 | BM | 1.00 | 1.00 | 0.72 | 0.72 | 0.84 | 0.84 | 0.72 | 0.72 |
BM-U | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||||
9 | millät | 22 | 10 | 29 | BM | 1.00 | 1.00 | 0.69 | 0.63 | 0.81 | 0.77 | 0.69 | 0.63 |
BM-U | 1.00 | 1.00 | 1.00 | 0.94 | 1.00 | 0.97 | 1.00 | 0.94 | |||||
10 | Amerika | 2 | 34 | 36 | BM | 1.00 | 1.00 | 0.06 | 0.06 | 0.11 | 0.11 | 0.06 | 0.15 |
BM-U | 1.00 | 1.00 | 1.00 | 0.97 | 1.00 | 0.99 | 1.00 | 0.97 | |||||
11 | naxša | 1 | 26 | 27 | BM | 1.00 | 1.00 | 0.04 | 0.04 | 0.07 | 0.07 | 0.04 | 0.04 |
BM-U | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
No. | Formula | Not VW | VW | ||
---|---|---|---|---|---|
∆ | Avg (∆) | ∆ | Avg (∆) | ||
1 | 0 | 0 | 0.39 | 0.04 | |
2 | 0 | 0 | 6.01 | 0.55 | |
3 | 0 | 0 | 3.62 | 0.33 | |
4 | 0 | 0 | 2.74 | 0.25 |
No. | Formula | Not VW | VW | ||
---|---|---|---|---|---|
∆ | Avg (∆) | ∆ | Avg (∆) | ||
5 | 0.59 | 0.05 | 1.11 | 0.10 | |
6 | −0.41 | −0.04 | 5.7 | 0.52 | |
7 | 0.39 | 0.04 | 4.2 | 0.38 | |
8 | 1.1 | 0.10 | 4.16 | 0.38 |
Sample Sentences (Uy/En) | Keywords (Apple, America) | Match | ||
---|---|---|---|---|
Bu almiliq bağdiki almilar bäk oxšaptu. Almaŋ bäk tatliqkän. Almida witamin köp. Alma Amerika alma šerkitiniŋ bälgisi. Akam Amerikidin maŋa Amerikida yasalğan alma telfuni äwätiptu. Amerikiliq dostum Amerikidiki yuqum sani 500 miŋdin ašti, Amerikini xuda saqlisun, Amerikiniŋ tibbi texnikisi ilğar, Amerikiliqlar bärdašliq beräläydu didi. | Word | Meaning | BM | BM-U |
almiliq bağ | apple orchard | N | Y | |
almilar | apples | N | Y | |
almaŋ | your apple | Y | Y | |
almida | in(on) the apples | N | Y | |
alma | apple | Y | Y | |
The apples in this apple orchard grow very well. Your apple is delicious. Apples are rich in vitamins. Apple is a symbol of Apple Inc. of the America. My brother mailed me an American-made iPhone from the America. My American friend said that the number of infected people in the America has exceeded 500,000, god bless the America, the America has developed medicine, and the American people can overcome the epidemic. | amerika | America | Y | Y |
amerikidin | from America | N | Y | |
amerikida yasalğan | made in America | N | Y | |
amerikiliq | American | N | Y | |
amerikidiki | In the America | N | Y | |
amerikini | (the) America | N | Y | |
amerikiniŋ | the America’s | N | Y | |
amerikiliqlar | American people | N | Y |
String Type | P | T (M. num) | Tz (M. num) | Mis. e.g., |
---|---|---|---|---|
Monosyllable | ma | 7053 | 1902 | Or + man, mal |
to | 2718 | 1410 | Toğ + ra, top | |
Non-syllable | m, u, tt, mm, uu, ää | 39455, 37998, 2802, 686, 19, 9 | 5, 14, 0, 0, 0, 0 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Abliz, W.; Maimaiti, M.; Wu, H.; Wushouer, J.; Abiderexiti, K.; Yibulayin, T.; Wumaier, A. Research on Uyghur Pattern Matching Based on Syllable Features. Information 2020, 11, 248. https://doi.org/10.3390/info11050248
Abliz W, Maimaiti M, Wu H, Wushouer J, Abiderexiti K, Yibulayin T, Wumaier A. Research on Uyghur Pattern Matching Based on Syllable Features. Information. 2020; 11(5):248. https://doi.org/10.3390/info11050248
Chicago/Turabian StyleAbliz, Wayit, Maihemuti Maimaiti, Hao Wu, Jiamila Wushouer, Kahaerjiang Abiderexiti, Tuergen Yibulayin, and Aishan Wumaier. 2020. "Research on Uyghur Pattern Matching Based on Syllable Features" Information 11, no. 5: 248. https://doi.org/10.3390/info11050248
APA StyleAbliz, W., Maimaiti, M., Wu, H., Wushouer, J., Abiderexiti, K., Yibulayin, T., & Wumaier, A. (2020). Research on Uyghur Pattern Matching Based on Syllable Features. Information, 11(5), 248. https://doi.org/10.3390/info11050248