Research on Uyghur Pattern Matching Based on Syllable Features

: Pattern matching is widely used in various fields such as information retrieval, natural language processing (NLP), data mining and network security. In Uyghur (a typical agglutinative, low-resource language with complex morphology, spoken by the ethnic Uyghur group in Xinjiang, China), research on pattern matching is also ongoing. Due to the language characteristics, the pattern matching using characters and words as basic units has insufficient performance. There are two problems for pattern matching: (1) vowel weakening and (2) morphological changes caused by suffixes. In view of the above problems, this paper proposes a Boyer–Moore-U (BM-U) algorithm and a retrievable syllable coding format based on the syllable features of the Uyghur language and the improvement of the Boyer–Moore (BM) algorithm. This algorithm uses syllable features to perform pattern matching, which effectively solves the problem of weakening vowels, and it can better match words with stem shape changes. Finally, in the pattern matching experiments based on character-encoded text and syllable-encoded text for vowel-weakened words, the BM-U algorithm precision, recall, F1-measure and accuracy are improved by 4%, 55%, 33%, 25% and 10%, 52%, 38%, 38% compared to the BM algorithm.


Introduction
Pattern matching refers to a given string (hereinafter referred to as text) T with length n, and another string (hereinafter referred to as pattern) P with length m (m ≤ n). It is necessary to find out the starting position of the first occurrence or all occurrences of pattern P in text T. Once found, it is called a success; otherwise, the match fails. Pattern matching is one of the basic research contents of computer science [1]. As an important text processing technology, pattern matching has been applied to many related studies, such as data processing, data compression, text editing, machine translation, search engines, virus and network intrusion detection, content filtering and genetic detection etc. [2][3][4][5][6][7][8][9]. The quality of pattern matching will directly affect the quality of related research and the complexity of the algorithm.
According to China's sixth census in 2010, the Uyghur population is 10 million, and the language belongs to a low-resource language. From a technical point of view, since Windows Vista, iOS 8.0 and Android 4.0 operating systems began to fully support Uyghur language from the system level, the Uyghur network text resources and information to be processed also expanded rapidly, which also accelerated the Uyghur natural language processing (NLP) progress. In April 2019, Tencent launched a machine translation tool containing Uyghur-Chinese translation based on the WeChat platform. In February 2020, Google Translation also added Uyghur language support. At present, the agglutinative and morphological complexity of Uyghur language is one of the main difficulties in its pattern matching research.
In languages such as English, Chinese, and Uyghur, characters and words are constituent units of different granularities in the language and are often used as the basic unit of pattern matching research. With the large-scale growth of textual information and content in the Uyghur language, higher requirements have been placed on the technical processing of pattern matching in Uyghur. When researching the pattern matching of Uyghur, due to the language characteristics, there are rich morphological changes, and different suffixes can form new words by splicing. Therefore, the research on word pattern matching in Uyghur faces two problems: (1) research using characters as the basic unit, with each word composed of multiple characters, and there is a low matching efficiency; and (2) when the word is used as the basic unit for matching. The morphological complexity of words leads to low matching efficiency.
By analyzing the Uyghur word structure and morphological changes, almost all words can be composed of a certain number of syllables. Therefore, this paper considers designing a data format of syllables as the basic data unit to study the single pattern matching task in Uyghur. The Boyer-Moore algorithm was improved by combining the syllable feature information of the morphological changes of words, and pattern matching was performed on the ordinary text and the syllable-encoded text proposed in this paper. Experimental results show that the method has good performance.
The main contributions of this paper are as follows.
(1) Our research on the structural features of Uyghur words and syllables, and the proposed searchable compression format based on syllables, will help improve the performance of existing pattern matching algorithms. (2) We conducted in-depth research on the morphological changes of words caused by weakened vowels. Through the limited expansion of pattern matching sequences, the problem of mismatch caused by morphological changes is solved, and the semantically similar matching effect and recall, precision, accuracy, and F1 values have improved significantly. (3) The research on pattern matching in this paper is also applicable to other syllabic agglutinative languages and can serve as a useful reference for the pattern matching research of other languages of the same type.

Related Research
The related algorithms of pattern matching matured in the past few decades, and several classic algorithms have appeared [10,11]. Subsequent improvements have been made to these algorithms [12,13]. At present, the research on pattern matching pays more attention to application innovation and improvement in specific tasks, such as NLP, information retrieval, text filtering and network security. In Uyghur, the study of pattern matching started late.
Syllables, as one of the main Uyghur features, have been widely studied in recent years. Research based on syllable feature information covers tasks such as speech recognition, speech synthesis, lexical analysis, named entity recognition, and spell checking [14][15][16][17][18]. A multi-pattern matching algorithm for Uyghur was researched for the first-time using syllable information [19]. This method uses the number and structure of syllables as one of the matching conditions to improve the matching efficiency. However, this method can only match words with consistent morphological features and cannot match words with weakened vowels and changed syllable structures. Syllable segmentation and analysis of syllable structure are required during the matching process. The Uyghur text filtering task has also been studied [20]; the authors used extended stems and an additional suffixes library to improve pattern matching performance and deal with vowel weakening.
There are also many studies on pattern matching in compressed formats. The corresponding pattern matching algorithms for different compression units and compression algorithms are also different. Usually, the short text [21], the suffix [22,23], the word [24], and the character string [25,26] are used as the pattern matching unit of the compressed content. Some studies have used the BM algorithm as a pattern matching algorithm in compression format [27,28]. Narupiyakul [29] and Paul G [30] treats syllables as the retrieval unit.

Morphological Changes of Words
Uyghur is a typical agglutinative language. It has strong derivational ability and rich morphological variations. The complex morphology of words is the main feature of the agglutinative language [31][32][33][34][35]. As a typical complex agglutinative language, its morphological structure is word = stem + [suffix]. There are two types of suffixes: the inflectional suffix and the derivational suffix. Adding roots or stems the derivational suffix generates new words, similar to work + man = workman. After the inflectional suffix is added, it only changes the grammatical attributes such as the person, plural, and case of the original word. Similar to book + s = books, this paper discusses the inflectional suffix. Uyghur noun stems can be connected with different suffixes and support continuous concatenation of multiple suffixes. For example, the noun " ‫ﻗﻮﻳ‬ ‫ﯘﯕ‬ ‫ﻼﺭﻧﯨ‬ ‫ﯔ‬ " "qoyuŋlarniŋ" is generated by adding three layers of suffixes to the stem qoy (sheep): (1) qoy + uŋ (your sheep); (2) qoyuŋ + lar (your sheep, sheep plural); (3) qoyuŋlar + niŋ (your sheep's, sheep plural);

Vowel Weakening
Modern Uyghur phonetic harmony is very common, and one of the main manifestations is the weakening of vowels. Vowel weakening refers to the weakening of vowels into other vowels when some additional elements are added to the stem composed of specific vowels, such as Är (man) + i (third person) = Eri (his man Ä→E); karwat (bed) + im (first person) = karwitim (my bed, a→i); Taš (stone) + iŋ (second person) = tešiŋ (your stone, a→e).
Mireguli et al. [36] proposed an algorithm to identify the Uyghur vowel weakening based on the word and syllable structure. Other languages have similar situations [37][38][39][40][41][42][43][44]. Uyghur vowel weakening occurs frequently in written form. There are special exceptions, such as Taj (crown) + i (third person) = Taji (crown, a→a). The weakening rules are complex, and all phenomena cannot be described completely according to the rules. In the 27,266 stem words collected from the orthographic dictionary, 13,843 (50.7%) are structurally weak vowels [45]. Although these words contain a certain amount of irregular words, it can be seen that the weakening of vowels is a very common phenomenon in Uyghur

Syllable-Encoded Text
There are no special signs between Uyghur syllables. The pronunciation of syllables alone and in words is unchanged [14]. There are 12 types of syllables in current Uyghur words, with C for consonant and V for vowel. The syllable types are the six syllable structures V, VC, CV, CVC, VCC, and CVCC. Meanwhile, CCV, CCVC, CCVCC, CVV, CVVC, and CCCV are structures for recording foreign words. The CVV and CVVC structures with two Vs are used for Chinese or other language words with two vowels. This paper uses the syllable segmentation method described by Wayit et al. [46]. Wayit et al. [47] found that the top 2000 Uyghur syllables with the highest frequency can cover 99% of words, and proposed a syllable coding scheme B16 encodes each syllable, in which a syllable is encoded in the same length as the Unicode character encoding length. The encoding area is within the Unicode Private Use Area (ue000-uf8ff). This paper uses the Wayit et al. [47] coding scheme to design a text format based on syllable encoding and changes the basic unit of string pattern matching in text from the original characters to syllables to compress strings while achieving syllable-based pattern matching.

Basic Concepts
There are several search-related symbols and supplementary definitions for string matching used in this paper: 1. uChar is a Uyghur Unicode character, and its encoding range is (u0600-u06ff). 2. Sb is a syllable, which is composed of several uChar. When Sb is a syllable composed of three uChar, its structure is Sb [uChar1 uChar2 uChar3]. 3. Sc is the syllable encoding of Sb in B16 encoding scheme [47]. Each Sc encoding length is equal to a Unicode character, and the encoding range is in the Unicode Private Use Area (ue000-uf8ff). 4. W is a Uyghur word composed of several Sb, W (Sb1Sb2…Sbn), its length is equal to the number of uChar in the word. 5. Wz is the result of syllable segmentation and encoding of word W. When W has three syllables, its structure is Wz (Sc1Sc2Sc3), and the length of Wz is equal to the number of syllables of W. 6. P is a pattern and noun stem, its structure is similar to W, and W = P + Inflectional suffix. 7. Pz is the syllable code of pattern P, and its structure is similar to Wz. 8. T is a text containing n W, its structure is T (W0, W1, …, W (n-1)). 9. Tz is a syllable-encoded compressed text, which is generated by T after syllable encoding. Its structure is Tz (Wz0, Wz1, …, Wz(n-1)). 10. Structure matching: when the sequence of characters in string S1 is unchanged, it is completely contained in string S2, and the length of S1 ≤ the length of S2 11. Semantic matching: when W = P + Inflectional suffixes, semantics of pattern P are included in word W; then, P and W have semantic matching. Sometimes, the weakening of vowels results in the change of W structure and the mismatch of P structure. Pattern P length is less than or equal to word W length. 12. Matching result: when P matches a W semantic or structure in T, a complete W is returned. For example, when P = man, T = {"other", "manchu", "mankind", "man", "men"}, the result of P structure matching is {"manchu", "mankind", "man"}, and the result of semantic matching is {"man", "mankind", "men"}.

Retrieval Parameters and Calculation Formulas
The ideal search results for this article are listed below (2)

Preparation of Experimental Corpus
Because of the morphological complexity of agglutinative language words, the usual pattern matching experiment method prepares a certain size of corpus T and then randomly selects P of different lengths or selects a certain number of highly related words as P (e.g., the top 10 words with the highest correlation degrees). This method does not ensure that all forms of a word can be matched because some words change their structure more than once after adding a suffix, such as naxša+lar→naxšilar+iŋ→ naxšiliriŋ (ša→ši, lar→lir). Moreover, some forms of a word rarely appear, and the experimental corpus T may not include this form. This article prepares three types of experimental corpus.
1. Type A corpus generated by the algorithm. First, this paper selects 22 high-frequency words based on word length, syllable structure, and number of syllables ( Table 1); 11 of the words are weakened (Word V.W). These weakened words cover all four vowel weakening types (a→i, a→e, ä→i, ä→e). We design a morphology-based word generator algorithm based on stemming, taking nouns as an example, this algorithm can generate all the 312 forms of P in a dictionary [45] by adding 1-4 layers of suffixes. The experimental corpus T generated by this algorithm covers 22×312 = 6864 forms of 22 words. If the matching algorithm can match all 312 forms of P according to the P, it means that the algorithm theoretically can recognize and match all forms of pattern P in any natural language environment, and hence recall = 1. When recall = 1, corpora B and C can be used for the next experiment to test the pattern matching ability of the algorithm in a natural language environment. 2. Type B corpus made from natural language text. We collect a certain amount of actual corpus for word segmentation to generate a word data set. The content of the dataset is the words appearing in the corpus and the frequency of occurrence F in the corpus. The corpus is based on Unicode-encoded text with a size of 46.8 MB and has covered comprehensive news, agricultural technology, agency names, novels, natural sciences, dictionaries and encyclopedias, and social media short texts. There were 136,523 unique words and 7434 unique syllables. 3. Type A and type B corpus is a list of experimental words obtained through algorithm derivation and database fuzzy query; type C is a paragraph of natural language text composed of several typical sentences.  Table 2 shows the statistical information of pattern P in the corpus type B. In the table, P indicates a test stem, Pm indicates the number of words that fuzzy match the P in structure, Pv is the weakened form of pattern P, Pvm is the number of words that fuzzy match the Pv in structure, F0 is the occurrence frequency of all Pm and Pvm in the corpus, and Pr is the number of words related to pattern P in semantics. For example, when P = är (man), ärkäk (male) belongs to Pr, ärkin (freedom) does not belong to Pr, the labeling of Pr is done manually, Pr is a part of Pm and Pvm. Fr is the frequency of all Pr in the corpus, and Fr is a part of F0. Pm and Pvm are obtained through fuzzy query through SQL statements: select words, frequency from table where words like '%P%'.

Matching of Existing Algorithms
The Boyer-Moore (BM) algorithm is used to perform pattern matching on experimental corpus type A, and Tz is the syllable-encoded text of corpus T. Table 3 shows the matching results. In the table, M indicates a successful match, Mis indicates a failed match, and e.g., indicates an example of a failed match. There are three cases of matching status.
1. Both P and T, Pz and Tz match exactly, for example: toxu and därya. 2. P and T match exactly, and Pz and Tz partially match, for example: quš. 3. Both P and T, Pz and Tz have matching failures, for example, naxša.

Analysis
According to experiments, to improve the degree of structural matching between P and T and between Pz and Tz, we must first solve the matching failure caused by changes in syllable structure. Below we use '*' for any string, '#' for any string that forms a syllable structure, and sx for any syllable.

Changes in syllable structure caused by weakened vowels
The naxša in Table 3 is taken as an example. When the third-person suffix si is added, the weakening of the vowels results in a change in the morphological structure: W = naxša + si = naxši + si (a→i). When P = naxša, the match with W = naxšisi (his song) fails; in order to be able to retrieve these forms, an algorithm needs to be designed to determine whether vowel weakening may occur based on the morphological structure of P, and if so, calculate the pattern P weakened form Pv and find out the mismatch pattern of P through Pv 2. Changes in the syllable structure caused by the addition of suffixes Taking Pz = quš in Table 3 as an example, when the first-person suffix um is added, the syllable structure changes as follows: Wz = quš + um → qu + šum (cvc + vc →cv + cvc) (bird→my bird). The syllable structure of Wz cannot match Pz. If the structural change of quš is represented by quš*, qu+š#+sx, then during the matching process, if the algorithm can recognize that the second syllable is a syllable that satisfies š#, the matching problem can be solved. From the syllable structure, š# belongs to C#. According to the Uyghur syllables type, there are five types of structures that may appear: CV, CVC, CVCC, CVV, and CVVC. When the first C is the character š and considering that there are 24 consonants and 8 vowels in Uyghur, then the theoretical type of syllables in the second syllable š# may be:

Solutions
It is found that the change of the syllable structure occurs between the last syllable of P and the first inflectional suffix. According to the rules [45] for attaching suffixes to nouns, P adds first layer of suffixes to generate 18 kinds of word forms. For comparison and convenience, we selected alma (apple) and quš (bird) and added first layer of suffixes to observe the change of the morphological structure. Table 4 shows the additional information. Here, we need to determine the value range of š#. According to the last calculation, š# has 6408 possibilities. By observing š#, there are the following structures: quš + sx (stem, no person), qu+ šum+sx (first person), qu+šung+ sx (second person), and qu+ši+sx (third person). All three forms of quš can be represented with three structures, which is a very interesting phenomenon. This means that the value range of š# can be reduced from 6408 to only 3 (šum, šuŋ, ši) and the remaining 6405 can be ignored. Another exciting result is that if the current value of š# is not one of these three, it can be determined that the word Wz in Tz may not meet the semantic matching condition Wz = Pz + inflectional suffix. For example, when Tz = {Wz1 = so+qu+šuš+niŋ (the war's…), Wz2 = tö+gi+qu+ši+niŋ (the ostrich's…)} because the third syllable of Wz1 šuš is not in (šum, šuŋ, ši), this method automatically excludes Wz1 and can match Wz2. When Pz is used to search Tz, the search results can be used to exclude some words that are not related to Pz semantically, without performing a semantic analysis; this further improves the precision, and the retrieval speed is faster. This method is also effective for generating weakened words alma.
If the structure after alma adds a configuration suffix cannot satisfy (al+mam, al+maŋ, al+mi), then the matching results are not related to alma; for example, almas (al+mas: diamond) is not related to alma (apple) in semantics. It can be seen in Table 4 that in order to make recall = 1, two algorithms need to be designed. The first algorithm determines whether P satisfies the weakening condition. If it is satisfied, the weakened form Pv of P is calculated. The second algorithm adds personal suffixes according to the structural characteristics of P. The two algorithms finally generate a list P for pattern matching, PList = {P, P1, P2, P3 / Pv}. Among them, P1, P2, and P3 are the result of adding personal (1-3) suffixes to P. The role of PList is to assist the BM algorithm to improve matching efficiency. Because the weakening of vowels is more complicated and cannot be completely solved by rules, there are some special cases not subject to rules or phenomena: for example, tağ + I → teği (a → e, subject to rules), taš + I → teši (a → e, subject to rules), and taj + i → taji (a → a, not subject to rules). The word weakening algorithm for these special cases is solved by adding a special case library.

Improvement of BM algorithm
According to the above analysis, if we use the weakening processing algorithm and the suffix addition algorithm to calculate the matching pattern list PList according to P, then we can calculate the common part Pcommon_part of the PList as the matching pattern of the BM algorithm. When the algorithm matches one Pcommon_part, it uses the remaining Premain_parts to match. If the match is successful, it starts to find the next Pcommon_part. For a single syllable P with weakening, Pcommon_part = null may appear. At this time, the algorithm will match each pattern P in the PList independently. For example, when P = At (horse), Pv = Eti (A→E), there is a common symbol Hamza in T, and there is no common syllable in Tz. The improved new algorithm BM-U is shown in Algorithm 1:

Experiment and Analysis
A total of four experiments were performed.
1. We used the BM-U algorithm to test the word morphology matching ability of pattern P, and used type A corpus generated by the algorithm. If recall = 1, it means that the new algorithm can recognize all word forms of pattern P, can use the type B and type C corpus to test the algorithm precision, accuracy, F1-measure, and observe the recall value of the algorithm in the natural corpus environment.
2. We used the BM and BM-U algorithms to test the matching ability of patterns P and Pz on natural language type B corpora T and Tz. Observe the matching performance of the two algorithms on the two encoding formats through experiments and calculate the impact of the algorithm and matching unit changes on the matching rate. 3. In order to facilitate the observation of the matching performance of the new algorithm, two algorithms were used to conduct demonstrative pattern matching experiments on natural language paragraphs using type C corpus. 4. We conducted pattern matching of syllable encoding file Tz format for monosyllable and non-syllable strings, compare character-based text T with syllable-based text Tz.

BM-U Word Morphology Matching Ability
The experimental method and pattern P are the same as those in Table 3. The algorithm is changed to the BM-U algorithm. The experimental result is BM-U algorithm uses pattern P and pattern Pz to correctly match all 312 forms of P and Pz in type A corpora T and Tz, and recall = 1. The new algorithm satisfies the conditions and can be used for pattern matching experiments based on natural language corpus B and C.

Experimental Results
The experimental results of BM and BM-U algorithms on type B natural language corpus T and Tz are shown in Tables 5 and 6. In the table, Pm and Pvm indicate the number of words that can be fuzzy matched with P and Pv within T. Pr is the number of words that are semantically related to P (semantic match) in (Pm + Pvm) and are manually labeled. Alg is the type of algorithm. T_P / Tz_P, T_R / Tz_R, T_F / Tz_F, and T_A / Tz_A represent the precision, recall, accuracy, and F1-measure values of the algorithm in T and Tz.

Analysis of Experimental Results
The pattern matching capabilities of T and Tz (Table 5) were compared using two algorithms, and the comparison results are shown in Table 6. In the table, VW indicates the calculation result of the vowel-weakened words (Table 6), and Not VW indicates the calculation result of the vowel words that are not weakened (Table 5), No. is the calculation formula number, P, R, A, F indicates precision, recall, accuracy, and F1-measure values. Taking formula No. 1 as an example, PT_BM-U indicates that T is retrieved using the BM-U algorithm, PT_BM is used to retrieve T using the BM algorithm, and ΔP is the BM-U algorithm P-value of each word in Table 5 minus the BM algorithm P-value sum of the increments after. ΔP > 0 indicates that the P-value of the BM-U algorithm is higher than that of the BM algorithm, and ΔP < 0 indicates that the P-value of the BM-U algorithm is lower than that of the BM algorithm. Δ represents the sum of the increments of all n words (here n = 11). In formula No.1 Δ= ΔP, and Avg (Δ) represents the average of the increments. Table 7 shows a comparison of the retrieval capabilities of T with two algorithms. The basic matching unit of T is a character, and the basic matching unit of Tz is a syllable Table 7. Comparison of Boyer-Moore (BM) and Boyer-Moore-U (BM-U) retrieval of T.

No.
Formula The improvement of the algorithm without weakening the words has no effect on the matching efficiency of T. The values of T_P, T_F, T_A, and T_R are unchanged. Since the content of T is collected by the P fuzzy search (% P%) method, T_R = 1. After improving the algorithm, the retrieval efficiency of weakened words significantly improved. The improvement of R, F, and A is very obvious, especially the average increase of R by 55%, mainly because the new algorithm can retrieve the weakened words. For example, when P = alma, T = {alma, almisi (his apple)}, the BM match result = {alma}, and the BM-U match result = {alma, almisi}. The new algorithm effectively increases the F and A values of T by 33% and 25%, respectively. Compared with R, F, and A values, the increase in P-value is not high (the average increase is 4%). Table 8 presents a comparison of the retrieval effect of T with the current BM algorithm and the retrieval of Tz with the BM-U algorithm proposed in this paper. For weakened words, all parameters were significantly increased; meanwhile, for non-weakened words, P, F, and A-value increased, R-value decreased and in the BM algorithm always T_R = 1. The R-value of BM-U on Tz is obviously improved once the algorithm is improved, but Tz_R <1 for some words. Here, the decline in R is mainly because the partially misspelled word T can still meet the matching conditions, and Tz cannot meet the syllable-based matching conditions, resulting in Tz_R <1 for BM-U. Taking No. 3 in Table 5 as an example, Tz_R = 0.75 (TP = 98, FN = 32) of the BM algorithm when P = yil, Tz_R = 0.96 (TP = 125, FN = 5) for the BM-U algorithm, and improving the algorithm increases the R-value by 21%. However, there are still five words that change the syllable structure due to misspelling {(FN = 5): (bir+yildn, yild+din, yill+din, yill+rdin, yi+le+si+ri)} has not been retrieved. The correct spelling of these five words should be { bir+yil+din, yil+din, yil+din, yil+lir+din, yi+li+siri }. The retrieval efficiency of the weakened words is relatively obvious in the algorithm. Figure 1 shows a comparison of the R-values of the weakened words by the two algorithms, and Figure 2 shows a comparison of the F and A-values of the weakened words by the two algorithms.    Table 9 is the pattern matching results of the two algorithms in character-based natural language sentences, P = {alma, amerika}. The improvement of the algorithm allows the BM-U algorithm to match words with weakened vowels and improve user search experience.  retrieval services, such as business information network (uqur.cn) and Kunlun network (uyghur. xjkunlun. gov.cn) have similar search performance.

Monosyllabic and Non-syllabic Retrieval
Because there is no vowel weakening phenomenon of single syllables (independent syllables, not monosyllable words) and non-syllable strings, there is no difference in the retrieval results of the BM algorithm and the BM-U algorithm.

Monosyllabic retrieval
Retrieving single syllables in Tz is very convenient, and the two algorithms are equally efficient. When P = Sb, Pz = Sc. As shown in Table 10, P = "ma" and P = "to" are the search results of single syllables. The number of structural matches in T far exceeds the number of matches in Tz. A search in T is equivalent to a fuzzy match P =% Sb%, and a search in Tz is equivalent to an exact match. The search results of T include other syllables such as or+man, mal, toğ+ra, and top. If accurate monosyllabic retrieval is implemented in T, it will increase the technical difficulty and extra time consumption, because after finding a match, the string needs its syllables segmented, and then it is determined whether the match is an independent syllable and not a part of other syllables. Implementing fuzzy matching of single syllables in Tz also increases technical difficulty and time consumption because this requires each syllable in Tz to be decoded and then for fuzzy matching to be executed.

Non-syllable retrieval
Tz is syllable-encoded text. Since the basic unit of data storage is the syllable, it is impossible to retrieve non-syllable content. For example, the search result of P = "mm" in Table 10 is one because there is an abbreviation mm in the text that meets the matching conditions. Searching with T is very convenient: we can just search directly.  [19] designed two functions Bohum_Sani (number of syllables) and Bohum_Xekli (syllable type) to propose a multi-pattern matching algorithm Bohum-Ug, which first applied syllables to pattern matching research. The algorithm first splits the syllables of pattern P and text T. When pattern matching, use Bohum_sani function to compare the number of syllables. If the number of syllables is the same, use Bohum_Xekli to compare the types of syllables. Then compare the characters after the same syllable types. This algorithm requires syllable segmentation in advance. When the size of the text T is large, the syllable segmentation consumes additional algorithm time. The final matching result is similar to the BM algorithm and cannot match weakened words. The BM-U algorithm does not require syllable segmentation, and can match weakened words, because the matching mechanism of the BM algorithm is not changed, and the BM-U algorithm can be transplanted to all variants of the BM algorithm.
2. Tohti [20] proposed WM-Uy (Wu-Manber-Uy), a multi-pattern matching algorithm. Stem extraction is performed on the pattern P before pattern matching. After the pattern matching of the stem is successful, the word suffix is checked. If the suffix is a derivational suffix, the matching fails, and when the suffix is an inflectional suffix, the matching is successful. The WM-Uy algorithm is different from the single-mode matching BM-U algorithm proposed in this paper. (1) The WM-Uy algorithm does not use stemming to match monosyllable weakened words, such as P = {Eti (his horse), Eri (his man), Eqi (the white)} cannot match the corresponding unweakened words W = {At, Ar, Ak}.
(2) According to the Aizimaiti [48] WM-Uy algorithm suffixes library should include all 378 Uyghur suffixes (104 derivational suffixes, 274 inflectional suffixes). The BM-U algorithm does not have an suffixes library, for nouns compare weakened words up to four times. (3) The matching requirements of the WM algorithm are different from the BM-U algorithm. According to the requirements of the WM-Uy algorithm, when P = {Alma, Amerika}, the stem add derivational suffix words W = {Almizar and Almiliq (Apple Orchard), Amerikiliq (American ), Almimu (Apple is also ..., is it Apple?), Amerikimu, Almiči (the person who deals with Apple), Almixan (taking Apple as the female name)} are not in the match, but in the BM-U algorithm, these words can satisfy the weakening forms of pattern P and can be matched. (4) The WM-Uy algorithm can match Almas (diamond), Almax (exchange) and other words that can match P in structure but are not semantically related to Alma. This paper proposes a syllable-based searchable compressed text format Tz; when Tz format cooperates with BM-U it can exclude these semantically unrelated words.

Conclusions
Uyghur is a very typical phonetic language. Each word is composed of syllables. The pronunciation of characters and syllables is the same as that of words. The Tz format proposed in this paper is a searchable compressed text format based on syllable encoding. The original document doc (char) with the character as the basic unit is changed to the document doc (Sb) with the syllable as the basic unit. If the Tz format is used as an auxiliary storage format for a text corpus, then based on the average length of a Uyghur syllable being 2.4 characters, the theoretical matching speed is 2.4 times faster when matching with a brute force algorithm. The Tz format is more convenient for accurate retrieval and processing of natural language content in units of syllables, requires less space, and matches faster. It can exclude some semantically unrelated words without semantic analysis and requires a syllable encoding dictionary installed on the client. The Tz format design ideas can be used in other languages that can be segmented into syllables and have complex word form features [49]. Figure 3 shows the process of retrieving a text corpus using speech. vSb in the figure is the speech syllable corresponding to text syllable Sb. The BM-U algorithm proposed in this paper is designed on the basis of the original BM algorithm to address the complex morphology of Uyghur words. The retrieval object is Uyghur natural language content. The new design does not change the original search mechanism of the BM algorithm and upgrades the original matching method to an extended matching method based on P_list, where P_list can be calculated based on pattern P; thus, this improvement can be transplanted to other versions of the BM algorithm. This paper only considers the relationship between the stem and the 1-level syllables attached to the stem when designing P_list, which is a syllable-based unigram method. If the content of P_list is increased to the 2-level or 3-level syllables attached to the stem, it will become a syllable-based bigram and trigram problem. Increasing from unigram to bigram and trigram will increase the time consumption and technical difficulty but will help improve the precision and accuracy values. This extended matching idea can also be theoretically applied to multi-pattern matching methods such as Wu-Manber. This paper mainly studies the pattern matching of nouns. Uyghur verbs have more suffix types and numbers than nouns, and their combination levels, structural changes, and additional rules are more complicated. When designing a verb generator algorithm, in theory, the number of forms based on a verb stem may reach thousands. The BM-U algorithm proposed in this paper requires mode P to be a stem. Uyghur stemming itself is one of the most important basic research contents, among which the stemming of verbs is more difficult. This study also found that spelling errors also have a certain effect on the efficiency of pattern matching. These are our future research directions.