A Syllable-Based Technique for Uyghur Text Compression

: To improve utilization of text storage resources and e ﬃ ciency of data transmission, we proposed two syllable-based Uyghur text compression coding schemes. First, according to the statistics of syllable coverage of the corpus text, we constructed a 12-bit and 16-bit syllable code tables and added commonly used symbols—such as punctuation marks and ASCII characters—to the code tables. To enable the coding scheme to process Uyghur texts mixed with other language symbols, we introduced a ﬂag code in the compression process to distinguish the Unicode encodings that were not in the code table. The experiments showed that the 12-bit coding scheme had an average compression ratio of 0.3 on Uyghur text less than 4 KB in size and that the 16-bit coding scheme had an average compression ratio of 0.5 on text less than 2 KB in size. Our compression schemes outperformed GZip, BZip2, and the LZW algorithm on short text and could be e ﬀ ectively applied to the compression of Uyghur short text for storage and applications.


Introduction
Network data on the internet continues to increase significantly each year. In 2018, for the mobile internet only, access traffic reached 71.1 billion GB in China. Text messaging and instant messaging consume huge amounts of storage and communication resources. Text compression is a type of lossless compression technology that improves storage space utilization and text transmission efficiency. Text compression technology mainly employs statistics-based and dictionary-based methods. These methods have distinct advantages and disadvantages, depending on the specific application, and they also operate differently. Statistics-based methods use the statistical information of characters (or other basic units of the language) to generate shorter-length codes, such as run-length coding, Shannon-Fano coding, and Huffman coding [1][2][3]. Run-length coding has better compression performance than the other two when several consecutively repeated elements occur. Shannon-Fano coding uses a top-down building tree, which has low coding efficiency and long average coding length. It is rarely used in practical applications. Huffman coding encodes the sequence according to the probability of character occurrence, so that the average code length is the shortest. This method has average compression efficiency for those characters with average probability of occurrence. The dictionary-based methods, such as the LZ77 and LZ78 [4,5] algorithms, perform compression-decompression by constructing a dictionary mapping table. The LZ77 algorithm uses a sliding dynamic dictionary to store local historical information and replaces duplicate content with In Table 1, the examples include the present-day Uyghur and the Latinized transition forms. The syllabic structures of nos. 7-12 are used mainly to record words that have foreign origin. Each syllable of the syllabic structures for no. 10 and no. 11 have two vowels, which are used mainly Information 2020, 11, 172 3 of 18 to record words with two vowels from Chinese and other languages, such as Zhonghua and Guangdong. The inherent feature of Uyghur syllables is that a syllable contains only one vowel and may contain no consonants, and thus the number of vowels in a word is theoretically equal to the number of syllables in the word. The following three problems need to be solved to implement this syllable segmentation: 1.
Some loanwords from Chinese have two vowels, such as tüän and hua.

2.
No more than one consonant should appear in front of the vowel, but some loanwords from foreign languages have more than one consonant in front of a vowel, such as Stalin and Strategiyä.

3.
When syllables are segmented, the syllabic structure of two vowels of certain loanwords from Chinese and the syllabic structure of multiple consonants of certain words from foreign languages are prone to making the segmentation algorithm ambiguous, such as syllabic type 11 (CVVC), which structurally is a combination of syllabic type 3 (CV) and type 2 (VC). When a character string that has the CVVC structure occurs in a word, identifying whether the string has one syllable or two is a key issue for the syllable segmentation.

Syllable Segmentation and Analysis
The corpus constructed in this paper included a collection of 52,718 articles from a variety of journals, government documents, scientific and literary works, and short documents such as social media posts. We performed syllable segmentation on all 713,716 unique words and expressions appearing in these articles, using the segmentation method described by Wayit et al. [29]. We found a total of 8621 different syllables belonging to the 12 syllabic structures using this syllable segmentation. We counted these syllables for each syllabic structure, and the statistics are given in Table 2.  Theoretical  Actual  Theoretical  Actual  V  8  8  31,653  CCV  4608  425  1358  VC  192  172  34,002  CCVC  110,592  688  2829  CV  192  184  593,850  CCVCC  2,654,208  151  287  CVC  4608  2992  441,376  CVV  1536  260  1358  VCC  4608  294  1319  CVVC  36,864  394  2194  CVCC  110,592  2956  11,667  CCCV  110,592  97  180 In Table 2, a theoretical syllable number reflects all of the syllables that can be composed of 24 consonants and 8 vowels in this structure. For example, the CCV syllabic structure theoretically can generate 24 * 24 * 8 = 4608 syllables. The actual number is the number that occurs based on the structure when counting 713,716 words during the syllable segmentation. For example, the CCV structure has only 425 syllabic structures. Frequency of occurrence is the number of occurrences of all syllables based on the type of syllabic structure among all of the words. From the statistical results, we found that the number of occurrences for the six inherent syllabic structures accounted for the majority of the syllables. According to the statistical results given in Table 2, the average syllable length (ASL) calculated by Equation (1) was 2.4 characters. Figure 1 shows the Zipf's law distribution of 8621 syllables in the corpus. All syllables are sorted in descending order of frequency f. The highest syllable ranks as r = 1. In the figure, the x-axis is the logarithm of the ranking r, and the y-axis is the logarithm of the frequency f. Figure 2 shows the syllable coverage. The x-axis is the top n syllables, and the y-axis is the logarithm of the sum of the frequencies of the top n syllables. The calculation of log fn by Equation (2) follows

Selection of High-Frequency Syllables
For example, log r = 2.0 on the x-axis corresponds to log f = 0.605 on the y-axis. This means that the first 100 syllables with the highest frequency can cover 60% of the words in the corpus. After the 2000th syllable, the increase in the syllables does not significantly increase the coverage.
For example, log r = 2.0 on the x-axis corresponds to log f = 0.605 on the y-axis. This means that the first 100 syllables with the highest frequency can cover 60% of the words in the corpus. After t

Syllable Coding
In this study, we used two coding schemes to encode the syllables: the 12-bit short coding scheme B12 and the 16-bit long coding scheme B16. The B12 scheme included an 11-bit syllable code table. The B16 scheme included a 16-bit syllable code table. We used these two coding schemes in conjunction with Unicode encoding, which we used to encode the characters in the nonsyllable coding table. In actual application, to identify the characters not included in the code tables, we had For example, log r = 2.0 on the x-axis corresponds to log f = 0.605 on the y-axis. This means that the first 100 syllables with the highest frequency can cover 60% of the words in the corpus. After t

Syllable Coding
In this study, we used two coding schemes to encode the syllables: the 12-bit short coding scheme B12 and the 16-bit long coding scheme B16. The B12 scheme included an 11-bit syllable code table. The B16 scheme included a 16-bit syllable code table. We used these two coding schemes in conjunction with Unicode encoding, which we used to encode the characters in the nonsyllable coding table. In actual application, to identify the characters not included in the code tables, we had

Syllable Coding
In this study, we used two coding schemes to encode the syllables: the 12-bit short coding scheme B12 and the 16-bit long coding scheme B16. The B12 scheme included an 11-bit syllable code table. The B16 scheme included a 16-bit syllable code table. We used these two coding schemes in conjunction with Unicode encoding, which we used to encode the characters in the nonsyllable coding table. In actual application, to identify the characters not included in the code tables, we had to add some identification flag to these Unicode codes.
In this paper, we used dicChar and dicSyllb to represent the encoded characters and syllables in the dictionary and used xChar and xSyllb to represent the uncoded characters and syllables in the dictionary. xSyllb was composed of several dicChars, and SP was a space (U0020).

B12 Coding Scheme
The most common word division symbol in Uyghur is a space (U0020). In the B12 scheme, we set the first bit to the space flag bit. A flag bit of "1" indicated that the current syllable was followed by a space (i.e., the end syllable of the word). A flag bit of "0" indicated that no space appeared after the current syllable was encoded (i.e., the first syllable or intermediate syllables of the word).
Excluding the space flag bit, we called the remaining 11 bits of the 12-bit short code bits the syllable code bits. The 11-bit syllable code bits contained 2048 code positions-that is, the 11-bit syllable code table contained 2048 syllable codes. We classified the code positions as follows: 1.
ASCII characters: The frequency of ASCII characters in Uyghur was higher than that of other symbols, such as Chinese characters, so we treated each ASCII character as a syllable, and thus we left the first 128 encoding positions for ASCII characters (0x00-0x7F).

2.
Uyghur characters: The code range of Uyghur characters was in the Unicode basic block (U0600-U06FF

5.
We reserved 15 positions to flags, for describing various situations in the data stream. 6.
High-frequency Uyghur syllables: Excluding the previously mentioned syllable codes, 1862 code positions remained. As shown in Figure 1, the coverage of 1862 syllable codes was around 98%, which contained the more commonly used high-frequency syllables.
These characters belonged to dicChar, and the remaining characters belonged to xChar. The xChar also included the non-Uyghur characters in (U0600-U06FF) and the other Uyghur characters in the Unicode extension areas (UFE70-UFEFF) and (UFB50-UFDFF).

B16 Coding Scheme
In the B16 coding scheme, the code length of a syllable was exactly equal to the length of the Unicode character code. This facilitated subsequent research on syllable-based text retrieval of compressed text. We selected the Private Use Area (UE000-UF8FF) with 6400 positions as the 16-bit long code block. The number of syllables occurring in the previously noted corpus was 8621. This block Information 2020, 11, 172 6 of 18 was large enough to accommodate most of the syllables that occurred. This scheme did not use the space flag.
In this scheme, if the text encountered xSyllb, we used the Unicode source code directly. If it was a character in the Private Use Area, we resolved it by attaching an identifier flag.

Code Block Division
The code blocks of the previous two coding schemes are shown in Table 3. Table 3. Code ranges of the two coding schemes.

B12 Scheme Flags
The purpose of the identification flag is to identify the xChar and xSyllb that appeared in the data stream. We first selected some of the corpora shown in Table 4 to make statistics based on the length of the xChar string. The length probability is shown in Figure 3. From the xChar length, the number of xChars with string lengths of 1 and 2 were the highest, and the probability that the length was greater than 8 was very low. A length of 1 was found primarily in Chinese characters and other symbols ( 1 (1) √ VI % , etc.), and a length of more than 2 was found mainly in Chinese characters. The flag and their meanings based on the statistical results are shown in Table 5. The purpose of the identification flag is to identify the xChar and xSyllb that appeared in the data stream. We first selected some of the corpora shown in Table 4 to make statistics based on the length of the xChar string. The length probability is shown in Figure 3. From the xChar length, the number of xChars with string lengths of 1 and 2 were the highest, and the probability that the length was greater than 8 was very low. A length of 1 was found primarily in Chinese characters and other symbols (①⑴√ VI ‰, etc.), and a length of more than 2 was found mainly in Chinese characters. The flag and their meanings based on the statistical results are shown in Table 5.   Figure 3. Length probabilities of xChar strings. If one string S, including 2 ASCII characters a1, a2 (encoding in the encoding table was A1, A2); 2 Russian characters r1, r2 (Unicode R1, R2); 10 Chinese characters c1-c10 (Unicode C1-C10); 1 dicSyllb composed of 2 Uyghur characters u1, u2 (encoding in the dictionary was S 1 ); 1 xSyllb composed of 3 Uyghur characters u3, u4, u5 (Unicode U3, U4, and U5); and 3 Private Use Area characters e1, e2, e3 (Unicode was E1 = 0xE000, E2 = 0xE001, E3 = 0xE002), then the encoding result of string S was S string = a1a2r1r2c1~c10u1u2u3u4u5e1e2e3 S encoding = SDB + A1A2 + fXC (0x7F2) + R1R2 + fXBB (0x7FA) C1-C10fxBE(0xE002) + S 1 + fXC(0x7F3) U3U4U5 + fxC(0x7F3)E1E2E3.
In the B12 scheme, the length of fXC and fXBB was 12 bits, and the length of fXBE was 16 bits. The advantage of this design was that it was easy to identify the fXC when it started to read 12 bits continuously and read n xChars directly, according to the value of fXC-0x7F0. When it encountered fxBB, it started to read 16 bits. When it read fXBE, it indicated the end of the xChars sequence. Then it restarted reading 12 bits.

B16 Scheme Flags
The B16 scheme identification flags and meaning are shown in Table 6. If string S contained 2 ASCII characters a1, a2 (Unicode A1, A2); 2 Russian characters r1, r2 (Unicode R1, R2); 10 Chinese characters c1-c10 (Unicode C1-C10); 2 dicSyllb (encoding in the dictionary S 3 (U1,U2) = 0xE003, S 5 (U3,U4) = 0xE005); 1 xSyllb (three Uyghur characters with Unicode U3, U4, U5); and 4 private area characters e3, e5, e3, and e5 (Unicode E3 = 0xE003, E5 = 0xE005), then the encoding result of S was We used the B16 to research syllable-based text retrieval. When we retrieved syllable S 3 S 5 from the encoded string S encoding , the S 3 S 5 code 0xE003 and 0xE005 appeared three times in total, and the last two times had the identifier flag fXC, which could be excluded. When retrieving e3e5, we encoded e3e5 into fXC + E3 + fXC + E5 before retrieval according to the input content. This excluded the encoding of S 3 S 5 without an identifier. If e3e5e3e5 was encoded with fXBB + E3E5E3E5 + fXBE, then when the e3e5 was retrieved, the system encoded e3e5 into fXBB + E3E5+ fXBE. There was no matching content in S encoding . This was why the B16 scheme did not design the fXBB and fXBB flags of the B12 scheme.

Datagram and File Format
The format design of the compressed datagram and compressed file is shown in Figure 4. SDB was a 2-byte identifier, which indicated that the compressed data stream started from this scheme. CodetabID was a code table ID in a dictionary 1-byte long. A dictionary could contain 255 code tables. A coding table represented a language or a language's different coding scheme. After receiving the data stream, the decoder performed decoding according to the encoding table ID. Sdata was the data stream to be decoded, and ESD was the end of the data stream. The ESD length in the B12 scheme was 12 bits, and the ESD length of the B16 scheme was 16 bits.
We used the B16 to research syllable-based text retrieval. When we retrieved syllable S3S5 from the encoded string Sencoding, the S3S5 code 0xE003 and 0xE005 appeared three times in total, and the last two times had the identifier flag fXC, which could be excluded. When retrieving e3e5, we encoded e3e5 into fXC + E3 + fXC + E5 before retrieval according to the input content. This excluded the encoding of S3S5 without an identifier. If e3e5e3e5 was encoded with fXBB + E3E5E3E5 + fXBE, then when the e3e5 was retrieved, the system encoded e3e5 into fXBB + E3E5+ fXBE. There was no matching content in Sencoding. This was why the B16 scheme did not design the fXBB and fXBB flags of the B12 scheme.

Datagram and File Format
The format design of the compressed datagram and compressed file is shown in Figure 4. SDB was a 2-byte identifier, which indicated that the compressed data stream started from this scheme. CodetabID was a code table ID in a dictionary 1-byte long. A dictionary could contain 255 code tables. A coding table represented a language or a language's different coding scheme. After receiving the data stream, the decoder performed decoding according to the encoding table ID. Sdata was the data stream to be decoded, and ESD was the end of the data stream. The ESD length in the B12 scheme was 12 bits, and the ESD length of the B16 scheme was 16 bits. In the adaptive dictionary method, the encoder generated a dictionary based on the text content and compressed it. Finally, the compressed data stream and file came with dictionary information. We read the data stream during decoding to determine whether the current item was identified or uncompressed data, and we looked up the final output for the decompressed data in the dictionary based on the identification. Because a string S had different frequencies and positions in the text T, the encoding in the compressed text Z was also different. In this study, we proposed a method to use the syllable characteristics of natural language to generate a general static dictionary for compression and decompression. The actual compressed data did not have a dictionary. Regardless of the frequency and position of a string S in the text T, its encoding in the compressed text Z was the same.

Data Compression Process
Taking the B12 scheme as an example, the implementation process of the compression method proposed in this paper is shown in Figure 5. The process functions in the flowchart are as follows: CheckXBlockEnd: Check the previous code first. If the previous encoding was xChar and the number of this xChar and the previous consecutive xChars exceeded 9, we added an fXBE identifier to the current data stream. In the adaptive dictionary method, the encoder generated a dictionary based on the text content and compressed it. Finally, the compressed data stream and file came with dictionary information. We read the data stream during decoding to determine whether the current item was identified or uncompressed data, and we looked up the final output for the decompressed data in the dictionary based on the identification. Because a string S had different frequencies and positions in the text T, the encoding in the compressed text Z was also different. In this study, we proposed a method to use the syllable characteristics of natural language to generate a general static dictionary for compression and decompression. The actual compressed data did not have a dictionary. Regardless of the frequency and position of a string S in the text T, its encoding in the compressed text Z was the same.

Data Compression Process
Taking the B12 scheme as an example, the implementation process of the compression method proposed in this paper is shown in Figure 5. The process functions in the flowchart are as follows: CheckXBlockEnd: Check the previous code first. If the previous encoding was xChar and the number of this xChar and the previous consecutive xChars exceeded 9, we added an fXBE identifier to the current data stream.
SetXCharFlag: Append the fXC or fXBB flag to the data stream according to the current xChar/xSyllb length.
AddSPCode: Check the previous code first. If it was dicSyllb/dicChar and no space flag bit had been added, we added a space flag bit to the previous code. If it was not, we added a space 12-bit code (0x020) to the data stream.
Finally, we generated a bit sequence of text. If the total length of the text bit was not divisible by 8, then we added several "0" bits, so that the sequence length was divisible by 8. Then, we converted the bit sequence into a byte sequence for storage.
Information 2020, 11, x FOR PEER REVIEW 9 of 18 SetXCharFlag: Append the fXC or fXBB flag to the data stream according to the current xChar / xSyllb length. AddSPCode: Check the previous code first. If it was dicSyllb / dicChar and no space flag bit had been added, we added a space flag bit to the previous code. If it was not, we added a space 12-bit code (0x020) to the data stream.
Finally, we generated a bit sequence of text. If the total length of the text bit was not divisible by 8, then we added several "0" bits, so that the sequence length was divisible by 8. Then, we converted the bit sequence into a byte sequence for storage.  Figure 6 shows an example of B12 coding. The reading direction of Uyghur text in the text was from right to left, and the original reading direction of other characters in the text remained unchanged. The reading order of the text in Figure 6 was A, B…→K→M→L→O, P…→W. The original text to be compressed had two xChar (two Chinese characters L2, M2) and one xSyllb (three characters T2, U2, V2). Line 1 was the result of the original word segmentation; Line 2 was the syllable segmentation of words, Line 3 to Line 6 were the Unicode encoding of the inner characters of these syllables, and Line 9 and Line 10 were the final encoding results. Line 8 was a space flag bit.   Figure 6 shows an example of B12 coding. The reading direction of Uyghur text in the text was from right to left, and the original reading direction of other characters in the text remained unchanged. The reading order of the text in Figure 6 was A, B . . . →K→M→L→O, P . . . →W. The original text to be compressed had two xChar (two Chinese characters L2, M2) and one xSyllb (three characters T2, U2, V2). Line 1 was the result of the original word segmentation; Line 2 was the syllable segmentation of words, Line 3 to Line 6 were the Unicode encoding of the inner characters of these syllables, and Line 9 and Line 10 were the final encoding results. Line 8 was a space flag bit. Information 2020, 11, x FOR PEER REVIEW 10 of 18

Compression Ratio
The compression ratio was an indicator used to evaluate the performance of a compression method. The compression ratio (CR) is calculated by Equation (3) = = where SO is the size of the original text, SC is the size of compressed data, and its value consists of a Unicode encoding portion PU and a binary encoding portion PB.

Compression Ratio of the B16
The calculation formula of PB part was the same as in equation (4), Lenc = 16 bit. When xSyllb appeared in PU, its structure was Pis = uchar1uchar2... ucharn, as follows

Compression Ratio
The compression ratio was an indicator used to evaluate the performance of a compression method. The compression ratio (CR) is calculated by Equation (3) where S O is the size of the original text, S C is the size of compressed data, and its value consists of a Unicode encoding portion P U and a binary encoding portion P B .

Compression Ratio of the B12
Any element in the P B was a binary code obtained by matching a 12-bit short code table. The length of P B is calculated by Equation (4) P B = SyllCount × L enc (4) where P U consists of xChar and xSyllb. When one element of the P U was Pi = char1char2... charn, there were two types of coding sequences. When the number of characters in Pi was CharCount (Pi) < 10, P i (n<10) = fCXchar1char2... charn, and when CharCount (Pi) > 9, P i (n>9) = fXBBchar1char2... charnfXBE. The calculation formula of the P U length is shown in Equations (5) and (6), and its unit is a bit According to Figure 6, the original text size was So = 592bit. The original text had 19 syllables, of which dicSyllb = 10 and a space (N2 unit) that could not be marked with space flag, P B = 10 × 12bit + 12bit = 132bit, two Chinese characters (L2, M2 unit), xChar = 2, P i1 = 2 × 16 + 12 = 44bit, and one xSyllb (T2, U2, V2 unit) with three characters P i2 = 3 × 16 + 12 = 60bit, total Sc = P B + P i1 + P i2 = 132 + 44 + 60 = 236bit. The CR calculation result was CR = 0.399.

Average Coding Length
Average coding length was another indicator used to evaluate text compression performance. The shorter the average coding length, the more efficient the compression method. S C was the size of the compressed text in bits. The average encoding length BPC (bits pec character) was calculated by Equation (9) BPC = Sc CharCount .
According to Figure 6, the compressed text size was Sc = 236 bit and had 37 characters, and the BPC calculation result was BPC = 6.378 bits, which was 9.622 bits fewer than when Unicode code was used.

Data Decompression
In the data decompression process, a compressed text file is decompressed. The decompression process is the reverse of the compression process. We used the following specific decompression process.

Decompression of B12 Scheme
We read the original text in bytes, converted each byte into an 8-bit binary, and generated a continuous bit stream.

1.
Intercept 12 bit; if it was dicSyllb, read the next 12 bits.

2.
If the decoding result was fXC, n Unicode characters would be intercepted continuously, where n = fXC-0x7F0, and each character was 16 bits in length.

3.
If the decoding result was fXBB, then the Unicode characters would be intercepted continuously until fXBE was read.

4.
If there was no remaining bit data stream, the decompression was completed; otherwise, repeat Step 1. The decoding algorithm is shown in Algorithm 1: Read the original text in Unicode characters.

2.
Intercept 1 character (2 bytes); if the encoding range is in the range 0xE003-0xF8FF, then use the dictionary encoding table to decode. 3.
Intercept 1 character (2 bytes); if the encoding is equal to fXC (0xE001), then read the next Unicode character directly. Repeat Step 2. If no character data stream remains, the decompression is complete.

1.
In this study, we randomly selected 15 texts according to size as the experimental corpus. We based the corpus on Unicode encoding.

2.
We used the short text in Table 4 as the experimental corpus. The corpus shared 2908 short texts with a total size of 907,108 bytes. We stored each short text twice in UTF8 and UTF16 encoding formats. The corpus-related information is shown in Table 7. The comparison method used GZip, LZW, and BZip2 compression. The LZW algorithm used Mark Nelson [30], GZip used the Microsoft. NET GZipStream class [31], and BZip2 used ICSharpCode.SharpZipLib.BZip2 (©Mike Kruger Version: 0.86.0.518). Table 8 gives the comparison experiments of the five compression methods for text compression. Among them, S O indicated the size of the compressed corpus, and the data unit was a byte. We compared the compression method used in this study with other compression ratios, as given in Table 8 and shown in Figure 7.

Short-Text Compression
We selected 2908 pieces of short text and stored each piece of text once with UTF8 and UTF16 encoding. We compressed each piece of text using the five methods. So represented the sum of 2908 files of the same encoding type, and Sc was the sum of the 2908 compressed files. The compression time and decompression time were the sums of the compression and decompression time of the 2908 files, respectively. The specific results are given in Table 9 and shown in Figure 8.

Short-Text Compression
We selected 2908 pieces of short text and stored each piece of text once with UTF8 and UTF16 encoding. We compressed each piece of text using the five methods. So represented the sum of 2908 files of the same encoding type, and Sc was the sum of the 2908 compressed files. The compression time and decompression time were the sums of the compression and decompression time of the 2908 files, respectively. The specific results are given in Table 9 and shown in Figure 8.  The distribution of Zipf's law for dicSyllb and xSyllb in the corpus is shown in Figure 9. The xaxis shows the ranking of syllables in descending order according to the syllable frequency, r = 1 indicates the syllable with the highest frequency, and the y-axis shows the frequency of the syllable corresponding to r. To facilitate comparison and observation, the frequency of xSyllb was represented by (−1) logf. The distribution of Zipf's law for dicSyllb and xSyllb in the corpus is shown in Figure 9. The x-axis shows the ranking of syllables in descending order according to the syllable frequency, r = 1 indicates the syllable with the highest frequency, and the y-axis shows the frequency of the syllable corresponding to r. To facilitate comparison and observation, the frequency of xSyllb was represented by (−1) logf. The distribution of Zipf's law for dicSyllb and xSyllb in the corpus is shown in Figure 9. The xaxis shows the ranking of syllables in descending order according to the syllable frequency, r = 1 indicates the syllable with the highest frequency, and the y-axis shows the frequency of the syllable corresponding to r. To facilitate comparison and observation, the frequency of xSyllb was represented by (−1) logf.

Experimental Analysis
The two coding schemes discussed in this paper offered certain advantages in short text. The B12 coding scheme performed best on text with a size of less than 4 KB. The effect was obvious when the text was less than 1 KB. The compression ratio was always stable at about 0.3 and 0.5. For compression of short texts smaller than 200 bytes, GZip, LZW, and BZip2 algorithms had compression ratios exceeding 1. As the text grew, the compression efficiency of GZip, BZip2, and LZW gradually improved and exceeded B12 and B16.
The LZW, GZip, and BZip2 compressed files appended the dictionary data needed for decompression. When the text was large, the dictionary data had little effect on the compression ratio. This is why the compression efficiency of short texts smaller than 200 bytes was greater than 1. The B12 and B16 schemes were based on syllable encoding. According to Equation (1), the average length of a Uyghur syllable was 2.4 characters; therefore, theoretically, no matter how large the text size was, the compression ratio was stable at CR B12 = 12/(2.4 × 16) = 0.31 and CR B16 = 16/(2.4 × 16) = 0.42 or so. The occurrence of xChar, XSyllb in the text, and units of one-character characters in the encoding table affected CR. The purpose of selecting high-frequency syllables and using the space flag bit in the B12 coding scheme was to reduce this effect and to further increase the compression ratio.
In the general compression method, to obtain the best compression efficiency, the additional dictionary data was related to the current text. The same character string may have been represented by different encodings in different compressed data because of different frequencies. Further processing of compressed content required decompression. In this syllable-based compression scheme, the encoding value and length were fixed. The basic unit of the compressed data stream changed from characters to syllables. The research and application of speech synthesis and speech recognition based on syllables has offered certain advantages. When processing content based on short data obtained from big data (e.g., WeChat and SMS), a theoretically faster and more convenient retrieval speed could be obtained with a compression ratio of 0.3 to 0.5. The advantages of the B16 solution in this regard were obvious. Third-party retrieval tools would not need to decompress to perform syllable-based retrieval of content.

CR best and CR worst of the B12 Scheme
The formula for calculating the compressed file size from the B12 scheme was Sc = P B + P U . In the formula, P B was the compressed content, and P U was the original code reserved part with the flag.
So, when P U = 0, S C was directly related to P B . Because P B = SyllCnt × Lenc, and Lenc = 12, so P B = SyllCnt × 12. Currently, the formula for calculating the compression ratio was as follows (assuming the original text is in Unicode format) As shown in Tables 1 and 2, the longest syllable structure was CCVCC, and the syllable length had five characters. Then the content structure of a text file was continuous CCVCC + SP (5-character syllable + 1 space), SP was exactly marked with a space flag bit, and the number of Sc characters was exactly divided by 6. Thus, the best CR best calculation result is Similarly, according to the design of the B12 data structure, the CR effect was the worst when the text appeared with one dicChar and one xChar consecutively. Currently, the size of this encoding structure was as follows: dicChar + fXC + xChar → 12 bit + 12 bit + 16 bit = 40 bit, and an average character was 20 bits. If the number of characters in the file was exactly a multiple of 2, CR worst was calculated as CR best and CR worst of the B12 scheme currently had no relationship with the original file size So. The B12 solution was a special compression method used in specific environments. It used the syllable characteristics of natural language for encoding and decoding. The current compression ratio of Uyghur natural text was about 0.3, and the compression ratio was not directly related to the file size. The client must have the same syllable encoding dictionary when compressing and decompressing (the current dictionary size is <20 KB). It was suitable for the compression of natural language short texts. For example, when using the B12 solution of the product (e.g., pharmaceutical) instructions using a two-dimensional code, it theoretically could accommodate two times more information. WeChat MSG and SMS transmission saved two-thirds of available communication resources. If it was not a text based on natural language (such as a random meaningless string, i.e., "aaaaaabbbbbb"), the compression ratio was significantly reduced (CR worst = 1.25), which was a disadvantage of this scheme and one of the subjects of our next research.

Conclusions
Compression is a fundamental area of research in the field of computer communications, with important theoretical significance and application value. Few studies have examined Uyghur text compression methods and current compression techniques have some shortcomings. We proposed a Uyghur text compression method based on syllable features, using B12 and B16 encoding schemes for experiments. Compared with other algorithms, such as LZW, GZip, and BZip2, the B12 scheme had higher compression efficiency when the text size was less than 4 KB. It could be applied effectively to compressed transmission of short texts and QR codes. The advantage of B16 scheme was that it could quickly retrieve syllable-based information in a compressed state. The compression method proposed in this paper could be applied to other agglutinative languages in the same language family with high similarity, such as Uzbek, Kazakh, and Kirgiz [32]. Future work will include studies of the general syllable-based text compression methods for agglutinative languages and their applications. Future research will examine full-text retrieval technology of short text based on syllables without decompression.

Conflicts of Interest:
The authors declare no conflict of interest.