Burrows–Wheeler Transform Based Lossless Text Compression Using Keys and Huffman Coding

: Text compression is one of the most signiﬁcant research ﬁelds, and various algorithms for text compression have already been developed. This is a signiﬁcant issue, as the use of internet bandwidth is considerably increasing. This article proposes a Burrows–Wheeler transform and pattern matching-based lossless text compression algorithm that uses Huffman coding in order to achieve an excellent compression ratio. In this article, we introduce an algorithm with two keys that are used in order to reduce more frequently repeated characters after the Burrows–Wheeler transform. We then ﬁnd patterns of a certain length from the reduced text and apply Huffman encoding. We compare our proposed technique with state-of-the-art text compression algorithms. Finally, we conclude that the proposed technique demonstrates a gain in compression ratio when compared to other compression techniques. A small problem with our proposed method is that it does not work very well for symmetric communications like Brotli.


Introduction
Managing the increasing amount of data that are produced by modern daily life activities is not a simple task for symmetric communications.In articles [1,2], it is reported that, on average, 4.4 zettabytes and 2.5 exabytes of data were produced per day in 2013 and 2015, respectively.On the other hand, the use of the internet is increasing.The total numbers of internet users were 2.4, 3.4, and 4.4 billion in 2014, 2016, and 2019, respectively [3].Though hardware manufacturing, companies are producing plenty of hardware in an attempt to provide a better solution for working with huge amounts of data, it's almost impossible to maintain this data without compression.
Compression is the representation of data in a reduced form, so that data can be saved while using a small amount of storage and sent with a limited bandwidth [4][5][6][7].There are two types of compression techniques: lossless and lossy [8,9].Lossless compression reproduces data perfectly from its encoded bit stream, and, in lossy compression, less significant information is removed [10,11].
There are various types of lossless text compression techniques, such as the Burrows-Wheeler transform (BWT), run-length coding, Huffman coding, arithmetic coding, LZ77, Deflate, LZW, Gzip, Bzip2, Brotli, etc. [12,13].Some statistical methods assign a shorter binary code of variable length to the most frequently repeated characters.Huffman and arithmetic coding are two examples of this type of statistical method.However, Huffman coding is one of the best algorithms in this category [14].Some dictionary-based methods, such as LZ77, LZW, etc., create a dictionary of substrings and assign them a particular pointer based on the substring's frequency.In [15], Robbi et al. propose a Blowfish encryption and LZW-based text compression procedure and show a better result than LZW coding.Deflate provides a slightly poor compression, but its encoding and decoding speeds are fast [16].Although researchers have developed many lossless text compression algorithms, they do not fulfill the current demand; researchers are still trying to develop a more efficient algorithm.
From this point of view, we propose a lossless text compression procedure while using the Burrows-Wheeler transformation and Huffman coding in this paper.In our proposed method, we apply a technique using two keys that reduces only the characters repeated consecutively more than two times after the Burrows-Wheeler transformation.Finally, we find all the patterns that have more frequencies and then apply Huffman coding for encoding.We explain our proposed method in detail and compare it against some popular text compression methods.In this paper, previous work is shown in Section 2. The proposed techniques are explained in Section 3. In Section 4, we present the experimental results and analysis, and we give further research directions.Finally, we conclude the article in Section 5.

Previous Works
Run-length coding is one of the text compression algorithms.It calculates symbols and their counts.When run-length coding is directly applied to data for compression, it sometimes takes more storage than the original data [17].Shannon-Fano coding generates prefix codes of variable length based on probabilities and provides better results than run-length coding, but it is not optimal, as it cannot produce the same tree in encoding and decoding.David Huffman developed a data compression algorithm, reported in [18], which is normally used as a part of many compression techniques.In this technique, a binary tree is generated by connecting the two lowest probabilities at a time when the root of the tree contains the summation of the two probabilities.The tree is then used to encode each symbol without ambiguity.However, Huffman coding cannot achieve an optimal code length when it is applied directly.Arithmetic coding outperforms Huffman coding in terms of average code length, but it takes a huge amount of time for encoding and decoding.
LZW is a dictionary-based lossless text compression algorithm and an updated version of LZ78 [19].In this technique, a dictionary is created and initialized with strings all of length one.Subsequently, the longest string in the dictionary that matches the current input data is found.Although LZW is a good text compression technique, it is more complicated due to its searching complexity [20].Deflate is also a lossless compression algorithm that compresses a text by using LZSS and Huffman coding together, where LZSS is a derivative of LZ77.The Deflate procedure finds all of the duplicate substrings from a text.Subsequently, all of the substrings are replaced by the pointer of the substring that occurred first.The main limitation of Deflate is that the longer and duplicate substring searching is a very lazy mechanism [21].Gzip is another lossless, Deflate-based text compression algorithm that compresses a text while using LZ77 and Huffman coding [22].The pseudo-code of LZ77 is reported in the reference [23].
The Lempel-Ziv-Markov chain algorithm (LZMA) that was developed by Igor Pavlov is a dictionary-based text compression technique that is similar to LZ77.LZMA uses a comparatively small amount of memory for decoding and it is very good for embedded applications [24].Bzip2, on the other hand, compresses only a single file using the Burrows-Wheeler transform, the move-to-front (MTF) transform and Huffman entropy coding techniques.Although bzip2 compresses more effectively than LZW, it is slower than Deflate but faster than LZMA [25].PAQ8n is a lossless text compression method that incorporates the JPEG model into paq81.The main limitation of PAQ8n is that it is very slow [26,27].Brotli, which was developed by Google, is a lossless text compression method that performs compression using the lossless mode of LZ77, a Huffman entropy coding technique and second order context modeling [28].Brotli utilizes a predefined static dictionary holding 13,504 common words [29,30].It cannot compress large files well because of its limited sliding window [31].
Burrows-Wheeler transform (BWT) in [32] transforms a set of characters into runs of identical characters.It is completely reversible, and no extra information is stored without the position of the last character.The transformed character set can be easily compressed by run-length coding.The pseudo-codes of the forward and inverse Burrows-Wheeler transforms are reported in [33].

Proposed Method
There are many algorithms used to compress a text.Some examples are Huffman, run-length, LZW, Bzip2, Deflate, Gzip, etc. coding-based algorithms [34][35][36][37][38].Many algorithms focus on the encoding or decoding speed during text compression, while others concentrate on the average code length.Brotli provides better compression ratios than other state-of-the-art techniques for text compression.However, it uses a large static dictionary [30].What makes our proposal special?We can apply our proposed method to a large file as well as a small file.The main limitation of BWT is that it takes huge amounts of storage and a lot of time to transform a large file [7,39,40].Our first interesting innovation is that we split a large file into a set of smaller files, where each file contains the same number of characters, and then apply the Burrows-Wheeler transform (BWT) to the smaller files individually to speed up the transformation.We do not use any static dictionary because searching for a word or phrase in a dictionary is very complicated and time consuming [14].We change the use of run-length coding a bit after the Burrows-Wheeler transform (BWT), because run-length coding takes the symbol and its count, and it only works well when characters are repeated more in a text.When a character is alone in a text, which normally happens, two values (the character and its count) are stored after encoding, which increases the use of storage.Our second interesting change is that we will only replace the characters repeated more than three times in a row by a key, the character, and its count.The position of the character's sequence in a reduced text is identified by the key.Huffman coding provides more compression if a text has a higher frequency of characters.We have analyzed ten large text files from [41], and the outcomes are shown in Figure 1.The figure shows that the frequency of lowercase letters in any text is always much higher than other types of letters.Figure 1 shows that, on average, all of the files contain 71.46%, 3.56%, 15.48%, 2.40%, 1.14%, and 5.96% small letters, capital letters, space, newline, zero to nine, and others, respectively.We have calculated that, averagely, the frequency of lowercase letters is 94.95%, 78.11%, 96.67%, 98.31%, Symmetry 2020, 12, 1654 4 of 14 and 91.72% higher than the frequency of capital letters, spaces, newlines, zero-to-nine (0-9) characters, and other characters, respectively.
Additionally, we have analyzed 5575 small text files of lengths less than or equal to 900 characters.We see that a maximum of twenty of the same characters are repeated consecutively at a time after the Burrows-Wheeler transform is applied to the small texts that are shown in Figure 2.There are twenty-six lowercase characters in the English alphabet.Accordingly, we have replaced the character count in lowercase letters using the formula (character's count + 96), so that the lowercase letters keep the frequency higher and we can obtain a higher compression ratio.The proposed above-mentioned idea can only reduce the characters repeated four times or more at a time.However, a file can contain many other characters that can be repeated two or three times.Our third interest is to reduce the characters that are repeated exactly three times using the formula (second key, the character).As a result, we can only store two characters instead of three and further reduce the length of the file.Changing characters that appear twice does not help to reduce the file length, so we keep these characters the same.We have analyzed eleven large text files after applying the Burrows-Wheeler Transform and the text reduction techniques using the two keys explained above in order to find specific patterns.We find patterns of lengths two, three, four, five and six, and the outcome of the analysis is demonstrated in Symmetry 2020, 12, 1654 5 of 14 Figure 3.This figure shows that the patterns of length two provide 63.03%, 79.02%, 81.49%, and 83.23% higher frequencies than the other patterns of lengths three, four, five, and six, respectively.This is why we selected patterns of length two and then applied the Huffman encoding technique for compression.This decreases the use of storage and increases the encoding and decoding speeds, because the detection of patterns of long lengths is relatively complex, and we normally obtain much lower frequencies of patterns.Figures 4 and 5, respectively, show the general block diagrams of the proposed encoding and decoding procedures.Additionally, the encoding and decoding procedures of the proposed method are given in Algorithms 1 and 2, respectively.

Experimental and Analysis
Some experimental results are shown and explained in this section.We made a comparison with some other methods in otder to show the usefulness of our proposed method.However, it is essential to determine the comparison parameters before making any comparisons.Here, the state-of-the-art techniques and the proposed method are compared based on the compression ratio (CR) that is Symmetry 2020, 12, 1654 8 of 14 calculated using Equation (1).It is a very important measurement criterion in this context [37].Additionally, the encoding and decoding times are considered in the comparison.

CR = Original text size Compressed text size
(1) There are many lossless text compression algorithms, but we select PAQ8n, Deflate, Bzip2, Gzip, LZMA, LZW, and Brotli for comparison in this article, because those are the state-of-the-art techniques in this area, and Brotli is one of the best methods among them.We use some sample text files of different sizes from the UCI dataset for testing the aforementioned algorithms.Compression ratios are used in order to evaluate each method based on the sample texts.We apply state-of-the-art techniques and the proposed method on twenty different texts.Table 1 shows the experimental results in terms of compression ratios of the texts and Figure 6 shows their graphical representation for quick comparison.Table 1 shows that averagely LZW provides the lowest (1.288) compression ratio and Brotli the highest (1.667) among state-of-the-art techniques.Although PAQ8n provides 3.04%, 4.3%, 5.5%, and 3.91% better results than Brotli for the texts 3, 15, 16, and 20, respectively, Brotli shows 1.44%, 10.86%, 14.94%, 9.78%, 20.52%, and 22.74% more compression than PAQ8n, Deflate, Bzip2, Gzip, LZMA, and LZW, on average.It can be seen that the proposed technique provides better results, having a higher (1.884) compression ratio on average.Specifically, the proposed technique demonstrates, on average, a compression ratio 12.79% higher than PAQ8n, 21.13% higher than Deflate, 24.73% higher than Bzip2, 20.17% higher than Gzip, 31.63%higher than LZMA, 31.7% higher than LZW, and 11.52% higher than Brotli.We can see from Figure 6 that the compression ratio for the proposed technique is higher for every sample.We also calculate the encoding and decoding times, which are shown in Figures 7 and 8, respectively.For encoding, on average, LZMA and Brotli take the highest (5.8915 s) and the lowest (0.0131 s) amounts of time, respectively, and the proposed technique takes 0.0375 s.PAQ8n and LZMA are 45.65% and 99.36% slower than the proposed coding technique.On the other hand, the proposed strategy takes 56.53%, 2.4%, 17.33%, 38.13%, and 65.07%more time than Deflate, Bzip2, Gzip, LZW, and Brotli, respectively.For decoding, on average, Brotli and LZMA take the lowest (0.007 s) and the highest (0.5896 s) amounts of time, respectively, and the proposed coding technique takes 0.0259 s.The proposed technique is 59.08% and 96.61% faster than PAQ8n, and LZMA, respectively; it is 49.81%, 47.1%, 7.34%, 27.8%, and 72.97% slower than Deflate, Bzip2, Gzip, LZW, and Brotli, respectively.In the case of both encoding and decoding time, we can conclude that our proposed coding method is faster than PAQ8n and LZMA and slower than the other methods that are mentioned in this article.Brotli performs the best out of the state-of-the-art methods in terms of compression ratios, encoding, and decoding times.However, our proposed method outperforms not only Brotli, but also the other state-of-the-art lossless text compression techniques that are mentioned in this article in terms of the compression ratio.
Text compression has two notable aspects based on its application: speed and storage efficiency.There are many applications, like Instant messaging, where speed is more important.On the other hand, a higher compression ratio is the primary concern for data storage applications.Because the proposed method provides more compression, it works better for the data storage applications.The compression Symmetry 2020, 12, 1654 10 of 14 ratio is inversely proportional to the total number of bits in a compressed file.Although the proposed method takes more time for encoding and decoding, a file compressed by the proposed method can be sent more quickly through a transmission media, because the number of bits in the file is less than in other files compressed by other methods.Steganography is a very well-known technique used for information hiding and is very important for Today's technology [42,43].Additionally, we can also use the proposed compression method with steganography when transferring a file securely over the Internet.To obtai a stego-text, we may first apply the steganography technique to a text file and then compress the stego-text by the proposed method to get a more secure text.In this paper, we use the languages C++ and MATLAB (version 9.8.0.1323502 (R2020a)); CodeBlocks (20.03) and MATLAB are used as the coding environments.We also use an HP laptop with the Intel Core i3-3110M @2.40 GHz processor.
As a research direction, we can suggest from our investigation that Brotli is one of the best text compression methods.Brotli cannot provide satisfactory results for a large file compression due to its limited sliding window.However, it is a relatively fast compression method.If we can solve the sliding window problem satisfactorily while maintaining the same speed, Brotli will perform well from every point of view.On the other hand, our proposed method is somewhat slow.If we can increase its encoding and decoding speed, it will give better results.

Conclusions
Lossless text compression is a more significant matter when there is a highly narrow-band communication channel and less storage available.We have proposed a completely lossless text compression while using the Burrows-Wheeler transform, two keys, and Huffman coding.What distinguishes our proposed method?First, to speed up the transformation, we split a large text file into sets of smaller files that ultimately increase the speed of compression.Second, we do not count all characters, which is done in run-length coding.We count only the characters that are repeated more than two times consecutively and replace the value of the letter count by a lowercase letter to increase the frequency of characters in the text, as each text contains the maximum number of lowercase letters.Third, we look for patterns of a certain length that have the highest frequency, so that we can get better results after applying Huffman coding.
The experimental outcomes show that the proposed method performs better than the seven algorithms that we compared it to: PAQ8n, Deflate, Bzip2, Gzip, LZMA, LZW, and Brotli.When used on the twenty sample texts, the proposed method gives an average of 21.66% higher compression than the methods that were described in this article.
One good aspect of our method is that we do not use any static dictionary, which helps to speed up the compression somewhat.Another special feature is that we find patterns of the same length.As a result, the complexity of finding patterns is minimal, and the highest frequency patterns are found, which leads to a better compression ratio.
A conventional algorithm takes inputs, executes a sequence of steps, and produces an output.However, a parallel algorithm executes many instructions on various processing machines at one time and produces a final outcome combining all the individual outcomes and speed up the processing.In our future research work for the development of the proposed technique, we will try to implement the method that is based on parallel processing to reduce the processing time.

Figure 1 .
Figure 1.Comparison of letters' frequency in the texts.

Figure 2 .
Figure 2. The highest frequencies of the same consecutive characters in the texts after the Burrows-Wheeler transform.

Figure 3 .
Figure 3. Frequency comparison of different patterns of different lengths.

Figure 4 .
Figure 4.The general block diagram of the proposed encoding technique.

Figure 5 . 14 Algorithm 1 : 5 if 7 I = I+count; 8 else if the number of the same consecutive characters is exactly three then 9 Store key2 and the character to ReducedText.; 10 I 17 19 Algorithm 2 : 4 if; 6 I 17 Algorithm 1 :1 5 if 6 7 I = I+count; 8 else if the number of the same consecutive characters is exactly three then 9 Store key2 and the character to ReducedText.; 10 I 17 19 Algorithm 2 : 4 if; 6 I
Figure 5.The general block diagram of the proposed decoding technique.

Figure 6 .
Figure 6.Graphical representation of the compression ratios.