A Unique Perspective on Data Coding and Decoding

The concept of a loss-less data compression coding method is proposed, and a detailed description of each of its steps follows. Using the Calgary Corpus and Wikipedia data as the experimental samples and compared with existing algorithms, like PAQ or PPMstr, the new coding method could not only compress the source data, but also further re-compress the data produced by the other compression algorithms. The final files are smaller, and by comparison with the original compression ratio, at least 1% redundancy could be eliminated. The new method is simple and easy to realize. Its theoretical foundation is currently under study. The corresponding Matlab source code is provided in the Appendix.


Overview
Data compression is the art of reducing the number of bits needed to store or transmit data.Compression can be either lossless or lossy.Lossless compressed data can be decompressed to exactly its original value.An example is Morse code, in which each letter of the alphabet is coded as a sequence of dots and dashes.The most common letters in English, like E and T, have the shortest codes, while the least common ones like J, Q, X, and Z are assigned the longest codes [1].
Information theory places hard limits on what can and cannot be compressed losslessly.Lossless compression uses the statistical redundancy of data to compress and later recover original information without distortion.The volume of data in various information systems has rapidly increased, so transmitting data rapidly and easily and storing data without losses have become primary problems in OPEN ACCESS data processing.Lossless data compression technology is an important method to solve this problem [2,3] that has long been applied to many fields, but compressibility is affected by the statistical redundancy of data.Currently prevailing algorithms like WinZip, WinRAR, and other similar tools fail to recompress information that has already been compressed [4][5][6].
A method for lossless data coding and decoding is proposed in this communication.The new method has passed several experimental tests.Compared with currently available technologies, the method can not only compress source data without losses, but also further compresses data that have already been compressed with other compression tools, decreasing the space occupied by the compressed files and accordingly removing at least 1% redundancy based on the original compressibility (the redundancy removed differs with the data redundancy of the files).Furthermore, the new method features a simple algorithm and is easy to carry out.Chinese patents for this technology have already been applied for [7].

Concept of the Coding Algorithm
The concept of this coding algorithm is simple.As we know, if the data consists of contiguous "0s" or "1s", there is little information entropy, and the data redundancy is huge.This redundancy can be eliminated by some lossless compression algorithms.In the real world, the data that we use in computers consists of contiguous "0s" and "1s", for example, the data "00101110001100000100" consists of one "0", two "00s",one "000",one "00000", two "1s", one "11" and one "111".In the example, the point is that it has no "0000" sequence, so we can represent "00000" with "0000".Using this method, 1 bit of the data can be removed.In some communication experiments, it has been found that if the data is source data, and has no process coding, this method is quite workable, and can eliminate more redundancy.If the data is random or processed coding, we can also eliminate a little redundancy, even if only 1 bit.Of course, because of the limit of Shannon's information theory, our method does not work yet on data which has no redundancy.The concept of this coding method is new and different from the other algorithms, so its theoretical foundation needs to be researched in depth.This is our goal in future work.

Detailed Coding Algorithm Implementation
(1) The source data is read in binary mode.The binary sequence is then stored in new Array A. The first binary code of Array A is stored in Variable a.
(2) The length "from 0 until the next bit is 1" is substituted in Array A for all "0" in "from 0 until the next bit is 1".The length "from 1 until the next bit is 0" is substituted in Array A for all "1" in "from 1 until the next bit is 0." The sequence is then obtained for new Array B.
(3) The different elements in Array B are sorted and stored in new Array C in their order of occurrence.
(4) The elements of Array C are sequenced in ascending order and then stored in new Array D.
(5) The digit n is substituted for the n-th element in Array D. The obtained sequence is stored in new Array E. (6) The elements of Array D correspond one for one to those of Array E, and all elements of Array B are replaced.The obtained sequence is then stored into new Array F. (7) All elements of Array F are inversely transformed.Unlike in Step (1), the first code of Array F is the code stored in Variable a.The inversely transformed BC sequence is then stored in new Array G. (8) The elements in Array D are processed by "deducting the adjacent previous item from the next item", and the results are stored in new Array H.The first position where the element is equal to or bigger than 2 in Array H is determined, and the position in Array H is set to n.All elements before the position n in Array D are deleted, and the remaining elements are stored in new Array I in order.( 9) Array I and Array G are saved as binary files; this is the result of source file lossless compression, where: In Step ( 5), the digit n begins from 1.
In Step ( 8), the element position n begins from 1.
In Step ( 1), the source file can be any data that have not been compressed, i.e., data in common file formats, and can also be data that have already been compressed with PAQ, WinRK, WinZip, WinRAR, or other tools.

Detailed Decoding Algorithm Implementation
(1) The BC sequence in Array G is read.The first binary code of Array G is stored in Variable g. ( 2) The length "from 0 until the next bit is 1" is substituted in Array A for all "0" in "from 0 until the next bit is 1".The length "from 1 until the next bit is 0" is substituted in Array A for all "1" in "from 1 until the next bit is 0".The obtained sequence is then stored in new Array F.
(3) The different elements in Array F are sorted and stored in new Array E in their order of occurrence.
(4) The elements in Array E are sequenced in ascending order, and the sequence is stored in new Array D.
(5) Array I is read, and the number of elements in Array I is set to i.The elements in Array I are substituted for the i elements in Array D from the end.The obtained sequence is stored in new Array C. (6) The elements in Array D correspond one for one to those of Array C. All elements in Array F are replaced.The obtained sequence is stored in new Array B. (7) All elements in Array B are compared with those in Step (1).The first code of Array B is the code stored in Variable g, and the inversely transformed BC sequence is stored in new Array A. Array A is saved as a source data.

Illustration
Assume that there is a data W to code.

The Lossless Compression for W Is Given as Follows:
(1) The file is read in binary mode (A = {001001011101001100000111010111110}, a = 0); (2) The length "from 0 until the next bit is 1" is substituted in Array A for all "0" in "from 0 until the next bit is 1".The length "from 1 until the next bit is 0" is substituted in Array A for all "1" in "from 1 until the next bit is 0." The obtained sequence is then stored in new Array B (B = {21211311225311151}); (3) The different elements in Array B are sorted and stored in new Array C in their order of occurrence (C = {2135}); (4) The elements in Array C are sequenced in ascending order, and the obtained sequence is stored in new Array D (D = {1235}); (5) The digit n is substituted for the n-th element in Array D, and the obtained sequence is stored into new Array E (E = {1234}); (6) The elements in Array D correspond one for one to those in Array E. All elements in Array B are replaced, and the obtained sequence is stored in new Array F (F = {21211311224311141}); (7) All elements in Array F are inversely transformed unlike in Step (1).The first code of Array F is the code stored in Variable a, and the inversely transformed BC sequence is stored in new Array G (G = {0010010111010011000011101011110}); (8) The elements in Array D are processed by "deducting the adjacent previous item from the next item", and the result is stored in new Array H.The first position where the element is equal to or bigger than 2 in Array H is determined, and the position in Array H is set to n.All the elements before the position n in Array D are deleted, and the remaining elements are stored in new Array I in order (H = {112}, I = {5}); (9) Array I and Array G are saved as binary files; this is the result of source file lossless compression.
Analysis: the comparison between Array A and Array G indicates that the number of binary codes in Array G is two less than that in Array A; i.e., the space occupied by Array G is 2 bits less than that occupied by Array A. In this example, the length of Sequence A of the source file is 33 bits and that of Sequence G of the compressed file is 31 bits.This method thus removes 6% redundancy from Sequence A and saves 6% space.

The Decompression of the Compressed File W Is Given as Follows:
(1) The BC sequence in Array G of the compressed data is read (G = {0010010111010011000011 101011110}, g = 0); (2) The length "from 0 until the next bit is 1" is substituted in Array A for all "0" in "from 0 until the next bit is 1".The length "from 1 until the next bit is 0" is substituted in Array A for all "1" in "from 1 until the next bit is 0".The obtained sequence is then stored in new Array F (F = {21211311224311141}); (3) The different elements in Array F are sorted and stored in new Array E in their order of occurrence (E = {2134}); (4) The elements in Array E are sequenced in ascending order, and the sequence is stored in new Array D (D = {1234}); (5) Array I is read, and the number of elements in Array I is set to i.The elements in Array I are replaced for the i elements in Array D from the end.The obtained sequence is stored in new Array C (I = {5}, i = 1, C = {1235}); (6) The elements in Array D correspond one for one to those of Array C. All elements in Array F are replaced, and the obtained sequence is stored in new Array B (B = {21211311225311151}); (7) All elements in Array B are inversely transformed unlike in Step (1).The first code of Array B is the code stored in Variable g.The inversely transformed BC sequence is stored in new Array A. Array A is saved as a source data (A = {001001011101001100000111010111110}).

Coding Source Code in Matlab
Coding source code in Matlab is given in the Appendix.More information can be obtained on our website [8].

Benchmarks
A data compression benchmark measures compression ratio over a data set, and sometimes memory usage and speed on a particular computer.Some benchmarks evaluate only file size, in order to avoid hardware dependencies.Compression ratio is often measured by the size of the compressed output file, or in bits per character (bpc), meaning compressed bits per uncompressed byte.In either case, smaller numbers are better: 8 bpc means no compression, while 6 bpc means 25% compression or 75% of original size [1].

Calgary Corpus
The Calgary corpus is the oldest compression benchmark still in use.It was created in 1987 and described in a survey of text compression models in 1989 [9].It consists of 14 files with a total size of 3,141,622 bytes.These files are listed in Table 1

Large Text Compression
The Large Text Compression Benchmark consists of a single Unicode encoded XML file containing a dump of Wikipedia text from March 3, 2006, truncated to 1,000,000,000 bytes (the enwik9 file).Its stated goal is to encourage research into artificial intelligence, specifically, natural language processing.As of February 2010, 128 different programs (889 including different versions and options) had been evaluated for compressed file size (including the decompression program source or executable and any other needed files as a zip archive), speed, and memory usage.The benchmark is open, meaning that anyone can submit results [1].
Programs are ranked by compressed size with options selecting maximum compression where applicable.The best result obtained is 127,784,888 bytes by Dmitry Shkarin on July 21, 2009 for a customized version of Durilca using 13 GB memory.It took 1,398 seconds to compress and 1,797 seconds to decompress using a size-optimized decompression program on a 3.8 GHz quad core Q9650 with 16 GB memory under 64 bit Windows XP Pro.The data was preprocessed with a custom dictionary built from the benchmark and encoded with order 40 PPM.Durilca is a modified version of PPMonstr by the same author.PPMonstr is a slower but better at compressing than the PPMd program which is used for maximum compression in several archivers such as RAR, WinZip, 7-Zip, and FreeArc [1].

Maximum Compression
The maximum compression benchmark [10] has two parts: a set of 10 public files totaling 53 MB, and a private collection of 510 files totaling 301 MB.In the public data set (SFC or single file compression), each file is compressed separately and the sizes added.Programs are ranked by size only, with options set for best compression individually for each file.The set consists of the following 10 files as in Table 2:  (.txt, .exe, .dll, .pdf).WinRK uses a dictionary which is not included in the total size.208 programs are ranked.Zip 2.2 is ranked 163 rd , with a size of 14,948,761 bytes [10].
In the second benchmark or MFC (multiple file compression), programs are ranked by size, compression speed, decompression speed, and by a formula that combines size and speed with time scaled logarithmically.The data is not available for download.Files are compressed together to a single archive.If a compressor cannot create archives, then the files are collected into an uncompressed archive (TAR or QFC), which is then compressed.In the MFC test, paq8px is top ranked by size.FreeArc is top ranked by combined score, followed by NanoZip, WinRAR and 7-Zip.All are archivers that detect file types and apply different algorithms depending on the type [10].

Benchmark
Table 3 shows the experimental results using Durilca and PAQ8px to compress the English Wikipedia file and the Calgary Corpus.Table 4 shows the maximum compression benchmark result which used WinRK3.1.2and PAQ8px.

Discussion and Conclusions
The test data for the Large Text Compression Benchmark is the first 1,000,000,000 bytes of the English Wikipedia dump on March 3, 2006.It is 1.1 GB or 4.8 GB after decompressing with bzip2 [11].Results are also given for the first 108 bytes, which is also used for the Hutter Prize [12].These files can be downloaded from the prize website [12].Durilca is the number one and can compress it to 127,377,411 bytes [13].Now, using our algorithm, 123 bytes of redundancy can be eliminated in the 127,377,411 byte Durilca file.
This communication proposes a unique coding method.Its theoretical foundation is under study, but the method is already workable.It is different from other loss-less compression algorithms, so, it can compress the data which has processed by those other algorithms too.Using the compression benchmark samples, compared with the current technologies, the new method can not only compress source data without losses but also further compress data that have already been compressed by the other algorithms.Table 1 and 2 had been shown that at least 1% redundancy can be removed from the compressed data.As we know that the redundancy in the compressed data is hard to eliminate because of the information theory limit.Dr. Matt's PAQ is an excellent data compression algorithm, and ranks first in the compression field.As for the other algorithms, it is very hard to find the redundancy in the data after PAQ compression, but our algorithm can do it in some types of data, like the Calgary Corpus data, EXECUTABLE data or MS-WORD DOC FILE data.It has the same effect as WinRK.The new coding method also features a simple algorithm and is easy to carry out.The disadvantage is that the theoretical foundation still needs study, and is the subject of ongoing research.

Table 1 .
: Calgary Corpus files and their descriptions.

Table 2 .
Maximum Compression test files and its description (unit: byte).As of December 31, 2009 the top ranked program, giving a total size of 8,813,124 bytes, is PAQ8px, a context mixing algorithm with specialized models for JPEG images, BMP images, X86 code, text, and structured binary data.WinRK 3.1.2,another context mixing algorithm, is top ranked on four of the files