# A Survey on Data Compression Methods for Biological Sequences

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Protein Sequence Compression

`7z`compressor, would need 15.9 MB to be stored. However, when compressing the three files together, only 8.6 MB are needed. Therefore, with the increasing number of sequenced genomes and availability of multiple-proteomes, reference-based compression of proteins will play a key role. In the following paragraphs, several reference-free protein compression methods are described.

`CP`(Compress Protein) scheme is proposed, as the first protein compression algorithm, which employs probability to weight all contexts with the maximum of a certain length, based on their similarity to the current context. The

`XM`(eXpert Model) method, presented in [9], encodes each symbol by estimating the probability based on previous symbols. When a symbol is recognized as part of a repeat, the previous occurrences are used to find that symbol’s probability distribution; then, it is encoded using arithmetic coding. In [13], three statistical models are proposed based on analysing the correlations between amino acids at a short and medium range.

`LZ77`-type algorithm is used to encode long approximate repeats and the

`CTW`algorithm [20] is used to encode short approximate repeats. Heuristics are also used to improve compression ratios. The

`ProtComp`algorithm, presented in [19], exploits approximate repeats and mutation probability for amino-acids to build an optimal substitution probability matrix (including the mutation probabilities).

`ProtCompSecS`method is presented, which considers the encoding of protein sequences in connection with their annotated secondary structure information. This method is the combination of the

`ProtComp`algorithm [19] and a dictionary based method, which uses the

`DSSP`(Dictionary of Protein Secondary Structure) database [23]. The

`CaBLASTP`algorithm, proposed in [22], introduces a course database to store unique data that comes from the original protein sequence database. In this method, sequence segments that align to previously seen sequences are added to a link index.

`CaBLASTP`is the combination of a dictionary-based compression algorithm and a sequence alignment algorithm. In [10], a dictionary-based scheme is proposed, which is based on random or repeated protein sequence reduction and

`ASCII`replacement.

## 3. Genomic Sequence Compression

#### 3.1. Reference-Free Methods

`biocompress`[29]. In this algorithm, that is based on the Ziv and Lempel compression method [30], factors (repeats) and complementary factors (palindromes) in the target sequence are detected and then they are encoded using the length and the position of their earliest occurrences. The

`biocompress-2`algorithm [31], an extension of

`biocompress`[29], exploits the same methodology, as well as the use of arithmetic coding of order-2 when it cannot find any significant repetition.

`Cfact`algorithm [32] consists of two phases, namely (i) the parsing; and (ii) the encoding. In the parsing phase, the most significant repeats, i.e., the repeats which are able to be encoded for achieving local compression gain, are selected. In order to select such repeats, a suffix tree data structure [33] is exploited to find the longest repeats. In the encoding phase, all non-repeats and first occurrences of repeats are encoded using two bits for each base. Also, the next occurrences of repeats are encoded using pointers to their first occurrences, in the form of $(\mathit{position},\mathit{length})$.

`GenCompress`algorithm [34] and was followed by the

`DNACompress`algorithm [35]. In

`GenCompress`[34], the position and the length of non-initial occurrences of repeats are used for encoding them. In

`DNACompress`[35], two phases are considered: (i) finding all approximate repeats containing palindromes, for which the

`PatternHunter`search tool [36] is used; and (ii) encoding approximate repeats and non-repeats, for which the Ziv and Lempel compression method [30] is exploited.

`NMLComp`[37] and

`GeNML`[38] are two methods which utilize the normalized maximum likelihood (

`NML`) model. The

`NMLComp`algorithm [37] proposes a version of the

`NML`model for discrete regression, for the purpose of encoding the approximate repeats, and then combines it with a first order Markov model. The

`GeNML`algorithm [38] presents the following improvements to the methodology used in the

`NMLComp`method: (i) restricting the approximate repeats matches to reduce the cost of search for the previous matches as well as obtaining a more efficient

`NML`model; (ii) choosing the block sizes which are used in parsing the target sequence; and (iii) introducing scalable forgetting factors for the memory model.

`WBTC`) [39] is used in the

`DNACompact`algorithm [40]. In the first phase of this method, the target sequence is converted into words in a way that A, T, C and G bases are replaced with A, C, $<\mathit{space}>A$ and $<\mathit{space}>C$, respectively, i.e., the four symbol space is transformed into a three symbol space. In the second phase, the obtained sequence is encoded by

`WBTC`. The advantage of

`WBTC`is that it does not require saving frequencies or codewords with the compressed stream, since the code of words only rely on ranks.

`DNAEnc3`method is proposed, which employs several competing Markov models of different orders to obtain the probability distribution of symbols in the sequences, for the aim of compression. The advantages of this method include: (i) exploiting a flexible programming technique that provides the ability to handle the models of order up to sixteen; (ii) handling the inverted repeats; and (iii) providing probability estimates that cover the broad range of context depths used.

`POMA`, is proposed in [42], which is based on comprehensive learning particle swarm optimization (

`CLPSO`) [43,44] and on an adaptive intelligent single particle optimizer (

`AdpISPO`)-based local search. In this method, an approximate repeat vector (

`ARV`) codebook is designed and then optimized using

`CLPSO`as well as

`AdpISPO`, for compressing the target sequence. The approximate repeats which have the fewest base variations exploit the candidate

`ARV`codebooks encoded as particles to achieve the optimal solution in

`POMA`. Thereafter, the weighted fitness values are used to select the leader particles in the swarm. Finally, an

`AdpISPO`-based local search is exploited to fine-tune the leader particles.

`DNA-COMPACT`(DNA COMpression based on a Pattern-Aware Contextual modeling Technique) algorithm [45] exploits complementary contextual models and consists of two phases. In the first phase, the exact repeats and palindromes are searched and then represented by a compact quadruplet. In the second phase, the non-sequential contextual models are introduced in order to exploit the features of

`DNA`sequences; then, the predictions of these models are synthesized using the logistic regression model. In this method, logistic regression is shown to lead to less biased results, rather than Bayesian averaging. It is worth noting that

`DNA-COMPACT`can handle both reference-free and reference-based genome compression.

`CoGI`(Compressing Genomes as an Image) algorithm is proposed, which initially transforms the genomic sequence into a binary image (or bitmap); then, it uses a rectangular partition coding method [48] to compress that image. Finally, it exploits entropy coding for further compression of the encoded image as well as the mismatches.

`GeCo`algorithm [50], derived from [51,52], exploits a combination of context models of several orders for reference-free, as well as for reference-based genomic sequence compression. In this method, extended finite-context models (

`XFCMs`), that are tolerant against substitution errors, are introduced. Furthermore, cache-hashes are employed in high order models to make the implementation of

`GeCo`more flexible. The cache-hash uses a fixed hash function to simulate a specific structure, which is a middle point between a dictionary and a probabilistic model. In order to make

`GeCo`more flexible, in terms of memory optimization, the cache-hash takes into consideration only the last hashed entries in memory. This way, the quantification of memory that is required to run in any sequence will be flexible and predictable.

#### 3.2. Reference-Based Methods

`DNAzip`algorithm [55] was presented for compressing James Watson’s genome, considering the high similarity between human genomes, implying that only the variation is required to be saved. In this method, variation data are considered in three parts: (i)

`SNPs`(single nucleotide polymorphisms), which are saved as a position value on a chromosome and a single nucleotide letter of the polymorphism; (ii) insertions of multiple nucleotides, which are saved as position values and the sequence of nucleotide letters to be inserted; and (iii) deletions of multiple nucleotides, which are saved as position values and the length of the deletion. Finally, the following techniques are used to further compress the target genome: variable integers for positions (

`VINT`), delta positions (

`DELTA`),

`SNP`mapping (

`DBSNP`) and k-mer partitioning (

`KMER`).

`RLZ`algorithm [56], each genome is greedily parsed into factors, using the

`LZ77`approach [30]. Then, through a single pass over the collection, the compression is done relative to the reference sequence, which is used as a dictionary. It is worth pointing out that

`RLZ`supports random access to arbitrary substrings. Later, in [57], the authors presented an improved version of the

`RLZ`method, which firstly, uses non-greedy parsing; and secondly, exploits the correlation between the starting points of long factors and their positions in the reference sequence, to improve the compression performance.

`GRS`algorithm [58], which is based on the

`diff`utility in Unix, finds the longest subsequences that are identical in the reference and the target sequences. In this method, a similarity measure is calculated and if it has a larger value than an identified threshold, the difference between the reference and the target sequences are compressed with Huffman encoding [59]; otherwise, the reference and the target sequences are divided into smaller pieces and the same strategy, i.e., comparing the value of the similarity measure with an identified threshold, is used to compress these pieces.

`GReEn`(Genome Re-sequencing Encoding), is proposed in [60], which exploits a probabilistic copy model. In this method, the probability distribution of each symbol in the target sequence is estimated using an adaptive model (the copy model), which assumes that the symbols of the target sequence are a copy of the reference sequence, or a static model, relying on the frequencies of the symbols of the target sequence. These probability distributions are then sent to an arithmetic encoder [61].

`GReEn`includes two parameters that can be used to alter the performance: (i) the size of the k-mer used for searching copy models; and (ii) the number of prediction failures that the copy model can tolerate, before it is restarted.

`GDC`(Genome Differential Compressor) method [62], that is based on the

`LZ77`approach [30], considers multiple genomes of the same species. In this method, a number of extra reference phrases, that are extracted from other sequences, along with an

`LZ`-parsing for detection of approximate repeats are used. The

`GDC`has several variations including normal, fast, advanced, advanced-R and ultra. In the

`GDC`-ultra, that provides the maximum compression ratio among other variations, multiple references are exploited. The

`GDC 2`method [63], that is an improved version of the

`GDC`, exploits two-level

`LZSS`factoring [64] to encode the sequences as a list of matches and literals. The advantage of this method in comparison to

`GDC`is considering short matches after some longer ones. It is worth pointing out that both the

`GDC`and

`GDC 2`methods support random access to an individual sequence.

`COMRAD`(COMpression using RedundAncy of

`DNA`) [66] is a dictionary construction method, in which a disk-based technique,

`RAY`[67], is exploited to identify exact repeats in large

`DNA`datasets. In this method, the collection is frequently scanned to identify repeated substrings, and then use them to construct a corpus-wide dictionary. An advantage of

`COMRAD`is supporting random access to individual sequences and subsequences, which resolves the need for decoding the whole data set.

`PPM`method [69].

`DNA-COMPACT`method [45], which can be used for both reference-free and reference-based genome compression, includes two phases. In the first phase, an adaptive mechanism,

`rLZ`, is presented which firstly, considers a sliding window for the subsequences of the reference sequence; and secondly, searches for the longest exact repeats in the current fragment of the target sequence, in a bi-directional manner. In the second phase, the non-sequential contextual models are introduced and then their predictions are synthesized exploiting the logistic regression model.

`FRESCO`(Framework for REferential Sequence COmpression) method [70], a compressed suffix tree [65] of the reference sequence is exploited to match the prefixes of an input string with substrings of the reference sequence. In addition, three techniques are used to improve the compression performance: (i) selecting a good reference; (ii) rewriting the reference sequence; and (iii) second-order compression, that is reference-based compressing of an already referentially compressed sequence.

`ERGC`(Efficient Referential Genome Compressor) method, based on a divide and conquer strategy, is proposed in [73]. In this method, first, the reference sequence and the target sequence are divided into some parts with equal sizes; Then, a greedy algorithm, which is based on hashing, is used to find one-to-one maps of similar regions between the two sequences; Finally, the identical maps, as well as dissimilar regions of the target sequence, are fed into the

`PPMD`compressor [74] for further compression of the results.

`iDoComp`algorithm [75] consists of three phases. In the first phase, named mapping generation, the suffix arrays are exploited to parse the target sequence into the reference sequence. In the second phase, named post-processing of the mapping, the consecutive matches, which can be merged together to form an approximate match, are found and used for reducing the size of mapping. Finally, in the third phase, named entropy encoding, an adaptive arithmetic encoder is used for further compression of the mapping and then generation of the compressed file.

`CoGI`method [47] can be used for reference-free as well as for reference-based genome compression. Considering the latter case, at first, a proper reference genome is selected using two techniques based on base co-occurrence entropy and multi-scale entropy [76,77,78]. Next, the sequences are converted to bit-streams, using a static coding scheme, and then the

`XOR`(exclusive or) operation is applied between each target and the reference bit-streams. In the next step, the whole bit-streams are transformed into bit matrices like binary images (or bitmaps). Finally, the images are compressed exploiting a rectangular partition coding method [48].

## 4. File Formats

#### 4.1. FASTA/Multi-FASTA

`>`” symbol; and (ii) the sequence data. The “Multi-FASTA” format, as shown in Figure 1, is the concatenation of different single sequence FASTA files. In this section, several methods for compressing FASTA/Multi-FASTA files are described.

`BIND`algorithm [85] employs

`7-Zip`for compressing the sequence headers. Also, for the

`DNA`reads, it uses bit-level encoding for the four nucleotide bases (A, T, C and G) and then splits the codes into two data streams. Next, it uses a unary coding scheme to represent the repeat patterns in these two data streams in a “block-length encoded” format. Finally,

`BIND`uses the

`LZMA`[86] algorithm to compress the block-length files.

`DELIMINATE`(a combination of Delta encoding and progressive ELIMINATion of nuclEotide characters) method, proposed in [87], performs the compression in two phases. In the first phase, all non-ATGC, as well as all lower-case characters, are recorded, and then a file without these characters is transferred to the next step. In the second phase, the positions of the two most frequent nucleotide bases are delta encoded; then, the two other bases are represented by a binary code. Finally, all generated files are further compressed with the

`7-Zip`archiver.

`MFCompress`method [90] exploits finite-context models, which are probabilistic models that select the probability distribution by estimating the probability of the next symbol in the sequence based on the k previous symbols (order-k context).

`MFCompress`encodes the sequence headers using single finite-context models and encodes the

`DNA`sequences using multiple competing finite-context models [41], along with arithmetic coding.

`LEON`method [91], that is based on a reference probabilistic de Bruijn Graph. This graph is built de novo from the reads and saved in a

`Bloom`filter [92]. In this method, the reads are considered as the paths in the de Bruijn Graph and compressed by memorizing the anchoring k-mers and the list of bifurcations. Since the

`LEON`scheme can exploit the same probabilistic de Bruijn Graph for encoding the quality scores, this method is able to be used for FASTQ files, too.

`MetaCRAM`, is proposed. It integrates taxonomy identification, alignment, assembly and compression. For the taxonomy identification, it exploits the

`Kraken`method [94] to identify the mixture of species included in the sample. Also, for the alignment and assembly, at first,

`MetaCRAM`uses

`Bowtie2`[95] to align reads to the reference genomes, and then it uses

`IDBA-UD`[96] to find reference genomes for unaligned reads. Finally,

`MetaCRAM`exploits Huffman [59], Golomb [97], and Extended Golomb encodings [98] for compressing the reads, and

`QualComp`[99] for compressing the quality scores. Unaligned reads are compressed with the reference-free

`MFCompress`compression method [90]. The

`MetaCRAM`method is also able to support the FASTQ file format.

#### 4.2. FASTQ

`ASCII`characters. This file format is considered as the de facto standard for storing the output of high-throughput sequencing instruments [100]. A sequence in FASTQ format, as shown in Figure 2, has four parts: (i) the sequence identifier; (ii) the raw sequence letters; (iii) a “+” character, optionally followed by the same sequence identifier; and (iv) the quality scores. The following methods support the compression of FASTQ files.

`GenCompress`method [101], first, maps short sequences to a reference genome; then, it encodes the addresses of the short sequences, their length and their probable substitutions, using entropy coding algorithms, e.g., Golomb [97], Elias Gamma [102],

`MOV`(Monotone Value) [103] or Huffman coding [59]. Similar to

`GenCompress`, the

`G-SQZ`scheme [104] employs Huffman coding; however, it does compression without altering the relative order. Also,

`G-SQZ`supports random access to the compressed data.

`DSRC`[54] and

`DSRC 2`[105] methods. The

`DSRC`method [54] groups blocks into superblocks; then, using the statistics of blocks inside each superblock, it encodes the superblocks, independently. In this method,

`LZ77`-style encoding, order-0 as well as order-1 Huffman encodings are used for compressing the identifiers, the

`DNA`reads, and the quality scores, respectively.

`DSRC 2`[105] is an improved version of

`DSRC`, implemented using multi-threading. In this approach, Huffman encoding, as well as arithmetic encoding [106], were added to the existing algorithm used for compressing athe

`DNA`reads in

`DSRC`. Also, in this method, the quality scores can be compressed arithmetically.

`DNA`reads. Finally, it uses run length-based methods for compressing the quality scores.

`Quip`scheme [108], arithmetic coding associated with models based on order-3 and high-order Markov chains are exploited for compressing the read identifiers, the nucleotide sequences and the quality scores. This method has some variations, including “

`Quip -r`” and “

`Quip -a`”, that perform reference-based and assembly-based compression, respectively. In the assembly-based compression, the first $2.5$ million reads are employed to assemble contigs that can be used instead of a reference sequence database. It is worth pointing out that

`Quip`also supports the

`SAM/BAM`file format.

`SCALCE`, is proposed in [109], which re-organizes the reads to obtain higher compression ratio and speed. In order to perform this re-organization, the long “core” substrings, derived from the locally consistent parsing (

`LCP`) method [110,111,112], that are shared between the reads, are considered. Finally, these reads are “bucketed” and compressed together, using Lempel-Ziv variants in each bucket.

`SeqDB`scheme [113] performs the compression in two phases. In the first phase, a byte-packing method, named

`SeqPack`, is proposed, which interleaves

`DNA`reads with their corresponding quality scores by exploiting a

`2D`array. Using this array, only one byte is required to store the information of reads and quality scores, instead of the two bytes used by

`FASTQ`. In the second phase, the

`Blosc`method [114] is exploited to compress the interleaved data. This method benefits from two optimization approaches, namely multi-threading and cache-aware blocking, and parallelization through

`SIMD`vectorization [115].

`Fqzcomp`and

`Fastqz`, are proposed for compressing the files in

`FASTQ`format. The

`Fqzcomp`method exploits a byte-wise arithmetic coder [117] as well as context models that are tuned to the type of data. The

`Fastqz`method employs the “

`libzpaq`” compression library to specify the context models in

`ZPAQ`format [118]. The

`ZPAQ`is based on

`PAQ`[119], which combines the bit-wise predictions of several context models, adaptively.

`ORCOM`(Overlapping Reads COmpression with Minimizers) method [120], utilizes the concept of distributing the data into disk bins and compressing each bin. Also, it uses “minimizers” [121] to pass similar (overlapping)

`DNA`reads into the same bin [94,122,123,124,125]. In order to compress the streams of data, on bin-by-bin basis,

`ORCOM`employs a context-based compressor,

`PPMd`[126], or an arithmetic coder [106].

`LW-FQZip`, is presented in [127], which eliminates redundancy from a FASTQ file at first step. Then, it employs incremental and run-length-limited methods for compressing the identifier and the quality scores, respectively. Next, in order to encode the reads, a light-weight mapping model is introduced which maps them against a reference sequence. Finally, the

`LZMA`[86] algorithm is utilized to compress all the processed data.

#### 4.3. SAM/BAM

`QNAME`,

`CIGAR`,

`SEQ`and

`QUAL`. It is worth pointing out that a “BAM” file is the binary version of a SAM file, that is compressed with the

`BGZF`(Blocked

`GNU`Zip Format) tool [3,128,129]. In this section, several methods proposed for compressing the SAM/BAM files are described.

`Goby`approach [131] introduces multiple techniques for compressing structured high-throughput sequencing (

`HTS`) data. These techniques include: (i) field reordering; (ii) field modeling; (iii) template compression; and (iv) domain modeling. Also,

`Goby`employs a number of engineering techniques, such as: the protocol buffers [132] middleware, a multi-tier data organization, and a storage protocol for structured messages.

`NGC`method [133] introduces a lossless approach for compressing read bases, namely vertical difference run-length encoding (

`VDRLE`), which utilizes the features of multiple mapped reads instead of considering each read individually. As a result, the number of required code words is reduced. Also,

`NGC`introduces a lossy approach for compressing quality values, which maps all quality values within an interval to a single value. In addition, for the compression of read names (stream

`QNAME`)

`NGC`uses the longest common prefix of the last n reads.

`Samcomp`method [116] has two variations,

`Samcomp1`and

`Samcomp2`. In

`Samcomp1`, the same method as

`Fqzcomp`[116] is used to compress the identifier and the quality scores. Also, for compressing the reads, a per-coordinate model is used. In

`Samcomp2`, that can read

`SAM`files in any order, instead of the per-position model a reference difference model is exploited in which a history of previous bases matching the context is used.

`DeeZ`method is proposed, which exploits “tokenization” of read names and compresses them with delta encoding. This method also employs an order-2 arithmetic coding, in a block-by-block manner, for compressing the quality scores and the mapping locations, and thus providing the random-access ability. It is worth pointing out that

`gzip`[135] is used in

`DeeZ`, to compress the other fields.

## 5. Results

^{®}Xeon

^{®}CPU E7320 and with 256 GB of RAM; then, their performance was evaluated based on compression ratio, i.e., the ratio of the size of original sequence to the size of compressed sequence, compression/decompression time and compression/decompression memory usage. It is worth pointing out that some of the methods addressed in this paper are missing from the tables of results, due to several reasons. These reasons are described in [158].

`DELIMINATE`[87],

`MFCompress`[90],

`gzip`[135] and

`LZMA`[86] methods, reported in Table 3, show that

`MFCompress`outperforms the other methods in terms of compression ratio, except for the Rice dataset, which has the smallest size among the datasets. In terms of other performance parameters, such as compression/decompression time and compression/decompression memory,

`gzip`outperforms the others.

`Fqzcomp`[116],

`Quip`[108],

`DSRC`[54],

`SCALCE`[109],

`gzip`[135] and

`LZMA`[86], have been compared. Based on this table, although the

`SCALCE`method outperforms the other methods in terms of compression ratio, making use of a non-reversible reordering of the reads, the

`Fqzcomp`method is the best approach in terms of compression time. Also,

`gzip`, which is a general-purpose method, has better performance rather than the other methods in terms of decompression time as well as compression/decompression memory usage.

`Deez`[134] approach provides the best results in terms of compression ratio. However, in terms of other compression metrics, including compression/decompression time and compression/decompression memory usage,

`gzip`performs better in comparison to the other methods. It is worth pointing out that the general-purpose algorithms in this table, i.e.,

`gzip`and

`LZMA`, are unable to compress BAM files, since the representation of BAM files is already in binary and the methods cannot extract redundancy from the data.

`iDoComp`[75] outperforms the other methods, including

`GReEn`[60],

`GeCo`[50] and

`GDC 2`[63] in three cases: compressing HS8 with reference HSCHM8, HS11 with reference PT11, and RICE5 with reference RICE7. Also,

`GDC 2`provides better results for the compression of HS11 with reference HSCHM11, as well as HSK16 with reference HS16. Finally,

`GeCo`outperforms the others for compressing HS11 with reference PA11. In terms of compression/decompression time as well as decompression memory usage, the

`GDC 2`approach has the best performance, except for three cases. Also, in terms of compression memory usage, the

`iDoComp`approach always performs better than the other approaches, except for one case. It is worth noting that, as it is clearly shown in the table,

`GDC 2`has a prominent advantage over the other reference-based methods, that is carrying out the compression and the decompression of sequences in far less time.

## 6. Discussion

## 7. Conclusions

`DELIMINATE`[87] and

`MFCompress`[90] provide an effective increase in compression ratio, although, as expected, at the cost of some additional computational resources. Also, although both

`DELIMINATE`and

`MFCompress`seem to provide equivalent performance—usually,

`DELIMINATE`is faster and uses less memory, but does not compress as good as

`MFCompress`—the choice of

`MFCompress`to integrate a recent pipeline (

`MetaCRAM`[93]) suggests that

`MFCompress`is the current best choice for FASTA and Multi-FASTA data compression.

`SCALCE`[109] clearly seems to be the choice—read reordering is, in fact, a source of improvement in compression. In the case the original order needs to be preserved, then about ${log}_{2}n!\approx n{log}_{2}n-n{log}_{2}\left(e\right)$ additional bits are required for recovering the initial ordering of n reads. However, if the aim is to always have the original file after decompression, including read ordering, then

`Fqzcomp`[116] is the tool that currently offers the best performance, not only in terms of compression ratio, but also in terms of used resources. As in other cases,

`gzip`[135] is the fastest method to decompress and also the one requiring less memory to run, but it falls short in terms of compression ratio. On the other hand,

`LZMA`[86] is competitive in terms of compression ratio, but at the cost of very high compression times.

`Deez`tool seems to be, currently, the best choice. Although it requires some more memory than

`Quip`, it provides a considerable better compression rate for equivalent CPU times. Both

`gzip`and

`LZMA`fall short regarding compression ratio.

`iDoComp`[75] and

`GDC 2`[63] offer similar compression ratios, with

`GDC 2`providing very fast decompression. However, when the “proximity” between the target and the reference is lower, we see

`GeCo`providing about twice as much compression ratio. One of the challenges here seems to be that it is not always possible to know beforehand the degree of “proximity” between a certain target and reference. In fact, in a certain sense, reference-based compression provides an estimate of the degree of “proximity” between both sequences, which is only available after compression is performed.

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Muir, P.; Li, S.; Lou, S.; Wang, D.; Spakowicz, D.J.; Salichos, L.; Zhang, J.; Weinstock, G.M.; Isaacs, F.; Rozowsky, J.; et al. The real cost of sequencing: Scaling computation to keep pace with data generation. Genom. Biol.
**2016**. [Google Scholar] [CrossRef] - Kahn, S. On the future of genomic data. Science
**2011**, 331, 728–729. [Google Scholar] [CrossRef] [PubMed] - Alberti, C.; Mattavelli, M.; Hernandez, A.; Chiariglione, L.; Xenarios, I.; Guex, N.; Stockinger, H.; Schuepbach, T.; Kahlem, P.; Iseli, C.; et al. Investigation on Genomic Information Compression and Storage; ISO/IEC JTC 1/SC 29/WG 11 N15346; ISO: Geneva, Switzerland, 2015; pp. 1–28. [Google Scholar]
- Giancarlo, R.; Rombo, S.; Utro, F. Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief. Bioinform.
**2014**, 15, 390–406. [Google Scholar] [CrossRef] [PubMed] - De Bruijn, N. A Combinatorial Problem. Available online: https://pure.tue.nl/ws/files/4442708/597473.pdf (accessed on 11 October 2016).
- Compeau, P.; Pevzner, P.; Tesler, G. How to apply de Bruijn graphs to genome assembly. Nat. Methods
**2011**, 29, 987–991. [Google Scholar] [CrossRef] [PubMed] - Conway, T.; Bromage, A. Succinct data structures for assembling. Bioinformatics
**2011**, 27, 479–486. [Google Scholar] [CrossRef] [PubMed] - Cao, M.; Dix, T.; Allison, L. A genome alignment algorithm based on compression. BMC Bioinform.
**2010**, 11. [Google Scholar] [CrossRef] [PubMed] - Cao, M.; Dix, T.; Allison, L.; Mears, C. A simple statistical algorithm for biological sequence compression. In Proceedings of the DCC ’07: Data Compression Conference, Snowbird, UT, USA, 27–29 March 2007; pp. 43–52.
- Rafizul Haque, S.; Mallick, T.; Kabir, I. A new approach of protein sequence compression using repeat reduction and ASCII replacement. IOSR J. Comput. Eng. (IOSR-JCE)
**2013**, 10, 46–51. [Google Scholar] [CrossRef] - Ward, M. Virtual Organisms: The Startling World of Artificial Life; Macmillan: New York, NY, USA, 2014. [Google Scholar]
- Wootton, J. Non-globular domains in protein sequences: Automated segmentation using complexity measures. Comput. Chem.
**1994**, 18, 269–285. [Google Scholar] [CrossRef] - Benedetto, D.; Caglioti, E.; Chica, C. Compressing proteomes: The relevance of medium range correlations. EURASIP J. Bioinform. Syst. Biol.
**2007**, 2007. [Google Scholar] [CrossRef] [PubMed] - Yu, J.; Cao, Z.; Yang, Y.; Wang, C.; Su, Z.; Zhao, Y.; Wang, J.; Zhou, Y. Natural protein sequences are more intrinsically disordered than random sequences. Cell. Mol. Life Sci.
**2016**, 73, 2949–2957. [Google Scholar] [CrossRef] [PubMed] - The Human Proteome Project. Available online: http://www.thehpp.org (accessed on 11 October 2016).
- Three sequenced Neanderthal genomes. Available online: http://cdna.eva.mpg.de/neandertal (accessed on 11 October 2016).
- Nevill-Manning, C.; Witten, I. Protein is incompressible. In Proceedings of the DCC ’99: Data Compression Conference, Snowbird, UT, USA, 29–31 March 1999; pp. 257–266.
- Matsumoto, T.; Sadakane, K.; Imai, H. Biological sequence compression algorithms. Genom. Inform.
**2000**, 11, 43–52. [Google Scholar] - Hategan, A.; Tabus, I. Protein is compressible. In Proceedings of the 6th Nordic Signal Processing Symposium, Espoo, Finland, 9–11 June 2004; pp. 192–195.
- Willems, F.; Shtarkov, Y.; Tjalkens, T. The context tree weighting method: Basic properties. IEEE Trans. Inf. Theory
**1995**, 41, 653–664. [Google Scholar] [CrossRef] - Hategan, A.; Tabus, I. Jointly encoding protein sequences and their secondary structure. In Proceedings of the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2007), Tuusula, Finland, 10–12 June 2007.
- Daniels, N.; Gallant, A.; Peng, J.; Cowen, L.; Baym, M.; Berger, B. Compressive genomics for protein databases. Bioinformatics
**2013**, 29, 283–290. [Google Scholar] [CrossRef] [PubMed] - Kabsch, W.; Sander, C. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers
**1983**, 22, 2577–2637. [Google Scholar] [CrossRef] [PubMed] - Hayashida, M.; Ruan, P.; Akutsu, T. Proteome compression via protein domain compositions. Methods
**2014**, 67, 380–385. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Giancarlo, R.; Scaturro, D.; Utro, F. Textual data compression in computational biology: Algorithmic techniques. Comput. Sci. Rev.
**2012**, 6, 1–25. [Google Scholar] [CrossRef] - Zhu, Z.; Zhang, Y.; Ji, Z.; He, S.; Yang, X. High-throughput DNA sequence data compression. Brief. Bioinform.
**2013**, 16. [Google Scholar] [CrossRef] [PubMed] - Bakr, N.; Sharawi, A. DNA lossless compression algorithms: Review. Am. J. Bioinform. Res.
**2013**, 3, 72–81. [Google Scholar] - Wandelt, S.; Bux, M.; Leser, U. Trends in genome compression. Curr. Bioinform.
**2014**, 9, 315–326. [Google Scholar] [CrossRef] - Grumbach, S.; Tahi, F. Compression of DNA sequences. In Proceedings of the DCC’93: Data Compression Conference, Snowbird, UT, USA, 30 March–2 April 1993; pp. 340–350.
- Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory
**1977**, 23, 337–343. [Google Scholar] [CrossRef] - Grumbach, S.; Tahi, F. A new challenge for compression algorithms: Genetic sequences. Inf. Process. Manag.
**1994**, 30, 875–886. [Google Scholar] [CrossRef] - Rivals, E.; Delahaye, J.; Dauchet, M.; Delgrange, O. A guaranteed compression scheme for repetitive DNA sequences. In Proceedings of the DCC ’96: Data Compression Conference, Snowbird, UT, USA, 31 March–3 April 1996; p. 453.
- Ukkonen, E. On-line construction of suffix trees. Algorithmica
**1995**, 14, 249–260. [Google Scholar] [CrossRef] - Chen, X.; Kwong, S.; Li, M.; Delgrange, O. A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of the 4th Annual International Conference of Research in Computational Molecular Biology (RECOMB ’00), Tokyo, Japan, 8–11 April 2000; pp. 107–117.
- Chen, X.; Li, M.; Ma, B. DNACompress: Fast and effective DNA sequence. Bioinformatics
**2002**, 18, 1696–1698. [Google Scholar] [CrossRef] [PubMed] - Ma, B.; Tromp, J.; Li, M. PatternHunter: Faster and more sensitive homology search. Bioinformatics
**2002**, 18, 440–445. [Google Scholar] [CrossRef] [PubMed] - Tabus, I.; Korodi, G.; Rissanen, J. DNA sequence compression using the normalized maximum likelihood model for discrete regression. In Proceedings of the DCC ’03: Data Compression Conference, Snowbird, UT, USA, 25–27 March 2003; pp. 253–262.
- Korodi, G.; Tabus, I. An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf. Syst.
**2005**, 23, 3–34. [Google Scholar] [CrossRef] - Gupta, A.; Agarwal, S. A scheme that facilitates searching and partial decompression of textual documents. Int. J. Adv. Comput. Eng.
**2008**, 1, 99–109. [Google Scholar] - Gupta, A.; Agarwal, S. A novel approach for compressing DNA sequences using semi-statistical compressor. Int. J. Comput. Appl.
**2011**, 33, 245–251. [Google Scholar] [CrossRef] - Pinho, A.; Ferreira, P.; Neves, A.; Bastos, C. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE
**2011**, 6, e21588. [Google Scholar] [CrossRef] [PubMed] - Zhu, Z.; Zhou, J.; Ji, Z.; Shi, Y. DNA sequence compression using adaptive particle swarm optimization-based memetic algorithm. IEEE Trans. Evolut. Comput.
**2011**, 15, 643–658. [Google Scholar] [CrossRef] - Liang, J.; Suganthan, P.; Deb, K. Novel composition test functions for numerical global optimization. In Proceedings of the IEEE Swarm Intelligence Symposium (SIS 2005), Pasadena, CA, USA, 8–10 June 2005; pp. 68–75.
- Liang, J.; Qin, A.; Suganthan, P.; Baskar, S. Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE Trans. Evolut. Comput.
**2006**, 10, 281–295. [Google Scholar] [CrossRef] - Li, P.; Wang, S.; Kim, J.; Xiong, H.; Ohno-Machado, L.; Jiang, X. DNA-COMPACT: DNA compression based on a pattern-aware contextual modeling technique. PLoS ONE
**2013**, 8, e80377. [Google Scholar] [CrossRef] [PubMed] - Guo, H.; Chen, M.; Liu, X.; Xie, M. Genome compression based on Hilbert space filling curve. In Proceedings of the 3rd International Conference on Management, Education, Information and Control (MEICI 2015), Shenyang, China, 29–31 May 2015; pp. 1685–1689.
- Xie, X.; Zhou, S.; Guan, J. CoGI: Towards compressing genomes as an image. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2015**, 12, 1275–1285. [Google Scholar] [CrossRef] [PubMed] - Mohamed, S.; Fahmy, M. Binary image compression using efficient partitioning into rectangular regions. IEEE Trans. Commun.
**1995**, 43, 1888–1893. [Google Scholar] [CrossRef] - Chen, M.; Chen, J.; Zhang, Y.; Tang, M. Optimized context weighting based on the least square algorithm. In Wireless Communications, Networking and Applications, Proceedings of the 2014 International Conference on Wireless Communications, Networking and Applications (WCNA 2014), Shenzhen, China, 27–28 December 2014; Zeng, Q.-A., Ed.; Lecture Notes in Electrical Engineering. Springer: Berlin/Heidelberg, Germany, 2016; Volume 348, pp. 1037–1045. [Google Scholar]
- Pratas, D.; Pinho, A.; Ferreira, P. Efficient compression of genomic sequences. In Proceedings of the DCC ’16: Data Compression Conference, Snowbird, UT, USA, 30 March–1 April 2016; pp. 231–240.
- Pinho, A.J.; Pratas, D.; Ferreira, P.J. Bacteria DNA sequence compression using a mixture of finite-context models. In Proceedings of the 2011 IEEE Statistical Signal Processing Workshop (SSP), Nice, France, 28–30 June 2011; pp. 125–128.
- Pratas, D.; Pinho, A.J. Exploring deep Markov models in genomic data compression using sequence pre-analysis. In Proceedings of the 2014 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, 1–5 September 2014; pp. 2395–2399.
- Wandelt, S.; Leser, U. Adaptive efficient compression of genomes. Algorithms Mol. Biol.
**2012**, 7. [Google Scholar] [CrossRef] [PubMed] - Deorowicz, S.; Grabowski, S. Compression of DNA sequence reads in FASTQ format. Bioinformatics
**2011**, 27, 860–862. [Google Scholar] [CrossRef] [PubMed] - Christley, S.; Lu, Y.; Li, C.; Xie, X. Human genomes as email attachments. Bioinformatics
**2009**, 25, 274–275. [Google Scholar] [CrossRef] [PubMed] - Kuruppu, S.; Puglisi, S.; Zobel, J. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. String Process. Inf. Retr.
**2010**, 6393, 201–206. [Google Scholar] - Kuruppu, S.; Puglisi, S.; Zobel, J. Optimized relative Lempel-Ziv compression of genomes. Conf. Res. Pract. Inf. Technol. Ser.
**2011**, 113, 91–98. [Google Scholar] - Wang, C.; Zhang, D. A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res.
**2011**, 39, 5–10. [Google Scholar] [CrossRef] [PubMed] - Huffman, D. A method for the construction of minimum redundancy codes. Proc. IRE
**1952**, 40, 1098–1101. [Google Scholar] [CrossRef] - Pinho, A.; Pratas, D.; Garcia, S. GReEn: A tool for efficient compression of genome resequencing data. Nucleic Acids Res.
**2012**, 40. [Google Scholar] [CrossRef] [PubMed] - Rissanen, J. Generalized Kraft inequality and arithmetic coding. IBM J. Res. Dev.
**1976**, 20, 198–203. [Google Scholar] [CrossRef] - Deorowicz, S.; Grabowski, S. Robust relative compression of genomes with random access. Bioinformatics
**2011**, 27, 2979–2986. [Google Scholar] [CrossRef] [PubMed] - Deorowicz, S.; Danek, A.; Niemiec, M. GDC 2: Compression of large collections of genomes. Sci. Rep.
**2015**, 5. [Google Scholar] [CrossRef] [PubMed] - Storer, J.; Szymanski, T. Data compression via text substitution. J. ACM
**1982**, 29, 928–951. [Google Scholar] [CrossRef] - Grossi, R.; Vitter, J. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing, Portland, OR, USA, 21–23 May 2000; pp. 397–406.
- Kuruppu, S.; Beresford-Smith, B.; Conway, T.; Zobel, J. Iterative dictionary construction for compression of large DNA data sets. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2012**, 9, 137–149. [Google Scholar] [CrossRef] [PubMed] - Cannane, A.; Williams, H. General-purpose compression for efficient retrieval. J. Assoc. Inf. Sci. Technol.
**2001**, 52, 430–437. [Google Scholar] [CrossRef] - Dai, W.; Xiong, H.; Jiang, X.; Ohno-Machado, L. An adaptive difference distribution-based coding with hierarchical tree structure for DNA sequence compression. In Proceedings of the DCC ’13: Data Compression Conference, Snowbird, UT, USA, 20–22 March 2013; pp. 371–380.
- Cleary, J.; Witten, I. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun.
**1984**, 32, 396–402. [Google Scholar] [CrossRef] - Wandelt, S.; Leser, U. FRESCO: Referential compression of highly-similar sequences. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2013**, 10, 1275–1288. [Google Scholar] [CrossRef] [PubMed] - Jung, S.; Sohn, I. Streamlined genome sequence compression using distributed source coding. Cancer Inform.
**2014**, 13, 35–43. [Google Scholar] [CrossRef] [PubMed] - Pradhan, S.; Ramchandran, K. Distributed source coding using syndromes (DISCUS): Design and construction. IEEE Trans. Inf. Theory
**2003**, 49, 626–643. [Google Scholar] [CrossRef] - Saha, S.; Rajasekaran, S. ERGC: An efficient referential genome compression algorithm. Bioinformatics
**2015**, 31, 3468–3475. [Google Scholar] [CrossRef] [PubMed] - Moffat, A. Implementing the PPM data compression scheme. IEEE Trans. Commun.
**1990**, 38, 1917–1921. [Google Scholar] [CrossRef] - Ochoa, I.; Hernaez, M.; Weissman, T. iDoComp: A compression scheme for assembled genomes. Bioinformatics
**2015**, 31, 626–633. [Google Scholar] [CrossRef] [PubMed] - Costa, M.; Goldberger, A.; Peng, C. Multiscale entropy analysis of complex physiologic time series. Phys. Rev. Lett.
**2002**, 89, 068102. [Google Scholar] [CrossRef] [PubMed] - Richman, J.; Moorman, J. Physiological time-series analysis using approximate entropy and sample entropy. Am. J. Physiol. Heart Circ. Physiol.
**2000**, 278, 2039–2049. [Google Scholar] - Cosic, I. Macromolecular bioactivity: Is it resonant interaction between macromolecules?—Theory and applications. IEEE Trans. Biomed. Eng.
**1994**, 41, 1101–1114. [Google Scholar] [CrossRef] [PubMed] - Hanus, P.; Dingel, J.; Chalkidis, G.; Hagenauer, J. Compression of whole genome alignments. IEEE Trans. Inf. Theory
**2010**, 56, 696–705. [Google Scholar] [CrossRef] - Matos, L.; Pratas, D.; Pinho, A. A compression model for DNA multiple sequence alignment blocks. IEEE Trans. Inf. Theory
**2013**, 59, 3189–3198. [Google Scholar] [CrossRef] - Matos, L.; Neves, A.; Pratas, D.; Pinho, A. MAFCO: A compression tool for MAF files. PLoS ONE
**2015**, 10, e0116082. [Google Scholar] [CrossRef] [PubMed] - Danecek, P.; Auton, A.; Abecasis, G.; Albers, C.A.; Banks, E.; DePristo, M.A.; Handsaker, R.E.; Lunter, G.; Marth, G.T.; Sherry, S.T.; et al. The variant call format and VCFtools. Bioinformatics
**2011**, 27, 2156–2158. [Google Scholar] [CrossRef] [PubMed] - Layer, R.M.; Kindlon, N.; Karczewski, K.J.; Quinlan, A.R.; Consortium, E.A.; Exome Aggregation Consortium. Efficient genotype compression and analysis of large genetic-variation data sets. Nat. Methods
**2016**, 13, 63–65. [Google Scholar] [PubMed] - Lipman, D.; Pearson, W. Rapid and sensitive protein similarity searches. Brief. Bioinform.
**1985**, 227, 1435–1441. [Google Scholar] [CrossRef] - Bose, T.; Mohammed, M.; Dutta, A.; Mande, S. BIND—An algorithm for loss-less compression of nucleotide sequence data. J. Biosci.
**2012**, 37, 785–789. [Google Scholar] [CrossRef] [PubMed] - LZMA. Available online: http://www.7-zip.org/sdk.html (accessed on 11 October 2016).
- Mohammed, M.; Dutta, A.; Bose, T.; Chadaram, S.; Mande, S. DELIMINATE—A fast and efficient method for loss-less compression of genomic sequences: Sequence analysis. Bioinformatics
**2012**, 28, 2527–2529. [Google Scholar] [CrossRef] [PubMed] - Chen, W.; Lu, Y.; Lai, F.; Chien, Y.; Hwu, W. Integrating human genome database into electronic health record with sequence alignment and compression mechanism. J. Med. Syst.
**2012**, 36, 2587–2597. [Google Scholar] [CrossRef] [PubMed] - Apostolico, A.; Fraenkel, A. Robust transmission of unbounded strings using Fibonacci representations. IEEE Trans. Inf. Theory
**1987**, 33, 238–245. [Google Scholar] [CrossRef] - Pinho, A.; Pratas, D. MFCompress: A compression tool for FASTA and multi-FASTA data. Bioinformatics
**2013**, 30, 117–118. [Google Scholar] [CrossRef] [PubMed] - Benoit, G.; Lemaitre, C.; Lavenier, D.; Drezen, E.; Dayris, T.; Uricaru, R.; Rizk, G. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform.
**2015**, 16. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kirsch, A.; Mitzenmacher, M. Less hashing, same performance: Building a better bloom filter. J. Random Struct. Algorithms
**2008**, 33, 187–218. [Google Scholar] [CrossRef] - Kim, M.; Zhang, X.; Ligo, J.G.; Farnoud, F.; Veeravalli, V.V.; Milenkovic, O. MetaCRAM: An integrated pipeline for metagenomic taxonomy identification and compression. BMC Bioinform.
**2016**, 17. [Google Scholar] [CrossRef] [PubMed] - Wood, D.; Salzberg, S. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genom. Biol.
**2014**, 15. [Google Scholar] [CrossRef] [PubMed] - Langmead, B.; Salzberg, S. Fast gapped-read alignment with bowtie 2. Nat. Methods
**2012**, 9, 357–359. [Google Scholar] [CrossRef] [PubMed] - Peng, Y.; Leung, H.C.; Yiu, S.M.; Chin, F.Y. IDBA-UD: A de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics
**2012**, 28, 1420–1428. [Google Scholar] [CrossRef] [PubMed] - Golomb, S. Run-length encodings. IEEE Trans. Inf. Theory
**1966**, 12, 399–401. [Google Scholar] [CrossRef] - Somasundaram, K.; Domnic, S. Extended golomb code for integer representation. IEEE Trans. Multimed.
**2007**, 9, 239–246. [Google Scholar] [CrossRef] - Ochoa, I.; Asnani, H.; Bharadia, D.; Chowdhury, M.; Weissman, T.; Yona, G. Qualcomp: A new lossy compressor for quality scores based on rate distortion theory. BMC Bioinform.
**2013**, 14. [Google Scholar] [CrossRef] [PubMed] - Cock, P.; Fields, C.; Goto, N.; Heuer, M.; Rice, P. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res.
**2009**, 38, 1767–1771. [Google Scholar] [CrossRef] [PubMed] - Daily, K.; Rigor, P.; Christley, S.; Xie, X.; Baldi, P. Data structures and compression algorithms for high-throughput sequencing technologies. BMC Bioinform.
**2010**, 11. [Google Scholar] [CrossRef] [PubMed] - Elias, P. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory
**1975**, 21, 194–203. [Google Scholar] [CrossRef] - Baldi, P.; Benz, R.; Hirschberg, D.; Swamidass, S. Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval. J. Chem. Inf. Model.
**2007**, 47, 2098–2109. [Google Scholar] [CrossRef] [PubMed] - Tembe, W.; Lowey, J.; Suh, E. G-SQZ: Compact encoding of genomic sequence and quality data. Bioinformatics
**2010**, 26, 2192–2194. [Google Scholar] [CrossRef] [PubMed] - Roguski, L.; Deorowicz, S. DSRC 2—Industry-oriented compression of FASTQ files. Bioinformatics
**2014**, 30, 2213–2215. [Google Scholar] [CrossRef] [PubMed] - Salomon, D.; Motta, G. Handbook of Data Compression; Springer: Berlin, Germany, 2010. [Google Scholar]
- Bhola, V.; Bopardikar, A.; Narayanan, R.; Lee, K.; Ahn, T. No-reference compression of genomic data stored in FASTQ format. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2011), Atlanta, GA, USA, 12–15 November 2011; pp. 147–150.
- Jones, D.; Ruzzo, W.; Peng, X.; Katze, M. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res.
**2012**, 40. [Google Scholar] [CrossRef] [PubMed] - Hach, F.; Numanagic, I.; Alkan, C.; Sahinalp, S. SCALCE: Boosting sequence compression algorithms using locally consistent encoding. Bioinformatics
**2012**, 28, 3051–3057. [Google Scholar] [CrossRef] [PubMed] - Sahinalp, S.; Vishkin, U. Efficient approximate and dynamic matching of patterns using a labeling paradigm. In Proceedings of the 37th Annual Symposium on Foundations of Computer Science (FOCS), Burlington, VT, USA, 14–16 October 1996; pp. 320–328.
- Cormode, G.; Paterson, M.; Sahinalp, S.; Vishkin, U. Communication complexity of document exchange. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), San Francisco, CA, USA, 9–11 January 2000; pp. 197–206.
- Batu, T.; Ergun, F.; Sahinalp, S. Oblivious string embeddings and edit distance approximations. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm (SODA), Miami, FL, USA, 22–24 January 2006; pp. 792–801.
- Howison, M. High-throughput compression of FASTQ data with SeqDB. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2013**, 10, 213–218. [Google Scholar] [CrossRef] [PubMed] - Alted, F. Available online: http://www.blosc.org (accessed on 11 October 2016).
- Alted, F. Why modern CPUs are starving and what can be done about it. Comput. Sci. Eng.
**2010**, 12, 68–71. [Google Scholar] [CrossRef] - Bonfield, J.; Mahoney, M. Compression of FASTQ and SAM format sequencing data. PLoS ONE
**2013**, 8, e59190. [Google Scholar] [CrossRef] [PubMed] - Shelwien, E. Available online: http://compressionratings.com/i_ctxf.html (accessed on 11 October 2016).
- Mahoney, M. Available online: http://mattmahoney.net/dc/zpaq.html (accessed on 11 October 2016).
- Mahoney, M. Adaptive Weighing of Context Models for Lossless Data Compression; Technical Report CS-2005–16; Florida Institute of Technology CS Department: Melbourne, FL, USA, 2005. [Google Scholar]
- Grabowski, S.; Deorowicz, S.; Roguski, L. Disk-based compression of data from genome sequencing. Bioinformatics
**2014**, 31, 1389–1395. [Google Scholar] [CrossRef] [PubMed] - Roberts, M.; Hayes, W.; Hunt, B.; Mount, S.; Yorke, J. Reducing storage requirements for biological sequence comparison. Bioinformatics
**2004**, 20, 3363–3369. [Google Scholar] [CrossRef] [PubMed] - Movahedi, N.; Forouzmand, E.; Chitsaz, H. De novo co-assembly of bacterial genomes from multiple single cells. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2012), Philadelphia, PA, USA, 4–7 October 2012; pp. 1–5.
- Li, Y.; Kamousi, P.; Han, F.; Yang, S.; Yan, X.; Suri, S. Memory efficient minimum substring partitioning. In Proceedings of the 39th international conference on Very Large Data Bases (VLDB 2013), Trento, Italy, 26–30 August 2013; pp. 169–180.
- Chikhi, R.; Limasset, A.; Jackman, S.; Simpson, J.; Medvedev, P. On the representation of de Bruijn graphs. In Proceedings of the 18th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2014), Pittsburgh, PA, USA, 2–5 April 2014; pp. 35–55.
- Deorowicz, S.; Kokot, M.; Grabowski, S.; Debudaj-Grabysz, A. KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics
**2015**, 31, 1569–1576. [Google Scholar] [CrossRef] [PubMed] - Shkarin, D. PPM: One step to practicality. In Proceedings of the DCC ’02: Data Compression Conference, Snowbird, UT, USA, 2–4 April 2002; pp. 202–211.
- Zhang, Y.; Li, L.; Yang, Y.; Yang, X.; He, S. Light-weight reference-based compression of FASTQ data. BMC Bioinform.
**2015**, 16, 188. [Google Scholar] [CrossRef] [PubMed] - Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R. The Sequence Alignment/Map format and SAMtools. Bioinformatics
**2009**, 25, 2078–2079. [Google Scholar] [CrossRef] [PubMed] - The SAM/BAM Format Specification Working Group. Sequence Alignment/Map Format Specification. Available online: https://samtools.github.io/hts-specs/SAMv1.pdf (accessed on 11 October 2016).
- Fritz, M.; Leinonen, R.; Cochrane, G.; Birney, E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genom. Res.
**2011**, 21, 734–740. [Google Scholar] [CrossRef] [PubMed] - Campagne, F.; Dorff, K.; Chambwe, N.; Robinson, J.; Mesirov, J. Compression of structured high-throughput sequencing data. PLoS ONE
**2013**, 8, e79871. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Varda, K. PB. Available online: https://github.com/google/protobuf (accessed on 11 October 2016).
- Popitsch, N.; Von Haeseler, A. NGC: Lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Res.
**2013**, 41. [Google Scholar] [CrossRef] [PubMed] - Hach, F.; Numanagic, I.; Sahinal, S. DeeZ: Reference-based compression by local assembly. Nat. Methods
**2014**, 11, 1081–1082. [Google Scholar] [CrossRef] [PubMed] - gzip. Available online: http://www.gzip.org (accessed on 11 October 2016).
- Rebico. Available online: http://bioinformatics.ua.pt/software/rebico (accessed on 11 October 2016).
- Human (GRC). Available online: ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/Assembled_chromosomes/seq (accessed on 11 October 2016).
- Chimpanzee. Available online: ftp://ftp.ncbi.nlm.nih.gov/genomes/Pan_troglodytes/Assembled_chromosomes/seq (accessed on 11 October 2016).
- Rice5. Available online: ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_5.0 (accessed on 11 October 2016).
- CAMERA Prokaryotic Nucleotide. Available online: ftp://ftp.imicrobe.us/camera/camera_reference_datasets/10572.V10.fa.gz (accessed on 11 October 2016).
- ERR174310_1. Available online: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR174/ERR174310/ERR174310_1.fastq.gz (accessed on 11 October 2016).
- ERR174310_2. Available online: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR174/ERR174310/ERR174310_2.fastq.gz (accessed on 11 October 2016).
- ERR194146_1. Available online: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146_1.fastq.gz (accessed on 11 October 2016).
- ERR194146_2. Available online: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146_2.fastq.gz (accessed on 11 October 2016).
- NA12877_S1. Available online: ftp://ftp.sra.ebi.ac.uk/vol1/ERA207/ERA207860/bam/NA12877_S1.bam (accessed on 11 October 2016).
- NA12878_S1. Available online: ftp://ftp.sra.ebi.ac.uk/vol1/ERA207/ERA207860/bam/NA12878_S1.bam (accessed on 11 October 2016).
- NA12882_S1. Available online: ftp://ftp.sra.ebi.ac.uk/vol1/ERA207/ERA207860/bam/NA12882_S1.bam (accessed on 11 October 2016).
- Homo sapiens, GRC Reference Assembly—Chromosome 8. Available online: ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p7_chr8.fa.gz (accessed on 11 October 2016).
- Homo sapiens, CHM Reference Assembly—Chromosome 8. Available online: ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/Assembled_chromosomes/seq/hs_alt_CHM1_1.1_chr8.fa.gz (accessed on 11 October 2016).
- Homo sapiens, GRC Reference Assembly—Chromosome 11. Available online: ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p7_chr11.fa.gz (accessed on 11 October 2016).
- Homo sapiens, CHM Reference Assembly—Chromosome 11. Available online: ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/Assembled_chromosomes/seq/hs_alt_CHM1_1.1_chr11.fa.gz (accessed on 11 October 2016).
- Pan troglodytes (Chimpanze) Reference Assembly, v3.0—Chromosome 11. Available online: ftp://ftp.ncbi.nlm.nih.gov/genomes/Pan_troglodytes/Assembled_chromosomes/seq/ptr_ref_Pan_tro_3.0_chr11.fa.gz (accessed on 11 October 2016).
- Pongo abelii (Orangutan) Reference Assembly—Chromosome 11. Available online: ftp://ftp.ncbi.nlm.nih.gov/genomes/Pongo_abelii/Assembled_chromosomes/seq/pab_ref_P_pygmaeus_2.0.2_chr11.fa.gz (accessed on 11 October 2016).
- Homo sapiens, GRC Reference Assembly—Chromosome 16. Available online: ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh38.p7_chr16.fa.gz (accessed on 11 October 2016).
- Homo sapiens, Korean Reference—Chromosome 16. Available online: ftp://ftp.kobic.re.kr/pub/KOBIC-KoreanGenome/fasta/chromosome_16.fa.gz (accessed on 11 October 2016).
- Oryza sativa (Rice), v5.0. Available online: ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_5.0 (accessed on 11 October 2016).
- Oryza sativa (Rice), v7.0. Available online: ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0 (accessed on 11 October 2016).
- Pratas, D. Available online: https://raw.githubusercontent.com/pratas/rebico/master/methods.txt (accessed on 11 October 2016).
- Li, H. BGT: Efficient and flexible genotype query across many samples. Bioinformatics
**2015**. [Google Scholar] [CrossRef] [PubMed] - Sambo, F.; Di Camillo, B.; Toffolo, G.; Cobelli, C. Compression and fast retrieval of SNP data. Bioinformatics
**2014**, 30, 3078–3085. [Google Scholar] [CrossRef] [PubMed] - Cao, M.D.; Dix, T.I.; Allison, L. A genome alignment algorithm based on compression. BMC bioinform.
**2010**, 11. [Google Scholar] [CrossRef] [PubMed] - Pratas, D.; Silva, R.M.; Pinho, A.J.; Ferreira, P.J. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci. Rep.
**2015**, 5. [Google Scholar] [CrossRef] [PubMed] - Beller, T.; Ohlebusch, E. Efficient construction of a compressed de Bruijn graph for pan-genome analysis. In Combinatorial Pattern Matching; Springer: Berlin, Germany, 2015. [Google Scholar]
- Baier, U.; Beller, T.; Ohlebusch, E. Graphical pan-genome analysis with compressed suffix trees and the Burrows–Wheeler transform. Bioinformatics
**2015**, 32, 497–504. [Google Scholar] [CrossRef] [PubMed] - Pinho, A.J.; Garcia, S.P.; Pratas, D.; Ferreira, P.J. DNA sequences at a glance. PLoS ONE
**2013**, 8, e79922. [Google Scholar] [CrossRef] [PubMed] - Wandelt, S.; Leser, U. MRCSI: Compressing and searching string collections with multiple references. Proc. VLDB Endow.
**2015**, 8, 461–472. [Google Scholar] [CrossRef]

File Format | Dataset | File Size(MB) |
---|---|---|

FASTA/Multi-FASTA | Human (GRC) [137] | 2,987 |

Chimpanzee [138] | 3,179 | |

Rice5 [139] | 166 | |

CAMERA Prokaryotic Nucleotide [140] | 43,945 | |

FASTQ | ERR174310_1 [141] | 51,373 |

ERR174310_2 [142] | 51,373 | |

ERR194146_1 [143] | 209,316 | |

ERR194146_2 [144] | 209,316 | |

BAM | NA12877_S1 [145] | 74,455 |

NA12878_S1 [146] | 108,290 | |

NA12882_S1 [147] | 88,318 |

File Format | Target Genome | File Size (MB) | Reference Genome | File Size (MB) |
---|---|---|---|---|

FASTA/Multi-FASTA | HS8 [148] | 138 | HSCHM8 [149] | 136 |

HS11 [150] | 128 | HSCHM11 [151] | 125 | |

HS11 [150] | 128 | PT11 [152] | 126 | |

HS11 [150] | 128 | PA11 [153] | 127 | |

HSK16 [154] | 71 | HS16 [155] | 78 | |

Rice5 [156] | 166 | Rice7 [157] | 164 |

**Table 3.**Reference-free compression results of genomic sequence data for FASTA/Multi-FASTA file format.

Dataset | Method | Metric | ||||
---|---|---|---|---|---|---|

C-Ratio | C-Time | D-Time | C-Mem | D-Mem | ||

(min) | (min) | (MB) | (MB) | |||

DELIMINATE [87] | 4.95 | 15.08 | 5.90 | 685.55 | 73.54 | |

Human (GRC) | MFCompress [90] | 4.96 | 34.47 | 29.48 | 1,240.45 | 1,266.49 |

(2.92 GB) | gzip [135] | 3.59 | 8.85 | 0.50 | 6.87 | 6.87 |

LZMA [86] | 4.39 | 117.70 | 1.26 | 199.52 | 31.58 | |

DELIMINATE | 5.36 | 16.64 | 6.62 | 683.31 | 73.55 | |

Chimpanzee | MFCompress | 5.43 | 33.93 | 28.55 | 1,377.85 | 1,374.18 |

(3.10 GB) | gzip | 3.85 | 8.79 | 0.52 | 6.87 | 6.87 |

LZMA | 4.68 | 118.31 | 1.26 | 199.52 | 31.58 | |

DELIMINATE | 5.22 | 0.59 | 0.33 | 368.11 | 43.41 | |

Rice5 | MFCompress | 5.15 | 2.15 | 1.72 | 1,169.64 | 1,082.49 |

(0.16 GB) | gzip | 3.28 | 0.53 | 0.03 | 6.87 | 6.87 |

LZMA | 4.46 | 6.71 | 0.07 | 199.71 | 25.79 | |

DELIMINATE | 5.07 | 198.24 | 85.38 | 684.66 | 687.44 | |

CAMERA | MFCompress | 5.39 | 571.74 | 223.61 | 2,764.16 | 2,668.16 |

(42.91 GB) | gzip | 3.52 | 141.33 | 7.66 | 6.87 | 6.87 |

LZMA | 4.51 | 1,720.09 | 16.48 | 199.71 | 31.58 |

Dataset | Method | Metric | ||||
---|---|---|---|---|---|---|

C-Ratio | C-Time | D-Time | C-Mem | D-Mem | ||

(min) | (min) | (MB) | (MB) | |||

Fqzcomp [116] | 4.76 | 49.58 | 64.79 | 77.21 | 67.31 | |

Quip [108] | 4.76 | 95.36 | 140.09 | 395.16 | 389.22 | |

ERR174310_1 | DSRC [54] | 4.64 | 140.67 | 177.37 | 10,608.25 | 12,227.26 |

(50.17 GB) | SCALCE [109] | 4.97 | 259.72 | 75.12 | 5,270.25 | 1,053.52 |

gzip [135] | 2.90 | 119.32 | 9.48 | 6.87 | 6.87 | |

LZMA [86] | 3.61 | 1,731.28 | 28.59 | 199.52 | 31.58 | |

Fqzcomp | 4.93 | 51.79 | 65.47 | 77.27 | 67.24 | |

Quip | 4.93 | 94.76 | 123.29 | 394.69 | 391.01 | |

ERR174310_2 | DSRC | 4.82 | 150.67 | 159.58 | 11,109.85 | 11,867.62 |

(50.17 GB) | SCALCE | 5.16 | 233.46 | 73.51 | 5,320.82 | 1,047.57 |

gzip | 2.98 | 117.65 | 9.12 | 6.87 | 6.87 | |

LZMA | 3.74 | 1,716.42 | 27.74 | 199.71 | 31.58 | |

Fqzcomp | 4.87 | 186.94 | 234.54 | 79.36 | 67.21 | |

Quip | 4.78 | 326.87 | 438.08 | 397.96 | 392.16 | |

ERR194146_1 | DSRC | 4.72 | 490.42 | 637.80 | 11,288.26 | 11,965.20 |

(204.41 GB) | SCALCE | 6.25 | 956.79 | 292.31 | 5,288.01 | 1,053.46 |

gzip | 3.94 | 273.20 | 33.28 | 6.87 | 6.87 | |

LZMA | 4.84 | 6,369.49 | 88.22 | 199.52 | 31.58 | |

Fqzcomp | 4.85 | 166.93 | 221.19 | 61.49 | 65.18 | |

Quip | 4.77 | 296.14 | 426.60 | 396.16 | 392.41 | |

ERR194146_2 | DSRC | 4.70 | 600.75 | 672.58 | 11,306.27 | 14,600.30 |

(204.41 GB) | SCALCE | 6.18 | 950.32 | 291.13 | 5,336.13 | 1,058.48 |

gzip | 3.90 | 272.37 | 33.69 | 6.87 | 6.87 | |

LZMA | 4.80 | 6,295.37 | 89.15 | 199.52 | 31.58 |

Dataset | Method | Metric | ||||
---|---|---|---|---|---|---|

C-Ratio | C-Time | D-Time | C-Mem | D-Mem | ||

(min) | (min) | (MB) | (MB) | |||

Deez [134] | 1.72 | 440.80 | 505.27 | 2,368.35 | 5,803.17 | |

NA12877_S1 | Quip [108] | 1.41 | 548.97 | 584.99 | 464.84 | 459.89 |

(72.71 GB) | gzip [135] | 1.00 | 72.85 | 12.11 | 6.87 | 6.87 |

LZMA [86] | 0.99 | 865.02 | 133.36 | 199.45 | 31.58 | |

Deez | 1.73 | 668.89 | 752.77 | 2,586.70 | 5,584.94 | |

NA12878_S1 | Quip | 1.38 | 802.84 | 893.90 | 460.81 | 459.61 |

(105.75 GB) | gzip | 1.00 | 109.88 | 17.68 | 6.87 | 6.87 |

LZMA | 0.99 | 1,261.44 | 193.97 | 199.52 | 31.58 | |

Deez | 1.79 | 553.17 | 641.44 | 2,699.24 | 5,773.42 | |

NA12882_S1 | Quip | 1.38 | 665.75 | 720.99 | 463.47 | 459.09 |

(86.25 GB) | gzip | 1.00 | 86.44 | 14.40 | 6.87 | 6.87 |

LZMA | 0.99 | 1,024.85 | 158.24 | 199.45 | 31.58 |

Dataset | Method | Metric | ||||
---|---|---|---|---|---|---|

C-Ratio ^{a} | C-Time | D-Time | C-Mem | D-Mem | ||

(min) | (min) | (MB) | (MB) | |||

HS8 (138 MB) Ref ^{b}: HSCHM8 | iDoComp [75] | 241.77 | 9.95 | 28.48 | 787.58 | 787.58 |

GReEn [60] | 102.98 | 242.99 | 247.32 | 1,124.38 | 1,110.83 | |

GeCo [50] | 70.65 | 693.67 | 579.50 | 4,897.60 | 4,897.65 | |

GDC 2 [63] | 239.72 | 2.89 | 0.07 | 1,513.25 | 166.84 | |

GDC [73] | 4.47 | 3.44 | 0.22 | 12,256.34 | 12,256.34 | |

HS11 (128 MB) Ref: HSCHM11 | iDoComp | 104.08 | 12.36 | 17.72 | 758.23 | 693.89 |

GReEn | 60.74 | 200.41 | 233.40 | 1,061.21 | 1,042.00 | |

GeCo | 67.81 | 517.44 | 509.67 | 4,897.79 | 4,897.73 | |

GDC 2 | 110.97 | 2.78 | 0.06 | 1,415.80 | 111.89 | |

GDC | 4.30 | 4.16 | 0.22 | 11,855.81 | 11,855.81 | |

HS11 (128 MB) Ref: PT11 | iDoComp | 29.19 | 29.06 | 19.15 | 781.27 | 781.27 |

GReEn | 11.99 | 252.55 | 279.45 | 1,078.43 | 1,062.30 | |

GeCo | 24.77 | 507.39 | 520.43 | 4,897.84 | 4,897.77 | |

GDC 2 | 27.49 | 4.65 | 0.08 | 1,434.61 | 118.70 | |

GDC | 4.30 | 4.35 | 0.22 | 11,853.75 | 11,853.75 | |

HS11 (128 MB) Ref: PA11 | iDoComp | 6.44 | 130.77 | 96.27 | 1,128.72 | 1,128.72 |

GReEn | 4.21 | 323.82 | 278.12 | 1,065.21 | 1,060.59 | |

GeCo | 12.34 | 517.65 | 524.30 | 4,897.84 | 4,897.77 | |

GDC 2 | 6.21 | 18.45 | 0.16 | 1,459.12 | 310.48 | |

GDC | 4.30 | 4.16 | 0.22 | 11,854.96 | 11,854.96 | |

HSK16 (71 MB) Ref: HS16 | iDoComp | 173.93 | 5.56 | 10.66 | 461.78 | 461.78 |

GReEn | 45.88 | 104.35 | 113.94 | 765.93 | 778.14 | |

GeCo | 45.37 | 362.98 | 296.28 | 4,897.79 | 4,897.73 | |

GDC 2 | 193.34 | 2.23 | 0.04 | 963.97 | 83.57 | |

GDC | 9.16 | 1.60 | 0.06 | 11,304.26 | 11,304.26 | |

RICE5 (166 MB) Ref: RICE7 | iDoComp | 204.38 | 28.22 | 34.50 | 950.05 | 950.05 |

GReEn | 169.16 | 204.66 | 221.81 | 1,286.32 | 1,285.14 | |

GeCo | 91.68 | 616.70 | 628.95 | 4,897.79 | 4,897.73 | |

GDC 2 | 199.28 | 1.73 | 0.08 | 1,712.41 | 116.05 | |

GDC | 4.44 | 5.36 | 0.27 | 11,789.49 | 11,789.49 |

^{a}The size of the reference sequences has not been included in calculation of the compression ratio (C-Ratio);

^{b}

**Ref:**Reference sequence.

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hosseini, M.; Pratas, D.; Pinho, A.J.
A Survey on Data Compression Methods for Biological Sequences. *Information* **2016**, *7*, 56.
https://doi.org/10.3390/info7040056

**AMA Style**

Hosseini M, Pratas D, Pinho AJ.
A Survey on Data Compression Methods for Biological Sequences. *Information*. 2016; 7(4):56.
https://doi.org/10.3390/info7040056

**Chicago/Turabian Style**

Hosseini, Morteza, Diogo Pratas, and Armando J. Pinho.
2016. "A Survey on Data Compression Methods for Biological Sequences" *Information* 7, no. 4: 56.
https://doi.org/10.3390/info7040056