# A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Method

#### 2.1. Weighted Context Models

#### 2.2. Weighted Stochastic Repeat Models

#### 2.3. Competitive Prediction Context Model

#### 2.4. Decompression

#### 2.5. Implementation

## 3. Results

**38,280,246**bytes (1.6139 BPS). This result is an improvement of 1% over Jarvis in mode 7. The trade-off is computational time and RAM, however still less than XM. Therefore, Jarvis is flexible and can be optimized to achieve considerably better compression ratios. The optimization, besides the choice of the best model, can be applied in a specific combination of the number of models, depths, estimator parameters, among many others.

## 4. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

AeCa | Aeropyrum camini—archaea |

AgPh | Aggregatibacter phage S1249—phage virus |

BPS | Bits per symbol |

BuEb | Bundibugyo ebolavirus—virus |

CPCM | Competitive prediction context model |

CTW | Context tree weighting |

DaRe | Danio rerio—fish |

DrMe | Drosophila miranda—fly |

EnIn | Entamoeba invadens—amoebozoa |

EsCo | Escherichia coli—bacteria |

GaGa | Gallus gallus—chicken |

GeCo | Genomic Compressor (tool) |

GPU | Graphical Processing Unit |

HaHi | Haloarcula hispanica—archaea |

HePy | Helicobacter pylori—bacteria |

HoSa | Homo sapiens—human |

LUT | Look Up Table |

NC | Normalized Compression |

OrSa | Oriza sativa—plant |

OS | Operating System |

PlFa | Plasmodium falciparum–protozoan |

RAM | Random Access Memory |

RLE | Run Length Encoding |

ScPo | Schizosaccharomyces pombe—fungi |

TB | TeraByte |

XM | eXpert-Model |

YeMi | Yellowstone lake—mimivirus |

## References

- Schatz, M.C.; Langmead, B. The DNA data deluge. IEEE Spectrum
**2013**, 50, 28–33. [Google Scholar] [CrossRef] [PubMed] - Mardis, E.R. DNA sequencing technologies: 2006–2016. Nat. Protocols
**2017**, 12, 213. [Google Scholar] [CrossRef] [PubMed] - Marco, D. Metagenomics: Theory, Methods and Applications; Horizon Scientific Press: Poole, UK, 2010. [Google Scholar]
- Duggan, A.T.; Perdomo, M.F.; Piombino-Mascali, D.; Marciniak, S.; Poinar, D.; Emery, M.V.; Buchmann, J.P.; Duchêne, S.; Jankauskas, R.; Humphreys, M.; et al. 17th century variola virus reveals the recent history of smallpox. Curr. Biol.
**2016**, 26, 3407–3412. [Google Scholar] [CrossRef] [PubMed] - Weber, W.; Fussenegger, M. Emerging biomedical applications of synthetic biology. Nat. Rev. Genet.
**2012**, 13, 21. [Google Scholar] [CrossRef] [PubMed] - Marciniak, S.; Perry, G.H. Harnessing ancient genomes to study the history of human adaptation. Nat. Rev. Genet.
**2017**, 18, 659. [Google Scholar] [CrossRef] [PubMed] - Stephens, Z.D.; Lee, S.Y.; Faghri, F.; Campbell, R.H.; Zhai, C.; Efron, M.J.; Iyer, R.; Schatz, M.C.; Sinha, S.; Robinson, G.E. Big data: Astronomical or genomical? PLoS Biol.
**2015**, 13, e1002195. [Google Scholar] [CrossRef] - Deorowicz, S.; Grabowski, S. Compression of DNA sequence reads in FASTQ format. Bioinformatics
**2011**, 27, 860–862. [Google Scholar] [CrossRef] - Hanus, P.; Dingel, J.; Chalkidis, G.; Hagenauer, J. Compression of whole genome alignments. IEEE Trans. Inf. Theory
**2010**, 56, 696–705. [Google Scholar] [CrossRef] - Matos, L.M.O.; Pratas, D.; Pinho, A.J. A compression model for DNA multiple sequence alignment blocks. IEEE Trans. Inf. Theory
**2013**, 59, 3189–3198. [Google Scholar] [CrossRef] - Mohammed, M.H.; Dutta, A.; Bose, T.; Chadaram, S.; Mande, S.S. DELIMINATE—A fast and efficient method for loss-less compression of genomic sequences. Bioinformatics
**2012**, 28, 2527–2529. [Google Scholar] [CrossRef] - Pinho, A.J.; Pratas, D. MFCompress: A compression tool for fasta and multi-fasta data. Bioinformatics
**2013**, 30, 117–118. [Google Scholar] [CrossRef] [PubMed] - Grabowski, S.; Deorowicz, S.; Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics
**2015**, 31, 1389–1395. [Google Scholar] [CrossRef] [PubMed] - Hach, F.; Numanagić, I.; Alkan, C.; Sahinalp, S.C. SCALCE: Boosting sequence compression algorithms using locally consistent encoding. Bioinformatics
**2012**, 28, 3051–3057. [Google Scholar] [CrossRef] [PubMed] - Layer, R.M.; Kindlon, N.; Karczewski, K.J.; Consortium, E.A.; Quinlan, A.R. Efficient genotype compression and analysis of large genetic-variation data sets. Nat. Methods
**2016**, 13, 63. [Google Scholar] [CrossRef] - Bonfield, J.K.; Mahoney, M.V. Compression of FASTQ and SAM format sequencing data. PLoS ONE
**2013**, 8, e59190. [Google Scholar] [CrossRef] - Wang, R.; Bai, Y.; Chu, Y.S.; Wang, Z.; Wang, Y.; Sun, M.; Li, J.; Zang, T.; Wang, Y. DeepDNA: A hybrid convolutional and recurrent neural network for compressing human mitochondrial genomes. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Madrid, Spain, 3–6 December 2018; pp. 270–274. [Google Scholar]
- Benoit, G.; Lemaitre, C.; Lavenier, D.; Drezen, E.; Dayris, T.; Uricaru, R.; Rizk, G. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform.
**2015**, 16, 288. [Google Scholar] [CrossRef] - Ochoa, I.; Li, H.; Baumgarte, F.; Hergenrother, C.; Voges, J.; Hernaez, M. AliCo: A new efficient representation for SAM files. In Proceedings of the 2019 Data Compression Conference (DCC), Snowbird, UT, USA, 26–29 March 2019; pp. 93–102. [Google Scholar]
- Zhang, C.; Ochoa, I. VEF: A Variant Filtering tool based on Ensemble methods. bioRxiv
**2019**, 540286. [Google Scholar] [CrossRef] - Chandak, S.; Tatwawadi, K.; Ochoa, I.; Hernaez, M.; Weissman, T. SPRING: A next-generation compressor for FASTQ data. Bioinformatics
**2018**, 35, 2674–2676. [Google Scholar] [CrossRef] - Holley, G.; Wittler, R.; Stoye, J.; Hach, F. Dynamic alignment-free and reference-free read compression. J. Comput. Biol.
**2018**, 25, 825–836. [Google Scholar] [CrossRef] - Kumar, S.; Agarwal, S. WBMFC: Efficient and Secure Storage of Genomic Data. Pertanika J. Sci. Technol.
**2018**, 26, 4. [Google Scholar] - Dougherty, E.R.; Shmulevich, I.; Chen, J.; Wang, Z.J. (Eds.) Genomic Signal Processing and Statistics; Hindawi Publishing Corporation: London, UK, 2005. [Google Scholar]
- Grumbach, S.; Tahi, F. Compression of DNA sequences. In Proceedings of the Data Compression Conference (DCC 1993), Snowbird, UT, USA, 30 March–2 April 1993; pp. 340–350. [Google Scholar]
- Hernaez, M.; Pavlichin, D.; Weissman, T.; Ochoa, I. Genomic Data Compression. Annu. Rev. Biomed. Data Sci.
**2019**, 2, 19–37. [Google Scholar] [CrossRef] - Rieseberg, L.H. Chromosomal rearrangements and speciation. Trends Ecol. Evol.
**2001**, 16, 351–358. [Google Scholar] [CrossRef] - Roeder, G.S.; Fink, G.R. DNA rearrangements associated with a transposable element in yeast. Cell
**1980**, 21, 239–249. [Google Scholar] [CrossRef] - Harris, K. Evidence for recent, population-specific evolution of the human mutation rate. Proc. Natl. Acad. Sci. USA
**2015**, 112, 3439–3444. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Jeong, C.; di Rienzo, A. Adaptations to local environments in modern human populations. Curr. Opin. Genet. Dev.
**2014**, 29, 1–8. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Beres, S.; Kachroo, P.; Nasser, W.; Olsen, R.; Zhu, L.; Flores, A.; de la Riva, I.; Paez-Mayorga, J.; Jimenez, F.; Cantu, C.; et al. Transcriptome remodeling contributes to epidemic disease caused by the human pathogen Streptococcus pyogenes. mBio
**2016**, 7, e00403-16. [Google Scholar] [CrossRef] - Fumagalli, M.; Sironi, M. Human genome variability, natural selection and infectious diseases. Curr. Opin. Immunol.
**2014**, 30, 9–16. [Google Scholar] [CrossRef] - Long, H.; Sung, W.; Kucukyildirim, S.; Williams, E.; Miller, S.F.; Guo, W.; Patterson, C.; Gregory, C.; Strauss, C.; Stone, C.; et al. Evolutionary determinants of genome-wide nucleotide composition. Nat. Ecol. Evol.
**2018**, 2, 237. [Google Scholar] [CrossRef] - Golan, A. Foundations of Info-Metrics: Modeling and Inference with Imperfect Information; Oxford University Press: Oxford, UK, 2017. [Google Scholar]
- Hosseini, M.; Pratas, D.; Pinho, A.J. A survey on data compression methods for biological sequences. Information
**2016**, 7, 56. [Google Scholar] [CrossRef] - Wang, C.; Zhang, D. A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res.
**2011**, 39, e45. [Google Scholar] [CrossRef] - Kuruppu, S.; Puglisi, S.J.; Zobel, J. Optimized relative Lempel–Ziv compression of genomes. In Proceedings of the 34th Australian Computer Science Conference (ACSC-2011), Perth, Australia, 17–20 January 2011; Volume 11, pp. 91–98. [Google Scholar]
- Tembe, W.; Lowey, J.; Suh, E. G-SQZ: Compact encoding of genomic sequence and quality data. Bioinformatics
**2010**, 26, 2192–2194. [Google Scholar] [CrossRef] [PubMed] - Fritz, M.H.Y.; Leinonen, R.; Cochrane, G.; Birney, E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res.
**2011**, 21, 734–740. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kozanitis, C.; Saunders, C.; Kruglyak, S.; Bafna, V.; Varghese, G. Compressing genomic sequence fragments using SlimGene. J. Comput. Biol.
**2011**, 18, 401–413. [Google Scholar] [CrossRef] [PubMed] - Pinho, A.J.; Pratas, D.; Garcia, S.P. GReEn: A tool for efficient compression of genome resequencing data. Nucleic Acids Res.
**2012**, 40, e27. [Google Scholar] [CrossRef] [PubMed] - Wandelt, S.; Leser, U. FRESCO: Referential compression of highly similar sequences. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2013**, 10, 1275–1288. [Google Scholar] [CrossRef] - Deorowicz, S.; Danek, A.; Niemiec, M. GDC 2: Compression of large collections of genomes. Sci. Rep.
**2015**, 5, 1–12. [Google Scholar] [CrossRef] - Ochoa, I.; Hernaez, M.; Weissman, T. iDoComp: A compression scheme for assembled genomes. Bioinformatics
**2014**, 31, 626–633. [Google Scholar] [CrossRef] - Liu, Y.; Peng, H.; Wong, L.; Li, J. High-speed and high-ratio referential genome compression. Bioinformatics
**2017**, 33, 3364–3372. [Google Scholar] [CrossRef] - Shi, W.; Chen, J.; Luo, M.; Chen, M. High efficiency referential genome compression algorithm. Bioinformatics
**2018**, 35, 2058–2065. [Google Scholar] [CrossRef] - Saha, S.; Rajasekaran, S. NRGC: A novel referential genome compression algorithm. Bioinformatics
**2016**, 32, 3405–3412. [Google Scholar] [CrossRef] - Tang, Y.; Li, M.; Sun, J.; Zhang, T.; Zhang, J.; Zheng, P. TRCMGene: A two-step referential compression method for the efficient storage of genetic data. PLoS ONE
**2018**, 13, e0206521. [Google Scholar] [CrossRef] [PubMed] - Kolmogorov, A.N. Three approaches to the quantitative definition of information. Probl. Inf. Transm.
**1965**, 1, 1–7. [Google Scholar] [CrossRef] - Pratas, D.; Pinho, A.J. On the Approximation of the Kolmogorov Complexity for DNA Sequences. In Iberian Conference on Pattern Recognition and Image Analysis; Springer: Cham, Switzerland, 2017; pp. 259–266. [Google Scholar]
- Goyal, M.; Tatwawadi, K.; Chandak, S.; Ochoa, I. DeepZip: Lossless Data Compression using Recurrent Neural Networks. arXiv
**2018**, arXiv:1811.08162. [Google Scholar] - Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory
**1977**, 23, 337–343. [Google Scholar] [CrossRef] - Grumbach, S.; Tahi, F. A new challenge for compression algorithms: Genetic sequences. Inf. Process. Manag.
**1994**, 30, 875–886. [Google Scholar] [CrossRef] [Green Version] - Rivals, E.; Delahaye, J.P.; Dauchet, M.; Delgrange, O. A guaranteed compression scheme for repetitive DNA sequences. In Proceedings of the Data Compression Conference (DCC ’96), Snowbird, UT, USA, 31 March–3 April 1996; p. 453. [Google Scholar]
- Loewenstern, D.; Yianilos, P.N. Significantly lower entropy estimates for natural DNA sequences. In Proceedings of the Data Compression Conference (DCC ’97), Snowbird, UT, USA, 25–27 March 1997; pp. 151–160. [Google Scholar]
- Allison, L.; Edgoose, T.; Dix, T.I. Compression of strings with approximate repeats. In Proceedings of the Intelligent Systems in Molecular Biology (ISMB ’98), Montréal, QC, Canada, 28 June–1 July 1998; pp. 8–16. [Google Scholar]
- Apostolico, A.; Lonardi, S. Compression of biological sequences by greedy offline textual substitution. In Proceedings of the Data Compression Conference (DCC 2000), Snowbird, UT, USA, 28–30 March 2000; pp. 143–152. [Google Scholar]
- Chen, T.; Sullivan, G.J.; Puri, A.H. 263 (including H.263+) and other ITU-T video coding standards. In Multimedia Systems, Standards, and Networks; Puri, A., Chen, T., Eds.; Marcel Dekker: New York, NY, USA, 2000; pp. 55–85. [Google Scholar]
- Chen, X.; Li, M.; Ma, B.; Tromp, J. DNACompress: Fast and effective DNA sequence compression. Bioinformatics
**2002**, 18, 1696–1698. [Google Scholar] [CrossRef] - Ma, B.; Tromp, J.; Li, M. PatternHunter: Faster and more sensitive homology search. Bioinformatics
**2002**, 18, 440–445. [Google Scholar] [CrossRef] - Matsumoto, T.; Sadakane, K.; Imai, H. Biological sequence compression algorithms. Genome Inform.
**2000**, 11, 43–52. [Google Scholar] - Tabus, I.; Korodi, G.; Rissanen, J. DNA sequence compression using the normalized maximum likelihood model for discrete regression. In Proceedings of the Data Compression Conference (DCC 2003), Snowbird, UT, USA, 25–27 March 2003; pp. 253–262. [Google Scholar]
- Korodi, G.; Tabus, I. An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf. Sys.
**2005**, 23, 3–34. [Google Scholar] [CrossRef] - Cherniavsky, N.; Ladner, R. Grammar-Based Compression of DNA Sequences; Technical Report; University of Washington: Seattle, WA, USA, 2004. [Google Scholar]
- Manzini, G.; Rastero, M. A simple and fast DNA compressor. Softw. Pract. Exp.
**2004**, 34, 1397–1411. [Google Scholar] [CrossRef] - Lee, A.J.T.; Chen, C. DNAC: An Efficient Compression Algorithm for DNA Sequences; National Taiwan University: Taipei, Taiwan, 2004; Volume 1. [Google Scholar]
- Behzadi, B.; Le Fessant, F. DNA compression challenge revisited. In Combinatorial Pattern Matching: Proceedings of CPM-2005; Springer: Jeju Island, Korea, 2005; Volume 3537, pp. 190–200. [Google Scholar]
- Cao, M.D.; Dix, T.I.; Allison, L.; Mears, C. A simple statistical algorithm for biological sequence compression. In Proceedings of the 2007 Data Compression Conference (DCC ’07), Snowbird, UT, USA, 27–29 March 2007; pp. 43–52. [Google Scholar]
- Vey, G. Differential direct coding: A compression algorithm for nucleotide sequence data. Database
**2009**, 2009. [Google Scholar] [CrossRef] [PubMed] - Mishra, K.N.; Aaggarwal, A.; Abdelhadi, E.; Srivastava, D. An efficient horizontal and vertical method for online dna sequence compression. Int. J. Comput. Appl.
**2010**, 3, 39–46. [Google Scholar] [CrossRef] - Rajeswari, P.R.; Apparao, A. GENBIT Compress-Algorithm for repetitive and non repetitive DNA sequences. Int. J. Comput. Sci. Inf. Technol.
**2010**, 2, 25–29. [Google Scholar] - Gupta, A.; Agarwal, S. A novel approach for compressing DNA sequences using semi-statistical compressor. Int. J. Comput. Appl.
**2011**, 33, 245–251. [Google Scholar] [CrossRef] - Gupta, A.; Agarwal, S. A scheme that facilitates searching and partial decompression of textual documents. Int. J. Adv. Comput. Eng.
**2008**, 1, 99–109. [Google Scholar] - Zhu, Z.; Zhou, J.; Ji, Z.; Shi, Y. DNA sequence compression using adaptive particle swarm optimization-based memetic algorithm. IEEE Trans. Evol. Comput.
**2011**, 15, 643–658. [Google Scholar] [CrossRef] - Pinho, A.J.; Pratas, D.; Ferreira, P.J.S.G. Bacteria DNA sequence compression using a mixture of finite-context models. In Proceedings of the 2011 IEEE Statistical Signal Processing Workshop (SSP), Nice, France, 28–30 June 2011. [Google Scholar]
- Pinho, A.J.; Ferreira, P.J.S.G.; Neves, A.J.R.; Bastos, C.A.C. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE
**2011**, 6, e21588. [Google Scholar] [CrossRef] - Roy, S.; Khatua, S.; Roy, S.; Bandyopadhyay, S.K. An efficient biological sequence compression technique using lut and repeat in the sequence. arXiv
**2012**, arXiv:1209.5905. [Google Scholar] [CrossRef] - Satyanvesh, D.; Balleda, K.; Padyana, A.; Baruah, P. GenCodex—A Novel Algorithm for Compressing DNA sequences on Multi-cores and GPUs. In Proceedings of the IEEE 19th International Conference on High Performance Computing (HiPC), Pune, India, 18–22 December 2012. [Google Scholar]
- Bose, T.; Mohammed, M.H.; Dutta, A.; Mande, S.S. BIND—An algorithm for loss-less compression of nucleotide sequence data. J. Biosci.
**2012**, 37, 785–789. [Google Scholar] [CrossRef] - Li, P.; Wang, S.; Kim, J.; Xiong, H.; Ohno-Machado, L.; Jiang, X. DNA-COMPACT: DNA Compression Based on a Pattern-Aware Contextual Modeling Technique. PLoS ONE
**2013**, 8, e80377. [Google Scholar] [CrossRef] - Pratas, D.; Pinho, A.J. Exploring deep Markov models in genomic data compression using sequence pre-analysis. In Proceedings of the 22th European Signal Processing Conference (EUSIPCO 2014), Lisbon, Portugal, 1–5 September 2014; pp. 2395–2399. [Google Scholar]
- Sardaraz, M.; Tahir, M.; Ikram, A.A.; Bajwa, H. SeqCompress: An algorithm for biological sequence compression. Genomics
**2014**, 104, 225–228. [Google Scholar] [CrossRef] [PubMed] - Guo, H.; Chen, M.; Liu, X.; Xie, M. Genome compression based on Hilbert space filling curve. In Proceedings of the 3rd International Conference on Management, Education, Information and Control (MEICI 2015), Shenyang, China, 29–31 May 2015; pp. 29–31. [Google Scholar]
- Xie, X.; Zhou, S.; Guan, J. CoGI: Towards compressing genomes as an image. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2015**, 12, 1275–1285. [Google Scholar] [CrossRef] [PubMed] - Mohamed, S.A.; Fahmy, M.M. Binary image compression using efficient partitioning into rectanglar regions. IEEE Trans. Commun.
**1995**, 43, 1888–1893. [Google Scholar] [CrossRef] - Pratas, D.; Pinho, A.J.; Ferreira, P.J.S.G. Efficient compression of genomic sequences. In Proceedings of the 2016 Data Compression Conference (DCC 2016), Snowbird, UT, USA, 30 March–1 April 2016; pp. 231–240. [Google Scholar]
- Pratas, D.; Hosseini, M.; Pinho, A.J. Substitutional Tolerant Markov Models for Relative Compression of DNA Sequences. In 11th International Conference on Practical Applications of Computational Biology & Bioinformatics; Springer: Cham, Switzerland, 2017; pp. 265–272. [Google Scholar]
- Pratas, D.; Hosseini, M.; Pinho, A.J. GeCo2: An optimized tool for lossless compression and analysis of DNA sequences. In 13th International Conference on Practical Applications of Computational Biology and Bioinformatics; Fdez-Riverola, F., Rocha, M., Mohamad, M.S., Zaki, N., Castellanos-Garzón, J.A., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 137–145. [Google Scholar]
- Chen, M.; Shao, J.; Jia, X. Genome sequence compression based on optimized context weighting. Genet. Mol. Res. GMR
**2017**, 16. [Google Scholar] [CrossRef] - Mansouri, D.; Yuan, X. One-Bit DNA Compression Algorithm. In International Conference on Neural Information Processing; Springer: Berlin, Germany, 2018; pp. 378–386. [Google Scholar]
- Pratas, D.; Pinho, A.J. A DNA Sequence Corpus for Compression Benchmark. In International Conference on Practical Applications of Computational Biology & Bioinformatics; Springer: Cham, Switzerland, 2018; pp. 208–215. [Google Scholar]
- Sayood, K. Introduction to Data Compression; Morgan Kaufmann: Burlington, MA, USA, 2017. [Google Scholar]
- Bell, T.C.; Cleary, J.G.; Witten, I.H. Text Compression; Prentice Hall: Upper Saddle River, NJ, USA, 1990. [Google Scholar]
- Pinho, A.J.; Neves, A.J.R.; Afreixo, V.; Bastos, C.A.C.; Ferreira, P.J.S.G. A three-state model for DNA protein-coding regions. IEEE Trans. Biomed. Eng.
**2006**, 53, 2148–2155. [Google Scholar] [CrossRef] - Hosseini, M.; Pratas, D.; Pinho, A.J. On the role of inverted repeats in DNA sequence similarity. In International Conference on Practical Applications of Computational Biology & Bioinformatics; Springer: Cham, Switzerland, 2017; pp. 228–236. [Google Scholar]
- Pinho, A.J.; Neves, A.J.R.; Martins, D.A.; Bastos, C.A.C.; Ferreira, P.J.S.G. Finite-context models for DNA coding. In Signal Processing; Miron, S., Ed.; INTECH: Rijeka, Croatia, 2010; pp. 117–130. [Google Scholar]
- Ferreira, P.J.S.G.; Pinho, A.J. Compression-based normal similarity measures for DNA sequences. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), Florence, Italy, 4–9 May 2014; pp. 419–423. [Google Scholar]
- Moffat, A.; Neal, R.M.; Witten, I.H. Arithmetic coding revisited. ACM Trans. Inf. Syst.
**1998**, 16, 256–294. [Google Scholar] [CrossRef] [Green Version]

**Figure 1.**Timeline with the names of the proposed data compressors specifically for genomic sequences.

**Figure 2.**An architecture example of a competitive prediction between five Weighted context models (at left, represented with prefix C) and three Weighted stochastic repeat models (at right, represented with prefix R). Each model has a weight (W) and associated probabilities (P) that are calculated according to the respective memory model (M), where the suffix complements the notation. The tolerant context model ($CW5,CP5$) uses the same memory of model four ($CW4,CP4$), since they have the same context. Independently, the probabilities of the context models and repeat models are averaged according to the respective weight and redirected to the competitive prediction model. Finally, the probabilities of the model class with the highest probability (predicted) are redirected to the arithmetic encoder.

**Figure 3.**Repeat model example with k-mer size of 8. The H is a hash function that encapsulates a k-mer into a natural number on the hash table. Positions 14,251 and 14,275 stand for identical k-mers seen in the past of the sequence. Number 14,295 stands for the current position of the base being coded.

**Figure 4.**Competitive prediction context model (CPCM) example with context depth (k) of 5. The next symbol is S, and Z is the sequence with the best class of models estimated by the CPCM.

**Figure 5.**Bits per base (BPS) of compressing four sequences applying a CPCM context order variation for the first twelve modes of Jarvis. The four datasets are sorted according to different sizes; namely, the largest is HoSa (

**left**-

**top**), then, EnIn (

**right**-

**top**), AeCa (

**left**-

**bottom**), and YeMi (

**right**-

**bottom**).

**Figure 6.**Benchmark with size (

**a**) and speed (

**b**). For each sequence, the value of speed is calculated as compressed size (KB) divided by compression time (s). The mean of speed values for all datasets is calculated to obtain the average speed for each method. The CoGI compressor is not included because it is an outlier concerning this dataset.

**Figure 7.**Comparison of the fifteen compression modes available in Jarvis for the three largest sequences in the dataset (HoSa, GaGa, and DaRe). Compression ratios are in Bits Per Symbol (BPS) and Time in seconds. Times may not agree precisely with Table 2 because we rerun the tool. Each number, corresponding to the blue dots, stands for the mode/level used in Jarvis. We recall that additional levels or specific configurations can be set.

**Figure 8.**Comparison of the fifteen compression modes available in Jarvis and GeCo2 for the human chromosome Y sequence. Compression ratios are in Bits Per Symbol (BPS) and Time in seconds. Each number, corresponding to the blue dots, stands for the mode/level used in the respective compressor.

**Table 1.**Number of bytes needed to represent each DNA sequence given the respective data compressor (LZMA -9, PAQ8 -8, CoGi, GeCo, XM and Jarvis). We ran LZMA with the -9 flag (best option), PAQ8 with the -8 (best option), GeCo using “-tm 1:1:0:0/0 -tm 3:1:0:0/0 -tm 6:1:0:0/0 -tm 9:10:0:0/0 -tm 11:10:0:0/0 -tm 13:50:1:0/0 -tm 18:100:1:3/10 -c 30 -g 0.9”, GeCo2 with parameters from [88], and XM using 50 copy experts. The compression level used in Jarvis is depicted between parentheses, and it has been set according to the size of the sequence. The length of the sequences is present in Table 2.

ID | LZMA-9 | PAQ8-8 | CoGI | GeCo | GeCo2 | XM | Jarvis (level) |
---|---|---|---|---|---|---|---|

HoSa | 42,292,440 | 40,517,624 | 51,967,817 | 38,877,294 | 38,845,642 | 38,940,458 | 38,660,851 (7) |

GaGa | 36,179,650 | 34,490,967 | 40,846,177 | 33,925,250 | 33,877,671 | 33,879,211 | 33,699,821 (6) |

DaRe | 12,515,717 | 12,628,104 | 17,084,450 | 11,520,064 | 11,488,819 | 11,302,620 | 11,173,905 (5) |

OrSa | 9,348,183 | 9,280,037 | 11,999,580 | 8,671,732 | 8,646,543 | 8,470,212 | 8,448,959 (5) |

DrMe | 8,016,544 | 7,577,068 | 8,939,690 | 7,498,808 | 7,481,093 | 7,538,662 | 7,490,418 (5) |

EnIn | 5,785,343 | 5,761,090 | 7,210,867 | 5,196,083 | 5,170,889 | 5,150,309 | 5,087,286 (4) |

ScPo | 2,722,233 | 2,557,988 | 2,921,247 | 2,536,457 | 2,518,963 | 2,524,147 | 2,517,535 (4) |

PlFa | 2,097,979 | 1,959,623 | 2,411,342 | 1,944,036 | 1,925,726 | 1,925,841 | 1,924,430 (4) |

EsCo | 1,185,704 | 1,107,929 | 1,307,943 | 1,109,823 | 1,098,552 | 1,110,092 | 1,095,606 (4) |

HaHi | 985,096 | 904,074 | 1,124,483 | 906,991 | 902,831 | 913,346 | 899,464 (3) |

AeCa | 413,886 | 380,273 | 454,357 | 385,640 | 380,115 | 387,030 | 380,507 (3) |

HePy | 415,161 | 385,096 | 457,859 | 381,545 | 375,481 | 384,071 | 374,362 (3) |

YeMi | 19,262 | 16,835 | 19,805 | 17,167 | 16,798 | 16,861 | 16,861 (2) |

AgPh | 12,183 | 10,754 | 12,243 | 10,882 | 10,708 | 10,711 | 10,745 (2) |

BuEb | 5441 | 4668 | 5291 | 4774 | 4686 | 4642 | 4690 (1) |

Total | 121,994,822 | 117,582,130 | 146,763,151 | 112,986,546 | 112,744,517 | 112,558,213 | 111,785,440 |

**Table 2.**Computational time (in seconds) needed to represent each DNA sequence given the respective data compressor (LZMA, PAQ8, CoGi, GeCo, GeCo2, XM, and Jarvis). We ran LZMA with the -9 flag (best option), PAQ8 with the -8 (best option), GeCo using “-tm 1:1:0:0/0 -tm 3:1:0:0/0 -tm 6:1:0:0/0 -tm 9:10:0:0/0 -tm 11:10:0:0/0 -tm 13:50:1:0/0 -tm 18:100:1:3/10 -c 30 -g 0.9”, GeCo2 with parameters from [88], and XM using 50 copy experts. The compression level used in Jarvis is depicted between parentheses and it has been set according to the size of the sequence. The length scale of the sequences is in bases.

ID | Length | LZMA | PAQ8 | CoGI | GeCo | GeCo2 | XM | Jarvis |
---|---|---|---|---|---|---|---|---|

HoSa | 189,752,667 | 552.5 | 85,269.1 | 25.2 | 648.6 | 652.4 | 5,589.8 | 814.8 (7) |

GaGa | 148,532,294 | 468.7 | 64,898.9 | 19.9 | 503.2 | 494.7 | 3,633.9 | 412.3 (6) |

DaRe | 62,565,020 | 170.0 | 29,907.7 | 8.2 | 215.9 | 198.8 | 785.2 | 284.9 (5) |

OrSa | 43,262,523 | 112.9 | 20,745.1 | 5.8 | 192.4 | 138.3 | 489.7 | 234.5 (5) |

DrMe | 32,181,429 | 85.6 | 14,665.8 | 4.3 | 114.6 | 102.4 | 362.6 | 66.7 (5) |

EnIn | 26,403,087 | 66.0 | 11,183.6 | 3.7 | 95.8 | 82.5 | 279.8 | 101.1 (4) |

ScPo | 10,652,155 | 23.0 | 4,619.1 | 1.5 | 45.2 | 34.2 | 96.5 | 28.7 (4) |

PlFa | 8,986,712 | 18.3 | 4,133.9 | 1.2 | 39.7 | 35.3 | 84.4 | 25.4 (4) |

EsCo | 4,641,652 | 8.1 | 1,973.9 | 0.6 | 26.4 | 5.1 | 36.8 | 10.9 (4) |

HaHi | 3,890,005 | 6.9 | 1,738.1 | 0.5 | 23.7 | 4.4 | 39.1 | 7.1 (3) |

AeCa | 1,591,049 | 2.2 | 675.3 | 0.2 | 17.0 | 1.9 | 10.3 | 2.2 (3) |

HePy | 1,667,825 | 2.3 | 715.1 | 0.2 | 17.2 | 1.9 | 11.2 | 2.7 (3) |

YeMi | 73,689 | 0.1 | 32.6 | 0.0 | 12.3 | 0.1 | 0.9 | 0.2 (2) |

AgPh | 43,970 | 0.0 | 20.1 | 0.0 | 12.1 | 0.1 | 0.9 | 0.1 (2) |

BuEb | 18,940 | 0.0 | 9.1 | 0.0 | 12.2 | 0.1 | 0.7 | 0.1 (1) |

Total | 534,263,017 | 1516.6 | 240,587.4 | 71.3 | 1976.3 | 1742.2 | 11,421.8 | 1991.7 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Pratas, D.; Hosseini, M.; Silva, J.M.; Pinho, A.J.
A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models. *Entropy* **2019**, *21*, 1074.
https://doi.org/10.3390/e21111074

**AMA Style**

Pratas D, Hosseini M, Silva JM, Pinho AJ.
A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models. *Entropy*. 2019; 21(11):1074.
https://doi.org/10.3390/e21111074

**Chicago/Turabian Style**

Pratas, Diogo, Morteza Hosseini, Jorge M. Silva, and Armando J. Pinho.
2019. "A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models" *Entropy* 21, no. 11: 1074.
https://doi.org/10.3390/e21111074