## Abstract

## 1. Introduction

## 2. Method

#### 2.1. Weighted Context Models

#### 2.2. Weighted Stochastic Repeat Models

#### 2.3. Competitive Prediction Context Model

#### 2.4. Decompression

#### 2.5. Implementation

## 3. Results

**38,280,246**bytes (1.6139 BPS). This result is an improvement of 1% over Jarvis in mode 7. The trade-off is computational time and RAM, however still less than XM. Therefore, Jarvis is flexible and can be optimized to achieve considerably better compression ratios. The optimization, besides the choice of the best model, can be applied in a specific combination of the number of models, depths, estimator parameters, among many others.

## 4. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

AeCa | Aeropyrum camini—archaea |

AgPh | Aggregatibacter phage S1249—phage virus |

BPS | Bits per symbol |

BuEb | Bundibugyo ebolavirus—virus |

CPCM | Competitive prediction context model |

CTW | Context tree weighting |

DaRe | Danio rerio—fish |

DrMe | Drosophila miranda—fly |

EnIn | Entamoeba invadens—amoebozoa |

EsCo | Escherichia coli—bacteria |

GaGa | Gallus gallus—chicken |

GeCo | Genomic Compressor (tool) |

GPU | Graphical Processing Unit |

HaHi | Haloarcula hispanica—archaea |

HePy | Helicobacter pylori—bacteria |

HoSa | Homo sapiens—human |

LUT | Look Up Table |

NC | Normalized Compression |

OrSa | Oriza sativa—plant |

OS | Operating System |

PlFa | Plasmodium falciparum–protozoan |

RAM | Random Access Memory |

RLE | Run Length Encoding |

ScPo | Schizosaccharomyces pombe—fungi |

TB | TeraByte |

XM | eXpert-Model |

YeMi | Yellowstone lake—mimivirus |

## References

**Figure 1.**Timeline with the names of the proposed data compressors specifically for genomic sequences.

**Figure 2.**An architecture example of a competitive prediction between five Weighted context models (at left, represented with prefix C) and three Weighted stochastic repeat models (at right, represented with prefix R). Each model has a weight (W) and associated probabilities (P) that are calculated according to the respective memory model (M), where the suffix complements the notation. The tolerant context model ($CW5,CP5$) uses the same memory of model four ($CW4,CP4$), since they have the same context. Independently, the probabilities of the context models and repeat models are averaged according to the respective weight and redirected to the competitive prediction model. Finally, the probabilities of the model class with the highest probability (predicted) are redirected to the arithmetic encoder.

**Figure 3.**Repeat model example with k-mer size of 8. The H is a hash function that encapsulates a k-mer into a natural number on the hash table. Positions 14,251 and 14,275 stand for identical k-mers seen in the past of the sequence. Number 14,295 stands for the current position of the base being coded.

**Figure 4.**Competitive prediction context model (CPCM) example with context depth (k) of 5. The next symbol is S, and Z is the sequence with the best class of models estimated by the CPCM.

**Figure 5.**Bits per base (BPS) of compressing four sequences applying a CPCM context order variation for the first twelve modes of Jarvis. The four datasets are sorted according to different sizes; namely, the largest is HoSa (

**left**-

**top**), then, EnIn (

**right**-

**top**), AeCa (

**left**-

**bottom**), and YeMi (

**right**-

**bottom**).

**Figure 6.**Benchmark with size (

**a**) and speed (

**b**). For each sequence, the value of speed is calculated as compressed size (KB) divided by compression time (s). The mean of speed values for all datasets is calculated to obtain the average speed for each method. The CoGI compressor is not included because it is an outlier concerning this dataset.

**Figure 7.**Comparison of the fifteen compression modes available in Jarvis for the three largest sequences in the dataset (HoSa, GaGa, and DaRe). Compression ratios are in Bits Per Symbol (BPS) and Time in seconds. Times may not agree precisely with Table 2 because we rerun the tool. Each number, corresponding to the blue dots, stands for the mode/level used in Jarvis. We recall that additional levels or specific configurations can be set.

**Figure 8.**Comparison of the fifteen compression modes available in Jarvis and GeCo2 for the human chromosome Y sequence. Compression ratios are in Bits Per Symbol (BPS) and Time in seconds. Each number, corresponding to the blue dots, stands for the mode/level used in the respective compressor.

**Table 1.**Number of bytes needed to represent each DNA sequence given the respective data compressor (LZMA -9, PAQ8 -8, CoGi, GeCo, XM and Jarvis). We ran LZMA with the -9 flag (best option), PAQ8 with the -8 (best option), GeCo using “-tm 1:1:0:0/0 -tm 3:1:0:0/0 -tm 6:1:0:0/0 -tm 9:10:0:0/0 -tm 11:10:0:0/0 -tm 13:50:1:0/0 -tm 18:100:1:3/10 -c 30 -g 0.9”, GeCo2 with parameters from [88], and XM using 50 copy experts. The compression level used in Jarvis is depicted between parentheses, and it has been set according to the size of the sequence. The length of the sequences is present in Table 2.

ID | LZMA-9 | PAQ8-8 | CoGI | GeCo | GeCo2 | XM | Jarvis (level) |
---|---|---|---|---|---|---|---|

HoSa | 42,292,440 | 40,517,624 | 51,967,817 | 38,877,294 | 38,845,642 | 38,940,458 | 38,660,851 (7) |

GaGa | 36,179,650 | 34,490,967 | 40,846,177 | 33,925,250 | 33,877,671 | 33,879,211 | 33,699,821 (6) |

DaRe | 12,515,717 | 12,628,104 | 17,084,450 | 11,520,064 | 11,488,819 | 11,302,620 | 11,173,905 (5) |

OrSa | 9,348,183 | 9,280,037 | 11,999,580 | 8,671,732 | 8,646,543 | 8,470,212 | 8,448,959 (5) |

DrMe | 8,016,544 | 7,577,068 | 8,939,690 | 7,498,808 | 7,481,093 | 7,538,662 | 7,490,418 (5) |

EnIn | 5,785,343 | 5,761,090 | 7,210,867 | 5,196,083 | 5,170,889 | 5,150,309 | 5,087,286 (4) |

ScPo | 2,722,233 | 2,557,988 | 2,921,247 | 2,536,457 | 2,518,963 | 2,524,147 | 2,517,535 (4) |

PlFa | 2,097,979 | 1,959,623 | 2,411,342 | 1,944,036 | 1,925,726 | 1,925,841 | 1,924,430 (4) |

EsCo | 1,185,704 | 1,107,929 | 1,307,943 | 1,109,823 | 1,098,552 | 1,110,092 | 1,095,606 (4) |

HaHi | 985,096 | 904,074 | 1,124,483 | 906,991 | 902,831 | 913,346 | 899,464 (3) |

AeCa | 413,886 | 380,273 | 454,357 | 385,640 | 380,115 | 387,030 | 380,507 (3) |

HePy | 415,161 | 385,096 | 457,859 | 381,545 | 375,481 | 384,071 | 374,362 (3) |

YeMi | 19,262 | 16,835 | 19,805 | 17,167 | 16,798 | 16,861 | 16,861 (2) |

AgPh | 12,183 | 10,754 | 12,243 | 10,882 | 10,708 | 10,711 | 10,745 (2) |

BuEb | 5441 | 4668 | 5291 | 4774 | 4686 | 4642 | 4690 (1) |

Total | 121,994,822 | 117,582,130 | 146,763,151 | 112,986,546 | 112,744,517 | 112,558,213 | 111,785,440 |

**Table 2.**Computational time (in seconds) needed to represent each DNA sequence given the respective data compressor (LZMA, PAQ8, CoGi, GeCo, GeCo2, XM, and Jarvis). We ran LZMA with the -9 flag (best option), PAQ8 with the -8 (best option), GeCo using “-tm 1:1:0:0/0 -tm 3:1:0:0/0 -tm 6:1:0:0/0 -tm 9:10:0:0/0 -tm 11:10:0:0/0 -tm 13:50:1:0/0 -tm 18:100:1:3/10 -c 30 -g 0.9”, GeCo2 with parameters from [88], and XM using 50 copy experts. The compression level used in Jarvis is depicted between parentheses and it has been set according to the size of the sequence. The length scale of the sequences is in bases.

ID | Length | LZMA | PAQ8 | CoGI | GeCo | GeCo2 | XM | Jarvis |
---|---|---|---|---|---|---|---|---|

HoSa | 189,752,667 | 552.5 | 85,269.1 | 25.2 | 648.6 | 652.4 | 5,589.8 | 814.8 (7) |

GaGa | 148,532,294 | 468.7 | 64,898.9 | 19.9 | 503.2 | 494.7 | 3,633.9 | 412.3 (6) |

DaRe | 62,565,020 | 170.0 | 29,907.7 | 8.2 | 215.9 | 198.8 | 785.2 | 284.9 (5) |

OrSa | 43,262,523 | 112.9 | 20,745.1 | 5.8 | 192.4 | 138.3 | 489.7 | 234.5 (5) |

DrMe | 32,181,429 | 85.6 | 14,665.8 | 4.3 | 114.6 | 102.4 | 362.6 | 66.7 (5) |

EnIn | 26,403,087 | 66.0 | 11,183.6 | 3.7 | 95.8 | 82.5 | 279.8 | 101.1 (4) |

ScPo | 10,652,155 | 23.0 | 4,619.1 | 1.5 | 45.2 | 34.2 | 96.5 | 28.7 (4) |

PlFa | 8,986,712 | 18.3 | 4,133.9 | 1.2 | 39.7 | 35.3 | 84.4 | 25.4 (4) |

EsCo | 4,641,652 | 8.1 | 1,973.9 | 0.6 | 26.4 | 5.1 | 36.8 | 10.9 (4) |

HaHi | 3,890,005 | 6.9 | 1,738.1 | 0.5 | 23.7 | 4.4 | 39.1 | 7.1 (3) |

AeCa | 1,591,049 | 2.2 | 675.3 | 0.2 | 17.0 | 1.9 | 10.3 | 2.2 (3) |

HePy | 1,667,825 | 2.3 | 715.1 | 0.2 | 17.2 | 1.9 | 11.2 | 2.7 (3) |

YeMi | 73,689 | 0.1 | 32.6 | 0.0 | 12.3 | 0.1 | 0.9 | 0.2 (2) |

AgPh | 43,970 | 0.0 | 20.1 | 0.0 | 12.1 | 0.1 | 0.9 | 0.1 (2) |

BuEb | 18,940 | 0.0 | 9.1 | 0.0 | 12.2 | 0.1 | 0.7 | 0.1 (1) |

Total | 534,263,017 | 1516.6 | 240,587.4 | 71.3 | 1976.3 | 1742.2 | 11,421.8 | 1991.7 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

