Bioinformatics is a combination of the field of informatics and biology, in which computational tools and approaches are applied to solve the problems that biologists face in different domains, such as agricultural, medical science and study of the living world.
DNA sequencing provides fundamental data in genomics, bioinformatics, biology and many other research areas. It contains some hidden significant genetic information, which makes it the most important element in bioinformatics and has become crucial for any biological or clinical research in different laboratories [1
]. Genomic (or DNA) sequence is a long string formed of four nucleotide bases: adenine (A), cytosine (C), guanine (G) and thymine (T). This biomolecule not only contains the genetic information required for the functioning and development of all living beings but also aids in medical treatments in the detection of the risk for certain diseases, in the diagnosis of genetic disorders, and the identification of a specific individual [4
]. Around three billion characters and more than 23 pairs of chromosomes are in the human genome, in a gram of soil there is 40 million bacterial cells, and certain amphibian species can even have more than 100 billion nucleotides [3
]. With the emergent evolution in DNA sequencing technology, the volume of biological data has increased significantly.
Many file formats are used to store the genomic data such as FASTA, FASTQ, BAM/SAM and VCF/BCF [7
]. In these file formats, in addition to the raw genomic or protein sequences, other information, such as in FASTQ file identifiers and quality scores are added. However, in FASTA files only one identifier is added in the head of the data, which represents nucleotides or amino acids [16
From 1982 to the present, the number of nucleotide bases generated in databases has doubled approximately every 18 months and their sizes can range from terabytes (TB) to petabytes (PB) [17
]. In June 2019, approximately 329,835,282,370 bases were generated [18
]. All of these data are stored in special databases, which have been developed by the scientific community such as the 1000 Genomes Project [6
], the International Cancer Genome Project [19
] and the ENCODE project [12
]. This advancement has caused the production of a vast amount of redundant data at a much higher rate than the older technologies, which will increase in the future and may go beyond the limit of storage and bandwidth capacity [3
]. Consequently, all these data pose many challenges for bioinformatics researchers, i.e., storing, sharing, fast-searching and performing operations on this large amount of genomic data become costly, requires enormous space, and has a large computation overhead for the encoding and decoding process [16
]. Sometimes the cost of storage exceeds other costs, which means that storage is the primary requirement for unprocessed data [9
To overcome these challenges in an efficient way, compression may be a perfect solution to the increased storage space issue of DNA sequences [3
]. It is required to reduce the storage size and the processing costs, as well as aid in fast searching retrieval information, and increase the transmission speed over the internet with limited bandwidth [13
]. In general, compression can be either lossless or lossy; in lossless compression, no information is lost and the original data can be recovered exactly, while in lossy compression, only an approximation of the original data can be recovered [23
]. Many researchers believe that lossless compression schemes are particularly needed for biological and medical data, which cannot afford to lose any part of their data [3
]. Thus, with looking at the importance of data compression, lossless compression methods are recommended for various DNA file formats such as FASTA and FASTQ file formats.
Currently, universal text compression algorithms, including gzip [27
] and bzip [28
], are not efficient in the compression of genomic data because these algorithms were designed for the compression of English text. Besides, the DNA sequences consist of four bases: two bits should be sufficient to store each base and follow no clear rules like those of text files that cannot provide proper compression results [9
For example, bzip2 can compress 9.7 MB of data to 2.8 MB (the compression ratio (CR) is significantly higher than 2 bits per base (bpb)). Nevertheless, this is far from satisfactory in terms of compression efficiency. Thus, more effective compression techniques for genomic data are required.
Recently, several lossless compression algorithms have been suggested for the task at hand considering the features of these data.
Tahi and Grumbach introduced two general modes for DNA sequence lossless compression: horizontal mode and vertical mode [5
]. The horizontal (or reference-free) methods are based only on the genetic information of the target sequences by referring exclusively to its substrings. Whereas the vertical (or reference-based) methods are using another genome sequence as a reference, and then store only the genomic locations and any mismatches instead of storing the sequences [29
]. A significant advantage of referential compression is that as the read length increases, it produces much better compression ratios compared with non-referential algorithms. However, reference-based compression techniques are not self-contained and the success of these algorithms depends on the availability of a good reference genome sequence, which may not always be ensured. Additionally, the creation of such databases or reference sequences is a big challenging problem. In the decompression, the exact reference genome to match the genomic locations and extract the reads is required. Hence, it is important to preserve the reference genome [5
Many studies have been proposed on the topic of DNA compression, considering the features of DNA sequences to achieve better performance, such as small alphabet size, repeats and palindromes. In general, these methods can be classified as statistical based, substitutional or dictionary based [13
The first algorithm based substitutional compression found in the literature is Biocompress, which was developed in [30
]. The replica of sequences, such as repeats, palindromes and complementary palindromes are detected using the Ziv and Lempel method, and then encoded them using the repeat length and the position of their earliest repeat occurrence. Two bits per base are used to encode non-repeat regions. BioCompress-2 [5
] is an extension of BioCompress, exploiting the same methodology, as well as when no significant repetition is found, arithmetic coding of order-2 is applied.
The use of “approximate repeats” in the compression of DNA sequences began with GenCompress [31
], followed by its modified version DNACompress [32
], DNAPack [16
], GeNML [33
] and DNAEcompress [34
]. Two versions of GenCompress were proposed [31
]. Hamming distance (only substitutions) is used to search approximate repeats in GenCompress-1 while the edition distance (deletion, insertion and substitution) is applied to encode the repeats in GenCompress-2. For non-repeat regions, arithmetic encoding of order 2 is used. However, GenCompress-2 fails to compress the larger DNA sequences. The DNACompress proposed in [32
] consists of two phases. In the first phase, Pattern Hunter software [35
] is used to identify the approximate repeats containing palindromes in the sequence. In the second phase, the Ziv and Lempel compression method is exploited to encode both approximate repeats and the non-repeat regions. In DNAPack [16
], the authors found a better set of repeats than those found by the previous algorithms. By using a dynamic programming approach, the longest approximate repeats and the approximate complementary palindrome repeats are detected and then encoded by using the Hamming distance (only substitutions). Non-repeat regions are encoded by the best choice from context tree weighting, an order-2 arithmetic coding, and two bits per base. DNAPack performed better than the previous compressors in terms of compression ratio (with an average equal to 1.714 bpb using standard benchmark data).
The GeNML [33
] presents the following improvements to the approach utilized in the NMLComp method [36
]. First, it reduces the cost of the search for pointers of the previous matches by restricting the approximate repeats matches, then choosing the block sizes, which are utilized in parsing the target sequence. Finally, introducing adaptable overlooking elements for the memory model. Based on the statistical characteristics of the genomic sequence, the authors in [37
] introduce the XM (eXpert Model) method that encodes each symbol by estimating its probability distribution based on the previous nucleotides utilizing a mixture of experts. Then, the results of experts are combined and encoded by arithmetic encoder. The DNASC algorithm [38
] compresses the DNA sequence horizontally in the first step by using extended Lempel–Ziv style, and, then, vertically in the second step by taking different block size and window size.
], two bits per base are used to encode the sequence of DNA as the bit-preprocessing stage. The GENBIT compress tool [39
] divides the input DNA sequence into segments of eight bits (four bases). If the consecutive blocks are the same, a specific bit “1” is introduced as a 9th bit. Otherwise, the specific bit will be “0”. If the fragment length is less than four bases, then a unique two bits are assigned for each base. When the repetitive blocks are maximum, the best compression ratio for the entire genome is achieved. When the repetitive fragments are less, or nil, the compression ratio is higher than two. The DNABIT Compress Algorithm [4
] splits the sequence into small fragments and compresses them while taking into consideration their existence before. To differentiate between exact and reverse repeat fragments, a binary bit in the preprocessing stage is used. This method achieves better compression and significantly improves the running time for larger genomes.
The concept of extended binary trees is used in the HuffBit Compress [40
]. It replaced the bases A, C, G and T by 0, 10, 110 and 111, respectively. However, the compression ratio will be higher than two bits if the frequency of ‘G’ or ‘T’ is higher than other bases’ frequencies, which is the worst-case compression ratio results because both ‘G’ and ‘T’ are replaced by three bits. A combination of different text compression methods is proposed to compress the genomic data. The improved RLE, proposed in [41
], is based on move-to-front (MTF), run length encoder (RLE) and delta encoding; this method is a good model to compress two or more DNA sequences.
The DNA-COMPACT (DNA COMpression) algorithm [42
] based on a pattern-aware contextual modeling technique comprises of two phases. Firstly, it searched for exact repeats and palindromes by exploiting complementary contextual models and represent them by a compact quadruplet. Secondly, where the predictions of these models are synthesized using a logistic regression model, then the non-sequential contextual models are used.
A new compression method named SBVRLDNAComp is proposed in [43
]. In this method, the exact repeats are searched and then encoded in four different ways using different techniques, and then the optimal solution is applied to encode them. Finally, the obtained stream is compressed by LZ77. OBRLDNAComp [44
] is a two-pass compression algorithm. In the first phase, it searches the optimal exact repeat within a sequence. Then, it scans a sequence horizontally from left to right followed by a vertical scanning from top to bottom before compressing them using a substitution technique. A seed-based lossless compression method for DNA sequence compaction, presented in [45
], a new substitutional compression scheme similar to the Lempel-Ziv, is proposed to compress all the various types of repeats using an offline dictionary.
A combination of transformation methods and text compression methods was suggested in [46
]. In the first step, Burrows–Wheeler transform (BWT) rearranges the data blocks lexicographically. MTF followed by RLE is used in the second step. Finally, arithmetic coding is used to compress the previously generated data. A novel attempt with the multiple dictionaries based on Lempel–Ziv–Welch (LZW) and binary search has been presented in [47
]. The performance of this algorithm is tested using a non-real DNA dataset. To overcome the limitation of the HuffBit algorithm [40
]. A Modified Huffbit compression algorithm was introduced in 2018 [48
]. It encodes the bases according to their frequency, which means that the base that occurs more often will be encoded with the shortest code.
Recently, a cloud-based symbol table driven DNA compression algorithm (CBSTD) using R language was proposed in [6
]. In this technique, the entire DNA sequence is subdivided into blocks (each block contains four bases), and then looked into which category it belongs. According to the occurrences of the nucleotide bases, three different categories are suggested. The first situation is when all four symbols are different, and hence, a table of 24 symbols is created to encode each block. The second situation is when the two nonconsecutive symbols are the same; then the block is divided into two parts and encoded using another table of symbols. The third situation is when two or more consecutive symbols are the same, in which case it is encoded with the bases followed by the number of occurrences. The authors in [6
] do not use the real biological sequences to test the performance of their algorithms.
], the authors illustrate an efficient genome compression method based on the optimization of the context weighting, abbreviated as OPGC. The authors in [50
] introduced the OBComp, which is based on the frequencies of bases including other symbols than ACGT. Only a single bit is required to encode the two nucleotide bases that occur most frequently, and the positions of the two other bases will be recorded separately before removing them from the original file. The first output will be encoded using modified run length encoding [51
] and Huffman coding [40
], whereas the position files are transformed using data differencing, which can improve the final compression. One of the general compressors based on asymmetric binary coding is used to generate the final compressed data.
Generally, compression algorithms that look for approximate or even exact repeats have a high execution time overhead, as well as a high storage complexity, which is not suitable for an efficient compression technique. The following described methods are reference-free methods that have been designed for compressing files in FASTA and FASTQ formats.
Three of the selected tools compress FASTA files only. The first one is called BIND [52
]. It is a lossless FASTA/multi FASTA compressor specialized for compressing nucleotide sequence data. After the separation of the head(s), 7-Zip is used to compress them; non-ATGC characters and lowercase character is also removed from the original data and their positions are stored separately. Each nucleotide (A, T, G and C) is represented with an individual binary code (00, 01, 10 and 11) and then splits the codes into two data streams. After that, the repeat patterns in these two data streams are encoded using a unary coding scheme. Finally, BIND uses the LZM algorithm to compress the output files.
The DELIMINATE [7
] is an efficient lossless FASTA/multi FASTA compressor, it separately handles the header and sequence and it performs the compression in two phases. Firstly, all lower-case characters, as well as all non-ATGC, are recorded and eliminated from the original file. Secondly, the positions of the two most repeated nucleotide bases are transformed using delta encoding; then, the two other bases are encoded by binary encoding. Finally, all generated data output files are archived using 7-Zip.
The MFCompress [8
] is one of the most efficient lossless non-referential compression algorithms for FASTA files compaction according to a recent survey [13
]. It divides the data into two separate kinds of data: one containing the nucleotide sequences, the other one the headers of the FASTA records. The first data are encoded using probabilistic models (multiple competing finite context models) as well as arithmetic; whereas the headers are encoded using a single finite context models. The SeqCompress algorithm [9
] is based on delta encoding with Lempel–Ziv; it uses a statistical model and arithmetic coding to compress DNA sequences.
The next generation sequencing data (NGS) are generally stored in FASTQ format. FASTQ files consist of millions to billions of records and each record has four lines, a metadata containing; a DNA sequence read obtained from one-fold of the over sampling; a quality score sequence estimating error probability of the corresponding DNA bases; and other line the same as the first one, which is generally ignored [1
]. Duo to the different nature of the FASTQ data, each compressor processed the various data (identifiers, DNA reads and quality score) separately. In this study, we are interested in the compression of DNA reads.
], the authors present new FASTQ compression algorithm named DSRC (DNA sequence reads compressor). They impose a hierarchical structure of the compressed data by dividing the data into blocks and superblocks; it encodes the superblocks independently to provide fast random access to any record. Each block represents b records and each superblock contains k blocks (b = 32 and k = 512 by default). In this method, LZ77-style encoding, order-0 Huffman encoding is used for compressing the DNA reads. The Quip [53
] is a reference-free compression scheme that use arithmetic coding based on order-3 and high order Markov chains for compressing all parts of FASTQ data. It also supports another file format such as the SAM and BAM.
In the Fqzcomp method [11
], byte-wise arithmetic coder associated with order K context models are exploited for compressing the nucleotide sequence in the FASTQ file. As a reference free lossless FASTQ compressor, GTZ is proposed in [15
]. It integrates adaptive context modeling with multiple prediction modeling schemes to estimate the base calls probabilities. Firstly, it transforms the DNA sequences with the BWT (Burrows–Wheeler transform) algorithm, which rearranges the sequences into runs of similar characters. Then, the appearance probability of each character is calculated by using the zero-order and the first-order prediction models. Finally, it compresses the generated data blocks with arithmetic coding.
In this paper, we propose a new lossless compression method DNAC-SEB that uses two different schemes to encode and compress the genomic data. Initially, based on the occurrence bases, the input data will be transformed into several binary streams using a binary encoding scheme. After that, all the binary streams generated previously are divided into fixed length blocks and encoded using the SEB method. This method encodes each block separately using two different techniques, and then the optimal code is chosen to encode the corresponding block. In the last step, one of a general-purpose compressor is used to generate the final compressed data. The proposed idea was inspired from [54
The majority of these methods assume data drawn only from the ACGT alphabet, ignoring the appearance of other an unidentified base that can be found in DNA data sequences like N. However, the proposed approach considers that. The DNAC-SEB is a reference-free method; it does not depend on any specific reference genome or any patterns and may work with any type of DNA sequence and with no ATCG characters. It encodes each block of data separately and does not require any additional information in the decoding step.
The remaining part of this study is organized as follows. In the next section, we describe the compression method in detail. Then, we provide performance evaluation details and discussion of the proposed method against state-of-the-art DNA compressors using different benchmark. The last section presents the conclusion and some future work.