Next Article in Journal
Studying Organizations on Instagram
Next Article in Special Issue
Lazy Management for Frequency Table on Hardware-Based Stream Lossless Data Compression
Previous Article in Journal
A Benchmarking Analysis of Open-Source Business Intelligence Tools in Healthcare Environments
Previous Article in Special Issue
Efficient Software HEVC to AVS2 Transcoding
Article Menu

Export Article

Open AccessArticle

A Survey on Data Compression Methods for Biological Sequences

Institute of Electronics and Informatics Engineering of Aveiro/Department of Electronics, Telecommunications and Informatics (IEETA/DETI), University of Aveiro, 3810-193 Aveiro, Portugal
Author to whom correspondence should be addressed.
Academic Editor: Khalid Sayood
Information 2016, 7(4), 56;
Received: 27 June 2016 / Revised: 23 September 2016 / Accepted: 29 September 2016 / Published: 14 October 2016
(This article belongs to the Special Issue Multimedia Information Compression and Coding)
PDF [372 KB, uploaded 14 October 2016]


The ever increasing growth of the production of high-throughput sequencing data poses a serious challenge to the storage, processing and transmission of these data. As frequently stated, it is a data deluge. Compression is essential to address this challenge—it reduces storage space and processing costs, along with speeding up data transmission. In this paper, we provide a comprehensive survey of existing compression approaches, that are specialized for biological data, including protein and DNA sequences. Also, we devote an important part of the paper to the approaches proposed for the compression of different file formats, such as FASTA, as well as FASTQ and SAM/BAM, which contain quality scores and metadata, in addition to the biological sequences. Then, we present a comparison of the performance of several methods, in terms of compression ratio, memory usage and compression/decompression time. Finally, we present some suggestions for future research on biological data compression. View Full-Text
Keywords: protein sequence; DNA sequence; reference-free compression; reference-based compression; FASTA; Multi-FASTA; FASTQ; SAM; BAM protein sequence; DNA sequence; reference-free compression; reference-based compression; FASTA; Multi-FASTA; FASTQ; SAM; BAM

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Hosseini, M.; Pratas, D.; Pinho, A.J. A Survey on Data Compression Methods for Biological Sequences. Information 2016, 7, 56.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Information EISSN 2078-2489 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top