Next Article in Journal
Preface of Special Issue "Molecular Therapies for Inherited Retinal Diseases"
Previous Article in Journal
Chromosome and Genome Divergence between the Cryptic Eurasian Malaria Vector-Species Anopheles messeae and Anopheles daciae
Previous Article in Special Issue
SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark
Open AccessArticle

A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce

Department of Computer Science, COMSATS University Islamabad, Attock Campus 43600, Pakistan
*
Author to whom correspondence should be addressed.
Genes 2020, 11(2), 166; https://doi.org/10.3390/genes11020166
Received: 20 January 2020 / Revised: 31 January 2020 / Accepted: 1 February 2020 / Published: 5 February 2020
(This article belongs to the Special Issue Impact of Parallel and High-Performance Computing in Genomics)
Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well.
Keywords: DNA; NGS; SNP; Hadoop; Map-Reduce; accuracy; execution time DNA; NGS; SNP; Hadoop; Map-Reduce; accuracy; execution time
MDPI and ACS Style

Tahir, M.; Sardaraz, M. A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce. Genes 2020, 11, 166.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop