Next Article in Journal
Large-Scale Profiling of RBP-circRNA Interactions from Public CLIP-Seq Datasets
Next Article in Special Issue
A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
Previous Article in Journal
Evaluation of the Abundance of DNA-Binding Transcription Factors in Prokaryotes
Previous Article in Special Issue
Computational Strategies for Scalable Genomics Analysis
 
 
Article

SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark

Computer Engineering Lab, Delft University of Technology, Mekelweg 5, 2628 CD Delft, The Netherlands
*
Author to whom correspondence should be addressed.
Genes 2020, 11(1), 53; https://doi.org/10.3390/genes11010053
Received: 30 October 2019 / Revised: 1 December 2019 / Accepted: 10 December 2019 / Published: 3 January 2020
(This article belongs to the Special Issue Impact of Parallel and High-Performance Computing in Genomics)
The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results. View Full-Text
Keywords: GATK variant calling; RNA-seq; Apache Spark; scalability; computation time GATK variant calling; RNA-seq; Apache Spark; scalability; computation time
Show Figures

Figure 1

MDPI and ACS Style

Al-Ars, Z.; Wang, S.; Mushtaq, H. SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark. Genes 2020, 11, 53. https://doi.org/10.3390/genes11010053

AMA Style

Al-Ars Z, Wang S, Mushtaq H. SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark. Genes. 2020; 11(1):53. https://doi.org/10.3390/genes11010053

Chicago/Turabian Style

Al-Ars, Zaid, Saiyi Wang, and Hamid Mushtaq. 2020. "SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark" Genes 11, no. 1: 53. https://doi.org/10.3390/genes11010053

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop