Impact of Parallel and High-Performance Computing in Genomics

A special issue of Genes (ISSN 2073-4425). This special issue belongs to the section "Technologies and Resources for Genetics".

Deadline for manuscript submissions: closed (20 October 2019) | Viewed by 22823

Special Issue Editors

Lawrence Berkeley National Lab & DOE Joint Genome Institute, University of California, Merced, CA 95343, USA
Interests: large-scale genome sequence analyses, next-generation transcriptomics, metagenomics, machine learning, application of cloud computing and high-performance computing in genomics

E-Mail Website
Co-Guest Editor
Applied Mathematics, University of California, Merced, CA 95343, USA
Interests: dynamics and transmission of prion proteins in yeast; identification of structural variation from high throughput sequencing data; genome evolution and population dynamics

E-Mail Website
Co-Guest Editor
Faculty of Technology and Center for Biotechnology, Bielefeld University, 33615 Bielefeld, Germany
Interests: large-scale genomics and metagenomics; cloud computing; high-performance computing

E-Mail Website
Co-Guest Editor
School of Computer Science and Technology, University of Science and Technology of China, 443 Huangshan Rd, Hefei 230001, China
Interests: large-scale parallel computer system architecture; systems for storing and processing big data; reconfigurable computing for cognitive problems; parallel programming environment and tools; high-performance computing

E-Mail Website
Co-Guest Editor
Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA
Interests: bioinformatics; computational genomics; computational epidemiology; analysis of high-throughput sequencing data; statistical inference; discrete algorithms
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In the past two decades, massive parallel next-generation sequencing techniques have also parallelized genomics projects. Human genome sequencing has evolved from sequencing a few individuals to the massive parallel sequencing of 100,000 individual people or single cells. Culturing and sequencing single microbes has been replaced by culture-independent, metagenomics sequencing. Genomic data generated from single projects has grown from a few megabases to hundreds of gigabases or even terabases. Analyzing these datasets can reveal robust links between genotypes and phenotypes, illustrate the precise mechanisms of cellular changes in cancer, uncover novel taxa with unprecedented metabolic capabilities, and the list goes on and on. However, effectively and efficiently analyzing these massive datasets poses a significant challenge not only to the underlying computing infrastructures and programming models, but also to the algorithms that drill out insights and visualize them.

This Special Issue focuses on various “big data genomics” strategies that employ parallel programming paradigms to analyze extremely large genomics datasets. Its scope includes, but is not limited to, traditional task parallelism (MP, MPI, GPU and FPGA), data parallelism (MapReduce, Spark), or the recent model parallelism (deep learning). We welcome submissions of reviews, research articles, and short communications. We also encourage the submission of manuscripts describing new ideas, in the form of “concept papers”.

Dr. Zhong Wang
Prof. Suzanne Sindi
Dr. Alexander Sczyrba
Prof. Hong An
Prof. Alex Zelikovsky
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Genes is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • massive parallel next-generation sequencing
  • metagenomics sequencing
  • human genome sequencing
  • big data genomics
  • task parallelism
  • data parallelism
  • MapReduce
  • Spark
  • model parallelism
  • deep learning

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Other

23 pages, 3022 KiB  
Article
A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
by Muhammad Tahir and Muhammad Sardaraz
Genes 2020, 11(2), 166; https://doi.org/10.3390/genes11020166 - 05 Feb 2020
Cited by 6 | Viewed by 3441
Abstract
Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection [...] Read more.
Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well. Full article
(This article belongs to the Special Issue Impact of Parallel and High-Performance Computing in Genomics)
Show Figures

Figure 1

15 pages, 1017 KiB  
Article
SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark
by Zaid Al-Ars, Saiyi Wang and Hamid Mushtaq
Genes 2020, 11(1), 53; https://doi.org/10.3390/genes11010053 - 03 Jan 2020
Cited by 7 | Viewed by 3918
Abstract
The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the [...] Read more.
The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results. Full article
(This article belongs to the Special Issue Impact of Parallel and High-Performance Computing in Genomics)
Show Figures

Figure 1

13 pages, 1882 KiB  
Article
A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks
by Ashley Cliff, Jonathon Romero, David Kainer, Angelica Walker, Anna Furches and Daniel Jacobson
Genes 2019, 10(12), 996; https://doi.org/10.3390/genes10120996 - 02 Dec 2019
Cited by 21 | Viewed by 4918
Abstract
As time progresses and technology improves, biological data sets are continuously increasing in size. New methods and new implementations of existing methods are needed to keep pace with this increase. In this paper, we present a high-performance computing (HPC)-capable implementation of Iterative Random [...] Read more.
As time progresses and technology improves, biological data sets are continuously increasing in size. New methods and new implementations of existing methods are needed to keep pace with this increase. In this paper, we present a high-performance computing (HPC)-capable implementation of Iterative Random Forest (iRF). This new implementation enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs. Using this implementation, we also present a new method, iRF Leave One Out Prediction (iRF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more. We compare the new implementation of iRF with the previous R version and analyze its time to completion on two of the world’s fastest supercomputers, Summit and Titan. We also show iRF-LOOP’s ability to capture biologically significant results when creating Predictive Expression Networks. This new implementation of iRF will enable the analysis of biological data sets at scales that were previously not possible. Full article
(This article belongs to the Special Issue Impact of Parallel and High-Performance Computing in Genomics)
Show Figures

Figure 1

17 pages, 4379 KiB  
Article
MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction
by Zahra Momeni and Mohammad Saniee Abadeh
Genes 2019, 10(12), 969; https://doi.org/10.3390/genes10120969 - 25 Nov 2019
Cited by 2 | Viewed by 2370
Abstract
Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the [...] Read more.
Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of modeling algorithms. To mitigate this problem, we propose a two-stage parallel algorithm for selection of age related CpG-sites. The algorithm first attempts to cluster the data into similar age ranges. In the next stage, a parallel genetic algorithm (PGA), based on the MapReduce paradigm (MR-based PGA), is used for selecting age-related features of each individual age range. In the proposed method, the execution of the algorithm for each age range (data parallel), the evaluation of chromosomes (task parallel) and the calculation of the fitness function (data parallel) are performed using a novel parallel framework. In this paper, we consider 16 different healthy DNAm datasets that are related to the human blood tissue and that contain the relevant age information. These datasets are combined into a single unioned set, which is in turn randomly divided into two sets of train and test data with a ratio of 7:3, respectively. We build a Gradient Boosting Regressor (GBR) model on the selected CpG-sites from the train set. To evaluate the model accuracy, we compared our results with state-of-the-art approaches that used these datasets, and observed that our method performs better on the unseen test dataset with a Mean Absolute Deviation (MAD) of 3.62 years, and a correlation (R2) of 95.96% between age and DNAm. In the train data, the MAD and R2 are 1.27 years and 99.27%, respectively. Finally, we evaluate our method in terms of the effect of parallelization in computation time. The algorithm without parallelization requires 4123 min to complete, whereas the parallelized execution on 3 computing machines having 32 processing cores each, only takes a total of 58 min. This shows that our proposed algorithm is both efficient and scalable. Full article
(This article belongs to the Special Issue Impact of Parallel and High-Performance Computing in Genomics)
Show Figures

Figure 1

17 pages, 3244 KiB  
Article
PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead
by Lingqi Zhang, Cheng Liu and Shoubin Dong
Genes 2019, 10(11), 886; https://doi.org/10.3390/genes10110886 - 04 Nov 2019
Cited by 10 | Viewed by 4093
Abstract
(1) Background: DNA sequence alignment process is an essential step in genome analysis. BWA-MEM has been a prevalent single-node tool in genome alignment because of its high speed and accuracy. The exponentially generated genome data requiring a multi-node solution to handle large volumes [...] Read more.
(1) Background: DNA sequence alignment process is an essential step in genome analysis. BWA-MEM has been a prevalent single-node tool in genome alignment because of its high speed and accuracy. The exponentially generated genome data requiring a multi-node solution to handle large volumes of data currently remains a challenge. Spark is a ubiquitous big data platform that has been exploited to assist genome alignment in handling this challenge. Nonetheless, existing works that utilize Spark to optimize BWA-MEM suffer from higher overhead. (2) Methods: In this paper, we presented PipeMEM, a framework to accelerate BWA-MEM with lower overhead with the help of the pipe operation in Spark. We additionally proposed to use a pipeline structure and in-memory-computation to accelerate PipeMEM. (3) Results: Our experiments showed that, on paired-end alignment tasks, our framework had low overhead. In a multi-node environment, our framework, on average, was 2.27× faster compared with BWASpark (an alignment tool in Genome Analysis Toolkit (GATK)), and 2.33× faster compared with SparkBWA. (4) Conclusions: PipeMEM could accelerate BWA-MEM in the Spark environment with high performance and low overhead. Full article
(This article belongs to the Special Issue Impact of Parallel and High-Performance Computing in Genomics)
Show Figures

Figure 1

Other

Jump to: Research

8 pages, 194 KiB  
Perspective
Computational Strategies for Scalable Genomics Analysis
by Lizhen Shi and Zhong Wang
Genes 2019, 10(12), 1017; https://doi.org/10.3390/genes10121017 - 06 Dec 2019
Cited by 10 | Viewed by 3543
Abstract
The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to [...] Read more.
The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications. Full article
(This article belongs to the Special Issue Impact of Parallel and High-Performance Computing in Genomics)
Back to TopTop