Next Article in Journal
Transcriptome and DNA Methylation Analyses of the Molecular Mechanisms Underlying with Longissimus dorsi Muscles at Different Stages of Development in the Polled Yak
Next Article in Special Issue
A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks
Previous Article in Journal
RNA-Dependent RNA Polymerase Speed and Fidelity are not the Only Determinants of the Mechanism or Efficiency of Recombination
Previous Article in Special Issue
PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead
Open AccessArticle

MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction

Faculty of Electrical and Computer Engineering, Tarbiat Modares University, Tehran P.O. Box 14115-143, Iran
Institute for Research in Fundamental Sciences (IPM), School of Computer Science, Tehran P.O. Box 14115-143, Iran
Author to whom correspondence should be addressed.
Genes 2019, 10(12), 969;
Received: 20 October 2019 / Revised: 12 November 2019 / Accepted: 15 November 2019 / Published: 25 November 2019
(This article belongs to the Special Issue Impact of Parallel and High-Performance Computing in Genomics)
Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of modeling algorithms. To mitigate this problem, we propose a two-stage parallel algorithm for selection of age related CpG-sites. The algorithm first attempts to cluster the data into similar age ranges. In the next stage, a parallel genetic algorithm (PGA), based on the MapReduce paradigm (MR-based PGA), is used for selecting age-related features of each individual age range. In the proposed method, the execution of the algorithm for each age range (data parallel), the evaluation of chromosomes (task parallel) and the calculation of the fitness function (data parallel) are performed using a novel parallel framework. In this paper, we consider 16 different healthy DNAm datasets that are related to the human blood tissue and that contain the relevant age information. These datasets are combined into a single unioned set, which is in turn randomly divided into two sets of train and test data with a ratio of 7:3, respectively. We build a Gradient Boosting Regressor (GBR) model on the selected CpG-sites from the train set. To evaluate the model accuracy, we compared our results with state-of-the-art approaches that used these datasets, and observed that our method performs better on the unseen test dataset with a Mean Absolute Deviation (MAD) of 3.62 years, and a correlation (R2) of 95.96% between age and DNAm. In the train data, the MAD and R2 are 1.27 years and 99.27%, respectively. Finally, we evaluate our method in terms of the effect of parallelization in computation time. The algorithm without parallelization requires 4123 min to complete, whereas the parallelized execution on 3 computing machines having 32 processing cores each, only takes a total of 58 min. This shows that our proposed algorithm is both efficient and scalable. View Full-Text
Keywords: age prediction; MapReduce; parallel genetic algorithm; CpG-site selection; GBR Model age prediction; MapReduce; parallel genetic algorithm; CpG-site selection; GBR Model
Show Figures

Figure 1

MDPI and ACS Style

Momeni, Z.; Saniee Abadeh, M. MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction. Genes 2019, 10, 969.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

Search more from Scilit
Back to TopTop