Special Issue "Systems Analytics and Integration of Big Omics Data"

A special issue of Genes (ISSN 2073-4425). This special issue belongs to the section "Technologies and Resources for Genetics".

Deadline for manuscript submissions: closed (31 August 2018).

Special Issue Editor

Prof. Gary Hardiman
E-Mail Website1 Website2
Guest Editor
1. School of Biological Sciences, Institute for Global Food Security (IGFS), Queen’s University Belfast, BT7 1NN Belfast, Northern Ireland, UK
2. Department of Medicine, Medical University of South Carolina, Charleston, SC 29425, USA
Interests: computational biology; genomics and genetics; big data; pharmacogenomics; endocrine disruption; systems biology

Special Issue Information

Dear Colleagues,

The emergence and global utilization of high-throughput (HT) technologies, including deep sequencing technologies (genomics) and mass spectrometry (proteomics, metabolomics, lipidomics), has allowed geneticists, biologists, and biostatisticians to bridge the gap between genotype and phenotype on a scale that was not previously possible. The adoption of a novel technology is met by a paradigm shift in how biological assays are designed and executed.

Throughput is increased typically by an order of magnitude and accompanied by an exponential cost reduction compared to older traditional approaches. The economic benefit and efficacy of nascent technologies is often realized by process-miniaturization combined with the multiplexing of millions of reactions.

Big data encompasses the collection of data sets derived from technologies. They are so large and complex that their processing is impractical using traditional data processing applications. Instead, challenges arise in collection: analysis, mining, sharing, transfer, visualization, archival and integration of big data.

Analogous to the impact of high-throughput DNA sequencing on genomics and transcriptomics, mass spectrometry has revolutionized proteomics studies in a similar manner providing independent draft maps of the human proteome. Large-scale interrogation of biological systems using mass-spectrometry based proteomics provides insights not available from genomics data, namely information on protein abundance, cell-type and time-dependent expression patterns, post-translational modifications and protein–protein interactions.

As observed with DNA microarray analysis pipelines over a decade ago, and more recently with HT sequencing, better analytical tools are emerging primarily from open source efforts permitting additional analyses and enhanced information mining from raw data sets compared to the tool kits provided with the instruments themselves.

Various statistical pipelines require different types of compute structure: large database storage arrays for query intensive data analysis, high throughput sequencing requiring a high-speed data networks with a hierarchical type compute core, statistical modeling methods requiring a modular type closely coupled compute infrastructure

Administration and development strategies must take into account the ever-growing size of data, public accessibility of analyzed data, software deprecations, software upgrades, hardware failures, user interface improvements, user account management, long term storage as well as security of systems.

In this Special Issue, we will focus on integration strategies for systems level analysis of omics data, big data infrastructure, rigor and transparency in big data research, best practices for sharing omics data with public repositories, recent developments in pathway and network algorithm development, and integration of omics data with clinical and biomedical data.

Prof. Dr. Gary Hardiman
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Genes is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

 

Keywords

  • systems level analysis
  • high-throughput sequencing
  • mass spectrometry
  • bioinformatics pipelines
  • rigor and transparency in big data research
  • omics data management
  • analysis provenance
  • algorithm development for pathway/network integration

Published Papers (12 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research, Review, Other

Open AccessEditorial
An Introduction to Systems Analytics and Integration of Big Omics Data
Genes 2020, 11(3), 245; https://doi.org/10.3390/genes11030245 - 26 Feb 2020
Abstract
A major technological shift in the research community in the past decade has been the adoption of high throughput (HT) technologies to interrogate the genome, epigenome, transcriptome, and proteome in a massively parallel fashion [...] Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)

Research

Jump to: Editorial, Review, Other

Open AccessArticle
MODIMA, a Method for Multivariate Omnibus Distance Mediation Analysis, Allows for Integration of Multivariate Exposure–Mediator–Response Relationships
Genes 2019, 10(7), 524; https://doi.org/10.3390/genes10070524 - 11 Jul 2019
Abstract
Many important exposure–response relationships, such as diet and weight, can be influenced by intermediates, such as the gut microbiome. Understanding the role of these intermediates, the mediators, is important in refining cause–effect theories and discovering additional medical interventions (e.g., probiotics, prebiotics). Mediation analysis [...] Read more.
Many important exposure–response relationships, such as diet and weight, can be influenced by intermediates, such as the gut microbiome. Understanding the role of these intermediates, the mediators, is important in refining cause–effect theories and discovering additional medical interventions (e.g., probiotics, prebiotics). Mediation analysis has been at the heart of behavioral health research, rapidly gaining popularity with the biomedical sciences in the last decade. A specific analytic challenge is being able to incorporate an entire ’omics assay as a mediator. To address this challenge, we propose a hypothesis testing framework for multivariate omnibus distance mediation analysis (MODIMA). We use the power of energy statistics, such as partial distance correlation, to allow for analysis of multivariate exposure–mediator–response triples. Our simulation results demonstrate the favorable statistical properties of our approach relative to the available alternatives. Finally, we demonstrate the application of the proposed methods in two previously published microbiome datasets. Our framework adds a new tool to the toolbox of approaches to the integration of ‘omics big data. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Figure 1

Open AccessArticle
IPCT: Integrated Pharmacogenomic Platform of Human Cancer Cell Lines and Tissues
Genes 2019, 10(2), 171; https://doi.org/10.3390/genes10020171 - 22 Feb 2019
Abstract
(1) Motivation: The exponential increase in multilayered data, including omics, pathways, chemicals, and experimental models, requires innovative strategies to identify new linkages between drug response information and omics features. Despite the availability of databases such as the Cancer Cell Line Encyclopedia (CCLE), [...] Read more.
(1) Motivation: The exponential increase in multilayered data, including omics, pathways, chemicals, and experimental models, requires innovative strategies to identify new linkages between drug response information and omics features. Despite the availability of databases such as the Cancer Cell Line Encyclopedia (CCLE), the Cancer Therapeutics Response Portal (CTRP), and The Cancer Genome Atlas (TCGA), it is still challenging for biologists to explore the relationship between drug response and underlying genomic features due to the heterogeneity of the data. In light of this, the Integrated Pharmacogenomic Database of Cancer Cell Lines and Tissues (IPCT) has been developed as a user-friendly way to identify new linkages between drug responses and genomic features, as these findings can lead not only to new biological discoveries but also to new clinical trials. (2) Results: The IPCT allows biologists to compare the genomic features of sensitive cell lines or small molecules with the genomic features of tumor tissues by integrating the CTRP and CCLE databases with the REACTOME, cBioPortal, and Expression Atlas databases. The input consists of a list of small molecules, cell lines, or genes, and the output is a graph containing data entities connected with the queried input. Users can apply filters to the databases, pathways, and genes as well as select computed sensitivity values and mutation frequency scores to generate a relevant graph. Different objects are differentiated based on the background color of the nodes. Moreover, when multiple small molecules, cell lines, or genes are input, users can see their shared connections to explore the data entities common between them. Finally, users can view the resulting graphs in the online interface or download them in multiple image or graph formats. (3) Availability and Implementation: The IPCT is available as a web application with an integrated MySQL database. The web application was developed using Java and deployed on the Tomcat server. The user interface was developed using HTML5, JQuery v.3.1.0, and the Cytoscape Graph API v.1.0.4. The IPCT web and the source code are available in Sample Availability section. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Figure 1

Open AccessArticle
Improving the Gene Ontology Resource to Facilitate More Informative Analysis and Interpretation of Alzheimer’s Disease Data
Genes 2018, 9(12), 593; https://doi.org/10.3390/genes9120593 - 29 Nov 2018
Abstract
The analysis and interpretation of high-throughput datasets relies on access to high-quality bioinformatics resources, as well as processing pipelines and analysis tools. Gene Ontology (GO, geneontology.org) is a major resource for gene enrichment analysis. The aim of this project, funded by the Alzheimer’s [...] Read more.
The analysis and interpretation of high-throughput datasets relies on access to high-quality bioinformatics resources, as well as processing pipelines and analysis tools. Gene Ontology (GO, geneontology.org) is a major resource for gene enrichment analysis. The aim of this project, funded by the Alzheimer’s Research United Kingdom (ARUK) foundation and led by the University College London (UCL) biocuration team, was to enhance the GO resource by developing new neurological GO terms, and use GO terms to annotate gene products associated with dementia. Specifically, proteins and protein complexes relevant to processes involving amyloid-beta and tau have been annotated and the resulting annotations are denoted in GO databases as ‘ARUK-UCL’. Biological knowledge presented in the scientific literature was captured through the association of GO terms with dementia-relevant protein records; GO itself was revised, and new GO terms were added. This literature biocuration increased the number of Alzheimer’s-relevant gene products that were being associated with neurological GO terms, such as ‘amyloid-beta clearance’ or ‘learning or memory’, as well as neuronal structures and their compartments. Of the total 2055 annotations that we contributed for the prioritised gene products, 526 have associated proteins and complexes with neurological GO terms. To ensure that these descriptive annotations could be provided for Alzheimer’s-relevant gene products, over 70 new GO terms were created. Here, we describe how the improvements in ontology development and biocuration resulting from this initiative can benefit the scientific community and enhance the interpretation of dementia data. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Graphical abstract

Open AccessArticle
NeVOmics: An Enrichment Tool for Gene Ontology and Functional Network Analysis and Visualization of Data from OMICs Technologies
Genes 2018, 9(12), 569; https://doi.org/10.3390/genes9120569 - 23 Nov 2018
Abstract
The increasing number of OMICs studies demands bioinformatic tools that aid in the analysis of large sets of genes or proteins to understand their roles in the cell and establish functional networks and pathways. In the last decade, over-representation or enrichment tools have [...] Read more.
The increasing number of OMICs studies demands bioinformatic tools that aid in the analysis of large sets of genes or proteins to understand their roles in the cell and establish functional networks and pathways. In the last decade, over-representation or enrichment tools have played a successful role in the functional analysis of large gene/protein lists, which is evidenced by thousands of publications citing these tools. However, in most cases the results of these analyses are long lists of biological terms associated to proteins that are difficult to digest and interpret. Here we present NeVOmics, Network-based Visualization for Omics, a functional enrichment analysis tool that identifies statistically over-represented biological terms within a given gene/protein set. This tool provides a hypergeometric distribution test to calculate significantly enriched biological terms, and facilitates analysis on cluster distribution and relationship of proteins to processes and pathways. NeVOmics is adapted to use updated information from the two main annotation databases: Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG). NeVOmics compares favorably to other Gene Ontology and enrichment tools regarding coverage in the identification of biological terms. NeVOmics can also build different network-based graphical representations from the enrichment results, which makes it an integrative tool that greatly facilitates interpretation of results obtained by OMICs approaches. NeVOmics is freely accessible at https://github.com/bioinfproject/bioinfo/. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Graphical abstract

Open AccessArticle
An Analytic Approach Using Candidate Gene Selection and Logic Forest to Identify Gene by Environment Interactions (G × E) for Systemic Lupus Erythematosus in African Americans
Genes 2018, 9(10), 496; https://doi.org/10.3390/genes9100496 - 15 Oct 2018
Cited by 2
Abstract
Development and progression of many human diseases, such as systemic lupus erythematosus (SLE), are hypothesized to result from interactions between genetic and environmental factors. Current approaches to identify and evaluate interactions are limited, most often focusing on main effects and two-way interactions. While [...] Read more.
Development and progression of many human diseases, such as systemic lupus erythematosus (SLE), are hypothesized to result from interactions between genetic and environmental factors. Current approaches to identify and evaluate interactions are limited, most often focusing on main effects and two-way interactions. While higher order interactions associated with disease are documented, they are difficult to detect since expanding the search space to all possible interactions of p predictors means evaluating 2p − 1 terms. For example, data with 150 candidate predictors requires considering over 1045 main effects and interactions. In this study, we present an analytical approach involving selection of candidate single nucleotide polymorphisms (SNPs) and environmental and/or clinical factors and use of Logic Forest to identify predictors of disease, including higher order interactions, followed by confirmation of the association between those predictors and interactions identified with disease outcome using logistic regression. We applied this approach to a study investigating whether smoking and/or secondhand smoke exposure interacts with candidate SNPs resulting in elevated risk of SLE. The approach identified both genetic and environmental risk factors, with evidence suggesting potential interactions between exposure to secondhand smoke as a child and genetic variation in the ITGAM gene associated with increased risk of SLE. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Figure 1

Open AccessArticle
miRmapper: A Tool for Interpretation of miRNA–mRNA Interaction Networks
Genes 2018, 9(9), 458; https://doi.org/10.3390/genes9090458 - 14 Sep 2018
Cited by 2
Abstract
It is estimated that 30% of all genes in the mammalian cells are regulated by microRNA (miRNAs). The most relevant miRNAs in a cellular context are not necessarily those with the greatest change in expression levels between healthy and diseased tissue. Differentially expressed [...] Read more.
It is estimated that 30% of all genes in the mammalian cells are regulated by microRNA (miRNAs). The most relevant miRNAs in a cellular context are not necessarily those with the greatest change in expression levels between healthy and diseased tissue. Differentially expressed (DE) miRNAs that modulate a large number of messenger RNA (mRNA) transcripts ultimately have a greater influence in determining phenotypic outcomes and are more important in a global biological context than miRNAs that modulate just a few mRNA transcripts. Here, we describe the development of a tool, “miRmapper”, which identifies the most dominant miRNAs in a miRNA–mRNA network and recognizes similarities between miRNAs based on commonly regulated mRNAs. Using a list of miRNA–target gene interactions and a list of DE transcripts, miRmapper provides several outputs: (1) an adjacency matrix that is used to calculate miRNA similarity utilizing the Jaccard distance; (2) a dendrogram and (3) an identity heatmap displaying miRNA clusters based on their effect on mRNA expression; (4) a miRNA impact table and (5) a barplot that provides a visual illustration of this impact. We tested this tool using nonmetastatic and metastatic bladder cancer cell lines and demonstrated that the most relevant miRNAs in a cellular context are not necessarily those with the greatest fold change. Additionally, by exploiting the Jaccard distance, we unraveled novel cooperative interactions between miRNAs from independent families in regulating common target mRNAs; i.e., five of the top 10 miRNAs act in synergy. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Figure 1

Open AccessArticle
A Computational Method for Classifying Different Human Tissues with Quantitatively Tissue-Specific Expressed Genes
Genes 2018, 9(9), 449; https://doi.org/10.3390/genes9090449 - 07 Sep 2018
Cited by 5
Abstract
Tissue-specific gene expression has long been recognized as a crucial key for understanding tissue development and function. Efforts have been made in the past decade to identify tissue-specific expression profiles, such as the Human Proteome Atlas and FANTOM5. However, these studies mainly focused [...] Read more.
Tissue-specific gene expression has long been recognized as a crucial key for understanding tissue development and function. Efforts have been made in the past decade to identify tissue-specific expression profiles, such as the Human Proteome Atlas and FANTOM5. However, these studies mainly focused on “qualitatively tissue-specific expressed genes” which are highly enriched in one or a group of tissues but paid less attention to “quantitatively tissue-specific expressed genes”, which are expressed in all or most tissues but with differential expression levels. In this study, we applied machine learning algorithms to build a computational method for identifying “quantitatively tissue-specific expressed genes” capable of distinguishing 25 human tissues from their expression patterns. Our results uncovered the expression of 432 genes as optimal features for tissue classification, which were obtained with a Matthews Correlation Coefficient (MCC) of more than 0.99 yielded by a support vector machine (SVM). This constructed model was superior to the SVM model using tissue enriched genes and yielded MCC of 0.985 on an independent test dataset, indicating its good generalization ability. These 432 genes were proven to be widely expressed in multiple tissues and a literature review of the top 23 genes found that most of them support their discriminating powers. As a complement to previous studies, our discovery of these quantitatively tissue-specific genes provides insights into the detailed understanding of tissue development and function. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Figure 1

Review

Jump to: Editorial, Research, Other

Open AccessReview
Challenges in the Integration of Omics and Non-Omics Data
Genes 2019, 10(3), 238; https://doi.org/10.3390/genes10030238 - 20 Mar 2019
Cited by 4
Abstract
Omics data integration is already a reality. However, few omics-based algorithms show enough predictive ability to be implemented into clinics or public health domains. Clinical/epidemiological data tend to explain most of the variation of health-related traits, and its joint modeling with omics data [...] Read more.
Omics data integration is already a reality. However, few omics-based algorithms show enough predictive ability to be implemented into clinics or public health domains. Clinical/epidemiological data tend to explain most of the variation of health-related traits, and its joint modeling with omics data is crucial to increase the algorithm’s predictive ability. Only a small number of published studies performed a “real” integration of omics and non-omics (OnO) data, mainly to predict cancer outcomes. Challenges in OnO data integration regard the nature and heterogeneity of non-omics data, the possibility of integrating large-scale non-omics data with high-throughput omics data, the relationship between OnO data (i.e., ascertainment bias), the presence of interactions, the fairness of the models, and the presence of subphenotypes. These challenges demand the development and application of new analysis strategies to integrate OnO data. In this contribution we discuss different attempts of OnO data integration in clinical and epidemiological studies. Most of the reviewed papers considered only one type of omics data set, mainly RNA expression data. All selected papers incorporated non-omics data in a low-dimensionality fashion. The integrative strategies used in the identified papers adopted three modeling methods: Independent, conditional, and joint modeling. This review presents, discusses, and proposes integrative analytical strategies towards OnO data integration. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Graphical abstract

Open AccessReview
Machine Learning and Integrative Analysis of Biomedical Big Data
Genes 2019, 10(2), 87; https://doi.org/10.3390/genes10020087 - 28 Jan 2019
Cited by 9
Abstract
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative [...] Read more.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Figure 1

Open AccessReview
From Genotype to Phenotype: Through Chromatin
Genes 2019, 10(2), 76; https://doi.org/10.3390/genes10020076 - 23 Jan 2019
Cited by 1
Abstract
Advances in sequencing technologies have enabled the exploration of the genetic basis for several clinical disorders by allowing identification of causal mutations in rare genetic diseases. Sequencing technology has also facilitated genome-wide association studies to gather single nucleotide polymorphisms in common diseases including [...] Read more.
Advances in sequencing technologies have enabled the exploration of the genetic basis for several clinical disorders by allowing identification of causal mutations in rare genetic diseases. Sequencing technology has also facilitated genome-wide association studies to gather single nucleotide polymorphisms in common diseases including cancer and diabetes. Sequencing has therefore become common in the clinic for both prognostics and diagnostics. The success in follow-up steps, i.e., mapping mutations to causal genes and therapeutic targets to further the development of novel therapies, has nevertheless been very limited. This is because most mutations associated with diseases lie in inter-genic regions including the so-called regulatory genome. Additionally, no genetic causes are apparent for many diseases including neurodegenerative disorders. A complementary approach is therefore gaining interest, namely to focus on epigenetic control of the disease to generate more complete functional genomic maps. To this end, several recent studies have generated large-scale epigenetic datasets in a disease context to form a link between genotype and phenotype. We focus DNA methylation and important histone marks, where recent advances have been made thanks to technology improvements, cost effectiveness, and large meta-scale epigenome consortia efforts. We summarize recent studies unravelling the mechanistic understanding of epigenetic processes in disease development and progression. Moreover, we show how methodology advancements enable causal relationships to be established, and we pinpoint the most important issues to be addressed by future research. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Graphical abstract

Other

Open AccessFeature PaperPerspective
Artificial Intelligence and Integrated Genotype–Phenotype Identification
Genes 2019, 10(1), 18; https://doi.org/10.3390/genes10010018 - 28 Dec 2018
Cited by 2
Abstract
The integration of phenotypes and genotypes is at an unprecedented level and offers new opportunities to establish deep phenotypes. There are a number of challenges to overcome, specifically, accelerated growth of data, data silos, incompleteness, inaccuracies, and heterogeneity within and across data sources. [...] Read more.
The integration of phenotypes and genotypes is at an unprecedented level and offers new opportunities to establish deep phenotypes. There are a number of challenges to overcome, specifically, accelerated growth of data, data silos, incompleteness, inaccuracies, and heterogeneity within and across data sources. This perspective report discusses artificial intelligence (AI) approaches that hold promise in addressing these challenges by automating computable phenotypes and integrating them with genotypes. Collaborations between biomedical and AI researchers will be highlighted in order to describe initial successes with an eye toward the future. Full article
(This article belongs to the Special Issue Systems Analytics and Integration of Big Omics Data)
Show Figures

Figure 1

Back to TopTop