Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A Review of Statistical Approaches and Future Challenges

Over the last decade, gene set analysis has become the first choice for gaining insights into underlying complex biology of diseases through gene expression and gene association studies. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results. Although gene set analysis approaches are extensively used in gene expression and genome wide association data analysis, the statistical structure and steps common to these approaches have not yet been comprehensively discussed, which limits their utility. In this article, we provide a comprehensive overview, statistical structure and steps of gene set analysis approaches used for microarrays, RNA-sequencing and genome wide association data analysis. Further, we also classify the gene set analysis approaches and tools by the type of genomic study, null hypothesis, sampling model and nature of the test statistic, etc. Rather than reviewing the gene set analysis approaches individually, we provide the generation-wise evolution of such approaches for microarrays, RNA-sequencing and genome wide association studies and discuss their relative merits and limitations. Here, we identify the key biological and statistical challenges in current gene set analysis, which will be addressed by statisticians and biologists collectively in order to develop the next generation of gene set analysis approaches. Further, this study will serve as a catalog and provide guidelines to genome researchers and experimental biologists for choosing the proper gene set analysis approach based on several factors.


First generation (Over representation analysis)
Over Representation Analysis (ORA), also called functional enrichment analysis, is used to identify an overrepresented pathway/GO category with a list of given/differentially expressed genes obtained (from Microarray or RNA-seq) by using traditional statistical tests such as t-test. Similarly, for SNP data, it starts by selecting SNPs and mapping the interesting SNPs to the corresponding genes. This initial selection process is based on whether a SNP is mapped to the pathway or whether the SNP is susceptible to the disease. Depending on the results, ORA builds a 2 × 2 contingency table to conduct a hypergeometric test. The underlying statistical tests/methodologies for each of the tools is given as below. The second-generation methods use a variation of a general framework, but have a common executional pattern, consists of the following steps: (i) a gene-level statistic is computed using the molecular measurements from an experiment; (ii) computation of gene set level statistic; (iii) Evaluation of statistical significance of the computed statistic. The underlying statistical tests/methodologies used in second generation GSA tools are given as below. Topology/Graph theory-based methods are similar to the second-generation methods as they perform the same steps as that of second generation methods. However, they only use pathway topology/gene set network information to compute gene-level statistics. The methodology used in third generation of GSA tools are given as:  The second and third generations GSA tools take test statistic(s) or p-values associated with genes as input, while ignores the original nature (i.e. discrete, continuous, categorical) of genomics data. Thus, fourth generation of GSA approaches are being developed by providing original data as input. The underlying tests/methodology used in such tools are given as below.

CST
Contains interactive signaling pathway diagrams, research overviews, relevant antibody products, publications, etc. Protein nodes in each interactive pathway diagram are linked to specific antibody product information or, optionally, to protein-specific listings in the database of post-translational modifications.

Database of Interacting Proteins (DIP)
Catalogs the experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of proteinprotein interactions. The data stored within the DIP database were curated, both, manually by expert curators and automatically using computational approaches that utilize the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data. https://dip.doembi.ucla.edu/dip/ Main.cgi [17] Gramene Open source, curated resource for plant comparative genomics and pathway analysis designed to support researchers working in plant genomics, breeding, evolutionary biology, system biology, and metabolic engineering. It consists of genomic information visualizing and analyzing data for 44 plant including curated rice pathways and orthology-based pathway projections for 66 plant species including various crops.
www.gramene.org [18] PANTHER Classification System is designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. NetPath A resource of curated human signaling pathways. Also provides detailed maps of a number of immune signaling pathways. Act as a consolidated resource for human signaling pathways that should enable systems biology approaches. [21]

GOLD.db
Provides biological pathways with image maps and visual pathway information for lipid metabolism and obesityrelated research. This database provides also the possibility to map gene expression data individually to each pathway. Gene expression at different experimental conditions can be viewed sequentially in context of the pathway.
http://gold.tugraz.a t [22] PATIKA Patika is composed of a server-side, scalable, objectoriented database and client-side editors to provide an integrated, multi-user environment for visualizing and manipulating network of cellular events. This tool features automated pathway layout, functional computation support, advanced querying and a user-friendly graphical interface.
http://pstiing.licr.or g [24] TRMP Information about non-target proteins and natural small molecules involved in these pathways also provides useful hint for searching new therapeutic targets and facilitate the understanding of how therapeutic targets interact with other molecules in performing specific tasks. The TRMPs database is designed to provide information about such multiple pathways along with related therapeutic targets, corresponding drugs/ligands, targeted disease conditions, constituent individual pathways, structural and functional information about each protein in the pathways. Competitive H0: Genes in gene set are at most as often overlapped with a particular chromosomal location(s) as the genes not in gene set Here, a gene set (as the collection of genes) can be tested for their association with the chromosomal locations (e.g. on chromosome 1). Therefore, proper statistical approach and tools need to be developed to analyze gene sets with respect to annotation information like chromosomal locations.

Differential expression
Self-contained H0: No genes in gene set are differentially expressed.
Competitive H0: Genes in gene set are at most as often overrepresented with the differentially expressed genes as the genes not in gene set In usual differential expression analysis, differentially expressed gene list and differential expression score is computed for each gene. Further, statistical methodology can be developed to test whether the gene set is overrepresented in this list. Quantitative Trait Loci (QTL) Self-contained H0: No genes in the gene set are over-lapped with the QTL regions.
Competitive H0: Genes in gene set are at most as often overlapped with the QTL regions as the genes not in gene set QTLs are segment of genomic regions either containing or linked to genes that correlates with variation in a phenotype. Performing analysis of gene sets based on trait specific QTLs through a computational approach instead of traditional GO or pathways information will be very helpful in unraveling genotypephenotype relationships. Exon content Self-contained H0: Genes in the gene set are enriched with equal exon content.
Competitive H0: Genes in gene set are at most as often enriched with equal exon content as the genes not in gene set Another set of statistical tests can be designed to test the gene sets with respect to exon count. For instance, a null hypothesis can be such that genes in the gene sets have higher proportions of exon counts as compared to that of outside the gene sets. Biological process (e.g. cell cycle) Self-contained H0: No genes in the gene set are represented with a biological process (e.g. cell cycle).
Competitive H0: Genes in gene set are at most as often overrepresented with the biological process (e.g. cell cycle) as the genes not in gene set Gene sets can be tested for their association with a biological process (e.g. cell cycle). Therefore, proper statistical approach and tools need to be developed to analyze gene sets with respect to cell cycle like information.
Condition/ Disease type/ Cell type Self-contained H0: No genes in the gene set are associated with a particular disease type.
A gene set (as the collection of genes) can be tested for their association with the disease type (e.g. breast Figure S1. Standard operation procedures for gene set analysis followed in microarrays, RNA-seq and GWAS. Figure S2. Analytical steps of GSA for microarray data analysis. Figure S3. Analytical steps of GSA for RNA-seq data analysis. Figure S4. Analytical steps of GSA for SNP (GWAS) data analysis.