Gene Set−Based Integrative Analysis Revealing Two Distinct Functional Regulation Patterns in Four Common Subtypes of Epithelial Ovarian Cancer

Clear cell (CCC), endometrioid (EC), mucinous (MC) and high-grade serous carcinoma (SC) are the four most common subtypes of epithelial ovarian carcinoma (EOC). The widely accepted dualistic model of ovarian carcinogenesis divided EOCs into type I and II categories based on the molecular features. However, this hypothesis has not been experimentally demonstrated. We carried out a gene set-based analysis by integrating the microarray gene expression profiles downloaded from the publicly available databases. These quantified biological functions of EOCs were defined by 1454 Gene Ontology (GO) term and 674 Reactome pathway gene sets. The pathogenesis of the four EOC subtypes was investigated by hierarchical clustering and exploratory factor analysis. The patterns of functional regulation among the four subtypes containing 1316 cases could be accurately classified by machine learning. The results revealed that the ERBB and PI3K-related pathways played important roles in the carcinogenesis of CCC, EC and MC; while deregulation of cell cycle was more predominant in SC. The study revealed that two different functional regulation patterns exist among the four EOC subtypes, which were compatible with the type I and II classifications proposed by the dualistic model of ovarian carcinogenesis.


Introduction
Epithelial ovarian carcinomas (EOC) are composed of a group of heterogeneous subtypes classified by their histology and the degree of epithelial proliferation and invasion. Clear cell (CCC), endometrioid (EC), mucinous (MC) and high-grade serous carcinoma (SC) are four common subtypes of EOC. Within the four subtypes, high-grade SC is the most common type accounting for 70% of EOC, followed by CCC, while MC is relatively rare. However, the carcinogenesis of EOC is still poorly understood. Based on the clinicopathological and molecular features, the dualistic model was proposed and divided EOCs into type I and II categories [1]. The type I EOC, including CCC, EC and MC, usually originating from the mutations of KRAS, BRAF, ERBB2, CTNNB1, PTEN and PIK3CA, is genetically stable and has a relatively indolent behavior [2]. The type II EOC, mainly high-grade SC, displays TP53 mutation in over 80% of the cases, exhibits impaired DNA damage repair and has a more uncontrolled cell differentiation and aggressive behavior. This hypothesis was based on the studies performed in the author's laboratory and correlated with the clinical, pathologic and molecular features of the disease. However, there is no single study, nor integrative analysis to demonstrate this hypothesis and compare the pathogenesis among the four EOC subtypes. As a result, we conducted a gene set-based analysis integrating the microarray gene expression profiles of the four EOC subtypes from the publicly available database. Gene expression microarray is the primary tool for investigating cancers, the analysis of gene expression profiles usually starts with detecting the differentially expressed genes (DEG) by statistical methods, and then the aberrant Gene Ontology (GO) terms or signaling pathways are inferred from the DEGs. This workflow identifies the most significant disease-related genes, function or processes annotated by GO terms or signaling pathways, however, it will focus only on the significant ones and omit those whose p values do not reach statistical significance. In fact, genes or GO terms that did not reach the significance also play a role in the carcinogenesis of EOCs. Besides, only limited functions defined by the GO term or canonical pathways are analyzed; the complete information about the regulation of the functions i.e., functionome in EOC is not provided. To address these limitations, we investigated the pathogenesis of the four subtypes of EOC with microarray gene expression profiles of EOC and their functionomes. The biological function was quantized by converting the gene expression profiles to a gene set regularity (GSR) index computed by modifying the DIRAC algorithm [3], which measured the matching degree of gene expression rankings in a given gene set between two different phenotypes, i.e., EOC and the normal ovarian tissue control in this study. This model utilized the gene set definitions from the GO term [4] and Reactome pathway [5] databases downloaded from the Molecular Signatures Database (MSigDB) [6]. These two gene set definitions collect relatively comprehensive biological functions, processes or signaling pathways. We then utilized them to annotate human functionomes. The GO database contains 1454 gene sets, defining biological functions, process and cellular components; the canonical pathway database contains 1330 curated canonical signaling pathways. In our previous study [7], we demonstrated by the GSR indices a stepwise deterioration of cellular function regularity during SC progression from stage I to stage IV according to International Federation of Gynecology and Obstetrics (FIGO). The pathogenesis of SC centered on cell cycle deregulation accompanied with multi-functional aberrations and interactions. To further explore the pathogenesis and relationship among different subtypes of EOCs, we collected the gene expression datasets of the four common subtypes of EOC and normal ovarian samples from the publicly available databases and converted them into the GSR indices, ranging from 0 to 1 and reflecting the regularities of functions defined by the GO terms or Reactome pathways. Then, the pathogenesis of the four EOC subtypes was investigated and compared with the GSR indices by hierarchical clustering, statistical methods and exploratory factor analysis (EFA).

DNA Microarray Gene Expression Datasets and Gene Sets
DNA microarray gene expression datasets of the four EOC subtypes were downloaded from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database. Initially, 1855 potentially eligible microarray gene expression profiles were selected. We filtered out the datasets that resulted in the available common gene number less than 8000 during cross-platform integration. A total of 1452 samples, including 85 CCC, 90 EC, 48 MC, 1093 SC and 136 normal ovarian tissue control samples, were utilized in this study ( Table 1). Most of the SC samples were not sub-divided into low-or high-grade SC in the GEO database. However, because high-grade SCs constitute around 90% of all SCs, it was reasonable to assume that the majority of the samples were high-grade SC. These samples data were collected from 38 datasets containing six different DNA microarray platforms without missing data. The 136 normal ovarian tissue gene expression profiles were used as controls for all of the four EOC subtypes. The detailed sample information, including the subtypes, platforms and accession numbers was available in Table S1. The 1454 GO term and 674 Reactome pathway gene set definitions were downloaded from the MSigDB, and the versions were "c5.all.v5.0.symbols.gmt" and "c2.cp.reactome.v5.0.symbols.gmt", respectively. Because various genes were utilized in different microarray platforms, finally, 1446, 1445, 1446, 1350 GO terms and 669, 669, 669 and 614 Reactome pathways were used in computing the GSR indices for the CCC, EC, MC and SC groups, respectively.

Means and Histograms of the GSR Indices of the Four Subtypes
The workflow of the GSR model was displayed on Figure 1, and the detailed procedures of computation were described in Methods. The GSR indices ranged from 0 to 1, 0 represented the most chaotic regulation of the function; while 1 represented the functional regulation of the EOC was completely unchanged in comparison with the most common gene expression ranking in the normal control population. The mean of total GSR indices for each subtype group was smaller than the normal control group, and the difference was statistically significant with a p value <0.001. The CCC and EC groups had similar means of the GSR indices, and the SC group has the smallest mean of the GSR indices, as listed in Table 1. It indicated the EOC groups exhibited more deregulated functions defined by the GO terms than the normal controls, while the SC group had the worst deregulation state.
When displayed on the histograms (Figure 2), two distinguishable distributions of the GSR indices appeared in each EOC subtype; the distribution located on the left side consisted of the GSR indices for the EOC subtype had a smaller levels than the normal control distribution on the right side, indicating the biological functions were generally more deregulated in the EOC subtypes than the normal ovarian tissue controls. Especially, a second peak was observed on the left side of the histogram for the SC group, indicating the existence of a group of more severely deregulated functions in the SC group.

Means and Histograms of the GSR Indices of the Four Subtypes
The workflow of the GSR model was displayed on Figure 1, and the detailed procedures of computation were described in Methods. The GSR indices ranged from 0 to 1, 0 represented the most chaotic regulation of the function; while 1 represented the functional regulation of the EOC was completely unchanged in comparison with the most common gene expression ranking in the normal control population. The mean of total GSR indices for each subtype group was smaller than the normal control group, and the difference was statistically significant with a p value <0.001. The CCC and EC groups had similar means of the GSR indices, and the SC group has the smallest mean of the GSR indices, as listed in Table 1. It indicated the EOC groups exhibited more deregulated functions defined by the GO terms than the normal controls, while the SC group had the worst deregulation state.
When displayed on the histograms (Figure 2), two distinguishable distributions of the GSR indices appeared in each EOC subtype; the distribution located on the left side consisted of the GSR indices for the EOC subtype had a smaller levels than the normal control distribution on the right side, indicating the biological functions were generally more deregulated in the EOC subtypes than the normal ovarian tissue controls. Especially, a second peak was observed on the left side of the histogram for the SC group, indicating the existence of a group of more severely deregulated functions in the SC group. Histograms of the four subtypes. The gene set regularity (GSR) indices for each subtype and normal control group were displayed on the histograms by density. The GSR indices for the two groups showed two distinguishable distributions on the histograms; the distribution consisted of the GSR indices for the EOC subtypes (orange) located on the left side had smaller levels, indicating the biological functions were generally more deregulated in the EOC subtypes than the normal control group (blue).

Figure 2.
Histograms of the four subtypes. The gene set regularity (GSR) indices for each subtype and normal control group were displayed on the histograms by density. The GSR indices for the two groups showed two distinguishable distributions on the histograms; the distribution consisted of the GSR indices for the EOC subtypes (orange) located on the left side had smaller levels, indicating the biological functions were generally more deregulated in the EOC subtypes than the normal control group (blue).

The Relationships of the Four Subtypes
To discover the relationships of the four EOC subtypes, the GSR indices of each gene set for the four EOC subtypes were averaged then classified by hierarchical clustering and displayed on the heatmap and dendrogram ( Figure 3). Grossly, the four EOC subtypes showed distinguishable patterns on the heatmap; the patterns between CCC and EC were more similar, while SC's pattern was quite distinct from the others and showed the worst regularity of function. Their relationships were also demonstrated by the dendrogram (Figure 3). The CCC and EC groups had the closest relationship, followed by MC, while SC group was the farthest from the other three subtypes.

The Relationships of the Four Subtypes
To discover the relationships of the four EOC subtypes, the GSR indices of each gene set for the four EOC subtypes were averaged then classified by hierarchical clustering and displayed on the heatmap and dendrogram ( Figure 3). Grossly, the four EOC subtypes showed distinguishable patterns on the heatmap; the patterns between CCC and EC were more similar, while SC's pattern was quite distinct from the others and showed the worst regularity of function. Their relationships were also demonstrated by the dendrogram (Figure 3). The CCC and EC groups had the closest relationship, followed by MC, while SC group was the farthest from the other three subtypes.

Functional Regulation Patterns Classified and Predicted by Machine Learning
Machine learning can learn from data by building a model and recognizing patterns to make prediction. We trained support vector machine (SVM) [8], a high performance machine learning algorithm to classify and predict among the four EOC subtypes and the normal control datasets with their functional regulation patterns consisted of the GSR indices. The accuracies were tested by

Functional Regulation Patterns Classified and Predicted by Machine Learning
Machine learning can learn from data by building a model and recognizing patterns to make prediction. We trained support vector machine (SVM) [8], a high performance machine learning algorithm to classify and predict among the four EOC subtypes and the normal control datasets with their functional regulation patterns consisted of the GSR indices. The accuracies were tested by five-fold cross-validation. Of 1316 samples, 1052 samples were used for training, and the remaining 264 samples were used for classification and prediction. Each measurement was measured by the cumulative results of repeating 10 times classifications and predictions. The results were shown in Table 2. The accuracies of binary classification (each EOC subtype vs. control) ranged from 98.18% to 100.00%. The classification between the CCC and normal control groups had the best result. The AUC of the test for each subtype ranged from 0.9805 to 1.0000. The accuracy of multiclass classification among the four subtypes and normal control group was 95.55%. The SVM is a widely used, high-performance machine learning algorithm; this result revealed that the GSR indices could provide sufficient and adequate information for SVM to undergo accurate classification and prediction. Table 2. Accuracies of the binary and multiclass classification and prediction by machine learning. This table displayed the performances of the binary (each subtype vs. control group) and multiclass classification (among the four subtype groups) and prediction by SVM with the GSR indices computed through the GO terms. The sensitivities, specificities, AUC, accuracies and the SD were measured by five-fold cross-validation. Each measurement was computed by the cumulative 10 results of repeated classifications and predictions. SVM: support vector machine; GSR: gene set regularity; GO: Gene Ontology; AUC: area under curve; SD: standard deviation; NA: not available.

Deregulated GO Terms and Reactome Pathways of the Subtypes
The GSR index is computed based on the extent of ranking change within a gene set defined by the GO terms or Reactome pathways between the case and control group, so the GSR index reflects the regulation of function defined by that gene set and can be utilized to evaluate the function regulation by comparing the difference between the EOC and normal control group. In order to compare the four EOC subtypes and normal controls based on the same standard, the GSR indices of the four subtype and normal control groups were computed after standardization by the baseline gene set template derived from the most common gene expression rankings in the normal ovarian gene expression profiles. The output of this calculation contained approximately 1400 or 670 GSR indices computed through the GO or Reactome pathway gene sets for each case and in each subtype. Table 3 displayed the top 15 deregulated GO terms ranked by the p values, and the full content was available in Table S2. The first deregulated GO term was "cofactor transport" for the CC and EC groups, "aldo-keto reductase" for MC and "protein tyrosine kinase activity" for the SC group. There were many recurring gene sets existing among the four subtype groups. For example, "oxidoreductase activity" was found in all of the four subtype groups, while "inositol or phosphatidylinositol phosphatase activity" appeared in the CCC, EC and MC groups. These recurring GO terms represented the commonly deregulated functions among the different EOC subtypes. In addition to oxidoreductase activity and cell adhesion, numerous deregulated GO terms in the SC group were associated with cell cycle, including "spindle", "negative regulation of cell proliferation" and "double stranded DNA binding", etc.
The Reactome pathways ranked by the p values revealed the first and second significant deregulated pathways in the CCC and EC groups were "downregulation of ERBB2 ERBB3 signaling" and pathways related to PI3K-AKT, respectively (Table 4); the full content is available in Table S3. Obviously, numerous significantly deregulated Reactome pathways were involved in the PI3K-AKT pathway. In the SC group, the first, second and fourth deregulated pathways were associated with G protein. The first deregulated Reactome pathway was "Ca dependent events"; it was a downstream pathway of "G protein mediated events" and "PCL beta mediated events" (4th deregulated Reactome pathway). The second deregulated pathway was "DARPP 32 events", which was a downstream of G protein coupled receptor (GPCR) signaling pathway and associated with neurotransmitter and steroid signaling. However, their roles in the carcinogenesis of EOC were unknown. Many of the subsequently deregulated pathways in the SC group were associated with cell cycle control, such as "G0 and early G1" and "cyclin A/B1 associated events during G2/M transition pathway", etc. Oxidoreductase Activity Acting on The Aldehyde or OXO Group of Donorsnad or Nadp As Acceptor

The Commonly Deregulated GO Term and Reactome Pathway Gene Sets among the Four Subtypes
Due to the existence of numerous recurring gene sets among the four subtype groups, we carried out set analysis for the top 200 deregulated GO or Reactome pathway gene sets to find out the similarities of deregulated functions among the four EOC subtypes. The p values of those selected gene sets were less than 0.001. The numbers of intersected gene set were displayed on the Venn diagram as shown in Figure 4. There were 27 commonly deregulated GO terms, accounting for 13.5% of all top 200 gene sets among the four subtype groups, including protein tyrosine kinase, cell adhesion, channel activity, oxidoreductase activity, DNA and protein binding etc. The number of common gene sets increased to 73%, or 36.5% of the top 200 gene sets among the CCC, EC and MC groups. Furthermore, the common gene set number was up to 114, or 57% of top 200 gene sets among the CCC and EC groups. It indicated the CCC and EC groups shared more than half of the most deregulated functions and implied a similar pathogenesis between CCC and EC. This finding was compatible with the relationship revealed by the dendrogram on Figure 3. In contrast, the deregulated functions of the SC group were quite different from the other three subtype groups; there were only 39 commonly deregulated gene sets between the MC and SC groups. The set analysis for the Reactome pathway gene sets among the four subtypes showed the number of commonly deregulated Reactome gene sets was 66, it accounted for 33% of the top 200 deregulated pathways. The number of commonly deregulated Reactome gene sets among the CCC, EC and MC groups was 101, or 50.5% of top 200 deregulated gene sets ( Figure 5).  Figure 5).

The Elements of Carcinogenesis Networks Discovered by Exploratory Factor Analysis
Usually, the pathogenesis of complex diseases, such as EOC, involves a variety of functions" aberrations as well as interactions. EFA is a broadly applied statistical technique to discover the underlying structures, or networks among numerous variables. We carried out the EFA to find out the gene set elements contributing to the EOC carcinogenesis network among 1454 GO terms or 674 Reactome pathways with the gene sets of p value <0.0001. The number of "factors", i.e., structure or network contributing to EOC carcinogenesis, was determined by the function "fa.parallel". The numbers of factors was 6, 4, 4 and 11 for the CCC, EC, MC and SC groups, respectively. Taking the CCC group as an example, EFA found six networks (factors) of gene sets involved in the carcinogenesis of CCC selected from the deregulated GO terms of p value <0.0001; each of the six networks contained 118, 59, 40, 52, 35 and 22 gene set elements, respectively. The 118 deregulated GO terms in the first network were associated with oxidoreductase activity, transmembrane receptor protein tyrosine kinase activity, G protein coupled receptor binding, transcription coactivator activity, chromatin assembly, cell cycle, ion transport, binding and cell adhesion. The second network was composed of the elements associated with sterol binding, cell division, channel activity, oxidoreductase activity, chromatin assembly and inositol/phosphatidylinositol phosphatase activity. They represented two different but overlapped networks of EOC carcinogenesis. The sixth network containing 22 elements was a sub-network of the first one.
Because of the similarity among the CCC, EC and MC groups revealed by the hierarchical clustering and set analysis, we merged the microarray gene expression datasets of the three subtypes (CCC-EC-MC group), recomputed the GSR indices for this group and carried out the EFA to discover the commonly deregulated functions among the three subtypes. The results of EFA showed seven networks of deregulated GO terms. The first network was composed of cell proliferation, oxidoreductase activity, protein binding, cell adhesion, steroid hormone, protein tyrosine kinase activity, GPCR, immune response, GTPase activity and metabolism. The second network was composed of oxidoreductase activity, cell adhesion, extracellular matrix, binding and GTPase activity. The third, fourth and fifth network was associated with channel activity, transport, G protein activity and chromatin assembly, respectively. We also utilized the EFA to analyze the Reactome pathways for the combined CCC-EC-MC group; the results showed the signaling cascades were primarily associated with the PI3K and ERBB pathways. The results of EFA for the SC group showed the deregulated GO terms were predominantly associated with cell cycle, apoptosis, cell proliferation and development. Especially, all of the elements in the 5th network were associated with cell cycle, including "spindle", "mitotic cell cycle checkpoint", "M phase of mitotic cell cycle", "condensed chromosome", "regulation of mitosis" and "microtubule organizing center", indicating a series of cell cycle control deregulation. The full EFA results were available in Supplemental Materials (Table S4-

Trees of Deregulated GO Terms for the Four Subtypes
Because the GO terms are structured ontologies established according to their child-parent relationship, the deregulated GO gene set elements from the EFA could be organized and visualized on a directed acyclic graph according to their GO hierarchies. The redundant GO terms could be diminished and simplify the interpretation of EFA results. To establish the tree of deregulated GO terms for each subtype, the deregulated GO gene set elements collected from all factors were merged then remapped to the GO tree by the R package "RamiGO", which would upload these GO terms to the AmiGO 2 web server for establishment of the GO trees. The deregulated GO tree of SC group is displayed in detail in Figure 6 as an illustration. The full deregulated GO trees of the four subtypes are available in Supplemental Materials (Figures S1-S4). This figure show the screenshot of the full GO tree of the SC group and some important deregulated GO terms. After mapping to the GO tree, the deregulated GO terms with similar functions or properties clustered together and were arranged by their GO hierarchies. Then, the group of clustered GO terms could be summarized by their common parental GO terms. Thus, the deregulated functions, processes or cellular components could be interpreted in a simplified way. Nine groups of clusters could be found in the deregulated GO terms of the SC group, including cell cycle, channel activity, oxidoreductase activity, chromosome, development, regulation of cell proliferation, regulation of programmed cell death and protein kinase activity. The GO tree provided an intuitive way to view the structure of deregulated functions in the carcinogenesis of EOCs. The GO trees of the CCC, EC and MC groups were relatively similar, including components of oxidoreductase activity, cell adhesion, binding, G protein activity, metabolism, channel activity and protein kinase activity. There were overlapping elements among the four EOC subtypes; however, the cell cycle-related GO terms were predominantly observed in the SC group. GO terms of the SC group, including cell cycle, channel activity, oxidoreductase activity, chromosome, development, regulation of cell proliferation, regulation of programmed cell death and protein kinase activity. The GO tree provided an intuitive way to view the structure of deregulated functions in the carcinogenesis of EOCs. The GO trees of the CCC, EC and MC groups were relatively similar, including components of oxidoreductase activity, cell adhesion, binding, G protein activity, metabolism, channel activity and protein kinase activity. There were overlapping elements among the four EOC subtypes; however, the cell cycle-related GO terms were predominantly observed in the SC group. After mapping to the GO tree, the similar GO terms clustered together. Each cluster was circled (red) and some of the important deregulated GO terms (green boxes) in the cluster were magnified to view the details. Each cluster was labeled by the common parental GO term (orange rectangle).

Differentially Expressed Genes in the Four Subtypes of EOC
We carried out integrative analysis for microarray gene expression datasets to discover and compare the differentially expressed genes (DEGs) in the four subtypes of EOC. The gene expressions of the samples in each dataset were rescaled to cumulative proportion before integration. Tables 5 listed the top 100 down-regulated and up-regulated genes ranked by the p values. We found the CCC, EC and MC groups shared many common up-regulated or down-regulated DEGs. We then explored the relationship by set analysis of the top 100 DEGs to find out the similarities on deregulated functions among the four EOC subtypes. The numbers of common GEGs among subtypes were displayed on the Venn diagram (Figure 7). There were 38 commonly up-regulated DEGs, accounting for 38% of all top 100 DEGs among CCC, EC and MC groups; however, no commonly up-regulated DEGs among CCC, EC, MC and SC were found. There were 41% commonly down-regulated DEGs among CCC, EC and MC groups but only 21% among the CCC, EC, MC and SC groups. These findings indicated the distribution of pathogenic DEGs of EOC subtypes was similar among CCC, EC and MC, while SC exhibited a significantly different distribution from the other three subtypes. These results also provided additional evidence supporting the dualistic model of type I and II classifications for ovarian carcinogenesis. . After mapping to the GO tree, the similar GO terms clustered together. Each cluster was circled (red) and some of the important deregulated GO terms (green boxes) in the cluster were magnified to view the details. Each cluster was labeled by the common parental GO term (orange rectangle).

Differentially Expressed Genes in the Four Subtypes of EOC
We carried out integrative analysis for microarray gene expression datasets to discover and compare the differentially expressed genes (DEGs) in the four subtypes of EOC. The gene expressions of the samples in each dataset were rescaled to cumulative proportion before integration. Table 5 listed the top 100 down-regulated and up-regulated genes ranked by the p values. We found the CCC, EC and MC groups shared many common up-regulated or down-regulated DEGs. We then explored the relationship by set analysis of the top 100 DEGs to find out the similarities on deregulated functions among the four EOC subtypes. The numbers of common GEGs among subtypes were displayed on the Venn diagram (Figure 7). There were 38 commonly up-regulated DEGs, accounting for 38% of all top 100 DEGs among CCC, EC and MC groups; however, no commonly up-regulated DEGs among CCC, EC, MC and SC were found. There were 41% commonly down-regulated DEGs among CCC, EC and MC groups but only 21% among the CCC, EC, MC and SC groups. These findings indicated the distribution of pathogenic DEGs of EOC subtypes was similar among CCC, EC and MC, while SC exhibited a significantly different distribution from the other three subtypes. These results also provided additional evidence supporting the dualistic model of type I and II classifications for ovarian carcinogenesis.

Discussion
Cancers are usually involved in multiple aberrations of gene and function as well as their interactions. In order to take these features into consideration, we utilized the GSR model to investigate the function regularities in cancers. Instead of detecting the DEGs, the model starts with converting the microarray gene expression profiles into quantized biological functions through a list of gene sets defined by the GO terms or Reactome pathways, and then the pathogenesis is evaluated by comparing the differences of functional regulation between the cases with the normal control groups. These quantized regularities of functions, i.e., the GSR indices, are computed by the modified DIRAC algorithm, which converts the gene expression levels to a gene expression ranking list in a gene set, and then measures the matching degree of gene expression rankings between two different phenotypes. We utilized a baseline gene set expression ranking template, defined as the most common gene expression ranking in the normal control populations for each gene set, as a standard to measure the regularity of gene ranking in either EOC or normal ovarian control sample. Then, the GSR index is computed by measuring the matching degree between the gene expression rankings of each ovarian cancer or normal ovarian control sample with the baseline gene set expression ranking template for each gene set. After being standardized by the baseline gene set template, the GSR indices of the four EOC subtypes can be compared based on the same standard. Besides, the GSR indices are computed based on the gene expression rankings; the gene expression levels are converted into ordinal data, and the ordinal data will encounter less cross-platform bias than the gene expression levels during integrating the datasets from different DNA microarray platforms. Computing the gene expression ranking in a gene set will take the gene interactions in a gene set into consideration. In contrast to the "genome" analyzed with gene expression microarray, this model investigates "functionome" with the GSR indices. By converting tens of thousands of gene expression profiles to approximately one thousand GSR indices, this approach will diminish the data noise, simplify the complexity of the subsequent analyses, and facilitate the performance of machine leaning. Besides, each GSR index is normalized to a value ranging from 0 to 1, in favor of the subsequent analyses.
The functionome of each subtype was computed through either GO term or Reactome pathway gene set database, both databases collect relative comprehensive human biological functions and processes, and provide the browsers for viewing the hierarchy of GO terms (AmiGO 2) [9] and pathways (Reactome Pathway Browser) [10], facilitating the clarification of the relationships among

Discussion
Cancers are usually involved in multiple aberrations of gene and function as well as their interactions. In order to take these features into consideration, we utilized the GSR model to investigate the function regularities in cancers. Instead of detecting the DEGs, the model starts with converting the microarray gene expression profiles into quantized biological functions through a list of gene sets defined by the GO terms or Reactome pathways, and then the pathogenesis is evaluated by comparing the differences of functional regulation between the cases with the normal control groups. These quantized regularities of functions, i.e., the GSR indices, are computed by the modified DIRAC algorithm, which converts the gene expression levels to a gene expression ranking list in a gene set, and then measures the matching degree of gene expression rankings between two different phenotypes. We utilized a baseline gene set expression ranking template, defined as the most common gene expression ranking in the normal control populations for each gene set, as a standard to measure the regularity of gene ranking in either EOC or normal ovarian control sample. Then, the GSR index is computed by measuring the matching degree between the gene expression rankings of each ovarian cancer or normal ovarian control sample with the baseline gene set expression ranking template for each gene set. After being standardized by the baseline gene set template, the GSR indices of the four EOC subtypes can be compared based on the same standard. Besides, the GSR indices are computed based on the gene expression rankings; the gene expression levels are converted into ordinal data, and the ordinal data will encounter less cross-platform bias than the gene expression levels during integrating the datasets from different DNA microarray platforms. Computing the gene expression ranking in a gene set will take the gene interactions in a gene set into consideration. In contrast to the "genome" analyzed with gene expression microarray, this model investigates "functionome" with the GSR indices. By converting tens of thousands of gene expression profiles to approximately one thousand GSR indices, this approach will diminish the data noise, simplify the complexity of the subsequent analyses, and facilitate the performance of machine leaning. Besides, each GSR index is normalized to a value ranging from 0 to 1, in favor of the subsequent analyses.
The functionome of each subtype was computed through either GO term or Reactome pathway gene set database, both databases collect relative comprehensive human biological functions and processes, and provide the browsers for viewing the hierarchy of GO terms (AmiGO 2) [9] and pathways (Reactome Pathway Browser) [10], facilitating the clarification of the relationships among numerous deregulated GO terms or pathways. The functionome was composed of approximately 1400 GO or 600 Reactome GSR indices for each case, when displayed on the heatmap, the functionomes of the four EOC subtypes could be visualized and show distinguishable patterns. These patterns could be recognized, classified and predicted by the machine learning. Our result revealed excellent binary or multiclass classification; it implied that the functionomes composed of GSR indices could be utilized as the basis of molecular classification by machine learning. Subsequently, the pathogenesis of the four subtypes was investigated by evaluating the GSR indexes. From the results of histograms and hierarchical clustering among the four subtypes, it could be found that CCC and EC had the closest relationship, followed by MC, and SC was relatively different from the others in terms of functional regulations. Indeed, the four subtypes shared quite a number of common deregulated functions, including cell adhesion, oxidoreductase activity, protein binding, channel activity and metabolism. However, deregulations of chromatin assembly, ERBB, PI3K-AKT pathways were more common among CCC, EC and MC but not in SC. In contrast, the predominant deregulated functions in SC were cell cycle control.
We further explored the pathogenesis and the relationship among the four subtypes by the EFA. The results of EFA using GO terms disclosed that CCC, EC and MC shared a similar structure of pathogenesis, associated with binding, channel activity, cell adhesion, oxidoreductase activity, protein kinase activity, G protein activity and chromatin assembly. The results of EFA using Reactome pathway gene sets revealed the common deregulation of the PI3K-AKT and ERBB pathways. In contrast, the results of EFA for the SC group revealed the pathogenesis mainly involved in apoptosis, mitosis and cell division and cell cycle checkpoint. Overlapped deregulated functions among the four EOC subtype groups were also found, such as protein tyrosine kinase activity, carbohydrate biosynthetic process, immune response, channel activity, cell adhesion and oxidoreductase activity. The channel activity was demonstrated to be involved in the cell cycle control in the carcinogenesis of EOC [11], and cell adhesion played an important role in the metastasis of EOC [12]. These findings draw the conclusion that the two overlapped, but distinguishable function regulation patterns existing among the four subtypes of EOC. The first pattern observed in the CCC, EC and MC groups had moderate, deregulated functions involved in oxidoreductase activity, channel activity, binding activity, metabolism, chromatin assembly, cell adhesion, PI3K-AKT and ERBB signaling pathway. The secondary pattern, observed in the SC groups, had more severe functional regularity and was predominantly involved in the cell cycle deregulation. These two function regulation patterns were compatible with the type I and type II classifications proposed by the dualistic model of ovarian carcinogenesis: the type I EOCs, including CCC, EC and MC, usually originated from the mutation of KRAS, BRAF, ERBB2, PTEN and PIK3CA, are genetically stable and have a relatively indolent behavior; the type II EOCs, mainly high-grade SC, primarily exhibit a TP53 signature, have a more uncontrolled cell cycle and aggressive behavior. The type I and II EOCs were compatible with the first and second patterns of function regulation in our study, respectively.
This study also showed evidence disclosing the relationship between deregulated functions and carcinogenesis. The association of CCC and EC with endometriosis has been repeated reported [13,14]. The cells in the endometriosis foci will be exposed to the reactive oxygen species (ROS) and are subjected to more DNA damage [15]. As the dendrogram showed in this study, the CCC and EC groups exhibited a relatively close relationship and shared many commonly deregulated GO terms, such as oxidoreductase activity and cell adhesion; both are the characteristic features of the pathogenesis of endometriosis. These findings provided the evidence supporting the role of endometriosis during the carcinogenesis of CCC and EC.
Our results showed the PI3K-AKT signaling pathway was a key element of the pathogenesis of EOCs. PI3K-AKT has been demonstrated to play an important role in the carcinogenesis of EOC, especially in CCC and EC. The deregulation of this signaling pathway may be originated from the loss of PTEN in 40% cases [16], PIK3CA mutation in 33% cases [17] or AKT amplification in 14% cases [18] of CCC patients. PI3K is the major downstream effector of receptor tyrosine kinases (RTK) and GPCR. If PI3K is activated, apoptosis will be inhibited and leads to cell proliferation [19]. Both of PI3K-AKT and G protein deregulation were detected with statistical significances in this study. As the results of CCC-EC-MC combined analysis listed in the Table S9, the GO terms "inositol or phosphatidylinositol phosphatase activity" and "transmembrane receptor protein tyrosine kinase activity" were the first and sixth top deregulated GO gene sets. ERBB2 was the first deregulated pathways for CCC and EC, its expression in EOC varies widely, ranging from 20% to 30% of cases [20]. ERBB is a member of the epidermal growth factor receptor (EGFR) family, it can activate the PI3K-AKT pathway and may represent a prognostic factor in primary EOC [21]. The 9th deregulated Reactome pathway "PI3K events in ERBB2 signaling" in the CCC-EC-MC combined group indicated the interaction between the two important deregulated Reactome pathways in the carcinogenesis of EOC (Table S10).
However, there are limitations when applying the GSR model to investigate the carcinogenesis of EOCs. As an illustration, the TP53 mutation is a common aberration in high-grade SC. The gene set related to TP53 could be found in the list of Reactome pathway database; however, they did not appear on the top of the significantly deregulated pathway list in this study; the first one appearing on the list was the 122th gene set "P53 dependent G1 DNA damage response" with a p value of 4.02ˆ10´1 7 . This finding illustrates the first limitation of this model: if the level of gene expression change does not reach the required extent, the gene expression ranking as well as the GSR index will remain unchanged and the aberration could not be detected. The second limitation is the incompleteness of gene set definitions. For example, there was no definition of PTEN gene set in the GO and Reactome gene set database, so no PTEN aberration was found in this study, although this model discovered a lot of PI3K related functions and pathway aberrations because the PI3K were the effector of PTEN. The third limitation is the false positivity. The third most deregulated Reactome pathway in the MC group was "olfactory signaling pathway" with a p value of 1.32ˆ10´1 2 , which should be independent of the carcinogenesis of MC. This situation can be checked and clarified via the Reactome Pathway Browser. When mapping to the browser, the hierarchy showed the "olfactory signaling pathway" was a member of the GPCR signaling pathway and contained elements involved with the regulation of G protein, and G protein was shown to play an important role in the carcinogenesis of EOC in this study. This false positivity happened because of the presence of the G protein-related gene elements in the gene set. Another limitation of this study was that the DEGs derived from the integrative analysis had not been validated. One of the best ways to validate these DEGs is RNA seq or protein expression for the samples of the four EOC subtypes. We attempted to validate the DEGs in our study by collecting the RNA seq datasets for the four EOC subtypes from two important publically available databases: The Cancer Genome Atlas (TCGA) and NCBI Sequence Read Archive (SRA). However, this validation was not feasible because the available samples of CCC, EC and MC were not enough to get significant statistical significance. Further investigation is still needed for validation of these DEGs.

Computing GSR Indices by Modified Differential Rank Conservation Algorithm
The algorithm of computing the GSR indices was modified from the Differential Rank Conservation (DIRAC). DIRAC is designed to measure the perturbation of a gene set by converting gene expression levels to gene expression rankings, and quantifying the regularity of gene expression ranking in the gene set by computing the ranking matching score, which is a measurement of the degree of each sample's gene expression ranking of each gene set matching the corresponding gene set ranking template. Instead of measuring the perturbation of gene expression ranking, the GSR index measures the extent of gene expression ranking change between two phenotypes in a gene set, i.e., EOC and normal controls in this study. For this purpose, the GSR indices for both EOC and the normal control are computed by comparing the sample's gene expression ranking with a standard template derived from the most common gene expression ranking in a gene set among the entire normal ovarian tissue control samples. Then, the EOC pathogenesis was investigated by comparing the EOC and normal control GSR indices. The baseline gene set ranking template was defined as a template of the most common gene ranking among the unaffected controls in a gene set; it is used as a standard template for a gene set from the unaffected population. The baseline gene set raking template for each gene set is established by pairwise comparison between the expression levels of two genes for all possible combinations of gene pair. A gene set contains m gene G = {G 1 , G m }, and the corresponding gene expression profile E = {E 1 , E m }, E i denotes the expression level of gene G i . Each sample is labeled by a phenotype of case (EOC) and unaffected control group, respectively. The baseline gene set raking template for each gene set is established by pairwise comparison between the expression levels of two genes for all possible combinations of gene pair. The baseline gene rank template B for a given gene set G is the binary vector composed of "A" or "B", where each component is either "A" if the probabilities Pr(E i < E j | phenotype = control) >0.5 or "B" if Pr(E i < E j | phenotype = control) ď0.5. For the expression profile of a given sample e n , the GSR index for a given gene set is the fraction of the mˆ(m´1)/2 pairs for which the observed gene expression ranking within e n matches the baseline gene ranking template B. Establishment of the baseline gene set expression ranking template and measurement of GSR indices were executed in R environment, the code and the test datasets are available on the GitHub (https://github.com/carlzang/GSR-model.git).

Microarray Datasets Gene Set Definition and Data Processing
Gene expression microarray datasets were downloaded in a SOFT format after comprehensively searching for all of the available microarray gene expression profiles in the NCBI GEO database. Ovarian carcinoma and normal ovarian tissue control datasets were selected only when the samples originated from the ovarian tissue and definite diagnosis was provided. The gene expression profile was discarded if containing missing data. The manipulation of genes and the corresponding gene expression data in each dataset was based on the HUGO Gene Nomenclature Committee (HGNC) gene symbols approved in 2013. The microarray gene expression datasets were utilized only if the corresponding gene symbol information was provided in the annotation table. The common genes and the corresponding gene expression profiles among all datasets were used in this study. The dataset were discarded if the number of the common genes became less than 8000 during intersecting with other datasets. The gene sets were discarded if the number of gene elements in the gene set is less the 3.

Statistical Analysis
The differences between the four EOC subtypes and the control GSR indices were tested by Mann Whitney U test and corrected by multiple hypotheses using false discovery rate (Benjamini-Hochberg procedure). The significance level was set at <0.001.

Classification and Prediction by Machine Learning
GSR index matrices computed through GO term and Reactome pathway gene sets were classified and predicted by the support vector machine (SVM) with kernlab [22], which is an R package for kernel-based machine learning methods and is used to classify patterns of the GSR indices with the setting of kernel = "rbfdot" (Radial Basis kernel "Gaussian"), type = "C´svc" (C classification). The performance of classification and prediction by SVM were measured by 5-fold cross-validation. Datasets were randomly sampled and divided into 5 parts, 4 parts were used for training sets and the remainder one part for prediction. The performance of binary classification was assessed with sensitivity, specificity, accuracy and area under curve (AUC), where Sensitivity = true positives/(true positives + false negatives) Specificity = true negatives/(true negatives + false positives) Accuracy = (true positives + true negatives)/(true positives + false positives + true negatives + false negatives) Sensitivity, specificity, accuracy and AUC were computed using the cumulative results of repeating classifications 10 times. AUC was computed by an R package pROC [23]. The performance of multiclass classification was assessed by the accuracy computed from the fraction of correct predictions within total prediction number.

Hierarchical Clustering Dendrogram and Heatmaps
All of the GSR indices in each gene set for each subtype were averaged and underwent hierarchical clustering with the R function "heatmap" as default. This function would execute hierarchical clustering, and drew dendrogram and heatmaps.

Set Analysis
All possible logical relations among the deregulated gene sets of the four EOC subtype groups was evaluated by set analysis and displayed by Venn diagram using an R package "VennDiagram" (version 1.6.16, downloaded from the comprehensive R archive network (CRAN), https://cran.r-project.org/index.html).

Exploratory Factor Analysis for the Deregulated GO Terms and Establishment of GO Trees
The deregulated GO terms of p values <0.001 were selected for EFA. EFA was executed with the R package "psych" (version 1.5.8). The number of factors to be extracted was determined by the function "pa.parellel". The setting of factoring method used in this study was "pa" and the correlation matrix rotation method was "promax". The tree of the deregulated GO terms was constructed and visualized in Portable Network Graphics (PNG) format constructed by the "RamiGO" [24], an R package providing functions to interact with the AmiGO 2 web server and retrieves GO trees.

Detection of Differentially Expressed Genes in the Four Subtypes of EOC
To discover the DEGs for each of the four EOC subtypes, we carried out integrative analysis with the downloaded DNA microarray datasets. The gene expression levels of all samples in each dataset were transformed and rescaled to cumulative proportion values from 0 (lowest expression) to 1 (highest expression) with an R package "YuGene" (version 1.1.5) before integration. The DEGs were discovered using linear model computed with empirical Bayes analysis by the functions "lmFit" and "eBayes" provided by the R package "limna" (version 3.26.9).

Conclusions
Investigating the pathogenesis of diseases with the functionomes not only makes a clear distinction among the different subtypes, but also provides a comprehensive view of the deregulated functions in these diseases. Our study demonstrated two overlapped but distinguishable deregulated function patterns among the four EOC subtypes. The first pattern, observed in CCC, EC and MC, showed a relatively moderate deregulation of functions involving the PI3K-related functions and chromatin assembly. The second pattern, found in SC, showed more severely deregulated functions associated with the control of cell cycle. These findings were compatible with the type I and II classifications proposed by the dualistic model of ovarian carcinogenesis. This study provided solid evidences to support this classification and was the first integrative analysis demonstrating this model.