HisCoM-PAGE: Hierarchical Structural Component Models for Pathway Analysis of Gene Expression Data

Although there have been several analyses for identifying cancer-associated pathways, based on gene expression data, most of these are based on single pathway analyses, and thus do not consider correlations between pathways. In this paper, we propose a hierarchical structural component model for pathway analysis of gene expression data (HisCoM-PAGE), which accounts for the hierarchical structure of genes and pathways, as well as the correlations among pathways. Specifically, HisCoM-PAGE focuses on the survival phenotype and identifies its associated pathways. Moreover, its application to real biological data analysis of pancreatic cancer data demonstrated that HisCoM-PAGE could successfully identify pathways associated with pancreatic cancer prognosis. Simulation studies comparing the performance of HisCoM-PAGE with other competing methods such as Gene Set Enrichment Analysis (GSEA), Global Test, and Wald-type Test showed HisCoM-PAGE to have the highest power to detect causal pathways in most simulation scenarios.


Introduction
Over the past several decades, gene expression data analysis has been the most common approach to investigate human diseases at the RNA level [1,2]. By analyzing gene expression data, we can gain a better understanding of disease etiology and biological mechanisms [3]. Especially for cancer prognosis, genetic information can be very effective in improving prognosis prediction of patients, based only on clinical information [4].
Analyzing high-throughput gene expression data, at the pathway level, is very effective in two ways. Firstly, grouping thousands of genes by their respective pathways reduces complexity to just several hundred pathways. Secondly, identifying active pathways that differ between two conditions, such as normal and tumor tissues, can have more explanatory power than a simple list of differentially expressed genes (DEGs) [5,6]. While there have been several methods proposed for gene set analysis, they mainly focused on the binary phenotypes. There are only a few methods available for dealing with survival phenotypes [7,8].
Various cancer prognosis and survival analysis have been reported [9,10]. For example, pancreatic cancer has a very poor prognosis, compared to other cancers. At the time of diagnosis, fewer than 20% of pancreatic cancer patients can have surgery, and their postoperative 5-year survival rate is also significantly low [11]. Therefore, more accurate pancreatic cancer prognosis, and early detection, are needed.
To build a good prediction model, using gene expression data, for actual clinical practice and medical intervention, it is first necessary to identify genes (features) related to prognosis. Exploring the pathways to which genes belong can provide valuable biological interpretation and help screen out false-positive genes. In this study, we mainly focus on finding significant pathways that are relevant to the prognosis of pancreatic cancer. Through pathway analysis, our ultimate goal is to identify biological mechanisms that influence the prognosis of disease more clearly.
Since gene set enrichment analysis (GSEA) was proposed, which uses the Kolmogorov-Smirnov statistic for measuring differentially expressed gene sets, many other pathway-based methods have been developed [12]. Recently, SetRank was developed to reduce the false positive hits of the GSEA [13]. Unlike other types of phenotypes, however, there are only a few pathway-based methods available for survival phenotypes. For example, the gene set variation analysis (GSVA) method was proposed to handle survival times by estimating the variation of pathway activity over a sample population in an unsupervised way [14]. The global test was proposed for continuous and censored survival time, based on the score statistics from random effects of parameters for association measure [15,16]. Likewise, the Wald test was proposed for the survival phenotype by summarizing the association measure from the sum of coefficients from a survival regression model [17]. However, those previous pathway methods are single pathway analyses, so they do not take into account correlations between pathways, and the global test only considers correlations between gene expression values. The Wald test merely sums up the statistics from each gene, to obtain its pathway statistics, so it does not account for the correlation among pathways. Since some genes may belong to several pathways simultaneously, there is a need for accounting for this nature of genes and pathways.
To account for this issue, we previously developed our Pathway-based approach using HierArchical structure of collapsed Rare variant Of High-throughput sequencing data (PHARAOH) method for discovering rare variants by constructing a hierarchical model that consists of collapsed gene-level summaries and entire pathways [18]. PHARAOH is based on the generalized structural component analysis (GSCA) model [19]. Later, we developed our Hierarchical structured CoMponent analysis of miRNA-mRNA integration (HisCoM-mimi) method to integrate anti correlated expression of miRNA and mRNA. By extension of PHARAOH, HisCoM-mimi can also account for the biological relationships between a miRNA and target mRNAs [20]. Recently, we developed another extension, Hierarchical structural CoMponent analysis of Gene-Gene Interactions (HisCoM-GGI), representing a model that not only summarizes common variants into gene levels, but also considers interactions among common variants [21].
In this study, we developed a new pathway-based model for survival phenotypes, based on gene expression data, by taking advantage of our earlier hierarchical model, referred to as HisCoM-PAGE which represents Hierarchical structural Component Models for Pathway Analysis for Gene Expression data. As an extension of HisCoM-mimi, HisCoM-PAGE considers the biological context of gene and pathway hierarchies, in the form of structured components. HisCoM-PAGE collapses genes into the pathway in a structured form using latent variables. Unlike other methods analyzing one pathway at a time, HisCoM-PAGE analyze all pathways simultaneously by one model, which enables considering the correlation of all pathways by using a ridge penalty in parameter estimation. HisCoM-PAGE can also successfully examine the effects of individual genes within the pathways.
Through simulation studies, we showed that HisCoM-PAGE performed well, compared to other existing pathway methods for survival phenotype. Application to real microarray data of pancreatic ductal adenocarcinoma (PDAC) patients from Seoul National University Hospital (SNUH) showed that HisCoM-PAGE could well identify prognosis-related pathways. HisCoM-PAGE is available at (http://statgen.snu.ac.kr/software/HisCom-PAGE/).

Materials
2.1.1. SNUH-PDAC Microarray Data 125 PDAC samples were collected by the department of Hepatobiliary and Pancreatic Surgery of SNUH and all human subject studies were approved by the Institutional Review Board of SNUH. The PDAC patients' average age was 63.32 years, and the standard deviation was 10.064 years; 75 patients were male, and 50 were female. The median survival time was 25 months. In this dataset, mRNA expression levels were generated using Affymetrix (Santa Clara, CA, USA) HuGene 1.0 ST arrays. A total of 32,321 genes were normalized by the Robust Multi-array Average (RMA) method [22]. Of the total, 21,369 genes were annotated. In this analysis, we selected mRNAs whose expressional variances were ranked in the top 25 percentiles for analysis [23,24]. The detail clinical information is described in Table 1.

The Cancer Genome Atlas -PDAC RNA-Seq Data
The RNA-seq data and the clinical data were downloaded from The Cancer Genome Atlas (TCGA) GDC portal (https://portal.gdc.cancer.gov) [25]. For the RNA-seq data, an Illumina HiSeq instrument (San Diego, CA, USA) was used for mRNA profiling. In the sample selection procedure, non-PDAC samples were removed [26]. Also, samples with a survival time less than 3 months were removed, since the cause of death may not be due to PDAC. As a result, we analyzed 124 PDAC patients, among which there were 61 female samples and 63 male samples. The median survival time was 598 days, and the censoring proportion was 41%. The average age was 64.56 years, and its standard deviation was 10.91.
For the preprocessing procedure of RNA-seq data, the following steps were applied to 56,716 genes annotated. The Relative Log Expression (RLE) normalization method was adopted to control the gene length bias. The RLE method was implemented in R package (v3.5) "DESeq2" (v1.22.2) [27]. After RLE normalization, the genes with zero proportion larger than 80% were filtered out [28]. The number of remaining genes was 37,405. Next, the RNA-seq mRNA expression data were log2 transformed. Using unsupervised filtering based on Median Absolute Deviation (MAD), 9,380 genes whose MADs were ranked in the top 25 percentiles were finally selected for analysis.

Simulation Data
To evaluate the performance of the HisCoM-PAGE method and compare its performance with other pathway methods, we generated a simulation data set, following the simulation settings of Lee et al [29]. In the simulation study, the following parameters were considered: the sample size (I), total number of genes (K), pathway size (m s ), proportion of censoring (c p ), and the proportion of significant genes in the pathway (m p ). Gene expression data were randomly generated from a multivariate normal distribution with mean zero and covariance matrix Σ. Let the O zero matrix be l × (K-l) dimensions, where l is the number of causal genes within the gene set. Let I l be an l × l identity matrix, and A be a l × l symmetric matrix. Then, the covariance matrix is given as follows: Four different scenarios of covariance matrix Σ were considered. For each scenario, Ahas a different structure. Here, i,j represent each row and column index for covariance matrix. For Scenario 1, , and x ij = 0.1 |i−j| . Scenario 4 has random variances and covariance, such that A is given as follows: when i is not equal to j, and 1, when i is equal to j, and ρ ij is generated from N(0,0.1 2 ). For all scenarios, three different significant gene proportion was considered. The survival time was generated from a Cox model with a constant baseline hazard rate of 0.005 whereas the censoring time was generated from an exponential distribution with a parameter of λ whose value depends on the censoring fraction [30]. The survival time and censoring time were generated independently. We only observed the minimum value of either the survival time or the censoring time, which occurred first. For power analysis, the regression coefficients w were generated from the uniform distribution U(0.2,0.6). For type 1 error estimation, the regression coefficient w was assumed to be zero.

HisCoM-PAGE Method
Let y i denote a survival time (i = 1, . . . ,I). Let x ijk denote the kth gene (k = 1, . . . ,m 1 , . . . ,K) expression corresponding to the jth pathway (j = 1, . . . ,J) for the ith patient. As shown in Figure 1, we must then consider latent structures for estimating the model parameters. Let w jk denote the weight assigned to x ijk . The coefficient β j represents the effect of the latent variable f ij on the phenotype.
Considering this structure, we designed the following Cox proportional hazard model.
To estimate the model parameters for HisCoM-PAGE, we maximized the penalized partial log likelihood using a ridge penalty. The following Equation (1) then represents the objective function, which is a partial log likelihood from a Cox model.
Here y 1 < y 2 < . . . <y d denote the distinct and ordered d(≤I) survival times and R(y i ) is the risk set at time y i The objective function can be maximized by an alternating least squares (ALS) algorithm, which iterates the following two steps until convergence. In the first step, the pathway coefficients are estimated and updated, using a least-squares approach. For the second step, the weight coefficients are updated for fixed-path coefficient estimates [31]. In HisCoM-PAGE, we adopted a ridge penalty to address the multicollinearity of genes. When estimating λ gene and λ pathway values, we conducted 5-fold cross-validation to obtain optimal values for λ gene and λ pathway The process of estimating the coefficients, using the ALS algorithm, with penalty, proceeds as follows: Compute η, u, A, and z, based on the latest value ofB ,Ŵ . Then maximize φ(W, B) with the fixedŴ . Repeat these steps untilB converges.

3.
Compute η, u, A, and z, based on the latest value ofB ,Ŵ . Then maximize the φ(W, B), with the updatedB . Repeat these steps untilŴ converges.

4.
Iterate steps 2 and 3 until φ(W, B) converges.  Figure 1 shows the HisCoM-PAGE model with J pathways. Rectangles and circles represent observed variables (mRNA expression) and latent variables (pathways), respectively. Each pathway consists of three or more genes and is represented by a latent variable constructed by a weighted sum of its genes. Single-headed arrows represent the effect of genes in a pathway, and the effect of pathways on the hazard function at the survival time y.
As a convergence criterion, we used the difference between φ(W, B) of consecutive iterations. The iteration continues until this difference is smaller than the given threshold value. In our analysis, the threshold value was 10 −4 . Since it is not straightforward to obtain asymptotic variance estimates of parameters from ridge estimation, we used a permutation test to obtain a statistical significance level for HisCoM-PAGE.

Comparison Methods
The following pathway methods were considered to compare the results of HisCoM-PAGE. We compared other pathway methods such as Gene Set Enrichment Analysis (GSEA) [12], Global Test (GT) [15,16] and the Wald type test [17]. The GSEA methods assume that the total number of genes K and the gene set S is predefined. For the first step, compute the regression coefficients of K genes, by fitting a univariate Cox models. The regression coefficient is used as an association measure between phenotype and genes. Secondly, order K genes by the absolute value of t statistics in descending order. Thirdly, calculate the enrichment score. While computing the enrichment score, GSEA method consider two methods of weighting, including GSEA1, the case when the weight term is 0, and GSEA2, the modified version of the original GSEA method when the weight term is 1 [12,29]. Lastly, calculate the significance level by comparing the observed values and the permutation distribution values. The Global test is based on the regression coefficients from a Cox model. The Global tests can test whether the expression of gene, within a predefined pathway, tends to closely associate with the survival times. The global test's Q statistic was taken as an average of the m test statistics calculated from each m individual gene, constituting a pathway by itself. Although the p-values can be calculated using the permutation and asymptotic method, we used the permutation approach. Thirdly, the Wald type test is based on the unified pathway method proposed which combined component-wise test statistics for significance of a subset of genes [17]. The Wald test also assesses whether the predefined pathway has an association with survival times. Thus, the test statistic is a sum of squares of the Wald statistic for the individual genes that constitute the pathway.

Pathway Analysis Using SNUH Microarray Data
The Affymetrix gene identifiers were mapped to the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Biocarta databases [32,33]. For the KEGG database, 4,320 genes were mapped to 185 pathways including overlapping genes. For the Biocarta database, 4,317 genes were mapped to 216 pathways including overlapping genes. Our objective in this pathway analysis was to identify pathways associated with PDAC patients' overall survival times.
For multiple test adjustment, we used the False Discovery Rate (FDR) analysis to calculate the FDR adjusted q-value as a criterion [34]. From the Biocarta database, HisCoM-PAGE identified four significant pathways with FDR-adjusted q-value smaller than 0.05. The upper part of Table 2 shows the list of significant pathways related to the survival times. The transforming growth factor β (TGF-β) pathway was found to be the most significant. It is well known that the TGF-β pathway is associated with inflammation promotion and carcinogenesis in the early stage of cancer [35][36][37][38][39][40][41]. Transducer of ERBB2 (TOB1) has previously been reported to be linked to PDAC [42,43].
From the KEGG database, HisCoM-PAGE identified 23 pathways with FDR-adjusted q-values smaller than 0.05. The lower part of Table 2 shows the list of significant pathways related to the survival times.
The pathways reported to be related to PDAC or pancreatic cancer are as follows. Hedgehog-signaling dysregulation, due to mutation or overexpression of pathway components and pathway ligands, induces pancreatic cancer [44]. The Wnt signaling pathway is related to drug resistance of pancreatic cancer and its function has an association with prognosis [45,46]. The Vascular Endothelial Growth Factor(VEGF) pathway is angiogenesis related [47], and the insulin-signaling pathway is studied for pancreatic cancer growth and metastasis [48]. Insulin receptor contributes to the Phosphatidylinositol-3-kinase (PI3K)-Akt pathway activation and this pathway mediates the therapeutic resistance [49]. The fatty acid metabolism pathway is closely related to tumor cell survival and growth [50,51]. Glycosphingolipids of the ganglio globo-series pathway have been reported as carbohydrate antigens with cancer [52]. The steroid hormone synthesis pathway has been studied for PDAC. In pancreatic cancer cells, cholesterol can be converted into oxysterol or can be used as a precursor for steroid hormone synthesis [53]. The glycerophospholipid metabolism pathway is significant for PDAC progression [54]. The mammalian TOR (MTOR) signaling pathway is reported to be related to metastasis of PDAC [55]. The ErbB signaling pathway is a mediator or has tumor stroma interactions in PDAC [56].  selected significantly by three methods are as follows: Adherens junction pathway, Colorectal cancer pathway, Circadian rhythm mammal pathway, and Dorso vental axis formation pathway. Among these pathways, only the E-cadherins protein in the Adherens junction pathway is reported to have an association with lymph node metastasis in high-grade PDAC [57,58].  Table 2.

Pathway Analysis Using TCGA RNA-Seq Data
We performed RNA-seq data analysis for a replication purpose of SNUH microarray analysis results. Thus, only the genes used for SNUH microarray data analysis were selected for RNA-seq analysis. As a result, for the KEGG database, 3,258 genes were mapped to 185 pathways. For the Biocarta database, 1,162 genes were mapped to 216 pathways. Table 2 shows the RNA-seq data results. The last three columns summarize the results of RNA-seq data. Since RNA-seq analysis was performed only for a replication purpose, FDR was used to correct the significant pathways identified from SNUH microarray data analysis. There were no significant pathways with FDR-adjusted q-value smaller than 0.05. From the Biocarta database, TGF-β pathway's nominal p-value was 0.053. For the KEGG database, the Amyotrophic lateral sclerosis pathway and MTOR signaling pathway were significantly identified with a nominal p-value Less than 0.05.
In addition, among the pathways significant at the nominal level by RNA-seq analysis, the Neurotrophic signaling pathway showed the FDR-adjusted q-value close to 0.05 by microarray data analysis (p-value = 0.007 and FDR-adjusted q-value = 0.052). The neurotrophic signaling pathway is known to promote pancreatic cancer invasion [59].

PDAC-Related Genes Using SNUH Microarray Data
With pathways associated with prognosis, we could also find genes meaningfully related to PDAC prognosis, as well as considering hierarchies of genes and pathways. Table 3 shows genes significant for the survival phenotype. Using the coefficients of our proposed model, we were able to calculate the w gene × β path value for each gene. As a result, it was possible to simultaneously consider the effect of the matched gene to the pathway, and the effect size of the pathway to the phenotype. After calculating each coefficient, significance testing was performed, using a permutation method. If the marker was selected based only on a nominal p-value, obtained by adapting the entire gene, a type 2 error and the false negative error can be large. Therefore, we used FDR analysis to calculate the FDR-adjusted q-value as a criterion.   Table 3 shows the significant genes with FDR-adjusted q-values < 0.05. SMAD3, BCL2, and TGF-β1 are well known to be related to PDAC. Upregulated SMAD3 promotes epithelial mesenchymal transition and predicts poor PDAC [60,61]. Also, it has been reported that BCL2 downregulated expression is a poor prognostic factor for PDAC [62]. TGF-β1 levels are significantly related to PDAC patients' prognosis [63]. That is, the patients with high levels of TGF-β1 showed higher overall survival times. Some other PDAC-related genes are as follows. ETS1 is known to have resistance to pancreatic cancer chemotherapy. Furthermore, ETS1 can exacerbate poor PDAC prognosis after radiation therapy [64]. HIF1A was reported as a significant indicator of PDAC prognosis [65][66][67]. GNAI1 is known to be a suppressor of tumor cell migration and invasion that is post-transcriptionally targeted by mir-320a/c/d [68]. Mir-320a is found to confer 5-FU chemo-resistance upon human pancreatic cancer cells [69].
3.1.4. PDAC-related Genes Using TCGA RNA-Seq Data Table 3 shows the gene results from the TCGA RNA-seq data analysis. FDR was used for significant genes identified from SNUH microarray data analysis. For the Biocarta database, GNAI1 was also significant with a p-value less than 0.05, and BCL2L1 was significant with a p-value less than 0.1. In addition, the p-value of SMAD3 was close to 0.1. However, TGF-β1 and ETS1 were not replicated. For the KEGG database, the p-value of SMAD3 was less than 0.1.

Simulation Analysis Result
3.2.1. Type 1 Error Figure 3 shows the simulation results, for each method, when total number of genes (K) = 200, sample size (I) = 80, gene set size (m s )= 50, and censoring proportion (c p ) = 0, 0.1, 0.2, 0.3, 0.4, 0.5. The number of permutations for significance testing was 1000. We checked the 95% confidence interval level of the estimated type 1 error. Overall, type 1 errors were shown to be well controlled in various scenario settings; especially in the HisCoM-PAGE method, the type 1 error is well controlled, even when the censoring fraction is high.

Comparison of Power
For power analysis, we varied the censoring proportion and the proportion of significant genes in the causal pathway. We set the parameters as follows: total genes (K) = 200 sample size (I) = 80, pathway size (m s ) = 50, significant gene proportion (m p ) = 0.1, 0.3, 0.5 and censoring proportion (c p ) = 0, 0.3. HisCoM-PAGE showed better performance than the other methods, when the significant gene proportions were not high, and the power was close to 1, when the significant gene proportion increases. Figure 4 shows the power of each method for the four correlation structure scenarios. Overall, the Global and the Wald type tests showed similar trend in power, and GSEA showed a relatively low power, compared to other methods, in many scenarios, as shown in the paper [29]. As shown by Chiristiaan [70], power depends largely on gene proportion, which has effects within a causal pathway. In Scenario 1, when all gene expression values are independent of each other and compared to other scenarios, the statistical power is strongly affected by the centering ratio. In Scenario 2, when the correlation coefficient between casual genes has the same effect, we could see a relatively high power compared to other scenarios.
For the GSEA method, the Cox model was only used for ordering genes, but enrichment scores were calculated using the relative rank only. In the case of the competitive analysis of the pathway methodology, observed statistics of the pathway of interest are compared with those of the pathway consisting of the genes within the pathway [71]. By contrast to GSEA-based method, HisCoM-PAGE can directly calculate the effect of the causal pathway, quantitatively, on the survival time. We could also confirm that the power of HisCoM-PAGE was higher than the other methods in Scenario 1, even when the censoring ratio was high.

Discussion and Conclusions
Among pathway analysis methods, few have been developed only for survival times. Thus, there is a need to quantitatively determine how much the pathway affects the survival phenotype and to identify a relative way of ranking pathways. To this end, HisCoM-PAGE uses structural equations to model real biological phenomena, so it can estimate not only the value of statistics of a pathway but also the meaning of pathway statistics. In other words, HisCoM-PAGE can find not only pathways related to prognosis but also how strongly genes contribute to the pathway. The estimated weight represents the gene effect on the pathway, and β path represents the effect of pathway on the hazard rate through a Cox model.
It is well known that biological phenomenon is a result of complex pathway interactions. That is, multiple pathway analysis provides a more biologically interpretable result than single pathway analysis. HisCoM-PAGE can perform multiple pathway analysis by considering all pathways simultaneously. No existing pathway methods can perform these analyses.
Recently, many studies have been actively conducted to examine cancer prognosis using RNA-seq data [26,72]. HisCoM-PAGE could easily be applicable to RNA-seq data. From the SNUH microarray data, HisCoM-PAGE identified PDAC prognosis-related pathways such as TGF-β, Hedgehog signaling pathways, MTOR signaling pathway, and so on. As a pilot study, we conducted the replication analysis using TCGA RNA-seq data. Application of HisCoM-PAGE to the TCGA RNA-seq data replicated TGF-β and MTOR signaling pathways. Furthermore, HisCoM-PAGE also replicated the GNAI1 gene as a PDAC prognosis-related gene among the genes SMAD3, BCL2, TGF-β1 and GNAI1 identified from SNUH microarray data analysis.
There have been several studies for comparing microarray and RNA-seq platforms [73,74]. For example, in identifying differentially expressed genes (DEGs), there were a number of genes specifically detected as DEGs in only one platform, while a large number of genes were identified as DEGs in both microarray data and RNA-seq data platforms [74]. A large discrepancy is expected between the two platforms when the treatment effect is small. In our pathway analyses, there was a discrepancy in analysis results between microarray data and RNA-seq data. To find out the main reason for this discrepancy, first the expression profiles of genes in two platforms were compared. The correlation between the averages of normalized and log2 transformed values of two platforms was high, with Pearson correlation coefficient of 0.62 and Spearman correlation coefficient of 0.72. However, when the p-value from a univariate Cox model for each gene was compared between the two platforms, the correlation of −log(p-value)s was low, with Pearson correlation coefficient of 0.25 and Spearman correlation coefficient of 0.23. This low correlation was caused by the different survival curves between microarray data and RNA-seq data. The two Kaplan-Meier curves from both microarray data and RNA-seq data were quite different and the log-rank test for the equivalence for these two survival curves has a p-value of 0.046 [75]. Since HisCoM-PAGE is based on the association between the survival time and genes based on a Cox model, it might be that the discrepancy of the pathway analyses between microarray and RNA-seq data was caused by the heterogeneity of survival times between two platforms. It could be expected that the samples with less heterogeneous survival times could provide more consistent results.
Next, we are planning to improve the performance of HisCoM-PAGE by taking the characteristics of RNA-seq data into account. Then, we need to conduct a simulation study to investigate its properties and compare it with other methods for RNA-seq data.
We believe that HisCoM-PAGE can be easily applicable to other types of omics data such as integration analysis of multi-omics data. Furthermore, we can also use the HisCoM-PAGE approach to construct predictive models for prognosis. We could study a design of a prognostic prediction model using the latent variable pathway as a marker, as well as the genetic marker [76]. In this case, unlike building predictive models using only genetic markers, we can add interpretability because the designed predictive model considers the contribution of genetic markers in a pathway manner. Also, it would be possible to study the relationship between genes and pathways, beyond the linear relationship, using the kernel generalized structured component analysis (GSCA) method.
In summary, we proposed a new pathway analysis method, Hierarchical Structural Component Models of Pathway Analysis for Gene Expression (HisCoM-PAGE), to identify disease prognosis-related pathways. Using simulated data, PDAC microarray and RNA-seq data, we can confirm that HisCoM-PAGE can identify pathways and genes that have an association with a survival phenotype. Moreover, HisCoM-PAGE can also find interpretable and meaningful pathways and prognostic genetic markers.