Integrative Analysis of Cancer Omics Data for Prognosis Modeling

Prognosis modeling plays an important role in cancer studies. With the development of omics profiling, extensive research has been conducted to search for prognostic markers for various cancer types. However, many of the existing studies share a common limitation by only focusing on a single cancer type and suffering from a lack of sufficient information. With potential molecular similarity across cancer types, one cancer type may contain information useful for the analysis of other types. The integration of multiple cancer types may facilitate information borrowing so as to more comprehensively and more accurately describe prognosis. In this study, we conduct marginal and joint integrative analysis of multiple cancer types, effectively introducing integration in the discovery process. For accommodating high dimensionality and identifying relevant markers, we adopt the advanced penalization technique which has a solid statistical ground. Gene expression data on nine cancer types from The Cancer Genome Atlas (TCGA) are analyzed, leading to biologically sensible findings that are different from the alternatives. Overall, this study provides a novel venue for cancer prognosis modeling by integrating multiple cancer types.


Introduction
Cancer is one of the leading causes of death worldwide and has been posing extensive public concerns. In cancer studies, prognosis modeling is a critical step that greatly contributes to understanding cancer etiology, developing effective therapeutic methods, and improving life quality. Significant effort has been devoted to searching for prognostic factors, among which omics markers have important implications. For example, EGFR has been suggested as a strong prognostic indicator in multiple cancers, such as ovarian, cervical, and bladder cancers. Nicholson, et al. [1] reviewed over 200 studies and reported that relapse-free-interval or survival data are directly in relation to the increased EGFR levels in breast, gastric, colorectal, and many other cancers. Petitjean, et al. [2] found that the mutation of TP53 has an impact on the prognosis of breast and several other cancers. Gao, et al. [3] used a Cox model to find that a high level of MMP-14 mRNA expression leads to a significantly shorter overall survival for breast cancer. Chiu, et al. [4] characterized prognostic alteration for melanoma with a panel of five genes, including CSMD2, CNTNAP5, NRDE2, ADAM6, and TRPM2. Despite considerable successes, our understanding of cancer prognosis is still limited. The limited progress in cancer analytics may be attributable to small sample sizes, high dimensionality and low signal-to-noise ratios of omics data, as well as the underlying molecular complexity of cancers.
Most of the existing studies, including the aforementioned, focus on a single type of cancer, and analysis often suffers from a lack of sufficient information. Cancer types have been typically classified according to organ-and tissue histology-based pathology criteria. This is especially true in "old" studies. More recently, with the development of high-throughput profiling, increasing attention has been paid to the molecular basis of cancers, providing a novel perspective on cancer types. A representative recent work is Hoadley, et al. [5], which conducted the molecular clustering of 33 different types of tumors in The Cancer Genome Atlas (TCGA) with data on aneuploidy, DNA methylation, mRNA, and miRNA. Their results show that some cancers, which were treated as completely different diseases according to traditional organ-and tissue histology-based pathology criteria, are closely related according to their molecular characteristics. For example, squamous cell carcinoma can occur in lung, bladder, cervix, head, and neck, and different histopathological types are often observed. However, in Hoadley, et al. [5], these cancer types have been found to have similar molecular characteristics.
Molecular similarity across cancers has been well established in the literature. Prognosis of many different cancer types is mediated by some common mechanisms associated with certain common pathways. For example, the p53 pathway inhibits cell growth and stimulates cell death, which plays an important role in a large fraction of cancers. In addition, there are other genes/pathways that have important roles in many cancer types, such as apoptosis, hypoxia-inducible transcription factor (HIF)-1, mitogen activated protein kinase (MAPK) phosphoinositide3-kinase (PI3K), and receptor tyrosine kinases (RTKs) [6]. Published studies have found that different cancer types may share common oncogenes, tumor-suppressor genes and stability genes, the alternations of which are responsible for the genesis and prognosis of cancers. For example, BRCA1 gene mutation is often found in both breast and ovarian cancers [7]. These two cancer types are perhaps the most common cancers in female and often occur together [7]. Another example is lung adenocarcinoma and lung squamous cell carcinoma which are two major lung cancer subtypes. Many genes have been reported to be associated with both cancer subtypes, including EGFR [8], TP53 [8], AKT1, DDR2 [9], FGFR1 [10], KRAS [8], PTEN, and others. With molecular similarity, one cancer may contain information useful for the analysis of other cancers. Overall, it is of interest and also reasonable to conduct the integrative analysis of molecular profiles of multiple cancer types to increase information and more accurately describe the underlying prognosis.
More recently, much effort has been devoted to collecting omics profiles of tumor samples with different cancer types under a unified protocol. A representative example is TCGA organized by The National Cancer Institute (NCI) which has generated a large amount of cross-platform genomic data for exploring the complex landscapes of human cancers. Specifically, it has collected multi-omics data from over 20,000 primary cancer and matched normal samples spanning 33 cancer types, including breast cancer, lung squamous cell carcinoma, lung adenocarcinoma, and others. Other examples include the International Cancer Genome Consortium (ICGC), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and others. With the clinical and omics data on multiple cancer types, these databases provide a good opportunity to conduct cancer modeling through data integration.
In the literature, there are a few related studies, which can be generally classified into two families. The first family adopts a meta-analysis strategy, which first analyzes different cancer types separately and then compares results across cancer types to search for overlapping findings. An example is Cava, et al. [11], which first analyzed gene expression data on 16 cancer types separately and then identified 895 de-regulated genes with a central role in pathways. Yu, et al. [12] systematically analyzed gene expressions across diverse cancers during the inflammatory timeline. After comparing the differentially expressed genes among cancers, they found three novel pan-cancer gene expression patterns, in which the gene expressions are regulated differently in the early and late phases of inflammation. Using a cohort of 3899 samples with 10 cancer types, Sharma, et al. [13] adopted a bottom-up approach to quantify the effects of gene expression variations and identified novel recurrent regulatory mutations influencing known cancer genes, such as GRIN2D and NKX2-1, in multiple cancer types. The second family of approaches stacks data from multiple cancer types together to create a "mega" dataset, and then conducts analysis as if there is in fact just a single dataset. An example is Martinez-Ledesma, et al. [14], which used a network-based exploration approach to identify gene expression biomarkers that are predictive of clinical outcomes in 12 cancer types. Using TCGA data on 3281 samples with 12 cancer types, Leiserson, et al. [15] performed a pan-cancer analysis of mutated networks with a new algorithm, HotNet2, and found some significantly mutated subnetworks as well as those with less characterized roles in cancers. Beyond studies on cancer omics data, similar strategies have also been considered in other fields of biomedical research to collectively analyze multiple datasets. For example, Xing, et al. [16] proposed two variations of a stacking algorithm to simultaneously predict the resistance of multiple drugs using mutation information, leading to improvement in prediction performance. As another example of drug analysis, Matlock, et al. [17] developed stacking models built on multiple cell lines, multiple tested drugs, as well as genomic information for drug sensitivity prediction in cancer cell lines. Medical imaging data integration has also been conducted. For example, a meta-analysis based support vector machine was introduced in [18] to collectively analyze multiple types of images, such as fluorodeoxyglucose positron emission tomography (FDG-PET) and magnetic resonance imaging (MRI), for identifying susceptible brain regions and predicting the incidence of Alzheimer's disease.
Despite considerable successes, both families have limitations. The former neglects integration in the discovery process. Data on each cancer type still suffers from a lack of sufficient information resulting from a small sample size, high noises, and other reasons. As such, the "delay" in integration may make the analysis less effective. For the latter one, although sample size increases by stacking, subjects with different cancer types are treated as if they were from the same population. It cannot effectively accommodate the heterogeneity across cancer types. In addition, in some of the existing studies, "classic" statistical techniques have been adopted, and there is a lack of utilizing state-of-the-art techniques.
Motivated by the limitations of single cancer type analysis and recent successes of integrative analysis in other contexts, in this study our goal is to conduct more effective integrative analysis of multiple cancer types with high dimensional omics data. By contrast with the single cancer type analysis, omics data from multiple cancer types are jointly analyzed to effectively borrow information across cancer types and generate more reliable findings. By contrast with the existing meta-analysisand stacking-based approaches, the proposed analysis integrates data on multiple cancer types in the discovery process and effectively accommodate the heterogeneity across cancer types. By contrast with the analysis on categorical and continuous outcomes, the more challenging prognosis analysis is conducted. The proposed analysis is based on the penalization technique which has a solid statistical ground and satisfactory performance in published studies. TCGA mRNA expression data on nine cancer types are analyzed to demonstrate the proposed integrative analysis approach. Overall, this study provides a practically useful new venue for cancer prognosis modeling with multiple cancer types.

The Cancer Genome Atlas (TCGA) Data
TCGA is one of the largest cancer genomics programs that comprehensively cover multiple cancer types with high quality omics measurements and serves as an ideal testbed. In this study, the processed level 3 data are downloaded from cBioPortal (http://www.cbioportal.org/). For omics data, we consider mRNA expressions which were measured using the IlluminaHiseq RNAseq V2 platform. For each subject, a total of 20,531 mRNA expression measurements are available. It is noted that the proposed analysis can be directly applied to other types of omics data, such as copy number variation, methylation, microRNA, and others. The prognosis outcome of interest is the overall survival time which is subject to right censoring. Nine common cancer types are analyzed, including some recognized as highly correlated, such as lung adenocarcinoma and lung squamous cell carcinoma. Summary information is provided in Table 1. We acknowledge that, as the proposed analysis can well accommodate heterogeneity across cancers, the selection of cancers for analysis does not need to follow a strict criterion. Beyond these nine cancers with high prevalence and mortality, others can be added to the analysis easily. It has been suggested in the literature that the number of important prognostic markers is not expected to be large. Besides, with a relatively moderate sample size for each cancer type and a much larger number of genes, analysis may not be reliable. To improve estimation stability and also reduce computational cost, we conduct prescreening as follows. We consider the 1385 genes in the TruSight RNA Pan-Cancer Panel which is produced by Illunima Company and provides a comprehensive assessment of cancer-related RNA transcripts and fusion detection. These genes have been referred to in public databases and implicated in multiple cancer types, including solid tumors, soft tissue cancers, and hematological malignancies [19]. After data matching, a total of 1040 gene expression measurements are left for downstream analysis. Note that this prescreening is not essential in our analysis, and the proposed approach can be directly applied to a bigger set of genes.

Methods
We conduct both marginal and joint analysis, where the former analyzes one gene at a time and the latter analyzes all genes in a single model. Both types of analysis have been extensively conducted in existing cancer modeling studies. As they have different implications and cannot replace each other, we conduct both analyses to generate a more comprehensive understanding of cancer prognosis. We develop a penalized regression-based framework to collectively analyze multiple datasets and identify markers associated with the prognosis of multiple cancer types, while effectively accounting for the similarity across cancers. The overall flowchart of analysis is provided in Figure 1.
Assume that there are K cancer types, where the kth (k = 1, . . . , K) type has n (k) independent subjects. For subject i with the kth cancer type, let T (k) i be the log-transformed survival time and be the p-dimensional vector of gene expression measurements. In practical analysis, right censoring is usually present. Denote C (k) i as the log-transformed censoring time, then we observe y with I(·) being the indicator function.

Marginal Analysis
We adopt the accelerated failure time (AFT) model for describing prognosis. It has been one of the most popular choices in high-dimensional survival analysis due to its lucid interpretation and, more importantly, computational simplicity [20]. For a specific cancer type, consider the marginal AFT model for the jth measurement as: where α ij is the random error. Assume that for each cancer type, data i in an ascending order. Then, the following weighted penalized objective function is proposed to collectively analyze multiple cancer types, Here, w (k) i 's are the Kaplan-Meier (KM) weights for accommodating censoring and defined as dx is the minimax concave penalty (MCP) with tuning parameter λ 1 and regularization parameter γ. We consider two types of ρ η with tuning parameter λ 2 . The first is the magnitude-based shrinkage penalty with where s with Sgn(·) being the sign function. The second is the sign-based shrinkage penalty with Based on (2) The objective function (2) analyzes one gene at a time, and enjoys stable estimation and simple optimization. It may be limited by a lack of attention to the interconnections among genes and their joint effects on cancer prognosis. Our brief literature search suggests that marginal analysis is still highly popular in high-dimensional omics studies [21]. For marginal analysis, a two-stage method is often adopted for marker identification, where multiple tests are first performed and a multiple comparison adjustment is then conducted on p values using, for example, the false discovery rate approach. By contrast with this strategy, we adopt the penalization technique, which can generate more stable results and, more importantly, effectively accommodate the similarity across cancer types. Specifically, MCP is used for regularized estimation and marker identification, which has been shown to have satisfactory theoretical and numerical properties. The most significant advancement is the penalty term which promotes similarity between the estimated coefficients of each cancer pair. Data integration is conducted in the discovery process to facilitate early information borrowing. With the magnitude-based shrinkage penalty (3), the magnitudes of gene effects across cancer types are promoted to be similar if they have the same signs, while with the sign-based shrinkage penalty (4), the signs of gene effects are promoted to be similar. Thus, the proposed two types of ρ η promote different types of similarity, with the former for quantitative similarity and the latter for qualitative similarity. As in practice the relatedness of cancer types may be not accurately known, both penalties can be useful. λ 1 and λ 2 are two tuning parameters which control the sparsity and similarity of coefficients, respectively. For the p objective functions, we impose the same values of λ 1 and λ 2 on different η (k) j to be concordant with joint analysis. If λ 2 = 0, the proposed approach goes back to the unintegrated strategy that analyzes each cancer type separately with MCP.

Joint Analysis
For k = 1, . . . , K, consider the AFT model with the joint effects of all omics measurements, where i is the random error. With the same notations as in the marginal analysis, for estimation, consider the following weighted penalized objective function where λ 3 and λ 4 are the tuning parameters. The KM weights, MCP, and two proposals for ρ β are also adopted in (6). The proposed estimate is defined as the minimizer of (6). Variables with nonzero estimates are identified as associated with prognosis. For optimization, the CD algorithm is adopted (Appendix A). Different from (2), objective function (6) jointly analyzes a large number of genes in a single model and thus accommodates a high dimensionality. Compared to marginal analysis, it advances by taking the combined effects of multiple genes into consideration and better describing the underlying disease biology. However, it involves more complex computation and may lead to less stable results. Penalization is adopted to accommodate high dimensionality and identify important genes. It is perhaps the most popular technique in high dimensional data analysis. Different from the existing studies, the magnitude-and sign-based shrinkage penalty terms are also introduced similarly to that in Section 2.2.1. This can effectively accommodate the similarity across cancer types and facilitate information borrowing.
The proposed analysis can be effectively realized. To facilitate data analysis within and beyond this study, we have developed R code and made it publicly available at www.github.com/shuanggema/ IntePanCancer.

Marginal Analysis
We analyze the TCGA data using the approach described in Section 2.2.1 with penalties (3) (referred to as A1) and (4) (referred to as A2), as well as an alternative marginal approach A3 which analyzes each cancer type separately with MCP for identifying relevant markers. Comparing with the benchmark A3 can straightforwardly establish the merit of the proposed integrative analysis. Detailed estimation results are provided in the Supplementary Excel file. Different approaches are observed to generate different findings. Specifically, a total of 910 genes with 482 unique ones and 1160 genes with 275 unique ones are identified with A1 and A2, respectively, compared to 2655 genes with 999 unique ones with A3.
In Table 2, we present the top five genes with the largest numbers of associated cancer types and refer to the Supplementary Excel file for more detailed results. It is observed that the numbers of multiple cancer types-related genes identified with A1 and A2 are slightly larger than those with A3. For example, both A1 and A2 identify gene APH1A as associated with all nine cancer types, but this gene is missed by A3. Literature search suggests that the identified genes with the proposed A1 and A2 may have important biological implications. For example, the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of gene APH1A suggests that it is a member of the notch signaling pathway which has an important impact on developmental and cell fate decisions and is deregulated in human solid tumors [22]. APH1A is one of the four essential components of γ-secretase [23]. γ-secretase is a multiprotein intramembrane-cleaving protease, which can cleave ligand-activated endogenous Notch receptors and is a potential drug target for cancer [24]. Gene MAPK1, identified as associated with eight cancer types by both A1 and A2, has been reported to be involved in many cancer related pathways. MAPK1 is one of the MAP kinases in the MAPK pathway. It can phosphorylate transcription factors, which regulate the expressions of genes involved in cell proliferation and differentiation. Besides, MAPK1 is involved in EGFR tyrosine kinase inhibitor resistance [25], which importantly contributes to the etiology of various types of cancer, such as pancreatic cancer [26], paediatric acute lymphoblastic leukemia [27], and others. In this process, MAPK1 acts as a serine/threonine kinase upstream of FRS2, which plays a role in epidermal growth factor (EGF) signaling [28]. MAPK1 has also been reported to have an impact on the malignant behavior of breast cancer cells. Published studies show that gene ETV6, identified as associated with eight and seven cancer types by A1 and A2, respectively, is involved in the transcriptional dysregulation of cancer pathways. The dysregulation of transcription factors can alter the expressions of target genes and lead to the tumorigenic process. For example, ETV6 is a negative regulator of transcription 3 (Stat3) transcription factor activity, which has the ability to mediate the inhabitation of the proliferation of tumor cells [29]. Gene ETV6 is relevant to multiple cancer types, including breast cancer [30], leukemia [31], non-small cell lung cancer [32], and others. These biological findings provide support to the validity of the proposed integrative analysis.

Approach
Gene Number of Associated Cancer Types To gain a deeper insight into the identification results, we further calculate the relative overlapping between gene sets associated with different cancer types. Specifically, for two gene sets A and B, their relative overlapping is defined as ROL(A, B) = A∩B A∪B , with a larger value indicating a stronger similarity. Results for different approaches are shown in Table 3. The average ROL values are 0.143 (A1), 0.308 (A2), and 0.147 (A3), respectively, suggesting that A2 leads to gene sets with a higher level of relative overlapping and A1 and A3 have comparable performance. Take breast invasive carcinoma (BRCA) and ovarian serous cystadenocarcinoma (OV), which are established as related, as an example. The ROL values for A1, A2, and A3 are 0.150, 0.265, and 0.146, respectively. The proposed A2 can improve the qualitative similarity of genes selected for multiple cancer types to a certain extent.  Beyond identification, we also take a closer look at the estimation results. Specifically, we compute the difference of the estimated coefficient matrices for each cancer pair. Consider the relative Euclidean distance defined as indicating a stronger similarity. Results for the three approaches are provided in Table 4, with the average values being 1.606 (A1), 1.534 (A2), and 2.254 (A3). The relative Euclidean distances with A1 and A2 are observed to be smaller than those with A3. For example, the distance values between BRCA and OV are 1.443 with A1 and 1.220 with A2, which are much smaller than 3.230 with A3. As another example, for the two squamous cell carcinomas, lung squamous cell carcinoma (LUSC) and head and neck squamous cell carcinoma (HNSC), the relative Euclidean distances are 1.644 (A1), 1.855 (A2), and 2.577 (A3), respectively. To more intuitively describe similarity, we conduct the hierarchical clustering analysis based on the relative Euclidean distances and present the results in Figure A1 (Appendix B). Biologically sensible findings are made, for example, the distance between BRCA and OV decreases after integration.

Joint Analysis
Similar to marginal analysis, in joint analysis we adopt both the magnitude-based shrinkage (referred to as B1) and the sign-based shrinkage (referred to as B2). We also consider an alternative joint analysis referred to as B3, which analyzes each cancer type separately and applies MCP to accommodate high dimensionality and select relevant markers. Detailed estimation results are provided in the Supplementary Excel file. For the nine cancer types combined, B1, B2, and B3 identify a total of 1135 genes with 662 unique ones, 1064 genes with 598 unique ones, and 530 genes with 421 unique ones, respectively. The two proposed approaches lead to results different from the alternative. In addition, the joint analysis identification results also differ from those in marginal analysis.
The top five genes with the largest numbers of associated cancer types are provided in Table 5, and more results are provided in the Supplementary Excel file. Similar patterns are observed where the proposed two approaches identify more genes associated with multiple cancer types. For the identified genes, a literature search provides independent evidences of their associations with multiple cancer types. For example, the important biological implications of gene APH1A have been already discussed in Section 3.1. In addition, gene CCAR2, identified as important for all nine cancer types with B2, has been reported to be associated with the development of many cancer types. It plays a pivotal role in DNA damage response and promoting apoptosis. The depletion of CCAR2 can impair the activation of the AKT pathway, which ultimately causes the inhibition of cancer cell growth [33]. Specifically, it binds to the BRCA1 C Terminus (BRCT) domain of the tumor suppressor BRCA1 and inhibits BRCA1 in breast cancer [34]. Cho, et al. [35] also suggested that the expression of CCAR2 is closely related with the progression of ovarian carcinomas. In Kim, et al. [36], an increase in apoptosis was observed in CCAR2-deficient non-small cell lung cancer cell lines. Wagle, et al. [37] demonstrated that the expression of CCAR2 is significantly associated with a higher clinical stage and predicted shorter survival in osteosarcoma. Gene BTLA is identified as important for eight cancer types with B2. It is an immunoinhibitory receptor and can deliver inhibitory signals for suppressing lymphocyte activation. The ability of BTLA to inhibit tumor-specific human CD8+ T cells suggests it as a target for cancer immunotherapy [38]. Published studies also suggest that gene BTLA is relevant to the occurrence and development of many cancer types [39]. For example, a case-control study conducted by Fu, et al. [40] on women from northeast China suggested that breast cancer risk and prognosis may be affected by BTLA gene polymorphisms. In addition, Oguro, et al. [41] showed that BTLA is closely associated with shorter overall survival in gallbladder cancer. Gene RUNX2 is identified by B2 as important for five cancer types. The transcription factor RUNX2 can regulate the expressions of genes that are associated with tumor promotion, invasion, and metastasis, such as VEGF [42]. RUNX2 is also involved in many pathways that are related to tumorigenesis, such as the WNT pathway, transforming growth factor beta (TGFβ) signaling pathway, and p53 pathway [42].  The relative overlapping and Euclidean distances between different cancer types are presented in Tables  Both measures indicate that the proposed joint integrative analysis can improve the identified similarity across cancer types. Take BRCA and PAAD, the relatedness of which has been suggested in literature, as an example. It has been demonstrated that protein annexin A1, A2, A4 and A5 play an important role in the occurrence and development of these two cancer types [43], and BRCA1 and BRCA2 gene mutations are commonly observed in both cancer types [44].  Figure A2 (Appendix B). With the proposed B1 and B2, cancer types with stronger relatedness tend to be assigned to the same clusters.
Advancing from marginal analysis, joint analysis has the capability of predicting survival time besides marker identification. To evaluate prediction performance, a resampling procedure is adopted. Specifically, for each of the nine cancers, we first split data randomly into a training and a testing set. The training sets for the nine cancer types are then used to fit models and obtain parameter estimates. Finally, we make prediction for the testing set subjects with the estimated parameters. For evaluation, C-statistic is adopted, which is one of the most popular measures for censored survival data [45,46]. It is the integrated AUC (area under the curve) of the time-dependent ROC curve and has value between 0.5 and 1, with a larger value indicating a better prediction performance. The average values over 100 resamplings are shown in Table 6. Overall, B1 and B2 perform better than B3, with B1 having a prominent superiority. For example, for LUSC, the average C-statistic values are 0.748 (B1), 0.649 (B2), and 0.612 (B3). The improvement in prediction accuracy suggests the benefit of integrative analysis of multiple cancer types.

Simulation Based on TCGA Data
To gain more insights into the performance of the proposed integrative analysis, we conduct practical data-based simulation under various scenarios. The specific settings were as follows.
(1) The observed gene expression measurements on nine cancer types from TCGA were used as predictors. To generate variations across simulation replicates, we adopted a resampling approach.
(2) Set p = 200, 500, or 1000. For each value of p, genes were randomly selected from the original gene set. (3) For each cancer type, there were 10 genes associated with the cancer outcomes with nonzero regression coefficients β (k) (1) , . . . , β (k) (10) . The rest of the coefficients were zeros. (4) For each subject, the event time was computed from the AFT model log T where the random error ε i was generated from N(0, 1). Censoring times were randomly generated from an exponential distribution, and the parameter was adjusted to make the censoring rate around 20%. It is noted that to mimic the complexity of real data, the data generating models are more complicated than the simple AFTs with the presence of a small number of quadratic effects. We consider various values of β (k) (1) , . . . , β (k) (10) to generate different levels of signal-to-noise ratios and cancer similarity. Under Scenarios I and II, the nine cancer types have the same set of important genes with the same nonzero effects. In particular, for j = 1, . . . , 10 and k = 1, . . . , 9, we set β ( j) = 2, and the other five important genes are "randomly selected" (and hence likely to differ across datasets) and with β (k) There are a total of 12 simulation settings, comprehensively covering different numbers of genes, and different levels of signal-to-noise ratios and cancer similarity.
Analysis was conducted using the proposed marginal and joint analysis approaches as well as two alternatives. To evaluate identification performance, we computed the true positive rate (TPR) and false positive rate (FPR). The average TPR and FPR values over 100 replicates are provided in Table A3 As the sign consistency of some genes does not hold under Scenario IV, A2 and B2 have inferior performance compared to A1 and B1, but still have superior performance compared to A3 and B3. The superiority of the proposed integrative analysis approaches observed in data-based simulation provides certain confidence to data analysis results.

Discussion
In cancer research, prognosis modeling with omics measurements plays an essential role. The existing studies mostly conduct analysis on one single type of cancer and often suffer from a lack of sufficient information. Integrative analysis represents an emerging trend in recent biomedical studies, among which the most common is the integrative analysis of multiple types of omics data, including gene expressions, copy number variations, and some others, and has led to interesting findings beyond single type omics data-based analysis. In this study, we have taken a different perspective and conducted integrative analysis on multiple cancer types to facilitate across-cancer information borrowing. Similarity across cancer types has been extensively studied in the literature, which provides a solid biological ground for our integrative analysis. Both marginal and joint analysis have been developed with two types of similarity-based penalty, which have intuitive formulations and solid statistical basis. We have analyzed mRNA gene expression data on nine TCGA cancer types with censored survival outcomes. Biologically sensible findings different from the benchmark analysis have been made.
The proposed analysis can be directly applied to other types of omics data and other cancer types. In this study, we have focused on prognosis data and the AFT model. A continuous outcome can be regarded as a special case of prognosis outcome without censoring, and thus the proposed analysis can be applied directly. It can also be extended to accommodate categorical outcomes using, for example, generalized linear models. With the availability of multiple types of omics data on multiple cancer types, it can be of interest to conduct the two types of integration simultaneously. More functional examination of the data analysis results will be needed to confirm the findings.

Acknowledgments:
We are very grateful to the reviewers for their careful review and insightful comments, which have led to a significant improvement of this article.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A
For optimizing objective functions (2) and (6), a weighted normalization is first conducted as: i . Then objective functions (2) and (6) can be rewritten as: and The coordinate descent (CD) technique is used to optimize objective functions (A1) and (A2). In the CD procedure, the objective function is optimized with respect to one parameter at a time, and the other parameters are fixed at their current values. All parameters are iteratively cycled through until convergence.
Specifically, with fixed tuning parameters, for j = 1, . . . , p, the CD algorithm for penalized objective function (A1) proceeds as follows.
is the magnitude-based shrinkage penalty (3), compute: x (k) ij 2 + λ 2 and a = 1 is the sign-based shrinkage penalty (4), compute: where χ is a small positive number, which is set as 0.01 in our numerical study.
(3). Repeat Step (2) until convergence. In our numerical study, convergence is concluded if With fixed tuning parameters, the CD algorithm for penalized objective function (A2) proceeds as follows.
is the magnitude-based shrinkage penalty (3), compute: is the sign-based shrinkage penalty (4), compute: where χ is a small positive number, which is set as 0.01 in our numerical study.
(3). Repeat Step (2) until convergence. In our numerical study, convergence is concluded if These approaches involve tuning parameters, which are selected using cross validation.
Appendix B Figure A1. Marginal analysis: clustering dendrogram based on the relative Euclidean distances.