Comprehensive Analysis of Large-Scale Transcriptomes from Multiple Cancer Types

Various abnormalities of transcriptional regulation revealed by RNA sequencing (RNA-seq) have been reported in cancers. However, strategies to integrate multi-modal information from RNA-seq, which would help uncover more disease mechanisms, are still limited. Here, we present PipeOne, a cross-platform one-stop analysis workflow for large-scale transcriptome data. It was developed based on Nextflow, a reproducible workflow management system. PipeOne is composed of three modules, data processing and feature matrices construction, disease feature prioritization, and disease subtyping. It first integrates eight different tools to extract different information from RNA-seq data, and then used random forest algorithm to study and stratify patients according to evidences from multiple-modal information. Its application in five cancers (colon, liver, kidney, stomach, or thyroid; total samples n = 2024) identified various dysregulated key features (such as PVT1 expression and ABI3BP alternative splicing) and pathways (especially liver and kidney dysfunction) shared by multiple cancers. Furthermore, we demonstrated clinically-relevant patient subtypes in four of five cancers, with most subtypes characterized by distinct driver somatic mutations, such as TP53, TTN, BRAF, HRAS, MET, KMT2D, and KMT2C mutations. Importantly, these subtyping results were frequently contributed by dysregulated biological processes, such as ribosome biogenesis, RNA binding, and mitochondria functions. PipeOne is efficient and accurate in studying different cancer types to reveal the specificity and cross-cancer contributing factors of each cancer.It could be easily applied to other diseases and is available at GitHub.

Besides mRNA, a large number of non-coding RNAs can be detected by RNA-seq, such as circular RNAs (circRNA) [18], long non-coding RNA (lncRNA, linear) [19]. CircRNAs are generated by a mechanism called back-splicing, in contrast to canonical splicing for linear RNAs, and these two splicing mechanisms may compete with each other [20]. One of the functions of circRNAs is serving as miRNA sponges [21]. LncRNAs have been demonstrated to be functional in different cellular activities and dysregulated in various cancers [22,23]. For example, lncRNA PVT1 drives oncogene MYC expression in various types of cancer cells [24]. Furthermore, retrotransposons are a large group of mobile DNA in the genome [25] that have the potential to be transcribed (retrotranscriptome) [26,27] and may be involved in many diseases including cancer [28]. In particular, human endogenous retroviruses (a type of retrotransposons) are stage-specifically transcribed during human embryonic development [29].
Biological processes in the cell interact with each other. For example, RNA-editing can affect AS and circular RNA biogenesis [30], RNA regulators may affect AS and APA [31], and gene fusions may dramatically change the transcriptome [32]. Focusing on only one single type of information may result in failure to identify critical factors underlying diseases. Therefore, combining multi-modal information in one model is critical to pinpoint the key players in pathological conditions. However, such a tool integrating all types of RNA-seq analyses is still lacking, although numerous analysis packages have been developed to perform specific analysis aforementioned [33]. Here, we developed PipeOne, a one-stop RNA-seq analysis pipeline that can integrate multi-modal information from large scale RNA-seq data to systematically identify key factors underlying diseases and stratify disease subtypes. Its application in five cancer types revealed shared cancer driver genes and pathways, and clinically-relevant cancer subtypes with genetic support from somatic mutations. PipeOne is freely available at https://github.com/nongbaoting/PipeOne. (version 1.1.0, accessed on 7 September 2020).

Disease-Related Feature Selection by Machine Learning
After preparation of the eight feature matrices, random forest algorithm (Python package 'scikit-learn') was used to select disease relevant features. Only the top 1000 (users may choose a different number) most variable features from each type of feature matrix were combined together and used for machine learning. Samples were randomly separated into training set (70% samples) and testing set (30%). First, random forest was applied to the training set to obtain all features importance using the leave-one-out validation strategy. Then, the top K (K = 10, 20, 50, 100, 200, all) important features were validated using the testing set. For each chosen K, the subset training matrix with only those top K features were used as the input for retraining as in the first step, and this time the subset test matrix with only those top K features were used to evaluate the clustering accuracy. Finally, features with non-zero importance scores were selected and regarded as disease relevant, which can be further used for downstream analysis.

Cancer Subtyping Analysis
This module used non-negative matrix factorization (NMF) to obtain a latent feature matrix of the disease samples, clustered those samples, and evaluated those clusters as significant subtypes by using survival analysis. Similar to the feature selection module, the top 1000 most variable features were selected first. First, a robust NMF integration algorithm was applied to find latent features with corresponding feature weights. Then, the K-Means clustering algorithm (Python package 'scikit-learn') was performed on the latent feature matrix and clusters was evaluated by silhouette width [53]. Next, clinical data with survival information was used to assess the clinical significance of clustered subtypes by testing the survival difference between those subtypes (log-rank test, R package 'survival'). Finally, a supervised random forest algorithm (Python package 'scikit-learn') was applied to the subtype result with largest silhouette value among significant clustering results (based on log-rank test p-values) to obtain the relative importance of all features contributing to subtyping.

Other Bioinformatic Analysis
Pathway and disease enrichment analysis for cancer-contributing genes in each cancer type was performed by invoking ToppGene API (https://toppgene.cchmc.org, accessed on 14 March 2021). Pathway enrichment for subtype-contributing genes shared across cancer types was performed by ToppGene webserver directly. Somatic mutation analysis and visualization for cancer subtypes were performed by MAFtools (v2.2.10, Singapore) [54].

PipeOne Workflow Overview
Studying diseases from multiple aspects proves to be effective in revealing disease mechanisms. We developed PipeOne that uses multi-modal information from RNA-seq data to comprehensively investigate transcriptomes. As shown in Figure 1A (see Methods), PipeOne is composed of three modules, data processing and feature matrices construction, disease feature prioritization, and disease subtyping. The pipeline is based on Nextflow [55], a reproducible workflow management system, and integrates eight different tools to extract different information from RNA-seq data in the first module ( Figure 1B). Subsequently, module two applies random forest algorithm to the combined most variable features extracted by algorithms in the first module to prioritize a number of diseaserelevant features for downstream analysis of disease mechanisms ( Figure 1C). Module three stratified patients according to evidences from multiple-modal information, as extracted similarly in module two, for better diagnosis and treatment ( Figure 1D).
Compared with two previous RNA-seq pipelines, RNACocktail [56] and VIPER [57], PipeOne integrated predictions for both novel lncRNAs and circRNAs, retrotranscriptome, and alternative splicing (Table 1). PipeOne would also perform and focused on comprehensive feature prioritization and subtyping analysis, instead of simply differential expression analysis.  (D) Details of module three. First, a robust NMF integration algorithm was applied to obtain latent features and associated weights for all samples. Then K-Means clustering evaluated by silhouette width was used to cluster samples based on the latent feature matrix. Differential survival analysis by log-rank test was used to assess the clinical relevance of those stable clusters as potential subtypes. Finally, similar to module two, random forest was used to select features contributing to the subtyping results.
Details of module two. First, feature importance was calculated by using random forest on all features from module one. Then the top K (20, 20, 50, 100, 200, all) ranked by feature importance were used to test and validate the importance of those top features. (D) Details of module three. First, a robust NMF integration algorithm was applied to obtain latent features and associated weights for all samples. Then K-Means clustering evaluated by silhouette width was used to cluster samples based on the latent feature matrix. Differential survival analysis by log-rank test was used to assess the clinical relevance of those stable clusters as potential subtypes. Finally, similar to module two, random forest was used to select features contributing to the subtyping results.

Cancer Genes and Pathways Contributing to Multiple Types of Cancer
We then applied PipeOne to five cancer types (COAD, n = 325 (41 normal), KIRP, n = 320 (32 normal), LIHC, n = 421 (50 normal), STAD, n = 403 (32 normal), and THCA, n = 555 (58 normal)). For each cancer, we observed that many non-mRNA expression cancer-associated features, especially AS, lncRNA expression and retrotranscriptome (Figure 2A, Table S1), supporting our claim that integration of multimodal information may help capture more relevant dysregulated disease factors We found 23 features as shared cancer-associate factors among different cancers ( Figure 2B). In contrast to other cancers, STAD showed the least shared factors (n = 3), and only one shared (KRT7 expression) with the gastrointestinal cancer COAD. KRT7 regulates cell differentiation and is involved in the regulating translation of human papillomavirus type 16 (HPV16) [58], which is oncogenic and contributes to both STAD and COAD development [59,60]. One other shared factor of STAD is ERV3116A3_Xq28b retroelement expression found in THCA. Surprisingly, we observed more frequent multiexon-skipping AS events near the same loci in ABI3BP in three tumors (COAD, KIRP, and THCA), compared to adjacent normal samples ( Figure 2C-E). ABI3BP was reported as a tumor suppressor gene in thyroid cancer [61] and lung cancer [62], and dependent on TP53 [63]. The exon-skipping AS event in this gene probably disrupts its tumor suppressor function across cancers to promote tumorigenesis. Other shared factors included expression of known oncogenic lncRNA PVT1 and mRNA genes involved in programmed cell death, such as BAX (apoptosis), LGALS3 (apoptosis), FBP1 (ferroptosis), and TMEM123 (oncosis), insulin growth factor gene IGF2, and macrophage inhibitory cytokine GDF15. The functionalities of those unexplored factors specifically contributing to individual cancer may be further investigated.
Interestingly, disease enrichment in our important LIHC genes showed drug toxicity or adverse reaction to drug (p = 8.7 × 10 −9 FDR = 1.6 × 10 −5 ) (Table S2). Moreover, acute kidney failure and secondary liver neoplasm were shared by four out five cancer types, suggesting the presence of both liver and kidney dysfunction during tumor progression in different cancers ( Figure 2G).

Clinically-Relevant Cancer Subtypes Characterized by Distinct Somatic Mutations
We further analyzed PipeOne-derived cancer subtypes (Table S3). Three subtypes were identified in LIHC (p = 0.014), KIRP (p = 0.012), and THCA (p < 1 × 10 −4 , six in STAD (p = 0.033) ( Figure 3A-D), but none in COAD. Most samples in KIRP and THCA were classified into two subtypes, subtype-0 (S0) and subtype-1 (S1). We compared the correlation between the PipeOne predictive subtypes and the pathological stages of the cancer and the results showed that there were a few significant correlations, although in most cases there was no correlation between them. (Supplementary Figure S2). This result indicates that there may be a different genetic basis between PipeOne derived subtypes and pathological subtypes, and may provide clues for the new stratification of patients for further evaluation and appropriate treatment. To understand the genetic basis for these cancer subtypes, we investigated somatic mutations in LIHC, KIRP, and THCA. For THCA, S1 and S2 differed most for mutations in BRAF (69% vs. 53%) and HRAS (1% vs. 5%) ( Figure 3E). For KIRP, we observed strikingly higher MET (13% vs. 4%), KMT2D (10% vs. 3%), and KMT2C (9% vs. 5%) somatic mutations in S0 compared to S1 ( Figure 3F). MET is an oncogenic tyrosine kinase and KMT2C and KMT2D are histone methyltransferases that could remodel chromatin. For the three subtypes of LIHC, S0 and subtype-2 (S2) were similar and showed higher number of TP53 somatic mutations, while lower number of TTN mutations, compared with S1 ( Figure 3G).
Genes 2021, 12, 1865 8 of 12 5%) ( Figure 3E). For KIRP, we observed strikingly higher MET (13% vs. 4%), KMT2D (10% vs. 3%), and KMT2C (9% vs. 5%) somatic mutations in S0 compared to S1 ( Figure 3F). MET is an oncogenic tyrosine kinase and KMT2C and KMT2D are histone methyltransferases that could remodel chromatin. For the three subtypes of LIHC, S0 and subtype-2 (S2) were similar and showed higher number of TP53 somatic mutations, while lower number of TTN mutations, compared with S1 ( Figure 3G).  Next, we examined subtype-contributing features in each cancer and across cancers. We observed 232 features (Table S4) shared by at least three cancer types, among which 41 were shared by at least four cancers. These 232 features included 155 mRNA features that were enriched in ribosome constitution (p = 9.5 × 10 −33 and RNA binding (p = 1.7 × 10 −19 ( Figure 3H). Disrupted ribosome homeostasis and RNA binding probably contributed to different subtypes with distinct prognosis. Interestingly, among the 41 features, we observed the strongest signals from mitochondria genes ( Figure 3I), possibly reflecting the reprogramming of cellular metabolism from oxidative phosphorylation (OXPHOS) to glycolysis observed in cancer cells [64].

Discussion
During RNA-seq raw data processing, PipeOne not only included classical procedures of data analysis, e.g., quality control, alignment, transcriptome reconstruction, gene quantification, but also contained novel lncRNA prediction, circRNA prediction, RNA editing prediction, fusion prediction, retrotranscriptome quantification, alternative splicing event detection, variants calling. Compared to RNAcocktail and VIPER, PipeOne harbored more functions, including prediction of novel lncRNAs and circRNAs, retrotranscriptome, and alternative splicing event detection. These will greatly enrich the information derived from RNA-seq data for downstream analysis, in which PipeOne focused on integrating multi-modal information to perform feature prioritization and disease subtyping. To the best of our knowledge, existing tools did not utilize such broad aspect information from RNA-seq for integration analysis. PipeOne did not implement many functions provided in VIPER, however, most of those procedures in RNA-seq analysis are classical, and could be performed by using common tools. For example, differential expression analysis could be perform by edgeR [65] or DESeq2 [66]. PipeOne was built based on Docker and Nextflow [55], making installation and management of workflow easy. Considering that users may not have root permission, PipeOne also provided the option of using Conda [67] for installation. These will alleviate users from tedious installation, configuration, and management. By using the Nextflow management system, advanced users could modify PipeOne to adapt their own research purposes.
One limitation of PipeOne is that it currently focuses on RNA-seq data only. Still, it may also be expanded to include features from other high-throughput data, for example, genome sequencing, DNA methylation profiling by microarray or sequencing, and proteomics by mass spectrometry, where any of these data are available Another limitation is subtyping evaluation. In this study, we only assessed the subtyping clusters by Silhouette values and log-rank test-based overall survival analysis. In the future, more choices would be implemented to allow other ways to evaluate subtypes, for example, by comparing patient drug responses when this information is available. That will enable subtyping analysis for other diseases without survival information. In addition, future version of PipeOne could include features from other high-throughput data as mentioned above to perform multi-omics base subtype analysis and proper reduce noise effects may help to better subtyping. For example, DefFusion [68] can make better survival predictions by taking the noise effect into account when integrating multiple omics data.
In short, we presented a cross-platform analysis pipeline integrating all kinds of RNAseq analysis to model diseases comprehensively and demonstrated the power of PipeOne in five cancer types. These results encourage us and other researchers to confidently apply PipeOne to more cancer types and other complex diseases with a large number of RNA-seq samples available. Our pipeline and analysis results provide new opportunities and clues for cancer research, and may inspire future multi-omics analysis to add more information from RNA-seq data.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/genes12121865/s1, Figure S1: The detailed processing pipeline to construct customized reference annotations with novel lncRNA genes, Figure S2: Comparison of PipeOne subtypes with cancer pathological stages in cancers, including KIRP (A), STAD (B), THCA (C) and LIHC (D), Table S1: Cancer-associated features shared across multiple cancer types, Table S2: Enriched pathways and disease annotations of cancer-associated features for each cancer type, Table S3: PipeOne identified cancer subtypes for each cancer type, Table S4: Subtype-contributing features shared in at least three cancer types.
Author Contributions: Y.X. conceived and supervised the project. B.N. built and executed the raw data preprocessing pipeline with the assistance of W.W. and M.G., Z.S. and B.N. analyzed the data. M.G. drafted the manuscript with assistance of B.N. and W.W., B.N., Y.X. and M.G. reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare that they have no competing interest.