Better Agreement of Human Transcriptomic and Proteomic Cancer Expression Data at the Molecular Pathway Activation Level

Previously, we have shown that the aggregation of RNA-level gene expression profiles into quantitative molecular pathway activation metrics results in lesser batch effects and better agreement between different experimental platforms. Here, we investigate whether pathway level of data analysis provides any advantage when comparing transcriptomic and proteomic data. We compare the paired proteomic and transcriptomic gene expression and pathway activation profiles obtained for the same human cancer biosamples in The Cancer Genome Atlas (TCGA) and the NCI Clinical Proteomic Tumor Analysis Consortium (CPTAC) projects, for a total of 755 samples of glioblastoma, breast, liver, lung, ovarian, pancreatic, and uterine cancers. In a CPTAC assay, expression levels of 15,112 protein-coding genes were profiled using the Thermo QE series of mass spectrometers. In TCGA, RNA expression levels of the same genes were obtained using the Illumina HiSeq 4000 engine for the same biosamples. At the gene level, absolute gene expression values are compared, whereas pathway-grade comparisons are made between the pathway activation levels (PALs) calculated using average sample-normalized transcriptomic and proteomic profiles. We observed remarkably different average correlations between the primary RNA- and protein expression data for different cancer types: Spearman Rho between 0.017 (p = 1.7 × 10−13) and 0.27 (p < 2.2 × 10−16). However, at the pathway level we detected overall statistically significantly higher correlations: averaged Rho between 0.022 (p < 2.2 × 10−16) and 0.56 (p < 2.2 × 10−16). Thus, we conclude that data analysis at the PAL-level yields results of a greater similarity when comparing high-throughput RNA and protein expression profiles.


Introduction
Most known human gene products execute their molecular function at the protein level. Proteomics, therefore, should theoretically be considered the preferred approach for high-throughput screening of the expression of such genes. This is also true for quantitative assessments of the activities of molecular pathways. However, for various reasons, the current limitations of proteomic techniques do not allow routine screening of most protein-coding genes [1]. For example, in the US National Cancer Institute's large-scale project Clinical Proteomic Tumor Analysis Consortium (CPTAC), devoted to the integrative proteogenomic characterization of human cancers, expression levels for only 15,112 genes have been resolved at the protein level by using tandem mass (MS/MS) spectrometry conducted on Thermo QE series of mass spectrometers (QE, QEplus, QE-HF, and QE-HF-X) [2]. These limitations are especially severe for the analysis of formalin-fixed paraffin-embedded (FFPE) cancer tissue samples, which is a routine format of biosample storage in clinical oncology [3]. Formalin fixation produces numerous artifact chemical modifications, including covalent cross-links between unrelated protein molecules [4].
Thus, it is important to improve current proteomic instruments or to use alternative methods that could be validated at the proteomic level. Transcriptomics offer another way of measuring expression of protein-coding genes [5]. It has been shown that RNA-and proteomic raw gene expression levels statistically significantly correlate with mean average Speaman Rho 0.16-0.52 [6][7][8]. Furthermore, RNA sequencing (RNAseq) can be used as an alternative to immunohistochemical methods for the assessment of biomarker functional status in clinical cancer samples for proteins such as HER2, ESR1, PGR, and PD-L1 [9].
The RNAseq results demonstrated high reproducibility and robustness in resolving transcripts of 23,248 protein-coding genes [10]. This allowed us to incorporate this type of analysis into the pipeline of tumor molecular analysis to estimate the expression of cancer drug target genes and the activation of targeted molecular pathways [11]. The latter, in turn, made it possible to simulate and predict the activities of cancer drugs with known molecular specificities, and to determine their patient-specific rating [12]. However, to our knowledge, the question of whether there is a correlation between the quantitative molecular pathway activation data received using RNAseq and proteomic data has never been explored before.
We have shown previously that the aggregation of RNA-level gene expression profiles into quantitative molecular pathway activation metrics results in reduced batch effects and better agreement between different experimental platforms. However, to our knowledge, the question of whether the pathway level of data analysis provides any advantage when comparing transcriptomic and proteomic data has never been explored before.
Here, we examined this question using paired proteomic and transcriptomic gene expression profiles obtained for the same human cancer tissue biosamples and available through The Cancer Genome Atlas (TCGA) and the NCI Clinical Proteomic Tumor Analysis Consortium (CPTAC) portals. We analyzed a total of 755 samples for glioblastoma and breast, liver, lung, ovarian, pancreatic, and uterine cancers. In CPTAC proteomic assay, the expression levels of 15,112 genes were profiled using the tandem mass (MS/MS) spectrometry conducted on Thermo QE series of mass spectrometers (QE, QEplus, QE-HF, and QE-HF-X) [2]. We used TCGA RNA expression levels of the same genes obtained with Illumina HiSeq 4000 platform for the same biosamples to compare proteomic and transcriptomic data at the individual gene-and pathway activation levels. At the gene level, absolute gene expression values were compared, whereas the pathway-grade comparisons were made between the pathway activation levels (PALs) calculated using average samplenormalized transcriptomic and proteomic profiles. We observed remarkably different average correlations between the primary RNA-and protein expression data for different cancer types: Spearman Rho between 0.017 (p = 1.7 × 10 −13 ) and 0.27 (p < 2.2 × 10 −16 ). However, at the pathway level, we detected overall statistically significantly higher correlations: averaged Rho between 0.022 (p < 2.2 × 10 −16 ) and 0.56 (p < 2.2 × 10 −16 ). Thus, we conclude that data analysis at the PAL level enables obtaining results of a greater similarity when comparing high-throughput RNA and protein expression profiles.

Results
Intracellular molecular pathways include gene products participating in common molecular processes, i.e., in all major cellular events in health and disease. Traditionally, they are primarily classified as the metabolic, DNA repair, signaling, and cytoskeleton organization pathways [1]. Quantitative assessments of pathway activation levels (PALs) have given rise to next generation biomarkers in human biology that are, in many contexts, more accurate and robust than individual gene expression levels [13]. Herein, positive, zero, and negative PAL values mean upregulation, no changes, and downregulation of a molecular pathway, respectively [14]. Furthermore, the absolute value of PAL corresponds to the extent of a pathway differential regulation [13]. Thus, a higher PAL value reflects greater pathway activation and vice versa. For several experimental techniques regarding RNA expression analysis, PALs also showed less significant platform bias and batch effects than individual gene expression profiles [15]. In this study, we compared, for the first time, RNA and protein expression levels for the same biosamples at both the pathway-and individual gene activation levels. The biospecimens were human cancer tissue samples. Primary RNA and protein expression data were extracted from the TCGA and CPTAC project databases, respectively. The platform for RNA expression profiling was Illumina HiSeq 4000, and proteomic profiles were obtained using tandem mass (MS/MS) spectrometry conducted on Thermo QE series of mass spectrometers (QE, QEplus, QE-HF, and QE-HF-X) [2]. In total, we analyzed 755 paired transcriptomic/proteomic biosamples for seven human cancer types: breast invasive carcinoma (99, 13.11%), glioblastoma (98, 12.98%), hepatocellular carcinoma (87, 11.52%), lung adenocarcinoma (111, 14.70%), ovarian serous cystadenocarcinoma (119, 15.76%), pancreatic ductal adenocarcinoma (140, 18.54%), and uterine corpus endometrial carcinoma (101, 13.38%).
Associations between protein and RNA expressions were assessed using Spearman Rho and Pearson R correlation coefficients for gene expression and pathway activation levels. The initial RNA sequencing profiles were screened and passed technical quality control metrics [16] (Figure 1). We then assessed at the individual gene level Spearman correlations between RNA and protein expression levels in seven human cancer types. We obtained 0.17 mean Spearman correlation that varied from 0.017 for Hepatocellular Carcinoma till 0.27 for Lung Adenocarcinoma samples (Figure 2), which is in line with the previous results, i.e., 0.16-0.35 correlation between the RNA-and proteome-based gene expression levels [6,7]. To compare RNA and protein expression data at the pathway level, we used the PAL approach for the same biosamples. We calculated Spearman correlations between PAL values for all tumor types under investigation ( Figure 3). We observed 0.27 mean Spearman correlation between PAL values, which varied from 0.022 for Hepatocellular Carcinoma up to 0.56 for Lung Adenocarcinoma samples. Thus, we obtained 1.58-fold-change between the averaged gene-to-gene and PAL-to-PAL correlations. We then measured statistical significance of the differences between gene (transcriptome)to-gene(proteome) vs. PAL (transcriptome)-to-PAL (proteome) correlations. Using Wilcoxon statistical test, we observed (Figures 4-10) that both Pearson and Spearman correlations calculated for pathway activation levels were statistically significantly higher for the PAL level compared to the gene level in five out of seven cancer types investigated. In the remaining two cancer types, the results for Pearson and Spearman correlations were inconsistent or statistically not significant, as for ovarian serous cystadenocarcinoma ( Figure 8) and uterine corpus endometrial carcinoma ( Figure 10).
We then performed this analysis for only 78 genes encoding molecular targets for the NCCN-recommended drugs in the seven cancer types considered in this study [17]. We compared 78-gene expression profiles and the profiles of molecular pathways including these genes at the RNA and protein levels. We observed that Pearson and Spearman correlations for pathway activation levels were statistically significantly higher compared to the single gene expression levels in the same four out of seven cancer types (Figures 4,5,9 and 11).         . Paired gene-to-gene and PAL-to-PAL correlation between RNA and protein expression levels estimated within Lung Adenocarcinoma biosamples using Pearson (left) and Spearman (right) correlation coefficients for (A) the total set of genes and molecular pathways; (B) the set of drug target genes and molecular pathways [17].
Additionally, we investigated how activation patterns of individual molecular pathways correlated between the different biosamples and compared this with the gene-specific patterns. We calculated Pearson and Spearman correlations for every gene and every molecular pathway among all biosamples of a given tumor type (Figures 12 and 13). We observed that these correlations were statistically significantly higher for pathway activation levels, with the lowest statistical significance (highest p = 0.013) being observed for hepatocellular carcinoma.
Separately for each cancer type, we then analyzed the "top-10" of the most and the least correlated molecular pathways between RNA and protein expression data (see Tables 1-7 and S1). Notably, such highly correlated pathways with regard to tumor infiltration, immune response, and regulation of DNA polymerase alpha, delta and epsilon activity were consistent among most tumor types (Spearman correlation 0.61-0.91).
We then checked the consistencies of pathway activation schemes built using transcriptomic or proteomic data for the best and the least correlated molecular pathways. We calculated PAL levels for the "top-10" such pathways for Lung Adenocarcinoma biosamples ( Figure 14) and compared activation charts for the chosen "AHR Pathway PS2 Gene expression via ESR" and "reactome Acetylcholine Neurotransmitter Release Cycle Main Pathway" pathways, which have the highest and the lowest correlations, respectively, for Lung Adenocarcinoma (Figure 15). Notably, all the pathways from both the best and the least correlated "top-10" group showed congruent activation patterns for RNAseq and proteomic data. Similarly, most of the components of the two molecular pathways that were considered more in-depth also showed congruent activation trends ( Figure 15).        The number of gene products in a pathway theoretically may have an impact on the extent of the data aggregation effect [15] and, consequently, influence the gain of correlation in the gene-pathway comparisons. Therefore, we separately estimated gene-pathway correlations for the groups of bigger and smaller pathways including, respectively, at least 10, 20, and 40 gene products (Supplementary Figures S1-S7). We noticed, however, that the pathway size did not have any detectable impact on the pathway-level correlation gain.
Specifically, we obtained the following results for the individual cancer types under investigation.

Breast Invasive Carcinoma
The Pearson and Spearman gene-to-gene correlations between RNA and protein expression for an averaged biosample (Figure 2A) were 0.14 and 0.12, respectively, while on the PAL-to-PAL level ( Figure 3A), they were 0.47 and 0.43, respectively. Thus, data analysis at the pathway activation levels resulted in~3.5 times higher transcriptome-proteome correlation compared to the gene level, which was statistically significant (Wilcoxon p < 2.2 × 10 −16 ); see Figures 4A, 16A and 17A. This difference also remained significant (p < 2.2 × 10 −16 ) for drug target genes and molecular pathways ( Figure 4B).    Pearson and Spearman correlations for individual genes or molecular pathways were statistically significantly higher for pathway activation levels (Wilcoxon, both p < 2.2 × 10 −16 , Figures 12A and 13A).
The "top-10" most strongly and weakly correlated molecular pathways are shown on Table 1.

Glioblastoma Multiforme
The Pearson and Spearman gene-to-gene correlations for an average biosample ( Figure 2B) were 0.29 and 0.24, respectively, and PAL-to-PAL correlations ( Figure 3B) were 0.39 and 0.35, respectively. Thus, in this case, we detected~1.5 times pathway-level gain of correlation, which was statistically significant (p < 2.2 × 10 −16 and p < 10 −9 for Pearson and Spearman correlations, respectively); see Figures 5A, 16B and 17B. This difference also remained statistically significant for the cancer drug-targeted genes and molecular pathways ( Figure 5B).
Pearson and Spearman correlations for individual genes or pathways were statistically significantly higher for pathway activation levels (Wilcoxon, both p < 2.2 × 10 −16 , Figures 12B and 13B).
The "top-10" of the most and the least correlated pathways are shown on Table 2.

Hepatocellular Carcinoma
The Pearson and Spearman gene-to-gene correlations for an average biosample ( Figure 2C) were 0.02 and 0.017, respectively, and PAL-to-PAL correlations ( Figure 3C) were 0.061 and 0.022, respectively. This corresponded to~3 times pathway-level gain for the Pearson correlation (p < 2.2 × 10 −16 ), whereas the fold-change for the Spearman correlation was only~1.2, which was also statistically significant (p = 2.2 × 10 −8 ); see Figures 6A, 16C and 17C. This difference, however, had the opposite effect on cancer drug targeted genes and molecular pathways ( Figure 6B).
The Pearson and Spearman correlations for individual genes or molecular pathways were statistically significantly higher for pathway activation levels (Wilcoxon, p = 5 × 10 −5 and p = 0.013, respectively); see Figures 12C and 13C.
The "top-10" of the most and the least correlated molecular pathways are shown on Table 3.

Lung Adenocarcinoma
The Pearson and Spearman gene-to-gene correlations for an average biosample (Figure 2D) were 0.33 and 0.27, respectively, and PAL-to-PAL correlations ( Figure 3D) were 0.58 and 0.56, respectively. Thus, the detected pathway-level gain of correlation was 2 times (p < 2.2 × 10 −16 ); see Figures 7A, 16D and 17D. The correlation gain also remained statistically significant for cancer drug-targeted genes and molecular pathways ( Figure 7B).
Pearson and Spearman correlations for individual genes or molecular pathways were statistically significantly higher for pathway activation levels (Wilcoxon, both p < 2.2 × 10 −16 , Figures 12D and 13D).
The "top-10" of the most and the least correlated molecular pathways are shown on Table 4.
We also compared PAL levels for "top-10" lists of the most and the least correlated molecular pathways in Lung Adenocarcinoma biosamples ( Figure 14) for the transcriptomic and proteomic data. All of the most and the least correlated pathways showed common activation or inhibition trends for the RNA and protein expression data.
Similarly, the activation charts for "AHR Pathway PS2 Gene expression via ESR" and "reactome Acetylcholine Neurotransmitter Release Cycle Main Pathway" pathways were compared ( Figure 15). Similarly, most of the components in the two considered molecular pathways were also congruently activated.

Ovarian Serous Cystadenocarcinoma
The Pearson and Spearman gene-to-gene correlations for an average biosample ( Figure 2E) were 0.24 and 0.23, respectively, and PAL-to-PAL correlations ( Figure 3E) were 0.25 and 0.17, respectively. In this case, no pathway-level correlation gain was detected (Pearson~1; Spearman~0.74 times) either for drug target genes or molecular pathways ( Figure 8B).
The Pearson and Spearman correlations for individual genes or molecular pathways were statistically significantly higher for pathway activation levels (Wilcoxon, both p < 2.2 × 10 −16 ); see Figures 12E and 13E.
The "top-10" of the most and the least correlated molecular pathways between RNA and protein expression levels are shown on Table 5.

Pancreatic Ductal Adenocarcinoma
Pearson and Spearman gene-to-gene correlations for an average biosample ( Figure 2F) were 0.16 and 0.11, respectively, and PAL-to-PAL correlations ( Figure 3F) were 0.19 and 0.15, respectively. This suggests~1.4 times pathway-level gain of correlation which was statistically significant only for Spearman (p = 3.6 × 10 −9 ) correlation; see Figures 5F, 15A and 16F. Differences between Spearman correlations were also statistically significant for drug target genes and molecular pathways ( Figure 9B).
The Pearson and Spearman correlations for individual genes or molecular pathways were statistically significantly higher for pathway activation levels (Wilcoxon, both p < 2.2 × 10 −16 ); see Figures 12F and 13F.
The "top-10" of the most and the least correlated molecular pathways are shown in Table 6.

Uterine Corpus Endometrial Carcinoma
The Pearson and Spearman gene-to-gene correlations for an average biosample ( Figure 2G) were 0.29 and 0.22, respectively, and PAL-to-PAL correlations ( Figure 3G) were 0.29 and 0.23, respectively. In this case, correlations remained essentially the same for both geneand pathway levels; see Figures 10A, 16G and 17G. The same trend was also seen for drug target genes and pathways ( Figure 10B).
Pearson and Spearman correlations for individual genes or molecular pathways were statistically significantly higher for pathway activation levels (Wilcoxon, both p < 2.2 × 10 −16 , Figures 12G and 13G).
The "top-10" of the most and the least correlated molecular pathways between RNA and protein expression levels are shown on Table 7.

Discussion
Quantitative gene expression profiles at the mRNA and protein levels are fundamentally different because of differential mRNA and protein stability patterns, epigenetic and protein translation regulatory mechanisms [18] and technical differences in the screening methods [19]. However, there is an overall correlation between the quantitative gene expression profiles that was detected for various organisms and cell types [18].
In this study we aimed to compare correlations of RNA-and protein-level gene expression data at the single gene and molecular pathway levels.
At the gene level, the correlations observed were congruent with the literature data (e.g., we observed~0.23-0.24 correlations for ovarian cancer at the gene level, compared to~0.3 in the previous report [7], or~0.24-0.29 for glioblastoma in this study compared to previously observed~0.15 [6]).
On the other hand, we detected remarkably different correlations between RNA expression and proteomic data on both gene-and pathway levels between the different tumor types investigated simultaneously in TCGA and CPTAC projects. For example, the lowest correlation could be observed for hepatocellular carcinoma (0.017-0.06), whereas the biggest was detected for lung adenocarcinoma (0.33-0.58).
Such dramatic difference from the first view could be explained by vulnerabilities of the experimental expression platforms to the nature of the tested biospecimens. However, some biological mechanisms could also be involved, e.g., tumor type-specific pH alteration, which can lead to dramatic differences in the repertoire of translated proteins while not strongly affecting RNA transcription [20].
In any case, we noticed that in five out of seven cancer types tested (glioblastoma, breast, liver, lung, pancreatic cancers), the expression data analysis at the pathway level was beneficial in terms of improving the correlation between the quantitative mRNA and protein data for the same biospecimens. In the two remaining tumor types (endometrial and ovarian cancers), we observed no pathway-level gain of correlation between mRNA and proteomic data. These results did not depend on the pathway size, as they were reproduced here with pathways of any size, and with pathways with 10-, 20-, and 40-participant size limits.
Similarly, we observed a gain of correlation between mRNA and protein expression in four out of seven cancer types in a correlation analysis of cancer drug-targeted genes and molecular pathways.

Processing of RNA Sequencing Data
Ensembl gene IDs were converted to HGNC gene symbols according to the Complete HGNC dataset (https://www.genenames.org, accessed on 14 November 2021. Overall, expression levels were established for 36,596 annotated genes with HGNC identifiers. Additionally, '1' was added to all raw gene counts prior to cluster analyses to avoid zero expression values, following the recommendation by Dillies et al. [21]. The gene expression data were merged into a single dataset and preprocessed using DESeq2 [22] as a normalization method. Hierarchical clustering was performed using R "ward.D2" method. We used a threshold of 2.5 M uniquely mapped reads for QC of RNA sequencing data (Figure 1), as this was found to be effective for marking samples with low-quality values of other QC metrics, e.g., the proportion of genomic counts, high rate of mismatches, number of reads spanning splice junction, a high percentage of ribosomal counts [16].

Molecular Pathway Annotation and Activation Scoring
In this study we used a publicly available collection of molecular pathways extracted from Biocarta version 1.2 [25], Qiagen Pathway Central [26], Kyoto Encyclopedia of Genes and Genomes (KEGG) [1], NCI database version 1.2 [27], and Reactome version 1.3 [13] databases, and algorithmically annotated for molecular functions of pathway components and nodes [28]. Using the Oncobox bioinformatics platform [15,16] we calculated pathway activation levels (PALs) for a total of 1611 molecular pathways containing 10 or more gene products. For PAL calculations, each sample expression profile was normalized on mean geometrical levels of RNA or protein expression for all samples in the dataset under analysis.
The PAL approach considers the impact of each gene product on overall molecular pathway activation [29,30], the PAL value for a pathway p a given sample is calculated as follows: PAL p = ∑ n ARR np · BTIF n · ln(CNR n ) where CNR n (case-to-normal ratio) is the ratio of gene n expression level in the sample under investigation to the mean geometrical gene n expression level in the group of control samples. The Boolean flag BTIF n (beyond tolerance interval flag) is zero when the CNR n value has not passed the significance criterion: when the difference with the control group of samples is not significant, where p > 0.05 ARR n,p (activator/repressor role of gene n in pathway p) is the discrete value that equals to −1 when gene product n is a repressor of pathway p; 1, when gene product n is an activator of pathway p; 0, when gene product n has both activities of an activator and of a repressor of pathway p; 0.5 and −0.5, respectively, when gene product n is rather an activator or repressor of pathway p.

Conclusions
In this study, we aimed to compare correlations among RNA-and protein-level gene expression data at the single gene and molecular pathway levels. We detected remarkably different correlations between RNA expression and proteomic data on both gene-and pathway levels among the different tumor types investigated simultaneously in TCGA and CPTAC projects. For example, the lowest correlation was observed for hepatocellular carcinoma (0.017-0.06), whereas the biggest was detected for lung adenocarcinoma (0.33-0.58). This dramatic difference could be due to the vulnerabilities of the experimental expression platforms to the nature of the tested biospecimens. However, some biological mechanisms may also be involved, e.g., tumor type-specific pH alteration, which can lead to dramatic differences in the repertoire of translated proteins while not strongly affecting RNA transcription [20]. We also demonstrated that the assessment of pathway activation levels based on transcriptomic data produces largely congruent profiles with those for the proteomic profiles. Our results evidence that the pathway level of transcriptomic data analysis can be advantageous compared to the single-gene level because it can statistically significantly improve correlations among RNA-and proteomic data in most of the tested cases.