Connecting Histopathology Imaging and Proteomics in Kidney Cancer through Machine Learning

Proteomics data encode molecular features of diagnostic value and accurately reflect key underlying biological mechanisms in cancers. Histopathology imaging is a well-established clinical approach to cancer diagnosis. The predictive relationship between large-scale proteomics and H&E-stained histopathology images remains largely uncharacterized. Here we investigate such associations through the application of machine learning, including deep neural networks, to proteomics and histology imaging datasets generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) from clear cell renal cell carcinoma patients. We report robust correlations between a set of diagnostic proteins and predictions generated by an imaging-based classification model. Proteins significantly correlated with the histology-based predictions are significantly implicated in immune responses, extracellular matrix reorganization, and metabolism. Moreover, we showed that the genes encoding these proteins also reliably recapitulate the biological associations with imaging-derived predictions based on strong gene–protein expression correlations. Our findings offer novel insights into the integrative modeling of histology and omics data through machine learning, as well as the methodological basis for new research opportunities in this and other cancer types.


Introduction
Kidney cancer is one of the most common cancers worldwide accounting yearly for hundreds of thousands of deaths [1]. Clear cell renal cell carcinomas (CCRCC) is the most common subtype of kidney cancer representing~75% of cases [2,3]. Its diagnosis is typically incidental, e.g., as part of medical imaging tests unrelated to kidney problems, and~30% of patients with CCRCC eventually develop metastases even after removal of the kidney and other treatments [2]. Therefore, there is a need for developing new approaches to the understanding and early diagnosis of CCRCC.
Histopathology is a well-established technique for confirming diagnosis and subsequent sub-classification of kidney and other cancer types [4,5]. Histopathology consists of the visual analysis of microscopic slides obtained from tissue samples typically stained with H&E (hematoxylin and eosin stains). This allows the pathologist to identify cellular patterns associated with the presence of cancer, its staging and potential clinical outcomes. Even when performed by well-trained experts, this task is time-consuming and not always highly-reproducible among pathologists [6,7]. Moreover, in kidney and other cancers, the use of histological analysis for diagnostic purposes is often challenging because different cancer subtypes may share non-specific morphological patterns [2,8]. Therefore, the accurate and robust analysis of large amounts of digitized histological slides for cancer diagnosis remains a key challenge in cancer research and clinical practice.
To address such challenges, different computational techniques have been proposed for analyzing histology images for diagnostic purposes in multiple cancers [9]. Such analyses have traditionally relied on the application of classification models, which process "handcrafted" (explicitly defined) image-derived features such as cell size, shape, and pixel intensity distributions observed in full slides or selected slide patches [8,9].
With the wider adoption of whole-slide high-content imaging and the increase in the volume of histology datasets, new opportunities have risen for the application of deep learning (DL) techniques [10]. Unlike previous generations of machine learning approaches, DL models based on convolutional neural networks (CNNs) can process raw intensity images and learn to automatically extract predictive features [11,12]. The accuracy and potential clinical relevance of DL models for analyzing histology images for diagnostic and prognostic purposes have already been shown in different cancer research domains [13][14][15][16][17][18][19][20][21]. Thus, DL is expected to play a key role in the era of digital pathology and precision medicine [6,22,23].
The analysis of large amounts of omics data, including transcriptomics and proteomics, has significantly advanced the molecular characterization of dozens of cancer types and offers deeper insights into their diagnosis, prognosis and treatment response assessment [24,25]. This has been possible in large part because of consortia such as The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC) [25,26]. For example, The TCGA recently reported a comprehensive analysis of multiple omics features of renal cancer and their associations with cancer subtypes and patient prognosis [3]. The study found that CCRCC tumors show elevated immune cell-specific gene expression in comparison to other kidney cancer sub-types. Other comprehensive characterizations of omic profiles for diagnostic, prognostic or biological understanding purposes in CCRCC have been recently reported [27][28][29].
The integration of omics and histopathology data has the potential to improve our understanding of the biological mechanisms underlying tumors, their detection, and treatment [30]. Previous efforts to achieve these goals include the integration of H&E-stained tissue sections and genomic markers from patients diagnosed with gliomas [31]. CNNs were applied to analyze the images and predict patient survival, and the combination of such models with genomic biomarkers outperformed the current clinical prognosis approach [31]. In lung adenocarcinomas (LUAD), histopathology-derived features have been shown to correlate with omics-based classification, using gene and protein expression, and to improve patient survival prediction [32]. More recently, using LUAD and liver cancer datasets, the combination of gene expression and imaging features was also shown to improve patient prognosis [33]. In both investigations the histology-based prediction models processed inputs that represented handcrafted image-derived features reflecting specific cellular and sub-cellular morphological patterns. The application of DL models has also been demonstrated with TCGA-derived histopathology images and omics data. For instance, a CNN applied to whole-slide images showed a diagnostic performance comparable to that of pathologists and was also capable of predicting the mutation status of commonly mutated genes in LUAD [34]. In breast cancer and using histopathology images, CNN-based models assigned patients to diagnostic attributes, e.g., tumor stage, and outperformed models based on transcriptomic data only [35]. Examples of other applications of integrative analysis in renal cancer research are provided in [36] and [37].
Despite the progress achieved to date, such investigations tend to emphasize the implementation of histology-based models for improving classification accuracy. Moreover, the integrated analysis of such models with large-scale proteomics data have received relatively less attention in comparison to genomics and transcriptomics data. Deeper investigations of the association of histology imaging models and large-scale proteomics will not only improve our understanding of the predictive complementarity of such data sources, but also may offer the basis for more precise diagnostic systems. Here we address these research needs through the application of machine learning techniques, including DL models, for proteomics and imaging data. Based on the identification of correlations between image-based models and proteomics profiles, we generate hypotheses about the roles of proteins and biological processes in CCRCC, whose molecular activity can be accurately captured by histopathology imaging. Furthermore, to the best of our knowledge, we are the first team to systematically investigate the association of histology imaging and proteomics data in CCRCC using DL.

Methods
An overview of our research strategy is summarized in Figure 1A. Here we address the question of finding associations between diagnostic imaging and proteomics data. To achieve it, we analyzed histology images and proteomics data from hundreds of tumors and control samples. Machine learning models for distinguishing tumors from normal samples were built for each dataset independently ( Figure 1B). Based on the resulting models, we investigated correlations between the diagnostic proteins and the image-based predictions. Using different databases containing annotations of biological processes and pathways, we detected statistically significant correlations that are relevant to cancer in general, and CCRCC in particular. Moreover, we investigated associations between mRNA obtained from the same patient cohort and the histology-based predictions, as well as between mRNA and their corresponding proteins. captured by histopathology imaging. Furthermore, to the best of our knowledge, we are the first team to systematically investigate the association of histology imaging and proteomics data in CCRCC using DL.

Methods
An overview of our research strategy is summarized in Figure 1A. Here we address the question of finding associations between diagnostic imaging and proteomics data. To achieve it, we analyzed histology images and proteomics data from hundreds of tumors and control samples. Machine learning models for distinguishing tumors from normal samples were built for each dataset independently ( Figure 1B). Based on the resulting models, we investigated correlations between the diagnostic proteins and the image-based predictions. Using different databases containing annotations of biological processes and pathways, we detected statistically significant correlations that are relevant to cancer in general, and CCRCC in particular. Moreover, we investigated associations between mRNA obtained from the same patient cohort and the histology-based predictions, as well as between mRNA and their corresponding proteins.

Datasets
The proteomics and histology datasets were generated by the CPTAC Clear Cell Renal Cell Carcinoma (CCRCC) Discovery Study [38]. The proteomics data, consisting of Tandem Mass Tags-10 (TMT10) experiments of 216 samples, were downloaded from the CPTAC Data Portal. This dataset included complete information for 9964 proteins measured in 194 samples (84 normal, 110 tumor samples), which are the focus of our investigation. The histology dataset was obtained from The Cancer Imaging Archive (TCIA) and included a total of 783 slide images (259 normal, 524 tumor, examples shown in Figures 1C and D). For some of the patients in this cohort, matching histology slides and proteomic samples (from the same patient) are available for investigating associations between proteomic-and image-based diagnostic models (details below). Before implementing diagnostic models, the proteomics dataset was pre-processed by selecting the LogRatio protein abundant column, and null values were replaced with zero. Histology images (raw pixel intensity

Datasets
The proteomics and histology datasets were generated by the CPTAC Clear Cell Renal Cell Carcinoma (CCRCC) Discovery Study [38]. The proteomics data, consisting of Tandem Mass Tags-10 (TMT10) experiments of 216 samples, were downloaded from the CPTAC Data Portal. This dataset included complete information for 9964 proteins measured in 194 samples (84 normal, 110 tumor samples), which are the focus of our investigation. The histology dataset was obtained from The Cancer Imaging Archive (TCIA) and included a total of 783 slide images (259 normal, 524 tumor, examples shown in Figure 1C,D). For some of the patients in this cohort, matching histology slides and proteomic samples (from the same patient) are available for investigating associations between proteomic-and image-based diagnostic models (details below). Before implementing diagnostic models, the proteomics dataset was pre-processed by selecting the LogRatio protein abundant column, and null values were replaced with zero. Histology images (raw pixel intensity data from thumbnails of whole slides) were fed into the DL models, and further processing at the pixel level was carried out during the model training process, as delineated next.

Diagnostic Models
The proteomics-based diagnostic model was generated with a Random Forest (RF) classifier with default parameters and ntree = 500. The selected ntree value was sufficient to obtain the top reported classification performance. As inputs to this model, we focused on the top-10% most variable proteins (based on their SD, i.e., 997 proteins) across all available samples. We focused on the top-10% most variable proteins because these genes were sufficient to obtain the best discrimination between normal and tumor samples. Using the full set of proteins, we obtained a relatively weaker discrimination of samples (Results). The RF model was trained, tested and its performance assessed with a 10-fold cross-validation (10-fold CV) sampling strategy. For both proteomics-and imaging-based models, we assessed their diagnostic performance using standard quality classification indicators: accuracy, precision, recall (sensitivity), F1, and AUC values.
The imaging-based diagnostic system consisted of a deep neural network architecture that combined: A CNN (the VGG16-CNN [39]), a regularized fully connected (FC) neural network and an output layer (OL). As in the case of the RF model, the objective of the classification task is to distinguish between normal and tumor samples. Because of the relatively small number of images (compared to typical large-scale datasets used in DL) and to reduce the computing times needed to train and test the models, we used a VGG16-CNN that was previously trained on more than 14 million generic images corresponding to 1000 image classes. Such a "transfer learning" is a well-established DL approach to extracting and re-using low-level image features across imaging application domains [10].
The histology imaging data were partitioned into training (181 normal and 366 tumor images), validation (52 normal and 105 tumor images) and test datasets (26 normal and 53 tumor images). These datasets were used for model generation, selection and independent evaluation respectively. To ensure an unbiased and robust analysis, we focused on the independent test dataset for implementing the proteomics-imaging integrative analysis. All the images were resized (to 224 × 224 pixels) and were input as 3-channel images to the DL model. To enable robust model building and reduce the risk of overfitting, images were randomly flipped and zoomed during training. The pre-trained VGG16-CNN was followed by a global average pooling layer, a fully connected network (128 units + ReLu activation) and a dropout layer to further minimize overfitting (rate = 0.2). Image classification was done with a 2-output (representing disease and control classes) using the softmax activation function to allow probabilistic classification. The FC and OL layers were optimized on the histology imaging data using the Adam optimization algorithm (lr = 0.001, decay = 0.0002), sparse categorical cross-entropy as loss function, with a maximum of 50 learning epochs and data batch size = 547. The batch size was chosen to include the full training dataset. Code available at https://gitlab.com/biomodlih/histo-proteo.

Integrative Data Analysis
Correlations between protein expression and histology-based predictions (p-values generated by the DL diagnostic system) were calculated with the Pearson correlation coefficient. Out of the 79 images available in our independent test dataset, only 24 of them have patient-matched proteomics data. Functional enrichment analyses using GO, KEGG and Reactome annotations were implemented on the set of predictive proteins. To identify highly differentially enriched (Reactome) pathways in the proteomics data on the basis of their correlation with image-based predictions, we performed Gene Set Enrichment Analysis (GSEA) [40].
We also performed correlative, functional enrichment and GSEA analyses on mRNA data matched to the independent dataset, i.e., patients with proteomic, imaging and gene expression data. As the other datasets in this article, the gene expression data were generated by the CPTAC project (RNASeq) and analyses were applied to their FPKM expression values [41]. A total of 185 samples were available in the RNASeq dataset with matching proteomics data (including 110 tumors), and 9884 genes with corresponding proteins in the proteomics data. Among these data, 22 samples also have matched histology images.

Software and Statistics
The proteomics-based RF classification model was implemented with the R packages caret and randomForest. The image-based DL classification model was implemented in Python using Pandas, NumPy, Matplotlib, and Keras libraries. We applied one-sample t-tests for detecting statistical differences between matched data groups using R. The statistical significance of functional enrichment analysis and GSEA was estimated with Benjamini-Hochberg adjusted p-values (statistical significance was defined with a cutoff p-adj value of 0.05 in all analyses). Additional data processing and visualization tasks were completed with R packages: fgsea, Rtsne, ggplot2, and complexHeatmap.

A Proteomics-Based Classification Model Accurately Detects CCRCC
Before implementing the proteomics-based classifier, we investigated the sample discrimination potential of the top-variable 997 proteins using an unsupervised classification algorithm. We found that this set of proteins effectively segregates disease and normal samples into clearly separated clusters (t-SNE mapping, Figure S1). Interestingly, when using the full set of proteins available in the dataset, we obtained a relatively good segregation of samples as well: Only 3 normal samples were clustered closer to tumor samples than to other normal samples ( Figure S1). These observations corroborate both the quality and diagnostic potential of the proteomics dataset, in general, and of our selected set of 997 proteomic markers, in particular.
The proteomics-based RF classification model was capable of distinguishing between CCRCC and normal samples with an overall accuracy of 0.98 (10-fold CV results), as well as high sensitivities and specificities (0.97 and 0.99 respectively). This also resulted in high F1 and AUC values (0.98 and 0.99, 10-fold CV results), which offer further evidence of the powerful diagnostic capacity of our proteomics-based classification model.

A Histology-Based Classification Model Accurately Detects CCRCC
The histology-based prediction (DL) model was trained using the transfer learning and network adaptation strategy detailed in Methods. The training process was implemented to learn the parameters of the FC and OL layers of our DL model, while keeping the (transferred learning) parameters of the CNN frozen. The resulting models consistently reported classification accuracies between 0.98 and 0.99 (on the training dataset), and between 0.81 and 0.88 when evaluated on a separate validation dataset (Methods). Such classification performance was observed when training our DL model during 50 epochs. A relative high classification performance was also obtained on the validation dataset for fewer training epochs: Accuracies between 0.83 and 0.85 (for 3 and 20 training epochs respectively). To reduce the risk of model overfitting and decrease the time needed for training and evaluating models, we selected a DL model trained with 3 learning epochs and the parameters specified in Methods.
The selected model was then applied to the independent test dataset of histology images. Our histology-based classification model was capable of distinguishing between CCRCC and normal samples with an accuracy of 0.95 on the test dataset, as well as with high sensitivities and specificities (1 and 0.93 respectively). This also resulted in high F1 and AUC values (both equal to 0.92), which further indicates the solid diagnostic capacity of our model. The model actually only misclassified 4 images out of 79 test images: 4 normal images predicted as tumors.
To rule out the possibility of incidental classification due to imaging artifacts, e.g., differences in the amounts of white pixels, we further assessed images that were correctly classified as tumor and normal samples in our test dataset. First, we did not find evidence that the amounts of white pixels or tissue in the images represented the main distinguishing feature for correctly classifying the images. For example, there are instances of normal and tumor samples that were correctly classified (test dataset) despite sharing similar amounts of white pixels and tissue ( Figure S2). Next, we examined the feature maps learned by the CNN in our model and observed that normal and tumor samples generate different feature activations, even for input (normal and tumor) images that display similar amounts of white pixels and tissue ( Figure S3). These results demonstrate that the classification performance of our model is not determined by incidental imaging artifacts.

Proteomic Markers are Correlated with Histology-Based Predictions
The previous section's findings motivated us to investigate in depth the relationship between the proteomic markers and the histology-based model predictions. Knowing that the proteomics data represent a strong source for accurately classifying normal vs. tumor samples, a key question is how such predictive features relate to the image-based predictions. To answer this question, first we calculated correlations between each protein in our test dataset of 24 samples (14 tumors, 10 normal samples) and their corresponding image-based predictions (p-values of assigning a sample to the tumor class). Also using hierarchical clustering, we further demonstrated that the protein expression data are sufficient to accurately separate tumor from normal samples (Figure 2A). Moreover, these proteins can be grouped in terms of their (expression) correlations with the image-based predictions (plot shown on left side of heatmap, Figure 2A). In particular, the histology-derived predictions are strongly associated, either highly positively-or anti-correlated, with a sub-set of protein markers ( Figure 2B).
A closer examination of these relationships showed that the proteins that are either highly positively-or anti-correlated with histology-based predictions are significantly enriched in a diversity of biological processes ( Figure 2C and Figure S4, GO terms and KEGG pathways respectively). In the case of proteins that are highly positively correlated with the image-based predictions, such an enrichment includes processes relevant to cell adhesion, extracellular organization and immune responses ( Figure 2C, Table S1). Proteins that are strongly anti-correlated with image predictions are significantly associated with several respiratory and metabolic processes. Unlike highly positively and anti-correlated proteins, weakly correlated proteins, i.e., those with correlations around 0 ( Figure 2B), are not statistically associated with specific biological processes.

Independent Verification of Biological Associations
Using an independent database of annotated molecular pathways (Reactome) and an alternative enrichment analysis technique (GSEA), we found again that proteins either strongly positively-or anti-correlated with histology-based predictions are significantly enriched in a variety of cancerrelevant molecular pathways ( Figure 3A, Table S2). Unlike the analysis reported above, here we considered the actual levels of the observed correlations between the proteomic data and the histology-based predictions for detecting significant functional enrichments.

Independent Verification of Biological Associations
Using an independent database of annotated molecular pathways (Reactome) and an alternative enrichment analysis technique (GSEA), we found again that proteins either strongly positively-or anti-correlated with histology-based predictions are significantly enriched in a variety of cancer-relevant molecular pathways ( Figure 3A, Table S2). Unlike the analysis reported above, here we considered the actual levels of the observed correlations between the proteomic data and the histology-based predictions for detecting significant functional enrichments. We verified that proteins that are positively correlated with the imaging-based predictions are also statistically associated with molecular pathways relevant to extracellular organization and immune responses ( Figure 3A and 3B). Conversely, we found that proteins that are anti-correlated with histology-based predictions are significantly associated with respiratory and metabolic pathways ( Figure 3A and 3C). These findings provide additional supporting evidence of the direct connection between proteomics markers and histology-based predictions, as well as of their biological meaning in the specific context of CCRCC.

Genes are Highly Correlated with Proteomic Markers and Imaging-Based Predictions
Next, we analyzed the concordance between proteins and their coding RNAs on the basis of their expression values. This analysis was applied to a set of 22 samples (14 tumor and 8 normal samples) with matched proteomics, gene expression and imaging data available. Figure 4 displays a global view of the correlations between these datasets and the histology-based predictions independently. To facilitate a comparative visualization of major trends, in each plot the rows show proteins ( Figure 4A) and their corresponding genes ( Figure 4B) in full alignment. This analysis first indicates that, as the proteomics data, the gene expression data are sufficiently informative to perfectly separate tumors from normal samples ( Figure 4). Moreover, as observed in the case of the proteomics data, genes can also be meaningfully ranked on the basis of their correlations with the histology-based predictions (see correlation plots on the left side of each heatmap, Figure 4). The latter includes RNAs highly positively-and anti-correlated with the histology-based predictions (Figures 4 and S5). Also, as in the case of the proteomics data, such genes are significantly enriched in biological processes ( Figure S5): Immune responses and extracellular organization (for genes highly positively correlated with histology-based predictions), and metabolic processes (for genes anti-correlated with histology-based predictions). We verified that proteins that are positively correlated with the imaging-based predictions are also statistically associated with molecular pathways relevant to extracellular organization and immune responses ( Figure 3A,B). Conversely, we found that proteins that are anti-correlated with histology-based predictions are significantly associated with respiratory and metabolic pathways ( Figure 3A,C). These findings provide additional supporting evidence of the direct connection between proteomics markers and histology-based predictions, as well as of their biological meaning in the specific context of CCRCC.

Genes are Highly Correlated with Proteomic Markers and Imaging-Based Predictions
Next, we analyzed the concordance between proteins and their coding RNAs on the basis of their expression values. This analysis was applied to a set of 22 samples (14 tumor and 8 normal samples) with matched proteomics, gene expression and imaging data available. Figure 4 displays a global view of the correlations between these datasets and the histology-based predictions independently. To facilitate a comparative visualization of major trends, in each plot the rows show proteins ( Figure 4A) and their corresponding genes ( Figure 4B) in full alignment. This analysis first indicates that, as the proteomics data, the gene expression data are sufficiently informative to perfectly separate tumors from normal samples (Figure 4). Moreover, as observed in the case of the proteomics data, genes can also be meaningfully ranked on the basis of their correlations with the histology-based predictions (see correlation plots on the left side of each heatmap, Figure 4). The latter includes RNAs highly positivelyand anti-correlated with the histology-based predictions (Figure 4 and Figure S5). Also, as in the case of the proteomics data, such genes are significantly enriched in biological processes ( Figure S5): Immune responses and extracellular organization (for genes highly positively correlated with histology-based predictions), and metabolic processes (for genes anti-correlated with histology-based predictions). A deeper analysis of these datasets (995 proteins with their corresponding gene expression data) showed strong correlations between protein and gene expression (median absolute Pearson correlation, r = 0.76). This correlation was statistically higher than that observed when all the proteins available in the dataset (n = 9984 proteins with corresponding gene expression data) are considered (r = 0.76 vs. 0.47, p < 2.2 × 10 -16 , Figure 5A). Moreover, we found that the correlations between protein expression and image-based predictions are also concordant with the correlations between gene expression and image-based predictions, in particular for the strongest positive and negative correlations observed in each correlation setting ( Figure 5B). GSEA of the proteins and genes separately, ranked by their correlations with the image-based predictions, resulted in 35 statistically enriched molecular pathways that were detected by both datasets independently ( Figure 5C). This shared set of functional associations included 31 pathways relevant to different immune and extracellular matrix organization processes with positive enrichment scores, i.e., the correlations of protein (and gene expression) with image-based predictions are also positively correlated with the activity of these pathways ( Figure 5D, Table S3). Conversely, there are 4 pathways relevant to different metabolic processes with negative enrichment scores ( Figure 5D). The latter means that image-based predictions that are not positively correlated with protein and gene expression are similarly anti-correlated with the activity of these 4 pathways. A deeper analysis of these datasets (995 proteins with their corresponding gene expression data) showed strong correlations between protein and gene expression (median absolute Pearson correlation, r = 0.76). This correlation was statistically higher than that observed when all the proteins available in the dataset (n = 9984 proteins with corresponding gene expression data) are considered (r = 0.76 vs. 0.47, p < 2.2 × 10 −16 , Figure 5A). Moreover, we found that the correlations between protein expression and image-based predictions are also concordant with the correlations between gene expression and image-based predictions, in particular for the strongest positive and negative correlations observed in each correlation setting ( Figure 5B). GSEA of the proteins and genes separately, ranked by their correlations with the image-based predictions, resulted in 35 statistically enriched molecular pathways that were detected by both datasets independently ( Figure 5C). This shared set of functional associations included 31 pathways relevant to different immune and extracellular matrix organization processes with positive enrichment scores, i.e., the correlations of protein (and gene expression) with image-based predictions are also positively correlated with the activity of these pathways ( Figure 5D, Table S3). Conversely, there are 4 pathways relevant to different metabolic processes with negative enrichment scores ( Figure 5D). The latter means that image-based predictions that are not positively correlated with protein and gene expression are similarly anti-correlated with the activity of these 4 pathways. To further assess the relevance of the correlations between protein (and gene) expression and image-based predictions, we investigated whether only the correlations between protein and gene expression would be sufficient to detect the above-identified molecular mechanisms independently of the image-derived prediction information. This analysis was done by ranking the 995 proteins on the basis of their expression correlations with their corresponding encoding genes, i.e., from the highest to the lowest protein-gene expression correlation pairs, followed by GSEA applied to the obtained ranking. This analysis did not result in any significant pathway enrichments for the set of 995 proteins, though as expected a variety of pathway enrichments were found when using the full set of 9884 proteins ( Figure S6). These results confirm that histology imaging-based predictions can reliably capture information relevant to immune responses and metabolic processes, as encoded in both the proteomics and transcriptomics data.

Discussion
Our research addressed the problem of integrating histopathology-and proteomics-based diagnostic models through machine learning approaches. This challenge is important for To further assess the relevance of the correlations between protein (and gene) expression and image-based predictions, we investigated whether only the correlations between protein and gene expression would be sufficient to detect the above-identified molecular mechanisms independently of the image-derived prediction information. This analysis was done by ranking the 995 proteins on the basis of their expression correlations with their corresponding encoding genes, i.e., from the highest to the lowest protein-gene expression correlation pairs, followed by GSEA applied to the obtained ranking. This analysis did not result in any significant pathway enrichments for the set of 995 proteins, though as expected a variety of pathway enrichments were found when using the full set of 9884 proteins ( Figure S6). These results confirm that histology imaging-based predictions can reliably capture information relevant to immune responses and metabolic processes, as encoded in both the proteomics and transcriptomics data.

Discussion
Our research addressed the problem of integrating histopathology-and proteomics-based diagnostic models through machine learning approaches. This challenge is important for systematically determining molecular features that can be accurately captured by pathology-based diagnostic models. Although our proteomic-and pathology-based models do not show perfect classification capacity, they are sufficiently accurate for investigating predictive relationships between them, as well as for establishing commonalities and complementarities at the functional level.
Using CCRCC as a novel study case, we elucidated the correlation of the diagnostic proteomics data with the predictions generated by the histology-based diagnostic model. This analysis demonstrated that, on the basis of their expression, a set of proteins are strongly correlated with the image-derived predictions. Using multiple annotation datasets and statistical analyses, we also showed how these correlations are significantly linked to specific biological processes relevant to the emergence and development of cancer. More specifically, we showed how our histology-based diagnostic model accurately captures predictive features in the proteomics dataset that are implicated in immune responses and extracellular matrix re-organization. These associations are also relevant in light of recent findings by the TCGA showing that CCRCC tumors are characterized by elevated immune activity [3]. Conversely, we showed how anti-correlations between proteomics and histology models are reflective of metabolic processes. Furthermore, we showed that gene expression data can also very closely recapitulate these biological associations based on their strong correlation with the proteomics data. These findings are useful not only for understanding novel ways to integrate these data types for predictive purposes, but also for generating hypotheses about the mechanisms underlying patient-specific classifications.
We showed that our model can accurately classify images in a systematic and automatic way. This would be of particular relevance when analyzing images that have not been annotated by human experts. Although our prediction model cannot be used for directly discovering relevant genes or proteins, we showed that the classifications obtained with our model, in particular the P-values of tumor presence, are correlated with the expression of proteins and genes that are relevant to distinguish between tumor and normal samples. Moreover, we show that the genes and proteins with the highest correlation with the image-based predictions are significantly involved in biological pathways relevant to cancer.
The highest positive correlations indicate that immune responses play a central role, mostly due to immune cell infiltration and immunotherapy treatment. The protein organization in the membrane changes the capacity of the motility of the cell, and both events have been described in this type of cancer previously [42]. The observed biological categories that anti-correlate with the proteomics (and transcriptomics) data show important effects centered in the metabolism and the respiratory chain in the cell, indicating decoupling between the respiratory chain and metabolism in the cell, as expected in many types of cancer [43,44]. Reprogramming of metabolism is considered a key driver of neoplastic malignancies [45]. The heterogeneous behavior of cancer metabolism is observed between tumors in terms of genetic and expression changes in signaling and regulatory pathways [46]. The pattern of positive correlations found in our analyses suggests that the role of oxidative phosphorylation reflects the adaptation of this type of cancer to physiological conditions such as hypoxia, nutrient availability, or complement of genetic lesions driving this specific tumor type. We also explored the anti-correlation between transcriptomics and proteomics profiles and metabolic pathway expression, and found that the renal tumors retain the metabolic expression patterns of the corresponding native tissues. This might be a consequence of similar local environments or tendencies to maintain the metabolic expression program established in the original tissue.
Although our study offers novel and relevant insights into the integration of histology and proteomics data through the application of machine learning, it shows some limitations that will merit future consideration. First, our study is limited by our focus on a single patient cohort of CCRCC patients. Additional validations on datasets obtained from independent cohorts may further demonstrate the clinical relevance of our diagnostic models and their integrative analysis, and will also enable wider investigations of variations related to different clinical factors, such as gender and tumor subtypes. Nevertheless, our study provides a solid basis for further investigations based on the analysis of carefully annotated datasets obtained from a CPTAC reference cohort.
Our study is also limited by the relatively small amounts of data, in particular those needed for independently validating our models on matched histology and proteomics data from the same patients. Although the CPTAC currently offers the largest amount of data combining histology and proteomics data for CCRCC research, further validations with cohorts of different sizes are needed. Although at present we do not have access to another cohort of H&E-stained histopathology images, additional testing our model on independent image collections will be crucial for further demonstrating the potential clinical relevance of our model. Future research will also benefit from the implementation of multi-modal classification models that directly combine proteomics, mRNA and histology data. Currently, an important obstacle for applying such a strategy are the limited amounts of data with matched multi-omics and histology measurements. As part of such efforts, it will be useful to continue investigating the informational complementarities and redundancies among such datasets [34]. Moreover, this may lead to new approaches to predicting protein or gene expression from histology data.
It is also important to recall that we analyzed raw pixel intensity data from thumbnails of whole slides instead of raw full resolution images. This choice may hamper classification performance, particularly in sub-typing or prognosis tasks. Here we showed that, in the CCRCC domain and in the diagnostic context (tumor vs. normal sample classification), thumbnails of whole slides contain sufficient information to achieve a meaningful and accurate classification. Future research will require analyses based on full resolution whole slides.
To conclude, our study presented a systematic investigation of the association of histopathology and proteomics data in a diagnostic setting. The resulting models and insights are relevant for understanding the predictive interplay between these datasets, as well as their informational complementarities at the molecular level. Furthermore, the proposed integrative analysis approach is applicable to other investigations with different tumors or omic data types.
Supplementary Materials: The following are available online at http://www.mdpi.com/2077-0383/8/10/1535/s1, Figure S1: Exploration of the (unsupervised) sample discrimination potential of the CPTAC-CCRCC proteomics dataset, Figure S2: Examples of normal and tumor samples correctly classified (test dataset) by the DL model, Figure S3: Examples of normal and tumor samples correctly classified (test dataset) by the DL model and a selection of their corresponding feature maps extracted from the 13th convolutional layer, Figure S4: KEGG pathway enrichments of the proteins that are either highly positively-or anti-correlated with histology-based predictions, Figure S5: Analysis of correlations between RNAs (encoding the diagnostic set of proteins) and the histology-based model predictions, Figure S6: Reactome pathways enriched in the full set of 9884 genes available in the CCRCC (RNASeq) dataset, Table S1: Full list of significant GO enrichments (connected to Figure 2C), Table S2: Full list of significant GSEA enrichments (connected to Figure 3A), Table S3: Full list of significant shared GSEA enrichments (connected to Figure 5D).