Uveal melanoma (UM) accounts for approximately 5% of all melanomas [1
]. The incidence of UM in Europe shows a characteristic increase from south to north, from <2 per million in Spain and southern Italy to >8 per million in Norway and Denmark [2
]. Therapy, enucleation, endoresection or radiotherapy, almost completely controls primary tumors but 25% and 34% of UM patients develop metastases within 5 and 10 years, respectively. Median survival after diagnosis of metastatic UM is approximately one year [3
]. The long-term cumulative melanoma-related mortality rate is over 50% for medium and large tumors at 25 years after primary treatment [4
]. UM is clearly distinct from cutaneous melanoma by different driver mutations, different chromosomal copy number alterations, a much lower mutational burden, and thus has to be treated as a different disease despite the common developmental origin of uveal and cutaneous melanocytes [5
UM shows a mean of 17 mutations in coding regions of protein coding genes [7
], approximately 80% of cases show a mutation in the alpha subunit of one of the two G-protein alpha subunits, GNAQ and GNA11 [8
]. Mutations in the tumor suppressor gene BAP1 [9
] are associated with elevated metastatic risk and mutations in SF3B1 [10
], a gene encoding a component of the splicing machinery, confer intermediate risk with retarded development of metastases. The relatively frequent mutations of the translational elongation factor gene EIF1AX [11
] do apparently not influence metastatic risk. Monosomy of chromosome 3 and amplification of chr8q are associated with metastatic risk, while amplification of chr6p is associated with a reduced risk in cases with concomitant monosomy of chr3. For recent reviews see refs. [5
Genomics has greatly contributed to the identification of prognostic classes that are distinguished by different gene expression profiles [13
], chromosomal copy number alterations [16
] and somatic mutations [8
]. As a result, cytogenetics, transcriptional profiling and/or BAP1 sequencing or immunohistochemistry have been integrated in routine prognostic assessment by the pathologist. The initially two prognostic classes, mainly distinguished by chromosome 3 status, have been extended to three classes since cases with disomy of chromosome 3 and SF3B1 mutations tend to develop metastases with time [8
]. Tumors of this subclass show a distinct gene expression profile [18
Currently, the only genomic multiplatform dataset that has been reported for UM is the TCGA-UVM dataset. Robertson and coworkers recently performed multi-omics on eighty UM cases within the TCGA-Project [7
]. This analysis subdivided the two cytogenetic classes (di- versus monosomy of chr3) into two subclasses each, characterized by SF3B1/SRSF2 and EIF1AX mutations in association with distinct transcriptional and methylation profiles in disomic cases, and distinct gene expression and copy number alterations in monosomic cases. Data analysis has thereby been performed separately for each platform followed by a graphical presentation that reassumes the classification obtained by the analysis of data derived from the single platforms.
For the purpose of classification it appears, however, desirable to take the information from the different platforms into account simulatenously. Analytical integration of data of different platforms can be obtained by data fusion techniques that are being developed for the analysis of complex genomic data. Various approaches have been proposed [19
] but have not yet found wide application for cancer classification despite the availability of many multiplatform datasets. A recent exhaustive overview of the application of integrative analysis on multi-level data in the field of cancers, with a focus on variable selection, can be found in [21
Here, we applied and adapted data fusion approaches to prognostic classification of UM. We first perform a simultaneous principal component analysis (PCA), dubbed here joint Singular Value Decomposition (jSVD) but known in chemometrics as Simultaneous Component Analysis (SCA) [22
] or a form of coupled matrix factorization [23
]. In some works on multi-omics cancer data integration, this approach is known as the k-table method [24
]. We then generalize this factorization by allowing different constraints on the factor matrices. This latter method we call joint Constrained Matrix Factorization (jCMF) and it can be computed using state of the art data fusion algorithms [25
]. We compare our algorithms and their findings with a joint graph clustering, the Similarity Network Fusion (SNF) that has recently gained momentum in the field [26
The aim of this article is the application of multiplatform data fusion techniques to the particular uveal melanoma dataset in order to verify previous classification but by using such a black box approach. We show that, in particular, the four classes described by Robertson et al. [7
] can be distinguished by our methods although the main discriminator is chromosome 3 status that is associated with distinct transcriptional and methylation profiles.
The TCGA-UVM dataset consists of several datasets: gene expression (mRNA and ncRNA), DNA-methylation, DNA copy number alterations (CNA) and somatic mutation data. This data has been collected for 80 patients and it has varying size. The task is to cluster these patients into different groups that predict the malignancy of the tumor. This has been done most recently in [7
], where each dataset has been considered independently and then an integrative analysis has identified four clusters with increasingly worse prognosis. Ultimately, the clustering relies mostly on the CNA data.
The goal of this work is to use data fusion techniques to analyse and cluster the datasets simultaneously. This is done both to test the proposed methods on cancer data and to evaluate the validity of the former integrative analysis. In the scope of this article, we constrain ourselves to the three datasets of mRNA expression data, DNA-methylation data, and CNA data, which we will store in three matrices
respectively. The first two matrices, corresponding to expression and methylation data, are nonnegative, whereas the copy number matrix has no restrictions on the entries. See Section 4.2.1
for the preprocessing of this data, in particular for the CNA matrix. Note that we are so far unable to include the data on somatic mutations in our data fusion approach as it consists of binary data that is not only hard to handle by itself but even more difficult to “fuse” with other, less restrictive datasets. Co-mutation plots are shown in Figure 1
, Figure 2
and Figure 3
but they were not used in data fusion analyses.
We propose two different algorithms that to our knowledge have so far not found application in cancer research. In the most general setting, we aim at factorizing M
The datasets all share the same row dimension, in our case the number of patients. Therefore, the factor matrix U is chosen to be the same for all . This means that we are trying to find a common row space that is spanned by the columns of U. If the dimension of this row space is small, we obtain the most important feature vectors that constitute similarities of the datasets as columns of the matrices .
When the datasets
consist of nonnegative information, joint Nonnegative Matrix Factorization (jNNMF)
], a method that has been successfully applied in several studies [28
], could be used to find the decompositions (1
). In jNNMF, the factor matrices U
are chosen to also contain only nonnegative entries. This is done to ensure that the feature vectors in the
are meaningful. If, however, one of the datasets may contain negative entries, as in our case the CNA matrix, jNNMF is not reasonably applicable, and we need to find alternative ways of joint factorization.
A first very straightforward approach to overcome this problem is to find a different constraint on the factor matrices that is more readily generalizable to different data types. Perhaps even more important in data analysis than NNMF is the Principal Component Analysis (PCA)
that relies on the computation of a Singular Value Decomposition (SVD)
and yields orthogonal
feature vectors. This has the advantage that the features are complementary to each other, i.e., information found in one feature vector is strictly not found in any of the other feature vectors. Doing this simultaneously via data fusion results in a joint Singular Value Decomposition (jSVD). This is the first proposed method and described briefly in Section 4.2.2
, and in more detail in [31
On the other hand, for taking nonnegativity of some of the matrices
back into account, we also propose a unified method that we call joint Constrained Matrix Factorization (jCMF)
. Here, we decompose all matrices as in (1
) but with different constraints on the feature matrices
, see Section 4.2.3
To judge the justifiability of clusterings obtained from these two methods, we also test the state of the art Similarity Network Fusion (SNF)
approach, a method that has recently gained momentum in the joint analysis of biological data, see Section 4.2.4
for more details. Here, the weighted graphs are fused together using an iterative procedure described in [26
]. This ultimately yields a single graph that contains the information of all datasets
and that can be clustered with standard techniques. While this method is rather heuristic, it has performed well in the past and it is widely used.
2.1. Results of jSVD
We first present the results obtained by the jSVD method. Figure 1
a shows the copy number variations for the 80 patients ordered according to k-means clustering. We decided to set the number of clusters to four to allow for a direct comparison with the analysis of copy number alteration data that yielded four clusters [7
Three of the four groups are easily visible: the major discriminant is chromosome 3 status, disomic (groups 1 and 2) or monosomic (groups 3 and 4). Most monosomic cases also show amplification of chr8q and a subgroup (group 4) shows deletion of chr8p. The same subgroup is distinguished by deletions of chr6q. Most disomic cases show amplification of chr6p and many of these show additional alterations, deletions or amplifications, on chr6q. The distinction of two groups within the disomic cases is less evident. Group 1 shows no deletions on chr1 that are more frequent in all the other groups. We also extracted features from somatic mutation, transcriptome and methylome data (Figure 1
b) where the distinction of chr3 disomic and monosomic cases is very clear. Group 4 is distinguished by stronger methylation and gene expression of the same genes that also characterize group 3. The distinction between groups 1 and 2 is hardly visible. As expected, somatic mutations are also associated with these groups. BAP1 mutations are typical for group 3 and 4 cases, a single BAP1 mutation occurs in metastatic case of group 2. EIF1AX mutations are almost exclusively observed in group 1 whereas SF3B1 mutations show a similar frequency in groups 1 and 2. Clustering on CNA data alone, as reported by Robertson et al., yields more clearly distinguished subgroups 1 and 2, within the disomic cases, where group 1 is characterized by the absence of CNA on chr8. SVD fusion of multiplatform data by the jSVD method does not yield this distinction.
We also report on the Kaplan Meyer survival curves of the four groups in Figure 1
c. As expected, groups 1 and 2 show better survival than groups 3 and 4. Survival is not significantly different between groups 1 and 2. Groups 3 shows slightly better survival than group 4.
2.2. A Unified Approach: jCMF
The clusterings obtained by joint SVD are satisfactory but a clear division between groups 1 and 2 is not visible. Indeed, the clustering obtained by jCMF shows a clearer division of groups 1 and 2 in the copy number variation plot, see Figure 2
a, and has slightly preferable Kaplan Meyer curves, see Figure 2
c. Groups 1 and 2 are distinguished by CNA on chr8.
Gene expression and methylation features again show a clear distinction between groups 1 and 2 as compared to groups 3 and 4 and between group 3 and 4 but much less so between group 1 and 2. However, jCMF groups together cases with SF3B1 mutations in group 2 (see comutation plot, Figure 2
2.3. A Comparison with SNF
This method yields almost the same clusters as discovered in the study by Robertson et al. [7
]. We have included the copy number variation plot (Figure 3
a), somatic mutations, gene expression features, methylation features (Figure 3
b) and the Kaplan Meyer curves (Figure 3
c). Again, the distinction of subgroups 1 and 2 is weak although SF3B1 mutations almost exclusively occur in group 2.
2.4. Prognostic Misclassification
All methods, whether based on a single data type or on data fusion, yield similar classifications (Figure 1
, Figure 2
and Figure 3
) yet they all misclassify at least one sample. Classification of samples from patients who did not develop metastases as “high risk” (groups 3 and 4) not necessarily indicates misclassification since a high risk tumor may respond to treatment and therefore not develop metastases, it might develop metastases beyond the follow-up period or the patient may have died before developing a metastasis. For tumors classified as “intermediate risk” (group 2) misclassification cannot be assessed since intermediate means they may or may not develop metastases in the time frame observed. On the contrary, tumors that developed metastases but are classified as “low risk” (group 1) are truly misclassified. The contingency table (Table 1
) reports cases that are misclassified as “low risk” for any of the measures applied. Chr3 status, which does not allow for “intermediate risk” misclassifies 5 cases. Chi-square statistics that omit intermediate risk show the highest odds ratio for, in order, chr3 status, DNA-methylation alone, CNA and SNF, yet in the cases of CNA and SNF, these values come at the cost of many cases classified as “intermediate risk”. Data fusion techniques apparently do not reduce misclassification rates.
Multi-platform molecular analyses have become common for many cancers and The Cancer Genome Atlas project (https://www.cancer.gov/tcga
) has collected these data for thousands of tumor samples yet most analyses consider molecular data types one at a time. Data fusion techniques are being developed, amongst them matrix decomposition techniques like GSVD [32
], coupled matrix factorization [23
], and tensor methods as for example in [33
], but they are not yet commonly applied in cancer research. In comparison, Bayesian and network-based models have been more widely applied also by biologists, see [30
], in particular SNF. An exception is perhaps the jNNMF method, which is the only model based on matrix decompositions that has become somewhat state of the art in bioinformatics.
Molecular data are neither completely independent from each other nor can one data type substitute any other data type. Gene expression relies on copy number and is inversely associated with DNA-methylation but this must not be true in each single case. Therefore, it is hard to predict whether fusion of multi-platform data actually improves molecular characterization and, for example, prognostication. For the purpose to develop innovative data fusion techniques able to maximally exploit molecular data present in the TCGA database we focused on UM data since UM, though rare, is very well characterized and intrinsically simple at the molecular level as well as in terms of prognostication. Two well distinguished classes with di- and monosomy of chromosome 3 are well known since the application of cytogenetics to UM [16
]. The analysis of somatic mutations [8
] and gene expression profiles [35
] further refines cytogenetics for the purpose of prognostication. Hence, any innovation will hardly do any better but it should at least reproduce what simpler methods can distinguish. For other, more complex and less well studied neoplasias, data fusion approaches might improve the distinction of molecular classes with prognostic significance.
A downside of jNNMF is that it is only applicable to a set of nonnegative matrices. In this work, it was our aim to reduce preprocessing as much as possible and since not all datasets are nonnegative, we opted to introduce two methods that can be seen as a generalization of jNNMF for other data types. We show here two data fusion approaches, jSVD and jCMF, that are characterized by a simultaneous low rank decomposition of the data matrices, thereby projecting the data on a lower dimensional shared subspace. Both are a form of coupled matrix factorization and the former is also known as Simultaneous Component Analysis (SCA) in related fields [22
]. JSVD produces orthogonal classification and feature matrices while in jCMF, it is possible to adjust the data type in order to reflect the data type of the different matrices. After the projection, a simple application of the kmeans algorithm yields the different clusters. The two algorithms are compared with state of the art SNF. Just as jNNMF, our methods have some advantages over SNF: they are less heuristic and therefore less prone to overfitting, and they produce meaningful feature vectors, i.e., vectors that contain the same data type as the original dataset and can therefore be interpreted in the same fashion. However, we do not claim superiority over SNF. Similar to SNF, jSVD and jCMF reproduce the known cytogenetic classes. The distinction of subtypes within the disomic cases is less defined and might require additional information such as SF3B1 mutation status. The importance of this subclass might be underestimated using the TCGA dataset since it has limited follow-up and cases with SF3B1 mutations develop metastases several years after diagnosis of primary UM [36
]. The TCGA dataset is by no means suited to develop prognostic classifiers and the prognostic power observed might strongly depend on the specific dataset. Moreover, some level of misclassification into risk classes by analyzing the primary tumor can probably not be overcome since tumor metastasization is an intrinsically probabilistic process and since disseminated tumor cells might acquire additional fitness that is not reflected by the primary tumor they are derived from [37
We obtain different odds ratios for the different single or multi-domain methods when applied to the classification in high and low risk groups. Nevertheless, we cannot decide which of the methods tested performs best, as single misclassifications and a different number of “intermediate risk” samples can alter the odds ratio significantly. To do this, we would need independent datasets for external validation yet the TCGA data are, at present, the only source of multi-domain data. The 80 cases of the TCGA dataset are insufficient to create distinct training and validation sets. We therefore limit our interpretation to the fact that all methods tested reproduce the classification by Robertson et al. [7
], which uses CNA data alone. However, a PCA-based dimensionality reduction of the CNA data alone does not yield the same clustering and more advanced algorithms are necessary.
Using UM as a test case, the present analysis shows that the new data fusion techniques work adequately. Both jSVD and jCMF are performed independently of the dataset (up to the choice of constraints in jCMF) and therefore eliminate the overfitting problem associated with heuristic procedures. Thus, they are suited for multiplatform based classification for those tumors that cannot be exhaustively classified by single platform data. In addition, as opposed to the also considered SNF, the methods yield meaningful feature vectors (more so in the case of jCMF) that can in theory be interpreted in order to gain a deeper understanding of the combination of features that determine the classifications, which is of interest to the biologist. We are working on feature extraction methods to be combined with data fusion.