A Random Walk Based Cluster Ensemble Approach for Data Integration and Cancer Subtyping

Availability of diverse types of high-throughput data increases the opportunities for researchers to develop computational methods to provide a more comprehensive view for the mechanism and therapy of cancer. One fundamental goal for oncology is to divide patients into subtypes with clinical and biological significance. Cluster ensemble fits this task exactly. It can improve the performance and robustness of clustering results by combining multiple basic clustering results. However, many existing cluster ensemble methods use a co-association matrix to summarize the co-occurrence statistics of the instance-cluster, where the relationship in the integration is only encapsulated at a rough level. Moreover, the relationship among clusters is completely ignored. Finding these missing associations could greatly expand the ability of cluster ensemble methods for cancer subtyping. In this paper, we propose the RWCE (Random Walk based Cluster Ensemble) to consider similarity among clusters. We first obtained a refined similarity between clusters by using random walk and a scaled exponential similarity kernel. Then, after being modeled as a bipartite graph, a more informative instance-cluster association matrix filled with the aforementioned cluster similarity was fed into a spectral clustering algorithm to get the final clustering result. We applied our method on six cancer types from The Cancer Genome Atlas (TCGA) and breast cancer from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC). Experimental results show that our method is competitive against existing methods. Further case study demonstrates that our method has the potential to find subtypes with clinical and biological significance.


Introduction
With the efforts of the large-scale projects such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) [1][2][3], a wealth of genome-scale molecular data are available and easy to access. The multiple types of omics data from genomes, transcriptomes, proteome, and epigenomes enable researchers to embrace great opportunities and possibilities to explore a more comprehensive view into cancer informatics, such as drug target prediction [4,5], diver gene identification [6][7][8], and so on.
One essential topic in oncology is cancer subtyping, whereby tumors are divided into clinically and biologically relevant subtypes, which could offer insight into tumor progression and provide personalized treatment. However, applying traditional clustering algorithms on a single data type-like gene expression data-does not obtain satisfactory results such as deriving subtypes with clinical phenotype [9]. These unsatisfactory results indicate the limitations of expression-based analysis for cancer subtyping. Since different types of molecular data contain information in various aspects that may complement each other, it is beneficial for leveraging different types of omics data simultaneously [10]. Several integrative frameworks have been proposed and gained success [9][10][11][12][13][14].
A promising method for cancer subtyping is cluster ensemble [11,15,16]. It can merge individual clusterings (clustering results obtained from running diverse clustering algorithms, running different types of omics data, etc.) to a consensus to form one robust unit. More importantly, cluster ensemble can naturally be applied on multiple data types as an integrative method. However, traditional cluster ensemble mostly merges different clusterings using a co-association matrix, which measures the frequency of two instances clustering together [17]. In this coarse way, some important information-such as relations among clusters-may be lost after merging the base clusterings. Linkbased cluster ensemble (LCE) [15] tries to solve this problem by considering the relationship among clusters in terms of the triplet. The triplet is a subgraph containing three vertices and two non-zero edges. The similarity between two clusters is measured based on the count of all triples between them. However, this only captures local structure since the triplet measures the similarity in a local range.
In this paper, we proposed a new method named Random Walk based Cluster Ensemble (RWCE) to deal with these problems (Figure 1). We first obtained a refined cluster-cluster similarity by using random walk on a network of clusters constructed with Jaccard similarity and applied a scaled exponential similarity kernel, which provided a more global view from the whole cluster network. We then generated a more informative instance-cluster association matrix by filling in the refined cluster-cluster similarity. A bipartite graph was modeled on this resulting matrix in which spectral clustering [18] was used to obtain the final partition. Experiments on six cancer type datasets from the TCGA and the Molecular Taxonomy of Breast Cancer International (METABRIC) breast cancer data set [19] showed that our RWCE was competitive compared with other methods. Further case study demonstrated that our method also had the power to find clinically and biologically relevant subtypes. The source code of RWCE can be found in supplementary File 1. traditional clustering algorithm (here we used K-means) was applied to each molecular data type to obtain M basic clusterings. For each basic clustering, the cluster number was randomly chosen from 2 to √n; (B) each data type's M clusterings were fused into one consensus clustering by RWCE refinement; (C) all data types' consensus clusterings were fused into one final clustering using RWCE refinement again. (A) traditional clustering algorithm (here we used K-means) was applied to each molecular data type to obtain M basic clusterings. For each basic clustering, the cluster number was randomly chosen from 2 to √ n; (B) each data type's M clusterings were fused into one consensus clustering by RWCE refinement; (C) all data types' consensus clusterings were fused into one final clustering using RWCE refinement again.

Datasets
In order to show the effectiveness of our method, we used six TCGA cancer types: kidney renal clear cell carcinoma (KIRC), glioblastoma multiforme (GBM), lung squamous cell carcinoma (LUSC), breast invasive carcinoma (BRCA), acute myeloid leukemia (LAML), and colon adenocarcinoma (COAD). The data were processed by PINS (perturbation clustering for data integration and disease subtyping) [20], which is an integrative clustering framework for cancer subtyping. Each cancer type had three molecular omics data types, namely mRNA expression, miRNA expression, and DNA methylation. In addition, the METABRIC breast cancer data set [19] was used for survival analysis. The METABRIC data set included a discovery cohort (997 patients) and a validation cohort (995 patients). Each of them had two molecular data types: mRNA expression and copy number variation data, which were downloaded from the European Genome-Phenome Archive [21] (https://ega-archive. org/).

Competitive Methods
To show the effectiveness of our method, we compared it with a traditional cluster ensemble method called consensus clustering (CC) [17] as the baseline, and three state-of-the-art methods called link-based cluster ensemble (LCE) [15], perturbation clustering for data integration and disease subtyping (PINS) [20], and entropy-based consensus clustering (ECC) [11].

Evaluation Metrics
We used cox log-rank p-value [22] for measuring the significance of the difference of survival distributions between subtypes. Normally, a p-value < 0.05 indicates statistical significance, and a lower p-value indicates a more significant difference. We also used the silhouette value to measure consistency within subtypes. The mean value of the silhouette was used as a measure of how tightly grouped all the data in the cluster were. Higher values of the silhouette indicate a well-divided clustering structure.
For survival analysis, we also used concordance index (CI) [23], which measures the consistency between the estimated risk and the real survival time. Higher CI value indicates better performance for survival analysis.

Methodology Overview of RWCE
Here, we sum up RWCE for cancer subtyping. Suppose we have three data types to use for clustering. There are three steps in the RWCE pipeline.
Step 1: For each data type, M basic clusterings are generated using K-means with a number of clusters randomly chosen from 2 to √ n, where n is the number of instances ( Figure 1A). Note that in this step, we can use any clustering method, and in this paper, we fixed it to K-means.
Step 2: These M basic clusterings are combined into a consensus clustering by RWCE refinement ( Figure 1B), which we introduce in detail later.
Step 3: Each data type follows the same operation as in Step 1 and Step 2, and we then have three consensus clusterings-π * 1 , π * 2 , π * 3 . At last, we use RWCE refinement again to combine each data type's consensus clustering to get the final clustering result π * ( Figure 1C).

Cluster Ensemble
Let X denote an omics data set such as gene expression data with n instances (or conditions, experiments, patients, and so on) and let m denote genes (or biomarkers and so on). A cluster ensemble is a set of M basic clustering solutions generated by different clustering algorithms or a single clustering algorithm with different parameters, which is represented as Π = π (1) , π (2) , . . . π (M) .
Each clustering π (m) partitions X into K m crisp clusters, represented as π (m) = C The cluster ensemble method then takes these clusterings Π as input and combines these solutions to produce the consensus clustering π * as the output. There are diverse ways for combing [24,25]. One can derive an instance-cluster binary (IC) matrix with '1', indicating that instance belongs to that cluster, otherwise it is indicated as '0'. Then, a clustering algorithm or graph segmentation algorithm could be used on this matrix to get the consensus clustering solution [17].

Generating a Refined Instance-Cluster Association (RIC) Matrix
A problem remains when leveraging an IC matrix or another similarity matrix since they only summarize information at a coarse level. For example, in an IC matrix, only one element is '1' for one instance in each clustering, and others are '0'. This may lead to sparsity and does not favor the similarity-based clustering algorithm. Accordingly, in LCE [15], an improved variation of the original IC matrix, refined cluster-association matrix (RM) is generated by modifying the zero entries of the IC matrix with the cluster-cluster similarity discovered by the link-based similarity algorithm. The results show that the refinement is helpful and works better than using the original matrix. However, the algorithm used for measuring the similarity among clusters through focusing on triple is limited in local view. In response, we proposed RWCE, which has a more global view in discovering cluster-cluster similarity. We put forward a refined instance-cluster association (RIC) matrix as a more informative variation of the original IC matrix. It is designed to replace the value of those hidden associations ('0') of the IC matrix with the refined cluster similarity. For each clustering is measured as follows: where C (m) * (x i ) is a cluster label to which the instance x i belongs in clustering π (m) . Moreover, sim C x , C y ∈ [0, 1] measures the similarity between any two clusters C x , C y , which can be calculated using the random-walk based similarity algorithms listed in Section 2.6.2, and dc is a hyperparameter that we empirically set as 1 (performance is robust to dc, thus we fixed it to 1 for the sake of explanation). In this way, we fill in the zero entries of the IC matrix with the normalized similarity between the clusters by using the following random-walk based similarity algorithm.

Random-Walk Based Similarity Algorithm
We first constructed an original cluster-cluster similarity network by using the Jaccard index as follows: where J xy is an edge of the above similarity network between cluster C x and C y , and L x and L y denote the set of samples of clusters C x and C y , respectively. On this initial network, we applied random walk with restart: where A is the adjacency matrix of the above-mentioned similarity network and F 0 is the IC matrix. (1 − α) is the restart probability that the random walker may choose to teleport to the Genes 2019, 10, 66 5 of 10 initial node. The random walk process runs iteratively until F t+1 converges ( F t+1 − F t < 1 × 10 −6 ). In consequence, the resulted F t+1 is a real-valued instance-by-cluster association matrix instead of a binary value, on which we can measure the refined similarity between clusters using the scaled exponential similarity kernel: where z i and z j are i-th column and j-th column of F t+1 , representing clusters C i and C j , respectively, ρ 2 z i , z j denotes the squared Euclidean distance between cluster C i and C j , and σ is a parameter we set to 1.

Applying Spectral Clustering to RIC
As a result, we obtained a refined and informative instance-cluster (RIC) matrix; RIC(i, j) ≥ 0 and ≤ 1 is a degree that instance i belongs to cluster j. We then modeled a bipartite graph G = (V, W) based on RIC, where V = V C ∪ V I . V C is the set of vertices, where each vertex corresponds to a cluster from ensemble Π; V I is the set of vertices, where each vertex corresponds to an instance from data set X . W denotes a set of weighted edges that can be defined as follows: Note that W can be written as W = 0 RIC T RIC 0 equivalently. Given such a graph, spectral graph partitioning (SPEC) [18] was then used to generate the final partition of X , denoted as π * . SPEC with normalized cut is simply described as follows. Given graph G = (V, W), it first calculated the degree matrix D with degrees of each node on the diagonal. It then computed the Laplacian matrix L = D − W. Next, the normalized Laplacian matrix D − 1 2 LD − 1 2 , with its K smallest eigenvalues λ 1 . . . λ k and their corresponding eigenvectors u 1 , u 2 . . . u K , were obtained. Then, a matrix U = [u 1 , u 2 . . . u K ] was formalized after being row normalized. At last, SPEC generated the final clustering result using K-means on U. More details can be found in [18]. We selected the number of clusters k = arg max i>1 eigengap(i), where eigengap(i) = λ i+1 − λ i . To sum up, we called the process of operating on ensemble Π and getting the final clustering π * as RWCE refinement.

Integrating Multiple Types of Omics Data for Subtyping
Suppose we had T types of omics data to integrate. For each type of omics data X t , t = 1 . . . T, we obtained the corresponding clustering result π * t , t = 1 . . . T. Then, we treated these clustering results as a new ensemble Π * = π * 1 , π * 2 , . . . π * T . Finally, we used RWCE refinement again to Π * to get the final clustering result π * across all T data types.

Evaluation on TCGA Cancer Data Sets
For each cancer type, we counted the number of significant survival analyss results based on three single molecular data types and the integration of the three data types.
According to Figure 2, our method outperformed other methods on both single data type and the integration. By integrating the three molecular data types, our method attained significant subtypes (p-value < 0.05) for all six cancer types (Table 1). This indicates the potential of leveraging multiple data types simultaneously for identifying meaningful subtypes and the power of RWCE as an integrative method.       Table 1 shows the cox log-rank p-value of RWCE on three molecular data types and their integration across six cancer types from TCGA. It indicates that RWCE is a good integrative method for combining multiple omics data for cancer subtype discovery.
In terms of silhouette value (Figure 3), our method still outperformed other methods, indicating good clustering performance at the data level.
In terms of silhouette value (Figure 3), our method still outperformed other methods, indicating good clustering performance at the data level.

3.2.A Case Study: Glioblastoma Multiforme
Our method found three GBM subtypes. The survival curves of them are shown in Figure 4. From Figure 4, subtype 1 had a bad prognosis while subtype 3 had a favorable prognosis. Moreover, Figure 5 shows that patients from subtype 1 had a favorable response to temozolomide (TMZ), a drug commonly used to treat GBM, and subtype 3 consisted of slightly younger patients.

A Case Study: Glioblastoma Multiforme
Our method found three GBM subtypes. The survival curves of them are shown in Figure 4. In terms of silhouette value (Figure 3), our method still outperformed other methods, indicating good clustering performance at the data level.

3.2.A Case Study: Glioblastoma Multiforme
Our method found three GBM subtypes. The survival curves of them are shown in Figure 4. From Figure 4, subtype 1 had a bad prognosis while subtype 3 had a favorable prognosis. Moreover, Figure 5 shows that patients from subtype 1 had a favorable response to temozolomide (TMZ), a drug commonly used to treat GBM, and subtype 3 consisted of slightly younger patients. From Figure 4, subtype 1 had a bad prognosis while subtype 3 had a favorable prognosis. Moreover, Figure 5 shows that patients from subtype 1 had a favorable response to temozolomide (TMZ), a drug commonly used to treat GBM, and subtype 3 consisted of slightly younger patients.

Evaluation on METABRIC Data Set
We also tested the performance of survival analysis on the METABRIC breast cancer data set. As seen in Table 2, our method outperformed other clustering methods and was comparable with the PAM50 analysis (a standard breast cancer signature). This indicates the potential of our method for finding subtypes with differential survival profiles. Table 2. Cox p-value and concordance index (CI) of subtypes discovered by PAM50, perturbation clustering for data integration and disease subtyping (PINS), consensus clustering (CC), entropybased consensus clustering (ECC), link-based cluster ensemble (LCE), and our method on METABRIC data. For each discovery and validation cohort, we calculated the p-value and CI with respect to disease free survival (DFS) and overall survival of the patients. For each row, the best p-value (most significant) and the best CI (highest) are in red. The number of clusters in discovery and validation cohort are shown after the name of the clustering methods.

Discussion and Conclusions
In this paper, a new cluster ensemble method named RWCE was introduced for clustering and integrating multiple omics data to discover meaningful cancer subtypes. A novel RIC matrix is used in RWCE that considers relationships among clusters, which contributes to a superior clustering performance in terms of silhouette value and cox log-rank p-value.
Moreover, RWCE can also be utilized as an integrative method to make use of diverse types of omics data together for identifying subtypes with differential survival profiles. Further case study on the GBM subtypes that RWCE generated showed that RWCE could find subtypes with differential

Evaluation on METABRIC Data Set
We also tested the performance of survival analysis on the METABRIC breast cancer data set. As seen in Table 2, our method outperformed other clustering methods and was comparable with the PAM50 analysis (a standard breast cancer signature). This indicates the potential of our method for finding subtypes with differential survival profiles. Table 2. Cox p-value and concordance index (CI) of subtypes discovered by PAM50, perturbation clustering for data integration and disease subtyping (PINS), consensus clustering (CC), entropy-based consensus clustering (ECC), link-based cluster ensemble (LCE), and our method on METABRIC data. For each discovery and validation cohort, we calculated the p-value and CI with respect to disease free survival (DFS) and overall survival of the patients. For each row, the best p-value (most significant) and the best CI (highest) are in red. The number of clusters in discovery and validation cohort are shown after the name of the clustering methods.

Discussion and Conclusions
In this paper, a new cluster ensemble method named RWCE was introduced for clustering and integrating multiple omics data to discover meaningful cancer subtypes. A novel RIC matrix is used in RWCE that considers relationships among clusters, which contributes to a superior clustering performance in terms of silhouette value and cox log-rank p-value.
Moreover, RWCE can also be utilized as an integrative method to make use of diverse types of omics data together for identifying subtypes with differential survival profiles. Further case study on the GBM subtypes that RWCE generated showed that RWCE could find subtypes with differential drug reactions and age distributions.
Taken together, RWCE provides a new way of thinking by combining basic clusterings in the cluster ensemble method and integrating multiple data types. We hope RWCE can generalize well to identify meaningful subtypes in more cancer types for the improvement of diagnostic and therapeutic intervention, and this is what we will investigate in further work.