An Integrated Approach for Identifying Molecular Subtypes in Human Colon Cancer Using Gene Expression Data

Identifying molecular subtypes of colorectal cancer (CRC) may allow for more rational, patient-specific treatment. Various studies have identified molecular subtypes for CRC using gene expression data, but they are inconsistent and further research is necessary. From a methodological point of view, a progressive approach is needed to identify molecular subtypes in human colon cancer using gene expression data. We propose an approach to identify the molecular subtypes of colon cancer that integrates denoising by the Bayesian robust principal component analysis (BRPCA) algorithm, hierarchical clustering by the directed bubble hierarchical tree (DBHT) algorithm, and feature gene selection by an improved differential evolution based feature selection method (DEFSW) algorithm. In this approach, the normal samples being completely and exclusively clustered into one class is considered to be the standard of reasonable clustering subtypes, and the feature selection pays attention to imbalances of samples among subtypes. With this approach, we identified the molecular subtypes of colon cancer on the mRNA gene expression dataset of 153 colon cancer samples and 19 normal control samples of the Cancer Genome Atlas (TCGA) project. The colon cancer was clustered into 7 subtypes with 44 feature genes. Our approach could identify finer subtypes of colon cancer with fewer feature genes than the other two recent studies and exhibits a generic methodology that might be applied to identify the subtypes of other cancers.


Introduction
Identifying the molecular subtypes of colorectal cancer (CRC) may allow for more rational, patient-specific treatment in the future. Various studies have been done to predict molecular subtypes for CRC based on gene expression data. Fearon and Vogelstein utilized four different genomic and epigenomic instabilities to identify four subtypes of CRC: chromosome instability (CIN), microsatellite able to select feature subsets with a predefined cardinality (which is its main functionality), but also can discover the optimal feature subset size. A wrapper classifier is needed in the DEFS W algorithm. In view of the usual imbalance of samples among subtypes, we use the naive Bayes (NB) classifier with empirical prior probabilities and weight accuracy to evaluate classification ability. The empirical prior probabilities estimate the prior probabilities from the relative frequencies of the classes in training, which can lessen the influence of the imbalance of samples. Weight accuracy is a special assessment measurement for classification of imbalance samples.
In this study, we integrated these state-of-the-art techniques of denoising, clustering, and feature selection to identify molecular subtypes in human colon cancer using gene expression data. Our integrated approach incorporates denoising by the BRPCA, hierarchical clustering by the DBHT, and selecting feature genes by DEFS W . We applied this approach to the Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) mRNA gene expression dataset of colon cancer and identified 7 subtypes with 44 feature genes. The results deliver finer subtyping with fewer feature genes than in the other two recent studies.

Dataset
The microarray mRNA gene expression dataset we used to identify the subtypes of colon cancer is from TCGA. It includes 153 cancers samples, which have been used by Muzny et al. [2] and Ren et al. [4] for the same purpose, and 19 control normal samples. We used the level 3 dataset from TCGA, and this was downloaded by the R package "RTCGA" [21].

Method Overview
We first applied the BRPCA to denoise the gene expression data by getting the sparse component but removing the low-rank and noise components, and then we used the DBHT to cluster the sparse components in the BRPCA. The normal samples completely and exclusively clustered into one class were considered as the standard of reasonable clustering. If the standard was not reached, we continually tuned the setting of the hyperparameters of the BRPCA until the clustering was up to the standard. Finally, we used the DEFS W to select the feature genes for the clusters. A summary of our approach is shown in Figure 1. The BRPCA, DBHT and DEFSw algorithms were developed in MATLAB (MathWorks, Natick, MA, USA), and all the related source codes for implementing our approach are available in the Supplementary file 2 with a brief README.

Bayesian Robust Principal Component Analysis
In the BRPCA model [8], the observed data matrix Y ∈ R P×N is the superposition of 3 parts: low-rank component L ∈ R P×N , sparse component S ∈ R P×N , and noise term E ∈ R P×N , where P is the number of genes, and N is the number of samples. Furthermore, where Λ ∈ R K×K is a diagonal matrix, X ∈ R P×N , and • denotes the pointwise product. The diagonal matrix Z has binary entries along the diagonal, z kk ∈ {0, 1} for k = 1, . . . K and the binary matrix B ∈ {0, 1} P×N is sparse. The integer K defines the largest possible rank that may be inferred for L. The BRPCA model assumed: The microarray mRNA gene expression dataset we used to identify the subtypes of colon cancer is from TCGA. It includes 153 cancers samples, which have been used by Muzny et al. [2] and Ren et al. [4] for the same purpose, and 19 control normal samples. We used the level 3 dataset from TCGA, and this was downloaded by the R package "RTCGA" [21].

Method Overview
We first applied the BRPCA to denoise the gene expression data by getting the sparse component but removing the low-rank and noise components, and then we used the DBHT to cluster the sparse components in the BRPCA. The normal samples completely and exclusively clustered into one class were considered as the standard of reasonable clustering. If the standard was not reached, we continually tuned the setting of the hyperparameters of the BRPCA until the clustering was up to the standard. Finally, we used the DEFSW to select the feature genes for the clusters. A summary of our approach is shown in Figure 1. The BRPCA, DBHT and DEFSw algorithms were developed in MATLAB (MathWorks, Natick, MA, USA), and all the related source codes for implementing our approach are available in the Supplementary file 2 with a brief README.  In our study, the observed data matrix Y ∈ R P×N was gene expression profiling of colon cancer with P = 17, 814 genes and N = 172 samples (153 cancer samples and 19 normal samples). Each column is a gene expression profile and each row is the gene expression data in every sample. Different from image processing, which the BRPCA was originally used for [8], our data matrix Y consists of 2 types of columns, tumor samples and normal samples, and most genes across 2 types of samples should share some common characteristics. Therefore, we assume the appearance of the sparse component across sample (column) satisfies a Markov property, i.e., where p = 1, . . . P, i = 2, . . . N − 1.
For i = 1, N, we may use sample 2 and sample N − 1 twice, respectively, since sample 1 has no left neighbor and sample N has no right neighbor. Specifically, a gene with high expression in the left and right neighbors of a sample should have a high probability of expressing highly in this sample; hence the sparsity of a sample depends on its neighbors.
Since the density function at one layer is conjugate to the density function at the layer above it, the posterior density function is easily computed via Markov chain Monte Carlo (MCMC) implemented using a Gibbs sampler. The details for calculation of the BRPCA algorithm are described in algorithm 1 in Ding et al. [8].

Hierarchical Information Clustering by Means of Topologically Embedded Graphs
The directed bubble hierarchical tree (DBHT) [10] algorithm is used to extract cluster structure and detect hierarchical organization in complex datasets. This approach is based on the properties of topologically embedded graphs built from a similarity measure. The general idea of the DBHT is to use the topological structure of a planar maximally filtered graph (PMFG) [22] to investigate the properties of the datasets. PMFG is a triangulation of a topological sphere. It has been shown that PMFG graphs are efficient filtering tools, with topological properties associated with the properties of the underlying system [22,23]. This makes the PMFG a desirable tool to extract clusters and hierarchies from complex datasets.
In our study, a sample is a vertex and the Pearson's correlation coefficient matrix is used as the similarity matrix of the vertexes.
The dissimilarity matrix of the vertexes we used is: Based on the similarity and dissimilarity matrix of samples, the DBHT constructs the PMFG to perform clustering and get hierarchies of samples. The details for the DBHT algorithm are described in Song et al. [10].

Differential Evolution Based Feature Selection Method
To identify the subtypes using gene expression profiling, feature selection can be used to reduce the high-dimensionality of huge amounts of otherwise meaningless data. Khushaba and Al-Ani et al. proposed a powerful feature selection method that utilizes the differential evolution (DE) float number optimizer in the combinatorial optimization problem of feature selection, named DEFS O [17], followed by an improved version, DEFS W [18]. In the DEFS O , the desired feature subset size is predefined by the user, while in the DEFS W , the optimal feature subset size can be discovered automatically only by setting an upper limit. In the two algorithms, a wrapper classifier such as K nearest neighbor (KNN), support vector machine (SVM), or naive Bayes (NB) classifier is needed. The wrapper classifier and an assessment measurement such as classification accuracy are used together to evaluate the classification ability of features.
In our study, some subtypes have a small number of samples and others have a lot, i.e., there are imbalances of samples among subtypes. Therefore, a wrapper classifier and an assessment measurement that can cope with the class imbalance have to be used to avoid the "larger subtypes win." Meanwhile, for the DE algorithm, the computation cost is generally huge because of its iterative evolution, so a fast and simple classifier is desired. Not only can the NB classifier be trained very efficiently under the condition of a small amount of training data and take only linear time, but its empirical prior probabilities can lessen the influence of the imbalance of samples. To assess classification ability, we used weight accuracy instead of the usual classification accuracy. The weight accuracy of classification is defined by Draminski et al. [20] as: (13) where c is the number of classes, and n ij denotes the number of samples in the class i classified as those from class j. The wAcc considers sizes of classes in such a way as to prevent undue influence of a majority class on the performance index, and can more effectively assess the ability to classify the selected feature genes in the imbalanced data. The DEFS W algorithm and its parameter setting that we used are listed in Figure 2.

Results
We first used the BRPCA to denoise. In the BRPCA model, we set the largest possible rank as 30 k  and the model hyperparameters were finally specified as follows:  Figure 3), in which the normal samples were completely and exclusively clustered into one cluster and most of the MSI/CIMP samples were divided into three subtypes and some parts of the "invasive" samples were clustered together. The confusion matrix [24] of the two kinds of subtypes is shown in Table 1. Each row of the matrix represents the samples in a predicted class by the method of Muzny et al. [2], while each column represents the samples in a predicted class by our method. We also compared the subtypes predicted by our method and Ren's method. The confusion matrix of the two methods is shown in Table 2. It shows that the subtypes ECL1 and ECL2 identified by Ren can be further subdivided by our method.

Results
We first used the BRPCA to denoise. In the BRPCA model, we set the largest possible rank as k = 30 and the model hyperparameters were finally specified as follows: The initial values of the main arguments were set as ν = 10 −6 , γ n = 1. For MCMC-based Bayesian inference, the number of burn-in iterations N burn−in and collection iterations N collect were set as 200 and 100, respectively. Then we applied the DBHT to the sparse component S and obtained eight sample clusters (see Figure 3), in which the normal samples were completely and exclusively clustered into one cluster and most of the MSI/CIMP samples were divided into three subtypes and some parts of the "invasive" samples were clustered together. The confusion matrix [24] of the two kinds of subtypes is shown in Table 1. Each row of the matrix represents the samples in a predicted class by the method of Muzny et al. [2], while each column represents the samples in a predicted class by our method. We also compared the subtypes predicted by our method and Ren's method. The confusion matrix of the two methods is shown in Table 2. It shows that the subtypes ECL1 and ECL2 identified by Ren can be further subdivided by our method.    We also tried to directly cluster the mRNA gene expression dataset using DBHT without denoising (Figure 4). It suggested that directly clustering by DBHT could not get any meaningful result; even the normal samples could not be clustered into one single class. We also did consensus hierarchical clustering [25] in the same way as that described in Ren et al. [4], and the result ( Figure  5a) suggested that the samples were clustered into two clusters (normal and cancer samples) or three clusters (normal cluster, and ECL1 and ECL2 subtypes identified by Ren et al.). Using the same consensus clustering method, we also clustered component S of the samples in the BRPCA model, and the result (Figure 5b) suggested that consensus clustering could not get finer clusters for component S than the DBHT algorithm. All these suggested that the combination of BRPCA and DBHT could not only correctly cluster the normal samples, but also cluster the cancer samples into finer subtypes. We also tried to directly cluster the mRNA gene expression dataset using DBHT without denoising ( Figure 4). It suggested that directly clustering by DBHT could not get any meaningful result; even the normal samples could not be clustered into one single class. We also did consensus hierarchical clustering [25] in the same way as that described in Ren et al. [4], and the result (Figure 5a) suggested that the samples were clustered into two clusters (normal and cancer samples) or three clusters (normal cluster, and ECL1 and ECL2 subtypes identified by Ren et al.). Using the same consensus clustering method, we also clustered component S of the samples in the BRPCA model, and the result (Figure 5b) suggested that consensus clustering could not get finer clusters for component S than the DBHT algorithm. All these suggested that the combination of BRPCA and DBHT could not only correctly cluster the normal samples, but also cluster the cancer samples into finer subtypes.   We then selected the feature genes for the identified subtypes by the DEFSw algorithm. Considering that the latent premise of identifying subtypes is that the samples are from cancer patients, we expected that the selected feature genes could not only identify cancer subtypes, but also discriminate between tumor and normal samples, i.e., we expected to select out the feature genes that could discriminate all eight sample clusters. It was rational to select the feature genes from the differentially expressed genes (DEGs) of tumor vs. normal samples. We obtained 5897 DEGs by t-test (BH-correction, p value < 0.05) and fold change cutoff 1.5, and then selected the feature genes from these DEGs using the DEFSW algorithm described in Figure 2. We set the upper limit of the size of feature genes from 100 to 20, and for each upper limit, the DEFSw algorithm could deliver an optimal size of feature genes. We then determined our final size of feature genes to be that which gave the maximal average weight accuracy of the eight sample clusters. In this way, we ended up with 44 feature genes : THBS2, NOX4, KIAA1199, SLC16A4, CCDC19, ZNRF3, GOLT1A, HYAL3,  C15orf26, KIFC1, TIPIN, CTNNAL1, CALU, TAF1A, MCM2, MSH6, FLAD1, GCG, SCRG1, PTGER2,  TIMD4, MUC1, PLOD2, LIMS2, ADH1B, PTN, PTPN7, AQP1, PSD3, CRAT, ATOH8, CGN, C6orf204,   Figure 4. Sample cluster structure directly clustered using directed bubble hierarchical tree (DBHT) for the same data in Figure 3. Labels and symbols are also the same as Figure 3.  We then selected the feature genes for the identified subtypes by the DEFSw algorithm. Considering that the latent premise of identifying subtypes is that the samples are from cancer patients, we expected that the selected feature genes could not only identify cancer subtypes, but also discriminate between tumor and normal samples, i.e., we expected to select out the feature genes that could discriminate all eight sample clusters. It was rational to select the feature genes from the differentially expressed genes (DEGs) of tumor vs. normal samples. We obtained 5897 DEGs by t-test (BH-correction, p value < 0.05) and fold change cutoff 1.5, and then selected the feature genes from these DEGs using the DEFSW algorithm described in Figure 2. We set the upper limit of the size of feature genes from 100 to 20, and for each upper limit, the DEFSw algorithm could deliver an optimal size of feature genes. We then determined our final size of feature genes to be that which gave the maximal average weight accuracy of the eight sample clusters. In this way, we ended up with 44 feature genes : THBS2, NOX4, KIAA1199, SLC16A4, CCDC19, ZNRF3, GOLT1A, HYAL3,  C15orf26, KIFC1, TIPIN, CTNNAL1, CALU, TAF1A, MCM2, MSH6, FLAD1, GCG, SCRG1, PTGER2,  TIMD4, MUC1, PLOD2, LIMS2, ADH1B, PTN, PTPN7, AQP1, PSD3, CRAT, ATOH8, CGN, C6orf204, We then selected the feature genes for the identified subtypes by the DEFS w algorithm. Considering that the latent premise of identifying subtypes is that the samples are from cancer patients, we expected that the selected feature genes could not only identify cancer subtypes, but also discriminate between tumor and normal samples, i.e., we expected to select out the feature genes that could discriminate all eight sample clusters. It was rational to select the feature genes from the differentially expressed genes (DEGs) of tumor vs. normal samples. We obtained 5897 DEGs by t-test (BH-correction, p value < 0.05) and fold change cutoff 1.5, and then selected the feature genes from these DEGs using the DEFS W algorithm described in Figure 2. We set the upper limit of the size of feature genes from 100 to 20, and for each upper limit, the DEFS w algorithm could deliver an optimal size of feature genes. We then determined our final size of feature genes to be that which gave the maximal average weight accuracy of the eight sample clusters. In this way, we ended up with 44 feature genes : THBS2, NOX4, KIAA1199, SLC16A4, CCDC19, ZNRF3, GOLT1A,  HYAL3, C15orf26, KIFC1, TIPIN, CTNNAL1, CALU, TAF1A, MCM2, MSH6, FLAD1, GCG, SCRG1,  PTGER2, TIMD4, MUC1, PLOD2, LIMS2, ADH1B, PTN, PTPN7, AQP1, PSD3, CRAT, ATOH8, CGN,  C6orf204, FTHP1, KCNMB1, LIG4, PPFIBP2, PPP2CB, ALAS2, ZZEF1, ATXN7, GRLF1, FAM102A, and C1orf152. Among them, MSH6 is known to be related to CRC, and is located in the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway of CRC [26]; THBS2 is a potential prognostic biomarker in CRC [27] and can be used as an early diagnosis biomarkers of CRC [28]; Overexpression of NOX4 predicts poor prognosis and promotes tumor progression in human CRC [29], and NOX4 plays a role in PhIP-induced colon carcinogenesis, especially during the early stages before tumor onset [30], and moreover NOX4 is highly predictive of relapse in stage II left-side colon cancer [31]; KIAA1199 in human colorectal tumors (benign and malignant) is markedly higher than that in the normal colonic mucosa [32,33] and its overexpression promotes CRC cell migration and invasion [34], and could be used as a prognostic factor and novel therapeutic target for CRC [35]; Furthermore KIAA1199 plays a critical role in maintaining an aggressive phenotype of tumor cells, and suppression of KIAA1199-related motilities of tumor cells contributes to reduced tumor metastasis in CRC [36]; MCM2 is correlated with the cell proliferation state in colon cancer [37] and is more sensitive than Ki-67 in identifying colorectal mucosal proliferation [38]; MUC1 is aberrantly overexpressed in human colon cancers and is associated with invasion, metastases and a poor prognosis [39,40]; ZNRF3 is important in serrated tumorigenesis and has identified a potential therapeutic strategy for CRC subtype [41]; LIG4 may represent a new epigenetic marker for CRC independent of known markers [42]; ADH1B displays decreased expression during progression from adenoma to early and more advanced stage of colorectal carcinomas [43]. In contrast, Muzny et al. [2] did not select the feature genes for the identified subtypes, and Ren et al. [4] detected 256 genes as the marker genes of the ECL1 and ECL2 subtypes by Prediction Analysis of Microarrays (PAM) [44].
We used the same NB classifier to validate the classification ability of these feature genes. We did 10%, 20%, and 30% cross-validation (CV) with 1000 repeats. The mean classification accuracy, the mean weight accuracy of classification, and the mean classification accuracy for each class of the 1000 CVs are listed in Table 3. S1-S8 denotes the clusters 1-8 and S8 is the cluster of normal samples. The classification accuracy Acc depends on the number of samples correctly classified and is evaluated by the formula: where t is the number of samples correctly classified and n is the total number of samples. Table 3. Overall mean accuracy, overall mean weight accuracy, and mean accuracy for each class by 1000 times of cross-validation using the naive Bayes (NB) algorithm on the feature gene sets.

Discussion
To identify cancer subtypes based on gene expression data, the proposed approach innovatively integrated state-of-art denoising, clustering, and feature selection algorithms. In BRPCA, the low-rank component of gene expression data may be the same as the background of an image and the noise component simulates unknown and nonstationary noise, whereas the sparse component may be the same as the foreground and is the key information for clustering. The DBHT is intrinsically a correlation-based clustering method. Through building the PMFG, the method does not need any prior tuning and provides both intracluster hierarchy, which describes the way clusters are composed, and intercluster hierarchy, which describes how clusters gather together. To assess whether the clustering is meaningful, we draw in the concept of "reference object" from classical physics. If the reference objects, the normal samples, are correctly clustered together, we consider the clustering as reasonable; otherwise, we need to tune the parameters in the BRPCA. Tuning the parameters means adjusting the degree of details of component S. The DEFS w algorithm can discover the optimal feature subset size. Considering the unstable stochastic search, we repeated the DEFS w algorithm multiple times with different upper limits of the feature size and selected the feature genes from the DEGs. This may be helpful for selecting out the optimal size of feature genes and getting higher accuracy for discriminating tumor and normal samples. Overall, the proposed approach can identify finer subtypes of colon cancer with fewer feature genes than the other two recent studies and exhibits a generic methodology for identifying cancer subtypes based on gene expression data by common processes.
Inter-tumor diversity of CRC complicates the prediction of disease and treatment outcomes. Subtypes of colorectal cancer identified by classifying gene expression profiles with defined prognostic markers would predict individual patient outcomes more precisely and therefore provide valuable guidance on appropriate therapeutic intervention [45]. It is proposed that CRC subtyping may advance precision diagnostics, treatment, and guide rational drug design. Numerous methods have been attempted to achieve this goal using gene expression datasets [2,4]. In a recent study by Bramsen et al. [46], subtyping strategy was used to CRC transcription profiles for identifying molecular-subtype-specific biomarkers which could contribute to improved patient prognostication. Moreover, other directions have also been taken to find the colorectal subtypes based on pathway profiles, morphological characteristics, clinical and molecular features. Different subtype classifications have been established in recent studies based on three identified molecular pathways: CIN (chromosomal instability), MSI-H (microsatellite instability-high), and CIMP [2,3,[47][48][49]. However, there are disagreements among these classifications. There have been many attempts to find consensus in classification of CRC subtypes, and such efforts are essential for revealing prognostic and predictive factors for patient outcomes and to guide treatments [45]. However, no universal subclassification has been agreed upon because of the difficulties and the cost of experimental verification. CRC subtyping consortium (CRCSC) proposed four transcriptional CMSs, which are associated with distinct histopathological features. However, this remains to be further documented, and consensus molecular subtyping is still not in a stage to guide clinical decisions [45]. The reliable molecular subtyping approaches are still needed to unveil clinical potentials.