Mutation Clusters from Cancer Exome

We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit a mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1389 published genome samples across 14 cancer types. In contrast, we find in- and out-of-sample instabilities in cancer signatures extracted from exome samples via nonnegative matrix factorization (NMF), a computationally-costly and non-deterministic method. Extracting stable mutation structures from exome data could have important implications for speed and cost, which are critical for early-stage cancer diagnostics, such as novel blood-test methods currently in development.


Introduction and Summary
Unless humanity finds a cure, about a billion people alive today will die of cancer. Unlike other diseases, cancer occurs at the DNA level via somatic alterations in the genome. A common type of such mutations found in cancer is due to alterations to single bases in the genome (single nucleotide variations (SNVs)). These alterations are accumulated throughout the lifespan of an individual via various mutational processes, such as imperfect DNA replication during cell division or spontaneous cytosine deamination [1,2], or due to exposures to chemical insults or ultraviolet radiation [3,4], etc. The footprint left by these mutations in the cancer genome is characterized by distinctive alteration patterns known as cancer signatures.
Identifying all cancer signatures would greatly facilitate progress in understanding the origins of cancer and its development. Therapeutically, if there are common underlying structures across different cancer types, then treatment for one cancer type might be applicable to other cancer types, which would be great news. From a diagnostic viewpoint, the identification of all underlying cancer signatures would aid cancer detection and identification methodologies, including vital early detection [5]-according to American Cancer Society, late stage metastatic cancers of unknown origin represent about 2% of all cancers [6] and can make treatment almost impossible. Another practical application is prevention by pairing the signatures extracted from cancer samples with those caused by known carcinogens (e.g., tobacco, aflatoxin, UV radiation, etc.). At the end of the day, it all boils down to the question of usefulness: is there a small enough number of cancer signatures underlying all (100+) known cancer types, or is this number too large to be meaningful/useful? Thus, if we focus on 96 mutation categories of SNVs [7], we cannot have more than 96 signatures [8]. Even if the number of true underlying signatures is, say, of order 50, it is unclear whether they would be useful, especially within practical applications. On the other hand, if there are only about a dozen underlying cancer signatures, then the hope for an order of magnitude simplification may well be warranted.
The commonly-used method for extracting cancer signatures [9] is based on nonnegative matrix factorization (NMF) [10,11]. Thus, one analyzes SNV patterns in a cohort of DNA sequenced whole cancer genomes and organizes the data into a matrix G iµ , where the rows correspond to the N = 96 mutation categories, the columns correspond to d samples and each element is a nonnegative occurrence count of a given mutation category in a given sample. Under NMF, the matrix G is then approximated via G ≈ W H, where W iA is an N × K matrix, H Aµ is a K × d matrix and both W and H are nonnegative. The appeal of NMF is its biologic interpretation, whereby the K columns of the matrix W are interpreted as the weights with which the K cancer signatures contribute to the N = 96 mutation categories, and the columns of the matrix H are interpreted as the exposures to these K signatures in each sample. The price to pay for this is that NMF, which is an iterative procedure, is computationally costly, and depending on the number of samples d, it can take days or even weeks to run it. Furthermore, NMF does not fix the number of signatures K, which must be either guessed or obtained via trial and error, thereby further adding to the computational cost. Perhaps most importantly, NMF is a nondeterministic algorithm and produces a different matrix W in each run. (Each W corresponds to one in myriad local minima of the NMF objective function.) This is dealt with by averaging over many such W matrices obtained via multiple NMF runs (or samplings). However, each run generally produces a weights matrix W iA with columns (i.e., signatures) not aligned with those in other runs. Aligning or matching the signatures across different runs (before averaging over them) is typically achieved via nondeterministic clustering such as k-means. Therefore, the result, even after averaging, generally is both noisy [12] and nondeterministic, i.e., if this computationally-costly procedure (which includes averaging) is run again and again on the same data, generally it will yield different looking cancer signatures every time. Simply put, the NMF-based method for extracting cancer signatures is not designed to be even in-sample stable. Under these circumstances, out-of-sample stability cannot even be feasible (i.e., cancer signatures obtained from non-overlapping sets of samples can be dramatically different, and out-of-sample stability is crucial for practical usefulness, e.g., diagnostically).
Without in-and out-of-sample stability, practical therapeutic and diagnostic applications of cancer signatures would be challenging. For instance, suppose one sequences genome (or exome; see below) data from a patient sample (be it via a liquid biopsy, a blood test or some other (potentially novel) method). Let us focus on SNVs. We have a vector of occurrence counts for 96 mutation categories. We need a quick computational test to determine with a high enough confidence level whether (i) there is a cancer signature present in this data and (ii) which cancer type this cancer signature corresponds to (i.e., in which organ the cancer originated). If cancer signatures are not even in-sample stable, then we cannot trust them. They could simply be noise. Indeed, there is always somatic mutational noise present in such data, and this must be factored out of the data before extracting cancer signatures. A simple way to understand somatic mutational noise is to note that mutations (i) are already present in humans unaffected by cancer and (ii) such mutations, which are unrelated to cancer, are further exacerbated when cancer occurs, as it disrupts the normal operation of various processes (including repair) in the DNA. At the level of the data matrix G, in [13], we discussed a key component of the somatic mutational noise and gave a prescription for removing it [14]. However, there likely exist other, deeper sources of somatic mutational noise, which must be further identified and carefully factored out. Simply put, somatic mutational noise unequivocally is a substantial source of systematic error in cancer signatures.
However, then there is also the statistical error, which is large and due to the nondeterministic nature of NMF discussed above. This statistical error is exacerbated by the somatic mutational noise, but would be present even if this noise were somehow completely factored out. Therefore, the in-sample instability must somehow be addressed. We emphasize that, a priori, this does not automatically address out-of-sample stability, without which any therapeutic or diagnostic applications would still be farfetched. However, without in-sample stability, nothing is clear.
The problem at hand is nontrivial and requires a step-by-step approach, including identification of various sources of in-sample instability. One simple observation of [13] is that, if we work directly with occurrence counts G iµ for individual samples, (i) the data are very noisy and (ii) the number of signatures is bound to be too large to be meaningful/useful if the number of samples is large. A simple way to deal with this is to aggregate samples by cancer types. In doing so, we have a matrix G is , where s now labels cancer types, which is (i) less noisy and (ii) much smaller (96 × n, where n is the number of cancer types), so the number of resultant signatures is much more reasonable [15]. Thus, such aggregation is helpful.
Still, even with aggregation, we must address nondeterminism (of NMF). To circumvent this, in [16], we proposed an alternative approach that bypasses NMF altogether. As we argue in [16], NMF is, at least to a certain degree, clustering in disguise, e.g., many COSMIC cancer signatures [17] obtained via NMF (augmented with additional heuristics based on biologic intuition and empirical observations) exhibit clustering substructure, i.e., in many of these signatures, there are mutation categories with high weights ("peaks" or "tall mountain landscapes") with other mutation categories having small weights likely well within statistical and systematic errors. For all practical purposes, such low weights could be set to zero. Then, many cancer signatures would start looking like clusters, albeit some clusters could be overlapping between different signatures. Considering that various signatures may be somatic mutational noise artifacts in the first instance and statistical error bars are large, it is natural to wonder whether there are some robust underlying clustering structures present in the data, with the understanding that such structures may not be present for all cancer types. However, even if they are present for a substantial number of cancer types, unveiling them would amount to a major step forward in understanding cancer signature structure.
To address this question, in [16], we proposed a new clustering algorithm termed *K-means. Its basic building block is the vanilla k-means algorithm, which computationally is very inexpensive. However, it is also nondeterministic. *K-means uses two machine learning levels on top of k-means to achieve statistical determinism (see Section 2 for details) [18], without any initialization of the centers [19]. Once *K-means fixes the clustering, it turns out that the weights and exposures can be computed using (normalized) regressions [16], thereby altogether bypassing computationally-costly NMF. In [16], we applied this method to cancer genome data corresponding to 1389 published samples for 14 cancer types. We found that clustering works well for 10 out the 14 cancer types; the metrics include within-cluster correlations and overall fit quality. This suggests that there is indeed a clustering substructure present in the underlying cancer genome data, at least for most cancer types [20]. This is encouraging.
In this paper, we apply the method of [16] to exome data consisting of 10,656 published samples (sample IDs with sources are in Appendix A) aggregated by 32 cancer types. *K-means produces a robustly-stable clustering (11 clusters) from these data. One motivation for using exome data is that the exome is a small subset (∼1%) of the full genome containing only protein-coding regions of the genome [21]. The exome is much less expensive and less time consuming to sequence, which can be especially important for early-stage diagnostics, than the whole genome, yet it encodes important information about cancer signatures. As we discuss in the subsequent sections, our method appears to work well on exome data for most cancer types. In fact, overall, it appears to work better than COSMIC signatures, including out-of-sample, when applying clusters derived from our exome data to genome data.

*K-means
In [16], by extending a prior work [22] in quantitative finance on building statistical industry classifications using clustering algorithms, we developed a clustering method termed *K-means ("star K-means") and applied it to the extraction of cancer signatures from genome data. *K-means is anchored on the standard k-means algorithm (see [23][24][25][26][27][28][29]) as its basic building block. However, k-means is not deterministic. *K-means is statistically deterministic, without specifying initial centers. This is achieved via two machine learning levels sitting on top of k-means. At the first level, we aggregate a large number M of k-means clusterings with randomly initialized centers (and the number of target clusters fixed using eRank) via a nontrivial aggregation procedure; see [16] for details. This aggregation is based on clustering (again, using k-means) the centers produced in the M clusterings, so the resultant aggregated clustering is nondeterministic. However, it is a lot less nondeterministic than vanilla k-means clusterings as aggregation dramatically reduces the degree of nondeterminism. At the second level, we take a large number P of such aggregated clusterings and determine the "ultimate" clustering with the maximum occurrence count (among the P aggregations). For sufficiently large M and P, the "ultimate" clustering is stable, i.e., if we run the algorithm over and over again, we will get the same "ultimate" clustering every time, even though the occurrence counts within different P aggregations are going to be different for various aggregations. What is important here is that the most frequently-occurring ("ultimate") aggregation remains the same run after run. We emphasize that *K-means is a universal algorithm, and its application is not limited to the cancer genome or exome. We discuss how the input data (i.e., matrices of somatic mutation counts for cancer exome) are used in the context of *K-means in Section 3.2 (see [16] for technical details of *K-means).

Data Summary
In this paper, we apply *K-means to exome data. (In [16], we applied it to published genome data. In this work, apart from applying *K-means to exome data, we also perform out-of-sample stability analysis of our results here (see Section 4).) We use data consisting of 10,656 published exome samples aggregated by 32 cancer types listed in Table 1, which summarizes total occurrence counts, numbers of samples and data sources. Appendix A provides sample IDs together with references for the data sources. Occurrence counts for the 96 mutation categories for each cancer type are given in Tables A1-A4. For Tables and Figures labeled A , see Appendix A. 3.1.1. Structure of the Data The underlying data consist of matrices [G(s)] iµ(s) whose elements are occurrence counts of mutation categories labeled by i = 1, . . . , N = 96 in samples labeled by µ(s) = 1, . . . , d(s). Here, s = 1, . . . , n labels n different cancer types (in our case n = 32). We can choose to work with individual matrices [G(s)] iµ(s) or with the N × d tot "big matrix" Γ obtained by appending (i.e., bootstrapping) the matrices [G(s)] iµ(s) together column-wise (so d tot = ∑ n s=1 d(s)). Alternatively, we can aggregate samples by cancer types and work with the so-aggregated matrix: Generally, individual matrices [G(s)] iµ(s) and, thereby, the "big matrix" Γ contain much noise. For some cancer types, we can have a relatively small number of samples. We can also have "sparsely-populated" data, i.e., with many zeros for some mutation categories. In fact, different samples are not even necessarily uniformly normalized. To mitigate the aforementioned issues, following [13], here, we work with the N × n matrix G is with samples aggregated by cancer types. Below, we apply *K-means to G is .

Exome Data Results
The 96 × 32 matrix G is given in Tables A1-A4 is what we pass into the function bio.cl.sigs() in Appendix A of [16] as the input matrix x. We use: iter.max = 100 (this is the maximum number of iterations used in the built-in R function kmeans(); we note that there was not a single instance in our 30 million runs of kmeans() where more iterations were required -the R function kmeans() produces a warning if it does not converge within iter.max); num.try = 1000 (this is the number of individual k-means samplings we aggregate every time); and num.runs = 30,000 (which is the number of aggregated clusterings we use to determine the "ultimate", that is the most frequently occurring, clustering). More precisely, we ran three batches with num.runs = 10,000 as a sanity check, to make sure that the final result based on 30,000 aggregated clusterings was consistent with the results based on smaller batches, i.e., that it was stable from batch to batch [30]. Based on Table A5, we identify Clustering-E1 as the "ultimate" clustering (see Section 2). Also, it is evident that the top-10 clusterings in Table A5 essentially are variations of each other.

Within-Cluster Correlations
We have our data matrix G is . We are approximating this matrix via the following factorized matrix: where W iA are the within-cluster weights (i = 1, . . . , N; A = 1 . . . , K), H As are the exposures (s = 1, . . . , n = 32 labels the cancer types), Q : {1, . . . , N} → {1, . . . , K} is the map between the N = 96 mutations and K = 11 clusters in Clustering-E1, and we have W iA = w i δ Q(i),A [31]. It is the matrix W iA that is given in Tables A6 and A7 for the unnormalized regressions and Tables 2 and 3 for the normalized regressions.
We can now compute an n × K matrix Θ sA of within-cluster cross-sectional correlations between G is and G * is defined via (xCor(·, ·) stands for "cross-sectional correlation", i.e., "correlation across the index i" -due to the factorized structure (2), these correlations do not directly depend on H As ) Here, J(A) = {i|Q(i) = A} is the set of mutations labeled by i that belong to a given cluster labeled by A. We give the matrix Θ sA for Clustering-E1 for weights based on unnormalized regressions in Table 4 and weights based on normalized regressions in Table 5. As for genome data [16], the fit for normalized regressions is somewhat better than that for unnormalized regressions.

Overall Correlations
Another useful metric, which we use as a sanity check, is this. For each value of s (i.e., for each cancer type), we can run a linear cross-sectional regression (without the intercept) of G is over the matrix W iA . Therefore, we have n = 32 of these regressions. Each regression produces multiple R 2 and adjusted R 2 , which we give in Tables 4 and 5. Furthermore, we can compute the fitted values G * is based on these regressions, which are given by: where (for each value of s) F As are the regression coefficients. We can now compute the overall cross-sectional correlations (i.e., the index i runs over all N = 96 mutation categories) These correlations are also given in Tables 4 and 5 and measure the overall fit quality.

Interpretation
Looking at Table 5, a few things jump out. First, most-24 out of 32-cancer types have high (80%+) within-cluster correlations with at least one cluster. Out of the other eight cancer types, six have reasonably high (70%+) within-cluster correlations with at least one cluster. The remaining two cancer types are X9 (cervical cancer) and X17 (liver cancer). In [16], based on genome data, we already observed that liver cancer does not have a clustering structure, so this is not surprising. On the other hand, with cervical cancer, the story appears to be trickier. According to [17], we should expect COSMIC signatures CSig2+13 and CSig26 (see Section 4 for more details) to appear in cervical cancer. According to Table A8 (see Section 4), CSig2+13 indeed have high correlations with X9 (but not CSig26). On the other hand, the dominant part of CSig2 (C > T mutations in TCA, TCC, TCG, TCT) is subsumed in Cluster Cl-10 (see Figure 10), and the dominant part of CSig13 (C > G mutations in TCA, TCC, TCT) is subsumed in Cluster Cl-9 (see Figure 9). Basically, it appears that the large (each with 16 mutation categories) Clusters Cl-9, Cl-10 and Cl-11 probably could be split into smaller clusters. In fact, Cl-9 and Cl-11 do not have 80%+ correlations with any cancer types (they do have 70%+ correlations with one cancer type each). This is another indication that these clusters might be "oversized". The same was observed with the largest cluster (with 21 mutation categories) in [16] in the context of genome data. Simply put, these "oversized" clusters may have to be dealt with via appropriately tweaking the underlying clustering algorithm (this is outside of the scope hereof and will be dealt with elsewhere).
The last three columns in Table 5 provide metrics for the overall fit for each cancer type. The overall correlations (between the original data G is and the model-fitted values G * is ; see Section 3.3.2) in the last column of Table 5 are above 80% for 16 (out of the 32) cancer types and above 70% for 26 cancer types. These high correlations indicate a good in-sample agreement between the original and reconstructed (model-fitted) data for each of these 26 cancer types. The remaining six cancer types, which all have overall correlations above 60%, are: X4 (B-cell lymphoma), X6 (bladder cancer), X8 (breast cancer), X9 (cervical cancer), X26 (rectum adenocarcinoma) and X29 (testicular germ cell tumor). We already discussed cervical cancer above. We address breast cancer in Section 4 hereof. Now, the X4 data are sparsely populated: there are 24 samples, and the total number of counts is 706, so there are many zeros in the underlying sample data, albeit only two zeros in the aggregated data. According to [17], we should expect CSig9 and CSig17 in B-cell lymphoma. However, according to Table A8 (see  Section 4), these signatures do not have high correlations with X4. Note that clustering worked well for B-cell lymphoma for the genome data in [16], but there, the genome data were well-populated. Therefore, it is reasonable to assume that here, the "underperformance" is likely due to the sparsity of the underlying data. For X6 (bladder cancer), the situation is similar to X9 (cervical cancer) above: according to [17], we should expect CSig2+13 in bladder cancer, and Table A8 is consistent with this. However, as mentioned above, CSig2 and CSig13 are subsumed in Clusters Cl-10 and Cl-9, respectively ("oversizing"). According to Table A9, we should expect CSig10 in X26. CSig10 to be dominated by the C > A mutation in TCT (which is subsumed in Cluster Cl-9) and the C > T mutation in TCG (which is subsumed in Cluster Cl-10). Again, here we are dealing with "oversizing" of these clusters. X29 has high within-cluster correlations with Clusters Cl-4 and Cl-5. The overall fit correlation apparently is lowered by the high negative correlation with Cluster Cl-3. To summarize, "oversizing" is one potential "shortcoming" here.

Concluding Remarks
In order to understand the significance of our results, let us compare them to the fit that COSMIC signatures (for details, see [17]; for references, see [9,[32][33][34][35]) provide for our exome data. We can do this by computing the following p × n cross-sectional correlation matrix: where U iα (α = 1, . . . , p) is the N × p matrix of weights for p = 30 COSMIC signatures, which for brevity, we will refer to as CSig1, . . . , CSig30 [36]. The matrix ∆ αs is given in Tables A8 and A9. Let us  look at the 80%+ correlations (which are in bold font in Tables A8 and A9). (Relaxing this cut-off to 70% (see Tables A8 and A9) does not alter our conclusions below.) Only six out 30 COSMIC signatures, to wit CSig1,2,6,7,10,15, have 80%+ correlations with the exome data for the 32 cancer types. The aetiology of these signatures is known [17]. CSig1 is the result of an endogenous mutational process initiated by spontaneous 5-methylcytosine deamination, hence the ubiquity of its high correlations with many cancer types. CSig2 (which usually appears in tandem with CSig13) is due to APOBEC-mediated cytosine deamination, hence its high correlations with some cancer types. CSig6 is associated with defective DNA mismatch repair, hence its high correlations with several cancer types. CSig7 is due to ultraviolet light exposure, so its high correlation with X19 (melanoma) is spot on [37]. CSig10 is associated with recurrent error-prone polymerase POLE somatic mutations (its high correlations with X26 (rectum adenocarcinoma) and X32 (uterine cancer) are consistent with [17] and, once again, apparently are due to a large overlap between the exome data we use here and those used by [17]). CSig15 is associated with defective DNA mismatch repair; the significance of its high correlation with X23 (pancreatic cancer) is unclear. Therefore, only a handful of COSMIC signatures, all associated with known mutational processes, do well on our exome data [38]. Others do not fit well. This is the out-of-sample stability issue emphasized in [13]. It traces to the fact that NMF is an intrinsically unstable method, both in-and out-of-sample. In-sample instability relates to the fact that NMF is nondeterministic and produces different looking signatures from one run to another. In fact, we attempted running NMF on our exome data. We ran three batches with 800 sampling in each batch (a computationally time-consuming procedure [39]). The three batches produced different looking results, which with much manual curation could only be partially matched to some COSMIC signatures, but this matching was different and highly unstable across the three batches. Simply put, NMF failed to produce any meaningful results on our exome data. Furthermore, the above discussion illustrates that most COSMIC signatures (extracted using NMF from exome and genome data) apparently are unstable out-of-sample, e.g., when applied to our exome data aggregated by cancer types. Here, one may argue that exome data contain only partial information, and NMF should not be used on it. However, the COSMIC signatures are in fact based on 10,952 exomes and 1048 whole-genomes across 40 cancer types [17] (also, see, e.g., [40]). The difference here is that we are aggregating samples by cancer types, and most COSMIC signatures apparently do not apply, which means that COSMIC signatures are highly sample-set-specific (that is, unstable out-of-sample). Furthermore, as mentioned above, CSig7 (UV exposure) is spot on in that it has 99.66% correlation with X19 (melanoma) (albeit one should keep in mind the comments in [37]). Therefore, one can argue that the culprit is not the exome data, but the method (NMF) itself. To quantify this, let us look at correlations of COSMIC signatures with genome data for 14 cancer types used in [13] and [16]. The results are given in Table A10. As in the case of exome data, here too, we have high correlations only for a handful of COSMIC signatures corresponding to known mutational processes, to wit CSig1,4,6,13. Therefore, most COSMIC signatures do not appear to have explanatory power on genome data aggregated by cancer types, a further indication that most COSMIC signatures lack out-of-sample stability.
What about out-of-sample stability for our clusters we obtained from exome data? One way to test this is to look at within-cluster correlations and the overall fit metrics as in Table 5, but for the aforesaid genome data for 14 cancer types used in [13,16]. The results are given in Table 6. Unsurprisingly, the quality of the fit for genome data (out-of-sample) is not as good as for exome data (in-sample). However, it is (i) reasonably good and (ii) unequivocally much better than the fit provided by the COSMIC signatures (Table A10). Furthermore, the 11 exome-based clusters have a poor overall fit for G.X4 (breast cancer), G.X8 (liver cancer), G.X9 (lung cancer) and G.X14 (renal cell carcinoma), the same four cancer types for which seven genome-based clusters in [16] produced a poor overall fit, and for a good reason as well (see [16] for details). It is less clear why the 11 exome-based clusters do not have a better fit for G.X7 (gastric cancer) considering the in-sample fits for this cancer type based on exome data (X15; Table 5 hereof) and genome data (Row 7, Table 15 of [16]) are petty good.
Therefore, unlike NMF, *K-means clustering, being a statistically deterministic method, is in-sample stable. Here, we can ask, what if we apply to NMF the same two machine learning levels as those that sit on top of k-means in *K-means to make it statistically deterministic? The answer is that when applying NMF, one already uses one machine learning method, which is a form of aggregation of a large number of samplings (i.e., individual NMF runs) [41]. This is conceptually similar to the first machine learning level in *K-means. Therefore, then we can ask, what if we add to NMF the second machine learning level as in *K-means, to wit by comparing a large number of such "averagings"? A simple, prosaic answer is that it would make NMF, which is already computationally costly as is and much more so with the first machine learning level, computationally prohibitive. The reason why *K-means is computationally much less expensive is that the basic building block of *K-means, on top of which we add the two machine learning methods, is vanilla k-means, which is much, much less expensive than NMF. That is what makes all the difference [42]. Table 6. The within-cluster cross-sectional correlations Θ sA (Columns 2-12), the overall correlations Ξ s (Column 15) based on the overall cross-sectional regressions and multiple R 2 and adjusted R 2 of these regressions (Columns 13 and 14). The cluster weights are based on normalized regressions (see Sections 3.2 and 3.3.1 for details). The definitions of cancer types G.X1-G.X14 for genome data are given in Table A10. All quantities are in the units of 1% rounded to 2 digits. The values above 80% are given in bold font. The values above 70% are underlined. Finally, let us mention that exome data for chronic myeloid disorders (121 samples, 175 total counts) were published in [43,44], and for neuroblastoma (13 samples, 298 total counts) in [45]. However, these data are so sparsely populated (too many zeros even after aggregation) that we specifically excluded them from our analysis. Much more unpublished data are available for the cancer types we analyze here, as well as other cancer types, and it would be very interesting to apply our methods to these data, including to (still embargoed) extensive genome data of the International Cancer Genome Consortium.

Acknowledgments:
The results published here are in whole or part based on data generated by the TCGA (The Cancer Genome Atlas) Research Network: http://cancergenome.nih.gov/.
Author Contributions: The authors contributed equally.

Conflicts of Interest:
The authors declare no conflict of interest.