Visual Clustering of Transcriptomic Data from Primary and Metastatic Tumors—Dependencies and Novel Pitfalls

Personalized oncology is a rapidly evolving area and offers cancer patients therapy options that are more specific than ever. However, there is still a lack of understanding regarding transcriptomic similarities or differences of metastases and corresponding primary sites. Applying two unsupervised dimension reduction methods (t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP)) on three datasets of metastases (n = 682 samples) with three different data transformations (unprocessed, log10 as well as log10 + 1 transformed values), we visualized potential underlying clusters. Additionally, we analyzed two datasets (n = 616 samples) containing metastases and primary tumors of one entity, to point out potential familiarities. Using these methods, no tight link between the site of resection and cluster formation outcome could be demonstrated, or for datasets consisting of solely metastasis or mixed datasets. Instead, dimension reduction methods and data transformation significantly impacted visual clustering results. Our findings strongly suggest data transformation to be considered as another key element in the interpretation of visual clustering approaches along with initialization and different parameters. Furthermore, the results highlight the need for a more thorough examination of parameters used in the analysis of clusters.


Introduction
From a clinical perspective, characteristic metastatic patterns frequently occur for specific cancer entities [1]. Thus, the site of metastasis has a considerable effect on patients' prognosis. For example, liver metastases derived from pancreatic adenocarcinoma are prognostically worse than lymph node or lung metastases [2,3]. Still, it is unclear whether there is a biological or genetic determination for tumors to develop regional metastases or even distant metastasis with preferred target regions [4][5][6]. From a biological point of view, tumor cells develop through clonal evolution, favoring tumor heterogeneity reflected by different driver mutations or genomic alterations. To be specific, these alterations need to take place in genes tightly related to cellular traits known as the hallmarks of cancer, for example, the ability to attract endothelial cells (angiogenesis) or the sufficient evasion from immune control [7,8]. Due to the accumulation of different alterations, metastases can occur. However, the circumstances of local or distant metastases still need to be investigated as it has already been shown that distant metastasis can develop without previous local metastasis [9]. Additionally, transcriptomic differences of the linear [10] and the parallel progression [11] models of metastasis are still not fully clarified.
Due to the advances in personalized oncology, the comprehensive elucidation of primary tumors and metastases is an ongoing process. Still, the determination of possible transcriptomic differences or similarities between metastases-especially from several different resection sites-and primary tumors is an unmet need. Previous approaches frequently analyzed the differences between two specific groups, e.g., bone and brain metastasis [12], or the mutational evolution, while showing a high concordance of primary and metastatic tumors [13][14][15]. As a result, a comprehensive study of the transcriptomic characteristics of metastases and corresponding primary tumors is lacking.
Visual clustering, based on data dimension reduction methods, is one potential approach to determine the transcriptomic differences and proximities of different metastasis sites. The mainly used visualization methods for this purpose are t-Distributed Stochastic Neighbor Embedding (t-SNE) [16] and Uniform Manifold Approximation and Projection (UMAP) [17]. They have already been widely applied in the field of single cell sequencing [18][19][20] and also bulk RNA sequencing [21][22][23][24] to visually separate transcriptionally similar cell populations from diverging populations in a two-dimensional space. Furthermore, recent studies have shown the critical impact of initialization [25] and parameters [26] on data dimension reduction methods.
To search for transcriptomic dependencies caused by the site of metastasis, t-SNE and UMAP were used to analyze three metastasis datasets, prostate cancer (PCa), neuroendocrine PCa, and skin cutaneous melanoma, totaling 682 samples. For a comprehensive analysis, unprocessed Fragments Per Kilobase Million (FPKM) values, obtained after normalization of the mapped sequencing reads, as well as log10 and log10 + 1 transformed data were analyzed, as logarithmic transformations are commonly used when analyzing gene expressions.

Data Acquisition
RNA sequencing data from three different metastasis datasets were analyzed. The first dataset contained n = 266 samples from metastatic prostate carcinoma (PRAD-SU2C-Dream Team [27]), the second consisted of n = 49 samples from metastatic neuroendocrine prostate carcinoma (NEPC WCM [28]), and the third dataset consisted of n = 367 metastatic skin cutaneous melanomas (TCGA-SKCM-Metastatic [29]). All datasets indicate the site of resection, which served as the basis for further analyses.
For further testing similarities and differences between primary and metastatic tumors, we used the complete TCGA-SKCM dataset, adding n = 103 primary tumor samples for a total of n = 470 samples, and a metastatic breast cancer dataset (MBCproject; cBioPortal [30,31] data version February 2020 [32]) consisting of n = 120 primary tumor and n = 26 metastatic samples.

Bioinformatic Analysis
To obtain a more comprehensive view, t-SNE plots and UMAPs were applied for the analyses, as t-SNE or UMAP have become standard not only for bulk RNA sequencing but also for single cell analysis [33][34][35], combined with three data transformation approaches. First, we used the unprocessed FPKM values-obtained by normalizing the mapped sequencing reads-log10 transformed values, and log10 + 1 transformed values. The so-called log10 transformed values are the log10 values of the unprocessed FPKM values, where values equal to 0 are set to 0. The so-called log10 + 1 transformed values are the log10 values, which are obtained after the unprocessed FPKM value plus 1 has been calculated.
Subsequently, results of t-SNE and UMAP dimension reduction were compared. All t-SNE plots were created equally based on a principal component analysis with 50 components, a learning rate of 300, and a perplexity of 27. For the NEPC WCM dataset, 25 components were used. Further details on the procedure are given elsewhere [23]. UMAP plots were generated based on an adapted UMAP approach as previously described [22]. In brief, the squared pairwise Euclidean distance was used to calculate the distance between samples with a subsequent binary search for the optimal rho based on a fixed number of 15 nearest neighbors. The symmetry calculation was simplified, by dividing the sum of probabilities by 2. Furthermore, mind_dist = 0.25 was used, as well as cross-entropy as cost function with normalized Q parameter. Last, gradient descent learning was used with 2 dimensions and 50 neighbors were applicable (NEP-WCM dataset used 25 neighbors). After generating the unbiased low-dimensional representations of the highdimensional input (RNA sequencing), data manual cluster interpretation was performed.
To further address the question of possible clusters and the associated distinction between primary tumor and metastasis in the SKCM dataset, we additionally performed k-means (for k = 2, 3, 4 based on elbow method) and Leiden [36] (with n_neighbors = 15, 50, 100 and resolution = 0.05-additional use of default parameters n_neighbors = 15 and resolution = 1) clustering for the different UMAP results (unprocessed, log10, log10 + 1). For k-means clustering, the KMeans method of the sklearn cluster module [37] was used. Leiden clustering was implemented using scanpy (version 1.7.2) [38], based on the previously calculated UMAPs.
To further assess the potentially introduced differences based on data transformation between the individual maps, we additionally used the scale-dependent similarity measure proposed by Taskesen et al. [39], utilizing the python module flameplot (v1.0.3) [40] with default parameters.

Analysis of the PRAD-SU2C (Dream Team) Dataset
The first dataset in our analysis represented metastatic prostate carcinoma. Within t-SNE plots, up to three clusters were observable, according to applied data transformation. Unprocessed FPKM values resulted in one visible cluster in addition to the big main cluster (Figure 1a), whereas log10 ( Figure 1b) and log10 + 1 (Figure 1c) approaches showed two additional smaller clusters. These clusters mainly contained bone or liver samples and were named accordingly.
The UMAP approach showed similar results, with unprocessed FPKM values ( Figure 1d) not providing any clustering information, whereas log10 ( Figure 1e) and log10 + 1 (Figure 1f) transformations showed three visible and distinct clusters. Again, one of these clusters consisted completely of bone samples, another mainly consisted of liver samples, and the last and largest cluster consisted of all remaining samples. These results indicate that the resection site was not the main cause for clustering; instead, visualization techniques (t-SNE vs. UMAP) and data transformation (unprocessed vs. log10 vs. log10 + 1 transformed data) heavily affected clustering results.

Analysis of the NEPC WCM (Neuroendocrine Prostate Cancer) Dataset
The second dataset represented neuroendocrine prostate cancer. No clusters were detectable using the t-SNE approach (Figure 2a-c). However, the UMAP approach consistently revealed three distinct clusters (Figure 2d-f). Notably, the resulting clusters were very similar throughout all data transformations-thereby not displaying any resection site specificities.

Analysis of the NEPC WCM (Neuroendocrine Prostate Cancer) Dataset
The second dataset represented neuroendocrine prostate cancer. No clusters were detectable using the t-SNE approach (Figure 2a-c). However, the UMAP approach consistently revealed three distinct clusters (Figure 2d-f). Notably, the resulting clusters were very similar throughout all data transformations-thereby not displaying any resection site specificities.

Analysis of the Metastatic Samples of TCGA-SKCM Dataset
The SKCM-TCGA dataset representing metastatic melanoma served as the third dataset. Again, no clusters were detected using t-SNE plots with the different data transformations (Figure 3a  In summary, in none of the three datasets could a continuous dependence of the resection site be seen. Instead, a strong dependence of the visual cluster formation on the applied method and data transformation was observed. Only one small bone cluster in the Dream Team dataset was detectable independent of data transformation and dimension reduction approaches. To validate the obtained results and to test previous observations of subgroup-dependent clustering, the KIPAN dataset-consisting of three known biologically distinct subgroups of renal cell carcinoma (RCC)-was additionally considered.

Further Evaluation of Cluster Formation Based on Data Dimension Reduction Methods and Data Transformations
To investigate the influence of data transformation and data dimension reduction methods on the formation of visually distinct clusters, the TCGA datasets of the three largest RCC subgroups, clear cell (KIRC), papillary (KIRP), and chromophobe (KICH), were combined to one dataset (KIPAN). Due to the nature of the histopathologic origin of the samples in this dataset, a specific clustering could be expected. t-SNE (Figure 4b,c) and UMAP (Figure 4e,f) approaches based on log10 or the log10 + 1 transformed data yielded a separation of samples matching the histopathologic expectation. Furthermore, the importance and clinical relevance of subgroups identified by t-SNE (Figure 4a) using unprocessed data for the TCGA-KIPAN dataset have already been shown [23]. However, the unprocessed FPKM values yielded no useful information regarding the resection sitespecific agglomeration of samples in the UMAP approach (Figure 4d).   Although both data transformations showed a separation based on the histopathological subgroups for both data dimension reduction methods, clusters were not exclusively subgroup-specific and displayed certain outliers.

Combined Analysis of Primary and Metastatic Samples of the Same Entity
Based on the results of the KIPAN cohort, further analyses were performed for the complete TCGA-SKCM dataset as well-to analyse the transcriptomic relation of the primary and metastatic melanoma samples. Interestingly, no distinct separation between the metastatic and primary melanoma samples was observable ( Figure 5).
Moreover, only the UMAP log10 + 1 transformed approach displayed two distinct clusters, each containing primary and metastatic samples (Figure 5f). For both clusters, no complete subgroup-specific (primary tumor vs. metastasis) clustering resulted, yet a certain gradient was observable, indicating transcriptomic differences between the primary and metastatic tumors, but without previous knowledge of the subgroup, no assumptions could be made in separating both groups. Moreover, some metastatic tumors seem to still harbour primary tumor transcriptomic features, whereas there are also primary tumors already harbouring metastatic features.
This conclusion can also be drawn from the application of two common clustering methods, namely Leiden [41] and k-means clustering [42]. Again, the dependence of the obtained results on the used parameter set can be seen, whereby the number of calculated clusters can differ strongly when applying Leiden clustering. When using k-means clustering with a cluster number determined by the elbow method, similarities with the results of Leiden clustering can be seen (Figures S1-S4). To further validate these results, we finally analysed the metastatic breast cancer project (MBC Project) dataset, consisting of both primary and metastatic tumors of different resection sites. Using the t-SNE approach on this dataset did not lead to cluster formation for any of the data transformations (Figure 6a-c). Again, unprocessed FPKM values in combination with the UMAP did not provide any useful information about the dataset (Figure 6d). Additionally, logarithmic transformations within UMAP approaches did not form any distinct clusters (Figure 6e,f),thereby confirming the findings from the TCGA-SKCM dataset.
Further assessment of the maps in terms of local and global structure between the used transformations revealed that the local distances or neighborhoods between data points were not well preserved (Figures S5-S10). This conclusion can also be drawn from the application of two common clustering methods, namely Leiden [41] and k-means clustering [42]. Again, the dependence of the obtained results on the used parameter set can be seen, whereby the number of calculated

Discussion
t-SNE plotting and UMAP are crucial methods to identify relevant subsets in transcriptomic data consisting of bulk RNA or single cell approaches. Frequently, these subsets or clusters display distinct cellular functionalities and share commonly altered signaling pathways. For druggable pathways, translational researchers have therefore identified therapeutic implications in various cancers, e.g., RCC [23] and prostate cancer [43]. Notably, clustering approaches for the methylomic data of pediatric brain tumors already play a prominent role in the clinical routine by allowing further subtyping of cancer specimens [44]. Moreover, methylome profiles of metastatic melanoma were shown to define distinct clusters linked with the response towards immune checkpoint blockade [45].

The Impact of Data Transformation on Cluster Formation within Data Dimension Reduction
In this work, we were looking for transcriptomic similarities and differences of metastasis representing different resection sites. It has already been shown in several studies that there is no clustering of samples depending on the underlying resection site of the metastasis [46]. Nevertheless, within these studies, clustering was frequently observed [47]. These clusters were often attributed to biologically distinct subgroups in one entity-also stating preferred metastasis sites for different subgroups [1]. Additionally, there are studies showing transcriptional differences between two different resection sites [12]. Due to this, we compared the clustering results of three different datasets. Since previous analyses did not specifically investigate the transcriptomic dependency of the resection site, our approach considered not only different unbiased data dimension reduction methodssubsequently used for visual clustering-but also different data transformations. It was observed that log10 + 1 transformed data especially, frequently resulted in a clearer and more distinct cluster formation when analysed with UMAP. In line with this observation, UMAP analysis of the TCGA-KIPAN dataset showed a cluster dependency mainly based on underlying RCC histopathology. However, histopathological clustering was evident in the UMAP log10 + 1 and in the UMAP log10 data transformations as well as in the results of the corresponding t-SNE approaches.
As already shown in a previous publication, obtained clusters by using unprocessed FPKM values in a t-SNE approach yielded prognostically relevant clusters with biologically distinct characteristics for RCC [23]. These findings were also in line with the previous literature [48]. Additionally, using UMAP data dimension reduction with logarithmically transformed data of the TCGA-ACC (adrenocortical carcinoma) dataset revealed two clusters closely matching the already known ACC subgroups [22]. This suggests that histopathological and cancer subgroup-specific differences can be represented with a UMAP log10 + 1 approach, even though clusters seen within TCGA-KIPAN analysis were not completely subgroup-specific, also observed in the t-SNE plot using unprocessed data. Since t-SNE and UMAP show biologically meaningful clustering results, known histopathological or cancer entity subgroups, based on different data transformations, both data dimension reduction methods are useable and valid, depending on the underlying biological question.
Another remarkable element is the bone cluster identified within the Dream Team dataset. This cluster appears, with minor changes and depending on the area considered, in each of our analyses, regardless of the data transformation and the data dimension reduction method. Considering previous results, we conclude that all present methods have their justification and can be used depending on the research question. For example, the UMAP log10 + 1 approach is suitable for bulk RNA sequencing to identify subgroups within specific entities. However, clusters based on different histopathological tissues, for example, and thus generally showing a quite different transcriptome, can also be seen in the unprocessed data, where t-SNE plot seems to be more suitable for bulk RNA sequencing than UMAP, which in turn does not seem to be suitable for the unprocessed data of bulk RNA sequencing in general.
When looking at the number of clusters previously identified for the analysed datasets within this work, it becomes apparent that the log10 + 1 approaches were mostly in line with the previously shown results. For the TCGA-SKCM dataset, three clusters were identified in the first publication of this dataset [29]. The initial description of the NEP-WCM dataset showed three (based on the main branches of the dendrogram) different main clusters based on unsupervised clustering, overlapping with our UMAP results. Additionally, a smaller neuroendocrine subgroup inside the Dream Team dataset was described, which might be one of the shown clusters in our approach. This further proves the clusters found by unsupervised clustering in the Dream Team original publication, stating the independence of the metastasis site [27] and confirming the different molecular phenotypes of neuroendocrine prostate cancers [49]. Regarding breast cancer metastases, our results confirm previous findings showing the cluster dependency on biological subgroups rather than on the resection site [46].
Looking more closely at the differences in the resulting clusters between the different data transformations of the individual data dimension reduction methods, changes are similar to those caused by parameters such as the number of neighbors. Considering very recent research, we believe that the data transformation used is just an equally important factor to consider in the initialization of the data [25], as respective kernel transformations [26].
In order to quantify the visible differences between the different methods, and especially between data transformations, Taskesen et al. proposed a solution. This method considers the differences between the nearest neighbors of the data points in order to make a statement about the preservation of the local and global structure between different maps. In the datasets used by us, it was noticeable that the local structures, i.e., a small number of nearest neighbors, were remarkably different between the individual maps and data transformations. This shows that the data transformation used has an influence on the local characteristics of the clusters. The global structure based on many nearest neighbors, however, seems to remain the same between data transformations. This circumstance is most apparent when looking at the TCGA-KIPAN datasets, as t-SNE and UMAP, with and without data transformation, provide visibly different results, but, nevertheless, a clustering based on the histopathological subgroups. Nevertheless, quantitative analysis of the nearest neighbors showed that the local structures of the clusters differ between the data transformations and the individual methods. This problem becomes even more relevant as t-SNE and UMAP are also used for the analysis of single cell sequencing, in which the smallest transcriptional differences can have major effects on the representation and subsequent interpretation or further analysis. In addition to the method, the data transformation used represents one further important parameter in the representation of clusters and local distances [50].
Consequently, our results suggest that a more in-depth investigation of data transformations and visualization methods are necessary to further assess the nature of the obtained clusters.

Primary Tumors and Metastases of the Same Entity Share Common Transcriptomic Features
Our findings that primary and metastatic tumors share common transcriptomic features and are inseparable when analysed with data dimension reduction methods appear to match with previous research. In metastatic pancreatic adenocarcinoma, a distinction between primary tumor and metastatic tumor cells was not possible using single cell RNA sequencing [51,52]. This could also be seen in breast cancer single cell RNA sequencing comparing lymph node metastasis with primary tumors [53], which is in line with our findings regarding the MBC project dataset, not forming visual clusters in any considered approach. The presented results support the linear progression model to some extent, at least for the transcriptomic differences between metastasis and primary tumors, indicating the need for further research to combine genomic alterations with transcriptomic features to clarify the (clonal) evolution of metastasis. In conclusion, our results suggest that there is no general transcriptomic dependency on the resection site for metastasis of the same primary tumor and that obtained clusters can be mostly attributed to existing subgroups. The genetic diversity, using bulk sequencing and analytical deconvolution, is a major hallmark of cancer in general. Premetastatic and pre-treatment diversity can help to predict the clinical and evolutionary outcome of the disease. Nonetheless, the regulatory wiring that underpins the metastatic process is likely to dynamically change across the transcriptional landscape.
One limitation of our work is the fact that our conclusions refer to a recurrent observation based on limited datasets. To show a quantifiable statement regarding the performance of t-SNE and UMAP with respect to data transformations, a comprehensive analysis has to be performed in future studies.

Addressing Pitfalls in Visual Clustering
To address these challenges, we propose an additional standard legend for visual clustering approaches based on data dimension reduction methods and machine learning, as represented by the UTMC legend in all figures of this work. The information required by this additional information includes the unit (U) (such as FPKM, TPM, RPKM, or read counts), data transformations (T), represented visualization or data dimension reduction method (M), and, if applicable, the applied cluster identification algorithm (C). This enables the reproducibility of figures and analyses and makes visual clustering approaches much more transparent.
Taken together, our work further extends the knowledge of tumor heterogeneity in different biological contexts [54], by providing sufficient evidence for the linear progression model of metastasis, since no dependency of clusters based on resection site was observable in any of the three considered datasets. The applied transformation tended to have the biggest impact on clustering results, and thus needs more in-depth analysis. Nevertheless, our results cannot identify a favourite approach, as all of them appear to properly address different questions. Transformed data, independent of the data dimension reduction method, tend to visualize subgroups very specifically, whereas using unprocessed data in t-SNE seems to be closer to the biological nature of samples, demonstrating the need for further research in this area.

Conclusions
Using two different data dimension reduction methods, we showed that there was no visual association between the resection site and the transcriptome for three considered metastatic datasets. Instead, there was a significant dependence of clustering according to data transformation and the data dimension reduction method applied. Additionally, the analysis of primary and metastatic samples of specific entities did not show distinct clusters or visible differences. Combining recent works and the results of our study, visual clustering seems highly vulnerable towards data and parameter alterations. To avoid pitfalls in analyzing visual clustering and to enhance reproducibility, we recommend extending the standardized nomenclature, e.g., by adding the UTMC legend introduced in this manuscript.