Portraying the Expression Landscapes of B-Cell Lymphoma-Intuitive Detection of Outlier Samples and of Molecular Subtypes

We present an analytic framework based on Self-Organizing Map (SOM) machine learning to study large scale patient data sets. The potency of the approach is demonstrated in a case study using gene expression data of more than 200 mature aggressive B-cell lymphoma patients. The method portrays each sample with individual resolution, characterizes the subtypes, disentangles the expression patterns into distinct modules, extracts their functional context using enrichment techniques and enables investigation of the similarity relations between the samples. The method also allows to detect and to correct outliers caused by contaminations. Based on our analysis, we propose a refined classification of B-cell Lymphoma into four molecular subtypes which are characterized by differential functional and clinical characteristics.


Introduction
Cancer is a complex disease caused by the deregulation of gene activity. Carcinogenesis and -progression is accompanied by dysfunctions on multiple layers of the cellular machinery. They are affected by a large number of different genetic and epigenetic factors. In recent years, large-scale studies such as The Cancer Genome Atlas (TCGA) [1,2], The Cancer Cell Line Encyclopedia [3] or the International Cancer Genome Consortium (ICGC) [4] were undertaken aimed at characterizing cancer on the molecular and cellular level. These studies allowed to discover the heterogeneity of the underlying regulatory mechanisms and to assign them to molecular cancer subtypes.
On the one hand, high-throughput technologies such as whole genome transcriptional profiling presently revolutionize molecular biology and provide an incredible amount of data. On the other hand, these techniques pose elementary methodological challenges simply by the huge and ever increasing amount of data produced [5 8]: researchers need adequate tools to extract the information content of the data in an effective and intelligent way. This includes algorithmic tasks such as data compression and filtering, feature selection, linkage with the functional context, and proper visualization. Especially, the latter task is very important because an intuitive visualization of massive data clearly promotes quality control, the discovery of their intrinsic structure, functional data mining and finally the generation of hypotheses. We aim at adapting a holistic view on the gene activation patterns as seen by expression studies rather than to consider single genes or single pathways. This view requires methods which support an integrative and reductionist approach to disentangle the complex gene-phenotype interactions related to cancer genesis and progression.
With this motivation we apply Self-Organizing Maps (SOM), a machine-learning clustering approach [9], to a large-scale patient expression data set of mature aggressive B-cell lymphomas published previously [10]. Our approach simultaneously searches for features which are differentially expressed and correlated in their profiles in the set of samples studied [11]. We include functional information about such co-expressed genes to extract distinct functional modules inherent in the data and attribute them to particular types of cellular and biological processes such as inflammation, cell division, etc. [12]. This modular view facilitates the understanding of the gene expression patterns characterizing different cancer subtypes on the molecular level. Importantly, SOMs preserve the information richness of the original data allowing the detailed study of the samples after SOM clustering [11].
A central role in our analysis is played by the so-called expression which serve as intuitive and easy-to-interpret fingerprints of the transcriptional activity of the samples. Their analysis provides a holistic view on the expression patterns activated in a particular sample. Importantly, they also allow identification and interpretation of outlier samples and, thus, improve data quality.
Our application of SOM machine learning to lymphoma expression data aims at characterizing the heterogeneity of the genome wide expression landscapes and at describing the molecular cancer subtypes. In particular, we will demonstrate the capabilities of our strategy to intuitively visualize the individual samples as well as the subtypes in terms of individual and group-averaged portraits, respectively. We show how to extract functional information from the data and appropriately incorporate it into the analysis strategy. Further, we describe how to detect and to correct outlier samples using their portraits. Finally, we propose a more detailed molecular subtype classification of the lymphoma samples.

Expression Data and Preprocessing
Microarray data of lymphoma are available under GEO accession number GSE4475 (data from 221 Affymetrix HG-U133A arrays). This study used biopsy specimens of mature aggressive B-cell lymphoma in which at least 70 percent of all cells were tumor cells. The classification of lymphoma samples into different subtypes is used as provided by Hummel et al. [10]: Of all 221 lymphomas, 44 are assigned to the mBL (molecular lymphoma) signature and 129 to non-mBL signature. 48 cases form an intermediate group, representing the transition zone between the mBL and non-mBL groups.
For a convention we use the data as numerical matrix of dimension N × M where N is the number of genes measured per sample and M is the number of samples in the study. Throughout this paper a row of this matrix will be termed of the respective gene. The columns on the other hand will be termed referring to one sample studied. Raw probe intensity values of Affymetrix arrays were calibrated and summarized into one expression value per probe set using the hook method [13,14]. To ensure comparability, we applied quantile-normalization to the samples [15]. It transfers the expression states of all samples into one common distribution. Then, the expression values of each gene were transformed into log10-scale and centered with respect to the mean expression value of the particular gene averaged over all samples in the study [11]. This translates the expression data into fold change units and will be addressed as logFC = e i,m , the relative log-expression of gene i in sample m. Hence, a e i,m of zero means that the gene is expressed according to its mean expression value. Positive and negative values refer to over-and under-expression in the series of samples, respectively.

SOM Training
The preprocessed expression values e i,m are used to train a Self-Organizing Map (SOM). It translates the high-dimensional N × M expression data matrix into a K × M metadata matrix (K: number of so-called metagenes, in literature also referred as or of the SOM) of reduced dimensionality K << N (N = 22,283 and K = 2,500). The corresponding relative log-expression values of the metagenes will be termed e k,m meta . The metagene expression profiles (in literature also named or are adapted in the iterative machine learning process to optimally cover the data space once the training is completed ( [9], see [16] for detailed illustration). Therefore, the metagene profiles are slightly altered in each iteration such that they resemble the input gene profiles more closely. Each metagene serves as a representative prototype of a cluster of real genes with similar expression profiles. The metagenes expression profile in turn approximately resembles the average profile over the associated real genes. Note that during training, the association of genes to the metagenes is not fixed and alters in a self-organizing process with the effect that the degree of similarity between metagenes decreases with increasing distance in the trained map.
Our SOM method was configured to enable the robust identification of spot modules inherent in the data (see below). Details were described previously [11,12,17]. In short: We have shown that the particular choice of the grid topology (e.g., rectangular or hexagonal) and of the map size (if chosen between K = 30 × 30 and K = 60 × 60 metagenes) is not crucial for downstream expression analysis. It provides almost identical results in terms of the expression patterns identified (see [11] in the supplementary material, and [16]). Variation of the SOM-size in reasonable limits can slightly alter the smoothness of the expression landscapes observed but not their basal properties required for further analysis [11,16]. Our choice of SOM size is further supported by an independent heuristic based on the two largest eigenvectors to estimate the map size [18]: The use of its implementation in toolbox returns an optimal SOM size of K = 42 × 28 metagenes.
In this application, we used a two-dimensional grid of size K = 50 × 50 metagenes and of rectangular topology, Gaussian neighborhood function [11,16], and the implementation of the algorithm in the R-package [19].

SOM Staining
Each meta-state is described by the K expression values in the columns of the metadata matrix. They are arranged according to the underlying metagene grid and visualized by an appropriate color gradient: dark red reflects strong over-expression; yellow and green tones indicate intermediate levels with low or no differential expression; and blue corresponds to under-expression. The color patterns emerge as smooth textures representing the fingerprint of transcriptional activity of each sample. Please note that the assignment of the genes to metagene clusters and therefore also their position in the SOM is identical in all sample portraits. Hence, the coloring at a certain position in the map refers to the same genes in all individual portraits allowing the direct comparison of their expression levels between the maps.
Subtype-specific mean portraits are calculated and visualized as the mean value of each metagene averaged over all sample portraits belonging to one subtype. They reflect subtype specific expression patterns while leveling out the heterogeneity of the individual expression states and outliers.

Detection of Expression Modules: Spot Selection
The SOM algorithm arranges similar metagene profiles in neighbored tiles of the map whereas more different ones are located more distantly. Adjacent metagenes thus tend to be colored similarly and the obtained mosaic portraits show typically smooth patterns with red and blue spot-like regions referring to clusters of over-and under-expressed metagenes, respectively. Metagenes located in the same spot are concertedly expressed across the samples studied. Consequently, distinct and well-separated spots in one sample collect genes of different expression profiles although concertedly over-expressed (or under-expressed) in this particular sample. Each spot can consequently be interpreted as a disjunct expression module of a group of metagenes (and of associated single genes) showing a unique expression profile in the data set studied.
We define over-/under-expression spots by applying a simple 98th/2nd-percentile criterion as described and verified in [11,12,20 23]. It selects the respective fraction of metagenes showing largest/smallest expression in each sample. The percentile criterion chosen allows selection of a sufficient number of candidate genes per spot on one hand and a sufficient number of relevant spots on the other hand. Moderate modifications of the percentile criterion used are uncritical with respect to the final results obtained. All spots detected in the individual portraits are transferred into one master map to visualize the global spot patterns of the series evaluated. This provides a simple and intuitive approach for the detection of expression modules inherent in the data. Note that this detection of spot modules provides gene clusters in an unsupervised fashion without necessity for prior definition of prototypes or cluster numbers.
We further implemented and verified complementary methods of spot selection using different metrics and algorithms such as k-Means and hierarchical clustering based on Euclidean distance between the metagene profiles, and seed-clustering based on their pairwise correlation coefficient [11,12,16]. The basal functional impact of the modules obtained is virtually independent of the particular method of spot selection used although the spots can differ considerably in the number of genes and the area of the map included. We here apply the over-/under-expression spot selection method because it selects lists of strongly differentially expressed genes. Such lists are of particular interest not only in our analysis but also in numerous gene expression studies aiming to detect marker genes.

Enrichment Analysis
Co-expressed genes of each spot module can be assumed to be functionally related according to the -by-principle [24]. Functional analysis aims at identifying the functional context of these expression modules.
We use different approaches to estimate the enrichment of groups of predefined genes (so-called gene sets) in gene lists obtained independently, for example from SOM-spot analysis (see [25] for a critical review). Enriched gene sets indicate an association between their context and the system studied. A large and diverse collection of such gene sets can be derived from the Gene Ontology (GO) annotation database [26] using the interface [27]. In particular, a total of 5,154 gene sets are included in our analysis according to the following categories: (i) 1748 GO gene sets subdivided into GO-terms (1,102 sets), (387 sets) and (259 sets); (ii) pathways referring to Biocarta (217 sets), KEGG (186 sets) and Reactome (430 sets) databases; (iii) curated gene sets taken from the literature on chemical and genetic perturbations 2,439 sets); (iv) tissue specific gene sets (25 sets) derived previously from a gene expression study on healthy human tissues [12]; and (v) gene sets taken from the literature on various cancer types and subtypes (109 sets).
Under the term we here subsume and combined analyses (see references [12,28] for a detailed discussion). In our approach, overrepresentation estimates the probability of finding more members of a given gene set in a particular spot cluster compared with their random appearance, independent of their expression values. Right-tail modified Fisher exact test and the hypergeometric distribution then provide a p-value for each predefined gene set in each spot. This p-value reflects the overlap between the genes in a spot cluster and the gene set given a certain total number of genes studied [12,29,30]. We considered overrepresented sets with p < 10 5 which ensures reasonable adjustment for false positives in multiple testing. In particular, this criterion applies Bonferroni adjustment p < /n, where denotes the desired significance level and n is the number of single tests. With = 0.05 and n 5000 (number of gene sets tested) one obtains p < 10 5 . Note that Bonferroni adjustment represents a conservative approach minimizing the family wise error rate (see e.g., [31]). However, it applies to statistically independent tests, a requirement which is not given for many gene sets used because they contain a high percentage of overlapping genes. The criterion applied therefore provides a conservative lower limit of acceptable gene sets.
As a second approach, -defines the deviation between the mean expression value averaged over the set-members compared with the mean expression value of all genes measured in a sample. The so-called gene set Z-score (GSZ) combines both options of gene set overrepresentation and overexpression approaches [12,32]. In particular, the GSZ-score for the list of all genes studied is given by denotes the mean expression of all genes and the denominator defines the respective standard error (for a detailed description see [12]). We use the GSZ-score to profile enrichment of a selected gene set across all samples and cancer-subtypes studied.
In addition to overrepresentation in spots and the GSZ-profiles, we generate gene set population maps. They visualize the distribution of the genes of a selected set in the SOM grid by appropriate color coding of the number of set members assigned to each metagene. It ranges from white (no gene) to maroon (maximum number of genes per tile observed for the particular gene set). Recall that each gene refers to one and the same metagene in all samples and thus occupies a fixed position in all SOM portraits allowing comparison between gene positions and spot positions in reference to specific functional modules.

Sample Similarity Analysis
Sample similarity analysis aims at evaluating mutual relations between the samples studied. There are various established approaches, for example to extract a hierarchy of similarities, to estimate mutual distances between the expression states or to assess the main sources of variance in the data.
Here, we use three different metrics, namely statistical dependence, Euclidean distance and Pearson correlation, which are applied to the metadata instead of to the original gene . Similarity analysis consequently compares the expression meta-states as characterised by the SOM portraits. The usage of metadata as the basal data has the advantage of improving the representativeness and resolution of the results as shown previously [11,20,33].
Independent component analysis (ICA) [34] is applied to the SOM-metadata using the R-package [35]. It distributes the samples in the space spanned by the components of minimal mutual statistical dependence. These components point along the directions of maximum information content in the data which is estimated by their deviation from a (non-informative) normal distribution [34].
As a second option, we apply the neighbor-joining algorithm (R-package [36]) to visualize similarity relations based on the Euclidean distances between the samples as similarity trees [37]. The distances between pairs of samples in the tree are in scale. It allows to identify h-clusters and to estimate the degree of mutual dissimilarity between them.
Pearson product-moment correlation is the third metric we use. It calculates correlation coefficients between the metagene states for all pairwise combinations of samples. The resulting quadratic correlation matrix is visualized by coloring the correlation values in the pairwise correlation map (PCM) using a color gradient ranging from red for positive correlation to blue for negative correlation. The correlation network (CN) additionally translates the correlation matrix into a graph structure. This undirected graph is constructed by connecting the nodes (i.e., the samples), whose pairwise correlation coefficient exceeds a given threshold. Here, we chose r threshold = 0.5, which ensures a relatively sparse but still fully connected graph structure (see description in Supplementary File 1). It provides a network-like overview about the correlation structure of the expression landscapes of the samples. It is capable to intuitively display multivariate relations in contrast to univariate dendrograms. The lengths of the edges in the CN approximately scale inversely with the respective degree of correlation.

Correction of Biased Data
Systematic deviations give rise to biased data. They can be caused, for example, by individual specifics of the expression characteristics of the patients not related to the disease, by the inaccurate biopsy of the tumor cells leading to contaminations of the samples with healthy tissue or by systematic variations in the sample preparation process. Inspection of the individual SOM portraits combined with similarity and gene set enrichment analyses provide a framework of hand-in-hand options to detect and to correct strongly biased samples: Firstly, outlier spots can be detected in the portrait gallery and subsequently analyzed for their functional context in terms of overrepresented gene sets. Secondly, outlier samples can be identified in the correlation network similarity plot and then further evaluated by functional spot analysis. Gene sets found to be associated with outlier spots and/or samples are then simply excluded from further analyses leading to corrected expression portraits. This procedure can be repeated for different putative sources of systematic errors.

Molecular Subtypes Derived from Prototype-Guided k-Means and from Consensus Clustering
Identification of distinct molecular phenotypes is a common and important question in cancer research. A previous study of lymphoma data classifies the samples into the main subtypes molecular Burkitt s Lymphoma (mBL), non-mBL and an intermediate group [10]. Both, inspection of SOM portraits and their similarity analyses suggests the further refinement of this classification into four subtypes, namely mBL*, non-mBL*,

intermediate-A and intermediate-B (see below).
We applied a modified prototype-k-Means clustering of the metadata to segregate the samples into these four subtypes. k-Means is an iterative algorithm which iteratively assigns the samples to so-called cluster prototypes showing the minimal mutual Euclidean distance and subsequently computes new prototypes as the centroids of the members of each cluster [38]. k-Means requires predefinition of a desired cluster number, while the initial prototypes are usually chosen randomly or initialized from the data [39].
The SOM portraits now constitute another option to initialize the prototypes: they can be established using selected expression patterns observed in the portraits such as the most prominent overexpression spots. Particularly, we define initial prototypic expression portraits showing a selected spot pattern for each subclass with values k,m meta for metagenes within the spot and for metagenes outside. These prototypic spot patterns are then used to assign the samples to the respective clusters in the standard k-Means algorithm. Then, a bootstrapping approach is used to estimate the robustness of the assignment of samples to the subtypes. Therefore, k-Means clustering is repeatedly applied to a subset of samples chosen randomly from the complete set of samples. The mean metagene expression states of the subtypes are used as initial cluster prototypes. The fraction of proper assignments of samples in agreement with their actual class assignment then defines a robustness score of each sample: a bootstrap stability score of means that the respective sample is always found in the correct subtype, while a score of means that the sample is assigned properly in only 50% of the resampling repetitions.
In addition, we applied consensus clustering [40] to validate the results of our k-Means approach by an independent method. Consensus clustering aims at reaching a consensus on the number of classes in the data and at judging reliability of the class assignment of the samples. We applied the R-package [41] for portioning the samples into k classes using hierarchical clustering with k ranging from two to six. For each k, one obtains a consensus matrix, reflecting the fraction of common class memberships for all pairwise combinations of samples estimated in a series of resampling runs (details are given in [40]). It is visualized by means of a clustered heatmap collecting samples frequently found in one class into blue squares along the diagonal. The cumulative distribution function (CDF) aggregates the consensus values up to a certain fractional co-occurrence of sample pairs. The CDF thus reflects the of of a consensus matrix using one curve such that clusterings with different k can be directly compared with the purpose to identify the optimal class number [40]. The incremental change between CDF curves with increasing k serves as a measure to judge whether increasing the class number leads to a marked increase of clusters stability or not.

Additional Expression and Phenotypic Data
Gene expression data from germinal center B-cell line samples and tissue samples of tonsils were taken from reference [42]. After preprocessing as described above, these data were co-trained with the lymphoma data to evaluate the cell of origin characteristics of the lymphoma samples. Recently published patient phenotypic data were used to characterize the newly defined subtypes in the cohort studied [43]. These included data from immunohistochemical staining against CD10, BCL2, BCL6, MUM1, data from interphase fluorescence in situ hybridization (FISH) for IGH, MYC, BCL6 and BCL2 loci, overall survival, age and gender.

SOM Expression Portraits of Lymphoma Samples and Subtypes
SOM machine learning transforms the whole genome expression pattern of the single genes into metagene expression data. Thereby, the number of single genes exceeds the number of metagenes by about one order of magnitude (N = 22,283 and K = 2,500). We visualize the expression meta-state of the samples as mosaic images, consisting of 50 × 50 tiles each representing one metagene. These metagenes serve as representatives of clusters of co-expressed single genes the number of which usually varies from metagene to metagene. The color gradient of the portraits was chosen to visualize over-and under-expression of the metagenes in each particular sample: red to green colors indicate over-expression with decreasing strength, while blue to green colors indicate under-expression. The colored texture of each mosaic thus individually characterizes the gene expression landscape in each sample. Figure 1 shows the expression portraits of selected lymphoma samples arranged according to their previous classification into subtypes [10]. The individual portraits reveal a handful of clusters of co-expressed metagenes frequently observed. These so-  We generate mean subtype portraits by averaging the expression values of each metagene over all the subtype members. This averaging cancels out the highly fluctuating, individual features and, thus, amplifies consistent subtype-specific features. In support of the observations from the individual portraits we found that the mBL and non-mBL subtypes are characterized by two spots in opposite corners of the map: one spot in the top-right corner is over-expressed and the other one in the bottom-left corner is under-expressed in mBL samples and vice versa in non-mBL samples, revealing the antagonistic character of their expression patterns. These subtype-specific spots collect highly populated, highly variable and well resolved metagenes (see Supplementary File 1).
In summary, SOM expression portraits reflect the individual expression landscapes of each sample in terms of characteristic color textures which enable visual perception of subtype-specific spot-like features representing clusters of differentially and co-expressed genes.

Characterizing the Expression Modules: Spot Analysis
Standard analysis tools usually evaluate the whole expression states of the individual samples to perform similarity or cluster analyses, or to generate lists of differentially expressed genes. Such global comparisons might overlook subtle effects due to individual properties of small groups of genes. These details are however projected into the color textures of the individual SOM portraits which change from sample to sample and can be assessed by means of feature selection (see [12] for a detailed review). The most prominent patterns are the over-and under-expression spots formed by neighboring metagenes of similar profiles which, in turn, represent clusters of correlated and thus potentially co-regulated genes strongly over-and/or under-expressed in a subset of samples.
We analyze the spot patterns in order to identify specific properties of the lymphoma subtypes. Figure 2a shows the so-called over-expression summary map which collects all over-expression spots observed in the individual sample portraits into one master map (see also [11]). Each disjunctive region of this map exceeding the 98th-percentile threshold defines one global overexpression spot. It represents a distinct expression module inherent in the data. In total, we identified 23 over-expression spots labeled with capital letters (Figure 2b). Please note that our spot selection algorithm neglects the abundance of each spot in the individual portraits and identifies both rare (e.g., observed in only one sample) and frequent spot modules. The over-expression heatmap in Figure 2c visualizes the spot expression profiles, i.e., the mean expression level of the metagenes in each of the spots across all samples. The colors range from blue representing the lowest mean expression values, to red representing the highest values. The samples are arranged according to their subtype classification. The heatmap provides an overview of the degree of subtype-specific expression in each of the spot modules. For example, spots and partly also spot are selectively over-expressed in samples of the mBL subtype, while spot is characteristic for the non-mBL subtype. Contrary, more ubiquitous spots as as well as rare spots as or lack of subtype-specific overexpression. Note that frequent spots are usually located in the peripheral part of the map (i.e., in the corners and along the edges) whereas rare spots tend to accumulate in the central part. We use the spot information and the mean subtype portraits to assign subtype labels to the most prominent and specific spot modules (Figure 2a): Spots and are ascribed to mBL while spot is prominent in non-mBL. Those three spot modules contain marker genes over-expressed in the respective subtypes as validated below. Spots and also frequently observed in the sample portraits, are assigned to the intermediate subtype. Interestingly, they constitute two alternative intermediate states located in between the main-subtypes mBL and non-mBL. They are characterized either by spot or by spot as indicated by the arrows in Figure 2a. Please note that the training algorithm distributes the metagenes in such a way that strongly correlated profiles are located at adjacent positions in the map whereas metagenes with anti-correlated profiles tend to occupy more distant regions, e.g., in the opposite corners of the map. This rule also applies to the spots detected. In order to discover the covariance between the spot modules we calculated Pearson correlation coefficients for all pairs of spot profiles. It turned out that, as a rule of thumb, neighboring spots are strongly positively correlated and spots located in opposite corners of the map are often strongly anti-correlated. The results of this correlation analysis are visualized in Figure 2b. One sees that, for example, the mBL marker spots and are highly correlated and usually appear together in the sample portraits whereas the anti-correlated over-expression spots and will not be observed together in the same expression portrait.
For this dataset, we also detected 11 global under-expression spots emerging as blue regions in the SOM portraits. The under-expression summary map is shown in Figure 2d. Position and size of most of the detected under-expression spots agree with those of the over-expression spots. Hence, overexpression of the respective metagenes in part of the samples changes into under-expression in other samples. For the analyses described in this paper, we therefore use only the over-expression spots detected without loss of essential information. Interestingly, virtually no blue under-expression spot was detected in the central area of the map indicating that the rare over-expression spots do not show this dualism. Below we will show that these spots potentially constitute clusters of outlier genes the expression of which is affected by bias effects.
In summary, the heterogeneous expression patterns observed in the individual portraits can be condensed to a few major expression modules represented by over-and under-expression spots. This way the relevant dimension of the data set is reduced by three orders of magnitude from about 20,000 single genes to approximately 12 spot modules.

Mining the Functional Context: Gene Set Enrichment Analysis
Each global overexpression spot module represents a cluster of potentially co-regulated genes. We applied gene set overrepresentation analysis to each spot-cluster taking into account a collection of more than 5,000 predefined gene sets referring to different GO-categories, pathways, diseases, human tissues and specific cell experiments (see methodical section). For each spot we obtained a list of gene sets ranked with increasing p-value estimating the probability that genes of the set are found within the spot by chance.
Based on the functional context of the overrepresented sets obtained we assign a short notation to each of the spots (see Figure 3a). Some spots are obviously related to processes associated with general hallmarks of cancer such as and (spots and respectively). Panel b of Figure 3 depicts the GSZ-expression profiles (left part) and the population maps (right part) of those two leading gene sets. The profiles clearly reflect the fact that the respective processes are selectively over-or under-expressed in a subtype-specific fashion. While is activated in the non-mBL subtype, genes annotated to the gene set are active in the mBL subtype. The respective gene set population maps reveal that the associated genes accumulate in the regions of spots overexpressed in the respective subtype, as expected. GSZ-profiles and population maps are shown for gene sets accumulating in the mBL and non-mBL specific overexpression spots as indicated by the red ellipses (panel b), for mBL-vs-non-mBL signature sets published previously [10] (c) and for sets accumulating in rare spots (d).
Neighboring spots of strongly correlated profiles can be assigned to related biological processes: the spot is surrounded by spots assigned to factor and according to the most overrepresented gene sets in each of the spots. Note that, although related, these neighboring spots are usually characterized by subtle differences in their expression profiles and presumably also by fine differences in the functional context of the overrepresented gene sets. Population maps and overexpression spot maps therefore represent complementary tools for discovering the functional context of the expression landscapes. The results so far show that the lymphoma samples split into pairs of subtypes differing by the antagonistic activation of processes related to and on one hand and to and the and translational on the other hand (non-mBL-vs-mBL). To validate the subtype-specific spot patterns identified above, we included the signature set that differentiates between mBL and non-mBL subtypes provided by Hummel et al. [10] (see Figure 3c). As expected, genes of this set clearly accumulate in the subtype-specific spots and assigned to mBL and non-mBL, respectively.
Another important question is about the possible origin of the rare spots in the central part of the map. In Figure 3d, we show the characteristics of two gene sets related to tissue specific gene expression in tonsils [11,12] and to drug response metabolism, cytochrome P450 see [44]), respectively. Their genes strongly accumulate in localized regions of the map agreeing with the positions of the rare spots and respectively.). Both gene sets are overexpressed in only few samples suggesting that the respective samples are outliers contaminated either with healthy tissue or affected by patient specific medication. Both effects are not related to the cancer studied and thus reflect systematic biases of the respective expression patterns.

Analyzing the Sample Similarity Space
We applied two standard sample similarity analyses, namely independent component analysis (ICA) and neighbor-joining clustering (NJ), to visualize and to analyze the mutual relations between the samples. In the two-dimensional ICA-plot shown in Figure 4a, the samples distribute along the first two components of minimal statistical dependency. It reveals basically three clusters referring to the three subtypes, however without clear boundaries limiting them. It also shows that the three subtypes mainly separate along the IC1-coordinate, whereas intra-subtype variability mainly spreads along the IC2-coordinate. The NJ algorithm visualizes sample similarity relations as seen by Euclidean distances. The obtained star-like dendrogram shown in Figure 4b identifies -clusters containing mostly samples of the same subtype. Interestingly, each of the different subtypes distributes over more than one of such bush-like branches reflecting its intrinsic heterogeneity in terms of disjoint clusters.
In summary, the ICA-analysis allows for estimating the mutual dependence of the expression changes associated with the different subtypes. We found a one-dimensional distribution of the lymphoma subtypes, supporting the classification into the three subtypes considered so far. The transversal heterogeneity however remains unconsidered in this case. The NJ-dendrogram, on the other hand, reveals finer details in terms of disjunct substructures potentially reflecting a finer granularity of subtype clusters.

Sample Correlation Structure
coefficients for all pairwise combinations of samples. The pairwise correlation map (PCM) given in Figure 5a visualizes the correlation coefficients for all sample pairings which are arranged according to their subtype assignments (see the color bars along the borders of the map). The compact red square of mBL sample couples reflects the strong similarity between their expression landscapes whereas the blue off-diagonal area formed between the mBL and non-mBL samples indicates their anti-correlated expression states. Note that the pairings between non-mBL samples, although correlated, reveal a much more fuzzy pattern due to the more heterogeneous expression states compared to the mBL subtype. The samples of the intermediate subtype either correlate with the mBL or non-mBL samples or with both in some cases.
The correlation matrix can be transformed into the correlation network (CN) shown in Figure 5b. In this graph representation, the samples are represented by nodes connected by edges if the mutual correlation coefficient exceeds a certain threshold. The length of the edges approximately inversely scales with the respective correlation strength. Visual inspection of the CN shows that the mBL and non-mBL samples accumulate into well separated clusters whereas samples of the intermediate subtype heterogeneously spread over the region between these two clusters. Interestingly, these intermediate samples distribute along two disjunctive branches of the CN, which both link the mBL and non-mBL clusters. These two separate branches also include a fraction of the mBL and non-mBL samples (see the purple lines in Figure 5b roughly separating the clusters and branches). This distribution of the intermediate subtype samples reflects the heterogeneous spot characteristics of the subtypes as discussed above.
A few samples are located far away from their subtype-specific cluster and/or from the majority of the other samples in the CN. Those samples are usually characterized by rare or unique spots as indicated in Figure 5b. We will address this issue in the next section more in detail.
of alternating compact and more fuzzy clusters. The former ones refer to the main subtypes and the latter ones to two distinct groups of samples mainly assigned to the intermediate subtype. The mutual correlation analysis as seen by the CN in combination with the SOM portraits thus provides additional information complementing the other similarity analyses applied.

Detection and Correction of Outliers
Inspection of the CN in Figure 5b reveals a series of samples which are located outside of the main network body. The portraits of these outlier samples reveal overexpression spot patterns deviating from the subtype specific patterns identified in terms of their mean SOM portraits. Particularly the spots and are identified in the outlier sample portraits (red circles in Figure 5b; see Figure 2b for spot-letter assignments). Here, we exemplarily focus on spot , located in the bottom-left region of the SOM and strongly overexpressed in samples MPI-002, MPI-208 and MPI-213 (see Figure 5b). The topmost enriched gene set in this spot is the -set. It was extracted as the tonsil-signature from a large expression data set of healthy human tissues previously analyzed with our SOM pipeline [11,12]. Enrichment of this set suggests that overexpression of spot is caused by contamination of the tumor biopsy with adjacent healthy lymph node tissue.
Panel a in the left part of Figure 6 shows the GSZ-profile and the population map of the set. The GSZ-profile reveals very strong overexpression of the set in a number of samples independent of their subtype assignment. The corresponding genes mainly accumulate in spot Selected samples which possess this particular spot in their portraits are shown in Panel c. They can already be identified as potential outliers by simple visual inspection of the SOM portrait gallery (Supplementary File 2). We highlighted the samples in the GSZ-profile (Panel a) and in the CN (Panel b) by arrows. Note however that not all of these samples protrude as clear outliers in the CN. Despite the strong overexpression of the contamination spot the overall expression state of e.g., samples MPI-208 and MPI-213 obviously resemble those of the unbiased samples. In a simple correction step we removed the genes included in the outlier spots from the whole data set (see red circle in the population maps in Figure 6). This procedure can be repeated for the other contamination spots identified: For example, spot was found to be related to drug metabolism see Figure 3d and sample MPI-090 in Figure 5b), presumably due to individual medication of the patient. Spots show an intense increase in expression of the G-antigen-family for unknown reasons (samples MPI-060, MPI-061 and MPI-195 in Figure 5b).
After removing strongly biased genes from the training data, we generated a new SOM. Note that, depending on the purpose, also re-evaluation of only parts of the analyses may be sufficient. The right part of Figure 6 shows the results after correction for tonsil-contamination accumulated in spot The corresponding GSZ-profile shows a more uniform expression of the gene set after correction. The respective sample portraits now show the characteristic spot signatures of the respective subtypes, i.e., of mBL for MPI-002 and non-mBL for MPI-208 and MPI-213. Especially the outlier sample MPI-002 is now located within the mBL cluster in the CN, such that it attains a more compact shape.
In summary, the combination of individual portraits, enrichment analysis and the correlation network provides a framework for easy and intuitive detection of outlier spots and samples. After correction, more reliable expression landscapes of the samples are obtained.

Alternative Subtyping of B-Cell Lymphoma
Our analysis so far suggests that the samples assigned to the intermediate subtype split up into two separate branches which also include samples previously assigned to the mBL and especially the non-mBL subtypes. These two branches are characterized by overexpression spots in the bottom-right and top-left part of the expression portraits, respectively (compare the first and the second row of the intermediate sample portraits in Figure 1). Note that these spot modules are frequently overexpressed in the intermediate-type samples (see spots and Figure 2a c). Both, neighbor-joining clustering and correlation network analyses clearly show two distinct sample groups forming two continuous transition ranges linking the compact mBL and non-mBL clusters. These transition ranges include samples of the intermediate and also of the mBL and non-mBL types (Figures 4b and 5b). These results suggest the existence of four subtypes partly differing from the classification into three subtypes discussed so far. In order to further verify this hypothesis, we applied our prototype-guided k-Means algorithm to cluster the samples into four groups (see methods section). The algorithm uses initial prototypes of the expression landscapes which are given by artificial spot patterns referring to the four desired subtypes: spot K initializes the new mBL-like subtype mBL*, spot O the non-mBL-like subtype non-mBL* and spots J and Q the two new intermediate subtypes  In the next step, we compare the robustness of the old and new subtype cluster assignments by applying the bootstrap clustering approach described in the methods section. It returns the bootstrap stability score for each sample in the range of [0, 1] for unstable to very stable assignments. For the previous classification into three subtypes, the stability scores of the intermediate and mBL subtype samples show a broad distribution with scores of 0.5 and below (see Supplementary File 1 for details). The new four subtype classification is clearly more robust, reflecting a more consistent and stable clustering of the samples. Only a small number of relatively uncertainly assigned samples are found even in the transition ranges between the different clusters.

Consensus Clustering of B-Cell Lymphoma
To further validate our new subtypes we applied consensus clustering to estimate the optimal number of classes in the lymphoma data by an independent method which assumes class numbers k ranging from two to six. Figure 8a c shows the heatmaps of the consensus matrix for two to four classes, respectively. Pairs of samples, robustly assigned to the same cluster, accumulate within one of the blue squares along the diagonal of the heatmap. The two-class approach basically divides the samples into an mBL-like and a non-mBL-like cluster (Figure 8a). The three-class approach essentially splits the samples into the mBL/intermediate/non-mBL subtype structure as proposed in [10] (Figure 8b).
The four-class consensus clustering resembles our new subtype classification with the two intermediate subtypes (Figure 8c). The five-and six-cluster approaches virtually do not change this result: the additional fifth and sixth clusters collect only one and three outlier samples, respectively (data not shown).
The cumulative distribution functions (CDFs) allow judging the incremental gain of increasing the number of clusters (see Figure 8d). The obtained CDFs support the four-class approach: the CDF converge for k > 3 showing only small incremental changes with further increasing k. Note that the increment between k = 4 and 5 is caused by a single-sample cluster. Hence, consensus clustering confirms our four-subtype classification.

Functional, Molecular and Phenotypic Characterization of the New Subtypes
The four new subtypes are defined by their distinct expression patterns and their particular functional contexts, i.e., they represent molecular subtypes. The question arises if these molecular subtypes associate with selected genetic, clinical, or alternative molecular phenotypes collected independently [10]. We used these data and calculated the frequency distribution of patients for each of the characteristics over the four subtypes. For mBL* and non-mBL* one finds analogous frequency distributions of a series of characteristics as described in previous studies, e.g., the age dependency [10], the effect of the MYC-gene translocation [10], different immune-phenotypes [46] and the GCB-ABC-signature [45]. Nearly 90% of the lymphoma assigned to the non-mBL* and to intermediate A&B subtypes are classified as diffused, large B-cell lymphoma (DLBCL) suggesting a close similarity between these three subtypes. In the supplementary text (Supplementary File 1), we provide a thorough analysis of the expression signatures of the subtypes, the co-expression network of the spot modules and their functional impact. It turned out that each of the subtypes is characterized by different hallmarks of cancer, e.g., proliferation and high transcriptional and translational activity in mBL*; activated immune response and inflammation in non-mBL*, innate immunity in the intermediate A subtype and up-regulated expression of common cancer gene signatures [47] in the intermediate B subtype. Generic, MYC-related poor prognosis gene signatures [48] are associated with the mBL* and, to a lesser degree, intermediate A subtypes. Moreover, we found that intermediate A subtype lymphomas show expression signatures of activated B-cells and strong dissimilarity with expression landscapes of germinal center B-cells and healthy lymph node tissue suggesting different cell-of-origins. On the level of gene regulation, the decomposition of lymphoma into four subtypes obviously further diversifies into different modes which, in turn, reflect driving effects on the genetic and epigenetic levels. The understanding of these molecular mechanisms thus requires the combined analysis of genetic, epigenetic and transcriptional data.
Finally, we generated Kaplan-Meier diagrams to estimate the probability of subtype specific overall patient survival as a function of time [49]. Figure 9a,b show the curves for the three and four subtype classifications, respectively. Based on the original definition by Hummel et al., patients with mBL lymphomas show significantly better survival rates as intermediate and non-mBL patients (p < 0.001 in log-rank test, see also [10]). In contrast, our new classification now reveals that both mBL* and non-mBL* patients show better survival rates than patients of the intermediate A & B subtypes. Assignment of lymphoma to either of the two intermediate subtypes roughly halves the survival rate. The diversification of lymphoma subtypes thus clearly impacts prognosis.
A recent study also proposed new classes of B-cell lymphoma based on a correlation gene set analysis and using a larger patient collective [42]. This study excluded mBL samples from the patient cohort and divided the remaining diffuse large B-cell cases into three classes. Their expression signatures and phenotypic characteristics show certain similarities with our non-mBL*, intermediate A and B subtypes; however, they also differ in other properties, for example in the assignment of cell-of-origin properties and of energy metabolism signatures.

Conclusions
Analysis of molecular biological data using Self-Organizing Maps (SOMs) enables a holistic view on high-dimensional data collected in large-scale studies. It provides a general framework for analytic tasks such as feature selection, integration of concepts of molecular function and systems tracking with individual resolution. The method extracts meta-features such as metagenes and spot-modules representing basal modes of systems behavior important for higher-level analysis.
We applied SOM machine learning to patient expression data of mature aggressive B-cell lymphomas to characterize the specifics of the genome wide expression landscapes in different molecular subtypes of lymphoma. The expression portraits obtained by the SOM algorithm reflect the expression landscapes of the individual samples or subtypes in terms of intuitive and characteristic color textures. These spot patterns can be used to describe the underlying functional modules using gene set enrichment techniques.
Several sample similarity analysis methods were applied to characterize the subtype structure in detail. The correlation network approach provides a powerful representation as it visualizes multivariate relationships in a clear and accessible fashion. We presented a straightforward strategy to identify outlier samples and modules, e.g., due to contaminations of tumor samples with healthy tissue, and to correct them. Furthermore, we found indications for a finer subtype classification of aggressive B-cell lymphoma into four subtypes. Samples were classified using a spot-guided and metagene-based k-Means clustering method. The robustness and consensus-cluster stability of the new four subtypes exceeds that of previous three class approaches. The functional and clinical impact of the new subtypes was discussed. The two intermediate subtypes of heterogeneous molecular signatures are associated with poor survival prognosis compared with the more homogeneous mBL* and non-mBL* subtypes.
Our case study shows that analyzing gene expression landscapes with the tools presented here facilitates information mining in such huge data sets and eventually promotes our understanding of cancer biology.