Applications of a Novel Clustering Approach Using Non-Negative Matrix Factorization to Environmental Research in Public Health

Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and the original NMF cannot be used. Our idea is to split and then concatenate the positive and negative parts of the matrix, after taking the absolute value of the negative elements. NMF applied to the concatenated data, which we call PosNegNMF, offers the advantages of the original NMF approach, while giving equal weight to large and small values. We use two public health datasets to illustrate the new method and compare it with alternative clustering methods, such as K-means and clustering methods based on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA). With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, we recommend that the epidemiologists and environmental scientists use the new method to obtain clusters with improved quality and interpretability.


Introduction
Let us consider the number of emergency hospital admissions in several US communities for specific diseases, e.g., cardiovascular disease (CVD), myocardial infarction (MI), and congestive heart failure (CHF). A pattern of admission causes may be characterized by unusually high and/or low counts for some of the possible causes. A specific community may have a high similarity with a particular pattern, e.g., high CVD and low MI, a somewhat lower similarity with the opposite pattern, e.g., low CVD and high MI, and negligible similarities with other patterns. As such, an admission pattern can be thought of as the pattern of an archetypal community, in which all admission causes have average count levels, except for the ones that are unusually high /low and characterize the pattern itself. What if a community's pattern is actually a mixture of patterns of archetypal communities, rather than being similar to one specific pattern? Matrix factorization methods can address both Int. J. Environ. Res. Public Health 2016, 13, 509 3 of 14 (ii) The Compressed Mortality File (CMF)-a county-level national mortality and population database spanning the years 1968-2010. The table contains death counts for 13 age categories.

Non-Negative Factorization of a General Matrix
Let us consider a matrix V that contains only non-negative entries. V can be approximated by a sum of k rank-1 bilinear forms V = ř w q¨hq T , more conveniently written as V = WH T , with W = [w q ] 1 ď q ď k and H = [h q ] 1 ď q ď k . The vectors w q and h q are referred to as the components of the factorization model. The NMF of V can be used to obtain W and H, and thereby guarantees that all elements of W and H are non-negative-by contrast with the ordinary SVD of V, which returns W and H with mixed signs. Consider now a general matrix V, which we would like to analyze with a NMF approach. If V is of mixed signs, a preliminary transformation is required to satisfy the non-negativity requirement. Two approaches can be applied. The first approach is a novel approach that is referred to as the PosNegNMF approach: The second approach is referred to as the affine NMF approach: (i) Substract the minimum of each column: V = V 0 + baseline [1].
(ii) Apply the NMF clustering to V 0 .
A variety of algorithms can be used to obtain the NMF model components W and H. The simplest method uses multiplying updating rules [4] to minimize the sum of squares of the elements of V´WH T , which we will refer to as the residual sum of squares (Appendix A). Note that in contrast to the analysis of V PN , which includes all information contained in V, the analysis of V 0 disregards the baseline, and so it includes only a part of the information contained in V. We will show in the results section that this loss of information may have dramatic consequences on the quality and interpretability of the clustering results.

NMF Clustering and Reordering
Data clustering methods such as K-means can be applied within the space generated by the column vectors of W to cluster observations, as traditionally done with SVD-based methods, such as the latent semantic indexing method used in document clustering [5]. The simple clustering scheme described in the introduction is based on the direct link of each homogeneous block in V with a particular column in W, which has the largest elements point to this block. However, neither approach is independent of the chosen scaling system, which is arbitrary. To address this problem, component leverages can be calculated. The component leverage represents the ability of a row or column to exert specifically an influence on a particular component-without a corresponding increase in the influence on other components. Leverage column vectors have all elements in the interval (0, 1). They are strongly correlated with the NMF column vectors w q and h q , however much less affected by the choice of a scaling system. Technical details for the calculation of leverages are given in Appendix B.

Stability and Specific Clustering Contribution of NMF Clusters
The clustering method that we consider in this paper operates in a rather dichotomic way as samples must be assigned only to one cluster, even when the assignment may be ambiguous. The stability of NMF clusters is evaluated by applying a resampling scheme. The validation of the identified clustering structure is the most difficult and challenging part of cluster analysis, and it goes far beyond the scope of this paper. We propose a simple, internal validation tool-which does not rely on external information such as known group membership: the specific clustering contribution (SCC) criterion is derived from the concept of entropy [6] to provide a quantitative assessment of the within-cluster homogeneity and the between-clusters heterogeneity. Details for the calculation of the stability and SCC are given in Appendix C. For both indicators, the result is a number that is in the interval (0, 1).

Rank of the NMF Factorization
Depending on the chosen factorization rank, the clusters may appear more or less stable, and consequently providing additional criteria for its determination. The factorization rank k can be selected through the analysis of the sum of squares of the residual matrix V´WH T , where k takes values from 1 up to a large value (e.g., half the number of rows or columns), in combination with the analysis of cluster stability. The plot of each criterion as a function of k is called a scree plot. An optimal k should correspond to both a low residual sum of squares and a high cluster stability. Additionally, the scree plot of the cluster SCC can be used to confirm the choice based on the first two criteria. Ultimately, selecting k is a difficult decision to make, and it is an important step on the path to a better understanding of the nature of V. Tables   The cells of a contingency table contain counts, which are all non-negative numbers, suggesting that NMF could be applied directly to the table. However, very different totals for rows or columns-as it often occurs in such tables-may result in highly heterogeneous variances, in sharp contrast to the error model behind NMF which assumes a homogeneous distribution across cells-at least when the algorithm attempts to minimize the residual sum of squares, similar to SVD. This problem is considered by Correspondence Analysis (CA) which uses the normalized matrix of contingency ratios. We briefly recall the main steps involved in this approach:

Normalization of Contingency
(i) For each cell, the contingency ratio is calculated by forming the ratio of the true count over the expected count-assuming the independence of rows and columns (ii) Further normalization steps include the subtraction of the expected ratio under the assumption of independence (=1), yielding a mixed signs matrix, and a subsequent scaling of rows and columns to ensure homogeneous cell variances. (iii) The SVD is applied to the normalized matrix. (iv) A biplot based on the first two SVD components is then performed, allowing for a simultaneous clustering of the rows and columns of the table, which we will refer to as the SVD clustering. Note that PCA clustering refers to the same approach, since PCA's eigen vectors are the column singular vectors [7].
More details can be found in [8]. The normalized matrix can alternatively be analyzed using the PosNegNMF clustering.

Software
We used a software in the form of an addin JMP (SAS Institute, Cary, NC, USA). The addin was developed in the JSL language and can be downloaded from Supplementary 1.

Hospital Admissions Data
The objective of the analysis is the identification of archetypal patterns and the subsequent clustering of cities, seeking clusters characterized by relatively low or high counts for one or more causes of hospital admission.

PosNegNMF Clustering
The PosNeg splitting scheme was applied to the normalized matrix of contingency ratios. The evaluation of the scree plot for the residual sum of squares ( Figure 1a) and the clustering stability ( Figure 1b) indicated that four component vectors provide a small residual variance (it almost stops decreasing when considering an additional component) and a good stability (it increases back to a level >0.95 from a level <0.92 with a three components model). The criteria were calculated for up to five components-the number of admission categories before the matrix was split. The count patterns are presented on the heatmap of the reordered rows and columns of the normalized contingency table (Figure 2). These patterns are characterized by high and/or low counts levels for each cause of hospital admission. Specifically, each row of the heatmap provides the counts of hospital admissions of the corresponding city using the color coding of the cells (i.e., red = high, blue = low). High/low counts of hospital admissions appear on the left/right side of the heatmap, respectively-the suffixes "+" and "−" were added to variable names to represent the positive and negative parts. The four clusters derived from the four model components are identified through the colors given to the city labels. Noteworthy, the third cluster (Detroit, Chicago, etc.) is mostly characterized by a lower number of hospital admissions due to respiratory disease, illustrating the role of low values for PosNegNMF clustering. The count patterns are presented on the heatmap of the reordered rows and columns of the normalized contingency table (Figure 2). These patterns are characterized by high and/or low counts levels for each cause of hospital admission. Specifically, each row of the heatmap provides the counts of hospital admissions of the corresponding city using the color coding of the cells (i.e., red = high, blue = low). High/low counts of hospital admissions appear on the left/right side of the heatmap, respectively-the suffixes "+" and "´" were added to variable names to represent the positive and negative parts. The four clusters derived from the four model components are identified through the colors given to the city labels. Noteworthy, the third cluster (Detroit, Chicago, etc.) is mostly characterized by a lower number of hospital admissions due to respiratory disease, illustrating the role of low values for PosNegNMF clustering.
The summary characterization of each cluster is given in Table 1. All variables were simultaneously clustered. Note that the positive and negative parts of a given variable are included in different clusters, e.g., the positive part of the respiratory cause is included in the first cluster, while the negative part is included in the third cluster.
Noteworthily, a K-means clustering of the normalized matrix of contingency ratios yields 6 clusters-following the Cubic Clustering Criterion (CCC) on the choice of k which is used classically with K-means [9]. Three K-means clusters are very close to PosNeg clusters 2, 3 and 4. However, the other three clusters have only one member. Moreover, cities cannot be ordered within each cluster as we do with PosNeg NMF-since K-means returns only cluster membership indicators-and causes cannot be easily assigned to each cluster since they are not simultaneously clustered. This illustrates the advantages of using PosNegNMF over K-means, as it helps summarizing with a heatmap the hospital admission patterns across US cities. It would be interesting to investigate these results more thoroughly, for example by using city characteristics to predict cluster membership, but this would take us beyond the scope of this article. The summary characterization of each cluster is given in Table 1. All variables were simultaneously clustered. Note that the positive and negative parts of a given variable are included in different clusters, e.g., the positive part of the respiratory cause is included in the first cluster, while the negative part is included in the third cluster. Noteworthily, a K-means clustering of the normalized matrix of contingency ratios yields 6 clusters-following the Cubic Clustering Criterion (CCC) on the choice of k which is used classically with K-means [9]. Three K-means clusters are very close to PosNeg clusters 2, 3 and 4. However, the other three clusters have only one member. Moreover, cities cannot be ordered within each cluster as we do with PosNeg NMF-since K-means returns only cluster membership indicators-and causes cannot be easily assigned to each cluster since they are not simultaneously clustered. This illustrates the advantages of using PosNegNMF over K-means, as it helps summarizing with a heatmap the hospital admission patterns across US cities. It would be interesting to investigate these results more thoroughly, for example by using city characteristics to predict cluster membership, but this would take us beyond the scope of this article.

Affine NMF Clustering
For each cause of hospital admission, the minimum count over all cities for this particular cause was first subtracted out of the count of each city. The evaluation of the scree plots for the residual sum of squares ( Figure 3a) and the clustering stability (Figure 3b) indicated-as for PosNeg NMFthat four component vectors provide both a small residual variance and a good stability.

Affine NMF Clustering
For each cause of hospital admission, the minimum count over all cities for this particular cause was first subtracted out of the count of each city. The evaluation of the scree plots for the residual sum of squares (Figure 3a) and the clustering stability (Figure 3b The ordered heatmap and clusters obtained with the affine NMF approach are shown in Figure 4. The clusters produced by the affine NMF method are neither as clearly separated nor easier to describe/summarize as those produced by the PosNegNMF method. Specifically, the second and fourth cluster from the affine NMF method contain only one city each. As a consequence, extracting from this figure a summary characterization of the clusters as we did with PosNeg NMF in Table 1  The ordered heatmap and clusters obtained with the affine NMF approach are shown in Figure 4. The clusters produced by the affine NMF method are neither as clearly separated nor easier to describe/summarize as those produced by the PosNegNMF method. Specifically, the second and fourth cluster from the affine NMF method contain only one city each. As a consequence, extracting from this figure a summary characterization of the clusters as we did with PosNeg NMF in Table 1 appears difficult, apart from the very main features: high myocardial infarction (MI) in cluster 1, and high congestive heart failure (CHF) in cluster 3. The ordered heatmap and clusters obtained with the affine NMF approach are shown in Figure 4. The clusters produced by the affine NMF method are neither as clearly separated nor easier to describe/summarize as those produced by the PosNegNMF method. Specifically, the second and fourth cluster from the affine NMF method contain only one city each. As a consequence, extracting from this figure a summary characterization of the clusters as we did with PosNeg NMF in Table 1 appears difficult, apart from the very main features: high myocardial infarction (MI) in cluster 1, and high congestive heart failure (CHF) in cluster 3. The difference in clustering quality is further evidenced by the scree plots for the specific clustering contributions of the two clustering methods (Figure 5), where the SCC is uniformly higher when the PosNegNMF approach is used. The difference in clustering quality is further evidenced by the scree plots for the specific clustering contributions of the two clustering methods (Figure 5), where the SCC is uniformly higher when the PosNegNMF approach is used.

Correspondence Analysis
The first two SVD components of the normalized contingency matrix allow for a simultaneous representation of cities and hospital admission causes. This kind of representation is called a biplot and is widely used when performing Correspondence Analysis (CA, Figure 6). We assess whether the clusters of cities and the admission patterns can be retrieved through this biplot. PosNegNMF clusters are presented using colored city labels, with the same colors as in the heatmap from 3.1.1. The proximities between clusters and causes of hospital admissions help the interpretation of the two axis of the biplot, e.g., Sacramento, Seattle and nearby cities, all have high levels of MI. However the corresponding detailed counts are not visible and they can only be inferred from these proximities, a process which is not reliable. For example, St Louis, Los Angeles, Bakersfield and Dallas (PosNegNMF cluster 2, green crosses) appear clustered on the biplot with Kansas City, Toledo, etc.

Correspondence Analysis
The first two SVD components of the normalized contingency matrix allow for a simultaneous representation of cities and hospital admission causes. This kind of representation is called a biplot and is widely used when performing Correspondence Analysis (CA, Figure 6). We assess whether the clusters of cities and the admission patterns can be retrieved through this biplot. PosNegNMF clusters are presented using colored city labels, with the same colors as in the heatmap from 3.1.1. The proximities between clusters and causes of hospital admissions help the interpretation of the two axis of the biplot, e.g., Sacramento, Seattle and nearby cities, all have high levels of MI. However the corresponding detailed counts are not visible and they can only be inferred from these proximities, a process which is not reliable. For example, St Louis, Los Angeles, Bakersfield and Dallas (PosNegNMF cluster 2, green crosses) appear clustered on the biplot with Kansas City, Toledo, etc. (PosNegNMF cluster 4, brownˆsymbols) despite their relatively high level of MI, which is characteristic of cluster 2. This suggests that the biplot may be too restrictive, possibly because of being limited to the first two SVD components.

Correspondence Analysis
The first two SVD components of the normalized contingency matrix allow for a simultaneous representation of cities and hospital admission causes. This kind of representation is called a biplot and is widely used when performing Correspondence Analysis (CA, Figure 6). We assess whether the clusters of cities and the admission patterns can be retrieved through this biplot. PosNegNMF clusters are presented using colored city labels, with the same colors as in the heatmap from 3.1.1. The proximities between clusters and causes of hospital admissions help the interpretation of the two axis of the biplot, e.g., Sacramento, Seattle and nearby cities, all have high levels of MI. However the corresponding detailed counts are not visible and they can only be inferred from these proximities, a process which is not reliable. For example, St Louis, Los Angeles, Bakersfield and Dallas (PosNegNMF cluster 2, green crosses) appear clustered on the biplot with Kansas City, Toledo, etc. (PosNegNMF cluster 4, brown × symbols) despite their relatively high level of MI, which is characteristic of cluster 2. This suggests that the biplot may be too restrictive, possibly because of being limited to the first two SVD components.

Additional Remarks
The three methods (PosNeg NMF, Affine NMF, and Correspondence Analysis) deal with the communities of Akron, OH, and El Paso, TX, USA, in different ways. These communities are similar with respect to four out of the five admission causes, but they are different with respect to diabetes (El Paso is high with respect to diabetes, while Akron is low). The PosNeg NMF method reveals the similarities between these two communities while at the same time pointing to the unique difference between them. The Affine NMF method considers the two communities as two separate clusters due to their differences with respect to diabetes (and thereby completely ignores the similarity with respect to the other causes). Correspondence analysis plots these two communities on the right side of the biplot, somewhat distant from each other on the vertical axis. Since diabetes is plotted at the bottom of the biplot, it may be concluded that the distance between these two communities is due to diabetes (although this may require extensive expertise with correspondence analysis).

Compressed Mortality File
We have presented an example where the PosNegNMF method outperforms the original NMF or CA/SVD methods. Nevertheless, whenever the studied matrix can be approximated well by a rank-1 bilinear form, using this new approach is not recommended. The Compressed Mortality File (CMF)-a county-level national mortality and population database spanning the years 1968-2010-will help illustrate this important point. The table contains death counts for males for 13 age categories. The analysis of the normalized matrix of contingency ratios shows that the second SVD component contributes only 6% of the overall variance. Thus, rows and columns can be reordered through the rank-1 bilinear approximation given by the first SVD component, revealing well-known trends in the pivot period marking the end of 20th century (Figure 7), such as more deaths being observed in the age category 85+, and less deaths are observed in younger age categories. Also as we enter the 21st century people tend to live longer. A sudden increase in deaths is observed in the younger age categories (25-44 years old) between 1985 and 1995 (AIDS outbreak), which tends to get back to normal after 1996 once the AIDS drugs come into wide use. diabetes (although this may require extensive expertise with correspondence analysis).

Compressed Mortality File
We have presented an example where the PosNegNMF method outperforms the original NMF or CA/SVD methods. Nevertheless, whenever the studied matrix can be approximated well by a rank-1 bilinear form, using this new approach is not recommended. The Compressed Mortality File (CMF)-a county-level national mortality and population database spanning the years 1968-2010will help illustrate this important point. The table contains death counts for males for 13 age categories. The analysis of the normalized matrix of contingency ratios shows that the second SVD component contributes only 6% of the overall variance. Thus, rows and columns can be reordered through the rank-1 bilinear approximation given by the first SVD component, revealing well-known trends in the pivot period marking the end of 20th century (Figure 7), such as more deaths being observed in the age category 85+, and less deaths are observed in younger age categories. Also as we enter the 21st century people tend to live longer. A sudden increase in deaths is observed in the younger age categories (25-44 years old) between 1985 and 1995 (AIDS outbreak), which tends to get back to normal after 1996 once the AIDS drugs come into wide use.

Performance
Our experience suggests that the computational performance of the NMF algorithm is improved when the PosNeg transformed matrix is used, despite the fact that there are twice as many columns as in the original matrix. This is due to the smaller number of iterations for the algorithm to converge-suggesting a more stable solution. It has been shown that a version of the algorithm, using an Expectation-Maximization (EM) scheme [10], can yield a better approximation for sparse matrices. Such scheme seems appropriate for PosNegNMF, given the sparse nature of the transformed matrix.

Alternative Approaches
While NMF allows to simultaneously cluster rows and columns of the input matrix V, we could also use standard clustering methods to first cluster rows, and then columns of V. A rank-1 bilinear form approximation provides a natural way of performing these operations. Approximating the matrix V by V = wh T , where w and h are the row and column marker vectors respectively, then reordering the rows of V by the decreasing values of the elements of w and the columns by the decreasing values of the elements of h leads to a matrix with elements that tend to go from the largest in the top left to the smallest in the bottom right. The bilinear approximation also provides a way to cluster the matrix into internally homogeneous rectangles. Sorting the elements of w into descending order and separating them into k maximally homogeneous clusters leads to the reordered V having rows that are maximally homogeneous to the extent the bilinear approximation matches them well-the elements of the error matrix V´wh T are small enough. Applying the same operation to separate the elements of h into m maximally homogeneous clusters does the same for the columns, while separating the elements of both the row and column markers will cluster the matrix V into km internally homogeneous blocks. This approach was used in Liu et al. [11] where a robust incomplete-data implementation of SVD has been used to obtain the approximants w and h. The segmentation of the row and column markers was carried out using the dynamic programming optimization method from Hawkins [12].
It may not always be sensible to permute the rows or columns of the matrix, for instance when one considers time series or geographic data, which are naturally ordered. In this setting a separation into rectangular blocks can still be performed by partitioning the array into homogenous. This approach, described in Hawkins [13], may not lead to explicit optimization algorithms but can be implemented using heuristic approaches. It can also be applied to more general matrix clustering problems during the second stage that follows the initial reordering of rows and columns using a bilinear approximation or other methods of reordering the rows and columns, such as spectral clustering [14][15][16]. Algorithms that try to retrieve such an order are generally called seriation algorithms (see [17] for a historical review).

Limitations
Our research is focused on improving the quality and interpretability of clusters. When the primary objective is to determine the compositions (the columns of H) of a mixture of true real-world sources, and their contributions (the columns of W) to observed samples (the rows of V), finding "the right solution", and determining the uniqueness of the solution, are critical goals. These two problems are essentially different, along the lines of Shmueli [18], to explain-through cluster analysis-or to predict-estimates of mixture contributions. Whether PosNegNMF improves on existing mixture deconvolution methods will require further investigation. Positive Matrix Factorization (PMF) [19] has been mostly applied to mixture problems, such as determining the sources of atmospheric pollution. Both NMF and PMF approaches actually solve the same mathematical problem with very close algorithms. However, PMF is more oriented towards mixture deconvolution problems-yielding accurate mixture contributions, where as NMF is more oriented towards pattern discovery-yielding interpretable clusters. It should be noted that the PosNeg transformation generates a sparse matrix with at least half of the cells having zero counts. Obviously, matrices which are naturally sparse are not amenable to the PosNeg transformation.

Conclusions
We showed that the heatmap of the reordered rows and columns of a matrix, when properly normalized, can add insight to the SVD clustering produced by CA, in particular with respect to the interpretation of the biplot axes. We also showed that PosNegNMF clustering returns more homogenous clusters, in contrast to affine NMF clustering.
In the more general context of NMF, we have provided methodological innovations to address some of the known caveats of the NMF approach such as component scaling indetermination and the choice of an optimal NMF rank. The calculation of leverages allows for a component scale-free clustering of the rows or columns of a non-negative matrix, while the evaluation of the stability and specific clustering contribution of NMF clusters provide additional criteria for choosing an optimal NMF rank.
We have proposed and illustrated with two examples a novel clustering approach using NMF. With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, the new method may be used by epidemiologists and environmental scientists to obtain clusters with improved quality and interpretability, compared to those obtained from commonly used clustering methods, such as K-means and the clustering produced by CA.
Supplementary Materials: The datasets used in this paper are available online at http://www.mdpi.com/1660-4601/13/5/509/s1. blue crosses correspond to samples with y < x, thus more similar to H 1 . Now, the x and y vectors were scaled by the square root of their respective sum of squares, yielding a new coordinates system. When this system is used, three samples originally closer to H 1 appear wrongly clustered with samples originally closer to H 2 ( Figure B1b, red crosses). Indeed, for these particular samples, in this coordinates system, y > x, where as in the original system, y < x. If the distance to H 1 or H 2 is now defined by the Euclidean distance to the extremity of each axis (represented by an arrow), the clustering remains unchanged in either coordinates system.
The NMF squared elements of the columns of W or H are typically constrained to sum to unity as a convenient way of eliminating the degeneracy associated with the invariance of WH under the transformation W → WΛ, H → Λ −1 H, where Λ is a diagonal matrix defining a particular scaling system. It should be noted that the sum to unity constraint is arbitrary. We will show that the choice of a particular system can affect the clustering process.
Assume that samples are mixtures of two archetypal samples H1 and H2. In Figure A1a, the sample coordinates (x, y) correspond to the mixture coefficients in a two-component model: Sample = xH1 + yH2. The red circles correspond to samples with y > x, thus more similar to H2; the blue crosses correspond to samples with y < x, thus more similar to H1. Now, the x and y vectors were scaled by the square root of their respective sum of squares, yielding a new coordinates system. When this system is used, three samples originally closer to H1 appear wrongly clustered with samples originally closer to H2 ( Figure A1b, red crosses). Indeed, for these particular samples, in this coordinates system, y > x, where as in the original system, y < x. If the distance to H1 or H2 is now defined by the Euclidean distance to the extremity of each axis (represented by an arrow), the clustering remains unchanged in either coordinates system. This distance, which appears to be independent of the chosen scaling system, can be easily extended to more than two component vectors. For each observation i, the distance to each profile vector Hq is defined by: The leverage can be directly derived from the distance: This distance, which appears to be independent of the chosen scaling system, can be easily extended to more than two component vectors. For each observation i, the distance to each profile vector Hq is defined by: distance pi, Hqq "ˆmax j pW pj, qqq´W pi, qq˙2`k ÿ r‰q W pi, rq 2 (B1) The leverage can be directly derived from the distance: leverage pi, Hqq " exp¨´d istance pi, Hqq 2¨mean j pdistance pj, Hqqq‚ The leverage of each variable can be defined in a similar way. By construction, leverages are well correlated with the column vectors of W or H, and are in the interval (0, 1). They are little affected by the chosen scaling system, thus allowing for a more reliable clustering. Now, the distance to each column vector Hq is a function of max j pW pj, qqq, thus it can be severely affected by outliers. The following iterative algorithm allows for estimating a robust maximum: 1. Initialize the robust estimate by the maximum of each component.

2.
For each vector component q: a For each row i of W, calculate the probability p (i, q) and the row score (Equations (C2) and (C1) respectively, Appendix C). b Force the row score to 0 if p (i, q) < 1/k. c Update Robust Max(q) by the weighted mean of W(i, q), where the mean is taken over all samples satisfying W pi, qq ą 95 j th quantile pW pj, qqq, and the weights are the row scores.
The idea is that rows with higher row scores should weigh more on the max estimation. d Replace all W(i, q) > Robust Max(q) by Robust Max(q).
Going back to the simulation example of Appendix A, the maximum appears above the horizontal blue line representing the robust maximum in the histograms of each column vector ( Figure B2). Note that on the second component, the maximum was also identified as an outlier by a standard criterion based on the interquartile range.
(C1) respectively, Appendix C). b. Force the row score to 0 if p (i, q) < 1/k. c. Update Robust Max(q) by the weighted mean of W(i, q), where the mean is taken over all samples satisfying W , W , , and the weights are the row scores.
The idea is that rows with higher row scores should weigh more on the max estimation. d. Replace all W(i, q) > Robust Max(q) by Robust Max(q).

Repeat 2. until convergence.
Going back to the simulation example of Appendix A, the maximum appears above the horizontal blue line representing the robust maximum in the histograms of each column vector ( Figure A2). Note that on the second component, the maximum was also identified as an outlier by a standard criterion based on the interquartile range.

Appendix C: Stability and Specific Clustering Contribution of NMF Clusters
Stability: The stochastic nature of most clustering algorithms has provided methods for evaluating the consistency and robustness of their performance [22]. The central idea is to perform numerous runs of the algorithm, starting from different random initializations. On each run, the algorithm groups the rows into clusters, allowing for the calculation of the frequency at which two different rows fall into the same cluster. In contrast to random initialization, we first calculate a pseudo-unique NMF solution based on SVD initialization, which is itself unique [23]. The rows of V are resampled with replacement and the rows of W are resampled in exactly the same way as in V. H is then re-estimated, while the resampled W remains fixed. Column leverages are re-estimated after each run of this resampling scheme, giving rise to different clusters of columns, and the

Appendix C: Stability and Specific Clustering Contribution of NMF Clusters
Stability: The stochastic nature of most clustering algorithms has provided methods for evaluating the consistency and robustness of their performance [22]. The central idea is to perform numerous runs of the algorithm, starting from different random initializations. On each run, the algorithm groups the rows into clusters, allowing for the calculation of the frequency at which two different rows fall into the same cluster. In contrast to random initialization, we first calculate a pseudo-unique NMF solution based on SVD initialization, which is itself unique [23]. The rows of V are resampled with replacement and the rows of W are resampled in exactly the same way as in V. H is then re-estimated, while the resampled W remains fixed. Column leverages are re-estimated after each run of this resampling scheme, giving rise to different clusters of columns, and the frequency at which a column falls into a particular cluster can be calculated. The higher the frequency, the more stable will be the column with respect to this cluster. Conversely, the updated H can be used to re-estimate W, which gives rise to different clusters of rows. Similarly, the frequency at which a row falls into a particular cluster can be calculated.
Specific Clustering Contribution: For each row of W, we define the row score for the ith row as: Row score piq " 1`1 log 2 pkq k ÿ q"1 p pi, qq log 2 pp pi, qqq where p(i, q) is the probability that the i-th row contributes to cluster q, i.e.,: The row score is a real value within the interval (0, 1). The higher the row score value, the more component-specific the corresponding rows [6]. The mean row score over all rows of W provides an overall indicator of the Specific Clustering Contribution (SCC).