Hyperspectral Image Clustering with Spatially-Regularized Ultrametrics

We propose a method for the unsupervised clustering of hyperspectral images based on spatially regularized spectral clustering with ultrametric path distances. The proposed method efficiently combines data density and geometry to distinguish between material classes in the data, without the need for training labels. The proposed method is efficient, with quasilinear scaling in the number of data points, and enjoys robust theoretical performance guarantees. Extensive experiments on synthetic and real HSI data demonstrate its strong performance compared to benchmark and state-of-the-art methods. In particular, the proposed method achieves not only excellent labeling accuracy, but also efficiently estimates the number of clusters.


I. INTRODUCTION
Remote sensing image processing has been revolutionized by machine learning. When large training sets are available, supervised methods such as support vector machines, decision trees, and deep neural networks accurately label pixels in a wide range of imaging modalities. However, it is often impractical to acquire the large training sets necessary for these methods to work well. In these situations, it is necessary to develop unsupervised clustering methods, which label the entire data set without the need for training data.
While many methods for unsupervised clustering have been proposed, most only enjoy mathematical performance guarantees under very restrictive assumptions on the underlying data. For example, K-means clustering works well for clusters that are roughly spherical and well-separated, but provably fails when the clusters becomes nonlinear or poorly separated [1]. Clustering methods based on deep learning may perform well in some instances, but are sensitive to metaparameters and lack robust mathematical performance guarantees even in highly idealized settings [2], [3].
Recently, ultrametric path distances (UPD) have been proven to provide state-of-the-art theoretical results for clustering high-dimensional data [4]. In particular, only very weak assumptions on the shape and structure of the clusters are required, namely that the underlying data exhibits intrinsically low-dimensional structure. This suggests UPD are well-suited for hyperspectral images (HSI) [5], which while very highdimensional, are typically such that each class in the data depends (perhaps nonlinearly) on only a small number of latent variables, and in this sense are intrinsically low-dimensional.
In this paper, we develop an HSI clustering algorithm based on ultrametric spectral clustering. Taking  of recent theoretical developments, UPDs are used to construct a weighted graph, from which a graph Laplacian is defined and then spatially regularized. The lowest frequency eigenfunctions of this graph Laplacian are then used as features for K-means clustering. The proposed method is denoted spatially regularized ultrametric spectral clustering (SRUSC). SRUSC is computationally efficient-quasilinear in the number of data points-and has few tuneable parameters. Moreover, the proposed method is shown to outperform a range of benchmark and state-of-the-art clustering methods on several synthetic and real HSI. In particular, the proposed method demonstrates strong clustering accuracy and efficient estimation of the number of latent clusters in the data.
The remainder of this article is organized as follows. In Section II, we provide background on unsupervised HSI clustering and ultrametric path distances. In Section III-A, we detail the proposed algorithm and discuss its theoretical properties and complexity. In Section IV, we introduce several data sets and perform comparisons between the proposed method and related methods. We conclude in Section V. Code implementing the proposed method and all experimental results is publicly available 1 .

II. BACKGROUND
Unsupervised clustering consists in providing a data set where each y i ∈ {1, 2, . . . , K}. In the case of HSI, n is the number of pixels in the data set, D the number of spectral bands, and K the number of latent classes in X. The key challenge in clustering, compared to supervised learning, is that no training data is available to guide the labeling process. So, labeling decisions must be made entirely based on latent geometrical and statistical properties of the data.
A range of methods for clustering data have been developed, including K-means clustering and Gaussian mixture models, which assume the underlying data is a mixture of well-separated, roughly spherical Gaussians; density-driven methods that characterize clusters as high density regions separated from other high-density regions by regions of low density [6], [7]; and spectral graph methods that attempt to find communities in networks generated from the underlying data [8], [1]. In the specific context of clustering HSI, methods taking advantage of the intrinsically low-dimensional (though potentially nonlinear) structure of the HSI clusters are particularly important, because capturing such low-dimensional structures significantly lowers the sampling complexity necessary to defeat the curse of dimensionality.

A. Background on Ultrametic Path Distances
In many clustering algorithms, decisions are based on pairwise distances between data points; the choice of distance metric is thus critical. We propose to use UPDs.
Let G 0 = (X, W ) be an undirected Euclidean k-nearest neighbor graph on X, with k ∼ log(n). For any x i , x j ∈ X, let P(x i , x j ) be the set of paths connecting x i , x j in G 0 . Define the ultrametric path distance (UPD) between x i , x j ∈ X as Intuitively, UPD computes the longest edge in each path, then minimizes this quantity over all paths. It may be understood as an ∞ -geodesic, while the classical shortest path is the 1geodesic. Like classical shortest paths, it may be computed efficiently using a Dijkstra-type algorithm [9]. The UPD has been proved to be extremely robust to noise, and enjoys excellent performance guarantees for distinguishing between points in the same and different clusters [4]. These desirable properties suggest its use as a metric in clustering algorithms, such as spectral clustering.

B. Background on Spectral Clustering
, define a weighted, undirected graph G with nodes X and weighted edges W ij = exp(−ρ(x i , x j ) 2 /σ 2 ). The scaling parameter σ can be tuned manually, or set automatically [10]. Intuitively, there will be a strong edge between A natural approach to clustering is to partition G into K communities that are simultaneously "large" and also pairwise weakly connected. This may be formulated precisely in terms of the normalized cuts functional, which leads to an NP-hard computational problem [8]. This graph cut problem may be relaxed by considering eigenvectors of the graph Laplacian, which allows to determine natural clusters in G in polynomial time [8], [1]. Indeed, let D be the diagonal degree matrix for G : D ii = n j=1 W ij . Let L = I − D −1/2 W D −1/2 be the (symmetric normalized) graph Laplacian. Note L ∈ R n×n is positive semi-definite, with a number of zero eigenvalues equal to the number of connected components in G .
The spectral clustering algorithm (Algorithm 1) consists in computing the lowest frequency eigenvectors of L (those with smallest eigenvalues) and using them as features in Kmeans clustering [1]. Note that we formulate Algorithm 1 as taking only W and a number of clusters K as an input. In practice, W must be computed using some metric ρ. While ρ(x i , x j ) = x i − x j 2 is common, other metrics may provide superior performance [11], [12], [4].
using K as the number of clusters.

III. ALGORITHM
The proposed SRUSC method (Algorithm 2) consists in performing spectral clustering using the UPD ρ ∞ and a spatially regularized graph Laplacian. More precisely, let where B r is the set of all pixels whose spatial coordinates lie inside the square with side lengths r centered at enforces spatial regularity in the graph: a pixel can only be connected to spatially proximal pixels. Spatial regularization has been shown to improve clustering performance for HSI, by producing clusters that are smoother with respect to the underlying spatial structure [13], [14]. It also has a sparsifying effect, since W has only r 2 n non-zero entries. This affords substantial computational advantage in the subsequent eigenvector calculation when r 2 n.

A. Computational Complexity
Note that the graph Laplacian L constructed in the proposed method is r 2 -sparse. Since the cost of computing ρ ∞ (x i , x j ) is proportional to the number of edges in the underlying graph G , the complexity of computing L is O(r 2 nk), where k is the number of nearest neighbors in G 0 . It is known that k ∼ log(n) is sufficient to ensure that with high probability the UPD on a k-nearest neighbors graph and UPD on the fully connected graph are the same [15], [4]. For such a k, the computation of L is O(r 2 n log(n)). Once L is computed, getting the K = O(1) lowest frequency eigenvectors is O(r 2 n) using iterative methods. Finally, running K-means via Lloyd's algorithm on these eigenvectors is O(n) if the number of iterations is constant with respect to n. Thus, the overall algorithm has complexity O(r 2 n log(n)).

B. Discussion of Parameters
The main parameters of the proposed method are σ, r, and K. Many automated methods exist for determining the scaling parameter σ in spectral clustering [10], and spatially regularized spectral graph methods are typically somewhat robust to the choice of spatial radius r. On the other hand, the number of clusters K is notoriously challenging to estimate, particularly for data with nonlinear or elongated shapes.
A commonly employed heuristic in spectral clustering is the eigengap heuristic, which estimatesK = arg max k λ k+1 − λ k . However, this depends strongly on σ. One can instead consider a multiscale eigengap [16], [4], which simultaneously maximizes over the eigenvalue index and over σ: is the i th -largest eigenvalue of L when it is computed using σ. We show in Section IV-C that this approach is effective when using UPD for HSI.

IV. EXPERIMENTAL ANALYSIS
To validate the proposed method, we perform clustering experiments on four data sets: two synthetic HSI, and two real HSI.
The first synthetic data set is generated by uniformly sampling 500 points from 4 2-dimensional spheres of radii The second synthetic data set is generated by randomly rotating three copies of a uniform grid sample of [0, 1] 3 into R 199 , then separating them in the 200 th dimension by translating the second cube by (0, 0, . . . , 0, .1) and the third cube by (0, 0, . . . , 0, .2). Each cube contains 1000 points, spatially arrayed to be 20×50. These clusters are concatenated spatially into a 60 × 50 × 200 synthetic HSI; see Fig. 2. To demonstrate the necessity of spatial regularization, we randomly select 30 points from the middles of cube 1 and cube 3 and swap them. This can be understood as a kind of noise, to which we expect spatially regularized methods to be robust.
We also consider two real HSI data sets: the Salinas A and Pavia U data sets 2 . Visualizations of the real HSI, along with their partial ground truth, are in Fig. 3, 4, respectively. Note that both of these data sets contain a relatively small number of classes; in the case of Pavia U we chose to crop a small subregion to use for experiments, due to both the welldocumented challenges of using too many ground truth classes  for unsupervised HSI clustering [17], and in order to ensure that most pixels in the image had ground truth labels.

A. Comparison Methods
We compare with four benchmark clustering methods, as well as four state-of-the-art methods. The benchmark methods are K-means clustering [18]; K-means clustering on data that has been dimension-reduced with PCA, using the top K principal components; Gaussian mixture models [18]; and spectral clustering using Euclidean distance [1]. The state-ofthe-art methods are diffusion learning [19], [20]; density peaks clustering [7]; nonnegative matrix factorization [21]; and local covariance matrix representation [22]. For all experiments, we assume the number of clusters K is known a priori; in Section IV-C, we show how the proposed method can estimate K.

B. Clustering Accuracy
To perform quantitative comparisons, we align each clustered data set with the ground truth by solving a linear assignment problem with the Hungarian algorithm, then computing the overall accuracy (OA), average accuracy (AA), and Cohen's κ statistic for each clustering. Numerical results are in Table I, with visual results in Fig. 5,6,7,8.

C. Estimation of Number of Clusters
In Fig. 9, the eigenvalues of the SRUSC Laplacian are shown for a range of scales σ. We see the eigengap correctly estimates the number of clusters for most of the data sets considered, for a range of σ values. This suggests that SRUSC is able to estimate the number of clusters even on challenging, high-dimensional HSI. The multiscale eigenvalues were also computed with the Euclidean Laplacian, with very poor results. Fig. 6: On the synthetic three cubes data set, only the proposed method is able to correctly label all data points. In particular, the spatial regularization is necessary to gain robustness to the noise introduced. Fig. 7: On the Salinas A data set, the proposed method performs strongly, with DL also performing well.  The first eigenvalue (blue) is always 0; the second eigenvalue is red, the third yellow, and so on. We see that for a range of σ values, the largest gap is between the K th and (K + 1) st eigenvalues for both synthetic data sets, as well as Pavia U, indicating that SRUSC is correctly estimates K for these data. The proposed method estimates K = 7 for Salinas A, rather than 6.

D. Runtime
The runtimes for all algorithms are in Table II. All experiments were performed on a Macbook Pro with a 3.5GHz Intel Core i7 processor and 16GB RAM. The runtimes for SC and SRUSC are much higher than for the other comparison methods because these methods perform model selection by using the eigengap statistic to estimate the number of clusters.  We see that across all methods and data sets, the proposed SRUSC method gives the best clustering performance. We note that several methods perform well on the synthetic four spheres data set: K-means, PCA followed by K-means, spectral clustering, and FSFDPC all give perfect performances. However, only the proposed method gives perfect performances on the three cubes synthetic data and on the Pavia U data set.
This requires running the algorithm over many instances of scaling parameter σ. If the number of clusters is know a priori, then the run time for SC and SRUSC are comparable to DL.

V. CONCLUSIONS AND FUTURE RESEARCH
In this article, we showed that UPD are a powerful and efficient metric for the unsupervised analysis of HSI. When embedded in the spectral clustering framework and combined with suitable spatial regularization, state-of-the-art clustering performance is realized on a range of synthetic and real data sets. Moreover, ultrametric spectral clustering is mathematically rigorous and enjoys theoretical understanding that many unsupervised learning algorithms lack.
Based on the success of unsupervised learning with ultrametric path distances, it is of interest to develop semisupervised path distance approaches for HSI. Path distances for semisupervised learning are effective in several contexts [23], and it is of interest to understand how the UPD perform specifically in the context of high-dimensional HSI, specifically for active learning of HSI [24], [25].