From Model-Based Optimization Algorithms to Deep Learning Models for Clustering Hyperspectral Images

: Hyperspectral images (HSIs), captured by different Earth observation airborne and space-borne systems, provide rich spectral information in hundreds of bands, enabling far better discrimination between ground materials that are often indistinguishable in visible and multi-spectral images. Clustering of HSIs, which aims to unveil class patterns in an unsupervised way, is highly important in the interpretation of HSI, especially when labelled data are not available. A number of HSI clustering methods have been proposed. Among them, model-based optimization algorithms, which learn the cluster structure of data by solving convex/non-convex optimization problems, have achieved the current state-of-the-art performance. Recent works extend the model-based algorithms to deep versions with deep neural networks, obtaining huge breakthroughs in clustering performance. However, a systematic survey on the topic is absent. This article provides a comprehensive overview of clustering methods of HSI and tracked the latest techniques and breakthroughs in the domain, including the traditional model-based optimization algorithms and the emerging deep learning based clustering methods. With a new taxonomy, we elaborated on the main ideas, technical details, advantages, and disadvantages of different types of clustering methods of HSIs. We provided a systematic performance comparison between different clustering methods by conducting extensive experiments on real HSIs. Unsolved problems and future research trends in the domain are pointed out. Moreover, we provided a toolbox that contains implementations of representative clustering algorithms to help researchers to develop their own models.


Introduction
A hyperspectral remote sensing image can be viewed as a stack of gray-scale images with each capturing the spectral reflectance characteristics of land cover in a narrow range of wavelengths. The rich spectral information makes it possible to recognize subtle differences and changes in the compositions of materials that cannot be noticed in optical photographs [1]. This is of interest in various domains ranging from space exploration and Earth observation to ocean monitoring and precision agriculture. Figure 1 shows an example of a hyperspectral image (HSI). Clustering of HSI refers to categorizing pixels into different clusters in an unsupervised way, where pixels of the same cluster are more similar than those from different clusters. It unveils the important structure of HSIs with the fact that pixels from the same cluster often share a common characteristic. The obtained structure information can be used to compress the relevant image content by merging similar pixels, reducing significantly the data volume of HSI to be interpreted. This alleviates the huge burden on big data storage, transmission and real-time processing, which is highly important in current on-trend nanosatellites with very limited power budgets [2]. It should be noted that the clustering of HSI can also refer to the clustering of spectral bands in Figure 1. An example of HSI for Matiwan Village in Xiongan of Hebei Province of China, consisting of 250 bands with a spatial size of 3750 × 1580, and spectral signatures of four representative land covers, i.e., "building", "water", "pear tree" and "grass".  Figure 2. The number of publications in Web of Science by searching with topics (a) "hyperspectral", "remote sensing", and "classification"; (b) "hyperspectral", "remote sensing", and "clustering".
Traditional clustering methods of HSI include centroid-based [19][20][21], densitybased [22][23][24], probability-based [25][26][27], and biologically driven methods [28,29]. Modelbased optimization methods [30][31][32][33][34] that employ matrix representation techniques, such as sparse representation (SR) [35], low-rank representation (LRR) [30], and non-negative matrix factorization (NMF) [36], have achieved the current state-of-the-art performance, attracting significant attentions in the fields. Through solving related convex/non-convex optimization problems, useful features/embeddings or important properties (e.g., connectivities) of data for clustering can be obtained. Recent works have extended model-based methods to deep versions and adopted neural networks to extract deep features for clustering, which is more effective in dealing with nonlinear data structure of HSIs. Two important questions are: (1) do deep clustering models always outperform the model-based clustering methods? and (2) which factors should be taken into account to develop an effective clustering model of HSI? With a comprehensive overview of HSI clustering methods and extensive experiments, we will answer the two questions in this article. In the literature, there are some excellent overview papers on clustering methods [18,[37][38][39][40]. However, most of them focus on object-level clustering tasks where gray-scale and color images are involved, and the surveys on the clustering of HSI are very scarce. This survey fills in this gap by providing a comprehensive overview of the state-of-the-art clustering methods of HSIs. Particularly, we introduced the main ideas, technical details, advantages, and disadvantages of different types of clustering methods. A new taxonomy of clustering methods was proposed, which helps readers to better follow the rapidly evolving techniques in the domain. We conducted extensive experiments on real HSIs to support a comprehensive comparative performance analysis of different clustering methods. Moreover, we provided an open source library that contains the codes of different methods to help researchers to develop their own models, especially for beginners who are willing to enter the field. Lastly, we discussed the limitations of the current status in the field and indicate promising research directions.

The Challenges in the Clustering of HSI
The clustering of hyperspectral images is challenging due to the following reasons: 1.
Clustering of high dimensional data, such as HSI, is difficult in general, due to the so-called "curse of dimensionality" problem [41]. The redundant bands of HSI make the inherent meaningful clusters sparse in a higher dimension. Using conventional distances such as Euclidean distance to measure the similarity of data points is no longer effective due to the participation of irrelevant dimensions.

2.
Clustering of HSI at pixel-level needs efficient algorithms to process large volumes of hyperspectral data. However, advanced models are often required to fit with the complex cluster structure of data to yield accurate clustering results, which results in computationally expensive algorithms. How to make a good balance between efficiency and accuracy is difficult. 3.
Influenced by sensor noise, varying imaging conditions and spectral mixing, hyperspectral data often show large within-class spectral variabilities, leading to a mixture of different clusters to a certain degree. The data distribution within-class can be arbitrary, which makes the centroid-based approaches infeasible. 4.
Estimation of the number of clusters in HSI is not trivial. Similar clusters can be merged as a major cluster or on the contrary a major cluster can be divided into more sub-clusters. Current clustering approaches mostly assume that the number of clusters is known.
Traditional clustering methods often yield an unsatisfactory performance in the clustering of HSI. For instance, k-means is known for being sensitive to initialization and noise, and only works well on "ball"-like distributed data, which is often not the case for high-dimensional HSI [42]. Density-based clustering algorithms assume that a cluster is a contiguous region of high point density that is separated from other clusters by contiguous regions of low point density. However, due to the effect of noise and spectral variabilities, the assumption might not be true in practice. The performance of probabilistic clustering can be also degraded by the violation of its specific probability distributions for clusters. An example with real data is shown in Figure 3, which demonstrates that the distribution of data points within-class is not spherical and the data points across different classes are highly mixed. Centroid-based clustering methods fail to uncover the correct cluster structure of the data. Compared with traditional clustering algorithms, model-based optimization methods and deep learning based methods perform clustering in a learned feature domain where the extracted features can be more discriminative than the raw data, resulting in improved clustering accuracy. Table 1 summarizes the published works of model-based optimization methods and deep learning-based methods for HSI clustering. Figure 4 shows the corresponding statistics. It is observed that most works adopt the model-based clustering techniques and the deep clustering models only account 31%. As shown in Table 1, we classify model-based optimization methods into three categories: self-representation based, dictionary learning based and NMF based methods, and classify deep clustering models into four classes: self-representation based deep clustering, autoencoder based, graph convolution based and contrastive learning based approaches, each of which will be introduced in the subsequent sections.  Table 1. A summary of model-based and deep clustering methods.

Category Subcategory Sub-Subcategory Algorithms Remarks
Model based clustering Self-repre sentation based Spectral based SSC [31], LRR [43], LRSSC [44], S 0 /L 0 -LRSSC [45] Adopt self-representation models to learn the similarity matrix of data points for spectral clustering. Only spectral information of HSI is exploited.
Object based RMC-OOSSC [52], FHoSSC [53] Clustering is performed in object level, which is much faster compared with the pixel-based algorithms.

Notation
We denote scalars by lowercase letter, e.g., x, vectors by boldface lowercase letters, e.g., x, matrices by boldface capital letters, e.g., X, and tensors by capital calligraphic letters, e.g., X , in this paper. Let X ∈ R B×M×N be a 3D HSI cube with a spatial size of M × N and a spectral dimension of B. We denote by X ∈ R B×MN the reshaped 2-D matrix from the 3D HSI tensor X . The definitions of different norms used in this paper are shown in Table 2, including the 0 norm, 1 norm, Frobenius norm, nuclear norm, etc. Tr(·) represents the trace of a matrix and D = diag(c) is a diagonal matrix with D ii = c i . Table 2. The definitions of the symbols used in this article.

Symbols
Definition Symbols Definition The i-th column of X X * The sum of the singular values of X |c| The absolute value of c

Self-Representation Based Clustering Methods
Sparse representation is a landmark technique in dealing with high-dimensional data and already achieved great success in signal processing [108][109][110], pattern recognition [111], image processing [112][113][114] and computer vision [115,116]. Basically, it represents input signal by a linear combination of a few atoms from a dictionary. Next to sparse representation, low-rank representation is another successful technique in signal processing which aims to learn a representation of data that has a low-rank property. Recently, both techniques were adopted to learn the similarities between data points within a self-representation framework where the input data were employed as the dictionary.
Self-representation based clustering methods are in fact built on the framework of spectral clustering as shown in Figure 5, where the similarity matrix of a graph, i.e., W, is particularly derived from the coefficients matrix C that is learned by solving sparse coding or low-rank representation problems with the input data being a dictionary as follows: where F (C) is a regularization term with respect to C, which can be a sparse constraint, a low-rank constraint, a smoothing constraint or mixed constraints, G(E) is a function with respect to the error matrix E. Clustering models (1) are often referred to as subspace clustering in the literature with the assumption that data points belonging to the same class are drawn from a linear subspace [37]. Compared with traditional spectral clustering methods where the similarity matrix is often built with a fully connected graph, k nearest neighbours (KNN) graph or ε-neighborhood graph, self-representation based methods have the following advantages in general: 1.
The number of nearest neighbours in the graph is adaptively determined for each data point by sparsity or low-rank constraint in the representation models, which avoids specifying a fixed number of neighbours for all the data points in KNN graph.

2.
Selecting an effective similarity measurement between data points is difficult in general, especially for high-dimensional data where "curse of dimensionality" problem might be suffered. In the self-representation based models, the representation coefficients matrix is utilized to build a similarity matrix, avoiding thereby the ad-hoc selection of similarity measurements.

Graph construction Clustering
Sparse, low-rank, spatial, or combined constraints Figure 5. Self-representation based clustering methods often consist of three steps: self-representation model design, graph construction and spectral clustering, where each column of X represents spectral vector of a pixel.
We classify self-representation based clustering methods into seven sub-categories: spectral-based, spatial-spectral, object-based, semi-supervised, multi-view, kernel-based and graph learning based methods. Each of them will be introduced in the following subsection.

Spectral-Based Clustering Methods
Sparse subspace clustering (SSC) [31] and low-rank representation (LRR) [43] are two pioneer works of self-representation based clustering methods. Following the framework in Figure 5, SSC obtains a sparse coefficients matrix C = [c 1 , c 2 , . . . , c MN ] by solving: where c i 0 represents the number of non-zeros of c i and c ii is the i-th element of c i . The constraint c ii = 0 is used to avoid a trial solution of C = I. Minimizing the sparsity of representation vector c i with 0 -norm is NP-hard. However, model (2) can be approximately solved by relaxing the 0 -norm to the convex 1 -norm: arg min where c i 1 = ∑ j |c ij |. The essential idea of SSC is that among infinitely many possibilities to represent a data point x i in terms of other points, a sparse representation will select a few points that belong to the same class as x i . Thus, the coefficients matrix C can be used to build a similarity matrix for the input data points by W = (|C| + |C T |)/2. By applying the similarity matrix into standard spectral clustering algorithm, one can obtain the clustering results of data. LRR learns the coefficients matrix C with a low-rank constraint by solving the optimization problem as follows: where C * denotes the nuclear norm of C, i.e., the sum of the singular values of C, whcih is used to regularize the rankness of a matrix, ij and λ is a parameter to control the balance between different terms. The utilized low-rank constraint makes LRR robust to noise and outliers [43]. Compared with SSC, LRR is more effective in the learning of global structure of data.
Due to the sparse constraint, the solution of SSC sometimes is too sparse, resulting in the over-segmentation of data points within-cluster. Moreover, the performance of LRR will be degraded if the subspaces of data are not independent. To address these issues, Wang et al. [44] combine sparse and low-rank constraints in a unified model, called LRSSC, yielding improved performance over SSC and LRR. The aforementioned models SSC, LRR and LRSSC often utilize relaxed convex norms, i.e., 1 norm and nuclear norm, to measure the sparsity and rankness of the coefficients matrix. However, the approximated solutions are suboptimal to the original sparse or low-rank constrained optimization problems. In [45], S 0 /L 0 -LRSSC was proposed by using non-convex L 0 quasi-norm C 0 for the sparsity constraint and Schatten-0 quasi-norm C S 0 = diag(Σ) 0 (UΣV T = C is the singular value decomposition of C) for the low-rank constraint, achieving improved performance compared with LRSSC. Although these methods outperform traditional clustering methods such as k-means and fuzzy c-means in terms of accuracy, they only exploit spectral information of HSI, and neglect spatial dependencies of data points, resulting in a sensitive performance to the spectral variabilities and sparse noise.

Spatial-Spectral Clustering Methods
It has been demonstrated that using spatial information together with spectral information can effectively improve the performance in various HSI processing tasks including supervised classification [117], denoising [118][119][120], change detection [121] and superresolution [122]. Similarly, incorporating spatial information proves to be beneficial in HSI clustering as well, resulting in a number of spatial-spectral extensions of SSC and LRR in recent years [42,46,[48][49][50][51][123][124][125][126]. Spatial-spectral clustering methods take into account spatial information by introducing local constraints on the coefficients matrix or by applying post-processing techniques such as filtering to promote piece-wise smoothness of representation coefficients. As pixels in the local region belong to the same cluster with a high probability, the improved smoothness of coefficients leads to reduced variance withincluster in the representation/feature domain, which facilitates building a better similarity matrix and thus obtaining an improved accuracy in the standard spectral clustering.
A number of spatial regularizations have been integrated into SSC to promote piecewise smoothness of the coefficients matrix, which are summarized in Table 3. Denoting the spatial regularization by Ψ(C), the related optimization problems can be represented by a unified form: where Θ(C) is the sparse or low-rank constraint, diag(C) denotes a vector consisting of elements C ii and C T 1 = 1 constraint indicates an affine subspace of the data. Table 3. Spatial regularizations in spatial-spectral clustering models.

JSSC [46]
∑ i C i 1,2 C i is the coefficients corresponding to the pixels within the i-th super-pixel SpatSC [47] CH 1 H is a difference matrix for 1-D hyperspectral data L2-SSC [48] ∑ N i is the index set of horizontal and vertical neighbours of the i-th pixel TV-CRC-LAD [32] ∑ MN i=1 ∑ j∈N i c i − c j 1 N i is the index set of horizontal and vertical neighbours of the i-th pixel S 4 C [42] C −C 2 FC is the smoothed matrix of C with a 2-D mean filter S-SSC [49] C −C 2 FC is the smoothed matrix of C with a 3D median filter LCR-FLDA [50] Tr(CLC T ) L is the Laplacian matrix of a normal graph SPHG-LRSC [51] Tr(CL H C T ) L H is the Laplacian matrix of a hypergraph Huang et at. [46] take into account spatial dependencies of pixels in the local region by introducing a joint sparsity constraint Ψ(C) = ∑ i C i 1,2 , where C i ∈ R MN×N i is a coefficients matrix of N i pixels in a local region defined by super-pixel segmentation of HSI and ij . The utilized 1,2 norm promotes pixels within a super-pixel to select a common set of samples in the subspace representation, resulting in similar coefficients of pixels within a super-pixel. The works in [42,49] develop spatial regularizations with Ψ(C) = C −C 2 F , whereC is a smoothed matrix of C by using smoothing filters, such as 2-D mean filter in [42] and 3D median filter in [49,126]. In [42], 2-D mean filter is applied on each slice of a reshaped 3D coefficients cube C ∈ R M×N×MN , where C(:, :, i) ∈ R M×N is obtained by reshaping each row of matrix C. Compared with the slice-by-slice filtering strategy in [42], a 3D median filter with a 3D moving window is performed on the tensor cube C in [49,126], which promotes column-wise and row-wise smoothness of C at the same time.
Instead of approaching a reference matrix obtained by filtering matrix C, total variation (TV) based spatial regularizations are developed to promote similar representations of neighbouring data points. For instance, Guo et al. [47,127] introduce a TV-based regularization, i.e., Ψ(C) = CH 1 , for 1-D hyperspectral data acquired by spectrometer, where H is a difference matrix. Zhai et al. [32,48] develop TV-based regularizations, i.e., for the clustering of HSI, where N i is the index set of adjacent spatial neighbours of the i-th pixel in horizontal and vertical directions. Minimizing TV-regularized optimization problems in fact facilitates the difference matrices of C to be sparse, leading to local smoothness of coefficients in the spatial domain. It was demonstrated in [32,47,48,127], by introducing TV-based spatial regularizations clustering accuracy is significantly increased compared with SSC.
Another type of spatial regularization is built on manifold learning with graph Laplacian. By considering each pixel as a graph node, a graph built with input data is utilized to constrain the manifold structure of data in the representation domain to be identical to that in the original data space. Liu et al. [50] introduced a K nearest neighbours (KNN) graph-based spatial constraint measures the similarity between i-th pixel and j-th pixel: By defining Laplacian matrix by L = D − W knn , where D = diag(W knn 1), the KNN graph-based regularization is reformulated as Ψ(C) = Tr(CLC T ), where Tr(C) is the trace of a real square matrix C, i.e., Tr(C) = ∑ i C ii . The graph Laplacian constraint promotes similar pixels to yield similar representation coefficients, facilitating a better similarity matrix and thus leading to an improved clustering accuracy. The normal graph can only model the pair-wise connection of nodes. In fact, one node can have connections with multiple nodes and the connected nodes can be seen as a group. In order to exploit group information in the subspace representation, Xu et al. [51] introduce a hypergraph-based graph regularization, i.e., Ψ(C) = Tr(CL H C T ), where L H is a normalized hypergraph Laplacian matrix obtained by: with D v a vertex-degree matrix, D e a hyperedge-degree matrix, H an incidence matrix and W H a weight matrix. We refer to [51] for details. The hypergraph-based regularization constrains the pixels (often more than two) connected by one hyperedge to yield similar representations, yielding thereby better performance than the models using a normal graph. It should be noted that the construction of graphs in [50,51] is highly important, which significantly affects their clustering accuracy.
In [123,128,129], post-processing techniques are developed for LRR or SSC. Different from the aforementioned spatial regularizations, the post-processing step is independent of the optimization problems with respect to C. In [128], a non-local majority voting scheme was proposed, which identifies the cluster of a data point by majority voting with its non-local neighbours, yielding an improved clustering accuracy. In [123], a cascaded weighting and local bilateral filtering scheme is applied on the coefficients matrix of LRR, leading to a better similarity matrix and thus achieving improved clustering results in spectral clustering. In [129], two different strategies, i.e., cosine-Euclidean (CE) and CE dynamic weighting (CEDW), are proposed to build more accurate similarity matrices with the coefficients matrix of SSC. Cosine measure on the sparse coefficients of two pixels is used to exploit spectral information of HSI and Euclidean distance is adopted to incorporate spatial information. Both spectral and spatial information are taken into account in CE and CEDW. In [130], based on the sparse coefficients of SSC, an improved similarity matrix is built by the multiplication of cosine-measured similarity matrix and Gaussian kernel dynamic similarity matrix, incorporating both spatial and spectral information of HSI. In general, post-processing-based clustering approaches have lower computational complexities compared with the spatial regularizations constrained clustering models. However, as the post-processing step is performed on the results of LRR or SSC, the performances of [123,[128][129][130] might be significantly degraded when LRR or SSC fails to produce a fair result.

Object-Based Clustering Methods
The aforementioned clustering methods classify HSI pixel-by-pixel, which can be easily affected by impulse noise or outliers. Moreover, due to the huge dictionary in the self-representation models, the computational complexities of these approaches are excessively high, which imposes a severe limitation on large-scale data. To alleviate these problems, object-based clustering methods [52,53] were developed. Compared with pixel-wise clustering approaches, object-based clustering methods require an additional pre-processing step to compress the data size of HSI. Super-pixel segmentation techniques are often applied to achieve this by segmenting HSIs into non-overlapping super-pixels and considering each super-pixel as an "object". As the pixels within a super-pixel often belong to the same cluster, one can cluster an HSI on the super-pixel level, which significantly reduces the number of data points.
In [52], mean-shift segmentation method [131] is adopted for super-pixel segmentation. Let p be the number of super-pixels or "objects". Then, reweighed mass centers of the 3D "objects", denoted byX = [x 1 ,x 2 , . . . ,x p ], are iteratively learned, which serve as input spatial-spectral features of different "objects". Next, representation coefficients matrix ofX is obtained by solving: where W e is a weight matrix to improve the sparsity ofC and represents element-wise multiplication of two matrices. As the clustering is performed on a super-pixel level, the clustering speed is much faster than the pixel-wise clustering methods. Wang et al. [53] employ SLIC [132] for the super-pixel segmentation of HSIs. Compared with [52], the correlation between super-pixels is specially taken into account in the subspace representation by introducing a new spatial regularization. The objective function with respect to coefficients matrixC of super-pixels is modelled by where each column of S is the averaged spectral signature of a super-pixel andC is an estimated coefficients matrix by KNN neighbours, i.e.,c i = 1 d i ∑ j∈Ω ic j with Ω i the neighborhood of the i-th super-pixel, d i = ∑ j∈Ω i T ij and T ij = exp(−( s i − s j 2 2 )/σ 2 ). By applying similarity matrix W = (|C| + |C T |)/2 into spectral clustering, clustering results can be obtained. To alleviate the effect of inaccurate super-pixels segmentation, the authors of [53] further refine the clustering results by a cumulative Markov random field (MRF)-based post-processing method, resulting in improved clustering accuracy.

Semi-Supervised Clustering Methods
Typically, clustering of HSI does not use any labelled data. However, sometimes a few labelled data points might be accessible, which can provide helpful supervised information to guide clustering algorithms to better learn the cluster structure of data. By incorporating supervised information, semi-supervised clustering methods are developed in [54,55,57]. The idea of [54,55] focuses on the refinement of coefficients matrix in self-representation models with supervised information for a more block-diagonal similarity matrix. In [54], a class probability propagation of supervised information based on SSC (CPPSSC) algorithm was proposed, which shares the same form of the objective function in (8). Compared with the object-based clustering model in (8), CPPSSC is a pixel-level clustering approach where the input matrix is X and W e is obtained with supervised information. CPPSSC first derives class probabilities of data points by using sparse representation classification [133] with a dictionary constructed by all labelled data. Then the inner product of class probabilities is utilized to measure the similarities of data points, resulting in a supervised weight matrix W e . By imposing the weight matrix on the coefficients matrix, the connectivities of data points can be learned more accurately in sparse coding, facilitating the constructed similarity matrix to be more block-diagonal. Benefiting from the supervised information, the semi-supervised model CPPSSC outperforms unsupervised models such as SSC and S 4 C. However, due to the lack of spatial regularization, its performance is sensitive to the amount of labelled data.
In [55], a semi-supervised method, called joint SSC with label information (JSSC-L), was proposed, which incorporates spatial information and label information in a unified model. Specifically, a joint sparsity constraint is introduced to regularize pixels within a super-pixel to select a common set of data points in the subspace representation. To refine the coefficients matrix, the authors exploit available label information to zero the entries of the sparse coefficient matrix, which correspond to the data points from different classes. The objective function of JSSC-L is formulated as follows: where w i is the weight for the i-th super-pixel, p is the number of super-pixels, P G (C) is a projection operator that extracts the entries in C whose indices are in G, and G is the union of sets {i, i} and {i, j} where i-th and j-th pixels are labelled pixels from different classes.
In order to make full use of labelled information, label propagation within super-pixels is carried out, which significantly increases the amount of labelled data. Compared with the semi-supervised model CPPSSC and other unsupervised models, JSSC-L achieves a significant improvement of accuracy with 1% labelled data. Different from [54,55], the authors of [56,57] propagate the label information in a graph that is obtained by solving a self-representation model. Let X l ∈ R B×l be the labelled data, X u be the unlabelled data, X = [X l , X u ], Y l ∈ R c×l be the one-hot label matrix of X l and F = [F l , F u ] be the predicted label matrix of X. The objective function of the semi-supervised clustering model, non-negative LRR (NNLRR), in [56,57] is formulated as follows: where λ ∞ is a sufficiently large value such that F l − Y l 2 F = 0 is approximately satisfied. The first two terms propagate the labelled vectors Y l in a graph with the similarity matrix C. The non-negative constraint, i.e., C ≥ 0, is utilized to interpret the learned low-rank and sparse matrix C as a similarity matrix.
Compared with CPPSSC [54] and JSSC-L [55], NNLRR requires much more labelled data to ensure an effective propagation of labels in the graph. Moreover, because of the lack of spatial constraint in NNLRR, the learned similarity matrix can be easily affected by noise and outliers, leading to a less reliable propagation of labels.

Multi-View Clustering Methods
Multi-view clustering methods, as extensions of aforementioned single-view clustering models, incorporate rich information from different data sources to cluster data points. Here, we refer to different sources acquired by heterogeneous sensors, such as HSI, Light Detection and Ranging (LiDAR) and synthetic-aperture radar (SAR), and features extracted from single-source or multi-source data, such as morphological profiles (MPs) [134], Gabor features [135] and local binary patterns [136], as different views of the same scene. Making use of complementary information from different views can help in discriminating better between data points from different classes. The essential problems of multi-view clustering methods are: (1) how to precisely capture the cluster structure of each view; and (2) how to fuse diverse cluster structures from different views and find a common cluster structure. A flowchart of multi-view clustering methods is shown in Figure 6.

Hyperspectral images
LiDAR data  Let {X t ∈ R B t ×MN } T t=1 denote the multi-view data, where B t is the dimensionality of the t-th data source and T is the number of data sources. Existing multi-view clustering methods for HSI can be formulated in a unified form: (12) where the first two terms are used to learn individual cluster structures within a selfrepresentation model, F (C t ) is a term consisting of different regularizations and T ({C t } T t=1 ) is a fusion function with respect to {C t } T t=1 . In [59,137], a multi-view clustering model was proposed by incorporating polarization information and spectral information of HSIs. Three schemes are designed to capture the individual cluster structure of data with different To fuse cluster structures of the two views, the authors of [59,137] impose a constraint C 1 = C 2 , which regularizes the cluster structures learned from different views to be the same.
In [58], a spatial-spectral-based multi-view low-rank SSC (SSMLC) was proposed. It generates spectral views by the partition of spectral bands, spatial views with morphological features and robust views with principle components analysis (PCA). SSMLC learns cluster structures from different views by using sparse and low-rank constraints, i.e., F (C t ) = C t 1 + α C t * , and regularizes the coefficients matrices {C t } T t=1 of different views to be similar with the constraint: In order to improve the clustering accuracy of SSMLC for the data that is non-linearly separable, the authors of [58] extend SSMLC to a non-linear version with a kernel trick, called K-SSMLC [60]. It learns coefficients matrix in higher dimensional data space with an implicit projection function Φ(X t ) : R B t → RB t , achieving an improved accuracy compared with SSMLC. Compared with [59,137], which regularizes different views to yield the same coefficients matrix, the constraints across different views in SSMLC and K-SSMLC are more flexible as they allow small deviations of C t across different views, which is often the case in real data. The disadvantage of [58][59][60]137] is that the learning of view-specific coefficients matrix neglects spatial dependencies of pixels, leading to a sensitive performance to noise and outliers.
In [61], a hybrid-hypergraph regularized multi-view subspace clustering (HMSC) method is put forward, which integrates local and nonlocal spatial information from each view in a unified framework. The authors incorporate the spatial content in each view by developing a hybrid-hypergraph-based manifold constraint F h is the Laplacian matrix of the hybrid-hypergraph consisting of multi-scale local hypergraphs and a nonlocal hypergraph. Moreover, a new decomposition-based scheme was proposed to learn the common intrinsic cluster structure from view-specific subspace representations. The objective function of HMSC is formulated as follows: where the first two terms are used to learn view-specific cluster structures within a selfrepresentation model, and the fused low-rank matrix Z is shared by all the views with view-specific sparse deviations E t . Compared with SSMLC and K-SSMLC which integrate a low-rank regularization for each C t , HMSC contains only one low-rank related constraint, obtaining thereby a lower computational complexity. Moreover, as HMSC incorporates local and nonlocal spatial information in each view, it yields a significant accuracy improvement than SSMLC. Multi-view clustering methods often outperform single-view clustering models in terms of accuracy due to the incorporated complementary information from multi-view data. However, multi-view clustering methods require image registration to ensure an identical spatial resolution across different views, and sometimes need to generate hand-crafted "views". The quality of image registration and generated features can have a significant effect on the clustering accuracy of multi-view models. Moreover, the learning of coefficient matrices for different views significantly increases the computational complexity.

Kernel-Based Clustering Methods
Due to the effect of noise, spectral mixing and poor imaging conditions, the cluster structure of real HSIs can be highly complex, making the acquired data linearly nonseparable. To improve the clustering performance of self-representation-based models in real applications, efforts have been made to extend the linear representation, i.e., X = XC, to non-linear versions by using non-linear mappings. Kernel methods are often exploited to learn the non-linear cluster structure of HSI [62][63][64][65][66]138]. They typically project the raw data into the reproducing kernel Hilbert space H where the correlation of data points can be more easily learned in the self-representation model. Representative kernel-based clustering models [62][63][64]138] are summarized by: where Φ(·) represents a mapping function that projects raw data X to a new higherdimensional feature space Φ(X) and Γ(C) is a regularization term with respect to C.
Kernel trick is often utilized to avoid an explicit mapping of data. We define kernel function κ : X × X → R as κ(x, y) = Φ(x), Φ(y) , and the positive semidefinite Gram matrix K XX ∈ R MN×MN as: Then Equation (15) can be reformulated as follows: In [62,64], a kernel low-rank and sparse subspace clustering (KLRS-SC) was proposed, which learns the correlations of data points within the framework of (17) by imposing joint low-rank and sparse constraints on the representation matrix C. The sparse constraint promotes a sparse graph, which maximizes inter-cluster separation. The low-rank constraint is used to improve the connectivities of data points belonging to the same cluster. The joint constraints enable the model to capture both local and global structures of HSIs, leading to a more block-diagonal structure of similarity matrix in the Hilbert space. In [63,138], a kernel SSC method with spatial maximum pooling operation (KSSC-SMP) was proposed, which extends SSC to a kernel version. Compared with KLRS-SC, KSSC-SMP additionally incorporates spatial information of HSI by a post-processing technique, i.e., max pooling, in the representation domain, producing a more smoothed clustering map.
The acquisition of HSIs is often degraded by numerous factors, including sensor saturation, thermal effects, quantization errors and transmission errors, resulting in different types of noise in HSIs. To alleviate the effect of noise in the clustering of HSIs, Jorge et al. [65] combine a TV-based noise denoising model and the kernel-based clustering model KSSC-SMP in a unified framework. Minimizing the TV-based denoising term results in less noisy data, which facilitates a better clustering performance in KSSC-SMP.
Different from the framework in (17), Cai et al. proposed a more generalized model, called efficient kernel graph convolutional subspace clustering (EKGCSC) [66], by improving the self-representation dictionary with graph convolution. The objective function of EKGCSC is formulated as follows: whereĀ =D −1/2 (A s + I)D −1/2 is a normalized similarity matrix withD = diag((A s + I)1) and A s being a similarity matrix. Φ(X)ĀC can be viewed as a special linear graph convolution operation in the projected high-dimensional feature space. WhenĀ = I, model (18) is reduced to the traditional one in (15). The new dictionary Φ(X)Ā constructed by graph embedding improve the robustness of EKGCSC to noise. Moreover, the optimization problem (18) can be solved by a closed-form solution, which is more computationally efficient and makes EKGCSC easily implemented and applied in practice.
Kernel-based clustering methods often perform better than linear representationbased clustering methods, which benefit from the increased separability of data points in the projected high-dimensional feature space. However, the selection of a proper kernel function is challenging and it is not guaranteed that in the implicit data space the data lies in a union of linear subspaces. Moreover, kernel-based methods need to calculate a predefined kernel matrix, i.e., K XX , which significantly increases the computational complexity.

Graph Learning Based Clustering Methods
Graph embedding, i.e., Tr(CLC T ) = ∑ i ∑ j c i − c j 2 2 W ij , is an effective technique to preserve local structure of data in the representation domain by promoting similar data points to yield similar coefficients vectors. The construction of the similarity matrix W is essential in graph embedding. Traditional graph-regularized clustering methods adopt a fixed similarity matrix, which is calculated from the raw data. KNN graph is commonly used as defined in (6). However, noise and outliers in HSIs decrease the quality of the similarity matrix, resulting in an unreliable graph embedding in the representation domain. To solve this problem, graph learning strategy was proposed recently in the self-representation-based clustering models [67,68], which iteratively learns a graph from the representation domain. A basic graph learning model can be formulated as follows: (19) where S is the adaptive graph and R(C, E, S) is a set of constraints with respect to C, E and S.
In [68], a clustering method with dual adaptive graphs learning strategy was proposed. The developed model learns a consensus graph from two adaptive graphs that are derived from the representation domain, i.e., ∑ i ∑ j Xc i − Xc j 2 2 S ij , and projection domain with locality preserving projection (LPP) [139], respectively. The dual adaptive graphs learning strategy learns similarities of data points from two different domains that are less affected by noise, leading to a more robust clustering performance. In [67], a unified clustering model is developed by combining hypergraph learning and spectral clustering. The proposed model leans a hypergraph with constraint Tr(XCL h XC T ), where L h is the Laplacian matrix of a hypergraph, and embeds the learned adaptive hypergraph in spectral clustering. The hypergraph learning and spectral clustering benefit from each other in the alternating optimization algorithm, resulting in improved clustering accuracy.

Dictionary Learning Based Clustering Methods
Self-representation-based clustering models often yield better performance in the clustering of HSI compared with traditional clustering methods such as k-means, fuzzy cmeans, density-based clustering methods and spectral clustering. However, as they employ input data as a dictionary, which is typically huge and redundant in practice, the subspace representation is less efficient and less informative. Moreover, the resulting optimization problems are computationally expensive due to the high complexity of O((MN) 3 ), where MN is the total number of pixels in HSI, posing a severe limitation on large-scale data. Recent works [69][70][71][72][73][74][75][76][77][78][79] solve this problem by replacing the self-representation dictionary with a more compact dictionary. Typical ways to obtain the compact dictionary are shown in Figure 7. With a smaller dictionary, the amount of coefficients to be learned is significantly reduced, making the resulting clustering models computationally efficient. Denote the compact dictionary by D ∈ R B×n , where n (n MN) is the number of atoms. According to how the dictionary D is constructed, we classify existing dictionary learning-based clustering methods into three categories: landmark-based, sketch-based and adaptive dictionary-based clustering methods.

Landmark-Based Clustering Methods
This type of method builds the compact dictionary by selecting representative data points from input data. The selected data points are viewed as landmarks of the data, approximately representing the subspaces associated with the input data. The most efficient way to select landmarks is through uniformly random sampling [140]. However, the randomly selected landmarks are often redundant, which requires more data points to represent the input data subspaces, resulting in a larger dictionary. Some methods [69,70] adopt fast clustering algorithms such as k-means to cluster HSI into different groups and obtain landmarks within each group. Another method [71] combines super-pixel segmentation and sparse coding for the selection of landmarks. Those methods typically yield a much smaller dictionary compared with the self-representation dictionary, leading to a more efficient sparse coding problem. After obtaining the coefficients matrix, clustering results can be obtained either by spectral clustering or by designing post-processing techniques, e.g., minimizing reconstruction residuals.
In [70], a landmark-based SSC model with TV regularization (LSSC-TV) was proposed for the clustering of large-scale HSIs. LSSC-TV replaces the self-representation dictionary with a landmark dictionary, which is obtained by over-clustering of HSI with k-means where the centroid of each cluster is collected as a landmark. The size of the landmark dictionary is small, reducing significantly the number of optimization variables compared with self-representation models. Thus, LSSC-TV has lower computational complexity and is more scalable to big data. Moreover, LSSC-TV incorporates spatial information with a TV regularization, which improves the local smoothness of coefficients, leading to improved accuracy. The objective function of LSSC-TV is formulated as follows: where A ∈ R n×MN is the sparse coefficients matrix, λ and λ tv are the penalty parameters for the sparsity level and spatial smoothness, respectively, and the non-negative and sumto-one constraints are used to interpret the coefficients as the probability to select landmarks in the sparse coding. Based on the theory of AnchorGraph in [141], a similarity matrix is constructed by Then, fast spectral clustering algorithm [142] is adopted to obtain the clustering results of HSI, reducing further the computational complexity of LSSC-TV. Zhai et al. [69] proposed a sparsity-based clustering method for large-scale HSIs. Compared with existing subspace clustering methods, which often rely on spectral clustering to yield clustering results with representation coefficients, the developed clustering methods in [69] use sparse representation recovery residual to cluster HSIs, resulting in a much lower computational complexity. Firstly, a structured dictionary D = [D 1 , D 2 , . . . , D c ] is constructed by using k-means and k-nearest neighbours (KNN) where D i is a subdictionary corresponding to the i-th cluster that is built by the KNN of the i-th cluster centroid. Inspired by sparse representation classification [133], they obtain discriminative sparse coefficients of all data points by solving a sparsity-based optimization problem, which are further fed into a representation residual-based clustering algorithm to yield clustering results. To reduce the effect of salt-and-pepper noise, a spatial-spectral version, called the joint-sparse-coding-based clustering (JSCC) method, was proposed by introducing an 1,2 norm-based joint sparsity on the coefficients matrix of pixels within a super-pixel, yielding an improved performance in terms of accuracy and time complexity. In [72], a multi-objective SSC was proposed for the clustering of HSIs. A compact dictionary is first constructed by using k-means to reduce the overall computation burden. Different from other subspace clustering methods, which obtain coefficients matrix by solving a single objective function, the authors of [72] simultaneously optimize multiple objective functions, i.e., sparsity term, data fidelity term and spatial TV term, resulting in a parameter-free clustering model.
More recently, Hinojosa et al. develop a computationally efficient clustering method with a small landmark dictionary obtained by super-pixel segmentation and sparse coding [71]. The landmark dictionary enables a fast calculation of sparse coefficients. Spatial filtering is used to post-process the coefficients matrix, promoting the connectivity of neighbouring pixels in the representation domain. To obtain clustering results of large-scale HSIs, fast spectral clustering is applied with the coefficients matrix, reducing further the computational complexity of the clustering method.

Sketch-Based Clustering Methods
Sketch-based clustering methods compress the self-representation dictionary by using a random sketching technique, i.e., D = XR, where R MN×n is a random matrix. The compressed dictionary D, referred to as sketched dictionary, is originally developed for computer vision tasks where clustering of faces, digits and scenes is of interest [143]. Same as the landmark dictionary, the size of the sketched dictionary is much smaller than the self-representation dictionary, making the resulting clustering model computationally efficient. It has been theoretically proved that the sketched dictionary is as expressive as the self-representation dictionary with a proper sketching matrix R. Thus, the sketched dictionary can well represent the subspaces associated with the input data. Recent works [73][74][75] apply sketched dictionary in the clustering of HSIs, achieving state-of-the-art performance in terms of efficiency and accuracy. The objective function of sketch-based clustering methods is formulated as arg min where D = XR, Θ(A) is the sparsity or low-rankness related constraints and Ψ(A) is a spatial regularization. After obtaining matrix A, KNN graph is built and further fed to spectral clustering to yield clustering results. In [73], a TV regularized sketch subspace clustering method was proposed for hyperspectral remote sensing images. It adopts Johnson-Lindenstrauss transform to sketch the self-representation dictionary as a compact dictionary, which significantly reduces the number of sparse coefficients to be solved, thereby reducing the overall complexity. In order to alleviate the effect of noise and within-class spectral variations of HSIs, a TV spatial constraint is used on the sparse coefficients matrix, which accounts for the spatial dependencies among the neighbouring pixels. Compared with the traditional SSC model, the sketch-based clustering model obtains significant improvements in accuracy and running speed. Another sketch-based clustering method [75] adopts the same sketching technique of [73,143] to build the compressed dictionary. To better capture the structural information of HSIs in the representation domain, joint sparsity and low-rankness constraints were introduced, which account for the underlying local and global information of HSIs at the same time. Moreover, a nonlocal means regularization is used to incorporate the spatial correlation information, which improves further the clustering accuracy. The objective function of the sketch-based clustering model [75] is shown as follows: arg min where W a is an adaptive weight matrix calculated by W a ij = ε 2 /(A ij + ε 1 ), which leads to an improved sparsity, ε 1 and ε 2 are two small constants andĀ NL is a filtered matrix of A by a nonlocal means filter.

Adaptive Dictionary Based Clustering Methods
Motivated by the success of dictionary learning in signal processing and highdimensional data analysis [144][145][146][147][148][149], recent works [76][77][78][79] replace the self-representation dictionary with an adaptive dictionary that is leaned from the input data, resulting in computationally efficient clustering models. The developed models often consist of three steps, joint dictionary learning and sparse coding, similarity matrix construction and spectral clustering.
In [76], a novel clustering method based on sparse dictionary learning and anchored regression was proposed. The proposed method first builds a sparse dictionary by mul-tiplying a fixed wavelet dictionary with a learned sparse matrix in a double sparsity constraint-based optimization framework. To improve the efficiency of sparse dictionary learning, an efficient scheme is adopted by using a few randomly selected data points. Then, based on atoms clustering within sparse dictionary and anchored regression, class-specific projection matrices are obtained, which allows a fast calculation of the coefficients matrix. A spatial smoothing filter is applied to the coefficients matrix, which is utilized to build a similarity matrix. Finally, spectral clustering is applied to obtain the clustering results of HSI. The developed model in [76] achieves a low computational complexity. However, the underlying fixed wavelet dictionary might not fit well with the input data. Instead, Bruton et al. [79] proposed an efficient online dictionary learning-based clustering model for HSIs. It obtains a compact dictionary and sparse coefficients simultaneously in a unified model. The learned dictionary is more adaptive to the input data compared with the one in [76]. The sparse coefficients are viewed as extracted features, which are demonstrated to be more discriminative compared with the raw spectral data. The new features facilitate a better similarity matrix, improving thereby the accuracy of spectral clustering. However, only spectral information of HSI is exploited in [79], making the clustering model less robust to the degradations of HSIs.
In [78], a dictionary learning-based clustering method is put forward with an adaptive spatial regularization. Specifically, a weighted joint total variation is formulated by adopting a reweighed 1,2 norm penalty on the difference matrix of coefficients, which encodes effectively the dependencies of spatially neighbouring pixels in the low-dimensional subspaces and promotes the coefficients vectors of neighbouring pixels to be similar. Thus, the variation of data within-cluster is significantly reduced in the representation domain, leading to an improved clustering accuracy in spectral clustering. The objective function of the dictionary learning model is formulated as follows: where D ≥ 0 requires that the atoms are nonnegative in agreement with the positive spectral intensities of HSIs, H is a combined TV operator in horizontal and vertical directions and W h is a diagonal weight matrix for the difference matrix HA T that is iteratively calculated by using the difference matrix of A. Compared with self-representation models, the complexity of the model in [78] is much lower. Compared with the commonly used TV regularization, the weighted 1,2 norm-based TV promotes row sparsity on the difference matrices of A, preserving better the local spatial structure of HSI in the representation domain. This makes the constructed similarity matrix with representation coefficients more block-diagonal, yielding better results in spectral clustering. Huang et al. [77] proposed a dictionary learning-based clustering method with a joint sparsity constraint, which accounts for local spatial information of HSIs in the sparse coding. It first segments HSI into nonoverlapping square patches and imposes an 1,2 norm-based constraint on the coefficients matrix of pixels within each patch. Minimizing the joint sparsity constraint promotes selecting a common set of atoms in the sparse coding of similar data points. The objective function of joint sparse coding and dictionary learning is formulated as follows: where s is the number of square patches, A i is the coefficients matrix of pixels belonging to the i-th patch and {w i } s i=1 are the weights for the joint sparsity constraint. After obtaining coefficients matrix A, different from the work of [78,79], which applies the KNN graph built with A into spectral clustering to obtain clustering results, the clustering method of [77] adopts a coclustering approach based on a bipartite graph, achieving simultaneous clustering of dictionary atoms and spectral data. An undirected bipartite graph G = (D, X, E) is built where all dictionary atoms d i and input data points x i are viewed as nodes and E represents the edges between nodes. As sparse coefficient A ij represents the correlations between input data point x j and dictionary atom d i , the adjacent matrix of the bipartite graph G is built by where |A| represents the absolute value of A. To obtain clustering results of HSIs, the adjacent matrix of the bipartite graph is applied into normalized cut [150].
In summary, benefiting from the compact dictionaries, the number of variables to be optimized in dictionary learning-based clustering methods is significantly reduced compared with self-representation models, resulting in low computation and memory cost. However, the clustering accuracies of dictionary learning-based clustering methods are sensitive to the built compact dictionary. Moreover, adaptive dictionary-based clustering methods learn the compact dictionary and sparse coefficients simultaneously, leading to non-convex optimization problems where global optimal solutions are not guaranteed. The obtained sub-optimal solutions may degrade the performance of these models.

NMF-Based Clustering Methods
Nonnegative matrix factorization (NMF) [151], which decomposes a nonnegative matrix into the product of two nonnegative factor matrices, has been demonstrated to be an effective tool in many applications including unmixing [152][153][154], source separation [155], compression [156], medical imaging [157], clustering [83,88,[158][159][160], etc. For a given nonnegative matrix X, NMF finds two nonnegative matrices U ∈ R MN×r and V ∈ R r×MN such that where U(:, i) is the i-th columns of U and V(i, :) is the i-th row of V. In general, NMF is NP-hard and highly ill-posed due to the non-uniqueness of the solutions [161]. Therefore, suitable regularizations are typically introduced to shrink the solution space and to promote additional properties of factorization matrices. The optimization problems of NMF are formulated as arg min where D(·, ·) is a discrepancy term, Φ i (·) represents the i-th regularization term and α i ≥ 0 are the regularization parameters to control the influence of Φ i (·). Typical choices of D(·, ·) is Frobenius norm, i.e., · 2 F , 1 norm, i.e., · 1 , and 2,1 norm, i.e., · 2,1 . For the regularization, 1 norm, 2 norm and other smoothing terms are commonly used.
NMF can be used for data clustering in two different ways as shown in Figure 8. The first strategy is to consider NMF as a representation learning technique, where the representation matrix V is viewed as new features of data. By applying the new features to existing clustering methods, clustering results can be obtained. As normally r B and r MN, NMF with UV is a low-rank approximation of X. Clustering in the feature space can be more effective than that in the raw data space. The second strategy views the factorization matrix U as a cluster centroids and V as a cluster membership matrix by setting r to the number of clusters and imposing orthogonal constraint VV T = I. This directly obtains clustering results through each column of V. Without other regularization terms, the second strategy is known as orthogonal NMF (ONMF) problems, which is equivalent to a weighted variant of the spherical k-means [162]. Compared with k-means, NMF clustering approaches are more flexible considering that different prior information of data can be easily incorporated by introducing suitable regularizations on the factorization matrices. The earliest work of NMF-based clustering can date back to 2003 [163], which applies NMF to the clustering of document and obtains superior performance compared with spectral clustering. Subsequent development of NMF-based clustering methods mainly focuses on computer vision tasks and the research on the clustering of HSIs with NMF just appears in recent three years. Depending on whether spatial information is incorporated, we categorize NMF-based clustering approaches of HSIs into spectral-based and spatialspectral-based methods.

Spectral-Based NMF Clustering Methods
Spectral-based NMF clustering methods treat pixels of HSI independently without considering their spatial dependencies. In [80], a hierarchical clustering method based on rank-two NMF (H2NMF) is put forward. The method starts with a single cluster containing all the data points and performs the following two steps iteratively, (1) cluster selection for further division and (2) split of the selected cluster with rank-two NMF. The major advantage of the method is that the tree-structured clustering results avoid rerunning the algorithms from scratch if the number of clusters required by the user is modified. Compared with k-means, spherical k-means and standard NMF, H2NMF yields better performance in terms of clustering accuracy. In [81], Manning et al. extend H2NMF to a version that supports parallel computing with distributed memory, compute nodes and processors, resulting in a scalable clustering algorithm for big data.
In [82], Fernsel et al. proposed elastic net regularized ONMF clustering models where the factorization rank r is set to the number of clusters. Compared with the traditional NMF, orthogonal constraint VV T = I is introduced for the factorization matrix V, leading to the interpretation of V as a cluster membership matrix. Moreover, the elastic net regularization with 1 norm and Frobenius norm is introduced to promote the factorization matrices to be sparse. Specifically, the objective function of the models in [82] is formulated as arg min where D(·, ·) is an 1 norm or Frobenius norm-based discrepancy term and λ U , λ V , µ U , µ V ≥ 0 are regularization parameters. It has been proved in [82] that the regularized ONMF in fact equals to generalized k-means model with suitable distance measures and centroids. Different from [80][81][82], which obtain clustering results via asymmetric NMF of the input data, the work of [83] develops a symmetric NMF (SNMF) clustering model for HSI by decomposing data covariance matrix K as MM T , where M ∈ R MN×c is a nonnegative matrix and is viewed as a cluster membership matrix. The objective function of SNMF is formulated as arg min where λ ρ are the regularization parameters for the sparsity of the rows of M. To solve the non-convex matrix factorization problem (29), the work of [83] converts it to a mixed integer linear programming problem. SNMF is shown to perform better than the standard clustering methods such as k-means and NMF. It should be noted that even compared with supervised classifier such as kernel support vector machine (SVM), which is trained with 25% of labelled data, SNMF often yields far better classification accuracy. However, the computational complexity of SNMF is excessively high, posing limitations on the clustering of large-scale data.

Spatial-Spectral-Based NMF Clustering Methods
Instead of using only the spectral information in [80][81][82][83], spatial information is incorporated in NMF to improve the clustering accuracy [84][85][86][87][88]. In [84], Tian et al. proposed a graph regularized ONMF (GONMF), which employs a graph built in the raw data space to preserve local geometrical structure in the cluster membership matrix. In addition, morphological spatial features of HSIs are extracted and concatenated with spectral data, obtaining more discriminative input data,X, for NMF. The objective function of GONMF is formulated as arg min U≥0,V≥0 GONMF directly obtains clustering results from the cluster membership matrix V. Compared with SSC, GONMF yields improved performance in terms of both accuracy and efficiency. In [85], a similar work to GONMF was proposed, which also takes account of the spatial information of HSIs. Specifically, a total variation regularized spatial constraint is imposed on the cluster membership matrix of ONMF, which promotes neighbouring pixels to be grouped in the same cluster, resulting in improved local homogeneity in the clustering maps.
Zhang et al. [86] proposed a semi-NMF clustering framework of HSIs, which works efficiently on the clustering of large-scale data. Specifically, dimensionality reduction by using orthogonal projection is performed jointly with clustering in a unified framework. The transformed data with dimensionality reduction has a much lower dimension, which facilitates fast clustering of data. To increase the robustness of model to sparse noise and outliers, 2,1 norm is utilized for the loss of dimensionality reduction and semi-NMF clustering. Moreover, a graph Laplacian-based manifold constraint is introduced in the low-dimensional feature space and label space, which promotes similar data points to yield similar features and clustering labels. The objective function of the semi-NMF clustering model is formulated as arg min where P is the projection matrix to generate new features of X, i.e., Y, L is the Laplacian matrix of a similarity matrix and V is the cluster membership matrix. Note that to improve the scalability of the clustering model (31), only a small portion of pixels in HSI are selected for the input matrix X and the clustering of the rest pixels is performed by using a KNN classifier according to the clustering results of X. This avoids complicated optimization procedure in (31) for the unselected pixels, making the clustering of HSIs much faster. It is observed that most NMF-based clustering methods view the factorization matrix V as a label matrix by setting the factorization rank of NMF to the number of clusters. Although this enables a direct clustering result with V, the linear representation ability of NMF limits their applications on the data that is linearly non-separable. To deal with this problem, in [87] the authors adopt NMF as a feature extraction tool and apply the extracted features to spectral clustering to obtain clustering results. To improve the feature learning in NMF, a graph regularized constraint is introduced in the feature space, which promotes the manifold structures in the raw data space and feature space to be identical. The objective function of the resulting model is shown as follows: where Z is a spectral-spatial similarity matrix that is constructed by using super-pixel segmentation with special attention on exploring intra-superpixel and inter-superpixel connectivities. With the new features V, the similarity matrix of a KNN graph with binary weights {0, 1} is built, which is further fused with the spectral-spatial similarity matrix Z by a weighted strategy. The fused similarity matrix is demonstrated to be more block-diagonal, improving thereby the clustering accuracy of spectral clustering.
Recently, a co-clustering approach based on NMF was proposed for the clustering of large-scale HSIs [88], which integrates affinity matrix learning and spectral coclustering into a unified model. Specifically, a joint sparsity regularized sparse representation model was used to learn the correlations between data points and anchors, based upon which a bipartite graph was built as in (25). According to the equivalence between bipartite graph kernel k-means and NMF, a co-clustering module for HSIs and anchors was designed by solving double orthogonal constraints regularized NMF optimization problem. The unified co-clustering model is formulated as follows: arg min

Co-clustering via NMF
where A is the sparse coefficients matrix of X obtained by joint sparse coding within each super-pixel. Matrix A can be used to measure the correlations between input data X and representative anchors, i.e., dictionary D. Benefiting from the 1,2 -norm regularized spatial constraint in sparse coding, the coefficients matrix encodes better the correlations between input data and dictionary, leading to a more accurate clustering result in the NMF. In the model (33), the clustering results of X and D can be directly obtained via the cluster membership matrices V and U. Compared with self-representation methods such as SSC and LRR, the co-clustering model via NMF in [88] yields significant improvements in terms of accuracy and computational complexity. In summary, NMF-based clustering models are more efficient than self-representationbased models as there are much less variables to be optimized. As the factorization matrix V of NMF indicates the cluster membership of data points, post-processing via other clustering algorithms is not needed, which is different from the aforementioned clustering approaches. According to [164], there are strong correlations between NMF, k-means and spectral clustering such that with mild relaxations of constraints NMF equals to the other two clustering methods. Considering the high flexibility in prior information modelling, low computational complexity and good interpretability, NMF is promising in the clustering of HSIs. However, current research in the field is limited. The disadvantage of NMF is that the related optimization problems are non-convex, which makes their global optimal solutions difficult to be obtained. Moreover, the linear representation ability of NMF limits the clustering performance on the data that are linearly non-separable.

Deep Clustering Methods
The model-based clustering methods often require to devise rational constraints according to domain-specific prior information to avoid ill-posed optimization problems. However, the incorporation of prior information highly relies on the experience and domain knowledge of experts, which greatly limits the application of model-based clustering methods. In addition, the added penalty parameters of constraints often vary across different data sets. There is a lack of theoretical guidance to set the parameters adaptively. Moreover, the extracted features by shallow models might not be discriminative enough for clustering especially when dealing with remote sensing images which are often highly complex. Benefiting from the powerful feature extraction capacity, data-driven deep learning technique has achieved great success in a number of applications, including classification [165,166], clustering [167], image denoising [168], spectral unmixing [169] and anomaly detection [170]. However, the research on the clustering of HSIs with deep learning is at a very early stage. This is a new and rapidly emerging domain within the last few years, showing impressive clustering performance and attracting increasing attention and interest in the field [104]. According to the mechanism of feature learning and clustering, current deep learning-based clustering approaches of HSI are categorized into self-representation-based, autoencoder (AE)-based, graph convolution-based and contrastive learning-based methods. Figure 9 shows the main idea of each category.

Self-Representation Based Deep Clustering (SDC)
Basically, SDC methods integrate deep generative neural networks with aforementioned self-representation clustering models, such as SSC [31], and can be seen as the deep versions of the shallow clustering models in Section 4.1. As shown in Figure 9a, AEs are often used to generate latent features, which are expected to be more effective in clustering tasks. The loss of AEs is formulated as where X is the input data,X is the reconstructed data by the AEs, Z = E (X) denotes the latent feature extracted by the encoder E (·) and Θ(Z) is a regularization term with respect to Z. The encoder of AE is cascaded with a self-representation layer, i.e., E (X) = E (X)C, where C is the self-representation coefficients matrix. The loss of self-representation layer is formulated as follows: where Ψ(C) is a regularization term to avoid trivial solution of C = I. Combining the reconstruction loss of AE with the loss of self-representation layer, the overall loss function is derived by The training of SDC models often consists of two steps: pre-training of AEs by minimizing L AE and fine-tuning step by minimizing (36). Once the coefficients matrix C is obtained, a similarity matrix can be built as in SSC by W = (|C| + |C T |)/2. Finally, the similarity matrix is fed into spectral clustering to obtain clustering result.
The first SDC model [89] was proposed in 2017, which introduces a self-representation layer between the encoder and decoder to model the self-expressiveness of data in the nonlinear feature space, achieving remarkable performance in the clustering of faces and objects. Motivated by [89], Laplacian regularized SDC models [90][91][92] were recently proposed for the clustering of HSI, which yield significant improvements compared with the shallow representation-based clustering methods. Basically, graph Laplacian constraint is employed to encode the correlations of data points either in the latent feature space or in the self-representation domain, making the manifold structure of learned features to be more consistent with that in the original domain. In [90], the authors introduced a graph Laplacian-based manifold constraint on the representation coefficients of the selfrepresentation layer to enhance the geometric structure consistency between the input domain and the representation domain. Moreover, skip connections between encoder and decoder are utilized to extract the spatial-spectral information. Experimental results on real data sets show an improved accuracy compared with SDC. The cost function of the model in [90] is formulated as: where L is the Laplacian matrix of a KNN graph.
In [91], Cai et al. replaced the regular convolutional autoencoder of [90] with a residual convolutional autoencoder, leading to a more easily trained model from scratch. More recently, Cai et al. proposed a hypergraph regularized deep clustering model, called HyperAE [92], which incorporates group structure information of data in the learning of deep latent features. The objection function of HyperAE is formulated as: where Z is the deep latent features of X and L is the normalized hypergraph Laplacian matrix. HyperAE is further extended to a semi-supervised version by making use of supervised information from a few labelled data. Specifically, the latent features of AE are fed to a softmax classifier for label prediction, and a cross-entropy-based classification loss is introduced as a task-specific loss function. The cost function of the semi-supervised HyperAE is formulated as: where the last term is the cross-entropy loss, N l is the number of labelled data, y i ∈ R 1×c is the one-hot label vector of x i andȳ i is the predicted label vector of x i . Benefiting from the hypergraph regularization, the extracted deep latent features of HyperAE are more discriminative than [89], resulting in a better clustering performance both in the unsupervised mode and semi-supervised modes. Recently, Li et al. [93] proposed a mutual information subspace clustering network for the clustering of HSI by embedding contrastive learning and self-representation of data into AE. A contrastive loss, which maximizes the mutual information between input data and latent features, was designed, improving effectively the nonlinear feature learning of data. Experimental results show that the developed model yields improved clustering accuracy compared with other deep clustering approaches.
In [34], a multi-scale SDC model was proposed for the clustering of HSI, which leverages multi-scale convolutional AEs to extract spatial-spectral features of HSI in different scales. By incorporating the self-expressiveness property of features in each scale, the extracted spatialspectral features are transformed to representation domain and fused further by minimizing the difference of the representation coefficients matrices across all the scales. Although this method obtains improved performance in terms of accuracy, the computational complexity is significantly increased due to the multiple self-representation layers.
Different from previous SDC models which commonly utilize AEs for deep feature extraction, Goel et al. [94] learned discriminative features with deep dictionary learning (DDL), which nonlinearly transforms the input data into a new data space where the data can be separable into different subspaces. The DDL is followed by a self-representation layer where representation coefficients are used to build a similarity matrix for spectral clustering. The objective function of the proposed model in [94] is formulated as: arg min ReLU activation where D 1 , D 2 and D 3 are three layers of dictionaries, Z is the corresponding representation matrix and Z i c represents a sub-matrix of Z by removing z i in SSC. The experimental results show significant improvement over state-of-the-art clustering methods. The aforementioned SDC methods separate feature learning from clustering, where the obtained features from deep learning might not be optimal for the clustering task. In [95], a unified self-supervised SDC model combing feature learning and spectral clustering was proposed for the clustering of HSI. It makes use of an AE and a self-representation layer to learn the similarity matrix of data and employs cluster assignments with high confidence from spectral clustering as pseudo-labels to supervise feature learning process. Moreover, a KNN graph built in the original domain is used to guide the initialization of self-expressive coefficient matrix, achieving significant improvement of clustering accuracy. The experimental results in [95] show that the proposed model yields comparable clustering performance to the state-of-the-art supervised deep classification methods with overall accuracy of 97.43%, 100% and 100% on the data sets Indian Pines, Pavia University and Salinas_A, respectively.
Pixel-level self-representation of HSI suffers from high computational complexity and high memory cost in practice, making the aforementioned SDC models on large-scale HSI infeasible. Recently, Cai et al. [96] proposed a super-pixel guided contrastive subspace clustering network (NCSC) for the clustering of large-scale HSIs. By designing a super-pixel pooling autoencoder, the local spatial information of HSI is efficiently encoded, allowing an effective object-level feature extraction. Moreover, contrastive loss, which maximizes the similarity between positive samples generated by KNNs, is introduced to NCSC to promote intra-class similarity of extracted features. Benefiting from super-pixel pooling and contrastive loss, the accuracy and computational cost of NCSC are simultaneously improved, achieving the current state-of-the-art performance in the clustering of HSI.

AE-Based Deep Clustering (AEDC)
AEDC methods utilize AEs as unsupervised deep data representation to extract latent features for clustering. Due to the nonlinear mapping function of encoders, AEDC is more effective at dealing with complex data compared with traditional linear representation models. Clustering can be performed separately from the latent feature learning, which leads to clustering methods such as those in [97][98][99] consisting of two steps: deep feature learning and clustering. In the first step, reconstruction loss is used to train the AEs. Different types of AEs can be utilized in AEDC, including stacked AE, the traditional AE, convolutional AE and variational AE. With the latent features learned by AEs, classical clustering methods such as k-means and Gaussian mixture models (GMM) are applied to yield clustering results.
A recurrent neural network-based (RNN) asymmetric AE was proposed for the clustering of HSI [97]. The RNN built with long short-term memory (LSTM) or gated recurrent units (GRUs) is utilized as an encoder. By interpreting separate bands of HSI as consecutive steps within a sequence, the high correlation between adjacent bands can be effectively captured by RNN. A multilayer perceptron is utilized as a decoder. With the asymmetric AE, one can obtain a nonlinear mapping function modelled by RNN from input data to latent feature space. The obtained latent features are further fed to GMM to yield clustering results of HSI. As the first attempt to use RNNs in the clustering of HSI, the proposed model in [97] performs comparably to other deep clustering approaches in terms of accuracy, but achieves a faster running speed.
In [98,99], multi-sensor AEDC models were proposed, which make use of rich information from multi-modal remote sensing data, yielding improved clustering performances. Rahimzad et al. [98] developed a boosted convolutional AE with concatenated hand-crafted features as input data for extracting more effective deep features for clustering. Compared with the deep models using raw data as the input of AEs, the network used in [98] is less complex for feature extraction. In [99], Shahi et al. proposed a multi-stream-based AEDC model for the clustering of remote sensing images, consisting of three parallel networks: one spectral network with fully connected AE, one spatial network with convolutional AE, and one fusion network that reconstructs concatenated images. The latent features from spectral and spatial network are concatenated and then fed to k-means clustering algorithm. Experimental results show significant improvement over the traditional SSC and deep learning methods.
The aforementioned AEDC models separate feature learning from clustering, where the extracted features might not be suitable for the clustering task. The works in [33,100,101] integrated deep feature learning and clustering in a unified framework. Apart from the reconstruction loss in AEs, additional clustering loss was introduced to the overall training loss. Representative clustering losses include intraclass distance loss, i.e., kmeans loss, and Kullback-Leibler (KL) divergence loss between target distribution and soft assignments. In [100], a deep embedded clustering (DEC) method was proposed. It first pre-trains an AE with reconstruction loss to learn the non-linear mapping function from input data to the latent feature space. Then, the decoder is discarded and the encoder is used for initial feature mapping. By minimizing the KL divergence loss between target distribution and soft assignments, the parameters of encoder and cluster centroids are jointly optimized. DEC yields remarkable improvement over k-means. However, the removal of reconstruction loss in the second fine-tuning stage makes feature extraction via encoder unstable. Nalepa et al. [101] extended DEC by coupling 3D convolutional AEs with clustering and combining reconstruction loss and KL divergence loss in the second fine-tuning stage. Although the AEDC model in [101] yields a high clustering accuracy, the computation time is much longer than others. In [33], an intraclass distance constrained AEDC model was proposed for the clustering of HSI, which performs feature extraction and k-means clustering in a unified model. During the training of network, the clustering error is propagated to the feature learning process of AEs, making the latent features to be more clustering-friendly. The objective function of the model in [33] is formulated as: arg min where W i and b i are the weights and bias of AE, M is the total number of layers of AE, Z is the latent features of AE and H and S are the cluster centroid matrix and cluster label matrix in k-means, respectively. The second term is k-means clustering loss, which promotes the intraclass distance of data to be smaller in the latent feature space. Experimental results show that the unified model in [33] outperforms both traditional shallow clustering methods and state-of-the-art deep clustering methods.

Graph Convolution Based Deep Clustering (GCDC)
Graph neural networks extend convolutional neural networks to process the data represented in the graph domain [171]. The feature representation of a node is updated by recursively aggregating representations of its neighbours. GCDC methods integrate graph convolution in the self-representation-based clustering models, which aggregates neighbourhood information of data in the affinity learning, leading to a robust similarity matrix to noise and outliers. Compared with traditional self-representation-based clustering methods, GCDC is more effective in dealing with graph-structured data in the non-Euclidean domain. A typical graph convolution propagation layer [172] can be defined by where X (r) is the r-the layer's graph embedding and W (r) is a weight matrix to be trained, P is a propagation matrix built with a similarity matrix of input data and σ(·) is a non-linear activation function. Cai et al. [66] removed the nonlinear activation function of (42) and employed the graph convolution in the traditional self-representation model, leading to a novel GCDC model as follows: arg min where the representation matrix C can be seen as the parameters of a simplified neural network. A closed-form solution of (43) can be obtained, which makes the model computationally efficient and more applicable. Moreover, the model in (43) was extended to a kernel version, which was demonstrated to perform better than existing clustering methods in terms of clustering accuracy. Zhang et al. [102] replaced the normal graph convolution of (43) with a hypergraph convolution to exploit the group structure of data that is beyond pairwise correlations. Moreover, a multi-hop aggregation strategy with the K power of the propagation matrix, i.e., P K , was employed to incorporate the long-range interdependence between hyperedges and vertices. The resulting model is formulated as arg min where P h is the propagation matrix of a hypergraph and k is the number of hypergraph propagations. The developed model outperforms (43) and achieves state-of-the-art clustering accuracy on five benchmark HSI data sets.
In [103], Cai et al. proposed a more generalized linear graph convolutional network, consisting of a parameter-free neighbourhood propagation and a task-specific linear model with a closed-form solution. As in [102], the non-linear activation function of (42) is removed, resulting in a simplified linear graph convolutional network. Moreover, an improved propagation scheme over [102] was devised by considering the initial node features, which is formulated as: It is observed that the initial feature X also contributes to the update of graph embedding H (r) with a fixed proportion α. By setting α = 0, the derived propagation matrix of (45) equals to the one in [102]. With the graph propagation scheme (45), a subspace clustering model is formulated as: where H (K+1) is the final graph embedding with the linear graph convolution network and C is the parameter matrix of the overall deep clustering network. Final clustering result can be obtained by applying the affinity matrix C into spectral clustering. It is demonstrated that the developed model outperforms traditional shallow representation-based methods and deep clustering methods.

Contrastive Learning Based Deep Clustering (CLDC)
Contrastive learning, as a recent new self-supervised learning technique, has achieved remarkable performance in feature learning [173,174] and classification of HSIs [175,176]. It promotes different augmentations of the same data point, called positive pairs, to yield more similar deep representations compared with the augmentations of other input data points, leading to improved discrimination between data points in the feature space. To achieve this, different contrastive loss functions were designed, including the instance-level InfoNCE loss [177,178] and between-cluster loss [179,180]. Compared with the aforementioned AE-based deep clustering models, which lean features by minimizing data reconstruction loss, contrastive learning is more effective in the learning of discriminative features for classification tasks with contrastive losses. Contrastive learning in the clustering of HSIs is at a very early stage. The initially obtained clustering performance is remarkable and demonstrates that contrastive learning is highly promising in the domain.
In [104], Cao et al. proposed an effective classification framework for HSIs by combining contrastive learning and AEs. It consists of three steps: (1) generations of two augmentations of data by variational AE (VAE) and adversarial AE (AAE), (2) feature extraction via contrastive learning and (3) clustering or classification of the generated deep features. In the first step, two different AEs are employed as transform functions for data augmentation. In the second step, the authors developed an adaptive InfoNCE contrastive loss by incorporating group information of features, promoting the within-cluster features to be close to the centroids. Experimental results show that contrastive learning is able to extract more discriminative features even compared with supervised models. In [105], Kang et al. adopted random patch cropping to generate anchor images and generate augmented images by selecting patches that are close to the central pixels of anchor images. CNNs were employed to extract deep features of anchor images and augmented images. With the InfoNCE contrastive loss, the parameters of CNNs are obtained. Finally, the authors fed the learned features from CNNs into classifiers or clustering algorithms to obtain classification results. In [106], Hu et al. generated augmented images for contrastive learning by image flipping and random removal of non-central pixels. Moreover, a twobranches-based CNN was proposed to extract the spectral and spatial features of HSIs. By combing the instance-level contrastive loss and cluster-level contrastive loss, an overall contrastive learning loss function was obtained, which minimizes the distances between positive pairs and maximizes the distances between negative pairs. Benefiting from the improved discrimination between data points with contrastive learning, the proposed model yields significant accuracy improvement compared with the traditional shallow clustering models and the state-of-the-art deep clustering models.
The aforementioned CLDC models need a separate clustering algorithm, such as k-means or spectral clustering, to cluster the extracted deep features, which makes them unscalable to big data. In [107], Cai et al. developed an end-to-end and scalable CLDC model by combining a symmetric twin CNN-based feature learning neural network with a projection head. The twin CNNs were used to extract deep features of augmented data, which were fed further into the projection head to directly obtain label representation. Moreover, a novel contrastive loss function, consisting of within-cluster contrastive loss and between-cluster contrastive loss, was designed to train the neural network, which promotes a reduction of the within-cluster similarity and an increase of the inter-cluster differences in the feature domain. Experimental results show that the proposed model outperforms the state-of-the-art approaches by large margins.
In general, benefiting from the powerful nonlinear data fitting ability, deep learningbased clustering approaches are more effective for dealing with complex data compared with traditional clustering models. The extracted features by deep learning are often more clustering-friendly, leading to improved clustering accuracy. However, most deep clustering models separate clustering from feature learning, which encounters the problem that the extracted features by deep learning might not fit well with the adopted clustering algorithms. In addition, existing deep clustering methods often need to apply dimensionality reduction techniques to reduce the dimension of HSIs to avoid a high computational complexity. However, this results in the loss of spectral information of HSIs, degrading their clustering accuracy to a certain degree. Moreover, the lack of explainability of deep learning, uninvestigated robustness to noise and high requirement on computing resources pose limitations of deep clustering models in real applications.

Experiments
In this section, we conducted extensive experiments with different clustering algorithms on two real HSIs to investigate their clustering performance. Systematic comparisons between different methods and deep analysis were provided. A toolbox that contains the implementations of different clustering methods can be accessed via https: //github.com/shhuang-1767/HSI_clustering.git (accessed on 20 May 2023).

HYDICE Urban
The first data set we used for evaluation was HYDICE Urban, which was captured by Hyperspectral Digital Imagery Collection Experiment (HYDICE) during a flight campaign over Copperas Cove, near Fort Hood, TX, USA. The data size of HYDICE Urban is 307 × 307 × 210, which captures spectral information from 400 nm to 2500 nm. Due to the serious degradation by atmosphere and water absorption, the bands 1-4, 76, 87, 101-111, 136-153 and 198-210 are removed and the remaining 162 bands are used in the experiments. For computational efficiency, a typical subset of data with a size of 150 × 160 × 162 was used as the test data, which included seven classes as shown in Table 4. The false-color image and ground truth are shown in Figure 10.

University of Houston (Houston)
The second benchmark data set was acquired by the ITRES-CASI 1500 sensor over the University of Houston campus and the neighbouring urban area. A representative region with the image size of 130 × 130 × 144 was selected as the test data, which contains seven classes as shown in Table 4. The false-color image and the ground truth of Houston are shown in Figure 11.
K-means [19]: a commonly used clustering algorithm due to its simplicity and efficiency. 2.
JSSC [55]: a spatial-spectral SSC model with joint sparsity on the coefficients of segmented super-pixels. 6.
Sketch-TV [73]: a scalable spatial-spectral subspace clustering model by integrating dictionary sketching and a TV spatial regularization. 8.
AEC [100]: an autoencoder-based clustering model where a three-layers stacked denoising AE was used to extract deep features of HSI and k-means was adopted to obtain the final clustering result. 10. DEC [100]: a symmetric AE-based deep clustering model, which is an extended version of AEC by introducing a KL divergence clustering loss to jointly learn the encoder and cluster centroids. 11. RNNC [97]: an asymmetric AE-based clustering model where recurrent neural nets (RNNs) are employed to build the encoder and a multilayer perceptron was used as the decoder. In our experiments, RNNs were built with long short-term memory (LSTM). The extracted latent features by the encoder of RNNC were fed to k-means to yield clustering results. 12. HyperAE [92]: a recent self-representation-based deep clustering model, which integrates the self-expressiveness of data points and graph-based manifold regularization in the autoencoder, resulting in an improved similarity matrix for spectral clustering.

Evaluation Metrics
We adopt six evaluation metrics to measure the performance of clustering methods, including overall accuracy (OA), average accuracy (AA), Kappa coefficient (κ), normalized mutual information (NMI), adjusted rand index (ARI) and Purity. To calculate OA, AA and κ, we first find the best match between the clustering results and ground truth by an optimal mapping function obtained by the Kuhn-Munkres algorithm [181]. For a data set with N samples, the OA is obtained by: where r i is the label of the i-th data point obtained by clustering and l i is the corresponding true label, δ(x, y) = 1 if x = y and is zero otherwise; map(·) is a mapping function obtained by [181]. Let n i,j be the number of samples in class i that are labelled as class j. The accuracy of the i-th class is computed by p i = n i,i /n i,+ , where n i,+ = ∑ j n i,j is the number of samples in class i. Then, AA is calculated by where C is the number of clusters. The Kappa coefficient κ is defined as: where n +,i = ∑ j n j,i is the number of samples that are identified as class i. The NMI score is calculated as: where I(l; r) denotes the mutual information between l and r, and H(l) and H(r) are their entropies. The ARI score is obtained by: Let Ω = {w 1 , w 2 , . . . , w K } be the clusters obtained by the clustering algorithm and C = {c 1 , c 2 , . . . , c C } be the ground truth, where w i is the set of samples that are grouped into the i-th cluster and c i is the set of samples belonging to the i-th cluster according to the ground truth. In the experiments, we assumed that the number of clusters is known, which means K = C. Then, the Purity score is obtained by The evaluation metric ranges between [−1, 1] for κ, [0, 1] for NMI, [−1, 1] for ARI and [0, 1] for Purity. A larger value indicates a better performance. We also report the running time of different clustering methods. Note that k-means, NMF, NMF-TV, SSC, JSSC and Sketch-TV are implemented in MATLAB on a computer with an Intel core-i7 3930K CPU with 64 GB of RAM. The ODL and GCSC methods are implemented in Python on a server's node with an Intel core-i7 4930K CPU with 64 GB of RAM. The AEC, DEC, RNNC and HyperAE are implemented in python and run on NVIDIA GeForce GTX 1080Ti with 11 GB of RAM.

Performance Comparison
We report the quantitative evaluation of clustering methods on the two data sets in Tables 5 and 6 and the corresponding clustering maps in Figures 12 and 13. In the tables, the best result is annotated in bold and the second best result is underlined. We set the number of columns of U, i.e., r, to C for NMF and ONMF-TV, the dictionary size to 70 for Sketch-TV, the dimensionality of latent feature to C for AEC and DEC and the dimensionality of latent feature to 18 for RNNC. For all the methods, we first performed PCA [182] to reduce the spectral dimensionality of HSI to eight for computational efficiency and then extract the spatial patch of each central pixel cross all the bands with a 3 × 3 square window, which serves as the input data point of each clustering method.
It is observed in Tables 5 and 6 that k-means and NMF do not perform well on both data sets in terms of accuracy. The reason can be attributed to the non-spherical cluster distribution of HSI as shown in Figures 10c and 11c, which cannot be effectively handled by k-means. NMF performs clustering via k-means in a representation domain. However, the representation of pixels are learned independently from each other, making NMF sensitive to noise and outliers. Moreover, the representation learned via NMF is separated from k-means, which might obtain unmatched features for k-means, leading to degraded clustering accuracy. In terms of running time, NMF and k-means are much faster than others, demonstrating their superior efficiency. The results in Tables 5 and 6 show that ONMF-TV outperforms NMF by a large margin with OA improvements of 9.58% on HYDICE Urban and 4.97% on Houston. The improved performance mainly benefits from the orthogonal constraint and the incorporation of spatial information.    NMF performs similarly to k-means on the data set HYDICE Urban, but much worse than k-means on the data set Houston. This might be caused by the small value of r in NMF, which resulting in non-discriminative features for clustering. Sparse representation-based clustering methods SSC and JSSC perform consistently better than the classic methods k-means and NMF. Compared with k-means-based methods, SSC and JSSC do not assume the cluster distribution of data. Particularly, they uncover the cluster structure of HSI in a graph, which is adaptively learned in a sparsity-driven self-representation model. The results demonstrate that self-representation models are very effective in the learning of cluster structure of complex data. However, the high computational complexity of SSC and JSSC makes their running time much longer than others. The spatial-spectral JSSC method yields higher accuracy than SSC. However, due to the imprecise super-pixel segmentation of HSI, the accuracy improvement of JSSC is rather limited. Compared with SSC, the clustering maps of JSSC are smoother, as shown in Figures 12 and 13. Scalable subspace clustering methods ODL and Sketch-TV obtain much faster running speeds compared with SSC and JSSC due to the introduced compact dictionary, which significantly reduces the amount of parameters to be optimized. However, the running speed improvement of ODL is at the cost of accuracy. Due to the incorporation of spatial information of HSI, Sketch-TV yields improvement both in accuracy and running speed. Among shallow representation-based clustering methods, Sketch-TV performs the best in terms of OA, κ, NMI, ARI, and Purity. The main reason can be attributed to the reduced feature variance within clusters caused by the adopted TV-based local spatial constraint. Compared with JSSC, which also incorporates spatial information of HSI, Sketch-TV performs considerably better, indicating the importance of an effective spatial constraint.
Deep learning-based clustering methods GCSC, AEC, DEC, RNNC and HyperAE outperform the shallow clustering methods in most cases on the data set HYDICE Urban. On the data set Houston, deep learning-based methods do not consistently yield better performance than the shallow methods. HyperAE performs the best in terms of accuracy among the deep clustering methods, but slightly worse than Sketch-TV on Houston. As both HyperAE and Sketch-TV need to feed the constructed similarity matrix to spectral clustering to yield the final clustering results, the worse accuracy indicates that the extracted deep features via deep neural networks do not always guarantee a superior performance than the traditional shallow clustering methods. It also verifies the importance of incorporating prior information of HSI, such as spatially local smoothness, global non-local structure, lowrankness, sparsity, etc., to learn clustering-friendly features instead of purely relying on data driven technique. Compared with k-means and NMF, AEC obtains improved performance in terms of accuracy, which demonstrates that the features extracted by AE are more discriminative than that in the original domain and in shallow feature extraction model NMF. However, the improvement is limited, which might be attributed to the separated feature extraction from clustering. DEC extends AEC to jointly fine tune the weights of AE and perform clustering by introducing a clustering loss function, resulting in an improved performance as shown in Tables 5 and 6. The trade-off for accuracy improvement is a slight increase in run time. Benefiting from the graph convolution of the dictionary, GCSC obtains improved accuracy compared with SSC and JSSC. Moreover, the employed collaborative representation with an 2 norm allows GCSC to obtain a closed-form solution, avoiding to derive the optimal solution in an iterative update fashion. This leads to a much lower computational complexity of GCSC compared with SSC and JSSC. RNNC yields improved performance compared with AEC in terms of accuracy, demonstrating the potential of asymmetric AE in unsupervised feature extraction. Figures 12l and 13l show that the clustering maps of RNNC are much smoother than AEC. It is observed that HyperAE takes the longest running time among deep clustering methods, which can be mainly attributed to the introduced self-representation layer, resulting in a huge coefficient matrix to be optimized as in the traditional methods SSC and JSSC.

Summary and Conclusions
In parallel to supervised classification of HSI, the clustering of HSI is another important research topic in the field of remote sensing. Model-based optimization methods have achieved remarkable performance in the clustering of HSI, which has attracted increasing attention in recent years. Meanwhile, powered by deep learning, emerging deep clustering methods extend model-based methods and yield huge breakthroughs in the clustering of HSIs. However, a comprehensive and systematic overview is absent for researchers, especially for beginners to quickly get into the field and to develop their own models, which hinders the development of new techniques in the field. In this paper, we showed the evolution of model-based methods and deep learning-based approaches for HSI clustering, and provided a systematic overview for each category of the methods. Moreover, we discussed the advantages and disadvantages of each subcategory of the clustering methods.
We conducted extensive experiments on two real HSIs to compare the performance of twelve representative clustering methods, including the shallow clustering methods, k-means, NMF, ONMF-TV, SSC, JSSC, ODL, and Sketch-TV, and the deep clustering methods GCSC, AEC, DEC, RNNC, and HyperAE. Source codes of different methods were provided to boost the research in the field. Important observations were made through the experiments as follows: 1.
Recent deep clustering methods outperform the shallow clustering methods in most cases. The experimental results show that some traditional shallow clustering methods such as Sketch-TV can yield competitive or even better clustering accuracy compared with the state-of-the-art deep clustering methods.

2.
Deep feature extraction by autoencoder indeed improves the discriminability between different clusters compared with using raw data. However, the accuracy improvement might be limited by the employed inappropriate clustering algorithm or by the unconsidered spatial information of HSI. Our results show that the traditional NMF feature extraction fails to yield improved performance.

3.
It is shown that spatial-spectral clustering methods often perform better than the spectral-based clustering methods. However, the degree of performance improvement highly relies on the adopted spatial regularizations, demonstrating the importance of an effective spatial constraint. 4.
Self-representation-based shallow and deep clustering methods are very competitive compared with other clustering methods. However, the computational complexities of self-representation models are much higher than others, which limits their applications on large-scale data.

5.
Clustering methods, which combine representation learning and clustering in a unified model, yield improved accuracies compared with the methods that perform the two steps separately. This demonstrates that introducing clustering-related loss function improves the clustering performance.
Finally, we pointed out unsolved important problems and future trends in the field as follows: 1.
Most existing clustering methods assume that the number of clusters is known, and very few studies in remote sensing focus on the estimation of the number of clusters. Thus, there is an urgent need to design an effective method to calculate the number of clusters for real applications.

2.
As data-driven deep clustering methods are typically trained on a specific target data set, the trained models often cannot be well generalized to new data sets. When the trained neural network is applied to a different HSI, the learned features might not be discriminative for clustering due to the different ground objects, varying spatial resolutions and different levels of noise. Improving the robustness and generalization of deep clustering methods is crucial in the domain.

3.
Although deep clustering methods often yield better clustering results, theoretical explanation of the superior performance is still absent, which means that existing deep clustering methods of HSI still lack interpretability for experts to deal with occasional failures on some data sets. A deeper and more clear understanding of the mechanism of deep clustering models is needed. Thus, explainable AI on the clustering of HSI is a very interesting research direction.

4.
Current clustering methods of HSI rely on a single clustering algorithm, whose performance is highly limited by the separability of features and the clustering ability of the selected clustering algorithm. It is known that different clustering methods have different advantages. Thus, it is more desirable to combine the clustering results of different clustering methods (also known as ensemble clustering) to find a consensus, which will effectively improve the clustering accuracy and robustness to noise. 5.
Clustering methods of HSI are mostly designed for a single data source, which is vulnerable to noise and other degradations. Recent advances in remote sensing greatly increase the types of sensors for Earth observation, resulting in different data modalities such as LiDAR, SAR, multispectral image, etc. Moreover, various hand-crafted features, which capture different data properties of HSI from different views, are demonstrated to be helpful in the classification of HSI. Incorporating the complementary information from different image modalities in the clustering of HSI can break the performance limitation of single-source clustering methods, which also improves the robustness of model to various degradations. 6.
Current advanced clustering methods either perform feature extraction and clustering of data separately or integrate the two steps in a unified clustering framework. All of them still rely on the conventional clustering algorithms, such as k-means, spectral clustering, GMM, and density-based methods, to yield the final clustering results. Designing a completely data-driven deep clustering model, which gets rid of the conventional clustering algorithm, might lead to a significant performance improvement.