Hierarchical Sparse Subspace Clustering (HESSC): An Automatic Approach for Hyperspectral Image Analysis

: Hyperspectral imaging techniques are becoming one of the most important tools to remotely acquire ﬁne spectral information on different objects. However, hyperspectral images (HSIs) require dedicated processing for most applications. Therefore, several machine learning techniques were proposed in the last decades. Among the proposed machine learning techniques, unsupervised learning techniques have become popular as they do not need any prior knowledge. Speciﬁcally, sparse subspace-based clustering algorithms have drawn special attention to cluster the HSI into meaningful groups since such algorithms are able to handle high dimensional and highly mixed data, as is the case in real-world applications. Nonetheless, sparse subspace-based clustering algorithms usually tend to demand high computational power and can be time-consuming. In addition, the number of clusters is usually predeﬁned. In this paper, we propose a new hierarchical sparse subspace-based clustering algorithm (HESSC), which handles the aforementioned problems in a robust and fast manner and estimates the number of clusters automatically. In the experiment, HESSC is applied to three real drill-core samples and one well-known rural benchmark (i.e., Trento) HSI datasets. In order to evaluate the performance of HESSC, the performance of the new proposed algorithm is quantitatively and qualitatively compared to the state-of-the-art sparse subspace-based algorithms. In addition, in order to have a comparison with conventional clustering algorithms, HESSC’s performance is compared with K-means and FCM. The obtained clustering results demonstrate that HESSC performs well when clustering HSIs compared to the other applied clustering algorithms.


Introduction
Hyperspectral imaging techniques enable the acquisition of hundreds of narrow spectral bands per pixel, covering (based on their sensor characteristics) the spectrum range of visible to near-infrared (VNIR, 0.4-1 µm), shortwave infrared (SWIR, 1-2.5 µm), and longwave infrared spectra (LWIR, 8-13 µm) [1,2]. By using such fine spectral information, accurate discrimination of different materials and objects is possible, which makes hyperspectral images (HSIs) a great source of information for a wide variety of applications such as land-cover classification [3,4] and mineral mapping [5][6][7].
The analysis of HSI is, however, a challenging task due to a number of reasons, such as (1) the inherently nonlinear relations between the captured spectral information and the corresponding material, (2) the existence of high dimensionality (also known as bands) and a limited number of corresponding representative samples, which leads to the so-called the curse of dimensionality [8], (3) the existence of huge redundancy among adjacent bands, and (4) the presence of mixed pixels and different noise types in the data [9].
In order to cope with the aforementioned challenges and process HSIs, several machine learning techniques have been proposed in the past decades. Among the developed machine learning techniques, supervised classification techniques perform well to classify HSIs in different applications [10][11][12]. Nevertheless, for the classification of HSIs, fully supervised algorithms use a set of representative samples of each class, which are also known as training samples. In most cases, acquiring training samples is expensive and time-consuming. Therefore, unsupervised learning techniques (also known as clustering techniques), which classify input data without any label, have drawn considerable attention in many domains, such as computer vision, pattern recognition, and data mining [13].
Traditional clustering algorithms (e.g., K-means and fuzzy c-means (FCM)) normally use distance measures between data points to group them into clusters. For instance, K-means utilizes the Euclidean distance to measure the similarity between data points; data points with small Euclidean distances are grouped into the same cluster [14]. However, in the traditional algorithms, the number of dimensions of subspaces needs to be defined prior to beginning the processing of an HSI. Moreover, the performance of such algorithms is not stable due to the random initialization [15].
Inspired by the sparse coding mechanism of human vision systems [16], sparse representation was developed as a powerful approach for a variety of computer vision tasks, including face recognition, image super-resolution, and data segmentation, to name a few [17,18]. The concept of sparse representation has also been successfully extended to deal with the challenging task of hyperspectral image classification [19][20][21]. For the case of sparse representation classification (SRC), an input signal (i.e., usually a pixel vector) is represented by a sparse linear combination of samples (atoms) from a dictionary [22], in which the existing training data are usually used to produce the dictionary. SRC usually comes with the important advantage of avoiding the heavy and expensive training procedure that a typical supervised classifier generally conducts, since it is performed directly on the dictionary [12].
In [19], Chen et al. proposed a pixel-wise sparse model to classify HSIs in a supervised way. The method assumes that spectral pixels can be approximately represented into a lower-dimensional subspace spanned by dictionary atoms from the same class. To further improve the performance of the spectral classifier and produce classification maps with higher quality, spatial information was incorporated into the classification framework in [19], shaping the joint sparse model (JSM). This method has been improved by developing a number of classification techniques based on the sparse representation model [22][23][24][25]. Although the concept of sparse representation has been maturely developed to deal with supervised problems, it still remains in its infancy for unsupervised problems where there are no training data to shape the dictionary for sparse classification.
In the last decade, sparse subspace-based clustering algorithms have been successfully applied in order to analyze HSIs [15,[26][27][28][29]. For instance, in [15], Elhamifar et al. proposed an unsupervised learning algorithm called sparse subspace clustering (SSC). In SSC, the entire dataset is used to identify the number of subspaces and data points that belong to each subspace. Although SSC provides good clustering results, one of its drawbacks is that it demands more time and computational power to process larger datasets.
In [30], the authors proposed a sparse subspace clustering algorithm by orthogonal matching pursuit (SSC-OMP). SSC-OMP uses the orthogonal matching pursuit algorithm as a greedy feature selection method to compute the sparsest coefficients in the sparse optimization problem. Consequently, the computed sparse coefficient matrix is used to cluster the data [31]. However, in such an algorithm the number of clusters has to be defined in advance.
In [26], the authors proposed exemplar-based subspace clustering (ESC), which is regarded as a fast sparse subspace-based algorithm to cluster large datasets. ESC is employed in order to cluster all of the data by using some representative samples. Due to the use of a few numbers of samples to shape the dictionary, ESC is able to analyze larger datasets within an acceptable processing time. In ESC, in order to define the subset of representative samples, the first sample of the subset is randomly selected, and based on that the other samples will be selected. Although ESC can efficiently handle the data, there are some limitations to its concepts. In ESC, there is no criterion to chose the first sample of the subset, which leads to an increase in the uncertainty of the algorithm. In other words, the algorithm is not stable due to the random selection of the first sample in the subset of representative samples. Besides, in ESC, the number of clusters has to be predefined.
In [32], Wu et al. proposed a hierarchical sparse subspace clustering algorithm to cluster the datasets at multiple resolutions. The proposed algorithm by Wu et al. is able to classify the data in a semi-automatic way, and it can be useful for obtaining information at different resolutions. Nonetheless, it can be time-consuming since it uses SSC in each node of the hierarchical structure to cluster the data, which significantly reduces the efficiency of the algorithm.
In this paper, to address the aforementioned challenges, we propose a new hierarchical sparse subspace-based clustering algorithm to analyze the HSI. The proposed algorithm has the capability of handling large datasets in a fast and robust manner. In the proposed algorithm, a hierarchical structure is used to analyze the data at different levels of detail. The foundation of the hierarchical structure is a lasso-based binary clustering algorithm, in which, to accelerate the clustering analysis, a random sample is selected to proceed with the binary clustering procedure. To reduce the random selection effect, an entropy-based ensemble algorithm is employed [33]. Additionally, in order to have an automatic manner to generate the number of clusters, a criterion is defined by comparing the calculated reconstruction error values at parental and child nodes. At the final step, a label is assigned to each end-node to produce the final clustering map. Thus, we can summarize the contribution of our proposed algorithm as follows: 1. We propose a sparse subspace-based clustering algorithm that considerably reduces the computational power requirements by using a novel lasso-binary clustering algorithm. 2. The proposed algorithm is able to provide robust and stable results thanks to the incorporation of the entropy-based consensus algorithm within its structure, which decreases the effect of the initial random selection of a sample in the lasso-binary clustering algorithm. 3. The proposed algorithm benefits from defined criteria, which make it possible to automatically generate the number of clusters.
The paper is organized as follows. The state-of-the-art sparse subspace-based algorithms and the proposed algorithm are elaborated in Section 2. Afterward, the experimental results are presented and discussed in Section 3. Finally, Section 4 is devoted to conclusions and remarks.

Methodology
In the following section, an overview of the state-of-the-art subspace clustering algorithms (e.g., scalable exemplar-based subspace clustering [26]) is provided. The proposed clustering algorithm (HESSC) is then elaborated in detail. The workflow of the proposed algorithm is presented in Figure 1.  In order to smoothly grasp the concept of the proposed algorithm, we start by introducing the sparse dictionary learning concept. Let us denote Y = [y 1 , y 2 , . . . , y N ] T ∈ R D×N as the HSI, where N is the number of pixels and D is the number of spectral bands in Y. y i = [y i1 , y i2 , . . . , y iD ] T represents the measured spectrum of the i-th pixel in Y, and i ∈ {1, 2, . . . , N} is the pixel indices in the HSI. Let us assume that Y is free of any kind of error (e.g., noise, measurement failures, etc.), to ease the explanation. In dictionary learning problems, each y i can be represented by a linear or affine combination of a few atoms from a spectral dictionary A = [a 1 , a 2 , . . . , a h ] ∈ R D×h , where A is a spectral dictionary composed of h spectra that is defined in advance [34]. The aforementioned representation can be formulated as below: where c i = [c i1 , c i2 , . . . , c ih ] ∈ R h is the representation coefficient vector of signal y i . One can also transform Equation (1) into a matrix form as follows: where C = [c 1 , c 2 , . . . , c N ] ∈ R h×N is the sparse coefficient matrix that can represent a collection of data points. However, A is usually over-complete, which results in an ill-posed problem. There are many ways to deal with the ill-posed problem [34][35][36][37]. One of them is to employ the principle of sparsity [15], in which the sparse coefficient matrix can be obtained by solving the following optimization problem: where ||C|| n is the n -norm of C, which can be computed as n [15]. In Equation (3), 0 -norm represents the number of nonzero elements in C. The problem in Equation (3) is an NP-hard problem, which can be solved by a greedy algorithm (i.e., basis pursuit). However, such an algorithm leads to sub-optimal solutions [38]. In order to deal with the NP-hard problem situation, 0 -norm is replaced with 1 -norm in Equation (3), which is the closest convex approximation of 0 -norm. Therefore, one can rewrite Equation (3) as the following convex problem: Sparse subspace clustering (SSC) is one of the most widely used subspace-based clustering algorithms [26]. In SSC [15], the self-expressiveness property of the data implies that each data point can be written as a linear combination of other data points from the same subspace. In other words, dictionary A in Equation (4) is replaced with Y. Moreover, we assume that the data points in Y can be drawn from a union of H linear subspaces, which can be denoted as The main aim of subspace modeling is to identify the underlying subspaces by using Y. Therefore, the sparse optimization problem in the affinity form can be written as: where the constraint diag(C) = 0, C ∈ R N×N is imposed in order to avoid representing a data point as a linear combination of itself. Besides, C T 1 = 1 is a constraint to make sure that it is a case of an affine subspace. 1 is a vector that consists of one in all elements. The optimization problem in Equation (5) can be solved by employing the alternating direction method of multipliers (ADMM). In order to have a better understanding of the ADMM solver, we refer the interested readers to [39,40]. After obtaining the sparsest coefficient matrix from data points, the symmetrical non-negative similarity matrix W ∈ R N×N is calculated as W = |C| + |C| T . W needs to be symmetrical to assure that all nodes from the same subspace are connected to each other. In the last step of SSC, a spectral clustering is applied to the similarity matrix to generate the final clustering result [15]. Although SSC provides reliable results, the algorithm suffers from memory shortage, since SSC employs all data points from Y to solve the sparse optimization problem in Equation (5). This issue leads to a very high computational time in large datasets [26].
The sparse subspace-based clustering algorithm by orthogonal matching pursuit (SSC-OMP) is a sparse subspace-based algorithm that uses OMP in order to solve the optimization problem in Equation (5). In SSC-OMP, it is assumed that data are contaminated by noise. In order to solve Equation (5), a data point is selected that maximizes the multiplication of the absolute value and the residual [31]. Therefore, SSC-OMP can be written as: where K shows the number of selected data points from Y that are used to compute the sparsest coefficients [31]. K is selected based on the existing noise in the data [30]. The computation of C in Equation (6) is followed by computing W, and at the last step, the spectral clustering is applied on W [31]. However, the number of clusters should be known in advance. Additionally, the identification of K can be crucial as well.
Another recent advanced sparse subspace-based clustering algorithm is the exemplar-based subspace clustering (ESC). In [26], ESC was proposed by You et al. to cope with the scalability problem of SSC in large datasets. In ESC, some representative samples (also known as "exemplars") are selected to compute the sparsest coefficient matrix, for which the first sample of the representative samples is randomly chosen. The optimization problem in ESC can be formulated as follows: where the sparse coefficient matrix is C ∈ R P×N , λ is a trade-off parameter between the sparsity and the fidelity terms, and Y 0 = [y 1 , y 2 , . . . , y P ] ∈ R D×P is a subset of selected representative samples ("exemplars") from Y, in which P (P ∈ Z and P << N) needs to be predefined in advance. In order to compute C in Equation (7), the lasso version of the LARS algorithm is used [41,42]. Due to the dimension of the computed C in Equation (7), to achieve the final clustering map, the spectral clustering step is applied to the similarity matrix W ∈ R N×N , which is calculated using the t-nearest neighbors of each c i [26,43]. In ESC, Y 0 is used as a dictionary to solve the minimization problem in Equation (7). Therefore, the processing time will be efficiently reduced by using a subset of data points instead of using the entire data. In addition, to the best of our knowledge, ESC has not been investigated yet to analyze HSIs. Although ESC is capable of handling relatively large datasets in a fast manner, its concept suffers from a shortcoming; as mentioned before, in ESC, the first sample of the subset Y 0 is randomly selected. Subsequently, the rest of the representative samples are identified by using the farthest first search method proposed in [26], which employs Equation (7) as the cost function. As a result of the random selection of the first sample in Y 0 , the algorithm reliability is jeopardized. In addition, in ESC, the selection of P is another drawback, since selecting the appropriate number of representative samples has a strong impact on the final clustering result.

Hierarchical Sparse Subspace Clustering (HESSC)
In this subsection, the proposed algorithm is described in detail. The approach involves the following three phases: (1) Hierarchical structure (a tree), which allows retrieving information from the HSI at different levels (see Figure 2); (2) Lasso-based binary clustering, which is the core part of the HESSC structure, and is applied to the existing data points at each node of the tree (see Figure 3); (3) Criterion definition, which will be described in detail in Section 2.2.3, which decides whether the current node can be split further or not. Using this criterion in the third phase of HESSC, it is possible to automatically generate the suitable number of clusters for the analyzed dataset. The clusters returned by the algorithm are defined by the leaves of the hierarchical structure, which are the nodes that are not split any further. Specifically, the data points associated with these nodes define the final clusters. The employed hierarchical structure is presented at different levels (l b ). The parental nodes are divided into child nodes based on the defined criterion, which uses the reconstruction error values of parental and child nodes (see Section 2.2.3). At l 2 , white nodes present the nodes that will be split further, while the blue nodes present the end-nodes of the hierarchical structure, which will not be split further. Additionally, in each node of the tree, the binary clustering procedure is applied on the data points (see Figure 3).

Figure 3.
Illustration of the binary clustering algorithm, which is applied to each node of the tree.
Step 1 is to cluster the data into binary clusters, and it is iterated k times.
Step 2 is to build a binary matrix (BM) by stacking all the k binary clustering results. Step 3 is to use an entropy-based ensemble algorithm to split the parental node into two child nodes by using the information of the BM.

Hierarchical Structure in HESSC
Let us denote the set of data points associated with the r-th node of the b-th level (l b ) of the tree as Y r,l b , where r = {1, 2}, b ∈ {0, 1, . . . , l}, and l is a user-defined parameter that specifies until what level the HSI needs to be explored, and indirectly defines the upper boundary of the number of clusters in HESSC (2 l ). Y r,l b consists of N r,l b pixels with N r,l b < N. In HESSC, a tree-based data structure is established to describe the existing variation of data points in the HSI at different levels of detail [32]. Additionally, HESSC decreases the computational cost by dividing the data into smaller subsets at each node. As shown in Figure 2, the first node is associated with the original HSI (Y l 0 = Y). Subsequently, the binary clustering algorithm (Figure 3) is applied to Y l 0 to split the Y l 0 related with the current node into two groups denoted as Y 1,l 1 and Y 2,l 1 , which are the two nodes at the first level. In mathematical terms, it is equivalent to Y l 0 = Y 1,l 1 ∪ Y 2,l 1 . As a result, the split procedure of each node continues until the tree reaches its maximum level, where the number of end-nodes without any constraints on them is equal to 2 l , or the defined criterion in HESSC is satisfied (see Section 2.2.3). Nonetheless, the entire aforementioned procedure relies on the main idea behind HESSC, which is the lasso-based binary clustering. After obtaining the binary clustering result in the current node, based on the defined criterion, the decision for dividing the current node is made.

Binary Clustering in HESSC
The core of the proposed approach is a binary clustering algorithm, which is based on the lasso-function. The sparse optimization problem in HESSC is equivalent to the ESC minimization problem as shown in Equation (7). Thus, one can write the optimization problem for HESSC as follows: where C ∈ R 1×N r,l b stands for the computed sparse coefficients, which are equivalent to c i in Equation (8), and y i is a randomly selected sample from Y r,l b . The main difference between the optimization problem in Equations (7) and (8) is that, in HESSC, instead of having a subset of representative samples, a random sample y i is used to solve the sparse optimization problem. In ESC, the execution time tends to be slower by increasing P, while in Equation (8), the processing time needed to solve the optimization problem is less than the optimization problem associated with Equation (7). The computational burden of the optimization problem in Equation (8) is considerably reduced, since only one random sample is selected to solve the sparse optimization problem. Since y i is randomly chosen, to reduce the randomness effect in the binary clustering procedure, HESSC employs an entropy-based consensus clustering algorithm (ECC), which was proposed in [33]. To the best of our knowledge, this is the first attempt to use ECC in the HSI analysis framework. ECC initially uses a basic clustering algorithm (e.g., K-means) to obtain k distinct clustering results (different groups of the dataset) by varying the number of clusters. Subsequently, ECC employs Shannon entropy as the utility function in order to quantify the similarity between two clustering results (the higher the value of the utility function, the closer the two clustering results are to each other) and looks for the ensemble clustering that has the maximum similarity with all k clustering outputs resulting from the basic clustering algorithm. In order to implement ECC, a modified distance K-means algorithm was developed to obtain the final clustering map [33]. In [33], the execution time can be increased by increasing one of the following parameters: (1) The number of iterations, I, within the ECC algorithm; (2) The number of clustering outputs, which is expressed as k; the outputs are concatenated to be used as entries to ECC; and (3) L, which is the number of clusters in the HSI. Although in HESSC, parameter L is equal to 2, by increasing I and k, the execution time of HESSC will increase as well. For more details on ECC, we refer interested readers to [33]. After extracting the sparsest coefficient matrix by solving Equation (8), the sparsest coefficients are sorted in an ascending order, and then the cumulative sum (CS i ) of the i-th element in Y r,l b is computed as where CS i is a vector with N r,l b elements. CS i is then normalized between 0 and 1 by the N r,l b -th elements in CS i (which is the maximum value in CS i ). Consequently, to generate the binary clustering result of Y r,l b , a certain threshold criterion is used, i.e., CS i CS iN r,l b > τ, where 0 < τ < 1 is a user-defined threshold, and all the points for which CS i CS iN r,l b is less or greater than τ are assigned to cluster 0 or 1, respectively.
We set τ = 0.5, since this value is large enough to capture the general information in the HSI and at the same time small enough to capture the detailed information in the HSI. However, if there is a desire to capture more details in each cluster, τ needs to be set close to 0. The structure of the proposed binary clustering algorithm is presented in Figure 3. After running the binary clustering step for k times, in which the selected y i is different, all k binary outputs are stacked to shape a binary matrix (BM). In our proposed algorithm, k and I are set to 100 and 40, respectively. These values were suggested as the default and optimal values by the developers in [33]. After that, ECC is applied on BM to split the parental node into two child nodes. The above-mentioned procedure is continued until one of criteria of HESSC is met. Eventually, a label is assigned to the data points of each end-nodes of the hierarchical structure in order to produce the final clustering map.

Automatic Generation of Number of Clusters
The following criterion is defined in HESSC, in order to automatically generate the most representative number of clusters. The optimal parameter relies on the reconstruction error values of each node in the hierarchical structure. The first node Y l 0 , no matter what, is divided into two nodes (Y 1,l 1 and Y 2,l 1 ), with Y l 0 = Y 1,l 1 ∪ Y 2,l 1 . Let us assume that the eigendecomposition of the matrix C r,l 1 = Y r,l 1 Y T r,l 1 in l 1 can be obtained as C r,l 1 = Q r,l 1 ΛQ T r,l 1 , where Q r,l 1 = [q 1 , . . . , q N r,l 1 ] ∈ R D×N r,l 1 and Λ = diag(λ 1 , λ 2 , . . . , λ N r,l 1 ) are the eigenvectors and eigenvalues of Y r,l 1 , respectively. Therefore, by obtaining Q r,l 1 and Λ, the dimensions of each subspace can be estimated [32]. The orthonormal bases of each subspace can be written as U = [q 1 , . . . , q d r,l 1 ], where d r,l 1 is the identified subspace dimension in the r-th node and the l 1 level. The d r,l b parameter can be estimated using the energy threshold as d r, and α is fixed to 0.99 for capturing the higher variation within the data in each node. As a continuation, the reconstruction error values of the parental and child nodes are used to divide the child node in two. Parental error (E r ) and child error (E r ) [32] can be formulated as: Consequently, the child node will be split further, if where 0 ≤ β ≤ 1 is a predefined parameter to control the nodes' division, β closer to 0 means the clustering result will have a higher number of clusters.

Data Acquisition and Description
To have a comprehensive comparison between the applied clustering algorithms, three real drill-core HSIs and a well-known benchmark (i.e., Trento) were used. This section is devoted to the data acquisition approach and the mineralogical information regarding the drill-core samples that were used. A SisuRock drill-core scanner was employed to capture HSIs of drill-core samples. The scanner is equipped with an AisaFENIX VNIR-SWIR hyperspectral sensor, and a FENIX sensor that covers the spectral range of 380-2500 nm in 450 bands. The obtained pixel size is 1.5 mm/pixel in this setting. Furthermore, the drill-core samples were analyzed visually by a trained geologist and segmented into domains based on the veins and matrix composition of the cores. Three drill-core samples collected from different locations within a porphyry system were used to test the proposed method [7]. The first sample is composed of 27 × 190 pixels, and it is characterized by pervasive calcic-sodic transitioning to potassic alteration. The matrix consists mostly of plagioclase feldspars, quartz, biotite, and chlorite. Two main vein types are present in this sample: quartz veins with a sulfide centerline and a low-intensity alteration halo consisting dominantly of white micas, and calcium sulfate veins with a sulfide centerline transitioning towards pure sulfide veins. The latter vein-type shows a higher intensity and wider alteration halo composed of white micas. The second sample is composed of 27 × 190 pixels, and in this sample the rock matrix is dominantly composed of plagioclase feldspars, quartz, and chlorite, characteristic of the sodic-phyllic transition in the system. Three main vein types are hosted: a sulfide vein with a wide white mica alteration halo, cross-cut by a second vein consisting predominantly of quartz, calcium sulfate and sulfides. Finally, fine sulfide veinlets with a nearly unobservable alteration halo are cross-cut by the previously mentioned two vein types. Sample No. 3 is composed of 28 × 189 pixels, and is characterized by a pervasive potassic alteration; the matrix is composed dominantly of feldspars, biotite, and minor chlorite. Two main vein types are also present in this sample: veins composed dominantly of sulfides exhibiting a high-intensity white mica alteration halo, and quartz veins with a sulfide or sulfide and calcium sulfate centerline. Complex vein compositions can be observed locally where the two main vein types are reopened or cross-cutting.
Additionally, the Trento dataset was used as benchmark to compare the performance of different clustering algorithms. The Trento dataset was captured over a rural area in the south of the city of Trento, Italy. The dataset is composed of 166 × 600 pixels. The HSI over the area was captured by he AISA Eagle sensor. The HSI consists of 63 spectral bands ranging from 402.89 to 989.09 nm. The HSI has spatial and spectral resolutions of 1 m and 9.2 nm, respectively. The region of interest in which the survey of acquiring HSI has taken place includes six different classes as follows: apple trees, building, ground, wood, vineyard, and roads.

Evaluation Metrics
In order to evaluate the performances of the applied clustering algorithms, three validation measures were chosen for their adequacy and described in [44]. Let us denote the ground truth of the HSI as G = [g 1 , g 2 , . . .
where Π is the set of all permutations of {1, 2, . . . , N} and n ij is equal to the number of elements that matches between M i and G j (n ij = |M i ∩ G j |) . In order to compute n ij , a matching function between clustering labels and ground truth labels is applied. More detailed elaboration on the applied matching function can be found in [31]. The second evaluation measure of clustering results is the so-called F-measure [45]. The F-measure can be used to report how well is G described by M; the F-measure value can vary between 0 and 1. An F-measure value closer to 1 means that G is well described by M. To calculate the F-measure of two partitions, two variables are needed: the first variable is precision, which is equal to p ij = n ij |G j | , and the second is recall, which can be written as r ij = n ij |M i | . For the ijth entry, the F-measure can be calculated as F ij = 2 * r ij * p ij r ij +p ij , j ∈ {1, 2, . . . , N}. In addition, the overall F-measure is given by: The third metric is the adjusted rand index (ARI), which is based on the rand index measure [46]. The Rand index [47] was introduced as a measure for comparing the performances of different clustering algorithms. The Rand index is calculated based on the contingency table (confusion matrix) of two different partitions-in this case, the confusion matrix between M, which is a vector of clustering results, and G, which is a vector of true labels. Thus, in the rand index, the information of the contingency table can be used to measure the amount of agreement between two partitions. In [46], Hubert et al. proposed the ARI, which is the corrected version of the original rand index for chance. The ARI can be mathematically formulated as: where n i+ is equal to ∑ N j=1 n ij . The range of ARI is normally between {0,1}. An ARI closer to 1 means that M and G highly agree. However, it is also possible that ARI gets a negative value, which shows that the agreement between M and G is even less than what is expected from a random result. However, to have an easier interpretation, we present the obtained ARIs in percentage. Therefore, an ARI closer to 100% is better compared to an ARI close to 0%.

Quantitative Analyses on Clustering Results
In order to have a quantitative evaluation of the performance of K-means, FCM, ECC, SSC, SSC-OMP, ESC, and HESSC for drill-core HSIs, reference maps of each sample were drawn by a geologist. Consequently, the generated reference maps were resized to the size of HSIs. The number of clusters for each sample was defined based on prior knowledge obtained from laboratory measurements and the geological interpretation by a geologist. Subsequently, the outputs of different clustering algorithms were compared against each other, using the above-mentioned evaluation metrics (Tables 1-3). In the experiments, for ECC, SSC, and SSC-OMP, the default parameters proposed by their developers in [15,31,33] were used. In addition, as described in Section 2.1, P, as the number of representative samples, needs to be predefined for ESC. Therefore, as shown in Figure 4 [26] were used. As mentioned before, K-means, FCM, ECC, SSC, SSC-OMP, and ESC require a predefined number of clusters for them to work. However, in HESSC, the number of clusters is automatically generated. Therefore, parameter β in HESSC is tuned by trial and error to achieve a similar number of clusters to the generated reference data. As for the number of clusters for each dataset, 7, 8, 4, and 6 clusters were defined for samples No. 1, No. 2, No. 3, and the Trento dataset, respectively. The quantitative evaluation results are presented in Table 1. For sample No. 1, HESSC performed well for clustering the minerals. In the first sample, HESSC (CA = 53.12 ± 0.00) exhibited significantly better performance compared to its competitive algorithms. After HESSC, SSC-OMP (CA = 45.39 ± 0.83) performed well compared to other algorithms such as SSC and ESC. For the second sample, as shown in Table 2, HESSC with the obtained results of CA = 46.53 ± 0.00 outperformed compared to other competitive algorithms. In terms of the F-measure for sample No. 2, ESC (F-measure = 26.26 ± 5.85) performed better than HESSC (F-measure = 22.23 ± 0.00). Additionally, ECC performed better than K-means by almost 3%, 1%, and 2% in terms of CA, F-measure, and ARI measurements, respectively. With respect to the obtained results in Table 3, HESSC (CA = 61.31 ± 0.58) outperformed ESC (CA = 50.03 ± 11.19). Overall, HESSC had the most accurate performance based on different evaluation metrics with nearly the lowest standard deviation over 10 trials for all three drill-core samples. As another observation, although ECC has shown similar performance to K-means for all samples, it obtained a lower standard deviation compared to K-means, which indicates that ECC has a more stable performance compared to K-means. Among all of the applied clustering algorithms, FCM and SSC exhibited the weakest performances. As shown in Table 4, HESSC clustered the Trento dataset accurately compared to other clustering algorithms (CA = 56.39 ± 0.00). The clustering maps of three drill-core three drill-core HSIs using different clustering algorithms are shown in Figure 5. In the Trento dataset, among the distance-based algorithms, FCM (CA = 53.88 ± 0.00) performed better compared to K-means and ECC. However, SSC-OMP (CA = 34.80 ± 0.00) and ESC (CA = 33.66 ± 0.02) exhibited weak performances for the clustering task. As shown in Figure 6, different classes, i.e., buildings, wood, and roads, were captured more accurately by HESSC compared to the other clustering algorithms. However, HESSC was not able to differentiate between apple trees and wood. The clustering map obtained by SSC-OMP shows its weak performance compared to the other algorithms for the Trento dataset, which can be due to the selected number of samples. It should be noted that due to the memory problem, SSC cannot be implemented on our computer for this dataset. Table 1. Quantitative assessment on the performances of applied clustering algorithms on the first drill-core sample. The evaluation metrics are reported in mean and standard deviation over 10 trials.  Table 2. Quantitative assessment on the performances of applied clustering algorithms on the second drill-core sample. The evaluation metrics are reported in mean and standard deviation over 10 trials.  Table 3. Quantitative assessment on the performances of the clustering algorithms applied on the third drill-core sample. The evaluation metrics are reported in mean and standard deviation over 10 trials.

Execution Time
This subsection is devoted to evaluating the processing time of each clustering algorithm. The results with respect to the processing time are presented in Table 5. All the applied algorithms were implemented in MATLAB on a computer with an Intel © core-i9 7900X with 128 GB RAM. As presented in Table 5, K-means was fast in terms of processing the data for all datasets. Nevertheless, in terms of CA, F-measure, and ARI, the performance of K-means was unstable due to its random initialization and, therefore, it is not robust compared to the other algorithms. For the three drill-core samples, original SSC led to the highest processing time, which shows that the algorithm is computationally expensive. The reason of this expensive processing time is that SSC uses ADMM to solve the sparse optimization problem. In addition, due to the memory limitation, SSC could not be applied on the Trento dataset. Although SSC-OMP and ESC were fast, in terms of CA it can be seen that their performances were weak compared to HESSC. This can be due to the number of selected samples (i.e., K and P) to solve their optimization problems. Based on this observation, by increasing P and K, the clustering procedure will become slower. In HESSC, the burden of time complexity of the ECC step is divided between nodes, which results in a reduction of the execution time of the original ECC. However, ECC still remains the most time-demanding step within the HESSC structure. Compared to HESSC, ECC was faster by almost 30 s for all drill-core samples. The faster performance of ECC can be due to the existence of a smaller subset of representative samples, which are user-defined in ESC. Nevertheless, HESSC was faster than SSC for all three drill-core samples. In comparison to ECC, HESSC was faster for Sample No. 1 and Sample No. 2 by almost 12% and 22%, respectively. For the Trento dataset, HESSC was faster than SSC-OMP by almost 30 s.

Qualitative Analyses on Clustering Results
From the geological perspective, clustering outputs as shown in Figure 5 were inspected by a trained geologist. For most features of the three samples, K-means, FCM, and ECC exhibited similar performance. In the case of Sample No. 1, HESSC exhibited better performance than the other algorithms and allowed the mapping of the variation of the vein composition. However, for all the methods, the distribution of the sulfide appears to be slightly inaccurate. Additionally, HESSC was less sensitive to the small changes available in the sample matrix, which are considered negligible for exploration purposes. A downside would be missing the biotite-rich areas in the matrix, which can indicate the transition towards the potassic zone of the deposit. While other methods, such as K-means, FCM, ECC, and ESC mapped a variation in the matrix, they were not able to accurately distinguish the biotite-rich areas either. Similar performance of the methods can be observed for Sample No. 2. HESSC, K-means, FCM, and ECC were able to pick up on the compositional variation of the thicker sub-vertical vein, from sulfide-rich vein fill to quartz-rich vein fill. None of the methods were able to accurately pick up the fine sulfide veinlets, likely due to their thicknesses, under the spatial resolution of the used sensor. With regards to the sub-horizontal sulfide vein, SSC, SSC-OMP, and ESC mapped it similarly regarding to the sub-vertical vein but did not cause misclassification as the matrix, as the other methods did. HESSC also matched a part of the vein to the sub-vertical vein and the other part to the matrix. At the same time, all methods picked up on very subtle changes in the alteration halo that were not considered in the generated reference map. The reason for both vein and halo differences from the reference data can be assigned to the stronger impact of subtle compositional changes of the alteration halo on the spectral response than that of the quartz-sulfide ratio. In the case of Sample No. 3, the biotite-rich matrix was well mapped by HESSC in comparison to the other methods. When it comes to vein identification, difficulties were observed for all methods. Firstly, the very fine veins in the center of the matrix were missed due to the spatial resolution of the sensor. Additionally, the quartz vein with a sulfide centerline on the right-hand side was missed by all methods or misclassified as fine sulfide veins with a white mica alteration halo. Class sulfide as defined in the reference data appears to have been slightly preserved in the map produced by HESSC. The class was extrapolated to the matrix in SSC, SSC-OMP, and ESC; additionally, it was overestimated by K-means, FCM, and ECC. However, due to the resampling of the reference data, some of the sulfide markers were also lost there, and therefore this will impact the quantitative evaluation of each method.

Evaluation of the Automatically-Generated Number of Clusters
In order to evaluate the automatic generation manner for the number of clusters, we propose to have fixed parameters for three different samples by using HESSC. The obtained results were visually inspected by a geologist. Moreover, the performance of HESSC using the automatically generated number of clusters was compared with the other applied clustering algorithms. The same number of clusters generated by HESSC was used as the input for K-means, FCM, SSC, SSC-OMP, and ESC as well. For the automatically-generated clusters, a similar performance assessment to the previous results can be made. As can be seen in Figure 7, in the case of Sample No. 3, the same number of clusters was automatically generated and therefore the method comparison will not be repeated. In addition, six clusters were assigned to both Sample No. 1 and Sample No. 2 in this evaluation. The results in sample No.1 were relatively consistent with the ones obtained for a higher number of clusters; however, the sulfide class was missing and was replaced by the class generally interpreted as sulfide-quartz. A stronger variation was present in the matrix, where the biotite-rich class was picked up but slightly overestimated by K-means, FCM, ECC, SSC, and HESSC. In the case of Sample No. 2, the reduction of the number of clusters led to a loss in sensitivity to the compositional variation of the sub-vertical vein for HESSC and a confusion between vein and alteration halo for K-means. At the same time, the mapping of the alteration halo appeared to be less influenced by noise and consistent with the variations noticeable in the RGB image. In the case of K-means, FCM, and ECC, a loss in accuracy related to the alteration halo variation was observed. ESC and SSC exhibited similar performance with regards to the higher number of clusters presented to the previous subsection.

Sensitivity of Parameters in HESSC Algorithm
In this section, the sensitivity of each parameter in the HESSC algorithm is investigated. In HESSC there are three main parameters (i.e., l, τ, and β) that influence on the automatic generation procedure. As described in Section 2, l is a predefined parameter used to indicate the level (depth) of detail to which the algorithm shall proceed. In other words, l is used to define the upper bound of the number of clusters (2 l ) that the algorithm can achieve. In order to evaluate the sensitivity of other parameters (τ, β) in HESSC, two different scenarios were defined and experimental results using Sample No.1 are presented in Tables 6 and 7. In the first scenario, it is assumed that β is fixed to 0.5. As a result, by increasing the τ value from 0.1 to 0.9, the number of clusters is finally raised up to 5, as shown in Table 6. Thus, τ can be described as a parameter that defines the amount of desired details in HESSC. In order to have more detailed results, it is desired to set τ << 1. On the other hand, as is illustrated in Table 7, the number of clusters is decreased by increasing the β value from 0 to 1, while τ = 0.5. Therefore, to have a higher number of clusters, it is necessary to have β < 0.5.  Table 6. Sensitivity of parameter τ while l = 3 and β = 0.5 to evaluating the number of clusters that can be automatically generated by HESSC.  Table 7. Sensitivity of parameter β while l = 3 and τ = 0.5 to evaluating the number of clusters that can be automatically generated by HESSC.

Conclusions
In this paper, we proposed a hierarchical clustering algorithm (HESSC) that analyzes HSIs in a robust and fast manner. A hierarchical approach was used to analyze HSIs at different levels of detail. The main processing task in each node of the resulting hierarchical structure consists of a sparse subspace-based binary clustering algorithm, in which the sparse representation matrix is calculated using a random sample to solve the sparse optimization problem in each node of the tree. In order to reduce the effect of randomness, an entropy-based ensemble algorithm is employed in HESSC. Moreover, HESSC is able to automatically generate the number of clusters by using a criterion that includes the reconstruction error values of each node. If the reconstruction error of the child node is lower than the reconstruction error of the parental node, the child node continues to split; otherwise, the child node does not split any further. The experimental results on three real drill-core HSIs as well as the Trento benchmark dataset support our findings regarding the performance of the proposed hierarchical sparse subspace-based clustering algorithm.
As a possible future work, the effect of incorporating spatial information of HSIs will be investigated, since it is more likely to happen that the neighboring pixels have a similar clustering label compared to the pixels that are located further. Additionally, we will apply the proposed algorithm to different types of and larger datasets to further investigate HESSC's performance.