Improving K-Nearest Neighbor Approaches for Density-Based Pixel Clustering in Hyperspectral Remote Sensing Images

: We investigated nearest-neighbor density-based clustering for hyperspectral image analysis. Four existing techniques were considered that rely on a K-nearest neighbor (KNN) graph to estimate local density and to propagate labels through algorithm-speciﬁc labeling decisions. We ﬁrst improved two of these techniques, a KNN variant of the density peaks clustering method DPC , and a weighted-mode variant of KNNCLUST , so the four methods use the same input KNN graph and only differ by their labeling rules. We propose two regularization schemes for hyperspectral image analysis: (i) a graph regularization based on mutual nearest neighbors (MNN) prior to clustering to improve cluster discovery in high dimensions; (ii) a spatial regularization to account for correlation between neighboring pixels. We demonstrate the relevance of the proposed methods on synthetic data and hyperspectral images, and show they achieve superior overall performances in most cases, outperforming the state-of-the-art methods by up to 20% in kappa index on real hyperspectral


Introduction
Clustering consists of automatically grouping data points (samples) having similar characteristics (features) without supervision (labels). It is a central problem across many fields such as remote sensing (e.g., to remove the need for expensive ground truth data), medicine (e.g., to identify meaningful features and trends among patients), genomics and content-based video indexing. As a kind of unsupervised learning paradigm, clustering is also increasingly relevant in the context of artificial intelligence, since the cost of labeled data is currently a major obstacle to the development of supervised methods [1]. Despite several decades of research and many advances, the ever-increasing amount, size and dimensionality of data that needs to be mined keeps challenging us with regard to computational efficiency, but also when it comes to identifying meaningful clusters in the data. This can indeed be difficult in high-dimensional data such as hyperspectral images due to the correlation/redundancy between spectral bands and uneven distribution of information across them.
However, clustering remains an ill-posed problem [21], simply because several legitimate solutions may exist for a given problem [22]. Many popular clustering methods (such as C-means) require that the user know the number of clusters in the data, which is constraining and often impossible. This is particularly true for centroid clustering, mixture resolving and spectral clustering in their baseline implementations. In contrast, hierarchical agglomerative clustering and density-based methods-including DBSCAN and BIRCH propagation and affinity propagation (AP), DPMM and convex clustering, can all estimate the number of clusters in the data automatically, although they have other parameters.
In this work, we consider that a good clustering approach should have the following characteristics: • Unsupervised: no labeled samples for training nor the actual number of clusters are available; • Nonparametric: no information about the clusters' characteristics (shape, size, density, dimensionality) is available; • Easy parametrization: the method only relies on a small number of parameters that are intuitive; • Deterministic: the clustering results do not depend on a random initialization step and are strictly reproducible.
These requirements disqualify a number of popular methods which require that the number of clusters be known (e.g., C-means) or which rely on a random initialization stage (e.g., fuzzy C-means). In the literature, few methods can satisfy all the above requirements. Among them, DBSCAN [6] is a density-based method requiring two parameters, i.e., the size of a neighborhood and the minimum number of samples lying in it. OPTICS [7] was developed to improve DBSCAN for the case of clusters with unbalanced densities. It has the same parameters as DBSCAN, and another distance threshold (the accessibility distance) which allows one to tackle the imbalance. DBSCAN and OPTICS can only be applied to datasets with limited numbers of dimensions [23].
The K-nearest neighbor (KNN) paradigm has been proven very useful to tackle clustering problems. It allows one to deal with the curse of dimensionality, and with clusters with different densities and non-convex shapes. In [24], a density-based clustering method named KNNCLUST was proposed. It iteratively clusters neighbors starting with each sample in its own cluster. In [25], we introduced KSEM, a stochastic extension of KNNCLUST that assigns labels based a posterior distribution of class labels of a given point learned from its nearest neighbors. In the same vein, many recent data clustering methods have been proposed to harness the advantages of the nearest neighbor paradigm. For instance, the fuzzy weighted KNN method in [26] was adapted from density peaks clustering (DPC) [11], much in the same way as the Fast Sparse Affinity Propagation (FSAP) [27] was adapted from AP [17].
Clustering approaches based on nearest neighbor graphs are very attractive when dealing with large datasets. Nevertheless, it is noteworthy that KNN search (i) relies on the distance/similarity metric used, which may have a counter-intuitive behavior in high-dimensional spaces [28,29]; (ii) has a quadratic complexity in the number of samples for an exhaustive search (brute force), though approximate but faster methods exist [30,31]. Moreover, these methods are able to work with just the pairwise distances between samples, i.e., without the data itself.
In this work, we propose several improvements and extensions to baseline nearest-neighbor density-based clustering algorithms. While we focus specifically on hyperspectral image analysis, the proposed framework is applicable to any type of data. The methods studied in the present work include MODESEEK [9,10], DPC [11], KNNCLUST [24] and GWENN [32]. We first consider a KNN variant of the DPC method and modify it to avoid using a decision graph and decision thresholds; then we modify the KNNCLUST method to incorporate ordered point-wise densities into the local labeling decision scheme to improve clustering accuracy with respect to the original algorithm. We then propose two types of clustering regularization specifically tailored for hyperspectral images. The first one is a KNN graph pruning method based on the mutual nearest neighbors paradigm. Since the notion of nearest neighbors loses meaningfulness with increasing dimensionality, the idea is to keep only strongly bonded, confident neighbors, to avoid the merging of close clusters. The second regularization scheme accounts for the specificity of image data and introduces spatial constraints in the local decision rules of each nearest-neighbor density-based algorithm. Note that these improvements can be applied to other algorithms based on the KNN concept.
Each method and its variants was modified to fairly and objectively compare them and to compare them against state-of-the-art methods. In particular, all the enhanced methods require only one parameter, namely, the number of nearest neighbors.
The paper is organized as follows. In Section 2, we provide the notation used. We then review the basic principles of the four density-based methods selected and motivate our choice from the related literature in Section 3. In Section 4 we propose modifications for two of these methods, DPC and KNNCLUST, in order to respectively facilitate implementation and improve clustering performance. In Section 5, we consider the application of these baseline methods to pixel clustering in HSIs, and introduce the two other improvements: the KNN graph regularization to facilitate cluster unveiling in high dimensions, and the introduction of a spatial context into the core labeling rules of each density-based method. Section 6 describes experiments to assess the improvements of the baseline methods and to compare their clustering properties on synthetic datasets. Then, Section 7 proposes an experimental study, using real datasets, of the four methods adapted to the context of hyperspectral images; this study includes a comparison with other state-of-the-art clustering methods. Lastly, we draw our conclusions in Section 8.

Notation
..,N be a set of samples, where x i ∈ R n is an element (sample) of the dataset. Let us define d as a distance metric between any two samples in X , and d ij = d(x i , x j ). Let K be the number of nearest neighbors (NNs); then N K (x i ) is the set of KNNs of For the time being K is assumed constant ∀x i . The directed KNN graph G = (X , N K (X ), {d(X , N K (X ))}) can be rewritten as a pair of N × K arrays (D, J) where D is the array of distances (in ascending order) between each sample and its KNNs, . . , j K i is the array of corresponding indices of each sample's KNNs. Like in [24] and [11], the methods selected in the present work rely on the computation of a local (point-wise) density function ρ(x i ) = ρ i , 1 ≤ i ≤ N. In practice, the density function is built from the set of pairwise distances between each sample x i and its KNNs. Many density functions can be specified in this manner, as shown below.
Each clustering algorithm outputs a vector of cluster labels c = [c 1 . . . c N ] T with c i ∈ C = {1 . . . C}-C is the number of output clusters-and the set of cluster representative samples E = {e c } 1≤c≤C ⊂ X , where e c ∈ X is called the exemplar of cluster c ∈ C, in reference to the terminology used in [17].

MODESEEK
In its original implementation [10], MODESEEK first estimates the local density at each sample based on its KNNs, and selects the neighbor with the highest local density. The label of that neighbor is then given to the sample and this is done iteratively until no more label changing occurs. One important property of this algorithm is that samples having themselves as neighbors (in the above sense) are reported as cluster exemplars. MODESEEK is a very simple, fast and quite attractive clustering method, which does not require the user to specify the number of clusters. MODESEEK is publicly available within the PRTools MATLAB package (prtools.org).Note that a similar, but recursive approach, called KNN graph clustering (KGC) was proposed in [33].

DENSITY PEAKS CLUSTERING-DPC
More recently, the density peaks clustering (DPC) method proposed by Rodriguez and Laio [11] has received much attention from the data science community owing to its attractive features: it is non iterative and deterministic, and it can estimate the number of clusters robustly based on two intuitive parameters: the local density ρ and δ, defined as the minimum distance between x i and any other sample of higher density.
In [26], the authors introduced the FKNN-DPC method, suggesting to replace the original assignment strategy on the basis of KNNs and fuzzy weighted KNNs. In [34], the authors also reported the difficulty of DPC when it came to performing correctly with clusters that have various densities. Another weakness of DPC is that it is essentially based on thresholding the ρ vs. δ decision graph, in order to determine the optimal clustering solution. To the best of our knowledge, there is no closed-form solution for finding the optimal thresholding, and most existing methods are based on some kind of gap statistic, which is only viable if the data are relatively noise-free. This is particularly challenging when the data contains unbalanced clusters [35,36]. Recently, in [37], it was suggested to incorporate the neighborhood information into DPC via shared nearest neighbors (SNN). However, this method also necessitates building a decision graph to identify the cluster centers.

GRAPH WATERSHED USING NEAREST NEIGHBORS-GWENN
In 2016, we introduced a new KNN-based clustering technique named GWENN [32], in reference to the watershed approach extensively used in image segmentation. Though less known, the watershed principle was first proposed by Vincent and Soille in relation with morphological operators on graphs [38], and later with diffusion processes on weighted graphs [39].
In addition to being deterministic, GWENN is also a non iterative procedure. However, unlike DPC and most of its KNN variants, GWENN can provide an estimate of the number of clusters without thresholding parameter. The main idea of GWENN is to progressively assign labels to samples, starting from samples of highest densities. Therefore, after sorting samples by decreasing density, a given sample takes the most common (mode) label among its KNNs which are already labeled. However, the original method can lead to local labeling decisions which are not consistent and can bias the final result: it suffices that, among one sample's NNs, two or more labels appear with the same highest number of occurrences to yield an ambiguous labeling decision state. To avoid this problem, we improved GWENN in [40] to allow labeling disambiguation. More precisely, we proposed to weigh the count of each label found among the set of nearest neighbors by the local densities of the latter. The weighted mode of one sample x i is then the label which, among those of its neighbors, has the maximum weighted sum of local densities: where Q i is defined as the subset of x i 's NNs which have been previously processed, i.e., Q i = P i ∩ N K (x i ), with P i the set of previously processed samples. In the following, we shall refer to this modified method as GWENN-WM, which will be one of our baseline methods in the following. The reader can find the algorithm in [41], though GWENN-WM was applied to the problem of hyperspectral band selection. Notice that a second labeling pass is set up in GWENN-WM which is aimed at removing weakly populated clusters found during the main pass. During the latter, the weighted mode is computed over Q i as defined above; during the second pass the whole set of x i 's NNs, i.e., j i , is considered.

KNNCLUST
In 2006, Tran et al. introduced KNNCLUST [24], an iterative version of the supervised KNN algorithm. In KNNCLUST, all the samples of a dataset are assigned a unique label at initialization. Then the algorithm successively relabels each sample based on a KNN-kernel Bayes decision rule; i.e., the class conditional probability p(x i |c) is replaced by max c ∈C p(x i |c ) ∀x i ∈ X . By doing so, KNNCLUST forms clusters in order to maximize a total class-conditional density function for all samples. KNNCLUST has also the advantage with respect to other popular clustering techniques of automatically estimating the number of clusters. Similarly to other clustering methods, KNNCLUST is iterative; it is stopped when no more label changes occur. Its main drawback is the computational load of the first iterations, due to the high number of putative labels to handle.
Note that the label assignment strategies of GWENN and KNNCLUST are similar to each other, since a label is assigned to each sample x i , by taking into account the labels of its KNNs. This is why we also proposed to apply the weighted mode procedure described above to KNNCLUST in [40]. The resulting algorithm, named KNNCLUST-WM, will also serve as a baseline method in the following.

Implementation Choices
Besides the parameter K used to construct the NN graph, two choices are crucial for density-based methods in general: the distance d and the density function ρ.
In terms of the former, several distances have been proposed to best apprehend the data structure, from the Minkowski L p norm with p = 1, p = 2 or even p < 1 for high dimensional datasets [42], to data-driven distances (e.g., Mahalanobis distance [43], geodesic distance [44,45] or kernel-based distance [46]). In essence, the nearest-neighbor density-based methods discussed in the present work are flexible with regard to this choice and can account for specific distance metrics. All the experiments described below involve the Euclidean norm.
The issue of finding an appropriate density model is more important and deserves attention. There is a vast amount of literature in the field of local density estimation published since the mid-twentieth and the advent of kernel estimators [47]. In the present context, density estimation is necessary to evaluate the local influence, on a given sample x i , of its KNNs. Examples of such local density estimators used in the literature on nearest-neighbor density-based methods are given in Table 1. In order to avoid introducing additional parameters to the studied methods, we will focus only on parameter-free estimators. Thus, in the present work, the experiments have been conducted by using the distance to the farthest neighbor (i.e., of index K), to estimate the local density. More precisely, the local density in the neighborhood of sample x i is defined as the inverse of the distance of x i to its K-th NN: ( Note that in the following, we did not investigate the question of optimizing K, which still deserves study. Instead, we aimed to compare the behaviors of nearest-neighbor clustering methods using K as a unique parameter, once the metric and density model were chosen. Table 1. Some local density estimates ρ i based on KNNs.

Density Estimate References
Non-parametric model Parametric model

Improvement of KNN-DPC: M-KNN-DPC
DPC [11] is based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from other points with higher densities. However, the original DPC method has two drawbacks; i.e., it requires the availability of the full pairwise distance matrix between samples to compute the cut-off distance, and it is not very clear how the cluster centers can be detected in a decision graph based on the densities {ρ i } and the distances {δ i = min j: The improved DPC method described here attempts to alleviate these two difficulties. A preliminary version of this method was presented in [40]. With regard to the first issue, we propose, following [26,34,48], to replace the full pairwise distance matrix with the KNN graph structure. This is motivated by the fact that the local density ρ(x i ) can be estimated only from N K (x i ), and does not need to depend on the full pairwise distance matrix; limiting the "scope" of each sample to its KNNs is therefore expected to better capture the structure of the data under study [34], while still allowing the use of different kernel functions for density estimation. Moreover, the reduction in storage can be significant for large datasets, decreasing The second issue related to the decision graph (or more generally the decision rule that one sample is or is not a cluster exemplar), can be easily by-passed owing to a simple observation. Actually, the first stage of DPC consists of finding the unique neighbor of each sample x i having the minimum distance among its neighbors of higher density: The specific neighbor NN(x i ) was recently coined as BigBrother(x i ) in [48]. Its selection is common to most variants of DPC. For KNN-based DPC methods, samples which do not have such neighbors are their own NNs in this respect, and are consequently the cluster exemplars sought. Thus, the idea of our improvement is very simple: once the set {NN(x i )}, x i ∈ X is obtained, a hill-climbing of each point to its NN is applied until no more climb is possible, thereby revealing its cluster exemplar. The difference with the clustering algorithm in [48] relies on the fact that the number of clusters C is not specified by the user, and therefore it does not need to explicitly sort the set of {δ(x i ).ρ(x i )}, x i ∈ X to find the C cluster exemplars. Additionally, the estimation of a cutoff distance as in the original method is no longer required, and the parametrization is reduced to K.
These modifications yield a new method, called M-KNN-DPC in the following, which can be compared with the three other density-based methods strictly under the same conditions. The pseudo-code of M-KNN-DPC is given in Algorithm 1. Compared to similar KNN DPC methods which are non iterative, the iterative feature of M-KNN-DPC is in fact not too critical since convergence is generally ensured within a few iterations, similarly to MODESEEK. In fact, as pointed out in [40], the above simplification of the labeling rule actually makes M-KNN-DPC very similar to MODESEEK: both methods require seeking a unique NN for each sample and proceed by iterative hill-climbing. However, in MODESEEK, the retained NN of an sample is, among its KNNs, the one with the highest density, whereas in M-KNN-DPC, the retained NN is the closest in distance among all NNs of higher density. These rules are illustrated in Figure 1, showing the differences in local labeling decisions provided by the two methods. Moreover, both methods provide the same set of exemplars by construction, since samples which have themselves as retained NNs do satisfy the rules for both methods. Nevertheless, clustering labels for samples with mid-to low-density values are not guaranteed to coincide. (1) Compute D, the N × K array of distances (in ascending order) between each sample and its KNNs. MODESEEK retains, among the current sample x i and its K nearest neighbors, the one with the highest density (here the fourth NN), whereas M-KNN-DPC retains the closest with higher density than the current sample (here the second NN).

Improvement of KNNCLUST-WM: M-KNNCLUST-WM
In the original implementation of KNNCLUST, samples can be processed in any order. In this work, we suggest to use a specific ordering scheme to apply the weighted mode labeling decision rule within KNNCLUST-WM. More precisely, we propose to process the samples by order of decreasing local density, similarly to GWENN-WM. This scheme, applied at each iteration of the algorithm, makes local labeling decisions more robust within denser regions, while postponing the most uncertain decisions, corresponding to lower local density values.
We call this new adaptation M-KNNCLUST-WM, standing for modified KNNCLUST-WM. The details of M-KNNCLUST-WM are given in Algorithm 2. In comparison with KNNCLUST-WM, the samples are processed in turn following the array of indices i, as indicated in Step 4. Recall that KNNCLUST-WM uses the same decision rule in Equation (1) to label each sample x i , except that the whole set of its NNs is considered during the iterations, not just the previously labeled ones as is done in GWENN-WM. Steps (1) to (3) are identical to Algorithm 1; ; end for end while (7) Find E as the set of unique labels in c; (8) Remap cluster labels in [1 . . . C], C = |E |;

Discussion
With the above improvements, we have a benchmark of four baseline clustering techniques, namely, MODESEEK, M-KNN-DPC, GWENN-WM and M-KNNCLUST-WM. It is important to recall that all the methods take the same data as input, i.e., the KNN graph. Therefore, once the distance metric and the way the local densities are estimated are chosen, the only parameter required to run them is the number of NNs K. Table 2 summarizes the properties of these methods. They differ by their initialization stage, their stopping rule and most importantly their internal decision rule. As said above, MODESEEK, M-KNN-DPC and M-KNNCLUST-WM are iterative methods for which the stopping rule is implemented as the criterion of convergence to a stable partitioning. The convergence is significantly slower for M-KNNCLUST-WM than for MODESEEK or M-KNN-DPC, due to the number of distinct labels to process at initialization.
Additionally, one can see that the decision rules differ between subgroups of methods, namely, MODESEEK and M-KNN-DPC on the one hand, and M-KNNCLUST-WM, and GWENN-WM on the other hand. This is expected to produce significantly different clustering results from one subgroup to the other.
In addition to the KNN search and the density estimation which is common to all these methods, the computational complexity of their proper core algorithms can be evaluated.

Improvements of Nearest-Neighbor Density-Based Methods for HSI Pixel Clustering
Hyperspectral image analysis has gained popularity in the remote sensing community during the past decade, owing to the richness of information such images convey and the applications it allows one to tackle. Among those, precision agriculture, forestry, natural habitats, coastal shore surveillance and environmental urban planning now commonly use hyperspectral images to help with decision making.
Due to the recent release of advanced sensors producing images with ever-increasing spatial and spectral sizes, clustering remote sensing HSIs at the pixel level has become a challenging task. However, it remains necessary for the end-user to obtain classification maps within reasonable time, without requiring one to fine tune a lot of parameters.
With the objective of tackling the problem of HSI pixel partitioning, we propose two improvements to the methods described above.
The first one is related to the high dimensionality of the spectral features carried in HSIs, which can reach several hundreds. As of the high correlation between spectral features within image pixels, hyperspectral data vectors often live in a low-dimensional subspace of the original representation space, which has led researchers to consider new clustering approaches such as sparse subspace clustering [20]. Considering the nearest-neighbor density-based methods described above, and to allow their use for high-dimensional data, we propose recourse to KNN graph regularization prior to the clustering stage.
The second improvement is related to the specificity of image data and the high level of correlation observed between adjacent pixels. To account for this, we propose to modify each of the baseline density-based methods by incorporating a spatial context in its proper labeling decision rule. We discuss these modifications below.

KNN Graph Regularization
Clustering high dimensional data is a difficult problem in general, and specifically for density-based methods. Here, we seek a solution through KNN graph regularization. The proposed approach, which is applied prior to clustering itself, is taken from [52]. The idea is to optimize the KNN graph, by attempting to preserve only the set of relevant NNs of each sample, and to remove less relevant edges of the graph. Considering an initial directed KNN graph, the approach consists of removing the edges which are not symmetric, so as to transform it into an undirected graph. More precisely, any edge (x i , x j ) of the directed graph G is retained iff x i ∈ N K (x j ) ∧ x j ∈ N K (x i ) . By doing so, x i and x j are nearest neighbors to each other. In this way, the graph does not longer have constant K outdegree: samples which are "popular" nearest neighbors among samples, called hubs [29], have larger outdegrees than samples located on the external borders of clusters. This approach is referred to as the mutual nearest neighbor (MNN) graph modification in the following. Figure 2 illustrates on a simple 2D example (N = 300, 3 clusters) the effect of applying the MNN graph modification on the initial NN graph. In this example, we set K = 11. One can observe that the set of resulting MNN edges no longer connects the three main data clusters, which makes it easier to identify the latter with density-based methods. However, one drawback of this approach is that it creates small or even single-sample clusters, mainly around the edges of the main clusters.
Concerning implementation, the MNN procedure proceeds by first computing a N × N sparse binary adjacency matrix from the indices of each sample's NNs, and computes the logical AND of this matrix with its transpose, thereby retaining only symmetric edges. The result is then recombined into a new NN graph with distance array D MNN . Note that applying MNN naturally associates with each x i a set of K i ≤ K NNs. Several other graph modification methods exist [52,53], but from our experience [54], the MNN graph modification provides better results than the natural nearest neighbor (3N) approach of Zou and Zhu [55].
Once the number K i of modified neighbors is obtained for each x i by the MNN procedure, the local density in Equation (2) can be rewritten accordingly as: We underline that the number of NNs K of the initial graph remains a key parameter of the clustering methods and must be selected carefully.

Spatial Regularization
In the case of (multi-component) digital images, pixels are considered as samples, and the feature space corresponds to the spectral components. Due to the high correlation between neighboring pixels in images, it can be useful to strengthen their labeling at the local level in the clustering procedure. To achieve this with the four methods described above, we propose to enlarge the set of neighbors of each pixel, by including the set of its spatial neighbors in addition to its NNs in the spectral domain.
Due to the difference in their fundamental principles, the spatial contextualization technique was applied differently to GWENN-WM and M-KNNCLUST-WM on the one hand, and MODESEEK and M-KNN-DPC on the other hand.
Concerning GWENN-WM, let us consider Equation (1), where Q i is the set of previously processed NNs of the current sample x i to label. By applying the spatial regularization, the intersection of previously processed samples with the union of the current sample's NNs and the set of its spatial neighbors V(x i ). This rule reinforces the coherence of local labels at little computational expense, provided the spatial context is limited. Concerning M-KNNCLUST-WM, the same scheme can be applied by replacing the set of indices j i m in the core loop of Algorithm 2 by its union with the spatial neighbors of sample i m .
Concerning M-KNN-DPC, the NN search in Equation (3) is modified as follows: The modification is similar for MODESEEK due to the closeness of the NN search: In order to account for spatial coherence, we will limit the spatial context to a 4-neighborhood in our experiments for each of the four methods; of course the procedure can easily extend to larger spatial neighborhoods. This issue was not investigated in the present work.

Clustering Experiments with the Baseline Methods on Synthetic Datasets
In order to verify the quality of clustering results with the improved baseline methods described in Section 4, we set up a few experiments with synthetic datasets (https://cs.uef.fi/sipu/datasets/). After presenting the clustering performance indices used in the present work, in the first experiment, we wished to assess the improvement of M-KNNCLUST-WM over KNNCLUST-WM. In the second one, we compared the behavior of the four baseline methods on more challenging clustering problems.

Cluster Validation Indices
All the datasets used in our work include a ground truth, which enables one to compute external classification accuracy indices. Some of these indices are derived from a confusion matrix obtained after optimal pairing of cluster labels with the actual ground truth classes. The cluster-to-class mapping is based on the Munkres assignment algorithm [56]. Once the optimal assignment is obtained, a new confusion matrix is obtained which is close to diagonal. From there, standard classification indices can be derived: the overall accuracy (OA, i.e., the normalized trace of the confusion matrix), the average accuracy (AA, i.e., the average of per-class accuracies) [57] and Cohen's kappa index of agreement (κ) [58]. These indices are widely used in the remote sensing community. Other indices are still available, like the adjusted Rand index (ARI) [59], purity and normalized mutual information (NMI) [60] and the centroid similarity index (CSI) [61]. In this work, we also used an internal clustering performance index named the consistency violation ratio (CVR) [62]. The CVR is defined asĤ(c|X )/Ĥ(c), where the denominator is the entropy of the clustering result c (obtained with the standard plugin entropy estimate) and the numerator is the entropy of c conditional on the data observation X . In the present work, the estimate of the latter follows Equation (5) in [62]. By doing this, the computed CVR indices are positive, lower values meaning better clustering.

Comparison of KNNCLUST Variants
In this experiment, we wished to verify the improvement of M-KNNCLUST-WM over the original KNNCLUST, its weighted mode variant KNNCLUST-WM and M-KNNCLUST, which is the original KNNCLUST algorithm modified according to Section 4.2. The objective was to compare their results in terms of AA as a function of the number of NNs K and to show their ability to provide consistent clustering results, in terms of identification of the correct number of clusters and in clustering accuracy. For this we used the S4 dataset [63] (see Figure 3), which has 5000 samples randomly distributed issued from 15 overlapping Gaussian 2D distributions with various means and covariance matrices. Both features were first normalized in the [0, 1] range. Figure 4 reports the number of clusters found as a function of the number of NNs K, and the average accuracy (AA) also versus K. Here, KNNCLUST and KNNCLUST-WM were run 20 times, each time using a different random ordering of the samples, chosen during the initialization stage and kept unchanged during the iterations. Boxplots are displayed on the figures to characterize the distributions of the resulting NC and AA for the latter methods. In contrast, M-KNNCLUST and M-KNNCLUST-WM sort the samples by decreasing local density at the start, so that the results are invariant to the order the samples are processed. Apart from relatively stable results across the clustering results within a large range of NNs K, it was observed that (i) M-KNNCLUST and M-KNNCLUST-WM are able to approach the correct number of clusters with lower K than KNNCLUST and KNNCLUST-WM; (ii) relatedly, M-KNNCLUST and M-KNNCLUST-WM provide higher classification accuracies than KNNCLUST and KNNCLUST-WM on average for low K; (iii) conversely, for higher values of K, M-KNNCLUST and M-KNNCLUST-WM seem to depart from the correct number of clusters faster than the original method on average (for K > 180). This is due to the fact that a smaller number of clusters was found in this situation. Indeed, primarily accounting for too-distant NNs to label points with high density is likely to result in biased labeling decisions and the merging of samples issued from close clusters in a single larger one. However, since our objective is to achieve the best clustering performances at reduced algorithm complexity, this has no practical relevance. In this sense, the above example seems to indicate that M-KNNCLUST and M-KNNCLUST-WM must be preferred over the original method. Figure 4 also shows the improvement of the weighted mode rule (KNNCLUST-WM and M-KNNCLUST-WM) against the classical mode decision rule (KNNCLUST and M-KNNCLUST). Similar conclusions were obtained after analyzing the S1, S2 and S3 datasets (available on the same website), which are much easier to partition due to smaller overlaps between the generating Gaussian distributions.

Comparison of the Baseline Clustering Methods
To illustrate the behavior of the four baseline clustering methods described above, we have used the synthetic datasets Worms-2D, with n = 2, and Worms-64D with n = 64 [48]. We have selected these datasets because they exhibit curve-shaped clusters with a large amount of overlap or even non-separable data, which represents a very challenging clustering task. Figure 3 displays the Worms-2D dataset, with N = 105,600 points distributed in C true = 35 clusters. Figure 5 shows the evolution of the average accuracy (AA) and the number of clusters C as a function of K. The AA values are lower than in the previous example due to the difficulty to cluster this dataset. Still, a tiny plateau of C can be observed for K ∈ [550, 650], giving a number of clusters slightly superior to the actual one. From these figures, one can observe close behaviors between M-KNNCLUST-WM and GWENN-WM on one hand, and between MODESEEK and M-KNN-DPC on the other hand. M-KNNCLUST-WM and GWENN-WM provide a few less clusters than MODESEEK and M-KNN-DPC, thereby yielding higher performance for lower K. For K > 700, M-KNN-DPC provides slightly higher AA values.
The results for the Worms-64D dataset, shown in Figure 6, show that the AA is stable for a very wide range of K (from 20 to 1000 for M-KNNCLUST-WM and GWENN-WM). Over this range, the correct number of 25 clusters was identified. Here again, M-KNNCLUST-WM and GWENN-WM outperformed the two other methods for all K. It is interesting to notice that MODESEEK outperformed M-KNN-DPC for a large range of K. This illustrates the fact that, for high dimensional data, the choice of the closest NN with higher density can be less relevant than the one of the NN with highest density used in MODESEEK. Figure 6 also displays the computation times of the four methods as a function of K; the time necessary to compute the KNN graph is not taken into account in this figure. MODESEEK and M-KNN-DPC were the fastest methods, with times increasing linearly with K; in contrast, M-KNNCLUST-WM and GWENN-WM were slower due to the weighted mode procedure, but GWENN-WM was approximately four times faster than M-KNNCLUST-WM.

Application to Hyperspectral Remote Sensing Images
In this section, we apply and compare the proposed baseline methods and their HSI-adapted variants to pixel clustering in HSIs. As previously mentioned, this task is challenging due to the high number of pixels to handle and the high dimensionality of data representation. All the HSIs considered here are accompanied with ground truth (GT) maps used for clustering assessment. In the first two experiments, the clustering methods are applied over the whole image support, comprising the pixels which are not labeled in the ground truth. Therefore, the number of final clusters generally exceeds by a few units the actual number of clusters specified in the GT. The section ends up with a study of the clustering results obtained using only the ground truth pixels of three other HSIs.

AVIRIS-Salinas Hyperspectral Image
This image (https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes) was acquired by the AVIRIS sensor in 1998 over agricultural fields in the Salinas Valley, California [64]. It has 512 × 217 pixels and 204 spectral bands after absorption and noisy band removal. The ground truth comprises 16 vegetation classes. Figure 7 shows a color composite image, and the associated ground truth. In this experiment, the spectral bands were first individually normalized to zero mean and unit variance before computing KNN graphs based on the Euclidean distance in the representation space (n = 204). The comparison of the proposed approaches was extended to state-of-the-art methods: FCM, DBSCAN and a KNN version of the fast sparse affinity propagation method (FSAP) proposed by Jia et al. [27]. For DBSCAN, we empirically sought the values of MinPts and Eps providing the best clustering results. For FCM, the number of output clusters was imposed, the fuzziness parameter was set to 2 and we performed 20 runs of the method with random initial labeling. For FSAP, which requires a sequence of two runs of sparse affinity propagation, the preference parameters were selected so as to optimize the external clustering indices.  Figure 8 depicts the results of these state-of-the-art methods and the proposed approaches regarding four clustering performance criteria, namely, the AA, the kappa index κ, the adjusted rand index (ARI) and the CVR index, as functions of the number of clusters provided by the methods. The proposed clustering methods include the set of four baseline methods (MODESEEK, M-KNN-DPC, GWENN-WM, M-KNNCLUST-WM), their respective improvements with the MNN graph regularization (denoted by a -MNN suffix thereafter) and the "spatial" variants of the latter (denoted with a -SP suffix). For each method but FCM (where the free parameter is the number of sought clusters C), K was varied in the range [300, 1200] by steps of 100, thereby providing as many clustering results. For FCM, boxplots are plotted for C ∈ {14, . . . , 20} to observe the average and dispersion of the performance indices for the 20 runs of the method. Typically, for all the methods and DBSCAN and FSAP, the number of final clusters decreases with K. Figure 9 shows the clustering maps obtained from the analysis of this HSI with the baseline density-based methods and their variants, and FCM, FSAP and DBSCAN. For each method, the best clustering result in terms of kappa index is displayed, so that these maps correspond to the highest values of kappa shown in Figure 8. Overall accuracy (OA), average accuracy (AA) and kappa indices are also given for each map.
From this experiment, several observations can be drawn: • On average, all the nearest-neighbor density-based methods perform better than DBSCAN, FSAP and FCM whatever the number of clusters, found or imposed, and especially regarding the external performance indices. Exceptions to this observation are discussed below.

•
Performing MNN graph regularization clearly helps with improving the results for all the methods, but at the cost of creating a higher number of clusters. These are consequences of graph pruning, which tends to create low-density clusters at the border of main clusters, while strengthening the density within the latter, thereby providing more robust results on average. Among the state-of-the-art methods used for comparison, FSAP provides the best results, at least visually, with a less noisy clustering map. However, there is some amount of confusion between the output clusters and the GT map which lowers the performance indices.

•
The classes C8 (grapes untrained) and C15 (vineyard untrained) could not be retrieved by any method. Several published papers already show the difficulty to separate these two classes [65]. However, the class C7 (celery) was split into two clusters with visual coherence by all the density-based methods with MNN graph modification, which confirms the usefulness of this approach for detecting close clusters. This additional cluster disappears with the further use of spatial context which forces the links between those two clusters so as to merge in a single one.

•
The only exception regarding the good performances of the proposed methods is with the spatial variant of M-KNN-DPC-MNN, which we recall only differs from MODESEEK by the way the NN of higher local density is selected (see Figure 1). Actually, including spatial neighbors in addition to "spectral" neighbors in this method is likely to drive the NN selection to the spectrally closest spatial neighbors, thereby separating compact clusters into subclusters while achieving the same final number of clusters as MODESEEK, as said above, hence dramatically reducing the clustering performance. In comparison, the NN selection rule set up in MODESEEK is more robust to some extent.

AVIRIS-Hekla Hyperspectral Image
This HSI was acquired in 1991 by AVIRIS over the Hekla volcano in Iceland. The image has 560 × 600 pixels (approximately thrice the number of pixels of the previous image) and 157 spectral bands [66]. Here, the original image was preprocessed with the minimum noise fraction algorithm (MNF) [67], and 10 principal components were retained for clustering. Figure 10 shows a color image of the first three MNF components, and the ground truth used for assessment and comprising 12 land-cover classes. The KNN graphs were computed for K in the range [400,2000] by steps of 100, and the proposed methods were applied and compared similarly as in the above example. For this image, it was not possible to find appropriate parameters to run DBSCAN and FSAP, and only FCM results were computed for C ∈ {10, . . . , 30}. Figure 11 shows the corresponding performance results. Though the data preprocessing stage was different from the previous example, the conclusions drawn from these results remained the same and confirmed the superiority of the nearest-neighbor density-based methods with respect to FCM. The improvement brought by both symmetrizing the initial KNN graph and introducing the spatial context in the density-based methods is also significant with this HSI, except again for

Other Hyperspectral Images
A last experiment was set up with a set of three other HSIs:  [53].
In this experiment, clustering was applied exclusively on the labeled pixels of the GT map. By doing so, it was not possible to apply the spatial context versions of the nearest-neighbor density-based methods, so we limited the study to the original methods and their MNN versions. Therefore, the KNN graphs were calculated from those labeled pixels. FCM (with varying C) and DBSCAN (with optimized parameters) were also used for comparison. No prior data normalization was applied the three HSIs. Table 3 shows the results corresponding to the highest kappa values obtained with the various approaches after varying K within a large range, specific to each HSI. Several performance indices are reported, namely, AA, OA, κ, purity, NMI and CVR. The latter three are useful to assess the consistency of the results; indeed, high values of OA, AA or kappa after cluster assignment are not sufficient to measure high quality clustering, especially in the case where the number of GT classes is greater than the number of found clusters. Again, the nearest-neighbor density-based methods show superior performance to the popular FCM in all cases. The situation is less clear for DBSCAN which seems to provide better accuracy indices for the KSC and Massey University HSIs. Actually, for the KSC dataset, the clustering results for DBSCAN were calculated only over 25% of the number of pixels (so-called "core points"), the remainder being classified as outliers by DBSCAN. For the Massey University HSI, the number of outliers provided by DBSCAN exceeded 96% of the number of pixels. Thus, the best OA results provided by DBSCAN must be considered with caution. However, the high CVR indices given by DBSCAN for these two HSIs can help with detecting this kind of anomalous situation.
Many of these best kappa results come with an underestimation by a few units of the actual number of classes present in the GT. However, on average the number of clusters is better approximated by applying the MNN graph regularization prior to density-based clustering. This was particularly observed for the Massey University HSI which has roughly twice the dimensionality of the other ones.
To some extent, these results on various hyperspectral datasets are in agreement with the preceding observations made in the above experiments, as shown by the good performances of the proposed approaches versus state-of-the-art clustering methods, and the improvement brought by the MNN graph pruning procedure.

Conclusions
In this paper, we have proposed several improvements to existing nearest-neighbor density-based clustering methods, and considered their application to pixel clustering in hyperspectral images. These methods use a KNN graph to estimate local (point-wise) density values which are then used to propagate labeling decisions specific to each method.
Our contributions are twofold. First, we improved two of these methods, namely, KNNCLUST-WM and a KNN variant of DPC. The former method was modified to account for the local density as prior information to foster the labeling decisions corresponding to samples by decreasing order of their density. The latter one was completely revisited to avoid the use of a decision graph and threshold, and we showed that the resulting algorithm is very close to MODESEEK. Second, we proposed using two additional procedures to help with tackling the problem of pixel partitioning in hyperspectral images. The first relies on a KNN graph pruning method based on mutual nearest neighbors (MNN), which can facilitate the discovery of clusters in high dimensional datasets. This procedure is applied prior to the baseline clustering methods. The second one aims to introduce a spatial context in the core of each clustering method, in order to account for the correlation between adjacent pixels in hyperspectral images.
The improved KNNCLUST-WM method and the set of baseline nearest-neighbor density-based methods MODESEEK, M-KNN-DPC, GWENN-WM and M-KNNCLUST-WM were then tested and compared on synthetic datasets, showing their ability to adequately cluster points issued from highly overlapping conditional densities in low and high dimensions. The enhanced methods, using the MNN, and then combining MNN and spatial context, were then applied to a benchmark of real hyperspectral images, and compared with state-of-the-art methods, namely, FCM, DBSCAN and FSAP. The experimental results show that the proposed improvements of the baseline nearest-neighbor density-based methods increase the clustering performance in most cases in terms of standard classification accuracy and other clustering indices. In particular, the graph and spatial improvements of GWENN and KNNCLUST provide the best results on average, outperforming the state-of-the-art methods by more than 20% in optimal kappa index for varying K on real hyperspectral images; while the results of these methods are close to each other, GWENN is much faster than KNNCLUST due to its non-iterative feature. Additionally, M-KNN-DPC generally performs better than MODESEEK in the baseline and MNN improved versions; this is, however, no longer true for their spatial versions, due to the difference pointed out between their respective core labeling rules, and MODESEEK should then be preferred to M-KNN-DPC. From a more general viewpoint, unlike other methods like DBSCAN and FSAP which are sensitive to parameter tuning, the proposed approaches require only one parameter, i.e., the number of nearest neighbors K.
In conclusion, owing to their attractive features, the nearest-neighbor density-based methods investigated in this paper do constitute a viable alternative to traditional clustering methods in the context of hyperspectral images: they are able to closely predict the number of clusters present in a dataset with minimal parametrization; they are strictly deterministic and therefore do not rely on a random initialization stage.
Although only HSIs were considered here, nearest-neighbor density-based methods can also be applied to other remote sensing image data, such as very high resolution multispectral images. To this end, the proposed improvements can be implemented in a multi-resolution framework, following [32,40].
However, many issues still remain, the most important of which is the automatic tuning of K. The choices of the best distance metrics among high-dimensional samples and the best local density model are also critical design choices. These aspects are currently under investigation.
Author Contributions: C.C. conceived of the paper, designed the experiments, generated the dataset, wrote the source code, performed the experiments and wrote the paper. S.L.M. and K.C. provided detailed advice during the writing process and revised the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.