Unsupervised Cluster-Wise Hyperspectral Band Selection for Classiﬁcation

: A hyperspectral image provides ﬁne details about the scene under analysis, due to its multiple bands. However, the resulting high dimensionality in the feature space may render a classiﬁcation task unreliable, mainly due to overﬁtting and the Hughes phenomenon. In order to attenuate such problems, one can resort to dimensionality reduction (DR). Thus, this paper proposes a new DR algorithm, which performs an unsupervised band selection technique following a clustering approach. More speciﬁcally, the data set was split into a predeﬁned number of clusters, after which the bands were iteratively selected based on the parameters of a separating hyperplane, which provided the best separation in the feature space, in a one-versus-all scenario. Then, a ﬁne-tuning of the initially selected bands took place based on the separability of clusters. A comparison with ﬁve other state-of-the-art frameworks shows that the proposed method achieved the best classiﬁcation results in 60% of the experiments.


Introduction
In pattern recognition problems, the separation among classes in the feature space is of great importance for the success of the classifier [1]. An appropriate separation may be achieved by means of effective data representation [2,3]. When it comes to hyperspectral image (HSI) classification, by selecting the right bands, one can provide a wider class separation [4], as well as attenuate the negative effects of the Hughes phenomenon [5] and avoid the overfitting of the classifier [6][7][8].
In such a scenario, feature extraction (FE), i.e., a combination of the original spectral bands, is capable of tackling the aforementioned problems, but it is not a recommended approach for dimensionality reduction of hyperspectral data, because the resulting features do not carry the physical information any longer [9], impairing, consequently, a proper understanding of the model [10,11]. Band selection (BS), on the other hand, is as good as FE in terms of providing class separability; moreover, it keeps the original information about the spectral bands [9,12]. Since a BS method provides suitable bands for a given task, it is possible to design tailored sensors to perform that application, consequently avoiding redundant and irrelevant bands [13].
BS methods can be grouped into one out of three major categories [14]: wrapper methods, when the selection of bands occurred during the training phase of the classifier; in this case, the classifier must be trained from scratch every time a band subset is assessed; embedded methods, when the classifier selects the bands by itself, for example, Lasso [15]; and filter methods, when the band selection process takes place before the classifier training phase; it has no relation to the classifier to be used [2].
It is known that unsupervised state-of-the-art BS frameworks follow either a rankingbased [28] or a clustering-based approach [29]. Ranking-based methods sort the spectral bands in relation to a specific criterion. However, they fail in terms of avoiding correlated bands [30]. Clustering-based BS frameworks, on the other hand, aim to find the most representative bands of each cluster of the data set, decreasing the correlation amongst bands [31]. Thus, clustering in the BS literature is normally used to form clusters of spectral bands. For instance, in [32], the authors propose a BS method that uses dynamic programming to cluster the spectral bands, which are considered continuous. In [33], the density peak is used for the clustering of the bands, due to its capability of tackling non-spherical data. The authors propose the weight between the normalized local density and cluster distance. In [34], the BS algorithm is performed based on correntropy-based clustering of the spectral bands. In [29], the authors propose a BS algorithm that initially clusters the spectral bands based on Euclidean distance. Then one band per cluster is selected, resulting in a band subset whose bands are ranked in a fine-tuning step. In [35], the proposed BS uses a self-tuning algorithm to cluster the spectral bands. In [13], a kernelbased probabilistic clustering of spectral bands is proposed, based on the assumption that there is a smooth transition between two adjacent clusters. Finally, in [36] the most representative bands for each pixel are selected by means of an attention mask. Then, an autoencoder reconstructs the original image using the selected bands. In the end, the final bands are selected by a clustering method.
As those cited papers performed the clustering operation on the spectral bands, structural information on the data set was not taken into account. Moreover, since the final objective of band selection frequently lies in a better classification of the data instances, an analysis based solely on the best representative bands (without looking at class separation on the feature space) ends up being of secondary importance. Furthermore, in an unsupervised BS framework, normally the filter approach is used. Thus, the band selection takes place in a preprocessing step, i.e., before the use of the classifier itself [37]. Thus, a priori one does not know beforehand which classifier will be used. For this reason, this paper presents a BS framework that seeks to maximize the distance among classes in the feature space. Therefore, our main purpose is not the representation of the data set by a few bands [38], but rather the selection of bands that best separate the classes. This class separability during the selection of bands is the gap we propose to fill in relation to other approaches. Since this framework is set to work in an unsupervised environment, the actual classes are represented by clusters.
Thus, in the proposed approach, the bands were iteratively selected based on data set portions, which, in turn, were defined by clustering algorithms. Eleven clustering methods were evaluated in order to provide the best match between the resulting clustering and the actual data classes. Thus, the clusters formed may be deemed as representatives of the actual classes, which enables an analysis based on the separability of the classes in the feature space; consequently, structural information was taken into account. Once the clusters were formed, a one-versus-all approach was adopted. In this way, the selected bands were those that provided the best separability between the cluster and the rest of the data set. Then, those bands were subjected to a fine-tuning procedure, which consisted of placing these bands into some clusters in order to select a combination of those that provided the biggest cluster separability in the feature space. The proposed method bears the acronym CW due to its cluster-wise approach.
The contributions of this paper are as follows: • The use of a cluster-wise approach to solving the unsupervised band selection problem; • Once two clusters were formed, the selection of bands was based on the parameters of a hyperplane defined by a single-layer neural network; • Fine-tuning of the selected bands based on cluster separability in the feature space.
In Section 2, the proposed method is presented. In Section 3, the results of the proposed method are compared to five competitors by using three classifiers and three hyperspectral images commonly used in BS literature. Finally, in Section 5, we offer the conclusion of this work.

Method
Every BS algorithm is supposed to select relevant features-refer to [39] for a thorough definition of feature relevance. In short, a relevant spectral band (i) should provide useful information [40]; and (ii) should not be redundant [14]. Since the proposed band selection framework is designed for classification purposes, the bands considered to provide useful information are those that provide maximum separation between clusters in the feature space. When it comes to redundancy between spectral bands, in this work, it is measured by correlation.
Therefore, following this reasoning, the proposed method is composed of three parts: Data clustering; Selection of bands of interest; and Redundancy reduction.

Data Clustering
Regarding unsupervised problems-data reconstruction [2] and data structure analysis (DSA), for instance, are approaches that render feature selection feasible.
Data entry clustering can find natural groupings in data sets, and, for this reason, it is considered a DSA-based band selection approach, when used for this purpose.
Inspired by [41], the proposed method also performs clustering of the data entries. However, here, we adopted a partitional clustering instead of a hierarchical one, as illustrated in Figure 1a. With partitional clustering, each resulting cluster C i , i ∈ {1, 2, . . . , k}, could be taken as a representative of the real class if (i) k equals the number of classes present in the data set, and (ii), the clustering algorithm, is appropriate for the data set at hand.
One generally wants to classify objects present in a known scene, supposing one knows beforehand the number k of classes is plausible.

Choice of the Clustering Algorithm
As for the fitness of a clustering algorithm to hyperspectral data, 11 methods were evaluated. It is worth noting that our focus was not on the best clustering algorithm available in the literature, but rather just to use some well-established clustering algorithms to show the efficiency of the proposed method.
The input data are the Salinas hyperspectral image [42], with 224 bands and 16 classes. So, each clustering algorithm was set to find k = 16 clusters-as we will see later in this paper, the proposed method sets k equal to the number of classes in the image. It is important to clarify that, at this point, the focus is on the comparison of clustering methods, so the data labels will be used.
The measure of agreement between the two data sets-the clustering result and Salinas ground truth-was computed by means of the adjusted Rand index (r) [1]. In short, let κ 1 and κ 2 be two different clustering types of a given data set; where κ 1 is the real class of the Salinas image and κ 2 is the result of a clustering algorithm.
Considering all pairs of vectors x j and x l , with j = l, let α 1 be the number of times that both vectors belong to the same clusters in the clustering types κ 1 and κ 2 . Moreover, let α 2 be the number of times the vectors belong to different clusters in κ 1 and different clusters in κ 2 .
Finally, the adjusted Rand index between clusters κ 1 and κ 2 is given by where m is the number of possible vector pairs in the data set. For each clustering algorithm, r was calculated 10 times. Table 1 shows the mean values for k-means and k-medoids algorithms with different distance metrics. K-means using the cosine similarity measure has the best outcome. For the sake of clarity, the bigger the values of r, the more similar the clustering types of κ 1 and κ 2 . Table 1. Adjusted Rand index (0 ≤ r ≤ 1) (mean values out of 10 runs for Salinas HSI).

Clustering Algorithm
Consequently, all of the clusters throughout this paper were obtained by k-means using the cosine similarity measure.
It is worth mentioning that an appropriate partitional clustering is able to turn supervised band selection algorithms into unsupervised ones, by taking the resulting clusters as class representatives, and the degree of success depends on r values. This paper follows that approach, by considering [4] as a reference. Since 0 ≤ r ≤ 1, where 1 means the two clustering outcomes match identically, r = 0.7941 indicates a good match between the clusters and the real classes of Salinas HSI. At this point, we opted to analyze a hyperspectral image not used in Section 3 in order to maintain the unsupervised nature of the proposed approach.

Selection of Bands of Interest
Once the initial data set is split into k clusters, it is time to present the proposed band selection algorithm, which has k iterations. At each iteration, two steps take place: (i) the selection of candidate bands and (ii) fine-tuning.

Selection of Candidate Bands
be the initial cluster, i.e., the HSI, where b j is the j th band vector whose norm l 2 is scaled to 1, n is the number of pixels and d is the dimensionality of the data set.
Let C i , ∀i ∈ {1, 2, . . . , k}, be the k clusters after the partitional clustering of C 0 , where k is the number of classes in the data set.
The following properties hold for the clusters: For each cluster, a one-versus-all binary classification was performed between C i and C 0 \ C i .
As in [4], we used a single-layer neural network to generate the separating hyperplane f . As an illustration, both the one-versus-all classification and the hyperplane f are shown in Figure 1b.
The cross-entropy loss function of the neural network is given by where η is the cardinality of the set containing the data points -since we make |C 0 \ C i | ≈ |C i | in order to balance the two clusters, η ≤ n-; y j ∈ {0, 1} is the expected output to the input vector x j ∈ R d×1 , where label 1 corresponds to cluster C i ; andŷ j is the calculated output given byŷ which is the sigmoid activation function. where e is the Euler's number, and z j is the hyperplane equation where w ∈ R d×1 and β ∈ R-both calculated by a single-layer neural network-are the hyperplane f parameters. The training phase of the network consists of 2000 training epochs, using the backpropagation algorithm, with 70% of the data set for training and the remaining 30% for the test.
After the neural network's training, a given input vector x j will cause either z j ≥ 0 or z j < 0. As z j is the argument of a sigmoid function, if where round(ŷ j ≥ 0.5) = 1, and round(ŷ j < 0.5) = 0. The band selection is based on the magnitude of weight vector components w (l) , l ∈ {1, . . . , d}. Indeed, according to (4), the biggest weights in magnitude, |w (l) |, will strongly determine the signal of z j . Therefore, the bands x (l) j -related to the biggest |w (l) |are the most relevant for the binary one-versus-all classification, and are, consequently, initially selected.
In order to provide an illustrative view on this matter, Figure 2 depicts a 2D situation in which a linear classifier, represented by a line segment, separates two different clusters, in red and blue colors, composed of synthetic data of variables v 1 and v 2 . In Figure 2a, the clusters are linearly separable, and it is easy to perceive that this separation is provided by variable v 2 , whereas variable v 1 bears similar values for both clusters. It is worth noting that the green line's parameters w (1) and w (2) , calculated by a single-layer neural network (β is omitted), indicate the relative importance of variables in this binary classification. That is, |w (2) | = 4.8126 > |w (1) | = 0.5782 indicates a higher relevance of variable v 2 in relation to v 1 . A similar situation occurs in Figure 2b, but this time |w (1) | > |w (2) |, indicating that variable v 1 provides better separability between the clusters. Figure 2c,d shows that the same analysis is valid even when the clusters overlap. Figure 2. A 2D binary classification illustration using synthetic data, in variables v 1 and v 2 . The hyperplane parameters w (1) , w (2) , and β (the latter not shown here) are calculated by a singlelayer neural network, whose result is depicted by a green line segment. The magnitude of the neural networks (or the hyperplane parameters) indicate the relevance of its correspondent feature.
(a) Two linearly separable classes. Clearly, attribute v 2 provides good separation between the clusters, which is corroborated by w (2) > w (1) . (b) A similar situation as in (a) occurs here, but this time v 1 provides the separation between the clusters, and w (1) > w (2) . In (c,d) it is possible to draw the same conclusion, even when the clusters overlap.
According to the proposed method, the number s of selected bands is defined by the user.
Since the method has k iterations, the selection of (s/k) ∈ N bands per iteration would be sufficient. However, at each iteration, the method selects 4(s/k) bands, from which only s/k are kept after the fine-tuning step. It is worth noting that, except for 4(s/k), other numbers have not been tested.
Those bands are then placed in s/k clusters-by means of k-means (Euclidean)-q l , l ∈ {1, 2, . . . , (s/k)}, and from each cluster 1 band b will be initially selected.
By picking 1 band from each cluster q, several tuples t are formed. An example of it is shown in Figure 3. The exact number of tuples is |q 1 | × |q 2 | × . . . × |q s/k |. Formally, at iteration i the set containing all tuples is given by Note that this approach for refining the band selection is based on [43]; however, here we adopted a different criterion to assess the importance of each tuple of bands.
For each tuple t ∈ Q, its bands were evaluated according to the class separability they provided between C i and C 0 \ C i . At this point, the data sets of both clusters contain only the bands in t. . Each cluster provides one band for the composition of tuple t. In this example, the band b i ∈ q 1 , the band b j ∈ q 2 , and the band b l ∈ q s/k , among others connected by the dashed line, form the tuple t. Each possible combination of bands in different clusters gives rise to all t ∈ Q.
The class separability is measured by the ρ ∈ R index, where where µ C i is the mean of cluster C i , µ 0 is the global mean, Σ is the covariance matrix, and p is the a priori probability. Since the clusters C i and C 0 \ C i are balanced, i.e., |C i | = |C 0 \ C i |, then p C i = p (C 0 \C i ) = 0.5. According to (6), the bigger the ρ, the more compact the clusters are, and the more distant they are from each other.
Finally, for each tuple t ∈ Q there is a corresponding ρ value, and only the bands in t max , whose ρ is the biggest, are selected at iteration i.

Redundancy Reduction
Let Ψ ∈ R d×d be the correlation matrix of the data set C 0 , calculated according to Pearson's correlation coefficient.
Let ψ be a set composed of the bands the most correlated to the band represented by ψ indices. Set ψ is calculated before the iterations start, according to Algorithm 1. For the sake of clarity, ψ j = b l , for instance, means that b l is the band the most correlated to band b j . Algorithm 1 starts by creating a correlation matrix Ψ, according to Pearson's correlation. After the initialization of matrix I and vector ψ, we sort all the columns of Ψ in a descend fashion, and store the indices idx, which corresponds to the band indices. To the first position of ψ is assigned the band b I(1,1) . Then the remaining positions of ψ receive the bands the most correlated to the band corresponding to that position in a way no same band is assigned to more than one position. The output is the vector ψ.
Given t max and ψ, the subset δ t max of the bands the most correlated to those in t max is given by Algorithm 2.
In Algorithm 2 we have t max and ψ as input. Given the bands in t max , the algorithm finds the most correlated bands to those in t max and insert them in subset δ t max , which is the output of the algorithm. if ψ l = t max j then 6: δ t max ← δ t max ∪ b l 7: Return: δ t max Finally, at each iteration i, the bands in t max are selected and inserted in S, which is the final subset of selected bands. So, Once we have t max , its most correlated bands δ t max are inserted in subset D, which are the bands to be discarded. That is, Then, for the next iteration i + 1 the data set is updated according to The method iterates until i = k. Then, the final output is the subset S of selected bands.

Proposed Method's Overview
Algorithm 3 presents the overview of the proposed method.

Algorithm 3 Proposed band selection algorithm
1: Input: Data set C 0 , number k of classes 2: S = ∅ Subset of selected bands 3: D = ∅ Subset of bands to be discarded 4: Proceed to k-means clustering (cosine distance) of C 0 into k clusters C i 5: for i = 1 : k do 6: Proceed to a binary classification between clusters C i and C 0 \ C i (one-versus-all) using a single-layer neural net 7: Select the 4(s/k) ∈ N bands related to the biggest separating hyperplane parameters |w|, according to (4) 8: Proceed to the band selection fine-tuning, according to Section 2.2.2 9: Update subset of selected bands S according to (7) 10: Update subset D according to (8) 11: Update data set according to (9) 12: Return: S

Results
Normally the quality of the subset of selected bands is assessed by the performance of the classifiers. So, this approach is adopted here, with support vector machine (SVM), K-nearest neighbor (KNN), and classification and regression tree (CART) classifiers [1], via three hyperspectral data sets used in several BS papers: Botswana, Indian Pines, and Pavia University. All of them can be downloaded at [42].

Competitors
The proposed method is compared to five state-of-the-art BS methods: ASPS, MPWR, ONR, UBS, and VGBS.

ASPS
ASPS [44] is the acronym for hyperspectral band selection via adaptive subspace partition strategy. Its framework begins with a two-step partition of the data cube, starting with the coarse partition of the image cube into a predetermined number of parts, and then there is a fine subspace partition, from which bands with low noise levels are finally selected.

MPWR
In [45], the authors proposed a manifold-preserving and weakly redundant (MPWR) unsupervised band selection method. A manifold-preserving band-importance metric was used to measure the band-wise essentiality. Concerning the redundancy caused by the correlated bands, this paper establishes a constrained band-weight optimization model. Thus, both band-wise manifold-preserving capability and intraband correlation are integrated into the BS method.

ONR
The approach called 'hyperspectral band selection via optimal neighborhood reconstruction' is based on optimal neighborhood reconstruction (ONR) [46]. It proceeds to different band combinations in order to reconstruct the original data set, and also applies a noise reducer to minimize the influence of noisy bands.

UBS
UBS is used as the acronym for the approach presented in [47]. This method is based on the spectral decomposition of a matrix, then the loading-factors matrix can be constructed for band prioritization, according to the corresponding eigenvalues and eigenvectors.

VGBS
This paper states that there is a relation between the volume of a sub-simplex and the volume gradient of a simplex. Based on this, they proposed a BS method called VGBS [48]. It is unsupervised and seeks to remove the most redundant band based only on the gradient of volume instead of calculating the volumes of all sub-simplexes.

Experimental Results
In order to compare the outcome of the proposed method, five different bands were selected from scratch: 10, 20, 30, 40, and 50. A set with 50 bands does not necessarily contain the 10-band set, for example, due to the nature of the neural networks.
We compare the results to other BS methods for each hyperspectral image separately. All of the classifier accuracies exhibited here are the mean values of ten runs.
Our approach is dubbed CW-the unsupervised cluster-wise method.

(Case 1) Botswana HSI
The Botswana image is composed of 242 spectral bands and has 14 classes. See further details about this image at [42]. Table 2 shows the results of the BS methods using Botswana HSI. Bold values represent the highest scores attained. Figure 4 presents the same results in an illustrative way. In the figure, different marks are used in order to identify the competitors. The line connecting the marks does not mean interpolation, but rather helps the reader.  Since the Botswana image has 14 classes, the proposed method has the same number of iterations-see Algorithm 3. For each one-versus-all case, a single-layer neural net is run, and the error (of the test set) versus the epoch curves are shown in Figure 5. Clearly, there is convergence in all cases, meaning that it is possible to find a hyperplane to separate the clusters C i and C 0 \ C i . The different curve shapes indicate how fast the algorithm converged. For instance, in Figure 5, Cluster 1 versus All converged faster than Cluster 6 versus All.
Concerning the results, there were five different sets of selected bands, and each set was subjected to three classifiers. Thus, in total, we had 15 different experiments, from which the proposed CW framework surpassed its competitors in 9 out of 15 cases.

(Case 2) Indian Pines HSI
This image has 224 bands and 16 classes. Further details about this image can be found at [42]. Table 3 shows the accuracies and standard deviations of the results. Out of 15 experiments, the proposed CW method achieved the best results 10 times. In Figure 6, it is possible to have a visual idea of the performance of all BS methods.
According to Figure 7, for all iterations, the single-layer neural network converged. This convergence is seen by the fact that the error was lower as the number of epochs increased.

(Case 3) Pavia University HSI
This image has 102 spectral bands and 9 classes. See more information about this image at [42].
In Table 4, we see the results, in terms of overall accuracy and standard deviation, of all methods. The proposed CW algorithm has the best accuracies in 8 out of 15 experiments. It is also shown in Figure 8.   Figure 9 indicates that all the one-versus-all separating hyperplanes solutions converged.

Remark
Finally, since this is a paper about band selection, Table 5 presents all of the bands selected by the proposed method CW.

Discussion
The proposed method was introduced in Section 2.2 and its performance was reviewed in Section 3.2.
It is still important to emphasize the advantages of using the proposed method CW. On the other hand, the deficiencies of the algorithm will also be highlighted.

Pros
As shown in Tables 2-4, with their corresponding Figures 4, 6 and 8, each classifier, due to its intrinsic characteristics, performed differently when compared to the others. SVM classifiers are well known for their effectiveness in high-dimensional spaces, such as the feature space of a hyperspectral image. Thus, for this reason, we see SVM outperforming the other classifiers. On this topic, we see an exception in Table 4 and Figure 8, where SVM is less accurate than the other two classifiers. Given that KNN and CART exhibit stable performances as the band numbers increase, the reason for the bad performance of SVM, in this case, as the dimensionality became higher, may lie in the fact that the Pavia University data were not well discriminative for the SVM algorithm (this phenomenon happened with all competitors). When it comes to the KNN classifier, we set the number of neighbors equal to 7, i.e., K = 7 in all experiments; the objective here was not to find the best settings for the KNN but to provide equal conditions for the comparison of the band selection methods. As KNN classifies an input pattern based on its neighbors in the feature space, it performed better than the CART classifier, whose decision trees rely on binary rules, which become more complex as dimensionality increases.
In general, the proposed CW method outperformed its competitors in lower dimensions, due to the fact that the CW algorithm selected its bands based on the class separations in the feature space. Thus, even in lower dimensions, we saw a good performance of the proposed method, which was designed to be used as a filter method, i.e., a preprocessing step of hyperspectral data classification tasks. As the dimension increased, the CW method maintained good results when compared to its competitors. In fact, considering all of the results, we see that the proposed method achieved the best results in 9+10+8 45 = 60% in the experiments. This is likely due to the fact that the CW method is capable of selecting the best spectral bands for each individual cluster (or class) in a one-versus-all fashion, even in an unsupervised case. Moreover, the cluster separability criterion used during the band selection process makes the job of the classifiers easier.
In terms of processing times, the proposed method does not appear amongst the fastest ones, as Figure 10 indicates. However, its outstanding accuracy mean compensates for this fact. Moreover, the mean processing time of the CW method was less than 50 s, which caused no problems in offline applications.

Cons
The proposed CW algorithm is not capable of addressing all the issues concerning a band selection application, such as the optimal number of bands to be selected. In fact, we do not address this topic in this paper. Here, the number s of bands to be selected is a user-defined parameter.
Moreover, it is necessary to know the number k of classes in the scene depicted by the image. Even though a remote sensing expert may easily infer the number of classes in a given scene, this topic remains unsolved.

Conclusions
The high dimensionality of a hyperspectral image can be useful in terms of good discrimination amongst objects and classes. On the other hand, it can also be a source of problems, such as the curse of dimensionality and overfitting of the classifier.
In order to alleviate such issues, this paper proposes a novel unsupervised band selection framework based on partitional clustering, in which each cluster stands for a real class of the data set. A hyperplane was used to separate all clusters in a one-versus-all fashion. After this, we proceeded to fine-tuning the initially selected bands based on the cluster separability in the feature space.
The proposed method achieved the best classification results in 60% of the experiments. In future works, it is advisable to verify the performance of support vector machines to find a separating hyperplane between clusters. Furthermore, other numbers of initially chosen bands should be tested, rather than only 4(s/k). Moreover, some more recent clustering algorithms could be tested in order to check their effects on the final results. Finally, one could use optimization algorithms to find a suitable subset of bands during the fine-tuning process. Funding: This work was carried out in the framework of the NExT Senior Talent Chair DeepCoSLAM, which were funded by the French Government, through the program Investments for the Future managed by the National Agency for Research ANR-16-IDEX-0007, and with the support of Région Pays de la Loire and Nantes Métropole.

Data Availability Statement:
Publicly available datasets were analyzed in this study. This data can be found here: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes, accessed on 16 August 2022.