Partitioned Relief-F Method for Dimensionality Reduction of Hyperspectral Images

: The classiﬁcation of hyperspectral remote sensing images is difﬁcult due to the curse of dimensionality. Therefore, it is necessary to ﬁnd an effective way to reduce the dimensions of such images. The Relief-F method has been introduced for supervising dimensionality reduction, but the band subset obtained by this method has a large number of continuous bands, resulting in a reduction in the classiﬁcation accuracy. In this paper, an improved method—called Partitioned Relief-F—is presented to mitigate the inﬂuence of continuous bands on classiﬁcation accuracy while retaining important information. Firstly, the importance scores of each band are obtained using the original Relief-F method. Secondly, the whole band interval is divided in an orderly manner, using a partitioning strategy according to the correlation between the bands. Finally, the band with the highest importance score is selected in each sub-interval. To verify the effectiveness of the proposed Partitioned Relief-F method, a classiﬁcation experiment is performed on three publicly available data sets. The dimensionality reduction methods Principal Component Analysis (PCA) and original Relief-F are selected for comparison. Furthermore, K-Means and Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH) are selected for comparison in terms of partitioning strategy. This paper mainly measures the effectiveness of each method indirectly, using the overall accuracy of the ﬁnal classiﬁcation. The experimental results indicate that the addition of the proposed partitioning strategy increases the overall accuracy of the three data sets by 1.55%, 3.14%, and 0.83%, respectively. In general, the proposed Partitioned Relief-F method can achieve signiﬁcantly superior dimensionality reduction effects.


Introduction
With the development of remote sensing and spectral imaging technologies, the amount of research into hyperspectral images (HSIs) has been increasing in recent years, occupying an important position [1]. Hyperspectral sensors are capable of acquiring continuous spectra of hundreds of bands simultaneously for each pixel. Compared with multispectral images (MSIs), HSIs undoubtedly contain more detailed spectral information, providing a better resource for the recognition or classification of ground objects [2]. There are currently many hyperspectral imaging systems providing a number of hyperspectral data sets, which are often used as research objects, such as AVIRIS (USA), HYDICE (USA), HyMAP (Australia), and ROSIS (Germany) [3].
When classifiers, such as SVM, KNN, and Neural Networks, are directly used to classify HSI data, the classification accuracy is generally low; this is mainly due to the Curse of Dimensionality. As high-dimensional data contain more information, their use should lead to better results for classification tasks (or other tasks). However, with the increase of the data dimension, the computational cost of the model tends to increase exponentially and, in the case of a small number of samples, it is difficult for the model to learn useful information to improve performance; this is the meaning of the Curse of Dimensionality [4]. The Hughes phenomenon indicates that the accuracy of HSI classification will not continue to increase with the addition of new bands but, in fact, may decrease after the accuracy reaches a certain level [5]. Due to the high dimensionality of HSIs, the amount of data required for MSI classification typically does not meet the needs of HSI classification. The linear inseparability of HSIs also makes it difficult to apply traditional MSI classifiers directly to HSIs [6].
In summary, the problem of high dimensionality should be solved before classification. Dimensionality reduction methods provide an effective way to deal with this problem, which are generally divided into feature extraction and feature selection. Feature extraction refers to mapping the original high-dimensional space into a low-dimensional space by combining features according to certain rules. Common feature extraction methods for HSI dimensionality reduction include Principal Component Analysis (PCA) [7], Minimum Noise Fraction (MNF) transforms [8], Independent Component Analysis (ICA) [9], Linear Discriminant Analysis (LDA) [10], Singular Value Decomposition (SVD) [11], Wavelet Transforms (WT) [12], and other algorithms. Feature extraction preserves all original feature information but destroys the intrinsic feature structure. Feature selection refers to selecting a representative subset from the original band set as the classification basis without losing important information. HSI dimensionality reduction methods based on feature selection can be divided into supervised and unsupervised, from the perspective of whether class marking is needed. Unsupervised feature selection methods only use the information contained in the data itself to select the bands, such as Optimum Index Factor (OIF) [13], Automatic Band Selection (ABS) [14], Hierarchical Clustering [15], K-Means Clustering [16], Particle Swarm Optimization (PSO) [17], and so on. Supervised feature selection methods require the use of class marking for band selection, such as Sequential Forward Selection (SFS) [18], Sequential Forward Floating Selection (SFFS) [19], Suboptimal Search Strategy (SSS) [20], and so on. Relief-F is a supervised feature selection method used for the dimensionality reduction of HSIs [21].
An HSI can be regarded as a data cube containing three dimensions, denoted as X ∈ R H×W×B , where H and W represent the two spatial dimensions of the scene, and B represents the spectral dimension. Therefore, the HSI contains H × W pixels, where each pixel contains B bands. In a supervised hyperspectral classification task, a pixel represents a sample that records the spectral information of a ground object. Generally, we need to reduce the dimensionality of the data before classification. The band selection method can be regarded as selecting the most representative B (B B) bands to form a new, low-dimensional data cube, denoted as X ∈ R H×W×B . In band selection methods, the selected band subset is required to have rich information and low redundancy. In the original Relief-F (ORF) method, the bands are sorted according to their calculated importance scores and the top k bands are selected to form a new data set to achieve data dimensionality reduction. However, ORF can only guarantee that the information on the data is rich. According to our research, it is easy to obtain adjacent bands using ORF, but the adjacent bands may be highly correlated. Therefore, in this paper, we propose an improved Partitioned Relief-F (PRF) method for band selection of HSIs. We can reshape the three-dimensional cube into a two-dimensional matrix, X ∈ R N×B (N = H × W), where each column is a band of the entire HSI, and all B bands can be regarded as one interval. In PRF, a partitioning strategy is proposed to divide the entire interval, in an orderly manner, to obtain multiple sub-intervals. The importance score of each band is calculated by ORF, and the band with the highest score is selected as the representative band of each sub-interval. As the partitioning strategy brings together highly correlated bands, the new low-dimensional data composed of bands selected from different sub-intervals would have lower redundancy. Additionally, as SVMs perform well on small sample data sets [22], we chose an RBF-kernel SVM as the final classifier to address the reality of the high cost of HSIs. The details of the proposed method are described in Section 3, and Figure 1 can be used as a reference for the entire model.
The remainder of this paper is organized as follows. Section 2 introduces the related work of dimensionality reduction for HSIs, including feature extraction and feature selection. Section 3 details the proposed methods. Section 4 presents the experimental results and analysis. The conclusion is given in Section 5.

Related Works
As HSIs contain narrow spectral bands over a continuous spectral range, adjacent bands in an HSI are usually highly correlated. Generally, HSIs have dozens or even hundreds of bands, which results in a high-dimensional data cube. The high dimensionality of the data causes considerable problems for classification tasks. Therefore, many researchers have been working on attempting to propose a reasonable and effective dimensionality reduction method for HSIs.

Feature Extraction Methods
In the dimensionality reduction of hyperspectral data, it is necessary to use some appropriate transformations as the non-linear features of HSI data cannot be extracted by using the common feature extraction methods. Non-linear features can be extracted by introducing kernel functions, through transforming the original data into a new high-dimensional space through a non-linear mapping and, then, using standard methods for low dimensional mapping [23]. For instance, the traditional PCA method can extract non-linear features from HSI data after combining kernel methods. Fauvel et al. demonstrated that the features extracted by Kernel-based Principal Component Analysis (KPCA) are more linearly separable than those extracted by the traditional PCA method [24]. Li X et al. proposed a Two-stage Subspace Projection framework, in which they used the KPCA method to carry out feature projection of HSI data [25]. Further improvement methods have been proposed on the basis of KPCA, such as Super pixelwise KPCA [26] and Multiple KPCA based on integrated learning [27].
Kernel techniques have also been combined with other feature extraction methods (except for PCA). Zhao B et al. proposed an optimized Kernel Minimum Noise Fraction (optimized KMNF) in which the noise is estimated using more stable spectral correlation information [28]. Gomez-Chova et al. proposed a KMNF method for explicitly estimating noise in a regenerative kernel Hilbert space, which is more noiseless than MNF and other KMNF methods [29]. The kernel technique also has been combined with ICA. Song S et al. verified the effectiveness of kernel ICA in the feature extraction of HSI data for anomaly detection tasks [30]. Han Z et al. conducted dimensionality reduction on feature sets based on quantitative histogram matrix (QHM) and the improved kernel ICA when using HSIs to identify qualified and adulterated petroleum products [31]. The kernel technique has also been combined with other traditional feature extraction methods. For instance, Yuan H et al. extended LDA to kernel LDA by incorporating a local scatter matrix from a small neighborhood as a regularization term into the objective function of LDA [32]. Gu Y et al. proposed an algorithm called rare signal component extraction (RSCE) for an anomaly detection task in HSIs, in which KSVD was used for feature extraction [33]. Du P et al. used the Coiflet kernel function in Wavelet Transform, which enabled the constructed wavelet SVMs to obtain high accuracy in the classification of HSIs [34].
In addition to introducing kernel functions to extract non-linear features from HSI data, there are also feature extraction methods based on manifold learning. Manifold learning uses local measurement information, which is easy to measure to learn the underlying global geometric structure of the data set, and is able to mine non-linear features [35]. Many non-linear dimensionality reduction algorithms based on manifold learning have been proposed and applied in HSI data, such as Isometric Mapping (ISOMAP), Laplacian Eigenmaps (LE), Locally-linear Embedding (LLE), Locality Preserving Projection (LPP), and so on. These non-linear dimensionality reduction methods also change correspondingly when applied to HSI data. For example, Orts Gomez et al. combined ISOMAP with SMACOF, the most accurate MDS method, to reduce the dimensionality of HSI data [36]. Zhou S et al. proposed an improved ISOMAP algorithm, which uses the neighborhood distance to indicate the manifold structure of HSI data, in order to improve the calculation rate [37]. Yan L et al. improved the LE-based dimensionality reduction method to enable the processing of missing data problems in multi-temporal HSIs [38]. Qian SE et al. proposed an improved dimensionality reduction algorithm by combining LLE with LE, which has more advantages in terminal member positioning and terminal element recognition [39]. Feng F et al. proposed a graph-based discriminant analysis method for spectral similarity to solve the problem that using the Euclidean distance in the LPP algorithm could not measure spectral variation well [40].

Feature Selection Methods
Feature selection methods select a feature subset from an HSI without losing important information by directly discarding a large number of "irrelevant" or "redundant" features. Therefore, feature selection methods mainly focus on two approaches: determining the validity of the selected bands (i.e., the selected bands should contain enough information); and eliminating the redundancy of the selected bands (i.e., the correlation between the selected bands should be low). Among current studies into band selection methods, ranking-, searching-, and clustering-based methods have received the most attention; however, methods based on sparsity theory have also become popular in recent years [41].
Ranking-based methods rank the bands by calculating the importance score of each band and, then, select a specified number of top-ranked bands as a band subset. The importance score is sometimes calculated based on a positive indicator, such as the optimal index factor (OIF), which measures the amount of information in the band [42]; and sometimes based on a negative indicator, such as recursive divergence, which measures the maximum separability between categories [43]. Some supervised dimensionality reduction methods have also been studied. For example, Mahlein et al. used the Relief-F algorithm for feature selection, which required the class signature of the sample [44]. In searching-based methods, a criterion function is defined to measure a subset of bands, and a search strategy is used to select different band subsets to optimize the defined criterion function. Du Q proposed a sequential forward search (SFS) method based on the abundance estimation of end members for feature selection and introduced a floating search strategy based on this, which further improved the classification accuracy [45]. Sun K et al. proposed a minimum noise band selection (MNBS) method based on sequential backward search (SBS) [46]. Furthermore, evolutionary calculation-based methods have also been used as search strategies. For example, Ghamisi et al. proposed a new binary optimization method based on fractional Darwin particle swarm optimization (FODPSO) to solve the dimensionality reduction problem [47]. Vaiphasa et al. demonstrated the effectiveness of genetic algorithms for band selection in HSIs [48].
Clustering-based methods give priority to the separability between bands and select the optimal band in each cluster after clustering the bands. For example, Martinez-Uso et al. used hierarchical clustering to group bands before selection with mutual information and Kullback-Leibler divergence [49]. Cao X et al. proposed an automatic band selection (ABS) method, which uses the KODAMA algorithm to cluster before selecting the optimal band with the minimum average Euclidean distance in each class [14]. Imbiriba T et al. proposed a band selection strategy based on reproducing kernel Hilbert spaces, which effectively extracts the non-linear features of HSI data, where the basic clustering method used was K-Means [50]. Sparsity-based methods, based on sparsity theory, reduce the dimension by learning a sparse representation of the original data. For example, Li S et al. used the existing K-SVD algorithm to obtain sparse representations of HSI data and selected the optimal band subset by calculating the histogram of the coefficient matrix [51]. Sun W et al. proposed a Symmetric Sparse Representation (SSR) method for band selection, based on the assumption that the selected band subset and the original data set could be sparsely represented similarly to each other [52]. After the analysis of various kinds of band selection methods, this paper presents the PRF method, which incorporates a partitioning strategy into ORF and can be regarded as a synthesis of a ranking-based method and a clustering-based method.

Proposed Method
As mentioned earlier, adjacent bands in an HSI are highly correlated as the hyperspectral sensor acquires data over a continuous spectral range. To verify the correlation between adjacent bands, we provide a statistical t-test in Section 3.2. The proposed model for HSIs is shown in Figure 1. After processing the data, as shown in Section 4.1, the three-dimensional HSI cube is reshaped into a two-dimensional matrix X ∈ R N×B (N = H × W), where each column represents a band and each row records the spectra of a single pixel. As shown in Figure 1, we first use a partitioning strategy to divide the whole band interval, where different sub-intervals are marked with different colors. Then, the importance scores of all bands are calculated using the ORF method, which can also be calculated before the partition. In each sub-interval, the band with the highest score is selected as the representative band. These representative bands are used to form a new data cube of lower dimension X ∈ R H×W×B (B B). The above processes are the general flow of the Partitioned Relief-F (PRF) method. For classification purposes, some distinguishable pixels in the HSI are manually labeled, while others are not. Therefore, the unlabeled pixels need to be eliminated first. For labeled pixels, some are used for the training set to train the SVM classifier, while the rest are used as the test set, to verify the performance of the model. The division ratio of the training set is introduced in Section 4.1.

Band Importance Score Calculation
The original Relief-F (ORF) method is a typical supervised ranking-based band selection method, which selects the top k bands as the band subset according to the calculated importance score of each band. The importance score measures the selection priority of each band, which is calculated according to the similarities and dissimilarities of the bands. Based on this, we set "near-hit" and "near-miss" to calculate the required "similarity contribution" and "dissimilarity contribution", respectively.
The training set of the given HSI data is recorded as H nB = {(x 1 , y 1 ) , (x 2 , y 2 ) , . . . , (x n , y n )}, where x i represents one pixel of the hyperspectral image, which is a one-dimensional vector (size is B × 1); y i represents the category tag to which x i belongs; and y i ∈ (1, 2, . . . , m), in which the number of categories of the training set is m. In this paper, the Pearson correlation coefficient is used as a distance metric for the calculation of "near-hit" and "near-miss": where x i and x j represent two different pixel points in H nB (i.e., i = j); Cov x i , x j is the covariance of x i and x j ; and Var (x i ) is the variance of x i . The "near-hit" is calculated in a sample set with the same class (let this sample set be K). For a pixel x i in K, its "near-hit" can be obtained by calculating the maximum correlation coefficient: On the contrary, the "near-miss" is calculated in other sample sets with different categories from x i , which can be obtained by calculating the minimum correlation coefficient: where x l,nm represents the "near-miss" of x i in the sample set L, where L is one of the sample sets of different categories from x i . In the PRF method proposed in this paper, a (a > 1) samples are randomly selected as the base samples in each sample set. Thus, there are a × m base samples, assuming that the number of classes is m. We denote the base sample as x i , where i = 0, 1, . . . , a × m, the corresponding "near-hit" is denoted as x i,nh , and the "near-miss" is denoted as x i,l,nm . Then, the importance score of each band (denoted as Score j ) is determined by the following formula: where diff (x, y) = |x − y| and p l represents the proportion of the number of samples in sample set l to the total number of samples in the training set. Equation (4) can be divided into two parts: The former part measures the aggregation contribution of the band j to "homogeneous samples", while the latter measures the separation contribution of the band j to "heterogeneous samples".

Correlation Test of Adjacent Bands
The importance score calculated by the method provided in Section 3.1 can be used to measure the importance (i.e., the amount of information) of a band. However, there are similar spectral measurements between two bands of an HSI if they are adjacent, which makes the importance scores of these two adjacent bands similar as well. In order to verify the correlation between adjacent bands, we provide a statistical t-test in this section.
Calculate the maximum of correlation coefficients between each band (b i , i = 1, 2, . . . , B) and all other bands as and the maximum of correlation coefficients between b i and its adjacent bands where corr (x, y) is the correlation coefficient between x and y. Then, max_corr (b i ) and neibor_corr (b i ) are examined for any significant differences using the paired sample t-test. Let max_corr (b i ) be X 1i and neibor_corr (b i ) be X 2i . Then, test whether the differences (D i = X 1i − X 2i (i = 1, 2, . . . , B)) are significantly smaller or not. Define the null hypothesis as H 0 : there is no difference between max_corr (b i ) and neibor_corr (b i ); that is, µ D = 0, where µ D is the mean of the differences µ D = ∑ D i /B. Define the alternative hypothesis as H 1 : there is a significant difference between max_corr (b i ) and neibor_corr (b i ); that is, µ D = 0.
The t distribution statistic is as follows: where D is the assumed sample difference, B is the number of sample differences (here, equal to the number of bands), µ D is the mean of the differences, and S D is the standard deviation of the sample difference.

Partitioning Strategy
When using the ORF method to select bands in an HSI, the "important bands" are directly selected according to the calculated importance scores to form a new data cube. In the original data set, each band can be assigned an index in spectral order: [1, 2, . . . , B], where B is equal to the spectral dimension of the data set. When two bands have similar indices, the information they contain is also similar, which means that the importance scores of the two bands are likely to be similar as well. Therefore, the indices of the bands selected by the ORF method should be continuously arranged. We verified this experimentally; the results are given in Section 4.2. In addition, using the t-test provided in Section 3.2, we confirmed that there is a high correlation between adjacent bands; these results are also given in Section 4.2. Define where b i is an N-dimensional real vector containing the measured values of all N pixels of the HSI in the i th band. Dividing the band interval according to the correlation between the bands can obtain multiple sub-intervals, as shown in Figure 1. In each sub-interval, a band with the highest importance score is obtained as the representative band and all the representative bands are used to form a new data set. Under proper division, the correlation between these representative bands should be low; that is, the new data set should have lower redundancy.
For a sub-interval containing m bands, the redundancy of the sub-interval is calculated, in the proposed partitioning strategy, by: where corr(·, ·) is used to calculate the correlation coefficient between two vectors (as shown in Equation (1)) and c is the "maximum average correlation vector" in the sub-interval; that is, the mean value of the correlation coefficient between the band vectors b i , (i = 1, 2, . . . , m) and c is the theoretical maximum, Therefore, the larger the value of r, the higher the degree of aggregation between all bands in the sub-interval. As the degree of aggregation is characterized by the correlation coefficient, the value of r can be used to measure the redundancy of the sub-interval. The solution for c is shown below.
Assuming the average correlation coefficient between vector z and other band vectors is f(z); that is, According to Equation (1), In Equation (16) (k = 0) and, at this time, f (z) attains its maximum value. Therefore, the redundancy of the sub-interval can also be calculated by: For the band interval [b 1 , b 2 , . . . , b B ] in the proposed partitioning strategy, the sub-intervals are continuously divided. Therefore, the initial state of the first sub-interval (denoted as I 1 ) is {b 1 }. Calculate the redundancy of I 1 (denoted by r) according to Equation (17); if r is greater than the threshold (denoted by λ), add the next band b 2 to I 1 . If r ≤ λ, the interval division of I 1 is ended and the next sub-interval division is started. Therefore, the general steps of the partitioning strategy can be described as: Step (1) Initialize the sub-interval I k = {b j }, where the band b j−1 belongs to the previous sub-interval I k−1 .
Step (2) Add the next band b j+1 to the sub-interval I k and calculate the redundancy r of I k . If r > λ, the band b j+1 will be retained in I k ; if r ≤ λ, the band b j+1 will be removed from I k and the next sub-interval I k+1 will start dividing.
Step (3) Repeat Step 2 until the last band b B is divided. As the correlations between the bands are not the same, the sizes of the sub-intervals are also different. According to the algorithm pseudocode (Algorithm 1) of the partitioning strategy, it can be shown that, when the threshold λ is set to a large value, the degree of redundancy in the sub-interval will also be high. At this time, the number of bands contained in the sub-interval will be smaller and, accordingly, there will eventually be many sub-intervals. Conversely, when λ is small, there will eventually be few sub-intervals. An appropriate value for λ was determined based on the experimental results, as detailed below.
// The band b j has the largest score in I. The graphical interpretation of the proposed partitioning strategy is shown in Figure 2. For the two-dimensional matrix X ∈ R N×B , where each column represents a band, each row records the spectra of a single pixel, and X i j is the measured value of the i th pixel in the j th band. Assuming that the first sub-interval I 1 contains five bands (i.e., I 1 = {b 1 , b 2 , . . . , b 5 }), as shown in the gray part in Figure 2. Let r represent the redundancy of {b 1 , b 2 , . . . , b 4 }, r represent the redundancy of I 1 , and r represent the redundancy of {b 1 , b 2 , . . . , b 6 }. According to the description of the partitioning strategy, we have r > r > λ > r , where λ is the pre-set threshold. Thus, r > r > λ > r is actually the condition for the current sub-interval to stop dividing; when this condition is met, the division of the next sub-interval starts. For the obtained sub-interval, a representative band can be obtained according to the band importance score. Assuming that b 4 has the highest score in I 1 , it will be selected as one of the representative bands used to form a new low-dimensional data set.
For comparison, this paper chooses two clustering algorithms-K-Means and Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH)-to partition the band interval. Similar to the PRF method, after using the clustering method to obtain multiple clusters, the representative band with highest importance score is selected from each cluster to form a new low-dimensional data set, which can also ensure that the new data set has low redundancy. (1) K-Means: K-Means is a classic unsupervised clustering algorithm, which aims to divide data points into k clusters, where the data points in each cluster have the closest mean. For b i ∈ R N (i = 1, 2, . . . , B), it divides these B N-dimensional vectors into the clusters C = C 1 , C 2 , . . . , C k . The objective function of the K-Means algorithm is: where µ i is the mean of the vectors in C i . K-Means uses an iterative approach to obtain the clusters. First, randomly select k data points as the initial centroids. Then, for each data point, calculate its distance from each centroid and divide it into the same cluster with the centroid in which the distance is the shortest. After the division is completed, the mean of vectors in each cluster is calculated as the new centroid.
(2) BIRCH: BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) is a hierarchical clustering algorithm suitable for large data sets. As with K-Means, BIRCH also needs to specify the number of clusters in advance. The core of the algorithm is to construct a clustering feature (CF) tree for hierarchical clustering, where CF is a triple that summarizes the information of the cluster. Assuming that cluster C i contains B i band vectors, the Clustering Feature of the cluster is defined as: where B i is the number of band vectors in the cluster, − → LS i is the linear sum of the B i band vectors (i.e., ∑ Bi i=1 − → b i ), and SS i is the square sum of the B i band vectors (i.e., . The specific process is detailed in [53].

Experimental Results and Discussions
In order to verify the performance of PRF in the HSI dimensionality reduction task, we selected three publicly available HSI data sets for experimental verification. The effectiveness was evaluated according to the classification accuracy using RBF-SVMs. To verify the effectiveness of PRF, we chose PCA and ORF as comparative dimensionality reduction methods. In addition, two clustering algorithms-K-Means and BIRCH-were selected for comparison in the partitioning strategy. Although each experiment had relatively stable results, in order to avoid accidental bias, the final result was the average of 10 experiments.

Data Sets
To examine the robustness of the proposed method, we selected three data sets collected from different scenes for our experiments.
(1) The Salinas data set was acquired by the AVIRIS sensor over Salinas Valley, California. AVIRIS acquires data in 224 bands of 10 nm width with center wavelengths from 400 to 2500 nm. Only 204 bands remained in the Salinas data set after removing bands covering the region of water absorption: [108-112, 154-167, and 224]. The Salinas data set consists of 512 × 217 pixels with a spatial resolution of 3.7 m/pixel. The Salinas ground-truth contains 16 classes.
(2) The PaviaU data set was collected by the ROSIS sensor over several areas of the University of Pavia and consists of 610 × 610 pixels with a spatial resolution of 1.3 m/pixel. In the PaviaU data set, a large part of the image is not used for the study; that is, many samples did not contain any available information. Thus, the size of the PaviaU data set was reduced to 610 × 340 pixels. The PaviaU data set has 103 available bands of 6 nm width with center wavelengths from 430 to 860 nm. Its ground-truth contains nine classes.
(3) The KSC data set was collected by the AVIRIS sensor over the Kennedy Space Center (KSC), Florida. Therefore, it also has 224 bands with wavelengths ranging from 400 to 2500 nm. There are 176 bands remaining in KSC data set after the removal of water absorption bands and low SNR bands: [1-4, 102-116, 151-172, and 218-224]. The size of the data set is 512 × 614 pixels, and the spatial resolution is 18 meters/pixel. The KSC ground-truth is divided into nine classes.
The original information on the three data sets is shown in Table 1. The classes of each data set and their corresponding number of samples are shown in Table 2. These data sets were collected by different sensors and have different spectral resolution, spatial resolution, and object classes. Therefore, we used these three data sets for experiments to obtain more comprehensive analysis results. As publicly available data sets, they have undergone some general processing; that is, the spectra have been normalized to reflectance, and the low SNR bands and water absorption bands in each data set have been eliminated, if they exist. In this paper, in order to make the classifier easy to train, the values of the HSIs were normalized to [−1, 1] using z-score normalization: To test the validity of the methods on a small sample data set, 10% of the entire HSIs were randomly selected as the training set, according to the hold-out method, in the Salinas and PaviaU data sets, and the remaining 90% of samples were used as the test set to verify the model. Considering that the sample size of the KSC data set is much smaller than the first two data sets, the proportion of the training set for the KSC data set was set at 30%. Although the three data sets were collected from different scenes using different hyperspectral sensors, the processing done on them was similar. As shown in Figure 3, "Processing (1)" was the processing performed during the period from data collection to public access. First, the spectra of the HSI acquired by the hyperspectral sensor was normalized to reflectance. Then, the water absorption bands and/or the low SNR bands in the obtained HSI cube were eliminated. According to the information provided by the sources of the data sets, only the water absorption bands were removed in the Salinas data set, no bands were removed in the PaviaU data set, and the water absorption bands and low SNR bands were removed in the KSC data set. Figure 3 uses the Salinas data set as an example. It can be seen that, after 20 water absorption bands are eliminated, 204 bands are left in the data set. "Processing (2)" is the processing performed in this paper. First, the three-dimensional data cube is reshaped into a two-dimensional matrix, X ∈ R N×B , where N = 512 × 217 and B = 204. Then, the values of each band are standardized using Equation (20), such that the bands follow the standard normal distribution. In fact, the data processing flow also includes a third stage, which is the dimensional reduction using PRF and the division of the training and test sets. This stage is shown in the model structure diagram, Figure 1.

Verification of Two Assumptions
The reason for adding the partitioning strategy to the ORF is based on two assumptions: First, a large part of the top k bands selected based on importance scores are adjacent in spectral order. Second, when two bands are adjacent, there is a high correlation between them.
After calculating the importance scores of each band in the original data set, select the top k bands to form a new, lower-dimensional data set (which is the general processing of the ORF method). When the dimension of an HSI is required to be reduced to 30 dimensions, the 30 bands with the highest importance scores are selected. The indices of the obtained 30 bands are shown, marked with red, in Figure 4. It can be seen from the figure that most of the top 30 bands are in continuous arrangement, which was the case for all three data sets. Taking the Salinas data set as an example, the top 30 bands were 13-36, 41, and 43-47. The specific ranking of these 30 bands and their corresponding importance scores are shown in Table 3. For generality, Figure 4 also shows the indices (marked in yellow) of the bands with importance scores ranking from 31 to 60. It can be seen that, in the three data sets, most of the bands with scores ranging from 31 to 60 were still continuously arranged. In summary, most of the top k bands obtained by the general processing of the ORF are arranged continuously. The main reason for this may be that HSIs acquire narrow spectral bands over the continuous spectral range, such that the measured values between adjacent bands are similar.  (1) is the processing performed by the data provider and Processing (2) is the processing performed in this paper.   16,15,18,13,14,19,17,20,10,12,9,11,8,22,24,21,7,23,25,132,26,27,30,29,6,28,5,31,70]   The second assumption is verified in the rest of this section. In fact, it is only necessary to verify that µ D < , where is a small real number (i.e., the difference between max_corr (b i ) and neibor_corr (b i ) is significantly small), without verifying µ D = 0. The test results are shown in Table 4, after modifying H 0 and H 1 accordingly. According to Table 4, there was a high correlation between adjacent bands in the two data sets (Salinas and PaviaU), which can be seen from the means of neibor_corr (x i ) in the two data sets (0.9910 and 0.9971, respectively). Paired sample t-tests show that the calculated t values of these two data sets fall into the rejection domain (i.e., we reject the null hypotheses and accept the alternative hypotheses). This shows that, in the case of the significance level α = 0.05, the difference between max_corr (x i ) and neibor_corr (x i ) is significantly small (µ D ≤ 0.01). However, the mean of neibor_corr (x i ) in the KSC data set is 0.5920, which is relatively small compared to the perceived high correlation. The value of the t distribution statistic falls into the acceptance domain (i.e., the difference between max_corr (x i ) and neibor_corr (x i ) is significantly large; µ D ≥ 0.2). According to the screening, about 37% of neibor_corr (x i ) of all adjacent bands in the KSC data set are greater than 0.98, while about 60% are less than 0.59; that is to say, some of the adjacent bands are highly correlated, while a large number of adjacent bands are not.
Generally speaking, due to the high correlation between adjacent bands, the information they contain is similar. Therefore, only selecting bands based on their importance scores will cause the selected bands to contain similar information. The band subset can have a large sum of importance scores, according to this operation; however, it does not actually contain diverse enough information, due to this similarity. Therefore, in order to ensure that the final band subset has low redundancy, our partitioning strategy is added to ensure that the selected bands are from different sub-intervals.

The Effectiveness and Advancement of the PRF Method
When using PRF to reduce the dimensionality of HSI data, the threshold λ needs to be set in advance. The value of λ affects the size of each sub-interval and the number of the selected bands. Therefore, this section will first explore the reasonable values of λ.
The maximum correlation coefficient between each band (b i ) and its adjacent bands (denoted as neibor_corr(b i )) can be calculated, according to Equation (6). Table 4 gives the average values of the respective neibor_corr(b i ) (i.e.,X 2i ), which can be used as a reference for the value of λ. Therefore, the value of the threshold started at 0.98 in the experiment.
According to Table 5, when (1 − λ) = 0.02, the band number of the reduced-dimensional Salinas, PaviaU, and KSC data sets were 11, 5, and 31, respectively. It can be known that the correlation between the adjacent bands of the KSC data set was far lower than that of the other two data sets based on Table 4. Therefore, though the three data sets were sharing the same threshold value, which was 0.98 (i.e., (1 − λ) = 0.02), the band number of the reduced-dimensional KSC data set was importantly larger than that of the other two data sets. According to the algorithm flow of PRF, the number of sub-intervals increases as the increasing threshold values; namely, the band number of the dimension-reduced data set increases accordingly. The growth of the band number allows the data sets to have more abundant information. Thus, it can be seen that the OA of the three data sets was constantly improving during the reduction of the value of (1 − λ) from 0.02 to 0.0001. However, the redundancy of the reduced-dimensional data sets could be at a higher degree when the threshold values exceed a certain level. In this case, when (1 − λ) = 0.00001, the classification accuracy was decreased yet while the number of bands was much more than before. Similar results were obtained on these three data sets. The PRF method partitions the bands with the partitioning strategy given in Section 3.3, where the general process is conducted in order of wavelength from short to long. We reversed the partitioning order and repeated the experiments to explore the variation of the dimensionality reduction effect of the PRF method. The experimental results are shown in Table 6, which demonstrate that there are slight changes in the results when the experiment is conducted in order of wavelength from long to short. Considering the specific mechanism of the proposed partitioning strategy that adds bands into sub-intervals orderly, the obtained, different partitioning results are expectable. However, the low correlation between bands from different sub-intervals is definite, which ensures a limited variability of partitioning results. In addition, the representative bands are achieved from sub-intervals according to the importance scores in both situations; thus, a significant difference in the selection of representative bands would not be seen if there was no significant variation of the sub-intervals. The clustering process on bands is similar to the partitioning strategy in PRF, where the clusters obtained by the clustering algorithm is similar to the sub-intervals obtained by PRF. Therefore, we chose two clustering algorithms, K-Means and BIRCH, to compare with the proposed partitioning strategy. Both the K-Means and BIRCH algorithms need to specify the number of clusters in advance, and the representative bands are selected from each cluster, where the representative bands are also obtained based on the importance scores. Therefore, the band selection methods based on these two clustering algorithms are denoted as ORF-K-Means and ORF-BIRCH, respectively. The five methods, ORF, PCA, ORF-K-Means, ORF-BIRCH, and PRF, are examined for the situation that the effect of dimensionality reduction changes with an increase of the number of bands, as shown in Figure 5.
On the three data sets, the classification accuracy based on PCA showed an obvious trend of first rising and then falling. The ORF method had extremely poor performance in low dimensions, which was particularly reflected in the PaviaU and KSC data sets. With an increase of dimensionality, the OA of ORF-based classification continued to improve. However, after the dimensionality reached 40, its OA increased slowly. When using K-Means or BIRCH to cluster the bands before the band selection, it can be found that the effect was better than ORF. After adding the partitioning strategy, it can be seen that the classification model achieved a high OA in low dimensions, which means that the partitioning strategy can effectively solve the problem of the deficiency of the available information (i.e., diverse information) due to the redundancy of bands. According to Figure 5, it can be seen that PRF had obvious advantages in the Salinas and PaviaU data sets, while its advantage on KSC was relatively weak. According to the test results in Table 5, the correlation between bands in the Salinas and PaviaU data sets reached a very high level, while the correlation in the KSC data set was relatively low (Salinas: 0.9910; PaviaU: 0.9971; KSC: 0.5920). The partitioning strategy mainly divides the sub-intervals based on the correlation between bands; therefore, the partitioning strategy performed better in the Salinas and PaviaU data sets than in the KSC data set.
Classification maps for the three data sets obtained by different methods are shown in Figures 6a-e, 7a-e, and 8a-e, where all the maps had classification noise. Figure 6a-e contains classification maps for the Salinas data set, where the noise was mainly concentrated in two regions that belong to Vineyard-untrained and Grapes-untrained, respectively. Figure 7a-e contains classification maps for the PaviaU data set, where the noise was mainly concentrated in the region belonging to Bare soil. Figure 8a-e contains classification maps for the KSC data set, where the noise was mainly concentrated in three regions that belong to Cabbage palm hammock, Slash pine, and Scrub, respectively. In summary, subfigure (e) contained less noise than the other four subfigures, which was shown in all three figures; that is, the PRF method performed better than the other four methods. More specifically, the optimal classifications based on various methods were compared, as shown in Table 7. It can be seen that, compared with the other four methods, the PRF-based classification model obtained the optimal OA in the three data sets. In the PaviaU and KSC data sets, the OA of the classification model based on K-Means (or BIRCH) was better than that based on ORF. In general, the partitioning strategies effectively improved the performance of dimensionality reduction for HSIs, especially the partitioning strategy proposed in Section 3.3. Finally, the runtimes of the various methods were analyzed, as shown in Table 8. In the three data sets, as the Salinas data set had the largest number of bands and samples, each method had the longest runtime on it. As the number of bands in the KSC data set was more than that of the PaviaU data set, the runtimes of the ORF and PRF methods on the KSC data set were longer than that on the PaviaU data set. In general, the runtime of PCA was the shortest, and the runtime of PRF was less than those of K-Means and BIRCH.

Conclusions
In this paper, we proposed a classification model for HSI data sets of small size. Our main focus was to provide a dimensionality reduction method combining the Relief-F method and a partitioning strategy, where the Relief-F method is used to calculate the importance score of each band. A partitioning strategy was proposed to divide the band interval, in order to reduce the redundancy of the selected bands. The main idea of the partitioning strategy was to divide the entire band interval into various sub-intervals that have high redundancy, which ultimately ensures that the bands selected from the sub-intervals are less redundant.
The effectiveness of the method provided in this paper was tested by experiments on three data sets. In the experiments, the high correlation between adjacent bands in the HSI data sets was first verified, and the shortcomings of the ORF method in the band selection strategy were analyzed. Compared with the ORF and PCA methods, the advantages of PRF were verified in the HSI data sets with highly correlated adjacent bands. In addition, the K-Means and BIRCH clustering methods were also used for comparison. The experimental results show that the division of the band interval can effectively reduce the redundancy between the selected bands, and that the partitioning strategy proposed in this paper is particularly effective. The method proposed in this paper has not yet been used to mine the spatial information of HSIs. Therefore, we will focus on the integration of spatial information into the feature selection method in future work.
Author Contributions: All the authors made significant contributions to the work. J.R., R.W., and W.W. designed the research, analyzed the results, and accomplished the validation work. G.L., R.F., and Y.W. provided advice for the preparation and revision of the paper. All authors have read and agreed to the published version of the manuscript.