Band Priority Index: A Feature Selection Framework for Hyperspectral Imagery

: Hyperspectral Band Selection (BS) aims to select a few informative and distinctive bands to represent the whole image cube. In this paper, an unsupervised BS framework named the band priority index (BPI) is proposed. The basic idea of BPI is to ﬁnd the bands with large amounts of information and low correlation. Sequential forward search (SFS) is used to avoid an exhaustive search, and the objective function of BPI consist of two parts: the information metric and the correlation metric. We proposed a new band correlation metric, namely, the joint correlation coefﬁcient (JCC), to estimate the joint correlation between a single band and multiple bands. JCC uses the angle between a band and the hyperplane determined by a band set to evaluate the correlation between them. To estimate the amount of information, the variance and entropy are used as the information metric for BPI, respectively. Since BPI is a framework for BS, other information metrics and different mathematic functions of the angle can also be used in the model, which means there are various implementations of BPI. The BPI-based methods have the advantages as follows: (1) The selected bands are informative and distinctive. (2) The BPI-based methods usually have good computational efﬁciencies. (3) These methods have the potential to determine the number of bands to be selected. The experimental results on different real hyperspectral datasets demonstrate that the BPI-based methods are highly efﬁcient and accurate BS methods.


Introduction
Hyperspectral images contain hundreds of bands with a fine resolution, e.g., 0.01 µm, which makes it possible to reduce overlap between classes, and, therefore, enhances the potential to discriminate subtle spectral difference [1,2].However, the high dimensionality of dataset also brings several problems, such as heavy computational burden and storage cost.In addition, the high resolution of the spectrum makes the bands highly correlated.Therefore, to process data effectively, dimensionality reduction (DR) is important and necessary.Dimensionality reduction techniques can be broadly split into two categories: feature extraction and feature selection (i.e., band selection) [3,4].Feature extraction reduces the data dimensionality by extracting a set of new features from the original ones through some function mapping.For instance, Principle Component Analysis (PCA) [5] is one of the well-known feature extraction methods, and other feature extraction methods include Nonnegative Matrix Factorization (NMF) [6], Independent Component Analysis [7], Local Linear Embedding (LLE) [8], Maximum Noise Fraction (MNF) [9], and so on.The feature selection reduces the feature space by selecting a subset of features from the original features.In hyperspectral imagery, Band Selection (BS) is preferable for feature extraction because BS methods select a subset of bands without losing their physical meaning and have the advantage of preserving the relevant original information in the data.Therefore, in this paper, we focus on BS methods.
Band Selection (BS) has been paid increasing attention in recent years.BS methods can be roughly divided into two categories: supervised BS [4] and unsupervised BS [10].The supervised methods try to find the most informative bands with respect to the available prior knowledge, whereas unsupervised methods do not assume any object information.Although the prior information of the class label enables the supervised methods to achieve better performance than unsupervised methods, the prior knowledge is often unavailable in practice, and, in this case, supervised BS methods are not suitable.Therefore, there is a need to develop unsupervised BS methods.
To design an unsupervised BS method, there are three major points that should be considered: (1) effective metrics for designing selection criteria; (2) a suitable subset searching strategy which ensures the algorithm has a good efficiency; and (3) the number of bands that should be selected.To address these issues, in this paper, we propose a new BS approach, named the Band Priority Index (BPI), which is a model or framework for BS and different metrics can be used in the model.The BPI model applies the sequential forward search (SFS) strategy [33] to avoid the exhaustive search.By combining with SFS, the desired bands are selected one by one.In each round of lookup, the BPI model computes the score of each unselected band and selects the band with the largest score as the optimal band, then the newly selected band is added into the selected band set, and next round of lookup begins.The process is repeated in this manner until the desired number of bands have been obtained.After the searching strategy has been determined, we need to design a suitable objective function for BPI, the objective function computes the score of each candidate band and the score denotes the contribution or the priority of the band.Generally, there are two basic ideas guiding the design of the selection criterion of an unsupervised BS method.First, the process of dimensionality reduction almost inevitably results in the loss of information, to minimize the loss, we resort to retaining the bands within large amounts of information.Second, we also want that the selected bands have low redundancy with each other, which ensures that the selected band set can provide sufficiently useful information for further applications.A good BS method should consider both information and redundancy (usually measured by band correlation).Therefore, the objective function of BPI consists of two parts: the information metric and the correlation metric: the former estimates the amount of information of a candidate band, while the latter measures the joint correlation between the candidate band and the currently selected band set.
The advantages of the BPI model can be summarized as follows: (1) BPI considers the amount of information and band correlation simultaneously, and, therefore, the selected bands are useful for further applications such as pixel classification.(2) BPI has a good computational efficiency, because the correlation metric can be incrementally calculated by using recursive formulas and the calculation of amounts of information is usually not computationally complex.(3) BPI is a model for BS; the metrics in the objective function can be modified or replaced depending on specific applications, which means that BPI has various implementations.(4) The BPI-based methods have the potential to determine the number of bands to be selected.
The remainder of this paper is organized as follows: Section 2 introduced some related works associated with the proposed method.Section 3 specifically explains the BPI model.Section 4 presents experiments on three different real-world hyperspectral images.Finally, Section 5 shows some concluding remarks.

OIF
In 1982, Chavez et al. proposed the formula of Optimal Index Factor (OIF) to find the best band combination of a multispectral dataset [13].This formula computes the optimal index of a combination with n bands: where σ i (i = 1, 2, ..., n) denotes the standard deviation of the ith band x i , and x i is a column vector with N pixels.x i,k and x j,k denote the kth pixels of x i and x j , respectively; µ i and µ j denote the averages of x i and x j , respectively; and R i,j denotes the correlation coefficient between them.In fact, it is easy to find that the numerator and denominator of OIF, respectively, evaluate the amount of information and the band correlation of the band combination; thus, OIF is perfectly consistent with the basic ideas of BS, that is, an unsupervised BS method should consider both amount of information and band correlation.However, OIF is originally proposed for the multispectral images with only seven bands, and the number n of OIF is set to be 3, so the exhaustive search can be executed rapidly, but for a hyperspectral image with hundreds of bands, exhaustive strategies cannot be used due to the huge computational time.Besides, in practical applications, the bands selected by OIF are not always the optimum combination, because the numerator and denominator of OIF are the sums of standard deviations and correlation coefficients, respectively, which means OIF is not sufficiently sensitive to the band correlation and thus is likely to select the bands with high correlation [14].

Variants of OIF
To overcome the drawbacks of OIF, some similar indexes were proposed.For instance, Xijun and Jun [14] proposed a simplified version of OIF, which is defined as follows: where SOIF denotes the score of the band x t .Different from OIF, the simplified version of OIF uses three adjacent bands to calculate the score of one band, and a specific number of bands with the maximum scores will be selected.Compared with OIF, the variant cares more about the effect of the correlation among adjacent bands.However, this index only considers the band correlation among adjacent bands but neglects that among nonadjacent bands.Considering that some nonadjacent bands are also likely highly correlated, so even though this index can select fewer neighboring bands, the selected bands may be still with high correlation.In fact, for the hyperspectral images, the adjacent bands are usually highly correlated with each other, so, for most bands, the denominators of SOIF (i.e., R t−1,t + R t,t+1 ) are almost the same, then the value of SOIF is mainly determined by the numerator, which means that this index also pays not sufficiently attention to the band correlation.Although other similar indexes have been proposed [34,35], a common drawback of these OIF-based methods is that they cannot well evaluate the correlation among the bands in the selected band set, so the bands obtained by them are not always with large amounts of information and low correlation.

Band Priority Index
Because an exhaustive search for the optimal solution is prohibitive from a computational viewpoint [36,37], we apply a simple suboptimal search method in this paper, namely, the sequential forward search (SFS) method [33].SFS is a simple greedy search algorithm, which belongs to the heuristic suboptimal search methods.It starts from the empty set of features, and adds the feature x that maximizes the cost function f (Y k + x) when combined with the features Y k that have already been selected, until a feature subset with the desired quantity is obtained.Therefore, the proposed method selects one band for each time, and in each round of lookup, the band that optimizes the objective function would be selected and added into the selected band set, then next iteration begins.This sequence is repeated in this manner until desired number of bands have been obtained.
When the searching strategy has been determined, we need to design a suitable objective function for the proposed model.In this paper, a new index named the band priority index (BPI) is proposed to evaluate the contribution of the candidate bands.Considering that a good unsupervised BS method should consider the amount of information and band correlation simultaneously, and referring to the selection criterion of OIF, we define the objective function of BPI as follows: where s t denotes the score of the tth band x t ; c t denotes the joint band correlation between the band x t and the currently selected band set; and δ t represents the amount of information of x t .The score s t measures the contribution or the priority of x t , the larger the score is, the more the contribution is.
Obviously, the band with a larger score is more prior to be selected, therefore, s t is called as the BPI of the band x t .It should be noted that c t is negatively proportional to the band correlation, in other words, the larger the value for c t is, the less the band correlation is.Hence, the key issue is to find effective metrics to be used in the model.

Joint Correlation Coefficient
In this section, we proposed a new band correlation metric, i.e., the joint correlation coefficient (JCC), to estimate the joint correlation between the candidate band and the currently selected band set.JCC is defined as the sine of the angle between the candidate band and the hyperplane spanned by the selected bands.It is derived from the correlation coefficient and the cosine version of JCC can be regarded as the extension of the correlation coefficient into the high-dimensional space.The correlation coefficient is defined in Equation (2).For a dataset X = [x 1 , x 2 , ..., x L ] ∈ R N×L , where N and L denote the numbers of pixels and bands, respectively, assume the mean value of each band has been removed.Then, the correlation coefficient between x i and x j can be simplified as follows: where x T i x j denotes the vector inner product; x i denotes the Euclidean norm of x i ; and θ i,j denotes the angle between the bands x i and x j (for the correlation coefficient, θ i,j lies in the interval [0, π] radians).In Figure 1a, it is evident that the correlation coefficient actually measures the band correlation by computing the angle between two bands.Enlightened by this, we can extend the correlation coefficient to a high-dimensional space, and use the angle between a single band and the hyperplane spanned by other bands to evaluate the correlation among them.Hence, we defined the sine of the angle between a single band and the hyperplane spanned by a set of bands as JCC.Interestingly, when only one band has been selected, the cosine version of JCC is exactly the correlation coefficient between the candidate band and the selected band, therefore, the cosine version of JCC can be regarded as the extension of the correlation coefficient in the high-dimensional space.However, it should be noted that the correlation coefficient is a pairwise correlation metric, namely, it is used to estimate the correlation between two bands, whereas the new metric JCC is able to measure the joint correlation between a single band and multiple bands.Thus, for the BPI model, the band correlation is evaluated jointly instead of pairwise.Figure 1b shows an example in 3-D, in which x t and W denote a candidate band and the hyperplane spanned by two selected bands, respectively, and θ t denotes the angle between them.It should be noted that, for the BPI model, we define that θ t lies in the interval [0, π/2] radians.Then, similar to the correlation coefficient, the larger the value for θ t is, the less band correlation is.For instance, in the worst case, the θ t equals zero, the band x t can be linearly expressed by the selected bands, which means x t is totally linearly correlated with the selected bands and thus can be regarded as a totally redundant band.In the best case, the θ t equals ninety degrees, then the band x t is perpendicular to any band in the selected band set, and it is reasonable to consider that x t has no correlation with the selected bands.Therefore, the angle between the candidate band and the selected band set can estimate the joint correlation between them.The JCC or the angle θ t can be obtained by computing the orthogonal projection of x t onto the vector space (or hyperplane) W. A vector space is defined as a set that is closed under finite vector addition and scalar multiplication.For instance, suppose that we have obtained k selected bands and the currently selected band set is denoted as Z = [x id(1) , x id(2) , ..., x id(k) ] ∈ R N×k , where id(i) denotes the index number of the ith selected band, and then the vector space spanned by the bands in Z can be expressed as follows: Then, according to Figure 1b, JCC can be obtained by computing where JCC t estimates the joint correlation between x t and Z; and x is the orthogonal projection of x t onto W. Similarly, x ⊥ is the orthogonal projection of x t onto the orthogonal complement of W. The two orthogonal components can be computed by x ⊥ = P ⊥ x t (10) where I is an identity matrix; P is called the orthogonal projector; and P ⊥ is the orthogonal complement of P [38].It is worth noting that P (or P ⊥ ) is symmetric and idempotent, i.e., For simplicity, we use the standardized bands (i.e., the unit vector in the direction of each band) to compute the angle between the candidate band and the selected band set.Assume that the standardized bands are denoted as X = [x 1 , x 2 , ..., x L ] ∈ R N×L , and the currently selected band set is denoted as Z = [x id(1) , x id(2) , ..., x id(k) ] ∈ R N×k , then Equation ( 7) can be simplified as follows: Hence, we can use the JCC as the correlation metric c t for the BPI model, i.e., It should be noted that, although we choose the JCC as the default correlation metric for the BPI model in this paper, other trigonometric functions of θ t (e.g., tan θ t ) or even the angle θ t itself can also be used as the correlation metric in BPI.However, for cos θ t , it cannot be directly used, because cos θ t is very close to 1 when there have been several bands obtained; in this case, using its inverse as the correlation metric will cause that BPI is insensitive to the band correlation, too.Therefore, we do not recommend directly using cos θ t as the correlation metric without additionally proper processing.As for tan θ t and θ t , when they are, respectively, applied as the correlation metric, the results are almost the same as using sin θ t , which occurs because, when θ t is close to zero, these three metrics are close to each other.Since JCC (i.e., sin θ t ) is more easily computed, we choose it as the default choice for correlation metric.

Incremental Calculation of JCC
However, directly computing JCC is impractical, because the projector P (or P ⊥ ) is an N × N matrix, where N is the number of pixels and is usually very large, which means the calculation and storage of P (or P ⊥ ) are unacceptable in practice.Fortunately, JCC can be incrementally calculated by using recursive formulas without computing and storing the projector P or P ⊥ .As aforementioned, JCC is computed by where y t denotes the orthogonal projection of x t onto the orthogonal complement of the vector space W, namely, y t = x ⊥ (Figure 1b).Assume that the number of desired bands is n; to find all the desired bands, we need perform n rounds of lookups.For the convenience of illustration, in the ith round, P ⊥ is denoted as P ⊥ i , and y t and sin θ t of the candidate band x t are, respectively, denoted as follows: After the newly selected band x id(i) has been found and its normalized band x id(i) has been added into Z, the next round of lookup begins.Then, in the (i + 1)th round, for the same candidate band x t , according to Equations ( 9)-( 13), we have: which demonstrates that the current y (i+1) t is only associated with y t and y (i) id(i) , and both the terms have been computed and stored in the previous round.Moreover, we notice that both y (i) t and y are the N × 1 vectors, and y 2 is a scalar, thus, the calculation of Equation ( 20) only involves low-complexity vector multiplication and scalar multiplication.Furthermore, Equation ( 20) can be further justified as where (y t is computed first and is also a scalar, thus Equation ( 21) avoids the generation of the high-order matrix variables during the calculation, which is useful for saving computing time and storage space.By using Equation (21), it is unnecessary to compute P (or P ⊥ ) in each round, we can directly obtain the value of JCC (i.e., sin θ) incrementally, and, at the same time, the computational complexity is reduced significantly.

Information Metric
On the other hand, for the BPI model, we need to choose an effective information metric to evaluate the amounts of information of bands.In this paper, we, respectively, apply two widely-used information metrics, namely, the variance and the information entropy, as the information metric δ t for the BPI model; and the corresponding methods are denoted as BPI-VAR and BPI-EN, respectively.
The variance is the expectation of the squared deviation of a random variable from its mean; it measures how far a set of (random) numbers are spread out from their average value.In hyperspectral remote sensing, variance is often used to estimate the amounts of information of bands, and the value for variance can be regarded as the classification separability to some extent.For a candidate band x t , its variance is defined as follows: where x t,i and µ t represent the ith pixel and the average of x t , respectively.Then, for the BPI-VAR method, c t = sin θ t and δ t = var(x t ).
In the field of information theory, the information entropy is defined as the average amount of information produced by a stochastic source of data [39].It estimates the disorder or uncertainty of a set of variables, and the band that has large entropy can be considered as the band with a large amount of information.Similarly, for the band x t , its entropy is computed by where p = [p 1 , p 2 , ..., p N ] is the image histogram of the band x t and is normalized as a probability distribution.Hence, for the BPI-EN method, we have c t = sin θ t and δ t = e(x t ).

Number of Selected Bands
In practice, another important issue for BS should be considered is the determination of the number of bands to be selected.Interestingly, the BPI model has the potential to find how many bands should be selected.We find that the score of the newly selected band is always smaller than the previously selected bands' scores.Based on this property, the BPI-based methods can determine the number of selected bands.According to Equation ( 4), the scores of two sequentially selected bands x id(i) and x id(i+1) are measured by: where s (i) id(i) denotes the score of the ith selected band x id(i) .Then, we need to prove that However, they have no direct relationship, so we introduce a third variable: s (i) id(i+1) , which is the score of band x id(i+1) in the ith round.Then, it is equivalent to proving that Obviously, s id(i+1) because the band x id(i) is the optimal band in the ith round.Hence, we just need to prove that s Hence, our goal is to prove that for the same candidate band x t , its score of the current round is smaller than that of the previous round.According to Equation (4), it is also equivalent to proving that ; since JCC is used as the correlation metric in this paper, our final goal is to prove sin θ Hence, we compute the equation as follows: According to Equation ( 21), it can be found that Therefore, Equations ( 25)-( 28) have been proven, and we can see that the scores of the newly selected bands decrease as the number of iteration increases.This phenomenon is because, as more bands have been included in the selected band set, the remaining bands are more correlated with the selected band set.
When the number of the selected bands exceeds a specific size, the score of the newly selected band becomes relatively small, which means the contribution of the bands becomes less and adding more bands into the selected band set no longer increases the total amount of information of the band combination significantly, in this case, the BS algorithm can be terminated.For instance, Figure 2 shows the scores of the bands selected by BPI-VAR in each round from a real hyperspectral dataset.It can be seen that the scores of each newly selected band is smaller than that of previously selected band, and the slope of the curve becomes quite small when sufficiently number of bands have been selected, which can be used to determine the number of selected bands.Here, a simple way to compute the decreasing rate is where r(k) denotes the decreasing rate of the kth selected band.Then, during the process of BS, when the average of three sequentially selected bands' rate is less than a threshold , the BS algorithm can be terminated.In this paper, the parameter is set to be 0.05 in default.The Score of the band selected in each round

Computational Complexity Analysis
The BPI model also has the advantage of high computational efficiency.The correlation metric JCC can be incrementally calculated by using the recursive Equation ( 21), thus the computation of this part is not computationally complex.As for the computation of amounts of information, its computational complexity depends on the choice of information metrics, and, for most information metrics, the calculation is also not complex.Here, we use the floating point operations (flops) to measure the computational complexity of proposed methods, and the procedures of BPI-based methods are given in Algorithm 1, which shows that the calculation of the JCC has been reduced significantly and only results in about 3nNL flops in total; when using the variance as the information metric, the calculation of amounts of information results in about NL flops; and, when using entropy as the information metric, it results in about 2NL flops.Therefore, the total flops of BPI-VAR and BPI-EN are about 3nNL + NL and 3nNL + 2NL, respectively.This demonstrates that the BPI-based methods have quite good computational efficiencies.

Band Selection
Step 1: Compute the amounts of information of bands in X and denote them as D = [δ 1 , δ 2 , ..., δ L ] ∈ R L .
Step 2: Select the band with maximum information as the initial selected band, which is denoted as x id(1) .Set the initial selected band set as Φ = {x id(1) } Step 3: Let Y = [y 1 , y 2 , . . ., y L ] = X, where X denotes the normalized band set of X, then set counter i = 2. while i < n + 1 or (31) is not met do Step 4: Calculate the c t of the tth normalized band x t , (t = 1, 2, ..., L), i.e., Step 5: Find the band that has the largest score as the optimal band, and add it into Φ, i.e., If the tth band x t has already been selected, its y t and s t will not be calculated and compared.

Experiments
In this section, we evaluate the performance of the BPI model on three different real-world hyperspectral datasets.Two of implementations of BPI, namely, the BPI-VAR and BPI-EN methods, are used in our experiments.For comparison, six different unsupervised BS methods are used: LCMV Band Correlation Constraint (LCMV-BCC) [10], LCMV Band Correlation Minimization (LCMV-BCM) [10],Volume-Gradient-based BS (VGBS) [21],Exemplar Component Analysis (ECA) [28], Manifold Ranking (MR) [20] and the Simplified OIF-based method (SOIF) [14].Among these methods, LCMV-BCC and LCMV-BCM are classical BS methods, VGBS; ECA and MR are newly proposed state-of-the-art methods; and SOIF method has a similar design idea with the proposed method.The LCMV-based methods aim to select the bands that best represent the whole image cube, and the representative ability of a candidate band is measured by its correlation with the whole image dataset.VGBS is a geometry-based method, which tries to find the band set with the maximum ellipsoid volume.The bands with the maximum volume gradients are removed one by one, until the desired number of bands remains.ECA is based on an effective clustering algorithm [29], so it performs quite well in practice.MR is based on many advanced machine learning algorithms including clustering, clone selection and manifold ranking [20].As for SOIF, it is similar to the BPI model and also computes the indexes of bands, the bands with the largest scores are selected as the desired bands.The comparison includes three aspects: pixel classification results, band correlation and computing time.Additionally, some tests about the recommended number of selected bands are also introduced in this section.

Hyperspectral Datasets
(1) Indian Pine Dataset [40]: The first hyperspectral image we used has been researched extensively.
The image was collected by the AVIRIS sensor over the Indian Pine region in Northwestern Indiana in 1992, and it has 145 × 145 pixels (about 20 m per pixel) and 220 bands with a wavelength range from 400 to 2500 nm (Figure 3a).In our experiments, bands 1-3, 103-112, 148-165, and 217-220 were removed due to atmospheric water vapor absorption and low signal to noise ratio (SNR) [16], leaving 185 valid bands to be used.Of the 16 classes in the image, only nine classes are used in our experiment and the others are removed because of the lack of sufficient samples (Table 1) [16].(2) Salinas Dataset [41]: The second image was collected by the 224-band AVIRIS sensor over Salinas Valley, California, and was characterized by a high spatial resolution (3.7-m pixels) (Figure 3b).The dataset has a medium size of 512 × 217 pixels, and the spectral range is from 370 to 2507 nm.For this dataset, we discarded the 20 water absorption bands, which were the bands: 108-112, 154-167, and 224.In our experiments, all 16 classes in the Salinas dataset are used.(3) Pavia University Dataset [42]: The third image is a hyperspectral image at the University of Pavia acquired by the ROSIS-3 optical sensor (Figure 3c).The dataset has 103 spectral bands with a spectral range from 0.43 to 0.86 µm.The image size is 610 × 340 with a spatial resolution of about 1.3 m.In the image, nine classes are labeled and used: Asphalt, Meadows, Gravel, Trees, Painted Metal Sheets, Bare Soil, Bitumen, Self-Blocking Bricks, and Shadows [43].

Classification Performance
To evaluate the performance of different methods, the pixel classifications of the three hyperspectral images are conducted, respectively, with two different classifiers: K-Nearest Neighborhood (KNN) and Support Vector Machine (SVM) [44].In our experiments, the neighbors in KNN are set to be 3; as for SVM, Gaussian Radial Basis Function (RBF) is used as the kernel function and the one-against-all scheme [45] is used for multi-class classification.For all three datasets, we randomly select 10% samples from each class to construct the training set, and the rest are used for testing.In the following, we will discuss the BS results on different images with respect to two classifiers.Two kinds of results are shown in this section: the first kind are the band number-accuracy curves and the second kind are the averaged accuracy bars (i.e., the average of accuracy curve).It should be noted that the classification accuracy is defined as the proportion of correctly classified pixels to all the corresponding class pixels in the image.
For the Indian Pine dataset, we can see in Figure 4 that the overall classification accuracies of all the methods increase as the number of the selected bands increases.When using SVM, the BPI-EN method shows the best overall performance, followed by MR, ECA, VGBS and BPI-VAR.The performances of VGBS and BPI-VAR are similar to each other.The SOIF, LCMV-BCM and LCMV-BCC methods cannot compete with the two proposed methods, especially when selecting a small number of bands.As for KNN classifier, likewise, BPI-EN performs the best, followed by MR, ECA, VGBS and BPI-VAR.For this classifier, VGBS is slightly superior to BPI-VAR.Additionally, the average results of selecting different numbers of bands are shown in Figure 4c, from which we can see that BPI-EN obtains the best overall classification performance, followed by MR, ECA, VGBS, BPI-VAR, SOIF and LCMV-BCM, whereas the LCMV-BCC method performs poorly.
For the Salinas dataset, similarly, BPI-EN outperforms other methods.BPI-VAR, ECA, VGBS and MR also perform well (Figure 5).Other methods cannot compete with these four methods, especially when selecting small numbers of bands.When using SVM, BPI-EN is always superior to others, and BPI-VAR, ECA, VGBS and MR show similar performances, while the remaining methods perform not as well as these four methods.Figure 5c further demonstrates that BPI-EN has the best overall performance, and the overall performance of BPI-VAR is almost the same with ECA and slightly better than those of VGBS and MR.When using KNN, the BPI-EN method still performs the best, followed by BPI-VAR, MR, ECA, VGBS and others.BPI-EN always performs better than others, and BPI-VAR also performs much better the competitors when selecting more than 15 bands.Figure 5c also indicates that the BPI-EN and BPI-VAR methods have the best overall performances for this classifier.
As for the Pavia University dataset, things are a little different (Figure 6).BPI-VAR and VGBS perform the best, followed by MR, BPI-EN, ECA and other methods.When using the SVM classifier, the BPI-VAR and VGBS methods obtain almost the same classification performances, MR also performs well, and the accuracy of ECA is slightly lower than that of MR and BPI-EN.The remaining methods still cannot compete with these best methods.When using KNN, VGBS and BPI-VAR performs best, followed by MR, BPI-EN and other methods.ECA performs worse when compared with the results of SVM, while the BPI-VAR and BPI-EN still obtain good performances.Figure 6c shows the average results of these methods; for this dataset, BPI-VAR and VGBS ranks the first, followed by MR and BPI-EN.These four methods outperform the other methods significantly.After introducing the classification results, we give some in-depth analysis.In general, the proposed methods are more effective than the other competitors.BPI-EN obtains the best overall performances among all the methods we used.BPI-VAR also performs well; it performs the best on the Pavia University dataset.This is mainly due to the BPI model can evaluate the contribution of bands properly, the band correlation and amounts of information are well considered by the proposed methods.For instance, when compared with the similar method, namely, the SOIF method, our proposed methods have achieved significant improvements on the performances for classification.Even compared with state-of-the-art methods such as MR, ECA and VGBS, the BPI-EN method performs better than them, and the BPI-VAR can compete with them, which verifies that the BPI-based methods are effective.We also notice that the classification performances are influenced by the number of selected bands.There is a phenomenon that the performance is better when the band number is larger.Considering that the purpose of BS is to enhance the computational efficiency and reduce the storage burden at the same time, fewer bands while good classification performance is encouraged for a BS method; therefore, if the selected band number is not large but the performance of classification is satisfying, we can think that the BS method is effective and of great value.The proposed methods (especially the BPI-EN method) have shown satisfactory overall performances in experiments, and when selecting a small number of bands, the superiority is more evident, which demonstrates that the BPI-based methods are valuable and have a good significance in practical applications.Additionally, we also conduct experiments using the whole image cube and the results are as follows: Indian Pine [SVM (0.8382) and KNN (0.7449)], Salinas [SVM (0.9406) and KNN (0.8806)] and Pavia University [SVM (0.9396) and KNN (0.8691)].Comparing the BS methods (Figures 4-6) with the full band method, we can find that the proposed BPI-EN method does not reduce the classification accuracy very much.For each image and classifier, abandoning most redundant bands only leads to a small reduction in accuracy (<3%) for the classification task.This denotes that the proposed methods are very effective.Although only a limited number of bands are selected, we can achieve an acceptable performance.Therefore, the classification experiments on three different datasets verify that the bands selected by the proposed method are informative for classification, and the proposed methods are highly accurate BS methods.

Band Correlation
In this section, we evaluate the average band correlation among the bands selected by different methods.We use the average correlation coefficients (ACC) to estimate the overall band correlation among the selected bands.The larger the value for JCC is, the larger the average band correlation is.Table 2 shows the ACC of the fifteen bands selected from different datasets, and the index numbers of the fifteen bands selected from the Indian Pine dataset are listed in Table 3.In Table 2, the bands obtained by SOIF and the two LCMV-based methods are highly correlated, whereas the selected bands obtained by the other methods are with much lower correlation.Furthermore, according to Table 3, it can be found that most of the bands obtained by the SOIF, LCMV-BCC and LCMV-BCM methods are neighboring bands, whereas the other five methods select less neighboring bands.In fact, for the hyperspectral images, the neighboring bands are usually highly correlated with each other, so the methods that select too many neighboring bands cannot ensure that the selected band set has low correlation, and, thus, may result in relatively poor classification performances.The results in Tables 2  and 3 and Figures 4-6 have verified this point; the bands selected by the proposed methods, MR, ECA and VGBS are with low correlation, and, correspondingly, the classification performances of them are relatively better than others.

Fifteen Bands
It is worth noting that these results also verify that the SOIF method cannot always consider the band correlation properly.It can be seen in Table 2 that, although SOIF can consider the band correlation to some extent, it does not perform consistently.For instance, when selecting bands from the Indian Pine and Salinas datasets, the selected bands are with acceptable correlation, but, when selecting bands from the Pavia University dataset, they are with quite high correlation.This occurs because that the SOIF method uses the correlation coefficients as the denominator of the objective function, and the correlation coefficients between neighboring bands are often very close to 1 because of their high correlation with each other, which means, for most candidate bands, their SOIF's denominators are almost the same, and thus their scores are mainly determined by the amounts of information of bands.Therefore, the SOIF method sometimes cannot pay sufficiently enough attention on the band correlation, which deteriorates its performance.As for the two LCMV-based methods, they also select one band for each time, and the band that is the most correlated with the whole image cube would be regarded as the optimal band.It is easy to find that LCMV-based methods also do not pay much attention on the band correlation among the selected bands, therefore, the bands obtained by this kind of methods are also usually highly correlated.

Computing Time
In this section, we compare the computing time of different methods.The computing time of selecting fifteen bands from different datasets is listed in Table 4.In Table 4, the proposed methods have good computational efficiencies.Among all the methods, SOIF spends the least time, followed by BPI-VAR and BPI-EN, of which computing time is just a little more than that of SOIF, but is always lower than the other methods' computing time.The SOIF method has a small number of steps, so it has a good computational efficiency.Specifically, the SOIF method only needs to compute the standard deviation of each band and the correlation coefficients of all the adjacent band pairs.All these steps only result in about 2NL flops, where N and L are the numbers of pixels and bands, respectively.As for the proposed methods, the procedures not only include the calculation of amounts of information but also the computation of JCCs, which results in the additional computational complexity.However, due to the adoption of recursive formulas, JCCs can be computed incrementally, which reduces the complexity of the algorithm significantly.For instance, the total computational complexity of BPI-VAR is about 3nNL + NL, which is slightly larger than that of SOIF.Considering that the proposed methods have shown much better classification performances than the SOIF method, a little more time cost is acceptable.When compared with the methods excluding SOIF, the BPI-based methods cost the least time.VGBS needs to compute the covariance matrix of total bands and perform Singular Value Decomposition (SVD), which result in about NL 2 + L 3 flops, so its computational complexity is higher than the BPI-based methods.Although the clustering algorithm applied by ECA is quite effective for most clustering algorithms, the computational burden is still high when compared with the proposed methods.The MR is quite complicated; it involves clustering, clone selection and manifold ranking, which are all time-consuming.Although the MR method performs quite well in the classification experiments, it costs the most time, which is a significant drawback of this method.The LCMV-based methods require to evaluate the correlation between each candidate band and the whole image cube, which is relatively computational complex, so they cost much time, too.To sum up, these results have identified that the proposed methods have good computational efficiencies and can obtain the desired bands in a short time.

Number of Selected Bands
In practice, it is difficult to determine the number of bands to be selected, a reasonable way is to choose the number of bands close to the number of classes in the dataset.Generally, the number of classes can be determined by using a virtual dimensionality (VD) estimation approach proposed in [46], but this results in additional computational burden and the classes number is sometimes not well estimated because it is also difficult to choose suitable values for the parameters in VD.Therefore, if the BS method can determine the number of bands properly and not increase the computational complexity very much, we can think the method is of great value.Interestingly, the proposed BPI-based methods have the potential to determine the number of selected bands automatically, and, most importantly, the parameters in our strategy of determining the selected bands number can be set easily and this process causes little additional computational burden in applications, therefore, the BPI model has a good value for practical applications.
Therefore, in this section, we use the strategy described in Section 3.4 as the stop criterion for the proposed BS algorithm and test the recommended number of selected bands.For all the three datasets, we set the parameters as 0.05, and the recommended numbers (n) of selected bands are, respectively, listed in Table 5, from which we can see that the recommended numbers are suitable.Furthermore, taking the BPI-VAR method as an example, the scores of the bands selected from different datasets in each iteration are illustrated in Figure 7, from which we can see that the curves of scores have clear inflection points and the slope of curve becomes quite small when sufficient number of bands have been selected.When the number exceeds the recommended number, the scores of the newly selected bands are relatively small and can be neglected when compared with the first several selected bands' scores.Considering that the scores of bands also denote the contribution of bands, it is reasonable to consider that the remaining bands cannot supply much additional information for the current band combination, and, thus, the BS algorithm can be stopped.The number of iterations (i) The Scores of Sected Bands The number of iterations (i) The Scores of Selected Bands The number of iterations (i) The Scores of Selected Bands (c)

Summary
From all the experiments on three different hyperspectral datasets, some important results can be summarized.In band selection, both the amount of information and band correlation should be considered.The BPI model can find a good trade-off between the amount of information and band correlation, and the experimental results have verified that the bands obtained by the proposed methods are informative and distinctive, and therefore the selected bands can achieve a satisfactory performance.In our experiments, the performance of BPI-EN is better than other methods, even when compared with the state-of-the-art methods such as MR, VGBS and ECA.The BPI-VAR also shows satisfactory performances in applications, its performance is close to that of the VGBS method and much better than other competitive methods excluded MR and ECA.It is worth noting that BPI-EN performs slightly better than BPI-VAR, which demonstrates that the choice of information metric has significant influence on the performance of the BPI method.Additionally, the proposed methods always produce good and stable performances of classification in any datasets, which demonstrates that the proposed methods have a good robustness.Furthermore, the BPI model has a good computational efficiency, and the experimental results verifies that the BPI-based methods can obtain desired bands in a short time.Finally, the BPI model also has the potential to determine the suitable number of bands to be selected; the recommended number can be regarded as a reference value for the number of selected bands.In conclusion, the effectiveness of BPI has been verified.

Conclusions
In this paper, a Band Priority Index (BPI) model for hyperspectral feature selection is proposed to effectively find a diverse band combination that contains discriminative and informative bands for hyperspectral image analysis.The BPI model adopts the SFS strategy, so the desired bands are obtained one by one.A new objective function is designed for BPI, and it consists of two parts: the information metric and the correlation metric.To evaluate the correlation between a candidate band and the selected band set, we proposed a new correlation metric named the joint correlation coefficient (JCC), which is defined as the sine of the angle between the candidate band and the hyperplane determined by selected bands.JCC can estimate the band correlation between a single band and multiple bands jointly instead of pairwise.The variance and entropy are, respectively, chosen as the information metric for BPI, and thus, we give two implementations of BPI, i.e., the BPI-VAR and BPI-EN methods.Experimental results on three different datasets demonstrate that the BPI-based methods are highly efficient and accurate BS methods.Moreover, the BPI-based methods have the potential to determine the number of bands to be selected.Finally, our future research interest is to find effective information metrics to improve the BPI model's performance.

Figure 1 .
Figure 1.The geometric explanation of the correlation coefficient and the new correlation metric: (a) the correlation coefficient; and (b) the joint correlation metric (here, for illustration convenience, assume that each band is a 3 by 1 vector).

Figure 2 .
Figure 2. The scores of the bands selected by BPI-VAR from Pavia University dataset (Section 4.1).

Figure 7 .
Figure 7. Scores of the selected bands obtained by the BPI-VAR method: (a) Indian Pine dataset; (b) Salinas dataset; and (c) Pavia University dataset.

Table 1 .
Number of samples for ground objects in Indian Pine dataset.

Table 2 .
Correlation of the fifteen bands selected from different datasets.

Table 3 .
Fifteen bands selected by different methods for the Indian Pine dataset.

Table 4 .
Computing time of selecting fifteen bands from different datasets.