Representative Band Selection for Hyperspectral Image Classification

The high dimensionality of hyperspectral images (HSIs) brings great difficulty for their later data processing. Band selection, as a commonly used dimension reduction technique, is the selection of optimal band combinations from the original bands, while attempting to remove the redundancy between bands and maintain a good classification ability. In this study, a novel hybrid filter-wrapper band selection method is proposed by a three-step strategy, i.e., band subset decomposition, band selection and band optimization. Based on the information gain (IG) and the spectral curve of the hyperspectral dataset, the band subset decomposition technique is improved, and a random selection strategy is suggested. The implementation of the first two steps addresses the problem of reducing inter-band redundancy. An optimization strategy based on a gray wolf optimizer (GWO) ensures that the selected band combination has a good classification ability. The classification performance of the selected band combination is verified on the Indian Pines, Pavia University and Salinas hyperspectral datasets with the aid of support vector machine (SVM) with a five-fold cross-validation. By comparing the proposed IG-GWO method with five state-of-the-art band selection approaches, the superiority of the proposed method for HSIs classification is experimentally demonstrated on three well-known hyperspectral datasets.


Introduction
Remotely sensed hyperspectral data which is collected by hyperspectral sensors consist of hundreds of contiguous spectral bands with high resolutions [1,2]. Recent studies in remote sensing have shown that hyperspectral images (HSIs) have been successfully applied in precision agriculture, urban planning, environmental monitoring and various other fields [3][4][5][6]. One of the typical characteristics of an HSI is that it cannot only obtain the scene information in the two-dimensional space of the target image but can also acquire one-dimensional spectral information with a high resolution to characterize the physical property. However, the superiority of HSIs with high resolution occurs at the expense of their vast amount of data and their high-dimensionality property (often including hundreds of bands). Furthermore, the great deal of information can also deteriorate the accuracy of classification, due to the redundant and noisy bands [7,8]. Therefore, it is of great importance to introduce new dimensionality-reduction methods that readily utilize the information resources of HSIs.
A large number of dimensionality reduction (DR) approaches have been developed in the past decades [9][10][11][12][13][14][15][16][17][18][19][20]. The existing dimensionality reduction methods can be commonly partitioned into the two branches of feature/band extraction and feature/band selection. The former projects the original high-dimensional HSIs data into a lower-dimensional space to reduce the number of dimensions through various transformations [16]. Typical algorithms include locally linear embedding (LLE) [17], to select the informative features by unlabeled samples [45]. Based on Ward's linkage and MI, a hierarchical clustering method was introduced to reduce the original bands [20]. This approach was termed WaLuMI. Maximum information and minimum redundancy (MIMR) defines a criterion that maximizes the amount of information for the selected band combination while removing redundant information [23]. The bands with low redundancy were selected by MI for HSIs classification in these methods. However, choosing a certain number of bands to minimize the correlation among them is a time-consuming optimization issue [46]. Information gain (IG), also known as the Kullback-Leibler divergence, can measure how much information the features could contribute to the classification. In a previous study [47], IG was applied to rank the text features to achieve text recognition. Intuitively, the features with larger IG values are considered informative and momentous contributions to classification. Koonsanit [48] proposed an integrated PCA and IG method for hyperspectral band selection that significantly reduced the computational complexity. Qian et al. [15] used IG as the similarity function to measure the similarity between the two bands, and then an AP clustering algorithm was used to cluster the initial band to select the optimal band subset.
Compared with the wrapper method, the filter-based method is independent of the subsequent learning algorithms during the band selection process and has the advantages of a low computational complexity and strong generalization capability [40,41]. However, there are still two issues that need to be addressed: (1) the representative bands can be quickly obtained in terms of the score of each band in the filter-based method, but the correlation between the selected bands may be high; (2) some bands show a significant and indispensable effect on HSIs classification when combined with other bands. However, their score may not be high, so they may be abandoned in error.
On the basis of the above analysis, it is highly necessary to develop a hybrid filter-wrapper feature selection method. In this paper, the filter method based on IG is designed to replace the predictor wrapped on the search algorithm in the wrapped method. The experimental results show that this attempt achieves a reasonable compromise between efficiency (computing time) and effectiveness (classification accuracy actualized by the selected bands).
Furthermore, inspired by the concept of the adaptive subspace decomposition approach (ASD) [49], an improved subspace decomposition approach has been newly defined. As described previously [49], the entire spectral bands were first partitioned into band subsets in line with the correlation coefficient of adjacent bands. Then, bands were selected from each subset according to the quantity of information or class separability. Nevertheless, ASD did not overcome the limitations of the traditional subspace decomposition method, which only used correlation coefficients to partition the band space. This leads to the problem that subspaces are difficult to partition because of their multiple minimum points. Thus, we suggest an improved subspace decomposition technique in which band space is decomposed by calculating the value of IG along with the visualization result of the HSIs spectral curve. Moreover, the selection of the same number of bands from each band subset ensures that each band subset makes the same contribution to the final classification results.
In this study, a novel hybrid filter-wrapper band selection method is introduced based on an improved subspace decomposition method and GWO optimization algorithm, and it is implemented through a three-step strategy: decomposition, selection and optimization. Extensive experimental results over the three hyperspectral datasets presented here clearly prove the effectiveness of the proposed method (IG-GWO), offering a solution to the aforementioned limitations and excelling compared to the other state-of-the-art band selection methods.

Information Gain (IG)
In machine learning and information theory, IG is a synonym for Kullback-Leibler divergence, which was introduced by Kullback and Leibler [47]. As a feature selection technique, IG has been widely applied to select a slice of important features for text classification. The larger the IG of a feature is, the more information it contains [48]. The computation of IG is based on entropy and conditional entropy.
Given a hyperspectral dataset HD = {x 1 , x 2 , · · · , x n } with l bands {B 1 , B 2 , · · · , B l } and q classes {C 1 , C 2 , . . . , C q }, the IG of band B i can be mathematically defined as below: where E(C) denotes the entropy of dataset HD, and E(C|B) is the conditional entropy. E(C) and E(C|B i ) are expressed as follows: where |·| denotes the cardinality of the set, and N Bm indicates the number of pixels in the m-th set that was obtained by using the band B i to partition the hyperspectral dataset into k different sets via their intensity value G. Assuming a hyperspectral image with a radiometric resolution of 8 bits, the intensity value G of pixels in the image will have 256 gray levels; N j Bm is the number of pixels belonging to the j-th class in the m-th set.
The band with the highest IG value is characterized by better classification performance. However, it is possible that the bands with high IG values will concentrate on a band subinterval. In this case, a flawed classification result will be obtained because the adjacent bands generally have a strong correlation. In addition, we do not know how to properly pre-assign the threshold to select them. Consequently, it is difficult to directly select such bands from the original bands to reduce the dimension of the hyperspectral dataset by using IG values only.

Gray Wolf Optimizer (GWO)
Very recently, a more powerful optimization method, the gray wolf optimizer, was proposed by Mirjalili [30]. This algorithm simulated the living and hunting behavior of a group of gray wolves. All gray wolves were divided hierarchically into four groups, denoted in turn as alpha wolf, beta wolf, delta wolf and omega wolf. In the first group, there was only a gray wolf, the alpha wolf, which was the leader of the others and controlled the whole wolf pack. The beta wolf only obeyed the alpha wolf in the second group and helped the alpha wolf to make some decisions and commanded the rest of the wolves in the two lower levels. The alpha wolf would be replaced by the beta wolf if the alpha wolf passed away. The gray wolves in the fourth group were named omega wolves and were responsible for collecting useful information to submit to the alpha wolf and beta wolf. The remainders were called delta wolves. The hunting procedure of gray wolves was carried out in three steps: tracking, pursuing and attacking the prey.
The social hierarchy and hunting procedure can be mathematically described as follows: α: The fittest solution; β: The second-best solution; δ: The third-best solution; ω: The remaining candidate solutions.
The mathematical model of encircling the prey can be described as where t denotes the current iteration; D represents the distance between the gray wolf and the prey; A and C are coefficient vectors; X p is the position vector of the prey and X the position vector of a gray wolf. The symbol. represents the corresponding multiplication of each component of two vectors. The coefficient vectors A and C can be obtained by where components of a linearly decrease from 2 to 0 over the course of iterations; r 1 and r 2 are random vectors. The mathematical model of hunting is Repeating Equations (4)-(10), one can eventually compute the best solution X α . It should be noted that this theory is valid in a continuous case. If we slightly revise Equation (10) by rounding the operation, this algorithm can be used in a discrete case.
The GWO algorithm is summarized by the following steps: Step 1 Initialize the gray wolf population GW i (= 1, 2, . . . , N); parameters a ; maximum iterations; Step 2 Select the alpha, beta and delta wolves from the population by calculating fitness; Step 3 Update the position of each wolf by Equation (10); Step 4 Compute Equations (4)-(9) and go to step 2; Step 5 Output alpha until the iterations reach their maximum or the choice of the same alpha wolf twice in succession is satisfied.
To select the most distinguished band combination for increasing classification accuracy, it is important to choose/define the fitness function in step 2. The information gain (IG) is adopted as the fitness function in this study. Unlike the adoption of overall accuracy (OA) as fitness, one of the advantages of adopting IG as fitness is to effectively reduce the computation time of the GWO method since the computation of IG has nothing to do with the classification results.
Provided that there are s bands in a band combination (gray wolf), the IG value can be computed by Equation (11): where IG(C, B i ) is the information gain value of i-th band obtained from Equations (1)-(3).

The Proposed Band Selection Method
In this study, we integrate band subset decomposition, band selection and a GWO optimization algorithm to select the optimal band combination. The framework of the proposal is described in Figure 1. On the basis of c different subsets, one can obviously select the bands with high IG values from each subset. However, this may lead to the loss of substantial information. Alternately, we randomly select k different bands from each band subset to constitute a band combination in a process that is also known as gray wolf, ( , … , , … … , , … , ). The component in gray wolf denotes the selected j-th band from i-th subset. This ensures that each band subset has the same contribution to the classification result. It is notable that the acquired band combination has less band redundancy, but this does not mean that it has good classification performance. Thus, the GWO algorithm is employed to optimize it to provide a satisfactory classification performance. The GWO algorithm is chosen as the optimization method because the GWO algorithm shows very good performance in exploration, local minima avoidance, and exploitation, simultaneously [30].
Regarding optimization: suppose that there are N gray wolves in the initial population. The IG value of each gray wolf in the initial population is computed by using Equations (1)- (3) and (11). One then performs a descending sort of the IG values, with the largest in front. The wolves that correspond to the three largest IG values are called the alpha wolf, beta wolf and delta wolf, one after the other. The rest are designated as omega wolves. In the absence of information about the location of the prey, the position vector of the prey is replaced approximately by the position of the alpha wolf. The following supposes that each of the gray wolves knows the location of prey and encircles it [30,33].
Encircling the prey during the hunt can be performed by Equations (4)-(7) and setting a = 2. After encircling the prey, each wolf should quickly update its position relative to the locations of the alpha, beta and delta wolves in order to enclose the prey. The hunting process is accomplished by computing Equations (8)- (10).
The proposed band selection algorithm IG-GWO can be summarized as follows: Input: Hyperspectral dataset; parameters and maximum iterations. Output: The most informative feature, subset .
Step 1: Divide the initial bands into c subsets by their IG value and the spectral curve of the hyperspectral dataset.
Step 2: Randomly select k different bands from each subset to constitute a gray wolf.
Step 3: Initialize the population in GWO by repeatedly performing step 2 N times.
Step 4: Call the GWO algorithm mentioned in Section 2.2.

Experimental Setup and Description of the Dataset
To assess the performance of the proposed method, three typical hyperspectral datasets, Indian Pines, Pavia University and Salinas [50], were chosen for our experiments. A support vector machine with five-fold cross-validation (SVM-5) is applied to classify the selected three datasets. Classification is actualized through the one-against-all approach, and the Gaussian radial basis function (RBF) kernel is adopted as its kernel function in SVM-5. The optimal parameters C and γ of Band subset decomposition is the partitioning of the original bands into different subsets so that bands in the same subset are similar to each other, and bands from different subsets are dissimilar. With this process, bands in different subsets enjoy different classification performances. Obviously, it is arduous to actualize band subset decomposition by only using the IG value. Therefore, in this work, we introduce an improved subset decomposition technique by simultaneously considering the spectral curve of the hyperspectral dataset and local minimal value method.
Specifically, for a given HSI, the information gain of each band is first calculated by Equations (1)-(3). The number c of band subsets is determined by the line chart of information gain in the spectral curve of the HSI. Finally, one can properly partition all bands into c different band subsets.
Regarding band selection, the spectral curve of the hyperspectral data set implies that the spectral curves of each pixel in a certain interval are similar in shape and only different from each other in radiance. This means that bands that are located in the same interval have a similar distinguishing capacity for different land surface features. Consequently, it is logical to replace all bands in the same subset with several selected bands in an attempt to maintain similar classification results. This process implies that some of the redundant bands are abandoned.
On the basis of c different subsets, one can obviously select the bands with high IG values from each subset. However, this may lead to the loss of substantial information. Alternately, we randomly select k different bands from each band subset to constitute a band combination in a process that is also known as gray wolf, (b 11 , . . . , b 1k , . . . . . . , b c1 , . . . , b ck ). The component b ij in gray wolf denotes the selected j-th band from i-th subset. This ensures that each band subset has the same contribution to the classification result. It is notable that the acquired band combination has less band redundancy, but this does not mean that it has good classification performance. Thus, the GWO algorithm is employed to optimize it to provide a satisfactory classification performance. The GWO algorithm is chosen as the optimization method because the GWO algorithm shows very good performance in exploration, local minima avoidance, and exploitation, simultaneously [30].
Regarding optimization: suppose that there are N gray wolves in the initial population. The IG value of each gray wolf in the initial population is computed by using Equations (1)-(3) and (11). One then performs a descending sort of the IG values, with the largest in front. The wolves that correspond to the three largest IG values are called the alpha wolf, beta wolf and delta wolf, one after the other. The rest are designated as omega wolves. In the absence of information about the location of the prey, the position vector of the prey is replaced approximately by the position of the alpha wolf. The following supposes that each of the gray wolves knows the location of prey and encircles it [30,33].
Encircling the prey during the hunt can be performed by Equations (4)- (7) and setting a = 2.
After encircling the prey, each wolf should quickly update its position relative to the locations of the alpha, beta and delta wolves in order to enclose the prey. The hunting process is accomplished by computing Equations (8)- (10).
The proposed band selection algorithm IG-GWO can be summarized as follows: Input: Hyperspectral dataset; parameters a and maximum iterations. Output: The most informative feature, subset α.
Step 1: Divide the initial bands into c subsets by their IG value and the spectral curve of the hyperspectral dataset.
Step 2: Randomly select k different bands from each subset to constitute a gray wolf.
Step 3: Initialize the population in GWO by repeatedly performing step 2 N times.
Step 4: Call the GWO algorithm mentioned in Section 2.2.

Experimental Setup and Description of the Dataset
To assess the performance of the proposed method, three typical hyperspectral datasets, Indian Pines, Pavia University and Salinas [50], were chosen for our experiments. A support vector machine with five-fold cross-validation (SVM-5) is applied to classify the selected three datasets. Classification is actualized through the one-against-all approach, and the Gaussian radial basis function (RBF) kernel is adopted as its kernel function in SVM-5. The optimal parameters C and γ of the RBF kernel are determined via five-fold cross-validation. TMI-CSA (clonal selection algorithm) [45], Lscore [44], STMIGR [45], WaLuMI [20] and MIMR-CSA (clonal selection algorithm) [23] are used as the comparison methods. The reason for choosing these five algorithms is that the same classifiers, labeling proportions and precision evaluation indicators are used in these five algorithms with the suggested methods.
In all experiments, 5%, 10%, 15% and 20% of instances in each class are randomly labeled to compose the training sets. The rest are to be classified directly. In order to reduce the influence of random labeling on the classification result, all experiments are conducted over 30 independent runs with a random choice of training set. The average value and standard deviation of classification accuracy are reported in the following tables. The initial population in the GWO algorithm consists of N = 30 gray wolves. The maximum number of iterations is set to 50 in the GWO algorithm.
To quantitatively appraise the classification result of the hyperspectral dataset, the three universal indices [51] of overall accuracy (OA), average accuracy (AA), and kappa coefficient (KC) are adopted in this study.
(1) Overall accuracy (OA) refers to the number of correctly classified instances divided by the total number of testing samples; (2) Average accuracy (AA) is a measure of the mean value of the classification accuracies of all classes; (3) The kappa coefficient (KC) is a statistical measurement of consistency between the ground truth map and the final classification map.
The Indian pines image [50] was collected by the Airborne Visible Infrared Imaging Spectrometer (AVIRIS) sensor at the Indian pine test station located in northwestern Indiana in 1992. It contains 145 × 145 pixels and 200 available spectral reflectance bands (eliminating the 20 water absorption bands) in the wavelength range from 0.4 to 2.5 µm. The spatial resolution is 20 m/pixel. The 16 classes of vegetation and forests are contained in the image, as well as the specific types of ground objects that can be found in the legend (Figure 2). A three-feature false-color composite image and its ground truth image are displayed in Figure 2a,b, respectively.
The Pavia University image was captured by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over Pavia, Northern Italy [50]. Each band image comprises 610 × 340 samples with a spatial resolution of 1.3 m/pixel. In each band scene, there are nine classes of representative urban ground objects. After removing 12 noisy spectral bands, 103 usable bands are retained for the following research. A three-feature false-color composite image and ground truth image are separately exhibited (Figure 3a,b).   The Salinas scene was also gathered by the AVIRIS sensor over Salinas Valley, California (Figure 3). The image has a spatial dimension of 512 × 217 pixels and a spatial resolution of 3.7 m/pixel. The AVIRIS sensor generates 224 spectral bands (eliminating 20 water absorption bands), and 204 bands remain for the subsequent experiment. The Salinas image consists of 16 classes of ground objects, covering vegetables, fields, bare soils, and so on [50]. A three-feature false-color composite image and its ground truth image are shown (Figure 4a,b).
In the second lines in Tables 2-4, the number of the optimal bands is the product of the number k and the number of band subsets. The numbers in the first columns in Tables 2-4 are sampling ratios.
The classification results of the Indian Pines dataset that were obtained by the proposed method are reported in Table 2. The best classification by OA achieves an accuracy of 86.15% ± 0.7% (42 optimized bands and a labeling ratio of 20%). The classification results over OA and KC become increasingly better as the number of the selected bands increases. When the number of the selected bands ranges from 12 to 42, the three indices all exhibit a slow increase (approximately 2.5%) and no notable increase. This may be because the redundancy among the bands is increased as the number of selected bands increases. The classification results are unfavorable when six bands are selected because too much band information is lost. Some of the main features of the three datasets are summarized in Table 1.

Classification Results.
Based on the information gain and spectral curve of three hyperspectral datasets ( Figure 5), one can properly decompose the Indian Pines dataset into six band subsets, which are (1~36), (37~58), (59~78), (79~105), (106~146) and (147~200). The Pavia University dataset can be divided into two band subsets, which are (1~73) and (74~103). The Salinas dataset may be separated into four band subsets, which are (1~29), (30~115), (116~151) and (152~204). Based on a specific dataset and its band subset decomposition, k different bands in each subset are randomly selected to consist of a gray wolf (k = 1, 2, ..., 7 for the Indian Pines and Salinas datasets and k = 3, 6, 9, 12, 15, 18 for the Pavia University dataset). Then, the most informative band subset is outputted by applying the proposed IG-GWO algorithm.  The satisfactory classification results on the Pavia University dataset are listed in Table 3. The classification accuracy over OA (94.88% ± 0.2%) indicates that the pixels in this dataset are almost correctly classified by the proposed method. This means that more informative bands can be selected by the proposed IG-GWO algorithm. Three indices have little variety and represent a stable tendency as the number of the selected bands increases from 18 to 36.
The classification results on the Salinas dataset are shown in Table 4. From the top-left to the lower right in Table 4, the three indices all present a growing trend. This pattern occurs because the classification accuracy increases as the number of selected bands and the sampling proportion increase. However, it is clear that there is a small difference in the classification results between 20 In the second lines in Tables 2-4, the number of the optimal bands is the product of the number k and the number of band subsets. The numbers in the first columns in Tables 2-4 are sampling ratios.  The classification results of the Indian Pines dataset that were obtained by the proposed method are reported in Table 2. The best classification by OA achieves an accuracy of 86.15% ± 0.7% (42 optimized bands and a labeling ratio of 20%). The classification results over OA and KC become increasingly better as the number of the selected bands increases. When the number of the selected bands ranges from 12 to 42, the three indices all exhibit a slow increase (approximately 2.5%) and no notable increase. This may be because the redundancy among the bands is increased as the number of selected bands increases. The classification results are unfavorable when six bands are selected because too much band information is lost.
The satisfactory classification results on the Pavia University dataset are listed in Table 3. The classification accuracy over OA (94.88% ± 0.2%) indicates that the pixels in this dataset are almost correctly classified by the proposed method. This means that more informative bands can be selected by the proposed IG-GWO algorithm. Three indices have little variety and represent a stable tendency as the number of the selected bands increases from 18 to 36.
The classification results on the Salinas dataset are shown in Table 4. From the top-left to the lower right in Table 4, the three indices all present a growing trend. This pattern occurs because the classification accuracy increases as the number of selected bands and the sampling proportion increase. However, it is clear that there is a small difference in the classification results between 20 bands and 28 bands. This result again confirms that the distinctive bands can be selected from the original bands by our proposal. This better result proves the effectiveness of the proposed method.

Comparative Analysis of Classification Results
To fairly compare our proposal with the other five band selection methods, the same random sampling proportion of 20% and classifiers are taken for the comparison experiments. In the other five approaches, the number of the selected bands is 30 for the Indian Pines and Salinas datasets and 20 for the Pavia University dataset. Based on the band subset decomposition and band selection strategy, the exact band number could not be obtained in our proposal. Accordingly, band number 18 for the Pavia University dataset and band number 28 for the Salinas dataset are adopted and are close to those of the other five methods. The comparison results of the six algorithms for the three datasets are listed in Tables 5-7. One can easily see from Tables 5-7 that the proposal outperforms the other five algorithms on the three datasets. The positive classification results have not been obtained by the six algorithms on the Indian Pines dataset, partially due to the complexity of the ground-truth. However, better classification results have been achieved on the Salinas dataset by these six methods since all of the classification accuracies over OA and AA are greater than 90%. The MIMR-CSA method is more competitive than the other four methods. As shown in Tables 5 and 7, the classification results on the Indian Pines and Salinas datasets show few differences between the MIMR-CSA method and our proposal. In addition, the TMI-CSA, Lscore, and STMIGR methods cannot effectively classify the Indian Pines and Pavia University datasets, as shown in Tables 5 and 6. It should be noted that excellent results can still be obtained by the proposed method in the absence of two bands in the Pavia University and Salinas datasets.
The classification results of each class on the three datasets are presented in Figure 6. For most classes, the classification accuracies that are obtained by our proposal are greater than or equal to those that are provided by the other five algorithms. Figure 6c clearly explains why the values of index AA are greater than those of OA on the Salinas dataset since 14 out of 16 classes are almost correctly classified. The TMI-CSA and STMIGR methods result in dreadful classification results, such as the 7th class (grass/pasture-mowed) and 9th class (Oats) in the Indian Pines dataset and the 3rd class (gravel) and 7th class (bitumen) in the Pavia University dataset.

Sensitivity and Computing Time Analysis
In what follows, for simplicity, we will take the Indian Pines dataset as an example in order to analyze the variation of the average classification accuracies obtained by the six algorithms as the number of selected features increases from 10 to 100. It can be seen in Figure 7 that our algorithm achieves better classification results when the number of the selected bands is less than or equal to In Medjahed et al. [33], the authors directly applied the GWO algorithm to select optimal candidate bands by defining five objective functions by their fitness. Based on the selected bands, they used a K nearest neighbor (KNN) classifier to perform classification. The best classification result (OA) that they provided was 73.67% on Indian Pines, 88.17% on Pavia University and 95.38% on Salinas. We here have not compared our results with theirs in Tables 5-7 since different classifiers are used. Furthermore, the authors did not clearly point out how many bands were selected for their final classification results. Regarding Tables 2-4 and in the case of the same sampling of 10% of the pixels, the superiority of the proposed method is obvious.

Sensitivity and Computing Time Analysis
In what follows, for simplicity, we will take the Indian Pines dataset as an example in order to analyze the variation of the average classification accuracies obtained by the six algorithms as the number of selected features increases from 10 to 100. It can be seen in Figure 7 that our algorithm achieves better classification results when the number of the selected bands is less than or equal to 40. This indicates that the most informative bands can be found by the proposed method. The classification accuracies of the six algorithms all show an upward trend as the number of selected bands increases. However, due to the sharp increase in redundancy between bands, the classification accuracy does not show a notable increase as the number of selected bands multiplies.   When OA and IG are adopted as fitness functions in the optimization process, Figure 9 shows the classification results that are obtained by using the IG-GWO method and single GWO algorithm. Based on these two fitness functions, the proposed method acquires a better classification accuracy.  When OA and IG are adopted as fitness functions in the optimization process, Figure 9 shows the classification results that are obtained by using the IG-GWO method and single GWO algorithm. Based on these two fitness functions, the proposed method acquires a better classification accuracy. The classification difference between the two fitness functions is less than 2%. Starting from the selection of the bands with high IG values, the GWO algorithm using both OA and IG as fitness is directly applied to search for the optimal band combination. As a result, their classification results are lower than those obtained by the IG-GWO method. The GWO algorithm using OA as fitness shows instability, which may be because the optimization algorithm is terminated before the optimal band combination is found. After all, for a data set with 200 bands, there are too many band combinations. GWO with IG as fitness provides poor classification results because there is much redundancy in the resulting band combinations, which leads to the decline of classification ability. This fully illustrates the necessity of carrying out subset decomposition in the proposed method.  When OA and IG are adopted as fitness functions in the optimization process, Figure 9 shows the classification results that are obtained by using the IG-GWO method and single GWO algorithm. Based on these two fitness functions, the proposed method acquires a better classification accuracy. The classification difference between the two fitness functions is less than 2%. Starting from the selection of the bands with high IG values, the GWO algorithm using both OA and IG as fitness is directly applied to search for the optimal band combination. As a result, their classification results are lower than those obtained by the IG-GWO method. The GWO algorithm using OA as fitness shows instability, which may be because the optimization algorithm is terminated before the optimal band combination is found. After all, for a data set with 200 bands, there are too many band combinations. GWO with IG as fitness provides poor classification results because there is much redundancy in the resulting band combinations, which leads to the decline of classification ability. This fully illustrates the necessity of carrying out subset decomposition in the proposed method. As shown in Figure 10, the computing time of the algorithm adopting IG as the fitness function is much less than that with OA. Theoretically, if OA is used as the fitness function, the GWO algorithm needs to implement the classification algorithm (SVM) in each iteration, which undoubtedly increases the computation time of the algorithm. This result is consistent with the theoretical analysis.  It is necessary to analyze the effect of parameter changes on classification accuracy. In the proposal, parameter a controls the optimization range and the convergence rate of the GWO As shown in Figure 10, the computing time of the algorithm adopting IG as the fitness function is much less than that with OA. Theoretically, if OA is used as the fitness function, the GWO algorithm needs to implement the classification algorithm (SVM) in each iteration, which undoubtedly increases the computation time of the algorithm. This result is consistent with the theoretical analysis.
It is necessary to analyze the effect of parameter changes on classification accuracy. In the proposal, parameter a controls the optimization range and the convergence rate of the GWO algorithm. A small a value means that the optimization range of GWO is smaller and the convergence speed is faster. Figure 11 records the variability classification accuracies with different parameters for a. There is approximately a 1% difference between the maximum classification accuracy and the minimum classification accuracy. This shows, to a certain extent, that the algorithm proposed in this paper is relatively stable.  It is necessary to analyze the effect of parameter changes on classification accuracy. In the proposal, parameter a controls the optimization range and the convergence rate of the GWO algorithm. A small a value means that the optimization range of GWO is smaller and the convergence speed is faster. Figure 11 records the variability classification accuracies with different parameters for a. There is approximately a 1% difference between the maximum classification accuracy and the minimum classification accuracy. This shows, to a certain extent, that the algorithm proposed in this paper is relatively stable.

Conclusions
In this work, a novel hybrid filter-wrapper band selection method is introduced by a three-step strategy that is based on the improved band subset decomposition method and the GWO optimization algorithm. For the selected band combination, the first two steps attempt to reduce the redundancy between the spectral bands, and the last step ensures that the combination has a good classification performance. The proposed method inherits the advantages of a high classification accuracy based on the wrapper method and a fast computing speed based on the filter algorithm. The experimental results reveal that this attempt achieves a reasonable compromise between computing time and classification accuracy. The comparative experimental results on three typical hyperspectral datasets successfully confirm that the proposal is more powerful than several existing approaches.

Conclusions
In this work, a novel hybrid filter-wrapper band selection method is introduced by a three-step strategy that is based on the improved band subset decomposition method and the GWO optimization algorithm. For the selected band combination, the first two steps attempt to reduce the redundancy between the spectral bands, and the last step ensures that the combination has a good classification performance. The proposed method inherits the advantages of a high classification accuracy based on the wrapper method and a fast computing speed based on the filter algorithm. The experimental results reveal that this attempt achieves a reasonable compromise between computing time and classification accuracy. The comparative experimental results on three typical hyperspectral datasets successfully confirm that the proposal is more powerful than several existing approaches.
Author Contributions: F.L. and F.X. conceived and designed the experiments; C.L. performed the experiments; L.K. and F.L. analyzed the data and developed the graphs and tables; F.L. and F.X. wrote the paper.

Conflicts of Interest:
The authors declare no conflict of interest.