Hyperspectral Image Classification with Multi-Scale Feature Extraction

Spectral features cannot effectively reflect the differences among the ground objects and distinguish their boundaries in hyperspectral image (HSI) classification. Multi-scale feature extraction can solve this problem and improve the accuracy of HSI classification. The Gaussian pyramid can effectively decompose HSI into multi-scale structures, and efficiently extract features of different scales by stepwise filtering and downsampling. Therefore, this paper proposed a Gaussian pyramid based multi-scale feature extraction (MSFE) classification method for HSI. First, the HSI is decomposed into several Gaussian pyramids to extract multi-scale features. Second, we construct probability maps in each layer of the Gaussian pyramid and employ edge-preserving filtering (EPF) algorithms to further optimize the details. Finally, the final classification map is acquired by a majority voting method. Compared with other spectral-spatial classification methods, the proposed method can not only extract the characteristics of different scales, but also can better preserve detailed structures and the edge regions of the image. Experiments performed on three real hyperspectral datasets show that the proposed method can achieve competitive classification accuracy.

Numerous techniques have been developed for HSI classification.As one of the typical classifiers, the support vector machine (SVM) [13][14][15] has been proven to be effective for hyperspectral imagery.By learning an optimal decision hyperplane, SVM can effectively separate training samples in the high dimensional feature space [16].In addition, other HSI classification methods have been proposed, such as multinomial logistic regression (MLR) [17][18][19], neural networks [20][21][22], kernel-based techniques [23,24], and active learning [25,26].A detailed discussion about advanced spectral classifiers and their performance can be found in [27].In addition, sparse representation (SR) [28,29] has been proven to be a useful tool for HSI classification providing a compact representation in a dictionary as a combination of linear atoms.The sparse representation has been extended to HSI classification in [30,31].However, HSIs not only contain rich ground image information, but also have abundant spectral information.These methods only consider spectral information, ignoring the spatial correlation of neighboring pixels.Therefore, various spatial-spectral classification methods have been developed [32][33][34].For example, Kang et al. proposed a spectral-spatial classification method based on edge-preserving filtering (EPF) [35].EPF has been a research hotspot in image processing and computer vision in recent years, which not only has the function of image smoothing, but also can import spatial structure information into the input image.Moreover, a novel spectral-spatial classification for HSIs is proposed [36], which attempts to utilize the within-class similarity between the training and test samples, while reducing the between-class interference.A composite kernel approach can also effectively combine the spectral-spatial information of each pixel [37,38].Considering that different scales regions contain complementary but interconnected information for classification, Fang et al. proposed a multiscale adaptive sparse representation (MASR) model [39].The MASR makes full use of multiple scales spatial information through an adaptive sparse strategy.In other works, effective feature extraction algorithms [40,41] and multifeature fusion [42,43] techniques have been developed in which the spectral-spatial characteristics of different materials in the image scene are more effectively represented.Peng et al. proposed a region kernel to measure the region-to-region distance similarity for HSI classification [44].The region kernel is designed to be a linear combination of multiscale box kernels, which can handle the HSI regions with arbitrary shape and size.These methods consider the spatial context information of the pixel and surrounding pixels while using the spectral information of the ground object for classification.Therefore, they can effectively reduce the influence of the phenomenon of the same object with different spectrum and different objects with the same spectrum on the classification accuracy.Meanwhile, the classification accuracy can also be greatly improved.
Based on the above works, many researchers have made further explorations on the extraction of spatial-spectral features.Therefore, different kinds of multi-scale feature extraction methods were proposed to improve the accuracy of HSIs classification.For instance, Zhao et al. proposed a deep learning framework for extracting deep learning features using multi-scale convolutional auto-encoder [45].In addition, morphological and neurological methods were used to extract the spectral-spatial features to classify the hyperspectral images of urban areas [46].By further studying the multi-scale feature extraction method [47,48], we found that the spectral features between pairs of pixels exhibit different spectral separability in multi-scale structures.Therefore, this paper proposed a multi-scale feature extraction (MSFE) method for HSI classification.In the proposed method, the Gaussian pyramid decomposition is employed to decompose the original HSI into a multi-scale structure.Then, we use the EPF method for image smoothing in each layer of the Gaussian pyramid.Finally, in order to get better classification results, we use the majority voting method to determine the label of the pixel.Experiments on three well-known HSI datasets demonstrate that the proposed method can obtained high classification accuracy.
The remainder of this paper is structured as follows: First, Section 2 introduces the proposed MSFE methods.Experimental results and analysis are reported in Section 3. Finally, conclusions are discussed in Section 4.

Proposed Classification Framework
For the purpose of improving the separability of the hyperspectral features of different features, we choose Gaussian pyramid for multi-scale feature extraction.Figure 1 shows the schematic diagram of the proposed classification method, which consists of the following major steps: First, the HSI is decomposed into several Gaussian pyramids to extract multi-scale features.Second, we construct and optimize probability maps in each layer of the Gaussian pyramid.Finally, the final classification map is acquired by the majority voting method.

Feature Extraction Based on Gaussian Pyramid Decomposition
Gaussian pyramid decomposition [49] is a series of low-pass sampling filters for images.The filtered image has a lower resolution and sampling density than the previous image.The Gaussian pyramids can be used to scale images on a single scale for multi-scale analysis of images.In this paper, the Gaussian pyramid decomposition operation is performed on the original HSI.Let the original image, O 1 , be used as the first layer of the Gaussian pyramid.Then, O 1 is convolved with a Gaussian kernel and downsampled (or even delete rows and columns).The resulting image, O 2 , is generated after O 1 is low-pass filtered and downsampled.Specifically, in the down-sampling process, when the pixels in rows and columns of an image are removed, the gray value of the pixel on the spatial position (m,n) in the l-th layer of the image can be calculated by an iterative algorithm as follows: where * is the convolution operation.l ∈ {2, 3, • • • , L} and L is the total number of layers in the Gaussian pyramid.The F(u, v) is a (2c + 1) × (2c + 1) Gaussian window that can be defined as: where τ is the variance of the Gaussian filter.The Gaussian pyramid is composed of a series of images {O 1 , O 2 , • • • , O L } which are generated from the above convolution and down-sampling operations.Except for the first layer, the nearest neighbor interpolation method is used to normalize each layer of the gaussian pyramid until the space size becomes the same as the first layer.In this way, multi-scale features will be combined for the subsequent processing step.The normalized Gaussian pyramid, E, can be calculated as follows: where L (which is the total number of the layers in the Gaussian pyramid) is determined by the size of the HSI, and O represents the input data.

Probability Maps Construction and Optimization
The probability maps are constructed and optimized in each layer of the normalized Gaussian pyramid.The initial classification result, C, can be obtained by SVM classifiers in each layer of the normalized Gaussian pyramid.The initial classification result, C, can be expressed in the form of a probability map, in which each pixel i is assigned to a label C i .The initial probability map in the lth layer can be represented as probability maps i.e., P = (P 1 ,P 2 , • • • , P c ), where P i,c ∈ [0, 1] is the probability that a pixel, i, belongs to the cth class.Then, the probability P i,c is denoted as The pixel-wise classification based on SVM does not consider the spatial information of HSIs.All probability values are either zero or one.The initial probability maps contain many noisy elements, and they are not aligned with real object boundaries.Therefore, we employ the EPF to optimize the initial probability maps.Specifically, the optimized probabilities can be denoted as a weighted average of their neighborhood probabilities where i and m represent the ith and mth pixels, respectively.The filtering weight, W, is chosen to preserves edges of a specified guidance image, M. Therefore, the filtering weight, W, and the guidance image, M, can be solved by the following two steps.First, the filtering weight is obtained by the guided filtering method in EPF [35].The guided filter is a filtering algorithm based on a local linear model.Assuming that M is the guidance image, Q is the filtering output, and s is the filtering size, then, in a local window, w, with a size of (2s+1) × (2s+1), Q can always be expressed as a linear transformation of M as follows: where the a m and b m are the linear transformation parameters.Take the derivative of both sides of this equation and we get ∇Q ≈ a∇M; that is, when M has an edge structure at a certain position, the same position of Q will lead to the similar edge structure.To determine the coefficients (a m , b m ), the energy function can be denoted as follows: where ζ (which is a regularization parameter) determines the degree of the blurring for the guided filter.By solving the energy function, the final filtering result is obtained by combining the edge information of M with the pixel information of the input image O.In addition, the output image and guide image must satisfy the local linear model.Q needs to be close to the input image, O.The local linear model in Equation ( 6) can also be expressed as a weighted sum form (5). The filtering weight W i,m (M) for the guided filter can be represented as follows: where ω i and ω m are local windows with pixel i and pixel m, respectively.µ k and σ a represent the mean and variance of the guided image, M, in ω k .|ω| represents the number of pixels in ω k .If M i and M m are on the same side of the edge, the sign of 8) is positive.Conversely, if M i and M m are on different sides, the sign of the term is negative.In other words, the filtering weight on the same side of the edge is higher; otherwise, it is lower.Based on this principle, the probability values corresponding to the pixels on the same side of the edge in the guidance image are usually similar to the filtering output.Second, we employ the gray-scale guidance image as the guidance image for EPF.The original HSI is firstly decomposed by principal component analysis (PCA) [50].The first principal component with the largest variance is used as the guidance image of the EPF.This can give an optimal representation of the HSI in the mean square and retain as much significant information as possible.
After filtering the probability maps, the label of pixel i can be simply selected in a probability maximization manner.The purpose is to use the SVM supervised classifier to convert the probability map into the initial classification result map.Therefore, each layer of the Gaussian pyramid obtains an initial classification map.Finally, the final classification map is obtained by the majority voting method as follows: where c i,t means that pixel i is marked with label c for t times in each initial classification map of the Gaussian pyramids.(2) The University of Pavia image which was captured over an urban area surrounding the University of Pavia, Italy.This data set was recorded by Reflective Optics System Imaging Spectrometer (ROSIS).The size of the image is 610 × 340 × 115.The spatial resolution of this image is 1.3 m per pixel, and the spectral coverage is ranging from 0.43 to 0.86 µm.In the experiment, 12 water absorption bands were discarded.Figure 3a-c  ), (e) the EMP method (67.93%), (f) the EPF method (61.32%),(g) the IFRF method (66.82%),(h) the JSR method (63.69%), (i) the SCMK method (77.36%), (j) the MASR method (61.57%), and (k) the MSFE method (81.35%).

Comparisons with Other Approaches
In the experiments, three quality indexes, i.e., overall accuracy (OA), average accuracy (AA), and Kappa coefficient are used to impersonally evaluate classification results.In this section, the proposed MSFE method is compared with the SVM [13], extended morphological profiles (EMP) [51], EPF [35], image fusion and recursive filtering (IFRF) [34], joint sparse representation (JSR) [30], superpixel-based classification via multiple kernels (SCMK) [52], and MASR [39] methods.The implementation of other comparisons methods is through the default parameters given by the authors.Here, the SVM algorithm utilizes the library for support vector machines with a Gaussian kernels.The EMP exploits the spatial information, which can improve the estimation.The EPF not only has the function of image smoothing, but also can incorporate spatial structure information into the input image.The IFRF method is to divide the hyperspectral image into multiple subsets of adjacent hyperspectral bands and then, by averaging, the bands in each subset.Finally, the fusion band is processed using transform domain recursive filtering to obtain the resulting features for classification.The JSR utilizes a joint sparsity model in which hyperspectral pixels are concentrated near test pixels.The SCMK effectively utilize the spectral information and spatial structure of superpixels through multiple kernels.The MASR makes effective use of multi-scale spatial information based on adaptive sparse strategy.The experiments of all methods are repeated ten times and samples were randomly selected in each experiments for the purpose of obtaining the average accuracy and standard deviation.The details of the number of training samples and test samples are displayed in Tables 1-3.The classification maps of different classification methods are shown in Figures 2-4, and corresponding OA values are also shown.As can be seen, the proposed method achieves the highest classification accuracy.
For the Indian Pines data set, only six labeled samples are randomly chosen as training samples, and the remaining are used as test samples, which can be clearly observed in Table 1.In particular, the number in parentheses represents the standard deviation of the number of trainings.The corresponding reference data and the classification maps for different methods are shown in Figure 2. It can be seen that the classification performance of the SVM is not very good compared with other comparison methods (e.g., EMP, EPF, IFRF, JSR, SCMK, MASR, and MSFE).It can be explained by the fact that SVM uses only spectral information, thus, being more susceptible to the occurrence of noise when estimating the pixels.Although the classification accuracy of the EMP method has improved, there is still a large amount of classification noise.It does take into account spatial information, but can better utilize the spectral information.Other methods (e.g., EPF, IFRF, JSR, SCMK, and MASR) employed spatial-spectral features and their classification accuracies are further enhanced.But they cannot well preserve the edge details.The proposed method can achieve an overall accuracy of 73.92%, which demonstrates its advantage.In addition, for the Corn_N class the proposed method was 29.12% more accurate than SVM classification.We can see that the proposed method, not only effectively extracts the multi-scale features, but also generates a smoother appearance in homogeneous regions, being superior to other methods.However, the data from Tables 1-3 can be analyzed.When the number of classified samples is very small, this method is not in line with expectations.
For the University of Pavia and Salinas data sets, six labeled samples are also chosen randomly as training samples, and the remaining are used as test samples, which are shown in Tables 2 and 3. Similarly, the numbers in parentheses represent the standard deviation of the number of trainings.The corresponding reference data and the classification accuracies for different methods are shown in Figures 3 and 4, respectively.For the two experimental results, it can be seen that the classification accuracy of the proposed method is always better than other comparison methods in terms of OA, AA, and Kappa.As shown in Table 2, it can observed that the classification accuracy of the proposed method reaches 81.35%, and obtained higher accuracies than other methods for the classes of Meadows, Gravel, and Bare soil.In addition, similar classification performance was obtained in the Salinas data set experiment.For example, the OA for SVM is just 79.56%, and the OA for the proposed MSFE method reached 92.08%.Moreover, the AA and Kappa are also increased rapidly.The classification results can effectively prove the superiority of the proposed method.These two strategies, feature extraction at different scales and further use of spatial context information in the image to optimize the probability of pixel-by-pixel spectral classification estimation, play an important role in improving classification performance.

Parameter Analysis
In this section, the analysis of three parameters of the proposed method, i.e., L (the number of layers in the Gaussian pyramid), s (the filtering size), and ζ (blur degree) is performed.In these experiments, the number of training and test samples for the Indian Pines, Pavia University, and Salinas data sets are shown in Tables 1-3, respectively.The experiments are performed by using MATLAB on a notebook computer.
First, the influence of the filtering size, s, and the blur degree, ζ, on the classification performance are show in Figure 5 In the experiments on Indian Pines and Pavia University data sets, s varies from 1 to 9 and ζ varies from 0 to 1, while for the Salinas data set, s varies from 12 to 20 and ζ varies from 0 to 1.It can be seen that if s and ζ are too large, the OA of the method will decline greatly.That is, too-large s and ζ will cause the image to be smoothed out considerably, and some small-scale targets will be misclassified.When the s and ζ are too small, the performance of the method also decreased significantly.The reason for this result is that edge preserving filtering is a local filtering process, and the smaller the window, the smaller the range of spatial information considered in probability optimization.s and ζ should take the appropriate value.Therefore, the parameters s and ζ setting as (s = 7, ζ = 0.4), (s = 4, ζ = 0.1), and (s = 17, ζ = 0) for the Indian Pines, Pavia University, Salinas data sets, respectively, in this paper can obtain the highest OA.
Then, the influence of L (which varies from 1 to 10) to the performance is shown in Figure 6.It can be seen that the accuracy of the proposed method increases gradually as L increases.This means that the proposed method can effectively extract the multi-scale characteristics of HSI for classification.For the proposed method, L is separately set to 6, 9, and 5 as the default parameters for the three data sets, respectively.

Classification Results with Different Numbers of Training Samples
The classification performance to eight methods with different training samples on the three data sets is shown in Figure 7.In the three figures, the number of training samples for all methods increased from 6 to 60.It can be observed that when the number of training samples changes, the proposed method significantly improved the classification accuracies in comparison with other approaches.

Conclusions
In this paper, the MSFE method was presented for the classification of HSIs.This method contains two important strategies: First, the Gaussian pyramid decomposition strategy was employed to achieve multi-scale analysis of images.The proposed method can capture spatial texture features of different scales, and thus targets of different scales can be classified more effectively.Second, by using the EPF algorithm to optimize the probability map, the proposed method can produce a smoother appearance in a homogeneous regions.Therefore, the details and the near-edge area of the image can be better identified.Furthermore, experimental results demonstrate the proposed method outperforms other classification methods in term of accuracy, especially when the number of training samples is quite limited.In future work, multi-scale feature fusion will be integrated into the MSFE to further improve classification accuracy.In addition, we can further solve the problem of classification accuracy loss when the number of samples is small.

( 1 )
The Indian Pines image was captured in 1992, by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor.It captures the agricultural Indian Pine test site of North-western Indiana.The size of the image is 145 × 145 × 220.The spatial resolution of this image is 20 m per pixel, and the spectral coverage ranges from 0.4 to 2.5 µm.In the experiment, 20 water absorption bands (nos.104-108, 150-163, and 220) were discarded.Figure 2a-c shows the false-color composite and the corresponding reference data of the Indian Pines image.
) The Salinas image was also captured by the AVIRIS sensor over Salinas Valley, California.The size of the image is 512 × 217 × 224.The geometric resolution of this image is 3.7 m per pixel.Similar to the Indian Pines image, 20 water absorption bands (nos.108-112, 154-167, and 224) were discarded.The false-color composite and the corresponding reference data of the Salinas image are shown in Figure 4a-c.

Figure 5 .Figure 6 .
Figure 5. Analysis of the effect of the parameters s (the filtering size) and ζ (blur degree).(a) Indian Pines dataset.(b) University of Pavia dataset.(c) Salinas dataset.

Table 1 .
Classification Accuracy (in percentage) Obtained by the SVM, EMP, EPF, IFRF, JSR, SCMK, MASR, and MSFE Methods.Class-specific accuracy values are in percentage.The number of training samples for each class is six for the Indian Pines data set (in %).

Table 2 .
Classification Accuracy (in percentage) Obtained by the SVM, EMP, EPF, IFRF, JSR, SCMK, MASR, and MSFE Methods.Class-specific accuracy values are in percentage.The number of training samples for each class is six for the University of Pavia data set (in %).

Table 3 .
Classification Accuracy (in percentage) Obtained by the SVM, EMP, EPF, IFRF, JSR, SCMK, MASR, and MSFE Methods.Class-specific accuracy values are in percentage.The number of training samples for each class is six for the Salinas data set (in %).