1. Introduction
The emergence of hyperspectral remote sensing technology is undoubtedly a breakthrough in the field of remote sensing [
1,
2,
3]. Hyperspectral sensors with dozens or even hundreds of spectral bands may effectively capture abundant spectral and spatial information on the Earth’s surface. This is clearly conducive to further research and analyze of land-cover of interest. Consequently, hyperspectral images (HSIs) with detailed spectral and spatial information have been successfully applied in various fields, such as environmental monitoring [
4], land management [
5], target detection [
6], urban area planning [
7], and precision agriculture [
8]. However, it is not easy to classify HSIs effectively and efficiently because of its characteristics of big data and the complexity of the distribution of ground objects, especially in the case of limited training samples.
A large number of existing studies have shown that deep learning-based classification method has good classification performance, and has achieved great success in computer vision, image processing and other fields. Some representative deep learning techniques include convolutional neural network (CNN) [
9,
10], recurrent neural network (RNN) [
11,
12], generative adversarial network [
13,
14] and convolutional auto-encoder [
15,
16]. Recently, a series of deep learning-based classification frameworks have been widely used in the field of remote sensing [
17,
18,
19,
20,
21,
22]. Combining the deep CNN with multiple feature learning, a joint feature map for HSI classification was generated, which makes the developed method have high classification performance on test datasets [
18]. A spectral locality-aware regularization term and label-based data augmentation were used in the CNN structure to prevent over-fitting in the presence of many features and few training samples [
19]. Based on the stacked sparse auto-encoder, Tao et al. proposed a suitable feature representation method for adaptively learning label-free data [
23]. A feature learning model based on unsupervised segmented denoising auto-encoder was depicted to learn both spectral and spatial features [
24]. Even though these deep learning-based HSI classification methods can achieve satisfactory results, a large number of training samples are usually required (e.g., labeling 200 pixels per class) [
18,
19,
20,
21]. Therefore, it is necessary to investigate the classification performance of these models in case of finite training samples.
Superpixels are such homogeneous regions where pixels are spatially nearest neighbors and their color or spectral features are similar to each other. There are many superpixel segmentation algorithms in computer vision and image processing. Among them, two representative methods, entropy rate superpixel (ERS) [
25] and simple linear iterative clustering [
26], are commonly used to split an HSIs into superpixels in remote sensing. Based on the superpixel homogeneity, a series of spectral-spatial HSI classification or dimensionality reduction approaches have recently been developed [
27,
28,
29] in order to improve the classification accuracy and speed up the classification process. Taking each superpixel rather than pixel as the basic input of the classifiers, several superpixel-level HSI classification methods have been proposed in the past few years [
30,
31,
32]. Experimental results demonstrate that these superpixel-level approaches can effectively explore spectral-spatial information of hyperspectral data and achieve satisfactory results on typical benchmarks even for limited training samples. Some superpixel–based dimensionality reduction methods were also investigated by combining superpixel with classic dimensionality reduction techniques [
33,
34,
35]. These methods make full use of spatial information provided by superpixels to improve the dimensionality reduction performance of classic methods. In addition, Blanco et al. adopted the texture information extracted from each superpixel to improve the classification accuracy [
36]. Extensive work has shown that the clever use of superpixels in the HSI classification process does contribute to the improvement of the classification results.
A common problem we have to face in practical HSI classification is the scarcity of labeled data, since it is expensive and time-consuming to label samples. Some researchers attempt to address this issue methodologically through various techniques [
37,
38,
39,
40,
41,
42]. Acquarelli et al. selected pixels in smaller classes by data enhancement, and then used the smoothing- and label-based techniques to prevent overfitting of few of training samples [
19]. Following the strategy of the pairing or recombining of samples, a spatial–spectral relation network was designed for HSI classification with limited labeled samples [
43]. Xie et al. used the pseudo sample labels obtained by the pre-classification method of multiple classifiers to enlarge the volume of training samples [
44]. Additionally, superpixel-wise classification methods provide an effective means to solve this problem [
45]. In fact, the superpixel-level classification method is to effectively expand the proportion of labeled samples by reducing the input. Based on this advantage of superpixel-wise method. We would like to adopt this technique in this work.
The use of transfer learning technique also provides a feasible solution to address the problem of insufficient training samples [
46,
47,
48]. Transfer learning technology is to transfer the knowledge learned from the source model to different but related new tasks, thus reducing both the training time of the new task and the number of labeled samples needed. Liu et al. suggested an HSI classification method to improve the performance of 3D-CNN model through parameter optimization, transfer learning and virtual samples [
47]. By combining the 3-D separable ResNet with cross-sensor transfer learning, an effective approach is presented to classify the HSIs with only a few of labeled samples [
48]. The multi-source or heterogeneous transfer learning strategy to classify HSIs were investigated to alleviate the problem of small labeled samples [
49,
50]. An end-to-end 3D lightweight CNN with less parameters was modeled for HSI classification via cross-sensor and cross-modal transfer learning strategies [
51]. To achieve a good transfer effect, the models were well pre-trained on the source dataset with sufficient labeled samples in these methods. However, the number of training samples on the source dataset may be limited. Therefore, it is interesting to investigate the effect and efficiency of knowledge transfer from source data to target data in this case.
Previous work has demonstrated that the traditional CNN-based pixel-wise HSI classification framework can effectively extract the main spectral features of HSI in the down-sampling process. With the increase of network depth, the spatial structure information of the HSIs is gradually lost. Generally, the lack of spatial information in the HSI classification will lead to unsatisfactory classification results. To obtain good classification results, a large number of labeled samples are used in these methods to improve the performance of the classifier. However, the acquisition of a considerable number of labeled samples is expensive. Meanwhile, the increase of network depth also means that it will take more time to train the network, because a large number of parameters need to be optimized. To address these two problems, we design an efficient deep learning-based spectral-spatial classification framework for HSIs with limited training samples, that is, superpixel pooling CNNwith transfer learning (SP-CNN).The main spectral features extracted by the CNN architecture and the spatial structure information provided by superpixel map are effectively fused in the suggested classification scheme. This clearly contributes to satisfactory results in the classification. Furthermore, different from previous pooling techniques, superpixel pooling weakens the dependence of the deep learning-based classification scheme on massive labeled samples, thus alleviating the problem of insufficient training samples on both source and target datasets. Meanwhile, for the purpose of improving the training efficiency of SP-CNN, the introduction of transfer learning strategy in our suggested framework obviously speed up the training process. As a result, the proposed method can classify the HSIs accurately and quickly with a small number of training samples.
The novelties of the current work lie in:
An efficient spectral-spatial HSI classification scheme is proposed based on superpixel pooling CNN with transfer learning;
The introduced superpixel pooling technique effectively alleviates the problem of insufficient training samples in HSI classification;
The training efficiency of the proposed classification model is improved significantly by using transfer learning strategy.
The reminder of this work is organized as follows: In
Section 2, we depict the designed classification framework and the used technologies.
Section 3 quantitatively reports the classification results on three benchmarks and discusses them qualitatively. The impact of the number of training samples, superpixel number and network architecture on classification results of SP-CNN method are analyzed in
Section 4.
3. Experimental Results and Analysis
To verify the effectiveness of the proposed SP-CNN method, extensive experiments were conducted on three public hyperspectral datasets, namely, Indian Pines, Pavia University and Salinas. These three typical benchmarks are widely utilized to test the performance of the HSI classification algorithms.
3.1. Datasets and Evaluation Indicators
Indian Pines dataset was collected by an airborne visible infrared imaging spectrometer (AVIRIS) sensor in a pine field in the northwestern Indiana. This image consists of 16 different categories, 145 × 145 pixels, and 200 bands. After removing background points, there are 10,249 pixels to be classified. The imbalance between class sizes leads to the difficulty of accurate classification.
Pavia University image was acquired by reflective optics system imaging spectrometer (ROSIS) sensors at University of Pavia, Italy. It has nine classes, 610 × 340 pixels, and 103 bands. The spatial structure of each class in this dataset varies greatly, which brings great challenges to classification.
The last dataset is the Salinas dataset. This dataset was collected by AVIRIS sensor from the Salinas Valley in California. It is composed of 16 categories, 512 × 217 pixels, and 204 bands. In this image, there are two spatially adjacent classes whose spectra are very similar.
In all experiments conducted in this work, the classification results were evaluated by adopting three commonly used indices, that is, overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ). To overcome the classification bias caused by random marking, the mean and standard deviation of 10 independent runs were calculated as the final classification results.
3.2. Classification Results and Analysis
The suggested method was compared with the other four state-of-the-art HSI classification approaches, namely, artificial neural network (ANN), CNN [
55], convolutional recurrent neural network (CRNN) [
56], CNN based on pixel-pair features (CNN-PPF) [
57], spectral-spatial CNN (SS-CNN) [
58], because these methods are based on CNN framework [
56,
57,
58] and adopt ANN as the final classifier.
Table 2,
Table 3 and
Table 4 list the comparison results of these algorithms on the three datasets.
The classification results of several competitive algorithms on Indian Pines dataset are reported in
Table 2. Compared with SS-CNN, the classification accuracy of our method achieves 94.45%, which is about 10% higher (94.45% vs. 84.58%). The SS-CNN method is superior to CRNN and CNN approaches, due to the use of spatial information in classification. CNN, CRNN, SS-CNN and SP-CNN methods outperform ANN because the CNN-based framework can extract the main spectral features from raw data. Satisfactory results on this dataset are not obtained by CNN, CRNN and SS-CNN when there are no more than 30 labeled pixels per class. However, these three methods can still exhibit superior classification performance in the case of 200 labeled pixels per class [
55,
56,
58]. Noted that in this case, seven classes with no more than 400 pixels per class were ignored in the classification. Experimental results of this image demonstrate that the proposed SP-CNN method can classify the unbalanced dataset with limited labeled samples. The classification result maps of these methods are shown in
Figure 5.
The classification statistics of six methods on Pavia University dataset are listed in
Table 3. According to the values of three evaluation indicators, our method has defeated the other five algorithms. The utilization of spatial information in classification makes the classification results of SS-CNN and SP-CNN methods better than those of other four approaches. The suggested classification scheme, however, does not identify classes “Trees” and “Shadows” well from other ground objects. This may be because, after removing the background points, fragmented or strip-like class distributions result in the generation of many superpixels with very small size, thus weakening the role of spatial structure information in classification. Particularly for class “Trees”, the spectral-based classifiers, ANN, CNN and CRNN also do not achieve good classification accuracy. This shows that the spectrum of this class is complex. Although good classification results can be obtained by marking more pixels in each class, it is worth studying how to improve the recognition accuracy of this class under the deep learning framework in the case of limited labeled samples.
Figure 6 presents the visualization of the classification results in
Table 3.
Table 4 summarizes the classification results of six classifiers on Salinas dataset. The main challenge for this dataset is to classify classes “Grapes_U” and “Vineyard_U” correctly, because there is a slight spectral difference between these two spatially adjacent categories. It can be seen from
Table 4, none of the six methods has satisfactory classification accuracy for class “Vineyard_U”. As shown in
Figure 7c–e, most of pixels of class “Vineyard_U” were misclassified and incorrectly assigned to the class “Grapes_U”, especially for the classifiers ANN and CNN. As was expected, three spectral-spatial classifiers, CNN-PPF, SS-CNN and SP-CNN show good classification completion and achieve more than 90% classification accuracy.
5. Conclusions
In this work, we suggest a spectral-spatial deep learning model for HSI classification based on CNN and superpixel. Different from the traditional CNN structure, an up-sampling process is connected after down-sampling to recover the lost spatial structure information while preserving the extracted spectral features. The extracted spectral features and spatial structure information provided by superpixel are effectively fused in the designed superpixel pooling layer. Furthermore, the homogeneity of superpixels allows to regard each superpixel instead of a pixel as the basic input of classifier, thus reducing the number of objects to be classified. For a fixed number of training samples, the reduction of the object to be classified means an increase in the proportion of training samples. This is the main reason why the proposed SP-CNN method can effectively classify the HSI with limited training samples. At the same time, this idea may serve as a feasible solution to the problem that the CNN-based HSI classification framework cannot achieve better classification accuracy due to insufficient training samples. As with the traditional CNN classification framework, the efficiency of the proposed SP-CNN method relies on the optimization process of a large number of parameters. As expected, the use of transfer learning technique in the proposed model significantly shortens the training time. Therefore, this method can be applied to solve other practical problems in the field of remote sensing. As our work effectively integrates the advantages of CNN architecture, superpixel and transfer learning, the proposed SP-CNN method can classify the hyperspectral data with a small number of training samples accurately and quickly. Extensive experimental and comparative results on three benchmarks confirm the effectiveness and efficiency of the SP-CCN.
Thus far, the optimal superpixel segmentation scale is still an experimental result and is difficult to specify in advance. In the future, we would like to adopt the superpixel merging technique to alleviate the dependence of the superpixel-level classification method on the segmentation scale.