1. Introduction
Recently, deep learning methods represented by convolutional neural networks (CNNs) have made a breakthrough in computer vision, showing great superiority in the image processing area [
1,
2,
3]. Therefore, the research on the CNN models has attracted more and more attention, which also makes the application of CNN penetrate into various subareas of image processing, for example, remote sensing image processing area [
4]. Hyperspectral image classification has always been one of the hotspots in the remote sensing community. At present, the CNN based hyperspectral classification methods are booming [
5]. However, hyperspectral images suffer from a large number of spectral bands, large data size, high redundancy, high nonlinearity and the “small sample problem”, the pixel-wise classification of which is still challenging [
6].
The convolutional neural network can automatically learn hierarchical abstract features from the raw image, which provides an ideal solution for feature extraction in computer vision. In 2012, a deep learning model named AlexNet [
7] showed an excellent classification result in the ImageNet dataset, which is a huge collection of natural images. Since then, innovative networks have emerged in an endless stream, constantly inspiring the paradigm of feature extraction and reuse. In 2015, He et al. [
8] proposed ResNet, solving the training problem of deep networks by introducing a residual connection. In ResNet, feature fusion is realized by pixelwise-addition of different feature maps. In 2017, Huang et al. [
9] proposed DenseNet, which made it possible for feature reusing and provided another way for feature fusion, which is realized by the concatenation of different feature maps. In recent years, the above mentioned two feature fusion methods, which are proposed in ResNet and DenseNet, that have been widely used in the tasks of image classification [
10,
11], semantic segmentation [
12,
13], object detection [
14,
15], etc. Additionally, they are served as the standard patterns of feature extraction based on CNN. As milestones in the design of CNN models, the ideas behind ResNet and DenseNet are also radiating beyond the natural image processing area [
16,
17]. At present, the research of feature extraction and feature fusion for specific task or specific data is still a hot topic in the field of computer vision.
Hyperspectral image classification is the hotspot in remote sensing image interpretation and is of great difficulty. Its purpose is to assign an accurate label to each pixel in the image and then divide the image into areas with different ground object semantic identification [
7]. Currently, the convolutional neural network has been successfully applied to the tasks of hyperspectral image classification [
18,
19,
20,
21]. In hyperspectral image (HSI) classification, the convolutional neural network acts as an “information distiller”, gradually extracting high-level abstract semantic features with the deepening of the network. In this process, the hyperspectral images with a huge amount of data are transformed, the irrelevant information is filtered out, and the useful information is enlarged and refined [
22]. Prior to deep learning methods, traditional methods mostly used a linear discriminant analysis [
23], such as the principal component analysis [
24] and independent component analysis [
25], to extract features. Additionally, they used a shallow classifier [
26,
27,
28] to complete classification. These methods rely on manual designed features. For complex and diverse hyperspectral data, it is difficult to find a universal feature extraction method using such a route. Convolution neural network, which can learn features from HSI autonomously, provides a good solution for feature extraction. The HSI classification models based on 1D-CNN [
29] or 2D-CNN [
30] can achieve considerable classification results by automatically extracting features from hyperspectral images, but along with a degree of spatial or spectral information loss. In order to fully utilize spatial and spectral information in hyperspectral images simultaneously, the 3D-CNN, which is used to process video data before, is introduced to HSI classification. Compared with 2D-CNN, 3D-CNN has a relatively large computation burden, but can better learn spectral features within a hyperspectral image, which result in better classification performance. Since then, 3D-CNN is widely applied on HSI classification, based on which many improved models are proposed.
Chen et al. [
18] constructed a 3D-CNN model composed of 3D convolutional layers and 3D pooling layers, improving classification performance by means of deep exploration into spatial–spectral features. Deeper networks enable deeper and more robust features and the network structure needs careful designing to pretend the greatly rising of the parameters amount. Lee et al. [
19] made good use of residual connection in the spectral feature learning and built a deeper network (Res-2D-CNN) by which deeper and more abstract features could be extracted. Liu et al. [
31] introduce residual connections to 3D-CNN and built Res-3D-CNN, which is aimed at enhancing spatial–spectral feature learning. Zhong et al. [
20] focused on the raw hyperspectral data without dimensionality reduction and built SSRN (spectral–spatial residual network). They introduced residual connection into the whole network and separate deep feature learning procedure into independent spatial feature learning and spectral feature learning. More discriminative features were learned by SSRN and the separated feature learning pattern has a significant impact on subsequent hyperspectral classification research. Recently, dense connections have attracted more attention from hyperspectral researchers [
32]. Dense connection reduces the network parameters through a small convolution kernel number, and realizes efficient feature reuse through feature map concatenation, both of which alleviates the problem of model overfitting. Wang et al. [
21] introduced a dense block into SSRN using dense connections and built FD-SSC (Fast Dense Spectral–Spatial Convolution Network). With the help of a dense connection, FD-SSC further enhanced the feature propagation and reuse, making it possible that deeper hierarchical spatial–spectral features are extracted. Besides the rational use of different residual connections, structural innovation is also an important aspect of the network optimization of CNN models for hyperspectral classification. Swalpa K et al. [
33] proposed a novel hyperspectral feature extraction pattern, HybridSN, based on the combination of the 3D-CNN and 2D-CNN. HybridSN takes hyperspectral data after a dimensionality reduction as the input and has a relatively small computation burden. It concatenates the feature maps extracted by three successive 3D convolutional layers in the spectral dimension and then used a 2D convolutional layer to enhance the spatial feature learning. HybridSN, which only has four convolutional layers, achieved extremely high classification accuracy, demonstrating the great potential of the 3D-2D-CNN model in hyperspectral classification. Based on the 3D-2D-CNN, Feng et al. [
34] proposed R-HybridSN (Residual-HybridSN) by means of rational use of non-identity residual connections, enriching the feature learning paths and enhancing the flow of spectral information in the network. In particular, R-HybridSN was equipped with depth separable convolution layers instead of traditional 2D convolutional layers, which further made it perform better in the small sample hyperspectral classification. However, the shallow features in R-HybridSN are not reused, so that the network structure of R-HybridSN can be further optimized.
Hu et al. [
35] proposed squeeze-and-excitation networks and introduced the attention mechanism to the image classification network, winning the champion of 2017 ImageNet Large Scale Visual Recognition Competition. Recently, the attention mechanism [
36] has been applied to the construction of HSI classification models. The attention mechanism is a resource allocation scheme, through which limited computing resources will be used to process more important information. Therefore, the attention mechanism module can effectively enhance the expression ability of the model without excessively increasing complexity. Wang et al. [
37] constructed a spatial–spectral squeeze-and-excitation (SSSE) module to automatically learn the weight of different spectral and different neighborhood pixels to emphasize the meaningful features and suppress unnecessary ones so that the classification accuracy is improved effectively. Li et al. [
38] added an attention module (Squeeze-and-Excitation block) respectively after the dense connection module used for shallow and middle feature extraction to emphasize effective features in the spectral bands, and then feed it to further deep feature extraction. The attention mechanism in the HSI classification model is used for finding more discriminative feature patterns in spectral or spatial dimension. However, the specific use of the attention mechanism, such as the location and calculation methods, has no mature theory and still needs further exploring.
Hyperspectral image labeling is laborious and time-consuming, therefore, labeled samples are always limited in classification tasks. How to use as few labeled samples as possible to achieve better classification results has been a research hotspot for a long time. Feng et al. [
34] conducted vast experiments using different amounts of training samples and found that the degradation of the CNN model is very common when the sample size decreased. The main strategies for small sample hyperspectral classification include generative adversarial networks [
39,
40], semi-supervised learning [
41,
42] and network optimization [
33,
34]. The residual connection is the core of network optimization, and the purpose of network optimization is to facilitate feature fusion and feature reusing. Compared with the simple pipelined network, the well-designed model, which is more like a directed acyclic graph of layers, usually has a better classification effect [
34]. Song et al. [
43] proposed a hybrid residual network (HDRN), in which the residual connection is used in and between residual blocks. The rational use of residual connection in the HDRN makes it better able to cope with hyperspectral classification with limited training samples. Network optimization can be combined with other methods. Liu et al. [
44] proposed a deep few-shot learning method, which is focused on “small sample” hyperspectral classification. The Res-3D-CNN model is utilized to extract spatial–spectral features and to learn a metric space for each class of objects. Therefore, network optimization has important research significance and constructing models with a more reasonable structure seems to be an effective solution for the “small sample” hyperspectral classification.
Based on the above observations, in order to explore a better topological structure, inspired by R-HybridSN and the attention mechanism, we proposed a novel model named AD-HybridSN (Attention-Dense-HybridSN) for “small sample problem” from the perspective of network optimization. Based on 3D-2D-CNN and the densely connected module, AD-HybridSN realized a more efficient feature reuse and feature fusion. Moreover, the attention mechanism was introduced to the 3D convolution part and 2D convolution part respectively so that the model can utilize the spectral features and spatial features in a targeted refinement circumstance. With fewer parameters, AD-HybridSN achieves better classification performance in the Indian Pines, Salinas and University of Pavia datasets.
4. Experimental Results and Discussion
In our experiment, training sets of HybridSN, R-HybridSN, Res-3D-CNN and Res-2D-CNN, such as window size, training epoch, etc., were consistent with the corresponding papers. In addition, we used the Adam optimizer and the learning rate was set to 0.001. In order to observe the performance of our model, we trained 100 epochs and used ReLU as an activation function in AD-HybridSN. In all experiments, we monitored the validation accuracy and saved the model with the highest verification accuracy.
4.1. Experimental Results
Three indexes were used to measure the accuracy of models, namely, overall accuracy (OA), average accuracy (AA) and Kappa coefficient (Kappa). OA represents the proportion of the number of samples that were correctly classified by the model. AA stands for the average precision of all land objects. KAPPA is an accuracy measure based on the confusion-matrix, which represents the percentage of errors reduced by classification versus a completely random classification.
In order to avoid fluctuations caused by accidental factors as far as possible, we conducted 20 consecutive experiments.
Table 4,
Table 5 and
Table 6 show the average indices and standard deviation of each model on three datasets.
Figure 7,
Figure 8 and
Figure 9 show the false-color map, the ground truths and the classification results of each model for three datasets. We can tell by the data and predicted maps that the classification result of AD-HybridSN was more detailed and accurate in Indian Pines, Salinas and University of Pavia. Among the contrast models, the OA of Res-2D-CNN on the three datasets were lower than the other contrast models, indicating that the 2D-CNN model was not suitable for small sample hyperspectral classification. Secondly, the classification result of Res-3D-CNN was higher than that of Res-2D-CNN, indicating that the 3D-CNN model could explore spatial–spectral features of training samples more effectively. R-HybridSN was superior to the HybridSN in Indian Pines and University of Pavia, and the two models had a higher classification accuracy than Res-3D-CNN, to a certain extent, it proved that, compared with the model that used the 3D convolution kernel or 2D convolution kernel alone, the 3D-2D-CNN model was more suitable for the classification under the condition of small samples, and the reasonable use of the residual connection could effectively improve the classification performance of the 3D-2D-CNN model. In particular, the classification accuracy of R-HybridSN in Salinas was slightly lower than HybridSN and our proposed model AD-HybridSN effectively solved this problem. Among the three 3D-2D-CNN models, our proposed AD-HybridSN achieved the highest classification accuracies in three datasets. For example, the OA of AD-HybridSN was 0.26% and 2.71% higher than R-HybridSN and HybridSN in Indian Pines.
We further compared the experiment results of the three 3D-2D-CNN based models and drew the following conclusions. Firstly, unlike R-HybridSN, which had inferior classification accuracy than HybridSN on Salinas, the classification accuracy of AD-HybridSN was relatively balanced on three datasets. It further demonstrated the strong feature extraction ability of the dense block and the necessity of feature refinement module. What is more, AD-HybridSN had an uneven classification accuracy on different datasets. Using a similar amount of training samples in three datasets, the classification effect of Salinas was far better than Indian Pines. Thus, the generalization ability of AD-HybridSN needs to be further analyzed. Thirdly, compared with the other two 3D-2D-CNN models, AD-HybridSN had a tremendous improvement in small sample classes, such as the Stone-steel Towers in Indian Pines and Shadows in the University of Pavia. However, the classification accuracy of AD-HybridSN on some ground objects, such as oats and alfalfa in Indian Pines and Lettuce_romaine_7wk in Salinas, which was over that of R-HybridSN, was still lower than HybridSN, which needs to be further studied.
4.2. Discussion
It is proved that the classification performance of AD-HybridSN was superior to R-HybridSN, HybridSN and other contrast models through vigorous experiments. Therefore, the network structure of AD-HybridSN was conducive to improving classification accuracy, which needs to be further discussed. From the perspective of the network structure, the HybridSN is a 3D-2D-CNN model with a relatively concise structure, which contains only four convolutional layers; R-HybridSN has a relatively deeper and more complex structure, which is based on the non-identity residual connection and depth separable convolutional layers. It can be speculated from the experimental results that R-HybridSN had a better spatial–spectral feature learning ability. At the same time, the features extracted from the shallow network layers were not fully utilized, which may be the reason that the accuracy of R-HybridSN in the Salinas dataset was slightly lower than that of HybirdSN. AD-HybridSN is the redevelopment of R-HybridSN, based on which the dense block and attention module are introduced for feature reusing and refinement. As AD-HybridSN only has six convolutional layers, the structural advantage of our proposed network was verified. However, the effectiveness of attention module needs to be further verified.
In order to further verify the effectiveness of the attention module in our proposed model, we built a D-HybridSN to conduct model ablation experiments. In order to control the experimental variables, the only difference between D-HybridSN and AD-HybridSN was that the former had no attention module.
Table 7 shows the accuracy of AD-HybridSN, D-HybridSN and R-HybridSN in three datasets and the proportion of the training sample used in this experiment was also 5%, 1% and 1% respectively. The classification accuracies of D-HybridSN were −0.42%, 0.66% and 0.27% higher than that of R-HybridSN in Indian Pines, Salinas and the University of Pavia respectively. From the comprehensive performance of models on the three datasets, the features extracted by D-HybridSN were more discriminative. Thus, it is further proved that, by means of reusing the spatial–spectral features in the network, the features from shallow layers were better utilized to contribute to classification. What is more, our proposed AD-HybridSN outperformed D-HybridSN in three datasets by 0.68%, 0.19% and 0.41% respectively, which indicate that the spatial–spectral features were further refined by the attention module that followed every convolutional layer.
Although AD-HybridSN has satisfactory overall accuracies on the three datasets, the classification of some ground objects was still unsatisfactory. This phenomenon may be attributed to the fixed network structure for different datasets, which may limit the targeted feature learning for different datasets with different spatial resolution and spectral conditions. Therefore, in the following research, the model integration method will be used to integrate the advantages of different networks, so as to comprehensively improve the classification accuracy of various ground objects. Besides, the fixed network structure might mean a fixed input size, which includes window sizes and a number of bands. That may further affect the ability of the model on learning spatial–spectral features from different datasets. Thus, how to learn features in a more flexible way needs to be further investigated in the aspect of network structure and hyperspectral image preprocessing.
In order to further verify the performance of AD-HybridSN under the “small-sample” condition, we further reduced the amount of training samples and conducted
supplementary experiments. In
Section 4.1 we showed the experiment results under unbalanced training sample cases, and we will further reduce the amount of training samples. Meanwhile, we will use balanced training samples, which means the amount of each ground objects are equal, to perform
supplementary experiments. Due to that, 5% is the minimum proportion of Indian Pines to ensure that all ground objects have at least one sample and the classification accuracy of the University of Pavia is relatively low, we only used the University of Pavia in our
supplementary experiments. In the unbalanced training sample experiment, the amount of labeled data decreased from 0.8% to 0.4%. In the balanced training sample experiment, we used 50, 40, 30 and 20 labeled data of each ground object respectively.
Table 8 and
Table 9 show the experiment results.
After analyzing the experimental results, we had the following findings:
(1) In our experiments, the classification accuracy of AD-HybridSN was the highest in both the unbalanced training sample case and balanced training sample case. Additionally, the classification accuracy of R-HybridSN was higher that of HybridSN, which was consistent with the experiment results of the University of Pavia in
Section 4.1.
However, the classification accuracy of the three models showed great differences in the two kinds of experiments. By comparing the OA and AA value in the experiment result, we found that, in the balanced training sample case, the AA value was relatively higher, which was different from the experiment on the unbalanced training sample case. This phenomenon indicates that the sample distribution had a great influence on the classification results.
(2) The experiment results further indicate that the sample distribution was a valuable research issue in “small sample” hyperspectral classification. For now, we randomly split the hyperspectral data to obtain the training set, validation set and testing set. However, there was an ill-posed problem in the hyperspectral image. On the one hand, the amount of samples were unbalanced, on the other hand, the quality of samples were also unbalanced. Thus, selecting the best training sample combination from the labeled samples may alleviate the problem of the hyperspectral image ill-posed problem to a certain extent.
(3) We can tell by the experimental results that when the number of training samples was reduced to a certain extent, the classification accuracy of all models decreased in a cliff-like manner. Therefore, there is a limit to improve the classification accuracy of small samples only by network optimization. When the training samples were reduced to a certain extent, there were a large number of unlabeled samples that were not used. Thus, in the following research, we should focus on mining the information of unlabeled samples by combining semisupervised learning or an active learning strategy.