1. Introduction
Remote sensing via hyperspectral imaging is very powerful to capture a set of continuous images of a scene at each resolved narrow band over a wide spectral range. The obtained hyperspectral images are rich in spatial information and spectral information, which can be effectively applied for the classification and recognition of targets over the scene [
1,
2,
3]. However, hyperspectral remote sensing classification remains difficult due to the problems of highly dimensional, highly nonlinear [
4], small samples.
Early research in hyperspectral classification or pattern recognition mainly focused on feature extraction using model-driven algorithms. Commonly used algorithms for local features are scale-invariant feature transform (SIFT) [
5], histogram of oriented gradient (HOG) [
6], local binary patterns (LBP) [
7], multinomial logistic regression [
8], active learning [
9], k-nearest neighbor (KNN) [
10], support vector machine (SVM) [
11], and spectral angle mapping (SAM) [
12], etc. Other algorithms put an emphasis on feature discrimination enhancement or dimensionality reduction, such as principal component analysis (PCA) [
13], independent component analysis (ICA) [
14], linear discriminant analysis (LDA) [
15], etc. These methods should follow classifiers to obtain the final results. Although model-driven algorithms make full use of spectral information, the classification maps still contain non-negligible noise due to the underutilization of spatial contextual information. The addition of spatial information can greatly improve the classification accuracy [
16]. For example, a 3D Gabor feature-based collaborative representation (3GCR) approach can extract multiscale spatial structured features [
17]. However, the above manually designed feature extraction algorithms lack robustness to geometric transformations and photometric variations between unbalanced intraclass samples, which could result in limited resolution of hyperspectral classification. Orthogonal complement subspace projection (OCSP) [
18] tries to solve the constraint problem of sample labeling via unsupervised learning. Although hyperspectral classification methods based on hand-designed feature extraction have made great progress in recent years, the complex modeling process limits the further improvement of performance. It also affects the efficiency and model scalability of practical application.
In recent years, deep learning methods have made breakthroughs in the fields of image classification [
19], target detection [
20], natural language processing [
21], etc. Deep learning methods use a hierarchical structure to extract higher-level abstract features from raw data, enabling nonlinear mapping from feature space to label space. They have powerful data mining and feature extraction capabilities and automatically learn features to achieve a more essential portrayal of data, which greatly saves human and material resources and effectively improves the recognition rate or classification accuracy. In particular, deep learning is considered an effective feature extraction method for the hyperspectral classification process [
22,
23]. For example, a stacked autoencoder (SAE) learns shallow and deep features from hyperspectral images using a single-layer autoencoder and a multilayer stacked autoencoder, respectively [
24]. This spatial-dominated information-based deep learning framework yields a higher accuracy compared to model-driven methods based on spectral information. A deep belief network (DBN) extracts the depth and invariant features of hyperspectral data and then uses the learned features in logistic regression to solve the classification problem of hyperspectral data [
25]. However, both SAE and DBN first represent the spatial information as vectors before the pretraining phase, which inevitably causes the loss of spatial information.
In contrast, a convolutional neural network (CNN) has the potential to fully extract and exploit spatial–spectral features simultaneously. In the early stage, hundreds of spectral channels of hyperspectral images usually are represented as 1D arrays. Hu et al. [
26] used a 1D-CNN for hyperspectral classification, concentrating only on the spectral signatures without considering the spatial correlation. Subsequently, the 2D spatial information was emphasized using a 2D-CNN [
27]. Randomized PCA is introduced to reduce the spectral dimensionality while keeping the spatial information unchanged. Thus far, several kinds of 2D-CNN models with innovative network structures have been employed for hyperspectral classification. For example, a deformable 2D-CNN that introduced deformable convolutional sampling locations [
28], a dilated 2D-CNN that utilized dilated convolution to avoid resolution reduction due to pooling [
29], and a two-stream 2D-CNN network that combined spectral features and spatial features [
30]. Furthermore, a deep feature fusion network (DFFN) counteracts negative effects such as overfitting, gradient disappearance, and accuracy degradation in excessive network depth by considering the correlation information between different layers and introducing residual learning [
31]. Due to the success of the above feature fusion 2D-CNN models in hyperspectral classification, the joint use of spatial–spectral information has become a mainstream trend.
However, most 2D shape-based filters sacrifice spectral information during feature extraction, which eventually prevents further performance improvement in the classification of complex scenes. Cheng et al. [
32] proposed a 3D-CNN framework to extract both spectral and spatial information. In addition, L2 regularization and dropout strategies were used to deal with the overfitting problem caused by limited training samples. Zhong et al. designed an end-to-end spectral–spatial residual network (SSRN) [
33] with consecutive residual blocks to learn spectral and spatial representations separately. The network designedly overfits the dataset and then uses residual connectivity to address the gradient disappearance phenomenon. Batch normalization (BN) and dropout were added as regularization strategies to improve the classification performance. However, this deep network structure with a large number of trainable parameters is computationally more expensive than common CNNs. To fully exploit the spatial contextual information of hyperspectral images, He et al. [
34] replaced the multiscale block with a typical 3D convolution layer to improve the performance significantly. Sellami et al. [
35] kept the initial spectral–spatial features by automatically selecting relevant spectral bands and associated with a 3D-CNN model. Paoletti et al. [
36] used pyramidal bottleneck residual blocks to gradually increase the feature map dimension at all convolutional layers for involving more locations. These pyramidal bottleneck residual units designed for hyperspectral images can extract more robust spectral–spatial representations even though they are still computationally expensive. To reduce the complexity of the net framework, Swalpa K. R. et al. [
37] proposed a HybridSN network to combine the complementary spectral–spatial information in the form of 3D and 2D convolutions. The simple hybrid model is more computationally efficient than either the 3D-CNN model or 2D-CNN model and also shows superior performance in terms of the “small sample problem”. On this basis, Feng et al. [
38] proposed a R-HybridSN network with skip connections and depth-separable convolution and achieved better classification results than all the contrast models using very few training samples. Ge et al. [
39] also proved the effectiveness of 2D-3D CNN with multibranch feature fusion.
Although multiple convolutional layers of feature extraction and subsequent classifiers can be tied together to form end-to-end hierarchical networks, the limited training samples would lead to obvious overfitting problems. While data augmentation is one solution, optimization of the network structure is another important direction. Multiscale feature extraction and information fusion modules can greatly improve network performance. From the perspective of joint use of spatial–spectral information, a multiscale 3D-CNN is effective in extracting features [
34]. From the perspective of channels, the channel attention mechanism can focus on more informative channel features. As for representation, a squeeze-and-excitation (SE) block can improve the quality of feature representations by reintegrating spatial–spectral information over the channels [
40]. From the perspective of spatial information, an unbalanced distribution of “samples would degrade the capability of discriminating for small sample problem”. Fortunately, pyramid pooling layers can retain global information at different scales [
41] and can make better use of features than single-scale pooling layers can.
In this paper, we will combine multiscale 3D convolution, an SE block, and pyramid pooling layers to fully extract and exploit the spatial–spectral information in hyperspectral images. The contributions of this paper are summarized as follows:
To overcome the “small sample problem” of hyperspectral pixel-level classification, we designed a multiscale information fusion hybrid 2D-3D CNN named the multiscale squeeze-and-excitation pyramid pooling network (MSPN) model to improve the classification performance. The model can not only deepen the network vertically, but can also expand the multiscale spectral–spatial information horizontally. In terms of being lightweight, the frequently used fully connected layer at the tail is replaced for the first time by a global average pooling layer to reduce model parameters.
The proposed MSPN was trained on a small sample with 5%, 0.5%, and 0.5% of the labeled data from the three public datasets of Indian Pine, Salinas, and Pavia University, respectively. The prediction accuracies reached up to 96.09%, 97%, and 96.56%, respectively. The MSPN achieved high classification performance on the small sample.
This paper is organized as follows:
Section 2 describes the proposed framework of the MSPN.
Section 3 introduces the datasets used in the experiment.
Section 4 describes the comparison experiments with existing methods and analyzes and discusses the results.
Section 5 concludes the paper and looks at future research directions.
2. Methodology
Figure 1 shows the whole framework of the hyperspectral image classification based on the MSPN. First, dimension reduction is conducted on raw hyperspectral data using PCA and a relatively small number of principal components are kept. The number of retained principal components should have an error of no more than 0.01 in projecting the data from higher to lower dimensions, which means that 99% of the information is retained in the reduced dimensional data. Hyperspectral data can be regarded as a 3D cube with input space
, where
W,
H, and
K are the width, height, and band, respectively. After selecting the first
B principal components, the hyperspectral data can be denoted as
. Next, the hyperspectral image cube is cut into small nearest-neighbor patches
, where
Pi, j represents the central pixel position at
with a window size of
covering all spectral dimensions
B, composed of pixels from
to
and from
to
. The head of the MSPN is a multiscale 3D-CNN and accepts the 3D hyperspectral image patch as input. The output space for pixel-level classification is represented as
, where
is a set of possible categories. In order to determine whether the model has converged in the training stage, the difference between the predicted class labels
and the ground truth class labels
is estimated using a cross-entropy loss function:
The model would be optimized by iteratively updating the parameters through a backpropagation algorithm. The MSPN mainly includes three modules: a multiscale 3D-CNN module, an SE block, and a pyramid pooling module. The multiscale 3D-CNN, consisting of three parallel 3D convolution kernels, is used to extract preliminary spatial–spectral features. The data dimension from the 3D-CNN should be reshaped in order to input into the SE block, which focuses on the attention redistribution of spatial–spectral channel information. The pyramid pooling module further integrates spatial context information. Finally, the global pooling layer is used to replace the full connectivity layer and then reduce trainable parameters for avoiding overfitting. Each module is described in detail as follows.
2.1. Multiscale 3D-CNN
Multiscale convolution means that convolution kernels of different sizes are used simultaneously to learn multiscale information [
42]. Multiscale information has been applied to the classification due to its rich context information in multiscale structures [
43]. The multiscale convolution block can construct a more powerful network model for hyperspectral detection and classification [
44]. Our proposed multiscale 3D-CNN consists of
n = 3 parallel convolution kernels with different sizes
, and would extract three feature cubes with the size of
, such as 13 × 13 × 16, in our experiment. Simultaneously, the position
of each feature cube
V is calculated as follows:
where
K and
R are the parameters of the convolution kernel and the bias term, respectively;
i,
j, and
m are the indexes of input layer, output layer, and feature map, respectively; and
μ is the activation function, such as the widely used rectified linear unit (ReLU). The input of the 3D-CNN is the small nearest-neighbor patches with the size of
, the subsampling steps of the 3D-CNN is
. Then, the output spatial width is
, and the spectral depth is
. Let
and add the same padding operation to ensure that the output size is equal to the input size. Then, we concatenate the outputs of the three multiscale 3D-CNNs on the channel dimension and feed them into the SE block.
2.2. SE Block
The original generation of the SE block is used to improve network performance by explicitly simulating interdependencies between channels and adaptively recalibrating the characteristic response in terms of channels [
45]. Due to the lightweight advantage of the SE block, it also can reduce model parameters and increase detection speed. A learning framework with an SE block can well characterize channel-wise spectral–spatial features [
46]. Usually, an SE block includes two steps of squeeze and excitation. The squeeze operation firstly compresses the features along the spatial dimension. It turns each 2D feature channel into a real number, which has a global receptive field to some extent. The statistic
Z of channel dimensions is generated by reducing the spatial dimension
H ×
W of the reshaped feature cube
V using global average pooling. The element of
Z is calculated as follows:
Once the global spatial information is embedded into the feature vector
, the excitation operation is used to convert the feature vector
Z to another feature vector
S as follows:
where
μ and
σ are the ReLU activation function and sigmoid activation function, respectively;
ω1 and
ω2 represent the weights of two consecutive fully connected layers. The weights are automatically generated by learning the correlation between feature channels explicitly. The output feature channels of
S match the input feature channels of
Z. The above two operations can let the output vector
S obtain the global information distribution. They can select the feature channels with more information by turning weights. Finally, the original feature cube
V is recalibrated on the channel dimension by weighting the feature vector
S via multiplication as follows:
Accordingly, the SE block can extract the important information of feature channels automatically and then enhance the selected features and suppress the unusable features.
2.3. Pyramid Pooling Module
The pyramid pooling module can further integrate spatial information. It is usually used for scene parsing on image semantic segmentation [
47] or pattern recognition [
48], since it can aggregate multiple receptive fields in different scales. Furthermore, different scales of receptive field can gather richer spatial context information. In terms of this point, it is reasonable for us to employ it to address the limitations of single-size receptive fields in “small sample problems” of the hyperspectral image classification task and to improve classification accuracy. In this paper, the framework and parameters of the pyramid pooling module are specified as follows. A 2D convolutional layer with a kernel size of 3 × 3 and 128 channels is used to extract a feature map before pyramid pooling layers. Then, four pool layers with different receptive fields are used to downsample the feature map to different scales. The sizes of the four pooling layers are 13 × 13, 7 × 7, 5 × 5, and 3 × 3. Subsequently, the 1 × 1 convolution layer is used to change the number of channels. Then, the four downsampled feature maps are restored to the original size by the deconvolution layer. The original feature map and the restored four feature maps are concatenated in the channel dimension. The last 2D convolution layer with a kernel size of 3 × 3 and 256 channels is used to extract the final features, which provide rich global context information for pixel-level classification, even if there is the “small sample problem”.
4. Results of Experiment
In order to evaluate the performance of the proposed MSPN model and compare it with other supervised deep learning methods, the following five experiments were implemented.
The first experiment evaluated the classification performances of the proposed MSPN and other models in terms of overall accuracy (OA), average accuracy (AA), and kappa coefficient, using a different number of training samples of the three classic datasets.
The second experiment compared the classification performances of the original MSPN model and its variants, such as by removing the multiscale 3D-CNN, SE block, or pyramid pooling module.
The third experiment verified the influence of the selection of principal components on the model performance.
The fourth experiment determined the proper number of convolutional kernels and the number of parallel convolutional layers.
The fifth experiment aimed to challenge the latest high-resolution remote sensing images in the WHU-Hi-LongKou dataset. We compared our proposed MSPN model with other methods by taking 0.1% of the sample points as a training sample. At the same time, a confusion matrix was used to visualize the classification results of our proposed model.
The performance was evaluated by the indicators OA, AA, and kappa coefficient, where OA is the ratio of correctly classified pixels to total pixels, AA is the average value of classification accuracy, and the kappa coefficient is a metric of statistical measurement that can measure the consistency between a predicted map and the ground truth. The experiments were implemented with Tensorflow. We used mini-batches of size 128 for training the network. The optimizer was Adam, and the learning rate was set to 0.001. All experiments were repeated 10 times and the average value was taken as the final classification accuracy.
For the first experiment,
Table 2,
Table 3 and
Table 4 list the classification performance of the different methods on all classes, and
Figure 2,
Figure 3 and
Figure 4 show the ground truths and the predicted classification maps of the different models. Evidently, the classification maps and accuracy of the MSPN are better than those of the other models. It should be noted that the MSPN uses fixed hyperparameters for each dataset. That is, the MSPN has a good generalization ability for different hyperspectral images. The OA, AA, and kappa coefficient obtained by the MSPN were the highest relative to other models: the OA of the MSPN was 96.09% on the Indian Pine dataset, 97% on the Salinas dataset, and 96.56% on the Pavia University dataset. It is interesting to note that the hybrid 3D-CNN and 2D-CNN are superior to either the 2D-CNN or 3D-CNN alone. The addition of multiscale modules further improves the performance.
Figure 5 presents the convergence accuracy and loss on training phase over 50 epochs.
Table 5 presents the computational efficiency of the MSPN in terms of training and testing time. As shown, the MSPN model is more efficient than the R-HybridSN model. Convergence was achieved after just 30 epochs, mainly because the full connection layer is replaced with the global pooling layer. In addition, relative to other models, the 3D-CNN layers are reduced, while the 2D-CNN layers are increased, and so the network becomes lightweight. There is no regularization strategy such as batch normalization and dropout.
In the second experiment, the contribution of each module was explored in detail. Removal of the multiscale module means letting the multiscale module become a single-scale one, and only the parameters of the middle 3D-CNN are copied. Removal of the SE block or pyramid pooling module means removing them from the network completely.
Figure 6 shows the performance of the MSPN and its variants. The results show that the pyramid pooling module contributes the most to the model. When it is removed, the OA decreases the most, mainly because the module contains the most convolution layers and multiscale information. The SE block has the least impact because it only recalibrates the features from the spectral dimension to a certain extent. Compared with other modules, it only plays an auxiliary role in feature learning.
In the third experiment, the effect of the number of principal components on the classification performance was investigated. The fewer the principal components were, the fewer the spectral features and the shorter the computation time were, and vice versa.
Figure 7 shows the classification OA as a function of the number of principal components. As expected, the OA increases with the number of principal components. However, the optimal principal components are different for different datasets. Considering generalization and stability,
k = 16 is recommended. It should be note that PCA in preprocessing may lose part of the original hyperspectral data cube information. Therefore, it is necessary to explore new dimensionality reduction methods to reduce the parameters and retain the original information as much as possible in the future.
For the fourth experiment,
Figure 8 shows the performance of different convolution kernels and parallel layers used in the MSPN. This experiment did not consider the SE block because it does not contain convolution layers. Based on the single variable principle, we used OA as the evaluation criterion. For the multiscale 3D-CNN, we changed the number of convolutional kernels per layer in the order of 2, 4, 8, and 16 in the case of determining three parallel convolutional layers. We changed the number of parallel convolution layers by copying or removing the middle branching layer in the case of determining eight kernels per layer. For the pyramid pooling module, the setting of variables was similar. The best parameters were determined in the multiscale 3D-CNN having three parallel layers with eight kernels per layer, and the pyramid pooling module having four parallel layers with 128 convolution kernels per layer. It should be noted that all the above determined optimal parameters were employed in the first experiment for comparison, and thus, the best classification performance was achieved.
For the fifth experiment,
Table 6 lists the classification performance of different methods on all classes, and
Figure 9 shows the ground truths and the predicted classification maps of the different models.
Figure 10 shows the visualization of the confusion matrix. From these charts, we can see that our proposed model still performs better on the latest high-resolution WHU-Hi-LongKou dataset.