Spatial–Spectral Squeeze-and-Excitation Residual Network for Hyperspectral Image Classiﬁcation

: Jointly using spectral and spatial information has become a mainstream strategy in the ﬁeld of hyperspectral image (HSI) processing, especially for classiﬁcation. However, due to the existence of noisy or correlated spectral bands in the spectral domain and inhomogeneous pixels in the spatial neighborhood, HSI classiﬁcation results are often degraded and unsatisfactory. Motivated by the attention mechanism, this paper proposes a spatial–spectral squeeze-and-excitation (SSSE) module to adaptively learn the weights for different spectral bands and for different neighboring pixels. The SSSE structure can suppress or motivate features at a certain position, which can effectively resist noise interference and improve the classiﬁcation results. Furthermore, we embed several SSSE modules into a residual network architecture and generate an SSSE-based residual network (SSSERN) model for HSI classiﬁcation. The proposed SSSERN method is compared with several existing deep learning networks on two benchmark hyperspectral data sets. Experimental results demonstrate the effectiveness of our proposed network.


Introduction
Hyperspectral sensors collect information as a series of images, represented by hundreds of narrow and contiguous spectral bands across a wide range of the spectrum, which allows detailed spectral signatures to be identified for different imaged materials [1][2][3]. The resulting hyperspectral image (HSI) can be used to find objects, identify specific materials and detect processes in different application fields [1,3], such as military, agriculture, and mineralogy. Among these applications, classification is a basic problem which aims to assign a class label to each pixel in a HSI [4]. Due to the discriminative characteristics of spectral curves, traditional HSI classification models are often based on spectral information. Typical spectral-based classifiers [2] include support vector machines (SVM), bayesian models, random forests (RF), and artificial neural networks.
However, the intrinsic complexity of hyperspectral images usually makes these traditional methods unsuitable for consistently providing satisfactory classification results. Compared with the large number of spectral bands, in practice the number of labeled training samples is usually quite limited. This high dimensionality-small sample problem makes classification much more difficult and can lead to the Hughes phenomenon [5]. In addition, due to the effects of the acquisition condition and imaging mechanism, there often exist redundant or even noisy spectral bands in the HSI. By performing feature extraction, the above two problems can be alleviated, to a certain extent [6,7]. One of key problems is how to effectively extract features of the HSI. Currently, spectral-spatial features are compute the ratio of the number of inhomogeneous pixels (the pixels whose labels are different from the central pixel z) to the number of total pixels in the spatial neighborhood. Figure 1c shows the ratio for each pixel. It can be clearly seen that the pixels around the boundary usually have high ratio values, which means that their spatial neighborhoods contain a large number of inhomogeneous pixels. Both the redundant or noisy bands and inhomogeneous neighboring pixels will produce negative effects in the classification. In this paper, motivated by the idea of attention mechanisms, we construct a spatial-spectral squeeze-and-excitation (SSSE) structure to adaptively learn the weights for different spectral bands and for different neighboring pixels at the same time. SSSE can learn to train the network to suppress or motivate features at certain spectral bands or spatial positions, which can effectively overcome the redundancy in the spectral channels and the pixel inconsistency in the spatial neighborhood. Furthermore, we embed several SSSE modules into a residual network architecture and generate an SSSE based-residual network (SSSERN) model for HSI classification.
The rest of this paper is organized as follows. Section 2 introduces the residual network and SE structure, and then describes our proposed method. The experimental results and analysis are provided in Section 3. Section 4 gives a discussion. Finally, Section 5 draws the conclusions.

Spatial-Spectral Squeeze-and-Excitation Residual Network
For spectral-based classifiers, hundreds of spectral bands in the hyperspectral data will lead to a large degree of feature redundancy and noise, which dramatically affects the classification performance; especially when the number of training samples is small. For the spatial-neighborhood-based classification methods, neighboring pixels which are too far from the center pixel usually provide limited contributions to the classification of the central target pixel, especially when the neighborhood window is large. To overcome the redundancy in the spectral channels and the pixel inconsistency in the spatial neighborhoods, we propose a spatial-spectral squeeze-and-excitation (SSSE) structure, which can adaptively learn the weights for different spectral bands and for different neighboring pixels at the same time. Motivated by the idea of recalibration of the SE structure, the SSSE trains the network to suppress or motivate features at a certain position, which can effectively resist noise interference and improve the classification result.

Residual Connections
It has been demonstrated, in previous studies, that skip-connections can take advantage of the multi-level features of a CNN and are effective for various visual tasks [29][30][31][32]. Here, we briefly introduce the concept of residual connectivity [31,32]. A residual connection adds a shortcut by identity mapping, forcing the network to learn the residual function to restore the original non-linear transformation. The residual connection can be obtained by the following formula: where X l−1 and X l refer to the input and output of the l-th layer, and h(·) is the original mapping. The desired underlying mapping h can be recovered by training the residual function f (·) indirectly, which can be a composite transformation of conventional CNN operations. A typical residual module structure, called a bottleneck residual block, is shown in Figure 2. Residual connections can effectively enhance the flow of information between the top and bottom of the network and can alleviate the over-fitting problem. In addition, the extra mapping structure almost does not increase the parameter consumption of the network, and the residual networks are easier to optimize [30].

SpectralSE: Squeeze Spatial Information and Excite Spectral Features
In order to deal with hyperspectral images, we define a SpectralSE structure which squeezes spatial information and excites spectral features. Similar to the traditional squeeze-and-excitation (SE) module [28], SpectralSE aims to recalibrate the channel-wise feature responses by modelling interdependencies between the channels. Let U = [u 1 , u 2 , . . . , u C ] denote the input of the SE module, where u k ∈ R H×W is the feature map of the k-th channel. As each element in u k corresponds to only one local area, this blind defect will result in a severe lack of global information in the bottom layer, with a less-receptive field [28]. In order to alleviate this problem, we propose to squeeze the global spatial information into a channel descriptor. This is achieved by using the global average operation over the spatial dimension, which generates a channel-wise statistic z ∈ R C , with elements where F sq (·) is called the squeeze operator.
To fully capture the channel-wise dependencies, in the process of excitation, a simple gating mechanism with a sigmoid activation σ(·) is used to get the final stimulus value: where δ(·) is the ReLU function. In order to limit the complexity of the model, a bottleneck with two fully-connected (FC) layers is used to parameterize the excitation operation, and W 1 ∈ R C× C 2 and W 2 ∈ R C 2 ×C are the weight matrices of the two fully-connected layers. After the squeeze and excitation operations, the final output of the block is: Figure 3a depicts the schema of SpectralSE.

SpatialSE: Squeeze Spectral Information and Excite Spatial Features
Similar to SpectralSE, we also define a SpatialSE module, which transforms the dimensions of the SpectralSE operation from spectra to space. The feature maps of U are squeezed along the channel to compress the information of all channels. Then, we excite it and scale by the original spatial information. Let U = u 1,1 , u 1,2 , . . . , u i,j , . . . , u W,H denote the slice on the spatial dimension, where u i,j ∈ R 1×1×C refers to the feature at the spatial position (i, j). Squeeze and excitation operations are completed by performing the following convolution and sigmoid activation transformations: where W ∈ R 1×1×C and q ∈ R W×H . Each q i,j refers to an excited linear combination of all channels of U at position (i, j). The final recalibration result is obtained by multiplying U with the activation value: Figure 3b shows the framework of the SpatialSE module.

SSSE: Combination of SpectralSE and SpatialSE
Finally, we combine the spectralSE and SpatialSE modules to get the spatial-spectral squeeze-and-excitation (SSSE) structure: where α is a trainable variable, allowing the network to learn the proportions of channel excitation and spatial excitation, respectively. When the value at position (i, j, c) in U is highly important, it will have a high activation value in the recalibration of the channel dimension and the spatial dimension. This recalibration encourages the network to learn more meaningful feature maps that are spectrally and spatially related. The SSSE structure is shown in Figure 3c.

SSSERN: Spatial-Spectral Squeeze-and-Excitation Residual Network
Now, we propose a new residual network that includes the SSSE structure, as shown in Figure 4. In the proposed SSSERN, batch normalization is used to correct the distribution of each layer and speed up the training [33]. The Xavier initialization method is used to initialize the network weights [34] and the Adam optimizer is used to minimize cross-entropy loss [35]. The details of the layers of the proposed SSSERN method are described in Table 1. The proposed network has four SSSE residual blocks. At the beginning, we use a 1 × 1 convolution kernel to extract features. Taking the Indian Pines data set as an example, the hyperspectral cube with size 11 × 11 × 200 is compressed to 11 × 11 × 128 by performing convolution with 128 filters of size 1 × 1 × 200. Here, the number of residual blocks and compression channels are adjustable. Following the SSSE residual blocks, a global pooling is used to transform the feature map into a one-dimensional vector. Finally, through softmax regression, the prediction labels corresponding to each category are obtained.

Datasets
To evaluate the performance of the proposed method in HSI classification, we use the following two benchmark hyperspectral data sets: (1) Indian Pines: This data was taken by the airborne visible/infrared imaging spectrometer (AVIRIS) sensor. The image scene contains 145 × 145 pixels and 220 spectral bands, from 0.4-2.5 µm, where 20 bands were discarded because of atmospheric affection. The spatial resolution of the Indian Pines data was 20 m. There are 16 classes in the data, as shown in Figure 5. The number of samples in each class is shown in Table 2.  (2) University of Pavia: This data was acquired by the Reflective Optical System Imaging Spectrometer (ROSIS) sensor. The ROSIS sensor generates 115 bands, ranging from 0.43-0.86 µm, in which 12 noisy bands were deleted and the remaining 103 bands are used for the experimental analysis. The spatial resolution is 1.3 m. The scene has the size of 610 × 340, and contains 9 ground categories, as shown in Figure 6. The number of samples in each class is shown in Table 3.

Classification Performance on Indian Pines and University of Pavia Data Sets
In this paper, the TensorFlow deep learning framework was used to build and train the proposed SSSERN. We compare the proposed method with six available classification methods in the literature: (1) Support Vector Machine (SVM) with a radial basis function kernel; (2) Random Forest (RF); (3) Multi-Layer Perceptron (MLP); (4) 2D-CNN [25]; (5) 3D-CNN [12]; and (6) SSRN [29]. Among these methods, SVM, RF, and MLP are spectral classifiers, and 2D-CNN can be considered as a spatial method which uses PCA to reduce the dimensionality of hyperspectral data and extracts only one principal component. Finally, 3D-CNN, SSRN, and the proposed SSSERN are spatial-spectral methods.
In the experiments, we randomly selected 15% samples from each class to form the training set and the test set consisted of the remaining samples. The experiment was repeated five times with randomly-chosen training samples, and the results of five runs were averaged. The class accuracy (CA), overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ) on the testing set were recorded to assess the performance of the different classification methods. In 2D-CNN, 3D-CNN, and our proposed algorithm, the neighborhood window was set as 11 × 11. The classification results on the two data sets are shown in Tables 4 and 5, respectively. Table 4. Overall, average, and individual class accuracies and κ statistics in the form of mean ± standard deviation for the Indian Pines data set. The best results are highlighted in bold typeface. From the classification results, we can see that:

Class
(1) The proposed SSSERN provided the best classification results on the two data sets.
(2) By jointly using the spectral and spatial information in a deep network architecture, the spatial-spectral methods (i.e., 3D-CNN, SSRN, and the proposed SSSERN) dramatically improved the spectral-based and spatial-based methods.
(3) Compared with existing deep learning methods (i.e., 2D-CNN, 3D-CNN and SSRN), the proposed SSSERN showed better results. This demonstrates that the proposed SSSE structure can extract much more effective spectral-spatial features by highlighting important spectral bands or neighboring pixels and suppressing noisy spectral bands or dissimilar neighboring pixels. Figures 7 and 8 show the classification maps of SVM, RF, MLP, 2D-CNN, 3D-CNN, SSRN, and our proposed SSSERN on the Indian Pines and University of Pavia data sets, respectively. The spectral-based classifiers, such as SVM and RF, generated noisy classification maps because they only considered isolated spectral samples and did not use spatial information to enhance the spatial neighborhood consistency. The spatial-spectral classifiers (i.e., 3D-CNN, SSRN, and SSSERN) provided much better results than the spectral classifiers and generated maps with little noise and clear object boundaries. Among all methods, our proposed SSSERN achieved a classification map that was the closest to the actual ground-truth; that is to say, the class boundaries were better defined and the background pixels were better classified. (a)

Investigation on the Effect of Network Parameters
Now, we investigate the effect of parameters on the classification performance of SSSERN. The parameters are the width of input feature window ω (i.e., ω × ω is the window), the combination coefficient α, and the number of residual blocks N block , where ω controls the size of the input features, α is used to indicate the ratio of SpatialSE to SpectralSE, and N block decides the deepness of the network. We also investigate the effect of the number of training samples, where 5% and 15% samples from each class in Indian Pines are chosen for training.
We first fix α = 0.5 and N block = 4, and show the effect of ω. Six different values of ω (3, 5, 7, 9, 11, and 13) were considered. The corresponding OA values of SSSERN, in the case of 5% and 15% training samples, are shown in Figure 9. It can be clearly seen that the OA of SSSERN increased rapidly with the increase of ω and achieved relatively stable results when ω ≥ 9. The optimal values of ω were 9 and 11 for 5% and 15% training samples, respectively. In the experiment, ω = 11 was used. Next, we investigate the effect of α. From Equation (7), when α = 0, the SSSE module is reduced to SpatialSE. When α = 1, SSSE is reduced to SpectralSE. When α = 0.5, SpatialSE and SpectralSE have the same importance in the SSSE. For simplicity, we only considered these three values of α (i.e., 0, 1, and 0.5). The OA of SSSERN versus different α values is shown in Figure 10, where SpectralSE, SpatialSE, and SSSE correspond to α = 1, α = 0, α = 0.5, respectively. It can be seen that the SSSE module that combined SpatialSE and SpectralSE provided the best results.  To further investigate the effectiveness of SSSE, we show the results of SSSERN with and without SSSE modules. As shown in Figure 4, the SSSE module is attached onto the residual block (resBlock).
When the SSSE modules are deleted, SSSERN is reduced to a general residual network. Figure 11 shows the OA of SSSERN with and without SSSE modules. It can be clearly seen that SSSE modules were more effective than traditional residual modules, and the optimal number of SSSE blocks was either 3 or 4.

Investigation on the Stimulus Values by the SSSE Structure
Although previous experiments have proven the effectiveness of SSSE blocks in improving the network performance, we also want to understand how the automatic gating incentive mechanism works in practice. In this subsection, to show the behavior of the SSSE structure more clearly, we will study the activation outputs of individual samples in the model and check their distribution for different classes on different residual modules. Specifically, we choose six different classes from the Indian Pines data set (Classes 1, 3, 4, 11, 14, and 15), and select 50 samples from each class, and then calculate the average of the SSSE module output of these samples in different layers.
As the activation value in the SSSE structure is composed of two parts-namely, the stimulus value in the spectral and spatial dimensions-the visualization results of these two parts will be shown below. Figure 12 shows the averaged spectral dimension stimulus value for each class. It can be seen that different classes of samples had different stimulus values for each channel, in each SSSE structure. In the third SSSE structure, Classes 1, 3, 4, and 14 showed synchronization suppression effects at the 36th channel, which demonstrates that the spectral characteristics of these classes were similar in this channel.  Figure 13 shows the activation values of the six classes in the spatial dimensions of different SSSE layers. In the figure, the brighter part corresponds to higher activation values. It can be seen that the features were almost always activated at the center position, and the positions around the boundary were suppressed. The boundary pixels may have been background pixels or pixels from different classes for a large window. In addition, they were far away from the central pixel and, hence, were less important. By suppressing these boundary pixels, the SSSERN model can obtain better results.

Discussion
The SSSE structure can re-calibrate the spatial and spectral features by using learning methods and has achieved the purpose of suppressing or stimulating certain features related to classification. In the following, we will provide an example to display the effect of SSSE. Given a pixel from Class 11 of the Indian Pines data set, we can construct an 11 × 11 spatial neighborhood, as shown in Figure 14. It is clear that the neighborhood contains background pixels with label 0, and pixels from the same class 11, and from the (different) classes 5 and 6. We compute the simulation value of the first layer SpatialSE structure, corresponding to the pixels in the neighborhood, and show the simulation values as different colors in Figure 14. The brighter or darker colors correspond to larger or smaller excitation values. It can be clearly seen that SpatialSE can generate a mask to stimulate the homogeneous pixels which are helpful for classification and, meanwhile, suppress inhomogeneous pixels (i.e., background pixels and pixels from classes 5 and 6) which have negative effects on the classification.

Conclusions
In this paper, we have proposed a spatial-spectral squeeze-and-excitation residual network (SSSERN) method for HSI classification. In the framework of a residual network, the proposed SSSERN contains four SSSE blocks, which can excite or suppress features in the spectral and spatial dimensions, simultaneously, by feature re-calibration. The proposed SSSERN is compared with some state-of-the-art deep learning methods. The experimental results on the Indian Pines and University of Pavia data sets have shown the effectiveness of SSSERN.