1. Introduction
Hyperspectral images (HSIs) typically encompass tens or even hundreds of bands with rich spatial and spectral information [
1]. Hyperspectral techniques can be used for hyperspectral imaging [
2]. HSI classification refers to the process of categorizing land cover or objects within a scene using hyperspectral image data. It represents a critical task in HSI analysis and it finds a wide array of applications in fields such as vegetation cover monitoring [
3], change detection [
4], and atmospheric environmental studies [
5].
Considerable attention has been paid to the study of HSI classification. Many effective methods have been devised to address the HSI classification task. Early methods, in particular, include traditional approaches like support vector machine (SVM) [
6], k-nearest neighbor [
7], and multinomial logistic regression [
8]. These methods primarily concentrate on the spectral information of HSIs while overlooking the high spatial correlation within them. This oversight results in the loss of valuable spatial information, consequently limiting the classification accuracy. To utilize both spectral and spatial information, Huo et al. [
9] successfully extracted both spectral and spatial details from HSIs. This was achieved by incorporating SVM and Gabor filters. Fang et al. [
10] applied a method of clustering HSIs into numerous superpixels. This effectively leverages both spectral and spatial information through multiple kernels. Tarabalka et al. [
11] applied probabilistic SVM for pixel-by-pixel HSI classification. They improved results by incorporating spatial contextual information into the classification process through Markov random field regularization. While these methods enhance classification accuracy, they all rely on manually designed feature extraction techniques. These techniques involve complex modeling procedures, limited feature expression capability, and weak generalization capability. This makes them insufficient to meet higher classification requirements.
In recent years, there have been continuous breakthroughs in deep learning in various fields such as target detection [
12], image classification [
13], and natural language processing [
14]. This has made the utilization of deep learning methods to extract deep features from HSIs a viable option. A large number of research results show that deep learning methods are better at extracting higher-level features than traditional methods. These deep features can characterize more complex and abstract structural information, resulting in a significant enhancement of HSI classification accuracy. For example, Chen et al. [
15] proposed a stacked autoencoder to obtain useful high-level features for HSI. Li et al. [
16] introduced deep belief networks (DBN) to HSI feature extraction and classification. They employed the restricted Boltzmann machine as the hierarchical framework for DBN and the results proved that the DBN method can achieve excellent performance. However, the above methods flatten the input into vectors in the spatial feature extraction stage, which can lead to some loss of spatial information.
To address this problem, convolutional neural network (CNN)-based approaches were applied to the field of HSI classification [
17,
18,
19]. Hu et al. [
17] used 1D CNNs for HSI classification. They represented the hundreds of bands of HSI as one-dimensional vectors, focusing solely on spectral features and neglecting spatial features. Makantasis et al. [
18] applied 2D CNNs for spatial feature extraction in HSI. They combined principal component analysis and multilayer perceptron to construct high-level features encoding both spectral and spatial information. Chen et al. [
19] used 3D CNNs to construct a finite element model. They addressed the common problem of imbalance between finite samples of HSIs and high-dimensional features. This was achieved through the use of regularization and dropout to avoid overfitting. Finally, their approach effectively extracted spectral and spatial features of HSIs. Roy et al. [
20] proposed a hybrid spectral CNN (HybridSN), where 3D CNNs extract joint spectral–spatial features of spectral bands and 2D CNNs further learn the spatial representation at the abstraction level. The results demonstrated that the hybrid CNN performed better than 3D CNN or 2D CNN alone for classification. He et al. [
21] proposed a multi-scale 3D CNN (M3D-CNN) for HSI classification. The model jointly learned 2D multi-scale spatial features and 1D spectral features from HSI in an end-to-end manner. This approach led to advanced classification results. To mitigate the decline of deep learning accuracy, Zhong et al. [
22] designed a spectral–spatial residual network (SSRN). The SSRN incorporates spectral and spatial residual blocks, extracting rich spectral features and spatial contextual discriminative features from HSI. The residual blocks are connected to other 3D convolutional layers through constant mapping, which alleviates the training difficulty of the network. Mei et al. [
23] proposes a spectral–spatial attention network for hyperspectral image classification. It introduces recurrent neural network (RNN) and CNN with attention mechanisms. The attention-equipped RNN captures spectral correlations within a continuous spectrum. Meanwhile, the attention-equipped CNN focuses on saliency features and spatial relationships among neighboring pixels in the spatial dimension. Li et al. [
24] applied a band attention module and a spatial attention module to alleviate the effects of redundant bands and interfering pixels. They extracted spectral and spatial features through multi-scale convolution. Ma et al. [
25] proposed a two-branch multi-attention mechanism network (DBMA). The network utilizes attention mechanisms to refine the feature map, resulting in optimal classification results. Attention mechanisms are increasingly employed in HSI classification tasks. In [
26], Xu et al. combined shallow and deep features using a multi-level feature fusion strategy. This approach enabled the model to leverage multi-scale information and achieve better adaptability for HSI classification. As seen, the models used in HSI classification are becoming increasingly complex and parameterized. They achieve high accuracy with sufficient samples but may exhibit a less-than-optimal performance in the presence of small sample problems.
To address the above challenges, we propose the spectral–spatial double-branch network (SSDBN) with attention mechanisms for HSI classification. This network consists of two independent branches—one for spectral and the other for spatial feature extraction. Additionally, we utilize advanced spectral attention and spatial attention modules to obtain classification results. On the spectral branch, we employ 1D convolution and long short-term memory (LSTM) for extracting spectral features. Simultaneously, on the spatial branch, we utilize serial multi-scale 2D convolution for extracting spatial features. The paper’s main contributions are as follows:
(1) We propose the spectral–spatial double-branch network (SSDBN) for HSI classification. This network is designed as an end-to-end HSI classification system that incorporates attention mechanisms, LSTM, and multi-scale 2D convolution. It discards multi-scale 3D convolution and serially employs multiple multi-scale 2D convolution modules. Moreover, the network fuses the outputs of multi-level multi-scale modules to extract spatial features and it utilizes the spectral sequence processing module (SSPM) and LSTM modules to extract spectral features. The whole network can maintain an excellent classification performance while significantly reducing the number of parameters;
(2) We invoke state-of-the-art lightweight spectral and spatial attention modules, resulting in negligible overhead and a simple yet efficient enough structure;
(3) The proposed algorithm attains superior classification results on three public datasets with small sample data, and the time for training and testing are also ahead of other deep learning algorithms.
The rest of the paper is organized as follows: in
Section 2, we describe the detailed structure of the proposed SSDBN. In
Section 3, we introduce the three datasets used in the experiments. In
Section 4 and
Section 5, we provide and analyze the results of the comparison and ablation experiments, respectively. Finally, we conclude the full paper and suggest directions for future work in
Section 6.
2. Methodology
The proposed overall framework of the SSDBN is shown in
Figure 1. HSI can be viewed as a 3D cube of shape
H ×
W ×
B, where
H,
W, and
B represent the height, width, and number of bands, respectively. First, the HSI is cut into 3D cubes of 9 × 9 ×
B. Then, the 3D cubes are divided into training, validation, and testing sets according to the set scale. The training set is employed to fit the sample data and train different models, the validation set is utilized to select the optimal model, and the performance of the model is evaluated by the testing set in the prediction phase. To determine whether the model converges in the training phase, the difference between the predicted and true values is usually measured using a cross-entropy loss function, which is defined as:
These 3D cubes are subsequently directed to the spectral and spatial branches, then the spectral and spatial feature maps obtained, respectively, are fused for classification. The SSDBN mainly includes the following modules: multi-scale 2D convolution, SSPM, LSTM, and an attention mechanism. The multi-scale 2D convolution module consists of four parallel 2D convolution kernels for extracting spatial information and three serial multi-scale 2D convolution modules for multi-layer feature fusion, to achieve full extraction of spatial features. The SSPM module processes the spectral sequences before input to the LSTM to enhance the performance of the LSTM layer. The LSTM is employed to capture spectral features from four spectral subsequences, and the attention mechanism serves to enhance the classification accuracy and speed up the model fitting. Each module is elaborated upon below.
2.1. RNN and LSTM
RNNs [
27] perform the same task for each element in a sequence, and the output element depends on the previous element or state. Due to their remarkable capacity for encoding contextual information, RNNs have found extensive application in addressing sequence classification problems [
28].
The structure diagram of an RNN is depicted in
Figure 2. When provided with a set of sequences, the computation of a recurrent neural network can be represented by the following equations:
where
b and
c denote the bias vectors.
W,
U, and
V are the weight matrices.
,
, and
are the input, hidden, and output values at moment
t, respectively.
is the activation function and, in general, the
function is chosen.
For the classification task, by adding the activation function, the final predicted model output can be obtained as:
where
is the activation function, typically
.
The loss function of RNN is usually chosen as cross entropy [
29], and the loss function of the RNN model at moment
t can be written as:
where
and
denote the label and predicted label of the ith data at moment
t, respectively.
N represents the number of input samples.
Then, taking into account all moments
t, the loss function is defined as:
Since the RNN model is related to a time series, its parameters are optimized by the backpropagation through time, rather than using the back-propagation algorithm directly.
However, RNNs inevitably face the issue of gradient disappearance or explosion during backpropagation, so it is difficult to handle long sequences of data. To address this thorny problem, ref. [
30] proposed the LSTM, which can avoid the gradient vanishing of traditional RNN, addresses the issue of long dependencies in RNNs, and is widely applied to various tasks, such as natural language processing [
31] and financial market change prediction [
32]. The general structure of LSTM is depicted in
Figure 3.
The LSTM adds a cell state
c to preserve the long-term state and controls the long-term state
c through these three gates that act as control switches: forget gate
f, input gate
i, and output gate
o. The forget gate determines how much of the cell state from the previous moment
is preserved to the current moment
; the information in the cell state is discarded selectively, which is determined by the value of the sigmoid function, with 0 indicating all discard and 1 indicating all retention. The input gate determines how much of the input to the network at the current moment
is preserved to the cell state at the current moment
, selectively recording the input information into the cell state, and the output gate controls how much of the cell state at the current moment
is output to the current output value
of LSTM. The forward propagation of one LSTM cell at time
t is calculated as follows:
where
,
,
, and
are the corresponding weight matrices, and
,
,
, and
are bias vectors.
is the sigmoid function with the expression
and function values between 0 and 1,
is the hyperbolic tangent function with the expression
and function values between
and 1, and * is the hadamard product, indicating pixelwise multiplication.
2.2. SSPM
LSTM can achieve satisfactory results in extracting contextual information between adjacent sequence data, but HSI usually has tens or even hundreds of consecutive and highly correlated bands, so directly inputting such complex spectral sequences into LSTM for feature extraction will often lead to unsatisfactory results. In our proposed method, a processing strategy combining 1D convolution and average pooling is used to process the spectral sequences into four subsequences of different sizes by SSPM before inputting them into the LSTM module, which can extract useful spectral features in the LSTM module to compensate for its lack of performance and also makes the sequences shorter, reducing the time cost of the whole model. The illustration of SSPM is shown in
Figure 4.
2.3. Multi-Scale 2D Convolution
Multi-scale convolution involves utilizing convolution kernels of various sizes simultaneously to learn multi-scale information [
33]. The multi-scale information can be applied to address relevant classification problems because of the rich contextual information in the multi-scale structure [
34]. In the HSI classification problem, it is currently popular to use multi-scale 3D convolutional blocks to build HSI classification network models and obtain excellent classification results [
35]. To decrease the number of parameters and build a more lightweight network, the multi-scale 2D convolutional module is proposed, which consists of four convolutional kernels of different sizes in parallel. The structure of the proposed multi-scale 2D convolution module is shown in
Figure 5. Compared to 3D convolution, one key advantage of using 2D convolution is that it operates across bands without considering the pixel’s position in each band. This approach reduces the model’s parameter count, lowers computational complexity, and mitigates the risk of overfitting during training. For hyperspectral images with numerous bands, the effectiveness of 2D convolution in capturing spectral information is evident. Therefore, the design choice of employing 2D convolution instead of 3D convolution contributes to enhanced model efficiency, reduced computational load, and decreased model complexity while maintaining performance.
The detailed parameter settings for this multi-scale 2D convolution module are shown in
Table 1. We set several sets of parameters and, after testing, we chose this succinct set as it was sufficiently balanced for the accuracy and efficiency of the HSI classification task to meet our requirements for building a succinct network. In general, just one multi-scale 2D convolution module does not sufficiently extract the effective features of the HSI data, and we can choose to serialize several such modules. In this paper, the number of multi-scale modules is set to 3. The reason for choosing 3 will be demonstrated in the ablation experiments in later parts of the paper.
The input data of shape 9 × 9 × B is sent to the multi-scale convolution module. Since each multi-scale convolution performs the corresponding padding operation during feature extraction, the outputs of the four 2D convolutions are of equal size and, by concatenating these four outputs in the channel dimension, the resulting output size is equal to the input size, i.e., 9 × 9 × B.
2.4. Attention Mechanism
The significance of different spectral bands and spatial pixels in feature extraction varies, and the attention mechanism allows for a selective focus on regions abundant in feature information while considering non-essential areas less. For spectral classification, each pixel can be depicted as a continuous spectral profile containing rich spectral features, and we can use the channel attention mechanism to focus on the interrelationship between the bands of features. In spatial classification, the spatial attention mechanism can be used to increase the weights of compelling pixels and focus on the spatial relationships of features. These two modules are described in detail below.
2.4.1. Spectral Attention Module
Most existing research employs intricate attention modules to attain improved performance, and the increase in model complexity inevitably consumes a large number of computational resources. Wang et al. [
36] proposed an effective channel attention (ECA) module that requires only a small number of additional parameters to deliver significant performance gains, which can effectively alleviate the tension between performance and complexity. The structure of this ECA module is shown in
Figure 6.
Since the purpose of the ECA module is to properly capture local cross-channel information interactions, the approximate range of channel interaction information (i.e.,
k) needs to be determined. The optimal range of optimized information interactions (i.e.,
k) can be tuned manually for convolutional blocks with a different number of channels in various CNN architectures, but manual tuning by cross-validation can be computationally resource intensive. In recent studies, group convolution has been successfully used in CNN architectures [
37] where high-dimensional (low-dimensional) channels are proportional to long-distance (short-distance) convolution for a fixed number of groups. Similarly, the coverage of cross-channel information interactions (i.e.,
k) should also be proportional to the number of channel dimensions
B. In other words, there may be a mapping
between
k and
B:
The simplest mapping is linear mapping. However, since linear functions have limitations in terms of representing certain relevant features and the channel dimension
B is usually set to a power of 2. Therefore, a nonlinear mapping relationship can be represented by extending the linear function to an exponential function with a base of 2, i.e.:
Therefore, given the channel dimension B, the convolution kernel size k can be calculated smoothly by setting the values of and b.
2.4.2. Spatial Attention Module
The spatial attention mechanism aims to identify regions within the feature map that deserve focused attention. The convolutional block attention module (CBAM) [
38] is a simple but efficient lightweight attention module. In the spatial dimension, given an
feature
F, two
channel descriptions are first obtained by averaging pooling and max pooling in one channel dimension, respectively, and these two descriptions are stitched together by channels. Then, after a
2D convolution with the sigmoid activation function, the obtained weights are multiplied with the input features
to obtain the adaptive refinement features. The spatial attention module used in the proposed method is schematically shown in
Figure 7.
3. Datasets and Experimental Setting
3.1. Datasets
In this paper, we utilize three publicly available hyperspectral datasets—the Indian Pines (IP), Pavia University (UP), and Kennedy Space Center (KSC) datasets—to validate the accuracy and efficiency of the proposed method.
IP: This dataset was collected using the Airborne Visible Imaging Spectrometer (AVIRIS) in Northwest Indiana, USA, and consists of 200 spectral bands in the wavelength range of 0.4 µm to 2.5 µm after removal of the absorption band, with a spatial resolution of 20 m. Included in the dataset are 145 × 145 pixels and 16 land cover classes.
UP: This dataset was acquired at the University of Pavia in northern Italy with the Reflection Optical System Imaging Spectrometer (ROSIS) sensor. It comprises 103 spectral bands in the wavelength range of 0.43 µm to 0.86 µm with a spatial resolution of 1.3 m, the dataset contains 610 × 340 pixels and 9 land cover classes.
KSC: This dataset was collected by the AVIRIS sensor at the Kennedy Space Center, Florida, and comprises 224 bands and 512 × 340 pixels, with 176 bands remaining after water vapor noise removal. Additionally, it has a spectral coverage of 0.4 µm to 2.5 µm, with a spatial resolution of 18m and a total of 13 land coverage categories.
In the HSI classification task, the size of the training samples significantly impacts the classification accuracy. Generally, a larger training sample size leads to higher accuracy. However, more sample data also means increased time consumption and computational complexity. To assess the effectiveness of our model in addressing the small sample problem, we used minimal training and test sample sizes for each dataset in our experiments. For the Indian Pines and KSC datasets, we select 5% of the samples for training and 5% of the samples for validation. For the Pavia University dataset, with sufficient samples, we select only 0.5% of the samples for training and 0.5% of the samples for validation because of the very large number of training sample points. The training, validation, and test samples for the three datasets used for the experiments are listed in
Table 2,
Table 3 and
Table 4, and the false-color images and ground truth maps are shown in
Figure 8,
Figure 9 and
Figure 10.
3.2. Experimental Setting
To evaluate the proposed algorithm’s effectiveness, it was compared with five different methods in the literature: (1) support vector machine based on a radial basis function kernel (SVM-RBF) [
6]; (2) multi-scale 3D deep convolutional neural network (M3D-CNN) [
21]; (3) spectral–spatial residual network (SSRN) [
22]; (4) 3D-2D CNN feature hierarchy (HybridSN) [
20]; (5) double-branch multi-attention mechanism network (DBMA) [
25].
We set the batch size to 64, the learning rate to 0.0001, and used the Adam optimizer to train the network. The algorithm’s classification performance was evaluated by OA, AA, and Kappa coefficients, where OA is the ratio of correctly classified pixels to total pixels, AA is the average classification accuracy across all categories, and the Kappa coefficient is a statistical measure of the consistency of classification results with ground truth. Higher values of these three indicate better classification results, and we also recorded the training time and testing time of each method to assess its efficiency. To fairly compare the experimental results, all experiments in this paper were conducted on the same platform configured with 64 GB RAM and NVIDIA GeForce RTX 2080Ti GPU, based on the Ubuntu 16.04.9 x64 operating system and the Pytorch framework.