1. Introduction
In the past decades, with the rapid development of hyperspectral imaging technology, sensors can capture hyperspectral images (HSIs) in hundreds of bands. In the field of remote sensing, an important task is hyperspectral image classification. Hyperspectral image classification is used to assign accurate labels to different pixels according to multidimensional feature space [
1,
2,
3]. In practical applications, hyperspectral image classification technology has been widely used in many fields, such as military reconnaissance, vegetation and ecological monitoring, specific atmospheric assessment, and geological disasters [
4,
5,
6,
7,
8].
Traditional machine-learning methods mainly include two steps: feature extraction and classification [
9,
10,
11,
12,
13,
14]. In the early stage of hyperspectral image classification, many classical methods appeared, such as feature mining technology [
15] and Markov random field [
16]. However, these methods cannot effectively extract features with strong discrimination ability. In order to adapt to the nonlinear structure of hyperspectral data, a pattern recognition algorithm support vector machine (SVM) was proposed [
17], but this method struggles to effectively solve the multi classification problem.
With the development of deep learning (DL) technology, some methods based on DL have been widely used in hyperspectral image classification [
18,
19,
20]. In particular, the hyperspectral image classification method based on convolutional neural network (CNN) has attracted extensive attention because it can effectively deal with nonlinear structure data [
21,
22,
23,
24,
25,
26,
27,
28]. In [
29], the first attempt to extract the spectral features of HSIs by stacking multilayer one-dimensional neural network (1DCNN) was presented. In addition, Yu et al. [
30] proposed a CNN with deconvolution and hashing method (CNNDH). According to the spectral correlation and band variability of HSIs, a recurrent neural network (RNN) was used to extract spectral features [
31]. In recent years, some two-dimensional neural networks have also been applied to hyperspectral image classification, and satisfactory classification performance has been obtained. For example, a two-dimensional stacked autoencoder (2DSAE) was used to attempt to extract depth features from space [
32]. In addition, Makantasis et al. [
33] proposed a two-dimensional convolutional neural network (2DCNN), which was used to extract spatial information and classify the original HSIs pixel by pixel in a supervised manner. In [
34], Feng et al. proposed a CNN-based multilayer spatial–spectral feature fusion and sample augmentation with local and nonlocal constraints (MSLN-CNN). MSLN-CNN not only fully extracts the complementary spatial–spectral information between shallow and deep layers, but also avoids the overfitting phenomenon caused by an insufficient number of samples. In addition, in [
35], Gong et al. proposed a multiscale convolutional neural network (MSCNN), which improves the representation ability of HSIs by extracting depth multiscale features. At the same time, a spatial spectral unified network (SSUN) based on HSIs was proposed [
36]. This method shares a unified objective function for feature extraction and classifier training, and all parameters can be optimized at the same time. Considering the inherent data attributes of HSIs, spatial–spectral features can be extracted more fully by using a three-dimensional convolutional neural network (3DCNN). In [
37], an unsupervised feature learning strategy of a three-dimensional convolutional autoencoder (3DCAE) was used to maximize the exploration of spatial–spectral structure information and learn effective features in unsupervised mode. Roy et al. [
38] proposed a mixed 3DCNN and 2DCNN feature extraction method (Hybrid-SN). This method first extracts spatial and spectral features through 3DCNN, then extracts depth spatial features using 2DCNN, and finally realizes high-precision classification. In [
39], a robust generative adversarial network (GAN) was proposed, and the classification performance was effectively improved. In addition, Paoletti et al. [
40] proposed the pyramid residual network (PyResNet).
Although the above methods can effectively improve the classification performance of high HSIs, they are still not satisfactory. In recent years, in order to further improve the classification performance, computer vision has widely studied the channel attention mechanism and applied it to the field of hyperspectral image classification [
41,
42,
43,
44]. For example, a squeeze-and-excitation network (SENet) improved classification performance by introducing the channel attention mechanism [
45]. Wang et al. [
46] proposed the spatial–spectral squeeze-and-excitation network (SSSE), which utilized a squeeze operator and excitation operation to refine the feature maps. In addition, embedding the attention mechanism into the popular model can also effectively improve the classification performance. In [
47], Mei et al. proposed bidirectional recurrent neural networks (bi-RNNs) based on an attention mechanism. The attention map was calculated by the tanh function and sigmoid function. Roy et al. [
48] proposed a fused squeeze-and-excitation network (FuSENet), which obtains channel attention through global average pooling (GAP) and global max pooling (GMP). Ding et al. [
49] proposed local attention network (LANet), which enriches the semantic information of low-level features by embedding local attention in high-level features. However, channel attention can only obtain the attention map of channel dimension, ignoring spatial information. In [
50], in order to obtain prominent spatial features, the convolutional block attention module (CBAM) not only emphasizes the differences of different channels through channel attention, but also uses the pooling operation of channel axis to generate a spatial attention map to highlight the importance of different spatial pixels. In order to fully extract spatial and spectral features, Zhong et al. [
51] proposed a spatial–spectral residuals network (SSRN). Recently, Zhu et al. [
52] added a spatial and spectral attention network (RSSAN) to SSRN and achieved better classification performance. In the process of feature extraction, in order to avoid the interference between the extracted spatial features and spectral features, Ma et al. [
53] designed a double-branch multi-attention (DBMA) network to extract spatial features and spectral features, using different attention mechanisms in the two branches. Similarly, Li et al. [
54] proposed a double-attention network (DANet), incorporating spatial attention and channel attention. Specifically, spatial attention is used to obtain the dependence between any two positions of the feature graph, and channel attention is used to obtain the channel dependence between different channels. In [
55], Li et al. proposed double-branch dual attention (DBDA). By adding spatial attention and channel attention modules to the two branches, DBDA achieves better classification performance. In order to highlight important features as much as possible, Cui et al. [
56] proposed a new dual triple-attention network (DTAN), which uses three branches to obtain cross-dimensional interactive information and obtain attention maps between different dimensions. In addition, in [
57], in order to expand the receptive field and extract more effective features, Roy et al. proposed an attention-based adaptive spectral–spatial kernel improved residual network (A
2S
2K-ResNet).
Although many excellent classification methods have been used for hyperspectral image classification, extracting features with strong discrimination ability and realizing high-precision image classification in small samples are still big challenges for hyperspectral image classification. In recent years, although the spatial attention mechanism and channel attention mechanism could obtain spatial dependence and channel dependence, there were still limitations in obtaining long-distance dependence. Considering the spatial location relationship and the different importance of different bands, we propose a three-dimensional coordination attention mechanism network (3DCAMNet). 3DCAMNet mainly includes three main components: a convolution module, linear convolution, and three-dimensional coordination attention mechanism (3DCAM). Firstly, the convolution module uses 3DCNN to fully extract spatial and spectral features. Secondly, the linear module aims to generate a feature map containing more information. Lastly, the designed 3DCAM not only considers the vertical and horizontal directions of spatial information, but also highlights the importance of different bands.
The main contributions of this paper are summarized as follows:
- (1)
The three-dimensional coordination attention mechanism-based network (3DCAMNet) proposed in this paper is mainly composed of a three-dimensional coordination attention mechanism (3DCAM), linear module, and convolution module. This network structure can extract features with strong discrimination ability, and a series of experiments showed that 3DCAMNet can achieve good classification performance and has strong robustness.
- (2)
In this paper, a 3DCAM is proposed. This attention mechanism obtains the 3D coordination attention map of HSIs by exploring the long-distance relationship between the vertical and horizontal directions of space and the importance of different channels of spectral dimension.
- (3)
In order to extract spatial–spectral features as fully as possible, a convolution module is used in this paper. Similarly, in order to obtain the feature map containing more information, a linear module is introduced after the convolution module to extract more fine high-level features.
The main structure of the remainder of this paper is as follows: in
Section 2, the components of 3DCAMNet are introduced in detail. Some experimental results and experimental analysis are provided in
Section 3.
Section 4 draws the conclusions.
3. Experimental Results and Analysis
In order to verify the classification performance of 3DCAMNet, this section conducts a series of experiments using five datasets. All experiments are implemented on the same configuration, i.e., an Intel (R) core (TM) i9-9900k CPU, NVIDIA Geforce RTX 2080TI GPU, and 32 GB random access memory server. The contents of this section include the experimental setup, comparison of results, and discussion.
3.1. Experimental Setting
3.1.1. Datasets
Five common datasets were selected, namely, Indian Pines (IP), Pavia University (UP), Kennedy Space Center (KSC), Salinas Valley (SV), and University of Houston (HT). The IP, KSC, and SV datasets were captured by airborne visible infrared imaging spectrometer (AVIRIS) sensors. The UP and HT datasets were obtained by the reflective optical spectral imaging system (ROSIS-3) sensor and the compact airborne spectral imager (CASI) sensor, respectively.
Specifically, IP has 16 feature categories with a space size of 145 × 145, and 200 spectral bands can be used for experiments. Compared with IP, UP has fewer feature categories, only nine, and the image size is 610 × 340. In addition to 13 noise bands, 103 bands are used in the experiment. The spatial resolution of KSC is 20 m and the spatial size of each image is 512 × 614. Similarly, after removing the water absorption band, 176 bands are left for the experiment. The SV space size is 512 × 217 and contains 16 feature categories, while there are 204 spectral bands available for experiments. The last dataset HT has a high spatial resolution and a spatial size of 349 × 1905, the number of bands is 114, and the wavelength range is 380–1050 nm, including 15 feature categories. The details of the dataset are shown in
Table 1.
3.1.2. Experimental Setting
In 3DCAMNet, the batch size and maximum training rounds used were 16 and 200, respectively, and the “Adam” optimizer was selected during the training process. The learning rate and input space size were 0.0005 and 9 × 9, respectively. In addition, the cross-loss entropy was used to measure the difference between the real probability distribution and the predicted probability distribution.
Table 2 shows the superparameter settings of 3DCAMNet.
3.1.3. Evaluation Index
Three evaluation indicators were adopted in the experiments, namely, overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa) [
61]. The measuremnet units of these evaluation indicators are all dimensionless. The confusion matrix
is constructed with the real category information of the original pixel and the predicted category information, where
is the number of categories, and
is the number of samples classified as category
by category
. Assuming that the total number of samples of HSIs is
, the ratio of the number of accurately classified samples to the total number of samples OA is
where,
is the correctly classified element in the confusion matrix. Similarly, AA is the average value of classification accuracy for each category,
The Kappa matrix is another performance evaluation index. The specific calculation is as follows:
where
and
represent all column elements in row
and all row elements in column
of confusion matrix
, respectively.
3.2. Experimental Results
In this section, the proposed method 3DCAMNet is compared with other advanced classification methods, including SVM [
17], SSRN [
52], PyResNet [
40], DBMA [
53], DBDA [
55], Hybrid-SN [
35], and A
2S
2K-ResNet [
57]. In the experiment, the training proportion of IP, UP, KSC, SV, and HT datasets was 3%, 0.5%, 5%, 0.5%, and 5%. In addition, for fair comparison, the input space size of all methods was 9 × 9, and the final experimental results were the average of 30 experiments.
SVM is a classification method based on the radial basis kernel function (RBF). SSRN designs a residual module of space and spectrum to extract spatial–spectral information for the neighborhood blocks of input three-dimensional cube data. PyResNet gradually increases the feature dimension of each layer through the residual method, so as to get more location information. In order to further improve the classification performance, DBMA and DBDA designed spectral and spatial branches to extract the spectral–spatial features of HSIs, respectively, and used an attention mechanism to emphasize the channel features and spatial features in the two branches, respectively. Hybrid-SN verifies the effectiveness of a hybrid spectral CNN network, whereby spectral–spatial features are first extracted through 3DCNN, and then spatial features are extracted through 2DCNN. A2S2K-ResNet designs an adaptive kernel attention module, which not only solves the problem of automatically adjusting the receptive fields (RFs) of the network, but also jointly extracts spectral–spatial features, so as to enhance the robustness of hyperspectral image classification. Unlike the attention mechanism proposed in the above methods, in order to obtain the long-distance dependence in the vertical and horizontal directions and the importance of the spectrum, a 3D coordination attention mechanism is proposed in this paper. Similarly, in order to further extract spectral and spatial features with more discriminant features, the 3DCNN and linear module are used to fully extract joint spectral–spatial features, so as to improve the classification performance.
The classification accuracy of all methods on IP, UP, KSC, SV, and HT datasets are show in
Table A1,
Table A2,
Table A3,
Table A4 and
Table A5, respectively. It can be seen that, in the five datasets, compared with other methods, the method proposed in this paper not only obtained the best OA, AA, and Kappa, but also almost every class had greater advantages in classification accuracy. Specifically, due to the complex distribution of features in the IP dataset, the classification accuracy of all methods on this dataset was low, but the method in this paper not only obtained better accuracy in the categories that were easy to classify, but also obtained better accuracy in the categories that were difficult to classify such as Class 2, Class 4, and Class 9. Similarly, in the UP dataset, we can clearly see that the accuracy of the method proposed in this paper, according to OA, AA, and Kappa or various categories, has great advantages over other methods. Compared with the IP dataset, the UP dataset has fewer feature categories, and all methods exhibited better classification results, but the method in this paper obtained the highest classification accuracy. The KSC dataset has the same number of categories as the IP dataset, in addition to 16 feature categories, but the KSC feature categories are scattered. It can be seen from
Table A3 that all classification methods obtained ideal results, but the proposed method obtained the best classification accuracy. In addition, because the sample distribution of the SV dataset is relatively balanced and the ground object distribution is relatively regular, the classification accuracy of all methods was high. On the contrary, HT images were collected from the University of Houston Campus, with complex distribution and many categories, but the method proposed in this paper could still achieve high-precision classification.
In addition,
Figure 5,
Figure 6,
Figure 7,
Figure 8 and
Figure 9 shows the classification visualization results of all methods, including the false-color composite image and the classification visualization results of each method. Because the traditional classification methods cannot effectively extract spatial–spectral features, the classification effect was poor, while the image was rough and noisy, as seen for SVM and the deep network methods based on ResNet, including SSRN and PyResNet. Although these kinds of method can obtain good classification results, there was still a small amount of noise. In addition, DBMA, DBDA, and A
2S
2K-ResNet all added an attention mechanism to the network, which yielded better classification visualization results, but there were still many classification errors. However, the classification visualization results obtained by the method proposed in this paper were smoother and closer to the real feature map. This fully verifies the superiority of the proposed method.
In conclusion, through multiple angle analysis, it was verified that this method has more advantages than other methods. First, among all methods, the proposed method had the highest overall accuracy (OA), average accuracy (OA), and Kappa coefficient (Kappa). In addition, the method proposed in this paper could not only achieve high classification accuracy in the categories that were easy to classify, but also had strong judgment ability in the categories that were difficult to classify. Second, among the classification visualization results of all methods, the method in this paper obtained smoother results that were closer to the false-color composite image.