Spatial–Spectral Feature Refinement for Hyperspectral Image Classification Based on Attention-Dense 3D-2D-CNN

Convolutional neural networks provide an ideal solution for hyperspectral image (HSI) classification. However, the classification effect is not satisfactory when limited training samples are available. Focused on “small sample” hyperspectral classification, we proposed a novel 3D-2D-convolutional neural network (CNN) model named AD-HybridSN (Attention-Dense-HybridSN). In our proposed model, a dense block was used to reuse shallow features and aimed at better exploiting hierarchical spatial–spectral features. Subsequent depth separable convolutional layers were used to discriminate the spatial information. Further refinement of spatial–spectral features was realized by the channel attention method and spatial attention method, which were performed behind every 3D convolutional layer and every 2D convolutional layer, respectively. Experiment results indicate that our proposed model can learn more discriminative spatial–spectral features using very few training data. In Indian Pines, Salinas and the University of Pavia, AD-HybridSN obtain 97.02%, 99.59% and 98.32% overall accuracy using only 5%, 1% and 1% labeled data for training, respectively, which are far better than all the contrast models.


Introduction
Recently, deep learning methods represented by convolutional neural networks (CNNs) have made a breakthrough in computer vision, showing great superiority in the image processing area [1][2][3]. Therefore, the research on the CNN models has attracted more and more attention, which also makes the application of CNN penetrate into various subareas of image processing, for example, remote sensing image processing area [4]. Hyperspectral image classification has always been one of the hotspots in the remote sensing community. At present, the CNN based hyperspectral classification methods are booming [5]. However, hyperspectral images suffer from a large number of spectral bands, large data size, high redundancy, high nonlinearity and the "small sample problem", the pixel-wise classification of which is still challenging [6].
The convolutional neural network can automatically learn hierarchical abstract features from the raw image, which provides an ideal solution for feature extraction in computer vision. In 2012, a deep learning model named AlexNet [7] showed an excellent classification result in the ImageNet dataset, which is a huge collection of natural images. Since then, innovative networks have emerged in an endless stream, constantly inspiring the paradigm of feature extraction and reuse. In 2015, He et al. [8] proposed ResNet, solving the training problem of deep networks by introducing a residual connection. In ResNet, feature fusion is realized by pixelwise-addition of different feature connections, structural innovation is also an important aspect of the network optimization of CNN models for hyperspectral classification. Swalpa K et al. [33] proposed a novel hyperspectral feature extraction pattern, HybridSN, based on the combination of the 3D-CNN and 2D-CNN. HybridSN takes hyperspectral data after a dimensionality reduction as the input and has a relatively small computation burden. It concatenates the feature maps extracted by three successive 3D convolutional layers in the spectral dimension and then used a 2D convolutional layer to enhance the spatial feature learning. HybridSN, which only has four convolutional layers, achieved extremely high classification accuracy, demonstrating the great potential of the 3D-2D-CNN model in hyperspectral classification. Based on the 3D-2D-CNN, Feng et al. [34] proposed R-HybridSN (Residual-HybridSN) by means of rational use of non-identity residual connections, enriching the feature learning paths and enhancing the flow of spectral information in the network. In particular, R-HybridSN was equipped with depth separable convolution layers instead of traditional 2D convolutional layers, which further made it perform better in the small sample hyperspectral classification. However, the shallow features in R-HybridSN are not reused, so that the network structure of R-HybridSN can be further optimized.
Hu et al. [35] proposed squeeze-and-excitation networks and introduced the attention mechanism to the image classification network, winning the champion of 2017 ImageNet Large Scale Visual Recognition Competition. Recently, the attention mechanism [36] has been applied to the construction of HSI classification models. The attention mechanism is a resource allocation scheme, through which limited computing resources will be used to process more important information. Therefore, the attention mechanism module can effectively enhance the expression ability of the model without excessively increasing complexity. Wang et al. [37] constructed a spatial-spectral squeeze-and-excitation (SSSE) module to automatically learn the weight of different spectral and different neighborhood pixels to emphasize the meaningful features and suppress unnecessary ones so that the classification accuracy is improved effectively. Li et al. [38] added an attention module (Squeeze-and-Excitation block) respectively after the dense connection module used for shallow and middle feature extraction to emphasize effective features in the spectral bands, and then feed it to further deep feature extraction. The attention mechanism in the HSI classification model is used for finding more discriminative feature patterns in spectral or spatial dimension. However, the specific use of the attention mechanism, such as the location and calculation methods, has no mature theory and still needs further exploring.
Hyperspectral image labeling is laborious and time-consuming, therefore, labeled samples are always limited in classification tasks. How to use as few labeled samples as possible to achieve better classification results has been a research hotspot for a long time. Feng et al. [34] conducted vast experiments using different amounts of training samples and found that the degradation of the CNN model is very common when the sample size decreased. The main strategies for small sample hyperspectral classification include generative adversarial networks [39,40], semi-supervised learning [41,42] and network optimization [33,34]. The residual connection is the core of network optimization, and the purpose of network optimization is to facilitate feature fusion and feature reusing. Compared with the simple pipelined network, the well-designed model, which is more like a directed acyclic graph of layers, usually has a better classification effect [34]. Song et al. [43] proposed a hybrid residual network (HDRN), in which the residual connection is used in and between residual blocks. The rational use of residual connection in the HDRN makes it better able to cope with hyperspectral classification with limited training samples. Network optimization can be combined with other methods. Liu et al. [44] proposed a deep few-shot learning method, which is focused on "small sample" hyperspectral classification. The Res-3D-CNN model is utilized to extract spatial-spectral features and to learn a metric space for each class of objects. Therefore, network optimization has important research significance and constructing models with a more reasonable structure seems to be an effective solution for the "small sample" hyperspectral classification.
Based on the above observations, in order to explore a better topological structure, inspired by R-HybridSN and the attention mechanism, we proposed a novel model named AD-HybridSN (Attention-Dense-HybridSN) for "small sample problem" from the perspective of network optimization.
Based on 3D-2D-CNN and the densely connected module, AD-HybridSN realized a more efficient feature reuse and feature fusion. Moreover, the attention mechanism was introduced to the 3D convolution part and 2D convolution part respectively so that the model can utilize the spectral features and spatial features in a targeted refinement circumstance. With fewer parameters, AD-HybridSN achieves better classification performance in the Indian Pines, Salinas and University of Pavia datasets.

AD-HybridSN Model
Hyperspectral image classification is to assign a specific label to every pixel in hyperspectral images. The convolutional neural networks based hyperspectral classification models take small image patches as the input. Every hyperspectral image patch was composed of the spectral vectors within a predefined range and its land-use type was determined by its center pixel. The hyperspectral patch can be denoted as P L×W×S , where L × W represents the spatial dimension and S represents the number of spectral bands. In our proposed model, the input data was processed by principal components analysis (PCA) in the spectral dimension, which greatly reduced the redundancy within hyperspectral data. The number of spectral bands decreased from S to C, and a different value of C has a great effect on the computational complexity. For the sake of equilibrium, we set L and W as 15 and C as 16 in our proposed network. Figure 1 shows the network structure of the proposed AD-HybridSN. AD-HybridSN is based on the 3D-2D-CNN feature extraction pattern and is composed of 6 convolutional layers. A 3D-Dense block composed of 4 3D convolutional layers was used for learning hierarchical spatial-spectral features. We introduced the channel attention method after every 3D convolutional layer to refine the extracted spatial-spectral features. Two subsequent depth separable convolutional layers supported by the spatial attention method enhanced the spatial information in the feature maps. Multiple residual connections were used in AD-HybridSN, which realized the feature reuse and spectral information compensation. to the 3D convolution part and 2D convolution part respectively so that the model can utilize the spectral features and spatial features in a targeted refinement circumstance. With fewer parameters, AD-HybridSN achieves better classification performance in the Indian Pines, Salinas and University of Pavia datasets.

AD-HybridSN Model
Hyperspectral image classification is to assign a specific label to every pixel in hyperspectral images. The convolutional neural networks based hyperspectral classification models take small image patches as the input. Every hyperspectral image patch was composed of the spectral vectors within a predefined range and its land-use type was determined by its center pixel. The hyperspectral patch can be denoted as × × , where × represents the spatial dimension and represents the number of spectral bands. In our proposed model, the input data was processed by principal components analysis (PCA) in the spectral dimension, which greatly reduced the redundancy within hyperspectral data. The number of spectral bands decreased from to , and a different value of has a great effect on the computational complexity. For the sake of equilibrium, we set and as 15 and as 16 in our proposed network. Figure 1 shows the network structure of the proposed AD-HybridSN. AD-HybridSN is based on the 3D-2D-CNN feature extraction pattern and is composed of 6 convolutional layers. A 3D-Dense block composed of 4 3D convolutional layers was used for learning hierarchical spatial-spectral features. We introduced the channel attention method after every 3D convolutional layer to refine the extracted spatial-spectral features. Two subsequent depth separable convolutional layers supported by the spatial attention method enhanced the spatial information in the feature maps. Multiple residual connections were used in AD-HybridSN, which realized the feature reuse and spectral information compensation.

Convolutional Layers Used in Our Proposed Model
In the CNN based hyperspectral classification networks, the input hyperspectral image patch contains abundant spatial-spectral information. As for feature extraction from the hyperspectral image patch, different types of convolution have different characteristics. In 2D-CNN, the 2D convolution kernels were used to carry out convolution operation in the spatial dimension of the feature maps, and then the corresponding operation results of different channels were added by the pixel to obtain a 3D tensor. Assuming that the input data dimension was Z × Z × C, the 2D convolution kernel size was M × M, the kernel number was N and padding was not used, then the 3D tensor of which its size was (Z − M + 1) × (Z − M + 1) × N was obtained. The value of the (x, y) position on the jth feature map at the ith layer, value x,y i,j , was calculated by Formula (1), where ϕ() represents the nonlinear activation function and W i , L i represents the width and height of the convolutional kernel; θ w,l i,j,µ represents the weight parameter at the position (l, w) on the jth feature map in the ith layer; value (x+w),(y+l) (i−1),µ represents the value at the (x + w, y + l) position in the µth feature map of the previous layer and b i,j represents the bias.
In the 3D-CNN based models, the convolution operation using the 3D convolution kernel not only acts on the spatial dimension, but also obtains the correlation information between several spectral bands. Suppose the dimension of the input hyperspectral image patch is Z × Z × C, and N 3D convolution kernels of which the kernel size is M × N × D was used for convolution operation. If padding is not used, then the output feature map is a 4D tensor and its dimension is the number of bands across the spectral dimension is fixed as D and the final result is a 4D tensor with a size of ( It is a remarkable fact that the number of bands crossed in spectral feature learning is fixed as D, which may be the limitation of the 3D convolutional kernel. Compared with the feature maps generated by 2D convolutional kernels, 3D convolutional kernel can better utilize spectral information. Therefore, the 3D-CNN based methods are more suitable for the classification of hyperspectral images with rich spectral information. In the 3D convolutional computation, the activation value at the (x, y, z) position on the jth feature map in the ith layer can be is calculated by Formula (2).
3D-CNN can make good use of the rich spectral information in hyperspectral data while extracting spatial information. However, the 3D convolution kernel has an extra spectral dimension compared with the 2D convolution kernel, which inevitably introduces more parameters and greatly increases the computational complexity [27]. Inspired by HybridSN, in consideration of making better use of the advantages of 3D-CNN and 2D-CNN respectively in feature extraction, we used the 3D-2D-CNN as the feature extraction pattern. Inspired by R-HybridSN, we used depth separable convolutional layers in AD-HybridSN. The depth separable convolution can be divided into depthwise convolution and pointwise convolution. Compared with traditional 2D convolutional layers, the depth separable convolutional layers have fewer parameters and less computational burden, which make it more suitable for hyperspectral data processing [34].

Multiscale Feature Reuse Module
The residual connection used in ResNet realized feature fusion through pixel-wise addition of feature maps at different layers, making it relatively easy to train deep networks. Based on the necessity of feature fusion, in the DenseNet proposed by Gao Huang et al., the correlations between feature maps are extended to some limitation by concatenating the outputs of any two layers in one dense block.
On account of the feature fusion method of concatenating feature maps in the channel dimension, the features extracted by the earlier convolutional layers are completely preserved, which makes it possible for complete feature reuse. For a CNN model with t convolutional layers, the tth layer will receive all characteristic information from the 0 layer to the t − 1 layer, as shown in Formula (3), where H t () represents the computation of the tth layer and [X 0 , X 1 , . . . . . . , X t−1 ] represents the concatenation result of all the previous layers in one dense block. This feature fusion method effectively enhances the free flow of information in the model. However, current DenseNet based models adopt local dense connections, which means dense connections only exist within each dense block. This is inevitable because globally dense connections consume a lot of memory and have a severe feature redundancy problem. From the perspective of the receptive field of the feature maps, in AD-HybridSN, we used four 3D convolutional layers in one dense block and each layer included 12 convolutional kernels, with a size of 3 × 3 × 3. With the convolution operation layer by layer, the spatial and spectral receptive fields of each set of feature maps aggressively increase. For example, after the first convolution operation, a set of feature maps, of which the receptive field was 3 × 3 × 3, were obtained. If there is no dense connection, the receptive field of the subsequent output feature maps will be 5 × 5 × 5, 7 × 7 × 7 and 9 × 9 × 9. In fact, due to the reuse of shallow features, the actual receptive field will be smaller. At the same time, multiscale features were sufficiently reused through dense concatenation operation in the dense block.

Attention Mechanism
Currently, the attention mechanism has been successfully applied to the area of computer vision based on convolutional neural networks. The attention mechanism can be used to readjust feature maps generated by some layers of a neural network, which make it able to detect specific channel or spatial features [45,46]. The attention mechanism can be roughly divided into spatial attention and channel attention [34]. In our proposed model, channel attention was introduced to the refactor and refines the spatial-spectral features extract by every convolutional layer in the dense block and spatial attention was utilized to discriminate spatial information within the features generated by the depth separable convolutional layers.

Channel Attention Module
As mentioned above, the feature map extracted by a single 3D convolution kernel is modeled as a 3D cube, which can learn detailed features and correlation information across spectral bands of hyperspectral data to some extent. Take the 3 × 3 × 3 convolutional kernel as an example, the same parameters are used for single channel 3D hyperspectral data during which each convolution operation covers three spectral bands. As the band span of the spectral features characterized by a single convolutional layer is fixed, the spectral feature mining has been limited to some extent. Therefore, in order to further refine the extracted spatial-spectral features, feature maps of all channels were concatenated in the spectral dimension to form a 3D tensor. The reshaped 3D tensor had a large channel number, which was equal to the original channel number times the original spectral band number. Then channel attention was introduced to assign a specific weight for each channel. Figure 2 is a schematic diagram of the channel attention mechanism used in this article. Let the dimension of the feature map generated by 3D convolutional layers be B × L × W × C × N, where B represents batchsize, L × W represents the spatial dimension of the feature map, C represents the spectral dimension and N represents the number of convolution kernels. In our proposed method, the 5D feature map will be reshaped to be a 4D tensor and its dimension will be B × L × W × (CN), where CN is the new channel number. Then channel attention is applied to the new 4D tensor to generate a specific weight for each channel. Firstly, global average pooling was performed on the 4D tensor to obtain channel-wise representation. Then, after two fully connected layers, the latter of which used sigmoid as the activation function, the channel-wise representation was converted to the weights of different channel. Finally, the reshaped input feature map was multiplied by the obtained weight vector to complete the refactoring of spatial-spectral features. Take the settings of the proposed model as an example, the dimension of the feature map generated by the first convolutional layer was 15 × 15 × 16 × 12. Twelve weights will be produced if the channel attention mechanism was directly used but the number of weights increased to 192 after reshaping. Therefore, the feature map reshaping was helpful to refine the spectral feature within the original feature maps.
Sensors 2020, 20, x 7 of 19 was directly used but the number of weights increased to 192 after reshaping. Therefore, the feature map reshaping was helpful to refine the spectral feature within the original feature maps. Figure 2. The overall architecture of the channel attention mechanism used in AD-HybridSN. The dimension of the input feature map is × × × (batchsize is not shown) and after the reshaping operation, the dimension of the new feature map will be × × ( ). After global pooling operation and two fully connected layers, specific weights are generated for every channel and the weight vector will be multiplied with the reshaped 3D tensor to complete feature refinement. After feature refinement, the 3D tensor will be reshaped to a 4D tensor, of which the dimension is × × × . Figure 3 shows a schematic of the spatial attention module. Suppose F represents the reshaped feature maps generated by the first depth separable convolutional layer, of which the dimension can be denoted as × × × . The processing process of the attention mechanism in this paper can be summarized as Formula (4),

Spatial Attention Module
() represents spatial attention, ⊗ represents the matrix multiplication operation and represents the refined feature maps. Take the first depth separable convolutional layer as an example, global average pooling and global max pooling were performed after the convolutional operation and the feature map dimension became × 1 × 1 × for each pooling operation. Then the two pooling results were concatenated to build an efficient feature descriptor, namely the spatial attention feature map. The spatial attention feature map will be further processed by a single convolutional kernel so that the location that needs strengthening or suppressing can be highlighted. The above computational process can be summarized as Formula (5),  The dimension of the input feature map is L × W × C × N (batchsize is not shown) and after the reshaping operation, the dimension of the new feature map will be L × W × (CN). After global pooling operation and two fully connected layers, specific weights are generated for every channel and the weight vector will be multiplied with the reshaped 3D tensor to complete feature refinement. After feature refinement, the 3D tensor will be reshaped to a 4D tensor, of which the dimension is L × W × C × N. Figure 3 shows a schematic of the spatial attention module. Suppose F represents the reshaped feature maps generated by the first depth separable convolutional layer, of which the dimension can be denoted as B × L × W × C. The processing process of the attention mechanism in this paper can be summarized as Formula (4),

Spatial Attention Module
where SA() represents spatial attention, ⊗ represents the matrix multiplication operation and F represents the refined feature maps. Take the first depth separable convolutional layer as an example, global average pooling and global max pooling were performed after the convolutional operation and the feature map dimension became B × 1 × 1 × C for each pooling operation. Then the two pooling results were concatenated to build an efficient feature descriptor, namely the spatial attention feature map. The spatial attention feature map will be further processed by a single convolutional kernel so that the location that needs strengthening or suppressing can be highlighted. The above computational process can be summarized as Formula (5), where [;] represents the concatenation operation of two feature maps, Avgpool () and MaxPool () represent the corresponding pooling operation, f 4×4 represents the convolutional operation and sigmoid () is the activation function.
Sensors 2020, 20, x 8 of 19 Figure 3. The overall architecture of the spatial mechanism used in AD-HybridSN. The dimension of the input feature map is × × (batchsize is not shown). After max pooling, average pooling and concatenation operation, the dimension of the new feature map will be 1 × 1 × 2 . A convolutional layer that has only one kernel with a sigmoid activation function is used to learn where more attention is needed. The obtained weights are multiplied with the input feature map to complete spatial feature refinement.
In our proposed model, the feature maps extracted by 3D convolutional layers contain abundant spectral information. During the subsequent 2D operation, spectral information suffers some loss while spatial information is strengthened, so residual connection was used to compensate spectral information, which in fact constructed a dual-path feature learning pattern. However, this simple dual-path feature extraction pattern was not able to learn the refined spatial feature. The convolutional kernels in the depth separable convolutional layers could not cover the whole feature map, so we used the spatial attention mechanism to refine the feature map point by point.

Datasets and Contrast Models
To observe the performance of the proposed model, three datasets: Indian Pines, Salinas and the University of Pavia were used in our experiment. Indian Pines was collected by the AVIRIS sensor in Indiana. The image had 145 pixels × 145 pixels and contained 224 bands. Apart from 24 bands absorbed by water vapor, 200 bands were available for classification. The spatial resolution of the image was 20 m, and the spectral coverage was 0.4-2.5 μm. The number of labeled samples was 10,249, which were divided into 16 categories, including crops and natural vegetation such as soybean, corn, wheat, alfalfa and pasture. The Salinas dataset was acquired in California by the AVIRIS sensor and contained 204 bands for classification without water vapor absorption bands. The image size was 512 × 127 and the spatial resolution was 3.7 m. It contained 16 types of ground objects in total, and the spectral coverage was the same as Indian Pines. The dataset mainly included vegetables, bare earth and vineyards. The University of Pavia dataset contained nine classes, with a total sample size of 42,776. It contained 610 pixels × 340 pixels, with a spatial resolution of 1.3 m and a spectral coverage range of 0.43-0.86 μm. The dataset was collected in urban areas, mainly including trees, asphalt roads, bricks, pastures, etc. Figures 4-6 show the distribution of each ground objects on the Indian Pines, Salinas and University of Pavia. Figure 3. The overall architecture of the spatial mechanism used in AD-HybridSN. The dimension of the input feature map is L × W × C (batchsize is not shown). After max pooling, average pooling and concatenation operation, the dimension of the new feature map will be 1 × 1 × 2C. A convolutional layer that has only one kernel with a sigmoid activation function is used to learn where more attention is needed. The obtained weights are multiplied with the input feature map to complete spatial feature refinement. In our proposed model, the feature maps extracted by 3D convolutional layers contain abundant spectral information. During the subsequent 2D operation, spectral information suffers some loss while spatial information is strengthened, so residual connection was used to compensate spectral information, which in fact constructed a dual-path feature learning pattern. However, this simple dual-path feature extraction pattern was not able to learn the refined spatial feature. The convolutional kernels in the depth separable convolutional layers could not cover the whole feature map, so we used the spatial attention mechanism to refine the feature map point by point.

Datasets and Contrast Models
To observe the performance of the proposed model, three datasets: Indian Pines, Salinas and the University of Pavia were used in our experiment. Indian Pines was collected by the AVIRIS sensor in Indiana. The image had 145 pixels × 145 pixels and contained 224 bands. Apart from 24 bands absorbed by water vapor, 200 bands were available for classification. The spatial resolution of the image was 20 m, and the spectral coverage was 0.4-2.5 µm. The number of labeled samples was 10,249, which were divided into 16 categories, including crops and natural vegetation such as soybean, corn, wheat, alfalfa and pasture. The Salinas dataset was acquired in California by the AVIRIS sensor and contained 204 bands for classification without water vapor absorption bands. The image size was 512 × 127 and the spatial resolution was 3.7 m. It contained 16 types of ground objects in total, and the spectral coverage was the same as Indian Pines. The dataset mainly included vegetables, bare earth and vineyards. The University of Pavia dataset contained nine classes, with a total sample size of 42,776. It contained 610 pixels × 340 pixels, with a spatial resolution of 1.3 m and a spectral coverage range of 0.43-0.86 µm. The dataset was collected in urban areas, mainly including trees, asphalt roads, bricks, pastures, etc. Figures 4-6 show the distribution of each ground objects on the Indian Pines, Salinas and University of Pavia.             In our experiments, for each image we divided all the pixels into three parts: training set, test set and validation set. The proportion of the training set and validation set of the Indian Pines, Salinas and University of Pavia datasets was 5%, 1% and 1% respectively and the remaining pixels served as a test set. The sample distribution of the three datasets for each class of ground object is shown in Tables 1-3. We used the HybridSN [32], R-HybridSN [33], Res-3D-CNN [31] and Res-2D-CNN [19] as contrast models to verify the classification performance of AD-HybridSN.  1  Brocoli_green_weeds_1  20  20  1969  2009  2  Brocoli_green_weeds_2  37  37  3652  3726  3  Fallow  20  20  1936  1976  4  Fallow_rough_plow  14  14  1366  1394  5  Fallow_smooth  27  27  2624  2678  6  Stubble  39  40  3880  3959  7  Celery  36  36  3507  3579  8  Grapes_untrained  113  112  11,046  11,271  9  Soil_vinyard_develop  62  62  6079  6203  10  Corn_senesced_green_weeds  33  33  3212  3278  11  Lettuce_romaine_4wk  11  10  1047  1068  12  Lettuce_romaine_5wk  19  20  1888  1927  13  Lettuce_romaine_6wk  9  9  898  916  14  Lettuce_romaine_7wk  11  10  1049  1070  15  Vinyard_untrained  72  73  7123  7268  16  Vinyard_vertical_trellis  18  18  1771  1807  Total  541  541 53,047 54,129

Experimental Results and Discussion
In our experiment, training sets of HybridSN, R-HybridSN, Res-3D-CNN and Res-2D-CNN, such as window size, training epoch, etc., were consistent with the corresponding papers. In addition, we used the Adam optimizer and the learning rate was set to 0.001. In order to observe the performance of our model, we trained 100 epochs and used ReLU as an activation function in AD-HybridSN.
In all experiments, we monitored the validation accuracy and saved the model with the highest verification accuracy.

Experimental Results
Three indexes were used to measure the accuracy of models, namely, overall accuracy (OA), average accuracy (AA) and Kappa coefficient (Kappa). OA represents the proportion of the number of samples that were correctly classified by the model. AA stands for the average precision of all land objects. KAPPA is an accuracy measure based on the confusion-matrix, which represents the percentage of errors reduced by classification versus a completely random classification.
In order to avoid fluctuations caused by accidental factors as far as possible, we conducted 20 consecutive experiments. Tables 4-6 show the average indices and standard deviation of each model on three datasets. Figures 7-9 show the false-color map, the ground truths and the classification results of each model for three datasets. We can tell by the data and predicted maps that the classification result of AD-HybridSN was more detailed and accurate in Indian Pines, Salinas and University of Pavia. Among the contrast models, the OA of Res-2D-CNN on the three datasets were lower than the other contrast models, indicating that the 2D-CNN model was not suitable for small sample hyperspectral classification. Secondly, the classification result of Res-3D-CNN was higher than that of Res-2D-CNN, indicating that the 3D-CNN model could explore spatial-spectral features of training samples more effectively. R-HybridSN was superior to the HybridSN in Indian Pines and University of Pavia, and the two models had a higher classification accuracy than Res-3D-CNN, to a certain extent, it proved that, compared with the model that used the 3D convolution kernel or 2D convolution kernel alone, the 3D-2D-CNN model was more suitable for the classification under the condition of small samples, and the reasonable use of the residual connection could effectively improve the classification performance of the 3D-2D-CNN model. In particular, the classification accuracy of R-HybridSN in Salinas was slightly lower than HybridSN and our proposed model AD-HybridSN effectively solved this problem. Among the three 3D-2D-CNN models, our proposed AD-HybridSN achieved the highest classification accuracies in three datasets. For example, the OA of AD-HybridSN was 0.26% and 2.71% higher than R-HybridSN and HybridSN in Indian Pines.       We further compared the experiment results of the three 3D-2D-CNN based models and drew the following conclusions. Firstly, unlike R-HybridSN, which had inferior classification accuracy than HybridSN on Salinas, the classification accuracy of AD-HybridSN was relatively balanced on three datasets. It further demonstrated the strong feature extraction ability of the dense block and the necessity of feature refinement module. What is more, AD-HybridSN had an uneven classification accuracy on different datasets. Using a similar amount of training samples in three datasets, the classification effect of Salinas was far better than Indian Pines. Thus, the generalization ability of AD-HybridSN needs to be further analyzed. Thirdly, compared with the other two 3D-2D-CNN models, AD-HybridSN had a tremendous improvement in small sample classes, such as the Stone-steel Towers in Indian Pines and Shadows in the University of Pavia. However, the classification accuracy of AD-HybridSN on some ground objects, such as oats and alfalfa in Indian Pines and Lettuce_romaine_7wk in Salinas, which was over that of R-HybridSN, was still lower than HybridSN, which needs to be further studied.

Discussion
It is proved that the classification performance of AD-HybridSN was superior to R-HybridSN, HybridSN and other contrast models through vigorous experiments. Therefore, the network structure of AD-HybridSN was conducive to improving classification accuracy, which needs to be further discussed. From the perspective of the network structure, the HybridSN is a 3D-2D-CNN model with a relatively concise structure, which contains only four convolutional layers; R-HybridSN has a relatively deeper and more complex structure, which is based on the non-identity residual connection and depth separable convolutional layers. It can be speculated from the experimental results that R-HybridSN had a better spatial-spectral feature learning ability. At the same time, the features extracted from the shallow network layers were not fully utilized, which may be the reason that the accuracy of R-HybridSN in the Salinas dataset was slightly lower than that of HybirdSN. AD-HybridSN is the redevelopment of R-HybridSN, based on which the dense block and attention module are introduced for feature reusing and refinement. As AD-HybridSN only has six convolutional layers, the structural advantage of our proposed network was verified. However, the effectiveness of attention module needs to be further verified.
In order to further verify the effectiveness of the attention module in our proposed model, we built a D-HybridSN to conduct model ablation experiments. In order to control the experimental variables, the only difference between D-HybridSN and AD-HybridSN was that the former had no attention module. Table 7 shows the accuracy of AD-HybridSN, D-HybridSN and R-HybridSN in three datasets and the proportion of the training sample used in this experiment was also 5%, 1% and 1% respectively. The classification accuracies of D-HybridSN were −0.42%, 0.66% and 0.27% higher than that of R-HybridSN in Indian Pines, Salinas and the University of Pavia respectively. From the comprehensive performance of models on the three datasets, the features extracted by D-HybridSN were more discriminative. Thus, it is further proved that, by means of reusing the spatial-spectral features in the network, the features from shallow layers were better utilized to contribute to classification. What is more, our proposed AD-HybridSN outperformed D-HybridSN in three datasets by 0.68%, 0.19% and 0.41% respectively, which indicate that the spatial-spectral features were further refined by the attention module that followed every convolutional layer. Although AD-HybridSN has satisfactory overall accuracies on the three datasets, the classification of some ground objects was still unsatisfactory. This phenomenon may be attributed to the fixed network structure for different datasets, which may limit the targeted feature learning for different datasets with different spatial resolution and spectral conditions. Therefore, in the following research, the model integration method will be used to integrate the advantages of different networks, so as to comprehensively improve the classification accuracy of various ground objects. Besides, the fixed network structure might mean a fixed input size, which includes window sizes and a number of bands. That may further affect the ability of the model on learning spatial-spectral features from different datasets. Thus, how to learn features in a more flexible way needs to be further investigated in the aspect of network structure and hyperspectral image preprocessing.
In order to further verify the performance of AD-HybridSN under the "small-sample" condition, we further reduced the amount of training samples and conducted supplementary experiments. In Section 4.1 we showed the experiment results under unbalanced training sample cases, and we will further reduce the amount of training samples. Meanwhile, we will use balanced training samples, which means the amount of each ground objects are equal, to perform supplementary experiments. Due to that, 5% is the minimum proportion of Indian Pines to ensure that all ground objects have at least one sample and the classification accuracy of the University of Pavia is relatively low, we only used the University of Pavia in our supplementary experiments. In the unbalanced training sample experiment, the amount of labeled data decreased from 0.8% to 0.4%. In the balanced training sample experiment, we used 50, 40, 30 and 20 labeled data of each ground object respectively. Tables 8 and 9 show the experiment results. After analyzing the experimental results, we had the following findings: (1) In our experiments, the classification accuracy of AD-HybridSN was the highest in both the unbalanced training sample case and balanced training sample case. Additionally, the classification accuracy of R-HybridSN was higher that of HybridSN, which was consistent with the experiment results of the University of Pavia in Section 4.1.
However, the classification accuracy of the three models showed great differences in the two kinds of experiments. By comparing the OA and AA value in the experiment result, we found that, in the balanced training sample case, the AA value was relatively higher, which was different from the experiment on the unbalanced training sample case. This phenomenon indicates that the sample distribution had a great influence on the classification results.
(2) The experiment results further indicate that the sample distribution was a valuable research issue in "small sample" hyperspectral classification. For now, we randomly split the hyperspectral data to obtain the training set, validation set and testing set. However, there was an ill-posed problem in the hyperspectral image. On the one hand, the amount of samples were unbalanced, on the other hand, the quality of samples were also unbalanced. Thus, selecting the best training sample combination from the labeled samples may alleviate the problem of the hyperspectral image ill-posed problem to a certain extent.
(3) We can tell by the experimental results that when the number of training samples was reduced to a certain extent, the classification accuracy of all models decreased in a cliff-like manner. Therefore, there is a limit to improve the classification accuracy of small samples only by network optimization. When the training samples were reduced to a certain extent, there were a large number of unlabeled samples that were not used. Thus, in the following research, we should focus on mining the information of unlabeled samples by combining semisupervised learning or an active learning strategy.

Conclusions
In this paper, in order to realize the efficient extraction and refinement of the spatial-spectral feature in the "small sample" hyperspectral image classification, we proposed an AD-HybridSN model from the perspective of network optimization. Based on the 3D-2D-CNN model, multifeature reuse was realized by a dense block. Besides, the 3D convolution and 2D convolution were respectively equipped with channel attention and spatial attention modules, thus the spatial-spectral features were further refined. We conducted a series of experiments on three open datasets: Indian Pines, Salinas and the University of Pavia. The experiment results show that the AD-HybridSN model had a better classification effect than all the contrast models. In Section 4.2, we further performed the supplementary experiment on a balanced training sample case. AD-HybridSN still had the best classification results when the amount of training samples decreased. However, the accuracy improvement brought by network optimization was limited, so other strategies should be combined to further improve the classification accuracy. In our proposed model, the attention module was of great help to improve the accuracy of the hyperspectral classification under the "small sample" condition. However, in AD-HybridSN, only a simple attention module was used. In the future, we will further study the attention mechanism and a more targeted attention module could be designed and applied in the hyperspectral image classification experiment. In addition, the AD-HybridSN still has room for improvement in the classification of specific ground objects. In subsequent studies, we will combine semisupervised learning or active learning strategy to break through the bottleneck of network optimization. Moreover, the dense block and attention module are only preliminarily explored in AD-HybridSN. Network optimization is an open field with rapid development, we hope that the idea of AD-HybridSN can be further expanded.