Hybrid Dense Network with Dual Attention for Hyperspectral Image Classiﬁcation

: Hyperspectral images (HSIs) have been widely used in many ﬁelds of application, but it is still extremely challenging to obtain higher classiﬁcation accuracy, especially when facing a smaller number of training samples in practical applications. It is very time-consuming and laborious to acquire enough labeled samples. Consequently, an efﬁcient hybrid dense network was proposed based on a dual-attention mechanism, due to limited training samples and unsatisfactory classiﬁcation accuracy. The stacked autoencoder was ﬁrst used to reduce the dimensions of HSIs. A hybrid dense network framework with two feature-extraction branches was then established in order to extract abundant spectral–spatial features from HSIs, based on the 3D and 2D convolutional neural network models. In addition, spatial attention and channel attention were jointly introduced in order to achieve selective learning of features derived from HSIs. The feature maps were further reﬁned, and more important features could be retained. To improve computational efﬁciency and prevent the overﬁtting, the batch normalization layer and the dropout layer were adopted. The Indian Pines, Pavia University, and Salinas datasets were selected to evaluate the classiﬁcation performance; 5%, 1%, and 1% of classes were randomly selected as training samples, respectively. In comparison with the REF-SVM, 3D-CNN, HybridSN, SSRN, and R-HybridSN, the overall accuracy of our proposed method could still reach 96.80%, 98.28%, and 98.85%, respectively. Our results show that this method can achieve a satisfactory classiﬁcation performance even in the case of fewer training samples.


Introduction
Hyperspectral images (HSIs) contain rich spatial and spectral information, and have been widely used in many fields of application, such as environmental science, precision agriculture, and land cover mapping [1][2][3][4]. However, the high-dimensional nature of spectral bands can lead to decreases in storage and computing efficiency [5]. In addition, the number of available training samples is usually limited in practical application [6], and it is still a challenging task to achieve a high-precision classification from HSIs [7,8].
To solve the above-mentioned problem, feature extraction must be carried out in order to reduce the dimensions of HSIs before inputting them into classifiers. Linear and nonlinear feature-extraction methods are generally applied to HSI classification. Principal component analysis (PCA) [9], linear discriminant analysis (LDA) [10], and independent component analysis (ICA) [11] are the most commonly used linear methods. Nevertheless, linear dimension-reduction methods cannot well solve the nonlinear problems existing in HSIs, and the deep features cannot be extracted. More efficient and robust dimensionreduction methods are required in order to process HSIs. Consequently, some effective methods have been developed. For example, stacked autoencoders (SAEs) [12] have a good performance in dealing with nonlinear problems; the loss of information can be minimized, and more complex data can be processed. In addition, due to the highdimensional, nonlinear, and small training samples for HSIs, the classifiers are required to have the ability to extract and process deep features [13]. Unfortunately, traditional classification methods-such as support vector machine (SVM) [14], extreme learning machine (ELM) [15], and random forest (RF) [16]-are often incapable of giving satisfying classification results without the support of deep features. Zhu et al. [17] proposed an image-fusion-based algorithm to extract depth information and verify its effectiveness via experiments. Han et al. [18] developed an edge-preserving filtering-based method that can better remove haze from original images and preserve their spatial details. However, these two methods lack the ability to automatically learn depth features, and rely on prior knowledge.
In recent years, deep learning (DL)-based classification methods have been favored due to the powerful ability of convolutional neural networks (CNNs) to automatically extract deep features [19,20]. Some researchers have improved the classification accuracy of HSIs by designing the architectures of CNNs [21,22]. Chen et al. [23,24] proposed a DL framework combining spatial and spectral features for the first time. The SAE and deep belief network (DBN) were used as feature extractors in order to obtain better classification results, demonstrating the great potential of DL in accurately classifying HSIs. Zhao et al. [25] proposed a spectral-spatial feature-based classification (SSFC) framework, in which a balanced local discriminant embedding (BLDE) algorithm and two-dimensional CNN (2D-CNN) network were used to extract spectral and spatial information from the dimension-reduced HSIs; it can be observed that this framework does not make use of the three-dimensional (3D) characteristics of HSIs, and needs to be further improved.
Based on spectral-spatial information and 3D characteristics, Li et al. [26] proposed a three-dimensional convolutional neural network (3D-CNN) framework for precise classification of HSIs. In comparison with 2D-CNNs, the 3D-CNN can more effectively extract deep spatial-spectral fusion features. Roy et al. [27] designed a hybrid 2D and 3D neural network (HybridSN), finding that the HybridSN reduces the model complexity and has better classification performance than a single 3D neural network. It is clear that shallow networks are generally deficient in terms of classification performance.
Compared with shallow networks, the features extracted from deep network structure are more abstract, and the classification results are better. However, with the deepening development of the network structure, the gradient dispersion or explosion will appear during the backpropagation process, resulting in network degradation [28,29]. To solve these problems, network connection methods such as residual networks (ResNets) [30] and dense networks (DenseNets) [31] have been adopted, which are beneficial for training deeper networks and alleviating gradient disappearance. Zhong et al. [32] proposed an end-to-end spatial-spectral residual network (SSRN) based on a 3D-CNN. The spectral and spatial residual modules were designed to learn spatial-spectral discrimination features, alleviating the decline in accuracy and further improving the classification performance. Wang et al. [33] introduced a DenseNet into their proposed fast-density spatial-spectral convolutional (FDSSC) neural network framework, which achieved better accuracy in less time. Both SSRNs and FDSSC networks first extract spectral features and then extract spatial features. Nevertheless, in the process of extracting spatial features, the extracted spectral features may be destroyed. In addition, Feng et al. [34] introduced a residual learning module and depth separable convolution based on a HybridSN to build the residual HybridSN (R-HybridSN), which can also obtain satisfactory classification results using depth and an effective network structure depending on less training samples; since the shallow features in the R-HybridSN have not been reused, the network structure can be further optimized.
Recently, as the most important part of human perception, the attention mechanism has been introduced into CNNs. This enables the model to selectively identify more critical features and ignore some information that is useless for classification [35,36]. Fang et al. [37], based on a DenseNet and a spectral attention mechanism, proposed an end-to-end 3D dense convolutional network with a single-attention mechanism (MSDN-SA) for HSI classification; the network framework enhanced the discriminability of spectral features, and performed well on three datasets; however, it only considers the spectral branch attention, and not the spatial branch attention. Inspired by the attention mechanism of the human visual system, Mei et al. [38] established a dual-channel attention spectral-spatial network based on an attention recurrent neural network (ARNN) and an attention CNN (ACNN), which trained the network in the spectral and spatial dimensions, respectively, to extract more advanced joint spectral-spatial features. Zhu et al. [39] proposed an HSI defogging network based on dual self-attention boost residual octave convolution, which improved the defogging performance. Sun et al. [40] proposed a spectral-spatial attention network (SSAN), in which a simple spectral-spatial network (SSN) was established and attention modules were introduced to suppress the influence of interfering pixels; the distinctive spectral-spatial features with a large contribution to classification could be extracted; similarly, this method may cause the same problems as the SSRN and FDSSC.
To make use of the attention mechanism and solve the problems of the SSRN and FDSSC, we proposed a hybrid dense network with dual attention (HDDA) for HSI classification. This framework contains two branches of 3D-DenseNet and 2D-DenseNet, which are used to extract spectral-spatial and spatial features, respectively, from dimensionreduced HSIs by the SAE. In addition, the residual channel attention and residual spatial attention are introduced for refining feature maps and avoiding unnecessary information. The enhanced spectral-spatial features are then obtained by connecting the outputs of two branches. Finally, classification results can be obtained using the softmax function. The main contributions of this work are listed as follows: (1) In order to deal with the nonlinear and high-dimensional problems of HSIs, a fourhidden-layer SAE network was built to effectively extract deep features and reduce the feature dimensions; (2) We proposed a hybrid dense network classifier with a dual-attention mechanism. The classification network has two independent feature-extraction paths, which continuously extract spectral-spatial features simultaneously in 3D and 2D spaces, respectively. The problem of feature conflict between the SSRN and FDSSC is avoided; (3) We constructed the residual dual-attention module. By integrating channel attention and spatial attention, the feature refinement was realized in the channel and spatial dimensions respectively. The results showed that spectral and spatial features had a better impact on the classification results, suppressing less useful features.
The rest of this study is arranged as follows: Three HSI datasets and evaluation factors for assessing the proposed network are described in Section 2. The background information is briefly introduced in Section 3. The proposed overall classification framework is presented in detail in Section 4. In Section 5, the experimental results are compared and discussed with reference to the ablation experiments. Finally, a summary and future directions are provided in Section 6.

Hyperspectral Datasets and Evaluation Factors
Three publicly available HSI datasets-namely, Indian Pines (IP), Pavia University (PU), and Salinas (SA)-were selected to verify the classification performance of the proposed HDDA method ( Table 1). The false-color composite images and ground-truth classes are shown in Figures 1-3 and Tables 2-4. For a general deep network, the more training samples, the better the classification performance. Unfortunately, it is usually time-consuming and laborious to collect enough label information from HSI data. It can be difficult or even impossible to provide sufficient training samples for most networks. Moreover, with the increase in the number of samples, the computational complexity and time consumption also increase correspondingly. Consequently, it is better to perform the classification using a small number of training samples. In our study, only 5%, 1%, and 1% of the samples from IP, UP, and SA, respectively, were selected as the training set, while the remaining samples were used to validate the classification performance.      The primary configuration of the computer included an Intel Corei5-7300HQ CPU (2.50 GHz), a GTX1050Ti GPU, and 8 GB RAM, with the Windows 10 64-bit operating system. The compiler was Spyder and the DL framework was PyTorch. To comparatively analyze the classification performance, the overall accuracy (OA), average accuracy (AA), and kappa coefficient (k) based on the confusion matrix were used as the evaluation factors [41]. OA is the ratio between the correctly classified samples and total samples. AA is the ratio between the sum of the classification accuracy for each category and the number of categories. k is generally used for checking consistency, and can be also used to measure the classification accuracy. The higher the three factors are, the better the classification performance.

Stacked Autoencoder
An autoencoder (AE) is an unsupervised learning method, whose structure is similar to that of a general feedforward neural network; its function is to perform representation learning on the input information, which has been applied to dimension reduction and abnormal data detection [42,43]. In comparison with supervised learning methods, only the target data rather than labeled data are required to be input for the AE. In our study, a stacked AE (SAE) was built by stacking the basic autoencoders to extract the features from original HSIs and perform dimension reduction. An SAE is formed by stacking the basic AE network structure layer by layer according to the input layer and hidden layers ( Figure 4). The encoder is composed of one input layer and four hidden layers, and the decoder includes four hidden layers and one output layer. To ensure the same range of [0, 1] for the input layer and the output layer, the tanh nonlinear activation function is used for both the encoder and the decoder, while the sigmoid nonlinear activation function is adopted for the output layer. The mean squared error (MSE) loss function is used to measure the deviation between real and reconstructed data. The adaptive moment estimation (Adam) optimization algorithm is used to train the network parameters. With the increase in the number of layers, the features become more and more abstract. Meanwhile, the dimensions of the input data are continuously reduced, and the high-dimensional input data are transformed into low-dimensional features to reduce the original HSI data.

ResNet and DenseNet
With the increase in the number of network layers, the CNN model is prone to gradient disappearance during the training process. Conversely, the ResNet and DenseNet can solve the problem via skip connection and dense connection, respectively. As an essential part of ResNet, the residual connection is commonly used in ResNet, enabling the input data to be passed directly over the network to subsequent layers [44]. As a special form of ResNet, DenseNet, by connecting all of the layers directly, ensures the maximum information flow between the layers of the network [45]. Unlike the combined features obtained by summation for ResNet, the dense module is used to combine features by connecting them in the channel direction. The 3D- (Figure 5a) and 2D-Dense (Figure 5b) modules are jointly used to establish a hybrid dense network consisting of three convolutional layers (l = 3), where the ReLU activation function is adopted. The dense connection is used for each layer to connect the front and back layers, in order to build a deeper network structure.

Dual-Attention Mechanism
The basic idea of attention mechanisms in computer vision is to enable the network to ignore irrelevant information among numerous features and pay attention to the important features related to the current task [46,47]. Attention mechanisms can be divided into soft attention mechanisms and hard attention mechanisms. Hard attention mechanisms are non-differentiable, and need to be trained through strategies such as enhanced learning, while soft attention mechanisms are differentiable, wherein the network parameters can be updated during the training process by the gradient descent algorithm [48]. Therefore, the dual-attention mechanism combining channel attention and spatial attention was adopted in our study in order to strengthen the features with a large contribution to classification and suppress the features with only a small contribution ( Figure 6).

Channel Attention Module
For the feature maps obtained by the CNN, different channels represent different types of features. The channel attention module reassigns the weight of channel dimensions according to the importance of different channels [49]. The detailed structure of the channel attention module is shown in Figure 6a. The feature map generated by the 3D convolutional layer is taken as an example. It is assumed that the input feature map is F ∈ R w×w×c×n , where w × w is the size, c is the spectral dimension, and n is the number of channels. Firstly, the 3D global average pooling and 3D global max pooling are each carried out on the whole input feature map F. Two different feature descriptors of F c avg and F c max are generated, and their dimensions are 1 × 1 × n . The two descriptors are then input into the shared network (SN) consisting of a two-layer convolutional layer and an activation function layer, in order to generate feature maps with the dimensions of 1 × 1 × n . Afterwards, the summation is used to merge the output feature vectors, and the channel attention map CA (F) is obtained via the sigmoid activation function. The channel attention map is a vector whose length is the same as the number of channels in the input feature map; its values are located within the range of (0, 1). The calculation process of channel attention is mathematically expressed as follows: where δ and δ represent the sigmoid and ReLU activation functions, respectively, and W 0 and W 1 are the weights of SN. Finally, the output feature map F ∈ R w×w×c×n is obtained as shown in Equation (2): where CA (F) represents the channel attention map and F is the original input feature.

Spatial Attention Module
In comparison with channel attention, spatial attention focuses on the significant regions of the spatial dimension, which can further capture the contextual information of different regions [50,51]. The detailed structure of the spatial attention module is shown in Figure 6b. Similar to the channel attention module, two types of pooling operations are used to generate two feature descriptors of F s avg and F s max , but the pooling operation is performed along the channel direction. Both of the two feature descriptors have the same dimensions of w × w × 1. Then, the joint operation results in an output feature descriptor of F s avg ; F s max . A 3D convolutional layer with a sigmoid function is then used to generate a spatial attention map SA (F). The calculation process of channel attention is mathematically expressed as follows: where δ represents the sigmoid activation function, while f N×N×N represents the 3D convolution operation with a convolution kernel of N × N × N. The output feature map F ∈ R w×w×c×n is obtained as shown in Equation (4): where SA (F) represents the spatial attention map and F is the original input feature.

Residual Dual-Attention Module
When the dual-channel attention module does not work, the original characteristic information of the entire network should not be reduced. The skip connection is introduced by referring to the ResNet. The high-level features revised by the channel attention and spatial attention are connected with the residual features to form the residual dual-attention module. The final output features of F RDA are as shown in Equation (5). Through the residual connection structure, the information transfer in the model can be promoted; the gradient descent is alleviated and the stability of the model is enhanced. In addition, based on the hybrid CNN, the structures of the residual dual-attention module are slightly different for different input feature maps. More details can be found in Section 3.1.

Overall Workflow
To take advantage of 3D HSI data, we proposed a hybrid dense network framework with dual attention (HDDA) for HSI classification (Figure 7). Firstly, the dimensions of the original HSI data were reduced by the SAE, and then the center pixels with the neighborhood size of w × w and corresponding class labels were taken as the samples and randomly divided into training sets of X train and Y train and test sets of X test and Y test . Secondly, we constructed two independent feature-extraction paths. The residual dualattention module was used to refine features, and then the training set was input into the HDDA network to be trained for obtaining the best network model. Finally, the test set was used to evaluate the performance of the trained model. As shown in Figure 7, two branches including 3D-Dense and 2D-Dense networks were jointly used to construct the feature-extraction network by introducing the dual-attention mechanism.

Hybrid Dense Network
The spatial-spectral features and spatially enhanced features of the HSI data were obtained by using the 3D-DenseNet and 2D-DenseNet with dual attention (Figure 8). Then, the fusion features obtained by merging the two networks were used to perform the classification. In our study, the addition operation was used to merge features through the full connection layer, and the dropout was used to prevent overfitting. Finally, the classification was carried out by the softmax. The IP dataset was taken as an example to describe the specific methodology.

3D-DenseNet with Dual Attention
The 3D-DenseNet is composed of a 3D-Dense module and a 3D residual dual-attention module (Figure 8a and Table 5). The sample size of the input layer was set as 23 × 23 × 10 for the HDDA network. At first, a 3D convolution with a kernel size of 3 × 3 × 3 was used to increase the number of channels to 32. Then, the spatial-spectral features were extracted through the 3D-Dense module. There were 16 channels for each 3D convolution with the 3 × 3 × 3 convolution kernels. The size of the output feature map was (23 × 23 × 10, 80) through the dense module. To refine the spatial-spectral features, the 3D residual dualattention mechanism was introduced to strengthen the space and channels that make a significant contribution to the classification. The joint dual-attention feature map can be generated with a size of (23 × 23 × 1, 160). Finally, the batch normalization (BN) layer was used to enhance the stability of the model, and a feature map of 1 × 160 was obtained through the global average pooling.

2D-DenseNet with Dual Attention
In order to enhance the spatial information, the HSI cube with a sample size of 23 × 23 × 10 was reshaped and input into the 2D-DenseNet with dual attention, composed of a 2D-Dense module and a 2D residual dual-attention module (Figure 8b and Table 6). Firstly, the 2D convolution with a kernel size of 3 × 3 was adopted to increase the number of channels to 32 in order to obtain the feature map with dimensions of (23 × 23, 32). Then, the feature map was transferred into the 2D-Dense module. Three-layer 2D convolution with a convolution kernel size of 3 × 3 for each layer was produced. The size of the output feature map was (23 × 23, 80) through the 2D-Dense module. Subsequently, it was input into the 2D residual dual-attention module in order to obtain a dual-attention feature map with dimensions of (23 × 23, 160). Furthermore, the important spatial regions and channels were highlighted. After obtaining the dual-attention feature map, the BN layer and activation function were used to acquire a 1 × 160 feature map through global average pooling.

Configuration of Network Parameters
A hybrid dense network classification framework with a dual-attention mechanism was designed. The weight parameters of the SAE and dual-attention hybrid dense network were updated through the gradient backpropagation. This section focuses on analyzing several determinant factors affecting the classification effect of the HDDA, specifically including the window size (w) of the input sample, learning rate (lr), and dropout ratio (p). A total of 5%, 1%, and 1% of samples were randomly selected from each class of the IP, UP, and SA datasets, respectively, in order to train the model, while the remaining samples were used to verify the model. In addition, the reduced-dimension value (d) was set to 10, and the mean squared error (MSE) loss function and Adam optimization algorithm were used to train the HDDA. The batch size was set to 64 with 200 iterations, and the average value of 10 experiments was employed as the classification accuracy.

Window Size
The w affects the classification performance of a CNN to a great extent [52]. If the w value is small, the receptive field of feature extraction in the convolution kernel will be insufficient, and the local effect will not be good. A larger window size provides more spatial information, but more noises are also introduced, reducing the training speed. The short-term memory occupation increases, imposing higher requirements on the hardware platform. Consequently, an appropriate w can not only improve the training speed, but also improve the classification performance. To find the appropriate w for the three datasets, six window sizes were adopted (Figure 9). When the w of IP and SA is 15, and for PU is 19, their OA is the best.

Learning Rate
The lr plays an extremely important role in the classification performance for a DLbased network model [53]; it affects the training convergence speed of the model. If it is too small, the optimization efficiency may be too low to converge. When it is too large, the parameter adjustment changes quickly, and optimal values may be missing. Therefore, different lrs affect the classification performance for various datasets. Four lrs of 0.1, 0.01, 0.001, and 0.0001 were employed in order to compare the classification performance ( Figure 10). It was found that, when the lr was 0.001, the HDDA had the best performance on the IP and SA datasets, with the highest OA values. Conversely, the HDDA performed best when the lr was 0.0001 for the PU dataset.

Dropout Ratio
Overfitting is one of the most common problems in neural network training, affecting the generalization performance of a model. Generally, the empirical error on the training samples is very small, while the generalization error on the test set is very large. Dropout is a regularization method in DL, which is beneficial to preventing overfitting and accelerates the training speed [54]. Five dropout ratios (ps) were set up to compare the classification performance for the three datasets (Table 7). When p was 0.4, HDDA had the best performance on the IP and PU datasets, with the highest OAs of 96.80% and 98.28%, respectively; when it was 0.5, the OA was highest for the SA dataset, reaching 98.85%. The accuracy and loss graphs for the three datasets also show similar results ( Figure 11).

Experimental Results
To verify the effectiveness and robustness of the HDDA, the spectral-based REF-SVM [14] and four advanced HSI classification methods-3D-CNN [24], HybridSN [25], SSRN [30], and R-HybridSN [32]-were adopted. The parameter settings of the five comparative methods were consistent with those in the corresponding references.

IP Dataset
There are high similarities of different classes for the IP dataset, such as Corn (2-4), Grass (5-7) and Soybean (10)(11)(12). The sample size is small for certain classes, which makes it difficult to achieve a good classification performance. A total of 5% of the samples of the IP dataset were randomly selected as the training set, and the remainder were used as the test set. The classification results were derived from the mean and standard deviation (SD) of 10 experiments. A five-layer SAE structure was used to reduce the dimensions of the HSIs, and the number of nodes in each layer was set to 220-120-80-40-10. The reduced-dimension HSIs were then input into the HDDA, with a w of 15 × 15, a lr of 0.001, a p of 0.4, and an epoch of 200.
In comparison with the five other methods (Table 8), the HDDA had the highest OA, AA, and k of 96.80%, 95.83%, and 96.34%, respectively. For those classes with fewer training samples, such as Grass-pasture-mowed (7) and Oats (9), their OA was not satisfactory when depending on the REF-SVM, which uses the spectral features to perform classification. Conversely, the other four DL-based classification methods-3D-CNN, HybridSN, SSRN, and R-HybridSN-showed advantages for processing small sample data, being able to extract the joint spatial-spectral features so as to achieve better classification. The OA of the two classes for the HDDA reached 100%. Due to the advantages of 3D convolution in extracting HSI cube data, the 3D-CNN can simultaneously extract spatial-spectral features, and its OA is improved by 13.13%. HybridSN improves the OA, AA, and k to 94.24%, 87.97%, and 93.40%, respectively, by combining a 3D-CNN and a 2D-CNN [25]. The HybridSN-based improved R-HybridSN introduces the residual module to deepen the network depth. A satisfactory classification performance can be obtained with an OA of 96.46% in the case of fewer training samples, which an improvement of 1.2% compared to the SSRN. In comparison with the SSRN, the OA, AA, and k of the HDDA were increased by 1.54%, 0.64%, and 1.79%, respectively, while they were increased by 0.34%, 5.23%, and 0.32%, respectively, compared with the R-HybridSN. In addition, the HDDA also achieved a good classification performance for those easily misclassified classes, such as the three kinds of Corn (2)(3)(4) and Soybeans (10)(11)(12), with OA of more than 94%. Consequently, there were fewer error points and better classification performance for the HDDA (Figure 12).

PU Dataset
Only 1% of the samples of the PU dataset were randomly selected as the training set, and the remainder were used as the test set. The number of nodes in each layer of the SAE was set to 103-80-60-40-10, and then the dataset was classified through the HDDA network. The w, lr, and p were set to 19 × 19, 0.0001, and 0.4, respectively, and a total of 200 epochs were recorded. The classification accuracy was obtained using the mean and SD of 10 experimental results.
As shown in Table 9, the best classification performance was achieved by the HDDA, with OA, AA, and k of 98.28%, 97.07%, and 97.72%, respectively. Although the best classification could not be achieved for each class using our method, all of the classification accuracies were still more than 93%, indicating that it was able to capture distinguishing features among different classes. Due to the presence of sufficient samples for the UP dataset, the OA also reached 84.80% for the REF-SVM, but the accuracy was poor for the easily misclassified classes of Gravel (3) and Bitumen (7); conversely, the SSRN with a spatial-spectral residual model can extract deeper spatial-spectral features and, thus, increased the accuracy to 76.45% and 91.60%, respectively. In comparison with the SSRN, the R-HybridSN improved the accuracy of the two classes by 10.72% and 4.22%, respectively. Considering the performance of the HDDA, they were improved by 7.52% and 3.28%, respectively, compared with the R-HybridSN, while the OA, AA, and k increased by 1.69%, 3.98%, and 2.16% respectively. The classification map derived from the HDDA was smoother and more similar to the ground-truth map ( Figure 13).

SA Dataset
Only 1% of the samples of each class were randomly selected as the training set, and the remainder were used as the test set. The number of nodes in each layer of the SAE was set to 224-120-80-40-10, and then input into the HDDA network for classification. The w, lr, and p were set to 15 × 15, 0.001, and 0.5, respectively, and the epochs were set to 200. The classification accuracy was also obtained based on the mean and SD of 10 experiments.
There are 16 classes and sufficient samples for each class of the SA dataset; it is relatively easy to distinguish various classes. As shown in Table 10, the OA reached 88.47% for the RBF-SVM, but the classification accuracy of easily misclassified Vineyard_Untrained (15) was poor, at only 66.81%. It is easier to obtain a deeper knowledge of advanced features for the DL-based network models, which have more advantages to deal with the easily misclassified classes. The classification accuracy of Vineyard_Untrained was improved to 85% using the 3D-CNN-with OA, AA, and k values of 94.03%, 95.09%, and 93.14%, respectively-while it reached more than 97% for the HybridSN; in addition, compared with the SSRN and R-HybridSN, it was more competitive in terms of the classification performance for the SA dataset, with OA, AA, and k values of 98.72%, 98.81%, and 98.54%, respectively. By contrast, the classification accuracy of Vineyard_Untrained (15) for the HDDA was slightly lower than that of the HybridSN, but the OA, AA, and k were 0.13%, 0.44%, and 0.18% higher than for the HybridSN, respectively. Compared with the R-HybridSN, the OA, AA, and k were increased by 0.6%, 1.56% and 0.67%, respectively, indicating that the performance of the HDDA was the best. It is clear that there were fewer noise points on the classification map derived from the HDDA, with smoother visual effects ( Figure 14).

Comparison of Training Percentages
In order to further verify the classification performance of the HDDA when using limited training samples, different proportions of the IP, PU, and SA datasets were set up. For the IP dataset, the proportions were 2%, 4%, 6%, 8%, and 10%, while for the UP and SA datasets, they were 0.2%, 0.4%, 0.6%, and 0.8%. The classification accuracy of the HDDA was comparatively analyzed by further reducing the number of training samples ( Figure 15). It can be seen that the HDDA showed the best OA under different training proportions for all three datasets, reaching 88.57%, 86.93%, and 93.22% for the IP, PU, and SA datasets, respectively-even when using training percentages of 2% of IP, or 0.2% of PU and SA. Moreover, with the increase in the number of training samples, the HDDA showed better classification performance compared with the five other classification methods.

Ablation Experiments
In order to verify the effectiveness of the proposed hybrid network structure, SAE dimension-reduction method, and attention module, ablation experiments were conducted on the three hyperspectral datasets. The models used for comparison were consistent with the original network structure, except for the tested components.

Effectiveness of the Hybrid Dense Network
In this section, we performed the 3D branch and 2D branch without changing the other parameter settings. The OA was obtained based on the mean and SD of 10 experiments (Table 11). It was observed that the 3D branch is more suitable for processing HSIs than the 2D branch. In addition, because the HDDA method integrates the spatial-spectral features extracted by the 3D branch and the spatial features extracted by the 2D branch, it has higher and more stable classification accuracy.

Effectiveness of the SAE
In order to highlight the advantages of SAE dimensionality reduction, the PCA, locally linear embedding (LLE), and single-layer AE methods were used to reduce the dimensions of the original HSIs. The number of dimensions was set to 10, and the proposed HDDA network was used. As shown in Figure 16, in comparison with the three other methods, the OA for all three datasets was the highest when using the SAE. Specifically, the OA of the LLE method is slightly better than that of PCA, because LLE can deal with nonlinear problems to a certain extent. Conversely, the AE method performed the worst.

Effectiveness of the Dual-Attention Mechanism
To verify the effectiveness of the proposed dual-attention module, we conducted three experiments on the three datasets, i.e., without the attention mechanism (Model1), with only the spatial attention mechanism (Model2), and with only the channel attention mechanism (Model3). It is clear that the OA of the models using the attention mechanisms was higher than that of the model without the attention mechanisms, proving the effectiveness of the attention mechanisms ( Figure 17). More specifically, the proposed HDDA method had the highest OA. Model2 performed better than Model3 on the IP and SA datasets, while Model3 performed better than Model2 on the PU dataset. This phenomenon shows that the spatial attention mechanism is superior to the channel attention mechanism for the three datasets. Figure 17. Comparison of OA using four attention mechanisms for three datasets.

Conclusions
Aiming at a limited sample size of HSI labeled data and the low classification accuracy of current neural network models, a hybrid dense network with a dual-attention mechanism was proposed from the perspective of network optimization. The network framework was established through a combination of two feature-extraction branches based on a 3D-CNN and a 2D-CNN. The use of dense modules deepened the network, reduced the problem of gradient disappearance, and extracted more robust spatial-spectral features. In addition, the dual-attention mechanism was introduced to the two feature-extraction branches, and corresponding weights were given in the spatial dimension and the channel dimension. The features in the HSIs were selectively learned, and different weights were assigned to corresponding features in order to further improve the feature-extraction capability of the network. Additionally, the BN layer and dropout layer were introduced, and the ReLU activation function was used to prevent the occurrence of overfitting and reduce the number of training parameters in order to achieve a faster convergence. Three publicly available hyperspectral datasets-IP, PU, and SA-were used to check the network. The results show that the HDDA has a superior classification performance compared with the five other methods. In the future, we will further study the attention mechanism and design more targeted attention modules in order to better solve the problem of small samples for HSIs.