Compound Multiscale Weak Dense Network with Hybrid Attention for Hyperspectral Image Classiﬁcation

: Recently, hyperspectral image (HSI) classiﬁcation has become a popular research direction in remote sensing. The emergence of convolutional neural networks (CNNs) has greatly promoted the development of this ﬁeld and demonstrated excellent classiﬁcation performance. However, due to the particularity of HSIs, redundant information and limited samples pose huge challenges for extracting strong discriminative features. In addition, addressing how to fully mine the internal correlation of the data or features based on the existing model is also crucial in improving classiﬁcation performance. To overcome the above limitations, this work presents a strong feature extraction neural network with an attention mechanism. Firstly, the original HSI is weighted by means of the hybrid spectral–spatial attention mechanism. Then, the data are input into a spectral feature extraction branch and a spatial feature extraction branch, composed of multiscale feature extraction modules and weak dense feature extraction modules, to extract high-level semantic features. These two features are compressed and fused using the global average pooling and concat approaches. Finally, the classiﬁcation results are obtained by using two fully connected layers and one Softmax layer. A performance comparison shows the enhanced classiﬁcation performance of the proposed model compared to the current state of the art on three public datasets.


Introduction
Hyperspectral remote sensing, namely, hyperspectral-resolution remote sensing, refers to the use of many very narrow electromagnetic wave segments (usually <10 nm) to obtain relevant data from the target area. The HSI is acquired using an imaging spectrometer that provides detailed spectral information in a narrow range of continuous wavelengths [1]. Benefitting from the high spectral resolution, the resulting HSI shows advantages in identifying various land-cover categories or targets [2]. It enables substances that cannot be detected in wideband remote sensing to be detected in hyperspectral data.
In recent years, due to the high dimension and massive data of HSIs, the analysis and processing of HSIs has become one of the hotspots in remote sensing image research. The process has been widely used in ocean detection [3,4], mineral exploration [5,6], road detection [7], vegetation analysis [8,9], national defense and military applications [10,11], etc., and it is worthy of further research.
The difference between HSI and RGB images is that an HSI divides the spectral dimension in a very detailed way based on an RGB image, and then forms a three-dimensional data cube with multiple bands stacked in sequence. HSI classification refers to analyzing the spectral and spatial information of all categories of ground objects in the HSI, selecting the features, dividing the feature space into non-overlapping subspace through various methods, and then dividing each pixel in the image into each subspace.
In recent decades, deep learning technology [12][13][14] has become one of the most popular research fields in artificial intelligence. It has made breakthroughs in many fields such as image processing [15,16], speech recognition [17], natural language processing [18], and so on. It is currently one of the most advantageous technologies applied to HSI classification tasks. Based on the particularity of HSIs, many spectral bands and neighborhood pixels of the target pixel all contain significant features. The effective use of the spectral and spatial information of the data is the key to extracting robust and discriminative features. A 3D CNN has the characteristics of parameter sharing and local perception, which fit the special research requirements of HSIs. Therefore, most of the classification models proposed in recent years are based on the 3D CNN approach. Lee et al. [19] proposed a contextual deep CNN, which jointly utilizes the local spectral-spatial relationship of adjacent single pixels to optimally explore local contextual interactions. Zhong et al. [20] designed an end-to-end spectral-spatial residual network, which uses residual connections to reduce decreases in accuracy caused by increases in network depth. Wang et al. [21] proposed an end-toend fast dense spectral-spatial convolution model, which combines different convolution kernel scales with dense connections in order to extract features. Roy et al. [22] used a hybrid 2D-3D CNN to construct a lightweight end-to-end network model. Ge et al. [23] constructed a multiscale multibranch feature-fusion HSI classification model based on a 2D-3D CNN, and achieved a good classification performance. Huang et al. [24] proposed a dual-path Siamese CNN for HSI classification. This model integrates morphological profiles, a CNN, a Siamese network, and spectral-spatial feature extraction technology, and achieves good classification results. Safari et al. [25] designed a neural network that combines different convolution kernels to effectively learn joint spatial-spectral features on a multiscale, which achieves better classification effects on high-resolution datasets. Praveen et al. [26] proposed a classification model combining traditional methods with CNN, and achieved good performance. In [27], a lightweight spectral-spatial convolution model was proposed to replace the convolution layer. This model consists of cheap transformation operations, which can greatly reduce the model parameters. Gao et al. [28] proposed a sandwich CNN based on spectral feature enhancement (SFE-SCNN), which reduces the interference of mixed pixels by enhancing spectral features.
How to mine the internal correlations of data or features has become a research hotspot in recent years, with researchers focusing on introducing an attention mechanism to weight data or features to improve the information utilization. The attention mechanism was initially used to deal with computation vision tasks [29][30][31] and showed good performance. For HSI classification tasks, attention mechanisms are also effective. Typically, attention mechanisms are added at the beginning or end of the neural network models. According to their position in the model, they can be divided into preprocessing-based attention mechanisms and postprocessing-based attention mechanisms.
The attention mechanism based on preprocessing is generally located at the initial stage of the model, directly processing the original HSI data and mining the structural characteristics of the original data [32][33][34]. Yu et al. [35] proposed a spectral-spatial dense CNN model with a feedback attention mechanism, using the semantic knowledge provided by the high level of the dense model to enhance the attention map. Zhu et al. [36] proposed an end-to-end residual spectral-spatial attention network. Based on the residual spectral attention module and spatial attention module, the original hyperspectral data are processed and fed into the CNN for feature extraction. Lin et al. [37] designed an attention-aware pseudo-3D CNN model, which provides a more detailed description of each dimension of the input by allocating attention. Guo et al. [38] proposed a featuregrouped network based on a spectral-spatial connected attention mechanism (FG-SSCA) to enhance the effectiveness of the data.
The attention mechanism based on postprocessing is usually located in the middle or end of the model, which improves the feature utilization by weighting high-dimensional semantic features. Ma et al. [39] constructed a double-branch multi-attention mechanism network (DBMA). The extracted features are weighted via the spectral and spatial attention modules for better classification performance. Compared with DBMA, DBDA [40] introduced a more flexible and adaptive attention mechanism to achieve better classification performance, while keeping the overall network architecture unchanged. In [41], a 3-D octave convolution (3D-OC) approach combined with a spectral-spatial attention network was proposed to extract discriminative spectral-spatial features. A 3D-OC first mines deep spatial information from high to low frequencies, and also takes spectral information into account. Then, two attention modules are used to highlight important spatial regions and special spectral bands to improve feature discrimination. Xue et al. [42] designed a second-order pooling network based on the attention mechanism, which assigns different weights to different pixels through a correlation matrix and a learnable cosine distance function. Zhang et al. [43] proposed a spectral-spatial-semantic network, which combines a multi-directional attention mechanism for HSI classification. Pu et al. [44] presented a dual-path CNN model based on an attention mechanism, which adaptively recalibrates the nonlinear interdependence between features in conjunction with the multiscale attention mechanism (MS-AM) to alleviate the Hughes phenomenon. Cui et al. [45] proposed a dual-triple attention network model, which achieves the high classification accuracy of HSIs by capturing cross-dimensional interactive information. In this model, attention mechanisms are added during and after the feature extraction process to improve the effectiveness of features. Zhao et al. [46] proposed a central attention network to effectively understand the internal correlation between the central pixel and its neighborhood pixels in a subcube sample. The spectral-spatial features generated by this method showed good discrimination performance. Xue et al. [47] proposed a hierarchical residual network with an attention mechanism (HResNetAM), which uses attention mechanisms in the spectral and spatial feature extraction branches to calibrate the weights of hierarchical spectral and spatial features.
In this paper, inspired by these advanced methods, we propose a novel compound multiscale weak dense network to extract strong, robust and discriminative features. In the preprocessing stage, we construct a hybrid attention mechanism to improve data effectiveness. Our new deep model consists of two network branches to extract the spectral and spatial features of HSI, respectively. For each branch of the network, the compound multiscale feature extraction modules are designed to obtain abundant features at different scales. Then, the weak dense feature extraction modules are constructed to further extract more discriminative high-dimensional semantic features. Through concat, we fuse the features of the two branches. Finally, the fused spectral-spatial features are fed into fully connected layers and a Softmax layer to obtain the classification results. In addition, in order to further enhance the performance of the model, a learnable hybrid spectral-spatial attention mechanism is designed for data preprocessing.
The main contributions of this paper are as follows: (1) A compound multiscale weak dense network model combining a hybrid attention mechanism (CMWD-HA) is proposed for HSI classification. This model shows good classification performance and high efficiency; (2) A hybrid spectral-spatial attention mechanism is proposed in preprocessing. This attention mechanism aims to weight HSI data simultaneously at both spectral and spatial levels. In addition, the mechanism is learnable and consumes fewer computing resources; (3) Spectral and spatial multiscale feature extraction modules and weak dense spectral and spatial feature extraction modules are designed, and spectral feature extraction branches and spatial feature extraction branches are constructed based on the above modules. The extracted high-level semantic features can distinguish different categories of pixels well, with good generalization ability; (4) Dropout and dynamic learning rates are used to ensure the rapid convergence of the model.

ResNet and DenseNet
In deep learning models, the shallow network layers may not fit the data well, resulting in weak performance. When the number of network layers increases to some extent, the performance of the model is enhanced accordingly. However, as the number of network layers increases, some corresponding problems appear. For example, the huge consumption of computing resources, the overfitting problem, the gradient disappears, the gradient explodes, and so on. The performance of the model is not always positively correlated with the increase in the number of network layers. When the number of network layers increases to a certain range, the network experiences degradation. That is, as the number of network layers increases, the loss of the training set gradually decreases. When the network depth continues to increase, the loss will increase instead. When network degradation occurs, the training effect of a shallow network is better than that of a deep network [48]. For the data processing inequality in information theory, that is, for a Markov process X → Y → Z, there exists I(X; Y) I(X; Z), where I represents mutual information. Therefore, in forward propagation, the deeper the number of network layers, the less original information the feature contains. Model performance cannot be improved by continuously increasing network depth. At this time, if the low-level features are transferred to the high level, the effect of the model will at least be no worse than that of the shallow network, which is the reason that ResNet was proposed. ResNet guarantees that the network of the N + 1 layer contains more image information than the N layer. The computation equation of ResNet is as follows: where x l represents the output of the l layer, and H l represents a nonlinear transformation. For ResNet, the output of layer l is the output of layer l − 1 plus the nonlinear transformation of the output of layer l − 1.
DenseNet [49] is based on a similar idea to that of ResNet, but it creates dense connections between all the previous layers and the back layers. The proposal of DenseNet fully reuses the features, that is, each layer of the network can use the feature maps of all previous layers. Compared to ResNet, DenseNet promotes gradient backpropagation, making the network easier to train. In addition, DenseNet can achieve better performance than ResNet with fewer parameters and lower computing costs. The computation equation of DenseNet is as follows: where x l represents the output of l layer, H l is a nonlinear transformation, and [x 0 , x 1 , · · · , x l−1 ] indicates concat of the output feature maps from layer 0 to l − 1. Figure 1a shows the connection mechanism of ResNet, and Figure 1b shows DenseNet. ResNet is the element-level addition between the input of each layer and the input of the previous layer. In DenseNet, each layer is connected with all the previous layers through dimensional stacking (concat). For an L-layer network, DenseNet contains L(L + 1)/2 connections, whereas ResNet contains L connections.

Hybrid Attention Mechanism
The attention mechanism is a data processing method in machine learning, which has been widely used in natural language processing [50,51] and image processing [52,53]. In HSI processing tasks, scholars have further improved the performance of the model by introducing attention mechanisms based on the study of a neural network model. The attention mechanism can help the model to assign different weights to each part of the input, extract more critical and important information, and enable the model to make more accurate judgments without exerting too much calculation and storage pressure.
This paper proposes a hybrid attention mechanism, which is located at the beginning of the network model. From the perspective of a hybrid and learnable approach, we designed the hybrid attention mechanism to accomplish both spectral and spatial attention simultaneously. The flowchart of the hybrid attention mechanism is shown in Figure 2, taking the Indian Pines dataset as an example. After carrying out principal component analysis (PCA) for dimension reduction, the dimensions of the original data cube change from 15 × 15 × 200 to 15 × 15 × 30. According to the spectral branch and spatial branch in the hybrid attention mechanism, the data are processed into 1 × 1 × 30 and 15 × 15 × 1 through AveragePooling3D. Then, the data are constrained between 0 and 1 through the sigmoid function after a 2D convolution. Among these, the scales of 2D convolution kernels in the spectral branch and spatial branch are 1 × 1 and 3 × 3, respectively. After sigmoid processing, the attention matrix of the same dimension as the original data is obtained through matrix multiplication of the data of the two branches. Finally, the original data and the attention matrix are multiplied element by element and added to complete the attention process.

Multiscale Spectral and Spatial Feature Extraction
HSI cubes exhibit the phenomena of "same spectral, different material", and "same material, different spectral", meaning that single-scale features cannot reflect the characteristics of image pixels well. Therefore, we propose a multiscale spectral feature extraction module and a multiscale spatial feature extraction module, respectively. In this way, more local and more global features can be considered. As shown in Figure 3a, for the multiscale spectral feature extraction module, HSI image cubes are processed by 3D convolutional layers with scales of 1 × 1 × 3, 1 × 1 × 5, and 1 × 1 × 7, respectively, and then fused by concat. Finally, feature alignment is performed through a 3D CNN with a kernel scale of 7 × 7 × 7. The multiscale spatial feature extraction module is shown in Figure 3b. We designed three branches of the neural network to extract multiscale spatial features of the image cubes. The first branch uses a 3D CNN with the kernel scale of 7 × 7 × 7, and the second branch uses two layers of a 3D CNN with the kernel scales of 5 × 5 × 5 and 3 × 3 × 3. The third branch uses three layers of a 3D CNN with the same kernel scale of 3 × 3 × 3. Then, the feature maps extracted from the three branches are also fused through concat. Finally, feature alignment is also performed by means of a 3D CNN with a kernel scale of 1 × 1 × 1. So far, multiscale spectral features and multiscale spatial features have been extracted, respectively. The implementation details of two multiscale feature extraction modules are shown in Tables 1 and 2.

Weak Dense Spectral and Spatial Feature Extraction
After multiscale spectral and spatial feature extraction, the features are fed into the weak dense spectral and spatial feature extraction modules, respectively. In this part of the process, we weaken the DenseNet structure and only retain skip connections between adjacent network layers. The input of each network layer is the fusion of the output feature maps of the previous two network layers. Padding processing is applied to the feature maps in these two modules to ensure that the dimensions of the feature maps remain unchanged. In addition, the stride of the last two convolution layers of each module is set to (1,1,2). The weak dense structure reduces the dimensions of the feature maps while ensuring the reuse of the feature, which improves the model's efficiency to some extent. The implementation details of two weak dense modules are shown in Tables 3 and 4.

Compound Multiscale Weak Dense Network with Hybrid Attention for HSI Classification
The structure of our CMWD-HA is shown in Figure 4. This model can be divided into two parts: the data preprocessing stage and the feature extraction and classification stage. In the data preprocessing stage, the original HSI is processed by means of PCA for dimension reduction, and the hybrid attention mechanism is adopted to assign corresponding weights to the data to improve the effectiveness of the data. After PCA processing of HSI, the data are compressed in the spectral dimension, and the influence of noise and redundant information is greatly reduced. At this time, the hyperspectral data still have dozens of spectral bands, and the spectral information for different categories of pixels is still discriminative. Therefore, implementing the spectral-spatial attention mechanism on the PCA-processed data can enhance the effectiveness of the data, thereby improving the classification performance of the model. For the feature extraction and classification stage, the proposed method constructs two neural network branches. The two branches first extract the multiscale spectral and spatial features of HSI, and then use the weak dense feature extraction modules to extract high-dimensional semantic features with sufficient discrimination. The data, processed by PCA and the attention mechanism, have abundant spectral and spatial information, and the important information in the data is more prominent. At this time, based on the multiscale spectral and multiscale spatial attention modules, very rich spectral and spatial features can be obtained. Then, combining these with the weak dense feature extraction modules, the model can extract higher-dimensional and more abstract semantic features. This ensures that the subsequent fusion features exhibit strong discrimination and can accurately complete the classification task. Finally, global average pooling is used to reduce the dimensions of features in two branches, and then two features are fused. The classification results are obtained through two fully connected layers and a Softmax layer.

Measures Taken to Prevent Overfitting
In the construction of deep learning models, if we blindly attempt to improve the predictive ability of model, the complexity of the structure will often be relatively high. Generally speaking, deep learning models contain too many parameters. It has a very good fitting ability for the training data, but poor performance on the test set. This phenomenon is called overfitting. In this paper, we introduce dropout and a dynamic learning rate to overcome the overfitting phenomenon.
Dropout means that in the training process of the deep learning model, some neurons will be temporarily dropped from the network according to a certain probability. Therefore, the model will not rely too much on some local features, so as to improve the generalization ability [54]. In this paper, we use a dropout with a 0.5 dropout rate for the fully connected layers at the end of the neural network model. The learning rate is one of the key hyperparameters in the training stage of a deep learning model. If the selected learning rate is too large, the model can accelerate learning in the early stage and decrease the loss rapidly, whereas in the later stage, the loss will fluctuate so that the model cannot converge. If the learning rate is too small, the loss decreases slowly during the training stage, making it difficult to optimize the model. Therefore, this paper adopts the dynamic learning rate mechanism in the training process. In the early stage of training, a slightly higher learning rate is used to reduce the loss rapidly. In the later stage of training, the model can converge better by gradually reducing the learning rate.

Data Description
Three widely used HSI datasets, the Indian Pines (IP), the University of Pavia (PU), the Salinas (SA) datasets, were employed in these experiments.
The Indian Pines (IP) dataset was collected using the airborne visible/infrared imaging spectrometer (AVIRIS) sensor in north-western Indiana, 1992. The dataset contains 16 categories with the size of 145 × 145 pixels and 220 spectral bands in the wavelength range of 0.4-2.5 µm. After removing 20 water absorption bands, the remaining 200 bands can be adopted for analysis.
The University of Pavia (PU) dataset was obtained through the reflective optics system imaging spectrometer (ROSIS) sensor at the University of Pavia, northern Italy, 2001. The dataset contains 9 categories with the size of 610 × 340 pixels and 103 spectral bands in the wavelength range 0.43-0.86 µm.
The Salinas (SA) dataset was acquired using the AVIRIS sensor from SA Valley, CA, USA, 1998. The dataset contains 16 categories with the size of 512 × 217 pixels and 224 spectral bands in the wavelength range 0.4-2.5 µm.
From the three datasets, we selected 5% of IP, 1% of PU, and 1% of SA for training, and used the same number of samples as the training set for validation, with the rest of the samples used as a test. Tables 5-7 list the land-cover classes and corresponding numbers of experimental samples for the three datasets.

Experimental Setup
The hardware devices used in this experiment were an Intel Core i7-9700 CPU and a Nvidia RTX2080TI GPU. According to the optimal experimental results, 0.0005 was selected as the learning rate, the batch size was 32, and the training epoch was 80.
Three quantitative indicators, overall accuracy (OA), average accuracy (AA), and the Kappa coefficient (Kappa), were used to measure the accuracy of each method. OA refers to the ratio of correctly classified pixels to the total pixels. AA refers to the average of the classification accuracy of all categories. Kappa refers to the consistency between the classification results and ground truth. The larger the value of the three indicators, the better the classification result of the model. All the experiments in this paper were repeated 10 times (the network parameters in each experiment were randomly initialized), and the average values of the 10 experiments were determined as the final experimental results. Next, we briefly introduce the compared methods.

Quantitative Evaluation of Classification Results
This part of the process involves a quantitative comparison between the proposed method and the related methods from four aspects: the accuracy of each category, OA, AA, and Kappa. The experimental results are shown in Tables 8-10, with the best accuracy shown in bold for three indicators. As shown in Tables 8-10, among all the methods compared, the proposed method achieved the highest classification accuracy in almost all cases in the three datasets.   Due to the particularities of HSIs, both spectral and spatial features are necessary factors in obtaining better classification results. The SVM depends only on spectral information for classification, resulting in the weakest classification performance. A CDCNN constructs a deeper network structure. However, the network contains a 2D CNN alone and ignores the relevant information between the spectral bands, so the classification accuracy is relatively low. The classification results of the above two methods on the three datasets were lower than 77%, 92%, and 87%, respectively.
The hybrid use of the spectral and spatial features of the HSI is the most direct way to enhance the classification performance of the model. A 3D CNN, with its 3D kernel structure, can simultaneously extract the joint spectral-spatial features of HSIs, which is a popular approach in current research. The structures of SSRN, FDSSC, and HybridSN are all based on the 3D CNN. Compared with the SVM and CDCNN, which use spectral or spatial features alone, the classification accuracy of the combination of spectral and spatial features shows improvements of at least 16.8%, 4.5%, and 6.5% in three datasets. When the structure and parameters of the model are determined, its performance is fixed. In recent years, many scholars have proposed attention mechanisms based on the weight distribution within the features to further improve the classification performance of the network model. The network models of DBMA and DBDA are relatively similar, using different spectral and spatial attention mechanisms for postprocessing, respectively. In IP, the performance of DBMA is slightly better than that of SSRN and HybridSN, which is comparable to FDSSC. Compared with SSRN, FDSSC, and HybridSN, DBMA shows improvements of 3.63%, 2.38%, and 3.99% on the three datasets, respectively. The classification accuracy of DBDA on IP is 2.73% higher than that of DBMA, and their performance is similar on PU and SA. In PU and SA, the classification accuracy of DBMA and DBDA using the attention mechanisms is lower than that of SSRN, FDSSC, and HybridSN. The reason is that only 1% of the data in PU and SA are used for training; thus, the extracted features are insufficient to distinguish between different categories of pixels. In this case, the attention mechanisms used in the postprocessing stage cannot improve the classification performance.
Our proposed CMWD-HA constructs the spectral feature extraction branch and the spatial feature extraction branch, respectively, through the multiscale feature extraction modules and the weak dense feature extraction modules. By fusing the output of two network branches, the fused features can distinguish well between different categories of pixels. In addition, we use the hybrid attention mechanism for preprocessing to further improve the performance of the model on three datasets. Compared with the best methods in the three datasets, the classification accuracy of CMWD-HA is improved by 0.35%, 0.57%, and 0.93%, respectively.

Qualitative Evaluation of Classification Results
The qualitative classification map can directly reflect the classification results of different methods. Figures 5-7 show the classification maps of each compared method.   The classification performance of an SVM using only spectral features and CDCNN using only spatial features were found to be the worst. The salt-and-pepper noise was severe, which can be seen in Figure 5c,d, Figure 6c,d and Figure 7c,d. By contrast, the hybrid spectral-spatial features extracted via SSRN, FDSSC, and HybridSN showed better classification performance and less noise in the classification maps. After adding attention mechanisms, the noise of DBMA and DBDA in the IP dataset was very small, whereas the noise in PU and SA was more than that of SSRN, FDSSC, and HybridSN. The hybrid features extracted via the proposed method showed strong robustness and discrimination. In the three datasets, the proposed method achieved the best classification accuracy and relatively clean classification maps.

Comparison of Different Methods When Different Training Samples Are Considered
To further compare the proposed method with the related methods, the performance of different methods with different numbers of training data was compared. In the three datasets, 1%, 3%, 5%, 10%, and 15% of data were adopted for training. The experimental results are shown in Figure 8. When 1% of the data were used for training, only the classification accuracy of FDSSC on IP was slightly higher than that of the proposed method. In other cases, the proposed method achieved the best classification performance. With the increase in the training data, the classification performance of the proposed method was better than that of all the compared methods. In summary, the hybrid features extracted by the proposed method have strong discrimination and robustness, and can distinguish well between different categories of pixels.

Comparison of OA for Different Spatial Sizes
When the spatial size of the selected data cubes is small, it will lead to a lack of spatial information. The extracted features are not sufficient to distinguish different categories of pixels, resulting in lower classification accuracy. If the spatial size is too large, the data cubes will contain more neighborhood pixels, which are likely to contain many other categories of pixels. In other words, the introduction of too much interference data will also lead to low classification accuracy. Therefore, it is very important to select the appropriate spatial size. In this section, we test data cubes of different spatial size from 11 × 11 to 21 × 21, including a total of six cases. The results are shown in Figure 9. In the tests of the three datasets, the OA of the model increased first and then decreased with the gradual increase in the spatial size. Among these, the fluctuations of PU and SA were small, and the classification performance of IP decreased significantly when the spatial size of IP increased from 19 to 21. This shows that the IP dataset was most affected by the neighborhood pixels. According to the overall optimal classification accuracy, we determined that the spatial size of the data cubes was 15 × 15. Please note that the above analysis is only limited to the proposed method.

Comparison of OA for Different Learning Rates
As an important hyperparameter in deep learning, the learning rate determines whether and when the objective function converges to the local minimum value. If the learning rate is too large, the loss function may directly exceed the global optimum point. If it is too small, the change rate of the loss function is very slow, which will greatly increase the convergence complexity of the network and easily fall into the local minimum or saddle point. Therefore, the appropriate learning rate makes the objective function converge to the local minimum value at an appropriate time. In this section of our analysis, the learning rate was set to 0.0001, 0.0005, 0.001, and 0.005 for the experiment, respectively, and the results are shown in Figure 10. As can be seen in Figure 10, when the learning rate increased from 0.0001 to 0.0005, the classification accuracy for IP, PU, and SA increased, respectively. As the learning rate continued to increase, the classification performance showed a continuous downward trend. In addition, during the experiment, when the learning rate was set to 0.0005, the model was able to converge within 80 epochs. When the learning rate was set to 0.0001, the model needed to be trained for more epochs. When the initial learning rate was set to 0.001 and 0.005, the model did not converge to the optimal value. Therefore, 0.0005 was selected as the optimal learning rate of the model.

Analysis of the Attention Mechanism's Effectiveness
In the proposed method, the original HSI is weighted through the hybrid attention mechanism after the PCA dimension reduction. This module is learnable and can complete both spectral and spatial attention processes simultaneously. In this section, we compare the classification performance of the model with and without the attention mechanism.
As shown in Figure 11, the classification accuracy of the model in three datasets increases to a certain extent after the hybrid attention mechanism is adopted. The experimental results indicate that the proposed hybrid attention mechanism is effective and can further improve the classification performance based on the existing model. In ad-dition, this attention mechanism only uses two small convolution kernels for learning and completes the attention weighting process through two-matrix multiplication and one-matrix addition. Therefore, the proposed attention mechanism consumes very few computational resources. Figure 11. Effectiveness of the attention mechanism.

The Effectiveness of the Multiscale Method
The two network branches of the proposed model first extract the multiscale spectral and spatial features of the HSI, respectively. Ablation experiments were performed to compare the classification performance of the method with no multiscale feature extraction module, with one multiscale feature extraction module, and with both spectral and spatial multiscale feature extraction modules. The classification performance is shown in Figure 12, where A represents the multiscale spectral feature extraction module, and B represents the multiscale spatial feature extraction module. As shown in Figure 12, the classification performance of the model was improved to a certain extent after the multiscale spectral feature extraction module or multiscale spatial feature extraction module was adopted. When both modules were used, the classification performance of the model was significantly improved. Therefore, with the introduction of multiscale feature extraction modules, the network can obtain different receptive fields in both spectral and spatial aspects, capture information at different scales, and extract abundant features. The extracted features distinguish between different categories of pixels well and achieve a great improvement in performance in the classification task.

The Comparison of DenseNet and Weak DenseNet
The weak dense spectral and spatial feature extraction modules used in the proposed method are simplified from the DenseNet model. Only the skip connections between the input and output of each layer based on the DenseNet model are reserved. In this section, we added another dataset, SalinasA (SAA), to train with 1% of the data for more effective comparison. The description of the dataset is as follows: The SalinasA (SAA) was obtained through the AVIRIS sensor in the Salinas Valley in California, USA. The dataset contains six categories with the size of 83 × 86 pixels. This scene can be corrected by removing 20 water absorption bands (108-112, 154-167, and 224) from 224 spectral bands.
We tested the classification performance of the model by using a weak dense structure and dense structure on the four datasets; the experiment results are shown in Figure 13. As shown in Figure 13, when the overall network model remained unchanged, the classification performance of the model using the weak dense structure was 0.05% lower than that using a dense structure on IP. For PU and SA, the classification accuracy of the model with a weak dense structure was increased by 1.23% and 0.23%. In order to make the experimental results more convincing, we also tested the SAA dataset. In the proposed method, the use of the weak dense structure showed a 0.43% performance improvement compared to the dense structure in the SAA dataset. Therefore, compared with the dense structure, the weak dense structure can reduce the amount of feature maps with almost no reduction in classification accuracy.

The Comparison of Averagepooling and Flatten at the End of the Model
In the proposed method, the high-dimensional spectral and spatial features are first extracted by two network branches. Then, the features are compressed into one dimension by AveragePooling3D. Finally, the features are fed into fully connected layers and a Softmax layer for the classification of the results. There are a number of related methods that directly flatten the high-dimensional features to one dimension and obtain classification results through fully connected layers and a Softmax layer [22,39,40]. Here, we conducted an experimental comparison of the above two approaches; the results are shown in Figure 14. As can be seen in Figure 14, compared with the AveragePooling3D approach, the classification performance using the flatten method decreased significantly in the three datasets, by 5.01%, 2.59%, and 0.86%, respectively. On the one hand, AveragePooling3D refines the features of each part and retains the most important information. On the other hand, this method can effectively reduce the number of parameters and largely suppress overfitting. Therefore, the proposed method using AveragePooling3D is superior to the flatten method.

Investigation on Running Time
The training and test time are significant indicators to measure the performance of the model. An excellent classification model depends not only on high classification accuracy but also on a high level of timeliness. Therefore, we compared the training and test times of the model. The comparison results are shown in Tables 11-13: As shown in Tables 11-13, compared with SSRN, FDSSC, DBMA, and DBDA, the proposed method has advantages in the training and test time. In addition to the multiscale spatial feature extraction module, the proposed model mostly adopts a CNN with a small kernel scale, especially the 1 × 1 × n convolution kernel in the spectral feature extraction branch. The proposed model has fewer parameters, a fast convergence speed, and high operational efficiency. It is worth noting that the network structure of CDCNN and HybridSN is simple, and the number of parameters is small. The training and test times of the two methods are shorter than those for the proposed method, but their classification accuracy is relatively low.

Conclusions
In this paper, a spectral and spatial feature extraction method combined with an attention mechanism is proposed for HSI classification. Firstly, PCA is used to reduce the dimensions of the original HSI, and the hybrid spectral-spatial attention mechanism is adopted to weight the data. This process not only reduces the amount of data and redundant information but also enhances the effectiveness of the data. Then, two network branches composed of multiscale feature extraction modules and weak dense feature extraction modules are used in parallel to extract high-dimensional semantic features of the image. Finally, the AveragePooling3D is adapted to compress the two parts of features, and the classification results are obtained through fully connected layers and a Softmax layer. The hybrid spectral-spatial attention mechanism effectively improves the classification performance of the model, with a learning ability and a very small computational overhead. The compound multiscale weak dense network model has a fast convergence speed, a high efficiency of feature extraction, strong robustness and discrimination of features, and good generalization ability. The experimental results on the three datasets showed that the proposed method is superior to the compared methods in terms of classification accuracy and timeliness.
The shortcoming of the proposed method is that there are too many training parameters in the fully connected layers of the model, which reduces the efficiency of the training and the test stage to a certain extent. In addition, the attention mechanism can improve the classification performance, but the effect is not significant enough. Therefore, our next research goal is still to focus on a network model based on the attention mechanism, to explore a more efficient attention mechanism based on preprocessing or postprocessing, and to improve the classification performance more obviously.