Hyperspectral Image Classiﬁcation Based on Spectral Multiscale Convolutional Neural Network

: In recent years, convolutional neural networks (CNNs) have been widely used for hyperspectral image classiﬁcation, which show good performance. Compared with using sufﬁcient training samples for classiﬁcation, the classiﬁcation accuracy of hyperspectral images is easily affected by a small number of samples. Moreover, although CNNs can effectively classify hyperspectral images, due to the rich spatial and spectral information of hyperspectral images, the efﬁciency of feature extraction still needs to be further improved. In order to solve these problems, a spatial–spectral attention fusion network using four branch multiscale block (FBMB) to extract spectral features and 3D-Softpool to extract spatial features is proposed. The network consists of three main parts. These three parts are connected in turn to fully extract the features of hyperspectral images. In the ﬁrst part, four different branches are used to fully extract spectral features. The convolution kernel size of each branch is different. Spectral attention block is adopted behind each branch. In the second part, the spectral features are reused through dense connection blocks, and then the spectral attention module is utilized to reﬁne the extracted spectral features. In the third part, it mainly extracts spatial features. The DenseNet module and spatial attention block jointly extract spatial features. The spatial features are fused with the previously extracted spectral features. Experiments are carried out on four commonly used hyperspectral data sets. The experimental results show that the proposed method has better classiﬁcation performance than some existing classiﬁcation methods when using a small number of training samples.


Introduction
Hyperspectral image is a three-dimensional image, which is captured by some aerospace vehicles carrying hyperspectral imagers. Hyperspectral images contain rich spectral-spatial information. Each sample of hyperspectral images has hundreds of spectral bands, and each band has hundreds of reflection information, which enables hyperspectral images to play a great role in military target detection, agricultural production, water quality detection, mineral exploration and other aspects [1][2][3][4]. Researchers have carried out a significant amount of useful research using the unique characteristics of hyperspectral images, for example, using the spectral information of hyperspectral images to detect the information of the earth's surface [5,6]. Hyperspectral image classification is based on different kinds of substances with different spectral curves. Each category corresponds to some specific samples, and each sample also has its own unique spatial-spectral characteristics. However, there are two common problems in hyperspectral image classification: (1) using small sample training in hyperspectral image classification will affect the performance of the model and reduce the generalization ability of the model and (2) in the case of small samples, if we can fully extract spatial spectral features and improve the classification performance.
In the early stage of studying hyperspectral image classification, tools such as support vector machine [7] and polynomial logistic regression [8] are mainly used. The definition of support vector machine is the linear classifier with the largest interval in the feature spatial. Its learning strategy is to maximize the interval. Because hyperspectral images also contain a large number of nonlinear features, SVM cannot extract the non-linear features in hyperspectral images well. Although spectral information can be used for classification, the classification performance will be better if spatial information is fully used on the basis of spectral information. In order to further improve the classification performance, super-pixel sparse representation and multi-core learning are also proposed [9][10][11]. A multi-core model has more flexibility and stronger feature mapping ability than single kernel function, but the algorithm of multi-core learning is more complex, inefficient and needs more memory and time.
Using deep learning technology can automatically extract the nonlinear and hierarchical features of hyperspectral images. For example, image classification [12], semantic segmentation [13] and target detection [14] in computer vision tasks, information extraction [15], machine translation [16], question answering system [17] in natural language processing and image classification. have made great progress with the help of deep learning technology. As a typical classification task, hyperspectral image classification, as a result of the progress of deep learning technology, the classification accuracy has also been greatly improved. So far, a great deal of exploration on extracting spectral-spatial features of hyperspectral images has been carried out. Some typical feature extraction methods [18], such as the structural filtering method [19][20][21] and morphological contour method [22][23][24], random field method [25,26], sparse representation method [27,28] and segmentation method [29][30][31] have been proposed. Compared with the traditional feature extraction method based on manual production, the deep learning method is an end-to-end method, which can learn useful features automatically from a large number of hyperspectral data through a multi-layer network. At present, the methods for extracting hyperspectral image features using depth learning include stack automatic encoder (SAE) [32], depth belief network (DBN) [33], CNNs [34], recursive neural network [35,36] and graph convolution network [37].
In [33], Chen introduced the stacked automatic encoder (SAE) to extract important features. Tao [38] extracted spectral-spatial features using two sparse SAE. Depth automatic coding will reduce the dimension of the input data. Its dimension reduction is different from PCA. Depth automatic coding is more complex because it carries out nonlinear operation. A new depth automatic encoder (DAE) was proposed by [39] who designed a new collaborative representation method to process small training sets, which can obtain more useful features from the neighborhood of target pixels in hyperspectral images. Zhang [40] used a recursive automatic encoder (RAE) and weighted method to fuse the extracted spatial information. In [34], a depth belief network (DBN) was used for hyperspectral image classification. Although these methods can be classified, they are one-dimensional classification methods with poor classification performance. Hu [41] proposed to directly extract the spectral features of hyperspectral images by using a one-dimensional CNN model and classified them by using the extracted spectral features; the model has five layers. Li [42] proposed a new method for classifying hyperspectral images using pixels.
For hyperspectral image classification tasks, the 2D-CNNs model can directly extract spatial information. Some features in hyperspectral images are highly similar. In [43], a depth two-dimensional CNNs model based on depth hash neural network (DHNN) is proposed. The proposed model can effectively learn the features with high similarity in hyperspectral images. Because hyperspectral images are high-dimensional data, it is necessary to reduce the dimension before classification, and then learn the spatial information in hyperspectral image samples through 2D-CNNs. Chen [44] et al. proposed a method of acquiring spatial-spectral information of hyperspectral images and feature fusion using 2D-CNNs, which is based on deep neural network (DNN). The different region convolution neural network (DRCNN) to classify hyperspectral images was proposed by [45]. The input of different regions is used to learn the context features of different regions, and better classification results are obtained. Zhu et al. [46] proposed to introduce deformable convolution into the network to classify hyperspectral images, and adaptively adjust the receptive field size of the activation unit of a convolution neural network to effectively reflect the complex structure of hyperspectral images.
Hyperspectral image classification using 3D-CNNs has better performance, because both 1D-CNNs and 2D-CNNs cannot extract spatial feature information and spectral feature information at the same time. When using small training samples for hyperspectral image classification, the 3D-CNNs model is a more effective classification method, because it can capture the spatial-spectral information of hyperspectral images at the same time. Ding et al. [47] proposed a convolutional neural network based on diverse branch modules (DBB). It enriches the spatial feature by combining branches with different scales and different complexities, including convolution sequence, multiscale convolution and average pooling. Thus, the feature extraction ability of single convolution is improved. Each branch uses convolution kernels with different scales to extract the spectral information of hyperspectral images, so as to improve the classification performance. Usually, convolution neural networks use pooling operations to reduce the size of feature map. This process is crucial to realize local spatial invariance and increase the receptive field of subsequent convolution. Therefore, the pooling operation should minimize the loss of information in the feature map. At the same time, computing and memory overhead should be limited. In order to meet these needs, Alexandros [48] and others proposed a fast and efficient pooling method, 3D-Softpool, which can accumulate activation in an exponential weighted manner. Compared with other pooling methods, 3D-Softpool retains more information in the down sampling activation mapping. In hyperspectral image classification, finer down sampling can obtain more spatial feature information and can improve the classification accuracy.
A new network of dual branch dual attention mechanism (DBDA) is proposed in [49]. One branch is used to extract spatial features and the other branch is used to extract spectral features. A spatial attention module is applied to spatial branches and a channel attention module is used to spectral branches. Some important spectral-spatial features can be captured by using the attention module, which helps to improve classification performance. The hyperspectral image classification methods based on 3D-CNNs can also be divided into two categories: (1) 3D-CNNs are utilized as a whole to extract the spectralspatial features of hyperspectral images. In [50], the depth feature extraction network of 3D-CNNs is proposed, which can capture spatial-spectral features at the same time. Some 3D-CNNs frameworks directly obtain the features of hyperspectral cubes without preprocessing and post-processing the input data. (2) Spectral features and spatial features are extracted and classified after feature fusion. In order to fully extract the spatial-spectral features of hyperspectral images from shallow layer to deep layer, a three-layer CNN is constructed in [51]. Then, by fusing multi-layer spatial features and spectral features, more complementary information can be provided. Finally, the fused features and classifiers form a network, and the end-to-end performance optimization is carried out. Li et al. [52] proposes deep CNN with double branch structure to extract spatial and spectral features.
The deep pyramid residual network (pResNet) proposed in [53] can make full use of a large amount of information in hyperspectral images for classification, because the network can increase the dimension of feature mapping between layers. In [54], an endto-end fast dense spectral spatial convolution (FDSSC) hyperspectral image classification structure is proposed. Different convolution kernel sizes are used to extract spectral and spatial features respectively, and an effective convolution method is used to reduce the high dimension. Improving the running speed of the network can also effectively prevent over fitting. In order to avoid the loss of context information caused by using only one or several fixed windows as the input of hyperspectral image classification, an attention multi branch CNN structure using adaptive region search (RS-AMCNN) is proposed in [55]. In [56], a method for classifying hyperspectral images using multiscale super pixels and guided filter (MSS-GF) is proposed. MSS is used to obtain spatial local information from different scales in different regions, and a sparse representation classifier is used to generate classification maps of different scales in each region. This method can effectively improve the classification ability of hyperspectral images. Because RS-AMCNN can adaptively search the position of spatial window in local areas according to the specific distribution of samples, it can effectively extract edge information and evenly extract important features in each area. In [57], a spectrum and context information set classification method based on Markov random field (MRF) is proposed. In order to make full use of deep features, a cascade MRF model is proposed to extract deep information. This method has good classification performance. In [58], the proposed dual branch multi attention network (DBMA) extracts spectral spatial features, and uses the attention mechanism on both branches, which has a good classification effect. Sun et al. proposed a new method, low rank component induced spatial spectral kernel method based on patch, called lrcissk, for HSI classification. Through the low rank matrix recovery (LRMR) technology, the low rank features of the spectrum in HSI are reconstructed to explore more accurate spatial information, which is used to identify the homogeneous neighborhood pixels (i.e., centroid pixels) of the target [59]. In [60], this paper proposed a spectral-spatial feature tokenization transformer (SSFTT) method, which has a Gaussian weighted feature marker for function transformation, capturing spectral spatial features and advanced semantic features for hyperspectral image classification. Hong et al. proposed a method called invariant attribute profile (IAP) to extract invariant features from the spatial and frequency domain of hyperspectral images and classify hyperspectral images [24]. Aletti, G. et al. proposed a new semi-supervised method for multilabel segmentation of HSI, which combines appropriate linear discriminant analysis and can be used to compare the similarity indexes of different spectra [61]. Christos G. Bampis et al. proposed a graph driven image segmentation method. By developing the diffusion process defined on any graph, this method has less computational burden through experiments [62].
The content of hyperspectral images is usually complex; many different substances show similar texture features, which means the performance of many CNN models cannot be brought into full play. Due to the existence of noise and redundancy in hyperspectral image data, standard CNNs cannot capture all features. In addition, when additional layers are added, the deeper CNNs architecture will also affect the convergence of the network and produce lower classification accuracy. In order to alleviate these problems, Ding et al. [47] proposed DBB, which combines multiple branches with different scales and complexity to extract richer spectral feature information, including convolution sequence, multiscale convolution and average pooling. When using 3D-CNNs to extract features, the more layers of the network, the more complex the network structure will be, resulting in more parameters and more computing and memory overhead. In order to reduce information loss, 3D-Softpool [48] is used to extract spatial features, and 3D-Softpool can be cumulatively activated in an exponentially weighted manner. Compared with a series of other pooling methods, 3D-Softpool retains more information in down sampling activation mapping, which can effectively improve the performance and generalization ability of hyperspectral image classification. Inspired by DBB and 3D-Softpool methods, in order to fully extract spatial-spectral information and solve the problem of small sample over fitting, a spectral-spatial attention fusion method based on four branch multiscale blocks (FBMB) and sampling activation network based on 3D-Softpool module is proposed. The contributions of this study are as follows: This paper proposes a FBMB structure different from other multi branches. The module enriches the feature spatial by combining multiple branches with different scales and complexity, and adds spectral attention blocks to each branch to further extract important spectral features. Finally, the extracted features of these branches are concatenated. The module can fully capture spatial-spectral features and improve the classification performance.
In the process of extracting spatial features, 3D-Softpool is introduced, and 3D-Softpool will be cumulatively activated in an exponential weighted manner. Compared with other pooling methods, 3D-Softpool retains more information in the down sampling activation mapping. A fusion method similar to dense connection is designed to extract the spectral and spatial features again to further improve the classification accuracy of hyperspectral images.
Experiments on four public data sets show that the experimental results of the proposed method for hyperspectral image classification are better than other advanced methods.
The rest of this paper is arranged as follows. Section 2 introduces each part of the proposed method in detail. Section 3 gives the experimental results and analysis. Section 4 provides a discussion of the proposed method. In Section 5, some conclusions are provided.

Overall Structure of the Proposed Method
The overall framework of the proposed network based on spectral four branch multiscale network (SFBMSN) for hyperspectral image classification is shown in Figure 1. The SFBMSN network is composed of three parts. In the first part, the FBMB structure with a channel attention module is designed to extract and select spectral features, so as to fully extract important spectral features. The second part uses the dense connection structure to further extract the spectral features and fully extract the information. In the third part, the dense connection structure of the first convolution layer including a 3D-Softpool module is used to extract spatial features, increase the receiving field of subsequent convolution and extract more spatial features, so as to improve the classification performance. In addition, some optimization strategies are used to prevent over fitting.
In the first part, FBMB includes four branches, and the convolution kernel size of each branch is not the same. A spectral attention mechanism is introduced into each branch to obtain spectral features more conducive to classification. Then, the spectral features extracted by four branches are fused. In the second part, based on the DenseNet [63] structure and the idea of reusing spectral features, the fused features are input into the DenseNet network, and the dense blocks with three convolution layers are used to extract spectral features. The third part is similar to the extraction of spectral features. The original hyperspectral data are input into the dense block containing 3D-Softpool. The 3D-Softpool module is added to the first convolution layer of dense blocks to extract spatial neighborhood features combined with spatial attention mechanism. Then, the feature maps obtained from spectral branches and spatial branches are added element by element, and the features after spatial-spectral fusion are input into global average pool (GAP), full connection layer and linear classifier to obtain the classification results.

FBMB and Spectral Self-Attention Module
The FBMB and spectral self-attention module are important modules in the first part of the proposed method. In Figure 1, the first part is the proposed FBMB structure with spectral self-attention mechanism. Firstly, principal component analysis (PCA) is performed on the original hyperspectral image data, and the data after PCA is P ∈ R 9×9×band , where 9 × 9 represents the length and width of the data, and band represents the number of channels of the data. Next, a three-dimensional convolution operation is carried out on the data P ∈ R 9×9×band . The size of the convolution kernel of the three-dimensional convolution layer is set to (1 × 1 × 7), the padding is set to (0 × 0 × 0) and the stride is set to (1 × 1 × 2). In this way, the length and width of the data after the convolution operation remain unchanged, which is 9 and 9, respectively, and the number of channels becomes a, that is, a is the number of channels of the data after the convolution layer operation.
The output data are input into the four branches of the FBMB module at the same time, and retain the useful spectral information as much as possible through the spectral self-attention mechanism in each branch. For the first branch of the FBMB module, the convolution layers with 6 convolution kernels are utilized for convolution. The population strategy is used for all branches to make the input and output data size consistent. After the convolution layer, the batch normalization (BN) layer is used. The BN layer can normalize and linearly scale the channel to speed up the convergence of the model. Suppose the input data are B = {x 1...m }, the output data are y i = BN λ,β (x i ) and the trainable parameters are λ, β, then the process of BN can be represented as Firstly, the mean and variance of the input data are calculated according to Equations (1) and (2), then the input data are normalized to [0, 1] corresponding to Equation (3), and finally multiply each element in B by γ and add β to output y i . γ and β are trainable parameters. The purpose of normalization is to adjust the data to a unified interval, reduce the divergence of data and reduce the learning difficulty of the network. Using BN and Mish activation function after convolution layer can effectively avoid gradient explosion and gradient disappearance. Then, continue to use the spectral self-attention mechanism to capture important spectral information. The schematic diagram of the mechanism of spectral self-attention is shown in Figure 2. Using the spectral attention mechanism, we can mine the interdependence between spectral feature maps, extract feature maps with strong dependence, and improve the feature representation of specific semantics. As shown in Figure 2, U represents the spectral features, which is the initial input of U ∈ R c×p×p . Here, p × p is the input patch size and c represents the number of input channels. Y represents the spectral attention map, and the size of Y is c × c. Y is calculated from the initial input spectral feature map U. y ij is used to measure the influence of the ith spectral feature on the jth spectral feature. U i is the ith spectral feature and U j is the jth spectral feature. The calculation process is Then, the results of matrix multiplication between Y and U are reshape into R c×p×p . Finally, the reshape result is weighted by the scale α parameter, and input U is added to obtain the final spectral attention map E ∈ R c×p×p : where α is initialized to zero and can be learned gradually. The final map E includes the weighted summations of features of all channels, which can improve the resolution of features.
In the second branch, the size of the convolution kernel in the first convolution layer is (1 × 1 × 1), the number of the convolution kernel is 6, and BN + Mish is used. The size of convolution kernel in the second convolution layer is (3 × 3 × 7), the number of convolution kernels is 6 and BN + Mish is used. At the end of the second branch, spectral self-attention is utilized, and the input size is (9 × 9 × a). The third branch has the same structure as the fourth branch, and the size of convolution kernel is different. The size of convolution kernel in the third branch is (3 × 3 × 7) and the number of convolution kernel is 6. For the fourth branch, the size of convolution kernel is (5 × 5 × 7) and the number of convolution kernel is 6. After the convolution layer, BN + Mish and spectral self-attention mechanism are adopted to avoid data explosion and gradient disappearance. Because data padding is used in all four branches, the output size is (9 × 9 × a, 6). Finally, the data outputs from the four branches are added, and the added cube size is (9 × 9 × a, 24). The cube output by the FBMB module contains a large amount of important spectral feature information, which provides rich spectral feature information for subsequent operations.

Dense Connection Network
Dense connection network is an important network of the second and third parts of the proposed method. In order to avoid the gradient disappearance problem caused by the deepening of the network depth, the dense connection module is adopted behind the FBMB module to further extract the effective spectral features. On the premise of ensuring full transmission of information between the middle layers of the network, all layers are directly connected. Each layer connects the inputs of all previous layers, and then transmits those output feature maps to all subsequent layers. This can ensure the maximum spectral information flow between network layers and make full use of spectral features. The structure diagram of DenseNet is shown in Figure 3. DenseNet used in this paper has three layers, and each layer is composed of convolution, BN, and Mish. Each layer has a nonlinear transformation function H l . H l is the combined function of the three operations. x l−1 is the output of layer l − 1. In the traditional feedback network, the output of layer l − 1 is the input of layer l, and the output x l of layer l can be obtained, which can be represented as DenseNet can effectively improve inter layer information transmission by connecting one layer with all subsequent layers. Therefore, the output [x 0 , x 1 , . . . , x l−1 ] of all layers before layer l is used as the input of layer l, and the output of layer l is represented as After dense connection, the key information of spectral features is extracted by spectral self-attention mechanism again.

Spatial Dense Connection Network with 3D-Softpool
As dense connection network has already been introduced, it will not be repeated in this section. 3D-Softpool is an important module in the third part of the proposed method. The principle of 3D-Softpool will be introduced in detail in this section. The dense connection is utilized to capture spatial features. This dense connection has three layers, and each layer is composed of convolution, BN and Mish. The Mish layer is replaced with 3D-Softpool in the first layer of dense connection. 3D-Softpool is a fast and efficient pooling method, which accumulates activation in an exponentially weighted manner. Compared to other pooling methods, 3D-Softpool retains more information during the down sampling activation mapping, which helps to improve classification performance. 3D-Softpool is based on the natural index (e), which can result in a large activation value and have a greater impact on the output. 3D-Softpool is differentiable, which means that all activation in the local neighborhood will be assigned at least one minimum gradient value during back propagation. The process of 3D-Softpool is shown in Figure 4. 3D-Softpool uses the maximum approximation R in the activation area, and each activation a i is assigned a weight W i , which can be represented as The output value of 3D-Softpool is obtained by summing all weighted activation in an activation map: The output is The probability distribution of the normalized results generated using SoftMax is proportional to the adjacent activation value of each activation value relative to the adjacent region. It can make all activation contribute to the final classification output, extract more useful spatial features and improve the classification accuracy.

Spatial Self-Attention Mechanism
The spatial self-attention mechanism is an important module in the third part of the proposed method. The principle of spatial self-attention mechanism will be introduced in detail in this section. The data generated after 3D-Conv, BN and Mish operations are input to the spatial self-attention module. For spatial attention mechanism, by establishing the context relationships between local spatial features, more extensive context information can be encoded into local spatial features to improve the representation ability of features. The spatial self-attention mechanism is shown in Figure 5. F ∈ R c×h×h is the input feature map, B, C and D are three new spatial feature maps generated after three convolution operations, among which {B, C and E} ∈ R c×h×h . Then B, C and D are reshaped into R c×t , where t = h 2 and t represent the number of pixels. B and C perform matrix multiplication, and then obtain the spatial attention feature map S ∈ R t×t through SoftMax layer calculation.
where, S ij is the influence of the ith pixel on the jth pixel. D and S T perform matrix multiplication, and the result is reshaped into R c×h×h ; Among them, the initial value of η is zero, which can gradually assign more weights. By observing Equation (13), we can find that all positions and original input feature maps will be added with a certain weight to obtain the final output spatial attention feature map Z ∈ R c×h×h .

Data Set and Parameter Setting
The performance of the proposed method is verified by using four classical hyperspectral image data sets: Indian pine (IN), Pavia University (UP), Kennedy Space Center (KSC) and Salinas Valley (SV).
The IN data set shown in Table 1 is the earliest test data used for hyperspectral image classification. An Indian pine tree in Indiana was imaged by airborne visible infrared imaging spectrometer (AVIRIS) in 1992, and then intercepted with a size of 145 × 145. The size of 145 is labeled for hyperspectral image classification test. The spatial resolution of the image generated by the spectral imager is about 20 m. After eliminating useless bands, 200 bands are left for experimental research. There are 21,025 pixels in the data set, but only 10,249 pixels are ground object pixels, and the remaining 10,776 pixels are background pixels. In the actual classification, these pixels need to be eliminated. Because the intercepted area is crops, there are 16 classes in total, so different ground objects have relatively similar spectral curves, and in these 16 classes, the distribution of samples is extremely uneven. The UP data set shown in Table 2 is part of the hyperspectral data imaged by the airborne reflective optical system imaging spectrometer (rosis) of Germany in Pavia City, Italy in 2003. The spatial resolution of the image is 1.3 m and the data size is 610 × 340; The spectral imager imaged 115 bands in the wavelength range of 0.43-0.86 µm , eliminated 12 bands affected by noise, and left 103 available bands. Among them, the data set contains 9 types of features, including trees, asphalt roads, bricks, meadows, etc. The KSC data set shown in Table 3 represent the data collected by NASA AVIRIS (airborne visible/infrared imaging spectrometer) instrument at Kennedy Space Center (KSC) in Florida on 23 March 1996. AVIRIS collected 224 bands with a width of 10 nm and a central wavelength of 400-2500 nm. The spatial resolution of KSC data obtained from an altitude of about 20 km is 18 M. After removing the water absorption and low SNR bands, 176 bands were used for analysis. Training data were selected using land cover maps provided by color infrared photography and Landsat thermal mapper (TM) images provided by Kennedy Space Center. The vegetation classification scheme was developed by KSC personnel to define the types of functions that can be distinguished at the spatial resolution of Landsat and these AVIRIS data. Because the spectral characteristics of some vegetation types are similar, it is difficult to distinguish land cover in this environment. For classification purposes, 13 categories are defined for the site, representing various land cover types that occur in this environment. The SV data set shown in Table 4 is the same as the Indian pines image. The Salinas data were also obtained by AVIRIS imaging spectrometer, which is an image of the Salinas Valley in California, USA. Different from Indian pines, its spatial resolution reaches 3.7 m. The image originally has 224 bands. Similarly, we generally use the image of the remaining 204 bands after excluding the 108-112154-167 and the 224th band that cannot be reflected by water. The size of the image is 512 × 217. Therefore, it contains 111,104 pixels, of which 56,975 pixels are background pixels, and 54,129 pixels can be applied to classification. These pixels are divided into 16 categories, including fallow and celery. In the experiment, in order to verify that small training samples can also achieve high classification accuracy, we randomly selected 3%, 0.5%, 3% and 0.5% samples in IN, UP, KSC and SV for training, and the remaining samples in each data set were used for testing. The next section will prove that the proposed method can achieve high classification accuracy in the case of small samples. Tables 1-4 Table 5. Hyperspectral images data are dimensionally reduced by PCA, and then put into the network for feature extraction. Assuming that the size of the input data is 9 × 9 × n, the data pass through the input module, FBMB module of spectral branch, 3D-Softpool module of spatial branch, dense connection block, attention block and final classification block. Obviously, spatial size and n are important factors affecting the classification performance. Therefore, this section will analyze the impact of these parameters in detail. Table 5 shows the parameter configurations of the proposed methods.

Concatenate-full_connection 260
Effect of n on classification performance: n determines the depth of the network. Here, the influence of parameter n in spectral branch on the classification accuracy is discussed. Generally speaking, with the increase in network depth, the classification accuracy will also improve. Too many layers will bring problems such as over fitting, gradient disappearance and gradient explosion. Figure 6 shows the impact of the number of n on OA of different data sets. Set n of IN, UP, KSC, and SV data sets to 96, 97, 98 and 99, respectively, and the spatial size is fixed. As can be seen from Figure 6a, when n is set to 97, the OA value is the highest, and OA gradually decreases with the increase in n. For SV data set, OA value will fluctuate slightly with the increase in n. In order to avoid information redundancy caused by excessively densely connected networks and balance classification accuracy and calculation, the n number of IN, UP, SV and KSC data sets is set to 97.
Effect of spatial size on classification performance: For hyperspectral images, too small input data blocks will lead to insufficient feature extraction, while large data blocks easily result in noise problems. Therefore, the spatial size of the input 3D data block will also affect the classification accuracy. When n is fixed, the effect of spatial size on classification performance is analyzed. The spatial size of the input sample is set to 5 × 5, 7 × 7, 9 × 9, 11 × 11, 13 × 13 and 15 × 15, respectively. Figure 6b shows the OAs with different input spatial sizes on four data sets. As can be seen from Figure 6b, the OA increases with the increase in spatial size. For UP and IN data sets, when the spatial size reaches 9 × 9, the OA accuracy begins to decrease. For SV and KSC data sets, when the spatial size reaches 9 × 9, the OA is the highest. Therefore, the input block with spatial size 9 × 9 is selected to train the network.

Experimental Results and Analysis
All methods are tested with the same proportion of training samples, and the classification performance of these methods is compared. Tables 6-9 lists the classification accuracy of all categories obtained by all methods on the four data sets of IN, UP, KSC and SV. All results obtain the average of the experimental results 10 times, and the highest classification results are shown in bold. It can be observed from Tables 6-9 that compared with other methods, OA, AA and Kappa proposed by the method are all the highest. The OA value of the proposed method is 28.29% higher than that of SVM, and 6.11%, 3.45%, 6.82%, 24.64%, 3.77% and 9.9% higher than that of DBMA, DBDA, SSRN, CDCNN, FDSSC and pResNet, respectively. SVM does not use spatial neighborhood information and has poor robustness, so its OA value is low at only 67.77%. CDCNN is a 2D-CNN structure, and its robustness is better than SVM, so its OA value is 3.65% higher than that of SVM. FDSSC adopts the dense connection, and its OA value is more than 3.05% higher than that of SSRN using the residual connection. The DBMA method adopts the network structure of double branches and double attention, and using small sample training will lead to over fitting. The DBDA network uses double branch and double attention structure, which has a more flexible feature extraction structure than the DBMA network. Therefore, the OA value obtained by DBDA is higher than that of DBMA. The proposed method designs the FBMB module to extract spectral features. The convolution kernels of different sizes are used to fully extract the important spectral features on the four branches, and the spectral attention mechanism is deployed on each branch to further extract the key spectral features. Finally, the spectral features extracted from the four branches are fused to obtain more effective spectral features. Observation of the experimental results shows that the classification performance is significantly improved after using the 3D-Softpool module to extract spatial features. After observing the experimental results of classification using small samples, it is found that the proposed method has the best classification performance. As can be seen from Tables 6-9, the classification accuracy of the proposed method is the highest among the four data sets. The experimental results also show that the classification performance of the CDCNN method using the shallow network to capture features is the worst, the receptive field of the shallow network is smaller than that of the deep network and it extracts low-level features. Other methods used for comparison also adopt feature fusion strategies, including SSAN and FDSSC, which usually provide higher classification accuracy than other methods (CDCNN, SVM). When using small samples for training, the hierarchical fusion mechanism can fuse the complementary information and related information output from different convolution layers, making the extracted features more comprehensive. Moreover, in order to further verify the performance of the proposed SFBMSN network, the classification maps of different methods on the four data sets of IN, UP, SV and KSC are shown in Figures 7-10. By observing the classification map, it can be seen that compared with the classification map of other methods, the classification map of the proposed method has less noise, clear boundary and is closest to the ground truth map. The effectiveness of the proposed method is further proved.

Discussion
Experiment 1: In order to verify the effectiveness of the proposed attention block, FBMB module, 3D-Softpool and dense connection, some ablation experiments were carried out on four hyperspectral images data sets. Only the modules to be tested in the network were deleted, and other parts remained unchanged. Figure 11 shows the experimental results with or without specific modules. At the same time, it can be seen that the accuracy of the network reaches the highest after adding the spatial and spectral attention module. The reason is that the introduction of attention block to extract the features of hyperspectral images can adaptively allocate different weights, different spectral features and different spatial regions and selectively enhance the important features useful for classification, that is, increase the weight of some important features, which will help to improve the accuracy of classification. In addition, Figure 11b shows the effectiveness of fusing spatial-spectral features of different scales. As shown in Figure 11b, the FBMB module improves the classification accuracy of four data sets. This is because multiple branches can effectively extract spectral features of different scales, which is helpful to improve the classification accuracy. It is obvious from Figure 11c that after removing the 3D-Softpool module, the OAs on the four data sets decrease significantly. 3D-Softpool block can reduce the loss of information in feature mapping and retain more information in down sampling activation mapping, so better classification results can be achieved on four data sets. As can be seen from Figure 11d, if the dense connection is not used for the experiment, the OAS on the four data sets of IN, UP, SV and KSC decreases significantly. Using dense connection, all convolution layers can be connected, so that the spectral characteristic map output after convolution operation of each convolution layer is the input of all subsequent layers. This can ensure the maximum spectral information flow between network layers and make full use of spectral features. Therefore, better classification performance can be obtained on the four data sets of IN, UP, SV and KSC. As can be seen from Figure 11d, if the dense connection is not adopted for the experiment, the OAs on the four data sets of IN, UP, SV and KSC decreases significantly. Using dense connection, all convolution layers can be connected, so that the spectral feature maps output by each convolution layer are the input of all subsequent layers. This can ensure the maximum spectral information flow between network layers and make full use of spectral features. Therefore, better classification performance can be obtained on the four data sets of IN, UP, SV and KSC.  Figure 12 is the schematic diagram of FBMB. In order to prove that our proposed FBMB can effectively improve the classification performance of hyperspectral image classification we carried out two groups of comparative experiments. Only the structure of FBMB was changed, and other structures remained unchanged. Some comparative experiments were carried out under the same conditions. The experimental results of the two groups of experiments were compared with that of the proposed method. In the first experiment, the convolution kernel size of all convolutions in FBMB were set to 7 × 7 × 7. We call the first group of comparative experiments the four-branch same scale block (FBSSB). In the second experiment, we used any three branches of the four branches B1, B2, B3 and B4 for experiments, which can have four combinations, namely C1 (B1, B2, B3), C2 (B1, B2, B4), C3 (B2, B3, B4) and C4 (B1, B3, B4). The experimental results of the three groups of comparative experiments on the four data sets of IN, UP, SV and KSC are shown in Figure 13. As can be seen from Figure 13, FBMB has the highest OA value and the best classification performance on the four data sets. The convolution kernel size of all convolutions in the first group of experiments is 7 × 7 × 7. The classification performance of this structure is not good. The size of convolution kernel is large, which leads to large number of parameters and slow operation in the experimental process. Compared with the combination of three branches, the feature extraction of four branches in FBMB is more sufficient and the classification performance is better.  Experiment 3: In order to further verify the performance of the proposed method, in comparison with other methods, the classification performance of different methods under different proportions of training samples was compared. In the experiment, the training ratios of IN, UP, KSC and SV were set to 1%, 5%, 10% and 15%, respectively. The experimental results are shown in Figure 14. As can be seen from Figure 14, when there are few training samples, the classification performance of CDCNN and SVM is relatively poor, and the classification performance of the proposed method is the best. With the increase in the number of samples, each method can obtain higher classification accuracy, but the classification accuracy of the proposed method can still be higher than that of other methods. This shows that this method has good generalization ability. Experiment 4: Feature fusion merges the features extracted from the image into a complete image as the input of the next layer network, which can input more discriminative features for the next layer network. According to the order of fusion and prediction, feature fusion can be divided into early feature fusion and late feature fusion. Early feature fusion is a commonly used classical feature fusion method. For example, in an Inside-Outside Net (ION) [65] or HyperNet [66]), concatenation [67] or addition operations are used to fuse certain layers. The feature fusion strategy in this experiment is an early fusion strategy that directly connects two spectral and spatial scale features. The two input features have the same size, and the output feature dimension is the sum of the two dimensions. Table 10 shows the experimental results of whether to use the fusion strategy. It can be seen from Table 10 that the OA values obtained on the four data sets after feature fusion are significantly higher than those obtained without feature fusion strategy. After feature fusion is used on each data set, the OA value obtained increases by more than 1.8%. The results show that the processing effect of feature fusion strategy on hyperspectral image classification is significantly improved compared with that without feature fusion strategy.

Conclusions
In this paper, the SFBMSN method is proposed for hyperspectral image classification. The FBMB module, 3D-Softpool module, spatial attention module, channel attention module and dense connection are used in the network structure of this method. Using FBMB module, the spectral features of hyperspectral images can be extracted from multiple scales and different levels, and the spectral attention module is introduced into each branch of FBMB to obtain more important information and suppress useless information. Using a dense connection structure to extract spatial features can directly splice spatial features from different layers, realize feature reuse and improve the efficiency of feature extraction. The 3D-Softpool module is used for the first time in the dense connection structure. The 3D-Softpool can retain more spatial feature information in the down sampling activation mapping. The purpose of using spatial attention is to extract important spatial information and suppress useless redundant information. Experiments on four commonly used data sets using small training samples show that the proposed SFBMSN method is very competitive in hyperspectral image classification tasks.
In future research, we will further explore how to fuse the extracted spatial-spectral features more effectively, so that the spatial and spectral features at the edge can also be fully applied to classification. Therefore, designing a more efficient fusion model is an important direction of our future research.