A Novel 2D-3D CNN with Spectral-Spatial Multi-Scale Feature Fusion for Hyperspectral Image Classiﬁcation

: Multifarious hyperspectral image (HSI) classiﬁcation methods based on convolutional neural networks (CNN) have been gradually proposed and achieve a promising classiﬁcation performance. However, hyperspectral image classiﬁcation still suffers from various challenges, including abundant redundant information, insufﬁcient spectral-spatial representation, irregular class distri-bution, and so forth. To address these issues, we propose a novel 2D-3D CNN with spectral-spatial multi-scale feature fusion for hyperspectral image classiﬁcation, which consists of two feature extraction streams, a feature fusion module as well as a classiﬁcation scheme. First, we employ two diverse backbone modules for feature representation, that is, the spectral feature and the spatial feature extraction streams. The former utilizes a hierarchical feature extraction module to capture multi-scale spectral features, while the latter extracts multi-stage spatial features by introducing a multi-level fusion structure. With these network units, the category attribute information of HSI can be fully excavated. Then, to output more complete and robust information for classiﬁcation, a multi-scale spectral-spatial-semantic feature fusion module is presented based on a Decomposition-Reconstruction structure. Last of all, we innovate a classiﬁcation scheme to lift the classiﬁcation accuracy. Experimental results on three public datasets demonstrate that the proposed method outperforms the state-of-the-art methods. parameters increase of the complexity. These results suggest that our proposed CAM+SAM-Net has excellent evaluation indexes and outstanding classiﬁcation performance.


Introduction
Hyperspectral images (HSI), that is, imaging spectroscopy, are generally obtained by imaging spectrometers or hyperspectral remote sensing sensors [1]. HSI contain massive spectral-spatial information, which reflects the interacted rule between light and materials, as well as the intensity of light reflected, emitted or transmitted by certain objects [2]. Compared with traditional RGB images, HSI has more plentiful and specified spectral information, which is beneficial for classification and recognition tasks [3]. HSI classification aims to assign an accurate land-cover label to each hyperspectral pixel, and has been widely applied in mineral exploitation [4], defense and security [5], environment management [6] and urban development [7].
Despite great progress being realized, HSI classification still struggles with various challenges, which are described as follows: (1) The quantity limitation of labeled samples. In practical applications, hyperspectral images are easily captured by imaging spectrometers, but it is very difficult and time-consuming to label these hyperspectral images; (2) The curse of dimension. In the field of supervised learning, classification accuracy may decline severely with the increment of dimension, due to the imbalance between limited (1) We propose a novel 2D-3D CNN with spectral-spatial multi-scale feature fusion for HSI classification, containing two feature extraction streams, a feature fusion module as well as a classification scheme. It can extract more sufficient and detailed spectral, spatial and high-level spectral-spatial-semantic fusion features for HSI classification; (2) We design a new hierarchical feature extraction structure to adaptively extract multiscale spectral features, which is effective at emphasizing important spectral features and suppress useless spectral features; (3) We construct an innovative multi-level spatial feature fusion module with spatial attention to acquire multi-level spatial features, simultaneously, put more emphasis on the informative areas in the spatial features; (4) To make full use of both the spectral features and the multi-level spatial features, a multi-scale spectral-spatial-semantic feature fusion module is presented to adaptively aggregate them, producing high-level spectral-spatial-semantic fusion features for classification; (5) We design a layer-specific regularization and smooth normalization classification scheme to replace the simple combination of two full connected layers, which automatically controls the fusion weights of spectral-spatial-semantic features and thus achieves more outstanding classification performance. The remainder of this article is organized as follows. Section 2 illustrates the related works. Section 3 presents the details of the overall framework and the individual modules. Then, Section 4 illustrates first experimental datasets and the parameters setting, and then shows the experimental results and analysis. Finally, in Section 5, the conclusions are presented.

Related Work
In this section, we introduce some basic knowledge, including convolutional neural networks, residual network and L2 regularization.

Convolutional Neural Networks
CNNs have made great progress in computer vision problems, due to their the weightshare mechanism and efficiency with local connections. CNNs mainly consist of a stack of alternating convolution layers and pooling layers with a number of fully connected layers. In general, convolutional layers are the most important parts of CNNs. Specifically, let X ∈ R H * W * C be the input cube, where H * W is the two dimension spatial size and C is the number of spectral bands. Suppose there are m filters at this convolutional layer and the ith filter can be characterized by the weight w i and bias b i . The ith output of convolutional layer can be represented as follows: where, * represents the convolutional operation and f (·) denotes an activation function, which can improve the nonlinearity of the network. ReLU has been the most used activation function, primarily due to robustness for gradient vanishing and a fast convergence.

Residual Network
ResNets can be constructed by stacking microblocks sequentially [48]. For each residual block, the input features are element-wisely added to the output features by skip connection, which not only can relieve the training pressure of the network but also contribute to information propagation.
Consider a network with L layers, each of which implements a nonlinear transformation S l (·). l represents the layer index and S l (·) consists of several operations, which includes convolution, pooling, batch normalization, activation and linear transformation. m l is the immediate output of S l (·). Figure 1 shows the connection pattern in the residual network, where introduces skip connection to bypass each transformation S l (·). The additional result after skip connection is denoted by x, and m 0 is equal to x 0 . The calculation equation of residual learning process is as follows: x l = H l (x l−1 ) + x l−1 .
Note that x l−1 is the input of S l (·), and m l is the immediate output of it, that is, m l = S l (x l−1 ). Considering the recursive property of (2), m l can be rewritten as follows: m l = S l (x l−1 ) = S l (S l−1 (x l−2 ) + x l−2 ) = S l (S l−1 (S l−2 (x l−3 ) + x l−3 ) + x l−2 ) = . . .

L2 Regularization
The basic idea of L2 regularization is to add an L2 norm penalty to the loss function as a constraint, which can prevent over-fitting and improve generalization ability. The loss function with L2 regularization is calculated as follows: where, 0 J refers to the original loss function, 2 2 W m λ denotes the L2 norm penalty, λ stands for the hyperparameter, m is the size of training samples and the weights of the model is represented by W .

Proposed Method
As shown in Figure 2, we give an introduction of the proposed method. The SMFFNet includes spectral feature extraction stream, spatial feature extraction stream, multi-scale spectral-spatial-semantic feature fusion module and classification scheme. The spectral feature extraction stream captures multi-scale spectral features by utilizing a hierarchical feature extraction module. The spatial feature extraction stream introduces a multi-level fusion structure to extract multi-stage spatial features. The two streams that operate in parallel extract simultaneously spectral and spatial features. The former's input is the size of 7 × 7 × 200 extending over all the spectral bands with 3-D image cube, while the latter takes as input a size of 27 × 27 × 30 with 3-D image cube. The multi-scale spectralspatial-semantic feature fusion module to map low-level spectral/spatial features to the high-level spectral-spatial-semantic fusion features, which are employed for classification. The classification scheme is used to lift the classification accuracy.

L2 Regularization
The basic idea of L2 regularization is to add an L2 norm penalty to the loss function as a constraint, which can prevent over-fitting and improve generalization ability. The loss function with L2 regularization is calculated as follows: where, J 0 refers to the original loss function, λ 2m W 2 denotes the L2 norm penalty, λ stands for the hyperparameter, m is the size of training samples and the weights of the model is represented by W.

Proposed Method
As shown in Figure 2, we give an introduction of the proposed method. The SMFFNet includes spectral feature extraction stream, spatial feature extraction stream, multi-scale spectral-spatial-semantic feature fusion module and classification scheme. The spectral feature extraction stream captures multi-scale spectral features by utilizing a hierarchical feature extraction module. The spatial feature extraction stream introduces a multi-level fusion structure to extract multi-stage spatial features. The two streams that operate in parallel extract simultaneously spectral and spatial features. The former's input is the size of 7 × 7 × 200 extending over all the spectral bands with 3-D image cube, while the latter takes as input a size of 27 × 27 × 30 with 3-D image cube. The multi-scale spectralspatial-semantic feature fusion module to map low-level spectral/spatial features to the high-level spectral-spatial-semantic fusion features, which are employed for classification. The classification scheme is used to lift the classification accuracy.

The Spectral Feature Extraction Stream
The network structure of the spectral feature extraction stream is shown in Figure 2. First, we employ the initial module to capture general spectral features of the training samples. Then, to extract multi-scale spectral features, we design a hierarchical spectral feature extraction module. Finally, we construct a hierarchical spectral feature fusion

The Spectral Feature Extraction Stream
The network structure of the spectral feature extraction stream is shown in Figure 2. First, we employ the initial module to capture general spectral features of the training samples. Then, to extract multi-scale spectral features, we design a hierarchical spectral feature extraction module. Finally, we construct a hierarchical spectral feature fusion structure to fuse multi-scale spectral features and effectively obtain global spectral information.

Hierarchical Spectral Feature Extraction Module
To obtain spectral features at different scales, we propose a hierarchical spectral feature extraction module (HSFEM). As shown in Figure 2, the HSFEM consists of several multi-scale residual learning blocks with channel-wise attention modules (MRCA).

Multi-Scale Residual Learning Block (MSRLB):
The network structure of MSRLB is shown in Figure 3. The MSRLB is composed of multi-scale spectral feature extraction and local residual learning.

The Spectral Feature Extraction Stream
The network structure of the spectral feature extraction stream is shown in Figure 2. First, we employ the initial module to capture general spectral features of the training samples. Then, to extract multi-scale spectral features, we design a hierarchical spectral feature extraction module. Finally, we construct a hierarchical spectral feature fusion structure to fuse multi-scale spectral features and effectively obtain global spectral information.

Hierarchical Spectral Feature Extraction Module
To obtain spectral features at different scales, we propose a hierarchical spectral feature extraction module (HSFEM). As shown in Figure 2, the HSFEM consists of several multi-scale residual learning blocks with channel-wise attention modules (MRCA).

Multi-Scale Residual Learning Block (MSRLB):
The network structure of MSRLB is shown in Figure 3. The MSRLB is composed of multi-scale spectral feature extraction and local residual learning. Specifically, we construct a two-bypass network and each bypass employs different convolutional kernels. In this way, spectral features at different scales can be detected, simultaneously, the spectral information between all bypasses are able to be shared with each other. The operation can be expressed by: Specifically, we construct a two-bypass network and each bypass employs different convolutional kernels. In this way, spectral features at different scales can be detected, simultaneously, the spectral information between all bypasses are able to be shared with each other. The operation can be expressed by: where the weights and bias are represented by w and b, respectively. The superscripts refer to the number of layers at which they are located. The subscripts refer to the size of convolutional kernel used in this layer. σ(·) represents the ReLU activation function.
[S 1 , P 1 ], [P 1 , S 1 ], [S 2 , P 2 ] stand for the concatenation operation. M denotes spectral feature maps, which are sent to the multi-scale residual learning block.
The first convolutional layer of each bypass not only has N the number of channel for input spectral feature maps, but also has N the number of channel for output spectral feature maps. Similarly, the second convolutional layer possesses 2N the number of channel for spectral feature maps. The spectral feature maps of all bypasses are concatenated, and then sent to a 1 * 1 convolutional layer. Here, the 1 * 1 convolutional layer is used as a bottleneck layer, which can reduce the number of channel for spectral feature maps from 2N to N.
Each MSRLB adopts residual learning, which can make the network effective. The MSRLB can be described as follows: where, the input and output of the MSRLB are represented M n−1 and M n respectively. Additionally, Z 2 stands for the output of the channel-wise attention module. The operation Z 2 + M n−1 is realized by a skip connection and element-wise addition. It is worth noting that the use of the local residual learning can greatly reduce the computational pressure and promote the flow of information.

Channel-Wise Attention Module (CAM):
The network structure of CAM is shown in Figure 4. To enhance the important spectral features and suppress the unnecessary spectral features by controlling the weight of each channel, we embed the CAM into the MSRLB. The CAM includes the squeeze process and the excitation process, which consists of a global average pooling layer (GAP), two fully connected layers (FC), and two activation function layers. The 2D global average pooling is used to average the spatial dimension of features maps with a size of H*W*C to form 1*1*C feature maps. The first FC is used to compress C channels into C/r (r is the compressed ratio of spectra channel) channels and the second FC restores the compressed channels to C channels. To guarantee that the input features of the next layer are optimal, the original output features is multiplied by the weight coefficients, which are limited to the [0, 1] range by sigmoid function.
refer to the number of layers at which they are located. The subscripts refer to the size of convolutional kernel used in this layer. ( ) ⋅ σ represents the ReLU activation function.

[ ][ ][ ]
stand for the concatenation operation. M denotes spectral feature maps, which are sent to the multi-scale residual learning block.
The first convolutional layer of each bypass not only has N the number of channel for input spectral feature maps, but also has N the number of channel for output spectral feature maps. Similarly, the second convolutional layer possesses N 2 the number of channel for spectral feature maps. The spectral feature maps of all bypasses are concatenated, and then sent to a 1 * 1 convolutional layer. Here, the 1 * 1 convolutional layer is used as a bottleneck layer, which can reduce the number of channel for spectral feature maps from N 2 to N . Each MSRLB adopts residual learning, which can make the network effective. The MSRLB can be described as follows: where, the input and output of the MSRLB are represented and n M respectively. Additionally, 2 Z stands for the output of the channel-wise attention module. The operation is realized by a skip connection and element-wise addition. It is worth noting that the use of the local residual learning can greatly reduce the computational pressure and promote the flow of information.
Channel-Wise Attention Module (CAM): The network structure of CAM is shown in Figure 4. To enhance the important spectral features and suppress the unnecessary spectral features by controlling the weight of each channel, we embed the CAM into the MSRLB. The CAM includes the squeeze process and the excitation process, which consists of a global average pooling layer (GAP), two fully connected layers (FC), and two activation function layers. The 2D global average pooling is used to average the spatial dimension of features maps with a size of H*W*C to form 1*1*C feature maps. The first FC is used to compress C channels into C/r (r is the compressed ratio of spectra channel) channels and the second FC restores the compressed channels to C channels. To guarantee that the input features of the next layer are optimal, the original output features is multiplied by the weight coefficients, which are limited to the [0,1] range by sigmoid function.

Hierarchical Feature Fusion Structure
It is important to make full use of spectral features and transfer these them to the multi-scale spectral-spatial-semantic feature fusion module (MSSFFM) for classification. However, with the increase of the network depth, these spectral features may gradually disappear. To fully exploit the hierarchical spectral features of each MRCA and improve the classification performance, we propose a hierarchical feature fusion structure (HFFS).
The outputs of each MRCA are sent to the MSSFFM, which can obtain distinct spectral features at different scales. However, these spectral features may not only contain abundant redundant information, but also increase the computational complexity. Thus, we introduce a convolutional layer with 1 * 1 kernel as a bottleneck layer, which can adaptively extract critical spectral information from these hierarchical features. The output of HFFS can be formulated as: where, T 0 refers to the output of the initial module for the spectral feature extraction stream, T i (i = 0) denotes the output of the ith MRCA, and [T 0 , T 1 , T 2 , . . . , T n ] represents the concatenation operation. The HFFS not only reduces the computational complexity and obtains more representative spectral features, but also improves the classification performance.

The Spatial Feature Extraction Stream
The network structure of the spatial feature extraction stream is provided in Figure 2. First, we use principal component analysis (PCA) to remove noise and unimportant spectral bands. Second, the initial module is employed to reduce the number of channels and the quantity of calculation. Then, to extract multi-level spatial features, we construct a multi-level spatial feature fusion module with a spatial attention module (SAMLSFF). Finally, a feature alignment module (FAM) is performed, which can reduce the spatial dimension of spatial features to the same as the spectral feature extraction stream.

Spatial Attention Module (SAM):
The network structure of SAM is shown in Figure 5. To make full use of the close correlation between hyperspectral pixels and capture more distinguishing spatial features, we embed the SAM into the multi-level spatial feature fusion module. and obtains more representative spectral features, but also improves the classification performance.

The Spatial Feature Extraction Stream
The network structure of the spatial feature extraction stream is provided in Figure  2. First, we use principal component analysis (PCA) to remove noise and unimportant spectral bands. Second, the initial module is employed to reduce the number of channels and the quantity of calculation. Then, to extract multi-level spatial features, we construct a multi-level spatial feature fusion module with a spatial attention module (SAMLSFF). Finally, a feature alignment module (FAM) is performed, which can reduce the spatial dimension of spatial features to the same as the spectral feature extraction stream.

Spatial Attention Module (SAM):
The network structure of SAM is shown in Figure  5. To make full use of the close correlation between hyperspectral pixels and capture more distinguishing spatial features, we embed the SAM into the multi-level spatial feature fusion module.  X k ∈ R S * S * C 1 denotes the input data of the SAM, where S * S and C 1 represent the spatial size and the number of spectral channels respectively. First, to simplify computation complexity and reduce the number of channels, the 3-D convolution with 1 * 1 * C 1 kernels is employed to transform the input data into f (X k ) ∈ R S * S * O , g(X k ) ∈ R S * S * O and h(X k ) ∈ R S * S * O from top to bottom. The equation of f (X k ) is as follows: where, the wight and bias parameters are represented by W f and b f respectively. The equations of g(X k ) and h(X k ) are similar to the equation of f (X k ). Second, three feature maps obtained in the previous step are reshaped to SS * O. The relationship R ∈ R SS * SS of different hyperspectral pixels is calculated by multiplying f (X k ) by g(X k ) T as follows: Third, a softmax is used to normalize the R by row: Next, the attention features Att is produced by multiplying the normalizedR by h(X k ), as shown in Equation (16): Then, two 3-D convolutional layers with 1 * 1 * O * C 1 and 1 * 1 * 1 * n kernels is utilized to convert Att to Att ∈ R S * S * C 1 , which makes Att and Att have the same number of channels.
Finally, to facilitate the convergence of the proposed method, a skip connection is used to add the attention features Att to the input features X k .

Multi-Level Spatial Feature Fusion Module (MLSFFM):
The network structure of MLSFFM is presented in Figure 2. The spatial features at different levels have a different significance. Shallow spatial features have small receptive fields and can only extract local features, but they have high resolution and richer details. Deep spatial features have low resolution, but they contain more abstract and semantic information. In this work, we design an MLSFFM to fuse different levels of spatial features, which consists of low-level residual block (LR), middle-level residual block (MR) and high-level residual block (HR). The MLSFFM not only obtains the shallow detailed spatial features, but also extracts the deep abstractly semantic spatial features. Each residual block includes two 3-D convolutional layers with 3 * 3 * 1 * 16 kernels. To boost the training speed and improve the ability of nonlinear discrimination, we add a BN layer and a PReLU to the first convolutional layers. Furthermore, to facilitate the convergence of the MLSFFM and avoid over-fitting, we introduce skip connection for each residual block.
We denote the input data and output data of each residual block as I i and O i (i ∈ [0, 1, 2], i is the residual block index). The process of the MLSFFM is described as follows: where, x i stands for the intermediate output of ith residual block. The weight and bias parameters are represented by w and b, whose superscripts and subscripts refer to the index of residual block and the number of layers at which they are located. σ denotes PReLU activation function. The final output O of the MLSFFM is realized by element-wise addition. Moreover, to reduce the spatial dimension of spatial features to the same as the spectral feature extraction stream and alleviate feature redundancy to some extent, we propose an FAM. The FAM includes four 3-D convolutional layers with 5 * 5 * 1 * 16 kernels and a 3-D convolutional layer with 1 * 1 * 16 * 8 kernels. The spatial feature extraction stream not only aggregates low-level, middle-level and high-level spatial features, but also pays more attention to the informative areas in the spatial features.

Multi-Scale Spectral-Spatial-Semantic Feature Fusion Module
After the spectral feature extraction stream and spatial feature extraction stream, we can obtain the multi-scale spectral features and the multi-level spatial features. To extract more representative and discriminating features for HSI classification, we construct a multiscale spectral-spatial-sematic feature fusion module (MSSFFM). The network structure of MSSFFM is shown in Figure 6. Here, we design a Deconstruction-Reconstruction structure, which not only can map low-level spectral-spatial features to high-level spectral-spatialsemantic features, but also learn multi-scale fusion features at a granular level [53].
First, we adopt a simple concatenation operation to get spectral-spatial features as the input data cube of the Deconstruction-Reconstruction structure. Second, after the 1 * 1 convolutional layer, we equally divide the input data into four feature subsets, represented by s 1 , s 2 , s 3 and s 4 . The number of channels per feature subset is a quarter of the original input data cube. Except for s 1 , each subset contains a corresponding 3 * 3 convolution, denoted by Q i (·). The output of Q i (·) is represented by y i . The feature subset s i is added with the output of Q i−1 (·), and then fed into Q i (·).To reduce parameters during increasing m,we omit the 3 * 3 convolution for s 1 . Therefore, y i can be written as: Third, s 2 is used to generate the first high-level feature subsetŝ 2 through the 2-D convolution that contains 18 convolutional filters of size 3 * 3. Then, we use the sum of the first high-level feature subsetŝ 2 and the third subset s 3 as input to generate the second high-level feature subsetŝ 3 . Similarly, we also employ the sum of the second high-level feature subsetŝ 3 and the fourth subset s 4 as input to generate the second high-level feature subsetŝ 4 . Then, to better fuse information at different scales, s 1 , s 2 , s 3 and s 4 are concatenated and pass through a 1 * 1 convolution. The Deconstruction-Reconstruction structure can make the convolution process features more effectively. Finally, we embed the CAM into the Deconstruction-Reconstruction structure and introduce a skip connection, which can enhance feature extraction ability and promote the flow of spectral-spatial-semantic information.
Remote Sens. 2021, 13, x FOR PEER REVIEW 11 of 28 the CAM into the Deconstruction-Reconstruction structure and introduce a skip connection, which can enhance feature extraction ability and promote the flow of spectral-spatial-semantic information.

Feature Classification Scheme
The current deep learning classification methods employ simple fully connected layers with ReLU activation function [54][55][56][57]. In this work, a smooth normalization and layerspecific regularization classification scheme (CS) is proposed. We define the CS as follows: where, 1 w and 2 w refer to convolutional kernels of the two fully connected layers respectively.
2 F is the Frobenius norm and λ denotes the regularization parameter, which controls all the fusion weights. σ stands for sigmoid activation function. In addition, the input data and output data of the CS are represented by 3 s y and y respectively. Compared with the ReLU activation function, the sigmoid activation function can not only avoid the blow up phenomenon, but also retain a more representative and improved HSI classification performance. To adaptively adjust the fusion weights, we append an L2 regularization term to the CS. Owing to the layer-specific regularization, the novel CS can effectively avoid over-fitting. Finally, we fed the output y into the last fully connected layer with K classes following a softmax function to generate the predicted probability vector. The cross entropy objective function is defined as follows: Figure 6. The structure of Deconstruction-Reconstruction.

Feature Classification Scheme
The current deep learning classification methods employ simple fully connected layers with ReLU activation function [54][55][56][57]. In this work, a smooth normalization and layer-specific regularization classification scheme (CS) is proposed. We define the CS as follows: where, w 1 and w 2 refer to convolutional kernels of the two fully connected layers respectively.
Finally, we fed the output y into the last fully connected layer with K classes following a softmax function to generate the predicted probability vector. The cross entropy objective function is defined as follows: where T represents the total number of training samples, the jth value of the one-hot encoding ground truth for the ith train sample is denoted by t i j . w and b stand for weight and bias in this layer. In addition, y i refers to the output of the ith training sample. Our proposed CS can adaptively control the fusion weights of spectral-spatial-semantic features and achieve a better classification performance.

Experiments and Results
In this section, we first introduce three public HSI datasets and popular evaluation indexes to evaluate the performance of our proposed SMFFNet method. Second, we discuss the main factors affecting the classification performance. Then, we compare the proposed method with several state-of-the-art HSI classification methods. Finally, to demonstrate the superiority of the proposed SMFFNet method, we perform four ablation experiments on three datasets.

Experimental Datasets
We employ three commonly available HSI datasets to evaluate the classification performance of the proposed SMFFNet method.
The India Pines (IN) dataset [43] is captured from the pine forest pilot area of Northwest Indiana by an Airborne Visible/Infrared Imaging Spectrometer (AVIRIS ) in 1992. It includes 16 categories with the image sizes of 145 * 145 pixels and a spatial resolution of 20 m by pixel. There are 224 spectral bands ranging from 0.4 to 2.5 µm. Because bands 104 to 108, 150 to 163 and 220 cannot be reflected by water, these 20 bands are generally removed, the remaining 200 spectral bands can be used for HSI experiments.
The Kennedy Space Center (KSC) dataset [43] is collected by AVIRIS in 1996 from the Kennedy Space Center, containing 224 spectral bands ranging from 0.4 to 2.5 µm. It consists of 13 classes with the size of 512 * 614 pixels and a spatial resolution of 18 m by pixel. After removing water absorption and low signal-to-noise ratio (SNR) bands, the remaining 176 spectral bands can be adopted for HSI experiments.
The Salinas-A scene (SA) dataset [58] is a small subscene of Salinas scene, gathered by AVIRIS in the Salinas Valley of California, with a images sizes of 83 * 86 pixels and a spatial resolution 3.7 m by pixel, and contains six types of geographic objects. Since 20 bands with high moisture absorption are removed, the remaining 204 spectral bands ranging from 0.4 to 2.5 µm can be used for HSI experiments. Tables 1-3 show up the total number of samples of each category in each HSI dataset and Figures 7-9 list false-color image and ground-truth of three datasets. Corn-notill 1428 3 Corn-mintill 830 4 Corn 237 5 Grass-pasture 483 6 Grass-trees 730 7 Grass-pasture-mowed Oats 20 10 Soybean-notill 972 11 Soybean-mintill 2455 12 Soybean-clean 593 13 Wheat 205 14 Woods 1265  15 Buildings-Grass-Tree 386 16 Stone-Steel-Towers 93 Total 10249 Table 2. Land cover class information for the KSC dataset. Grass-pasture-mowed 105 8

No. Class Name Numbers of Samples
Graminoid marsh 431 9 Spartina marsh 520 10 Cattail marsh 404 11 Salt marsh 419 12 Mud flats 503 13 Water 927 Total 5211 Table 3. Land cover class information for the SA dataset.

Experiments and Results
In this section, we first introduce three public HSI datasets and popular evaluation indexes to evaluate the performance of our proposed SMFFNet method. Second, we discuss the main factors affecting the classification performance. Then, we compare the proposed method with several state-of-the-art HSI classification methods. Finally, to demonstrate the superiority of the proposed SMFFNet method, we perform four ablation experiments on three datasets.

Experimental Datasets
We employ three commonly available HSI datasets to evaluate the classification performance of the proposed SMFFNet method.
The India Pines (IN) dataset [43] is captured from the pine forest pilot area of Northwest Indiana by an Airborne Visible/Infrared Imaging Spectrometer (AVIRIS ) in 1992. It includes 16 categories with the image sizes of 145 145 * pixels and a spatial resolution of 20 m by pixel. There are 224 spectral bands ranging from 0.4 to 2.5 m μ . Because bands 104 to 108, 150 to 163 and 220 cannot be reflected by water, these 20 bands are generally removed, the remaining 200 spectral bands can be used for HSI experiments.
The Kennedy Space Center (KSC) dataset [43] is collected by AVIRIS in 1996 from the Kennedy Space Center, containing 224 spectral bands ranging from 0.4 to 2.5 m μ . It consists of 13 classes with the size of 614 512 * pixels and a spatial resolution of 18 m by pixel. After removing water absorption and low signal-to-noise ratio (SNR) bands, the remaining 176 spectral bands can be adopted for HSI experiments.
The Salinas-A scene (SA) dataset [58] is a small subscene of Salinas scene, gathered by AVIRIS in the Salinas Valley of California, with a images sizes of 86 83 * pixels and a spatial resolution 3.7 m by pixel, and contains six types of geographic objects. Since 20 bands with high moisture absorption are removed, the remaining 204 spectral bands ranging from 0.4 to 2.5 m μ can be used for HSI experiments.    Soybean-mintill 2455 12 Soybean-clean 593 13 Wheat 205 14 Woods 1265  15 Buildings-Grass-Tree 386 16 Stone-Steel-Towers 93 Total 10249  Grass-pasture-mowed 105 8 Graminoid marsh 431 9 Spartina marsh 520 10 Cattail marsh 404 11 Salt marsh 419 12 Mud flats 503 13 Water 927 Total 5211

. Classification Evaluation Indexes
In this work, we adopt the overall accuracy (OA), average accuracy (AA) and Kappa coefficient (Kappa) as the HSI classification evaluation indexes. Confusion matrix (CM) can reflect the classification results, which is the basis for people to understand other classification evaluation indexes of HSI. Assuming that there are n kinds of ground objects, and the equation of the CM with the size of n n * is as follows: where element ij c represents that the number of samples in category i has been classi-

Classification Evaluation Indexes
In this work, we adopt the overall accuracy (OA), average accuracy (AA) and Kappa coefficient (Kappa) as the HSI classification evaluation indexes. Confusion matrix (CM) can reflect the classification results, which is the basis for people to understand other classification evaluation indexes of HSI. Assuming that there are n kinds of ground objects, and the equation of the CM with the size of n * n is as follows: where element c ij represents that the number of samples in category i has been classified as class j. OA represents the ratio between the number of correctly classified samples and the total number of samples. Although OA reflects the performance of the whole classifier, the unbalanced samples greatly impact it. The equation of OA is as follows: AA represents the average value of classification accuracy of each category, and reflects each category is equally important, the equation of AA is as follows: Kappa measures the consistency between the classification results and the groundtruth, which is an indispensable index to evaluate the performance of HSI classification. The equation of Kappa is as follows:

Experimental Setup
In this section, we present the detailed network parameter setting of the proposed SFMMNet for three HSI datasets, as shown in Table 4 Table 5.
From Table 5, we can find that, in general, with the increase of the ratio of training samples, the three evaluation indexes of our proposed method gradually increase. Specifically, when the proportion of training samples is 5%, due to the small total number of samples and the random selection of training samples, some category samples are not selected, which retrains the classification performance, especially the IN dataset. When the proportion of training samples is 30%, we can see that the evaluation indexes of the IN and KSC datasets decrease slightly, while those of the SA dataset reduce significantly. When the proportion of training samples is 40%, the proposed method has already classified the IN and KSC categories with three evaluation indexes close to 100%, and those of the SA dataset are 100%. We may notice that the higher classification accuracy of the proposed method requires a great quantity of training samples. Therefore, we choose the ratio set of 4:1:5 as the final ratios for three HSI datasets.

Analysis of the Patch Size
The patch size greatly affects the classification performance. If the patch size is too small, the information will be lost due to the insufficient receptive filed; while if the patch size is too large, it will introduce much noise and increase interclass interference. Therefore, a suitable patch size is vital for the classification performance. From Figure 10a,b,d and e, we can obviously see that when the spectral patch size is 7 × 7 and the spatial patch size is 27 × 27, the IN and KSC datasets possess best evaluation indexes. As shown in Figure 10c,f, all evaluation indexes are higher than 99.9%, except for the spatial patch size of 21 × 21. To obtain the optimal classification performance and make the proposed SMFFNet universal, we choose the spectral patch size of 7 × 7 and the spatial patch size of 27 × 27 the most suitable size for SA datasets.
When the proportion of training samples is 40%, the proposed method has already classified the IN and KSC categories with three evaluation indexes close to 100%, and those of the SA dataset are 100%. We may notice that the higher classification accuracy of the proposed method requires a great quantity of training samples. Therefore, we choose the ratio set of 4:1:5 as the final ratios for three HSI datasets. The red font highlights which mechanic works best.

Analysis of the Patch Size
The patch size greatly affects the classification performance. If the patch size is too small, the information will be lost due to the insufficient receptive filed; while if the patch size is too large, it will introduce much noise and increase interclass interference. Therefore, a suitable patch size is vital for the classification performance. From Figure 10a,b,d and e, we can obviously see that when the spectral patch size is 7 × 7 and the spatial patch size is 27 × 27, the IN and KSC datasets possess best evaluation indexes. As shown in Figure 10c,f, all evaluation indexes are higher than 99.9%, except for the spatial patch size of 21 × 21. To obtain the optimal classification performance and make the proposed SMFFNet universal, we choose the spectral patch size of 7 × 7 and the spatial patch size of 27 × 27 the most suitable size for SA datasets.

Analysis of the Principal Components of Spatial Feature Extraction Stream
To analyze the influence of the number of principal components on the classification performance, here, we set the principal components of different ratios to {20, 25, 30, 35, 40}. From Figure 11a, when the number of principal components is 30, the evaluation indexes are the highest and most features of HSI data are retained for the IN dataset. From Figure 11b, we can clearly see that the evaluation indexes with the number of principal components of 30 are significantly superior to other conditions and the classification effect is the most outstanding for the KSC dataset. Therefore, we choose the number of principal components to 30 for the IN and KSC datasets. From Figure 11c, the SA dataset have better evaluation indexes, except the number of principal components of 20. To reserve more feature information and achieve the best classification performance, we set the number of principal components to 30 for the SA dataset.

Analysis of the Principal Components of Spatial Feature Extraction Stream
To analyze the influence of the number of principal components on the classification performance, here, we set the principal components of different ratios to {20, 25, 30, 35, 40}. From Figure 11a, when the number of principal components is 30, the evaluation indexes are the highest and most features of HSI data are retained for the IN dataset. From Figure 11b, we can clearly see that the evaluation indexes with the number of principal components of 30 are significantly superior to other conditions and the classification effect is the most outstanding for the KSC dataset. Therefore, we choose the number of principal components to 30 for the IN and KSC datasets. From Figure 11c, the SA dataset have better evaluation indexes, except the number of principal components of 20. To reserve more feature information and achieve the best classification performance, we set the number of principal components to 30 for the SA dataset.

Analysis of Different Ratios of Channel-Wise Attention Module
To explore the sensitivity of the proposed SMFFNet to different compressed ratios of the CAM, we set the version for different  Figure 12a, when the compressed ratio is 1, the IN dataset has the highest evaluation indexes. Meanwhile, we can find that with the increase of the compressed ratios, the evaluation indexes decrease significantly, especially the AA. From Figure 12b, when the compressed ratio is 4, three evaluation indexes are best for the KSC dataset. Then, with the increase of the compressed ratios, the evaluation indexes decrease slightly. From Figure 12c, compared with the compressed ratio of 4 and 16, three evaluation indexes under other conditions attain 100%. To reduce parameters and relieve the calculation pressure, we set the compressed ratio to 1 for the SA dataset.

Analysis of Different Ratios of Channel-Wise Attention Module
To explore the sensitivity of the proposed SMFFNet to different compressed ratios of the CAM, we set the version for different r ∈ {1, 4, 8, 16, 32, 64}. From Figure 12a, when the compressed ratio is 1, the IN dataset has the highest evaluation indexes. Meanwhile, we can find that with the increase of the compressed ratios, the evaluation indexes decrease significantly, especially the AA. From Figure 12b, when the compressed ratio is 4, three evaluation indexes are best for the KSC dataset. Then, with the increase of the compressed ratios, the evaluation indexes decrease slightly. From Figure 12c, compared with the compressed ratio of 4 and 16, three evaluation indexes under other conditions attain 100%. To reduce parameters and relieve the calculation pressure, we set the compressed ratio to 1 for the SA dataset.

Analysis of the Principal Components of Spatial Feature Extraction Stream
To analyze the influence of the number of principal components on the classification performance, here, we set the principal components of different ratios to {20, 25, 30, 35, 40}. From Figure 11a, when the number of principal components is 30, the evaluation indexes are the highest and most features of HSI data are retained for the IN dataset. From Figure 11b, we can clearly see that the evaluation indexes with the number of principal components of 30 are significantly superior to other conditions and the classification effect is the most outstanding for the KSC dataset. Therefore, we choose the number of principal components to 30 for the IN and KSC datasets. From Figure 11c, the SA dataset have better evaluation indexes, except the number of principal components of 20. To reserve more feature information and achieve the best classification performance, we set the number of principal components to 30 for the SA dataset.

Analysis of Different Ratios of Channel-Wise Attention Module
To explore the sensitivity of the proposed SMFFNet to different compressed ratios of the CAM, we set the version for different  Figure 12a, when the compressed ratio is 1, the IN dataset has the highest evaluation indexes. Meanwhile, we can find that with the increase of the compressed ratios, the evaluation indexes decrease significantly, especially the AA. From Figure 12b, when the compressed ratio is 4, three evaluation indexes are best for the KSC dataset. Then, with the increase of the compressed ratios, the evaluation indexes decrease slightly. From Figure 12c, compared with the compressed ratio of 4 and 16, three evaluation indexes under other conditions attain 100%. To reduce parameters and relieve the calculation pressure, we set the compressed ratio to 1 for the SA dataset.

Classification Results Comparison with the State-of-the-Art Methods
To verify the effectiveness of our proposed SSMFFNet method, we compare SMFFNet with several classic methods, including SVM [13], Multinomial Logistic Regression (MLR) [57], Random Forest (RF) [59], 1-D CNN [37], 2-D CNN [60], 3-D CNN [61], Hybrid [62], JSSAN [63], RSSAN [64], TSCNN [65]. Here, SVM, MLR and RF are implemented by scikit learn, other methods are realized by tensorflow frame. We will classify these comparison methods in two ways. On the one hand, SVM, MLR and RF methods belong to the traditional machine learning; nevertheless, 1-D CNN, 2-D CNN, 3-D CNN, Hybrid, JSSAN, RSSAN, TSCNN and our proposed SMFF methods belong to the deep learning. On the other hand, SVM, MLR, RF and 1-D CNN methods are based on spectral information; 2-D CNN method is based on spatial information; nevertheless, 3-D CNN, HybridSN, JSSAN, RSSAN, TSCNN and our proposed SMFF methods are based on spectral and spatial information. For the sake of fair comparison, we choose 40% samples as the training set, 10% samples as the validation set and remaining samples as the test set. The OA, AA, Kappa coefficients, and the classification accuracy of each category for three HSI datasets are shown in Tables 6-8.   This is because that the TSCNN method only uses several ordinary consecutive convolution operations embedded SE modules to extract shallow spectral and spatial features and ignores high-level semantic. However, our proposed SMFFNet not only extracts multi-scale spectral features and multi-level spatial features, but also maps the low-level spectral/spatial features to high-level spectral-spatial-semantic fusion features for improving HSI classification. (3) The Hybrid method is based on 2D-3D CNN for HSI classification. Nevertheless, our proposed SMFFNet method also employs 2D-3D CNN for HSI classification. The Hybrid method and the proposed SMFFNet method takes 2D-3D CNN as the basic framework. From the tables, compared with the Hybrid method, the evaluation indexes of the proposed SMFFNet method are higher than those of it. Specifically, the OA, AA and Kappa of the SMFFNet method are 5.85%, 5.86% and 5.67% higher than those of the Hybrid method on the IN dataset, respectively. Moreover, only one class of our proposed method has lower classification accuracy than that of the Hybrid method. The other two HSI datasets have semblable classification results. Although the Hybrid method uses 2D-3D convolution to extract spectral and spatial features, it does not extract coarse spectral-spatial fusion features and ignores the close correlation between spectral and spatial information. (4) The JSSAN, RSSAN, TSCCN and our proposed SMFFNet methods embed an attention mechanism to enhance feature extraction ability. From the tables, we can see that the OA, AA, Kappa and the classification accuracy of each category of our SMFFNet method are the highest. It means that we use channel-wise attention mechanism and spatial attention mechanism to improve the feature extraction capacity, enhance useful feature information and suppress unnecessary ones. These show that the proposed method combined with the attention mechanism can achieve a better classification performance and an excellent classification accuracy. (5) Figures 13-15 show the visualization maps of all categories of all classification methods, along with corresponding ground-truth maps. From the figures, we can find that the classification maps of SVM, MLR, RF, 1-D CNN, 2-D CNN, 3-D CNN, Hybrid, JSSAN, RSSAN and TSCNN have some dot noises in some categories. Compared with these classification methods, the proposed SMFFNet method has smoother classification maps. In addition, the edge of each category is clearer than others and the prediction effect on unlabeled samples is also significantly better, which indicates that the attention mechanism can effectively suppress the distraction of interfering samples. Compared with the proposed SMFFNet method, other methods cause the misclassification of many categories and their classification maps are very rough. Our proposed method not only has fairly smooth classification maps and more higher classification prediction accuracy. Owing to the idiosyncratic structure of SMFFNet method, it can fully extract the spectral-spatial-semantic features of the HSI and achieve more detailed and discriminable fusion features. ignores high-level semantic. However, our proposed SMFFNet not only extracts multi-scale spectral features and multi-level spatial features, but also maps the lowlevel spectral/spatial features to high-level spectral-spatial-semantic fusion features for improving HSI classification. (3) The Hybrid method is based on 2D-3D CNN for HSI classification. Nevertheless, our proposed SMFFNet method also employs 2D-3D CNN for HSI classification. The Hybrid method and the proposed SMFFNet method takes 2D-3D CNN as the basic framework. From the tables, compared with the Hybrid method, the evaluation indexes of the proposed SMFFNet method are higher than those of it. Specifically, the OA, AA and Kappa of the SMFFNet method are 5.85%, 5.86% and 5.67% higher than those of the Hybrid method on the IN dataset, respectively. Moreover, only one class of our proposed method has lower classification accuracy than that of the Hybrid method. The other two HSI datasets have semblable classification results. Although the Hybrid method uses 2D-3D convolution to extract spectral and spatial features, it does not extract coarse spectral-spatial fusion features and ignores the close correlation between spectral and spatial information. (4) The JSSAN, RSSAN, TSCCN and our proposed SMFFNet methods embed an attention mechanism to enhance feature extraction ability. From the tables, we can see that the OA, AA, Kappa and the classification accuracy of each category of our SMFFNet method are the highest. It means that we use channel-wise attention mechanism and spatial attention mechanism to improve the feature extraction capacity, enhance useful feature information and suppress unnecessary ones. These show that the proposed method combined with the attention mechanism can achieve a better classification performance and an excellent classification accuracy. (5) Figures 13-15 show the visualization maps of all categories of all classification methods, along with corresponding ground-truth maps. From the figures, we can find that the classification maps of SVM, MLR, RF, 1-D CNN, 2-D CNN, 3-D CNN, Hybrid, JSSAN, RSSAN and TSCNN have some dot noises in some categories. Compared with these classification methods, the proposed SMFFNet method has smoother classification maps. In addition, the edge of each category is clearer than others and the prediction effect on unlabeled samples is also significantly better, which indicates that the attention mechanism can effectively suppress the distraction of interfering samples. Compared with the proposed SMFFNet method, other methods cause the misclassification of many categories and their classification maps are very rough. Our proposed method not only has fairly smooth classification maps and more higher classification prediction accuracy. Owing to the idiosyncratic structure of SMFFNet method, it can fully extract the spectral-spatial-semantic features of the HSI and achieve more detailed and discriminable fusion features.  The red font highlights which mechanic works best. The blue font do contrast test, which method achieves the highest classification accuracy.   The red font highlights which mechanic works best. The blue font do contrast test, which method achieves the highest classification accuracy.

Analysis of Classification Scheme and L2 Regularization Parameter
To prove the validity of the proposed classification scheme and keep other parameters unchanged, we compare four classification schemes: fully connected layers with ReLU activation function (R); fully connected layers with sigmoid activation function (S); ReLU activation function with L2 regularization (R-L2) and our proposed sigmoid activation function with L2 regularization (S-L2).
From Table 9, we can see that, compared with other classification schemes, our proposed classification scheme has the highest evaluation indexes and an excellent classification performance. Specifically, on the IN dataset, compared with the R, the OA, AA and Kappa of the S-L2 improve 0.37%, 0.97% and 0.42% respectively; compared with the S, the OA, AA and Kappa of the S-L2 improve 2.89%, 20.65% and 2.85% respectively; compared with the R-L2, the OA, AA and Kappa of the S-L2 improve 1.64%, 3.39% and 1.87%, respectively. The KSC dataset is similar to the IN dataset, but the evaluation indexes change greatly. The evaluation indexes of the four classification schemes are 100%. These results indicate that our proposed scheme is more robust and effective. To explore the sensitivity of the proposed classification scheme to the parameter λ of L2 regularization, we set different λ ∈ {0, 0.0005, 0.002, 0.01, 0.02, 0.03, 0.2, 1}. From Figure 16a,b, we can find that with the increase of the parameter λ, on the IN and KSC datasets, the curves fluctuate obviously. When the parameter λ is 0.02, three evaluation indexes are excellent. As shown in Figure 16c, on the SA dataset, the curves decrease slightly at the parameter λ of 0.0005, then rise to the highest accuracy 100% and remain unchanged, finally decrease sharply at the parameter λ of 0.2.
Kappa of the S-L2 improve 0.37%, 0.97% and 0.42% respectively; compared with the S, the OA, AA and Kappa of the S-L2 improve 2.89%, 20.65% and 2.85% respectively; compared with the R-L2, the OA, AA and Kappa of the S-L2 improve 1.64%, 3.39% and 1.87%, respectively. The KSC dataset is similar to the IN dataset, but the evaluation indexes change greatly. The evaluation indexes of the four classification schemes are 100%. These results indicate that our proposed scheme is more robust and effective.
To explore the sensitivity of the proposed classification scheme to the parameter λ of L2 regularization, we set different  Figure  16a,b, we can find that with the increase of the parameter λ , on the IN and KSC datasets, the curves fluctuate obviously. When the parameter λ is 0.02, three evaluation indexes are excellent. As shown in Figure 16c, on the SA dataset, the curves decrease slightly at the parameter λ of 0.0005, then rise to the highest accuracy 100% and remain unchanged, finally decrease sharply at the parameter λ of 0.2.

Analysis of Attention Module
To valid the effectiveness of the attention mechanisms embedded in the proposed SMFFNet method, we compared the SMFFNet (CAM+SAM-Net) with the SMFFNet wiuthout the spectral and spatial attention mechanisms (NO-Net); SMFFNet only with a spectral attention mechanism (CAM-Net) and SMFFNet only with a spatial attention mechanism (SAM-Net).
From Table 10, it is obvious that the evaluation indexes of the NO-Net are lowest on three HSI datasets. Specifically, on the IN and KSC datasets, we can see that the evaluation indexes of the SAM-Net are significantly higher than those of the CAM-Net, especially the AA improve 4.46% and 4.62%, respectively. That is probably that the CAM-Net only employs spectral information and ignores rich two dimension spatial information. However, on the SA dataset, the evaluation indexes of the CAM-Net are significantly higher than those of the SAM-Net, the Kappa especially improves by 1.48%. That is probably because the SAM-Net may need more parameters and increase of the training complexity. These results suggest that our proposed CAM+SAM-Net has excellent evaluation indexes and outstanding classification performance. To valid the effectiveness of the spectral feature extraction stream, spatial feature extraction stream and multi-scale spectral-spatial-semantic feature fusion module of the proposed SMFFNet method, we compare the SMFFNet (B+A+S) with other six methods: the SMFFNet only with the spectral feature extraction stream (B); the SMFFNet only with the spatial feature extraction stream (A); the SMFFNet only with the spectral and spatial feature extraction stream (B+A); the SMFFNet only with the multi-scale spectral-spatialsemantic feature fusion module (S); the SMFFNet only with the spectral feature extraction stream and multi-scale spectral-spatial-semantic feature fusion module (B+S); the SMFFNet only with spatial feature extraction stream and multi-scale spectral-spatial-semantic feature fusion module (A+S).
From Figure 17, we can clearly see that the evaluation indexes of the B+A+S are the highest on three HSI datasets. This is because the B+A+S not only fully extracts spectral and spatial features, but also maps low-level spectral/spatial features to high-level spectral-spatial-semantic fusion features so as to improve the classification performance. Specifically, on the IN dataset, the B has the lowest evaluation indexes. That is probably because the B only spectral features and ignores abundant spatial and high-level semantic features. On the KSC dataset, the A has lowest evaluation indexes. That is probably because the A only pays attention to the spatial information and ignores rich spectral and high-level semantic features. On the SA dataset, the B+S has the lowest evaluation indexes. That is probably because, although the B+S employ spectral and semantic information, it introduces much noise and redundancy information, which is harmful to the classification performance. These results illustrate that our proposed method is prime and obtains excellent classification accuracy. features. On the KSC dataset, the A has lowest evaluation indexes. That is probably because the A only pays attention to the spatial information and ignores rich spectral and high-level semantic features. On the SA dataset, the B+S has the lowest evaluation indexes. That is probably because, although the B+S employ spectral and semantic information, it introduces much noise and redundancy information, which is harmful to the classification performance. These results illustrate that our proposed method is prime and obtains excellent classification accuracy.
(a) (b) (c) Figure 17. The influence of spectral, spatial and spectral-spatial-semantic stream. (a-c) represent the influence of spectral, spatial and spectral-spatial-semantic stream on IN, KSC and SA respectively.

Analysis of the Network Depth
The depth of the proposed SMFFNet greatly affects the classification performance. To find the most suitable depth of the spectral feature extraction stream (B), we discuss different depths 10 , Figure 18a,b, we can clearly see that, with the increase of the depth of the B, on the IN and KSC datasets, the evaluation index curves fluctuate greatly, especially the curve of AA. When the depth of the B is 8, three evaluation indexes are the highest on the IN and KSC datasets. From Figure 18c, it is obvious that the

Analysis of the Network Depth
The depth of the proposed SMFFNet greatly affects the classification performance. To find the most suitable depth of the spectral feature extraction stream (B), we discuss different depths n = 0, 1, 2, 3, 4, 5, 6,7,8,9,10. From Figure 18a,b, we can clearly see that, with the increase of the depth of the B, on the IN and KSC datasets, the evaluation index curves fluctuate greatly, especially the curve of AA. When the depth of the B is 8, three evaluation indexes are the highest on the IN and KSC datasets. From Figure 18c, it is obvious that the evaluation index curves decrease slightly at the depth of 3, and significantly at the depths of 5 and 10 on the SA dataset. The evaluation indexes under other conditions have better accuracy. To make the SMFFNet universal and reduce the network complexity, we set the depth of the B to 8 for SA dataset. In addition, we may notice that the evaluation index curves have a significant increase or decrease on the three datasets, the network depth has great influence on the classification performance. If the network depth is too shallow, the feature extraction is insufficient; if the network depth is too deep, the gradient may disappear. Therefore, the number of MRCA added or removed can not only change the depth of the proposed SMFFNet, but also greatly affect the classification performance.

Discussion and Conclusions
In this paper, we propose a novel 2D-3D CNN with spectral-spatial multi-scale feature fusion (SMFFNet) for hyperspectral image classification, which can extract spectral, spatial, and high-level spectral-spatial-semantic fusion features simultaneously. Multiple functional modules of the proposed method are designed based on 2D-3D CNN, in which the 2D convolution is adopted to reduce the training parameters to decrease computation complexity, the 3D convolution is utilized to be more consistent with the 3-D structure of HSI data and extract more discriminating features. The proposed method includes four parts: two features extraction streams, a feature fusion module as well as a classification scheme. First, we use two diverse backbone modules for feature representation, that is, the spectral feature and the spatial feature extraction streams. The spectral feature extraction stream is designed to extract multi-scale spectral features, learn important spectral information, and suppress useless information, which consists of a initial layer, a hierarchical spectral feature extraction module and a hierarchical feature fusion module. The spatial feature extraction stream is constructed to obtain multi-level spatial features, and extract context information to strength the spatial features, which includes an initial mod- To valid the effectiveness of the multi-level spatial feature fusion module (low+middle+high), we compare it with other three modules: the SMFFNet without the multi-level spatial feature fusion module (no-fusion), the SMFFNet only with low-level residual learning module (low) and the SMFFNet only with the low-level and middle-level residual learning modules (low+middle). From Figure 19, we can find obviously that the evaluation indexes of the low+middle+high are highest on three HSI datasets. Specifically, as shown in Figure 19a, compared with the no-fusion, the OA, AA and Kappa of the evaluation indexes improve 3.27%, 21.32% and 3.73%; compared with the low, the OA, AA and Kappa of the evaluation indexes improve 5.54%, 25.45% and 6.32%; compared with the low+middle, the OA, AA and Kappa of the evaluation indexes improve 3.76%, 21.79% and 4.28%. The results of the KSC dataset are similar to those of the IN dataset. As shown in Figure 19c, the evaluation indexes of the mo-fusion are lowest, those of the other three modules reach 100%. It is probable that the SA dataset containing relatively few label samples and categories train easily. To make the SMFFNet universal, we choose the low+middle+high for our proposed SMFFNet. These results prove that the proposed method has superb classification and more robustness. So, the use of a low-level residual learning module, a middle-level residual learning module and a highlevel residual learning module has an effect on the depth of the proposed SMFFNet and classification performance.

Discussion and Conclusions
In this paper, we propose a novel 2D-3D CNN with spectral-spatial multi-scale feature fusion (SMFFNet) for hyperspectral image classification, which can extract spectral, spatial, and high-level spectral-spatial-semantic fusion features simultaneously. Multiple functional modules of the proposed method are designed based on 2D-3D CNN, in which the 2D convolution is adopted to reduce the training parameters to decrease computation complexity, the 3D convolution is utilized to be more consistent with the 3-D structure of HSI data and extract more discriminating features. The proposed method includes four parts: two features extraction streams, a feature fusion module as well as a classification scheme. First, we use two diverse backbone modules for feature representation, that is, the spectral feature and the spatial feature extraction streams. The spectral feature extraction stream is designed to extract multi-scale spectral features, learn important spectral information, and suppress useless information, which consists of a initial layer, a hierarchical spectral feature extraction module and a hierarchical feature fusion module. The spatial feature extraction stream is constructed to obtain multi-level spatial features, and extract context information to strength the spatial features, which includes an initial module, a multi-level spatial feature fusion module with spatial attention mechanism and a feature alignment module. Two feature extraction streams can fully excavate the category attribute information of HSI. Then, the multi-scale spectral-spatial-semantic feature fusion module is raised based on the Decomposition-Reconstruction structure, which maps lowlevel spectral/spatial features to the high-level spectral-spatial-semantic fusion features used for classification. Ultimately, to enhance classification performance, we adopt a layer-specific regularization and smooth normalization classification scheme to replace the simple combination of two full connected layers, which can adaptively learn fusion weights of spectral-spatial-semantic features from fusion module.

Discussion and Conclusions
In this paper, we propose a novel 2D-3D CNN with spectral-spatial multi-scale feature fusion (SMFFNet) for hyperspectral image classification, which can extract spectral, spatial, and high-level spectral-spatial-semantic fusion features simultaneously. Multiple functional modules of the proposed method are designed based on 2D-3D CNN, in which the 2D convolution is adopted to reduce the training parameters to decrease computation complexity, the 3D convolution is utilized to be more consistent with the 3-D structure of HSI data and extract more discriminating features. The proposed method includes four parts: two features extraction streams, a feature fusion module as well as a classification scheme. First, we use two diverse backbone modules for feature representation, that is, the spectral feature and the spatial feature extraction streams. The spectral feature extraction stream is designed to extract multi-scale spectral features, learn important spectral information, and suppress useless information, which consists of a initial layer, a hierarchical spectral feature extraction module and a hierarchical feature fusion module. The spatial feature extraction stream is constructed to obtain multi-level spatial features, and extract context information to strength the spatial features, which includes an initial module, a multi-level spatial feature fusion module with spatial attention mechanism and a feature alignment module. Two feature extraction streams can fully excavate the category attribute information of HSI. Then, the multi-scale spectral-spatial-semantic feature fusion module is raised based on the Decomposition-Reconstruction structure, which maps low-level spectral/spatial features to the high-level spectral-spatial-semantic fusion features used for classification. Ultimately, to enhance classification performance, we adopt a layer-specific regularization and smooth normalization classification scheme to replace the simple combination of two full connected layers, which can adaptively learn fusion weights of spectral-spatial-semantic features from fusion module.
To prove the effectiveness and advantages of the proposed SMFFNet, lots of comparison experiments are conducted on three popular HSI datasets. The OA, AA, Kappa coefficients, and the classification accuracy of each category on three HSI datasets demonstrate that the proposed SMFFNet outperforms the state-of-the-art methods. Moreover, the above ablation experiments also adequately verify the validity of the proposed hierarchical spectral feature extraction module, the multi-level spatial feature fusion module with the spatial attention module and the multi-scale spectral-spatial-semantic feature fusion module.
However, the proposed method still has some shortcomings. By calculating the computation cost of complex methods, which includes 3-D CNN, Hybrid, JSSAN, RSSAN, TSCNN and SMFFNet methods, we find that the proposed method needs a relatively high computation cost. Since the multi-scale residual block of SSMFFNet contains different blocks, the integrity of information is guaranteed, but the structure of the model is relatively complex and the training parameters are more so. So, future work will focus on how to effectively reduce the complexity of the model while obtaining a high classification. In addition, hyperspectral image classification has been widely used in many fields of computer vision. Therefore, in the future, we will try to apply the proposed classification method to some computer vision tasks, such as target recognition.