Deep Spectral Spatial Inverted Residual Network for Hyperspectral Image Classiﬁcation

: Convolutional neural networks (CNNs) have been widely used in hyperspectral image classiﬁcation in recent years. The training of CNNs relies on a large amount of labeled sample data. However, the number of labeled samples of hyperspectral data is relatively small. Moreover, for hyperspectral images, fully extracting spectral and spatial feature information is the key to achieve high classiﬁcation performance. To solve the above issues, a deep spectral spatial inverted residuals network (DSSIRNet) is proposed. In this network, a data block random erasing strategy is introduced to alleviate the problem of limited labeled samples by data augmentation of small spatial blocks. In addition, a deep inverted residuals (DIR) module for spectral spatial feature extraction is proposed, which locks the effective features of each layer while avoiding network degradation. Furthermore, a global 3D attention module is proposed, which can realize the ﬁne extraction of spectral and spatial global context information under the condition of the same number of input and output feature maps. Experiments are carried out on four commonly used hyperspectral datasets. A large number of experimental results show that compared with some state-of-the-art classiﬁcation methods, the proposed method can provide higher classiﬁcation accuracy for hyperspectral images.


Introduction
With the rapid development of remote sensing imaging technology, hyperspectral image (HSI) has drawn more attention in recent years. HSI can be viewed as a threedimensional cube constructed by plenty of bands. Each sample of HSI contains reflection information of hundreds of different spectral bands, which makes this kind of image suitable for many practical applications, such as precision agriculture [1], food analysis [2], anomaly detection [3], geological exploration [4,5], etc. In the past decade, hyperspectral image processing technology has become increasingly popular due to the development of machine learning. However, there are still some challenges in the field of HSI classification: (1) the training of deep learning model depends on a great quantity of labeled sample data, while the number of labeled samples in hyperspectral data is insufficient; (2) because HSI contains rich spectral spatial information, the problem that the spectral spatial features of HSI are not effectively extracted still exists [6]. In addition, the phenomenon of different spectral curves of the same substance and different substances of the same spectral curves often occur.
Many of the traditional machine learning-based HSI classification approaches use hand-crafted features to train the classifier [7]. Obviously, feature extraction and classifier training are separated. Representative methods of hand-craft features include local binary patterns (LBPs) [8], directional gradient histogram (HOG) [9], global invariant scalable transform (GIST) [10], random forest [11], and so on. Representative classifiers include ∈ℜ is the 3D HSI cube with height H , width W , and the number of spectral channel B , and Y is the label vector of HSI data. Firstly, is randomly divided into some 3D blocks, which are composed of marked central samples and adjacent samples, and are recorded as a new sample set , where h , w , and B represent the height, width, and spectral dimension of the new 3D cube, respectively. h and w are set to the same value. Then, the training set train X is randomly sampled from the new sample set according to a certain proportion P , and then the validation set valid X is randomly sampled from the rest according to the same proportion. Finally, the remaining proportion is used as the test set test X . Next, the DSSIRNet is trained with the training set train X to obtain the initial parameters of the model, and the parameters are continuously updated through the validation set valid X until the optimal parameters are obtained. Finally, the test set test X is input into the optimal model to obtain the final prediction results ' Y .   Figure 2 shows the structure of the proposed DSSIRNet. The framework is divided into three stages: data augmentation based on random erasing, deep feature extraction with dual-input fusion and DIR module, and classification. The first stage is designed to solve the problem of limited labeled samples of hyperspectral images. Through block random erasing of training samples, the spatial distribution of the dataset can be fully h and w are set to the same value. Then, the training set X train is randomly sampled from the new sample set according to a certain proportion P, and then the validation set X valid is randomly sampled from the rest according to the same proportion. Finally, the remaining proportion is used as the test set X test . Next, the DSSIRNet is trained with the training set X train to obtain the initial parameters of the model, and the parameters are continuously updated through the validation set X valid until the optimal parameters are obtained. Finally, the test set X test is input into the optimal model to obtain the final prediction results Y . Figure 2 shows the structure of the proposed DSSIRNet. The framework is divided into three stages: data augmentation based on random erasing, deep feature extraction with dual-input fusion and DIR module, and classification. The first stage is designed to solve the problem of limited labeled samples of hyperspectral images. Through block random erasing of training samples, the spatial distribution of the dataset can be fully changed, and the number of available training samples can be effectively increased without parameter increase. In order to improve the robustness of the model, it is necessary to solve the problem of insufficient spectral spatial information extraction of hyperspectral images. The second stage of a deep feature extraction network can solve this problem. The second stage mainly includes two parts: (1) dual-input fusion part; (2) DIR dense connection part. In the second part of this stage, three DIR modules are densely connected to enhance the Remote Sens. 2021, 13, 4472 5 of 21 ability of feature representation. This paper also designs a global 3D attention module in the DIR module to achieve more refined and sufficient global context feature extraction. Finally, the spectral spatial features fully extracted in the second stage are input into the third stage to realize classification. changed, and the number of available training samples can be effectively increased without parameter increase. In order to improve the robustness of the model, it is necessary to solve the problem of insufficient spectral spatial information extraction of hyperspectral images. The second stage of a deep feature extraction network can solve this problem. The second stage mainly includes two parts: (1) dual-input fusion part; (2) DIR dense connection part. In the second part of this stage, three DIR modules are densely connected to enhance the ability of feature representation. This paper also designs a global 3D attention module in the DIR module to achieve more refined and sufficient global context feature extraction. Finally, the spectral spatial features fully extracted in the second stage are input into the third stage to realize classification.

Block Random Erasing Strategy for Data Augmentation
In the imaging process, hyperspectral images may be occluded by clouds, shadows or other objects due to the influence of the atmosphere, resulting in the loss of information in the occluded area. The most common interference is cloud cover. Therefore, in the first stage, this paper designs a block random erasing strategy. Before model training, first, the image scene is simulated under cloud erasing conditions, and then the space block with added interference is input into the model for training. It realizes the change of spatial information, increases the available samples, and then improves the classification accuracy. Different from [40], the feature extraction classification model proposed in this paper is three-dimensional. In order to avoid the high complexity of the model, the random erasing under small spatial blocks (i.e., 9 × 9) is studied.
For the training sample set train X , S is the spatial area of the original input 3D cube, the erasing probability is set to p , the area of the random initialization erasing area is set to e S , the ratio e S S is set between l p and h p , and the ratio e r of height and width of e S is set between 1 r and 2 r . First, a probability 1 (0,1) p Rand = is randomly obtained. If 1 p p < is satisfied, erasure is implemented; otherwise, it will not be processed. In order to obtain the position of the erasing area, first the initial erasing area ( , ) e l h S rand p p S = * is calculated according to the randomly erasing proportion, and then the length and width of the erasing area is calculated according to the random height to width ratio:

Block Random Erasing Strategy for Data Augmentation
In the imaging process, hyperspectral images may be occluded by clouds, shadows or other objects due to the influence of the atmosphere, resulting in the loss of information in the occluded area. The most common interference is cloud cover. Therefore, in the first stage, this paper designs a block random erasing strategy. Before model training, first, the image scene is simulated under cloud erasing conditions, and then the space block with added interference is input into the model for training. It realizes the change of spatial information, increases the available samples, and then improves the classification accuracy. Different from [40], the feature extraction classification model proposed in this paper is three-dimensional. In order to avoid the high complexity of the model, the random erasing under small spatial blocks (i.e., 9 × 9) is studied.
For the training sample set X train , S is the spatial area of the original input 3D cube, the erasing probability is set to p, the area of the random initialization erasing area is set to S e , the ratio S e S is set between p l and p h , and the ratio r e of height and width of S e is set between r 1 and r 2 . First, a probability p 1 = Rand(0, 1) is randomly obtained. If p 1 < p is satisfied, erasure is implemented; otherwise, it will not be processed. In order to obtain the position of the erasing area, first the initial erasing area S e = rand(p l , p h ) * S is calculated according to the randomly erasing proportion, and then the length and width of the erasing area is calculated according to the random height to width ratio: Then, according to the randomly height and width value, the coordinate (x e , y e ) of the upper left corner of the erasing area can be obtained. According to the height and width obtained by Formulas (1)-(3), the boundary coordinates of the occluded area (x e + W e , y e + H e ) can be obtained. If the coordinates exceed the boundary of the original space, the above process is repeated. For each iteration of training, random erasing is performed to generate a changing 3D cube, which enables the model to learn more abundant spatial information.

Dual-Input Fusion Part
Let x ∈ X train h×h×l be the data processed by random erasing. After spectral processing f (·) and spatial processing g(·), two groups of feature maps are obtained as the dual input of DIR module. In fact, f (·) and g(·) are the combination functions of three-dimensional convolution, batch normalization (BN), and swish activation functions. The difference between f (·) and g(·) is that the convolution kernel of three-dimensional convolution is different, and the corresponding feature map is where M spectral and M spatial are the feature maps obtained by spectral processing and spatial processing, respectively, σ represents swish activation function operation, * represents three-dimensional convolution operation, W i and b i are the weight and bias of two convolution, respectively, and γ and β represent the trainable parameters of BN operation, respectively. The three-dimensional convolution kernels of spectral processing and spatial processing are 1 × 1 × 9 and 3 × 3 × 9, respectively. The number of convolution kernels is 32, the convolution stride is (1, 1, 2), and the paddings are (0, 0, 0) and (1, 1, 0), respectively. Therefore, the size and number of the two groups of feature maps are the same. Then, the obtained two groups of feature maps are directly added and fused element by element, which can be represented as where ⊕ represents the element by element addition operation, and M sum represents the fused feature map.

DIR Module
The DIR module designed in this paper is inspired by Mobilenet v3 [42] and Efficientnet [43]. We expand the 2D inverted residual model to 3D inverted residual model. Through a large number of experiments and parameter adjustment, a 3D general module suitable for hyperspectral image classification is designed. Figure 3 shows the schematic diagram of the DIR module.  The main idea of deep extraction of spectral spatial features in a DIR module is to expand the low-dimensional representation of input data to high-dimensional representation and filter the feature maps with depthwise separable convolution (DSC). The filtered feature maps are also transmitted to the global 3D attention module for deeper filtering. Following this, the two groups of filtered features are multiplied to enhance feature representation, and then these features are projected back to the low-dimensional representation by standard convolution. Finally, the residual branch is utilized to avoid net- The main idea of deep extraction of spectral spatial features in a DIR module is to expand the low-dimensional representation of input data to high-dimensional representation and filter the feature maps with depthwise separable convolution (DSC). The filtered Remote Sens. 2021, 13, 4472 7 of 21 feature maps are also transmitted to the global 3D attention module for deeper filtering. Following this, the two groups of filtered features are multiplied to enhance feature representation, and then these features are projected back to the low-dimensional representation by standard convolution. Finally, the residual branch is utilized to avoid network degradation. The implementation details of DIR module are described in Algorithm 1.

Algorithm 1 DIR module.
Input: The feature map set x l ∈ h×h×l,C obtained after dual-input fusion. 1: Ascend the dimension of input data x l ∈ h×h×l,C by Conventional 3D convolution. The number of feature maps after convolution is 6C, and then the feature maps are activated by BN and swish. 2: DSC is performed on the feature maps in two steps: depthwise process and pointwise process. The convolution kernels of the two convolution processes are set to 3 × 3 × 3 and 1 × 1 × 1, respectively. The number of feature maps before and after DSC remains the same. Following this, BN and swish activation are performed on these feature maps and then the filtered feature map set D l+1 is obtained. 3: Input D l+1 into the global 3D attention module for depth filtering and obtain the feature map set G 3D . 4: Calculate the product of feature mapsĜ 3D = D l+1 ⊗ G 3D . 5: Reduce the dimension of the multiplied feature mapĜ 3D through conventional three-dimensional convolution, and then perform the BN and swish to obtain the reduced feature map set x l+1 ∈ h×h×l,C . 6: Add the input feature map and the reduced dimension feature map, and then activate them with swish to obtain the output feature map set y l = σ(x l ⊕ x l+1 ). Output: Feature map set y l ∈ h×h×l,C with the same size and number as the input feature map.

Global 3D Attention Module
Inspired by csSE [44], this paper designs the global 3D attention module, as shown in Figure 4. The global 3D attention module fully considers the global information and effectively extracts the spectral and spatial context features, so as to enhance the ability of feature representation.  , is processed by adaptive average pooling (AAP), and the tensor of element k is obtained: Then, two linear layers with size 1 1 1 2 c × × × and 1 1 1 c × × × are trained to find the correlation between features and classification, and the s -dimensional tensor is obtained. The 3D spectral part: firstly, the input feature map U = [u 1 , u 2 , . . . , u s ] as a combination of spectral channels u i ∈ d×d×s , is processed by adaptive average pooling (AAP), and the tensor of element k is obtained: Then, two linear layers with size c 2 × 1 × 1 × 1 and c × 1 × 1 × 1 are trained to find the correlation between features and classification, and the s-dimensional tensor is obtained. Next, the sigmoid function is used for normalization to obtain the spectral attention map. Finally, the obtained spectral attention map is multiplied by the input feature map. The process can be represented as where δ(ẑ s ) represents the combination function of two linear layers, ReLU operation and sigmoid activation operation, and F c (U) is the spectral attention map. The 3D spatial part: firstly, the image features are extracted through C convolution layers, then the spatial attention map is activated by sigmoid function, and finally the obtained spatial attention map is multiplied by the input feature map. The process can be represented as where δ(q s ) is the combination function of one-layer three-dimensional convolution and sigmoid activation operation, and F s (U) is the spatial attention map. Finally, the feature maps obtained from the two parts are compared element by element, and the maximum value is returned to generate the final global 3D attention map.

(1) Datasets
In order to verify the performance of DSSIRNet, four classical datasets are used in the experiment.
The Indian Pines (IN) dataset was obtained from the airborne visible infrared imaging spectrometer (AVIRIS) sensor in northwestern Indiana. The data size is 145 × 145, a total of 220 continuous imaging bands for ground objects. After excluding 20 bands of 104-108, 150-163, and 200 that cannot be reflected by water, the remaining 200 effective bands are taken as the research object. The spatial resolution is 20 m per pixel and the spectral coverage is 0.4~2.5 µm. There are 16 kinds of crops.
The Pavia University (UP) dataset was collected by ROSIS sensor. It continuously images 115 bands in the wavelength range of 0.43~0.86 µm, with a spatial resolution of 1.3 m. After eliminating the noise influence band, the size of UP is 610 × 340 × 103, including nine types of land cover.
The Salinas Valley (SV) dataset was captured by AVIRIS sensors in Salinas Valley, California. The spatial resolution of the data is 3.7 m, and the size is 512 × 217. The original data is 224 bands. After removing 20 bands of 108-112, 154-167, and 224 with serious water vapor absorption, 204 effective bands are retained. The dataset contains 16 crop categories.
The Botswana (BS) dataset was gathered from the NASA EO-1 satellite over Okavango Delta, Botswana, with a spatial resolution of 30 m and spectral coverage of 0.4~2.5 µm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the Remote Sens. 2021, 13, 4472 9 of 21 ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. vango Delta, Botswana, with a spatial resolution of 30 m and spectral coverage of 0.4~2.5 μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. vango Delta, Botswana, with a spatial resolution of 30 m and spectral coverage of 0.4~2.5 μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. vango Delta, Botswana, with a spatial resolution of 30 m and spectral coverage of 0.4~2.5 μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. μm. After excluding the uncalibrated bands covering the water absorption characteristics and noise bands, 145 bands of 10-55, 82-97, 102-119, 134-164, and 187-220 are left for research. The data size is 1476 × 256. The ground truth map can be divided into 14 categories.
In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively. In the follow-up experiment, the IN, UP, SV, and BS datasets are divided into training set, validation set, and test set, respectively. The proportion of training, validation, and test randomly selected from each category is the same. The training proportion is equal to the ratio of the number of training samples obtained by random sampling to the total number of samples. The principle of validation proportion and test proportion is similar. Here, in order to avoid missing training samples of some categories in the IN dataset, we randomly select 5% for training, 5% for validation, and the remaining 90% for testing. The proportion of training set, validation set, and test set randomly selected from UP dataset, SV dataset, and BS dataset is the same, which are 3%, 3%, and 94% respectively. The detailed division information of the four datasets is listed in Tables 1 and 2 respectively.  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification methods, including SVM_RBF [14], DRCNN [45], ROhsi [40], SSAN [46], SSRN [29], and A 2 S 2 K-ResNet [47]. SVM_RBF is a pixel based classification method, DRCNN and ROhsi are 2D-CNN based methods, and SSRN is a method based on RNN and 2D-CNN. The remaining two methods and the proposed DSSIRNet are 3D-CNN based methods. Tables 3-6 list the average classification accuracy of the seven methods on the four datasets. The 2-17 lines record the classification accuracy of specific categories, and the last three lines record OA, AA, and kappa of all categories. In addition, for these experimental results, the best results are highlighted in bold.  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification  (

2) Experimental Setting and Evaluation Index
In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification  (

2) Experimental Setting and Evaluation Index
In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification  (

2) Experimental Setting and Evaluation Index
In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification methods, including SVM_RBF [14], DRCNN [45], ROhsi [40], SSAN [46], SSRN [29], and  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification methods, including SVM_RBF [14], DRCNN [45], ROhsi [40], SSAN [46], SSRN [29], and  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification methods, including SVM_RBF [14], DRCNN [45], ROhsi [40], SSAN [46], SSRN [29], and  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification methods, including SVM_RBF [14], DRCNN [45], ROhsi [40], SSAN [46], SSRN [29], and  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification methods, including SVM_RBF [14], DRCNN [45], ROhsi [40], SSAN [46], SSRN [29], and  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification methods, including SVM_RBF [14], DRCNN [45], ROhsi [40], SSAN [46], SSRN [29], and  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification methods, including SVM_RBF [14], DRCNN [45], ROhsi [40], SSAN [46], SSRN [29], and  (2) Experimental Setting and Evaluation Index In the experiment, the batch size of each dataset was set to 16 and the input spatial size was 9 × 9. In this paper, Adam optimizer was adopted. The initial learning rate was set to 0.0003, and the patience value was set to 15 with cosine attenuation. In addition, the maximum training epoch was set to 200. The experimental hardware platform is a server with Intel (R) Core (TM) i9-9900k CPU, NVIDIA GeForce RTX 2080 Ti GPU and 32 G memory. The experimental software platform is based on Windows 10 Visual Studio Code operating system, and adopts CUDA 10.0, Pytorch 1.2.0, and Python 3.7.4. The classification results of all the experiments were the average classification accuracy ± standard deviation through experiments more than 20 times. In order to provide a quantitative evaluation, this paper uses overall accuracy (OA), average accuracy (AA), and kappa coefficient (kappa) as the measures of classification performance. OA represents the ratio of the number of correctly classified samples to the total number of samples. AA indicates the classification accuracy of each category. Kappa coefficient measures the consistency between the results and the real ground map, which is an index to measure the accuracy of classification. Its calculation is based on the confusion matrix. The lower the Kappa value, the more imbalanced the confusion matrix, and the worse the classification effect.

Classification Results Compared with Other Methods
This paper compares the proposed DSSIRNet with some state-of-the-art classification methods, including SVM_RBF [14], DRCNN [45], ROhsi [40], SSAN [46], SSRN [29], and 2 2 Vineyard- vertical-trellis  54  54  1699  ----Total  1578 1578 50,973  -95  95  3058 As can be seen from Table 3, the classification results of the proposed DSSIRNet method on the IN dataset are obviously superior to those of other state-of-the-art methods, and the best OA, AA, and kappa values are achieved. Because the frameworks based on deep learning (including DRCNN, ROhsi, SSAN, SSRN, A 2 S 2 K-ResNet, and DSSIR-Net) have excellent nonlinear representation and automatic hierarchical feature extraction capabilities, their classification performances are better than that of SVM_RBF. For the model based on 2D-CNN, the model structures of ROhsi and SSAN are too simple, and the extraction of spectral spatial feature is not sufficient. Therefore, the OAs of the above two methods are 9.63% and 9.08% lower than those of DRCNN, respectively. For the 3D-CNN model, the learning of spectral spatial features of SSRN is separated, and the learned advanced features are not fused. Thus, the OAs of SSRN are lower than that of A 2 S 2 K-ResNet and the proposed DSSIRNet method. The proposed DSSIRNet method firstly performed block data augmentation on the input 3D cube to increase the available samples, and then the designed DIR module fully extracted the spectral spatial features. The global 3D attention module also effectively realized the selection and extraction of global context information. The final dense connection operation effectively integrated the joint spectral spatial features learned by the DIR module. Therefore, the OA value For the UP dataset and SV dataset, as shown in Tables 4 and 5, the classification results of all methods exceed 90%. The DRCNN uses multiple input spatial windows, so the OA values are higher than that of SVM_RBF, ROhsi, and SSAN. A 2 S 2 K-ResNet can adaptively select 3D convolution kernel to jointly extract spectral spatial features, so its OA values are higher than that of SSRN. Compared with these methods, the OA values obtained by DSSIRNet on the UP dataset are 7.82%, 3.52%, 4.74%, 4.96%, 1.69%, and 1.09% higher than that of SVM-RBF, DRCNN, ROhsi, SSAN, SSRN, and A 2 S 2 K-ResNet, respectively. The OA values obtained on the SV dataset are 8.61%, 2.72%, 5.07%, 4.71%, 2.55%, and 0.99% higher than that of SVM-RBF, DRCNN, ROhsi, SSAN, SSRN, and A 2 S 2 K-ResNet, respectively. In particular, the proposed DSSIRNet method achieved 100% prediction rate in the three categories of grassland, painted metal plate, and bare soil of UP dataset; on the SV dataset, seven categories achieved 100% prediction rate. Table 6 shows the classification results of different methods on the BS dataset. Compared with the other three datasets, the BS dataset has the highest spatial resolution, so the classification model is very important for the effective extraction of spatial context information. Obviously, the classification performances based on 3D-CNN (SSRN, A 2 S 2 K-ResNet, and DSSIRNet) are higher than other methods, because 3D-CNN can extract spectral and spatial features at the same time. A 2 S 2 K-ResNet extracts joint spectral spatial features directly, which may lose spatial context information, so its OA values are lower than that of SSRN. The proposed DSSIRNet method utilized the DIR module to realize the joint extraction of spectral spatial features, and combined the global 3D attention module to focus on the spectral and spatial context features that contribute greatly to the classification, so it achieved the best classification performance. Figures 5-8 show the visual classification results obtained by seven methods on four datasets. Taking Figure 5 as an example, the classification maps obtained by SVM_RBF, DRCNN, ROhsi, and SSAN have some noise, especially in the corn-notill, grass-pasture, oats, and soybean-mintill classes. The 3DCNN-based methods (including SSRN, A2S2KResNet and DSSIRNet) extract the spectral spatial features more effectively. Compared with other methods, DSSIRNet greatly improves the regional consistency and make some categories more separable, such as grass-trees, hay-windrowed, and wheat. As can be seen, the probability of misclassification among the categories of SVM_RBF, DRCNN, ROhsi, and SSAN is large, and the misclassification rate among other methods is small. In particular, the proposed DSSIRNet method has the smallest misclassification rate and a clear category boundary, which is closest to the ground real map, which shows the effectiveness of the proposed DSSIRNet method.
A2S2KResNet, and DSSIRNet) extract the spectral spatial features more effectively. Compared with other methods, DSSIRNet greatly improves the regional consistency and make some categories more separable, such as grass-trees, hay-windrowed, and wheat. As can be seen, the probability of misclassification among the categories of SVM_RBF, DRCNN, ROhsi, and SSAN is large, and the misclassification rate among other methods is small. In particular, the proposed DSSIRNet method has the smallest misclassification rate and a clear category boundary, which is closest to the ground real map, which shows the effectiveness of the proposed DSSIRNet method.    Lettuce-romaine-5wk Lettuce-romaine-6wk Lettuce-romaine-7wk Vinyard-untrained Vinyard-vertical-trellis   Lettuce-romaine-5wk Lettuce-romaine-6wk Lettuce-romaine-7wk Vinyard-untrained Vinyard-vertical-trellis

Parameter Analysis of Erasing Probability
In order to verify the efficiency of random erasing strategy under small spatial blocks, we studied the influence of erasing probability and spatial blocks smaller than 20 × 20 on the classification performance, and showed the comparison results with the surface graph, as shown in Figure 9a-d. It can be seen from Figure 9 that under the same patch size, the OA value reaches the highest value at 0.15 p = , and then the OA values of each dataset begin to decrease with the increase of p . From the perspective of patch, for the same parameter p , with the increase of patch size, the OA value obtains the highest value of each dataset when the patch size is 9. To sum up, for these hyperspectral datasets, appropriate size of image block (i.e., 9 × 9) and erasing probability (i.e., 0.15) are taken, which not only reduces the computational complexity, but also improves the classification accuracy.

Parameter Analysis of Erasing Probability
In order to verify the efficiency of random erasing strategy under small spatial blocks, we studied the influence of erasing probability and spatial blocks smaller than 20 × 20 on the classification performance, and showed the comparison results with the surface graph, as shown in Figure 9a-d. It can be seen from Figure 9 that under the same patch size, the OA value reaches the highest value at p = 0.15, and then the OA values of each dataset begin to decrease with the increase of p. From the perspective of patch, for the same parameter p, with the increase of patch size, the OA value obtains the highest value of each dataset when the patch size is 9. To sum up, for these hyperspectral datasets, appropriate size of image block (i.e., 9 × 9) and erasing probability (i.e., 0.15) are taken, which not only reduces the computational complexity, but also improves the classification accuracy.

Run Time Comparison
In addition to classification accuracy, running time is also an important indicator in HSI classification tasks, especially in practical applications. Table 7 shows the running time of all methods on four datasets. Compared with other methods based on deep learning, SVM_RBF takes the least time. Because DRCNN uses multiple input space windows for learning, the running time of this method is long. ROhsi sends large space blocks to input 2D-CNN model for training, so the running time is longer than DRCNN. For SSAN, it uses RNN and 2D-CNN to learn spectral spatial features at the same time, where running time is only longer than SVM_RBF. For 3D-CNN models, i.e., SSRN, A 2 S 2 K-ResNet, and DSSIRNet, although the proposed DSSIRNet costs slightly more time on SV and BS datasets due to a large number of layers, the running time of it is less than that of DRCNN and ROhsi methods. Therefore, DSSIRNet has moderate computational complexity and can be used in practical applications.

Run Time Comparison
In addition to classification accuracy, running time is also an important indicator in HSI classification tasks, especially in practical applications. Table 7 shows the running time of all methods on four datasets. Compared with other methods based on deep learning, SVM_RBF takes the least time. Because DRCNN uses multiple input space windows for learning, the running time of this method is long. ROhsi sends large space blocks to input 2D-CNN model for training, so the running time is longer than DRCNN. For SSAN, it uses RNN and 2D-CNN to learn spectral spatial features at the same time, where running time is only longer than SVM_RBF. For 3D-CNN models, i.e., SSRN, A 2 S 2 K-ResNet, and DSSIRNet, although the proposed DSSIRNet costs slightly more time on SV and BS datasets due to a large number of layers, the running time of it is less than that of DRCNN and ROhsi methods. Therefore, DSSIRNet has moderate computational complexity and can be used in practical applications.

Efficiency Analysis of Dense Connection of DIR Module
In this section, the efficiencies of the number of DIR modules on four datasets are analyzed, as shown in Table 8. As a general module, the DIR module is densely connected, and the joint spectral spatial features extracted by the module can be used most effectively. Firstly, the two DIR blocks are densely connected, and the classification accuracy exceeds that of other methods except 3D-CNN model. When using three DIR modules, the proposed DSSIRNet method achieves the highest classification accuracy. On this basis, after adding another module, the OA value of each dataset decreases significantly. Meanwhile, the number of layers and complexity of the model increase rapidly, which will affect the effective utilization of features. In conclusion, when three DIR modules are adopted, the feature extraction is the most effective and the ability to distinguish features is the strongest.

Ablation Experiment
In order to verify the performance of the random erasing (RE) strategy and the global 3D attention (G3D) module proposed in this paper, the ablation results of the two modules (strategies) on four datasets are shown in Table 9. It can be seen that when the RE strategy and G3D module are not adopted, the classification accuracy of the four datasets is the lowest. Adding any of these methods will increase the OA value. Because the RE strategy increases the number of available samples, it will have a greater impact on the final classification performance than adding only G3D modules. Obviously, the optimal OA value is obtained by adding both the two methods at the same time (i.e., the proposed DSSIRNet), which can not only realize data augmentation, but also consider the spectral spatial global context information.

Small Sample Comparative Analysis
Figure 10a-d show the OAs of all methods on different datasets with different numbers of training samples. Specifically, four datasets are randomly selected from labeled samples, with 1%, 3%, 5%, and 10% training samples in each class. As can be seen from Figure 10, the proposed DSSIRNet method achieves the highest classification accuracy on all four datasets. With the increase of training proportion, the OA values of all methods are improved, and the performance differences between different models are reduced, but the OA value of the proposed method is still the highest. At 1% training samples, compared with other models based on deep learning, ROhsi and SVM_RBF have no advantages. In 5% of the training samples, the OA value of ROhsi on UP and SV increased rapidly, exceeding that of SVM_RBF. Compared with other methods, the proposed DSSIRNet shows the best classification performance in the case of small samples. The reason is that it adopts an effective data augmentation strategy and the DIR module to realize effective feature extraction, which also proves that DSSIRNet has more advantages in the case of small datasets.

Conclusions
In this paper, we propose a novel HSI classification network, DSSIRNet. DSSIRNet is divided into three stages. The first stage adopts a random erasing strategy for augmentation of the data of the original 3D cube. The two components of the second stage realize effective joint spectral spatial feature extraction. The third stage classifies high-level semantic features. The full combination of the three stages realizes the optimal classification effect of HSI. In addition, this paper studies the random erasing strategy in small spatial blocks, which can expand the data more effectively without adding parameters. In DSSIRNet, an effective feature extraction module, the DIR module, is designed to fully extract image features. This paper also designs a global 3D attention module to fully consider the global context information of spectral and spatial dimension and further improve the classification performance. The experimental results on four datasets prove the effectiveness of DSSIRNet. In the future, we will study a deep learning framework for HSI classification tasks with low parameters and small samples.