CSA-MSO3DCNN: Multiscale Octave 3D CNN with Channel and Spatial Attention for Hyperspectral Image Classiﬁcation

: 3D convolutional neural networks (CNNs) have been demonstrated to be a powerful tool in hyperspectral images (HSIs) classiﬁcation. However, using the conventional 3D CNNs to extract the spectral–spatial feature for HSIs results in too many parameters as HSIs have plenty of spatial redundancy. To address this issue, in this paper, we ﬁrst design multiscale convolution to extract the contextual feature of different scales for HSIs and then propose to employ the octave 3D CNN which factorizes the mixed feature maps by their frequency to replace the normal 3D CNN in order to reduce the spatial redundancy and enlarge the receptive ﬁeld. To further explore the discriminative features, a channel attention module and a spatial attention module are adopted to optimize the feature maps and improve the classiﬁcation performance. The experiments on four hyperspectral image data sets demonstrate that the proposed method outperforms other state-of-the-art deep learning methods.


Introduction
Hyperspectral images (HSIs) are obtained by a series of hyperspectral imaging sensors and composed of hundreds of successive spectral bands. Because the wavelength interval between every two neighboring bands is quite small (usually 10 nm), HSIs generally have a very high spectral resolution [1]. Analysis of HSIs has been widely used in a large variety of fields, including materials analysis, precision agriculture, environmental monitoring and surveillance [2][3][4]. Among the hyperspectral community, the HSIs classification is most vibrant filed of research which is to assign a unique class to each pixel in the image [5]. However, due to the excessively redundant spectral band information and limited training samples, it also poses a great challenge to the classification of HSIs [6].
Early attempts for HSIs classification including the radial basis functions (RBFs) and K-nearest neighbor (kNN) methods are all pixel-wise and focus on the spectral signatures of hyperspectral data. But besides the spectral aspect, the spatial dependency which indicates the adjacent pixels likely belong to the same category is another useful information in the hyperspectral data. According to this, in aim to characterize the relationship between the samples, several spatial methods such as sparse representation and graph-based methods were proposed [7][8][9][10]. However, they used the class label to construct the manifold structure by both labeled and unlabeled data for classification which didn't incorporate the spectral feature. Consequently, a promising way is to combine the spectral and spatial information for classification that can enhance the performance of classification • There is a lot of spatial redundancy in the hyperspectral data processing which takes up much memory space. Especially when 3D CNN is adopted to learn the feature, it will include numerous parameters which is disadvantage to the classification performance, compared to the 2D CNN and 1D CNN. • Although the methods of combining DL method and attention mechanism have achieved successes for HSIs classification, to my best knowledge, there is not much research on the spatial attention for HSIs classification which does play an important role for HSIs classification.
In this paper, we propose a novel multiscale octave 3D CNN with channel and spatial attention (CSA-MSO3DCNN) for hyperspectral image classification. In our method, as 3D CNN can mine information hidden in the hyperspectral data more effectively whereas 1D CNN and 2D CNN can not, 3D CNN serves as the foundation of the entire architecture to directly extract the spectral-spatial features. In order to extract the spectral-spatial features of different scales, we design 3D CNN convolution kernels of different sizes. Due to 3D CNN has a lot of parameters and redundancy, we propose to use octave 3D CNN to replace the standard 3D CNN to decompose the features into high frequency and low frequency and reduce the spatial redundancy. Before feeding into the full connection layer, the channel and spatial attention mechanism modules are added to refine the feature maps. Through a series of optimization design, our method can extract higher and more recognizable features compared with the standard methods based on deep learning, like 2D CNN, 3D CNN, etc. Finally, the contributions of this paper can be summarized as follows: 1. The proposed network takes full advantages of octave 3D CNN with different kernels to capture diverse features and reduce the spatial redundancy simultaneously. Given the same input and structure, our proposed method works more effectively than the method based on normal 3D CNN. 2. A new attention mechanism with two attention modules is employed to refine the feature maps, which selects the discriminative features from the spectral and spatial views. This boosts the performance of our proposed network which further captures the similarity of adjacent pixels and the correlation of various spectral bands.
The remainder of this paper is organized as follows. In Section 2, we briefly introduce the CNN and the attention mechanism in DL. The detailed design of CSA-MSO3DCNN method is given in Section 3. In Section 4, we present and discuss the experimental results, including ablation experiments. Finally, Section 5 summarizes this paper.

Convolutional Neural Network
Convolutional neural network (CNN) is a hierarchical structure composed of a deep stack of convolutional layers. It is because of this structure that CNN has a good capability of extracting the features of the visual data such as images and videos, which is very helpful for the subsequent operations. The mechanism of CNN is that it is based on the receptive fields and follows the behavior of neurons in the primary visual cortex of a biological brain [36]. In order to promote the efficiency of CNN, some improved convolutions, such as group convolution [14], separable convolution [38], depthwise convolution [39], and dilated convolution [40], have been proposed, which are mainly distinguished by the different ways of convolution.
At present, convolutional neural network has three different forms of convolution kernels that are 1D (s 1 × n), 2D (s 1 × s 2 × n) and 3D (s 1 × s 2 × s 3 × n). They have the same principle, specifically, they have the same element calculation and all adopt the back propagation algorithm to modify the parameters and train the network. For HSIs classification, the difference between the three forms is that they characterize different forms of feature, specifically, 1D CNN explores the spectral feature, 2D CNN explores the spatial feature, 3D CNN explores the spatial and spectral feature.
Due to the HSIs are originally 3-D and high dimensional, 3D CNN is more suitable for feature extraction and also used in our proposed network. To build a network for HSIs classification, only 3D convolution kernel is not enough. The activation function and some regularization measures are also needed. Therefore, for the input or intermediate feature map, the processing through a layer of the network can be described by the following formula, where x l−1 , x l and b are the input, output and corresponding bias respectively, F(.) is the convolution operation and H(.) is a subsequent processing function which can be batch normalization (BN) and rectified linear units (RELU). By stacking more and more layers, and adding the pooling layer and fully connected layer, a trainable network is established. In our proposed network, the same construction is employed.

Attention Mechanisms
As early as around the year 2000, studies have shown that attention mechanisms play an important role in human visual perception [41,42]. Subsequent to these, attention mechanisms have penetrated into various tasks in the field of information recognition, such as machine translation [43], object recognition [30], pose estimation [44], saliency detection [45]. In [43], the author proposed an architecture based on convolutional neural networks which used gated linear units to ease gradient propagation and equipped each decoder layer with a separate attention module. In [30], a recurrent neural network (RNN) model which is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution was proposed.
In recent years, several attention mechanism networks of great significance have been developed. Hu et al. [35] proposed a squeeze-and-excitation network based on channel attention. It employed squeeze and excitation operations, which are composed of pooling and fully connecting, to assign different weights to different channels of the feature map to achieve the purpose of re-calibrating the feature map. Furthermore, it is a plug-and-play lightweight model that can be easily combined with classic DL models such as residual models and can be conveniently applied to various applications. To investigate attention from more aspects, in [46], Park et al. proposed an effective attention module called bottleneck attention module (BAM) from channel and spatial pathways separately which can be embedded in any feed-forward convolutional neural networks. Similar to this, to boost representation power of CNNs, Woo et al. [47] proposed an attention module along channel and spatial dimensions separately and plug it at every convolutional block, whereas Park et al. placed BAM module at every bottleneck of the network.
Inspired by [47], we propose a novel deep learning method combined with channel and spatial attention, which not only decreases the noise along the spectral bands but also explores the correlation between them. In the next section, our proposed method will be discussed in detail.

CSA-MSO3DCNN for Hyperspectral Images Classification
In this section, we first introduce the octave CNN and attention module, and then present the proposed network architecture, which is a novel multiscale octave 3D CNN with channel and spatial attention for HSIs classification (CSA-MSO3DCNN).

Octave Convolution
The main feature of the CNN is that the parameters of the network and the demand for memory will increase dramatically as the number of convolution layers increases, especially for the 3D CNN used in the field of HSIs classification. Recent proposed octave convolution [48] decomposes the feature map produced by CNN into high and low spatial frequency, which are updated separately and exchanged with each other and finally merged together, to reduce the spatial redundancy and enlarge the receptive field.
Based on this, in this paper we propose to leverage three layers octave 3D convolution. Let X j = {x 1 , x 2 , . . . , x i , . . . , x n channels |x i ∈ R n bands ×l 1 ×l 2 } denote the n channels feature maps in the jth layer, where l 1 × l 2 denotes the spatial dimensions and n bands is the number of spectral bands. As shown in Figure 1, the first layer of the octave 3D convolution is to decompose the input feature maps into high frequency group X H 1 and low frequency group X L 1 along the channel dimension by a super-parameter α, which is the ratio of the low frequency group to the total. The channels of X H 1 and X L 1 are calculated as (1 − α) × c and α × c respectively. Given the input feature maps X 1 , the output of the first layer of 3D octave convolution is: where F(.) is normal 3D convolution operation and Avg_pool is average pooling operation. In the middle layer, the high frequency group and low frequency group perform intra-feature update and inter-feature communication. The output of the middle layer is computed specifically as: where U p(.) is up-sampling operation. Aiming at HSIs classification, in the last layer, the high and low frequency group are processed to obtain the same shape and then add up to decrease the feature redundancy, as illustrated in Figure 1. The output of the last layer Y is: Therefore, in the octave 3D convolution, the group of spatial resolution of low frequency is reduced by sharing information between neighboring regions. It is believed that the octave 3D convolution has two distinct advantages, reducing spatial redundancy and enlarging the receptive field. Accordingly, we propose to use the octave 3D convolution to replace the traditional 3D convolution to improve the HSIs classification.

Channel and Spatial Attention
To boost the representation ability of our network, with consideration of the abundant spectral and spatial information that HSIs have, we propose to employ the channel attention mechanism which attempts to assign different significance to the channels of feature maps and the spatial attention mechanism which is in aim to find which portions are more important in a feature map. We adopt the convolutional block attention module proposed by Woo et al. that is general and end-to-end trainable along with the basic CNNs [47]. The structures of the channel and spatial attention module are shown in Figure 2. Channel-wise attention is an attention mechanism which emphasizes reducing channel redundancy and building a channel attention map through capturing the inter-channel relationship of features [47]. As exhibited in the top of Figure 2, given an intermediate layer of feature maps X = {x 1 , x 2 , . . . , x i , . . . , x n channels |x i ∈ R n bands ×l 1 ×l 2 }, to squeeze and aggregate the feature average-pooling and max-pooling are performed simultaneously to generate two different feature maps: max-pooled features X max and average-pooled features X avg . Then, X max is fed into a shared network which is composed of two dense layers to train. With the learned weight the X avg is also fed into the shared network. As a result, the channel attention M C ∈ R n channels ×1×1×1 is obtained. In addition, we adopt a reduction ratio r to reduce parameters [35] and the hidden activation size is set to n channels /r × 1 × 1 × 1. In summary, the channel attention is calculated as: where σ denotes the sigmoid function and Max_pool is max pooling operation.
In order to further explore where to focus on in a channel of feature map, the spatial-wise attention mechanism is adopted which can be seen as a supplement to the channel-wise attention. As illustrated in the bottom of the Figure 2, the spatial attention module is connected behind the channel-wise attention module. The input of the spatial attention module X C is the channel refined feature maps, where ⊗ denotes element-wise multiplication. For taking full advantage of channel information, global average-pooling and max-pooling operations are both applied to generate 3D feature maps X Cmax and X Cavg . Then these are concatenated and convolved by a standard convolution layer to generate the 3D spatial attention map. The spatial-wise attention is computed as: where F 3×3×3 denotes a normal 3D convolution with kernel size of 3 × 3 × 3. At this point, the output feature map Y of the two attention module is The obtained Y is the optimized feature map of X through the two sequential attention modules and can improve the classification performance.

Proposed Network Architecture
In this section, we design our CSA-MSO3DCNN architecture as follows. First, for the high-dimensional HSIs data, we apply principal component analysis (PCA) to reduce the dimension before the formal network training, which can decrease the parameters and keep the essential information as illustrated in Figure 3. Then, a pixel and surrounding pixels are selected from the processed HSIs data to form a 3D patch which has the shape of s × s × d, where s × s is the spatial dimension and d is the spectral bands. This is to explore the relationship between the neighbouring pixels, which has a large probability of belonging to the same category.  Secondly, the 3D patch is fed to three network branches, each of which is composed of three 3D octave convolution layers to extract multi-scale features. In the three network branches, each octave 3D convolution layer is followed by batch normalization-RELU(BN-RELU). We denote the outputs of the three branches as X 1 , X 2 , X 3 . It is worth noting that the three branches differ in the size of the octave 3D convolution kernel, where the convolution kernel sizes of the three branch are 1 × 1 × 1, 3 × 3 × 3 and 5 × 5 × 5 respectively. In each branch, each octave 3D convolution layer is designed with a different number of convolution kernels, as shown in Figure 3.

PCA
Thirdly, for the convenience of concatenation, we keep the size of the original data and the feature map consistent. Therefore, we concatenate the outputs of the three branches to get X: where G(.) is concatenation operation. Then a normal 3D convolutional layer is used to abstract the feature map. Fourthly, we employ an attention module with channel-wise attention and spatial-wise attention (see in Section 3.2) to refine the obtained feature maps, so that the feature maps can become more discriminative. Finally, two fully connected layers (FC) and a softmax classifier are used as a classifier.
Reasonably, we also use 'dropout' technology in the fully connected layer, which can effectively suppress over-fitting without adding a large number of parameters. In addition, the categorical cross entropy is adopted as a loss function: where t k is the correct label and the y k is the output of the network. In order to continuously reduce the loss and update network parameters, the Adam method is adopted. In summary, a novel multi-scale octave 3D CNN based on channel and spatial attention for HSIs classification has been proposed. It is obvious that our approach can greatly reduce spatial redundancy and enlarge the receptive field, which are beneficial for improving the classification performance.

Experimental Results and Analysis
In this section, we evaluated the performance of our CSA-MSO3DCNN on four public HSI data sets for HSIs classification. Four popular indicators, class accuracy, overall accuracy (OA), average accuracy (AA), kappa coefficient (κ) are used to measure the pros and cons of our approach and the compared five state-of-the-art methods. All experiments are implemented with an NVIDIA 1060 GPU and a Titan graphics card server, Tensorflow-gpu and Keras with Python 3.6.

Experimental Data
The experiments were conducted on four standard HSIs data sets, including two popular data sets and two contest data sets, that are, Indian Pines, University of Pavia, grss_dfc_2013 [49] and grss_dfc_2014 [50]. Grss_dfc_2014 is a coarser-resolution long-wave infrared (LWIR, thermal infrared) hyperspectral data set, which is more challenging and employed in 2014 IEEE GRSS Data Fusion Contest. It was acquired by an 84-channel imager that covered the wavelengths between 7.8 to 11.5 µm with approximately 1-m spatial resolution. The size of this data set is 795 × 564 pixels with 22532 labeled pixels and is classed into seven classes. Figure 7a,b give a false color image of Grss_dfc_2014 and the ground truth map.   Table 1, because the number of samples in different categories varies widely, for each class we randomly selected a half as the training samples if the number of samples was less than 600, and took 300 as the training set if the number of samples was more than 600 and the rest were set as the testing samples. For the University of Pavia and Grss_dfc_2014 data sets, 200 training samples were randomly selected from each class. The rest of the samples were taken for testing, as shown in Tables 2 and 4. It can be seen that several classes had a large number of test samples, such as 'meadows' and 'asphalt' in the University of Pavia and 'vegetation' in the Grss_dfc_2014, which increased the difficulty of classification. For the Grss_dfc_2013 data set, 200 training samples were randomly selected from each class, except the 'water' class. Because of the limited samples in class 'water', we randomly selected 162 samples as the training set. The rest samples were defined for testing, as shown in Table 3.

Experimental Setup
In all the experiments, the size of the 3D cube was set to 22 × 22 × 20, where the '20' is the spectral bands after PCA dimension reduction. The processed data contained no less than 99.96% information of the original data. Because the last octave 3D convolution layer contains a half-size operation (Avg_pooling), the shape of the cube was reduced to 11 × 11 × 10, which omitted the padding operation and reduced the noise in this step. In addition, the size of all 3D convolution operation is depicted in Figure 3, such as 3 × 3 × 3 with padding 'SAME'. The setting of attention module is referenced to Figure 2.
The parameter α was set to 0.25 which was used to reduce the spatial redundancy. The learning rate was set to 1e − 4 to make sure the convergence speed. We used training steps to represent the number of iterations of the network, and each iteration was the parameter update of the whole network. If the number of training steps was too large, it would lead to over fitting. In all the experiments, the number of training steps was set to 1000 and the training batch was set to 128. The number of hidden layers and the regularization are shown in Figures 2 and 3. Each branch had three hidden layers, and each convolution layer was followed by a batch normalization for regularization.

Experimental Results and Discussion
To evaluate the effectiveness of the proposed method, we compared our method with five state-of-the-art methods. For fair comparison, for each data set the size of the training set adopted is same for all the methods. The compared methods are as follows: • CNN [20]: A method exploits CNN to encode the spectral-spatial information of pixels and a MLP to conduct the classification task. • M3DCNN [25]: A multiscale 3D CNN method for HSIs classification, which different branches have different sizes of 3D convolution kernel. • SRN: [26] A spectral-spatial 3D deep learning network with residual structure, which effectively mitigates over-fitting. • MSDN-SA: [37] A dense 3D CNN framework with spectral-wise attention mechanism. • MSO3DCNN: Our proposed method without attention module.
In the experimental setup, we randomly chose the training and testing samples for the classification task. Because each random selection produced a different classification result, for each data set and each class we ran the experiment for 10 times to obtain the accuracy. We finally computed the average accuracy and the standard deviation for each class and compute the overall accuracy (OA), average accuracy (AA) and kappa coefficient (κ) for each data set.

Results for Indian Pines Data Set
The comparison results of the classification accuracy for the Indian Pines data set are presented in Table 5. From the Table 5, it can be seen that the proposed CSA-MSO3DCNN obtained the best OA, AA, and κ, which are 99.68%, 99.45%, and 99.62% respectively. Compared with the CNN method, OA, AA, and κ of our method have improved much more, where the AA has increased by about 8%. The classification results of SSRN and MSO3DCNN methods show that the spatial-spectral feature obtained by the octave 3D CNN was better than the feature obtained by the normal 3D CNN. Furthermore, the proposed method was better than MSO3CNN method, which proves the positive effect of the channel and spatial attention module.
To show the visual classification results, the classification maps and normalized confusion matrices are shown in Figure 8 and Figure 9 respectively. From Figure 8, the classification map produced by CSA-MSO3DCNN method is closest to the ground truth map, which means that CSA-MSO3DCNN method obtains the best classification result. The diagonal values of the matrix in Figure 9 also prove this, where the row represents prediction value and the column represents actual value.

Results for The University of Pavia Data Set
The comparison results of the classification accuracy and classification maps for the University of Pavia data set are reported in Table 6 and Figure 10. From Table 6 we can see that the proposed CSA-MSO3DCNN achieved the best OA, AA, and κ, which are 99.76%, 99.66% and 99.67% respectively. The classification results of each class obtained by the CSA-MSO3DCNN were much better than others such as the first three categories 'Asphalt', 'Meadows' and 'Gravel'. Compared with the MSO3DCNN method (without attention module), the OA, AA, and κ of CSA-MSO3DCNN all improved, which indicates the effectiveness of the attention module. The results also demonstrate the superiority of octave 3D CNN which replaces the normal 3D CNN in the CSA-MSO3DCNN method. Furthermore, from the results we can conclude that methods based on 3D CNN were significantly better than the methods based on CNN. Although the MSDN-SA method provided the second best results, from the standard deviation, our proposed CSA-MSO3DCNN performed more stably than MSDN-SA. For the visual results, from Figure 10, the classification maps obtained by the proposed method were closest to the ground truth map. For example, the center yellow area, which represents the class 'Baresoil', in our feature map, was all yellow without other color, which means there were no misclassified pixels. The normalized confusion matrices of classification results are depicted in Figure 11. From Figure 11a, there are several colors of diamonds in the confusion matrix obtained by the CNN method, which means some pixels were misclassified into other categories. From Figure 11a-f, it can be seen that the classification accuracy of each class obtained by the CSA-MSO3DCNN was higher than others. 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 8 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 9 70.00 ± 36.10 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00 10 90. 28

Results for the Grss_dfc_2013 Data Set
The classification accuracy, the classification maps and the normalized confusion matrices of classification results of all methods for Grss_dfc_2013 data set are listed in Table 7, in Figures 12  and 13 respectively. It can be seen that the proposed CSA-MSO3DCNN method achieved the best OA, AA, and κ, which were 99.69%, 99.72%, and 99.66% respectively from the Table 7. The classification result of each class was 99.09% at least, and our obtained lowest accuracies were all higher than lowest accuracies obtained by other methods.

Results for The Grss_dfc_2014 Data Set
This data set is a challenging data set due to its low resolution. The comparison results of the classification accuracy and classification maps for The Grss_dfc_2014 data set are shown in Table 8 and Figure 14.  From Table 8, the best OA, AA and κ were obtained by our proposed CSA-MSO3DCNN method, which were 97.96%, 96.19% and 97.37% respectively. Compared with the most competitive MSDN-SA method, the proposed CSA-MSO3DCNN method made a great improvement, especially for the indicator of κ, CSA-MSO3DCNN increased by 7%. For further analysis of two results obtained by MSO3DCNN and CSA-MSO3DCNN, it could be found that the CSA-MSO3DCNN method was more outstanding. The ablation experiment demonstrated that the feature map was optimized by the channel and spatial attention module, which can be comprehended as feature selection process. Moreover, it is obvious that the MSO3DCNN method also outperformed the other methods. It can be inferred that this data set was sensitive to spatial redundancy and the methods based on octave 3D CNN could obtain better result.
For the visual results, from Figure 14, the classification maps obtained by the CSA-MSO3DCNN method is closest to the ground truth map. For example, the 'vegetation' class, at the top right of the classification map, is all yellow, which is same to the ground truth map, and the classification maps obtained by other methods have wrong colors to some extent. The corresponding normalized confusion matrices of classification results are reported in Figure 15. It could be seen that the confusion matrix obtained by the CSA-MSO3DCNN method diagonal color is closest to yellow, which shows the best classification accuracy. The experimental results of this data set also demonstrate the superiority of our approach. Overall, our approach excels on all the four data sets compared to other competitive methods, which indicates the robustness and stability of the CSA-MSO3DCNN method. It is worth noting that the experimental results show that spatial information has a greater influence on the results. Therefore, in the next sub-subsection, the effects of several parameters on the experiment are discussed.

The Effects of Parameters and Number of Training Samples
In the deep learning framework, the parameters play a significant role in the experiments. There are three main parameters in our method, which are α, spatial size and dropout. As the number of the training samples affects the quality of the ultimate model [51], the effects of the number of training samples on the ultimate model are also analyzed.
1. In our method, α characterizes the ratio between high frequency and low frequency, which decides the balance of spatial information and spatial redundancy. Thus we test a series of different α values to evaluate and get the OA results which are listed in Table 9. The test experimental results reveal that the best results are obtained for four data sets when α = 0.25. 2. To figure out the influence of the size of the 3D patch s × s × d, different spatial sizes s × s are conducted on the four data sets where d is set to 20. The OA results are provided in Table 10.
The experimental results show that too large or too small spatial size is not recommended which means excessive noise or too little spatial information is included. It is not beneficial for the classification. 3. In the fully connected layer, the drop out is generally employed to overcome over-fitting. The effects of various drop out are depicted in Table 11. It could be observed that 0.5 was a suitable value for all four data sets, which can suppress over-fitting and train model in a balanced way. In order to explore the effects of the number of training samples on the ultimate model, we have implemented more experiments by using different percentage quantity of the whole samples as the training set for all the six methods and all the data sets. For the Indian Pines data set the ratio of the training set to sample is selected from 5% to 20%, for the University of Pavia data set and Grss_dfc_2014 data set the ratio of the training set to sample is selected from 1% to 6%, and for the Grss_dfc_2013 data set the ratio of the training set to sample is selected from 10% to 20%. The obtained OA results for the four data sets are shown in Figure 16. It can be seen that, for the six methods, with the increase of the training data, the OA also increases, and our proposed method can always provide a better OA compared with the state-of-the-art methods.

Conclusions
In this paper, we have proposed a new framework based on DL for HSIs classification. Although the method based on DL has achieved good results in HSIs classification, the automatically extracted features are still rough and contain a lot of noise. Therefore, we investigate to reduce the noise of features and select more appropriate features by octave 3D CNN and attention mechanism operations.
The multi-scale octave 3D convolution is designed to decrease the spatial redundancy and expand the receptive field which are proven to be important for extracting appropriate features. Then, three different group feature maps are cascaded into one. In addition, a channel attention module and a spatial attention module are employed to refine the feature maps, which not only assign different weights to the feature maps along the channel dimension but also along the spatial dimension. The refined feature maps have been demonstrated to be beneficial for improving the classification performance. The results of ablation experiments have shown the efficiency of the attention modules. The experimental results on four public HSIs data sets have demonstrated that the proposed CSA-MSO3DCNN outperforms the state-of-the-art methods. Accordingly, it can be concluded that our method is more suitable for HSIs classification.
Because of the limit labeled HSIs pixel samples and the difficult of labeling the HSIs pixel samples, as future work, we intend to explore the HSI classification methods combined with the data enhancement techniques and semi-supervised HSIs classification in order to overcome the problem of the limit labeled HSIs pixel samples.
Author Contributions: All the authors made significant contributions to this work. Y.X. and D.W. designed and performed the experiments; Q.X. and Y.X. analyzed the results; Q.X. and Y.X. wrote the paper; B.L. and J.L. acquired the funding support. All authors have read and agreed to the published version of the manuscript.