A 3D Cascaded Spectral–Spatial Element Attention Network for Hyperspectral Image Classiﬁcation

: Most traditional hyperspectral image (HSI) classiﬁcation methods relied on hand-crafted or shallow-based descriptors, which limits their applicability and performance. Recently, deep learning has gradually become the mainstream method of HSI classiﬁcation, because it can automatically extract deep abstract features for classiﬁcation. However, it remains a challenge to learn more meaningful features for HSI classiﬁcation from a small training sample set. In this paper, a 3D cascaded spectral–spatial element attention network (3D-CSSEAN) is proposed to solve this issue. The 3D-CSSEAN integrates the spectral–spatial feature extraction and attention area extraction for HSI classiﬁcation. Two element attention modules in the 3D-CSSEAN enable the deep network to focus on primary spectral features and meaningful spatial features. All attention modules are implemented though several simple activation operations and elementwise multiplication operations. In this way, the training parameters of the network are not added too much, which also makes the network structure suitable for small sample learning. The adopted module cascading pattern not only reduces the computational burden in the deep network but can also be easily operated via plug– expand–play. Experimental results on three public data sets show that the proposed 3D-CSSEAN achieved comparable performance with the state-of-the-art methods.


Introduction
With the development of remote sensing technology, hyperspectral images (HSIs) have been of wide concern and gradually applied in many fields [1,2]. In the field of HSIs, as a fundamental task, HSI classification is a task of assigning category labels to each pixel in the HSI and has attracted more and more attention.
An HSI usually contains hundreds of spectral bands, so it has abundant spectral information in addition to the usual spatial information of the image. In the early stages of HSI classification, there were many works based on spectral or spatial characteristics [3]. Support vector machines (SVMs) were used to address the problem by using spectral information [4]. In the past ten years, many works were based on spectral-spatial feature learning for HSI classification [5,6]. The performance of sparse representation was improved by using the spatial neighborhood information of samples [7]. In [8], principal component analysis (PCA) was used for unsupervised extraction of spectral features and data dimensionality reduction, and edge-preserving features were obtained by edge-preserving cation results. The main contributions of this work can be summarized in the following three aspects.
First, a cascade element attention network is proposed to extract meaningful features, which can give different weight responses to each element in the 3D data. Two element attention modules are employed to enhance the important spectral features and strengthen the interesting spatial features, respectively.
Second, the proposed element attention modules are implemented through several simple activation operations and elementwise multiplication operations. Therefore, the implementation of the attention module does not add too many parameters, which makes the network model suitable for small sample learning.
Third, the proposed attention modules can be easily plug and play, and can be achievable based on a single branch, so it is more time-efficient.  The rest of this paper is organized as follows: In Section 2, the existing attention methods for HSI classification are discussed. The proposed 3D-CSSEAN model is described in detail in Section 3. Experimental results and analysis are presented in Section 4. In Section 5, the influence of attention block numbers and different training sample numbers on the model are discussed. Finally, conclusions are summarized in Section 6.

Related Work
In this section, the existing attention methods for HSI classification are reviewed briefly. According to the different ways of paying attention to spectral and spatial features, these methods can be roughly divided into three categories: 1. Global operation-based methods. These methods use a global operation on an HSI or its feature map, such as global pooling or global convolution, to obtain the spectral attention weight or spatial attention weight [26,29]. As shown in Figure 3a, a spectral weight vector of the HSI is obtained by global operation of spatial dimension, and then the weight vector is multiplied by the HSI to achieve the spectral attention. Similarly, in Figure 3b, a spatial weight plane of the HSI is obtained by Finally, a fully connected layer and softmax activation layer are used to generate classification results. The main contributions of this work can be summarized in the following three aspects. First, a cascade element attention network is proposed to extract meaningful features, which can give different weight responses to each element in the 3D data. Two element attention modules are employed to enhance the important spectral features and strengthen the interesting spatial features, respectively.
Second, the proposed element attention modules are implemented through several simple activation operations and elementwise multiplication operations. Therefore, the implementation of the attention module does not add too many parameters, which makes the network model suitable for small sample learning.
Third, the proposed attention modules can be easily plug and play, and can be achievable based on a single branch, so it is more time-efficient.  The rest of this paper is organized as follows: In Section 2, the existing attention methods for HSI classification are discussed. The proposed 3D-CSSEAN model is described in detail in Section 3. Experimental results and analysis are presented in Section 4. In Section 5, the influence of attention block numbers and different training sample numbers on the model are discussed. Finally, conclusions are summarized in Section 6.

Related Work
In this section, the existing attention methods for HSI classification are reviewed briefly. According to the different ways of paying attention to spectral and spatial features, these methods can be roughly divided into three categories: 1. Global operation-based methods. These methods use a global operation on an HSI or its feature map, such as global pooling or global convolution, to obtain the spectral attention weight or spatial attention weight [26,29]. As shown in Figure 3a, a spectral weight vector of the HSI is obtained by global operation of spatial dimension, and then the weight vector is multiplied by the HSI to achieve the spectral attention. Similarly, in Figure 3b, a spatial weight plane of the HSI is obtained by First, a cascade element attention network is proposed to extract meaningful features, which can give different weight responses to each element in the 3D data. Two element attention modules are employed to enhance the important spectral features and strengthen the interesting spatial features, respectively.
Second, the proposed element attention modules are implemented through several simple activation operations and elementwise multiplication operations. Therefore, the implementation of the attention module does not add too many parameters, which makes the network model suitable for small sample learning.
Third, the proposed attention modules can be easily plug and play, and can be achievable based on a single branch, so it is more time-efficient.
The rest of this paper is organized as follows: In Section 2, the existing attention methods for HSI classification are discussed. The proposed 3D-CSSEAN model is described in detail in Section 3. Experimental results and analysis are presented in Section 4. In Section 5, the influence of attention block numbers and different training sample numbers on the model are discussed. Finally, conclusions are summarized in Section 6.

Related Work
In this section, the existing attention methods for HSI classification are reviewed briefly. According to the different ways of paying attention to spectral and spatial features, these methods can be roughly divided into three categories:

1.
Global operation-based methods. These methods use a global operation on an HSI or its feature map, such as global pooling or global convolution, to obtain the spectral attention weight or spatial attention weight [26,29]. As shown in Figure 3a, a spectral weight vector of the HSI is obtained by global operation of spatial dimension, and then the weight vector is multiplied by the HSI to achieve the spectral attention. Similarly, in Figure 3b, a spatial weight plane of the HSI is obtained by global operation of the spectral dimension, and subsequently, spatial attention features are obtained by multiplying the spatial weight plane by the HSI.

2.
Correlation-based methods. Spatial location correlation and inter-channel correlation are used to describe the degree of attention [27,30]. The channel attention module can be illustrated as Figure 4a. Firstly, the original HSI or 3D feature tensor is reshaped to a plane with C height and N width, where C is the spectral dimension and N is the number of pixels. Next, matrix multiplication is performed on the plane and its transpose to obtain the channel correlation matrix. Finally, the channel attention features are obtained by multiplying the channel correlation matrix with the transpose matrix. The spatial attention features can also be obtained in a similar way, and the spatial attention module is shown in Figure 4b. 3.
Multifeature-based methods. These methods usually appear in the form of two branches; the rough network structure is shown in Figure 5. The attention module is composed of a trunk and mask [31], and the trunk branch is composed of some residual blocks, and the mask branch is composed of a symmetrical downsamplerupsampler structure. Different features can be extracted by different network structures in two branches. Finally, the attention features are obtained by multiplying different features between the trunk branch and mask branch. Similarly, attention modules are composed of two branches extracting different scale spectral-spatial features [28]. The parts of interest in the images are obtained by multiplying different scale features between two branches. By adopting different structures or utilizing different scales, these attention models can extract meaningful information and improve the performance of classification tasks. global operation of the spectral dimension, and subsequently, spatial attention features are obtained by multiplying the spatial weight plane by the HSI.  2. Correlation-based methods. Spatial location correlation and inter-channel correlation are used to describe the degree of attention [27,30]. The channel attention module can be illustrated as Figure 4a. Firstly, the original HSI or 3D feature tensor is reshaped to a plane with C height and N width, where C is the spectral dimension and N is the number of pixels. Next, matrix multiplication is performed on the plane and its transpose to obtain the channel correlation matrix. Finally, the channel attention features are obtained by multiplying the channel correlation matrix with the transpose matrix. The spatial attention features can also be obtained in a similar way, and the spatial attention module is shown in Figure 4b.  3. Multifeature-based methods. These methods usually appear in the form of two branches; the rough network structure is shown in Figure 5. The attention module is composed of a trunk and mask [31], and the trunk branch is composed of some residual blocks, and the mask branch is composed of a symmetrical downsamplerupsampler structure. Different features can be extracted by different network structures in two branches. Finally, the attention features are obtained by multiplying different features between the trunk branch and mask branch. Similarly, attention modules are composed of two branches extracting different scale spectral-spatial features [28]. The parts of interest in the images are obtained by multiplying different scale features between two branches. By adopting different structures or utilizing different scales, these attention models can extract meaningful information and improve the performance of classification tasks. The above three kinds of attention methods may help deep networks pay more attention to the region of interest in space and important spectral bands. Recently, a multiattention fusion network (MAFN) [32] was proposed to merge multiple attention features for classification. MAFN is a method that combines the global operation-based method and the correlation-based method. However, these methods still have room for improvement. For global operation-based methods, the global pooling is too simple and crude to capture certain local attention features. For correlation-based methods, they have too high   3. Multifeature-based methods. These methods usually appear in the form of two branches; the rough network structure is shown in Figure 5. The attention module is composed of a trunk and mask [31], and the trunk branch is composed of some residual blocks, and the mask branch is composed of a symmetrical downsamplerupsampler structure. Different features can be extracted by different network structures in two branches. Finally, the attention features are obtained by multiplying different features between the trunk branch and mask branch. Similarly, attention modules are composed of two branches extracting different scale spectral-spatial features [28]. The parts of interest in the images are obtained by multiplying different scale features between two branches. By adopting different structures or utilizing different scales, these attention models can extract meaningful information and improve the performance of classification tasks. The above three kinds of attention methods may help deep networks pay more attention to the region of interest in space and important spectral bands. Recently, a multiattention fusion network (MAFN) [32] was proposed to merge multiple attention features for classification. MAFN is a method that combines the global operation-based method and the correlation-based method. However, these methods still have room for improvement. For global operation-based methods, the global pooling is too simple and crude to capture certain local attention features. For correlation-based methods, they have too high The above three kinds of attention methods may help deep networks pay more attention to the region of interest in space and important spectral bands. Recently, a multiattention fusion network (MAFN) [32] was proposed to merge multiple attention features for classification. MAFN is a method that combines the global operation-based method and the correlation-based method. However, these methods still have room for improvement. For global operation-based methods, the global pooling is too simple and crude to capture certain local attention features. For correlation-based methods, they have too high a computational burden due to matrix multiplication. For multifeature-based methods, they suffer from the small sample learning issue and computational burden because two branch networks inevitable increase the parameters. In this paper, the element attention mechanism is used to extract the spectral-spatial attention features, which is more meaningful for HSI classification. At the same time, the design of a single branch network structure can produce a network with less computing burden and higher time efficiency.

Proposed Method
As illustrated in Figure 2, the proposed 3D-CSSEAN contains four main modules: data dimension reduction module, spectral element attention module, spatial element attention module, and prediction layers. The 3D-CSSEAN firstly uses several 3D convolution operations for data-dimension reduction and spectral-spatial feature extraction. Then, the element attention mechanism is used to make the model focus on the primary spectral features and strengthen meaningful spatial features as well as to suppress unnecessary features. Finally, prediction layers are used to obtain the classification results. To fully utilize the spectral-spatial information of the HSI, each labeled pixel is first expanded into a 3D image patch centered on it, and then the patch is used as the input of the 3D-CSSEAN for training and testing. The training objective of the network is to update the parameters of the 3D-CSSEAN by minimizing cross-entropy loss between the predictive output and the truth label of the patch center pixel.

Data Dimension Reduction Module
Commonly, the utilization of hundreds of bands in the HSI is not only not optimal for classification but also increases the computational burden, especially for deep learning with a limited training data set. Therefore, data dimension reduction is necessary to improve the classification effect and time efficiency. The input of our model is a 3D image patch. Let the patch size be ω × ω × B, where ω × ω represents the spatial neighborhood of the centered pixel, and B is the band number of the HSI. In the proposed framework, for shallow feature extraction and spectral dimension reduction, a data dimension reduction module is designed based on a 3D convolutional operation, as shown in Figure 2. The i-th output of (k + 1)-th 3D convolutional layer can be formulated as where P k j ∈ R ω×ω×c k ,1 is the j-th component of P k , P k ∈ R ω×ω×c k ,n k represents the input feature tensor of the (k + 1)-th convolutional layer, ω × ω × c k is the size of the feature tensor, ω × ω represents the spatial size and c k represents the spectral size, n k is the number of the convolutional kernel in the k-th convolutional layer, W k+1 i and b k+1 i indicate weights and the bias of the i-th convolutional operation in the (k + 1)-th layer, respectively, and * denotes the 3D convolutional operation. After each convolution operation, batch normalization (BN) is used to regularize the training process, as in prior work [20]. Moreover, G(·) represents the BN operation and rectified linear unit (ReLU) activation function. If the output data dimension of the convolution operation is expected to be smaller than the input data, then the convolution stride needs to be set greater than 1 or the convolution kernel size needs to be greater than 1 without a boundary padding. In the proposed model, three 3D convolutional layers, C 1 , C 2 , and C 3 , are used for spectral-dimension reduction, as shown in Figure 2. These convolutional layers used a 3D convolution kernel with 1 × 1 × L i , L i > 1 and added the subsampling procedure with a stride of (1, 1, S i ), S i ≥ 1, where i is 1, 2, or 3 corresponding to C 1 , C 2 , and C 3 . The kernel size 1 × 1 × L i specify the height, width, and spectral dimensionality of the 3D convolution window, respectively. In particular, the convolutional layer C 3 integrates all the spectral features into one dimension by not padding the boundary, which is convenient for subsequent spatial feature extraction.
To better understand this process, an example diagram is used to illustrate the data dimension reduction module on the Indian Pines data set. As shown in Figure 6, let the input of the model be a tensor with a size of 7 × 7 × 200 where 7 × 7 represents the spatial size of the tensor, 200 is the spectral dimensionality. The first convolutional layer C 1 uses a convolution operation with a stride size of 2 to reduce the spectral dimension. The spectral dimension has been reduced from 200 to 97. The second convolutional layer C 2 uses a convolution kernel with 1 × 1 × 7 without a boundary padding to reduce the spectral dimension. The spectral dimension has been reduced from 97 to 91. Finally, the convolutional layer C 3 uses a convolution kernel with 1 × 1 × 91 without a boundary padding to integrate all the spectral features into one dimension. spatial size of the tensor, 200 is the spectral dimensionality. The first convolutional layer uses a convolution operation with a stride size of 2 to reduce the spectral dimension. The spectral dimension has been reduced from 200 to 97. The second convolutional layer uses a convolution kernel with 1 × 1 × 7 without a boundary padding to reduce the spectral dimension. The spectral dimension has been reduced from 97 to 91. Finally, the convolutional layer uses a convolution kernel with 1 × 1 × 91 without a boundary padding to integrate all the spectral features into one dimension. Figure 6. Diagram of the data dimension reduction process on the Indian Pines data set.

Spectral Element Attention Module
Following the data dimension reduction module, a spectral element attention module is designed to extract deep meaningful spectral features for each patch. The spectral element attention module is composed of several attention blocks, which are shown in Figure 7. The red dotted box in Figure 7 represents an attention block, which can be defined as follows: where is the input tensor of the spectral element attention block, is the output of the spectral element attention block, and indicate weights and the bias of the convolutional operation in the ( + 1)-th layer, respectively, * represents the 3D convolutional operation, and × represents the elementwise multiplication operation. To extract spectral features, a 1 × 1 × , > 1 convolution kernel is used, where represents the kernel size of spectral dimension. Moreover, ℎ(•) and (•) represent the ℎ and activation function, respectively. The ℎ activation function can play a role in contrast stretching, which can increase the relative separability of data around zero. The activation function can map the outputs to a probability distribution ranging from 0 to 1, which are considered to be the weight map (or mask) of the spectral features. The attention block can pay the different levels of attention to spectral features via elementwise multiplication operation between ℎ _ and . Finally, the output of the element attention block is obtained through the BN layer and the activation layer. Since this method can give different attention weight for each element in the tensor, this attention block is called an element attention block. It should be noted that the output tensors of the convolution operation are the same size as the input tensors through the padding strategy, and thus the implementation of elementwise multiplication can be guaranteed.

Spectral Element Attention Module
Following the data dimension reduction module, a spectral element attention module is designed to extract deep meaningful spectral features for each patch. The spectral element attention module is composed of several attention blocks, which are shown in Figure 7. The red dotted box in Figure 7 represents an attention block, which can be defined as follows: where P k is the input tensor of the spectral element attention block, P k+1 is the output of the spectral element attention block, W k+1 and b k+1 indicate weights and the bias of the convolutional operation in the (k + 1)-th layer, respectively, * represents the 3D convolutional operation, and × represents the elementwise multiplication operation. To extract spectral features, a 1 × 1 × L e , L e > 1 convolution kernel is used, where L e represents the kernel size of spectral dimension. Moreover, tanh(·) and so f tmax(·) represent the tanh and so f tmax activation function, respectively. The tanh activation function can play a role in contrast stretching, which can increase the relative separability of data around zero. The so f tmax activation function can map the outputs to a probability distribution ranging from 0 to 1, which are considered to be the weight map (or mask) of the spectral features. The attention block can pay the different levels of attention to spectral features via elementwise multiplication operation between weighted_P and P k . Finally, the output of the element attention block is obtained through the BN layer and the activation layer. Since this method can give different attention weight for each element in the tensor, this attention block is called an element attention block. It should be noted that the output tensors of the convolution operation are the same size as the input tensors through the padding strategy, and thus the implementation of elementwise multiplication can be guaranteed. To illustrate the method more clearly, an example diagram is used to illustrate the spectral element attention block. As shown in Figure 8, let the input of a spectral element attention block be a feature tensor with size of (7 × 7 × 91,24), where 7 × 7 represents the spatial size of feature map, 91 is the spectral dimensionality, and 24 is the number of the 3D feature map. First, a convolution layer with kernel size 1 × 1 × 3 is used to extract spectral features from the input data. The ℎ activation and activation are utilized to transform spectral features to attention weights. Finally, spectral attention fea- To illustrate the method more clearly, an example diagram is used to illustrate the spectral element attention block. As shown in Figure 8, let the input of a spectral element attention block be a feature tensor with size of (7 × 7 × 91, 24), where 7 × 7 represents Remote Sens. 2021, 13, 2451 8 of 20 the spatial size of feature map, 91 is the spectral dimensionality, and 24 is the number of the 3D feature map. First, a convolution layer with kernel size 1 × 1 × 3 is used to extract spectral features from the input data. The tanh activation and so f tmax activation are utilized to transform spectral features to attention weights. Finally, spectral attention features are obtained by elementwise multiplication between the original feature tensor and the attention weights. To illustrate the method more clearly, an example diagram is used to illustrate the spectral element attention block. As shown in Figure 8, let the input of a spectral element attention block be a feature tensor with size of (7 × 7 × 91,24), where 7 × 7 represents the spatial size of feature map, 91 is the spectral dimensionality, and 24 is the number of the 3D feature map. First, a convolution layer with kernel size 1 × 1 × 3 is used to extract spectral features from the input data. The ℎ activation and activation are utilized to transform spectral features to attention weights. Finally, spectral attention features are obtained by elementwise multiplication between the original feature tensor and the attention weights. From the above process, it can be seen that the spectral element attention block first extracts the features by 3D convolution. Then it converts the features into attention weights by two simple activation functions. Finally, the elementwise multiplication between the weights and the features of the previous layer is performed. The element attention method can give different weights to any element in the tensor, thereby achieving more attention to detail features. This method considers all the elements of the feature map, so local details will not be lost. Meanwhile, this single-branch implementation does not add many training parameters, so the model is easy to converge and implement for small data sets. However, there are still several limitations to this module. Because the values of ℎ _ are in the range [0, 1], its multiplication over features may degrade them in deeper layers. Drawing on the idea of a residual network [20], this problem can be mitigated by adding and . Equation (5) is reformulated as follows: From the above process, it can be seen that the spectral element attention block first extracts the features by 3D convolution. Then it converts the features into attention weights by two simple activation functions. Finally, the elementwise multiplication between the weights and the features of the previous layer is performed. The element attention method can give different weights to any element in the tensor, thereby achieving more attention to detail features. This method considers all the elements of the feature map, so local details will not be lost. Meanwhile, this single-branch implementation does not add many training parameters, so the model is easy to converge and implement for small data sets. However, there are still several limitations to this module. Because the values of weighted_P are in the range [0, 1], its multiplication over P k features may degrade them in deeper layers. Drawing on the idea of a residual network [20], this problem can be mitigated by adding P k+i+1 and P k+i . Equation (5) is reformulated as follows: where + denotes the elementwise addition, and P k+i and P k+i+1 represent the input and output of i-th attention block, respectively.

Spatial Element Attention Module
The spatial element attention module has a similar structure to the spectral element attention module. Unlike the spectral element attention module, the convolutional kernel size is L a × L a × 1, L a > 1 in the spatial element attention module for the spatial feature extraction. The structure of a spatial element attention block is shown in Figure 9. A convolution layer with kernel size 7 × 7 × 1 is used to extract spatial features from the input data firstly. Then spatial attention weights and attention features are obtained in the same way as the spectral element attention module. It should be noted that the input of the spatial module is (7 × 7 × 1, 24), because the C 3 convolutional layer reduces the spectral dimension to 1, as shown in Figure 2. The spatial element attention module is also composed of several spatial element attention blocks, as shown in Figure 7.
extraction. The structure of a spatial element attention block is shown in Figure 9. A convolution layer with kernel size 7 × 7 × 1 is used to extract spatial features from the input data firstly. Then spatial attention weights and attention features are obtained in the same way as the spectral element attention module. It should be noted that the input of the spatial module is (7 × 7 × 1,24), because the convolutional layer reduces the spectral dimension to 1, as shown in Figure 2. The spatial element attention module is also composed of several spatial element attention blocks, as shown in Figure 7. As can be seen from the above introduction, regardless of the spectral feature or the spatial feature, different attention degrees can be obtained in this way of element attention, so this model does not need to design different global pooling methods based on the spectral feature and the spatial feature.
Finally, in the prediction layers, the average pooling layer is used to reduce the dimensions of the feature tensor, while a flatten layer, a fully connected layer, and a activation function are adopted for classification.

Analysis of the Role of the ℎ Function
In this section, the influence of the ℎ function on the data is briefly analyzed. The function curve of the ℎ function in the interval of 5, 5 is shown in Figure 10. For values outside the interval of 5, 5 , the value of the tanh function was infinitely close to 1 as the value of the horizontal axis became smaller and smaller. On the other hand, the larger the number of the horizontal axis, the closer the value of the function became to 1. It can be seen that the ℎ function had a higher slope at the 0 point and its surroundings compared to the other positions. This also means that the image contrast stretch in this area was greater than in other areas. Moreover, the preprocessed data conformed to the Gaussian distribution with 0 mean unit variance, so there were many values distributed near 0. Thus, the ℎ function could increase the relative separability of most data. As can be seen from the above introduction, regardless of the spectral feature or the spatial feature, different attention degrees can be obtained in this way of element attention, so this model does not need to design different global pooling methods based on the spectral feature and the spatial feature.
Finally, in the prediction layers, the average pooling layer is used to reduce the dimensions of the feature tensor, while a flatten layer, a fully connected layer, and a so f tmax activation function are adopted for classification.

Analysis of the Role of the tanh Function
In this section, the influence of the tanh function on the data is briefly analyzed. The function curve of the tanh function in the interval of [−5, 5] is shown in Figure 10. For values outside the interval of [−5, 5], the value of the tanh function was infinitely close to −1 as the value of the horizontal axis became smaller and smaller. On the other hand, the larger the number of the horizontal axis, the closer the value of the function became to 1. It can be seen that the tanh function had a higher slope at the 0 point and its surroundings compared to the other positions. This also means that the image contrast stretch in this area was greater than in other areas. Moreover, the preprocessed data conformed to the Gaussian distribution with 0 mean unit variance, so there were many values distributed near 0. Thus, the tanh function could increase the relative separability of most data. At the same time, the tanh function could also suppress the contrast at some too large or too small values. In order to show the effect of the tanh function, the visualization result of the image after tanh transformation is provided in Figure 11. At the same time, the ℎ function could also suppress the contrast at some too large or too small values. In order to show the effect of the ℎ function, the visualization result of the image after ℎ transformation is provided in Figure 11.  (d-f) represent the image after ℎ transformation of (a-c), respectively. At the same time, the ℎ function could also suppress the contrast at some too large or too small values. In order to show the effect of the ℎ function, the visualization result of the image after ℎ transformation is provided in Figure 11.  (d-f) represent the image after ℎ transformation of (a-c), respectively. (d-f) represent the image after tanh transformation of (a-c), respectively.

Experimental Setup
This section evaluates the performance of our method on three public hyperspectral image data sets. The Indian Pines data set includes 16 vegetation classes and has 224 bands from 400 to 2500 nm. After removing water absorption bands, it had 145 × 145 pixels with 200 bands. The Kennedy Space Center data set includes 13 classes and has 224 bands from 400 to 2500 nm. After removing water absorption bands, it had 512 × 453 pixels with 176 bands. The Salinas Scene data set includes 16 classes and has 224 bands from 360 to 2500 nm. After removing water absorption bands, it had 512 × 217 pixels with 204 bands.
In the Indian Pines data set, the labeled samples were unbalanced. In the Kennedy Space Center data set, the number of labeled samples was small. Compared with the Indian Pines and Kennedy Space Center, the labeled samples in the Salinas Scene data set were larger and more balanced. Therefore, these three data sets represented three different situations. The performance of the proposed method was verified in three different cases, which could better demonstrate the generalization ability of the method. For the Indian Pines and Kennedy Space Center data sets, about 5%, 5%, and 90% of the labeled samples were randomly select as training, validation, and testing data sets, respectively. For the Salinas Scene data set, due to the large number of overall labeled samples, a smaller training ratio was set. The ratio was about 1%:1%:98% for the Salinas Scene data set. Moreover, all three data sets were normalized to a Gaussian distribution with zero mean and unit variance. The overall accuracy (OA%), average accuracy (AA%), and Kappa coefficient (Kappa × 100) were used to evaluate the classification performance of the proposed methods. The higher these index values, the better the classification performance of the method. Each method was randomly run ten times, and the mean and standard deviation of the classification index were reported. All the experiments were implemented with a GTX 2080Ti GPU, 16 GB of RAM, Python 3.6, TensorFlow 1.10, and the Keras 2.1.0 framework.
To express more clearly, Table 1 shows the shape of input data and output data and the specific parameters of the convolutional operation in the 3D-CSSEAN for the Indian Pines data set. The settings of Kennedy Space Center and Salinas Scene data sets are same as Indian Pines except for the band number of the input data. C spe and C spa in Table 1 indicate the convolution operation in the spectral element attention module and spatial element attention module, respectively. For each convolutional layer, n k were set to be 24 for each convolutional layer, and experiments show that the change of n k in a small range had little impact on the result.

Comparison and Analysis of Experimental Results
To evaluate the superiority and effectiveness of the proposed 3D-CSSEAN model, some machine learning and deep learning classification methods were compared with it. These methods included a traditional machine learning method SVM, state-of-the-art 3D deep learning models such as SSRN [20] and HybridSN [21], and the latest attention networks, such as CDSCN [28] and MAFN [32]. SVM was implemented by scikit-learn tools of the machine learning. The Radial Basis Function (RBF) was selected as the kernel function on the three data sets. The grid search method was used to determine the best values of parameters C and gamma. Other comparison methods were implemented through code published in their papers [20,21,28,32]. For fairness of comparison, the input image patch size was set to 7 × 7 × B for all methods except HybridSN, where B was the band number of the HSI. For HybridSN, in order to make the network work without changing the network structure, the input image patch size was set to 11 × 11 × B, which was the closest parameter setting. For SVM and HybridSN, the number of PCA principal components was set to 30, which is the same as in the literature on HybridSN [21].
Classification results of the different methods on testing data of the three data set are reported in Tables 2-4. As shown, 3D-CSSEAN achieved the best results on most indicators compared with the other methods. In our cases, the classification performances of all deep learning methods were better than those of SVM, which indicates that these deep learning models are generally superior to the traditional machine learning method in HSI classification. On the Indian Pines data set, the 3D-CSSEAN, MAFN, and CDSCN achieved better results than other methods. These results show that in the case of imbalanced categories, these attention models pay more attention to meaningful features, so they achieved better results. Compared with the two other attention methods, the 3D-CSSEAN increased the score at least 0.89%, 1.52%, and 1.01% in the OA, AA, and Kappa, respectively. Moreover, the AA of the 3D-CSSEAN was 0.89% higher than the best result of the other compared methods. These results indicate that the proposed method has good stability and robustness under the condition of unbalanced samples. On the Kennedy Space Center data set, the 3D-CSSEAN, SSRN, CDSCN, and MAFN achieved at least 22% improvement compared to HybridSN and SVM. The reasons for this may be that HybridSN and SVM use PCA for dimension reduction, while the 3D-CSSEAN, SSRN, CDSCN, and MAFN are end-to-end network structures. The data dimension reduction module in the end-to-end is implemented in a supervised way, so the effect is better than the unsupervised way of PCA. Compared with SSRN and CDSCN, the 3D-CSSEAN achieved 2% and 1.75% improvement on OA, respectively. As for the latest MAFN, the 3D-CSSEAN also achieved comparable results. MAFN was slightly better than the 3D-CSSEAN on AA. The possible reason is that the spatial distribution of some categories in the Kennedy Space Center data set was relatively scattered. MAFN uses the correlation-based attention method to extract spatial features. Correlation-based methods may better capture the connections between scattered samples of these categories, so as to obtain more ideal results. The increase in accuracy of these categories can improve AA. On the Salinas Scene data set, all methods achieved higher than the 94% overall accuracy, while the 3D-CSSEAN was 0.42, 0.47, and 0.55 higher than the best result of the other methods on OA, Kappa, and AA, respectively.
In general, the three attention methods, CDSCN, MAFN and 3D-CSSEAN, achieved good results, indicating that the attention features extracted by them are beneficial to classification. These results indicated that the proposed element attention method can also effectively improve the classification performance. According to the results of the three data sets, the 3D-CSSEAN has good generalization ability on different data sets.
The classification maps of the five methods and the corresponding ground truth maps of the three data sets are shown in Figures 12-14. It can be clearly seen from these results that the higher the classification accuracy, the better the continuity of the classification map. For the Indian Pines data set, there were obvious noise and discontinuous regions, as shown in Figure 12b, while the classification effect of the 3D-CSSEAN was relatively good. As shown in Figure 13, although there are very few labeled samples in the Kennedy Space Center data set, the 3D-CSSEAN still achieved good results. On the contrary, many obvious misclassified pixels can be seen in Figure 13b,c. All methods achieved over 94% overall accuracy on Salinas Scene data sets; however, there were still significant differences, which can be observed in Figure 14. It can be seen from Figure 14g that the 3D-CSSEAN still performed well at the edge of the category and the easily confused area. relatively good. As shown in Figure 13, although there are very few labeled samples in the Kennedy Space Center data set, the 3D-CSSEAN still achieved good results. On the contrary, many obvious misclassified pixels can be seen in Figure 13 b,c. All methods achieved over 94% overall accuracy on Salinas Scene data sets; however, there were still significant differences, which can be observed in Figure 14. It can be seen from Figure 14g that the 3D-CSSEAN still performed well at the edge of the category and the easily confused area. regions, as shown in Figure 12b, while the classification effect of the 3D-CSSEAN was relatively good. As shown in Figure 13, although there are very few labeled samples in the Kennedy Space Center data set, the 3D-CSSEAN still achieved good results. On the contrary, many obvious misclassified pixels can be seen in Figure 13 b,c. All methods achieved over 94% overall accuracy on Salinas Scene data sets; however, there were still significant differences, which can be observed in Figure 14. It can be seen from Figure 14g that the 3D-CSSEAN still performed well at the edge of the category and the easily confused area. Training and testing times provide a direct measure of the computational efficiency of HSI classification methods. In Table 5, the training time and the test time on the test data of different methods are shown. As presented in Table 5, because their inputs were the data under dimension reduction through PCA, the training time of SVM and HybridSN was significantly lower than that of other methods. Additionally, the time efficiency of the 3D-CSSEAN was higher than that of SSRN, CDSCN, and MAFN. As for MAFN, this may be because it uses a mixture of global operation-based and correlation-based methods to extract attention features, so it is relatively time-consuming. In particular, the training and testing time of the 3D-CSSEAN was about half that of the CDSCN method. The possible reason for this is that CDSCN adopts the dual branches mode, while the 3D-CSSEAN adopts the single branch mode, and thus it can save about half of the running time. Training and testing times provide a direct measure of the computational efficiency of HSI classification methods. In Table 5, the training time and the test time on the test data of different methods are shown. As presented in Table 5, because their inputs were the data under dimension reduction through PCA, the training time of SVM and Hy-bridSN was significantly lower than that of other methods. Additionally, the time efficiency of the 3D-CSSEAN was higher than that of SSRN, CDSCN, and MAFN. As for MAFN, this may be because it uses a mixture of global operation-based and correlationbased methods to extract attention features, so it is relatively time-consuming. In particular, the training and testing time of the 3D-CSSEAN was about half that of the CDSCN method. The possible reason for this is that CDSCN adopts the dual branches mode, while the 3D-CSSEAN adopts the single branch mode, and thus it can save about half of the running time.

Ablation Studies
Three ablation experiments were conducted to analyze the contribution of different attention modules to HSI classification. The results are shown in Table 6. NONE means the 3D-CSSEAN without spectral and spatial attention module. SPE-EAN indicates the 3D-CSSEAN only with the spectral attention module, and SPA-EAN indicates the 3D-CSSEAN only with the spatial attention module. The experimental results showed that any kind of attention module is helpful for classification. The role of the spatial attention module is more obvious than that of the spectral attention module. In terms of OA indicators, SPA-EAN increased 1.25%, 0.87%, and 1.06% more than SPE-EAN on Indian Pines, Kennedy Space Center, and Salinas Scene data sets, respectively. These results suggest that the spatial element attention module is more conducive to acquiring discriminative features for classification. The OA obtained by the 3D-CSSEAN had obvious improvement compared with the module without spectral-spatial attention. The OA of the 3D-CSSEAN was 3.17%, 3.66%, and 1.99% higher than without attention modules on Indian Pines, Kennedy Space Center, and Salinas Scene data sets, respectively. It can be seen from the results of ablation experiments that the proposed cascaded spectral-spatial element attention module can obtain more meaningful spectral and spatial features, thereby improving the final classification results. To verify the contribution of tanh activation function to the classification task, a series of experiments was conducted on the three data sets. Experiment results are shown in Table 7. As can be seen from Table 7, AA, Kappa, and OA were all improved on the three data sets by using the tanh function. Compared with the model without tanh, the OA score's enhancements obtained by the 3D-CSSEAN with tanh were 0.56% (Indian Pines), 0.49% (Kennedy Space Center), and 0.12% (Salinas Scene). The AA score's increases were 0.49% (Indian Pines), 0.82% (Kennedy Space Center), and 0.04% (Salinas Scene). The Kappa coefficient's improvements were 0.64% (Indian Pines), 0.55% (Kennedy Space Center), and 0.14% (Salinas Scene). These results indicate that the tanh function is beneficial to enhance the separability of features and improve the classification performance. In addition, the standard deviation of all the results also decreased through using the tanh function. This also shows that the stability of the model is improved by using the tanh function.

Influence of the Attention Block Number
On three public data sets, the influence of the attention block number on the classification performance was analyzed. The experimental results are shown in Figure 15. In the figure, iSPE_jSPA of the horizontal axis represents i attention blocks in the spectral element attention module and j attention blocks in the spatial element attention module. Figure 15a-c, respectively, show the influence of the attention block number on overall accuracy, average accuracy, and Kappa coefficient. As can be seen from the figure, on the Salinas Scene data set, the number of attention blocks had little effect on the results. Particularly, the model with 1SPE_1SPA achieved good performance of OA at over 98%, indicating that the network structure with only one spectral element attention block cascading to one spatial element attention block extracted enough features for the improvement of the classification performance. and Salinas Scene data sets. As for the Indian Pines data set, when the number of attention module increased, the improvement in classification performance was limited. Furthermore, as the number of attention block increased, the time efficiency was bound to decrease. Overall, the network with 2 _2 could achieve the best or very close to the best on three indicators. In addition, it had good performance on the three data sets, indicating that its generalization performance was better. Based on the above analysis, the network structure of our final model is 2 _2 . On the Indian Pines and Kennedy Space Center data sets, when the number of the spectral attention block was 1, three indicators all fluctuated greatly with the increase of spatial attention modules. In the case of 1SPE_3SPA, all the indicators were significantly reduced. This result shows that when the spectral features are not sufficiently extracted, blindly adding spatial depth features will not bring good results. When the spectral feature block was greater than 2, the indicators on the Kennedy Space Center data set tended to be stable, and at the same time, the fluctuation range on the Indian Pines data set was also narrowing.
When the number of spectral attention modules was 2, and the number of spatial attention modules was from 1 to 2, both OA and Kappa increased slightly on the three data sets. In the case of 2SPE_2SPA, the best OA was achieved on Kennedy Space Center and Salinas Scene data sets. As for the Indian Pines data set, when the number of attention module increased, the improvement in classification performance was limited. Furthermore, as the number of attention block increased, the time efficiency was bound to decrease. Overall, the network with 2SPE_2SPA could achieve the best or very close to the best on three indicators. In addition, it had good performance on the three data sets, indicating that its generalization performance was better. Based on the above analysis, the network structure of our final model is 2SPE_2SPA.

Influence of Different Training Sample Numbers
To evaluate the performance of the proposed 3D-CSSEAN, in this paper, under different numbers of training samples, four groups of labeled samples with different percentages were randomly selected as training samples for experiments. Specifically, 1%, 3%, 5%, and 10% of each category were randomly selected from the labeled samples as training samples on the Indian Pines data set and Kennedy Space Center data set, and 0.1%, 0.5%, 1%, and 3% of each category were randomly selected from the labeled samples as training samples on the Salinas Scene data set. The experiment results are shown in Figure 16.
small samples, this advantage is more obvious. In addition, the MAFN method based on multiple attention combinations also demonstrated its competitiveness, especially on the Kennedy Space Center data set, where the spatial distribution of categories was relatively scattered. This shows that the combination of multiple attention methods is a promising research direction. In the future, perhaps the combination of the proposed element attention method and other attention methods will also produce more competitive results. On the Indian Pines data set, the advantages were more obvious when 1% and 3% of the labeled samples were used for training. Meaningful features extracted by the 3D-CSSEAN were more conducive to improving the classification performance in the case of small samples. Moreover, there was a significant decrease in the OA of CDSCN when only 3% of the labeled samples were used for training, indicating that CDSCN is prone to overfitting small training data. However, the 3D-CSSEAN did not increase many training parameters in the implementation of the attention module, and thus this problem can be avoided to some extent. On the Kennedy Space Center data set, the three different attention models, the 3D-CSSEAN, MAFN, and CDSCN, achieved better results than other methods, especially at 1% and 3%. These results indicate that these three attention features are beneficial for classification on the Kennedy Space Center data set. On the Salinas Scene data set, all methods achieved relatively close results, but the results of the 3D-CSSEAN were always the highest. In most cases, all methods could achieve good results, but in 0.10% of cases, the 3D-CSSEAN and MAFN had more obvious advantages.
In general, on Indian Pines and Salinas Scene data sets, the 3D-CSSEAN consistently outperformed the other approaches on all the training samples. As for the Kennedy Space Center data set, the results of the 3D-CSSEAN and MAFN were very close, and these results were better than those from the other comparison methods. Through these experimental investigations, it can be concluded that the 3D-CSSEAN has better classification performance and robustness in different training sample sets, and especially in the case of small samples, this advantage is more obvious. In addition, the MAFN method based on multiple attention combinations also demonstrated its competitiveness, especially on the Kennedy Space Center data set, where the spatial distribution of categories was relatively scattered. This shows that the combination of multiple attention methods is a promising research direction. In the future, perhaps the combination of the proposed element attention method and other attention methods will also produce more competitive results.

Conclusions
In this paper, a 3D cascaded spectral-spatial element attention network (3D-CSSEAN) is proposed to extract the meaningful features for hyperspectral image classification. The spectral element attention module and the spatial element attention module can make the network focus on primary spectral features and meaningful spatial features. Two element attention modules were implemented through several simple activation functions and elementwise multiplication. Therefore, the proposed model not only can obtain features that facilitate classification, but also has high computational efficiency. Since the implementation of the attention module does not add too many training parameters, it also makes the network structure suitable for small sample learning.
To evaluate the effectiveness of the method, extensive experiments were implemented on three public data sets: Indian Pines, Kennedy Space Center and Salinas Scene. Compared with the machine learning method, the popular deep learning methods and the attention methods, the proposed method obtained better classification performance. In cases with small samples, the advantages of the proposed method are more obvious. These results verify that the attention features obtained by the 3D-CSSEAN are beneficial for classification, and the 3D-CSSEAN is suitable for small sample learning. To evaluate the effectiveness of attention modules, several ablation experiments were conducted. From the results of the ablation experiments, both the spectral element attention module and the spatial element attention module have improved classification performance.
Extensive experiments showed that in the case of limited training samples, how to extract more meaningful features for classification is a direction worth exploring. In addition, the fusion of multiple attention features may be a kind of potential method, but how to ensure time efficiency may be a direction to be studied in the future.