Hyperspectral Image Classiﬁcation Based on Two-Branch Spectral–Spatial-Feature Attention Network

: Although most of deep-learning-based hyperspectral image (HSI) classiﬁcation methods achieve great performance, there still remains a challenge to utilize small-size training samples to remarkably enhance the classiﬁcation accuracy. To tackle this challenge, a novel two-branch spectral–spatial-feature attention network (TSSFAN) for HSI classiﬁcation is proposed in this paper. Firstly, two inputs with different spectral dimensions and spatial sizes are constructed, which can not only reduce the redundancy of the original dataset but also accurately explore the spectral and spatial features. Then, we design two parallel 3DCNN branches with attention modules, in which one focuses on extracting spectral features and adaptively learning the more discriminative spectral channels, and the other focuses on exploring spatial features and adaptively learning the more discriminative spatial structures. Next, the feature attention module is constructed to automatically adjust the weights of different features based on their contributions for classiﬁcation to remarkably improve the classiﬁcation performance. Finally, we design the hybrid architecture of 3D–2DCNN to acquire the ﬁnal classiﬁcation result, which can signiﬁcantly decrease the sophistication of the network. Experimental results on three HSI datasets indicate that our presented TSSFAN method outperforms several of the most advanced classiﬁcation methods.


Introduction
Hyperspectral imagery is captured with the spectrometer and supplies rich spectral information containing tens to hundreds of narrow bands for all the image elements [1][2][3]. Hyperspectral image (HSI) contains rich features of ground [4][5][6], in which spatial features and spectral features are both included for each pixel. As a result, it is utilized widely in multiple fields of agriculture [7], target detection [8], environmental monitoring [9], urban planning [10], and military reconnaissance [11]. In these applications, the classification of HSI [12][13][14][15] is a basic problem, which aims to find the specific class of each pixel.
Over the past few decades, diverse classification methods have been proposed to tackle this challenge, such as support vector machines (SVM) [16], k-nearest neighbor (KNN) [17], random forests [18], and multinomial logistic regression (MLR) [19], etc. However, these methods mentioned above have a common disadvantage that they classify the pixel by only applying the spectral information. While the spectral information of the pixel belonging to one category is very likely mixed with the spectral information of pixels from other categories. Therefore, these classification methods mentioned above, which have obvious shortcomings, are not robust to noise, and their classification results do not always perform well.
To solve such problems, various novel classification methods have been introduced in the past several years, which try to improve the classification performance by incorporating spatial information. One category in these methods attempts to design diverse feature extraction approaches, including the local binary pattern (LBP) histogram feature extraction [20] and extended morphological profiles (EMP) extraction [21], etc. The disadvantage of this type of method is that they extract only a single feature, so the improvement of classification performance is limited. The other tries to fuse spectral information with spatial contexts by adopting the joint sparse representation classification (JSRC) model [22]. The representative methods include: space-spectrum combined kernel filter [23], multiple kernel sparse representation classifier with superpixel features [24], kernel sparse representation-based classification (KSRC) method [25], and so on. Although these methods perform better on the specific datasets, they show the disadvantage that the designed classification models are more complex and less adaptable. Compared with the methods that only use spectral information, these methods can effectively enhance the classification performance. However, all these classification methods mentioned above design and extract features based on specific data with different structures. They have no universality for diverse hyperspectral datasets and cannot simultaneously achieve good results for data with different structures.
Therefore, researchers gradually introduce deep-learning mechanisms [26][27][28][29][30] to replace the methods of manually extracting features, which can automatically design feature extraction and solve various problems caused by the diversification of hyperspectral data structures. Chen [31] first applies the deep-learning network SAE to the HSI classification and proposes a deep-learning model that fuses spectral features and spatial features to obtain high classification accuracy. Then, more and more deep-learning models [32][33][34][35] are explored by researchers. Zhao [36] introduces the deep belief network (DBN) model into the HSI classification, and the data are preprocessed to decrease the redundancy by the principal component analysis (PCA) method. The hierarchical learning of features and the use of logistic regression methods to extract the spatial spectrum feature can achieve good experimental results. Wei [37] first applies the convolutional neural network (CNN) to the HSI classification, but the established CNN model can only extract spectral features. Chen [38] proposes a CNN-based depth feature extraction method, which establishes a three-dimensional convolutional neural network, so that the spatial and spectral features can be extracted, meanwhile. Zhong [39] proposes the spectral-spatial residual network (SSRN), which facilitates the back propagation of the gradient, while extracting deeper spectral features and alleviating that the accuracy of other deep-learning models is reduced. Sellami [40] introduces the semi-supervised three-dimensional CNN into the HSI classification through adaptive dimensionality reduction to solve the dimensionality curse problem. Mei [41] proposes the spectral-spatial attention network and achieves a good training result with the incorporation of attention mechanism into their model. Despite the competitive classification performance being achieved by the above deep-learning-based approaches, they still remain two major disadvantages. One is that the training samples are massively required for the purpose of learning the parameters in the deep network. However, expensive economic costs and a lot of time must be spent in order to collect such labelled data, which directly results in a very limited quantity of labelled data in practical applications. The other is that the neural network needs to adjust numerous variables during the backpropagation, which results in considerable calculation costs and time costs. Therefore, it remains a challenge to utilize small-size training samples to concurrently extract discriminative spectral-spatial features and remarkably enhance the classification performance.
In this paper, a novel two-branch spectral-spatial-feature attention network (TSSFAN) for HSI classification is proposed. Firstly, two inputs with different spectral dimensions and spatial sizes are constructed for the network, which can not only reduce the redundancy of the original dataset but also accurately and separately explore the spectral and spatial features. Then, two parallel 3DCNN branches with attention modules are designed for the network, in which one focuses on extracting spectral features and adaptively learning the more discriminative spectral channels, and the other focuses on exploring spatial features and adaptively learning the more discriminative spatial structures. Next, the feature attention module is constructed in the fusion stage of the two branches to automatically adjust the weights of different features based on their contributions for the classification. Finally, the 2DCNN network is designed to obtain the final classification result, which can decrease the sophistication of the network and reduce the parameters in the network. Compared with several typical and recent HSI classification methods, the results indicate that our presented TSSFAN is superior to the most advanced methods.
The remaining chapters of the article are organized as follows. The CNN network, attention mechanism, and the proposed TSSFAN method are introduced in Section 2. The classification results of three different public datasets are presented in Section 3. The article is concluded in Section 4, finally.

Materials and Methods
In this section, the traditional methods including 2DCNN, 3DCNN, and attention mechanism are the first to be introduced. Then, the process of the proposed TSSFAN method is explained in detail.

2DCNN and 3DCNN
Convolutional neural network (CNN) [42][43][44][45] is commonly employed in the computer vision (CV) task. Inspired by the thinking mode of the human brain, CNN can automatically learn the spatial features of images by the convolution and pooling operations [46][47][48][49][50], which contains multiple layers of repetitive stacked structures to extract deep information. CNN is originally designed for the recognition of two-dimensional images, so the traditional network structure is a two-dimensional convolutional neural network [51][52][53]. A typical 2DCNN structure is presented in Figure 1. In this paper, a novel two-branch spectral-spatial-feature attention network (TSS-FAN) for HSI classification is proposed. Firstly, two inputs with different spectral dimensions and spatial sizes are constructed for the network, which can not only reduce the redundancy of the original dataset but also accurately and separately explore the spectral and spatial features. Then, two parallel 3DCNN branches with attention modules are designed for the network, in which one focuses on extracting spectral features and adaptively learning the more discriminative spectral channels, and the other focuses on exploring spatial features and adaptively learning the more discriminative spatial structures. Next, the feature attention module is constructed in the fusion stage of the two branches to automatically adjust the weights of different features based on their contributions for the classification. Finally, the 2DCNN network is designed to obtain the final classification result, which can decrease the sophistication of the network and reduce the parameters in the network. Compared with several typical and recent HSI classification methods, the results indicate that our presented TSSFAN is superior to the most advanced methods.
The remaining chapters of the article are organized as follows. The CNN network, attention mechanism, and the proposed TSSFAN method are introduced in Section 2. The classification results of three different public datasets are presented in Section 3. The article is concluded in Section 4, finally.

Materials and Methods
In this section, the traditional methods including 2DCNN, 3DCNN, and attention mechanism are the first to be introduced. Then, the process of the proposed TSSFAN method is explained in detail.

2DCNN and 3DCNN
Convolutional neural network (CNN) [42][43][44][45] is commonly employed in the computer vision (CV) task. Inspired by the thinking mode of the human brain, CNN can automatically learn the spatial features of images by the convolution and pooling operations [46][47][48][49][50], which contains multiple layers of repetitive stacked structures to extract deep information. CNN is originally designed for the recognition of two-dimensional images, so the traditional network structure is a two-dimensional convolutional neural network [51][52][53]. A typical 2DCNN structure is presented in Figure 1. In the convolution layer, the convolution kernel is first used to perform convolutional operations on the input image. Then, the convolutional result is fed into a nonlinear function, and its output is sent to the next layer for further computation. Different from the fully connected neural network (FC), the training parameters in CNN are remarkably reduced due to the application of the shared convolution kernel. The convolution formula is as follows: In the convolution layer, the convolution kernel is first used to perform convolutional operations on the input image. Then, the convolutional result is fed into a nonlinear function, and its output is sent to the next layer for further computation. Different from the fully connected neural network (FC), the training parameters in CNN are remarkably reduced due to the application of the shared convolution kernel. The convolution formula is as follows:

Input Image
where f (·) indicates the nonlinear activation function, and it can strengthen the network's ability to process nonlinear data. F l−1 is the input feature map in layer l − 1, and F l is the output feature map in layer l. W l indicates the convolutional filter, and b l indicates the bias of each output feature map. In the pooling layers, the previous feature maps are sub-sampled to reduce the spatial size. After the multilayer architectures are stacked, the fully connected layer and the SoftMax classifier are typically utilized to present the final results.
Although 2DCNN can recognize two-dimensional shapes very well, it does not perform satisfactorily when directly processing three-dimensional data. Therefore, 2DCNN is promoted to 3DCNN to extract high-level 3D features [54] for three-dimensional data. Figure 2 presents a typical 3DCNN structure. It has a highly similar structure with 2DCNN, but their difference is that 2DCNN uses the 2D convolution kernel, while 3DCNN uses the 3D convolution kernel. Three-dimensional CNN [55][56][57][58] can simultaneously extract spatial and depth features for three-dimensional data via the 3D convolution kernel.
where ( ) f × indicates the nonlinear activation function, and it can strengthen the network's ability to process nonlinear data.
1 l Fis the input feature map in layer 1 l -, and l F is the output feature map in layer l . l W indicates the convolutional filter, and l b indicates the bias of each output feature map.
In the pooling layers, the previous feature maps are sub-sampled to reduce the spatial size. After the multilayer architectures are stacked, the fully connected layer and the SoftMax classifier are typically utilized to present the final results.
Although 2DCNN can recognize two-dimensional shapes very well, it does not perform satisfactorily when directly processing three-dimensional data. Therefore, 2DCNN is promoted to 3DCNN to extract high-level 3D features [54] for three-dimensional data. Figure 2 presents a typical 3DCNN structure. It has a highly similar structure with 2DCNN, but their difference is that 2DCNN uses the 2D convolution kernel, while 3DCNN uses the 3D convolution kernel. Three-dimensional CNN [55][56][57][58] can simultaneously extract spatial and depth features for three-dimensional data via the 3D convolution kernel.

Convolution
Pooling

Attention Mechanism
As the applications of deep learning in many CV tasks become more and more extensive, attention mechanism [59][60][61] as an auxiliary means is increasingly used in deep networks to optimize the network structure. Attention mechanism [62] is similar to the way that the human eyes observe things, which can always ignore the irrelevant information but pay attention to the significant information. It makes the network focus on learning [63], which can remarkably enhance the performance of the network. Figure 3 presents a typical attention module.

Attention Mechanism
As the applications of deep learning in many CV tasks become more and more extensive, attention mechanism [59][60][61] as an auxiliary means is increasingly used in deep networks to optimize the network structure. Attention mechanism [62] is similar to the way that the human eyes observe things, which can always ignore the irrelevant information but pay attention to the significant information. It makes the network focus on learning [63], which can remarkably enhance the performance of the network. Figure 3 presents a typical attention module.
As can be seen from Figure 3, the attention module aims to construct an adaptive function, which maps the original images to the matrix that represents the weights of different spatial locations. With the help of such a function, different regions are given independent weights to highlight more relevant and noteworthy information. The process can be expressed by: where X is the original image, while Y indicates the output.F max indicates maximum pooling along the channel dimension, while F avg represents average pooling along the channel dimension.W, b indicate the convolutional filter and the bias in the convolutional  As can be seen from Figure 3, the attention module aims to construct an adaptive function, which maps the original images to the matrix that represents the weights of different spatial locations. With the help of such a function, different regions are given independent weights to highlight more relevant and noteworthy information. The process can be expressed by: where X is the original image, while Y indicates the output. max F indicates maximum pooling along the channel dimension, while avg F represents average pooling along the channel dimension. W , b indicate the convolutional filter and the bias in the convolutional operation, respectively. s X is the generated weight matrix, and ( ) attention sa F × indicates spatial-wise multiply between the original input X and weight s X . Figure 4 depicts the flowchart of the proposed method TSSFAN. From this flowchart, the TSSFAN method has four main steps: data preprocessing, two-branch 3DCNN with attention modules, feature attention module in the co-training model, and 2DCNN for classification. Next, each main step of the TSSFAN method is introduced in detail.

Data Preprocessing
Let the HSI dataset be denoted by , where I represents the original input; H , W , C indicate the height, the width, and channel numbers of I . The steps of data preprocessing are described as follows, and Figure 5 shows the process. (2) Based on the two different spectral dimensions, the image with the larger spectral dimension selects a smaller spatial window to create an input 3D cube for each center

Data Preprocessing
Let the HSI dataset be denoted by I ∈ R H×W×C , where I represents the original input; H, W, C indicate the height, the width, and channel numbers of I. The steps of data preprocessing are described as follows, and Figure 5 shows the process.

Two-Branch 3DCNN with Attention Modules
After data preprocessing, two inputs with different spectral dimensions and spatial sizes are obtained. Then, we design two parallel 3DCNN branches, where each branch contains an attention module. One branch focuses on spatial feature extraction, and the other focuses on spectral feature extraction. Moreover, the attention module in each branch can automatically adjust the weights of the spatial features and spectral features for different input data, concentrating on more discriminative spatial structures and spectral channels.
with the larger spectral dimension, we design a branch of 3DCNN with spectral-spatial attention, which focuses on extracting spectral features and adaptively learning the more discriminative spectral channels. Figure 6 shows the process. Below, the two main steps of this branch are presented in detail. Step

Spectral-Spatial
pass through the spectral-spatial attention module, which can automatically adjust the weights of the spectral features and the spatial features for different input data, concentrating on more discriminative spectral channels and spatial structures. Specifically, Figure 7 presents the spectral-spatial attention module. (1) PCA is employed to reduce the spectral dimension of the original image and obtain two datasets with different spectral dimensions, where P ∈ R H×W×D 1 and Q ∈ R H×W×D 2 . (2) Based on the two different spectral dimensions, the image with the larger spectral dimension selects a smaller spatial window to create an input 3D cube for each center pixel, while the image with the smaller spectral dimension selects a larger spatial window to create another input 3D cube for each center pixel, where Input_1 is denoted by U ∈ R M×M×D 1 and Input_2 is denoted by V ∈ R N×N×D 2 , respectively.
Through such data preprocessing, we create two inputs with different spectral dimensions and spatial sizes, which can not only reduce the redundancy of the original dataset but also accurately and separately explore the spectral and spatial features.

Two-Branch 3DCNN with Attention Modules
After data preprocessing, two inputs with different spectral dimensions and spatial sizes are obtained. Then, we design two parallel 3DCNN branches, where each branch contains an attention module. One branch focuses on spatial feature extraction, and the other focuses on spectral feature extraction. Moreover, the attention module in each branch can automatically adjust the weights of the spatial features and spectral features for different input data, concentrating on more discriminative spatial structures and spectral channels.

A. 3DCNN with Spectral-Spatial Attention
For Input_1 U ∈ R M×M×D 1 with the larger spectral dimension, we design a branch of 3DCNN with spectral-spatial attention, which focuses on extracting spectral features and adaptively learning the more discriminative spectral channels. Figure 6 shows the process. Below, the two main steps of this branch are presented in detail.
Step 1: Let Input_1 U ∈ R M×M×D 1 pass through the spectral-spatial attention module, which can automatically adjust the weights of the spectral features and the spatial features for different input data, concentrating on more discriminative spectral channels and spatial structures. Specifically, Figure 7 presents the spectral-spatial attention module.
A. 3DCNN with Spectral-Spatial Attention with the larger spectral dimension, we design a branch of 3DCNN with spectral-spatial attention, which focuses on extracting spectral features and adaptively learning the more discriminative spectral channels. Figure 6 shows the process. Below, the two main steps of this branch are presented in detail. Step

Spectral-Spatial
pass through the spectral-spatial attention module, which can automatically adjust the weights of the spectral features and the spatial features for different input data, concentrating on more discriminative spectral channels and spatial structures. Specifically, Figure 7 presents the spectral-spatial attention module. As can be seen from Figure 7, let the input be denoted by × represent the spectral-wise multiply and spatial-wise multiply, respectively.
In the spectral-spatial attention module, the acquisition of the spectral attention map and the spatial attention map are two necessary parts. Additionally, the process of obtaining the spectral attention map and the spatial attention map is as follows.
(a) Spectral attention map: The spectral attention exploits the inter-channel relationships of feature maps and aims to construct an adaptive function, which maps the original images to the vector that represents the weights of different spectral bands. As can be seen from Figure 8a, the global average pooling and the global max pooling are first operated to squeeze the spatial dimension to obtain the Avg-Pool and Max-Pool , respectively.
Then, the Avg-Pool and Max-Pool are passed through two fully connected layers with shared parameters. Finally, the two outputs are added and passed through the sigmoid function to obtain the spectral attention map. The process is calculated as follows:  As can be seen from Figure 7, let the input be denoted by Y ∈ R H×W×C , where H, W, C indicate the height, the width, and channel numbers of Y. Let the spectral attention map be denoted by A se ∈ R 1×1×C , the spatial attention map be denoted by A sa ∈ R H×W×1 , and the output be denoted by Y ea ∈ R H×W×C . The computation process can be denoted as: where Y ea is the output. F attention se (·), F attention sa (·) represent the spectral-wise multiply and spatial-wise multiply, respectively.
In the spectral-spatial attention module, the acquisition of the spectral attention map and the spatial attention map are two necessary parts. Additionally, the process of obtaining the spectral attention map and the spatial attention map is as follows.
(a) Spectral attention map: The spectral attention exploits the inter-channel relationships of feature maps and aims to construct an adaptive function, which maps the original images to the vector that represents the weights of different spectral bands. As can be seen from Figure 8a, the global average pooling and the global max pooling are first operated to squeeze the spatial dimension to obtain the Avg-Pool Y avg c ∈ R 1×1×C and Max-Pool Y max c ∈ R 1×1×C , respectively. Then, the Avg-Pool and Max-Pool are passed through two fully connected layers with shared parameters. Finally, the two outputs are added and passed through the sigmoid function to obtain the spectral attention map. The process is calculated as follows: where (b) Spatial attention map: The spatial attention aims to construct an adaptive function, which maps the original images to the matrix that represents the weights of different spatial locations. From Figure  8b, the global average pooling and the global max pooling are first operated along the channel direction, squeezing the spectral dimension to obtain the Avg-Pool 

Y´Î
, respectively. Next, we concatenate the Avg-Pool and Max-Pool, then, pass them through a 2D convolutional layer. At last, the spatial attention map sa A is generated with the application of a sigmoid function. The process is calculated as follows: where avg Y , max Y indicate the feature map obtained by the global average pooling and the global max pooling, respectively.
where 1 l Fis the input feature map in layer 1 l -, and l F is the output feature map in layer l . l W indicates the 3D convolutional filter, and l b indicates the bias of each output feature map.
( ) Relu × is the nonlinear activation function. The spatial attention aims to construct an adaptive function, which maps the original images to the matrix that represents the weights of different spatial locations. From Figure 8b, the global average pooling and the global max pooling are first operated along the channel direction, squeezing the spectral dimension to obtain the Avg-Pool Y avg ∈ R H×W×1 and Max-Pool Y max ∈ R H×W×1 , respectively. Next, we concatenate the Avg-Pool and Max-Pool, then, pass them through a 2D convolutional layer. At last, the spatial attention map A sa is generated with the application of a sigmoid function. The process is calculated as follows: where Y avg , Y max indicate the feature map obtained by the global average pooling and the global max pooling, respectively. W 3 , b 3 are the parameter of the 2D convolutional layer. Additionally, σ is the sigmoid function.
Step 2: As can be seen from Figure 6, two 3D convolutional layers are employed to extract spectral and spatial features simultaneously after passing through the spectralspatial attention module. Finally, output 1 with size {S × S × C,P} is obtained. In the 3D convolution, the convolution formula is calculated by: where F l−1 is the input feature map in layer l − 1, and F l is the output feature map in layer l. W l indicates the 3D convolutional filter, and b l indicates the bias of each output feature map. Relu(·) is the nonlinear activation function. For Input_2 V ∈ R N×N×D 2 with the larger spatial size, we design a branch of 3DCNN with spatial-spectral attention, which focuses on exploring spatial features and adaptively learning the more discriminative spatial structures. Figure 9 shows the process. The two branches are very similar, and the main difference is that they use different attention modules, which are the spectral-spatial attention module presented in Figure 7 and the spatial-spectral attention module presented in Figure 10, respectively. By comparing Figures 7 and 10, we can see that the spectral-spatial attention module prioritizes spectral attention before spatial attention, while the spatial-spectral attention module is just the opposite. The reason for this design is that the Input_1 U ∈ R M×M×D 1 contains more spectral information so spectral attention is given priority, while Input_2 V ∈ R N×N×D 2 contains more spatial information so spatial attention is given priority. with the larger spatial size, we design a branch of 3DCNN with spatial-spectral attention, which focuses on exploring spatial features and adaptively learning the more discriminative spatial structures. Figure 9 shows the process. The two branches are very similar, and the main difference is that they use different attention modules, which are the spectral-spatial attention module presented in Figure 7 and the spatial-spectral attention module presented in Figure 10, respectively. By comparing Figure  7 and Figure 10, we can see that the spectral-spatial attention module prioritizes spectral attention before spatial attention, while the spatial-spectral attention module is just the opposite. The reason for this design is that the Input_1  1 M M D U´Î contains more spectral information so spectral attention is given priority, while Input_2  2 N N D V´Î contains more spatial information so spatial attention is given priority. Finally, through the similar structure, output 2 with size { S S Ć´, P } is also obtained. The size of the output in both branches is the same, because we acquire the output with the same size by controlling the convolution operation, so that the output of the two branches can be merged later.

Feature Attention Module in the Co-Training Model
As shown in Figure 11, the next step is to concatenate the two branches for co-training. The outputs of the two 3DCNN branches are merged together, and we obtain the output with size { S S Ć´, 2P }. We consider that different features from different branches do not contribute equally to the classification task. If we can fully explore the prior information, then, the learning ability of the entire network will be improved to a considerable extent. Therefore, we construct the feature attention module to automatically adjust the weights of different features based on their contributions for classification, which can remarkably enhance the classification performance. Figure 12 presents the feature attention module.
In the feature attention module, for the input   with the larger spatial size, we design a branch of 3DCNN with spatial-spectral attention, which focuses on exploring spatial features and adaptively learning the more discriminative spatial structures. Figure 9 shows the process. The two branches are very similar, and the main difference is that they use different attention modules, which are the spectral-spatial attention module presented in Figure 7 and the spatial-spectral attention module presented in Figure 10, respectively. By comparing Figure  7 and Figure 10, we can see that the spectral-spatial attention module prioritizes spectral attention before spatial attention, while the spatial-spectral attention module is just the opposite. The reason for this design is that the Input_1 Finally, through the similar structure, output 2 with size { S S Ć´, P } is also obtained. The size of the output in both branches is the same, because we acquire the output with the same size by controlling the convolution operation, so that the output of the two branches can be merged later.

Feature Attention Module in the Co-Training Model
As shown in Figure 11, the next step is to concatenate the two branches for co-training. The outputs of the two 3DCNN branches are merged together, and we obtain the output with size { S S Ć´, 2P }. We consider that different features from different branches do not contribute equally to the classification task. If we can fully explore the prior information, then, the learning ability of the entire network will be improved to a considerable extent. Therefore, we construct the feature attention module to automatically adjust the weights of different features based on their contributions for classification, which can remarkably enhance the classification performance. Figure 12 presents the feature attention module.
In the feature attention module, for the input  Finally, through the similar structure, output 2 with size {S × S × C, P} is also obtained. The size of the output in both branches is the same, because we acquire the output with the same size by controlling the convolution operation, so that the output of the two branches can be merged later.

Feature Attention Module in the Co-Training Model
As shown in Figure 11, the next step is to concatenate the two branches for co-training. The outputs of the two 3DCNN branches are merged together, and we obtain the output with size {S × S × C, 2P}. We consider that different features from different branches do not contribute equally to the classification task. If we can fully explore the prior information, then, the learning ability of the entire network will be improved to a considerable extent. Therefore, we construct the feature attention module to automatically adjust the weights of different features based on their contributions for classification, which can remarkably enhance the classification performance. Figure 12 presents the feature attention module.
where Ä indicates the channel-wise multiplication.
By the constructed feature attention module, different features are given different weights based on their contributions for the classification task. We acquire the output with size { S S Ć´, 2P }.

2DCNN for Classification
where Ä indicates the channel-wise multiplication.
By the constructed feature attention module, different features are given different weights based on their contributions for the classification task. We acquire the output with size { S S Ć´, 2P }.  In the feature attention module, for the input X i ∈ R S×S×C×2P , the global average pooling and the global max pooling are operated in the direction of the channel to obtain the global feature description maps, i.e., F avg ∈ R 1×1×1×2P and F max ∈ R 1×1×1×2P . The weight X w ∈ R 1×1×1×2P of different features is obtained through a structure similar to the spectral attention map. Finally, the output X j ∈ R S×S×C×2P is calculated as

2DCNN for Classification
where ⊗ indicates the channel-wise multiplication. By the constructed feature attention module, different features are given different weights based on their contributions for the classification task. We acquire the output with size {S × S × C, 2P}.

2DCNN for Classification
As shown in Figure 13, the result of the feature attention module enters the 2DCNN network to further extract the feature and obtains the final classification result. The purpose of introducing the 2DCNN network instead of continuing to use the 3DCNN network is to decrease the sophistication of the network and reduce the parameters in the network. The main steps are the following. pose of introducing the 2DCNN network instead of continuing to use the 3DCNN network is to decrease the sophistication of the network and reduce the parameters in the network. The main steps are the following.
(1) For the result with size { S S Ć´, 2P }, the convolution kernel with size { 1 1 Ć´} is adopted to convert the 3D feature maps with size { S S Ć´, 2P } to the 2D feature maps with size { S Ś , k }.
(2) Then, the 2D feature maps with size { S Ś , k } is sent to the 2D convolutional layers to promote the fusion of features and further extract features with stronger representation ability.
(3) The 2D convolutional layer is concatenated with the fully connected layers. Finally, the SoftMax classifier is employed to predict the category of each pixel.

Experimental Result and Analysis
In this chapter, the three HSI datasets utilized in our experiments are described first, and the experimental configurations are, then, presented. Next, the influences of the main parameters for the classification performance of our proposed method TSSFAN are analyzed. Additionally, the proposed TSSFAN is compared to several of the most advanced classification methods to verify the superiorities.

Data Description
In our experiment, we consider three openly accessible HSI datasets, including Indian Pines (IP), University of Pavia (UP), and Salinas Scene (SA).
(1) Indian Pines (IP): IP was acquired by a sensor on June 1992, in which the spatial size is 145 × 145 and the number of the spectral band is 224. Specifically, its spectral resolution is 10 nm. Moreover, the range of wavelength in IP is 0.4-2.5 μm. Additionally, sixteen categories are contained in IP, and only 200 effective bands in IP can be utilized because the 24 bands that could carry noise information are excluded.
(2) University of Pavia (UP): UP is acquired by a sensor known as the ROSIS sensor, in which the spatial size is 610 × 340 and the number of the spectral band is 115. Moreover, the range of wavelength in UP is 0.43-0.86 μm. Specifically, nine categories are contained in UP with 42,776 labeled pixels. In the experiment, only 103 effective bands in UP can be utilized because the 12 bands that could carry noise information are excluded. (1) For the result with size {S × S × C, 2P}, the convolution kernel with size {1 × 1 × C} is adopted to convert the 3D feature maps with size {S × S × C, 2P} to the 2D feature maps with size {S × S, k}.
(2) Then, the 2D feature maps with size {S × S, k} is sent to the 2D convolutional layers to promote the fusion of features and further extract features with stronger representation ability.
(3) The 2D convolutional layer is concatenated with the fully connected layers. Finally, the SoftMax classifier is employed to predict the category of each pixel.

Experimental Result and Analysis
In this chapter, the three HSI datasets utilized in our experiments are described first, and the experimental configurations are, then, presented. Next, the influences of the main parameters for the classification performance of our proposed method TSSFAN are analyzed. Additionally, the proposed TSSFAN is compared to several of the most advanced classification methods to verify the superiorities.

Data Description
In our experiment, we consider three openly accessible HSI datasets, including Indian Pines (IP), University of Pavia (UP), and Salinas Scene (SA).

Experimental Configuration
All experiments were implemented on the computer including an AMD Ryzen 7 4800H CPU and an Nvidia GeForce RTX2060 GPU. We employed Windows 10 as the operating system, using the PyTorch1.2.0 deep-learning framework and a Python 3.6 compiler. In our experiments, overall accuracy (OA), average accuracy (AA), and Kappa coefficient (Kappa) were adopted as the evaluation metric, which aimed to quantitatively assess the classification performance.

Analysis of Parameters
In this section, we analyze the influences of the three main parameters for the classification performance of our proposed TSSFAN, including learning rate, spectral dimension, and spatial size.
(1) Learning rate: During the gradient descent process of a deep-learning model, the weights are constantly updated. A few hyperparameters play an instrumental role in controlling this process properly, and one of them is the learning rate. The convergence capability and the convergence speed of the network can be productively regulated by a suitable learning rate. In our trials, the effect of the learning rate on the classification performance is tested, where the value of the learning rate is set to {0.00005,0.0001,0.0003,0.0005,0.001,0.003,0.005,0.008}. Figure 17 shows the experimental results. (1) Learning rate: During the gradient descent process of a deep-learning model, the weights are constantly updated. A few hyperparameters play an instrumental role in controlling this process properly, and one of them is the learning rate. The convergence capability and the convergence speed of the network can be productively regulated by a suitable learning rate. In our trials, the effect of the learning rate on the classification performance is tested, where the value of the learning rate is set to  {0.00005,0.0001,0.0003,0.0005,0.001,0.003,0.005,0.008}. Figure 17 shows the experimental results.
From Figure 17, we can observe that there is a gradual rise in accuracy as the learning rate increases from 0.00005 to 0.001, while there is a considerable drop in learning rate further growing from 0.001 to 0.008 for all three datasets. The convergence speed of the network would be reduced when the learning rate is lower, which extends the learning time of the model and weakens the classification performance. However, the network would fail to converge or converge to the local optimum if the learning rate is too high, which can also negatively affect the classification performance. Based on the experimental result, 0.001 is chosen as the optimal learning rate for the three datasets to acquire the best classification performance. (2) Spectral dimension: Input-1 contains more spectral information and less spatial information. The spectral dimension in Input-1 determines how much spectral information is available to classify the pixels. We tested the impact of the spectral dimension in Input-1. In our experiments, the spectral dimension in Input-1 is set to {21,23,25,27,29,31,33} to capture sufficient spectral information. Figure 18 presents the experimental results. Figure 18 shows that there is a trend of rising first and, then, falling with the spectral dimension increasing from 21 to 33 for the three datasets. At the beginning, the classification accuracy increases because more spectral information could be provided with the spectral dimension increasing. Nevertheless, with the spectral dimension further increasing, although more spectral information could be supplied, some noise information is also introduced to reduce the classification accuracy. Figure 18 reveals that we achieve the highest classification accuracy when we fix the spectral dimension to 27 for IN and SA From Figure 17, we can observe that there is a gradual rise in accuracy as the learning rate increases from 0.00005 to 0.001, while there is a considerable drop in learning rate further growing from 0.001 to 0.008 for all three datasets. The convergence speed of the network would be reduced when the learning rate is lower, which extends the learning time of the model and weakens the classification performance. However, the network would fail to converge or converge to the local optimum if the learning rate is too high, which can also negatively affect the classification performance. Based on the experimental result, 0.001 is chosen as the optimal learning rate for the three datasets to acquire the best classification performance.
(2) Spectral dimension: Input-1 contains more spectral information and less spatial information. The spectral dimension in Input-1 determines how much spectral information is available to classify the pixels. We tested the impact of the spectral dimension in Input-1. In our experiments, the spectral dimension in Input-1 is set to {21,23,25,27,29,31,33} to capture sufficient spectral information. Figure 18 presents the experimental results.
datasets and 23 for UP datasets. The spectral dimension in this situation can provide sufficient spectral information. Although some noise information is also introduced, it can be effectively suppressed by the spectral attention in the network. To be mentioned, since Input-2 contains less spectral information, we preset the spectral dimension as 9 for the three datasets, which minimizes the computing sophistication while guaranteeing the basic spectral information. (3) Spatial size: Input-2 contains more spatial information and less spectral information. The spatial size in Input-2 determines how much spatial information is available to classify the pixels. We test the impact of the spatial size in Input-2. In our experiment, the spatial size is set as {25 × 25, 27 × 27, 29 × 29, 31 × 31, 33 × 33, 35 × 35, 37 × 37}. Figure 19 presents the experimental results. Figure 19 illustrates that there is a gradual improvement and, then, a gradual fall with the spatial size increasing from 25 to 37 for the three datasets. In the initial stage, the classification accuracy grows because more spatial context and spatial structures could be available with the spatial size increasing. However, with the spatial size further increasing, pixels and spatial structures belonging to different classes will be introduced, which reduces the classification performance. Figure 19 indicates that the highest classification accuracy is obtained when we fix the spatial size as 33×33 for the three datasets. To be mentioned, since Input-1 contains less spatial information, we preset the spatial size to 9×9 for the three datasets, which ensures the basic spatial information while minimizing the computational complexity.  Figure 18 shows that there is a trend of rising first and, then, falling with the spectral dimension increasing from 21 to 33 for the three datasets. At the beginning, the classification accuracy increases because more spectral information could be provided with the spectral dimension increasing. Nevertheless, with the spectral dimension further increasing, although more spectral information could be supplied, some noise information is also introduced to reduce the classification accuracy. Figure 18 reveals that we achieve the highest classification accuracy when we fix the spectral dimension to 27 for IN and SA datasets and 23 for UP datasets. The spectral dimension in this situation can provide sufficient spectral information. Although some noise information is also introduced, it can be effectively suppressed by the spectral attention in the network. To be mentioned, since Input-2 contains less spectral information, we preset the spectral dimension as 9 for the three datasets, which minimizes the computing sophistication while guaranteeing the basic spectral information.
(3) Spatial size: Input-2 contains more spatial information and less spectral information.
The spatial size in Input-2 determines how much spatial information is available to classify the pixels. We test the impact of the spatial size in Input-2. In our experiment, the spatial size is set as {25 × 25, 27 × 27, 29 × 29, 31 × 31, 33 × 33, 35 × 35, 37 × 37}. Figure 19 presents the experimental results. Figure 19 illustrates that there is a gradual improvement and, then, a gradual fall with the spatial size increasing from 25 to 37 for the three datasets. In the initial stage, the classification accuracy grows because more spatial context and spatial structures could be available with the spatial size increasing. However, with the spatial size further increasing, pixels and spatial structures belonging to different classes will be introduced, which reduces the classification performance. Figure 19 indicates that the highest classification accuracy is obtained when we fix the spatial size as 33 × 33 for the three datasets. To be mentioned, since Input-1 contains less spatial information, we preset the spatial size to 9 × 9 for the three datasets, which ensures the basic spatial information while minimizing the computational complexity. According to the above parameter analysis, Table 4 lists all final parameters.

Comparisons to the State-of-the-Art Methods
In our experiment, we compare the presented TSSFAN method with SVM [16], twodimensional convolutional neural network (2DCNN) [43], three-dimensional convolutional neural network (3DCNN) [54], spectral-spatial residual network (SSRN) [39], hybrid spectral CNN (HybridSN) [64], and spectral-spatial attention network (SSAN) [41]. Tables 5-7 show the classification performance of each method for IP, UP, and SA. Compared with other competitor models, our proposed TSSFAN acquires the highest OA, AA, and Kappa for the three datasets. In particular, our proposed TSSFAN can still achieve a pretty good classification accuracy of 98.26% under the condition that there are only 1% training samples for Pavia University. The main reason is that our proposed method creates two inputs with different spectral dimensions and spatial sizes for the network, which can accurately and separately explore the spectral and spatial features even if we only use a very small training set. Although 2DCNN, 3DCNN, SSRN, and Hy-bridSN design different network structures to acquire stronger classification performance, our presented TSSFAN method obtains a higher OA than for all three datasets. In addition, our presented TSSFAN acquires better per-class accuracy than the competitor methods in most cases. Especially, our proposed method achieves 100% accuracy in the Alfalfa and Grass-pasture-mowed categories for Indian Pines and achieves 100% accuracy in the Brocoli_green_weeds_2 and Lettuce_romaine_5wk categories for Salinas. The main reason is that our method designs two-branch 3DCNN with attention modules to focus on According to the above parameter analysis, Table 4 lists all final parameters.

Comparisons to the State-of-the-Art Methods
In our experiment, we compare the presented TSSFAN method with SVM [16], two-dimensional convolutional neural network (2DCNN) [43], three-dimensional convolutional neural network (3DCNN) [54], spectral-spatial residual network (SSRN) [39], hybrid spectral CNN (HybridSN) [64], and spectral-spatial attention network (SSAN) [41]. Tables 5-7 show the classification performance of each method for IP, UP, and SA. Compared with other competitor models, our proposed TSSFAN acquires the highest OA, AA, and Kappa for the three datasets. In particular, our proposed TSSFAN can still achieve a pretty good classification accuracy of 98.26% under the condition that there are only 1% training samples for Pavia University. The main reason is that our proposed method creates two inputs with different spectral dimensions and spatial sizes for the network, which can accurately and separately explore the spectral and spatial features even if we only use a very small training set. Although 2DCNN, 3DCNN, SSRN, and HybridSN design different network structures to acquire stronger classification performance, our presented TSSFAN method obtains a higher OA than for all three datasets. In addition, our presented TSSFAN acquires better per-class accuracy than the competitor methods in most cases. Especially, our proposed method achieves 100% accuracy in the Alfalfa and Grass-pasture-mowed categories for Indian Pines and achieves 100% accuracy in the Brocoli_green_weeds_2 and Lettuce_romaine_5wk categories for Salinas. The main reason is that our method designs two-branch 3DCNN with attention modules to focus on more discriminative spectral channels and spatial structures, which can effectively enhance the classification performance. Moreover, although SSAN utilizes the attention mechanism to concentrate on more significant information in the classification task, our proposed TSSFAN method acquires better OA for all three datasets. This is largely because our method constructs the feature attention module, which can automatically adjust the weights of different features based on their contributions for the classification. By the constructed feature attention module, different features are given different weights based on their contributions for the classification task, so the classification accuracy will be improved. As a result, the superiorities of the presented TSSFAN by creating two inputs with different size to, respectively, emphasize accurately extracting spectral information and spatial information, designing two-branch 3DCNN with attention modules to focus on more discriminative spectral channels and spatial structures, and constructing the feature attention module to concentrate on the feature contributing more to the classification tasks are completely verifiable.   Figures 20-22, we find that the classification figures acquired by SSRN, HybridSN, and SSAN have smoother boundaries and edges, while those acquired by SVM, 2DCNN, and 3DCNN present more misclassifications. However, our proposed TSSFAN achieves the more accurate classification map, which presents fewer classification errors and smoother boundaries and edges. The main reason is that TSSFAN introduces the attention mechanism to focus on more discriminative information for classification, which can provide more a detailed and accurate classification map.    Finally, we test the computational efficiency of 2DCNN, 3DCNN, and TSSFAN methods on the three datasets to verify the superiority of the hybrid architecture. From Table  8, we find that the calculation time of the presented TSSFAN, including training and testing time, is much less than that of 3DCNN but larger than that of 2DCNN for all three datasets. This is mainly because our proposed TSSFAN designs the hybrid architecture of Finally, we test the computational efficiency of 2DCNN, 3DCNN, and TSSFAN methods on the three datasets to verify the superiority of the hybrid architecture. From Table 8, we find that the calculation time of the presented TSSFAN, including training and testing time, is much less than that of 3DCNN but larger than that of 2DCNN for all three datasets. This is mainly because our proposed TSSFAN designs the hybrid architecture of 3D-2DCNN, which can not only make the network lightweight but also reduces the complexity of the model remarkably.

Conclusions
In this paper, a novel two-branch spectral-spatial-feature attention network (TSSFAN) is proposed for HSI classification. TSSFAN designs two parallel 3DCNN branches with attention modules for two inputs with different spectral dimensions and spatial sizes to, respectively, focus on extracting the more discriminative spectral features and spatial features. Moreover, TSSFAN constructs the feature attention module to automatically adjust the weights of different features based on their contributions for classification to remarkably enhance the classification performance and utilize 2DCNN to obtain the final classification result. To verify the effectiveness and superiorities of the proposed method, TSSFAN is compared with several advanced classification methods on three real HSI datasets. The experimental results confirm that our proposed TSSFAN can fully extract more discriminative spectral and spatial features to further improve the classification accuracy. In addition, TSSFAN achieves the highest classification accuracy and clearly performs better than the other compared methods. Nevertheless, there still exist some points that could be further improved. In our further work, the research will focus on how to optimize a deep-learning framework with an attention mechanism to extract more discriminative spectral-spatial features under the small training samples situation and further improve the classification performance.