Improved Transformer Net for Hyperspectral Image Classification

: In recent years, deep learning has been successfully applied to hyperspectral image classiﬁcation (HSI) problems, with several convolutional neural network (CNN) based models achieving an appealing classiﬁcation performance. However, due to the multi-band nature and the data redundancy of the hyperspectral data, the CNN model underperforms in such a continuous data domain. Thus, in this article, we propose an end-to-end transformer model entitled SAT Net that is appropriate for HSI classiﬁcation and relies on the self-attention mechanism. The proposed model uses the spectral attention mechanism and the self-attention mechanism to extract the spectral– spatial features of the HSI image, respectively. Initially, the original HSI data are remapped into multiple vectors containing a series of planar 2D patches after passing through the spectral attention module. On each vector, we perform linear transformation compression to obtain the sequence vector length. During this process, we add the position–coding vector and the learnable–embedding vector to manage capturing the continuous spectrum relationship in the HSI at a long distance. Then, we employ several multiple multi-head self-attention modules to extract the image features and complete the proposed network with a residual network structure to solve the gradient dispersion and over-ﬁtting problems. Finally, we employ a multilayer perceptron for the HSI classiﬁcation. We evaluate SAT Net on three publicly available hyperspectral datasets and challenge our classiﬁcation performance against ﬁve current classiﬁcation methods employing several metrics, i.e., overall and average classiﬁcation accuracy and Kappa coefﬁcient. Our trials demonstrate that SAT Net attains a competitive classiﬁcation highlighting that a Self-Attention Transformer network and is appealing for HSI classiﬁcation.


Introduction
Hyperspectral image (HSI) conceives high-dimensional data containing massive information in both the spatial and spectral dimensions.Given that ground objects have diverse characteristics in different dimensions, hyperspectral images are appealing for ground object analysis, ranging from agricultural production, geology, and mineral exploration to urban planning and ecological science [1][2][3][4][5][6][7][8][9][10].Early attempts exploiting HSI mostly employed support vector machines (SVM) [11][12][13], K-means clustering (KNN) [14], and polynomial logistic regression (MLR) [15] schemes.Traditional feature extraction mostly relies on feature extractors designed by human experts [16,17] exploiting the domain knowledge and engineering experience.However, these feature extractors are not appealing in the HSI classification domain as they ignore the spatial correlation and local consistency and neglect exploiting the spatial feature information of HSI.Additionally, the redundancy of HSI data makes the classification problem a challenging research problem.
In recent years, deep learning (DL) has been widely used in the field of remote sensing [18].Given that deep learning can extract more abstract image features, the literature suggests several DL-based HSI classification methods.Typical examples include Stacked Autoencoder (SAE) [19][20][21], Deep Belief Network (DBN) [22], Recurrent Neural Network (RNN) [23,24], and Convolutional Neural Network (CNN) [25][26][27].For example, Dend et al. [19] use a layered and stacked sparse autoencoder to extract HSI features, while Wan et al. [20] propose a joint bilateral filter and a stacked sparse autoencoder, which can effectively train the network using only a limited number of labeled samples.Zhou et al. [21] employ a semi-supervised stacked autoencoder with co-training.When the training set expands, confidential predictions of unlabeled samples are generated to improve the generalization ability of the model.Chen et al. [22] suggest a deep architecture combined with the finite element of the spectral space using an improved DBN to process threedimensional HSI data.These methods [19][20][21][22][23][24] achieved the best results in the three datasets of IN, UP, and SA, as follows: 98.39% [21], 99.54% [19], and 98.53% [21], respectively.Zhou et al. [23] extend the long-term short-term memory (LSTM) network to exploit the spectral space and suggest an HSI classification scheme that treats HSI pixels as a data sequence to model the correlation of information in the spectrum.Hang et al. [24] use a cascaded RNN model with control loop units to explore the HSI redundant and complementary information, i.e., reduce redundant information and learn complementary information, and fuse different properly weighted feature layers.Zhong et al. [25] designed an end-to-end spectral-spatial residual network (SSRN), which uses a continuous spectrum and spatially staggered blocks to reduce accuracy loss and achieve better classification performance in the case of uneven training samples.In [26], the authors propose a deep feature fusion network (DFFN), which introduces residual learning to optimize multiple convolutional layers as identity mapping that can extract deeper feature information.Additionally, the work of [27] suggests a five-layered CNN framework that integrates the spatial context and the spectral information of HSI and integrates into the framework both spectral features and spatial context.Although current literature manages an overall appealing classification performance, the classification accuracy, network parameters, and model training should still be improved.
Deep neural network models increase the accuracy of classification problems; however, as the depth of the network increases, they also cause network degradation and increase the difficulty of training.Prompted by He et al. [28], the residual network (ResNet) is introduced into the HSI classification [29][30][31] problem.Additionally, Paoletti et al. [30] design a novel CNN framework based on the feature residual pyramid structure, while Lee et al. [31] propose a residual CNN network that utilizes the context depth of the adjacent pixel vectors using residuals.These network models with residual structure afford a deep network that learns easier, enhances gradient propagation, and effectively solves deep learning-related problems such as gradient dispersion.
Due to the three-dimensional nature of HSI data, current methods have a certain degree of spatial or spectral information loss.To this end, 3D-CNNs are widely used for HSI classification [32][33][34][35], with Chen et al. [32] proposing a 3D-CNN finite element model combined with regularization that uses regularization and virtual sample enhancement methods to solve the problem of over-fitting and improve the model's classification performance.Seydgar et al. [33] suggest an integrated model that combines a CNN with a convolutional LSTM (CLSTM) module that treats adjacent pixels as a sequence of recursive processes, and makes full use of vector-based and sequence-based learning methods to generate deep semantic spectral-spatial characteristics, while Rao et al. [34] develop a 3D adaptive spatial, spectral pyramid layer CNN model (ASSP-SCNN), where the ASSP-SCNN can fully mine spatial and spectral information, while additionally, training the network with variable sized samples increases scale invariance and reduces overfitting.In [35] the authors suggest a deep CNN (DCNN) scheme that during network training combines an improved cost function and a Support vector machine (SVM) and adds category separation information to the cross-entropy cost function promoting the between-classes compactness and separability during the process of feature learning.These methods [32][33][34][35] achieved the best results in the three datasets of IN, UP, and SA, respectively, of 99.19%, 99.87%, and 98.88% [33].However, despite the appealing accuracy of CNNbased solutions, these impose a high computational burden and increase the network parameters.The models proposed in [33] and [35] converge at 50 and 100 epochs, respectively.To solve this problem, quite a few algorithms extract the spatial and spectral features separately and introduce the attention mechanism for HSI classification [36][37][38][39][40][41].For example, Zhu et al. [36] propose an end-to-end residual spectral-spatial attention network (RSSAN), which can adaptively realize the selection of spatial information and spectrum information.Through the function of weighted learning, this module enhances the information features that are useful for classification, and Haut et al. [37] introduce the attention mechanism into the residual network (ResNet), suggesting a new vision attentiondriven technology that considers bottom-up and top-down visual factors to improve the feature extraction ability of the network.Wu et al. [38] develop a 3D-CNN-based residual group channel and space attention network (RGCSA) appropriate for HSI classification combining bottom-up and top-down attention structures with residual connections, making full use of context information to optimize the features in the spatial dimension and focus on the area with the most information.This method achieved 99.87% and 100% overall classification accuracy on the IN and UP datasets, respectively.Li et al. [39] design a space spectrum attention network (JSSAN) to simultaneously capture the remote interdependence of spatial and spectral data through similarity assessment, and adaptively emphasize the characteristics of informational land cover and spectral bands, and Mou et al. [40] improve the network by involving a network unit for the spectral attention module using the global spectrum space context and the learnable spectrum attention module to generate a series of spectrum gates reflecting the importance of the spectrum band.Qing et al. [41] propose a multi-scale residual network model with an attention mechanism (MSRN).The model uses an improved residual network and spatial-spectral attention module to extract hyperspectral image information from different scales multiple times, fully integrates and extracts the spatial spectral features of the image.A good classification effect has been achieved on the HSI classification problem.These methods [36][37][38][39][40][41] achieved the best result in the SA dataset of 99.85% [37].
Although CNN models manage good results on the HSI classification problem, these models still have several problems.The first one being that the HSI classification task is at the pixel level, and thus due to the irregular shape of the ground objects, the typical convolution kernel is unable to capture all the features [42].Another deficiency of CNNs is the small-size convolution kernel limiting the CNN's receptive field to match the hyperspectral features over their entire bandwidth.Thus, in-depth utilization of CNN is limited, and the requirements for convolution kernels of different classification tasks vary greatly.Due to the large HSI spectral dimensionality, it is not trivial to use long-range sequential dependence between distant positions of the spectral band information because it is difficult to use for CNN-based HSI classification specific context-based convolutional kernels to capture all the spectral features.
Spurred by the above problems, this paper proposes a self-attention-based transformer (SAT) model for HSI classification.Indeed, a transformer model was initially used for natural language processing (NLP) [43][44][45][46][47], achieving great success and attracting significant attention.To date, transformer models have been successfully applied to computer vision fields such as image recognition [48], target detection [49], image super-resolution [50], and video understanding [51].Hence, in this work, the proposed SAT Net model first processes the original HSI data into multiple flat 2D patch sequences through the spectral attention module and then uses their linear embedding sequence as the input of the transformer model.The image feature information is extracted via a multi-head selfattention scheme that incorporates a residual structure.Due to its core components, our model effectively solves the gradient explosion problem.Verification of the proposed SAT Net on three HSI public data sets against current methods reveals its appealing classification performance.
The main contributions of this work are as follows: 1. Our network employs a spectral attention module and uses both the spectral attention module and the self-attention module to extract feature information avoiding feature information loss.2. The core process of our network involves an encoder block with multi-head self-attention, which successfully handles the long-distance dependence of the spectral band information of the hyperspectral image data.3.In our SAT Net model, multiple encoder blocks are directly connected using a multilevel residual structure and effectively avoid information loss caused by stacking multiple sub-modules.4. Our proposed SAT Net is interpretable, enhancing its HSI feature extraction capability and increasing its generalization ability.5. Experimental evaluation on HSI classification against five current methods highlights the effectiveness of the proposed SAT Net model.
The remainder of this article is organized as follows.Section 2 introduces in detail the multi-head self-attention, the encoder block, the spectral attention, and the overall architecture of the proposed SAT Net.Section 3 analyzes the interplay of each hyper-parameter of SAT Net against five current methods.Finally, Section 4 summarizes this work.

Methodology
In this section, we first introduce the Spectral attention module, then we derive a detailed formula for the multi-head self-attention module and the encoder module.Finally, we give the detailed HSI classification process of the proposed model.

Spectral Attention Block
The attention mechanism [52] imitates the internal process of a biological observation behavior.It is a mechanism that aligns internal experience and external sensation to increase the observation precision and can quickly extract important features of coefficient data.The attention mechanism is currently an important concept in neural networks widely used in several computer vision tasks [53].In this paper, we introduce the spectral attention module to enhance the feature extraction ability of the proposed deep learning network.Given a feature map  × × as input, we define a 1-D spectral attention map   × × .The purpose of using spectral attention is to extract information features useful for HSI classification by changing the weight of spectral information, which can be defined as presented in Equation (1).
where   × × ,  (, ) is  at position (m, n),  represents the multiplication element, y the output of spectral attention, and max(.)themaximum area.y 和y represent the global average and maximum pooling, respectively.The first FC layer is used as a dimensionality reduction layer parameterized by W0, while the second FC layer is a dimensionality increasing layer parameterized by W1.  refers to the ReLU activation function, and 0 × , 1 × ,   × × , W0, and W1 are shared weights.Finally, we multiply  ( ) with the input  to obtain y .
The spectral attention module is presented in Figure 1, where we use global average and global maximum pooling to extract the spectral information of the image.The two different pooling schemes extract more abstract spectral features, which are then followed by two FC layers and activation functions to establish two-pooling channel information.Then, we perform a correlation process to combine the weights of the two spectral feature channels.Finally, the newly assigned feature weight is multiplied by the input feature map to correct the weights of the input feature map and afford to extract higher-level feature information.

MaxPool AvgPool
Input feature map Shared MLP

Output feature map
Figure 1.Spectral attention mechanism.The module uses operations such as maximum pooling, average pooling, and shared weights to re-output feature maps with different weights.

Multi-Head Self-Attention
A CNN scheme is strictly limited by its kernel size and number of layers, thus weakening its advantage in capturing the long-range dependence of input data [52] and ultimately it is imposed to ignore some sequence information of the HSI input data.The selfattention mechanism improves the attention mechanism, which reduces the dependence on external information and can better capture the internal data correlation or its characteristic information.In this work, we utilize a self-attention variant to extract image features, namely the multi-head self-attention module.
Therefore, we initially remap X to q ,k , v by utilizing the three initialization transformation matrices W , W , and W : where X is that the original HSI data is processed first, and then the block is noticed through the spectrum.The resulting flat 2D block with the same size W , W , and W are three different weight matrices, which linearly change the input original vector and perform on each input three different linear transformations to obtain the intermediate vectors q , k , and v , and ultimately increase the diversity of the model feature sampling.
Then, we calculate the weight vector a according to the  and k parameters obtained from Equations ( 2) and (3), respectively, which is expressed as: where i, j, m(1, N + 1), with N the number of flattened 2D blocks (Section 2.3 presents a detailed calculation of N).After that, we apply the dot product operation on the  and k , and divide by √, where d is the dimensions of  and k, respectively, to normalize the data.Finally, the weight vector a is output through a softmax function.The a vector depends on the  vector and all k vectors, and thus ultimately, Equation ( 5) produces in total N + 1 vectors with a length of N + 1 per vector.Next, we combine Equations ( 4) and ( 5) to obtain the , and a vector and perform a weighted average operation to calculate vector c : The output vector of Equation ( 6) is the weighted average of all  vectors, with the weights provided by the a vector.
Our deep learning pipeline combines a multi-head self-attention block under multiple self-attention concatenation schemes with the detailed process presented in Figure 2. The multi-head self-attention input is the vector produced by Equation ( 6), employing different W , W , and W parameters during the matrix operations in Equations ( 2)-( 4) to obtain different  vectors.Ultimately, all  outputs are stacked, forming the multihead self-attention output.Finally, the latter output passes through a fully connected layer to create N + 1 u-vectors, where each u vector has a one-to-one correspondence with X .Figure 2. Multi-Head Self-Attention structure: After mapping, linear change, matrix operation, and other operations, the output sequence obtained has the same length as the input sequence, and each output vector depends on all input vectors.

Encoder Block
According to the transformer concept employed in NLP and the suggestion of Dosovitskiy et al. [54], an image  × × can be remapped into a sequence of flattened 2D patches   ×( • ) .We extend [54] and add a processing step where the patch image obtained from the original data is mapped through the spectral attention block to extract the relevant features.Thus, ultimately, we obtain N flattened 2D blocks of the same size, with the dimension of each block being ( • ), with P the size of the setting block,  =  •   , and H, W, C are the width, height, and channel number of the image, respectively.Then, for each vector, we perform a linear transformation (fully connected layer) and compress the dimension ( • ) into dimension D. As a reference, we use the encoder model of the transformer, and since the decoder model is not used, we add a learnable embedding vector  and introduce a positional encoding  .This process is represented by:  0 =   ;   1 E;   2 E;   3 E; ⋯ ⋯ ;    E +   (7) where E represents the linear transformation layer,  •  is the input dimension, and D is the output dimension.The trainable variable  is used to represent the position information of the added sequence.When the positions are close, they often have similar codes, and the patches in the same line/column also have similar position codes.
We design the encoder block by utilizing several operations, including the norm, multi-head self-attention, and dense, as expressed in Equation ( 8) and illustrated in Figure 3.It is worth noting that in the latter figure, the Gaussian Error Linear Unit (GELU) [55] activation function introduces the idea of random regularization, affording the network to converge faster and increasing the model's generalization ability.Additionally, we employ multiple residual blocks to eliminate problems such as gradient dispersion.The Multilayer Perceptron (MLP) exploited contains two layers with a GELU non-linearity scheme.Finally, depending on the scenario, the encoder block presented in Figure 3 can be stacked multiple times as required to achieve a high HSI classification.The latter is discussed in Section 3.3, where LN represents Layer Normalization and MHSA multihead Self-Attention.

Overview of the Proposed Model
Finally, the vectors obtained through stacked encoder modules are input to two fully connected layers employing GELU activation functions.Then, we exploit the first of the two vectors, i.e., the learnable embedding vector  of the classification, to obtain the final classification result, which is expressed as: where  is an additional embeddable vector used for classification and refers to the output of the encoder block, i.e., utilizing the dense, GELU, and dense blocks presented in Figure 4.The execution process of the entire SAT network is shown in the latter figure.After the original HSI data is processed, it is input into the spectral attention and decoder modules with multi-head self-attention to extract HSI features.Second, the encoder module uses a multilayer residual structure for connection, thereby effectively reducing information loss, and finally through the fully connected layer, it outputs classification information.
First, around each pixel, we extract patches of block size  ×  × , with the third dimension being the spectral dimension of different his, while for the edge pixels that cannot be directly extracted, we employ a padding operation.Ultimately, we obtain the final sample data with shape (m, , , ), where m is the number of samples and  is the width and height of the sample, respectively.A detailed analysis of the sample size is presented in Section 3.3.For the processed sample data, we pass it through the spectral attention module to redistribute the weight of the spectral information.Since the spectral attention mechanism does not change the shape of the input feature map, the shape of the output sample data is still (m, , , ).
Once the raw HSI data are remapped into a set of (  ×  ×  ) image patches, we process each sample into an  ×   ×  sequence of flattened 2D patches with shape (P, P, o).However, the transformer-model expects a two-dimensional NxD matrix as an input (Remove the Batch_size dimension), where  =  ×   ×  is the sequence length and D the dimension of each vector of the sequence (Set to 64 in this article).Therefore, we reshape the  ×   ×  2D patches into a two-dimensional matrix of shape (  ×   ×  , o × P × P), and apply a linear transformation layer on the latter two-dimensional matrices to ultimately create a two-dimensional Matrix (N, D).Then, we introduce the embedding vector  and the position code  (as described in Section 2.2) and create a matrix of size (Batch_size, N + 1, D) (Add Batch_size dimension) used as the input to the encoder block.Here, we use multiple encoder modules (the specific number of modules is discussed in Section 3.3.3)to continue extracting image features.In contrast to Dosovitskiy et al. [54], we change the direct connection of a single encoder module and employ the residual structure to inter-connect each encoder module, with the detailed process shown in Figure 4.This strategy affords to reduce the information loss caused by stacking multiple encoder modules, and the model convergence speed is accelerated.The classification results are finally output through two fully connected layers.

Experiments, Results, and Discussion
In this section, we first introduce three publicly available HSI data sets and then analyze the five factors that influence the classification accuracy of the proposed model.Finally, we challenge the proposed model against current state-of-the-art methods and discuss the experimental results.

Data Set Description
For our experiments, we consider three publicly available HSI data sets, namely the Salinas (SA), the Indian Pines (IN), and the University of Pavia (UP).Detailed information on all datasets is presented in Table 1.This dataset includes HSI collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in Salinas, California, USA.It has 224 spectral bands and a spectral resolution of 400~2500 nm.Each HSI has a size of 512 × 217 pixels and a spatial resolution of 3.7 m/pixel.This dataset has in total 54,129 marked pixels presenting 16 object classes.The pseudo-color image and the corresponding ground truth map are illustrated in Figure 5, with the sample division ratio of the training and the test set shown in Table 2.This dataset was collected by the AVIRIS sensor in Northwestern Indiana, USA involving 200 spectral bands and a spectral resolution of 400~2500 nm.It includes an HSI of 145 × 145 pixels and a spatial resolution of 20 m/pixel, with 10,249 marked pixels involving 16 object classes.The pseudo-color image and ground truth map are presented in Figure 6.The sample ratio between the training and the test set is shown in Table 3.The Reflective Optics Spectrographic Imaging System (ROSIS) sensors collected this HSI in Pavia, Italy, involving imagery of 610 × 340 pixels and a spatial resolution of 1.3 m/pixel.The spectral bands are 103 with a resolution of 430~860 nm.In total, there are 42,776 marked pixels of nine object classes.The pseudo-color image and ground truth map are shown in Figure 7, with the training and test sets presented in Table 4.We randomly selected 20% of the dataset for training for our experiments, and the remaining 80% was for testing.A detailed experimental analysis is presented in Section 3.2.

Experimental Setup
We evaluate the performance of the proposed SAT Net model on an Intel(®) Xeon(®) Gold 5218 with 512 GB RAM and an NVIDIA(Headquartered in Santa Clara, USA) Ampere A100 GPU with 40 GB RAM.Our platform operates on windows 10 utilizing the tensorflow2.2deep learning framework and the python3.7 compiler.We optimize the model by exploiting the Adam optimizer [56] with a batch size of 64 and employ the crossentropy loss function for reverse gradient propagation.We also employ a five-folder cross-validation [57] scheme to train and test the model in the experiments 3.3.1 and 3.3.2.Specifically: we divide each data set into five parts, accounting for 20% of the total data set.During each training round four parts are used as the training set and one part is used as the test set.In total, we consider five rounds of training exploiting each time a different subset of the data set as a training and a testing set.Finally, the average performance of the five test results is considered the model's accuracy.In the experiments that follow.We quantitatively evaluate the performance of all competitor methods relying on the overall classification accuracy (OA), the average accuracy (AA), and the kappa coefficient (K).

Image Preprocessing
The first batch of trials involves investigating the interplay between the hyperparameter setup and the overall classification performance of the proposed SAT Net.These hyperparameters involve the extracted cube size, i.e., are the size of the 3D extracted patch, the size of the 2D patches, the number of stacked encoder blocks, the learning rate, and the proportion of training to testing samples.

Image Size (IS)
In this trial, we investigate the cube sizes of 16, 32, and 64, which are extracted around each pixel of the HSI raw data, with the corresponding results presented in Table 5.From the latter table, we observe that for IS = 16, the SA, IN, and UP datasets manage an OA of 97.18%, 93.42%, and 96.45%, respectively.However, despite the OA metric being relatively high, it is still lower than the optimum performance attained when IS = 64.This is because a smaller extraction cube interferes with the spatial continuity, while as IS increases, the performance also increases, and ultimately IS = 64 achieves the highest classification results.It should be noted that due to our hardware, our trials are limited to a maximum of IS = 64.In this experiment, we vary the size of the flattened 2D patch sequence.The different PS evaluated are inversely proportional to the number of the linear embedding sequences that are input to the encoder block.Thus, we set PS to 4, 8, and 16 with the corresponding results presented in Table 5.From the latter table, we confirm the findings of Dosovitskiy et al. [54] that  =   , and thus for our trials, it should be greater than 16.
Hence, for our trials, we employ a trial-and-reject strategy and conclude that for  = 16 our method manages an appealing performance, which we adopt for the trials to follow.

Depth Size
Here, we vary the number of stacked encoder blocks within the proposed SAT Net, with the stack cardinality set to 2, 3, 4, 5, and 6.The corresponding experimental results are shown in Figure 8, highlighting that as the number of encoder blocks increases, the classification accuracy increases, but also the total network parameters affecting the difficulty during network training increase as well.However, increasing the model parameters too much will cause the model to overfit and ultimately reduce its classification accuracy.For our trials, an encoder block cardinality of three manages a classification performance of 99.91%, 99.03%, and 99.47%, for the SA, IN, and UP datasets, respectively.

Training Sample Ratio
The proportion of training vs. testing data affects the fitting process of the model during its training.Hence, we evaluate the training proportions of 3%, 5%, 10%, 20%, 30%, and 40% of the entire dataset, with the corresponding results presented in Figure 9. From the latter figure, we observe that when the proportion of the training set is 3% and 5%, the classification result of IN is poor, and this is because the total number of samples in the IN dataset is relatively small.However, when the proportion of the training set exceeds 20%, all three datasets achieve quite appealing classification results.For the subsequent trials, and to compare our technique against current methods, e.g., Zhong et al. [25], we set the training set ratio to 20% of the total samples.

Learning Rate
The learning rate affects the gradient descent rate of the model, and thus choosing an appropriate learning rate can control the convergence performance and speed of the model.For our experimental analysis, we set the learning rate to 0.0001, 0.0005, 0.001, and 0.005, respectively, with the corresponding results shown in Figure 10.We optimize SAT Net's performance by setting the learning rate for SA to 0.001 and UP and IN to 0.0005.

Evaluation
We challenge the proposed SAT Net against convolutional neural network (CNN) [58] (CNN architecture with five layers of weights), spectral attention module-based convolutional network (SA-MCN) [40] (Recalibrate spatial information and spectral information), three-dimensional convolutional neural network (3D-CNN) [32], and the spectral-spatial residual network (SSRN) [25], and the multi-scale residual network model with an attention mechanism (MSRN) [41].For fairness, we set the ratio of training set and test set to 2:8.We also optimize the model by exploiting the Adam optimizer [56] with a batch size of 64 and employ the cross-entropy loss function for reverse gradient propagation.

Quantitative Evaluation
Tables 6-8 present the classification accuracy of each object class, and method evaluated exploiting the OA, AA, and K metrics.From the results, we observe that the CNN network, its classification results are still lacking due to the spectral feature information loss of the 2D-CNN that ignores the 3D nature of the HSI data.SA-MCN extracts spectral information features based on spectral attention.The 3D-CNN directly extracts the feature information of the spatial and spectral dimensions, which significantly improves the accuracy of HSI classification.Nevertheless, 3D-CNN still does not fully utilize the space and spectrum-related information.On the contrary, SSRN exploits the spatial-spectrum attention module to redistribute the spatial and spectral information weights achieving good classification results.The proposed SAT Net attains the most appealing results over all three data sets, especially on the SA dataset, where it manages an overall classification accuracy of 99.91%.The MSRN network uses an improved residual network and space-spectral attention module to extract hyperspectral image information from different scales and multiple times, and fully integrates and extracts the spatial spectral features of the image.The best results are attained on the IN dataset managing an Overall accuracy, Average accuracy, and Kappa of 0.9937, 0.9945, and 0.9961, respectively.Regarding the proposed SAT Net, it obtains the most attractive results on the SA data set, as its overall classification accuracy, average classification accuracy and Kappa reaches 0.9991, 0.9963 and 0.9978, respectively.Finally, on the UP data set the proposed methods has comparable performance to MSRN.Indeed, the overall accuracy and Kappa coefficient are slightly inferior to the MSRN model, while the average accuracy is slightly superior to the MSRN model.Compared to the competitor methods, we extract the image features via a multi-head self-attention scheme that avoids partial information loss when utilizing regular convolution kernels during feature extraction and solves the problem of HSI long-distance dependence.

Qualitative Evaluation
Figures 11-13 show the overall accuracy curve of the proposed model against the competitor models.The results indicate that as the number of training steps increases, the accuracy of all models is continuously improving.Among the models, CNN has the lowest initial OA.SA-NET has the slowest convergence speed, MSRN has the fastest convergence speed, and SAT NET has the second-best convergence speed.The proposed model converges well in 20 epochs on the SA dataset and converges well within 30 epochs on the IN and UP datasets.Figures 14-16 show the visualization results (pseudo-color classification map) of different models on the three public datasets we utilize in this work.The corresponding classification maps obtained by CNN, and SA-MCN manage an inferior performance, with significant noise levels, spectra, and poor continuity between different object classes.The results obtained by the 3D-CNN and SSRN methods are better, containing less point noise.MSRN also achieved good classification results.In contrast, the classification map generated by the proposed SAT Net model and MSRN has smoother boundaries, less noise, and overall manages a higher classification accuracy.Figure 17

Conclusions
This article proposes a deep learning model that is appropriate for HSI classification entitled SAT Net.Our technique successfully employs a transformer scheme for HSI processing and proposes a new strategy for HSI image classification.Indeed, we first process the HSI data into a linear embedding sequence and then use the spectral attention module and the "multi-head self-attention" module to extract image features.The latter module solves long-distance dependence on the HSI spectral band and simultaneously discards the convolution operation avoiding information loss caused by the irregular processing of the typical convolution kernel during object classification.Overall, SAT Net combines multi-head self-attention and linear mapping, regularization, activation functions, and other operations to form an encoder block with a residual structure.To improve the performance of SAT Net, we stack multiple encoder blocks to form the main structure of our model.We verified the effectiveness of the proposed model by conducting two experiments on three publicly available datasets.The first experiment analyzes the interplay of our model's hyperparameters, such as image size, training set ratio, and learning rate, to the overall attained classification performance.The second experiment challenges the proposed model against current classification methods.In comparison with models such as CNN, SA-MCN, 3D-CNN, and SSRN on the three public datasets, SAT NET's OA, AA, and Kappa achieved better results.In comparison with MSRN, SAT NET achieved better results on the SA dataset.It achieved classification performance comparable to that of MSRN on the UP dataset, whereas it is slightly inferior to MSRN on the IN dataset; however, it uses less convolution (spectral attention module) to achieve better classification performance.In comparison with other methods, it provides a novel idea for HSI classification.Second, SAT NET better handles the long-distance dependence of HSI data spectrum information.On the three public data sets, i.e., SA, IN and UP, the proposed method achieved an overall accuracy of 99.91%, 99.22%, and 99.64% and an average accuracy of 99.63%, 99.08%, and 99.67%, respectively.Due to the small number of samples in the IN data set and the uneven data distribution, the classification performance of the SAT network still needs to be improved.In the future, we will study methods such as data expansion, weighted loss function, and model optimization to improve the classification of small-sampled hyperspectral data.

Figure 3 .
Figure 3. Transformer Encoder Block.This module is composed of the norm, multi-head self-attention, and dense and other structures connected in the form of residuals.

Figure 4 .
Figure 4.The proposed SAT Net architecture.After the original HSI data is processed, it is input into the spectral attention and decoder modules with multi-head self-attention to extract HSI features.Second, the encoder module uses a multilayer residual structure for connection, thereby effectively reducing information loss, and finally through the fully connected layer, it outputs classification information.

Figure 8 .
Figure 8. Overall classification accuracy per dataset under various encoder block sizes.

Figure 9 .
Figure 9. Overall accuracy per dataset under different training set proportions.

Figure 10 .
Figure 10.The overall classification accuracy of the three data sets at different learning rates.
show the overall accuracy curve of the proposed model against the competitor models.The results indicate that as the number of training steps increases, the accuracy of all models is continuously improving.Among the models, CNN has the lowest initial OA.SA-NET has the slowest convergence speed, MSRN has the fastest convergence speed, and SAT NET has the second-best convergence speed.The proposed model converges well in 20 epochs on the SA dataset and converges well within 30 epochs on the IN and UP datasets.Figures14-16show the visualization results (pseudo-color classification map) of different models on the three public datasets we utilize in this work.The corresponding classification maps obtained by CNN, and SA-MCN manage an inferior performance, with significant noise levels, spectra, and poor continuity between different object classes.The results obtained by the 3D-CNN and SSRN methods are better, containing less point noise.MSRN also achieved good classification results.In contrast, the classification map generated by the proposed SAT Net model and MSRN has smoother boundaries, less noise, and overall manages a higher classification accuracy.Figure17is a partially enlarged view of the classification results of MSRN and SAT NET on the three datasets of SA, IN, and UP.It is observed from the enlarged image that in the SA dataset, the classification result of the SAT Net model has less continuous noise, and there is less noise only at the boundary between Grapes_untrained and Vinyard_untrained.In the IN dataset, MSRN and SAT Net have some pixel misdivisions at the border of Soybeans-clean till and Soybeans-min till.In the UP dataset, MSRN and SAT Net are mixed with some Meadow features in the bare soil features.

Figure 11 .
Figure 11.Overall accuracy curve of different models in SA dataset.

Figure 12 .
Figure 12.Overall accuracy curve of different models in IN dataset.

Table 2 .
Training and Testing Samples for the SA Dataset.

Table 3 .
Training and Testing Samples for the IN Dataset.

Table 4 .
Training and Testing Samples for the UP Dataset.

Table 5 .
Evaluation of several hyperparameters under five-folder cross-validation.(Highest Performance is in Boldface).

Table 6 .
Classification Results of Various Methods for the SA Dataset (Highest Performance is in Boldface).

Table 7 .
Classification Results of Various Methods for the IN Dataset (Highest Performance is in Boldface).

Table 8 .
Classification Results of Various Methods for the UP Dataset (Highest Performance is in Boldface).