Multi-Scale Residual Spectral–Spatial Attention Combined with Improved Transformer for Hyperspectral Image Classification

: Aiming to solve the problems of different spectral bands and spatial pixels contributing differently to hyperspectral image (HSI) classification, and sparse connectivity restricting the convolutional neural network to a globally dependent capture, we propose a HSI classification model combined with multi-scale residual spectral–spatial attention and an improved transformer in this paper. First, in order to efficiently highlight discriminative spectral–spatial information, we propose a multi-scale residual spectral–spatial feature extraction module that preserves the multi-scale information in a two-layer cascade structure, and the spectral–spatial features are refined by residual spectral–spatial attention for the feature-learning stage. In addition, to further capture the sequential spectral relationships, we combine the advantages of Cross-Attention and Re-Attention to alleviate computational burden and attention collapse issues, and propose the Cross-Re-Attention mechanism to achieve an improved transformer, which can efficiently alleviate the heavy memory footprint and huge computational burden of the model. The experimental results show that the overall accuracy of the proposed model in this paper can reach 98.71%, 99.33%, and 99.72% for Indiana Pines, Kennedy Space Center, and XuZhou datasets, respectively. The proposed method was verified to have high accuracy and effectiveness compared to the state-of-the-art models, which shows that the concept of the hybrid architecture opens a new window for HSI classification.


Introduction
Hyperspectral images (HSIs) contain both a high spatial resolution and continuous spectral bands of different objects at the same time, with the characteristics of "spectral image unity" [1,2].They have been applied in a wide variety of applications, such as urban management [3], geological exploration [4], and military surveys [5].
HSI classification is a foundation component in Earth-monitoring applications, with the main goal of assigning each pixel in the HSI to specific land cover classes, thus achieving precise identification and classification of surface cover.Initially, HSI classification mainly used traditional machine learning methods to extract features.Typically, machine learning methods first adopted some dimension reduction methods to reduce spectral redundancy, such as principal component analysis (PCA) [6] and linear discriminant analysis (LDA) [7].Traditional machine learning methods then employed classifiers such as the K-nearest neighbor method [8], support vector machine [9], random forest [10], decision tree [11], and other methods to classify the extracted features.Although traditional machine learningbased methods have made progress in improving classification performance, they often rely on hand-crafted features for HSI classification.With the rapid development of deep learning and practical progress in the task of HSI classification, deep learning-based methods fully absorb the early experience of HSI classification, combining spectral and spatial information to complete the classification task, and can directly extract effective deep features from the original image [12].Chen et al. [13] firstly introduced deep learning into the field of HSI classification, and used the unsupervised deep feature-learning model stacked autoencoder (SAE) to extract the features from the original image, which improves the accuracy of HSI classification.
Due to the existence of spectral and spatial heterogeneity in HSI, it is difficult to accurately identify land cover types using only spectral information.Therefore, the model used to jointly extract spectral-spatial features from HSIs for classification has become a research focus.Convolutional neural networks (CNNs) have been widely used in the field of HSI classification to realize the joint extraction of spectral and spatial features [14].Among these, the three-dimensional convolutional neural network (3D-CNN) achieves direct end-to-end deep spectral-spatial feature extraction on HSIs, providing a robust and reliable feature extraction mechanism [15,16].Considering the importance of multiscale information for improving network performance, Song et al. [17] proposed a deep feature fusion strategy that is able to effectively fuse multi-scale feature representations by creating interconnections between different layers of information.Zhong et al. [18] proposed the spectral-spatial residual network (SSRN), which sequentially uses spectral residual blocks and spatial residual blocks to learn deep features from HSIs.Roy et al. [19] proposed the attention-based adaptive spectral-spatial kernel improved residual network (A 2 S 2 K-ResNet) with spectral attention to capture discriminative spectral-spatial features in an end-to-end training approach.In addition, attention mechanisms are widely used for HSI classification.Zhou et al. [20] designed a Cross-Attention Fusion module in an Attention Multihop Graph and Multiscale Convolutional Fusion Network (AMGCFN) to highlight important information and enhance feature fusion in different subnets.Gou et al. [21] proposed a global spatial feature representation model to learn global spatial features based on an encoder-decoder structure with channel attention and spatial attention.The CNN-based approaches improved the local perception of the model by point-wise operations with pixels around the image, but were limited by the kernel size and the number of network layers, which results in an insufficient ability to capture global contextual feature information.
In recent years, some studies have introduced a transformer, which extracts features by convolution, and then used the transformer to obtain contextual information [22].Dosovitskiy et al. [23] proposed the Vision Transformer (ViT) with a dynamic and global sensory field, which ensures the model performs well in image classification tasks and can learn the dependencies of different positions of the output image.ViT learns features mainly using the multi-head attention mechanism, which can extract global information from the non-overlapping parts of the image.Therefore, ViT can effectively capture the longrange dependencies form the input images, enabling the network to parse the information from a global perspective, and thus effectively assisting in describing the local semantic information [24].For the task of feature classification in HSIs, applying the ViT to sequence data is more effective and flexible in analyzing the spectral data of HSIs [25].Sun et al. [26] proposed the Spectral-Spatial Feature Tokenisation Transformer (SSFTT) to obtain spectralspatial features and high-level semantic information.
However, when applying ViT to the HSI classification task, a prominent issue is that the computational burden of the self-attention mechanism grows quadratically with input, and its computational amount hinders the inference speed of the model.Additionally, unlike CNNs that can be expanded to deeper layers to improve performance, the performance of ViT saturates rapidly when expanded to deeper layers, the expansion difficulty is mainly due to the collapse in attention, and the feature maps generated in deeper structures tend to be the same.Therefore, to address the problem of computational burden, Zhang et al. [27] proposed a lightweight transformer (LiT) that achieved a balance between high computational efficiency and significant performance.Liu et al. [28] proposed the Swin-Transformer to use a shift window to capture the global features.Meanwhile, Lin et al. [29] proposed the Cross-Attention in Vision Transformer (CAT), which used Cross-Attention to focus on capturing local information inside the feature map patches, and captured global information between the feature map patches in a single channel.Both methods made the original square-growth computation become linear, which significantly reduced the computation of the transformer.
To address the problem of attention collapse, researchers from AI Lab proposed the Re-Attention mechanism, which regenerated the feature maps between layers to enhance the diversity between layers, avoiding the problem of feature maps converging to be the same in deeper layers [30].Hybrid architectures combining the transformer and convolutions have garnered widespread attention in building lightweight, high-performance models.Some works have proposed a hybrid structure of a CNN and a transformer after analyzing the working principle of the CNN and the transformer in detail, where shallow features are extracted by the CNN and the extracted features are fed into a semantic tagger to tag the global semantic information [31][32][33].
Based on the above-mentioned analysis, in this paper, an efficient multi-scale residual spectral-spatial attention combined with an improved transformer (RSSAT) is proposed for HSI classification.In RSSAT, we designed a multi-scale residual spectral-spatial feature extraction module to improve the discriminative power of extracted features and adaptively fuse the acquired spectral and spatial information.In addition, we designed improved transformers to fully extract high-level semantic features and model long-range feature dependencies in HSI multidimensional datasets.Overall, our approach constructs a shallow-to-deep feature-learning model that effectively reduces misclassification of small target samples.The main contributions of this paper are summarized as follows: 1.
In order to fully extract HSI high-level semantic features as well as to enhance the effective representation of global contextual information, this paper combines the respective representational features of a CNN and transformer, and proposes a new HSI classification method called RSSAT.RSSAT has strong advantages in discriminative feature extraction and capturing long-range dependencies with the best classification performance.2.
By investigating the characteristics of HSIs, a multi-scale residual spectral-spatial feature extraction module was designed.The module fully exploits the local information of HSIs in a two-layer cascade structure and selectively aggregates the information between spectral bands and spatial pixels to highlight discriminative information.The module alleviates the information loss in feature flow and retains more spectral and spatial information, reducing misclassification of small target samples and discrete samples.

3.
In order to accurately capture long-range feature dependencies in HSI multidimensional datasets, we propose an improved transformer.For the transformer, we design the Cross-Re-Attention mechanism as an alternative to Self-Attention in the traditional transformer.The innovative strategy significantly enhances the model's ability to learn high-level semantic features by introducing a learnable matrix that dynamically generates new attention mappings between each layer.4.
According to the experimental results, RSSAT significantly outperforms other stateof-the-art deep learning methods in terms of classification performance, especially when dealing with uneven samples, and achieves an excellent improvement in its classification accuracy.

Materials and Methods
Figure 1 demonstrates the framework of the RSSAT model.In general, the model architecture mainly includes a multi-scale residual spectral-spatial feature extraction module and an improved transformer module.The model skillfully integrates the advantages of the CNN and transformer to enable feature extraction from shallow to deep, which enables the model to fully utilize the rich spectral-spatial information in HSIs, further improving the performance and robustness of the model.In the model training process, first, after removing the spectral redundant bands by principal component analysis (PCA), the HSI data are fed into the convolution module to learn low-order features.Then, for the purpose of enhancing the spectral-spatial feature representation capability and robustness of the RSSAT model, residual spectral-spatial attention is embedded in the multi-scale residual feature-learning part.The multi-scale residual spectral-spatial feature extraction module re-adjusts and optimizes the extracted features through a two-level cascaded residual structure to highlight discriminative information.Meanwhile, the model can effectively establish channel connections between feature maps at different stages to enhance the convergence ability of the RSSAT.Finally, we propose the improved transformer to obtain long-distance dependencies of the sequential spectral features.The obtained discriminative spectral-spatial features are employed to obtain the classification results.

Materials and Methods
Figure 1 demonstrates the framework of the RSSAT model.In general, the model architecture mainly includes a multi-scale residual spectral-spatial feature extraction module and an improved transformer module.The model skillfully integrates the advantages of the CNN and transformer to enable feature extraction from shallow to deep, which enables the model to fully utilize the rich spectral-spatial information in HSIs, further improving the performance and robustness of the model.In the model training process, first, after removing the spectral redundant bands by principal component analysis (PCA), the HSI data are fed into the convolution module to learn low-order features.Then, for the purpose of enhancing the spectral-spatial feature representation capability and robustness of the RSSAT model, residual spectral-spatial attention is embedded in the multiscale residual feature-learning part.The multi-scale residual spectral-spatial feature extraction module re-adjusts and optimizes the extracted features through a two-level cascaded residual structure to highlight discriminative information.Meanwhile, the model can effectively establish channel connections between feature maps at different stages to enhance the convergence ability of the RSSAT.Finally, we propose the improved transformer to obtain long-distance dependencies of the sequential spectral features.The obtained discriminative spectral-spatial features are employed to obtain the classification results.

Residual Spectral-Spatial Attention
According to HSI pixel-level classification, there are two principles of joint extraction of spectral and spatial information [34]: Principle 1: Spectral information is the basis of HSI pixel-level classification and is the most discriminative information.
Principle 2: Effective spatial information for HSI pixel-level classification refers to the information carried by neighboring pixels that are similar to the center pixel.
Based on the above two principles, this paper embeds the residual spectral-spatial attention module into the multi-scale feature extraction part to achieve the realignment and optimization of the spectral and spatial features to highlight the discriminative information, thereby improving the accuracy and efficiency of the HSI classification.Figure 2 illustrates the structure of the proposed residual spectral-spatial attention module.

Residual Spectral-Spatial Attention
According to HSI pixel-level classification, there are two principles of joint extraction of spectral and spatial information [34]: Principle 1: Spectral information is the basis of HSI pixel-level classification and is the most discriminative information.
Principle 2: Effective spatial information for HSI pixel-level classification refers to the information carried by neighboring pixels that are similar to the center pixel.
Based on the above two principles, this paper embeds the residual spectral-spatial attention module into the multi-scale feature extraction part to achieve the realignment and optimization of the spectral and spatial features to highlight the discriminative information, thereby improving the accuracy and efficiency of the HSI classification.Figure 2 illustrates the structure of the proposed residual spectral-spatial attention module.In the paper, we introduce the spectral-spatial attention module [35] and combine it with residual operations to create the residual spatial-spectral attention module to enhance the feature extraction ability of RSSAT.First, we introduce the spectral attention module, which achieves the selection of specific spectral bands from the input HSI.The In the paper, we introduce the spectral-spatial attention module [35] and combine it with residual operations to create the residual spatial-spectral attention module to enhance the feature extraction ability of RSSAT.First, we introduce the spectral attention module, which achieves the selection of specific spectral bands from the input HSI.The module highlights those bands that are useful for the classification task and reduces the influence of irrelevant bands.Next, we introduce the spatial attention module, which achieves fine extraction of spatial information by adaptively strengthening neighboring pixels that are the same category as the center pixel or weakening pixels of different categories.The two attention modules are arranged in a specific order.Based on the given input or intermediate features, spectral attention weights are computed and applied to the relevant features.Then, the obtained results are used as inputs to the spatial attention module.
Spectral Attention: The core purpose of the spectral attention module is to highlight those spectral features that are critical for HSI classification.To realize the refinement and selection of features, the spectral feature map is generated using the relationship between the spectral information of the features.The structure of spectral attention module is given in Figure 3.In the paper, we introduce the spectral-spatial attention module [35] and combine it with residual operations to create the residual spatial-spectral attention module to enhance the feature extraction ability of RSSAT.First, we introduce the spectral attention module, which achieves the selection of specific spectral bands from the input HSI.The module highlights those bands that are useful for the classification task and reduces the influence of irrelevant bands.Next, we introduce the spatial attention module, which achieves fine extraction of spatial information by adaptively strengthening neighboring pixels that are the same category as the center pixel or weakening pixels of different categories.The two attention modules are arranged in a specific order.Based on the given input or intermediate features, spectral attention weights are computed and applied to the relevant features.Then, the obtained results are used as inputs to the spatial attention module.
Spectral Attention: The core purpose of the spectral attention module is to highlight those spectral features that are critical for HSI classification.To realize the refinement and selection of features, the spectral feature map is generated using the relationship between the spectral information of the features.The structure of spectral attention module is given in Figure 3.To aggregate information and infer finer spectral attention, an average pooling layer and maximum pooling layer are employed.The two different feature descriptions are obtained for the feature mapping based on the different pooling schemes.The kth element of the output is calculated by Equation (1) and the kth channel of the output is calculated by Equation (2).
where ( , ) k y i j is the value at position ( , ) i j of the kth channel.se avg y and se max y denote the output of the average pooling and maximum pooling, respectively.H and W denote the height and width, respectively.To aggregate information and infer finer spectral attention, an average pooling layer and maximum pooling layer are employed.The two different feature descriptions are obtained for the feature mapping based on the different pooling schemes.The kth element of the output is calculated by Equation (1) and the kth channel of the output is calculated by Equation (2).
where y k (i, j) is the value at position (i, j) of the kth channel.y se avg and y se max denote the output of the average pooling and maximum pooling, respectively.H and W denote the height and width, respectively.
For the purpose of fully understanding the interrelationships between different spectral bands and to improve the generalization ability of the model, the outputs of the average pooling layer and maximum pooling layer are directly fed into a shared MLP, which contains two fully connected (FC) layers.A new weight is assigned to each pixel through the SoftMax function.The output of the module is given as follows: where F se denotes the output of the spectral attention module.Spatial Attention: The main purpose of the spatial attention module aims to enhance the spatial information of neighboring pixels that have the same class label as the center pixel, and to weaken the spatial information of pixels that have different category labels.The spatial attention module is given in Figure 4.
where se F denotes the output of the spectral attention module.
Spatial Attention: The main purpose of the spatial attention module aims to enhance the spatial information of neighboring pixels that have the same class label as the center pixel, and to weaken the spatial information of pixels that have different category labels.The spatial attention module is given in Figure 4.  To fully aggregate the spatial information, an average pooling layer and a maximum pooling layer are used to mine the target features.The spatial attention module takes the output of the spectral attention module and passes it through the maximum pooling and average pooling operations to obtain two new feature maps.Then, the information carried by the two feature maps is horizontally concatenated and input into the 7 × 7 convolution operation.Finally, the weight of attention is assigned to each pixel using a sigmoid function.The mathematical expressions are shown as follows: × sa sa sa avg max F S Conv Concat y y (6) where sa avg y and sa max y denote the output denote the output of the average pooling and maximum pooling, respectively.sa F denotes the output of the spatial attention module.

Multi-Scale Residual Spectral-Spatial Feature Extraction Module
The multi-scale information enables the effective enhancement of the robustness and increases the classification accuracy of the model [36].Therefore, in this work, we designed a two-tier cascaded multi-scale residual spectral-spatial feature extraction module to refine the multi-scale information to obtain enhanced discriminative spectral-spatial features.Figure 5 illustrates the structure of the module.
The module uses convolution kernels of different sizes to obtain a better representation of the image to enhance the feature extraction capability of the model.In this work, we employed the 1 × 1 × 1, 3 × 3 × 3, and 5 × 5 × 5 convolution kernels. 1 × 1 × 1 convolution is employed to extract the global information of the image, and 3 × 3 × 3 and 5 × 5 × 5 can provide the local information of the image under different receptive fields.The proposed model uses a 3D convolution layer after the residual spectral-spatial attention module so To fully aggregate the spatial information, an average pooling layer and a maximum pooling layer are used to mine the target features.The spatial attention module takes the output of the spectral attention module and passes it through the maximum pooling and average pooling operations to obtain two new feature maps.Then, the information carried by the two feature maps is horizontally concatenated and input into the 7 × 7 convolution operation.Finally, the weight of attention is assigned to each pixel using a sigmoid function.The mathematical expressions are shown as follows: F sa = Sigmoid(Conv 7×7 (Concat(y sa avg , y sa max ))) where y sa avg and y sa max denote the output denote the output of the average pooling and maximum pooling, respectively.F sa denotes the output of the spatial attention module.

Multi-Scale Residual Spectral-Spatial Feature Extraction Module
The multi-scale information enables the effective enhancement of the robustness and increases the classification accuracy of the model [36].Therefore, in this work, we designed a two-tier cascaded multi-scale residual spectral-spatial feature extraction module to refine the multi-scale information to obtain enhanced discriminative spectral-spatial features.Figure 5 illustrates the structure of the module.
Electronics 2024, 13, x FOR PEER REVIEW 7 of 21 that the spectral-spatial features extracted from the previous residual block achieve feature fusion by 3D convolution.Based on this approach, the following residual spectralspatial attention module can acquire both the base features and the optimized features, which is conducive to better learning of feature information by the model.Meanwhile, in order to obtain more in-depth feature information extracted by each residual spectralspatial attention module and enrich the learning hierarchy of the network, we use residual learning outside each residual spectral-spatial attention module to achieve the effective transfer of features and take full advantage of the independence of different features to complete the global fusion of the features obtained from different residual blocks.Finally, the features and information at different scales are fused using the Concat stitching operation to make the acquired spectral and spatial features more comprehensive.

Improved Transformer
For the purpose of further obtaining the long-distance relationship of sequential spectra, this work uses the transformer to enable the model to parse semantic information from a global perspective.However, when applied to HSI classification tasks, the transformer mainly suffers from the following two problems: (1) Transformer architectures require large quantities of data and computational resources for training and optimization.The computational complexity of the MHSF in transformers shows quadratic growth with the input size.Therefore, using the transformer module to investigate high-resolution images can lead to reduced computational efficiency and slower model inference speed.The formula of computation can be expressed as: where H denotes the height of the input, W denotes the width of the input, and C denotes the number of channels in the input.
(2) Unlike CNNs, which can enhance performance by stacking additional convolu- The module uses convolution kernels of different sizes to obtain a better representation of the image to enhance the feature extraction capability of the model.In this work, we employed the 1 × 1 × 1, 3 × 3 × 3, and 5 × 5 × 5 convolution kernels. 1 × 1 × 1 convolution is employed to extract the global information of the image, and 3 × 3 × 3 and 5 × 5 × 5 can provide the local information of the image under different receptive fields.The proposed model uses a 3D convolution layer after the residual spectral-spatial attention module so that the spectral-spatial features extracted from the previous residual block achieve feature fusion by 3D convolution.Based on this approach, the following residual spectral-spatial attention module can acquire both the base features and the optimized features, which is conducive to better learning of feature information by the model.Meanwhile, in order to obtain more in-depth feature information extracted by each residual spectral-spatial attention module and enrich the learning hierarchy of the network, we use residual learning outside each residual spectral-spatial attention module to achieve the effective transfer of features and take full advantage of the independence of different features to complete the global fusion of the features obtained from different residual blocks.Finally, the features and information at different scales are fused using the Concat stitching operation to make the acquired spectral and spatial features more comprehensive.

Improved Transformer
For the purpose of further obtaining the long-distance relationship of sequential spectra, this work uses the transformer to enable the model to parse semantic information from a global perspective.However, when applied to HSI classification tasks, the transformer mainly suffers from the following two problems: (1) Transformer architectures require large quantities of data and computational resources for training and optimization.The computational complexity of the MHSF in transformers shows quadratic growth with the input size.Therefore, using the transformer module to investigate high-resolution images can lead to reduced computational efficiency and slower model inference speed.The formula of computation can be expressed as: where H denotes the height of the input, W denotes the width of the input, and C denotes the number of channels in the input.
(2) Unlike CNNs, which can enhance performance by stacking additional convolutional layers, the performance of the transformer is quickly saturated when scaling to deeper layers.The difficulty of scaling the transformer is mainly caused by the attention collapse problem.As the number of transformer layers increases, the attention maps gradually become similar, and even after certain layers, the attention maps are basically the same.This situation suggests that MHSF may not be able to efficiently learn useful feature representations in deep transformer structures, resulting in the model failing to obtain the desired performance gains [30].
Based on the above two points, this paper proposes a Cross-Re-Attention mechanism to alleviate the problems of attention collapse and the huge computational burden.Generating new feature maps between the layers of the transformer enhances the diversity of each layer to avoid similarity in feature maps at deep layers.Meanwhile, considering the contextual information extraction and communication, an attention processing method on a singlechannel feature map is used.The computation is significantly reduced compared to that for attention on all channels.Figure 6 illustrates the framework of improved transformer block.
Patch merging is employed to an input that is down-sampled twice, and is used to diminish the resolution and adjust the number of channels.The Cross-Re-Attention block is composed of an Inner-Patch-Re-Attention (IPRA) block and a Cross-Patch-Re-Attention (CPRA) block.By stacking IPRA blocks and CPRA blocks, the module efficiently extracts and integrates features between pixels in a patch and between patches in a feature map.The IPRA part performs pixel-by-pixel Re-Attention computation within each patch to obtain information.Attention computation is performed pixel by pixel within each patch pixel, aiming to capture and utilize the relationship between pixels within the patch to obtain global information.This strategy not only significantly reduces the computational burden, but also greatly enhances the inference efficiency of the model.The mathematical expression of computation is as follows: where N denotes the size of the patch in IPRA.Compared to the MHSA in the standard transformer, the computational complexity is reduced from quadratic correlation to linear correlation.Patch merging is employed to an input that is down-sampled twice, and is used to diminish the resolution and adjust the number of channels.The Cross-Re-Attention block is composed of an Inner-Patch-Re-Attention (IPRA) block and a Cross-Patch-Re-Attention (CPRA) block.By stacking IPRA blocks and CPRA blocks, the module efficiently extracts and integrates features between pixels in a patch and between patches in a feature map.The IPRA part performs pixel-by-pixel Re-Attention computation within each patch to obtain information.Attention computation is performed pixel by pixel within each patch pixel, aiming to capture and utilize the relationship between pixels within the patch to obtain global information.This strategy not only significantly reduces the computational burden, but also greatly enhances the inference efficiency of the model.The mathematical expression of computation is as follows: where N denotes the size of the patch in IPRA.Compared to the MHSA in the standard transformer, the computational complexity is reduced from quadratic correlation to linear correlation.
In CNN-based networks, although the perceptual field can be expanded by stacking convolutional kernels, its sparse connectivity restricts its global dependency capture and makes it difficult to expand the perceptual field to the global range.However, in a transformer, the feature map having a single channel inherently encompasses global information.CPRA partially takes an individual channel as one of the group inputs.Re-Attention is performed in one group to cross the information of different patches to obtain global semantic information.Meanwhile, the attention maps are regenerated in layers of the transformer to enhance their diversity on different layers.
By virtue of the Cross-Re-Attention mechanism, the existing transformer model can be trained to obtain deep transformer models with linear growth in computation.Specifically, the method is based on the head-generated attention maps and generates new attention maps through dynamic aggregation.A learnable matrix, θ , is defined.This matrix is then used to map attention to a regenerated new matrix, which is multiplied with the V matrix in the transformer as follows: In CNN-based networks, although the perceptual field can be expanded by stacking convolutional kernels, its sparse connectivity restricts its global dependency capture and makes it difficult to expand the perceptual field to the global range.However, in a transformer, the feature map having a single channel inherently encompasses global information.CPRA partially takes an individual channel as one of the group inputs.Re-Attention is performed in one group to cross the information of different patches to obtain global semantic information.Meanwhile, the attention maps are regenerated in layers of the transformer to enhance their diversity on different layers.
By virtue of the Cross-Re-Attention mechanism, the existing transformer model can be trained to obtain deep transformer models with linear growth in computation.Specifically, the method is based on the head-generated attention maps and generates new attention maps through dynamic aggregation.A learnable matrix, θ, is defined.This matrix is then used to map attention to a regenerated new matrix, which is multiplied with the V matrix in the transformer as follows: where d indicates the dimension of K.The Norm function is employed to reduce the layerwise variance.The SoftMax function is employed to compute the weights on the values.Q (Query), K (Key), and V (Value) are the projections of tokens, which are the matrices obtained by multiplying the input vectors with the weight matrices obtained after training.

Results
For the purpose of validating the performance of the RSSAT model, three public HSI datasets were selected, namely Indian Pines, Kennedy Space Center (KSC), and XuZhou datasets.To better understand the RSSAT structure, we used ablation experiments to investigate the validity of each component of the model by removing different modules.Meanwhile, we visualized the HSI classification maps to compare the feature extraction capabilities of the proposed RSSAT and other SOTA methods.

Dataset Description and Experiment Design
(1) The Indian Pines dataset was imaged by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) in 1992.A region of size 145 × 145 was selected for annotation and used as a test dataset for HSI classification.The imaging wavelength range of the dataset is 0. Tables 1-3 report the information of the classes and number of available samples.The Indian Pines and KSC datasets have 10,249 and 5211 labeled samples, respectively, while the XuZhou dataset has 68,877 labeled samples.Compared to the Indian Pines and KSC datasets, the XuZhou dataset has a larger number of samples.Therefore, this work designed different proportions of labeled samples as training strategies for all datasets and used different numbers of training samples to validate the performance of the RSSAT method.On the Indian pines and KSC datasets, 20% of labeled samples were randomly selected for training, 10% of labeled samples for validation, and 70% of labeled samples as the testing set.For the XuZhou dataset, 10%, 10%, and 80% of the labeled pixels were randomly selected as the training set, validation set, and test set, respectively.

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.The value of the Kappa coefficient ranges from −1 to 1, where a positive value indicates

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.The value of the Kappa coefficient ranges from −1 to 1, where a positive value indicates

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.

Experiment Configuration
For a fair comparison, our experiments were conducted on an Intel (R) Xeon (R) CPU E5-2620 v4 @ 2.10 GHz processor, 128 GB RAM, and an NVIDIA GeForce RTX 2080Ti (GPU), Window10, using the PyTorch framework and the Python 3.7 compiler.In order to minimize the errors and contingencies of the experiments, all experimental results are the average of 10 experiments.For model training, all experiments used batch processing.We set the training batch size to 32 × 32.Meanwhile, the Adam optimizer was employed to learn the weights and the original learning rate was set to 0.003.To ensure that the model can be adequately trained and perform optimally, we set the maximum iteration time to 200 epochs.We set an early stopping strategy to avoid the overfitting problem.

Experiment Comparison and Analysis
In the study, the classification performance of the proposed RSSAT was verified by comparison with the SVM [37], 3D-CNN [14], residual neural network (ResNet) [38], Multi-Attention Fusion Network (MAFN) [39], Spectral-Spatial Feature Tokenisation Transformer (SSFTT) [26], SSRN [18], and Dual-View Spectral and Global Spatial Feature Fusion Network (DSGSF) [21] methods.SVM is a traditional image classification method.The remaining models are deep learning-based algorithms, which utilize deep neural networks to process HSI classification tasks.Among these, the experimental results were quantitatively evaluated by three metrics: overall accuracy (OA), average accuracy (AA), and Kappa (K) coefficient.OA represents the percentage of correctly classified pixels in all pixel classification results.The Kappa coefficient is employed to test uniformity and determine whether the prediction results of the method are identical to the true results.The value of the Kappa coefficient ranges from −1 to 1, where a positive value indicates superior classification performance, a negative value indicates poor classification performance, and a value close to 0 indicates average classification performance.
Quantitative classification results for evaluation indicators and the accuracy of each class are given in Tables 4-6, respectively (the standard deviation of ten runs was taken as the experimental result).Overall, it could be observed from all experimental results of the datasets that our proposal yields the best accuracy and relatively low standard deviations.Specifically, on the Indian Pines dataset, RSSAT achieves 97.31% in terms of AA; at the same time SVM, 3D-CNN, ResNet, MAFN, SSFTT, and SSRN achieve 79.76%, 76.95%, 92.64%, 96.65%, 96.08%, and 92.63%, respectively.On the Indian Pines, KSC, and XuZhou datasets, compared to SSRN, the increases in OA of our proposal are 0.30%, 0.44%, and 0.09%.The proposed RSSAT method consistently demonstrates superiority in performance compared to SSRN, which is a strong argument for the superiority of our method in improving the representation of specific spectral-spatial features by readjusting the high spatial correlation contexts over spectral bands.Moreover, in the Indian Pines dataset, 16 classes of samples are unevenly distributed in terms of quantity.For example, there are only 20 labeled samples for the 9th class (Oats), while the 11th class (Soybeanmintill) contains 2455 labeled samples.The uneven sample distribution presents a serious challenge to the HSI classification.For the accuracy of the 9th class, SSFTT (67.22 ± 7.29), SSRN (58.94 ± 48.22), and other methods with better performance still fail to provide a good solution.In our proposed RSSAT method, we use a two-tier cascaded multi-scale residual spectral-spatial feature-learning module by introducing a spectral-spatial attention mechanism.Meanwhile, we strategically embed ResBlock to enhance the nonlinear representation capability.The module mitigates the information loss in the feature stream, preserves more spatial information, and better addresses the challenge of scale diversity under different land cover types.Therefore, RSSAT (85.32 ± 13.57)% achieves the best classification results on the 9th class.At the same time, we obtained the best classification performance in terms of overall evaluation metrics, and obtained the closest classification maps to the ground truth.In addition, Figures 7-9 show the learning curves of the proposed method.On the learning curves of these three datasets, as the number of epochs increases, both the loss values and accuracy tend to have a smoothed output.The maximum fluctuation in loss values is less than 0.5, effectively demonstrating the excellent convergence performance of the model.Meanwhile, the gradual fitting of the accuracy curves in the figure visually demonstrates the remarkable generalization ability of the model.Based on the above analysis, our method exploits complementary hybrid blocks to enable the efficient characterization of the deep spectral-spatial features.

Visualization of Classification Maps
In order to visually demonstrate the effectiveness of the RSSAT method, we analyzed the classification results over the Indian Pines, KSC, and XuZhou datasets, as shown in Figures 10-12.These classification maps display that RSSAT has fewer misclassified pixels and cleaner boundaries than other SOTA models.Therefore, we can conclude that the RSSAT method outperforms all the methods for classification.

Visualization of Classification Maps
In order to visually demonstrate the effectiveness of the RSSAT method, we analyzed the classification results over the Indian Pines, KSC, and XuZhou datasets, as shown in Figures 10-12.These classification maps display that RSSAT has fewer misclassified pixels and cleaner boundaries than other SOTA models.Therefore, we can conclude that the RSSAT method outperforms all the methods for classification.

Visualization of Classification Maps
In order to visually demonstrate the effectiveness of the RSSAT method, we analyzed the classification results over the Indian Pines, KSC, and XuZhou datasets, as shown in Figures 10-12.These classification maps display that RSSAT has fewer misclassified pixels and cleaner boundaries than other SOTA models.Therefore, we can conclude that the RSSAT method outperforms all the methods for classification.

Visualization of Classification Maps
In order to visually demonstrate the effectiveness of the RSSAT method, we analyzed the classification results over the Indian Pines, KSC, and XuZhou datasets, as shown in Figures 10-12.These classification maps display that RSSAT has fewer misclassified pixels and cleaner boundaries than other SOTA models.Therefore, we can conclude that the RSSAT method outperforms all the methods for classification.(e) (f) (g) (h)

Feature Visualization Analysis
For the purpose of investigating the feature representation capability of RSSAT, the t-distributed stochastic neighborhood embedding (t-SNE) algorithm [40] was used to visualize and compare the features extracted by ResNet and RSSAN in 2D space.As shown in Figures 13-15, the samples belonging to the same class are clearly clustered into a group in the figures, while samples of different classes are easily separated from each other.From the visualization results, the RSSAT method is more significant and effective in clustering the features, which further proves that the method gains the abstract representation of spectral-spatial features for HSIs.

Discussions 4.1. Feature Visualization Analysis
For the purpose of investigating the feature representation capability of RSSAT, the t-distributed stochastic neighborhood embedding (t-SNE) algorithm [40] was used to visualize and compare the features extracted by ResNet and RSSAN in 2D space.As shown in Figures 13-15, the samples belonging to the same class are clearly clustered into a group in the figures, while samples of different classes are easily separated from each other.From the visualization results, the RSSAT method is more significant and effective in clustering the features, which further proves that the method gains the abstract representation of spectral-spatial features for HSIs.
ualize and compare the features extracted by ResNet and RSSAN in 2D space.As sh in Figures 13-15, the samples belonging to the same class are clearly clustered into a g in the figures, while samples of different classes are easily separated from each o From the visualization results, the RSSAT method is more significant and effective in tering the features, which further proves that the method gains the abstract represent of spectral-spatial features for HSIs.

Time Cost Comparison
In order to comprehensively evaluate the efficiency of different methods in th classification task, the running time and computational cost of each method are reco in detail in Table 7.As seen from the data in the table, the training time of RSSAT is sl longer compared to that of 3D-CNN, SSFTT, and SSRN.This is mainly attributed complexity of the RSSAT model design, which contains more layers, thus increasin length of the training process to some extent.However, it is worth noting that R exhibits a significant advantage in classification accuracy.This performance enhance especially in the accurate classification of small target samples, compensates for its m shortfall in training time.This balance between performance and efficiency of RSS reasonable considering that classification accuracy is often a crucial metric in practic plications.Meanwhile, RSSAT shows significant advantages in both efficiency and p

Time Cost Comparison
In order to comprehensively evaluate the efficiency of different methods in th classification task, the running time and computational cost of each method are reco in detail in Table 7.As seen from the data in the table, the training time of RSSAT is sl longer compared to that of 3D-CNN, SSFTT, and SSRN.This is mainly attributed complexity of the RSSAT model design, which contains more layers, thus increasin length of the training process to some extent.However, it is worth noting that R exhibits a significant advantage in classification accuracy.This performance enhance especially in the accurate classification of small target samples, compensates for its m shortfall in training time.This balance between performance and efficiency of RSS reasonable considering that classification accuracy is often a crucial metric in practic plications.Meanwhile, RSSAT shows significant advantages in both efficiency and p

Time Cost Comparison
In order to comprehensively evaluate the efficiency of different methods in the HSI classification task, the running time and computational cost of each method are recorded in detail in Table 7.As seen from the data in the table, the training time of RSSAT is slightly longer compared to that of 3D-CNN, SSFTT, and SSRN.This is mainly attributed to the complexity of the RSSAT model design, which contains more layers, thus increasing the length of the training process to some extent.However, it is worth noting that RSSAT exhibits a significant advantage in classification accuracy.This performance enhancement, especially in the accurate classification of small target samples, compensates for its minor shortfall in training time.This balance between performance and efficiency of RSSAT is reasonable considering that classification accuracy is often a crucial metric in practical applications.Meanwhile, RSSAT shows significant advantages in both efficiency and performance compared to ResNet and MAFN.This further demonstrates that RSSAT is able to achieve superior classification performance with moderate computational cost, providing an efficient and feasible solution for the HSI classification task.Overall, although RSSAT may not be the optimal choice from the perspectives of execution time and computational cost, its high-precision overall classification performance and its ability to accurately recognize small target samples make up for these shortcomings.

Different Numbers of Training Samples
In order to be closer to real-world application scenarios and to test the generalization ability of the model under limited data, we reduced the ratio of training samples to validation samples.The experimental results are shown in Table 8.Specifically, we randomly selected 5% of the samples in the Indian Pines dataset as the training set, 5% of the samples as the validation set, and the remaining samples as the test set.From the experimental results, the performance of each method shows a different degree of degradation as the number of samples is reduced.Compared with other methods, RSSAT still has obvious advantages with fewer samples, which proves that RSSAT has superior generalization ability.

Ablation Experiments Analysis
In this experiment, we still used the three datasets as examples to perform ablation experiments to investigate the gain in each component when using our RSSAT by removing different modules.The relevant results are reported in Table 9.
(1) In this work, we employed the SSRN model with multi-scale information integration as the basic model architecture (the experimental model was defined as Base).
(2) For the purpose of verifying the validity of the residual spectral-spatial attention module over RSSAT, the experiment only increased the improved transformer based on Base (the experimental model was defined as Base+IT).(3) For the purpose of verifying the validity of the improved transformer module over RSSAT, the experiment only increased the multi-scale residual spectral-spatial attention module based on Base (the experimental model was defined as Base+RSS).Specifically, Base+IT increased OA by 0.17%, 0.57%, and 0.57% over the Indian Pines, KSC, and XuZhou datasets, respectively, which showed that the transformer adequately captured contextual information, enabling the network to parse semantic information from a global perspective.Base+RSS improved OA by 0.26%, 0.43%, and 0.16% on different datasets, demonstrating that the multi-scale residual spectral-spatial feature extraction module helped the architecture to adaptively learn the important features of each spectralspatial domain while emphasizing the information-rich features and suppressing less useful features.

Conclusions
In the paper, a novel hybrid architecture is examined for HSI classification.Specifically, the proposed RSSAT method improves the representational ability of extracted features and captures relationships within a long range in the spectral domain by combining the strengths of a transformer and a CNN.For the RSSAT method, the residual spectralspatial attention mechanism is embedded in the multi-scale feature-learning part for the joint extraction of spectral and spatial features on the selected multi-scale feature maps to highlight the discriminative information.For the characteristics of the HSI spectral approximation continuation, we propose the Cross-Re-Attention mechanism to improve the formal transformer to achieve deeper ViT training, which effectively alleviates the ViT attention collapse problem and computational volume problem.Overall, RSSAT successfully extracts discriminative features in complex regions and significantly enhances remote contextual information in the spectral domain.The classification performance is evaluated on three challenging datasets.The overall accuracy of the RSSAT model was 98.71%, 99.33%, and 99.72%, and average accuracy was 97.31%, 99.02%, and 99.72%, for the Indian Pines, KSC, and XuZhou datasets, respectively.
Since the number of samples in the Indian Pines dataset is small and unevenly distributed, there is still room for improvement in the classification performance of the RSSAT model.In future work, we will study methods such as data expansion, loss constraints between features and HSI data, and transformer optimization to facilitate the classification performance of a small-sample HSI dataset.

Figure 1 .
Figure 1.Framework of the proposed RSSAT model for HSI classification.

Figure 1 .
Figure 1.Framework of the proposed RSSAT model for HSI classification.

Figure 6 .
Figure 6.The internal structure of the improved transformer block.

Figure 6 .
Figure 6.The internal structure of the improved transformer block.

Figure 7 .
Figure 7. Learning curves for the Indian Pines dataset.(a) Valid loss vs. train loss in each epoch.(b) Valid accuracy vs. train accuracy in each epoch.Figure 7. Learning curves for the Indian Pines dataset.(a) Valid loss vs. train loss in each epoch.(b) Valid accuracy vs. train accuracy in each epoch.

Figure 7 .Figure 8 .Figure 9 .
Figure 7. Learning curves for the Indian Pines dataset.(a) Valid loss vs. train loss in each epoch.(b) Valid accuracy vs. train accuracy in each epoch.Figure 7. Learning curves for the Indian Pines dataset.(a) Valid loss vs. train loss in each epoch.(b) Valid accuracy vs. train accuracy in each epoch.Electronics 2024, 13, x FOR PEER REVIEW 14 of 21

Figure 8 .Figure 8 .Figure 9 .
Figure 8. Learning curves for the KSC dataset.(a) Valid loss vs. train loss in each epoch.(b) Valid accuracy vs. train accuracy in each epoch.

Figure 9 .
Figure 9. Learning curves for the XuZhou dataset.(a) Valid loss vs. train loss in each epoch.(b) Valid accuracy vs. train accuracy in each epoch.

Figure 9 .
Figure 9. Learning curves for the XuZhou dataset.(a) Valid loss vs. train loss in each epoch.(b) Valid accuracy vs. train accuracy in each epoch.

Figure 13 .
Figure 13.Visualization of the 2D spectral-spatial features for the samples in the Indian Pines dataset via t-SNE.(a) ResNet.(b) RSSAT.

Figure 13 .
Figure 13.Visualization of the 2D spectral-spatial features for the samples in the Indian Pine taset via t-SNE.(a) ResNet.(b) RSSAT.

Figure 15 .
Figure 15.Visualization of the 2D spectral-spatial features for the samples in the XuZhou d via t-SNE.(a) ResNet.(b) RSSAT.

Figure 15 .
Figure 15.Visualization of the 2D spectral-spatial features for the samples in the XuZhou dataset via t-SNE.(a) ResNet.(b) RSSAT.
4-2.5 µm, and it can continuously provide images at 220 hyperspectral bands with a spatial resolution of 20 m/pixel.Since bands 104-108, 150-163, and 220 cannot be reflected by water, these 20 bands are typically excluded from the research process, leaving only 200 bands for analysis.The dataset contains 10,249 labeled samples and 16 vegetation classes.It is worth mentioning that the number of samples for these 16 classes of ground objects is unevenly distributed, making the dataset prone to mixed pixels, which poses a challenge for classification.(2)The KSC dataset was acquired by the National Aeronautics and Space Administration (NASA) Airborne Visible/Infrared Imaging Spectrometer (VIRIS) instrument.The dataset covers 224 spectral bands and 13 classes, and an area of 512 × 614 pixels has been specially selected for detailed labeling to ensure the accuracy and usefulness of the data.It has an excellent spectral resolution of 10 nm, ranging from 0.4 to 2.5 µm, capturing subtle spectral differences and providing strong support for analysis.Meanwhile, the 18 m spatial resolution ensures the spatial accuracy of the image, fully demonstrating the spatial characteristics of the features.(3) The XuZhou dataset was acquired in November 2014 in XuZhou City, Jiangsu Province, China.The test area size of 500 × 260 pixel with 436 bands was selected for labeling to ensure classification accuracy.The test area, which is located near a coal mining area, has been categorized into nine classes of ground objects.

Table 1 .
Details of the Indian Pines dataset.

. Class Color Sample Numbers False-Color Map Ground-Truth Map 1 Alfalfa
Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Scrub 1997 Willow 3726 46 Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map 1 Scrub 1997 2 Willow 3726 Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Scrub 1997 Willow 3726 1428 3 Corn-mintill Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Scrub 1997 Willow 3726 830 4 Corn Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Scrub 1997 Willow 3726 730 7 Grass-pasture-mowed Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Scrub 1997 Willow 3726 205 14 Woods Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Scrub 1997 Willow 3726 1265 15 Buildings-Grass-Trees-Drives Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Scrub 1997 Willow 3726 386 16 Stone-Steel-Towers Electronics 2024, 13, x FOR PEER REVIEW 10 of 21

Table 1 .
Details of the Indian Pines dataset.

Table 2 .
Details of the KSC dataset.
No. Class Color Sample Numbers False-Color Map Ground-Truth Map Scrub 1997 Willow 3726 93 Total 10,249

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 2 .
Details of the KSC dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 3 .
Details of the XuZhou dataset.

Table 4 .
Quantitative classification performance of different methods on the Indian Pines dataset.

Table 5 .
Quantitative classification performance of different methods on the KSC dataset.

Table 6 .
Quantitative classification performance of different methods on the XuZhou dataset.

Table 7 .
Training time in minutes (m) and test time in seconds (s) between the comparison methods and the RSSAT method for three datasets.

Table 8 .
Classification performance under 5% training samples for Indian Pines dataset.

Table 9 .
Ablation experiments for each component.