SS-TMNet: Spatial–Spectral Transformer Network with Multi-Scale Convolution for Hyperspectral Image Classiﬁcation

: Hyperspectral image (HSI) classiﬁcation is a signiﬁcant foundation for remote sensing image analysis, widely used in biology, aerospace, and other applications. Convolution neural networks (CNNs) and attention mechanisms have shown outstanding ability in HSI classiﬁcation and have been widely studied in recent years. However, the existing CNN-based and attention mechanism-based methods cannot fully use spatial–spectral information, which is not conducive to further improving HSI classiﬁcation accuracy. This paper proposes a new spatial–spectral Transformer network with multi-scale convolution (SS-TMNet), which can effectively extract local and global spatial–spectral information. SS-TMNet includes two key modules, i.e., multi-scale 3D convolution projection module (MSCP) and spatial–spectral attention module (SSAM). The MSCP uses multi-scale 3D convolutions with different depths to extract the fused spatial–spectral features. The spatial– spectral attention module includes three branches: height spatial attention, width spatial attention, and spectral attention, which can extract the fusion information of spatial and spectral features. The proposed SS-TMNet was tested on three widely used HSI datasets: Pavia University, IndianPines, and Houston2013. The experimental results show that the proposed SS-TMNet is superior to the existing methods.


Introduction
Hyperspectral image classification is a significant application of remote sensing technology. The hyperspectral remote sensing image has many spectral bands, which provides rich information to achieve a more precise classification of the scene object. Each pixel is a high-dimensional vector with hundreds of wavebands in a hyperspectral image. The numerical value of each vector with hundreds of bands in a hyperspectral image, representing the spectral reflectance at the corresponding wavelengths [1]. HSI classification is the pixel-by-pixel classification of remote sensing scenes, which is extensively used in agriculture, aerospace, biology, and other fields [2,3].
In the past two decades, hyperspectral image classification has received significant attention as an essential application of remote sensing technology. Some traditional machine learning methods [4][5][6] were proposed for HSI classification tasks in the early years. For instance, the support vector machine (SVM) [4] and K-nearest neighbor (KNN) [5] were used to capture abundant spectral information in HSI classification. Li et al. [6] presented a multinomial logistic regression method to classify HSIs using semi-supervised learning of a posterior distribution. An extended morphological profiles (EMPs) method [7] was proposed in handing the spatial information in HSIs through multiple morphological operations. Although the above HSI classification methods have been proven effective in some cases, the classification effect is not satisfactory when the environment is very complex.
to capture the local and global dependencies of each dimension. The main contributions of this work are as follows. • We design a new Transformer-based HSI classification method (SS-TMNet), which uses multi-scale convolution and spatial-spectral attention to extract local and global information efficiently. • We design an MSCP module to extract the fused spatial-spectral features as the initial feature projection. This module uses multi-scale 3D convolutions and feature fusion to extract fused spatial-spectral features from multiple scales efficiently. • We propose an SSAM module to encode the input features from the height, width, and spectral dimensions. We use multi-dimensional convolution and self-attention to extract more effective local and global spatial-spectral features. • We have conducted extensive experiments based on three benchmark datasets. The experimental results show that the proposed SS-TMNet outperforms the state-of-the-art CNN-based and Transformer-based hyperspectral image classifiers.
The structure of the work is as follows. Section 2 introduces the related work. Section 3 introduces the proposed SS-TMNet architecture, and then introduces the proposed MSCP module and SSAM module in detail. Section 4 reports and analyzes the experimental results. Section 5 summarizes this work.

Related Work
Hyperspectral image classification technology is one of the essential technologies in the field of remote sensing. After years of research, researchers have presented many methods for HSI classification tasks [35][36][37][38][39]. This section mainly summarizes related work in three parts: traditional classification methods, CNN-based methods, and Transformerbased methods.

Traditional Classification Methods
Some kernel-based methods were proposed in the early stage of HSI classification research. For instance, Melgani et al. [4] applied the SVM method to achieve HSI classification. Unlike SVM, the multiple kernel learning (MKL) method proposed by Rakotomamonje et al. [40] aims to learn the kernel and related predictors simultaneously in a supervised learning environment. However, both methods focus only on the feature information of the spectral dimension and overlook the spatial dimension. Benediktsson et al. [7] proposed extended morphological profiles (EMPs) to study the spatial feature information of HSI. Extended attribute profiles and extended multi-attribute profiles (EMAP) are presented in [41] for capturing spatial information. In order to make better use of the spatial features in HSI, Li et al. [42] presented a generalized composite kernel (GCK) method to model spatial information from the extended multiattribute profiles. In addition, due to the high-dimensional characteristics of HSIs, many works specifically explore how to reduce dimension and extract features more effectively. For instance, Bandos et al. [43] presented a linear discriminant analysis (LDA) method, which can be utilized to solve related ill-posed problems for HSIs. Villa et al. [44] applied the Independent Component Analysis (ICA) method to HSI classification and presented the Independent Component Discriminant Analysis (ICDA) method, which calculates the density function of each independent component by using a nonparametric kernel density estimator. Furthermore, linear versus nonlinear PCA (NLPCA) proposed by Licciardi et al. [45] for HSI classification. There are other methods in the literature, such as DSML-FS based on multimodal learning, which was presented by Zhang et al. [46]. This method utilizes joint structure sparse regularization to explore the relationship between the intrinsic structure of the data and its different characteristics. Jouni et al. [47] proposed an HSI classification method based on tensor decomposition and mathematical morphology by modeling the data as a higher-order tensor. Additionally, Luo et al. [48] introduced a new dimension reduction method for HSI classification, known as local geometric structure Fisher analysis (LGSFA), which uses neighboring points and corresponding intra-class reconstruction points to enhance intra-class compactness and inter-class separability. However, these methods are based on shallow feature representation, which can show unsatisfactory classification results in complex scenes.

CNN-Based Methods
With deep learning development, CNN performs excellently in extracting local spatial features. Therefore, numerous CNN-based methods have been presented for the HSI classification task. Hu et al. [49] introduced the CNN into the HSI classification task and proposed a five-tier 1D-CNN-based method. Compared with the traditional classification methods, the effect has been improved. Hao et al. [20] presented a 2D-CNN-based method to classify ground plants. In addition, Fang et al. [22] presented a 3D asymmetric inception network to extract spatial-spectral features and overcome the overfitting problem. Chang et al. [23] presented a novel 3D-CNN-based method to capture the joint spatialspectral information by stacking layers of 3D-CNN and 2D-CNN. In order to capture fused spatial-spectral information more effectively, He et al. [21] used multi-scale 3D-CNN for HSI classification and presented a multi-scale 3D deep convolution neural network (M3D-DCNN). Although CNN-based methods perform well in HSI classification, capturing the long-range dependence between spectra is challenging. Furthermore, the excessive dependence of CNN on local spatial information makes it difficult to improve the classification accuracy further.

Transformer-Based Methods
Recently, due to the excellent performance of Transformer in the NLP field, many researchers have applied it to the image classification field. Dosovitskiy et al. [18] presented a ViT method based on Transformer for image classification. However, in the top-level feature representation of the deep ViT model the feature maps are similar, which leads to the incapability of the self-attention mechanism to learn the deeper feature representation. Zhou et al. [24] presented a ViT-based method that can effectively use the deep architecture, called DeepViT, which generates a new set of attention maps by aggregating multiple attention maps dynamically. Although spectral dependence is considered in these methods, the effect of spatial features is omitted. Considering the superior performance of CNN in extracting local spatial features, many researchers applied convolution on Transformer to obtain better performance. Graham et al. [50] re-examined the CNNs, applied it to ViT, and proposed a hybrid neural network of CNN and ViT for image classification, called LeViT. In order to extract multi-scale features from ViT, Chen et al. [51] presented a multiscale Transformer using cross attention, called CrossViT, which uses multiple multi-scale encoders with two branches for feature extraction. Many researchers also introduced Transformer-based methods into the HSI classification field. For example, He et al. [25] presented an HSI classification method called spatial-spectral transformer (SST), which uses VGGNet [52] to capture basic spatial information and then inputs the Transformer to capture spectral information. Yang et al. [53] presented a novel Transformer-based method called HiT for HSI classification, which uses double branch 3D convolution as feature mapping, embeds the convolution in the encoder of Transformer architecture, and extracts feature information from different dimensions using convolution. However, these methods do not effectively use the advantages of convolution in the attention mechanism, making it impossible to improve the classification effect further. In this work, we propose a novel Transformer-based method called SS-TMNet, which can effectively employ the advantages of convolution and attention mechanisms to extract global and local spatialspectral features. In the SS-TMNet, two modules, MSCP and SSAM, are proposed to extract multi-scale fused spatial-spectral information and construct cross-dimensional interactions between different dimensions, respectively.

The Proposed SS-TMNet Method
In this section, we introduce our SS-TMNet method in three aspects: the overall architecture of SS-TMNet, the MSCP module, and the encoder sequence module.

The Framework of the Proposed SS-TMNet
This work presents a novel HSI classification method called SS-TMNet based on Transformer. SS-TMNet consists of two key modules: the MSCP module and the SSAM module. MSCP is used for feature projection of the initial HSI image, where multi-scale 3D convolution is utilized to capture the fused multi-scale spatial-spectral information. SSAM is used to capture local and global spatial-spectral dependencies from different spatial and spectral dimensions. The encoder sequence includes four stages where a downsampling layer is added to reduce the dimensions after the second stage. Moreover, a global residual connection connects the input and the final output. Figure 1 shows the overall architecture of our SS-TMNet method.

Multi-Scale 3D Convolution
Hyperspectral images differ from ordinary RGB images. Because of the high-dimensional characteristics of HSI, ordinary 2D convolution can not effectively capture the fused spatialspectral information because it ignores the dependence between spectra. Meanwhile, 3D convolution can process the features from three dimensions, which can extract features more effectively. In general, HSI data can be represented by a tensor with the size of C × S × H × W, where C represents the number of channels, S denotes the spectral domain, and H and W are the height and width in the spatial domain. Based on this, we can apply 3D convolution to the initial HSI data to extract more effective feature representation for subsequent network learning. More specifically, the formula for 3D convolution is as follows: where m represents the feature map in the (i − 1)th layer connected to the jth feature map, and P i and Q i are the height and width of the spatial convolution kernel, R i is the size of the 3D kernel along the spectral dimension, w pqr ijm is the value at the (p, q, r)th position of the kernel connected to the mth feature map of the preceding layer. b ij is the bias of the jth feature map in the ith layer. F represents the activation function.
We studied HSI's data characteristics and found that multi-scale 3D can perform feature mapping more effectively than ordinary 3D convolution. As shown in Figure 2, we developed a multi-scale 3D convolution to build the data mapping module and proposed a new feature mapping module called MSCP. The multi-scale convolution layer uses different sizes of 3D convolution to extract the feature map. From a global perspective, we extract features from the feature information of interest in the image to obtain new feature maps of different sizes and then fuse them to obtain the spatial-spectral feature map. The feature map obtained through the MSCP module has rich fused spatial-spectral information, which enhances the efficiency of feature extraction of subsequent networks.  Figure 2 shows that the MSCP module comprises multiple multi-scale 3D convolution layers and feature fusion modules. MSCP processes input HSI data in three phases. Suppose X ∈ R C×S×H×W is a patch of the input data (in this paper, the input image is divided into several patches with the size of H × W for processing, and the values of H and W are 15 in the experiments). In the first phase P 1 , the input data X are placed into a 3D convolution layer with ReLU operation to extract the spatial-spectral characteristics X 1 , where the convolution kernel size is set to (11,3,3). Then, X 1 is fed into a multi-scale 3D convolution layer M 1 with four different convolution kernel sizes, mainly used to extract spectral characteristics of different scales. Then, we fuse the output multi-scale features with the addition operation. To prevent overfitting, we use the residual connection to link the fused multi-scale feature to the output of the first 3D convolution layer X 1 . The BatchNorm and ReLU operations are then used to produce the first stage output X P 1 . The formula for feature mapping in the first stage is as follows:

Module Composition
where ReLU represents the activation function, BN represents the BatchNorm operation, ⊕ represents the residual connection, and i represents the 3D convolution of different scales.
In the second stage P 2 , we first feed the output X P 1 of the first stage into a 3D convolution layer with a ReLU operation whose convolution kernel size is (9,3,3) to further extract the spatial-spectral characteristics. The output features are placed in two successive multi-scale 3D convolution layers M 2 and M 3 with feature fusion and residual connection operations to extract deeper spectral features. Then, we perform BatchNorm and ReLU operations to obtain the output X P 2 . In the third stage, P 3 , the activation function and 3D pointwise convolution operation are used to handle the features of the output further X P 2 in the second stage. Finally, the MSCP module outputs the final representation X P 3 ∈ R H×W×D as the extracted features. The formula for the second and third stages is as follows: Overall, the proposed approach employs multiple multi-scale 3D convolution layers to extract fused spatial-spectral feature information at multiple scales, as well as shallow local spatial-spectral dependences. To mitigate the issue of gradient disappearance, residual connections are used in multiple locations. The extracted fused spatial-spectral information provides an excellent feature representation for processing subsequent encoder sequences.

Encoder
As shown in Figure 1, the encoder consists of two modules: SSAM and FFN modules. SSAM encodes features from height, width, and spectral dimensions to extract local and global spatial-spectral features. FFN consists of linear layers with the activation function GELU, which is used to transform features and extract deeper features. The encoder adds LayerNorm and residual connection operations to alleviate overfitting and gradient disappearance, and more effectively cooperates with the above two modules for feature extraction. Given an input embedding X P 3 ∈ R H×W×D , the formulas of the coding process are as follows: where ⊕ represents residual connection, Y represents the residual connection between X P 3 and the output of SSAM, and Z represents the FFN module's output. In general, the SS-TMNet has four stages, and each stage consists of an encoder sequence composed of a different number of encoders. The implementation details of SSAM will be introduced in the next section. Figure 3 shows the structure of the SSAM, which encodes the inputting feature along the height, width, and spectral dimensions to extract local and global spatial-spectral features more effectively. We feed X in (X P 3 after layer normalization operation) into three branches for height-spatial coding, width-spatial coding, and spectral coding.

SSAM Module
In the height branch L H , we utilize a depthwise convolution layer with convolution kernel size (1,3) to the X in to obtain local height spatial features, which will be fed into the hight spatial attention (HSA) to calculate the spatial self-attention and obtain the global dependent X H . In the width branch L W , we employ a depthwise convolution layer with a convolution kernel size of (3,1) to handle the X in and obtain local width spatial characteristics. The width spatial attention (WSA) module is then used to derive a globally dependent X W based on the local width spatial characteristics. In the spectral branch L S , local spectral information is captured from the X in using a pointwise convolution layer with convolution kernel size (1,1). Then, the spectral attention (SA) is utilized to obtain globally dependent X S . Then, three local residuals connecting the input X in with the outputs of hight spatial attention, width spatial attention and spectral attention are also added to alleviate gradient disappearance. It is worth noting that three learnable parameters γ h , γ w , and γ s are used to adjust the proportion of learning to characteristics for each branch. Finally, we fuse the feature information of the three branches with the addition operation and linear projection, and join the global residual connection to the X in to get the final output X out ∈ R H×W×D . The calculation formula for SSAM is as follows: where γ h , γ w , and γ s represent the learnable parameters, DepthConv represents a depthwise convolution layer, PointConv represents a pointwise convolution layer, ⊕ represents residual connection, and F denotes linear projection. Next, we will detail the spatial and spectral attention modules.
As shown in Figure 4, we introduce spatial attention for feature extraction to establish rich spatial feature dependency. We first reshape the X in ∈ R H×W×D to X re ∈ R (H×W)×D , and then send it to three parallel linear layers for feature mapping to obtain the output {Q, K, V} ∈ R N×D , where N equals H times W. The concrete procedure of spatial attention can be formulated as follows: where ⊗ denotes the operation of matrix multiplication, and d is the scale factor. Finally, a linear layer maps the feature and reshapes the dimension to obtain the final output X out ∈ R H×W×D . Furthermore, our spectral attention part is similar to spatial attention.
To simplify the calculation, our spectral attention discards the initial linear projection layer and uses the input features to calculate the self-attention.
In summary, the SSAM module uses depthwise convolution and pointwise convolution to map features from height, width, and spectral dimensions, respectively, and further extract features using spatial and spectral attention. To extract long-range relationship dependences of both spatial and spectral features, we utilize spatial and spectral self-attention mechanisms. Specifically, our SSAM module integrates convolution with self-attention mechanisms, extracting features from three dimensions and fusing them to obtain feature representations with both global and local dependencies.

Experiments
This section introduces the HSI datasets used in the experiments, including the Pavia University, Indian Pines, and Houston2013 datasets. In addition, we introduce the parameter settings, evaluation metrics, and comparison models in experimental settings. Then, we show and analyze the results. Finally, the ablation experiment and model performance analysis are introduced.

Pavia University Dataset
This dataset was obtained with the Reflective Optical Spectral Imaging System (ROSIS) sensor of the University of Pavia, Italy. The spatial size of the hyperspectral image is 610 × 340 pixels, the spectral bands range from 0.43 to 0.86 µm, a total of 103 bands, excluding 12 water absorption bands. The dataset has 9 classification categories. The dataset is shown in Figure 5.

Indian Pines Dataset
This dataset was collected in 1992 by the AVIRIS sensors in northwestern India, USA. The spatial size of the hyperspectral image in the dataset is 145 × 145 pixels, and spectral bands range from 0.4 µm to 2.5 µm. The total number of spectral bands is 200, excluding 20 water absorption bands. Available ground truths comprise 16 classes. The dataset is shown in Figure 6.

Houston2013 Dataset
This dataset was captured by the CASI-1500 sensor over the University of Houston and its surroundings in Texas, USA. The spatial size of the image in the dataset is 949 × 1905 pixels, and the spectral dimension includes 144 bands. The dataset has 15 classification categories. The dataset is shown in Figure 7.

Parameters Setting
The training samples for this work were set to 10% in three datasets, and the rest were used as test samples. It is noteworthy that the selection of training and testing samples was random. To ensure the fairness of the comparative trials, we performed all the comparison models ten times and recorded the results as mean ± standard deviation to compare the performance of different models. The proposed SS-TMNet and the compared methods were implemented on a NVIDIA RTX 3080Ti GPU machine with the pytorch [54] platform. We used the Adam optimizer for gradient descent and set the initial learning rate to 1 × 10 −4 . The mini-batch size was set to 32, and we set the epochs on these three benchmark datasets to 200.

Evaluation Metrics
Overall accuracy (OA) and Kappa coefficient (K) were chosen in our experiments to evaluate the results produced by different models in experiments. The OA is the average accuracy for each category. The Kappa measures whether the classification results are consistent with the actual underlying category. The formulas for calculating the above evaluation criteria are as follows: where TP represents the true positive value, TN is the true negative value, FP represents the false positive value, and FN represents the false negative value. n is the number of categories, and N is the total number of data samples. x ii denotes the value on the diagonal line of the confusion matrix, x i+ and x +i denote the total value of rows i and columns i of the confusion matrix, respectively.

Baselines
To validate the proposed SS-TMNet method, several representative baselines and the most advanced backbone methods are chosen for comparison, including RNN-based methods (such as Mou [16]), CNN-based methods (such as He [21], 3D-CNN [55], and Hy-bridSN [56]), and Transformer-based methods (such as ViT [18], CrossViT [51], LeViT [50], RvT [57], and HiT [53] LeViT [50]: Another Transformer-based method, which includes four convolution layers and three stage codes, and each stage contains four multiple attention layers. We replicated the methods used for HSI classification according to this architecture. • RvT [57]: Based on ViT, the RvT method uses a pooling layer to downsample the image and reduce the size of the images. We follow this architecture to design the network for the HSI classification tasks.
• HiT [53]: A method of embedding convolution into Transformer, which uses two proposed SACP layers based on 3D convolution to process the input image. Feature extraction is performed using a three-branch convolution layer based on transformer architecture.

Results and Analysis
This section will elaborate the experimental results and analysis, including results comparison and visualization of three datasets: Pavia University, Indian Pines, and Houston2013. Table 1 shows the experimental results produced by different comparison models with respects to OA and Kappa metrics on the Pavia University dataset. The table shows that our proposed SS-TMNet is superior to all comparison methods, with OA and Kappa reaching 91.74% and 89.44%. OA was 0.6%, 0.3%, 0.16% higher than RNN-based method Mou [16], CNN-based method HybridSN, and Transformer-based method LeViT, respectively. The possible reason is that SS-TMNet can more effectively capture local and global dependencies. Among all the mentioned methods, the original ViT method performed the worst, with 88.92% OA and 85.81% Kappa, which indicates that it is difficult for the original ViT network to perform the hyperspectral classification task. The reason may be that ViT lacks effective modeling capabilities for spatial characteristics. The methods using only 3D convolution, such as He [21] and 3D-CNN, which obtained 89.97% and 90.72% OA, respectively, did not perform well since these methods focus on only spatial characteristics and spectral correlation is not fully considered. Transformer-based methods, such as LeViT and HiT, their OA metrics were 91.58% and 91.28%, respectively. They are developed on the basis of ViT, which performs better than ViT-only and 3D convolution-only methods. This demonstrates that combining the convolution and Transformer networks can improve classification results. The visualization experiment results produced by the comparison methods are shown in Figure 8. As shown in the red rectangle box in the figure, most methods produce much noise in the classification maps compared to our SS-TMNet method. It is worth noting that although the classification maps of the HybridSN and LeViT methods are similar to ours, there is still a small amount of noise, and from the evaluation metrics in Table 1, the results of the method we presented are still better. The possible reason is that compared with HybridSN and LeViT, SS-TMNet learns the fused local spatial-spectral features through the proposed MSCP module and more effective local and global feature representations through the SSAM module. The visualization proves that our proposed method can produce better results than most existing methods.  Table 2 shows the evaluation results of our presented and compared models on the dataset. Our proposed SS-TSNet shows the best results, with OA and Kappa reaching 84.67% and 82.66%. The OA metric of our proposed method is 9.40% higher than RNN-based methods (i.e., Mou), 12.08% higher than CNN-based methods (i.e., 3D-CNN), and 1.04% higher than Transformer-based methods (i.e., LeViT). One possible reason is that we have improved the encoding of feature projections and spatial-spectral features to enable more efficient feature encoding.

Experimental Analysis on Indian Pines Dataset
Our method differs from the existing methods (i.e., CrossViT, LeViT, RvT, and HiT). We use MSCP to capture the spatial-spectral dependence of the fused multi-scale features. Meanwhile, the SSAM is presented to capture the local and global spatial-spectral information of multidimensional data. Thus, our proposed model can more effectively model the HSIs from spatial-spectral dependence and local-global features. Figure 9 shows the dataset's visualization results, which shows that our proposed model produces the classification map with the least noise and achieves satisfactory results. For example, as shown in the red rectangle in the figure, compared with other comparison methods, the SS-TMNet method generates the slightest noise in the classification map. The reason HiT does not perform well relative to our proposed method may be due to its ineffective integration of convolution into Transformer, leading to a lack of effective modeling of global feature dependencies. From the overall effect, our proposed method produces a classification map closer to the ground truth image than other methods, which proves the validity of our proposed method.

Experimental Analysis on Houston2013 Dataset
The experimental results of our proposed SS-TSNet and compared models on the Houston2013 dataset are shown in Table 3. We can see that our model works best with OA and Kappa, reaching 96.22% and 96.22%, respectively. In addition, the standard deviation of our model is the smallest, only 0.12, indicating the stability of our model. It is worth noting that the LeViT performs much worse on this dataset than the other two datasets in the experiment with respect to OA and Kappa, only 87.36% and 86.34%, which indicates that the generalization capability of the LeViT model is relatively weak. Our model performs well on all three datasets, possibly because our SSAM models both local and global features effectively from three dimensions. Figure 10 shows the visualization results of the experiment. To make it clearer to see the difference at the pixel level, we crop local details to show the classification map. As shown in the red rectangle in the figure, the classification map generated by our method is less noisy than comparison methods and closer to the ground truth image, which shows the superiority of our presented method. Other methods, such as HybridSN, may not perform well because only the combination of 3D convolution and 2D convolution is used. Although it has a good model of local spatial characteristics, it lacks the dependence on the relationship between capturing long-range spectra. As for the CrossViT method, it only uses Transformer to build the network without considering the effect of convolution on classification results, which may result in unsatisfactory performance.

Student's t-Test
We conducted a Student's t-test between our presented method and the compared methods with 10 times randomized initializations. We collected OA results produced by 10 randomized experiments on Pavia University, Indian Pines, and Houston2013 datasets using SS-TMNet and other comparative methods. Student's t-test method was employed to compute the p-value between our proposed methods and existing methods. When the p-value is greater than 0.05, there is no significant difference between the two models. When the p-value is less than 0.05, the results of the two models are significantly different.
To make it easier to observe data differences, our experimental data are represented by scientific notation. As shown in Table 4, the p-value between the SS-TMNet and all the compared methods is less than 0.05 on the three datasets, which shows that our SS-TMNet method has significant advantages over other methods. For instance, on the Pavia University dataset, the p-values between the SS-TMNet method and HybridSN and HiT methods are 1.40 × 10 −2 and 2.26 × 10 −5 , respectively, which are less than 0.05, showing significant differences between methods.

Ablation Studies
We have performed ablation experiments on the main components of the SS-TMNet model. The results and analysis of the ablation experiments for the proposed MSCP and SSAM modules are described in the following two sections. The results in Tables 5 and 6 are the average of ten times experiments.

The Effectiveness of the MSCP Module
In order to verify the effectiveness of our proposed MSCP module, we used different projection methods (such as Linear, Conv2D, SACP [53], and MSCP) to project the image features without changing the subsequent module and network structure. Furthermore, SACP is the feature projection module in the HiT method. As shown in Table 5, we chose ViT as the baseline method. The experimental results show that our MSCP+SSAM showed the best performance (91.74% in OA and 89.44% in Kappa). The Mean and Std columns in the table represent the mean and standard deviation differences between our proposed SS-TMNet (MSCP+SSAM) and the comparison method. We can see that our presented method had the highest mean and lowest standard deviation, which shows that the MSCP module is more effective than the other feature extraction modules. In order to verify the effectiveness of the SSAM module, we took the ViT method as the baseline method and set up four groups of comparison experiments. In the experiment, the SSAM Module in our proposed method is replaced by the single linear layer connection (Linear), the convolution permutator module (ConvPermute) of the HiT method, and the ViP [58] method (ViP), respectively. The table shows that our MSCP+SSAM had the best performance (91.74 ± 0.12 in OA and 89.44 ± 0.16 in Kappa). Compared with the replaced SACP and ViP modules, our proposed method was 0.45% and 0.37% higher in the OA metric, respectively, which shows the effectiveness of our SSAM module for improving network performance.

Scability
Due to the scarcity of hyperspectral image data, it is meaningful to study the influence of the number of training samples on the classification method. We changed the training samples from 10% to 50% on the Houston 2013 dataset to study the scability. Each model was run ten times, and the average value was taken as the final result. Table 7 reports the average OA of the proposed SS-TMNet and compared models. We can see that as the training samples change from 10% to 50%, the performance gradually improves, and our model always shows excellent results and high stability. It is worth noting that the experimental results of LeViT, when the training sample are 40% and 50%, are slightly higher than the model we proposed. However, LeViT performs poorly when the training samples are few, indicating its instability. Moreover, to study the experimental results of our SS-TMNet method on several datasets that vary with the number of training samples, we tested SS-TMNet on three datasets. The experiment also adopted the average of 10 results as the final result. The experimental visualization results of the OA metric are shown in Figure 11. With the increase of training samples, OA gradually increases and eventually tends to be stable, which effectively proves the proposed method's stability.

Conclusions
This work presents a novel HSI classification Transformer-based method (SS-TMNet) to improve HSI classification, which can fully use the spatial-spectral information in HSI data. SS-TMNet includes two key modules: the MSCP module and the SSAM module. The MSCP module uses multi-scale 3D convolution to extract the fused spatial-spectral features. The SSAM module extracts features through height dimension, width dimension, and spectral dimension, which can more effectively obtain local and global feature information. We compared our proposed method with the most advanced Transformer-based and CNN-based methods on three benchmark HSI datasets. Experimental results show that our SS-TMNet method performs the best overall accuracy on three datasets.
In future work, we plan to study more efficient HSI classification methods based on Transformer by embedding convolution neural networks into Transformer more effectively. For the scarcity problem of labeled HSI, we plan to study transfer learning and selfsupervised learning based on SS-TMNet to improve the performance of classification of limited training samples.
Author Contributions: Conceptualization, review and editing, X.H.; write the original draft preparation and correct it, Y.Z.; methodology and correct this paper, X.Y.; data curation and correct this paper, X.Z. and K.W. All authors have read and agreed to the published version of the manuscript.