S2Former: Parallel Spectral–Spatial Transformer for Hyperspectral Image Classiﬁcation

: Due to their excellent representation talent in local features, the convolutional neural network (CNN) has achieved favourable performance in hyperspectral image (HSI) classiﬁcation tasks. Nevertheless, current CNN models exhibit a marked ﬂaw: they are hard to model the dependencies in long-range distanced positions. This ﬂaw becomes more problematic for the HSI classiﬁcation task, which targets extracting more discriminative features in local and global dimensions from limited samples. In this paper, we introduce a spatial–spectral transformer (S2Former), which explores spatial and spectral feature extraction in a dual-stream framework for HSI Classiﬁcation. S2Former, which consists of a spatial transformer and a spectral transformer in parallel branches, extracts the discriminative feature in spatial and spectral dimensions. More speciﬁcally, we propose multi-head spatial self-attention to capture the long-range spatial dependency of non-adjacent HSI pixels in a spatial transformer. In the spectral transformer, we propose multi-head covariance spectral attention to mine and represent spectral signatures by computing covariance-based channel maps. Meanwhile, the local activation feed-forward network is developed to complement local details. Extensive experiments conducted on four publicly available datasets indicate that our S2Former achieves state-of-the-art performance for the HSI classiﬁcation task.

The main challenge of HSI classification is how to extract enough discriminative features from limited and insufficient samples.CNN has exhibited potential in learning generalisable features from samples.Thus, it has been widely exploited to customise HSI classifiers.To tackle the above challenge, the CNN-based solutions are mainly grouped into two categories.First, deeper network structures are proposed to extract more delicate features from the limited training samples.Lee et al. [22] proposed a contextual deep CNN (CDCNN) for extracting contextual spatial-spectral features.In [23], a regularised deep feature extraction (FE) method is proposed to effectively solve the common problems of insufficient and unbalanced training samples for HSI.Residual Network (ResNet) [24] and Dense Convolutional Network (DenseNet) [25] are designed to relieve the training stress of deep model.Based on the ResNet, SSRN [26] was proposed, which exploits the deep model to improve classification accuracy.Similarly, FDSSC [27] was designed based on the DenseNet framework to capture spectral-spatial features.Second, the attention mechanism is introduced to refine and reweight the extracted spatial and spectral features.Attention mechanism, a dynamic weight adjustment operation on features, is widely used to optimise feature extraction and refine features.Spectral-wise attention (MSDN-SA) [28] was proposed to enhance the discriminatory ability of the model for spectral features.Channel-wise attention and spatial-wise attention were introduced [29,30] to refine spatial and spectral features and achieve outstanding classification results.
However, the CNN models only model and analyse the limited receptive field and, thus, ignore the global contextual information without the ability of modeling long-range location dependency.As applied to the high-dimensional hyperspectral data, the inherent bias of CNN is magnified.The inadequate global representation ability led to ignoring global spatial and global spectral features, which most likely contain discriminative features.Even for CNNbased methods that are involved with self-attention modules, they still fail to simultaneously constrain the global spatial information and global spectral information of input HSI cubes.
A transformer has excellent performance on most computer vision tasks.It relies on the self-attention mechanism to model global contextual information, simulate the global receptive field, and capture the global information and long-term dependence of samples.This feature greatly alleviates the above-mentioned limitations of using CNNbased methods to complete HSI classification tasks.ViT [31] is the earliest transformer architecture proposed for use in the field of computer vision and has achieved better results than convolutional neural networks.Inspired by ViT, Xue et al. [32] developed a novel deep hierarchical vision transformer (DHViT) to extract long-term spectral dependencies and hierarchical features.Zhao et al. [33] proposed a Convolutional Transformer Network (CTN).The proposed Convolutional Transformer (CT) blocks solved the problem of weak local feature extraction by the transformer and effectively captured the local-global features of HSI patches.A new backbone network named SpectralFormer [34] combines the advantages of transformer and CNN, aiming to learn local spectral feature representation and feature transfer between shallow and deep layers.Song et al. [35] designed a Bottleneck Spatial-Spectral Transformer (BS2T) to describe the dependencies between HSI pixels over long-term locations and bands.HSI-Mixer [36] uses a simple CNN architecture to simulate the function of transformer, reconsiders the significant inductive bias of convolution.A hybrid measurement-based linear projection and spatial and spectral mixer blocks are constructed to implement spatial-spectral feature fusion and decomposition, respectively.However, these transformer-based classifiers just introduce the self-attention mechanism to capture global features.Self-attention is a special attention mechanism that only considers the adaptability in the spatial dimension but ignores the adaptability in the channel dimension, which is also important for HSI classification task.For transformer-based HSI classifiers, it is essential to design the self-attention mechanism separately tailored for spatial and spectral dimensions.
In this article, we aim to develop a novel transformer-based dual-branch network architecture, parallel spatial-spectral transformer (S2Former for short), to achieve highperformance HSI classification by extracting discriminative features in both spatial and spectral dimensions with tailored self-attentions.S2Former consists of a spatial transformer and spectral transformer in parallel branches, emphasising the global spectral information and the spatially global context individually.Specifically, the spatial transformer exploits the multi-head spatial self-attention (MSSA) and local activation feed-forward network to learn the spatially global context and local signals.The spectral transformer equips multihead covariance spectral attention (MCSA) to model the contextualised global relationships between spectra and capture the subtle spectral discrepancies.
The main contributions can be concluded as follows.

1.
A parallel spectral-spatial transformer architecture is proposed for HSI classification, which is an efficient extraction of the spectral and spatial features in dual parallel branches.

2.
MCSA and MSSA, which are tailored for spectral and spatial feature extraction, improve the mining of local-global spatial and spectral sequence features.

3.
A local activation feed-forward network is proposed to enhance the extraction of local context signals by encoding information from spatially neighbouring pixel positions.

Methodology
As shown in Figure 1, our S2Former consists of a spatial transformer and a spectral transformer.In our S2Former, the 3D cube is taken as input.In other words, the target pixel and its adjacent pixels are fed into the network.Given a 3D cube I ∈ R M×M×O , where M × M denotes the spatial dimension, O is the number of bands.Our S2Former first applies a 3 × 3 convolutional layer to obtain low-level feature embeddings F 0 ∈ R M×M×C .Next, the shallow features  The learnable weights α and β optimise the spatial and spectral feature extraction by backward propagation.We obtain the output fused features F out ∈ R M×M×C .Finally, the output fused features pass through the fully connected layer, mapping into predicted results.

Spatial Transformer
As illustrated in Figure 1, the spatial transformer is a stack of multi-head spatial transformer groups similar to the encoder to extract the deeper spatial features.The spatial transformer contains K multi-head spatial transformer groups (MSTG).The output spatial features extracted group by group are shown as where H MSTG i (•) is the i-th multi-head spatial transformer group.The multi-head spatial transformer group is a residual group with multiple multi-head spatial transformer blocks and a spatial-3D CNN block.Given the input feature F Spa i,0 of the i-th MSTG, we first extract intermediate spatial features F Spa i,j by L multi-head spatial transformer blocks (MSTB) as where H MSTB i,j (•) denotes the j-th multi-head spatial transformer blocks in the i-th multi- head spatial transformer group.Next, the spatial features F Spa i,L are enhanced by adding a spatial-3D enhanced block.
where H Spa_3D i is the spatial-3D enhanced block in the i-th multi-head spatial transformer group.Next, we give a specific description of the multi-head spatial transformer block and spatial-3D enhanced block, which are the core components of the multi-head spatial transformer group.

Multi-Head Spatial Self-Attention
As shown in Figure 2b, a multi-head spatial transformer block contains a multi-head spatial self-attention (MSSA), a local activation feed-forward network (LAFN), and layer normalisation (LN) modules.CNN-based methods exploit the local receptive field to extract features in HSI classification but have difficulty modeling pixels with long-range distanced positions.To capture non-local long-range dependencies, in our spatial transformer, we exploit MSSA to model long-range dependencies in spatial dimensions.As demonstrated in Figure 2, given an input X in ∈ R M×M×C .MSSA aims to apply self-attention across global spatial locations and generates an attention map modeling the long-range dependencies and spatial interactions.
where W Q , W K , and W V ∈ R C×C are learnable projection matrices.The attention matrix is thus computed by the self-attention mechanism.We apply dot-product interaction on Q spa and K spa to generate the spatial attention map where W out ∈ R C×C are also learnable projection matrices.B is the learnable relative positional encoding.Following multi-head SA [37], MSSA divides the number of channels into 'heads', then performs the attention function for 'heads' times in parallel and concatenates the results for multi-head results.
The LayerNorm (LN) layer is added before MSSA, and the residual connection is employed to obtain the output feature map where H MSSA (•) and H LN (•) denote the function of MSSA and the LayerNorm layer.

Local Activation Feed-Forward Network
In the traditional feed-forward network, the two fully connected layers are applied to expand the input feature channels and map the output channels back to the original input dimension.The fully connected layer processes token information point-wise in an identical manner.Thus, it neglects local information.In our work, we propose the local activation feed-forward network (LAFN), which aims at complementing local information by encoding information from spatially neighboring pixel positions.As shown in Figure 3, we complement the local details in the feed-forward network with two operations.First, we exploit the depth-wise convolution layer between the two fully connected layers to explore local signals from the global feature information in the regular branch.Second, we add a branch with depth-wise convolution to activate the local signal.The element-wise product is used to aggregate local and global information streams in two parallel branches.Given an input feature Xin ∈ R M×M×C , LAFN is formulated as:

Element-wise Multiplication
where H Gelu represents the Gelu non-linearity, and W * ( * denote 1 or 2) denote the fully connected layers.W P denotes point-wise convolution with a kernel size of 1 × 1, and W D represents 3 × 3 depth-wise convolution.Overall, the LAFN controls information flow through the activated local signal in our pipeline, thereby focusing on the fine details.

Spatial-3D Enhanced Block
The spatial-3D enhanced block is designed to maintain the global spatial feature and enhance the local spatial feature expression.As shown in Figure 4, we design the spatial-3D enhanced block similar to DBDA [30].The input spatial feature F Spa i,L ∈ R M×M×C is recalibrated and reweighted through the remapping of local spatial feature coordinates with the spatial-3D enhanced block.First, the high-dimensional features across channels are mapped to low-dimensional ones by the 1 × 1 × C 3D convolution.Next, the features are transported to three spatial-3D enhanced layers with a dense connection.Each spatial-3D enhanced layer includes 3 × 3 × 1 3D convolution, 3D batch normalisation, and a Mish activation function.

Spectral Transformer
As exhibited in Figure 1, similar to the structure of the spatial transformer, the spectral transformer consists of K multi-head covariance spectral transformer groups (MCTG) that aim to extract the discriminative spectral features.The output spectral features extracted block by block are shown as The multi-head covariance spectral transformer group is a residual group with multiple multi-head covariance spectral transformer blocks and a spectral-3D enhanced block.Given the input features F Spe i,0 of the i-th MCTG, we first extract intermediate spatial features F Spe i,j by L multi-head covariance spectral transformer blocks (MCTB) as where H MCTB i,j (•) denotes the j-th multi-head spatial transformer blocks in the i-th multihead covariance spectral transformer group.Next, the spatial features F Spe i,L are enhanced by adding a spatial-3D CNN block, where H Spe_3D i (•) is the spectral-3D enhanced block in the i-th multi-head covariance spectral transformer group.Similarly, we elaborate the details of our multi-head covariance spectral transformer block and spectral-3D enhanced block.

Multi-Head Covariance Spectral Attention
Different from natural images, HSIs are also spectrally correlated and have numerous narrow bands.Capturing local and global spectral features are equally essential.In our work, we propose multi-head covariance spectral attention (MCSA) to model the interspectra similarity and long-range dependencies.Multi-head covariance spectral attention is intent on applying self-attention across spectral channels.MCSA computes cross-covariance across channels to generate an attention map encoding the global spectral signals.X in is first projected and reshaped into query Q spe ∈ R C×M 2 , key K spe ∈ R C×M 2 , and value V spe ∈ R C×M 2 by applying 1 × 1 point-wise convolutions W P followed by 3 × 3 depth-wise convolutions W D to encode spatial context in a spectral-wise manner, Next, the spectral attention map is computed by the self-attention mechanism.We apply dot-product interaction on Q spe and K spe to generate the spectral attention map where ε is a learnable parameter to reweight the dot product of Q spE and K spE before applying the softmax function.

Spectral-3D Enhanced Block
The spectral-3D enhanced block is stacked behind L multi-head covariance spectral transformer blocks in each multi-head covariance spectral transformer group.The spectral-3D CNN block is designed to learn inter-spectra correlations after modeling the global spectral feature.Given the F Spe i,L ∈ R M×M×C , the spectral-3D enhanced block first applies a 3D convolution with the convolutional kernel of 1 × 1 × 7 to obtain low-level local spectral features.Then, three spectral-3D enhanced layers are applied to extract spectral information.Each spectral-3D enhanced layer contains 3D convolution, 3D batch normalisation, and a Mish activation function.Except for the third 3D convolution layer, the remaining 3D convolution layers are with the convolutional kernel of 1 × 1 × 7. The input of the i-th spectral-3D enhanced layer is the concatenated output of the former layers.

Experimental Results
To verify the performance of our S2Former, we conduct ablation studies and comparative experiments on four public datasets, including the Indian Pines dataset (IN), Pavia University dataset (UP), Botswana dataset (BS), and Houston dataset (HU).All experiments presented in this section are conducted on Nvidia GeForce RTX 3090 GPU.
Optimised Parameters: The proposed S2Former is trained on four datasets for 200 epochs.In addition, the batch size is 16, and the Adam optimiser with the 0.0005 learning rate was used to update the network parameters.For the compared methods and our S2Former, we select the input patches with a size of 9 × 9 × O. Training and test data of the four datasets are listed in Tables 1-4.The UP dataset has the most significant number of samples, so only 0.5% of the training samples are selected, and the rest are used for testing.The labelled samples of each category in the IN dataset are unevenly distributed.To reasonably utilise the training samples, we choose 3% of the samples for training.For BS and HU, we use 1% labelled samples to train the models.A total of 99% labelled samples are used to test the performance of models.
Metrics: The classification performance is evaluated with overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa) [39].Our S2Former is a 3D cube-wise-based HSI classification model.The size of the 3D cube determines the number of neighbouring pixels, which is the amount of information for the centre pixel.Therefore, the patch size is closely related to the performance of our S2Former.In order to explore the impact on HSI classification accuracy, we conduct an ablation study on the input HSI patch size.We exhibit the HSI classification results with the spatial dimension of the 3D cube ranging from 3 × 3 to 13 × 13 on four standard datasets.The quantitative results are shown in Figure 5.The patch size of the input 3D cube achieves a significant influence on classification performance.The patch size of 9 × 9 obtains the optimal results for the four standard datasets.

Discussion on Model Depth
The number of groups and blocks in the spatial transformer and spectral transformer represents the network depth of S2Former.We select the integer values of group and block in the range of {1, 2, 3, 4, 5, 6, 7, 8} and {1, 2, 3, 4, 5, 6, 7, 8}, respectively.The classification performance (OA) of our S2Former different groups and blocks is shown in Figure 6.We show the effects of group number and block number on Botswana dataset (in Figure 6a), Indian Pines dataset (in Figure 6b), and Pavia University dataset (in Figure 6c).It is observed that the OA is correlated with block number and group number.The blocks, including the multi-head spatial transformer block in the spatial transformer and the multi-head covariance spectral transformer block in the spectral transformer, are stacked as groups to capture the long-range dependency of HSI pixels and bands.However, the long-range dependency will be damaged when excessive blocks and groups are added.The group number and block number of our S2Former are set to 4 and 4 to find the desired results with reasonable computation efficiency.

Discussion on Parallel Spectral-Spatial Transformer Architecture
Our S2Former is a dual-stream architecture with a spectral transformer and spatial transformer in parallel, which is designed to capture the discriminative features in spectral and spatial dimensions.To demonstrate the effectiveness of dual-stream architecture, we compare the classification results among the single spatial transformer, single spectral transformer, and our S2Former.As shown in Table 5, we show the classification results of break-down ablation experiments on HU, UP, and BS datasets.The quantitative results of Table 5(d), (h), and S2Former demonstrate that the combination of the spectral transformer and spatial transformer achieves the optimal performance.Compared to our S2Former, the single spectral transformer or spatial transformer fails to capture the comprehensive spectral-spatial features.

Discussion on Self-Attention and Feed-Forward Network
The self-attention modules and feed-forward network used in the S2Former play a powerful role in classification performance.The proposed self-attention mechanisms, including MSSA in the spatial transformer and MCSA in the spectral transformer, aim to produce the global spatial and spectral features.Considering the weaknesses of traditional feed-forward networks, we design LAFN.To investigate the contribution of the proposed MSSA, MCSA, and LAFN, an ablation study is conducted.As seen in Table 5, it can be observed that our MCSA shows a comparable ability of feature extraction compared to MSSA.The spectral feature representation is equally important for HSI classification with a redundant spectrum.Table 5(b) and (f) confirm that LAFN obtains a significant gain compared to a traditional MLP.It can be attributed to a supplement of the local details.

Comparison with Other Methods
To confirm the superiority of our S2Former, seven state-of-the-art classifiers are used for comparison.Except for SVM, the remaining six methods are patch-based HSI classifiers.Furthermore, the compared classifiers and our S2Former maintain the same parameter settings.The quantitative results are shown in Tables 6-9, which represent the detailed classification results of the UP, IN, BS, and HU datasets, respectively.Obviously, our S2Former achieves more significant results than the other seven classifiers.In Table 6, the proposed method achieves 97.40% in OA, 97.60% in AA, and 96.54% in Kappa.SVM achieves the worst results.For the 2D-CNN-based CDCNN, the classification accuracy is only better than SVM.FDSSC exploits the dense connection to improve the network performance.DBDA extracts spatial and spectral features by adding attention mechanisms and obtains superior classification accuracy.BS2T, which also employs the transformer architecture, obtains the performance second only to our method.In addition, in Table 6-9, the best classification accuracy for each land cover category is shown in bold.Our S2Former also achieves satisfactory results in each land cover category.Although the seventh land cover category Grass-pasture-mowed in Table 7 only has three training samples, the classification accuracy of S2Former is improved by 13.18% compared to the second-accuracy HSI-Mixer.S2Former uses the proposed self-attention mechanisms to simulate global spatial-spectral features and extract the discriminative features at a maximum extent.Therefore, the proposed S2Former achieves a significant gain in fine-grained classification.In Table 7, the tenth (Soybean-no-till), eleventh (Soybean-min-till), and twelfth (Soybeanclean) land cover categories in the IN dataset have very similar characteristics.S2Former obtains the highest classification accuracy (94.82%, 97.21%, and 99.42%) than other SOTA methods.Similar situations can also be found in the BS dataset and HU dataset, such as Acacia woodlands, Acacia shrublands, and Acacia grasslands in Table 8, and Stressed Grass and Synthetic Grass in Table 9.Our proposed S2Former obtains outstanding classification results in four datasets.
To visually demonstrate the performance of our S2Former, we present the falsecolor images, the Ground Truth maps, and the classification maps for each method in Figures 7-10.Our method has clearer boundaries and less noise and outperforms the other seven SOTA methods, which benefits from the dual-stream spectral-spatial transformer structure in S2Former.SVM performs the worst among all methods, with large intraclass noise.CDCNN, with a simple network structure, also has a lot of noise.Although the other five comparison methods achieve relatively accurate and smooth classification maps, they still have some errors compared with the proposed S2Former.Specifically, in Figure 7, CDCNN misclassifies Bare Soil into Meadows, resulting in poor classification results.Although DBDA works better in the Bare Soil class, it has obvious misclassification in the Bitumen class.Since our S2Former extracts global features and supplements local details, the classification results are slightly better than BS2T in the visualisation results.In Figure 8, compared methods have misclassified two similar land cover categories (Soybean-notill and Soybean-clean).We propose multi-head covariance spectral attention in the spectral transformer branch to explore highly similarity and correlation across the spectral dimension.Therefore, our S2Former can accurately classify these two land cover categories.Considering the messy distribution of labelled pixels in the BS and HU datasets, we perform local magnification on a specific area of the two datasets to exhibit the performance.As shown in Figures 9 and 10, when dealing with isolated objects of small size, other methods maintain an over-smoothing phenomenon or misclassification occurs.Our method maintains appreciable classification results even in cluttered pixels.

Comparison of the Complexity
We have compared the complexity of our S2Former and state-of-the-art HSI classification algorithms in terms of time and space consumption.We select Flops and params to represent space consumption and Training time to represent time consumption.Table 10 shows the number of parameters and Flops and the training times.The results demonstrate that our S2Former achieves higher performance and has a better tradeoff between model size and performance.Our S2Former achieves significant PSNR gains at the cost of less model complexity increase.

Conclusions
In this article, we propose a novel spectral-spatial transformer (S2Former) for HSI classification.S2Former exploits the dual-branch transformer architecture to extract discriminative feature streams with the local and global receptive fields in the spatial and spectral dimensions.First, the spatial transformer consists of two components: the proposed MSSA applied self-attention in the spatial dimension to encode the long-term spatial position information, and the spatial-3D enhanced block adaptive captures the local spatial information.Similarly, the spectral transformer exploits MCSA to compute covariancebased channel maps in modeling the long-range spectrum dependence.The spectral-3D enhanced block is introduced to learn the subtle spectral discrepancies.LAFN is designed to replace traditional MLP in the transformer block, which activates the global features and F 0 are transported to the spatial transformer and spectral transformer in parallel.The shallow features F 0 are transformed into deep features F Spa D ∈ R M×M×C and F Spe D ∈ R M×M×C .The spatial transformer contains multiple multi-head spatial transformer groups.Similarly, the spectral transformer consists of a series of multi-head spectral transformer groups.Then, we use the learnable weights α and β to reweight the deep spatial features F Spa D and the deep spectral features F Spe D .

Figure 2 .
Figure 2. Illustration of the multi-head spatial transformer block (MSTB) and multi-head covariance spectral transformer block (MCTB).The core modules of (a) MCTB and (b) MSTB are (c) multihead covariance spectral attention, (d) multi-head spatial self-attention, and the local activation feed-forward network.

Figure 3 .
Figure 3.The architecture of the local activation feed-forward network (LAFN).LAFN exploits depth-wise convolution and the fully connected layer to encode local signals and global information.

Figure 4 .
Figure 4.The detailed schematic of the spatial-3D enhanced block and spectral-3D enhanced block.

Figure 5 .
Figure 5. Model performance comparison under different input patch sizes.

Figure 6 .
Figure 6.The quantitative results in terms of OA with different model depths.

Figure 1. The network architecture of our S2Former. S2Former adopts the parallel spatial transformer and spectral transformer to individually explore discriminative features in spatial and spectral
dimensions.The transformer branches mainly consist of residual in residual design, incorporating a multi-head spatial transformer block and multi-head covariance spectral transformer block.

Table 1 .
Training and test data for each land cover category in the UP dataset.

Table 2 .
Training and test data for each land cover category in the IN dataset.

Table 3 .
Training and test data for each land cover category in the BS dataset.

Table 4 .
Training and test data for each land cover category in the HU dataset.

Table 5 .
Break-down ablation study on proposed components.

Table 6 .
Comparison of quantitative results on the UP dataset.

Table 7 .
Comparison of quantitative results on the IN dataset.

Table 8 .
Comparison of quantitative results on the BS dataset.

Table 9 .
Comparison of quantitative results on the HU dataset.

Table 10 .
Comparison of the complexity of time-space consumption.