Next Article in Journal
Unmanned Aerial Vehicle-Assisted Federated Learning Method Based on a Trusted Execution Environment
Previous Article in Journal
A Comparison of Machine Learning Algorithms for Wi-Fi Sensing Using CSI Data
Previous Article in Special Issue
Flood Disaster Assessment Method Based on a Stacked Denoising Autoencoder
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

S2Former: Parallel Spectral–Spatial Transformer for Hyperspectral Image Classification

1
College of Internet of Things Engineering, Hohai University, Changzhou 213022, China
2
The Key Laboratory of Jinan Digital Twins and Intelligent Water Conservancy, Shandong Water Conservancy Survey and Design Institute Co., Jinan 250013, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2023, 12(18), 3937; https://doi.org/10.3390/electronics12183937
Submission received: 15 August 2023 / Revised: 6 September 2023 / Accepted: 15 September 2023 / Published: 18 September 2023
(This article belongs to the Special Issue Artificial Intelligence and Sensors with Agricultural Applications)

Abstract

:
Due to their excellent representation talent in local features, the convolutional neural network (CNN) has achieved favourable performance in hyperspectral image (HSI) classification tasks. Nevertheless, current CNN models exhibit a marked flaw: they are hard to model the dependencies in long-range distanced positions. This flaw becomes more problematic for the HSI classification task, which targets extracting more discriminative features in local and global dimensions from limited samples. In this paper, we introduce a spatial–spectral transformer (S2Former), which explores spatial and spectral feature extraction in a dual-stream framework for HSI Classification. S2Former, which consists of a spatial transformer and a spectral transformer in parallel branches, extracts the discriminative feature in spatial and spectral dimensions. More specifically, we propose multi-head spatial self-attention to capture the long-range spatial dependency of non-adjacent HSI pixels in a spatial transformer. In the spectral transformer, we propose multi-head covariance spectral attention to mine and represent spectral signatures by computing covariance-based channel maps. Meanwhile, the local activation feed-forward network is developed to complement local details. Extensive experiments conducted on four publicly available datasets indicate that our S2Former achieves state-of-the-art performance for the HSI classification task.

Graphical Abstract

1. Introduction

Hyperspectral imaging measures the reflected solar radiation of land cover objects over dozens to hundreds of spectra. The hyperspectral image (HSI) collects rich and detailed spectral information, effectively reflecting the subtle spectral difference of different objects. This provides significant potential in Earth observation missions, such as land cover mapping [1], precision agriculture [2], land cover classification [3], environmental monitoring [4], and mineral exploration [5]. Several hyperspectral image data processing techniques have been explored, such as denoising [6,7], unmixing [8,9], super- resolution [10,11,12,13], target detection [14,15], change detection [16], and classification [17,18,19,20,21]. Among these techniques, HSI classification has attracted more attention.
The main challenge of HSI classification is how to extract enough discriminative features from limited and insufficient samples. CNN has exhibited potential in learning generalisable features from samples. Thus, it has been widely exploited to customise HSI classifiers. To tackle the above challenge, the CNN-based solutions are mainly grouped into two categories. First, deeper network structures are proposed to extract more delicate features from the limited training samples. Lee et al. [22] proposed a contextual deep CNN (CDCNN) for extracting contextual spatial–spectral features. In [23], a regularised deep feature extraction (FE) method is proposed to effectively solve the common problems of insufficient and unbalanced training samples for HSI. Residual Network (ResNet) [24] and Dense Convolutional Network (DenseNet) [25] are designed to relieve the training stress of deep model. Based on the ResNet, SSRN [26] was proposed, which exploits the deep model to improve classification accuracy. Similarly, FDSSC [27] was designed based on the DenseNet framework to capture spectral–spatial features. Second, the attention mechanism is introduced to refine and reweight the extracted spatial and spectral features. Attention mechanism, a dynamic weight adjustment operation on features, is widely used to optimise feature extraction and refine features. Spectral-wise attention (MSDN-SA) [28] was proposed to enhance the discriminatory ability of the model for spectral features. Channel-wise attention and spatial-wise attention were introduced [29,30] to refine spatial and spectral features and achieve outstanding classification results.
However, the CNN models only model and analyse the limited receptive field and, thus, ignore the global contextual information without the ability of modeling long-range location dependency. As applied to the high-dimensional hyperspectral data, the inherent bias of CNN is magnified. The inadequate global representation ability led to ignoring global spatial and global spectral features, which most likely contain discriminative features. Even for CNN-based methods that are involved with self-attention modules, they still fail to simultaneously constrain the global spatial information and global spectral information of input HSI cubes.
A transformer has excellent performance on most computer vision tasks. It relies on the self-attention mechanism to model global contextual information, simulate the global receptive field, and capture the global information and long-term dependence of samples. This feature greatly alleviates the above-mentioned limitations of using CNN-based methods to complete HSI classification tasks. ViT [31] is the earliest transformer architecture proposed for use in the field of computer vision and has achieved better results than convolutional neural networks. Inspired by ViT, Xue et al. [32] developed a novel deep hierarchical vision transformer (DHViT) to extract long-term spectral dependencies and hierarchical features. Zhao et al. [33] proposed a Convolutional Transformer Network (CTN). The proposed Convolutional Transformer (CT) blocks solved the problem of weak local feature extraction by the transformer and effectively captured the local–global features of HSI patches. A new backbone network named SpectralFormer [34] combines the advantages of transformer and CNN, aiming to learn local spectral feature representation and feature transfer between shallow and deep layers. Song et al. [35] designed a Bottleneck Spatial–Spectral Transformer (BS2T) to describe the dependencies between HSI pixels over long-term locations and bands. HSI-Mixer [36] uses a simple CNN architecture to simulate the function of transformer, reconsiders the significant inductive bias of convolution. A hybrid measurement-based linear projection and spatial and spectral mixer blocks are constructed to implement spatial–spectral feature fusion and decomposition, respectively. However, these transformer-based classifiers just introduce the self-attention mechanism to capture global features. Self-attention is a special attention mechanism that only considers the adaptability in the spatial dimension but ignores the adaptability in the channel dimension, which is also important for HSI classification task. For transformer-based HSI classifiers, it is essential to design the self-attention mechanism separately tailored for spatial and spectral dimensions.
In this article, we aim to develop a novel transformer-based dual-branch network architecture, parallel spatial–spectral transformer (S2Former for short), to achieve high-performance HSI classification by extracting discriminative features in both spatial and spectral dimensions with tailored self-attentions. S2Former consists of a spatial transformer and spectral transformer in parallel branches, emphasising the global spectral information and the spatially global context individually. Specifically, the spatial transformer exploits the multi-head spatial self-attention (MSSA) and local activation feed-forward network to learn the spatially global context and local signals. The spectral transformer equips multi-head covariance spectral attention (MCSA) to model the contextualised global relationships between spectra and capture the subtle spectral discrepancies.
The main contributions can be concluded as follows.
1.
A parallel spectral–spatial transformer architecture is proposed for HSI classification, which is an efficient extraction of the spectral and spatial features in dual parallel branches.
2.
MCSA and MSSA, which are tailored for spectral and spatial feature extraction, improve the mining of local–global spatial and spectral sequence features.
3.
A local activation feed-forward network is proposed to enhance the extraction of local context signals by encoding information from spatially neighbouring pixel positions.

2. Methodology

As shown in Figure 1, our S2Former consists of a spatial transformer and a spectral transformer. In our S2Former, the 3D cube is taken as input. In other words, the target pixel and its adjacent pixels are fed into the network. Given a 3D cube I R M × M × O , where M × M denotes the spatial dimension, O is the number of bands. Our S2Former first applies a 3 × 3 convolutional layer to obtain low-level feature embeddings F 0 R M × M × C . Next, the shallow features F 0 are transported to the spatial transformer and spectral transformer in parallel. The shallow features F 0 are transformed into deep features F D S p a R M × M × C and F D S p e R M × M × C . The spatial transformer contains multiple multi-head spatial transformer groups. Similarly, the spectral transformer consists of a series of multi-head spectral transformer groups. Then, we use the learnable weights α and β to reweight the deep spatial features F D S p a and the deep spectral features F D S p e .
The learnable weights α and β optimise the spatial and spectral feature extraction by backward propagation. We obtain the output fused features F o u t R M × M × C . Finally, the output fused features pass through the fully connected layer, mapping into predicted results.

2.1. Spatial Transformer

As illustrated in Figure 1, the spatial transformer is a stack of multi-head spatial transformer groups similar to the encoder to extract the deeper spatial features. The spatial transformer contains K multi-head spatial transformer groups (MSTG). The output spatial features extracted group by group are shown as
F i S p a = H M S T G i F i 1 S p a , i = 1 , 2 , 3 , K ,
where H M S T G i · is the i-th multi-head spatial transformer group. The multi-head spatial transformer group is a residual group with multiple multi-head spatial transformer blocks and a spatial-3D CNN block. Given the input feature F i , 0 S p a of the i-th MSTG, we first extract intermediate spatial features F i , j S p a by L multi-head spatial transformer blocks (MSTB) as
F i , j S p a = H M S T B i , j F i , j 1 S p a , j = 1 , 2 , 3 , L ,
where H M S T B i , j · denotes the j-th multi-head spatial transformer blocks in the i-th multi-head spatial transformer group. Next, the spatial features F i , L S p a are enhanced by adding a spatial-3D enhanced block.
F i , o u t S p a = H S p a _ 3 D i F i , L S p a ,
where H S p a _ 3 D i is the spatial-3D enhanced block in the i-th multi-head spatial transformer group. Next, we give a specific description of the multi-head spatial transformer block and spatial-3D enhanced block, which are the core components of the multi-head spatial transformer group.

2.1.1. Multi-Head Spatial Self-Attention

As shown in Figure 2b, a multi-head spatial transformer block contains a multi-head spatial self-attention (MSSA), a local activation feed-forward network (LAFN), and layer normalisation (LN) modules.
CNN-based methods exploit the local receptive field to extract features in HSI classification but have difficulty modeling pixels with long-range distanced positions. To capture non-local long-range dependencies, in our spatial transformer, we exploit MSSA to model long-range dependencies in spatial dimensions. As demonstrated in Figure 2, given an input X i n R M × M × C . MSSA aims to apply self-attention across global spatial locations and generates an attention map modeling the long-range dependencies and spatial interactions. X i n is first linearly projected into q u e r y Q s p a R M 2 × C , k e y K s p a R M 2 × C and v a l u e V s p a R M 2 × C ,
Q s p a = W Q X i n , K s p a = W K X i n , V s p a = W V X i n ,
where W Q , W K , and W V R C × C are learnable projection matrices. The attention matrix is thus computed by the self-attention mechanism. We apply dot-product interaction on Q s p a and K s p a to generate the spatial attention map A s p a R M 2 × M 2 ,
A s p a = S o f t m a x Q s p a · K s p a C + B ,
A t t e n t i o n Q s p a , K s p a , V s p a = W o u t · V s p a · A s p a ,
where W o u t R C × C are also learnable projection matrices. B is the learnable relative positional encoding. Following multi-head SA [37], MSSA divides the number of channels into ‘heads’, then performs the attention function for ‘heads’ times in parallel and concatenates the results for multi-head results.
The LayerNorm (LN) layer is added before MSSA, and the residual connection is employed to obtain the output feature map X ^ i n S p a R M × M × C ,
X ^ i n S p a = H M S S A ( H L N ( X i n ) ) + X i n ,
where H M S S A ( · ) and H L N ( · ) denote the function of MSSA and the LayerNorm layer.

2.1.2. Local Activation Feed-Forward Network

In the traditional feed-forward network, the two fully connected layers are applied to expand the input feature channels and map the output channels back to the original input dimension. The fully connected layer processes token information point-wise in an identical manner. Thus, it neglects local information. In our work, we propose the local activation feed-forward network (LAFN), which aims at complementing local information by encoding information from spatially neighboring pixel positions. As shown in Figure 3, we complement the local details in the feed-forward network with two operations. First, we exploit the depth-wise convolution layer between the two fully connected layers to explore local signals from the global feature information in the regular branch. Second, we add a branch with depth-wise convolution to activate the local signal. The element-wise product is used to aggregate local and global information streams in two parallel branches.
Given an input feature X ^ i n R M × M × C , LAFN is formulated as:
X = W D W 1 L N X ^ i n H G e l u W P W D L N X ^ i n , X = X ^ i n + W 2 X ,
where H G e l u represents the Gelu non-linearity, and W * ( * denote 1 or 2) denote the fully connected layers. W P denotes point-wise convolution with a kernel size of 1 × 1 , and W D represents 3 × 3 depth-wise convolution. Overall, the LAFN controls information flow through the activated local signal in our pipeline, thereby focusing on the fine details.

2.1.3. Spatial-3D Enhanced Block

The spatial-3D enhanced block is designed to maintain the global spatial feature and enhance the local spatial feature expression. As shown in Figure 4, we design the spatial-3D enhanced block similar to DBDA [30]. The input spatial feature F i , L S p a R M × M × C is re-calibrated and reweighted through the remapping of local spatial feature coordinates with the spatial-3D enhanced block. First, the high-dimensional features across channels are mapped to low-dimensional ones by the 1 × 1 × C 3D convolution. Next, the features are transported to three spatial-3D enhanced layers with a dense connection. Each spatial-3D enhanced layer includes 3 × 3 × 1 3D convolution, 3D batch normalisation, and a Mish activation function.

2.2. Spectral Transformer

As exhibited in Figure 1, similar to the structure of the spatial transformer, the spectral transformer consists of K multi-head covariance spectral transformer groups (MCTG) that aim to extract the discriminative spectral features. The output spectral features extracted block by block are shown as
F i S p e = H M C T G i ( F i 1 S p e ) , i = 1 , 2 , 3 , K .
The multi-head covariance spectral transformer group is a residual group with multiple multi-head covariance spectral transformer blocks and a spectral-3D enhanced block. Given the input features F i , 0 S p e of the i-th MCTG, we first extract intermediate spatial features F i , j S p e by L multi-head covariance spectral transformer blocks (MCTB) as
F i , j S p e = H M C T B i , j ( F i , j 1 S p e ) , j = 1 , 2 , 3 , L ,
where H M C T B i , j ( · ) denotes the j-th multi-head spatial transformer blocks in the i-th multi-head covariance spectral transformer group. Next, the spatial features F i , L S p e are enhanced by adding a spatial-3D CNN block,
F i , o u t S p e = H S p e _ 3 D i F i , L S p e ,
where H S p e _ 3 D i ( · ) is the spectral-3D enhanced block in the i-th multi-head covariance spectral transformer group. Similarly, we elaborate the details of our multi-head covariance spectral transformer block and spectral-3D enhanced block.

2.2.1. Multi-Head Covariance Spectral Attention

Different from natural images, HSIs are also spectrally correlated and have numerous narrow bands. Capturing local and global spectral features are equally essential. In our work, we propose multi-head covariance spectral attention (MCSA) to model the inter-spectra similarity and long-range dependencies. Multi-head covariance spectral attention is intent on applying self-attention across spectral channels. MCSA computes cross-covariance across channels to generate an attention map encoding the global spectral signals. X i n is first projected and reshaped into q u e r y Q s p e R C × M 2 , k e y K s p e R C × M 2 , and v a l u e V s p e R C × M 2 by applying 1 × 1 point-wise convolutions W P followed by 3 × 3 depth-wise convolutions W D to encode spatial context in a spectral-wise manner,
Q s p e = W P Q W D Q X i n , K s p e = W P K W D K X i n , V s p e = W P V W D V X i n .
Next, the spectral attention map is computed by the self-attention mechanism. We apply dot-product interaction on Q s p e and K s p e to generate the spectral attention map A s p e R C × C ,
A s p e = S o f t m a x Q s p E · K s p E ε + B ,
A t t e n t i o n Q s p e , K s p e , V s p e = W P · V s p e · A s p e ,
where ε is a learnable parameter to reweight the dot product of Q s p E and K s p E before applying the softmax function.

2.2.2. Spectral-3D Enhanced Block

The spectral-3D enhanced block is stacked behind L multi-head covariance spectral transformer blocks in each multi-head covariance spectral transformer group. The spectral-3D CNN block is designed to learn inter-spectra correlations after modeling the global spectral feature. Given the F i , L S p e R M × M × C , the spectral-3D enhanced block first applies a 3D convolution with the convolutional kernel of 1 × 1 × 7 to obtain low-level local spectral features. Then, three spectral-3D enhanced layers are applied to extract spectral information. Each spectral-3D enhanced layer contains 3D convolution, 3D batch normalisation, and a Mish activation function. Except for the third 3D convolution layer, the remaining 3D convolution layers are with the convolutional kernel of 1 × 1 × 7 . The input of the i-th spectral-3D enhanced layer is the concatenated output of the former layers.

3. Experimental Results

To verify the performance of our S2Former, we conduct ablation studies and comparative experiments on four public datasets, including the Indian Pines dataset (IN), Pavia University dataset (UP), Botswana dataset (BS), and Houston dataset (HU). All experiments presented in this section are conducted on Nvidia GeForce RTX 3090 GPU.

3.1. Experimental Settings

Comparison Methods: To verify the effectiveness of the proposed S2Former, seven state-of-the-art HSI classifiers are used for comparison, including SVM [38], CDCNN [22], FDSSC [27], DBDA [30], SpectralFormer [34], HSI-Mixer [36], and BS2T [35].
Optimised Parameters: The proposed S2Former is trained on four datasets for 200 epochs. In addition, the batch size is 16, and the Adam optimiser with the 0.0005 learning rate was used to update the network parameters. For the compared methods and our S2Former, we select the input patches with a size of 9 × 9 × O. Training and test data of the four datasets are listed in Table 1, Table 2, Table 3 and Table 4. The UP dataset has the most significant number of samples, so only 0.5% of the training samples are selected, and the rest are used for testing. The labelled samples of each category in the IN dataset are unevenly distributed. To reasonably utilise the training samples, we choose 3% of the samples for training. For BS and HU, we use 1% labelled samples to train the models. A total of 99% labelled samples are used to test the performance of models.
Metrics: The classification performance is evaluated with overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa) [39].

3.2. Ablation Studies

3.2.1. Discussion on Input Patch Size Comparison

Our S2Former is a 3D cube-wise-based HSI classification model. The size of the 3D cube determines the number of neighbouring pixels, which is the amount of information for the centre pixel. Therefore, the patch size is closely related to the performance of our S2Former. In order to explore the impact on HSI classification accuracy, we conduct an ablation study on the input HSI patch size. We exhibit the HSI classification results with the spatial dimension of the 3D cube ranging from 3 × 3 to 13 × 13 on four standard datasets. The quantitative results are shown in Figure 5. The patch size of the input 3D cube achieves a significant influence on classification performance. The patch size of 9 × 9 obtains the optimal results for the four standard datasets.

3.2.2. Discussion on Model Depth

The number of groups and blocks in the spatial transformer and spectral transformer represents the network depth of S2Former. We select the integer values of group and block in the range of 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 and 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , respectively. The classification performance (OA) of our S2Former different groups and blocks is shown in Figure 6. We show the effects of group number and block number on the Botswana dataset (in Figure 6a), Indian Pines dataset (in Figure 6b), and Pavia University dataset (in Figure 6c). It is observed that the OA is correlated with block number and group number. The blocks, including the multi-head spatial transformer block in the spatial transformer and the multi-head covariance spectral transformer block in the spectral transformer, are stacked as groups to capture the long-range dependency of HSI pixels and bands. However, the long-range dependency will be damaged when excessive blocks and groups are added. The group number and block number of our S2Former are set to 4 and 4 to find the desired results with reasonable computation efficiency.

3.2.3. Discussion on Parallel Spectral–Spatial Transformer Architecture

Our S2Former is a dual-stream architecture with a spectral transformer and spatial transformer in parallel, which is designed to capture the discriminative features in spectral and spatial dimensions. To demonstrate the effectiveness of dual-stream architecture, we compare the classification results among the single spatial transformer, single spectral transformer, and our S2Former. As shown in Table 5, we show the classification results of break-down ablation experiments on HU, UP, and BS datasets. The quantitative results of Table 5(d), (h), and S2Former demonstrate that the combination of the spectral transformer and spatial transformer achieves the optimal performance. Compared to our S2Former, the single spectral transformer or spatial transformer fails to capture the comprehensive spectral–spatial features.

3.2.4. Discussion on Self-Attention and Feed-Forward Network

The self-attention modules and feed-forward network used in the S2Former play a powerful role in classification performance. The proposed self-attention mechanisms, including MSSA in the spatial transformer and MCSA in the spectral transformer, aim to produce the global spatial and spectral features. Considering the weaknesses of traditional feed-forward networks, we design LAFN. To investigate the contribution of the proposed MSSA, MCSA, and LAFN, an ablation study is conducted. As seen in Table 5, it can be observed that our MCSA shows a comparable ability of feature extraction compared to MSSA. The spectral feature representation is equally important for HSI classification with a redundant spectrum. Table 5(b) and (f) confirm that LAFN obtains a significant gain compared to a traditional MLP. It can be attributed to a supplement of the local details.

3.3. Comparison with Other Methods

To confirm the superiority of our S2Former, seven state-of-the-art classifiers are used for comparison. Except for SVM, the remaining six methods are patch-based HSI classifiers. Furthermore, the compared classifiers and our S2Former maintain the same parameter settings.
The quantitative results are shown in Table 6, Table 7, Table 8 and Table 9, which represent the detailed classification results of the UP, IN, BS, and HU datasets, respectively. Obviously, our S2Former achieves more significant results than the other seven classifiers. In Table 6, the proposed method achieves 97.40% in OA, 97.60% in AA, and 96.54% in Kappa. SVM achieves the worst results. For the 2D-CNN-based CDCNN, the classification accuracy is only better than SVM. FDSSC exploits the dense connection to improve the network performance. DBDA extracts spatial and spectral features by adding attention mechanisms and obtains superior classification accuracy. BS2T, which also employs the transformer architecture, obtains the performance second only to our method.
In addition, in Table 6, Table 7, Table 8 and Table 9, the best classification accuracy for each land cover category is shown in bold. Our S2Former also achieves satisfactory results in each land cover category. Although the seventh land cover category Grass-pasture-mowed in Table 7 only has three training samples, the classification accuracy of S2Former is improved by 13.18% compared to the second-accuracy HSI-Mixer. S2Former uses the proposed self-attention mechanisms to simulate global spatial–spectral features and extract the discriminative features at a maximum extent. Therefore, the proposed S2Former achieves a significant gain in fine-grained classification.
In Table 7, the tenth (Soybean-no-till), eleventh (Soybean-min-till), and twelfth (Soybean-clean) land cover categories in the IN dataset have very similar characteristics. S2Former obtains the highest classification accuracy (94.82%, 97.21%, and 99.42%) than other SOTA methods. Similar situations can also be found in the BS dataset and HU dataset, such as Acacia woodlands, Acacia shrublands, and Acacia grasslands in Table 8, and Stressed Grass and Synthetic Grass in Table 9. Our proposed S2Former obtains outstanding classification results in four datasets.
To visually demonstrate the performance of our S2Former, we present the false-color images, the Ground Truth maps, and the classification maps for each method in Figure 7, Figure 8, Figure 9 and Figure 10. Our method has clearer boundaries and less noise and outperforms the other seven SOTA methods, which benefits from the dual-stream spectral–spatial transformer structure in S2Former. SVM performs the worst among all methods, with large intraclass noise. CDCNN, with a simple network structure, also has a lot of noise. Although the other five comparison methods achieve relatively accurate and smooth classification maps, they still have some errors compared with the proposed S2Former.
Specifically, in Figure 7, CDCNN misclassifies Bare Soil into Meadows, resulting in poor classification results. Although DBDA works better in the Bare Soil class, it has obvious misclassification in the Bitumen class. Since our S2Former extracts global features and supplements local details, the classification results are slightly better than BS2T in the visualisation results. In Figure 8, compared methods have misclassified two similar land cover categories (Soybean-notill and Soybean-clean). We propose multi-head covariance spectral attention in the spectral transformer branch to explore highly similarity and correlation across the spectral dimension. Therefore, our S2Former can accurately classify these two land cover categories. Considering the messy distribution of labelled pixels in the BS and HU datasets, we perform local magnification on a specific area of the two datasets to exhibit the performance. As shown in Figure 9 and Figure 10, when dealing with isolated objects of small size, other methods maintain an over-smoothing phenomenon or misclassification occurs. Our method maintains appreciable classification results even in cluttered pixels.

3.4. Comparison of the Complexity

We have compared the complexity of our S2Former and state-of-the-art HSI classification algorithms in terms of time and space consumption. We select Flops and params to represent space consumption and Training time to represent time consumption. Table 10 shows the number of parameters and Flops and the training times. The results demonstrate that our S2Former achieves higher performance and has a better tradeoff between model size and performance. Our S2Former achieves significant PSNR gains at the cost of less model complexity increase.

4. Conclusions

In this article, we propose a novel spectral–spatial transformer (S2Former) for HSI classification. S2Former exploits the dual-branch transformer architecture to extract discriminative feature streams with the local and global receptive fields in the spatial and spectral dimensions. First, the spatial transformer consists of two components: the proposed MSSA applied self-attention in the spatial dimension to encode the long-term spatial position information, and the spatial-3D enhanced block adaptive captures the local spatial information. Similarly, the spectral transformer exploits MCSA to compute covariance-based channel maps in modeling the long-range spectrum dependence. The spectral-3D enhanced block is introduced to learn the subtle spectral discrepancies. LAFN is designed to replace traditional MLP in the transformer block, which activates the global features and complements local details. Finally, the spatial and spectral features extracted from the dual branches are used to obtain the classification result.
Extensive experimental results demonstrate the superiority of our S2Former over SOTA HSI classifiers. Our S2Former has good scalability in spectral–spatial feature extraction for other HSI tasks.

Author Contributions

D.Y. (Dong Yuan): Conceptualisation, Methodology, Writing—review and editing. D.Y. (Dabing Yu): Software, Writing—original draft, Writing—review and editing. Y.Q.: Writing—original draft, Data curation, Visualisation. Y.X.: Writing—review and editing. Y.L.: Data curation, Visualisation, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Open Research Fund of Key Laboratory of Jinan Digital Twins and Intelligent Water Conservancy (37H2022KY040116) and Changzhou Science and Technology Project (CJ20220089).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, F.; Jiang, H.; Van de Voorde, T.; Lu, S.; Xu, W.; Zhou, Y. Land cover mapping in urban environments using hyperspectral APEX data: A study case in Baden, Switzerland. Int. J. Appl. Earth Obs. Geoinf. 2018, 71, 70–82. [Google Scholar] [CrossRef]
  2. Sethy, P.K.; Pandey, C.; Sahu, Y.K.; Behera, S.K. Hyperspectral imagery applications for precision agriculture-a systemic survey. Multimed. Tools Appl. 2021, 81, 3005–3038. [Google Scholar] [CrossRef]
  3. Zhao, X.; Niu, J.; Liu, C.; Ding, Y.; Hong, D. Hyperspectral Image Classification Based on Graph Transformer Network and Graph Attention Mechanism. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  4. Pour, A.B.; Zoheir, B.; Pradhan, B.; Hashim, M. Editorial for the special issue: Multispectral and hyperspectral remote sensing data for mineral exploration and environmental monitoring of mined areas. Remote Sens. 2021, 13, 519. [Google Scholar] [CrossRef]
  5. Peyghambari, S.; Zhang, Y. Hyperspectral remote sensing in lithological mapping, mineral exploration, and environmental geology: An updated review. J. Appl. Remote Sens. 2021, 15, 031501. [Google Scholar] [CrossRef]
  6. Cao, X.; Fu, X.; Xu, C.; Meng, D. Deep spatial-spectral global reasoning network for hyperspectral image denoising. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
  7. Zhang, H.; Cai, J.; He, W.; Shen, H.; Zhang, L. Double low-rank matrix decomposition for hyperspectral image denoising and destriping. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–19. [Google Scholar] [CrossRef]
  8. Rasti, B.; Koirala, B. SUnCNN: Sparse unmixing using unsupervised convolutional neural network. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
  9. Zhang, X.; Sun, Y.; Zhang, J.; Wu, P.; Jiao, L. Hyperspectral unmixing via deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1755–1759. [Google Scholar] [CrossRef]
  10. He, J.; Yuan, Q.; Li, J.; Xiao, Y.; Liu, X.; Zou, Y. DsTer: A dense spectral transformer for remote sensing spectral super-resolution. Int. J. Appl. Earth Obs. Geoinf. 2022, 109, 102773. [Google Scholar] [CrossRef]
  11. Li, Q.; Wang, Q.; Li, X. Exploring the Relationship Between 2D/3D Convolution for Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2021, 59, 8693–8703. [Google Scholar] [CrossRef]
  12. Zou, C.; Zhang, C. Hyperspectral image super-resolution using cluster-based deep convolutional networks. Signal Process. Image Commun. 2023, 110, 116884. [Google Scholar] [CrossRef]
  13. Zou, C.; Huang, X. Hyperspectral image super-resolution combining with deep learning and spectral unmixing. Signal Process. Image Commun. 2020, 84, 115833. [Google Scholar] [CrossRef]
  14. Xie, W.; Zhang, X.; Li, Y.; Wang, K.; Du, Q. Background learning based on target suppression constraint for hyperspectral target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5887–5897. [Google Scholar] [CrossRef]
  15. Zhao, X.; Hou, Z.; Wu, X.; Li, W.; Ma, P.; Tao, R. Hyperspectral target detection based on transform domain adaptive constrained energy minimization. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102461. [Google Scholar] [CrossRef]
  16. Qu, J.; Xu, Y.; Dong, W.; Li, Y.; Du, Q. Dual-Branch Difference Amplification Graph Convolutional Network for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
  17. Sun, G.; Zhang, X.; Jia, X.; Ren, J.; Zhang, A.; Yao, Y.; Zhao, H. Deep fusion of localized spectral features and multi-scale spatial features for effective classification of hyperspectral images. Int. J. Appl. Earth Obs. Geoinf. 2020, 91, 102157. [Google Scholar] [CrossRef]
  18. Yu, D.; Li, Q.; Wang, X.; Xu, C.; Zhou, Y. A Cross-Level Spectral—Spatial Joint Encode Learning Framework for Imbalanced Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  19. Kumar, V.; Singh, R.S.; Dua, Y. Morphologically dilated convolutional neural network for hyperspectral image classification. Signal Process. Image Commun. 2022, 101, 116549. [Google Scholar] [CrossRef]
  20. Mookambiga, A.; Gomathi, V. Kernel eigenmaps based multiscale sparse model for hyperspectral image classification. Signal Process. Image Commun. 2021, 99, 116416. [Google Scholar] [CrossRef]
  21. Ghasrodashti, E.K.; Sharma, N. Hyperspectral image classification using an extended Auto-Encoder method. Signal Process. Image Commun. 2021, 92, 116111. [Google Scholar] [CrossRef]
  22. Lee, H.; Kwon, H. Contextual deep CNN based hyperspectral classification. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3322–3325. [Google Scholar]
  23. Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251. [Google Scholar] [CrossRef]
  24. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  25. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  26. Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2017, 56, 847–858. [Google Scholar] [CrossRef]
  27. Wang, W.; Dou, S.; Jiang, Z.; Sun, L. A fast dense spectral–spatial convolution network framework for hyperspectral images classification. Remote Sens. 2018, 10, 1068. [Google Scholar] [CrossRef]
  28. Fang, B.; Li, Y.; Zhang, H.; Chan, J.C.W. Hyperspectral images classification based on dense convolutional networks with spectral-wise attention mechanism. Remote Sens. 2019, 11, 159. [Google Scholar] [CrossRef]
  29. Ma, W.; Yang, Q.; Wu, Y.; Zhao, W.; Zhang, X. Double-branch multi-attention mechanism network for hyperspectral image classification. Remote Sens. 2019, 11, 1307. [Google Scholar] [CrossRef]
  30. Li, R.; Zheng, S.; Duan, C.; Yang, Y.; Wang, X. Classification of hyperspectral image based on double-branch dual-attention mechanism network. Remote Sens. 2020, 12, 582. [Google Scholar] [CrossRef]
  31. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  32. Xue, Z.; Tan, X.; Yu, X.; Liu, B.; Yu, A.; Zhang, P. Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification. IEEE Trans. Image Process. 2022, 31, 3095–3110. [Google Scholar] [CrossRef]
  33. Zhao, Z.; Hu, D.; Wang, H.; Yu, X. Convolutional Transformer Network for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  34. Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
  35. Song, R.; Feng, Y.; Cheng, W.; Mu, Z.; Wang, X. BS2T: Bottleneck Spatial–Spectral Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  36. Liang, H.; Bao, W.; Shen, X.; Zhang, X. HSI-mixer: Hyperspectral image classification using the spectral–spatial mixer representation from convolutions. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  37. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Nice, France, 2017; Volume 30. [Google Scholar]
  38. Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
  39. Gao, H.; Chen, Z.; Xu, F. Adaptive spectral-spatial feature fusion network for hyperspectral image classification using limited training samples. Int. J. Appl. Earth Obs. Geoinf. 2022, 107, 102687. [Google Scholar] [CrossRef]
Figure 1. The network architecture of our S2Former. S2Former adopts the parallel spatial transformer and spectral transformer to individually explore discriminative features in spatial and spectral dimensions. The transformer branches mainly consist of residual in residual design, incorporating a multi-head spatial transformer block and multi-head covariance spectral transformer block.
Figure 1. The network architecture of our S2Former. S2Former adopts the parallel spatial transformer and spectral transformer to individually explore discriminative features in spatial and spectral dimensions. The transformer branches mainly consist of residual in residual design, incorporating a multi-head spatial transformer block and multi-head covariance spectral transformer block.
Electronics 12 03937 g001
Figure 2. Illustration of the multi-head spatial transformer block (MSTB) and multi-head covariance spectral transformer block (MCTB). The core modules of (a) MCTB and (b) MSTB are (c) multi-head covariance spectral attention, (d) multi-head spatial self-attention, and the local activation feed-forward network.
Figure 2. Illustration of the multi-head spatial transformer block (MSTB) and multi-head covariance spectral transformer block (MCTB). The core modules of (a) MCTB and (b) MSTB are (c) multi-head covariance spectral attention, (d) multi-head spatial self-attention, and the local activation feed-forward network.
Electronics 12 03937 g002
Figure 3. The architecture of the local activation feed-forward network (LAFN). LAFN exploits depth-wise convolution and the fully connected layer to encode local signals and global information.
Figure 3. The architecture of the local activation feed-forward network (LAFN). LAFN exploits depth-wise convolution and the fully connected layer to encode local signals and global information.
Electronics 12 03937 g003
Figure 4. The detailed schematic of the spatial-3D enhanced block and spectral-3D enhanced block.
Figure 4. The detailed schematic of the spatial-3D enhanced block and spectral-3D enhanced block.
Electronics 12 03937 g004
Figure 5. Model performance comparison under different input patch sizes.
Figure 5. Model performance comparison under different input patch sizes.
Electronics 12 03937 g005
Figure 6. The quantitative results in terms of OA with different model depths.
Figure 6. The quantitative results in terms of OA with different model depths.
Electronics 12 03937 g006
Figure 7. Qualitative comparison results for the UP dataset. (a) Hyperspectral image. (b) False color image. (c) GT. (d) SVM. (e) CDCNN. (f) FDSSC. (g) DBDA. (h) SpectralFormer. (i) HSI-Mixer. (j) BS2T. (k) S2Former.
Figure 7. Qualitative comparison results for the UP dataset. (a) Hyperspectral image. (b) False color image. (c) GT. (d) SVM. (e) CDCNN. (f) FDSSC. (g) DBDA. (h) SpectralFormer. (i) HSI-Mixer. (j) BS2T. (k) S2Former.
Electronics 12 03937 g007
Figure 8. Comparison of qualitative results for the IN dataset. (a) Hyperspectral image. (b) False color image. (c) GT. (d) SVM. (e) CDCNN. (f) FDSSC. (g) DBDA. (h) SpectralFormer. (i) HSI-Mixer. (j) BS2T. (k) S2Former.
Figure 8. Comparison of qualitative results for the IN dataset. (a) Hyperspectral image. (b) False color image. (c) GT. (d) SVM. (e) CDCNN. (f) FDSSC. (g) DBDA. (h) SpectralFormer. (i) HSI-Mixer. (j) BS2T. (k) S2Former.
Electronics 12 03937 g008
Figure 9. Comparison of qualitative results for the BS dataset. (a) Hyperspectral image. (b) False color image. (c) GT. (d) SVM. (e) CDCNN. (f) FDSSC. (g) DBDA. (h) SpectralFormer. (i) HSI-Mixer. (j) BS2T. (k) S2Former.
Figure 9. Comparison of qualitative results for the BS dataset. (a) Hyperspectral image. (b) False color image. (c) GT. (d) SVM. (e) CDCNN. (f) FDSSC. (g) DBDA. (h) SpectralFormer. (i) HSI-Mixer. (j) BS2T. (k) S2Former.
Electronics 12 03937 g009
Figure 10. Comparison of qualitative results for the HU dataset. (a) Hyperspectral image. (b) False color image. (c) GT. (d) SVM. (e) CDCNN. (f) FDSSC. (g) DBDA. (h) SpectralFormer. (i) HSI-Mixer. (j) BS2T. (k) S2Former.
Figure 10. Comparison of qualitative results for the HU dataset. (a) Hyperspectral image. (b) False color image. (c) GT. (d) SVM. (e) CDCNN. (f) FDSSC. (g) DBDA. (h) SpectralFormer. (i) HSI-Mixer. (j) BS2T. (k) S2Former.
Electronics 12 03937 g010
Table 1. Training and test data for each land cover category in the UP dataset.
Table 1. Training and test data for each land cover category in the UP dataset.
ClassCategoryTotal SamplesTrainingTest
1Asphalt6631336598
2Meadows18,6499318,556
3Gravel2078102068
4Trees3033153018
5Metal Sheets133161325
6Bare Soil4978254953
7Bitumen131661310
8Bricks3645183627
9Shadows9374933
 Total42,34421042,134
Table 2. Training and test data for each land cover category in the IN dataset.
Table 2. Training and test data for each land cover category in the IN dataset.
ClassCategoryTotal SamplesTrainingTest
1Alfalfa46343
2Corn-notill1428421386
3Corn-mintill83024806
4Corn2377230
5Grass-pasture48314469
6Grass-trees73021709
7Grass-pasture-mowed28325
8Hay-windrowed47814464
9Oats20317
10Soybean-notill97229943
11Soybean-mintill2455732382
12Soybean-clean59317576
13Wheat2056199
14Woods1265371228
15Buildings-Grass-Trees-Drivers38611375
16Stone-Steel-Towers93390
Total10,2493079942
Table 3. Training and test data for each land cover category in the BS dataset.
Table 3. Training and test data for each land cover category in the BS dataset.
ClassCategoryTotal SamplesTrainingTest
1Water2703267
2Hippo grass101299
3Floodplain grasses12513248
4Floodplain grasses22153212
5Reeds12693266
6Riparian2693266
7Fierscar22593256
8Island interior2033200
9Acacia woodlands3143311
10Acacia shrublands2482246
11Acacia grasslands3053302
12Short mopane1812179
13Mixed mopane2683265
14Exposed soils95293
Total3248423206
Table 4. Training and test data for each land cover category in the HU dataset.
Table 4. Training and test data for each land cover category in the HU dataset.
ClassCategoryTotal SamplesTrainingTest
1Healthy Grass1251131238
2Stressed Grass1254131241
3Synthetic Grass6977690
4Tree1244121232
5Soil1252131239
6Water3253322
7Residential1268131255
8Commercial1244121232
9Road1252131239
10Highway1227121215
11Railway1235121223
12Parking Lot 11234121222
13Parking Lot 24695464
14Tennis Court4284424
15Running Track6607653
Total15,01115114,860
Table 5. Break-down ablation study on proposed components.
Table 5. Break-down ablation study on proposed components.
 Spatial
Transformer
Spectral
Transformer
 OA  AA  Kappa 
(a)MSSA+MLP-81.7694.4293.2484.6893.7992.8980.2892.5492.67
(b)MSSA+LAFN-84.696.9495.5186.1097.0095.4183.3495.9295.13
(c)MSSA+MLP
+Spatial-3D
-83.5396.4095.1086.4196.4795.1282.1895.2194.69
(d)MSSA+LAFN
+Spatial-3D
-84.9397.2195.8986.5797.0795.7783.3696.3095.54
(e)-MCSA+MLP82.4594.2494.0283.7693.5494.3181.0392.6693.52
(f)-MCSA+LAFN84.5496.9396.3385.0097.1195.7583.3095.9196.03
(g)-MCSA+MLP
+Spectral-3D
84.2596.3995.1984.2696.5494.9982.9995.1994.79
(h)-MCSA+LAFN
+ Spectral -3D
84.8197.0896.1986.1097.4196.1383.3496.2796.01
S2FormerMSSA+LAFN
Spatial-3D
MCSA+LAFN
Spectral -3D
85.1297.4096.4986.9697.7196.4283.9296.5496.20
Table 6. Comparison of quantitative results on the UP dataset.
Table 6. Comparison of quantitative results on the UP dataset.
ClassSVMCDCNNFDSSCDBDASpectral
Former
HSI-MixerBS2TS2Former
183.8780.6087.2294.1585.9792.7793.9394.77
286.3393.1099.5099.0196.2697.1098.6498.16
367.7562.4572.4295.0692.4393.7097.9098.02
496.8998.7697.1197.7697.7198.6698.3699.06
594.2899.4599.0199.3499.2398.3598.61100.00
684.5586.0497.8398.1194.4897.3298.5598.99
766.4673.8469.7899.6889.3696.5699.2098.80
870.7072.7182.7686.6177.9587.0384.3991.55
999.8998.3299.5598.4797.9098.9598.83100.00
OA(%)83.9987.4894.3096.5692.4396.4896.9797.40
AA(%)83.4185.0389.4696.4792.3795.6096.6997.71
Kappa(%)78.2483.2092.4195.4389.8495.3195.9296.54
Table 7. Comparison of quantitative results on the IN dataset.
Table 7. Comparison of quantitative results on the IN dataset.
ClassSVMCDCNNFDSSCDBDASpectral
Former
HSI-MixerBS2TS2Former
124.2090.3989.5599.7669.2399.67100.00100.00
255.9874.0895.0295.1083.6389.4393.5095.48
364.4560.5593.0594.5384.2180.8693.9489.28
443.1986.6796.7595.2396.3393.7695.7898.70
584.5988.0897.6395.2396.6295.4398.6298.34
682.1186.3197.3497.8488.3692.9497.9399.70
758.7587.0880.7873.4763.9886.8276.89100.00
887.8786.5198.41100.0088.1289.5199.98100.00
946.5871.9258.3396.5474.3791.5894.43100.00
1065.1083.2991.4889.5783.2789.0188.0094.82
1163.1172.2493.0894.2380.5295.2196.1097.21
1249.6769.7190.2191.8584.5782.5690.9499.42
1388.5996.5599.5898.9292.4798.7698.39100.00
1489.8990.3797.8597.8991.9794.1496.1097.90
1561.6375.6593.4095.5789.8789.4394.8491.22
1699.2391.3098.8096.9589.8795.6996.6978.76
OA(%)69.0678.2492.3594.8985.4692.5094.9495.16
AA(%)66.5681.9291.9594.8084.7991.5594.5196.30
Kappa(%)64.2874.9491.2594.1783.2991.4494.7394.47
Table 8. Comparison of quantitative results on the BS dataset.
Table 8. Comparison of quantitative results on the BS dataset.
ClassSVMCDCNNFDSSCDBDASpectral
Former
HSI-MixerBS2TS2Former
190.3889.2997.2198.5689.7696.0293.9298.74
234.5874.7287.9793.3856.3183.0382.7686.48
386.7168.4599.5499.2783.01100.00100.00100.00
456.7357.0384.7189.8783.9987.9394.6097.99
587.2781.6594.0794.2890.2190.8292.3881.72
656.6359.8582.0892.0170.3190.4194.9097.80
789.0989.3998.5497.8499.5399.39100.00100.00
876.9169.5296.9999.2096.7799.4597.0498.00
975.7874.5288.8596.0996.4990.5596.9498.02
1087.6975.2691.4291.3593.3393.0774.0894.42
1182.2595.8497.5798.6183.0697.67100.00100.00
1260.3889.7399.7299.8986.2697.7899.4496.72
1391.0671.0093.0196.8689.8899.39100.00100.00
1479.9191.0998.1897.6799.5599.29100.00100.00
OA(%)73.1574.8991.9495.4783.4294.1194.3496.49
AA(%)75.3877.6793.5696.0687.0394.6394.7196.42
Kappa(%)71.0372.8591.2795.0982.0893.6293.8796.20
Table 9. Comparison of quantitative results on the HU dataset.
Table 9. Comparison of quantitative results on the HU dataset.
ClassSVMCDCNNFDSSCDBDASpectral
Former
HSI-MixerBS2TS2Former
169.4692.7379.9083.7777.6083.9794.1985.92
293.5293.2095.0993.7588.6093.2294.1395.38
361.6796.6299.1596.6696.00100.00100.00100.00
491.2698.5794.9093.8592.5997.3385.6498.82
580.7093.7293.8194.7590.7594.3388.4496.08
694.08100.0090.0092.2697.2797.75100.00100.00
765.8573.6872.5583.1478.2379.0180.5884.16
857.5466.3172.3092.4485.1993.0888.9985.32
975.3765.1873.1675.8169.6071.3470.0576.64
1066.0469.3174.2372.7364.9464.8868.8974.44
1161.1061.4669.2880.0969.2771.2963.1180.60
1267.7463.8276.4171.6463.5272.6465.7971.20
1397.0226.3049.0889.7073.1690.5165.9874.82
1483.8793.1998.2390.5195.3295.1097.2292.92
1565.7598.8696.8089.4892.7895.5895.7486.82
OA(%)71.9278.7580.6283.9579.0782.7081.7885.12
AA(%)75.4079.5382.3386.6982.3286.6783.9286.96
Kappa(%)69.5876.9979.0082.6377.3581.2880.3083.92
Table 10. Comparison of the complexity of time-space consumption.
Table 10. Comparison of the complexity of time-space consumption.
 CDCNNFDSSCDBDASpectral
Former
HSI-MixerBS2TS2Former
Training time (s)498.971459.141229.351307.761752.501549.891855.85
Flops (G)2.6022.7113.8213.2512.5713.8412.76
params (M)2.911.230.380.510.250.380.54
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, D.; Yu, D.; Qian, Y.; Xu, Y.; Liu, Y. S2Former: Parallel Spectral–Spatial Transformer for Hyperspectral Image Classification. Electronics 2023, 12, 3937. https://doi.org/10.3390/electronics12183937

AMA Style

Yuan D, Yu D, Qian Y, Xu Y, Liu Y. S2Former: Parallel Spectral–Spatial Transformer for Hyperspectral Image Classification. Electronics. 2023; 12(18):3937. https://doi.org/10.3390/electronics12183937

Chicago/Turabian Style

Yuan, Dong, Dabing Yu, Yixi Qian, Yongbing Xu, and Yan Liu. 2023. "S2Former: Parallel Spectral–Spatial Transformer for Hyperspectral Image Classification" Electronics 12, no. 18: 3937. https://doi.org/10.3390/electronics12183937

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop