1. Introduction
Hyperspectral images (HSIs) are data cubes captured by hyperspectral sensors, which simultaneously reveal 2-D spatial and 1-D spectral information about land cover substances [
1]. What distinguishes HSIs from panchromatic and multispectral images is that their pixels record the distinctive spectral signatures using hundreds of nearly continuous spectral bands [
2,
3,
4]. The high-resolution spectral response curves reflect detailed characteristics of land cover substances [
5]. Consequently, hyperspectral image (HSI) classification, defined as “assigning a certain category to each pixel [
6]”, has become a fundamental but crucial aspect of remote sensing applications. However, abundant spectral information could also be redundant due to some highly correlated spectral bands [
7,
8,
9]. Moreover, there are some other hindrances to HSI classification. The spectral variability [
10,
11] and the lack of labeled training samples, for example, would negatively affect the HSI feature extraction and make the classification more challenging. These adverse effects have heightened the need for advanced feature extraction networks.
Over recent years, deep learning has emerged as the most preferable approach to extracting informative features thanks to its ability in feature representation. Typical deep learning networks, such as stacked autoencoders (SAEs) [
12,
13], recurrent neural networks (RNNs) [
14,
15,
16], convolutional neural networks (CNNs), and transformers, have been widely used in HSI classification. Among them, CNN- and transformer-based networks, which excel at local perception and global interaction, respectively, have established their superiority in HSI classification.
In general, there are three types of HSI classification methods based on different ways of feature extraction [
17]: spectral-feature networks, spatial-feature networks, and spectral–spatial-feature networks. Accordingly, CNN-based networks could also be intuitively divided into 1-D CNNs [
18,
19], 2-D CNNs [
20], and 3-D CNNs [
21]. Facing the same problem as SAEs and RNNs, 1-D CNNs only exploit spectral features, whereas spatial features are somewhat weakened [
22,
23]. However, 2-D CNNs are inclined to assemble only spatial information. Nevertheless, previous studies have indicated that the individual spectral or spatial features may not achieve a satisfactory performance [
24]. Spectral features provide the most revealing insight into land cover substances, while spatial features could add some complementary information, and an integration would achieve better classification performance than the individual ones. Therefore, 3-D CNNs were employed to extract features in spectral and spatial dimensions jointly. For example, Zhong et al. [
25] proposed a spectral–spatial residual network (SSRN) that adopts 3-D CNN as the basic element to extract spectral–spatial features, achieving impressive performance. In fact, 3-D CNNs are just the most direct ways of spectral–spatial feature extraction, and there are some other approaches. Zhao et al. [
26] developed two kinds of 1-D CNNs to extract spectral and spatial features and then fused these features. Zhang et al. [
27] combined 1-D CNN with 2-D CNN to exploit spectral–spatial features efficiently. Roy et al. [
28] proposed a hybrid spectral CNN (HybridSN) that combines 3-D CNN with 2-D CNN and thus reduces the computation overload. Huang et al. [
29] used a 3-D CNN and a pyramid squeeze-and-excitation attention module to extract spectral–spatial features jointly.
Based on the multi-head self-attention (MSA) mechanism [
30], transformers have become a dominant paradigm of natural language processing (NLP) and have made significant progress in computer vision (CV) tasks as well. In 2020, vision transformer (ViT) [
31] pioneered the use of transformers for CV tasks, which provides an efficient method for modeling long-range dependencies and establishing global interactions. Then, many researchers committed to adapting ViT to HSI classification. Specifically, for the patches embedding layer, there are three different perspectives of tokenization: spectral, spatial, and spectral–spatial perspectives. The spatial–spectral transformer (SST) [
32] and spectral former [
33] treated HSIs as spectral sequential data for tokenization. The main difference is that the former utilized a VGG-like architecture to tokenize each band separately, while the latter designed a groupwise spectral embedding layer to tokenize overlapped bands. HSI-BERT [
34], on the other hand, concentrated on modeling spatial dependencies among pixels in a spatial perspective. From a spectral–spatial perspective, Sun et al. [
1] developed a model called the spectral–spatial feature tokenization transformer (SSFTT), which extracts spectral–spatial features and then makes samples more separable using a Gaussian-weighted feature tokenizer. As for the transformer encoder layer, many improvements have also been made in order to facilitate feature representation. For example, Liang et al. [
35] developed a dual multi-head contextual self-attention (DMuCA) network that decouples spatial and spectral contextual attention into two subblocks, capturing rich contextual dependencies from both the spatial and spectral domains.
Albeit the exciting progress the aforementioned methods have made, there are still some imperfections:
- (1)
CNNs are good at local perception and extracting low-level features. However, they treat all features equally without considering different significances. Moreover, capturing global contextual information and establishing long-range dependencies can be inefficiently limited by their inherent structure [
36].
- (2)
Transformers are good at global interaction and capturing salient features. However, they often manifest difficulty in local perception [
37,
38], which is nevertheless critical to the collection of refined information. Furthermore, transformers usually have a considerable demand for training data [
39], yet annotated HSI data are mostly inadequate. Moreover, the internal spectral–spatial data structure can be damaged in the transformer architecture, which deteriorates the classification performance.
- (3)
Most of these CNN- and transformer-based networks follow a patch-wise classification framework; that is, each pixel with its adjacent pixels can form a coherent whole that is labeled as the category of the center pixel [
40,
41]. This framework is grounded on the spatial homogeneity assumption that the adjacent pixels will share the same land cover category with their center pixel. However, the assumption is not always tenable because the cropped patch is too complicated in spatial distribution to be roughly represented by its center pixel.
To alleviate the above problems, we propose a U-shaped convolution-aided transformer (UCaT) that embeds group convolutions into a U-shaped transformer architecture to aid the per-pixel identifications over cropped HSI patches, making full use of both the advantages of CNNs and transformers. Hence, it is the classification map, not one label, that is generated for a patch. Accordingly, the spatial homogeneity assumption we mentioned in the third problem can be a guide, not a hard constraint. And in response to the limitations of CNNs and transformers, we introduce such reasonable inductive bias of CNNs as locality to the transformer. Specifically, by replacing linear projection with group convolutional projection, the UCaT is dominated by a transformer to focus on salient features and capture global dependencies. And it cooperates with convolutions for local perception and lowering the demand for training data. Based on this, three components are constructed using particular strategies. First, the spectral groupwise self-attention (spectral-GSA) component treats HSIs as sequential spectral data for extracting discriminative spectral features. Then, the spatial dual-scale convolution-aided self-attention (spatial-DCSA) encoder and the spatial convolution-aided cross-attention (spatial-CCA) decoder form a U-shaped architecture for building spatial attention, which effectively assembles local-global spatial information. Overall, the main contributions can be summarized as:
- (1)
A UCaT network, which incorporates group convolutions into a novel transformer architecture, is proposed. The group convolution extracts detailed features locally, and then the MSA recalibrates the obtained features with a global field of vision in consistent groups. This combination takes full account of the characteristics of HSI data, emphasizing informative features while maintaining the inherent spectral–spatial data structure.
- (2)
The spectral-GSA builds spectral attention and provides a new way of dimensionality reduction. It divides the spectral bands into small groups and builds spectral attention in groups, which possesses the ability to capture subtle spectral discrepancies. And a convolutional attention weight adjustment is constructed, which efficiently reduces the spectral dimension.
- (3)
The spatial-DCSA encoder and the spatial-CCA decoder form a U-shaped architecture to assemble local-global spatial information, where a dual-scale strategy is employed to exploit information in different scales, and the cross-attention strategy is adopted to compensate high-level information with low-level information, which contributes to spatial feature representation.
- (4)
The UCaT achieves better classification results and better interpretability. Extensive experiments demonstrate that the UCaT outperforms the CNN- and transformer-based state-of-the-art networks. A visual explanation shows that the UCaT can not only distinguish homogeneous areas to eliminate semantic ambiguity but also capture pixel-level spatial dependencies.
The remaining sections of this article are organized as follows.
Section 2 revisits the related works, i.e., transformer-based networks and segmentation networks.
Section 3 gives a brief introduction to the proposed network.
Section 4 presents experimental details and classification results, and
Section 5 visually explains the proposed network. Finally,
Section 6 concludes this article.
3. Methodology
In this section, we will first give a brief introduction to the overall structure of the proposed UCaT and then describe its individual components.
3.1. Overview
The overall structure of the proposed UCaT is depicted in
Figure 1. The classification flowchart inherits the work in [
48], where a few modifications to the traditional patch-wise classification flowchart were made so that it can take classification maps as output. And we propose a UCaT network that makes dense identifications of cropped HSI patches. Thus, the spatial homogeneity assumption could also provide a soft spatial prior with the aim of avoiding the salt-and-pepper noise but not forcing the labels of a whole patch to be the label of its center pixel.
Let represent the original HSI data, where H, W, and C denote the spatial height, width, and the number of bands, respectively. And indicates the ground truth of H, where N is the number of land cover categories (note that all the unlabeled pixels are subsumed into an additional category, i.e., 0). After removing all the unlabeled pixels, the remaining pixels can be randomly divided into training pixels and testing pixels. For each pixel pi as one of the training pixels, the patch centered on pi is cropped from H to set up the training set, where indicates the cropped window size. And so does the ground truth map: (note that here the testing pixels are also subsumed into the additional category, i.e., 0, which can be deemed as the ignore index that does not contribute to backpropagation). To sum up, the training set can be expressed as: , where m represents the number of training pixels. During the test phase, dense predictions can be made through the sliding window across H.
The UCaT is mainly comprised of three components: the spectral-GSA component, the spatial-DCSA encoder, and the spatial-CCA decoder. The former is a shallow spectral feature extractor that extracts discriminative spectral features and suppresses redundant features, transforming into , where c is the new channel dimension (set to 64). Then, the last two components form a U-shaped encoder-decoder architecture that assembles local-global spatial information. Both the encoder and decoder contain five blocks; each is a three-tier structure with a skip connection, except that the last block of the decoder is a transposed convolution with an upsampling stride of 2. In each block, the first and third layers are both the 1 × 1 convolutional layers for integrating information. The middle layer undertakes the core work to extract informative spatial features, in which the convolution-aided self-attention with downsampling or the convolution-aided cross-attention with upsampling is performed. In the encoder, the downsampling strides of the five blocks are , and the channel dimension remains c unchanged, so the output resolutions are: . The upsampling strides of the first four blocks in the decoder are set to , thus the output resolutions can be restored as: ; the fifth block is a transposed convolutional block and outputs .
3.2. Spectral Groupwise Self-Attention Component
The high-dimensional spectral bands provide a revealing insight into the physical properties of land cover substances; however, they suffer from data redundancy. Inspired by the channel attention [
49], we use the transposed version of MSA for building spectral attention. Then, we add a convolutional attention weight adjustment operation and propose the spectral-GSA, which extracts discriminative spectral features and reduces the channel dimension.
The spectral-GSA is a one-block spectral feature extractor with a skip connection. It is designed based on a fundamental principle, that is, the subtle spectral discrepancies and the internal spectral–spatial structure should be retained to the maximum. As seen in
Figure 2, we divide the spectral bands into groups and then extract subtle spectral features per group; the yellow series and blue series represent two different groups for illustration. For each group, the spectral attention is built based on the correlations between three neighboring bands. And a novel way of dimensionality reduction is designed by imposing an asymmetric depthwise convolution on the attention weight.
Formally, the max pooling and average pooling operations are adopted to map
into
Q and
K,
V is directly duplicated from
X:
Then, the obtained
Q,
K, and
V matrices can be divided into small groups and flattened among the spatial dimensions, and then transpose the last two dimensions
After the reshape operation, the spectral attention weight
A can be calculated using the scaled dot-product:
where
. It can be seen that the spectral attention weight collects the correlations between channels in the same group. With the aim of dimensionality reduction, a convolutional attention weight adjustment with kernel size (1, 3) and stride (1, 3) is attached, mapping the
A into
. After that, the output can be obtained by allocating the corresponding attention weight to
:
where the DWConv(·) denotes the depthwise convolution.
Finally, a 1 × 1 convolutional layer is performed for channel mixing, and
c is the output dimension. As the output of the spectral-GSA block is obtained, a skip connection can then be carried out to mitigate the vanishing-gradient problem.
3.3. Spatial Dual-Scale Convolution-Aided Self-Attention Encoder
The spatial-DCSA encoder assembles spatial features hierarchically using a stack of five blocks, and each block contains three layers with a skip connection. As shown in
Figure 3a, the first and the third layers are both the 1 × 1 convolutional layers for channel mixing rather than the aim of dimensionality reduction or expansion in the residual [
50] or the inverted residual [
51] module. A batch normalization (BN) and a rectified linear unit (ReLU) are executed after each layer. The middle layer is a DCSA layer that extracts informative spatial features while maintaining the inherent spectral–spatial data structure. The DCSA layer is shown in
Figure 3b. In an attempt to keep the spectral–spatial data structure, the spatial feature extraction can be conducted in groups. Since group convolution is a great substitute for group operation, we directly use it. By aligning the groups in group convolution with the heads in MSA, convolution-aided self-attention can be executed.
First, we use group convolutions with different kernel sizes to transform
into
Q,
K, and
V; the number of groups is
g, and the stride is
s. As illustrated in
Figure 4, when
s is 2, the kernel size is also set to 2. When
s is 1, the kernel size for obtaining
K and
V is set to 1 while 3 is for
Q, which is termed the dual-scale that helps to fully explore information on different scales without inducing too many parameters. Besides,
Q can have a larger receptive field for better guiding the allocations of attention weight.
where the GConv(·) denotes the group convolution.
Then, a reshape operation is carried out to adjust the data structure for follow-up calculations:
After that, self-attention can be used to calculate spatial attention. Notably, since convolutions naturally have an intuition for positions [
43], the positional encoding will not be used in our network
where
.
Finally, restoring the data shape: .
3.4. Spatial Convolution-Aided Cross-Attention Decoder
The spatial-CCA decoder restores the resolutions of feature maps progressively. For designing a decoder, previous studies have demonstrated the benefits of attaching the low-level but high-resolution features obtained earlier by the encoder to the high-level features in the decoder. Apart from the feature concatenation method proposed in UNet, there are also some other context fusion methods. For example, the UNet transformer [
52] used cross-attention in an encoder-decoder skip connection manner, achieving good performance in medical image segmentation. Inspired by it, we adopt a skip connection level cross-attention operation to effectively transfer refined information from the encoder to the decoder.
Symmetrical to the encoder, the decoder is constructed from four CCA blocks and a transposed convolutional block. The difference lies in the middle layer: the encoder utilizes the self-attention mechanism while the decoder adopts the cross-attention mechanism. The CCA layer is shown in
Figure 3c; one input
comes from the existing block, and another input
was obtained earlier from the previous encoder block. The transposed convolution is employed to upsample
into
with a stride
s. And two group convolutions with kernel size 1 are used to transform
into
. Since
Q contains high-level information while
K and
V provide details such as edge and texture, an integration could facilitate feature expression. Thus, the cross-attention is carried out to recalibrate the obtained features, which aggregates low-level but high-resolution features with high-level but low-resolution features. This process can be formulated as follows:
where
and DConv(·) denotes the transposed convolution.
4. Experiment
In this section, the proposed UCaT is quantitatively evaluated using three publicly available HSI datasets. These datasets and the implementation details of experiments are briefly introduced at first. Then, the important parameters, such as the input patch size, the network width, and the number of groups, are selected experimentally. After that, extensive experiments are conducted for comparison with several state-of-the-art classification algorithms, evaluating the classification performance of the UCaT. Finally, ablation experiments are carried out to further confirm the effectiveness of the main components.
4.1. Data Description
Three publicly available HSI datasets, i.e., Indian Pines (IP), Pavia University (PU), and Salinas Valley (SV), are used to evaluate the effectiveness of the proposed UCaT, the false-color images and the ground-truth maps are shown in
Figure 5.
- (1)
IP dataset: The first dataset was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines field in Northwestern Indiana. After discarding some spectral bands that are affected by the water absorption, the remaining 200 bands in a spatial size of 145 × 145 pixels are used for experiments. The dataset has 10,249 labeled pixels that can be partitioned into 16 land cover types.
- (2)
PU dataset: The second dataset was gathered by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor at Pavia University, Northern Italy. The image consists of 610 × 340 pixels; among them, 42,776 pixels were labeled. The dataset has 9 types of land cover classes and 103 spectral bands.
- (3)
SV dataset: The third dataset was also collected by the AVIRIS sensor over Salinas Valley, California. After removing the water absorption bands, the remaining 204 bands with a spatial size of 512 × 217 pixels are used for experiments. The dataset has 16 land cover classes, and a total of 54,129 pixels were labeled.
4.2. Experimental Setup
- (1)
Metrics: Three evaluation metrics, i.e., overall accuracy (OA), average accuracy (AA), and kappa coefficient (K), are used to measure the classification performance quantitatively. To ensure the reliability of the experiment results, all subsequent experiments are repeated ten times, and each is conducted on randomly selected training and testing sets.
- (2)
Data partition: For the IP, PU, and SV datasets, 10% (1024 pixels), 5% (2138 pixels), and 3% (1623 pixels), respectively, of the labeled samples, are randomly selected for training. The random seeds for the ten times repeated experiments are set to 0~9 for reducing random error.
- (3)
Implementation details: All experiments are implemented with the Python 3.7 compiler and the PyTorch platform, running on a desktop PC with an Intel Core i7-9700 CPU and an NVIDIA GeForce RTX 3080 graphics card. Before training, the original HSI datasets are normalized to the range [0, 1] using the min-max scaling. Then, the cross-entropy loss and the AdamW optimizer (the weight decay is set to 0.03) are used to supervise training. Specifically, we train the network for 105 epochs with a mini-batch size of 128. The learning rate is initialized to 0.03, and then the CosineAnnealingWarmRestarts learning rate scheduler is employed to adjust it, where the number of iterations for the first restart T_0 is set to 5, and the increase factor after each restart T_mult is set to 4.
4.3. Parameter Analysis
The exploration of the main parameters, such as the input patch size, the network width, and the number of groups, is indispensable since they have considerable influences on the classification performance. The proper parameters will be selected through experiments.
4.3.1. Influence of Patch Size
To a certain extent, the larger input patch size could produce more neighborhood information for classification. However, as the patch size continues to grow, the computational complexity and the number of parameters increase significantly, yet the gain in classification performance decelerates gradually. We therefore compare the classification performance with different input patch sizes in the range of {12 × 12, 16 × 16, 20 × 20, 24 × 24, 28 × 28, 32 × 32} and report the variation trends in
Figure 6. It can be found that the OA curves show an improvement along with the expansion of patch size, especially for the IP and SV datasets. However, if the patch size exceeds 20 × 20, the increase in OA is not statistically significant. Accordingly, follow-up experiments set the patch sizes for the IP, PU, and SV datasets to 24 × 24, 20 × 20, and 24 × 24, respectively, as a compromise between classification performance and computational complexity.
4.3.2. Influence of Network Width and the Number of Groups
We report the classification results under different types of network width: {32, 64, 128, [32, 64, 128, 256, 512]} with different numbers of groups: {1, 2, 4, 8, 16, 32}, the last setting of the network width represents the increasing width of each block. As seen in
Figure 7, when the width is 64, the accuracy is generally better. The accuracy presents an increasing and then a slightly downward trend with the increasing numbers of groups. We set the network width to 64 and the number of groups to 8 for follow-up experiments.
4.4. Classification Results
Eight state-of-the-art networks, including CNN- and transformer-based networks, and segmentation networks, are selected for comparative experiments to analyze the classification performance of the proposed network in HSI classification. They are spectral–spatial residual network (SSRN) [
25], hybrid spectral CNN (HybridSN) [
28], double-branch dual-attention mechanism network (DBDA) [
53], vision transformer (ViT) [
31], spectral former (SF) [
33], spectral–spatial feature tokenization transformer (SSFTT) [
1], UNet [
47], and UNet transformer (UT) [
52]. The hyperparameters of the SSRN, HybridSN, DBDA, SF, and SSFTT are set based on the recommendations in their respective literature. As for the ViT, UNet, and UT, the hyperparameters are consistent with our proposed UCaT for a fair comparison.
Detailed results of these networks on the IP, PU, and SV datasets are presented in
Table 1,
Table 2 and
Table 3, where the best results are in bold. We can observe that the classification results of CNN-based networks are generally better than those of transformer-based networks except for the SSFTT. This suggests that CNNs can capture more detailed information than pure transformers for refined classification. Since the SSFTT used convolutional layers to extract low-level spectral and spatial features first and then developed the transformer encoder module for capturing global contextual dependencies, it produced better results than pure CNNs or transformers. Moreover, we can also observe that the traditional patch-wise classification networks are generally worse than the segmentation networks, possibly because it is not robust to force the labels of a patch to be the label of its center pixel. Besides, despite the total number of labeled training pixels being the same, the segmentation networks can repeatedly use the training pixels in different patches, which enriches the feature representation. Most notably, among all the networks, the proposed UCaT achieves the highest classification results with 99.48% OA, 99.09% AA, 99.41% K on the IP dataset, 99.92% OA, 99.86% AA, 99.90% K on the PU dataset, and 99.94% OA, 99.90% AA, 99.93% K on the SV dataset. The high classification accuracy confirms that our proposed network can exploit both the advantages of CNNs and transformers. Besides, it can improve the classification performance of the segmentation networks.
Specifically, on the IP dataset (see
Table 1), the classification results of the proposed UCaT are significantly higher than those of the other networks. It achieves the highest accuracy in 14 of a total of 16 land cover categories. And the classification results of all the land cover categories are more than 96% despite the extremely unbalanced data distribution of the IP dataset, which demonstrates that our proposed network can still achieve promising results under the extremely unbalanced data distribution.
The improvement in the PU dataset is also obvious (see
Table 2). Out of a total of 9 land cover categories, 7 can achieve the highest accuracy. All the land cover classes could achieve a classification accuracy of over 99.2%. In particular, classes 2, 5, 6, and 7, which are meadows, painted metal sheets, bare soil, and bitumen, respectively, achieve a straight 100% accuracy.
On the SV dataset (see
Table 3), all the classes could achieve a classification accuracy of over 99.4%. Classes 1, 2, 6, 7, 8, 9, and 12, which are Brocoli_green_weeds_1, Brocoli_green_weeds_2, Stubble, Celery, Grapes_untrained, Soil_vinyard_develop, and Lettuce_romaine_5wk, respectively, could attain an accuracy of 100%.
The classification maps of the comparison networks and the proposed network on the three datasets are shown in
Figure 8,
Figure 9 and
Figure 10. On the whole, it can be seen that the proposed UCaT can obtain better classification maps than others, showing its superiority in HSI classification. Specifically, first, the classification maps of the proposed network are the closest to the ground truth maps for all the three datasets. Moreover, it can be seen that there is no apparent noise scatter in the classification maps of the proposed network for all the three datasets, which demonstrates that the proposed network could effectively eliminate semantic ambiguity by capturing pixel-level spatial dependencies. Moreover, the edges in the classification maps of the proposed network are also relatively smooth, especially on the PU dataset, which may suggest that the proposed UCaT has the spatial feature representation ability.
4.5. Ablation Study
The proposed UCaT contains three key components: the spectral-GSA component, the spatial-DCSA encoder, and the spatial-CCA decoder. This part investigates the necessity of the three components experimentally on the IP dataset using 10% training data. Then, the effectiveness of the particular strategies in the spatial-DCSA encoder and the spatial-CCA decoder is also evaluated.
The classification results in the absence of the three components are evaluated in terms of OA, AA, and K. If the spatial-DCSA or the spatial-CCA is not used, then the 1 × 1 convolution (2 × 2 convolution or 2 × 2 transposed convolution for the encoder or decoder, respectively, when the stride is 2) is utilized as the substitute for it. If the spectral-GSA component is removed, then the 1 × 1 convolution shall be used for dimensionality reduction. As listed in
Table 4, combining all three components can achieve the best classification performance. The classification results decrease in the absence of the spatial-DCSA or the spatial-CCA, indicating that they do help in spatial feature learning; however, if the spatial-DCSA and the spatial-CCA are both absent, the classification results drop sharply. A possible reason is that the network can capture informative spatial features by either of the two components, and when one is absent, the other will work. However, the network can only achieve the highest classification results when the two components work collaboratively. Moreover, the decline in classification results in the absence of the spectral-GSA component, which, to some extent, confirms that the spectral-GSA can capture more discriminative spectral features. To make the difference easier to observe, an additional experiment was also conducted by reducing the proportion of training data to 5%. The vanilla UCaT achieves the higher classification accuracy with 97.99% OA, 90.92% AA, 97.71% K, and it obtains 97.59% OA, 90.61% AA, 97.25% K in the absence of the spectral-GSA. The decline in OA is 0.4%, further indicating that the spectral-GSA can be helpful in feature learning.
Moreover, to verify the effectiveness of the dual-scale strategy in the spatial-DCSA encoder, we conduct experiments under a series of different kernel sizes for obtaining
Q,
K, and
V when the stride is 1. From
Table 5, it can be seen that when the kernel size for obtaining
Q,
K, and
V is 1, the classification results are the worst compared with the other configurations. Changing the kernel size for obtaining
Q into 3 can improve the OA, AA, and K by 0.12%, 0.21%, and 0.14%, respectively, and the parameters increase by 12k. However, when the kernel sizes for obtaining
K and
V are also changed into 3, the classification results do not have further improvement anymore, and the number of parameters continues to increase. In summary, the dual-scale strategy can achieve better classification results without inducing too many parameters, demonstrating its effectiveness.
Furthermore,
Table 6 lists the classification results under the cross-attention, the concatenate, and the add approaches in the spatial-CCA decoder. By contrast, the network outperforms other strategies when using the cross-attention strategy for information fusion between the encoder and decoder, confirming the effectiveness of the cross-attention strategy in the spatial-CCA decoder.
5. Discussion
To make the UCaT more explainable, the heatmaps [
54] which indicate the salient regions for classification, are generated with respect to different land cover classes on the IP dataset. The ground truth map of the cropped HSI patch is shown in
Figure 11a. We chose two kinds of classes, i.e., the Corn-notill and the Stone-Steel-Towers, for illustration. The heatmaps that take all the pixels in the same class and one of a pixel in this class as outputs for calculating the gradients of loss are shown in
Figure 11. When taking all the pixels in the same class as output, we can observe that the heatmaps are similar to the land cover locations of this class, which confirms that the UCaT has the capacity for spatial information perception.
It can also be found that when taking one pixel as output for backpropagation, the heatmaps are almost the same as the heatmaps that take all pixels in the same class as output for backpropagation. This is proof that the proposed UCaT can capture pixel-level spatial dependencies.
In sum, these heatmaps indicate the advantages of the cooperation among convolution, MSA, and the U-shaped segmentation architecture. With regard to a whole patch, the network differentiates several homogeneous regions by spatial relations and gives comprehensive consideration, which helps to eliminate semantic ambiguity caused by the inadequacy of training samples. With respect to a pixel, the network can capture pixel-level spatial dependencies, which finds similar pixels to assist the classification of this pixel. We think it is a good quality for HSI classification.