Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification

Han, Ruimin; Cheng, Shuli; Li, Shuoshuo; Liu, Tingjie

doi:10.3390/rs17152705

Open AccessArticle

Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification

School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2705; https://doi.org/10.3390/rs17152705

Submission received: 11 June 2025 / Revised: 25 July 2025 / Accepted: 1 August 2025 / Published: 4 August 2025

(This article belongs to the Special Issue Multi-Task Remote Sensing Image Analysis: Classification, Segmentation, and Change Detection)

Download

Browse Figures

Versions Notes

Abstract

Hyperspectral image (HSI) classification is an important task in the field of remote sensing, with far-reaching practical significance. Most Convolutional Neural Networks (CNNs) only focus on local spatial features and ignore global spectral dependencies, making it difficult to completely extract spectral information in HSI. In contrast, Vision Transformers (ViTs) are widely used in HSI due to their superior feature extraction capabilities. However, existing Transformer models have challenges in achieving spectral–spatial feature fusion and maintaining local structural consistency, making it difficult to strike a balance between global modeling capabilities and local representation. To this end, we propose a Prompt-Gated Transformer with a Spatial–Spectral Enhancement (PGTSEFormer) network, which includes a Channel Hybrid Positional Attention Module (CHPA) and Prompt Cross-Former (PCFormer). The CHPA module adopts a dual-branch architecture to concurrently capture spectral and spatial positional attention, thereby enhancing the model’s discriminative capacity for complex feature categories through adaptive weight fusion. PCFormer introduces a Prompt-Gated mechanism and grouping strategy to effectively model cross-regional contextual information, while maintaining local consistency, which significantly enhances the ability for long-distance dependent modeling. Experiments were conducted on five HSI datasets and the results showed that overall accuracies of 97.91%, 98.74%, 99.48%, 99.18%, and 92.57% were obtained on the Indian pines, Salians, Botswana, WHU-Hi-LongKou, and WHU-Hi-HongHu datasets. The experimental results show the effectiveness of our proposed approach.

Keywords:

hyperspectral image classification; convolutional neural networks; vision transformer; prompt-gated

1. Introduction

Hyperspectral imaging (HSI) is an advanced remote sensing technique capable of capturing detailed spectral information of surface objects in hundreds of continuous spectral bands. Each pixel not only carries spatial location information, but also contains a unique spectral response curve [1], making different materials and features highly distinguishable. Due to its rich data properties, HSI finds extensive application in various fields, which include agricultural monitoring, environmental management, urban planning, geological exploration, and national-security-related defense [2]. As a core task in hyperspectral data processing, HSI aims to assign an accurate land cover or object class to each pixel [3]. Nevertheless, this task encounters significant challenges, owing to the complexity of high-dimensional data and the limited availability of labeled samples. Therefore, the development of efficient feature extraction and classification methods has become a key direction in current HSI research.

Early HSI methods relied heavily on hand-designed features with shallow classifiers, such as classifiers based on SAM, SVM [4], and K Nearest Neighbor [5]. These methods mainly utilize spectral information to achieve classification based on the similarity of spectral curves. However, they ignore spatial contextual information, which leads to limited classification accuracy in scenes with complex feature distributions. In order to solve the problem of utilizing only single spectral information, researchers have proposed joint spatial–spectral methods such as Morphological Profiles [6], Gabor Filtering [7], and Markov Random Fields [8]. These methods enhance classification performance by fusing spatial texture with spectral features. While these methods have greatly enhanced HSI accuracy, they heavily rely on expert feature design and struggle to adapt to multi-scale spatial structures, leading to suboptimal results in complex scenes.

In recent years, the updated iteration of deep learning techniques has brought new breakthroughs in research on HSI. CNNs [9] have become an efficient feature extraction method in HSI by virtue of their advantages in local feature extraction, but they have also gradually exposed some inherent limitations [10]. Firstly, the local receptive field of CNNs is limited and the convolution can only see a fixed neighborhood, which makes it difficult to capture global contextual information [11]. Secondly, high-dimensional spectral bands generate a huge number of parameters, which often need to be reduced through spectral downscaling. However, this process inevitably leads to the loss of some spectral information in the model. Additionally, an excessive number of convolutions can cause the model to overfit, limiting its ability to fully leverage the spectral information. In order to compensate for the lack of local sensory fields, researchers have tried allowing CNNs to extract features at a larger scale or at different scales [12]. For example, spectral–spatial networks such as 3D-CNN were proposed by combining the spectral dimension with the spatial dimension through convolution, as a way to enhance the portrayal of this joint spectral–spatial pattern [13]. Other methods acquire spatial context information at different scales through multi-scale convolution or pyramid structures to take into account both local details and large-scale structures, and these improvements help to enhance the adaptability of CNNs to complex scenes [14]. To this end, Liu et al. [15] proposed a multiscale large kernel asymmetric convolutional network, which combines a spectral feature extraction module and a multiscale large kernel asymmetric convolutional module to efficiently capture both local and global spatial features. Guan et al. [16] introduced a dense pyramidal residual network that utilizes a combined spectral–spatial attention mechanism, enabling it to capture intricate spectral and spatial features in HSI. Gong et al. [17] introduced a novel Multiscale Feature Fusion Convolutional Network, incorporating a multiscale convolutional architecture designed to extract both spectral and spatial features. When using these 3D-CNN architectures, despite showing strong performance in joint spatial and spectral modeling, the local convolution kernel has difficulty capturing remote dependencies [18], limiting the global modeling capability of the model.

With the introduction of the Transformer model into the field of HSI [19], its powerful global modeling capabilities have been widely applied. The detrimental effect of small sample data sizes on Transformer training has become prominent, and it is easy to overfit or fall into unstable training when there are insufficient training samples [20]. To compensate for the limitations of Transformers in local feature extraction, many studies have begun to adopt CNN–Transformer hybrid architectures, leveraging CNN’s powerful local feature extraction ability and the global modeling advantages of Transformer. Hong et al. [21] proposed the SpectralFormer model to further exploit Transformer’s modeling advantages on spectral sequences. In addition, researchers have successively proposed many innovative Transformer-based models in the field of HSI. Zhou et al. [22] proposed a dual-branch convolutional Transformer network with an efficient interactive self-attention mechanism. Sun et al. [23] introduced a memory enhancement mechanism for spatial–spectral feature fusion. Chen et al. [24] explored the role of the center pixel in a Transformer. Zhao et al. [25] proposed a lightweight model based on group convolution. Cheng et al. [26] proposed a multi-scale spatial–spectral information interaction Transformer. Wu et al. [27] proposed a combination of the MLP-mixer and graph convolution in an enhanced Transformer. Shu et al. [28] effectively enhanced feature extraction and fusion capabilities through a dual feature aggregation module and cross-attention aggregation mechanism. These models significantly improved classification accuracy by introducing hybrid structures, attention mechanisms, or multi-scale feature fusion strategies. These innovations demonstrate that the CNN–Transformer hybrid architecture can effectively overcome the limitations of single architectures by complementing each other’s advantages. These models significantly improved classification accuracy by introducing hybrid structures, attention mechanisms, or multi-scale feature fusion strategies.

In contrast, although the Mamba architecture has demonstrated excellent performance in spatial–spectral dependency modeling, and while the SSUMamba model proposed by Fu et al. [29] excels in HSI denoising tasks, its primary advantage lies in denoising rather than classification accuracy. The Mamba architecture has lower computational complexity but lacks the flexibility and diversity in feature extraction of the CNN–Transformer hybrid architecture. In HSI, the CNN–Transformer hybrid architecture can better combine local details with global information, providing more precise classification results.

Therefore, despite the Mamba architecture’s high computational efficiency and performance in certain tasks, the CNN–Transformer hybrid architecture is more suitable for HSI due to its stronger local feature extraction capabilities and global modeling advantages. By optimizing the fusion of local and global features, the CNN–Transformer architecture fully leverages the strengths of both, significantly improving classification accuracy. Based on this, this study selected the CNN–Transformer hybrid architecture as the core framework for further exploration of its application potential and advantages in HSI.

Meanwhile, prompting, as a new paradigm for large model tuning, has shown great potential for efficient feature guidance. For example, Zhang et al. [30] proposed an efficient fine-tuning method based on zero-initialized attention, which efficiently controls the feature flow through prompting factors. This idea inspired us to introduce prompting factors into the HSI task and further improve the classification performance by designing an adaptive prompting factor mechanism to guide the model to achieve more accurate feature selection during the fusion of spectral and spatial features.

The methods mentioned above have made significant progress in fusing spectral and spatial features, but there is still much room for optimization in terms of fully extracting spatial and spectral information, while maintaining the consistency of the local spatial structure. In order to address the previously discussed problems, we propose a Prompt-Gated Transformer with Spatial–Spectral Enhancement for HSI, which combines the advantages of CNNs and Transformer and successfully optimizes the attention mechanism. The network structure mainly consists of two parts: the Channel Hybrid Positional Attention (CHPA) module and the Prompt Cross-Former (PCFormer). The CHPA module gives full play to the advantages of dual branching to mine the deep spectral information and spatial information in HSI and perform feature fusion. On this basis, we use PCFformer to establish global contextual links and control the feature flow through the prompting factors and gating mechanism, as a way to enhance the long-range feature-dependent expression ability of the model. The main contributions of this paper are as follows:

In this paper, a Prompt-Gated Transformer with Spectral-Spatial feature Enhancement is proposed. The network not only makes full use of the global feature extraction capability of a Transformer network, but also introduces prompting factors into the field of HSI.
To compensate for Transformer’s insufficient modeling of spatial structure, this paper proposes a Channel Hybrid Positional Attention (CHPA) module for HSI. The positional attention introduced by this module can enhance the extraction of spatial structure information, so that the model focuses on the spatial continuity of similar features and the boundary of dissimilar features. The channel weighting mechanism of CHPA can filter out unimportant channels and highlight key spectral features. This helps alleviate the curse of dimensionality in high-dimensional spectral data and improves the model’s ability to utilize effective spectral information.
In order to solve the problem that traditional self-attention mechanisms tend to be globally relevant and ignore local spectral–spatial details, this paper proposes Prompt Cross-Former (PCFormer) for HSI. The PCFormer includes AttenMix and PGFormer Block. In the PGFormer Block, we design the Prompt-Gated Cross Attention (PGCA), which uses a learnable prompt-gating mechanism to adaptively pass training prompts into the self-attention layer of the Transformer, to guide the attention to focus on effective features.

The main research of this paper will be presented in detail in the subsequent sections. Section 2 details our proposed methodology. Section 3 describes the experimental setup in detail, as well as some advanced technological approaches. Section 4 shows the experimental results and analyses them in depth. Section 5 discusses the effects of different parameter configurations on the model performance and the ablation experiments. Finally, Section 6 summarizes our contributions and looks to the future.

2. Methodology

In this section, the PGTSEFormer network for HSI is introduced. Section 2.1 describes the structure of the PGTSEFormer network and its application in HSI. Section 2.2 give the basic structure and principle of CHPA, PCFormer, respectively.

PGTSEFormer is proposed as shown in Figure 1. Its overall framework consists of the following five parts: the first is the channel tuning part, the next is the Channel Hybrid Positional Attention (CHPA) module, next is the Prompt Cross-Former (PCFormer), the fourth is the GAP layer, and the last is the part containing the fully connected (FC) layer of the softmax classifier.

In the PGTSEFormer framework, the network accepts a series of 3D data cubes as input. Since datasets vary in their spectral dimensions, we first align these dimensions using a channel tuning mechanism to ensure consistent processing across datasets. Specifically, a series of 2D convolutional layers are employed to calibrate the channels and simultaneously extract preliminary null spectral features. Subsequently, a CHPA is introduced, which adaptively directs the model to focus on key channel information and spatial regions, suppressing irrelevant or interfering information.

After the initial feature extraction, the PCFormer is composed of two core modules: the AttenMix and the PGFormer Block. Where the AttenMix module integrates DWConv, PWConv, Weight Branching Structure, and Channel Blending. This module serves as a pre-processing step that prepares features for global feature extraction by the Transformer. It mainly extracts local null-spectrum features using a lightweight convolutional structure and conducts interactive channel blending through a channel attention mechanism, thereby improving the spatial representation of the model. Following the AttenMix module, the PGFormer Block further enhances the model’s ability to model global context. This module incorporates Prompt-Gated Cross Attention (PGCA), which employs group partitioning and hierarchical strategies to model spatial attention, both within and across groups. These strategies ensure spatial structural consistency, while enabling the capture of long-range contextual dependencies across regions and groups. As a result, the model’s ability to interpret complex spatial relationships is significantly enhanced. Ultimately, the extracted feature maps are downscaled to 1 × 1 spatial dimensions by GAP and spread to one-dimensional vectors to be passed into the FC layer for final classification prediction.

2.1. Channel Hybrid Positional Attention Module

HSI is characterized by high dimensionality and redundancy [31], placing greater demands on the model’s feature extraction capability. In order to improve the network’s ability to perceive key spectral channels and pay attention to important spatial regions, this paper proposes an efficient two-branch attention mechanism, CHPA, and integrates it into the shallow feature extraction stage of the network. As shown in the Figure 1, the CHPA module combines channel attention with local spatial position attention and incorporates cross-attention interaction to enhance feature representations. This design strengthens the feature characterization capability from both spectral and spatial dimensions, thereby improving the model’s discriminative performance on complex feature classes. Given an input feature map

X \in R^{C \times H \times W}

, where C denotes the number of input channels, H and W represent the height and width of the feature map. The CHPA module first equally splits the input X along the channel dimension into two parts:

X_{1}, X_{2} = Split (X, C / 2, \dim = 1)

(1)

where

X_{1} \in R^{\frac{C}{2} \times H \times W}

is used for channel attention modeling, while

X_{2} \in R^{\frac{C}{2} \times H \times W}

is used for positional attention modeling.

To capture the importance distribution along the spectral dimension, the channel attention branch first applies adaptive global average pooling to extract global contextual information:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{1} (i, j)

(2)

where

z_{c}

denotes the aggregated descriptor for each channel and

X_{1} (i, j)

refers to the value at spatial position

(i, j)

in the feature map

X_{1}

.

To facilitate subsequent processing, the aggregated descriptor is reshaped into a 1D vector:

z^{'} = reshape (z_{c}) = z_{c} . view (B, 1, C)

(3)

then, it is passed into a dynamic convolution layer to perform cross-channel information interaction and generate channel attention weights:

A_{c} = σ (F_{1 D} (z_{c}))

(4)

where

F_{1 D} (\cdot)

is a 1D convolution with the kernel size adaptively adjusted based on the number of channels, allowing it to adapt to multi-scale cross-channel relations with a wide receptive field.

σ (\cdot)

denotes the sigmoid function. The resulting attention weights are

A_{c} \in R^{B \times C / 2 \times 1 \times 1}

, which are reshaped back to

A_{c} \in R^{B \times C \times 1 \times 1}

and applied to the input feature

X_{1}

:

X_{att} = X_{1} ⊙ A_{c}

(5)

where ⊙ denotes element-wise multiplication. The process dynamically learns the importance weights of different spectral channels to highlight highly responsive channels and suppress redundant channels.

In HSI, spatial structure is also important for feature recognition. To further improve the model’s capacity to grasp spatial structures, we introduce a local spatial position attention branch. This branch models long-range spatial dependencies using an innovative multi-way compression mechanism, designed to capture local positional relationships within feature maps. Given an input feature map,

X_{2} \in R^{C / 2 \times H \times W}

. In order to efficiently extract spatial structure features, feature compression is applied to the input feature map along the horizontal and vertical directions, respectively:

z_{h} (i) = \frac{1}{W} \sum_{j = 1}^{W} X_{2} (i, j), \forall i \in [1, H]

(6)

z_{w} (j) = \frac{1}{H} \sum_{i = 1}^{H} X_{2} (i, j), \forall j \in [1, W]

(7)

where

z_{h} \in R^{\frac{C}{2} \times H \times 1}

and

z_{w} \in R^{\frac{C}{2} \times 1 \times W}

, respectively, retain the spatial contextual information along the horizontal and vertical dimensions. Compared to traditional global pooling, this decomposed compression approach can effectively capture directional features. Next,

z_{h}

is transposed to match the shape of

z_{w}

for subsequent concatenation:

z_{h}^{'} = z_{h}^{⊤} \in R^{B \times C / 2 \times 1 \times H}

(8)

then, concatenate and apply a non-linear transformation:

z_{cat} = Concat (z_{h}^{'}, z_{w}) \in R^{C / 2 \times (H + W)}

(9)

\hat{z} = ReLU (B N (W_{1} z_{cat} + b_{1}))

(10)

where

W_{1} \in R^{(\frac{C}{8} \times \frac{C}{2})}

is the dimensionality-reduction convolution kernel, which reduces the number of parameters by compressing the channels by a factor of 8, and

b_{1} \in R^{\frac{C}{8}}

is the bias term. The fused features are then split again into two directional representations, which are used to generate attention weights in the horizontal and vertical directions:

[{\hat{z}}_{h}, {\hat{z}}_{w}] = Split (\hat{z}, dims = [H, W])

(11)

each directional feature is processed separately through a 1×1 convolution, followed by a Sigmoid function.

A_{s}^{h} = σ (z_{h} \cdot {Conv}_{1 \times 1}^{h} ({\hat{z}}_{h}) + b_{h})

(12)

A_{s}^{w} = σ (z_{w} \cdot {Conv}_{1 \times 1}^{w} ({\hat{z}}_{w}) + b_{w})

(13)

where

z_{h} \in R^{H \times \frac{C}{8}}

represents the horizontal attention projection matrix and

z_{w} \in R^{W \times \frac{C}{8}}

represents the vertical attention projection matrix.

b_{h} \in R^{H}

and

b_{w} \in R^{W}

are the corresponding bias terms. All channels share the same set of

z_{h}

and

z_{w}

, the two attention maps are then combined through matrix multiplication to establish attention correlations between rows and columns:

X_{pos} = Broadcast (A_{s}^{h} \otimes A_{s}^{w}) \in R^{C / 2 \times H \times W}

(14)

The local spatial positional attention branch aims to capture the local structural information in an image from the spatial dimension, especially the edges, textures, and their spatial relationships in the image. The core idea of this branch is to learn positional dependencies along different spatial directions by applying directional average pooling and convolutional operations. Additionally, shared weights are employed to enable efficient long-range dependency modeling, which in turn enhances the model’s ability to recognize complex spatial patterns.

Through the weighted fusion of the channel attention branch and the local spatial location attention branch, the model is able to extract richer feature representations from both the spectral and spatial dimensions. In summary, the CHPA module strengthens the model’s capacity to recognize intricate structures and subtle category differences, thereby boosting its overall classification performance.

2.2. Prompt Cross-Former

In each PCFormer of the PGTSEFormer network, the input features are first pre-processed by the AttenMix module to enhance their representation. This module integrates several spatial and channel enhancement strategies, including two sets of convolutional operations, a dual-branch dynamic weight fusion mechanism and a channel mixing module. Specifically, AttenMix first recombines the input features across bands via DWC and PWC, then maps them into a high-dimensional feature space to enhance feature representation. Subsequently, the module further extracts key features through a dual channel-space branching mechanism. Among them, the channel branch utilizes global average pooling to extract spectral statistical information and dynamically adjusts channel weights to enhance local saliency in the feature map, whereas the spatial branch applies a 3D convolutional kernel to model inter-band spatial correlations and capture local structural features more effectively. Through this structure, the model can efficiently mine local spectral–spatial features within neighboring band groups. The introduced channel mixing operation facilitates information interaction across channels, thereby enriching the diversity of feature representations. In addition, it increases the stochasticity during training, which in turn improves the model’s adaptability and generalization capability.

After the feature enhancement process using AttenMix, the features are fed into the PGFormer Block for global context modeling. In the PGFormer Block, we introduce learnable Group Tokens as semantic proxies for different regions, to enable cross-region feature interaction and fusion through the cross-group attention mechanism. Meanwhile, the module introduces a dynamic Prompt Factor to regulate the query vectors through a gating mechanism, so that the attention distribution is self-adapted to the semantic structure and distribution pattern of the input features. The synergistic design of the module preserves the capacity for local detail extraction, while simultaneously activating global semantic associations, and establishes a multi-level, multi-scale feature enhancement framework. This design significantly improves the model’s representational capacity and classification accuracy.

2.2.1. AttenMix

HSI contains a large number of spectral channels and exhibits highly complex spatial structures [32]. Although convolutional operations can effectively extract local spatial features, they often introduce redundancy when handling high-dimensional spectral data and are inherently constrained by fixed kernel sizes. To address these limitations, we designed the AttenMix module as a structure-aware component placed before each Transformer layer, to facilitate efficient local–global feature interaction. Its structure is shown in Figure 1. Firstly, the input feature

X \in R^{\times C \times H \times W}

is processed using DWC. Specifically, it consists of two stages:

X_{dw} = DWConv (X) = X * W_{dw}, W_{dw} \in R^{C \times 1 \times H \times W}

(15)

X_{pw} = PWConv (X_{dw}) = X_{dw} * W_{pw}, W_{pw} \in R^{C^{'} \times C \times 1 \times 1}

(16)

where * denotes the convolution operation.

X_{dw}

is the convolution weight for depth-separable convolution, which can extract spatial information for each channel independently.

X_{pw}

is the convolution weight for PWC, which can help the model to achieve spectral domain fusion between different channels and strengthen cross-channel feature interaction. The processed input features are then split into two branches along the channel dimension. The channel branches are weighted by a subset of globally pooled channels via learnable weights,

c_{weight}

and bias

c_{bias}

:

A_{c} = σ (c_{weight} \cdot AvgPool (X) + c_{bias})

(17)

where

c_{weight}

performs a linear transformation of the channel information and learns the magnitude of the enhancement for each channel, which can regulate the strength of the response of each channel’s attention to the final channel’s attention.

c_{bias}

, on the other hand, introduces independent offsets for each channel’s attention, to enhance the nonlinear representation of the model. Spatial branching is moderated by using learnable weights

s_{weight}

and

s_{bias}

for each grouped spatial subset after GroupNorm:

A_{s} = σ (s_{weight} \cdot GN (X) + s_{bias})

(18)

The

s_{weight}

moderates the normalized spatial features by highlighting salient regions and suppressing redundancy, while

s_{bias}

provides spatial group-specific offsets and jointly controls the feature map shape with

s_{weight}

. These learnable parameters are automatically updated by backpropagation during the training process.

X = Concat (A_{s}, A_{c}) \in R^{C \times H \times W}

(19)

Then, we feed the feature X after channel splicing of these two branches into the channel mashup section. In channel shuffling, the channels are rearranged in a grouped manner to achieve inter-group information interaction. This design can break the locality between channels and enhance the information flow of features between groups, so that the global information can be more fully integrated.

2.2.2. PGFormer Block

Although conventional ViT performs well in capturing global dependencies, it still has limitations in dealing with images and significant local features [33]. For this reason, a Prompt-Gated Transformer structure based on prompts was designed in this paper, aiming to enhance the model’s ability to model local spatial information. The structure is shown in Figure 2a. The structure consists of a series of normalization layers, a convolutional layer, and PGCA, and the various parts work in concert to significantly enhance the accuracy of the model for feature extraction. We embed the PGCA module in each Transformer coding layer to enhance the model’s adaptive perception in the spatial dimension.

As shown in Figure 2b, the PGCA module consists of four key steps: local spatial partitioning, intra-group self-attention modeling, inter-group context fusion, and Prompt-Gated mechanism. First, the input features are divided into multiple local regions through the spatial partitioning mechanism, to maintain the consistency of the local structure. Subsequently, the intra-group self-attention mechanism is employed to capture the key spatial relationships within the regions. On this basis, the inter-group context fusion module further models the global dependencies between different regions. Finally, the attention path is dynamically adjusted through the introduction of the Prompt-Gated mechanism, so that the model is capable of adaptively selecting attention regions according to the distribution of input features, thus achieving more accurate spatial feature modeling.

Specifically, given an input feature map

X \in R^{C \times H \times W}

, we first rearrange it into multiple spatially grouped blocks shaped as g × g local subregions.

X_{group} = Rearrange (X) \to R^{N \times C \times g^{2}}

(20)

In the above formula,

N = \frac{H \cdot W}{g^{2}}

denotes the number of spatially partitioned groups, where g is the spatial size of each group. Next, 1 × 1 convolution is applied to each local group feature to generate the query (Q), key (K), and value (V) matrices, enabling intra-group attention modeling.

Q, K, V = {Conv 1 d}_{1 \times 1} (X_{group}) \to R^{B \cdot N \times h \times g^{2} \times d}

(21)

where h denotes the number of attention heads, and d represents the embedding dimension of each head. To enhance the expressive capability of the attention mechanism, we introduce a learnable prompt vector

P \in R^{1 \times h \times 1 \times d}

and a gating factor

G \in R^{1 \times h \times 1 \times 1}

to modulate the original query vector. The operation is shown in Equation (22):

Q^{'} = Q \cdot (1 - σ (G)) + P \cdot σ (G)

(22)

where

σ (\cdot)

denotes the sigmoid function used to control the level of involvement of the prompt vector. Next, the standard self-attention operation is performed within each local space group:

Attn (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d}}) \cdot V

(23)

The obtained attention result is the output feature after context enhancement within the group. To enhance the information interaction between spatial groups, we design a Group Tokens mechanism. The first position in each group is set to be extracted as a semantic representation, and then attention is computed between groups:

\begin{matrix} T_{Q}, T_{K} & = Linear (GroupTokens) \Rightarrow {Attn}_{group} = Softmax (\frac{T_{Q} T_{K}^{⊤}}{\sqrt{d}}) \end{matrix}

(24)

The results of intergroup attention are subsequently broadcast back to the groups to remodulate the intragroup features to obtain the final context fusion feature

{A t t n}_{f u s e d}

. The local intragroup feature rearrangements are reduced to the original spatial dimensions and residual concatenation is performed.

X_{out} = X + Reshape ({Attn}_{fused})

(25)

With the PGSA module, the PGTSEFormer network effectively enhances the perception ability and information interaction between spatial regions, while maintaining a lightweight structure, which significantly improves the model’s structural modeling and generalization ability in HSI.

3. Datasets and Experimental Settings

In this section, in order to evaluate the effectiveness of our proposed PGTSEFormer framework, we present a comprehensive experiment using five publicly available HSI datasets. We describe in detail the five datasets used and provide a detailed description of the experimental setup.

3.1. Datasets

Indian Pines: This dataset was collected in 1992 by NASA over northwestern Indiana, using the AVIRIS sensor. It comprises a hyperspectral image of 145 × 145 pixels, capturing 16 distinct land cover categories. The spectral range spans from 0.4 to 2.5 microns across 220 contiguous bands. Figure 3 presents a false-color composite of the image, alongside its corresponding ground-truth annotation map. In total, the dataset includes 10,249 labeled instances representing all classes. For experimental purposes, the data were divided such that 5% were allocated for training, another 5% for validation, and the remaining 90% for testing. Detailed class-wise sample counts and distribution ratios are provided in Table 1.

Salinas: This dataset was captured by NASA’s AVIRIS sensor over the Salinas Valley in California. It features a spatial resolution of 3.7 m, with image dimensions of 512 × 217 pixels. Originally composed of 224 spectral bands, 20 bands heavily influenced by water vapor absorption were removed, leaving 204 continuous and usable spectral channels. Figure 4 displays a false-color composite image along with its associated ground-truth label map. This dataset focuses on 16 categories of agricultural crops and contains a total of 54,129 labeled samples. For dataset partitioning, 1% of the samples were allocated to the training set and another 1% to the validation set, while the remaining 98% were reserved for testing. Table 2 outlines the number of samples per class and their respective distribution.

Botswana: This dataset was acquired by NASA on the Earth Observation-1 (EO-1) satellite in May 2001 and covers the Okavango Delta region of Botswana, Africa. The dataset has an image size of 1476 × 256 pixels, a spatial resolution of about 20 m, and a spectral wavelength range of 0.4 to 2.5 microns. The raw data contain 242 spectral bands, of which 5 bands were excluded due to noise effects, and finally 145 valid continuous spectral channels were retained. The false color composite image of the Botswana dataset with its corresponding ground truth label map is shown in Figure 5. The dataset contains 14 different feature types, totaling 3248 labeled samples. In the experimental setup of this paper, 5% of the samples were used for the training set, 5% for the validation set, and the remaining 90% were used as the test set. The details of the category distribution and sample division are shown in Table 3.

WHU-Hi-LongKou: This dataset was collected using a hyperspectral imaging sensor mounted on a UAV over an urban area in Hubei Province, China. It primarily focuses on nine representative crop types and is intended for research on high-resolution hyperspectral classification. Figure 6 shows the dataset’s pseudo-color composite images and their corresponding ground-truth maps. Each image is 550 × 400 pixels, covering 270 contiguous spectral bands across the 0.4–1.0 µm range, with a spatial resolution of 0.463 m. The dataset includes 204,542 labeled instances across six categories. For the experiments, 0.2% of the data were allocated to training and validation (each), while the remaining 99.6% were reserved for testing, to ensure a robust evaluation. Class-wise sample counts are provided in Table 4.

WHU-Hi-HongHu: This dataset was acquired in Honghu City, Hubei province, China, with a 17 mm focal length Headwall Nano-Hyperspec imaging sensor equipped on a DJI Matrice 600 Pro UAV platform. The experimental area is a complex agricultural scene with many classes of crops, and different cultivars of the same crop are also planted in the region, including Chinese cabbage and cabbage, and Brassica chinensis and small Brassica chinensis. The UAV flew at an altitude of 100 m, the size of the imagery is 940 × 475 pixels, there are 270 bands from 400 to 1000 nm, and the spatial resolution of the UAV-borne hyperspectral imagery is about 0.043 m. In Figure 7, a false-color image and its corresponding ground truth label map from the WHU-Hi-HongHu dataset are displayed. In the experimental setup of this study, the training and validation sets each accounted for 0.2% of the total samples, while the remaining 99.6% were used as the test set. This dataset contains 22 land-cover classes. The specific number of samples for each class and their partitioning are detailed in the Table 5.

3.2. Experimental Settings

3.2.1. Experimental Setup

Our proposed PGTSEFormer network was implemented using PyTorch 1.11 and Python 3.7.13. All experiments were conducted on a server with an NVIDIA GeForce RTX 3090 GPU. We used the AdamW optimizer to update network parameters during training, with cross-entropy [34] as the loss function. The learning rate was set to 0.001, weight decay to 0.05, batch size to 128, and patch size to 12. To reduce randomness and ensure stable results, each experiment was repeated 10 times and the average outcome was reported as the final result.

3.2.2. Evaluation Metrics

In order to strengthen the credibility of the performance of our method, we used three comprehensive evaluation metrics: overall accuracy (OA), average accuracy (AA), and kappa (k). Higher metric values indicate better model accuracy.

3.2.3. Comparative Methods

To assess the performance of the proposed PGTSEFormer in HSI, we compared it with several representative CNNs and Transformer-based methods. The selected methods cover a wide range of feature extraction strategies, from shallow convolutional networks to deep Transformer architectures. The specific comparison models and their structural features are as follows:

3DCNN [35]: This model adopts a multi-layer 3D convolutional structure to jointly capture spectral and spatial features, while integrating a multi-level pooling module to progressively compress the spectral dimensions. Subsequently, feature integration and dimensionality reduction are carried out through an FC layer, and the final output is the classification result.
CACFTNet [36]: This model innovatively introduces a covariance attention mechanism and cross-layer fusion strategy, utilizing a dual-branch module that combines CNN and Transformer structures to achieve efficient integration of spatial–spectral information and band similarities in HSI tasks.
SpectralFormer [21]: This network model comprises four key modules: a band-domain neighborhood augmentation module for constructing local spectral–spatial relationships, a multi-layer Transformer encoder for capturing global contextual information, an embedding and positional coding module for feature representation, and a linear FC layer for final classification.
SPRN [37]: This network model consists of four main components: a 2D convolutional layer for extracting underlying spectral–spatial features, a multi-grouped residual structure for feature enhancement and residual learning, a grouped convolutional module for channel adaptation, and an FC layer for final classification.
SSFTT [38]: This network consists of 3D and 2D convolutional layers for joint spectral–spatial feature extraction, a learnable token generation module to construct semantic representations, a Transformer encoder to model global contextual dependencies, and a linear layer for final category prediction.
GAHT [39]: This network is divided into three stages, each consisting of 1 × 1 convolutional and Transformer Encoder modules embedded in grouped pixels, followed by a global average pooling layer and a linear layer for classification.
Massformer [23]: The network model first applies 3D and 2D convolutional layers for shallow spectral–spatial feature extraction. The resulting features are then split into two branches: one performs both max pooling and average pooling to capture statistical characteristics, while the other concatenates a class token and positional encoding to prepare for Transformer-based semantic modeling. The dual branch enters the transformer coding layer, which is used by the pooled branch as memory to recompose the Q, K, and V. Finally, it enters the mlp used for classification.
SQSFormer [24]: This network is designed for hyperspectral feature extraction and consists of a 2D convolutional layer to extract underlying spatial features, an SE module to recalibrate channel responses, a Transformer encoder enhanced with a central feature mechanism to capture global dependencies, and a linear layer for aggregating features and predicting output classes.
MCTGCL [40]: This network model integrates 3D and 2D convolutional layers to extract underlying spatio-temporal features, an efficient attention weighting module to refine spatial channel importance, a Transformer encoder enhanced with a memory mechanism to capture long-range dependencies, and a linear layer to aggregate features for final classification.

4. Experimental Comparison and Analysis

In this section, we selected nine representative HSI methods for comparative experiments, covering different network structures based on CNNs and Transformer-based architectures. Through systematic comparisons with these state-of-the-art methods, we comprehensively evaluated the classification performance of the proposed PGTSEFormer model on several standard hyperspectral datasets. We present a detailed description of the network architecture, the experimental setup, and the comparison methods used for evaluating PGTSEFormer. In addition, we provide in-depth analyses of the classification results, which collectively demonstrate the state-of-the-art performance and effectiveness of the proposed framework in HSI tasks from multiple perspectives. For the classification accuracy of each feature category, the highest value among the methods is marked in bold font, in order to highlight the performance advantages.

Additionally, to visually demonstrate the classification performance of the different methods across the various datasets, we provide the corresponding false-color composite images, ground-truth annotations, and predicted classification maps. These visualizations help in more effectively comparing the recognition accuracy and spatial consistency of each model.

4.1. Quantitative Results Analysis

Table 6, Table 7, Table 8, Table 9 and Table 10 present the classification results of our method and the comparative approaches on the five datasets. The evaluation metrics used included Overall Accuracy (OA), Average Accuracy (AA), Kappa (k), and individual classification accuracy for each category. The best-performing values in each experiment are boldfaced to underscore the advantages of each method across the various evaluation criteria.

The classification performance on the Indian Pines dataset is shown in Table 6. Although PGTSEFormer performed slightly worse on certain individual categories (compared with the SPRN method), it achieved a significantly higher accuracy for category 9 (Oats) than all other methods, demonstrating its strength in recognizing classes with limited samples. With only 5% of the training data, PGTSEFormer achieved an OA value of 97.91 ± 0.41%, which is approximately 0.19% higher than the best performance of the other methods. Moreover, its performance exhibited lower variance, indicating good stability.

The classification performance on the Salinas dataset is presented in Table 7. SQSFormer achieved an accuracy of 97.70 ± 0.93% on Class 5 (Fallow-Smooth), demonstrating its strength in local feature modeling. However, in terms of overall performance, PGTSEFormer achieved an OA value of 98.74 ± 0.42% using only 1% of the training samples, outperforming the best comparison method (i.e., SPRN) by 1.22%.

The classification results on the Botswana dataset are shown in Table 8. While several methods achieved classification accuracies of 100 ± 0.00% in specific categories, PGTSEFormer also reached this level in most categories, demonstrating its strong capability in modeling multi-class spectral–spatial features. Although the classification accuracies of PGTSEFormer were slightly lower than those of the best-performing method in category 1 (Water), category 5 (Hippo Grass-1), category 12 (Short Mopane), and category 14 (Chalcedony), the respective differences were only 0.25%, 0.47%, 0.13%, and 0.12%, indicating negligible performance gaps. Overall, PGTSEFormer achieved an OA value of 99.48 ± 0.34% using only 5% of the training samples, with a fluctuation of just 0.34%, which further validates its robustness and stability on this dataset.

The classification results on the WHU-Hi-LongKou dataset are shown in Table 9 and PGTSEFormer performed particularly well for Class 3 (Sesame) and Class 5 (Narrow-leaf Soybean). In contrast, the traditional CNN methods performed poorly in these two classes, possibly due to their shortcomings in feature extraction with fewer samples. PGTSEFormer achieved the highest classification accuracies for these categories, despite the improvements of other Transformer-based methods over CNNs. Under very low sample conditions (only 0.2% of the training samples), PGTSEFormer still achieved an OA value of 99.18 ± 0.29%, which is an improvement of about 0.21% compared to the SPRN.

The classification results on the WHU-Hi-HongHu dataset are shown in Table 10. Our method (Ours) outperformed all other methods, especially in Class 1 (Read roof) and Class 5 (Cotton firewood), where it achieved a higher classification accuracy compared to the other methods. In contrast, the traditional CNN-based methods such as 3DCNN and Spectralformer performed relatively poorly in these two classes, possibly due to their limitations in feature extraction, particularly for the complex patterns present in these classes. Notably, the Transformer-based methods such as GAHT and MCTGCL also showed significant improvements over the CNN-based methods, but they still lagged behind our method in terms of classification accuracy. Our method achieved the highest OA and AA values, with a notable OA value of 91.52 ± 1.01%, significantly surpassing the next best method (GAHT) by a margin of approximately 1.5%.

In summary, PGTSEFormer achieved excellent classification results on several public datasets, verifying its effectiveness and sophistication in dealing with complex null spectral structures and modeling global–local feature relationships.

4.2. Visual Results Analysis

Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12 show the results of the classification performance visualization for the proposed method with each comparison algorithm on the five datasets. Owing to the structural advantages of the proposed PGTSEFormer in spatial–spectral feature modeling and contextual representation, the classification results exhibited reduced noise artifacts, sharper region boundaries, and greater consistency with the ground-truth distribution. On the Indian Pines dataset, this was especially reflected in category 15 (Vineyard-untrained) (bright red area). This type of area is usually noisy and difficult to segment accurately using traditional methods, while PGTSEFormer could effectively suppress misclassified areas and significantly improved the accuracy of boundary identification. Similar advantages were observed on the Salinas, Botswana, WHU-Hi-LongKou, and WHU-Hi-HongHu datasets. The classification maps generated by PGTSEFormer not only aligned well with the ground-truth contours, but also exhibited superior accuracy in edge delineation and fine-grained category prediction. These results further validate the model’s robustness and generalization performance for complex hyperspectral scenes.

4.3. Learned Feature Visualizations by T-SNE

T-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique widely used for visualizing high-dimensional data. It maps high-dimensional data into two or three dimensions, while preserving the local structure, revealing underlying patterns and group structures. In hyperspectral image classification, t-SNE plots can showcase the distribution characteristics of data. If the classification performance is good, the samples of different categories will be clearly separated in the t-SNE plot and will cluster into independent groups, indicating that the classifier can effectively distinguish between categories. However, if the classification performance is poor, the categories may overlap or interweave, making it difficult to separate the samples, suggesting that the classifier failed to distinguish between different categories, and misclassified samples may be projected into the clusters of other categories.

Taking the IP dataset as an example, as shown in Figure 13, in (b), (c), and (d), the samples of different categories are well clustered together and their boundaries are distinct, suggesting that these plots represent a relatively good classification performance. The overlap between categories is minimal, indicating that the classifier could effectively distinguish between these categories. In (a), (e), and (f), some categories begin to show more overlap, especially in the areas where the colors intersect. This distribution might indicate that the classifier’s ability to differentiate between these categories has declined, or that the feature spaces of these categories are similar, making classification more difficult. From (g) to (j), it can be observed that the degree of separation between categories fluctuated with different dimensionality reduction parameters and methods. In particular, in (h) and (i), some categories start to cluster more closely together, possibly due to the impact of the dimensionality reduction process on the local structure.

In (j), the samples of each category are almost all clustered in distinct areas, and the distribution of most categories does not overlap or cross. The samples in each category have high separability in the feature space, indicating that the features of the data can effectively distinguish between these categories. Areas with the same color (e.g., blue, green, pink) show that the samples of each category are tightly grouped, with no points from other categories scattered in these areas, meaning that this plot is excellent in terms of category separation. Compared to the other plots (e.g., (d) and (e)), the samples in (j) are mostly clustered within their respective clusters, with only a few samples potentially in neighboring category regions, showing a low misclassification rate. The distribution of each category in the plot forms relatively compact clusters, indicating that similar samples are grouped closely together in the feature space, thereby enhancing the classification accuracy and robustness.

4.4. Parameter Comparison

To evaluate the synergy between efficiency and accuracy, Table 11 compares the model complexity of PGTSEFormer with nine mainstream HSI methods, considering both Parameters and FLOPs. As shown in the table, PGTSEFormer had significantly fewer parameters than most competitors on the Salinas, Botswana, WHU-Hi-LongKou, and WHU-Hi-HongHu datasets, demonstrating its lightweight design, without compromising feature extraction capability. Despite the relatively small number of parameters in the model, PGTSEFormer still achieved a leading classification accuracy, which indicates that it is more efficient in parameter utilization.

Comprehensively analyzing the parameter scale, computational cost, and classification performance of the model, PGTSEFormer achieved superior classification results on the basis of ensuring a low computational overhead, which verifies its significant advantages in HSI tasks.

4.5. Impact of Different Training Ratios

To validate the robustness of the proposed method, we conducted comparison experiments on the OA of all the compared methods on five HSI datasets, covering different training sample ratio settings. As shown in Figure 14, PGTSEFormer consistently outperformed the other methods in classification accuracy, even under limited training data conditions. Moreover, its accuracy increased steadily with the proportion of training samples, demonstrating strong scalability and robustness. Taking the SA dataset as an example, the classification accuracy of our method was consistently higher than that of the other methods when the number of training samples was limited. This suggests that the model retained robust feature extraction and discrimination abilities, even in challenging situations with limited samples and subtle class differences.

In summary, this experiment fully demonstrated that PGTSEFormer not only has superior performance under standard settings, but also shows good robustness and stability under restricted training samples.

5. Discussion

To better understand how various parameter configurations influenced the network model, this study conducted a series of systematic experiments on patch sizes, learning rate settings, and module combinations. Through this series of experiments, we effectively revealed which parameter combinations gave the model the highest classification accuracy.

5.1. Patchsize

To examine the impact of patch size on classification performance, we evaluated four configurations (patch sizes of 4, 8, 12, and 16), keeping all other parameters fixed. As shown in Figure 15, the figure depicts the OA achieved on the five datasets under different patch sizes. The results reveal that the OA on the Salinas, WHU-Hi-LongKou, and WHU-Hi-HongHu datasets increased steadily with larger patch sizes, whereas Indian Pines and Botswana datasets exhibited a peak performance at size 12, before declining. Based on these observations, a patch size of 12 was adopted consistently across all datasets.

5.2. Learning Rate

Choosing the correct learning rate not only influences whether a model reaches the optimum, but also affects the convergence speed of the model. In order to study the role of learning rate on the network model, we set six different learning rates: 0.001, 0.005, 0.0001, 0.0005, 0.00001, and 0.00005, ensuring that the other parameters remain unchanged. As shown in Figure 16, the performance of the five datasets had the highest value when the learning rate was 0.001. Based on this conclusion, we set the learning rate to 0.001 uniformly for all five datasets.

5.3. Ablation Study

To systematically evaluate the individual contributions of the proposed modules, we performed ablation studies on five benchmark datasets. The experiments were conducted under the following configurations:

Both CHPA and PCFormer modules removed;
CHPA retained, PCFormer removed;
PCFormer retained, CHPA removed;
Both CHPA and PCFormer modules retained.

The classification results corresponding to these settings are presented in Table 12. As shown, the inclusion of either module individually improved the performance compared to the baseline. Notably, the best performance was achieved when both modules were incorporated, demonstrating their complementary effectiveness in enhancing the network’s feature representation capability.

6. Conclusions

In this paper, we proposed a lightweight network framework for HSI named PGTSEFormer. It consists of two main parts: shallow feature extraction, and deep contextual information interaction. In the shallow feature extraction part, a CHPA module is introduced to jointly extract local spatial and spectral features. In the deep contextual interaction part, the PCFormer is employed to capture global contextual dependencies. The integration of shallow local feature enhancement and deep contextual interaction mechanisms greatly improved the feature representation capability and classification accuracy of the model. In comparison experiments on four publicly available HSI datasets, PGTSEFormer outperformed current state-of-the-art methods in terms of accuracy and parameter efficiency, demonstrating its good performance and application potential, especially when the number of parameters is small. In particular, it maintained excellent classification results with a small number of parameters, which proves its usefulness in resource-limited situations.

The experimental results demonstrate that the proposed method achieved strong performance on existing hyperspectral datasets. Future research will focus on further optimizing the model structure and exploring lightweight network designs with improved generalization capabilities to better meet the diverse requirements of real-world HSI processing tasks.

Author Contributions

Conceptualization, S.C. and R.H.; methodology, R.H.; software, R.H.; validation, R.H., S.C. and S.L.; resources, S.C.; data curation, R.H.; writing—original draft preparation, R.H.; visualization, R.H.; supervision, S.C.; project administration, T.L.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China Project under Grant 62441213, the Key Laboratory Open Projects in Xinjiang Uygur Autonomous Region under Grant 2023D04028, and in part by Xinjiang University Graduate Innovation Project under Grant XJDX2025YJS193.

Data Availability Statement

Indian Pines, Botswana, and Salinas datasets are available at https://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes (accessed on 21 July 2025). WHU-Hi-LongKou and WHU-Hi-HongHu datasets are available at http://rsidea.whu.edu.cn/resource_WHUHi_sharing.htm (accessed on 21 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fang, L.; Yan, Y.; Yue, J.; Deng, Y. Toward the vectorization of hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5518214. [Google Scholar] [CrossRef]
Ghamisi, P.; Plaza, J.; Chen, Y.; Li, J.; Plaza, A.J. Advanced spectral classifiers for hyperspectral images: A review. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–32. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Gao, H.M.; Xu, M.X.; Xu, M.G.; Wang, X.; Huang, F.C. A method of hyperspectral image classification based on posterior probability SVM and MRF. In Proceedings of the 2013 International Conference on Machine Learning and Cybernetics, Tianjin, China, 14–17 July 2013; Volume 1, pp. 235–240. [Google Scholar]
Li, Y.; Yang, X.; Tang, D.; Zhou, Z. RDTN: Residual Densely Transformer Network for hyperspectral image classification. Expert Syst. Appl. 2024, 250, 123939. [Google Scholar] [CrossRef]
Benediktsson, J.A.; Palmason, J.A.; Sveinsson, J.R. Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Trans. Geosci. Remote Sens. 2005, 43, 480–491. [Google Scholar] [CrossRef]
Zoubir, H.; Rguig, M.; El Aroussi, M.; Saadane, R.; Chehri, A. Pixel-level concrete bridge crack detection using Convolutional Neural Networks, gabor filters, and attention mechanisms. Eng. Struct. 2024, 314, 118343. [Google Scholar] [CrossRef]
Guan, T.; Wang, C.; Liu, Y.H. Neural markov random field for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5459–5469. [Google Scholar]
Lee, H.; Kwon, H. Going deeper with contextual CNN for hyperspectral image classification. IEEE Trans. Image Process. 2017, 26, 4843–4855. [Google Scholar] [CrossRef]
Yu, C.; Zhu, Y.; Wang, Y.; Zhao, E.; Zhang, Q.; Lu, X. Concern with Center-Pixel Labeling: Center-Specific Perception Transformer Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5514614. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
He, X.; Chen, Y.; Ghamisi, P. Heterogeneous transfer learning for hyperspectral image classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3246–3263. [Google Scholar] [CrossRef]
Xu, Q.; Wei, J.; Wu, Q.; Wang, J.; Wang, X.; Liu, J.; Jiang, B. Mix-Mask Augmentation and Self-Reconstruction for Cross-Domain Few-Shot Hyperspectral Image Classification. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Liu, X.; Ng, A.H.M.; Lei, F.; Ren, J.; Liao, X.; Ge, L. Hyperspectral image classification using a multi-scale cnn architecture with asymmetric convolutions from small to large kernels. Remote Sens. 2025, 17, 1461. [Google Scholar] [CrossRef]
Chen, J.; Yue, J.; Chen, Y.; Zhou, H.; Hu, Z. Nonlinear Activation Functions are Not Necessary: A Lightweight Nonlinear Activation Free Network Based on Multiscale Large Kernel Attention Mechanism for Fault Diagnosis. IEEE Sens. J. 2025, 25, 18926–18940. [Google Scholar] [CrossRef]
Guan, Y.; Li, Z.; Wang, N. A Dense Pyramidal Residual Network with a Tandem Spectral–Spatial Attention Mechanism for Hyperspectral Image Classification. Sensors 2025, 25, 1858. [Google Scholar] [CrossRef]
Gong, G.; Wang, X.; Zhang, J.; Shang, X.; Pan, Z.; Li, Z.; Zhang, J. MSFF: A Multi-Scale Feature Fusion Convolutional Neural Network for Hyperspectral Image Classification. Electronics 2025, 14, 797. [Google Scholar] [CrossRef]
Qin, B.; Feng, S.; Zhao, C.; Li, W.; Tao, R.; Zhou, J. Language-Enhanced Dual-Level Contrastive Learning Network for Open-Set Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5508114. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Roy, S.K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503615. [Google Scholar] [CrossRef]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5518615. [Google Scholar] [CrossRef]
Zhou, Y.; Huang, X.; Yang, X.; Peng, J.; Ban, Y. DCTN: Dual-branch convolutional transformer network with efficient interactive self-attention for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5508616. [Google Scholar] [CrossRef]
Sun, L.; Zhang, H.; Zheng, Y.; Wu, Z.; Ye, Z.; Zhao, H. MASSFormer: Memory-Augmented Spectral-Spatial Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5516415. [Google Scholar] [CrossRef]
Chen, N.; Fang, L.; Xia, Y.; Xia, S.; Liu, H.; Yue, J. Spectral query spatial: Revisiting the role of center pixel in transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5402714. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, X.; Li, S.; Plaza, A. Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5511817. [Google Scholar] [CrossRef]
Cheng, S.; Chan, R.; Du, A. MS2I2Former: Multiscale Spatial–Spectral Information Interactive Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5532919. [Google Scholar] [CrossRef]
Al-qaness, M.A.; Wu, G.; AL-Alimi, D. MGCET: MLP-mixer and graph convolutional enhanced transformer for hyperspectral image classification. Remote Sens. 2024, 16, 2892. [Google Scholar] [CrossRef]
Shu, Z.; Liu, Z.; Yu, Z.; Wu, X.J. Dual Feature Aggregation Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5502916. [Google Scholar] [CrossRef]
Fu, G.; Xiong, F.; Lu, J.; Zhou, J. SSUMamba: Spatial-Spectral Selective State Space Model for Hyperspectral Image Denoising. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5527714. [Google Scholar] [CrossRef]
Sun, Z.; Zhao, R. LLM Security Alignment Framework Design Based on Personal Preference. In Proceedings of the 2024 International Conference on Artificial Intelligence and Future Education, Shanghai, China, 1–2 November 2024; pp. 6–11. [Google Scholar]
He, L.; Li, J.; Liu, C.; Li, S. Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines. IEEE Trans. Geosci. Remote Sens. 2017, 56, 1579–1597. [Google Scholar] [CrossRef]
Kang, J.; Zhang, Y.; Liu, X.; Cheng, Z. Hyperspectral image classification using spectral–spatial double-branch attention mechanism. Remote Sens. 2024, 16, 193. [Google Scholar] [CrossRef]
Ashraf, M.; Zhou, X.; Vivone, G.; Chen, L.; Chen, R.; Majdard, R.S. Spatial-spectral BERT for hyperspectral image classification. Remote Sens. 2024, 16, 539. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Hamida, A.B.; Benoit, A.; Lambert, P.; Amar, C.B. 3-D deep learning approach for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4420–4434. [Google Scholar] [CrossRef]
Cheng, S.; Chan, R.; Du, A. CACFTNet: A Hybrid Cov-Attention and Cross-Layer Fusion Transformer Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Zhang, X.; Shang, S.; Tang, X.; Feng, J.; Jiao, L. Spectral partitioning residual network with spatial attention mechanism for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5507714. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Mei, S.; Song, C.; Ma, M.; Xu, F. Hyperspectral image classification using group-aware hierarchical transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5539014. [Google Scholar] [CrossRef]
Xi, B.; Zhang, Y.; Li, J.; Zheng, T.; Zhao, X.; Xu, H.; Xue, C.; Li, Y.; Chanussot, J. MCTGCL: Mixed CNN-Transformer for Mars Hyperspectral Image Classification With Graph Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5503214. [Google Scholar] [CrossRef]

Figure 1. Overall structure of PGTSEFormer.

Figure 2. Overall structure of PCFormer. (a) PGFormer Block. (b) Prompt-Gated Cross Attention.

Figure 3. (a) False color composition of the Indian Pines dataset. (b) Ground truth-map containing 16 mutually exclusive land cover classes.

Figure 4. (a) False color composition of the Salinas dataset. (b) Ground truth-map containing 16 mutually exclusive land cover classes.

Figure 5. (a) False color composition of the Botswana dataset. (b) Ground truth-map containing 14 mutually exclusive land cover classes.

Figure 6. (a) False color composition of the WHU-Hi-LongKou dataset. (b) Ground truth-map containing 9 mutually exclusive land cover classes.

Figure 7. (a) False color composition of the WHU-Hi-HongHu dataset. (b) Ground truth-map containing 22 mutually exclusive land cover classes.

Figure 8. Indian Pines dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.

Figure 9. Salinas dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.

Figure 10. Botswana dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.

Figure 11. WHU-Hi-Longkou dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.

Figure 12. WHU-Hi-HongHu dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.

Figure 13. Visualization of T-SNE results on the Indian Pines dataset. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.

Figure 14. Effect of different training sample proportions on OA. (a) Indian Pines. (b) Salinas. (c) Botswana. (d) WHU-Hi-LongKou. (e) WHU-Hi-HongHu.

Figure 15. Impact of five different patch sizes on network performance on five datasets.

Figure 16. Impact of different learning rates on network performance on five datasets.

Table 1. Indian Pines land cover sample summary.

Class No.	Class Name	Train	Valid	Test
C01	Alfalfa	3	2	41
C02	Corn-notill	71	71	1285
C03	Corn-mintill	42	41	747
C04	Corn	12	12	213
C05	Grass-pasture	24	24	435
C06	Grass-trees	37	36	657
C07	Grass-pasture-mowed	2	1	25
C08	Hay-windrowed	24	24	430
C09	Oats	1	1	18
C10	Soybean-notill	49	48	875
C11	Soybean-mintill	122	123	2210
C12	Soybean-clean	30	20	534
C13	Wheat	10	10	185
C14	Woods	63	63	1139
C15	Buildings-Grass-Trees-Drives	19	20	347
C16	Stone-Steel-Towers	4	5	84
	Total	513	511	9225

Table 2. Salinas land cover sample summary.

Class No.	Class Name	Train	Valid	Test
C01	Brocoli-green-weeds-1	20	20	1969
C02	Brocoli-green-weeds-2	37	37	6352
C03	Fallow	20	20	1936
C04	Fallow-rough-plow	14	14	1366
C05	Fallow-smooth	27	27	2624
C06	Stubble	40	39	3880
C07	Celery	36	36	3507
C08	Grapes-untrained	113	112	11,046
C09	Soil-vinyard-develop	62	62	6079
C10	Corn-senesced-green-weeds	33	33	3212
C11	Lettuce-romaine-4wk	11	10	1047
C12	Lettuce-romaine-5wk	20	19	1888
C13	Lettuce-romaine-6wk	9	9	898
C14	Lettuce-romaine-7wk	10	11	1049
C15	Vinyard-untrained	73	72	7123
C16	Vinyard-vertical-trellis	18	18	1771
	Total	543	539	53,047

Table 3. Botswana land cover sample summary.

Class No.	Class Name	Train	Valid	Test
C01	Water	14	13	243
C02	Hippo-grass	5	5	91
C03	Hippo-grass 1	13	12	226
C04	Hippo-grass 2	11	10	194
C05	Reeds	14	13	242
C06	Riparian	14	13	242
C07	Firescar	13	13	233
C08	Island interior	10	10	183
C09	Acacia woodlands	16	15	283
C10	Acacia shrublands	13	12	223
C11	Acacia grasslands	15	15	275
C12	Short mopanc	9	9	163
C13	Mixed mopanc	14	13	241
C14	Chalcedony	5	5	85
	Total	166	158	2924

Table 4. WHU-Hi-LongKou land cover sample summary.

Class No.	Class Name	Train	Valid	Test
C01	Corn	69	69	34,373
C02	Cotton	17	17	8341
C03	Sesame	6	6	3019
C04	Broad-leaf soybean	127	126	62,959
C05	Narrow-leaf soybean	9	8	4134
C06	Rice	24	23	11,807
C07	Water	134	134	66,788
C08	Roads and houses	15	14	7095
C09	Mixed weed	11	10	5208
	Total	513	511	9225

Table 5. WHU-Hi-HongHu land cover sample summary.

Class No.	Class Name	Train	Valid	Test
C01	Red roof	28	28	13,985
C02	Road	7	7	3498
C03	Bare soil	44	43	21,734
C04	Cotton	326	327	162,632
C05	Cotton firewood	13	12	6193
C06	Rape	89	89	44,379
C07	Chinese cabbage	48	48	24,007
C08	Pakchoi	8	8	4038
C09	Cabbage	22	21	10,776
C10	Tuber mustard	25	25	12,344
C11	Brassica parachinensis	22	22	10,971
C12	Brassica chinensis	18	18	8918
C13	Small Brassica chinensis	45	45	22,417
C14	Lactuca sativa	15	15	7326
C15	Celtuce	2	2	998
C16	Film covered lettuce	14	15	7233
C17	Romaine lettuce	6	6	2998
C18	Carrot	6	7	3204
C19	White radish	18	17	8677
C20	Garlic sprout	7	7	3472
C21	Broad bean	2	3	1323
C22	Tree	8	8	4024
	Total	773	765	385,147

Table 6. Performance comparison of ten methods on the Indian Pines dataset. (Bold data indicates the highest classification accuracy).

Class No.	CNN	Transformers								Our Methods
Class No.	3DCNN	Spectralformer	SPRN	SSFTT	GAHT	Massformer	SQSFormer	MCTGCL	CACFTNet	Ours
1	33.68 ± 3.87	20.73 ± 11.96	100.00 ± 0.00	78.54 ± 14.00	54.88 ± 3.13	68.54 ± 18.78	77.56 ± 19.38	79.76 ± 7.48	93.90 ± 6.29	85.85 ± 8.98
2	72.95 ± 2.13	76.37 ± 2.67	96.09 ± 1.84	92.57 ± 3.10	95.18 ± 0.93	92.06 ± 2.21	92.94 ± 3.26	93.42 ± 1.90	97.77 ± 1.03	98.68 ± 0.60
3	63.77 ± 2.04	61.69 ± 3.42	98.30 ± 1.05	91.87 ± 3.75	94.74 ± 2.31	96.49 ± 1.00	92.28 ± 3.32	92.85 ± 3.27	98.27 ± 1.07	97.50 ± 1.05
4	66.68 ± 4.06	52.77 ± 3.05	97.61 ± 1.45	83.43 ± 4.50	90.80 ± 2.99	93.29 ± 5.04	90.99 ± 5.21	90.61 ± 2.87	92.25 ± 2.67	95.62 ± 2.97
5	86.57 ± 3.96	73.89 ± 4.26	98.17 ± 0.21	96.62 ± 1.08	97.06 ± 1.21	89.47 ± 4.61	95.29 ± 2.38	93.95 ± 2.12	98.07 ± 0.40	99.01 ± 0.80
6	99.07 ± 0.24	88.75 ± 1.63	97.85 ± 0.12	99.16 ± 0.58	99.48 ± 0.21	97.69 ± 1.25	98.81 ± 0.41	97.67 ± 1.48	99.80 ± 0.19	98.57 ± 0.90
7	31.30 ± 11.63	46.00 ± 10.00	97.60 ± 1.96	68.80 ± 25.22	83.20 ± 8.91	81.60 ± 13.76	55.20 ± 15.26	92.40 ± 12.58	87.20 ± 11.57	94.40 ± 10.50
8	99.80 ± 0.19	96.42 ± 1.08	99.77 ± 0.23	98.81 ± 1.26	99.70 ± 0.28	99.00 ± 1.29	99.93 ± 0.15	99.81 ± 0.20	99.81 ± 0.33	99.98 ± 0.07
9	11.88 ± 12.64	33.89 ± 10.08	77.78 ± 16.85	36.67 ± 25.72	25.00 ± 14.33	59.44 ± 29.09	17.22 ± 10.38	47.22 ± 20.53	81.44 ± 12.53	85.00 ± 14.07
10	70.01 ± 1.88	77.50 ± 3.21	94.82 ± 0.70	87.79 ± 2.21	90.16 ± 1.30	91.35 ± 3.91	93.61 ± 2.24	92.75 ± 1.35	95.20 ± 0.64	95.52 ± 1.05
11	84.77 ± 2.94	84.36 ± 1.11	98.34 ± 0.42	96.32 ± 1.55	95.48 ± 1.20	96.87 ± 1.26	96.82 ± 1.42	97.82 ± 0.78	98.55 ± 0.39	98.62 ± 0.32
12	53.74 ± 5.79	44.49 ± 2.35	97.51 ± 0.95	86.24 ± 3.28	94.70 ± 2.00	88.15 ± 5.11	91.39 ± 3.05	88.26 ± 1.65	97.28 ± 1.61	93.93 ± 1.62
13	98.82 ± 0.65	97.41 ± 0.93	97.62 ± 0.97	98.65 ± 1.17	97.46 ± 1.64	98.32 ± 1.85	99.51 ± 0.29	96.76 ± 1.95	96.76 ± 2.56	96.86 ± 1.34
14	96.14 ± 0.36	95.36 ± 0.75	98.84 ± 0.39	97.41 ± 0.92	98.43 ± 0.68	97.34 ± 1.48	99.46 ± 0.40	98.60 ± 0.61	98.40 ± 0.49	99.50 ± 0.21
15	76.82 ± 3.89	66.97 ± 2.91	86.02 ± 3.16	84.99 ± 3.89	87.41 ± 2.68	87.44 ± 8.34	70.37 ± 3.35	86.25 ± 3.19	82.82 ± 3.07	91.50 ± 1.52
16	100.00 ± 0.00	99.52 ± 1.09	98.33 ± 0.79	91.43 ± 5.58	98.45 ± 1.41	85.12 ± 8.80	92.62 ± 3.32	99.05 ± 1.90	97.58 ± 1.64	98.69 ± 1.64
OA (%)	80.62 ± 1.67	78.78 ± 0.87	97.72 ± 0.34	93.52 ± 0.73	95.05 ± 0.42	94.25 ± 0.47	94.37 ± 0.59	94.96 ± 0.72	97.36 ± 0.25	97.91 ± 0.41
AA (%)	71.63 ± 2.52	69.76 ± 1.97	96.23 ± 1.14	86.83 ± 2.72	87.63 ± 1.15	88.89 ± 1.27	85.25 ± 1.10	90.45 ± 2.00	95.36 ± 1.19	92.78 ± 0.84
Kappa (%)	77.74 ± 1.91	75.66 ± 1.01	97.40 ± 0.39	92.60 ± 0.83	94.36 ± 0.48	93.45 ± 0.54	93.57 ± 0.68	96.99 ± 0.19	94.25 ± 0.82	97.61 ± 0.47

Table 7. Performance comparison of ten methods on the Salinas dataset. (Bold data indicates the highest classification accuracy).

Class No.	CNN	Transformers								Our Methods
Class No.	3DCNN	Spectralformer	SPRN	SSFTT	GAHT	Massformer	SQSFormer	MCTGCL	CACFTNet	Ours
1	98.17 ± 0.99	95.09 ± 3.13	100.00 ± 0.00	98.56 ± 1.98	100.00 ± 0.00	96.72 ± 4.11	99.15 ± 1.67	99.82 ± 0.42	99.80 ± 0.27	99.88 ± 0.27
2	99.85 ± 0.09	99.41 ± 0.41	100.00 ± 0.00	99.91 ± 0.11	99.99 ± 0.03	99.39 ± 1.01	99.91 ± 0.10	99.97 ± 0.05	99.76 ± 0.31	100.00 ± 0.00
3	92.49 ± 0.86	94.15 ± 2.45	99.56 ± 0.05	98.83 ± 1.30	99.40 ± 0.73	96.94 ± 1.55	98.00 ± 2.74	97.25 ± 1.87	96.56 ± 6.22	99.67 ± 0.54
4	98.78 ± 0.61	95.24 ± 0.44	99.84 ± 0.11	99.53 ± 0.50	99.93 ± 0.08	97.00 ± 2.87	99.74 ± 0.19	99.80 ± 0.24	99.88 ± 0.10	99.95 ± 0.15
5	89.73 ± 1.98	87.17 ± 2.12	96.72 ± 1.11	97.36 ± 1.59	97.60 ± 1.18	97.03 ± 2.12	97.70 ± 0.93	95.72 ± 1.38	96.33 ± 1.67	96.36 ± 1.20
6	100.00 ± 0.00	99.99 ± 0.01	100.00 ± 0.00	99.98 ± 0.04	100.00 ± 0.00	99.76 ± 0.49	99.99 ± 0.03	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
7	99.79 ± 0.09	98.00 ± 0.67	99.98 ± 0.01	99.87 ± 0.17	99.89 ± 0.10	99.25 ± 0.88	98.98 ± 0.81	99.60 ± 0.50	99.95 ± 0.06	99.83 ± 0.15
8	87.38 ± 1.10	87.38 ± 2.91	95.28 ± 0.99	88.58 ± 2.21	92.07 ± 2.09	93.29 ± 2.36	31.06 ± 1.52	94.27 ± 1.63	91.61 ± 2.16	97.30 ± 1.29
9	98.46 ± 0.30	97.81 ± 0.57	100.00 ± 0.00	99.38 ± 0.59	99.33 ± 0.47	99.68 ± 0.36	99.17 ± 0.24	99.65 ± 0.12	99.48 ± 0.82	100.00 ± 0.00
10	93.86 ± 0.70	91.94 ± 2.29	99.40 ± 0.17	97.55 ± 1.28	99.57 ± 0.33	97.57 ± 1.37	97.78 ± 0.57	99.58 ± 0.21	98.12 ± 0.77	99.94 ± 0.07
11	70.17 ± 1.81	76.57 ± 6.35	79.97 ± 3.37	71.66 ± 4.35	80.81 ± 2.62	94.52 ± 4.47	94.06 ± 0.88	98.22 ± 1.42	83.48 ± 4.06	98.07 ± 1.51
12	98.63 ± 0.61	97.93 ± 3.51	99.96 ± 0.06	99.79 ± 0.55	99.97 ± 0.08	99.34 ± 1.46	100.00 ± 0.00	99.87 ± 0.36	99.98 ± 0.05	99.99 ± 0.02
13	99.99 ± 0.03	98.83 ± 0.70	99.90 ± 0.20	99.73 ± 0.30	99.82 ± 0.36	99.55 ± 0.57	99.39 ± 1.37	99.61 ± 0.53	99.78 ± 0.42	100.00 ± 0.00
14	99.01 ± 0.18	95.90 ± 1.01	99.19 ± 0.18	98.15 ± 0.97	99.60 ± 0.26	97.66 ± 2.23	96.83 ± 1.54	95.59 ± 1.87	98.36 ± 1.79	99.60 ± 0.26
15	71.73 ± 3.48	75.81 ± 4.84	93.78 ± 1.44	86.82 ± 3.90	89.93 ± 4.17	92.57 ± 2.11	89.81 ± 3.15	89.77 ± 1.74	92.58 ± 1.48	93.88 ± 2.01
16	91.84 ± 3.10	95.05 ± 3.38	98.70 ± 0.26	97.78 ± 0.78	97.92 ± 0.99	96.05 ± 3.89	98.47 ± 1.67	97.65 ± 1.45	98.47 ± 0.91	99.32 ± 0.79
OA (%)	91.19 ± 0.52	91.26 ± 0.90	97.52 ± 0.14	94.70 ± 0.26	96.28 ± 0.29	96.54 ± 0.72	96.00 ± 0.27	96.79 ± 0.36	96.38 ± 0.37	98.74 ± 0.42
AA (%)	93.12 ± 0.43	92.89 ± 1.25	97.67 ± 0.20	95.84 ± 0.20	97.24 ± 0.27	97.27 ± 0.79	97.50 ± 0.22	97.88 ± 0.19	97.19 ± 0.40	99.22 ± 0.50
Kappa (%)	90.19 ± 0.59	90.26 ± 1.00	97.24 ± 0.16	94.10 ± 0.19	95.86 ± 0.32	96.15 ± 0.80	95.55 ± 0.30	96.42 ± 0.40	95.97 ± 0.44	98.60 ± 0.47

Table 8. Performance comparison of ten methods on the Botswana dataset. (Bold data indicates the highest classification accuracy).

Class No.	CNN	Transformers								Our Methods
Class No.	3DCNN	Spectralformer	SPRN	SSFTT	GAHT	Massformer	SQSFormer	MCTGCL	CACFTNet	Ours
1	100.00 ± 0.00	100.00 ± 0.00	97.49 ± 5.89	100.00 ± 0.00	100.00 ± 0.00	98.19 ± 1.38	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	99.75 ± 0.38
2	86.37 ± 3.38	97.47 ± 4.96	99.56 ± 0.88	89.45 ± 8.68	99.01 ± 2.97	93.52 ± 7.65	93.74 ± 3.93	94.62 ± 6.38	100.00 ± 0.00	100.00 ± 0.00
3	99.12 ± 0.44	95.35 ± 2.70	100.00 ± 0.00	98.98 ± 0.91	100.00 ± 0.00	97.04 ± 3.00	100.00 ± 0.00	98.81 ± 1.05	100.00 ± 0.00	100.00 ± 0.00
4	97.63 ± 1.31	89.90 ± 6.94	99.95 ± 0.15	97.27 ± 7.19	99.95 ± 0.15	97.78 ± 4.10	100.00 ± 0.00	95.67 ± 3.90	100.00 ± 0.00	100.00 ± 0.00
5	94.71 ± 2.75	68.97 ± 9.20	98.06 ± 3.43	96.90 ± 1.50	99.13 ± 0.63	89.38 ± 9.50	95.00 ± 3.26	88.18 ± 2.32	94.83 ± 3.04	98.66 ± 3.04
6	71.61 ± 4.06	87.23 ± 4.11	94.05 ± 2.95	93.10 ± 6.38	95.17 ± 0.59	90.45 ± 4.32	96.40 ± 1.59	90.66 ± 3.71	93.14 ± 4.83	97.69 ± 0.98
7	99.61 ± 0.30	99.61 ± 0.56	99.79 ± 0.29	99.66 ± 0.50	100.00 ± 0.00	98.93 ± 1.60	100.00 ± 0.00	99.96 ± 0.13	99.91 ± 0.26	100.00 ± 0.00
8	79.18 ± 5.70	77.05 ± 7.48	100.00 ± 0.00	96.23 ± 4.61	100.00 ± 0.00	95.30 ± 9.19	99.67 ± 0.98	89.56 ± 10.08	100.00 ± 0.00	100.00 ± 0.00
9	97.24 ± 1.77	82.65 ± 8.19	99.26 ± 0.71	97.39 ± 3.38	98.87 ± 1.41	98.48 ± 1.47	99.58 ± 0.70	93.64 ± 4.69	99.47 ± 0.51	100.00 ± 0.00
10	98.25 ± 0.51	91.52 ± 3.75	99.42 ± 1.60	98.83 ± 2.21	99.91 ± 0.27	99.10 ± 1.32	98.97 ± 0.64	99.91 ± 0.27	100.00 ± 0.00	100.00 ± 0.00
11	97.78 ± 1.51	98.51 ± 0.38	98.58 ± 0.91	98.95 ± 1.27	99.16 ± 0.61	96.65 ± 3.08	99.02 ± 1.32	98.11 ± 1.89	97.82 ± 0.91	100.00 ± 0.00
12	99.39 ± 0.48	93.13 ± 2.10	99.88 ± 0.25	99.88 ± 0.25	99.63 ± 0.41	94.79 ± 3.49	99.39 ± 0.87	93.01 ± 4.91	99.20 ± 1.40	99.75 ± 0.30
13	99.13 ± 0.88	97.59 ± 3.74	100.00 ± 0.00	98.80 ± 2.81	100.00 ± 0.00	100.00 ± 0.00	95.85 ± 2.30	99.54 ± 0.29	100.00 ± 0.00	100.00 ± 0.00
14	80.94 ± 27.25	87.41 ± 1.18	98.35 ± 1.20	99.06 ± 0.71	98.47 ± 1.06	89.76 ± 7.72	100.00 ± 0.00	98.12 ± 1.41	99.18 ± 2.47	99.88 ± 0.35
OA (%)	93.96 ± 1.28	90.39 ± 2.12	98.80 ± 0.79	97.75 ± 1.33	99.23 ± 0.18	96.10 ± 1.47	98.48 ± 0.27	95.76 ± 1.31	98.61 ± 0.44	99.48 ± 0.34
AA (%)	92.93 ± 2.38	90.46 ± 1.98	98.88 ± 0.69	97.46 ± 1.46	99.24 ± 0.28	95.67 ± 1.76	98.40 ± 0.23	95.70 ± 1.27	98.79 ± 0.39	99.57 ± 0.28
Kappa (%)	93.45 ± 1.39	89.59 ± 2.29	98.70 ± 0.86	97.56 ± 1.44	99.17 ± 0.19	95.78 ± 1.59	98.36 ± 0.29	95.41 ± 1.42	98.50 ± 0.47	99.44 ± 0.37

Table 9. Performance comparison of ten methods on the WHU-Hi-Longkou dataset. (Bold data indicates the highest classification accuracy).

Class No.	CNN	Transformers								Our Methods
Class No.	3DCNN	Spectralformer	SPRN	SSFTT	GAHT	Massformer	SQSFormer	MCTGCL	CACFTNet	Ours
1	99.47 ± 0.09	99.85 ± 0.05	99.98 ± 0.01	99.92 ± 0.03	99.92 ± 0.04	99.37 ± 0.36	99.96 ± 0.03	99.79 ± 0.18	99.85 ± 0.26	99.85 ± 0.14
2	87.99 ± 2.10	86.82 ± 2.00	96.84 ± 0.75	97.12 ± 1.15	96.35 ± 0.73	93.85 ± 3.11	88.88 ± 3.55	93.12 ± 3.30	98.46 ± 1.24	98.53 ± 0.65
3	37.57 ± 6.19	92.78 ± 2.40	97.13 ± 1.27	91.74 ± 3.82	93.92 ± 3.29	87.48 ± 4.87	95.23 ± 1.99	95.86 ± 5.45	90.23 ± 8.00	98.86 ± 1.50
4	98.70 ± 0.18	97.45 ± 0.62	99.42 ± 0.07	98.68 ± 0.21	99.23 ± 0.23	98.17 ± 0.54	98.54 ± 0.37	98.79 ± 0.23	98.93 ± 0.36	99.29 ± 0.31
5	16.14 ± 7.81	72.02 ± 7.85	93.35 ± 1.17	89.24 ± 4.78	86.87 ± 5.89	83.41 ± 6.88	71.98 ± 9.42	78.61 ± 9.80	83.78 ± 3.64	96.72 ± 1.02
6	98.88 ± 0.17	92.33 ± 3.63	99.84 ± 0.15	99.36 ± 0.35	99.71 ± 0.12	97.47 ± 2.34	99.87 ± 0.09	98.79 ± 0.61	99.21 ± 0.69	99.89 ± 0.09
7	99.99 ± 0.00	99.97 ± 0.01	99.99 ± 0.00	99.96 ± 0.03	99.95 ± 0.04	99.55 ± 0.36	99.99 ± 0.00	99.70 ± 0.20	99.99 ± 0.01	99.84 ± 0.13
8	80.26 ± 2.70	77.60 ± 3.81	93.77 ± 1.68	93.93 ± 2.83	91.52 ± 4.16	82.35 ± 6.67	90.66 ± 3.34	80.64 ± 2.79	94.79 ± 2.24	94.01 ± 2.32
9	68.60 ± 1.60	64.08 ± 2.65	87.90 ± 0.69	86.57 ± 3.76	88.20 ± 2.06	84.37 ± 5.70	93.31 ± 1.73	85.67 ± 3.23	85.82 ± 4.41	91.01 ± 2.57
OA (%)	94.83 ± 0.31	95.82 ± 0.30	98.97 ± 0.08	98.52 ± 0.25	98.61 ± 0.22	97.25 ± 0.40	97.94 ± 0.26	97.60 ± 0.40	98.85 ± 0.24	99.18 ± 0.29
AA (%)	76.40 ± 1.61	86.99 ± 0.89	96.47 ± 0.29	95.17 ± 1.16	95.07 ± 1.19	91.78 ± 1.62	93.16 ± 1.21	92.33 ± 1.72	94.56 ± 1.16	97.57 ± 1.02
Kappa (%)	93.15 ± 0.41	94.49 ± 0.39	98.65 ± 0.10	98.05 ± 0.33	98.17 ± 0.29	96.38 ± 0.52	97.29 ± 0.34	96.84 ± 0.52	98.05 ± 0.32	98.93 ± 0.37

Table 10. Performance comparison of ten methods on the WHU-Hi-HongHu dataset. (Bold data indicates the highest classification accuracy).

Class No.	CNN	Transformers								Our Methods
Class No.	3DCNN	Spectralformer	SPRN	SSFTT	GAHT	Massformer	SQSFormer	MCTGCL	CACFTNet	Ours
1	76.90 ± 1.44	93.92 ± 1.35	95.85 ± 1.14	83.59 ± 3.20	94.85 ± 1.24	95.01 ± 1.99	82.03 ± 8.21	94.67 ± 2.57	94.30 ± 3.73	90.15 ± 3.86
2	76.65 ± 1.64	76.89 ± 2.53	76.71 ± 3.46	53.60 ± 15.96	78.62 ± 3.07	72.01 ± 7.21	53.12 ± 18.87	80.36 ± 4.86	41.89 ± 16.97	77.20 ± 11.87
3	92.76 ± 0.43	87.32 ± 1.02	94.83 ± 0.92	93.86 ± 1.42	92.72 ± 1.14	86.40 ± 1.56	93.82 ± 2.07	89.54 ± 2.51	93.10 ± 1.93	92.47 ± 2.00
4	99.28 ± 0.57	98.84 ± 0.24	99.34 ± 0.15	99.34 ± 0.28	99.36 ± 0.18	99.57 ± 0.14	99.76 ± 0.18	99.51 ± 0.18	99.50 ± 0.24	99.56 ± 0.36
5	29.62 ± 13.74	73.60 ± 5.54	75.27 ± 4.46	47.84 ± 11.37	79.20 ± 5.49	73.49 ± 5.30	30.32 ± 10.77	74.12 ± 5.70	22.52 ± 10.14	78.59 ± 3.81
6	92.03 ± 0.45	92.02 ± 0.77	95.79 ± 0.83	94.57 ± 1.12	94.10 ± 1.84	94.35 ± 0.87	93.19 ± 1.43	94.67 ± 0.90	93.44 ± 3.00	96.83 ± 1.49
7	79.21 ± 2.59	84.83 ± 3.15	86.50 ± 1.18	83.85 ± 3.51	86.22 ± 3.09	86.80 ± 2.37	85.15 ± 4.04	87.46 ± 2.88	87.49 ± 3.25	88.28 ± 4.11
8	2.89 ± 1.35	14.25 ± 3.09	27.92 ± 2.80	0.54 ± 0.82	25.11 ± 4.56	26.70 ± 3.77	0.84 ± 0.71	30.39 ± 6.08	0.67 ± 1.17	28.14 ± 10.32
9	94.14 ± 0.71	95.44 ± 1.19	99.00 ± 0.40	97.17 ± 1.09	97.94 ± 0.75	95.16 ± 0.99	94.87 ± 1.61	95.34 ± 1.04	97.41 ± 1.05	97.64 ± 1.76
10	57.94 ± 4.56	61.05 ± 3.69	82.52 ± 2.31	77.24 ± 5.15	84.77 ± 4.18	79.59 ± 2.74	59.13 ± 5.57	82.72 ± 2.91	65.00 ± 8.05	86.67 ± 3.64
11	43.47 ± 3.88	60.13 ± 3.36	71.26 ± 4.46	38.11 ± 10.23	67.88 ± 5.64	71.87 ± 5.59	58.66 ± 5.03	69.78 ± 3.29	46.42 ± 8.37	73.69 ± 7.07
12	50.77 ± 3.26	56.65 ± 7.00	66.96 ± 2.28	50.18 ± 14.06	70.77 ± 6.09	71.01 ± 3.60	33.19 ± 15.32	70.26 ± 5.90	32.75 ± 21.50	72.08 ± 5.08
13	76.29 ± 2.50	72.24 ± 5.21	78.40 ± 2.71	80.66 ± 4.42	80.05 ± 1.86	82.27 ± 1.50	78.93 ± 3.69	82.16 ± 1.95	76.95 ± 4.24	83.12 ± 2.28
14	53.08 ± 2.65	78.27 ± 3.35	90.06 ± 2.04	69.82 ± 7.51	83.57 ± 3.98	84.12 ± 5.69	55.74 ± 11.97	86.33 ± 6.17	68.73 ± 19.33	88.41 ± 3.25
15	7.49 ± 3.28	46.58 ± 10.06	61.13 ± 9.39	0.66 ± 1.59	49.38 ± 9.83	42.42 ± 9.15	0.00 ± 0.00	41.20 ± 9.71	0.00 ± 0.00	45.18 ± 17.49
16	86.56 ± 3.30	93.45 ± 3.09	96.36 ± 0.76	93.75 ± 2.85	93.23 ± 2.64	94.62 ± 2.43	90.47 ± 8.29	92.92 ± 6.42	96.55 ± 2.63	97.19 ± 2.63
17	43.30 ± 2.93	52.17 ± 14.88	70.12 ± 1.07	32.36 ± 25.52	64.03 ± 5.88	66.37 ± 6.32	23.88 ± 27.11	67.71 ± 3.21	3.34 ± 5.43	72.97 ± 5.30
18	11.31 ± 9.33	53.55 ± 6.11	58.73 ± 10.21	1.79 ± 4.83	68.09 ± 9.76	70.49 ± 8.44	3.96 ± 9.23	64.96 ± 9.20	0.07 ± 0.13	73.42 ± 11.84
19	79.74 ± 1.39	81.63 ± 3.05	93.39 ± 1.51	66.44 ± 18.03	90.16 ± 3.86	86.46 ± 4.02	82.11 ± 5.90	85.86 ± 5.39	83.30 ± 5.84	91.32 ± 4.24
20	68.44 ± 4.65	82.74 ± 2.78	92.30 ± 2.48	16.15 ± 18.03	79.94 ± 8.13	72.92 ± 13.83	56.23 ± 15.84	70.85 ± 10.75	58.35 ± 16.71	74.14 ± 15.09
21	0.00 ± 0.00	11.12 ± 3.77	0.00 ± 0.00	0.00 ± 0.00	0.08 ± 0.15	11.97 ± 5.64	0.00 ± 0.00	14.42 ± 6.22	0.00 ± 0.00	11.43 ± 5.45
22	41.73 ± 7.83	65.56 ± 7.91	75.38 ± 6.66	39.93 ± 16.19	72.98 ± 11.34	77.58 ± 6.66	46.07 ± 15.88	83.84 ± 5.22	28.69 ± 10.00	78.79 ± 10.73
OA (%)	83.97 ± 0.30	87.38 ± 0.36	91.41 ± 0.22	85.22 ± 0.55	90.84 ± 0.41	90.51 ± 0.39	84.82 ± 0.29	90.88 ± 0.54	84.88 ± 0.66	92.57 ± 0.33
AA (%)	57.44 ± 1.14	69.65 ± 0.79	76.72 ± 1.02	55.52 ± 1.84	75.14 ± 1.56	74.60 ± 1.28	55.52 ± 1.65	75.41 ± 1.05	54.11 ± 1.78	79.01 ± 1.16
Kappa (%)	79.56 ± 0.35	84.02 ± 0.45	89.11 ± 0.28	81.18 ± 0.72	88.40 ± 0.52	87.97 ± 0.49	80.63 ± 0.37	88.46 ± 0.68	80.73 ± 0.84	90.58 ± 0.43

Table 11. Comparison of models across different metrics for the Indian Pines, Salians, Bostwana, WHU-Hi-Longkou, and WHU-Hi-HongHu datasets.

Methods	Indian Pines		Salians		Bostwana		WHU-Hi-Longkou		WHU-Hi-HongHu
Methods	Parameters/k	FLOPs/MMac	Parameters/k	FLOPs/MMac	Parameters/k	FLOPs/MMac	Parameters/k	FLOPs/MMac	Parameters/k	FLOPs/MMacs
3DCNN	263.596	45.19568	263.596	46.08224	189.594	32.79048	201.989	60.90368	449.52	60.90
Spectralformer	342.649	15.61216	352.405	16.235776	227.844	8.283392	540.644	28.299072	544.86	28.30
SPRN	283.348	21.023393	281.055	21.023393	281.055	20.892897	248.887	21.178785	186.56	11.18
SSFTT	931.848	16.156424	950.28	16.482441	678.278	11.673576	1253.953	21.861256	1250	21.86
GAHT	830.8	13.888512	972.624	14.304128	800.718	11.971456	1514.121	32.713279	1510	32.71
Massformer	304.51	25.54	316.27	25.54	301.11	25.54	327.64	25.54	327.64	25.54
SQSFormer	421.968	19.774	424.272	20.163776	390.03	14.420224	368.65	25.587584	463.06	26.85
MCTGCL	271.44	33.28	271.44	33.28	272.18	33.28	273.456	33.28	275.54	33.28
CACFTNet	3270	66.71	3360	69.44	2023	35.69	4930	120.56	4930	120.57
Ours	649.981	38.151188	204.605	36.926	189.37	24.797	273.456	31.3309	274.30	31.33

Table 12. Comparison of CHPA and PCFormer on five datasets. (Bold font indicates the highest classification performance).

Datasets	CHPA	PCFormer	OA	AA	KAPPA
Indian Pines	×	×	96.25 ± 0.83	92.01 ± 1.71	95.73 ± 0.95
	🗸	×	97.07 ± 2.32	89.40 ± 2.83	96.43 ± 0.82
	×	🗸	97.46 ± 0.40	91.35 ± 0.69	97.10 ± 0.46
	🗸	×	97.91 ± 0.41	92.78 ± 0.84	97.61 ± 0.47
Salinas	×	×	96.58 ± 0.46	97.74 ± 0.52	96.19 ± 0.96
	🗸	×	97.69 ± 1.07	98.37 ± 1.19	97.43 ± 1.20
	×	🗸	98.32 ± 0.16	98.99 ± 0.14	98.13 ± 0.18
	🗸	🗸	98.74 ± 0.42	99.22 ± 0.50	98.47 ± 0.17
Botswana	×	×	97.29 ± 1.72	97.24 ± 1.65	97.06 ± 1.87
	🗸	×	98.81 ± 0.72	98.90 ± 0.66	98.37 ± 0.77
	×	🗸	98.32 ± 0.16	98.99 ± 0.14	98.13 ± 0.18
	🗸	🗸	99.48 ± 0.34	99.57 ± 0.28	99.44 ± 0.37
WHU-Hi-Longkou	×	×	98.44 ± 0.44	94.47 ± 1.00	97.95 ± 0.31
	🗸	×	98.62 ± 0.34	95.61 ± 2.28	98.13 ± 0.27
	×	🗸	99.03 ± 0.16	97.30 ± 0.87	98.60 ± 0.20
	🗸	🗸	99.18 ± 0.29	97.57 ± 1.02	98.93 ± 0.37
WHU-Hi-HongHu	×	×	90.13 ± 0.31	73.14 ± 1.16	87.48 ± 0.40
	🗸	×	91.49 ± 0.37	75.33 ± 1.55	89.22 ± 0.48
	×	🗸	91.68 ± 0.37	75.85 ± 1.09	89.73 ± 0.48
	🗸	🗸	92.57 ± 0.33	79.01 ± 1.16	90.58 ± 0.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, R.; Cheng, S.; Li, S.; Liu, T. Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification. Remote Sens. 2025, 17, 2705. https://doi.org/10.3390/rs17152705

AMA Style

Han R, Cheng S, Li S, Liu T. Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification. Remote Sensing. 2025; 17(15):2705. https://doi.org/10.3390/rs17152705

Chicago/Turabian Style

Han, Ruimin, Shuli Cheng, Shuoshuo Li, and Tingjie Liu. 2025. "Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification" Remote Sensing 17, no. 15: 2705. https://doi.org/10.3390/rs17152705

APA Style

Han, R., Cheng, S., Li, S., & Liu, T. (2025). Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification. Remote Sensing, 17(15), 2705. https://doi.org/10.3390/rs17152705

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification

Abstract

1. Introduction

2. Methodology

2.1. Channel Hybrid Positional Attention Module

2.2. Prompt Cross-Former

2.2.1. AttenMix

2.2.2. PGFormer Block

3. Datasets and Experimental Settings

3.1. Datasets

3.2. Experimental Settings

3.2.1. Experimental Setup

3.2.2. Evaluation Metrics

3.2.3. Comparative Methods

4. Experimental Comparison and Analysis

4.1. Quantitative Results Analysis

4.2. Visual Results Analysis

4.3. Learned Feature Visualizations by T-SNE

4.4. Parameter Comparison

4.5. Impact of Different Training Ratios

5. Discussion

5.1. Patchsize

5.2. Learning Rate

5.3. Ablation Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI