Next Article in Journal
CFRANet: Cross-Modal Frequency-Responsive Attention Network for Thermal Power Plant Detection in Multispectral High-Resolution Remote Sensing Images
Previous Article in Journal
A Hierarchical Path Planning Framework of Plant Protection UAV Based on the Improved D3QN Algorithm and Remote Sensing Image
Previous Article in Special Issue
A Lightweight Remote-Sensing Image-Change Detection Algorithm Based on Asymmetric Convolution and Attention Coupling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification

School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(15), 2705; https://doi.org/10.3390/rs17152705
Submission received: 11 June 2025 / Revised: 25 July 2025 / Accepted: 1 August 2025 / Published: 4 August 2025

Abstract

Hyperspectral image (HSI) classification is an important task in the field of remote sensing, with far-reaching practical significance. Most Convolutional Neural Networks (CNNs) only focus on local spatial features and ignore global spectral dependencies, making it difficult to completely extract spectral information in HSI. In contrast, Vision Transformers (ViTs) are widely used in HSI due to their superior feature extraction capabilities. However, existing Transformer models have challenges in achieving spectral–spatial feature fusion and maintaining local structural consistency, making it difficult to strike a balance between global modeling capabilities and local representation. To this end, we propose a Prompt-Gated Transformer with a Spatial–Spectral Enhancement (PGTSEFormer) network, which includes a Channel Hybrid Positional Attention Module (CHPA) and Prompt Cross-Former (PCFormer). The CHPA module adopts a dual-branch architecture to concurrently capture spectral and spatial positional attention, thereby enhancing the model’s discriminative capacity for complex feature categories through adaptive weight fusion. PCFormer introduces a Prompt-Gated mechanism and grouping strategy to effectively model cross-regional contextual information, while maintaining local consistency, which significantly enhances the ability for long-distance dependent modeling. Experiments were conducted on five HSI datasets and the results showed that overall accuracies of 97.91%, 98.74%, 99.48%, 99.18%, and 92.57% were obtained on the Indian pines, Salians, Botswana, WHU-Hi-LongKou, and WHU-Hi-HongHu datasets. The experimental results show the effectiveness of our proposed approach.

1. Introduction

Hyperspectral imaging (HSI) is an advanced remote sensing technique capable of capturing detailed spectral information of surface objects in hundreds of continuous spectral bands. Each pixel not only carries spatial location information, but also contains a unique spectral response curve [1], making different materials and features highly distinguishable. Due to its rich data properties, HSI finds extensive application in various fields, which include agricultural monitoring, environmental management, urban planning, geological exploration, and national-security-related defense [2]. As a core task in hyperspectral data processing, HSI aims to assign an accurate land cover or object class to each pixel [3]. Nevertheless, this task encounters significant challenges, owing to the complexity of high-dimensional data and the limited availability of labeled samples. Therefore, the development of efficient feature extraction and classification methods has become a key direction in current HSI research.
Early HSI methods relied heavily on hand-designed features with shallow classifiers, such as classifiers based on SAM, SVM [4], and K Nearest Neighbor [5]. These methods mainly utilize spectral information to achieve classification based on the similarity of spectral curves. However, they ignore spatial contextual information, which leads to limited classification accuracy in scenes with complex feature distributions. In order to solve the problem of utilizing only single spectral information, researchers have proposed joint spatial–spectral methods such as Morphological Profiles [6], Gabor Filtering [7], and Markov Random Fields [8]. These methods enhance classification performance by fusing spatial texture with spectral features. While these methods have greatly enhanced HSI accuracy, they heavily rely on expert feature design and struggle to adapt to multi-scale spatial structures, leading to suboptimal results in complex scenes.
In recent years, the updated iteration of deep learning techniques has brought new breakthroughs in research on HSI. CNNs [9] have become an efficient feature extraction method in HSI by virtue of their advantages in local feature extraction, but they have also gradually exposed some inherent limitations [10]. Firstly, the local receptive field of CNNs is limited and the convolution can only see a fixed neighborhood, which makes it difficult to capture global contextual information [11]. Secondly, high-dimensional spectral bands generate a huge number of parameters, which often need to be reduced through spectral downscaling. However, this process inevitably leads to the loss of some spectral information in the model. Additionally, an excessive number of convolutions can cause the model to overfit, limiting its ability to fully leverage the spectral information. In order to compensate for the lack of local sensory fields, researchers have tried allowing CNNs to extract features at a larger scale or at different scales [12]. For example, spectral–spatial networks such as 3D-CNN were proposed by combining the spectral dimension with the spatial dimension through convolution, as a way to enhance the portrayal of this joint spectral–spatial pattern [13]. Other methods acquire spatial context information at different scales through multi-scale convolution or pyramid structures to take into account both local details and large-scale structures, and these improvements help to enhance the adaptability of CNNs to complex scenes [14]. To this end, Liu et al. [15] proposed a multiscale large kernel asymmetric convolutional network, which combines a spectral feature extraction module and a multiscale large kernel asymmetric convolutional module to efficiently capture both local and global spatial features. Guan et al. [16] introduced a dense pyramidal residual network that utilizes a combined spectral–spatial attention mechanism, enabling it to capture intricate spectral and spatial features in HSI. Gong et al. [17] introduced a novel Multiscale Feature Fusion Convolutional Network, incorporating a multiscale convolutional architecture designed to extract both spectral and spatial features. When using these 3D-CNN architectures, despite showing strong performance in joint spatial and spectral modeling, the local convolution kernel has difficulty capturing remote dependencies [18], limiting the global modeling capability of the model.
With the introduction of the Transformer model into the field of HSI [19], its powerful global modeling capabilities have been widely applied. The detrimental effect of small sample data sizes on Transformer training has become prominent, and it is easy to overfit or fall into unstable training when there are insufficient training samples [20]. To compensate for the limitations of Transformers in local feature extraction, many studies have begun to adopt CNN–Transformer hybrid architectures, leveraging CNN’s powerful local feature extraction ability and the global modeling advantages of Transformer. Hong et al. [21] proposed the SpectralFormer model to further exploit Transformer’s modeling advantages on spectral sequences. In addition, researchers have successively proposed many innovative Transformer-based models in the field of HSI. Zhou et al. [22] proposed a dual-branch convolutional Transformer network with an efficient interactive self-attention mechanism. Sun et al. [23] introduced a memory enhancement mechanism for spatial–spectral feature fusion. Chen et al. [24] explored the role of the center pixel in a Transformer. Zhao et al. [25] proposed a lightweight model based on group convolution. Cheng et al. [26] proposed a multi-scale spatial–spectral information interaction Transformer. Wu et al. [27] proposed a combination of the MLP-mixer and graph convolution in an enhanced Transformer. Shu et al. [28] effectively enhanced feature extraction and fusion capabilities through a dual feature aggregation module and cross-attention aggregation mechanism. These models significantly improved classification accuracy by introducing hybrid structures, attention mechanisms, or multi-scale feature fusion strategies. These innovations demonstrate that the CNN–Transformer hybrid architecture can effectively overcome the limitations of single architectures by complementing each other’s advantages. These models significantly improved classification accuracy by introducing hybrid structures, attention mechanisms, or multi-scale feature fusion strategies.
In contrast, although the Mamba architecture has demonstrated excellent performance in spatial–spectral dependency modeling, and while the SSUMamba model proposed by Fu et al. [29] excels in HSI denoising tasks, its primary advantage lies in denoising rather than classification accuracy. The Mamba architecture has lower computational complexity but lacks the flexibility and diversity in feature extraction of the CNN–Transformer hybrid architecture. In HSI, the CNN–Transformer hybrid architecture can better combine local details with global information, providing more precise classification results.
Therefore, despite the Mamba architecture’s high computational efficiency and performance in certain tasks, the CNN–Transformer hybrid architecture is more suitable for HSI due to its stronger local feature extraction capabilities and global modeling advantages. By optimizing the fusion of local and global features, the CNN–Transformer architecture fully leverages the strengths of both, significantly improving classification accuracy. Based on this, this study selected the CNN–Transformer hybrid architecture as the core framework for further exploration of its application potential and advantages in HSI.
Meanwhile, prompting, as a new paradigm for large model tuning, has shown great potential for efficient feature guidance. For example, Zhang et al. [30] proposed an efficient fine-tuning method based on zero-initialized attention, which efficiently controls the feature flow through prompting factors. This idea inspired us to introduce prompting factors into the HSI task and further improve the classification performance by designing an adaptive prompting factor mechanism to guide the model to achieve more accurate feature selection during the fusion of spectral and spatial features.
The methods mentioned above have made significant progress in fusing spectral and spatial features, but there is still much room for optimization in terms of fully extracting spatial and spectral information, while maintaining the consistency of the local spatial structure. In order to address the previously discussed problems, we propose a Prompt-Gated Transformer with Spatial–Spectral Enhancement for HSI, which combines the advantages of CNNs and Transformer and successfully optimizes the attention mechanism. The network structure mainly consists of two parts: the Channel Hybrid Positional Attention (CHPA) module and the Prompt Cross-Former (PCFormer). The CHPA module gives full play to the advantages of dual branching to mine the deep spectral information and spatial information in HSI and perform feature fusion. On this basis, we use PCFformer to establish global contextual links and control the feature flow through the prompting factors and gating mechanism, as a way to enhance the long-range feature-dependent expression ability of the model. The main contributions of this paper are as follows:
  • In this paper, a Prompt-Gated Transformer with Spectral-Spatial feature Enhancement is proposed. The network not only makes full use of the global feature extraction capability of a Transformer network, but also introduces prompting factors into the field of HSI.
  • To compensate for Transformer’s insufficient modeling of spatial structure, this paper proposes a Channel Hybrid Positional Attention (CHPA) module for HSI. The positional attention introduced by this module can enhance the extraction of spatial structure information, so that the model focuses on the spatial continuity of similar features and the boundary of dissimilar features. The channel weighting mechanism of CHPA can filter out unimportant channels and highlight key spectral features. This helps alleviate the curse of dimensionality in high-dimensional spectral data and improves the model’s ability to utilize effective spectral information.
  • In order to solve the problem that traditional self-attention mechanisms tend to be globally relevant and ignore local spectral–spatial details, this paper proposes Prompt Cross-Former (PCFormer) for HSI. The PCFormer includes AttenMix and PGFormer Block. In the PGFormer Block, we design the Prompt-Gated Cross Attention (PGCA), which uses a learnable prompt-gating mechanism to adaptively pass training prompts into the self-attention layer of the Transformer, to guide the attention to focus on effective features.
The main research of this paper will be presented in detail in the subsequent sections. Section 2 details our proposed methodology. Section 3 describes the experimental setup in detail, as well as some advanced technological approaches. Section 4 shows the experimental results and analyses them in depth. Section 5 discusses the effects of different parameter configurations on the model performance and the ablation experiments. Finally, Section 6 summarizes our contributions and looks to the future.

2. Methodology

In this section, the PGTSEFormer network for HSI is introduced. Section 2.1 describes the structure of the PGTSEFormer network and its application in HSI. Section 2.2 give the basic structure and principle of CHPA, PCFormer, respectively.
PGTSEFormer is proposed as shown in Figure 1. Its overall framework consists of the following five parts: the first is the channel tuning part, the next is the Channel Hybrid Positional Attention (CHPA) module, next is the Prompt Cross-Former (PCFormer), the fourth is the GAP layer, and the last is the part containing the fully connected (FC) layer of the softmax classifier.
In the PGTSEFormer framework, the network accepts a series of 3D data cubes as input. Since datasets vary in their spectral dimensions, we first align these dimensions using a channel tuning mechanism to ensure consistent processing across datasets. Specifically, a series of 2D convolutional layers are employed to calibrate the channels and simultaneously extract preliminary null spectral features. Subsequently, a CHPA is introduced, which adaptively directs the model to focus on key channel information and spatial regions, suppressing irrelevant or interfering information.
After the initial feature extraction, the PCFormer is composed of two core modules: the AttenMix and the PGFormer Block. Where the AttenMix module integrates DWConv, PWConv, Weight Branching Structure, and Channel Blending. This module serves as a pre-processing step that prepares features for global feature extraction by the Transformer. It mainly extracts local null-spectrum features using a lightweight convolutional structure and conducts interactive channel blending through a channel attention mechanism, thereby improving the spatial representation of the model. Following the AttenMix module, the PGFormer Block further enhances the model’s ability to model global context. This module incorporates Prompt-Gated Cross Attention (PGCA), which employs group partitioning and hierarchical strategies to model spatial attention, both within and across groups. These strategies ensure spatial structural consistency, while enabling the capture of long-range contextual dependencies across regions and groups. As a result, the model’s ability to interpret complex spatial relationships is significantly enhanced. Ultimately, the extracted feature maps are downscaled to 1 × 1 spatial dimensions by GAP and spread to one-dimensional vectors to be passed into the FC layer for final classification prediction.

2.1. Channel Hybrid Positional Attention Module

HSI is characterized by high dimensionality and redundancy [31], placing greater demands on the model’s feature extraction capability. In order to improve the network’s ability to perceive key spectral channels and pay attention to important spatial regions, this paper proposes an efficient two-branch attention mechanism, CHPA, and integrates it into the shallow feature extraction stage of the network. As shown in the Figure 1, the CHPA module combines channel attention with local spatial position attention and incorporates cross-attention interaction to enhance feature representations. This design strengthens the feature characterization capability from both spectral and spatial dimensions, thereby improving the model’s discriminative performance on complex feature classes. Given an input feature map X R C × H × W , where C denotes the number of input channels, H and W represent the height and width of the feature map. The CHPA module first equally splits the input X along the channel dimension into two parts:
X 1 , X 2 = Split ( X , C / 2 , dim = 1 )
where X 1 R C 2 × H × W is used for channel attention modeling, while X 2 R C 2 × H × W is used for positional attention modeling.
To capture the importance distribution along the spectral dimension, the channel attention branch first applies adaptive global average pooling to extract global contextual information:
z c = 1 H × W i = 1 H j = 1 W X 1 ( i , j )
where z c denotes the aggregated descriptor for each channel and X 1 ( i , j ) refers to the value at spatial position ( i , j ) in the feature map X 1 .
To facilitate subsequent processing, the aggregated descriptor is reshaped into a 1D vector:
z = reshape ( z c ) = z c . view ( B , 1 , C )
then, it is passed into a dynamic convolution layer to perform cross-channel information interaction and generate channel attention weights:
A c = σ ( F 1 D ( z c ) )
where F 1 D ( · ) is a 1D convolution with the kernel size adaptively adjusted based on the number of channels, allowing it to adapt to multi-scale cross-channel relations with a wide receptive field. σ ( · ) denotes the sigmoid function. The resulting attention weights are A c R B × C / 2 × 1 × 1 , which are reshaped back to A c R B × C × 1 × 1 and applied to the input feature X 1 :
X att = X 1 A c
where ⊙ denotes element-wise multiplication. The process dynamically learns the importance weights of different spectral channels to highlight highly responsive channels and suppress redundant channels.
In HSI, spatial structure is also important for feature recognition. To further improve the model’s capacity to grasp spatial structures, we introduce a local spatial position attention branch. This branch models long-range spatial dependencies using an innovative multi-way compression mechanism, designed to capture local positional relationships within feature maps. Given an input feature map, X 2 R C / 2 × H × W . In order to efficiently extract spatial structure features, feature compression is applied to the input feature map along the horizontal and vertical directions, respectively:
z h ( i ) = 1 W j = 1 W X 2 ( i , j ) , i [ 1 , H ]
z w ( j ) = 1 H i = 1 H X 2 ( i , j ) , j [ 1 , W ]
where z h R C 2 × H × 1 and z w R C 2 × 1 × W , respectively, retain the spatial contextual information along the horizontal and vertical dimensions. Compared to traditional global pooling, this decomposed compression approach can effectively capture directional features. Next, z h is transposed to match the shape of z w for subsequent concatenation:
z h = z h R B × C / 2 × 1 × H
then, concatenate and apply a non-linear transformation:
z cat = Concat ( z h , z w ) R C / 2 × ( H + W )
z ^ = ReLU ( B N ( W 1 z cat + b 1 ) )
where W 1 R C 8 × C 2 is the dimensionality-reduction convolution kernel, which reduces the number of parameters by compressing the channels by a factor of 8, and b 1 R C 8 is the bias term. The fused features are then split again into two directional representations, which are used to generate attention weights in the horizontal and vertical directions:
[ z ^ h , z ^ w ] = Split ( z ^ , dims = [ H , W ] )
each directional feature is processed separately through a 1×1 convolution, followed by a Sigmoid function.
A s h = σ ( z h · Conv 1 × 1 h ( z ^ h ) + b h )
A s w = σ ( z w · Conv 1 × 1 w ( z ^ w ) + b w )
where z h R H × C 8 represents the horizontal attention projection matrix and z w R W × C 8 represents the vertical attention projection matrix. b h R H and b w R W are the corresponding bias terms. All channels share the same set of z h and z w , the two attention maps are then combined through matrix multiplication to establish attention correlations between rows and columns:
X pos = Broadcast ( A s h A s w ) R C / 2 × H × W
The local spatial positional attention branch aims to capture the local structural information in an image from the spatial dimension, especially the edges, textures, and their spatial relationships in the image. The core idea of this branch is to learn positional dependencies along different spatial directions by applying directional average pooling and convolutional operations. Additionally, shared weights are employed to enable efficient long-range dependency modeling, which in turn enhances the model’s ability to recognize complex spatial patterns.
Through the weighted fusion of the channel attention branch and the local spatial location attention branch, the model is able to extract richer feature representations from both the spectral and spatial dimensions. In summary, the CHPA module strengthens the model’s capacity to recognize intricate structures and subtle category differences, thereby boosting its overall classification performance.

2.2. Prompt Cross-Former

In each PCFormer of the PGTSEFormer network, the input features are first pre-processed by the AttenMix module to enhance their representation. This module integrates several spatial and channel enhancement strategies, including two sets of convolutional operations, a dual-branch dynamic weight fusion mechanism and a channel mixing module. Specifically, AttenMix first recombines the input features across bands via DWC and PWC, then maps them into a high-dimensional feature space to enhance feature representation. Subsequently, the module further extracts key features through a dual channel-space branching mechanism. Among them, the channel branch utilizes global average pooling to extract spectral statistical information and dynamically adjusts channel weights to enhance local saliency in the feature map, whereas the spatial branch applies a 3D convolutional kernel to model inter-band spatial correlations and capture local structural features more effectively. Through this structure, the model can efficiently mine local spectral–spatial features within neighboring band groups. The introduced channel mixing operation facilitates information interaction across channels, thereby enriching the diversity of feature representations. In addition, it increases the stochasticity during training, which in turn improves the model’s adaptability and generalization capability.
After the feature enhancement process using AttenMix, the features are fed into the PGFormer Block for global context modeling. In the PGFormer Block, we introduce learnable Group Tokens as semantic proxies for different regions, to enable cross-region feature interaction and fusion through the cross-group attention mechanism. Meanwhile, the module introduces a dynamic Prompt Factor to regulate the query vectors through a gating mechanism, so that the attention distribution is self-adapted to the semantic structure and distribution pattern of the input features. The synergistic design of the module preserves the capacity for local detail extraction, while simultaneously activating global semantic associations, and establishes a multi-level, multi-scale feature enhancement framework. This design significantly improves the model’s representational capacity and classification accuracy.

2.2.1. AttenMix

HSI contains a large number of spectral channels and exhibits highly complex spatial structures [32]. Although convolutional operations can effectively extract local spatial features, they often introduce redundancy when handling high-dimensional spectral data and are inherently constrained by fixed kernel sizes. To address these limitations, we designed the AttenMix module as a structure-aware component placed before each Transformer layer, to facilitate efficient local–global feature interaction. Its structure is shown in Figure 1. Firstly, the input feature X R × C × H × W is processed using DWC. Specifically, it consists of two stages:
X dw = DWConv ( X ) = X W dw , W dw R C × 1 × H × W
X pw = PWConv ( X dw ) = X dw W pw , W pw R C × C × 1 × 1
where * denotes the convolution operation. X dw is the convolution weight for depth-separable convolution, which can extract spatial information for each channel independently. X pw is the convolution weight for PWC, which can help the model to achieve spectral domain fusion between different channels and strengthen cross-channel feature interaction. The processed input features are then split into two branches along the channel dimension. The channel branches are weighted by a subset of globally pooled channels via learnable weights, c weight and bias c bias :
A c = σ c weight · AvgPool ( X ) + c bias
where c weight performs a linear transformation of the channel information and learns the magnitude of the enhancement for each channel, which can regulate the strength of the response of each channel’s attention to the final channel’s attention. c bias , on the other hand, introduces independent offsets for each channel’s attention, to enhance the nonlinear representation of the model. Spatial branching is moderated by using learnable weights s weight and s bias for each grouped spatial subset after GroupNorm:
A s = σ s weight · GN ( X ) + s bias
The s weight moderates the normalized spatial features by highlighting salient regions and suppressing redundancy, while s bias provides spatial group-specific offsets and jointly controls the feature map shape with s weight . These learnable parameters are automatically updated by backpropagation during the training process.
X = Concat ( A s , A c ) R C × H × W
Then, we feed the feature X after channel splicing of these two branches into the channel mashup section. In channel shuffling, the channels are rearranged in a grouped manner to achieve inter-group information interaction. This design can break the locality between channels and enhance the information flow of features between groups, so that the global information can be more fully integrated.

2.2.2. PGFormer Block

Although conventional ViT performs well in capturing global dependencies, it still has limitations in dealing with images and significant local features [33]. For this reason, a Prompt-Gated Transformer structure based on prompts was designed in this paper, aiming to enhance the model’s ability to model local spatial information. The structure is shown in Figure 2a. The structure consists of a series of normalization layers, a convolutional layer, and PGCA, and the various parts work in concert to significantly enhance the accuracy of the model for feature extraction. We embed the PGCA module in each Transformer coding layer to enhance the model’s adaptive perception in the spatial dimension.
As shown in Figure 2b, the PGCA module consists of four key steps: local spatial partitioning, intra-group self-attention modeling, inter-group context fusion, and Prompt-Gated mechanism. First, the input features are divided into multiple local regions through the spatial partitioning mechanism, to maintain the consistency of the local structure. Subsequently, the intra-group self-attention mechanism is employed to capture the key spatial relationships within the regions. On this basis, the inter-group context fusion module further models the global dependencies between different regions. Finally, the attention path is dynamically adjusted through the introduction of the Prompt-Gated mechanism, so that the model is capable of adaptively selecting attention regions according to the distribution of input features, thus achieving more accurate spatial feature modeling.
Specifically, given an input feature map X R C × H × W , we first rearrange it into multiple spatially grouped blocks shaped as g × g local subregions.
X group = Rearrange ( X ) R N × C × g 2
In the above formula, N = H · W g 2 denotes the number of spatially partitioned groups, where g is the spatial size of each group. Next, 1 × 1 convolution is applied to each local group feature to generate the query (Q), key (K), and value (V) matrices, enabling intra-group attention modeling.
Q , K , V = Conv 1 d 1 × 1 ( X group ) R B · N × h × g 2 × d
where h denotes the number of attention heads, and d represents the embedding dimension of each head. To enhance the expressive capability of the attention mechanism, we introduce a learnable prompt vector P R 1 × h × 1 × d and a gating factor G R 1 × h × 1 × 1 to modulate the original query vector. The operation is shown in Equation (22):
Q = Q · ( 1 σ ( G ) ) + P · σ ( G )
where σ ( · ) denotes the sigmoid function used to control the level of involvement of the prompt vector. Next, the standard self-attention operation is performed within each local space group:
Attn ( Q , K , V ) = Softmax Q K d · V
The obtained attention result is the output feature after context enhancement within the group. To enhance the information interaction between spatial groups, we design a Group Tokens mechanism. The first position in each group is set to be extracted as a semantic representation, and then attention is computed between groups:
T Q , T K = Linear ( GroupTokens ) Attn group = Softmax T Q T K d
The results of intergroup attention are subsequently broadcast back to the groups to remodulate the intragroup features to obtain the final context fusion feature A t t n f u s e d . The local intragroup feature rearrangements are reduced to the original spatial dimensions and residual concatenation is performed.
X out = X + Reshape ( Attn fused )
With the PGSA module, the PGTSEFormer network effectively enhances the perception ability and information interaction between spatial regions, while maintaining a lightweight structure, which significantly improves the model’s structural modeling and generalization ability in HSI.

3. Datasets and Experimental Settings

In this section, in order to evaluate the effectiveness of our proposed PGTSEFormer framework, we present a comprehensive experiment using five publicly available HSI datasets. We describe in detail the five datasets used and provide a detailed description of the experimental setup.

3.1. Datasets

Indian Pines: This dataset was collected in 1992 by NASA over northwestern Indiana, using the AVIRIS sensor. It comprises a hyperspectral image of 145 × 145 pixels, capturing 16 distinct land cover categories. The spectral range spans from 0.4 to 2.5 microns across 220 contiguous bands. Figure 3 presents a false-color composite of the image, alongside its corresponding ground-truth annotation map. In total, the dataset includes 10,249 labeled instances representing all classes. For experimental purposes, the data were divided such that 5% were allocated for training, another 5% for validation, and the remaining 90% for testing. Detailed class-wise sample counts and distribution ratios are provided in Table 1.
Salinas: This dataset was captured by NASA’s AVIRIS sensor over the Salinas Valley in California. It features a spatial resolution of 3.7 m, with image dimensions of 512 × 217 pixels. Originally composed of 224 spectral bands, 20 bands heavily influenced by water vapor absorption were removed, leaving 204 continuous and usable spectral channels. Figure 4 displays a false-color composite image along with its associated ground-truth label map. This dataset focuses on 16 categories of agricultural crops and contains a total of 54,129 labeled samples. For dataset partitioning, 1% of the samples were allocated to the training set and another 1% to the validation set, while the remaining 98% were reserved for testing. Table 2 outlines the number of samples per class and their respective distribution.
Botswana: This dataset was acquired by NASA on the Earth Observation-1 (EO-1) satellite in May 2001 and covers the Okavango Delta region of Botswana, Africa. The dataset has an image size of 1476 × 256 pixels, a spatial resolution of about 20 m, and a spectral wavelength range of 0.4 to 2.5 microns. The raw data contain 242 spectral bands, of which 5 bands were excluded due to noise effects, and finally 145 valid continuous spectral channels were retained. The false color composite image of the Botswana dataset with its corresponding ground truth label map is shown in Figure 5. The dataset contains 14 different feature types, totaling 3248 labeled samples. In the experimental setup of this paper, 5% of the samples were used for the training set, 5% for the validation set, and the remaining 90% were used as the test set. The details of the category distribution and sample division are shown in Table 3.
WHU-Hi-LongKou: This dataset was collected using a hyperspectral imaging sensor mounted on a UAV over an urban area in Hubei Province, China. It primarily focuses on nine representative crop types and is intended for research on high-resolution hyperspectral classification. Figure 6 shows the dataset’s pseudo-color composite images and their corresponding ground-truth maps. Each image is 550 × 400 pixels, covering 270 contiguous spectral bands across the 0.4–1.0 µm range, with a spatial resolution of 0.463 m. The dataset includes 204,542 labeled instances across six categories. For the experiments, 0.2% of the data were allocated to training and validation (each), while the remaining 99.6% were reserved for testing, to ensure a robust evaluation. Class-wise sample counts are provided in Table 4.
WHU-Hi-HongHu: This dataset was acquired in Honghu City, Hubei province, China, with a 17 mm focal length Headwall Nano-Hyperspec imaging sensor equipped on a DJI Matrice 600 Pro UAV platform. The experimental area is a complex agricultural scene with many classes of crops, and different cultivars of the same crop are also planted in the region, including Chinese cabbage and cabbage, and Brassica chinensis and small Brassica chinensis. The UAV flew at an altitude of 100 m, the size of the imagery is 940 × 475 pixels, there are 270 bands from 400 to 1000 nm, and the spatial resolution of the UAV-borne hyperspectral imagery is about 0.043 m. In Figure 7, a false-color image and its corresponding ground truth label map from the WHU-Hi-HongHu dataset are displayed. In the experimental setup of this study, the training and validation sets each accounted for 0.2% of the total samples, while the remaining 99.6% were used as the test set. This dataset contains 22 land-cover classes. The specific number of samples for each class and their partitioning are detailed in the Table 5.

3.2. Experimental Settings

3.2.1. Experimental Setup

Our proposed PGTSEFormer network was implemented using PyTorch 1.11 and Python 3.7.13. All experiments were conducted on a server with an NVIDIA GeForce RTX 3090 GPU. We used the AdamW optimizer to update network parameters during training, with cross-entropy [34] as the loss function. The learning rate was set to 0.001, weight decay to 0.05, batch size to 128, and patch size to 12. To reduce randomness and ensure stable results, each experiment was repeated 10 times and the average outcome was reported as the final result.

3.2.2. Evaluation Metrics

In order to strengthen the credibility of the performance of our method, we used three comprehensive evaluation metrics: overall accuracy (OA), average accuracy (AA), and kappa (k). Higher metric values indicate better model accuracy.

3.2.3. Comparative Methods

To assess the performance of the proposed PGTSEFormer in HSI, we compared it with several representative CNNs and Transformer-based methods. The selected methods cover a wide range of feature extraction strategies, from shallow convolutional networks to deep Transformer architectures. The specific comparison models and their structural features are as follows:
  • 3DCNN [35]: This model adopts a multi-layer 3D convolutional structure to jointly capture spectral and spatial features, while integrating a multi-level pooling module to progressively compress the spectral dimensions. Subsequently, feature integration and dimensionality reduction are carried out through an FC layer, and the final output is the classification result.
  • CACFTNet [36]: This model innovatively introduces a covariance attention mechanism and cross-layer fusion strategy, utilizing a dual-branch module that combines CNN and Transformer structures to achieve efficient integration of spatial–spectral information and band similarities in HSI tasks.
  • SpectralFormer [21]: This network model comprises four key modules: a band-domain neighborhood augmentation module for constructing local spectral–spatial relationships, a multi-layer Transformer encoder for capturing global contextual information, an embedding and positional coding module for feature representation, and a linear FC layer for final classification.
  • SPRN [37]: This network model consists of four main components: a 2D convolutional layer for extracting underlying spectral–spatial features, a multi-grouped residual structure for feature enhancement and residual learning, a grouped convolutional module for channel adaptation, and an FC layer for final classification.
  • SSFTT [38]: This network consists of 3D and 2D convolutional layers for joint spectral–spatial feature extraction, a learnable token generation module to construct semantic representations, a Transformer encoder to model global contextual dependencies, and a linear layer for final category prediction.
  • GAHT [39]: This network is divided into three stages, each consisting of 1 × 1 convolutional and Transformer Encoder modules embedded in grouped pixels, followed by a global average pooling layer and a linear layer for classification.
  • Massformer [23]: The network model first applies 3D and 2D convolutional layers for shallow spectral–spatial feature extraction. The resulting features are then split into two branches: one performs both max pooling and average pooling to capture statistical characteristics, while the other concatenates a class token and positional encoding to prepare for Transformer-based semantic modeling. The dual branch enters the transformer coding layer, which is used by the pooled branch as memory to recompose the Q, K, and V. Finally, it enters the mlp used for classification.
  • SQSFormer [24]: This network is designed for hyperspectral feature extraction and consists of a 2D convolutional layer to extract underlying spatial features, an SE module to recalibrate channel responses, a Transformer encoder enhanced with a central feature mechanism to capture global dependencies, and a linear layer for aggregating features and predicting output classes.
  • MCTGCL [40]: This network model integrates 3D and 2D convolutional layers to extract underlying spatio-temporal features, an efficient attention weighting module to refine spatial channel importance, a Transformer encoder enhanced with a memory mechanism to capture long-range dependencies, and a linear layer to aggregate features for final classification.

4. Experimental Comparison and Analysis

In this section, we selected nine representative HSI methods for comparative experiments, covering different network structures based on CNNs and Transformer-based architectures. Through systematic comparisons with these state-of-the-art methods, we comprehensively evaluated the classification performance of the proposed PGTSEFormer model on several standard hyperspectral datasets. We present a detailed description of the network architecture, the experimental setup, and the comparison methods used for evaluating PGTSEFormer. In addition, we provide in-depth analyses of the classification results, which collectively demonstrate the state-of-the-art performance and effectiveness of the proposed framework in HSI tasks from multiple perspectives. For the classification accuracy of each feature category, the highest value among the methods is marked in bold font, in order to highlight the performance advantages.
Additionally, to visually demonstrate the classification performance of the different methods across the various datasets, we provide the corresponding false-color composite images, ground-truth annotations, and predicted classification maps. These visualizations help in more effectively comparing the recognition accuracy and spatial consistency of each model.

4.1. Quantitative Results Analysis

Table 6, Table 7, Table 8, Table 9 and Table 10 present the classification results of our method and the comparative approaches on the five datasets. The evaluation metrics used included Overall Accuracy (OA), Average Accuracy (AA), Kappa (k), and individual classification accuracy for each category. The best-performing values in each experiment are boldfaced to underscore the advantages of each method across the various evaluation criteria.
The classification performance on the Indian Pines dataset is shown in Table 6. Although PGTSEFormer performed slightly worse on certain individual categories (compared with the SPRN method), it achieved a significantly higher accuracy for category 9 (Oats) than all other methods, demonstrating its strength in recognizing classes with limited samples. With only 5% of the training data, PGTSEFormer achieved an OA value of 97.91 ± 0.41%, which is approximately 0.19% higher than the best performance of the other methods. Moreover, its performance exhibited lower variance, indicating good stability.
The classification performance on the Salinas dataset is presented in Table 7. SQSFormer achieved an accuracy of 97.70 ± 0.93% on Class 5 (Fallow-Smooth), demonstrating its strength in local feature modeling. However, in terms of overall performance, PGTSEFormer achieved an OA value of 98.74 ± 0.42% using only 1% of the training samples, outperforming the best comparison method (i.e., SPRN) by 1.22%.
The classification results on the Botswana dataset are shown in Table 8. While several methods achieved classification accuracies of 100 ± 0.00% in specific categories, PGTSEFormer also reached this level in most categories, demonstrating its strong capability in modeling multi-class spectral–spatial features. Although the classification accuracies of PGTSEFormer were slightly lower than those of the best-performing method in category 1 (Water), category 5 (Hippo Grass-1), category 12 (Short Mopane), and category 14 (Chalcedony), the respective differences were only 0.25%, 0.47%, 0.13%, and 0.12%, indicating negligible performance gaps. Overall, PGTSEFormer achieved an OA value of 99.48 ± 0.34% using only 5% of the training samples, with a fluctuation of just 0.34%, which further validates its robustness and stability on this dataset.
The classification results on the WHU-Hi-LongKou dataset are shown in Table 9 and PGTSEFormer performed particularly well for Class 3 (Sesame) and Class 5 (Narrow-leaf Soybean). In contrast, the traditional CNN methods performed poorly in these two classes, possibly due to their shortcomings in feature extraction with fewer samples. PGTSEFormer achieved the highest classification accuracies for these categories, despite the improvements of other Transformer-based methods over CNNs. Under very low sample conditions (only 0.2% of the training samples), PGTSEFormer still achieved an OA value of 99.18 ± 0.29%, which is an improvement of about 0.21% compared to the SPRN.
The classification results on the WHU-Hi-HongHu dataset are shown in Table 10. Our method (Ours) outperformed all other methods, especially in Class 1 (Read roof) and Class 5 (Cotton firewood), where it achieved a higher classification accuracy compared to the other methods. In contrast, the traditional CNN-based methods such as 3DCNN and Spectralformer performed relatively poorly in these two classes, possibly due to their limitations in feature extraction, particularly for the complex patterns present in these classes. Notably, the Transformer-based methods such as GAHT and MCTGCL also showed significant improvements over the CNN-based methods, but they still lagged behind our method in terms of classification accuracy. Our method achieved the highest OA and AA values, with a notable OA value of 91.52 ± 1.01%, significantly surpassing the next best method (GAHT) by a margin of approximately 1.5%.
In summary, PGTSEFormer achieved excellent classification results on several public datasets, verifying its effectiveness and sophistication in dealing with complex null spectral structures and modeling global–local feature relationships.

4.2. Visual Results Analysis

Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12 show the results of the classification performance visualization for the proposed method with each comparison algorithm on the five datasets. Owing to the structural advantages of the proposed PGTSEFormer in spatial–spectral feature modeling and contextual representation, the classification results exhibited reduced noise artifacts, sharper region boundaries, and greater consistency with the ground-truth distribution. On the Indian Pines dataset, this was especially reflected in category 15 (Vineyard-untrained) (bright red area). This type of area is usually noisy and difficult to segment accurately using traditional methods, while PGTSEFormer could effectively suppress misclassified areas and significantly improved the accuracy of boundary identification. Similar advantages were observed on the Salinas, Botswana, WHU-Hi-LongKou, and WHU-Hi-HongHu datasets. The classification maps generated by PGTSEFormer not only aligned well with the ground-truth contours, but also exhibited superior accuracy in edge delineation and fine-grained category prediction. These results further validate the model’s robustness and generalization performance for complex hyperspectral scenes.

4.3. Learned Feature Visualizations by T-SNE

T-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique widely used for visualizing high-dimensional data. It maps high-dimensional data into two or three dimensions, while preserving the local structure, revealing underlying patterns and group structures. In hyperspectral image classification, t-SNE plots can showcase the distribution characteristics of data. If the classification performance is good, the samples of different categories will be clearly separated in the t-SNE plot and will cluster into independent groups, indicating that the classifier can effectively distinguish between categories. However, if the classification performance is poor, the categories may overlap or interweave, making it difficult to separate the samples, suggesting that the classifier failed to distinguish between different categories, and misclassified samples may be projected into the clusters of other categories.
Taking the IP dataset as an example, as shown in Figure 13, in (b), (c), and (d), the samples of different categories are well clustered together and their boundaries are distinct, suggesting that these plots represent a relatively good classification performance. The overlap between categories is minimal, indicating that the classifier could effectively distinguish between these categories. In (a), (e), and (f), some categories begin to show more overlap, especially in the areas where the colors intersect. This distribution might indicate that the classifier’s ability to differentiate between these categories has declined, or that the feature spaces of these categories are similar, making classification more difficult. From (g) to (j), it can be observed that the degree of separation between categories fluctuated with different dimensionality reduction parameters and methods. In particular, in (h) and (i), some categories start to cluster more closely together, possibly due to the impact of the dimensionality reduction process on the local structure.
In (j), the samples of each category are almost all clustered in distinct areas, and the distribution of most categories does not overlap or cross. The samples in each category have high separability in the feature space, indicating that the features of the data can effectively distinguish between these categories. Areas with the same color (e.g., blue, green, pink) show that the samples of each category are tightly grouped, with no points from other categories scattered in these areas, meaning that this plot is excellent in terms of category separation. Compared to the other plots (e.g., (d) and (e)), the samples in (j) are mostly clustered within their respective clusters, with only a few samples potentially in neighboring category regions, showing a low misclassification rate. The distribution of each category in the plot forms relatively compact clusters, indicating that similar samples are grouped closely together in the feature space, thereby enhancing the classification accuracy and robustness.

4.4. Parameter Comparison

To evaluate the synergy between efficiency and accuracy, Table 11 compares the model complexity of PGTSEFormer with nine mainstream HSI methods, considering both Parameters and FLOPs. As shown in the table, PGTSEFormer had significantly fewer parameters than most competitors on the Salinas, Botswana, WHU-Hi-LongKou, and WHU-Hi-HongHu datasets, demonstrating its lightweight design, without compromising feature extraction capability. Despite the relatively small number of parameters in the model, PGTSEFormer still achieved a leading classification accuracy, which indicates that it is more efficient in parameter utilization.
Comprehensively analyzing the parameter scale, computational cost, and classification performance of the model, PGTSEFormer achieved superior classification results on the basis of ensuring a low computational overhead, which verifies its significant advantages in HSI tasks.

4.5. Impact of Different Training Ratios

To validate the robustness of the proposed method, we conducted comparison experiments on the OA of all the compared methods on five HSI datasets, covering different training sample ratio settings. As shown in Figure 14, PGTSEFormer consistently outperformed the other methods in classification accuracy, even under limited training data conditions. Moreover, its accuracy increased steadily with the proportion of training samples, demonstrating strong scalability and robustness. Taking the SA dataset as an example, the classification accuracy of our method was consistently higher than that of the other methods when the number of training samples was limited. This suggests that the model retained robust feature extraction and discrimination abilities, even in challenging situations with limited samples and subtle class differences.
In summary, this experiment fully demonstrated that PGTSEFormer not only has superior performance under standard settings, but also shows good robustness and stability under restricted training samples.

5. Discussion

To better understand how various parameter configurations influenced the network model, this study conducted a series of systematic experiments on patch sizes, learning rate settings, and module combinations. Through this series of experiments, we effectively revealed which parameter combinations gave the model the highest classification accuracy.

5.1. Patchsize

To examine the impact of patch size on classification performance, we evaluated four configurations (patch sizes of 4, 8, 12, and 16), keeping all other parameters fixed. As shown in Figure 15, the figure depicts the OA achieved on the five datasets under different patch sizes. The results reveal that the OA on the Salinas, WHU-Hi-LongKou, and WHU-Hi-HongHu datasets increased steadily with larger patch sizes, whereas Indian Pines and Botswana datasets exhibited a peak performance at size 12, before declining. Based on these observations, a patch size of 12 was adopted consistently across all datasets.

5.2. Learning Rate

Choosing the correct learning rate not only influences whether a model reaches the optimum, but also affects the convergence speed of the model. In order to study the role of learning rate on the network model, we set six different learning rates: 0.001, 0.005, 0.0001, 0.0005, 0.00001, and 0.00005, ensuring that the other parameters remain unchanged. As shown in Figure 16, the performance of the five datasets had the highest value when the learning rate was 0.001. Based on this conclusion, we set the learning rate to 0.001 uniformly for all five datasets.

5.3. Ablation Study

To systematically evaluate the individual contributions of the proposed modules, we performed ablation studies on five benchmark datasets. The experiments were conducted under the following configurations:
  • Both CHPA and PCFormer modules removed;
  • CHPA retained, PCFormer removed;
  • PCFormer retained, CHPA removed;
  • Both CHPA and PCFormer modules retained.
The classification results corresponding to these settings are presented in Table 12. As shown, the inclusion of either module individually improved the performance compared to the baseline. Notably, the best performance was achieved when both modules were incorporated, demonstrating their complementary effectiveness in enhancing the network’s feature representation capability.

6. Conclusions

In this paper, we proposed a lightweight network framework for HSI named PGTSEFormer. It consists of two main parts: shallow feature extraction, and deep contextual information interaction. In the shallow feature extraction part, a CHPA module is introduced to jointly extract local spatial and spectral features. In the deep contextual interaction part, the PCFormer is employed to capture global contextual dependencies. The integration of shallow local feature enhancement and deep contextual interaction mechanisms greatly improved the feature representation capability and classification accuracy of the model. In comparison experiments on four publicly available HSI datasets, PGTSEFormer outperformed current state-of-the-art methods in terms of accuracy and parameter efficiency, demonstrating its good performance and application potential, especially when the number of parameters is small. In particular, it maintained excellent classification results with a small number of parameters, which proves its usefulness in resource-limited situations.
The experimental results demonstrate that the proposed method achieved strong performance on existing hyperspectral datasets. Future research will focus on further optimizing the model structure and exploring lightweight network designs with improved generalization capabilities to better meet the diverse requirements of real-world HSI processing tasks.

Author Contributions

Conceptualization, S.C. and R.H.; methodology, R.H.; software, R.H.; validation, R.H., S.C. and S.L.; resources, S.C.; data curation, R.H.; writing—original draft preparation, R.H.; visualization, R.H.; supervision, S.C.; project administration, T.L.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China Project under Grant 62441213, the Key Laboratory Open Projects in Xinjiang Uygur Autonomous Region under Grant 2023D04028, and in part by Xinjiang University Graduate Innovation Project under Grant XJDX2025YJS193.

Data Availability Statement

Indian Pines, Botswana, and Salinas datasets are available at https://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes (accessed on 21 July 2025). WHU-Hi-LongKou and WHU-Hi-HongHu datasets are available at http://rsidea.whu.edu.cn/resource_WHUHi_sharing.htm (accessed on 21 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Fang, L.; Yan, Y.; Yue, J.; Deng, Y. Toward the vectorization of hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5518214. [Google Scholar] [CrossRef]
  2. Ghamisi, P.; Plaza, J.; Chen, Y.; Li, J.; Plaza, A.J. Advanced spectral classifiers for hyperspectral images: A review. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–32. [Google Scholar] [CrossRef]
  3. Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
  4. Gao, H.M.; Xu, M.X.; Xu, M.G.; Wang, X.; Huang, F.C. A method of hyperspectral image classification based on posterior probability SVM and MRF. In Proceedings of the 2013 International Conference on Machine Learning and Cybernetics, Tianjin, China, 14–17 July 2013; Volume 1, pp. 235–240. [Google Scholar]
  5. Li, Y.; Yang, X.; Tang, D.; Zhou, Z. RDTN: Residual Densely Transformer Network for hyperspectral image classification. Expert Syst. Appl. 2024, 250, 123939. [Google Scholar] [CrossRef]
  6. Benediktsson, J.A.; Palmason, J.A.; Sveinsson, J.R. Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Trans. Geosci. Remote Sens. 2005, 43, 480–491. [Google Scholar] [CrossRef]
  7. Zoubir, H.; Rguig, M.; El Aroussi, M.; Saadane, R.; Chehri, A. Pixel-level concrete bridge crack detection using Convolutional Neural Networks, gabor filters, and attention mechanisms. Eng. Struct. 2024, 314, 118343. [Google Scholar] [CrossRef]
  8. Guan, T.; Wang, C.; Liu, Y.H. Neural markov random field for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5459–5469. [Google Scholar]
  9. Lee, H.; Kwon, H. Going deeper with contextual CNN for hyperspectral image classification. IEEE Trans. Image Process. 2017, 26, 4843–4855. [Google Scholar] [CrossRef]
  10. Yu, C.; Zhu, Y.; Wang, Y.; Zhao, E.; Zhang, Q.; Lu, X. Concern with Center-Pixel Labeling: Center-Specific Perception Transformer Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5514614. [Google Scholar] [CrossRef]
  11. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  12. He, X.; Chen, Y.; Ghamisi, P. Heterogeneous transfer learning for hyperspectral image classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3246–3263. [Google Scholar] [CrossRef]
  13. Xu, Q.; Wei, J.; Wu, Q.; Wang, J.; Wang, X.; Liu, J.; Jiang, B. Mix-Mask Augmentation and Self-Reconstruction for Cross-Domain Few-Shot Hyperspectral Image Classification. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
  14. Liu, X.; Ng, A.H.M.; Lei, F.; Ren, J.; Liao, X.; Ge, L. Hyperspectral image classification using a multi-scale cnn architecture with asymmetric convolutions from small to large kernels. Remote Sens. 2025, 17, 1461. [Google Scholar] [CrossRef]
  15. Chen, J.; Yue, J.; Chen, Y.; Zhou, H.; Hu, Z. Nonlinear Activation Functions are Not Necessary: A Lightweight Nonlinear Activation Free Network Based on Multiscale Large Kernel Attention Mechanism for Fault Diagnosis. IEEE Sens. J. 2025, 25, 18926–18940. [Google Scholar] [CrossRef]
  16. Guan, Y.; Li, Z.; Wang, N. A Dense Pyramidal Residual Network with a Tandem Spectral–Spatial Attention Mechanism for Hyperspectral Image Classification. Sensors 2025, 25, 1858. [Google Scholar] [CrossRef]
  17. Gong, G.; Wang, X.; Zhang, J.; Shang, X.; Pan, Z.; Li, Z.; Zhang, J. MSFF: A Multi-Scale Feature Fusion Convolutional Neural Network for Hyperspectral Image Classification. Electronics 2025, 14, 797. [Google Scholar] [CrossRef]
  18. Qin, B.; Feng, S.; Zhao, C.; Li, W.; Tao, R.; Zhou, J. Language-Enhanced Dual-Level Contrastive Learning Network for Open-Set Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5508114. [Google Scholar] [CrossRef]
  19. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  20. Roy, S.K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral–spatial morphological attention transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503615. [Google Scholar] [CrossRef]
  21. Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5518615. [Google Scholar] [CrossRef]
  22. Zhou, Y.; Huang, X.; Yang, X.; Peng, J.; Ban, Y. DCTN: Dual-branch convolutional transformer network with efficient interactive self-attention for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5508616. [Google Scholar] [CrossRef]
  23. Sun, L.; Zhang, H.; Zheng, Y.; Wu, Z.; Ye, Z.; Zhao, H. MASSFormer: Memory-Augmented Spectral-Spatial Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5516415. [Google Scholar] [CrossRef]
  24. Chen, N.; Fang, L.; Xia, Y.; Xia, S.; Liu, H.; Yue, J. Spectral query spatial: Revisiting the role of center pixel in transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5402714. [Google Scholar] [CrossRef]
  25. Zhao, Z.; Xu, X.; Li, S.; Plaza, A. Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5511817. [Google Scholar] [CrossRef]
  26. Cheng, S.; Chan, R.; Du, A. MS2I2Former: Multiscale Spatial–Spectral Information Interactive Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5532919. [Google Scholar] [CrossRef]
  27. Al-qaness, M.A.; Wu, G.; AL-Alimi, D. MGCET: MLP-mixer and graph convolutional enhanced transformer for hyperspectral image classification. Remote Sens. 2024, 16, 2892. [Google Scholar] [CrossRef]
  28. Shu, Z.; Liu, Z.; Yu, Z.; Wu, X.J. Dual Feature Aggregation Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5502916. [Google Scholar] [CrossRef]
  29. Fu, G.; Xiong, F.; Lu, J.; Zhou, J. SSUMamba: Spatial-Spectral Selective State Space Model for Hyperspectral Image Denoising. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5527714. [Google Scholar] [CrossRef]
  30. Sun, Z.; Zhao, R. LLM Security Alignment Framework Design Based on Personal Preference. In Proceedings of the 2024 International Conference on Artificial Intelligence and Future Education, Shanghai, China, 1–2 November 2024; pp. 6–11. [Google Scholar]
  31. He, L.; Li, J.; Liu, C.; Li, S. Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines. IEEE Trans. Geosci. Remote Sens. 2017, 56, 1579–1597. [Google Scholar] [CrossRef]
  32. Kang, J.; Zhang, Y.; Liu, X.; Cheng, Z. Hyperspectral image classification using spectral–spatial double-branch attention mechanism. Remote Sens. 2024, 16, 193. [Google Scholar] [CrossRef]
  33. Ashraf, M.; Zhou, X.; Vivone, G.; Chen, L.; Chen, R.; Majdard, R.S. Spatial-spectral BERT for hyperspectral image classification. Remote Sens. 2024, 16, 539. [Google Scholar] [CrossRef]
  34. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  35. Hamida, A.B.; Benoit, A.; Lambert, P.; Amar, C.B. 3-D deep learning approach for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4420–4434. [Google Scholar] [CrossRef]
  36. Cheng, S.; Chan, R.; Du, A. CACFTNet: A Hybrid Cov-Attention and Cross-Layer Fusion Transformer Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
  37. Zhang, X.; Shang, S.; Tang, X.; Feng, J.; Jiao, L. Spectral partitioning residual network with spatial attention mechanism for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5507714. [Google Scholar] [CrossRef]
  38. Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
  39. Mei, S.; Song, C.; Ma, M.; Xu, F. Hyperspectral image classification using group-aware hierarchical transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5539014. [Google Scholar] [CrossRef]
  40. Xi, B.; Zhang, Y.; Li, J.; Zheng, T.; Zhao, X.; Xu, H.; Xue, C.; Li, Y.; Chanussot, J. MCTGCL: Mixed CNN-Transformer for Mars Hyperspectral Image Classification With Graph Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5503214. [Google Scholar] [CrossRef]
Figure 1. Overall structure of PGTSEFormer.
Figure 1. Overall structure of PGTSEFormer.
Remotesensing 17 02705 g001
Figure 2. Overall structure of PCFormer. (a) PGFormer Block. (b) Prompt-Gated Cross Attention.
Figure 2. Overall structure of PCFormer. (a) PGFormer Block. (b) Prompt-Gated Cross Attention.
Remotesensing 17 02705 g002
Figure 3. (a) False color composition of the Indian Pines dataset. (b) Ground truth-map containing 16 mutually exclusive land cover classes.
Figure 3. (a) False color composition of the Indian Pines dataset. (b) Ground truth-map containing 16 mutually exclusive land cover classes.
Remotesensing 17 02705 g003
Figure 4. (a) False color composition of the Salinas dataset. (b) Ground truth-map containing 16 mutually exclusive land cover classes.
Figure 4. (a) False color composition of the Salinas dataset. (b) Ground truth-map containing 16 mutually exclusive land cover classes.
Remotesensing 17 02705 g004
Figure 5. (a) False color composition of the Botswana dataset. (b) Ground truth-map containing 14 mutually exclusive land cover classes.
Figure 5. (a) False color composition of the Botswana dataset. (b) Ground truth-map containing 14 mutually exclusive land cover classes.
Remotesensing 17 02705 g005
Figure 6. (a) False color composition of the WHU-Hi-LongKou dataset. (b) Ground truth-map containing 9 mutually exclusive land cover classes.
Figure 6. (a) False color composition of the WHU-Hi-LongKou dataset. (b) Ground truth-map containing 9 mutually exclusive land cover classes.
Remotesensing 17 02705 g006
Figure 7. (a) False color composition of the WHU-Hi-HongHu dataset. (b) Ground truth-map containing 22 mutually exclusive land cover classes.
Figure 7. (a) False color composition of the WHU-Hi-HongHu dataset. (b) Ground truth-map containing 22 mutually exclusive land cover classes.
Remotesensing 17 02705 g007
Figure 8. Indian Pines dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Figure 8. Indian Pines dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Remotesensing 17 02705 g008
Figure 9. Salinas dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Figure 9. Salinas dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Remotesensing 17 02705 g009
Figure 10. Botswana dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Figure 10. Botswana dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Remotesensing 17 02705 g010
Figure 11. WHU-Hi-Longkou dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Figure 11. WHU-Hi-Longkou dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Remotesensing 17 02705 g011
Figure 12. WHU-Hi-HongHu dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Figure 12. WHU-Hi-HongHu dataset visualization classification map. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Remotesensing 17 02705 g012
Figure 13. Visualization of T-SNE results on the Indian Pines dataset. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Figure 13. Visualization of T-SNE results on the Indian Pines dataset. (a) 3DCNN. (b) CACFTNet. (c) SSFTT. (d) GAHT. (e) Massformer. (f) SpectralFormer. (g) SPRN. (h) SQSFormer. (i) MCTGCL. (j) Ours.
Remotesensing 17 02705 g013
Figure 14. Effect of different training sample proportions on OA. (a) Indian Pines. (b) Salinas. (c) Botswana. (d) WHU-Hi-LongKou. (e) WHU-Hi-HongHu.
Figure 14. Effect of different training sample proportions on OA. (a) Indian Pines. (b) Salinas. (c) Botswana. (d) WHU-Hi-LongKou. (e) WHU-Hi-HongHu.
Remotesensing 17 02705 g014
Figure 15. Impact of five different patch sizes on network performance on five datasets.
Figure 15. Impact of five different patch sizes on network performance on five datasets.
Remotesensing 17 02705 g015
Figure 16. Impact of different learning rates on network performance on five datasets.
Figure 16. Impact of different learning rates on network performance on five datasets.
Remotesensing 17 02705 g016
Table 1. Indian Pines land cover sample summary.
Table 1. Indian Pines land cover sample summary.
Class No.Class NameTrainValidTest
C01Alfalfa3241
C02Corn-notill71711285
C03Corn-mintill4241747
C04Corn1212213
C05Grass-pasture2424435
C06Grass-trees3736657
C07Grass-pasture-mowed2125
C08Hay-windrowed2424430
C09Oats1118
C10Soybean-notill4948875
C11Soybean-mintill1221232210
C12Soybean-clean3020534
C13Wheat1010185
C14Woods63631139
C15Buildings-Grass-Trees-Drives1920347
C16Stone-Steel-Towers4584
Total5135119225
Table 2. Salinas land cover sample summary.
Table 2. Salinas land cover sample summary.
Class No.Class NameTrainValidTest
C01Brocoli-green-weeds-120201969
C02Brocoli-green-weeds-237376352
C03Fallow20201936
C04Fallow-rough-plow14141366
C05Fallow-smooth27272624
C06Stubble40393880
C07Celery36363507
C08Grapes-untrained11311211,046
C09Soil-vinyard-develop62626079
C10Corn-senesced-green-weeds33333212
C11Lettuce-romaine-4wk11101047
C12Lettuce-romaine-5wk20191888
C13Lettuce-romaine-6wk99898
C14Lettuce-romaine-7wk10111049
C15Vinyard-untrained73727123
C16Vinyard-vertical-trellis18181771
Total54353953,047
Table 3. Botswana land cover sample summary.
Table 3. Botswana land cover sample summary.
Class No.Class NameTrainValidTest
C01Water1413243
C02Hippo-grass5591
C03Hippo-grass 11312226
C04Hippo-grass 21110194
C05Reeds1413242
C06Riparian1413242
C07Firescar1313233
C08Island interior1010183
C09Acacia woodlands1615283
C10Acacia shrublands1312223
C11Acacia grasslands1515275
C12Short mopanc99163
C13Mixed mopanc1413241
C14Chalcedony5585
Total1661582924
Table 4. WHU-Hi-LongKou land cover sample summary.
Table 4. WHU-Hi-LongKou land cover sample summary.
Class No.Class NameTrainValidTest
C01Corn696934,373
C02Cotton17178341
C03Sesame663019
C04Broad-leaf soybean12712662,959
C05Narrow-leaf soybean984134
C06Rice242311,807
C07Water13413466,788
C08Roads and houses15147095
C09Mixed weed11105208
Total5135119225
Table 5. WHU-Hi-HongHu land cover sample summary.
Table 5. WHU-Hi-HongHu land cover sample summary.
Class No.Class NameTrainValidTest
C01Red roof282813,985
C02Road773498
C03Bare soil444321,734
C04Cotton326327162,632
C05Cotton firewood13126193
C06Rape898944,379
C07Chinese cabbage484824,007
C08Pakchoi884038
C09Cabbage222110,776
C10Tuber mustard252512,344
C11Brassica parachinensis222210,971
C12Brassica chinensis18188918
C13Small Brassica chinensis454522,417
C14Lactuca sativa15157326
C15Celtuce22998
C16Film covered lettuce14157233
C17Romaine lettuce662998
C18Carrot673204
C19White radish18178677
C20Garlic sprout773472
C21Broad bean231323
C22Tree884024
Total773765385,147
Table 6. Performance comparison of ten methods on the Indian Pines dataset. (Bold data indicates the highest classification accuracy).
Table 6. Performance comparison of ten methods on the Indian Pines dataset. (Bold data indicates the highest classification accuracy).
Class No.CNNTransformersOur Methods
3DCNNSpectralformerSPRNSSFTTGAHTMassformerSQSFormerMCTGCLCACFTNetOurs
133.68 ± 3.8720.73 ± 11.96100.00 ± 0.0078.54 ± 14.0054.88 ± 3.1368.54 ± 18.7877.56 ± 19.3879.76 ± 7.4893.90 ± 6.2985.85 ± 8.98
272.95 ± 2.1376.37 ± 2.6796.09 ± 1.8492.57 ± 3.1095.18 ± 0.9392.06 ± 2.2192.94 ± 3.2693.42 ± 1.9097.77 ± 1.0398.68 ± 0.60
363.77 ± 2.0461.69 ± 3.4298.30 ± 1.0591.87 ± 3.7594.74 ± 2.3196.49 ± 1.0092.28 ± 3.3292.85 ± 3.2798.27 ± 1.0797.50 ± 1.05
466.68 ± 4.0652.77 ± 3.0597.61 ± 1.4583.43 ± 4.5090.80 ± 2.9993.29 ± 5.0490.99 ± 5.2190.61 ± 2.8792.25 ± 2.6795.62 ± 2.97
586.57 ± 3.9673.89 ± 4.2698.17 ± 0.2196.62 ± 1.0897.06 ± 1.2189.47 ± 4.6195.29 ± 2.3893.95 ± 2.1298.07 ± 0.4099.01 ± 0.80
699.07 ± 0.2488.75 ± 1.6397.85 ± 0.1299.16 ± 0.5899.48 ± 0.2197.69 ± 1.2598.81 ± 0.4197.67 ± 1.4899.80 ± 0.1998.57 ± 0.90
731.30 ± 11.6346.00 ± 10.0097.60 ± 1.9668.80 ± 25.2283.20 ± 8.9181.60 ± 13.7655.20 ± 15.2692.40 ± 12.5887.20 ± 11.5794.40 ± 10.50
899.80 ± 0.1996.42 ± 1.0899.77 ± 0.2398.81 ± 1.2699.70 ± 0.2899.00 ± 1.2999.93 ± 0.1599.81 ± 0.2099.81 ± 0.3399.98 ± 0.07
911.88 ± 12.6433.89 ± 10.0877.78 ± 16.8536.67 ± 25.7225.00 ± 14.3359.44 ± 29.0917.22 ± 10.3847.22 ± 20.5381.44 ± 12.5385.00 ± 14.07
1070.01 ± 1.8877.50 ± 3.2194.82 ± 0.7087.79 ± 2.2190.16 ± 1.3091.35 ± 3.9193.61 ± 2.2492.75 ± 1.3595.20 ± 0.6495.52 ± 1.05
1184.77 ± 2.9484.36 ± 1.1198.34 ± 0.4296.32 ± 1.5595.48 ± 1.2096.87 ± 1.2696.82 ± 1.4297.82 ± 0.7898.55 ± 0.3998.62 ± 0.32
1253.74 ± 5.7944.49 ± 2.3597.51 ± 0.9586.24 ± 3.2894.70 ± 2.0088.15 ± 5.1191.39 ± 3.0588.26 ± 1.6597.28 ± 1.6193.93 ± 1.62
1398.82 ± 0.6597.41 ± 0.9397.62 ± 0.9798.65 ± 1.1797.46 ± 1.6498.32 ± 1.8599.51 ± 0.2996.76 ± 1.9596.76 ± 2.5696.86 ± 1.34
1496.14 ± 0.3695.36 ± 0.7598.84 ± 0.3997.41 ± 0.9298.43 ± 0.6897.34 ± 1.4899.46 ± 0.4098.60 ± 0.6198.40 ± 0.4999.50 ± 0.21
1576.82 ± 3.8966.97 ± 2.9186.02 ± 3.1684.99 ± 3.8987.41 ± 2.6887.44 ± 8.3470.37 ± 3.3586.25 ± 3.1982.82 ± 3.0791.50 ± 1.52
16100.00 ± 0.0099.52 ± 1.0998.33 ± 0.7991.43 ± 5.5898.45 ± 1.4185.12 ± 8.8092.62 ± 3.3299.05 ± 1.9097.58 ± 1.6498.69 ± 1.64
OA (%)80.62 ± 1.6778.78 ± 0.8797.72 ± 0.3493.52 ± 0.7395.05 ± 0.4294.25 ± 0.4794.37 ± 0.5994.96 ± 0.7297.36 ± 0.2597.91 ± 0.41
AA (%)71.63 ± 2.5269.76 ± 1.9796.23 ± 1.1486.83 ± 2.7287.63 ± 1.1588.89 ± 1.2785.25 ± 1.1090.45 ± 2.0095.36 ± 1.1992.78 ± 0.84
Kappa (%)77.74 ± 1.9175.66 ± 1.0197.40 ± 0.3992.60 ± 0.8394.36 ± 0.4893.45 ± 0.5493.57 ± 0.6896.99 ± 0.1994.25 ± 0.8297.61 ± 0.47
Table 7. Performance comparison of ten methods on the Salinas dataset. (Bold data indicates the highest classification accuracy).
Table 7. Performance comparison of ten methods on the Salinas dataset. (Bold data indicates the highest classification accuracy).
Class No.CNNTransformersOur Methods
3DCNNSpectralformerSPRNSSFTTGAHTMassformerSQSFormerMCTGCLCACFTNetOurs
198.17 ± 0.9995.09 ± 3.13100.00 ± 0.0098.56 ± 1.98100.00 ± 0.0096.72 ± 4.1199.15 ± 1.6799.82 ± 0.4299.80 ± 0.2799.88 ± 0.27
299.85 ± 0.0999.41 ± 0.41100.00 ± 0.0099.91 ± 0.1199.99 ± 0.0399.39 ± 1.0199.91 ± 0.1099.97 ± 0.0599.76 ± 0.31100.00 ± 0.00
392.49 ± 0.8694.15 ± 2.4599.56 ± 0.0598.83 ± 1.3099.40 ± 0.7396.94 ± 1.5598.00 ± 2.7497.25 ± 1.8796.56 ± 6.2299.67 ± 0.54
498.78 ± 0.6195.24 ± 0.4499.84 ± 0.1199.53 ± 0.5099.93 ± 0.0897.00 ± 2.8799.74 ± 0.1999.80 ± 0.2499.88 ± 0.1099.95 ± 0.15
589.73 ± 1.9887.17 ± 2.1296.72 ± 1.1197.36 ± 1.5997.60 ± 1.1897.03 ± 2.1297.70 ± 0.9395.72 ± 1.3896.33 ± 1.6796.36 ± 1.20
6100.00 ± 0.0099.99 ± 0.01100.00 ± 0.0099.98 ± 0.04100.00 ± 0.0099.76 ± 0.4999.99 ± 0.03100.00 ± 0.00100.00 ± 0.00100.00 ± 0.00
799.79 ± 0.0998.00 ± 0.6799.98 ± 0.0199.87 ± 0.1799.89 ± 0.1099.25 ± 0.8898.98 ± 0.8199.60 ± 0.5099.95 ± 0.0699.83 ± 0.15
887.38 ± 1.1087.38 ± 2.9195.28 ± 0.9988.58 ± 2.2192.07 ± 2.0993.29 ± 2.3631.06 ± 1.5294.27 ± 1.6391.61 ± 2.1697.30 ± 1.29
998.46 ± 0.3097.81 ± 0.57100.00 ± 0.0099.38 ± 0.5999.33 ± 0.4799.68 ± 0.3699.17 ± 0.2499.65 ± 0.1299.48 ± 0.82100.00 ± 0.00
1093.86 ± 0.7091.94 ± 2.2999.40 ± 0.1797.55 ± 1.2899.57 ± 0.3397.57 ± 1.3797.78 ± 0.5799.58 ± 0.2198.12 ± 0.7799.94 ± 0.07
1170.17 ± 1.8176.57 ± 6.3579.97 ± 3.3771.66 ± 4.3580.81 ± 2.6294.52 ± 4.4794.06 ± 0.8898.22 ± 1.4283.48 ± 4.0698.07 ± 1.51
1298.63 ± 0.6197.93 ± 3.5199.96 ± 0.0699.79 ± 0.5599.97 ± 0.0899.34 ± 1.46100.00 ± 0.0099.87 ± 0.3699.98 ± 0.0599.99 ± 0.02
1399.99 ± 0.0398.83 ± 0.7099.90 ± 0.2099.73 ± 0.3099.82 ± 0.3699.55 ± 0.5799.39 ± 1.3799.61 ± 0.5399.78 ± 0.42100.00 ± 0.00
1499.01 ± 0.1895.90 ± 1.0199.19 ± 0.1898.15 ± 0.9799.60 ± 0.2697.66 ± 2.2396.83 ± 1.5495.59 ± 1.8798.36 ± 1.7999.60 ± 0.26
1571.73 ± 3.4875.81 ± 4.8493.78 ± 1.4486.82 ± 3.9089.93 ± 4.1792.57 ± 2.1189.81 ± 3.1589.77 ± 1.7492.58 ± 1.4893.88 ± 2.01
1691.84 ± 3.1095.05 ± 3.3898.70 ± 0.2697.78 ± 0.7897.92 ± 0.9996.05 ± 3.8998.47 ± 1.6797.65 ± 1.4598.47 ± 0.9199.32 ± 0.79
OA (%)91.19 ± 0.5291.26 ± 0.9097.52 ± 0.1494.70 ± 0.2696.28 ± 0.2996.54 ± 0.7296.00 ± 0.2796.79 ± 0.3696.38 ± 0.3798.74 ± 0.42
AA (%)93.12 ± 0.4392.89 ± 1.2597.67 ± 0.2095.84 ± 0.2097.24 ± 0.2797.27 ± 0.7997.50 ± 0.2297.88 ± 0.1997.19 ± 0.4099.22 ± 0.50
Kappa (%)90.19 ± 0.5990.26 ± 1.0097.24 ± 0.1694.10 ± 0.1995.86 ± 0.3296.15 ± 0.8095.55 ± 0.3096.42 ± 0.4095.97 ± 0.4498.60 ± 0.47
Table 8. Performance comparison of ten methods on the Botswana dataset. (Bold data indicates the highest classification accuracy).
Table 8. Performance comparison of ten methods on the Botswana dataset. (Bold data indicates the highest classification accuracy).
Class No.CNNTransformersOur Methods
3DCNNSpectralformerSPRNSSFTTGAHTMassformerSQSFormerMCTGCLCACFTNetOurs
1100.00 ± 0.00100.00 ± 0.0097.49 ± 5.89100.00 ± 0.00100.00 ± 0.0098.19 ± 1.38100.00 ± 0.00100.00 ± 0.00100.00 ± 0.0099.75 ± 0.38
286.37 ± 3.3897.47 ± 4.9699.56 ± 0.8889.45 ± 8.6899.01 ± 2.9793.52 ± 7.6593.74 ± 3.9394.62 ± 6.38100.00 ± 0.00100.00 ± 0.00
399.12 ± 0.4495.35 ± 2.70100.00 ± 0.0098.98 ± 0.91100.00 ± 0.0097.04 ± 3.00100.00 ± 0.0098.81 ± 1.05100.00 ± 0.00100.00 ± 0.00
497.63 ± 1.3189.90 ± 6.9499.95 ± 0.1597.27 ± 7.1999.95 ± 0.1597.78 ± 4.10100.00 ± 0.0095.67 ± 3.90100.00 ± 0.00100.00 ± 0.00
594.71 ± 2.7568.97 ± 9.2098.06 ± 3.4396.90 ± 1.5099.13 ± 0.6389.38 ± 9.5095.00 ± 3.2688.18 ± 2.3294.83 ± 3.0498.66 ± 3.04
671.61 ± 4.0687.23 ± 4.1194.05 ± 2.9593.10 ± 6.3895.17 ± 0.5990.45 ± 4.3296.40 ± 1.5990.66 ± 3.7193.14 ± 4.8397.69 ± 0.98
799.61 ± 0.3099.61 ± 0.5699.79 ± 0.2999.66 ± 0.50100.00 ± 0.0098.93 ± 1.60100.00 ± 0.0099.96 ± 0.1399.91 ± 0.26100.00 ± 0.00
879.18 ± 5.7077.05 ± 7.48100.00 ± 0.0096.23 ± 4.61100.00 ± 0.0095.30 ± 9.1999.67 ± 0.9889.56 ± 10.08100.00 ± 0.00100.00 ± 0.00
997.24 ± 1.7782.65 ± 8.1999.26 ± 0.7197.39 ± 3.3898.87 ± 1.4198.48 ± 1.4799.58 ± 0.7093.64 ± 4.6999.47 ± 0.51100.00 ± 0.00
1098.25 ± 0.5191.52 ± 3.7599.42 ± 1.6098.83 ± 2.2199.91 ± 0.2799.10 ± 1.3298.97 ± 0.6499.91 ± 0.27100.00 ± 0.00100.00 ± 0.00
1197.78 ± 1.5198.51 ± 0.3898.58 ± 0.9198.95 ± 1.2799.16 ± 0.6196.65 ± 3.0899.02 ± 1.3298.11 ± 1.8997.82 ± 0.91100.00 ± 0.00
1299.39 ± 0.4893.13 ± 2.1099.88 ± 0.2599.88 ± 0.2599.63 ± 0.4194.79 ± 3.4999.39 ± 0.8793.01 ± 4.9199.20 ± 1.4099.75 ± 0.30
1399.13 ± 0.8897.59 ± 3.74100.00 ± 0.0098.80 ± 2.81100.00 ± 0.00100.00 ± 0.0095.85 ± 2.3099.54 ± 0.29100.00 ± 0.00100.00 ± 0.00
1480.94 ± 27.2587.41 ± 1.1898.35 ± 1.2099.06 ± 0.7198.47 ± 1.0689.76 ± 7.72100.00 ± 0.0098.12 ± 1.4199.18 ± 2.4799.88 ± 0.35
OA (%)93.96 ± 1.2890.39 ± 2.1298.80 ± 0.7997.75 ± 1.3399.23 ± 0.1896.10 ± 1.4798.48 ± 0.2795.76 ± 1.3198.61 ± 0.4499.48 ± 0.34
AA (%)92.93 ± 2.3890.46 ± 1.9898.88 ± 0.6997.46 ± 1.4699.24 ± 0.2895.67 ± 1.7698.40 ± 0.2395.70 ± 1.2798.79 ± 0.3999.57 ± 0.28
Kappa (%)93.45 ± 1.3989.59 ± 2.2998.70 ± 0.8697.56 ± 1.4499.17 ± 0.1995.78 ± 1.5998.36 ± 0.2995.41 ± 1.4298.50 ± 0.4799.44 ± 0.37
Table 9. Performance comparison of ten methods on the WHU-Hi-Longkou dataset. (Bold data indicates the highest classification accuracy).
Table 9. Performance comparison of ten methods on the WHU-Hi-Longkou dataset. (Bold data indicates the highest classification accuracy).
Class No.CNNTransformersOur Methods
3DCNNSpectralformerSPRNSSFTTGAHTMassformerSQSFormerMCTGCLCACFTNetOurs
199.47 ± 0.0999.85 ± 0.0599.98 ± 0.0199.92 ± 0.0399.92 ± 0.0499.37 ± 0.3699.96 ± 0.0399.79 ± 0.1899.85 ± 0.2699.85 ± 0.14
287.99 ± 2.1086.82 ± 2.0096.84 ± 0.7597.12 ± 1.1596.35 ± 0.7393.85 ± 3.1188.88 ± 3.5593.12 ± 3.3098.46 ± 1.2498.53 ± 0.65
337.57 ± 6.1992.78 ± 2.4097.13 ± 1.2791.74 ± 3.8293.92 ± 3.2987.48 ± 4.8795.23 ± 1.9995.86 ± 5.4590.23 ± 8.0098.86 ± 1.50
498.70 ± 0.1897.45 ± 0.6299.42 ± 0.0798.68 ± 0.2199.23 ± 0.2398.17 ± 0.5498.54 ± 0.3798.79 ± 0.2398.93 ± 0.3699.29 ± 0.31
516.14 ± 7.8172.02 ± 7.8593.35 ± 1.1789.24 ± 4.7886.87 ± 5.8983.41 ± 6.8871.98 ± 9.4278.61 ± 9.8083.78 ± 3.6496.72 ± 1.02
698.88 ± 0.1792.33 ± 3.6399.84 ± 0.1599.36 ± 0.3599.71 ± 0.1297.47 ± 2.3499.87 ± 0.0998.79 ± 0.6199.21 ± 0.6999.89 ± 0.09
799.99 ± 0.0099.97 ± 0.0199.99 ± 0.0099.96 ± 0.0399.95 ± 0.0499.55 ± 0.3699.99 ± 0.0099.70 ± 0.2099.99 ± 0.0199.84 ± 0.13
880.26 ± 2.7077.60 ± 3.8193.77 ± 1.6893.93 ± 2.8391.52 ± 4.1682.35 ± 6.6790.66 ± 3.3480.64 ± 2.7994.79 ± 2.2494.01 ± 2.32
968.60 ± 1.6064.08 ± 2.6587.90 ± 0.6986.57 ± 3.7688.20 ± 2.0684.37 ± 5.7093.31 ± 1.7385.67 ± 3.2385.82 ± 4.4191.01 ± 2.57
OA (%)94.83 ± 0.3195.82 ± 0.3098.97 ± 0.0898.52 ± 0.2598.61 ± 0.2297.25 ± 0.4097.94 ± 0.2697.60 ± 0.4098.85 ± 0.2499.18 ± 0.29
AA (%)76.40 ± 1.6186.99 ± 0.8996.47 ± 0.2995.17 ± 1.1695.07 ± 1.1991.78 ± 1.6293.16 ± 1.2192.33 ± 1.7294.56 ± 1.1697.57 ± 1.02
Kappa (%)93.15 ± 0.4194.49 ± 0.3998.65 ± 0.1098.05 ± 0.3398.17 ± 0.2996.38 ± 0.5297.29 ± 0.3496.84 ± 0.5298.05 ± 0.3298.93 ± 0.37
Table 10. Performance comparison of ten methods on the WHU-Hi-HongHu dataset. (Bold data indicates the highest classification accuracy).
Table 10. Performance comparison of ten methods on the WHU-Hi-HongHu dataset. (Bold data indicates the highest classification accuracy).
Class No.CNNTransformersOur Methods
3DCNNSpectralformerSPRNSSFTTGAHTMassformerSQSFormerMCTGCLCACFTNetOurs
176.90 ± 1.4493.92 ± 1.3595.85 ± 1.1483.59 ± 3.2094.85 ± 1.2495.01 ± 1.9982.03 ± 8.2194.67 ± 2.5794.30 ± 3.7390.15 ± 3.86
276.65 ± 1.6476.89 ± 2.5376.71 ± 3.4653.60 ± 15.9678.62 ± 3.0772.01 ± 7.2153.12 ± 18.8780.36 ± 4.8641.89 ± 16.9777.20 ± 11.87
392.76 ± 0.4387.32 ± 1.0294.83 ± 0.9293.86 ± 1.4292.72 ± 1.1486.40 ± 1.5693.82 ± 2.0789.54 ± 2.5193.10 ± 1.9392.47 ± 2.00
499.28 ± 0.5798.84 ± 0.2499.34 ± 0.1599.34 ± 0.2899.36 ± 0.1899.57 ± 0.1499.76 ± 0.1899.51 ± 0.1899.50 ± 0.2499.56 ± 0.36
529.62 ± 13.7473.60 ± 5.5475.27 ± 4.4647.84 ± 11.3779.20 ± 5.4973.49 ± 5.3030.32 ± 10.7774.12 ± 5.7022.52 ± 10.1478.59 ± 3.81
692.03 ± 0.4592.02 ± 0.7795.79 ± 0.8394.57 ± 1.1294.10 ± 1.8494.35 ± 0.8793.19 ± 1.4394.67 ± 0.9093.44 ± 3.0096.83 ± 1.49
779.21 ± 2.5984.83 ± 3.1586.50 ± 1.1883.85 ± 3.5186.22 ± 3.0986.80 ± 2.3785.15 ± 4.0487.46 ± 2.8887.49 ± 3.2588.28 ± 4.11
82.89 ± 1.3514.25 ± 3.0927.92 ± 2.800.54 ± 0.8225.11 ± 4.5626.70 ± 3.770.84 ± 0.7130.39 ± 6.080.67 ± 1.1728.14 ± 10.32
994.14 ± 0.7195.44 ± 1.1999.00 ± 0.4097.17 ± 1.0997.94 ± 0.7595.16 ± 0.9994.87 ± 1.6195.34 ± 1.0497.41 ± 1.0597.64 ± 1.76
1057.94 ± 4.5661.05 ± 3.6982.52 ± 2.3177.24 ± 5.1584.77 ± 4.1879.59 ± 2.7459.13 ± 5.5782.72 ± 2.9165.00 ± 8.0586.67 ± 3.64
1143.47 ± 3.8860.13 ± 3.3671.26 ± 4.4638.11 ± 10.2367.88 ± 5.6471.87 ± 5.5958.66 ± 5.0369.78 ± 3.2946.42 ± 8.3773.69 ± 7.07
1250.77 ± 3.2656.65 ± 7.0066.96 ± 2.2850.18 ± 14.0670.77 ± 6.0971.01 ± 3.6033.19 ± 15.3270.26 ± 5.9032.75 ± 21.5072.08 ± 5.08
1376.29 ± 2.5072.24 ± 5.2178.40 ± 2.7180.66 ± 4.4280.05 ± 1.8682.27 ± 1.5078.93 ± 3.6982.16 ± 1.9576.95 ± 4.2483.12 ± 2.28
1453.08 ± 2.6578.27 ± 3.3590.06 ± 2.0469.82 ± 7.5183.57 ± 3.9884.12 ± 5.6955.74 ± 11.9786.33 ± 6.1768.73 ± 19.3388.41 ± 3.25
157.49 ± 3.2846.58 ± 10.0661.13 ± 9.390.66 ± 1.5949.38 ± 9.8342.42 ± 9.150.00 ± 0.0041.20 ± 9.710.00 ± 0.0045.18 ± 17.49
1686.56 ± 3.3093.45 ± 3.0996.36 ± 0.7693.75 ± 2.8593.23 ± 2.6494.62 ± 2.4390.47 ± 8.2992.92 ± 6.4296.55 ± 2.6397.19 ± 2.63
1743.30 ± 2.9352.17 ± 14.8870.12 ± 1.0732.36 ± 25.5264.03 ± 5.8866.37 ± 6.3223.88 ± 27.1167.71 ± 3.213.34 ± 5.4372.97 ± 5.30
1811.31 ± 9.3353.55 ± 6.1158.73 ± 10.211.79 ± 4.8368.09 ± 9.7670.49 ± 8.443.96 ± 9.2364.96 ± 9.200.07 ± 0.1373.42 ± 11.84
1979.74 ± 1.3981.63 ± 3.0593.39 ± 1.5166.44 ± 18.0390.16 ± 3.8686.46 ± 4.0282.11 ± 5.9085.86 ± 5.3983.30 ± 5.8491.32 ± 4.24
2068.44 ± 4.6582.74 ± 2.7892.30 ± 2.4816.15 ± 18.0379.94 ± 8.1372.92 ± 13.8356.23 ± 15.8470.85 ± 10.7558.35 ± 16.7174.14 ± 15.09
210.00 ± 0.0011.12 ± 3.770.00 ± 0.000.00 ± 0.000.08 ± 0.1511.97 ± 5.640.00 ± 0.0014.42 ± 6.220.00 ± 0.0011.43 ± 5.45
2241.73 ± 7.8365.56 ± 7.9175.38 ± 6.6639.93 ± 16.1972.98 ± 11.3477.58 ± 6.6646.07 ± 15.8883.84 ± 5.2228.69 ± 10.0078.79 ± 10.73
OA (%)83.97 ± 0.3087.38 ± 0.3691.41 ± 0.2285.22 ± 0.5590.84 ± 0.4190.51 ± 0.3984.82 ± 0.2990.88 ± 0.5484.88 ± 0.6692.57 ± 0.33
AA (%)57.44 ± 1.1469.65 ± 0.7976.72 ± 1.0255.52 ± 1.8475.14 ± 1.5674.60 ± 1.2855.52 ± 1.6575.41 ± 1.0554.11 ± 1.7879.01 ± 1.16
Kappa (%)79.56 ± 0.3584.02 ± 0.4589.11 ± 0.2881.18 ± 0.7288.40 ± 0.5287.97 ± 0.4980.63 ± 0.3788.46 ± 0.6880.73 ± 0.8490.58 ± 0.43
Table 11. Comparison of models across different metrics for the Indian Pines, Salians, Bostwana, WHU-Hi-Longkou, and WHU-Hi-HongHu datasets.
Table 11. Comparison of models across different metrics for the Indian Pines, Salians, Bostwana, WHU-Hi-Longkou, and WHU-Hi-HongHu datasets.
MethodsIndian PinesSaliansBostwanaWHU-Hi-LongkouWHU-Hi-HongHu
Parameters/kFLOPs/MMacParameters/kFLOPs/MMacParameters/kFLOPs/MMacParameters/kFLOPs/MMacParameters/kFLOPs/MMacs
3DCNN263.59645.19568263.59646.08224189.59432.79048201.98960.90368449.5260.90
Spectralformer342.64915.61216352.40516.235776227.8448.283392540.64428.299072544.8628.30
SPRN283.34821.023393281.05521.023393281.05520.892897248.88721.178785186.5611.18
SSFTT931.84816.156424950.2816.482441678.27811.6735761253.95321.861256125021.86
GAHT830.813.888512972.62414.304128800.71811.9714561514.12132.713279151032.71
Massformer304.5125.54316.2725.54301.1125.54327.6425.54327.6425.54
SQSFormer421.96819.774424.27220.163776390.0314.420224368.6525.587584463.0626.85
MCTGCL271.4433.28271.4433.28272.1833.28273.45633.28275.5433.28
CACFTNet327066.71336069.44202335.694930120.564930120.57
Ours649.98138.151188204.60536.926189.3724.797273.45631.3309274.3031.33
Table 12. Comparison of CHPA and PCFormer on five datasets. (Bold font indicates the highest classification performance).
Table 12. Comparison of CHPA and PCFormer on five datasets. (Bold font indicates the highest classification performance).
DatasetsCHPAPCFormerOAAAKAPPA
Indian Pines××96.25 ± 0.8392.01 ± 1.7195.73 ± 0.95
🗸×97.07 ± 2.3289.40 ± 2.8396.43 ± 0.82
×🗸97.46 ± 0.4091.35 ± 0.6997.10 ± 0.46
🗸×97.91 ± 0.4192.78 ± 0.8497.61 ± 0.47
Salinas××96.58 ± 0.4697.74 ± 0.5296.19 ± 0.96
🗸×97.69 ± 1.0798.37 ± 1.1997.43 ± 1.20
×🗸98.32 ± 0.1698.99 ± 0.1498.13 ± 0.18
🗸🗸98.74 ± 0.4299.22 ± 0.5098.47 ± 0.17
Botswana××97.29 ± 1.7297.24 ± 1.6597.06 ± 1.87
🗸×98.81 ± 0.7298.90 ± 0.6698.37 ± 0.77
×🗸98.32 ± 0.1698.99 ± 0.1498.13 ± 0.18
🗸🗸99.48 ± 0.3499.57 ± 0.2899.44 ± 0.37
WHU-Hi-Longkou××98.44 ± 0.4494.47 ± 1.0097.95 ± 0.31
🗸×98.62 ± 0.3495.61 ± 2.2898.13 ± 0.27
×🗸99.03 ± 0.1697.30 ± 0.8798.60 ± 0.20
🗸🗸99.18 ± 0.2997.57 ± 1.0298.93 ± 0.37
WHU-Hi-HongHu××90.13 ± 0.3173.14 ± 1.1687.48 ± 0.40
🗸×91.49 ± 0.3775.33 ± 1.5589.22 ± 0.48
×🗸91.68 ± 0.3775.85 ± 1.0989.73 ± 0.48
🗸🗸92.57 ± 0.3379.01 ± 1.1690.58 ± 0.43
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, R.; Cheng, S.; Li, S.; Liu, T. Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification. Remote Sens. 2025, 17, 2705. https://doi.org/10.3390/rs17152705

AMA Style

Han R, Cheng S, Li S, Liu T. Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification. Remote Sensing. 2025; 17(15):2705. https://doi.org/10.3390/rs17152705

Chicago/Turabian Style

Han, Ruimin, Shuli Cheng, Shuoshuo Li, and Tingjie Liu. 2025. "Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification" Remote Sensing 17, no. 15: 2705. https://doi.org/10.3390/rs17152705

APA Style

Han, R., Cheng, S., Li, S., & Liu, T. (2025). Prompt-Gated Transformer with Spatial–Spectral Enhancement for Hyperspectral Image Classification. Remote Sensing, 17(15), 2705. https://doi.org/10.3390/rs17152705

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop