1. Introduction
Hyperspectral imaging (HSI) is an advanced remote sensing technique capable of capturing detailed spectral information of surface objects in hundreds of continuous spectral bands. Each pixel not only carries spatial location information, but also contains a unique spectral response curve [
1], making different materials and features highly distinguishable. Due to its rich data properties, HSI finds extensive application in various fields, which include agricultural monitoring, environmental management, urban planning, geological exploration, and national-security-related defense [
2]. As a core task in hyperspectral data processing, HSI aims to assign an accurate land cover or object class to each pixel [
3]. Nevertheless, this task encounters significant challenges, owing to the complexity of high-dimensional data and the limited availability of labeled samples. Therefore, the development of efficient feature extraction and classification methods has become a key direction in current HSI research.
Early HSI methods relied heavily on hand-designed features with shallow classifiers, such as classifiers based on SAM, SVM [
4], and K Nearest Neighbor [
5]. These methods mainly utilize spectral information to achieve classification based on the similarity of spectral curves. However, they ignore spatial contextual information, which leads to limited classification accuracy in scenes with complex feature distributions. In order to solve the problem of utilizing only single spectral information, researchers have proposed joint spatial–spectral methods such as Morphological Profiles [
6], Gabor Filtering [
7], and Markov Random Fields [
8]. These methods enhance classification performance by fusing spatial texture with spectral features. While these methods have greatly enhanced HSI accuracy, they heavily rely on expert feature design and struggle to adapt to multi-scale spatial structures, leading to suboptimal results in complex scenes.
In recent years, the updated iteration of deep learning techniques has brought new breakthroughs in research on HSI. CNNs [
9] have become an efficient feature extraction method in HSI by virtue of their advantages in local feature extraction, but they have also gradually exposed some inherent limitations [
10]. Firstly, the local receptive field of CNNs is limited and the convolution can only see a fixed neighborhood, which makes it difficult to capture global contextual information [
11]. Secondly, high-dimensional spectral bands generate a huge number of parameters, which often need to be reduced through spectral downscaling. However, this process inevitably leads to the loss of some spectral information in the model. Additionally, an excessive number of convolutions can cause the model to overfit, limiting its ability to fully leverage the spectral information. In order to compensate for the lack of local sensory fields, researchers have tried allowing CNNs to extract features at a larger scale or at different scales [
12]. For example, spectral–spatial networks such as 3D-CNN were proposed by combining the spectral dimension with the spatial dimension through convolution, as a way to enhance the portrayal of this joint spectral–spatial pattern [
13]. Other methods acquire spatial context information at different scales through multi-scale convolution or pyramid structures to take into account both local details and large-scale structures, and these improvements help to enhance the adaptability of CNNs to complex scenes [
14]. To this end, Liu et al. [
15] proposed a multiscale large kernel asymmetric convolutional network, which combines a spectral feature extraction module and a multiscale large kernel asymmetric convolutional module to efficiently capture both local and global spatial features. Guan et al. [
16] introduced a dense pyramidal residual network that utilizes a combined spectral–spatial attention mechanism, enabling it to capture intricate spectral and spatial features in HSI. Gong et al. [
17] introduced a novel Multiscale Feature Fusion Convolutional Network, incorporating a multiscale convolutional architecture designed to extract both spectral and spatial features. When using these 3D-CNN architectures, despite showing strong performance in joint spatial and spectral modeling, the local convolution kernel has difficulty capturing remote dependencies [
18], limiting the global modeling capability of the model.
With the introduction of the Transformer model into the field of HSI [
19], its powerful global modeling capabilities have been widely applied. The detrimental effect of small sample data sizes on Transformer training has become prominent, and it is easy to overfit or fall into unstable training when there are insufficient training samples [
20]. To compensate for the limitations of Transformers in local feature extraction, many studies have begun to adopt CNN–Transformer hybrid architectures, leveraging CNN’s powerful local feature extraction ability and the global modeling advantages of Transformer. Hong et al. [
21] proposed the SpectralFormer model to further exploit Transformer’s modeling advantages on spectral sequences. In addition, researchers have successively proposed many innovative Transformer-based models in the field of HSI. Zhou et al. [
22] proposed a dual-branch convolutional Transformer network with an efficient interactive self-attention mechanism. Sun et al. [
23] introduced a memory enhancement mechanism for spatial–spectral feature fusion. Chen et al. [
24] explored the role of the center pixel in a Transformer. Zhao et al. [
25] proposed a lightweight model based on group convolution. Cheng et al. [
26] proposed a multi-scale spatial–spectral information interaction Transformer. Wu et al. [
27] proposed a combination of the MLP-mixer and graph convolution in an enhanced Transformer. Shu et al. [
28] effectively enhanced feature extraction and fusion capabilities through a dual feature aggregation module and cross-attention aggregation mechanism. These models significantly improved classification accuracy by introducing hybrid structures, attention mechanisms, or multi-scale feature fusion strategies. These innovations demonstrate that the CNN–Transformer hybrid architecture can effectively overcome the limitations of single architectures by complementing each other’s advantages. These models significantly improved classification accuracy by introducing hybrid structures, attention mechanisms, or multi-scale feature fusion strategies.
In contrast, although the Mamba architecture has demonstrated excellent performance in spatial–spectral dependency modeling, and while the SSUMamba model proposed by Fu et al. [
29] excels in HSI denoising tasks, its primary advantage lies in denoising rather than classification accuracy. The Mamba architecture has lower computational complexity but lacks the flexibility and diversity in feature extraction of the CNN–Transformer hybrid architecture. In HSI, the CNN–Transformer hybrid architecture can better combine local details with global information, providing more precise classification results.
Therefore, despite the Mamba architecture’s high computational efficiency and performance in certain tasks, the CNN–Transformer hybrid architecture is more suitable for HSI due to its stronger local feature extraction capabilities and global modeling advantages. By optimizing the fusion of local and global features, the CNN–Transformer architecture fully leverages the strengths of both, significantly improving classification accuracy. Based on this, this study selected the CNN–Transformer hybrid architecture as the core framework for further exploration of its application potential and advantages in HSI.
Meanwhile, prompting, as a new paradigm for large model tuning, has shown great potential for efficient feature guidance. For example, Zhang et al. [
30] proposed an efficient fine-tuning method based on zero-initialized attention, which efficiently controls the feature flow through prompting factors. This idea inspired us to introduce prompting factors into the HSI task and further improve the classification performance by designing an adaptive prompting factor mechanism to guide the model to achieve more accurate feature selection during the fusion of spectral and spatial features.
The methods mentioned above have made significant progress in fusing spectral and spatial features, but there is still much room for optimization in terms of fully extracting spatial and spectral information, while maintaining the consistency of the local spatial structure. In order to address the previously discussed problems, we propose a Prompt-Gated Transformer with Spatial–Spectral Enhancement for HSI, which combines the advantages of CNNs and Transformer and successfully optimizes the attention mechanism. The network structure mainly consists of two parts: the Channel Hybrid Positional Attention (CHPA) module and the Prompt Cross-Former (PCFormer). The CHPA module gives full play to the advantages of dual branching to mine the deep spectral information and spatial information in HSI and perform feature fusion. On this basis, we use PCFformer to establish global contextual links and control the feature flow through the prompting factors and gating mechanism, as a way to enhance the long-range feature-dependent expression ability of the model. The main contributions of this paper are as follows:
In this paper, a Prompt-Gated Transformer with Spectral-Spatial feature Enhancement is proposed. The network not only makes full use of the global feature extraction capability of a Transformer network, but also introduces prompting factors into the field of HSI.
To compensate for Transformer’s insufficient modeling of spatial structure, this paper proposes a Channel Hybrid Positional Attention (CHPA) module for HSI. The positional attention introduced by this module can enhance the extraction of spatial structure information, so that the model focuses on the spatial continuity of similar features and the boundary of dissimilar features. The channel weighting mechanism of CHPA can filter out unimportant channels and highlight key spectral features. This helps alleviate the curse of dimensionality in high-dimensional spectral data and improves the model’s ability to utilize effective spectral information.
In order to solve the problem that traditional self-attention mechanisms tend to be globally relevant and ignore local spectral–spatial details, this paper proposes Prompt Cross-Former (PCFormer) for HSI. The PCFormer includes AttenMix and PGFormer Block. In the PGFormer Block, we design the Prompt-Gated Cross Attention (PGCA), which uses a learnable prompt-gating mechanism to adaptively pass training prompts into the self-attention layer of the Transformer, to guide the attention to focus on effective features.
The main research of this paper will be presented in detail in the subsequent sections.
Section 2 details our proposed methodology.
Section 3 describes the experimental setup in detail, as well as some advanced technological approaches.
Section 4 shows the experimental results and analyses them in depth.
Section 5 discusses the effects of different parameter configurations on the model performance and the ablation experiments. Finally,
Section 6 summarizes our contributions and looks to the future.
2. Methodology
In this section, the PGTSEFormer network for HSI is introduced.
Section 2.1 describes the structure of the PGTSEFormer network and its application in HSI.
Section 2.2 give the basic structure and principle of CHPA, PCFormer, respectively.
PGTSEFormer is proposed as shown in
Figure 1. Its overall framework consists of the following five parts: the first is the channel tuning part, the next is the Channel Hybrid Positional Attention (CHPA) module, next is the Prompt Cross-Former (PCFormer), the fourth is the GAP layer, and the last is the part containing the fully connected (FC) layer of the softmax classifier.
In the PGTSEFormer framework, the network accepts a series of 3D data cubes as input. Since datasets vary in their spectral dimensions, we first align these dimensions using a channel tuning mechanism to ensure consistent processing across datasets. Specifically, a series of 2D convolutional layers are employed to calibrate the channels and simultaneously extract preliminary null spectral features. Subsequently, a CHPA is introduced, which adaptively directs the model to focus on key channel information and spatial regions, suppressing irrelevant or interfering information.
After the initial feature extraction, the PCFormer is composed of two core modules: the AttenMix and the PGFormer Block. Where the AttenMix module integrates DWConv, PWConv, Weight Branching Structure, and Channel Blending. This module serves as a pre-processing step that prepares features for global feature extraction by the Transformer. It mainly extracts local null-spectrum features using a lightweight convolutional structure and conducts interactive channel blending through a channel attention mechanism, thereby improving the spatial representation of the model. Following the AttenMix module, the PGFormer Block further enhances the model’s ability to model global context. This module incorporates Prompt-Gated Cross Attention (PGCA), which employs group partitioning and hierarchical strategies to model spatial attention, both within and across groups. These strategies ensure spatial structural consistency, while enabling the capture of long-range contextual dependencies across regions and groups. As a result, the model’s ability to interpret complex spatial relationships is significantly enhanced. Ultimately, the extracted feature maps are downscaled to 1 × 1 spatial dimensions by GAP and spread to one-dimensional vectors to be passed into the FC layer for final classification prediction.
2.1. Channel Hybrid Positional Attention Module
HSI is characterized by high dimensionality and redundancy [
31], placing greater demands on the model’s feature extraction capability. In order to improve the network’s ability to perceive key spectral channels and pay attention to important spatial regions, this paper proposes an efficient two-branch attention mechanism, CHPA, and integrates it into the shallow feature extraction stage of the network. As shown in the
Figure 1, the CHPA module combines channel attention with local spatial position attention and incorporates cross-attention interaction to enhance feature representations. This design strengthens the feature characterization capability from both spectral and spatial dimensions, thereby improving the model’s discriminative performance on complex feature classes. Given an input feature map
, where
C denotes the number of input channels,
H and
W represent the height and width of the feature map. The CHPA module first equally splits the input
X along the channel dimension into two parts:
where
is used for channel attention modeling, while
is used for positional attention modeling.
To capture the importance distribution along the spectral dimension, the channel attention branch first applies adaptive global average pooling to extract global contextual information:
where
denotes the aggregated descriptor for each channel and
refers to the value at spatial position
in the feature map
.
To facilitate subsequent processing, the aggregated descriptor is reshaped into a 1D vector:
then, it is passed into a dynamic convolution layer to perform cross-channel information interaction and generate channel attention weights:
where
is a 1D convolution with the kernel size adaptively adjusted based on the number of channels, allowing it to adapt to multi-scale cross-channel relations with a wide receptive field.
denotes the sigmoid function. The resulting attention weights are
, which are reshaped back to
and applied to the input feature
:
where ⊙ denotes element-wise multiplication. The process dynamically learns the importance weights of different spectral channels to highlight highly responsive channels and suppress redundant channels.
In HSI, spatial structure is also important for feature recognition. To further improve the model’s capacity to grasp spatial structures, we introduce a local spatial position attention branch. This branch models long-range spatial dependencies using an innovative multi-way compression mechanism, designed to capture local positional relationships within feature maps. Given an input feature map,
. In order to efficiently extract spatial structure features, feature compression is applied to the input feature map along the horizontal and vertical directions, respectively:
where
and
, respectively, retain the spatial contextual information along the horizontal and vertical dimensions. Compared to traditional global pooling, this decomposed compression approach can effectively capture directional features. Next,
is transposed to match the shape of
for subsequent concatenation:
then, concatenate and apply a non-linear transformation:
where
is the dimensionality-reduction convolution kernel, which reduces the number of parameters by compressing the channels by a factor of 8, and
is the bias term. The fused features are then split again into two directional representations, which are used to generate attention weights in the horizontal and vertical directions:
each directional feature is processed separately through a 1×1 convolution, followed by a Sigmoid function.
where
represents the horizontal attention projection matrix and
represents the vertical attention projection matrix.
and
are the corresponding bias terms. All channels share the same set of
and
, the two attention maps are then combined through matrix multiplication to establish attention correlations between rows and columns:
The local spatial positional attention branch aims to capture the local structural information in an image from the spatial dimension, especially the edges, textures, and their spatial relationships in the image. The core idea of this branch is to learn positional dependencies along different spatial directions by applying directional average pooling and convolutional operations. Additionally, shared weights are employed to enable efficient long-range dependency modeling, which in turn enhances the model’s ability to recognize complex spatial patterns.
Through the weighted fusion of the channel attention branch and the local spatial location attention branch, the model is able to extract richer feature representations from both the spectral and spatial dimensions. In summary, the CHPA module strengthens the model’s capacity to recognize intricate structures and subtle category differences, thereby boosting its overall classification performance.
2.2. Prompt Cross-Former
In each PCFormer of the PGTSEFormer network, the input features are first pre-processed by the AttenMix module to enhance their representation. This module integrates several spatial and channel enhancement strategies, including two sets of convolutional operations, a dual-branch dynamic weight fusion mechanism and a channel mixing module. Specifically, AttenMix first recombines the input features across bands via DWC and PWC, then maps them into a high-dimensional feature space to enhance feature representation. Subsequently, the module further extracts key features through a dual channel-space branching mechanism. Among them, the channel branch utilizes global average pooling to extract spectral statistical information and dynamically adjusts channel weights to enhance local saliency in the feature map, whereas the spatial branch applies a 3D convolutional kernel to model inter-band spatial correlations and capture local structural features more effectively. Through this structure, the model can efficiently mine local spectral–spatial features within neighboring band groups. The introduced channel mixing operation facilitates information interaction across channels, thereby enriching the diversity of feature representations. In addition, it increases the stochasticity during training, which in turn improves the model’s adaptability and generalization capability.
After the feature enhancement process using AttenMix, the features are fed into the PGFormer Block for global context modeling. In the PGFormer Block, we introduce learnable Group Tokens as semantic proxies for different regions, to enable cross-region feature interaction and fusion through the cross-group attention mechanism. Meanwhile, the module introduces a dynamic Prompt Factor to regulate the query vectors through a gating mechanism, so that the attention distribution is self-adapted to the semantic structure and distribution pattern of the input features. The synergistic design of the module preserves the capacity for local detail extraction, while simultaneously activating global semantic associations, and establishes a multi-level, multi-scale feature enhancement framework. This design significantly improves the model’s representational capacity and classification accuracy.
2.2.1. AttenMix
HSI contains a large number of spectral channels and exhibits highly complex spatial structures [
32]. Although convolutional operations can effectively extract local spatial features, they often introduce redundancy when handling high-dimensional spectral data and are inherently constrained by fixed kernel sizes. To address these limitations, we designed the AttenMix module as a structure-aware component placed before each Transformer layer, to facilitate efficient local–global feature interaction. Its structure is shown in
Figure 1. Firstly, the input feature
is processed using DWC. Specifically, it consists of two stages:
where * denotes the convolution operation.
is the convolution weight for depth-separable convolution, which can extract spatial information for each channel independently.
is the convolution weight for PWC, which can help the model to achieve spectral domain fusion between different channels and strengthen cross-channel feature interaction. The processed input features are then split into two branches along the channel dimension. The channel branches are weighted by a subset of globally pooled channels via learnable weights,
and bias
:
where
performs a linear transformation of the channel information and learns the magnitude of the enhancement for each channel, which can regulate the strength of the response of each channel’s attention to the final channel’s attention.
, on the other hand, introduces independent offsets for each channel’s attention, to enhance the nonlinear representation of the model. Spatial branching is moderated by using learnable weights
and
for each grouped spatial subset after GroupNorm:
The
moderates the normalized spatial features by highlighting salient regions and suppressing redundancy, while
provides spatial group-specific offsets and jointly controls the feature map shape with
. These learnable parameters are automatically updated by backpropagation during the training process.
Then, we feed the feature X after channel splicing of these two branches into the channel mashup section. In channel shuffling, the channels are rearranged in a grouped manner to achieve inter-group information interaction. This design can break the locality between channels and enhance the information flow of features between groups, so that the global information can be more fully integrated.
2.2.2. PGFormer Block
Although conventional ViT performs well in capturing global dependencies, it still has limitations in dealing with images and significant local features [
33]. For this reason, a Prompt-Gated Transformer structure based on prompts was designed in this paper, aiming to enhance the model’s ability to model local spatial information. The structure is shown in
Figure 2a. The structure consists of a series of normalization layers, a convolutional layer, and PGCA, and the various parts work in concert to significantly enhance the accuracy of the model for feature extraction. We embed the PGCA module in each Transformer coding layer to enhance the model’s adaptive perception in the spatial dimension.
As shown in
Figure 2b, the PGCA module consists of four key steps: local spatial partitioning, intra-group self-attention modeling, inter-group context fusion, and Prompt-Gated mechanism. First, the input features are divided into multiple local regions through the spatial partitioning mechanism, to maintain the consistency of the local structure. Subsequently, the intra-group self-attention mechanism is employed to capture the key spatial relationships within the regions. On this basis, the inter-group context fusion module further models the global dependencies between different regions. Finally, the attention path is dynamically adjusted through the introduction of the Prompt-Gated mechanism, so that the model is capable of adaptively selecting attention regions according to the distribution of input features, thus achieving more accurate spatial feature modeling.
Specifically, given an input feature map
, we first rearrange it into multiple spatially grouped blocks shaped as g × g local subregions.
In the above formula,
denotes the number of spatially partitioned groups, where g is the spatial size of each group. Next, 1 × 1 convolution is applied to each local group feature to generate the query (Q), key (K), and value (V) matrices, enabling intra-group attention modeling.
where h denotes the number of attention heads, and d represents the embedding dimension of each head. To enhance the expressive capability of the attention mechanism, we introduce a learnable prompt vector
and a gating factor
to modulate the original query vector. The operation is shown in Equation (
22):
where
denotes the sigmoid function used to control the level of involvement of the prompt vector. Next, the standard self-attention operation is performed within each local space group:
The obtained attention result is the output feature after context enhancement within the group. To enhance the information interaction between spatial groups, we design a Group Tokens mechanism. The first position in each group is set to be extracted as a semantic representation, and then attention is computed between groups:
The results of intergroup attention are subsequently broadcast back to the groups to remodulate the intragroup features to obtain the final context fusion feature
. The local intragroup feature rearrangements are reduced to the original spatial dimensions and residual concatenation is performed.
With the PGSA module, the PGTSEFormer network effectively enhances the perception ability and information interaction between spatial regions, while maintaining a lightweight structure, which significantly improves the model’s structural modeling and generalization ability in HSI.
4. Experimental Comparison and Analysis
In this section, we selected nine representative HSI methods for comparative experiments, covering different network structures based on CNNs and Transformer-based architectures. Through systematic comparisons with these state-of-the-art methods, we comprehensively evaluated the classification performance of the proposed PGTSEFormer model on several standard hyperspectral datasets. We present a detailed description of the network architecture, the experimental setup, and the comparison methods used for evaluating PGTSEFormer. In addition, we provide in-depth analyses of the classification results, which collectively demonstrate the state-of-the-art performance and effectiveness of the proposed framework in HSI tasks from multiple perspectives. For the classification accuracy of each feature category, the highest value among the methods is marked in bold font, in order to highlight the performance advantages.
Additionally, to visually demonstrate the classification performance of the different methods across the various datasets, we provide the corresponding false-color composite images, ground-truth annotations, and predicted classification maps. These visualizations help in more effectively comparing the recognition accuracy and spatial consistency of each model.
4.1. Quantitative Results Analysis
Table 6,
Table 7,
Table 8,
Table 9 and
Table 10 present the classification results of our method and the comparative approaches on the five datasets. The evaluation metrics used included Overall Accuracy (OA), Average Accuracy (AA), Kappa (k), and individual classification accuracy for each category. The best-performing values in each experiment are boldfaced to underscore the advantages of each method across the various evaluation criteria.
The classification performance on the Indian Pines dataset is shown in
Table 6. Although PGTSEFormer performed slightly worse on certain individual categories (compared with the SPRN method), it achieved a significantly higher accuracy for category 9 (Oats) than all other methods, demonstrating its strength in recognizing classes with limited samples. With only 5% of the training data, PGTSEFormer achieved an OA value of 97.91 ± 0.41%, which is approximately 0.19% higher than the best performance of the other methods. Moreover, its performance exhibited lower variance, indicating good stability.
The classification performance on the Salinas dataset is presented in
Table 7. SQSFormer achieved an accuracy of 97.70 ± 0.93% on Class 5 (Fallow-Smooth), demonstrating its strength in local feature modeling. However, in terms of overall performance, PGTSEFormer achieved an OA value of 98.74 ± 0.42% using only 1% of the training samples, outperforming the best comparison method (i.e., SPRN) by 1.22%.
The classification results on the Botswana dataset are shown in
Table 8. While several methods achieved classification accuracies of 100 ± 0.00% in specific categories, PGTSEFormer also reached this level in most categories, demonstrating its strong capability in modeling multi-class spectral–spatial features. Although the classification accuracies of PGTSEFormer were slightly lower than those of the best-performing method in category 1 (Water), category 5 (Hippo Grass-1), category 12 (Short Mopane), and category 14 (Chalcedony), the respective differences were only 0.25%, 0.47%, 0.13%, and 0.12%, indicating negligible performance gaps. Overall, PGTSEFormer achieved an OA value of 99.48 ± 0.34% using only 5% of the training samples, with a fluctuation of just 0.34%, which further validates its robustness and stability on this dataset.
The classification results on the WHU-Hi-LongKou dataset are shown in
Table 9 and PGTSEFormer performed particularly well for Class 3 (Sesame) and Class 5 (Narrow-leaf Soybean). In contrast, the traditional CNN methods performed poorly in these two classes, possibly due to their shortcomings in feature extraction with fewer samples. PGTSEFormer achieved the highest classification accuracies for these categories, despite the improvements of other Transformer-based methods over CNNs. Under very low sample conditions (only 0.2% of the training samples), PGTSEFormer still achieved an OA value of 99.18 ± 0.29%, which is an improvement of about 0.21% compared to the SPRN.
The classification results on the WHU-Hi-HongHu dataset are shown in
Table 10. Our method (Ours) outperformed all other methods, especially in Class 1 (Read roof) and Class 5 (Cotton firewood), where it achieved a higher classification accuracy compared to the other methods. In contrast, the traditional CNN-based methods such as 3DCNN and Spectralformer performed relatively poorly in these two classes, possibly due to their limitations in feature extraction, particularly for the complex patterns present in these classes. Notably, the Transformer-based methods such as GAHT and MCTGCL also showed significant improvements over the CNN-based methods, but they still lagged behind our method in terms of classification accuracy. Our method achieved the highest OA and AA values, with a notable OA value of 91.52 ± 1.01%, significantly surpassing the next best method (GAHT) by a margin of approximately 1.5%.
In summary, PGTSEFormer achieved excellent classification results on several public datasets, verifying its effectiveness and sophistication in dealing with complex null spectral structures and modeling global–local feature relationships.
4.2. Visual Results Analysis
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12 show the results of the classification performance visualization for the proposed method with each comparison algorithm on the five datasets. Owing to the structural advantages of the proposed PGTSEFormer in spatial–spectral feature modeling and contextual representation, the classification results exhibited reduced noise artifacts, sharper region boundaries, and greater consistency with the ground-truth distribution. On the Indian Pines dataset, this was especially reflected in category 15 (Vineyard-untrained) (bright red area). This type of area is usually noisy and difficult to segment accurately using traditional methods, while PGTSEFormer could effectively suppress misclassified areas and significantly improved the accuracy of boundary identification. Similar advantages were observed on the Salinas, Botswana, WHU-Hi-LongKou, and WHU-Hi-HongHu datasets. The classification maps generated by PGTSEFormer not only aligned well with the ground-truth contours, but also exhibited superior accuracy in edge delineation and fine-grained category prediction. These results further validate the model’s robustness and generalization performance for complex hyperspectral scenes.
4.3. Learned Feature Visualizations by T-SNE
T-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction technique widely used for visualizing high-dimensional data. It maps high-dimensional data into two or three dimensions, while preserving the local structure, revealing underlying patterns and group structures. In hyperspectral image classification, t-SNE plots can showcase the distribution characteristics of data. If the classification performance is good, the samples of different categories will be clearly separated in the t-SNE plot and will cluster into independent groups, indicating that the classifier can effectively distinguish between categories. However, if the classification performance is poor, the categories may overlap or interweave, making it difficult to separate the samples, suggesting that the classifier failed to distinguish between different categories, and misclassified samples may be projected into the clusters of other categories.
Taking the IP dataset as an example, as shown in
Figure 13, in (b), (c), and (d), the samples of different categories are well clustered together and their boundaries are distinct, suggesting that these plots represent a relatively good classification performance. The overlap between categories is minimal, indicating that the classifier could effectively distinguish between these categories. In (a), (e), and (f), some categories begin to show more overlap, especially in the areas where the colors intersect. This distribution might indicate that the classifier’s ability to differentiate between these categories has declined, or that the feature spaces of these categories are similar, making classification more difficult. From (g) to (j), it can be observed that the degree of separation between categories fluctuated with different dimensionality reduction parameters and methods. In particular, in (h) and (i), some categories start to cluster more closely together, possibly due to the impact of the dimensionality reduction process on the local structure.
In (j), the samples of each category are almost all clustered in distinct areas, and the distribution of most categories does not overlap or cross. The samples in each category have high separability in the feature space, indicating that the features of the data can effectively distinguish between these categories. Areas with the same color (e.g., blue, green, pink) show that the samples of each category are tightly grouped, with no points from other categories scattered in these areas, meaning that this plot is excellent in terms of category separation. Compared to the other plots (e.g., (d) and (e)), the samples in (j) are mostly clustered within their respective clusters, with only a few samples potentially in neighboring category regions, showing a low misclassification rate. The distribution of each category in the plot forms relatively compact clusters, indicating that similar samples are grouped closely together in the feature space, thereby enhancing the classification accuracy and robustness.
4.4. Parameter Comparison
To evaluate the synergy between efficiency and accuracy,
Table 11 compares the model complexity of PGTSEFormer with nine mainstream HSI methods, considering both Parameters and FLOPs. As shown in the table, PGTSEFormer had significantly fewer parameters than most competitors on the Salinas, Botswana, WHU-Hi-LongKou, and WHU-Hi-HongHu datasets, demonstrating its lightweight design, without compromising feature extraction capability. Despite the relatively small number of parameters in the model, PGTSEFormer still achieved a leading classification accuracy, which indicates that it is more efficient in parameter utilization.
Comprehensively analyzing the parameter scale, computational cost, and classification performance of the model, PGTSEFormer achieved superior classification results on the basis of ensuring a low computational overhead, which verifies its significant advantages in HSI tasks.
4.5. Impact of Different Training Ratios
To validate the robustness of the proposed method, we conducted comparison experiments on the OA of all the compared methods on five HSI datasets, covering different training sample ratio settings. As shown in
Figure 14, PGTSEFormer consistently outperformed the other methods in classification accuracy, even under limited training data conditions. Moreover, its accuracy increased steadily with the proportion of training samples, demonstrating strong scalability and robustness. Taking the SA dataset as an example, the classification accuracy of our method was consistently higher than that of the other methods when the number of training samples was limited. This suggests that the model retained robust feature extraction and discrimination abilities, even in challenging situations with limited samples and subtle class differences.
In summary, this experiment fully demonstrated that PGTSEFormer not only has superior performance under standard settings, but also shows good robustness and stability under restricted training samples.