1. Introduction
Hyperspectral image contains rich spectral spatial information that can elucidate the response characteristics of ground objects across various spectra [
1]. HSIC has been extensively applied to a range of earth science tasks, including environmental mapping [
2], climate change studies [
3], agricultural management [
4,
5], mineral exploration [
6], and geological research [
7]. HSIC represents a critical aspect of hyperspectral image processing and analysis. Hyperspectral images, which encompass hundreds of bands, are transformed into classification maps through pixel-by-pixel category labeling. These classification maps provide an intuitive representation of the distribution of ground objects within the image by employing distinct colors for different types of ground features [
8,
9]. Early approaches to HSIC primarily focused on extracting spectral information while neglecting spatial data. Commonly utilized feature classifiers include Random Forest (RF) [
10], Support Vector Machine (SVM) [
11], and K-Nearest Neighbor (KNN) [
12]. Moreover, researchers have proposed several methods tailored for spectral dimensionality reduction and information extraction from hyperspectral images. These methods include principal component analysis (PCA) [
13], independent component analysis (ICA) [
14], and linear discriminant analysis (LDA) [
15]. By calculating spectral correlation information along the spectral dimension of hyperspectral images, these techniques can effectively reduce data dimensionality while extracting relevant spectral features. However, the drawback of this approach lies in its difficulty to comprehensively capture nonlinear relationships and intricate spectral characteristics, which can easily lead to information loss or suboptimal dimensionality reduction. Additionally, the design of manual features necessitates in-depth expertise, thereby increasing complexity and introducing subjective influences that may compromise the stability and reliability of classification outcomes.
In recent years, numerous efforts have been made to classify hyperspectral images using deep learning techniques [
16]. The convolutional neural network (CNN) model is among the most widely utilized deep learning architectures for HSIC. CNNs employ multiple convolutional, activation, and pooling layers for feature extraction and are typically followed by fully connected layers and softmax classifiers. For hyperspectral images, CNNs can be employed to extract either spectral or spatial features based on the input provided to the network. Hu et al. [
17] designed a one-dimensional CNN model that utilizes the spectral information inherent in hyperspectral images as input, effectively extracting spectral dimensional convolutional features. Building upon this foundation, Zhao et al. [
18] implemented a 2D-CNN to extract spatial features from local cubes of data. Wang et al. [
19] proposed a hybrid HSIC model that leverages the initial few convolutional layers to capture position-invariant mid-layer features before employing a recurrent layer to extract contextual details related to the spectrum. To enhance computational efficiency, Liu et al. [
20] introduced an improved CNN architecture aimed at mitigating overfitting issues while enhancing generalization capabilities through the use of 1×1 convolutional kernels. Gao et al. [
21] incorporated a global average pooling layer into their design to reduce training parameters within the network while facilitating higher-dimensional feature extraction. Subsequent methodologies increasingly focus on integrating both spatial and spectral features. For example, Ran et al. [
22] proposed an enhanced pixel-pair feature (PPF) method that extracts pixel pairs exhibiting spatial adjacency relationships from hyperspectral images as distinctive features. Zhong et al. [
23] introduced a supervised spectral–spatial residual network (SSRN) specifically designed for feature extraction in hyperspectral imagery; this approach employs a series of three-dimensional convolutions within their respective spectral residual blocks and spatial residual blocks. Roy et al. [
24] integrated the features of 2D and 3D CNNs to develop a hierarchical convolutional network model, termed HybridSN, which effectively extracts spatial–spectral features while reducing computational complexity and enhancing classification accuracy. To account for variations in the spatial environment across different hyperspectral image blocks, Li et al. [
25] employed adaptive weight learning instead of fixed weights, thereby introducing greater spatial detail. Deep networks are crucial for capturing global features within an image; however, traditional convolution methods often fall short in modeling long-range correlations. Furthermore, convolutional networks necessitate a substantial amount of annotated data to effectively learn feature representations. As a result, when confronted with small sample tasks that possess limited annotated data, the accuracy of image classification tends to be low due to inadequate model training.
The successful implementation of these methodologies hinges on a critical prerequisite: each category must possess a sufficient number of samples, a requirement that is both costly and impractical in hyperspectral remote sensing tasks. Consequently, achieving effective classification with limited training sets has emerged as an urgent necessity within this field. In response to this challenge, researchers have made persistent efforts to enhance classification performance under the conditions of small training datasets. Zhang et al. [
26] proposed a locally balanced embedding algorithm designed to extract spectral dimension features while thoroughly considering both spatial and spectral characteristics; this approach addresses the issue posed by the scarcity of labeled samples. Liu et al. [
27] trained their model by simulating scenarios typical of small sample classifications during the training phase, aiming to extract features characterized by reduced intra-class distances and increased inter-class distances, thereby improving classification accuracy for small sample sizes. Ma et al. [
28] developed a two-stage relation learning network that leverages other hyperspectral images for general information sharing while also allowing fine-tuning on specific hyperspectral images for targeted information acquisition. Then, several new feature learning structures have emerged, with the most prominent being the attention mechanism proposed in natural language processing (NLP) [
29]. This mechanism has proven to be more effective at integrating global features at an early stage. The performance of classification models is enhanced by incorporating an attention module into the architecture. Fu et al. [
30] introduced a HSIC framework based on class-level band learning, which not only improved the model’s sensitivity to class-specific information within spectral bands but also significantly increased classification accuracy through the joint extraction of multi-scale spatial features and global spectral features. Xue et al. [
31] proposed a novel hierarchical residual network that employs the attention mechanism to assign adaptive weights to spatial and spectral features across different scales. This approach enables the extraction of multi-scale spatial and spectral features at a granular level, thereby expanding the receptive field range and enhancing the model’s feature representation capabilities. Building upon this foundation, the self-attention mechanism facilitates deeper global understanding and contextual analysis through interactive learning among internal elements. In contrast to traditional convolutional neural networks (CNNs), self-attention mechanisms establish connections between any two locations within an image while maintaining a global receptive field. This allows for the more comprehensive capturing of complex patterns and information dependencies present in images. Leveraging this self-attention mechanism, Hong et al. [
32] introduced SpectralFormer, which is a backbone network capable of learning local spectral sequence information from adjacent strips to generate grouped spectral embeddings applicable to both pixel- and patch-based inputs. To optimize feature extraction processes more effectively, some researchers have explored various types of attention mechanisms in hyperspectral classification tasks. Among them, Ma et al. [
33] integrated the convolutional block attention module (CBAM) [
34], a widely utilized tool in computer vision, into the feature learning process of hyperspectral images, thereby proposing an innovative double-branch multi-attention (DBMA) network. To address the limitations of these methods in spectral feature extraction, Zhang et al. [
35] introduced a hierarchical self-attention network (HSAN) which constructs a hierarchical self-attention module for feature learning, designs a hierarchical fusion mode, and leverages the self-attention mechanism within the transformer structure to capture context information. This approach minimizes the loss of effective information during feature learning and enhances the integration of features across different levels.
While the attention mechanism effectively promotes global interaction between contexts in hyperspectral image classification (HSIC), challenges remain in terms of generalization. Especially for large and diverse hyperspectral datasets, the generalizability of the model emerges as a critical challenge. The emergence of large-scale models offers valuable insights for hyperspectral image classification (HSIC). Large language models (LLMs), renowned for their robust capabilities in understanding, generating, and processing vast amounts of textual data, have achieved significant success in natural language processing (NLP) tasks. The Segment Anything Model (SAM) [
36], an exemplary application of LLMs in visual tasks, has extended the frontiers of image segmentation. After extensive training with substantial datasets, the SAM has demonstrated its versatility across various domains, particularly in remote sensing applications. Ren et al. [
37] conducted a comprehensive evaluation of the SAM’s performance on several remote sensing segmentation datasets, including Solar [
38], Inria [
39], DeepGlobe and DeepGlobe Roads [
40], 38-Cloud [
41], and Parcel Delineation [
42]. Their findings indicate that the SAM achieves comparable performance to supervised models in segmenting certain ground objects, showcasing strong generalizability. Based on its fundamental components—namely, a large-scale pre-trained visual encoder, a flexible prompting mechanism (which includes point, box, or text prompts), and a lightweight mask decoder—the SAM effectively extracts common visual features through the encoder. It utilizes prompt information to guide the decoder in dynamically adapting to various targets, thereby achieving efficient feature interaction and semantic segmentation. In traditional small sample classification of hyperspectral images, convolutional neural networks often suffer from overfitting and inadequate feature extraction due to their inherent model complexity and limitations in local receptive fields. Although transformers are capable of capturing long-range dependencies, they require substantial amounts of data and entail high computational costs, which constrains their effectiveness in scenarios with limited sample sizes. In this paper, we introduce the SAM’s prompt driving mechanism into the task of hyperspectral small sample classification. The optimization of spectrum–space feature fusion is achieved through the integration of pre-trained prior knowledge and interactive prompts. Compared to traditional methods, our model significantly reduces reliance on labeled data, effectively balancing the relationship between local details and global context, even with a limited number of samples. This approach enhances both classification accuracy and generalization capabilities. This study utilizes publicly available datasets from IP, SA, and PU for experimentation. In these experiments, only five labeled samples per land type are selected as cue information for the SAM, thereby successfully implementing SAM-based HSIC.
3. Experiments and Results
3.1. Experimental Data and Evaluation Indicators
To validate the effectiveness of the proposed method, three publicly available hyperspectral image datasets from Indian Pines (IP), Salinas (SA), and Pavia University (PU) were selected for experimentation. Among these datasets, IP and SA represent agricultural scenarios, whereas PU represents a more complex urban scenario. The detailed information regarding the datasets is provided in
Table 2.
The overall accuracy (OA), average accuracy (AA), and kappa coefficient were employed as evaluation metrics for the classification performance of different methods. The same set of tests was repeated 10 times, and the average value was taken as the final test result.
OA
The formula for calculating
OA is provided in Equation (5).
where
TP represents the number of samples that are actually true and correctly predicted as true.
TN represents the number of samples that are actually false and correctly predicted as false.
FP represents the number of samples that are actually false but incorrectly predicted as true.
FN represents the number of samples that are actually true but incorrectly predicted as false.
AA
The calculation formula for
AA is provided in Equation (6).
where
TPi represents the
TP count for class
i,
FNi represents the
FN count for class
i, and
n denotes the total number of categories across all features.
Kappa coefficient
The formula for calculating the
kappa coefficient is provided in Equation (7).
where
Po is equivalent to the
OA,
Pe is provided by Equation (8), and
N represents the total number of test samples as given by Equation (9).
3.2. Experiments and Results
To verify the accuracy of SAM-based HSIC under small sample conditions, three representative hyperspectral datasets, IP, SA, and PU, were utilized for image classification experiments. For each land type, labeled samples were selected as a priori knowledge samples. The experimental results demonstrate that the SAM can efficiently classify hyperspectral images from different scenes. The experiment was repeated 10 times using the same sample combination, and the average of the results from these 10 trials was calculated to verify the stability and reliability of the model through consistent experimental outcomes.
To assess the effectiveness of the method proposed in this paper, five labeled samples were randomly selected as supervised samples for each category of ground objects across three datasets. Eight distinct methods—SVM, 1D-CNN, 2D-CNN, 3D-CNN, SSRN, HybridSN, DFSL, and transformer—were employed for HSIC. The classification results obtained from these methods were compared with those derived from SAM-based HSIC to evaluate the performance of the SAM in ground object classification. The findings are presented in
Figure 5,
Figure 6 and
Figure 7, as well as in
Table 3,
Table 4 and
Table 5.
The experimental results presented in
Figure 5,
Figure 6 and
Figure 7, as well as in
Table 3,
Table 4 and
Table 5, demonstrate that the application of the SAM method across three datasets—IP, SA, and PU—yields satisfactory classification outcomes with only five samples selected for each land type. The overall classification accuracies achieved by the SAM for these three datasets are recorded at 80.29%, 90.66%, and 86.51%, respectively; these figures significantly surpass the classification results obtained from the other eight methods discussed in this paper. This confirms that the SAM is capable of performing HSIC effectively with a minimal number of prior samples. Furthermore, it illustrates that the SAM possesses the adaptability to tackle new tasks without any prior examples while still achieving high levels of classification accuracy without necessitating pre-training.
As illustrated in
Figure 5 and
Table 3, the classification performance of the SAM is relatively low for the IP dataset. This may be attributed to the high number of categories and complex features present within the IP dataset, which necessitates a greater volume of samples for more effective classification. The intricate characteristics of ground objects contribute to increased diversity among the samples in this dataset, with each category exhibiting unique spectral traits that must be accurately identified and distinguished by the model. As the number of categories increases, the feature differences between various categories within the dataset may also become increasingly subtle and complex. Consequently, models are required to possess enhanced expressiveness and discriminative capabilities to effectively capture these nuances across different categories. As a result, classifying such complex features becomes more challenging, leading to a decline in overall classification performance.
3.3. Analysis of PCA Contribution Rate
To quantitatively assess the information retention efficiency of principal component analysis (PCA) during the dimensionality reduction process of hyperspectral data, a statistical analysis was conducted on the PCA contribution rates and cumulative contribution rates for the IP, SA, and PU datasets. According to the voting strategy, the bands from each dataset were randomly divided into ten groups. Each group underwent independent PCA processing, resulting in the extraction of three principal components: PC1, PC2, and PC3. To provide a comprehensive illustration of the contribution rates of the three principal components, we calculated both the average contribution rate and the cumulative contribution rate across ten experimental groups. The “average contribution rate” reflects the mean explanatory power of each principal component across all groups, while the “average cumulative contribution rate” indicates the overall extent to which data features are represented by the first three principal components. These metrics were subsequently employed as representative indicators for this dataset, with the results presented in
Table 6.
Table 6 presents the experimental results, indicating that the average contribution rate of PC1 in the IP dataset is 89.46%. Furthermore, the average cumulative contribution rate of the first three principal components rises to 96.65%. For SA and PU, the average contribution rates of PC1 are notably higher, reaching 98.40% and 98.13%, respectively. Additionally, it is worth noting that only the average cumulative contribution rates of the first two principal components have achieved levels of 99.86% (PU) and 99.72% (SA). This result indicates that, in the context of the designed 10-group band division, the first three principal components retained through PCA dimensionality reduction effectively capture over 95% of the variance information present in the original data. Furthermore, the average contribution rate of higher-order principal components (such as PC3) is less than 1.61%, suggesting that their complementary effect on overall variance is exceedingly limited.
This result delineates the threshold of potential information loss during dimensionality reduction. While three-dimensional compression may overlook certain higher-order spectral details, the fundamental spectral features are effectively captured by the first two principal components. Furthermore, the averaging across multiple experiments reinforces the statistical significance of this conclusion. The high and stable cumulative contribution rate indicates that the low-dimensional space resulting from dimensionality reduction effectively preserves the core features of the original hyperspectral data, while minimizing information loss. Furthermore, it is demonstrated that the integration of band grouping with PCA dimensionality reduction strikes an effective balance between model compatibility (in RGB format) and information integrity, thereby providing a robust framework for SAM-based hyperspectral image classification.
3.4. The Influence of the Quantity of Labeled Samples on the Results
To evaluate the performance of the SAM under small sample conditions, systematic experiments were conducted on three classical hyperspectral datasets from Indian Pines, Salinas and Pavia University. The classification accuracy of the SAM under different sample numbers was analyzed by gradually increasing the number of labeled samples from 1 to 15 with an interval of 2. The results are shown in
Table 7,
Table 8 and
Table 9, in which the optimal results of each type of accuracy index are marked in bold.
From
Table 7,
Table 8 and
Table 9, it is evident that the SAM method demonstrates significant advantages under small sample conditions, particularly when the sample size is limited to five. Furthermore, it maintains a commendable classification capability even in scenarios with very low sample sizes. Specifically, when each category comprises only five samples, the overall accuracy (OA) of the SAM achieved values of 80.29%, 90.66%, and 86.51% for the IP, SA, and PU datasets, respectively. These results indicate that a sample size of five effectively balances the adequacy of labeling information with noise control; this approach not only provides sufficient spectral feature identification data, but also mitigates redundancy or category confusion that may arise from an excessive number of samples.
Taking the PU dataset as an example, when the sample size increased from one to five, the overall accuracy (OA) surged from 33.54% to 86.51%, and the Kappa coefficient rose from 19.95 to 82.05. This demonstrates the SAM’s capability to rapidly extract essential spectral features even with extremely limited annotations. The performance of the IP dataset further corroborates this conclusion: with a sample size of five, the OA increases by 36 percentage points compared to using just one sample; however, when the sample size is further expanded to 15 samples, there is a slight decline in OA to 75.92%. This suggests that the model may be sensitive to local noise or shifts in data distribution. It is noteworthy that performance variations across different datasets are closely linked to spectral separability. Due to significant spectral differences among categories, only three samples are required for the SA dataset to approach performance saturation, whereas five samples are necessary for both IP and PU datasets in order to overcome performance bottlenecks caused by high spectral overlap. This disparity underscores a design advantage of SAM cue encoders: through a space-spectral joint attention mechanism, the model can dynamically concentrate on discriminative bands derived from limited annotations, thereby alleviating issues related to spectral redundancy.
3.5. Ablation Experiments
To thoroughly evaluate the effectiveness of the SAM for HSIC, ablation experiments were conducted utilizing the IP, SA, and PU datasets with various band combinations. Ten distinct groups of slices, each containing different spectral information, were extracted from the original 3D hyperspectral image. Principal component analysis (PCA) was employed to reduce the spectral dimensionality to three components. Subsequently, ten different classification results were generated using the SAM. The final classification result was obtained through a voting strategy. The specific outcomes are presented in
Table 10.
By comparing the classification results obtained with and without the voting strategy, we can verify the effectiveness of this approach in enhancing the hyperspectral image classification performance of the SAM. As illustrated in
Table 10, for the IP, SA, and PU datasets, the overall classification accuracy (OA) improved by 0.91%, 2.09%, and 5.75%, respectively, following the implementation of the voting strategy. This enhancement is primarily attributed to the optimization and integration of multiple groups of classification results derived from different band combinations through the voting mechanism. Specifically, the original hyperspectral image is partitioned into ten subsets based on varying band combinations; each subset yields independent classification results after PCA reduction. The category with the highest frequency among these results is selected as the final output via a voting process. This method effectively harnesses and integrates the distinctive advantages offered by various band combinations while significantly increasing kappa coefficient by 7.65 percentage points within the PU dataset. Moreover, simultaneous improvements in average accuracy (AA) further substantiate that voting strategy provides balanced enhancements across different categories’ classification performance. This experiment demonstrates that by amalgamating classification outcomes from multiple band combinations, a voting strategy can effectively mitigate errors arising from any single projection, thereby improving the overall performance in SAM-based hyperspectral image classifications.
3.6. Computational Efficiency Analysis
To assess the computational efficiency of the proposed method, we selected transformer, DFSL, HybridSN, and SSRN—methods that demonstrate strong overall performance in comparison to others—for comparative analysis against SAM. The results of this analysis are presented in
Table 11,
Table 12 and
Table 13. The experiments were conducted on a PC equipped with an Intel (R) Core (TM) i7-14650HX processor running at 2.20 GHz, an NVIDIA GeForce RTX 4060 GPU graphics card with 8 GB of GPU memory and 32 GB of RAM.
By comparing the computational efficiency of the Spectral Angle Mapper (SAM) with several mainstream hyperspectral classification models, it is evident from
Table 11,
Table 12 and
Table 13 that SAM demonstrates a particularly high level of time efficiency. From a model architecture perspective, this efficiency can be attributed to the deep adaptation of its pre-training features. As a foundational model for pre-training based on large-scale datasets, the SAM significantly mitigates the training overhead associated with hyperspectral classification tasks by employing lightweight parameter fine-tuning strategies. On the PU dataset, the SAM’s training time was a mere 12.52 s—an impressive advantage stemming from the knowledge transfer capabilities of its pre-trained weights, which circumvent the extensive iterative calculations typically required when training a model from scratch. Furthermore, the test time for classification using the SAM remains consistently stable at approximately 8–9 s, representing a reduction of over 50% compared to traditional methods.
The time efficiency advantage of the SAM is synergistic with its multi-projection integration strategy. Although this approach increases the computational load associated with data preprocessing due to 10 sets of independent projections, the SAM’s inference flow effectively maintains the time cost of multi-projection inference within a reasonable range by leveraging shared image encoder features and a batch processing mechanism. Experimental results indicate that the total time cost for the SAM on the PU dataset is 34.17% lower than that of the second-best method, DFSL, demonstrating the SAM’s robustness in complex processing workflows. In addition, the SAM’s cue encoder facilitates the efficient adaptation of hyperspectral data through lightweight modifications. Traditional interactive segmentation models often encounter computational bottlenecks when processing high-dimensional spectral information; however, the SAM alleviates this issue by compressing the computational load of cue encoding into an acceptable range through a spectral-sensitive attention mechanism and a parameter-sharing strategy. This design empowers the model to respond swiftly to user prompts while maintaining stable reasoning speed in scenarios with limited samples.