1. Introduction
Hyperspectral images (HSIs) capture the continuous reflectance spectrum of surface materials across hundreds of narrow and contiguous spectral bands. As a result, a HSI can be represented as a three-dimensional data cube that combines spatial information (
pixels) with rich spectral signatures (
B bands). This dense spectral resolution enables the discrimination of subtle material differences that are difficult to observe using conventional RGB or multispectral sensors. Therefore, HSIs are valuable for precision agriculture [
1], environmental monitoring [
2], mineral exploration [
3], and urban studies [
4]. Hyperspectral image classification (HSIC) is a central task in these applications, in which each pixel is assigned a semantic land-cover or material label to support large-scale mapping and automated decision support.
Early HSIC methods largely relied on handcrafted feature engineering and shallow classifiers. Techniques such as band selection [
5], spectral derivatives [
6], and linear dimensionality reduction (e.g., PCA and LDA) were commonly used to alleviate spectral redundancy and the curse of dimensionality. Spatial context was often incorporated through morphological profiles [
7] or heuristic filtering. Although these approaches are interpretable, they have limited adaptability to complex nonlinear spectral mixing patterns, depend heavily on domain expertise, and often generalize poorly across diverse scenes [
8]. These limitations are particularly pronounced in high-dimensional, spatially heterogeneous, and label-scarce environments. Beyond classification-oriented pipelines, non-deep-learning hyperspectral image analysis has also explored noise-aware weighting and outlier removal to improve the robustness of spectral–spatial criteria for object-based processing and scale selection [
9].
The advent of deep learning has dramatically reshaped the HSIC landscape. Convolutional Neural Networks (CNNs) have become a dominant paradigm in this area. In particular, 2D CNNs [
10] extract spatial textures from spectral bands, and 3D CNNs [
11] jointly model spectral–spatial dependencies. Hybrid architectures such as HybridSN [
12] further improve efficiency by combining 2D and 3D convolutions. Despite their success, CNNs are limited by local receptive fields and fixed grid processing, which restrict their ability to capture long-range dependencies and adapt to irregular object boundaries. To mitigate these limitations,
Transformers have been introduced and use self-attention to model global contextual relationships across both spatial and spectral dimensions [
13,
14]. More recently,
state–space models (SSMs) such as Mamba [
15] have emerged as efficient alternatives to Transformers by offering linear complexity with global receptive fields. However, these sequence-based models can be sensitive to spectral noise, may not preserve fine-grained local structures, and do not explicitly model irregular spatial relationships.
In parallel,
graph neural networks (GNNs) have gained traction due to their ability to model non-Euclidean relationships among pixels or superpixels. Early graph convolutional networks (GCNs) [
16] demonstrated promising results in semi-supervised HSIC by propagating node features over adjacency graphs. Subsequent efforts introduced multi-scale GCNs [
17], cross-attention GCNs [
18], and object-based graph constructions [
19] to enhance feature aggregation and boundary preservation. More recently, hybrid graph state–space models such as Graph Mamba [
20] have bridged graph structural learning with sequence modeling. Nevertheless, graph-based approaches still face several fundamental challenges. First, graph convolutions primarily capture local connectivity and may fail to model long-range dependencies effectively. Second, graph construction is often heuristic and scene-dependent. Third, many models lack dynamic multi-scale fusion mechanisms and do not fully integrate spatial and spectral cues. Finally, spectral noise and redundancy can degrade input quality and reduce robustness.
To address these limitations, we propose the Spectral–Spatial Graph Transformer Network (SSGTN), a unified dual-branch architecture that integrates graph-based structural modeling with Transformer-based global reasoning. The proposed framework includes four key components. First, an LDA-SLIC superpixel graph construction module combines linear discriminant analysis (LDA) for spectral compaction with Simple Linear Iterative Clustering (SLIC) for spatially homogeneous region segmentation to obtain a structurally informed and computationally efficient graph representation. Second, a lightweight spectral denoising module based on convolutions and batch normalization suppresses redundant and noisy spectral bands while preserving discriminative features. Third, a Spectral–Spatial Shift Module (SSSM) performs cyclic shifts along spectral, height, and width dimensions to enable efficient multi-scale feature interaction without introducing additional parameters. Fourth, a dual-branch GCN-Transformer block jointly models local graph topology and global dependencies, where a spatial Transformer guided by GCNs captures long-range spatial information and a spectral Transformer models cross-band correlations; the two branches are fused through a residual graph convolution.
The main contributions of this work are summarized as follows:
- (1)
We propose a novel dual-branch graph–Transformer hybrid architecture that jointly models local graph structures and global spectral–spatial dependencies, effectively overcoming the limitations of conventional single-paradigm models.
- (2)
We design a dynamic Spectral–Spatial Shift Module that enables efficient multi-dimensional feature fusion through parameter-free shift operations, enhancing the model’s ability to capture contextual interactions across scales.
- (3)
We develop a superpixel-driven graph construction strategy using LDA-SLIC, which adaptively captures spatial homogeneity and spectral discriminability while maintaining computational efficiency via sparse graph representations.
- (4)
We introduce a spectral denoising module that refines input representations through lightweight convolutions and normalization, improving robustness to spectral noise and redundancy.
- (5)
We conduct comprehensive experiments and ablation studies across multiple datasets and training regimes, validating the superiority, generality, and interpretability of SSGTN in HSI classification under limited supervision.
The remainder of this paper is organized as follows.
Section 2 introduces related work in hyperspectral remote sensing image classification.
Section 3 presents the proposed SSGTN architecture.
Section 4 reports experimental results on three benchmark hyperspectral datasets. Finally,
Section 6 concludes the paper and outlines future research directions.
2. Related Work
In this section, we systematically review the evolution of deep learning-based hyperspectral image classification (HSIC) methods, which can be broadly categorized into convolutional, attention-based, and graph-based approaches. We highlight the strengths and limitations of each paradigm, paving the way to introduce our proposed Spectral–Spatial Graph Transformer Network (SSGTN).
2.1. CNN-Based Hyperspectral Image Classification Methods
Convolutional Neural Networks have become a cornerstone in HSIC due to their strong ability to extract spatially structured features [
12,
21,
22,
23,
24,
25,
26,
27,
28,
29]. Early work by Hu et al. [
10] demonstrated that 2D CNNs can effectively leverage local spatial textures within HSI patches, significantly improving classification accuracy over purely spectral methods. To better model the spectral–spatial dependencies inherent in HSIs, Li et al. [
11] extended CNNs to three dimensions and proposed 3D CNNs that jointly process spectral cubes. Further innovations led to hybrid architectures, such as the synergistic 2D/3D CNN by Yang et al. [
30], which integrates spectral–spatial fusion through 3D convolutions and uses complementary 2D spatial context modeling to balance accuracy and computational efficiency. Overall, these developments reflect a progression from purely spatial 2D CNNs to more advanced 3D and hybrid architectures for comprehensive spectral–spatial integration.
Despite these advances, CNN-based methods exhibit several intrinsic limitations. Standard 2D-CNNs often disrupt spectral continuity by treating bands independently, leading to potential misclassification of spectrally similar materials. While 3D-CNNs can preserve spectral–spatial coherence, they dramatically increase model size and computational burden, creating scalability issues for high-dimensional HSIs. Moreover, the fixed receptive fields and inherently local inductive biases of convolutional kernels restrict their ability to capture long-range dependencies and multi-scale contextual information. These limitations hinder the generalization of CNN-based models in heterogeneous environments and motivate the exploration of more flexible architectures beyond convolution.
2.2. Attention-Based Hyperspectral Image Classification Methods
To overcome the locality bias of convolutions, attention-based architectures have been introduced into HSIC to model long-range spatial–spectral dependencies [
13,
15,
31,
32,
33,
34,
35,
36,
37,
38]. Representative examples include Transformers and, more recently, state–space models such as Mamba. For instance, Hong et al. [
14] proposed
SpectralFormer to strengthen inter-band relationships via self-attention, yielding competitive gains over convolutional baselines. Gu et al. [
39] designed a multi-scale lightweight Transformer to reduce computational cost while preserving global modeling capacity. On the state–space side, He et al. [
15] introduced
3DSS-Mamba, which organizes spectral–spatial tokens for efficient long-range dependency modeling.
CenterMamba [
38] adopts a center-scan strategy to enhance semantic representation with linear-complexity sequence processing. These designs provide two complementary approaches to scalable global spatial–spectral representation learning in HSIC.
Notwithstanding their progress, attention-based approaches still face several limitations. First, Transformer models can be computationally demanding and may struggle to reconcile global dependency modeling with fine-grained local detail, especially under high spectral dimensionality and limited labels. Second, many Transformer pipelines rely on fixed tokenization or single-scale processing, leading to insufficient dynamic multi-scale adaptation across heterogeneous scenes. Third, both Transformers and Mamba variants can be sensitive to spectral redundancy and noise, benefiting from explicit denoising or channel re-weighting to stabilize training. Finally, while Mamba/SSM models offer efficiency gains, they may suffer from slow convergence and hyper-parameter sensitivity, and by design, they do not explicitly account for irregular spatial relations. These shortcomings have spurred increasing interest in graph-based architectures, which provide a more flexible representation for non-Euclidean spatial–spectral structures.
2.3. Graph-Based Hyperspectral Image Classification Methods
Graph-based methods have recently emerged as powerful tools for HSIC because they can represent spatial–spectral relations on irregular and non-Euclidean domains [
16,
19,
40,
41,
42,
43,
44,
45,
46,
47]. Early studies demonstrated that graph convolutional networks can capture contextual dependencies through message passing over pixels or superpixels [
16]. Subsequent advances introduced more adaptive designs. For example, Wan et al. [
17] proposed a multi-scale dynamic GCN that aggregates information across spatial neighborhoods, while Yang et al. [
18] introduced a cross-attention-driven spatial–spectral GCN to better integrate heterogeneous features. More recently, object-based strategies such as MOB-GCN [
19] have further emphasized multi-scale structural cues, improving boundary delineation and robustness to noise.
Building on these advances, researchers have extended attention mechanisms to graph formulations. Zheng et al. [
48] proposed a graph Transformer that fuses spatial–spectral features via self-attention to enhance long-range dependency modeling. In parallel, Ahmad et al. [
20] introduced a hybrid Graph Mamba model that tokenizes hyperspectral data into graph representations and leverages state–space modeling to balance efficiency and global context capture.
Although graph-based methods have significantly advanced hyperspectral image classification, they remain constrained by several factors. First, neighborhood aggregation in graph convolutions primarily captures local connectivity, limiting the capture of complex long-range dependencies. Second, graph construction is often heuristic and scene-dependent, reducing adaptability across diverse scenes. Third, most models process features at fixed scales, hindering their adaptability from heterogeneous spatial–spectral patterns. Fourth, spatial and spectral cues are not always effectively integrated, leading to suboptimal joint representations. Finally, redundant or noisy bands degrade input quality and reduce classification robustness, particularly under scarce supervision. These challenges highlight the need for a more integrated approach that combines the strengths of graph structural learning with dynamic multi-scale fusion and global dependency modeling.
The proposed SSGTN is designed to address the aforementioned limitations in a unified framework. Unlike CNNs, SSGTN captures long-range dependencies via Transformer blocks while preserving local structure through graph convolutions. In contrast to pure Transformers, it incorporates an LDA-SLIC superpixel graph to model non-Euclidean spatial relationships and employs a spectral denoising module to enhance input representations. Compared to existing graph-based methods, SSGTN introduces a novel Spectral–Spatial Shift Module for dynamic multi-scale feature fusion and a dual-branch GCN-Transformer architecture to jointly model local topology and global dependencies. By synergistically integrating adaptive graph priors, spectral purification, shift-based feature interaction, and Transformer-based global reasoning, SSGTN achieves expressive and efficient hyperspectral representation learning under high-dimensional and structurally complex conditions, particularly under limited supervision.
5. Discussion
The experimental results demonstrate that SSGTN effectively addresses key challenges in hyperspectral image classification under limited labeled data through its novel architectural design. The dual-branch framework successfully leverages complementary strengths: the graph-based branch preserves discriminative spectral patterns in homogeneous regions, while the Transformer branch captures long-range dependencies essential for complex landscapes.
However, SSGTN exhibits limitations in handling severely underrepresented classes, as evidenced by the poor performance on Class 12 in Houston2018. This limitation stems from graph sparsity in rare classes and attention bias toward dominant categories. Future work should explore topology-aware graph sampling and attention regularization to improve minority class representation.
Compared to CNN-based methods, SSGTN achieves superior spatial coherence through graph-structured regularization. Relative to pure GCN approaches, the Transformer branch mitigates oversmoothing in heterogeneous scenes. The computational complexity of joint graph-attention learning necessitates careful hardware considerations for large-scale deployments, though the sparse graph construction provides significant efficiency gains.
The consistent performance advantage across diverse datasets and low-label training regimes demonstrates that SSGTN is a robust and generalizable framework for hyperspectral image classification, particularly in practical scenarios where labeled data are limited and computational efficiency is critical.