1. Introduction
The hundreds of continuous bands in hyperspectral images (HSIs) contain abundant spatial and spectral information [
1,
2]. In hyperspectral images, each pixel can be regarded as a high-dimensional vector, whose entries correspond to the spectral reflectance at specific wavelengths [
3,
4,
5,
6]. Hyperspectral images have the advantage of distinguishing subtle spectral differences and have been widely applied in many fields [
7], such as environmental monitoring [
8], precision agriculture [
9], mineral exploration, and military recognition [
10,
11].
Hyperspectral images (HSIs) contain hundreds of contiguous spectral bands. While this high-dimensional information provides a basis for classification, it also introduces information redundancy and the Hughes phenomenon. Furthermore, due to their low spatial resolution, HSIs are prone to mixed-pixel effects, making classification based solely on spectral information prone to significant errors. Therefore, effective feature extraction is of great research significance in the field of HSI classification.
Early studies on HSI classification primarily relied on machine learning (ML) methods, including Random Forest [
12], K-Nearest Neighbors (KNN) [
13], Markov Random Fields [
14], Gaussian Processes [
15], and Support Vector Machines [
16]. However, these methods often ignore spatial–contextual information, are sensitive to noise and outliers, and lack the capacity to capture deep semantic features.
The HSI classification model based on deep learning (DL) has become a cutting-edge research issue in recent years [
17], and many network architectures have been applied in this field, including autoencoders (AEs) [
18,
19], recurrent neural networks (RNNs) [
20], graph neural networks (GNNs) [
21,
22], Transformers [
23], and Mamba [
24,
25]. Compared to machine learning, deep learning enables automatic high-dimensional feature extraction and supports end-to-end classification. The convolutional neural network (CNN), as a valuable paradigm, is widely applied in HSI classification tasks. A 1D-CNN [
26] classifies HSIs spectrally using a five-layer convolutional structure, but overlook spatial relationships. A 2D-CNN incorporates neighboring pixel information to improve classification. Since HSI is intrinsically three-dimensional, Hamida et al. introduced a 3D-CNN [
27] to simultaneously extract spectral and spatial features.
As researchers found that CNN could not adapt to the high dimensionality of hyperspectral images, Zhu et al. [
28] proposed RSSAN with dual attention mechanism to suppress irrelevant spectral bands. However, RSSAN is constrained by its CNN-based structure and is prone to ignoring global information. In addition, Mamba [
24,
25] and Transformer [
29,
30] capture global features by establishing long-range dependencies in the spectrum sequence, but face challenges, including high computational resource demands and local information loss. Interestingly, SSGRAM [
31] proposes a GNN paradigm that not only processes data directly within a local window, but also increases the computational burden. To reduce the computational burden, Wang et al. [
32] designed a capsule attention network (CAN), combining activity vectors and attention mechanisms to improve the HSI classification. Notably, most existing HSI classification methods rely on the single backbone network, which fails to simultaneously capture both global and local features. Consequently, dual-backbone architectures with appropriate fusion strategies present significant research value for effectively integrating global and local representations.
As shown in
Figure 1, CNN-based feature extractors, constrained by their limited kernel sizes, primarily focus on local neighboring pixel features [
33]. Consequently, while CNN demonstrates strong local feature extraction capabilities, its capabilities in capturing global representations are limited [
34]. In contrast, GNN focuses on modeling correlations among pixels or patches across the global scope [
35,
36]. They classify pixels or patches based on these correlations to construct a graph [
37]. With a global receptive field that transcends spatial constraints, GNNs excel at capturing long-range pixel dependencies but exhibit limited attention to local fine-grained features [
38,
39]. Thus, combining both backbone networks provides an effective solution for multi-scale feature extraction. The current mainstream GNN and CNN fusion methods significantly optimize classification accuracy compared to single backbone networks. However, the improvement brought by simple fusion strategies often leads to suboptimal results and heavy computational burdens. Thus, developing an effective global–local feature fusion module specifically for HSI data represents a viable approach to enhance classification accuracy [
32]. To mitigate the limitations of fusion architectures, we propose a feature fusion enhancement network termed GLFFEN.
The main contributions of this research are as follows:
We propose a global–local feature fusion enhancement network (GLFFEN) based on the combination of GNN and CNN. To reduce GNN’s computational load, we use the patches of superpixel segmentation (SLIC) as nodes to construct the graph, and use the multi-head attention-enhanced GNN contributed by dynamic weighted neighbors to construct the graph attention (GA) branch.
We design a CNN-based spatial–spectral feature attention module (SSFAM) to extract local spatial–spectral features.
In order to solve the problems of scale misalignment and information redundancy interference during feature fusion, we propose a multi-feature adaptive fusion (MAF) module to effectively integrate global and local representations.
Comparative experiments have shown that our method is superior to existing methods on three well-known datasets, and ablation experiments have been conducted to verify the effectiveness of the proposed GLFFEN.
2. Related Work
As shown in
Figure 2, the spatially heterogeneous distribution of land-cover categories in HSIs poses challenges for CNN, as their regular grid-based feature extraction struggles to generalize across all image regions. In contrast, GNN overcomes this limitation by representing the image as a graph that transcends Euclidean distance constraints. They establish connections based directly on spectral similarity, thereby effectively aggregating spatially dispersed pixels belonging to the same category. Conversely, the detailed spatial patterns captured by CNN within local neighborhoods serve as robust features for initializing GNN nodes, while compensating for potential local detail loss caused by the topology-driven relational reasoning in GNNs. This complementary interaction enables the model to simultaneously perceive local details and establish global contextual relationships, thereby achieving robust classification of complex land-cover structures. By synergistically combining these architectures, the hybrid model can simultaneously leverage the CNN’s power in fine-grained feature extraction and the GNN’s ability to encode global semantic correlations, leading to enhanced representational capacity and improved classification accuracy, particularly in complex scenes with high inter-class variability.
2.1. Limitations of CNN for HSI Classification
CNN processes data at the pixel level and constructs non-linear mappings through the sequential stacking of convolutional operations, thereby progressively extracting high-level semantic features. If the cube of HSI is
, the dot product between each convolution computational input of the
-th layer and the weight matrix
is
where
denotes the kernel size, and
s denotes the stride.
and
denote the input data and the weight of element
.
is the corresponding output after convolution.
is the bias. Therefore, the obtained
will be an array of scalar values. When processing hyperspectral images (HSI), the core convolutional operator of a CNN is intrinsically confined, by mathematical definition, to a local neighborhood of size
. To expand the receptive field, CNNs must rely on stacking convolutional layers and pooling operations. However, this leads to severe dilution of information between distant pixels through multiple nonlinear transformations, while downsampling via pooling sacrifices precise spatial structural information. Consequently, although the inductive bias of CNNs is beneficial for extracting local-hierarchical features, it fundamentally impedes the capture of global context.
In convolutional neural networks, both the size of the convolution kernel and the stride length are critical hyperparameters. The kernel essentially extracts statistical features from local image regions, which can be regarded as an approximate modeling of reflective properties within uniformly distributed pixel areas. The stride controls the interval at which the kernel moves across the input data. Fundamentally, convolution is an operation that numerically encodes local regions, enabling deeper architectures to capture higher-level semantic features through larger receptive fields [
40]. However, constrained by its scalar-based representation, CNNs often require deep architectures stacked with multiple convolutional and nonlinear activation layers to achieve satisfactory performance. This approach not only increases model complexity but also introduces challenges to feature propagation across distant layers. Although classical models such as ResNet and DenseNet have partially alleviated gradient flow issues through identity mappings or dense connections, their substantial computational demands hinder deployment in resource-constrained scenarios. Particularly in hyperspectral image classification tasks, the limited representational capacity of scalar features becomes a notable performance bottleneck.
2.2. Limitations of GNN for HSI Classification
Graph Neural Network (GNN) models hyperspectral images (HSIs) as undirected graphs to capture contextual relationships between land-cover categories. While GNN demonstrates considerable potential for HSI classification, it faces several inherent limitations that restrict its robustness and efficiency in practical applications.
The core operation in GNN involves iterative feature propagation through layer-wise updates. A standard Graph Neural Network (GNN) employs the propagation rule:
where
is the adjacency matrix with self-loops,
is the corresponding degree matrix,
contains node features at layer
l,
denotes learnable weights, and
is a nonlinear activation.
A pivotal challenge arises from the sensitive and often subjective graph construction process, which critically influences model performance yet heavily relies on heuristic hyperparameter selection. Furthermore, repeated application of the propagation rule leads to over-smoothing [
41], where node features become increasingly similar through successive multiplications with the normalized adjacency matrix. This spectral smoothing effect disproportionately amplifies influences from distant nodes while diluting local detail, undermining the model’s capacity to capture fine-grained spatial structures in HSIs.
Additional limitations include substantial computational and memory overhead when scaling to typical hyperspectral data sizes [
42], difficulties in modeling spectral–spatial heterogeneity, and heightened risk of overfitting under limited labeled samples. These issues collectively motivate research into more adaptive and scalable graph-based learning frameworks for HSI classification.
2.3. CNN-GNN Combined Models
TBDGCN [
43] alleviates over-smoothing in GNNs through the incorporation of DropEdge and residual connections, and combines superpixel-level and pixel-level features. However, it suffers from high computational complexity, and its performance is strongly influenced by the quality of superpixel segmentation. WFCG [
44] performs weighted integration of features from superpixel-based GAT and pixel-based CNN to explore high-dimensional characteristics. Yet, its fusion mechanism is relatively elementary and may fall short of facilitating deep interaction between heterogeneous features. MIAF-Net [
45] employs an interactive attention mechanism to enhance mutual supplementation between local CNN features and global GCN topology, along with a hierarchical attention fusion module. Nonetheless, the model’s structural complexity results in elevated training difficulty and computational expense. Liu et al. [
46] extracted features through multi-scale attentional graph convolution and a complementary dual convolutional attention network and introduced an attentional fusion pooling mechanism, yet this approach also faces challenges related to model complexity and computational overhead. NAGIN [
37] enhances graph representation flexibility through adaptive neighborhood modeling, thereby boosting classification performance in complex scenarios. However, it maintains a pronounced sensitivity to hyperparameter configurations.
In contrast to the aforementioned methods, the proposed GLFFEN framework is designed to holistically address several recurring limitations in existing CNN-GNN fusion paradigms. Current approaches often face a critical trade-off: while methods relying on superpixels (e.g., TBDGCN) or adaptive graph structures (e.g., NAGIN) enhance modeling flexibility, they inherently suffer from sensitivity to segmentation quality or hyperparameter settings, compromising robustness. Furthermore, many fusion strategies, ranging from elementary weighted averaging (e.g., WFCG) to highly complex interactive attention modules (e.g., MIAF-Net), either fail to facilitate deep, hierarchical feature interactions or incur prohibitive computational costs. Motivated by these identified gaps, GLFFEN introduces a streamlined yet powerful architecture centered on a novel Multi-feature Adaptive Fusion (MAF) module. This design eliminates the dependency on explicit, high-quality superpixels and avoids intricate multi-stage feature extraction, enabling robust and efficient integration of heterogeneous spatial–spectral features. Consequently, GLFFEN establishes a superior balance between model performance, computational efficiency, and operational stability, presenting a cohesive solution that advances beyond the current state-of-the-art.
3. Materials and Methods
As shown in
Figure 3, the GLFFEN framework pipeline consists of four key components: (1) preprocessing for spectral dimension reduction, (2) the GA branch for global feature extraction, (3) the SSFA branch for local feature extraction, and (4) the MAF module for global–local feature fusion.
We denote HSI as . H, W, and B are denoted as the height, width, and the number of bands. We use two convolutional layers as the preprocessing step. The layers are used as cross-channel information exchange to remove useless spectral dimensions to strengthen discrimination ability and remove computational cost.
3.1. Graph Attention Branch
The GA branch is used to extract global features, mainly including four steps: Superpixel-based Graph Representation, Graph Construction, Dynamic Weight Attention Mechanism, and Feature Decoding.
3.1.1. Superpixel-Based Graph Representation
As shown in
Figure 4, we adopt the Simple Linear Iterative Cluster (SLIC) superpixel algorithm [
47] to aggregate adjacent pixels into homogeneous regions, thereby reducing computational complexity and enhancing structural coherence in subsequent graph construction. The algorithm operates in the CIELab color space augmented with spatial coordinates, forming extended feature vectors
for each pixel. Initially,
k cluster centers are sampled on a regular grid with interval
, where
N denotes the total number of pixels in the image. Each cluster center is subsequently optimized through iterative
k-means clustering within a localized region of size
, minimizing a combined distance metric that balances color proximity and spatial adjacency. This process results in a partition of the image into compact, perceptually consistent superpixels, which serve as the foundational nodes of the graph
in our graph neural network architecture.
3.1.2. Graph Construction
The superpixel–pixel relationship is encoded in a binary association matrix
, where
indicates pixel
i belongs to superpixel
j. The image
is transformed into graph nodes
via
where
is the column-normalized version of
. Spatial adjacency defines the graph edges
.
3.1.3. Dynamic Weight Attention Mechanism
To enhance traditional GNNs with adaptive neighbor weighting, we employ a multi-head attention mechanism that dynamically computes the importance of neighboring nodes. The feature transformation for each node
i is given by the following equation:
where
is a shared weight matrix.
The attention mechanism computes normalized attention coefficients
between connected nodes
using the following equation:
where
is a learnable attention vector, and
denotes the neighborhood of node
i.
Multi-head aggregation combines
K independent attention heads:
3.1.4. Feature Decoding
The graph representation is projected back to pixel space via
where
represents the pixel-level feature mapping reconstructed from superpixel-level features, and
maps the node features back to the grid format.
For the output , it needs to be fed into a fully connected layer and projected into the same space as the output of the SSFA branch. This operation can place the outputs of the two branches in the same feature space to prepare for the MAF Module.
3.2. Spatial–Spectral Feature Attention Branch
The Spatial–Spectral Feature Attention Module (SSFAM) forms the core of our SSFA branch, with two SSFAMs arranged sequentially. As illustrated in
Figure 5, each SSFAM comprises two principal components: the Spatial Feature Attention (SpaFA) block and the Spectral Feature Attention (SpeFA) block. These blocks are designed as efficient variants of the self-attention mechanism [
48], leveraging global context modeling to enhance feature representations in their respective dimensions.
The proposed SSFAM differentiates itself from existing joint spatial–spectral attention mechanisms [
49,
50] through its sequential-decoupled architecture. While joint attention attempts to model interactions within a unified high-dimensional tensor—often incurring significant computational overhead and potential feature interference—the SSFAM processes spatial and spectral attentions separately. This design ensures a more efficient and hierarchical feature refinement: the SpaFA first establishes global contextual relationships across the image, upon which the SpeFA performs channel-wise recalibration. This sequential, decoupled approach mitigates the optimization difficulties of entangled feature spaces and yields a more computationally efficient and interpretable model compared to its joint counterparts.
3.2.1. SpaFA Block
The SpaFA block operates as a spatial self-attention mechanism that captures long-range dependencies across spatial positions [
51]. SpaFA constructs a global spatial attention map by calculating the correlation between any two positions in the feature map. Its core mechanism lies in enabling features to interact in the spatial dimension through matrix transpose and multiplication, thereby encoding the spatial context information at a distance to each pixel position, thus overcoming the local receptive field limitation of traditional convolution and enhancing the model’s overall perception of the spatial layout of ground objects. Formally, given an input feature map
, we generate query, key, and value projections through linear transformations:
where
are learnable weight matrices. The spatial attention map is computed via scaled dot-product attention:
where
represents the dimension of the key vectors. The enhanced spatial features are obtained through the following equation:
where
is a learnable scaling parameter. This formulation enables global contextual modeling while preserving the original feature details through residual connection.
3.2.2. SpeFA Block
The SpeFA block functions as a channel self-attention mechanism that models interdependencies between spectral bands [
51]. SpeFA focuses on feature recalibration in the channel dimension, explicitly modeling channel dependencies by constructing a covariance matrix between channels. The principle is to have the features of each channel accept the global information of all channels and undergo nonlinear transformation, thereby adaptively emphasizing spectral bands rich in discriminative information and suppressing redundant channels. Taking the spatially enhanced features
as input, we compute the channel attention map following the self-attention paradigm:
where
are learnable parameters. The channel attention is computed as follows:
with the final output obtained through
where
is a learnable parameter. This spectral attention mechanism adaptively emphasizes discriminative spectral bands while suppressing redundant information, completing the hierarchical feature refinement process.
3.3. Multi-Feature Adaptive Fusion Module
The proposed Multi-feature Adaptive Fusion (MAF) module addresses feature misalignment and shallow interaction in fusion through a dual-path architecture that explicitly processes macro-scale contextual patterns and micro-scale spatial details. The module employs a gated attention mechanism, implemented via a squeeze-and-excitation block, to generate dynamic channel-wise weights that resolve feature conflicts through non-linear recombination. This enables selective enhancement of complementary features while suppressing redundancies. Positioned after deep backbone networks, the module performs high-level aggregation of semantically rich features, preventing semantic dilution while maintaining discriminative power through coherent integration of multi-scale representations.
The network structure of the MAF module is shown in
Figure 6. Given global and local features representations with the same shape:
. Firstly, the global feature representation is obtained through the global average pool (GAP) operation in the spatial dimension. Then, the correlation between channels is modeled by
, and the channel descriptors are generated through the sigmoid activation function:
Then, the global and local features are concatenated in the second dimension to obtain
, and the respective weights are obtained through the softmax function:
Perform element-wise product on the generated weights with the corresponding inputs and add them together:
where
represents feature representation after fusion, ⊙ denotes element-wise product, and
.
3.4. Loss Function
The identity loss and reconstruction loss are used to train the proposed GLFFEN. The identity loss operates at the feature extraction level by imposing constraints on both the GA and SSFA branches to preserve the input’s spectral-spatial identity information. This mechanism effectively prevents critical discriminative features from being diminished during complex encoding-decoding transformations. The enforced identity consistency promotes tighter clustering of homogeneous samples and greater separation of heterogeneous samples in the feature space. Consequently, this enhancement in feature discriminability directly contributes to improved classification accuracy, with particularly notable gains observed in classifying minority categories and complex geographical boundaries.
where
denotes the overall loss, and
, and
denote the reconstruction loss and the identity loss.
and
are the weights of the loss.
3.4.1. Reconstruction Loss
Calculate the difference between the predicted value and the target value using the mean square error (MSE):
where
denotes the groud truth (GT) value,
denotes the predicted value.
3.4.2. Identity Loss
Generate an output map through the pre-trained IdentityMLP, and then calculate the MSE between it and the input:
where IdentityMLP is a pre-trained identity mapping model, with
y as the input and the transformation result that maintains the structure unchanged as the output.
6. Conclusions
This paper proposes GLFFEN, a global–local feature fusion enhancement network for hyperspectral image classification. By combining the global and local feature extraction capabilities of GNN and CNN, more comprehensive and detailed classification information can be obtained. We design a GNN based on superpixel segmentation and multi-head attention mechanism as a GA branch for extracting global features. In addition, we propose SSFAM to focus more effectively on the local spatial–spectral features. By the way, the MAF module is ingeniously designed and used for the fusion of global–local feature self-weighting, which ensures the automatic adjustment of the fusion strategy for different ground object types under different datasets. Comparison experiments on three well-known HSI datasets show that our GLFFEN has a significant advantage over the other six SOTA methods in terms of the classification effect.
Although the proposed GLFFEN framework is competitive, this study acknowledges certain limitations. This model is still inherently constrained by its reliance on the quality of superpixel segmentation. In addition, the parallel dual-branch architecture still brings about relatively high computational complexity. These aspects highlight the key directions for future research, including exploring unsegmented graph construction, developing more adaptive and in-depth feature fusion interaction mechanisms, and simplifying model structures to enhance computational efficiency.