A Symmetric Multiscale Feature Fusion Architecture Based on CNN and GNN for Hyperspectral Image Classification

Xu, Yaoqun; Wang, Junyi; You, Zelong; Li, Xin

doi:10.3390/sym17111930

Open AccessArticle

A Symmetric Multiscale Feature Fusion Architecture Based on CNN and GNN for Hyperspectral Image Classification

School of Computer Science and Information Engineering, Harbin University of Commerce, Harbin 150028, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(11), 1930; https://doi.org/10.3390/sym17111930

Submission received: 25 September 2025 / Revised: 24 October 2025 / Accepted: 8 November 2025 / Published: 11 November 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Convolutional neural networks (CNNs) and graph convolutional networks (GCNs) have been widely applied to hyperspectral image classification tasks, but both exhibit certain limitations. To address these issues, this paper proposes a multi-scale feature fusion architecture (MCGNet). Symmetry serves as the core design principle of MCGNet, where its parallel CNN-GCN branches and multi-scale fusion mechanism strike a balance between local spectral-spatial features and global graph structural dependencies, effectively reducing redundancy and enhancing generalization capabilities. The architecture comprises four modules: the Spectral Noise Suppression (SNS) module enhances the signal-to-noise ratio of spectral features; the Local Spectral Extraction (LSE) module employs deep separable convolutions to extract local spectral-spatial features; Superpixel-level Graph Convolution (SGC), performing graph convolution on superpixel graphs to precisely capture dependencies between object regions; Pixel-level Graph Convolution (PGC), constructed via adaptive sparse pixel graphs based on spectral and spatial similarity to accurately capture irregular boundaries and fine-grained non-local relationships between pixels. These modules form a symmetric, hierarchical feature learning pipeline integrated within a unified framework. Experiments on three public datasets—Indian Pine, Pavia University, and Salinas—demonstrate that MCGNet outperforms baseline methods in overall accuracy, average precision, and Kappa coefficient. This symmetric design not only enhances classification performance but also endows the model with strong theoretical interpretability and cross-dataset robustness, highlighting the significance of symmetry principles in hyperspectral image analysis.

Keywords:

hyperspectral image classification; convolutional neural networks; graph convolutional networks; feature fusion; deep learning

1. Introduction

Hyperspectral imaging (HSI) integrates spatial and spectral information, capturing high-resolution reflectance across hundreds of contiguous bands. At each pixel point, HSI records reflectance information across multiple wavebands from visible light to near-infrared and even mid-infrared, typically encompassing hundreds of adjacent narrow bands [1]. Compared to traditional RGB images or multispectral images, HSI provides richer and more detailed spectral information, enabling it to capture subtle spectral differences in material composition at the chemical level, thereby achieving more precise object identification and classification.

Due to its unique imaging characteristics, hyperspectral remote sensing technology [2,3] has been widely applied. In agricultural remote sensing, it can be used for crop growth monitoring, pest and disease identification, etc.; in environmental protection, it is commonly used for water pollution detection and air pollution source identification; in urban planning, it supports land use classification and urban expansion monitoring; in geological exploration, it enables mineral distribution analysis; in forestry resource surveys, it assists in tree species classification; and in medical imaging processing, it can be applied to tissue pathology detection and cancerous region identification. In the aforementioned applications, HSI classification [4], which involves precisely labeling each pixel in an image with the corresponding object or material category, serves as a foundational step for downstream analysis and decision-making. It plays a crucial role in unlocking the full potential of hyperspectral remote sensing data.

Although HSI exhibits strong discriminative power in the spectral dimension, its classification tasks still face a series of challenges: (1) the contradiction between high-dimensional features and limited samples, i.e., the “curse of dimensionality” and overfitting issues; (2) high spectral similarity between classes in high-dimensional spectral space, along with high correlation and noise interference between bands; (3) complex spatial structures, including blurred object boundaries and widespread mixed pixels; (4) The scarcity of training samples and imbalanced class distribution caused by high annotation costs. The scarcity of training samples and imbalanced class distribution caused by high annotation costs [5,6,7].

Early studies primarily employed traditional machine learning methods based solely on spectral data [8], such as support vector machines [9], extreme learning machines [10], k-nearest neighbor classifiers (KNN) [11], and logistic regression. These methods classify pixels by designing discriminant functions, but most adopt a pixel-by-pixel processing approach, ignoring the spatial structural features of the image. As a result, in areas with complex boundaries or severe noise interference, classification results often exhibit fragmented and discontinuous patterns [12,13]. To address this, researchers have increasingly introduced various spectral-spatial joint feature modeling methods [14], utilizing techniques such as sparse representation [15], low-rank representation [16], and collaborative representation [17] to enhance spatial discriminative capabilities while preserving spectral structure. Additionally, graph modeling techniques [18,19] have been widely applied in HSI classification, effectively capturing spatial proximity and spectral consistency by constructing pixel similarity graphs or superpixel segmentation results. Examples include edge-weighted KNN graphs [20]; superpixel segmentation methods [21], such as SLIC [22,23,24]; and edge-preserving methods based on morphological processing have all achieved significant improvements in spatial structure representation.

With the breakthrough progress of deep learning in computer vision, CNNs have been widely applied in HSI classification, such as the 1D-CNN [25] to 2D-CNN [26] proposed by Ge et al., which transitions from modeling only the spectral dimension to modeling only the spatial dimension; to the 3D-CNN [27] proposed by Zhong et al., which simultaneously models both spectral and spatial dimensions; and various structural evolution versions of 3D-CNN, such as a multi-channel hybrid 2D-3D CNN [28] designed for small-sample hyperspectral image classification. Structures such as MCCNN [29], DenseNet [30], and ResNe [31] have also been introduced to mitigate gradient vanishing and improve training stability. However, the fixed receptive field characteristic of CNNs limits their ability to handle complex morphological structures with irregular spatial distributions and capture long-range spatial dependencies [32].

Therefore, GCN has been introduced into the HSI classification field [33,34]. GCN can propagate features across arbitrary graph structures, effectively capturing spatial correlations across regions. However, since hyperspectral images contain a large number of pixels, directly treating each pixel as a graph node would result in a significant increase in computational and storage overhead [35]. To address this issue, researchers proposed a graph structure modeling method based on superpixels [36,37]. By using a simple linear clustering algorithm (SLIC) to segment HSI into a series of compact superpixels and treating them as nodes, the size of the graph is effectively reduced, thereby improving efficiency. Superpixels exhibit excellent edge retention capabilities in space, facilitating the extraction of superpixel-level features. However, superpixel aggregation inevitably results in the loss of some fine-grained information, particularly at category boundaries and in small target regions, which may lead to classification errors. Although some methods [24,38,39] have attempted to supplement superpixel-level GNN models with richer feature representations by integrating CNNs, they inevitably inherit the inherent limitations of CNNs, such as difficulty in capturing long-range dependencies between instances and the inability to guarantee the homogeneity of pixels within the receptive field.

Against this backdrop, methods that fuse CNNs and GCNs [40,41] have gradually become a research hotspot. By using CNNs to extract local fine-grained features and GCNs to model cross-regional structural relationships, the two can learn synergistically to effectively mitigate each other’s limitations, as mentioned in [42]. However, existing fusion methods [43,44,45] still face the following bottlenecks: (1) The fixed structure of convolutional kernels lacks adaptability to the diversity of landform shapes; (2) While superpixel-level GCN reduces computational complexity, it tends to lose pixel-level detail information, while pixel-level GCN is computationally intensive; (3) There is a lack of an efficient and unified fusion mechanism between CNN and GCN features.

Symmetry holds significant importance in computational science, as it simplifies models, reduces redundancy, and enhances algorithm interpretability. In hyperspectral image classification, both spectral and spatial dimensions exhibit inherent structural symmetry. Leveraging these properties to design symmetric networks ensures balanced information flow, mitigates overfitting, and enhances noise resilience. This study pairs convolutional and graph-based modules, embedding symmetry into the MCGNet architecture to achieve consistent feature extraction across scales and domains, highlighting the crucial role of symmetry in hyperspectral image analysis [46,47].

To effectively address the challenges in the aforementioned HSI classification, this paper proposes a multi-scale feature fusion architecture (MCGNet) that combines convolutional neural networks (CNN) with graph neural networks (GCNs). This architecture consists of four key modules: a lightweight spectral noise suppression module (SNS), a local spectral feature extraction module (LSE), a superpixel-level graph convolution module (SGC), and a pixel-level graph convolution module (PGC). First, the input hyperspectral images undergo noise suppression via the SNS module to enhance the signal-to-noise ratio of spectral features. The denoised pixel features are then simultaneously fed into the LSE, SGC, and PGC modules. In the LSE module, local spectral features are extracted using deep separable convolutions to capture spectral-spatial information from pixel neighborhoods; The SGC module performs graph convolution on the superpixel graph to capture dependencies between object regions and enhance global contextual information. To further enhance feature expressiveness, MCGNet introduces a powerful self-attention mechanism [48], effectively decoupling feature representations at the superpixel level and converting superpixel features into fine-grained pixel-level feature representations. The PGC module adaptively constructs a sparse graph structure at the pixel level, dynamically generating adjacency relationships based on spectral and spatial similarity to accurately model irregular dependencies between pixels. Through parallel processing, these three modules extract complementary features from different levels, capturing multi-scale information at the local, regional, and pixel levels. Finally, these features are fused in the channel dimension and a pixel-level category probability distribution is generated through a Softmax classifier, thereby achieving high-precision hyperspectral image classification. MCGNet significantly improves the accuracy and robustness of hyperspectral image classification by fully utilizing local spectral features, regional dependencies, and fine-grained pixel-level information through a carefully designed parallel feature extraction mechanism.

Experimental results on three public hyperspectral datasets (Indian Pines, Pavia University, and Salinas) demonstrate that the proposed method significantly outperforms baseline methods in terms of OA, AA, and Kappa coefficients, achieving improvements of 0.59% 0.84% and 0.81% over the best baseline method, respectively. Additionally, the superpixel dimension reduction and sparse pixel map mechanisms significantly reduce computational overhead while maintaining high classification accuracy, validating the method’s synergistic optimization advantages in both accuracy and efficiency. Our code is open-source on the public platform GitHub: https://github.com/Wang-jun-yi/MCGNet_main (accessed on 7 November 2025).

The main contributions of this paper are as follows:

We propose a novel MCGNet architecture that successfully achieves multi-scale feature modeling from local details to global relationships and from rule-based domains to non-Euclidean spaces through the collaborative action of the SNS module, LSE branch, SGC branch, and PGC branch. Experimental results show that MCGNet improves overall classification accuracy by 0.59% and reduces running time by 18.01 s compared to the best baseline method on multiple public datasets.
We introduce the SNS module and the LSE module. The SNS module effectively improves the signal-to-noise ratio of input data through a lightweight noise suppression strategy, laying a solid foundation for subsequent feature extraction. The LSE module adopts deep separable convolution, which not only reduces computational complexity but also enhances the quality of feature representations.
We propose the SGC module, which successfully captures the dependency relationships between object regions through graph convolution on superpixel graphs. Additionally, we introduce a self-attention mechanism to decouple superpixel features, further refining pixel-level feature representations.
We propose the PGC module. This module combines spectral and spatial similarity to construct a sparse graph structure and effectively captures complex dependencies between pixels through graph convolution, thereby improving the model’s performance in identifying complex object boundaries and subtle differences.

2. Proposed Method

The method proposed in this paper aims to learn a discriminative model that infers the categories of all pixels in a full hyperspectral image based on a limited number of labeled samples. The model framework is illustrated in Figure 1. The overall architecture of MCGNet follows a symmetric multiscale design principle, organized through four key modules in a balanced and complementary manner: the SNS module for enhancing spectral feature signal-to-noise ratio and improving feature quality; the LSE branch for extracting local spectral features via deep separable convolutions; the SGC branch for modeling superpixel-level dependencies; and the PGC branch for constructing pixel-level sparse graph relationship modeling. Specifically, LSE focuses on extracting local spectral spatial details, while SGC and PGC capture long-range dependencies at regional and pixel scales. This symmetric arrangement ensures consistent information flow from local to global scales and from Euclidean to non-Euclidean domains, reducing bias toward any single representation type. By maintaining structural symmetry across modules, MCGNet achieves a robust balance between efficiency and accuracy, enabling consistent feature learning across diverse datasets. Detailed descriptions of these modules follow.

2.1. Spectral Noise Suppression and Local Spectral-Spatial Encoding

2.1.1. SNS

Since HSI is highly susceptible to various types of noise during data acquisition, such as sensor noise, environmental noise, and atmospheric scattering, these noises can severely impair image quality, reduce the purity of spectral features, and consequently negatively impact the accuracy of subsequent land classification. Therefore, effective denoising of raw hyperspectral data is particularly important before performing advanced feature extraction. To address this, this paper introduces an SNS module at the front end of the proposed model as a preprocessing stage, aiming to filter out redundant information from the source data and enhance effective features, thereby providing a cleaner and more discriminative input representation for the subsequent multi-branch feature extraction module. The specific implementation of this module is as follows:

First, the input HSI is

X \in R^{H \times W \times C}

, where H, W, and C represent the height, width, and spectral channel number of the image, respectively, and is rearranged into a four-dimensional tensor with shape (1, C, H, W). Let the rearranged input hyperspectral image be:

X^{(0)} \in R^{1 \times C_{i n} \times H \times W}

(1)

Subsequently, the data passes through two stacked 1 × 1 convolution layers, each followed by batch normalization (BN) and LeakyReLU activation functions:

X^{(p)} = L e a k y R e L U ({B N}_{m} (X^{(p - 1)} * W^{(p)} + b^{(p)}))

(2)

Among them,

W \in R^{1 \times 1 \times c_{p - 1} \times c_{p}}

is the weight of the 1 × 1 convolution kernel in layer p,

b^{p}

is the corresponding bias term, and

{BN}_{m} (\cdot)

is batch normalization. This module performs the above operations twice. Finally, the output of the SNS module is:

Y_{S N S} = X^{(2)} \in R^{1 \times C_{o u t} \times H \times W}

(3)

Through the synergistic action of the above components, the SNS module can effectively learn from and extract purer, more informative feature representations from raw hyperspectral data. As part of the convolutional branch, the SNS module forms a symmetric counterpart to graph-based modules, focusing on local feature refinement while maintaining structural balance.

2.1.2. LSE

To further explore the spatial structural information of images and construct a discriminative spectral-spatial joint feature representation, this paper introduces the LSE branch into the model. This branch focuses on capturing spatial clues such as texture, edges, and structure in local regions of images, thereby alleviating the category ambiguity issues that may arise when relying solely on spectral features for classification. Combined with the output from the previous stage of the SNS, the LSE can model local spatial variation patterns while reducing spectral noise, thereby improving object recognition accuracy.

The core of the LSE branch lies in the use of a depth-separable convolution structure, which splits traditional convolution operations into two stages: depth convolution and pointwise convolution. This approach significantly reduces model parameters and computational complexity while maintaining feature extraction capabilities, thereby enhancing model efficiency. Let the input tensor be:

H_{L E S}^{(1)} = Y_{S N S} \in R^{1 \times C_{i n} \times H \times W}

(4)

First, in the depthwise convolution stage, a spatial convolution kernel

K^{(c)} \in R^{k \times k}

is applied independently to each input channel

c \in {1, 2, \dots, C_{in}}

to perform local spatial receptive field computation. For a pixel position

(i, j)

, the computation of the output feature

Y^{'} \in R^{H \times W \times C_{in}}

is given by:

\tilde{H} (i, j, c) = \sum_{m = - ⌊ k / 2 ⌋}^{⌊ k / 2 ⌋} \sum_{n = - ⌊ k / 2 ⌋}^{⌊ k / 2 ⌋} K_{m, n} (c) \cdot H_{LES}^{(1)} (i + m, j + n, c)

(5)

where k denotes the size of the convolution kernel, and

⌊ \cdot ⌋

represents the floor operation (rounding down).

Based on the local spatial features extracted by deep convolution, the pointwise convolution stage employs a 1 × 1 convolution kernel

W \in R^{C_{in} \times C_{out}}

to perform a linear combination of the deep convolution outputs along the channel dimension, thereby achieving cross-spectral information fusion. For the output feature map

Y^{'} \in R^{H \times W \times C_{out}}

, the output for each channel k is computed as follows:

H_{LES}^{(2)} (i, j, c) = \sum_{c = 1}^{C_{in}} W_{c, k} \cdot H (i, j, c) + b_{k}

(6)

where

b_{k}

is the bias term for the k-th output channel.

In summary, the LSE module organically integrates local spatial feature modeling with spectral channel fusion through the aforementioned ’separate channels first, then combine channels’ processing approach, effectively enhancing feature representation capabilities while significantly reducing parameter scale and computational complexity. The LSE module, together with SNS, establishes a locally symmetric convolutional branch, ensuring consistency and complementarity with the global graph-based components of the architecture. Therefore, the deep separable convolution structure becomes the key component for extracting local spatial-spectral features in this model.

2.2. Regional and Pixel-Level Relationship Inference

Although the LSE module can effectively capture fine-grained spatial-spectral features in HIS classification, hyperspectral data itself possesses characteristics such as high dimensionality, high resolution, and dense pixel distribution. These characteristics make global modeling at the full image scale extremely computationally expensive and storage-intensive. Additionally, convolutional structures with fixed receptive fields are limited in performance when significant semantic differences exist between objects across multiple scales, making it difficult to comprehensively model long-range dependencies in complex scenes.

2.2.1. SGC

To address this, this paper introduces the SGC branch, which aims to construct a non-local, long-range spatial dependency modeling mechanism. This branch adopts the graph encoder–decoder framework proposed in CEGCN [24]. We design a fusion-type decoder based on the Transformer [48], which not only effectively resolves the problem of insufficient information fusion in traditional decoders but also significantly enhances the model’s capability to model multi-scale context.

First, to obtain semantically meaningful homogeneous regions and reduce the size of the subsequent graph, SLIC is used to segment the HSI. After segmentation, the HSI is divided into

N_{s}

superpixels

S = {S_{i}}_{i = 1}^{N}

.

Based on the superpixel segmentation results, a pixel-to-superpixel mapping matrix

P \in R^{N_{p} \times N_{s}}

is constructed to define the mapping relationship between all

N_{p}

pixels and

N_{s}

superpixels, as defined below:

P_{i j} = \{\begin{matrix} 1, & if pixel i \in superpixel S_{i} \\ 0, & otherwise \end{matrix}

(7)

Second, the association matrix P between pixels and superpixels is used to convert pixel-level features

H_{p} \in R^{N_{p} \times C}

into superpixel-level representations

H_{s} \in R^{N_{s} \times C}

. To achieve this aggregation, the matrix P is first normalized column-wise to obtain

\hat{P}

, where

{\hat{P}}_{i j} = P_{i j} / \sum_{k = 1}^{N_{p}} P_{k j}

. Consequently, the features of each superpixel can be represented as the weighted average of all pixel features within it. The aggregation process can be efficiently implemented using matrix multiplication:

H_{s} = Encode (H_{p}; P) = {\hat{P}}^{T} H_{p}

(8)

where

H_{p}

is the pixel feature matrix and

H_{s}

contains the aggregated superpixel features, this process compresses fine-grained pixel features into more compact regional features, thereby significantly reducing the scale of subsequent graph operations.

Subsequently, the GCN is applied to the pre-constructed superpixel-level adjacency graph

M \in R^{N_{s} \times N_{s}}

to capture the global context and long-range dependencies between different regions. The updated regional feature representation is as follows:

H_{s}^{'} = σ (\hat{M} H_{s} W_{g})

(9)

Among these,

\hat{M} = {\tilde{D}}^{- \frac{1}{2}} (M + I) {\tilde{D}}^{- \frac{1}{2}}

is the normalized adjacency matrix obtained after adding self-loops,

\tilde{D}

is the corresponding degree matrix,

W_{g}

is the learnable weight, and

σ (\cdot)

represents the LeakyReLU activation function. Next, the association matrix P between pixels and superpixels is utilized to convert superpixel-level features into pixel-level feature representations, i.e.:

H_{p}^{'} = P H_{s}^{'}

(10)

Among these,

H_{p}^{'} \in R^{N_{p} \times C}

represents the pixel features of this module. To enhance the representation of the pixel feature

H_{p}^{'}

, we apply the self-attention mechanism to generate a new attention matrix by combining the superpixel feature

H_{s}^{'}

and the local spectral feature

H_{L S E}

generated by the LSE module, and perform attention calculations from the superpixel features to the fine-grained pixel features. Specifically, we first map

H_{p}^{'}

into key vectors (K) and value vectors (V) using two different linear transformations, each followed by a normalization layer:

K = L a y e r N o r m (H_{p}^{'} W_{K}), V = B a t c h N o r m (H_{p}^{'} W_{V})

(11)

Simultaneously, a linear transformation with a normalization layer is applied to

H_{L S E}

to generate a query vector (Q):

Q = L a y e r N o r m (H_{L S E} W_{Q})

(12)

Among them,

W_{K}

,

W_{V}

, and

W_{Q}

are learnable parameters, while

LayerNorm (\cdot)

and

BatchNorm (\cdot)

are used to normalize the feature distribution.

Next, the association matrix P between pixels and superpixels is used to weight the attention matrix, resulting in the attention matrix between pixels and superpixels:

A = Softmax (P ⊙ (\frac{Q K^{T}}{\sqrt{d_{B}}}))

(13)

Finally, the pixel-level feature

H_{p}^{'}

, generated based on the association matrix P between pixels and superpixels, is fused with the attention-weighted aggregation value vector

A V

in a proportional manner:

H_{f u s e d} = α \cdot A V + H_{p}^{'}

(14)

where

α

is a hyperparameter.

In summary, the SGC branch significantly reduces computational complexity through the bidirectional ‘pixel-superpixel-pixel’ mapping mechanism and introduces a Transformer-based global self-attention mechanism in the decoding stage to achieve efficient fusion of multi-source features. This design excels in modeling long-range dependencies, facilitating cross-scale semantic interactions, and integrating multi-source information, thus providing feature representations that combine both local details and global context for subsequent classification and prediction. The SGC module embodies the global branch of the symmetric architecture, complementing the convolutional branch by modeling long-range contextual dependencies at the superpixel level.

2.2.2. PGC

Although the SGC branch effectively models global context information at the superpixel level, the superpixel aggregation process has a certain level of compressibility, which may lead to the loss of fine-grained features during mapping, such as boundary information or small object features. To more precisely capture the complex spatial relationships and spectral similarities between pixels, this paper further designs the PGC branch, which constructs a sparse pixel graph to directly model non-local information.

In this module, the input pixel feature matrix

H_{p} \in R^{N_{p} \times C}

is first subjected to graph construction operations, where

N_{p}

denotes the number of pixels and C represents the feature dimension. This process takes into account two factors: spectral similarity and spatial distance, which are modeled through feature vector dot products and Gaussian kernel functions, respectively. The specific formulas are as follows:

S_{spat} (i, j) = H_{p} (i) H_{p} (j)

(15)

S_{spat} (i, j) = exp (- {∥p_{i} - p_{j}∥}^{2} / 2 σ^{2})

(16)

where

p_{i}

and

p_{j}

denote the spatial coordinates of pixels i and j, respectively, and

σ

is the scale parameter that controls the shape of the Gaussian distribution.

The final weighted adjacency matrix

A \in R^{N_{p} \times N_{p}}

is defined as the weighted sum of the two similarity measures, with only the top K most relevant neighbors retained for each pixel to construct a sparse pixel graph, significantly reducing the computational complexity of the graph.

On the constructed graph structure

\tilde{A} = A + I

, PGC uses GCN for feature propagation and enhancement. The pixel-level graph convolution propagation rule is defined as follows:

H^{(l + 1)} = σ ({\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} H^{(l)} W^{(l)})

(17)

Among them,

H^{(l)}

represents the node feature matrix of layer l;

\tilde{D}

is the diagonal matrix of

\tilde{A}

;

W^{(l)}

is the learnable weight matrix of layer l; and

σ (\cdot)

is the nonlinear activation function.

By stacking L layers of graph convolutional units, PGC can progressively expand the receptive field of pixels, capture contextual relationships across a wide spatial range, and ultimately obtain pixel-level feature representations that incorporate global structural constraints:

H_{P G C} = H^{(L)}

(18)

Alongside the SGC module, the PGC branch further enhances the global side of the symmetric design, capturing pixel-level irregular dependencies and maintaining equilibrium with the local convolutional components. The PGC branch performs fine-grained modeling of image structures at the pixel level and captures long-range dependencies, overcoming the limitations of fixed receptive fields in CNNs and compensating for the loss of fine-grained structural information that may occur during superpixel aggregation. This is particularly evident in boundary-sensitive areas, small target regions, or objects with irregular structures. It works in conjunction with modules such as SGC, LSE, and SNS to form the multi-scale feature extraction framework of the MCGNet model, significantly enhancing the model’s ability to integrate information at different granularities and model dependency relationships, thereby improving classification accuracy.

2.3. Fusion and Final Classification

In the MCGNet model proposed in this paper, we have designed three parallel feature extraction branches to model hyperspectral images from different scales and perspectives: the LSE branch focuses on joint spectral-spatial feature extraction within local regular windows; the PGC branch models non-local spectral and spatial relationships by constructing pixel-level graph structures; and the SGC branch models long-range spatial dependencies at the superpixel level using graph convolutions based on superpixel graphs.

Each branch models ground object features at different granularities and spatial levels, offering complementary capabilities. However, features extracted from a single branch are prone to information loss or a limited field of view, making it challenging to achieve high-precision classification alone. Therefore, designing an efficient feature fusion mechanism is essential for improving model performance. To this end, MCGNet introduces a unified fusion and classification module after the three branches to integrate the key features output by each branch, thus enhancing representation capabilities and classification performance.

First, the superpixel-enhanced features

= H_{f u s e d}

from the SGC branch are weighted and fused with the pixel-level graph convolution features

H_{PGC}

from the PGC branch, balancing local and global non-Euclidean relationships. The fusion operation is implemented through element-wise addition:

H_{g r a p h} = H_{f u s e d} + H_{P G C}

(19)

Among these,

H_{f u s e d} \in R^{N \times C}

is the result of the SGC branch projecting superpixel-level graph features to the pixel level,

H_{PGC} \in R^{N \times C}

represents the fine-grained graph structure features captured by the PGC branch, and

H_{g r a p h} \in R^{N \times C}

is the fused pixel-level graph structure features. This fusion method combines the semantic context of superpixel graphs with the detailed feature representation capabilities of pixel graphs, thereby enhancing the model’s ability to understand complex graph structures.

Building on the graph feature fusion, the model further integrates the local convolution features

X_{LSE} \in R^{N \times C}

from the LSE branch. The final fused features are obtained through a concatenation operation, resulting in a high-dimensional representation containing multi-source information:

Y_{f u s e d} = C o n c a t (X_{L S E}, H_{g r a p h})

(20)

Here,

Concat (\cdot)

represents the concatenation operation along the feature dimension;

Y_{fused} \in R^{N \times 2 C}

represents the final pixel feature representation after fusion. This fusion process unifies the modeling capabilities of regular and irregular graph structures, fully capturing texture, edges, regional relationships, and contextual dependencies, thereby constructing a robust multi-source semantic representation.

Finally, the fused features

Y_{fused}

are used as input to the final classification layer, mapped to the category space via a linear transformation, and then output as the classification probabilities for each pixel via the Softmax function:

Z = Y_{f u s e d} \cdot W_{c l s} + b_{c l s}

(21)

Q_{i, j} = S o f t m a x (Z_{i}, j)

(22)

where

W_{cls}

is the classifier weight matrix; Z represents the linear output;

Q_{i, j}

denotes the probability that pixel i is classified as class j; and K represents the total number of classes.

The fusion stage plays a pivotal role in preserving symmetry, integrating local–global and pixel–region features into a unified representation, thereby ensuring balanced information propagation across the entire network.

Through the above fusion strategy, MCGNet achieves multi-scale information integration, spanning from local details to global relationships and from regular domains to non-Euclidean spaces. The three branches handle local spectral features, pixel-level dependencies, and superpixel-level contextual information, respectively, while the fusion module effectively integrates them into discriminative feature representations. This strategy not only enhances the model’s ability to model fine-grained regions and boundary information but also improves its adaptability and generalization in complex terrain scenarios, demonstrating outstanding performance across multiple remote sensing classification datasets.

2.4. Theoretical Explanation of Symmetric Euclidean–Non-Euclidean Feature Fusion

In the proposed network, the CNN branch operates in the Euclidean space, extracting local spatial–spectral features through convolutional filtering within regular neighborhoods. Conversely, the GCN branch functions in the Non-Euclidean space, where relationships among pixels or regions are represented as graph structures. Traditional CNN–GCN fusion methods typically follow a unidirectional flow (CNN → GCN), in which convolutional features dominate the learning process while graph representations serve as auxiliary cues. Such asymmetric structures as illustrated in Figure 2a often lead to gradient imbalance and weakened relational features, limiting model generalization.

MCGNet introduces a symmetric multiscale fusion architecture, establishing bidirectional and equal-weight information exchange between Euclidean and Non-Euclidean feature spaces. At each layer, CNN and GCN features are projected into a shared latent space and fused symmetrically, formulated as:

f_{sym} (X) = \frac{1}{2} (f_{CNN} (X) + f_{GCN} (X))

(23)

where

f_{CNN} (X)

and

f_{GCN} (X)

denote Euclidean and Non-Euclidean feature mappings, respectively. This symmetric operation not only ensures feature complementarity during forward propagation but also maintains balanced gradient flow during backpropagation. Consequently, MCGNet achieves a balanced co-evolution of spatial–spectral and relational representations, effectively mitigating CNN dominance and enhancing structural collaboration between Euclidean and Non-Euclidean spaces.

3. Experiments

This section provides a detailed description of the experimental setup, comparison results, and visualization of the classification results of the proposed TSCMPN model. First, we introduce the three hyperspectral remote sensing datasets used in the experiment and their basic characteristics. Next, we provide a detailed explanation of the experimental configuration, including data preprocessing, training and testing strategies, and evaluation metrics. Finally, we verify the superior performance of the proposed TSCMPN model by comparing it with a classic classification algorithm and several advanced deep learning models.

3.1. Hyperspecteral Data Sets

This section provides a detailed description of the experimental setup, comparative results, and classification result visualizations for the proposed MCGNet model. First, we introduce the three hyperspectral remote sensing datasets used in the experiments and outline their fundamental characteristics. Next, the experimental configuration is presented, including data preprocessing, training and testing strategies, and evaluation metrics. Finally, the superior performance of the proposed MCGNet model is validated by comparing it with several state-of-the-art deep learning models.

In order to comprehensively evaluate the effectiveness of the MCGNet model proposed in this paper, three publicly available hyperspectral datasets commonly used in hyperspectral image classification studies were selected for the experimental section: Indian Pines (IP), Pavia University (PU), and Salinas (SA).

The IP dataset was acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) in 1992 over an agricultural area in northwestern Indiana, USA. The image has a spatial resolution of 145 × 145 pixels and a spectral resolution of 10 nm, covering 224 spectral bands. After removing 24 bands with low signal-to-noise ratios due to atmospheric water vapor absorption, 200 bands were used for analysis in the experiment. This dataset contains 16 feature classes with a total of 10,249 labeled samples. For the IP dataset, 1% of the samples from each class were randomly selected as the training set. The detailed class information for this HSI is provided in Table 1.

The PU dataset was acquired by the German Reflectance Optical Spectral Imager (ROSIS) in 2002 over the University of Pavia, Italy. The spatial resolution of the image is 610 × 340 pixels, with a spectral range from 0.43 to 0.86 µm. The raw data consists of 115 bands, and after removing 12 noise bands, 103 bands were used in the experiment. The scene contains 9 urban feature classes with a total of 42,776 labeled samples. Due to the large sample size, only 0.1% of the samples from each class were randomly selected as the training set. The detailed class information for this HSI is shown in Table 2.

The SA dataset was also acquired by the AVIRIS sensor in 1998 over the Salinas Valley, CA, USA. The image has a high spatial resolution of 512 × 217 pixels (3.7 m). Similar to the IP dataset, 204 bands were used after removing 20 bands affected by water vapor absorption and noise. This dataset contains 16 crop classes with a total of 54,129 labeled samples. Due to the large sample size, only 0.1% of the samples from each class were randomly selected as the training set. The detailed class information for this HSI is provided in Table 3.

To evaluate the effect of class imbalance in the selected datasets, we calculated the Imbalance Ratio (IR). As shown in Table 4, the Indiana Pines dataset exhibits a severe imbalance with an IR of 122.75, indicating that the largest class contains over 120 times more samples than the smallest one. The Pavia University and Salinas datasets show moderate imbalance levels, with IR values of 19.69 and 15.53, respectively. These results suggest that relying solely on Overall Accuracy (OA) may lead to biased evaluations, and thus, complementary metrics such as F1 are introduced for a fairer comparison.

3.2. Experiment Setting

3.2.1. Baseline

To validate the effectiveness of the proposed method, we conducted comparative experiments with a variety of classical and state-of-the-art methods on three standard datasets, which include convolutional neural networks, graph neural networks, and their powerful multi-scale feature fusion models. Specifically, the models include: CNN, a 2D convolutional classification model; GCN, a graph convolutional network method; CEGCN, a fusion model combining CNN and GCN that fuses local spatial convolution with the global graph structure; MSSGU, a multi-scale hypergraph convolutional network based on the UNet structure; SSSTNet, a spectral-space attention mechanism model based on the Transformer; and EMS-GCN, an adaptive hyperpixel graph structure. These baseline methods all use the hyperparameter configurations provided in the authors’ original papers, as these were carefully tuned by the authors.

3.2.2. Model Setup

All experiments were conducted using the PyTorch (version 2.4.0) deep learning framework, with computation accelerated by CUDA in a GPU environment. The training, validation, and test sets were divided using a proportional sampling strategy. During training, the learning rate was uniformly set to

5 \times 10^{- 4}

, with a maximum of 600 training epochs. The Adam optimizer was used, and the scale parameter for hyperpixel segmentation was adjusted according to the dataset characteristics. For the IP dataset, the parameter was set to 200, while for the PU and SA datasets, it was set to 100.

3.2.3. Evaluation Metrics

To quantitatively assess the model performance, we adopted three standard evaluation metrics: OA, AA, and Kappa. All reported experimental results are the mean and standard deviation of five experiments with different random seeds to minimize the impact of randomness.

3.3. Comparison of Classification Performance

3.3.1. Experimental Results

On the IP dataset, MCGNet achieves the best performance across all metrics, with an OA of 85.87% ± 1.68 (95% CI: [0.8291, 0.8924]), an AA of 74.63% ± 2.37 (95% CI: [0.7111, 0.7725]), a Kappa coefficient of 83.81% ± 1.87 (95% CI: [0.8055, 0.8761]), and an F1 score of 85.06% ± 2.77 (95% CI: [0.8319, 0.8852]), as shown in Table 5. These values represent improvements of 0.59%, 3.30%, 0.67%, and 0.33%, respectively, compared to the best-performing baseline method. The narrow confidence intervals indicate that the performance of MCGNet is both stable and statistically reliable across multiple runs (

p < 0.05

). Notably, in categories where traditional models perform poorly, such as Class 1 and Class 4, MCGNet shows significant advantages, achieving accuracies 18.33% and 65.27% higher than GCN and SSSTNet, respectively. In Classes 8 and 13, which exhibit highly similar spectral characteristics, MCGNet achieves 100.00% and 99.70% accuracy, outperforming all other models. These results demonstrate that the multi-branch synergistic structure effectively preserves discriminative information even in complex and small-sample classes, leading to higher F1 scores and improved class balance.

On the PU dataset, MCGNet achieves an OA of 92.03% ± 1.95 (95% CI: [0.8934, 0.9482]), an AA of 89.62% ± 3.25 (95% CI: [0.8533, 0.9403]), a Kappa coefficient of 89.41% ± 2.68 (95% CI: [0.8572, 0.9322]), and an F1 score of 91.95% ± 2.38 (95% CI: [0.8921, 0.9499]), as presented in Table 6. Compared to the strongest baseline, these metrics improve by 0.84%, 1.26%, 2.12%, and 1.11%, respectively. The small variance and narrow confidence intervals again validate the robustness and statistical significance of the improvements (

p < 0.05

). In complex urban scenarios, MCGNet shows distinct advantages in fine-grained feature discrimination. For instance, the accuracy of Class 3 reaches 80.05%, outperforming CNN, MSSGU, and SSSTNet by 26.61%, 14.19%, and 60.05%, respectively. Class 5 achieves 99.92%, demonstrating superior convergence and class-wise balance, which directly contributes to its high F1 score. These gains mainly arise from the superpixel graph convolution (SGC) branch that effectively captures global contextual relationships, while the pixel graph convolution (PGC) branch models local fine-grained structures, complementing the CNN’s limited receptive field.

On the SA dataset, MCGNet achieves the highest scores with an OA of 94.75% ± 0.48 (95% CI: [0.9413, 0.9563]), an AA of 95.37% ± 0.99 (95% CI: [0.9395, 0.9693]), a Kappa coefficient of 94.15% ± 0.54 (95% CI: [0.9346, 0.9513]), and an F1 score of 94.81% ± 0.50 (95% CI: [0.9411, 0.9561]), as shown in Table 7. The confidence intervals for all four metrics are extremely tight, indicating high reliability and minimal fluctuation across repeated trials. MCGNet also performs strongly in small-sample and complex categories—for instance, Class 3 achieves 87.88%, which is 17.99% and 3.47% higher than SSSTNet and EMS-GCN, respectively, while Class 13 reaches 94.35%, surpassing CNN, GCN, and SSSTNet. The superior F1 score reflects the model’s balanced precision and recall, even in categories with few samples or high spectral similarity. This advantage is attributed to the multi-level fusion design, where the LSE branch extracts fine local features, the SGC branch captures superpixel-level contextual dependencies, and the PGC branch learns pixel-level non-local relationships.

The results from all three datasets consistently confirm that MCGNet achieves statistically significant improvements (

p < 0.05

) and stable performance across runs. By jointly modeling local spectral-spatial patterns, superpixel-level dependencies, and pixel-level non-local relations, the proposed architecture realizes hierarchical representation learning from local to global. This leads to superior OA, AA, Kappa, and F1 values, demonstrating strong generalization and cross-scene robustness for hyperspectral image classification.

3.3.2. Visualization of Dataset Classification

In order to more intuitively compare the classification results of different methods, Figure 3, Figure 4 and Figure 5 present the visualization results of multiple methods on the IP, PU, and SA datasets. CNN is prone to significant “salt-and-pepper” noise in complex boundary regions, as shown in Figure 3c. This is particularly noticeable at the farmland boundaries in the IP dataset, where the classification results are fragmented. This issue arises because the fixed receptive field of CNNs struggles to capture contextual information across regions. In contrast, MCGNet, as shown in Figure 3h, generates smoother and more continuous boundaries, which benefits from the region-level relationship modeling performed by the SGC branch on the hyperpixel map, effectively improving spatial consistency.

Although GCN alleviates part of the boundary discontinuity problem by utilizing the graph structure, missing edges still appear at the building-road intersections in the PU dataset, as shown in Figure 4d. In comparison, the PGC branch of the proposed method significantly improves this issue by constructing an adaptive pixel-level graph, which can flexibly adapt to complex boundary shapes and accurately portray irregular contours.

MCGNet also demonstrates advantages in small-sample category recognition. For example, Class 7 in the IP dataset has a scattered distribution and a small area. Compared with other methods, the proposed method can fully identify these regions due to the high-resolution local features in the LSE branch and the fine-grained relationship modeling in the PGC branch.

For the complex urban scenes in the PU dataset, MCGNet maintains high accuracy in distinguishing categories such as buildings, grass, roads, and shadows. Particularly in shaded regions, such as Class 9, it can accurately restore shapes, as shown in Figure 4h, avoiding shape distortion and pixel blurring.

In the SA dataset, MCGNet can more accurately reconstruct the irregular farmland boundaries, as shown in Figure 5h, effectively avoiding the loss of details caused by excessive smoothing.

In summary, the multi-branch cooperative mechanism of MCGNet excels in boundary preservation, small feature recognition, and adaptation to complex scenes. This not only validates its superior quantitative performance but also visually demonstrates its advantages in detail restoration and spatial consistency.

3.3.3. Comparison of Training and Testing Time

To comprehensively evaluate the computational efficiency of the proposed method, we compared the average training and testing time of MCGNet with various baseline methods on the IP dataset. The results are presented in Table 8. The experimental results show that, although MCGNet’s computational time is slightly higher than some simpler baseline models due to its multi-branch architecture and complex graph learning mechanism, the overall overhead remains within an acceptable range. In contrast, EMS-GCN’s training and testing times are longer, primarily due to the adaptive super-pixel segmentation performed in each iteration.

The advantages of MCGNet are primarily attributed to: the SGC branch performing graph reasoning on the super-pixel graph, which reduces the number of graph nodes and effectively lowers the computational complexity of graph convolution operations; the PGC branch constructing a sparse graph structure, connecting only to a small number of relevant neighbors, thereby reducing computational overhead while maintaining feature expressiveness; the LSE branch utilizing depthwise separable convolutions, which significantly reduces the model parameters and floating-point operations (FLOPs).

Notably, the design of MCGNet incorporates efficient graph construction and feature extraction mechanisms, effectively mitigating the high computational overhead faced by traditional graph neural networks when processing large-scale hyperspectral images. These optimizations ensure that MCGNet achieves higher inference efficiency and resource utilization, while maintaining high classification accuracy.

4. Ablation Studies

In this section, the proposed MCGNet model is discussed and analyzed in depth, including the effectiveness of its internal modules, the visualization of feature fusion, the ion, the adaptability of the datasets, the sensitivity of key hyperparameters, and the graph learning mechanism of this; all of this will be presented in the following section one by one.

4.1. Comparison and Analysis of Results

To comprehensively evaluate the contribution and necessity of each key component in the MCGNet model, a series of systematic ablation experiments was conducted. The effects of different architectural configurations were analyzed by selectively removing or replacing the core modules, including the LSE, SGC, and PGC branches, as well as the symmetric and self-attention mechanisms. The comparative results on the three datasets are summarized in Table 9, which reports the overall accuracy (OA), average accuracy (AA), and Kappa coefficient for each model variant.

The results reveal several clear trends. First, the LSE-Only model, which relies solely on depthwise separable convolution for local spectral–spatial feature extraction, achieves the lowest accuracies across all datasets (OA of 74.14%, 84.27%, and 88.77% on IP, PU, and SA, respectively). Although it effectively alleviates the “salt-and-pepper” noise problem by focusing on local contextual consistency, its limited receptive field restricts its ability to capture long-range dependencies and irregular class boundaries.

When integrating graph reasoning modules, the performance significantly improves. The LSE + SGC variant introduces the superpixel graph convolution (SGC) branch, which models region-level contextual relationships. Compared with the LSE-Only model, this configuration increases OA by 10.45%, 6.37%, and 4.92% on the IP, PU, and SA datasets, respectively. This improvement demonstrates that the SGC branch effectively captures global structure and smooths predictions within large homogeneous areas, improving classification consistency.

Similarly, the LSE + PGC variant incorporates the pixel graph convolution (PGC) branch to directly construct sparse pixel-level graphs based on spectral similarity and spatial distance. Although its improvement on IP is moderate (OA = 83.60%), it achieves more robust feature discrimination in complex boundary regions. This branch excels at preserving detailed edge information and modeling fine-grained relationships between spectrally similar classes.

The PGC + SGC variant, which removes the LSE convolutional path but retains both graph branches, shows that these two graph-based reasoning components are largely complementary. It achieves OA values of 84.95%, 90.37%, and 94.29% on IP, PU, and SA, respectively, surpassing any single-branch configuration. This confirms that the integration of pixel-level and superpixel-level graphs enhances both global contextual awareness and local discriminability.

The complete MCGNet model, which integrates all three branches under a symmetric multi-scale framework with self-attention, achieves the best overall performance on all datasets—85.87% ± 1.68, 92.03% ± 1.95, and 94.75% ± 0.48 OA for IP, PU, and SA, respectively. The corresponding AA and Kappa values also reach the highest levels, indicating improved inter-class balance and model reliability. Compared with the best two-branch combination (PGC + SGC), the full model further improves OA by 0.92%, 1.66%, and 0.46%, respectively, validating the synergy of the LSE branch in complementing graph-based features.

In addition, the inclusion of the Self-Attention mechanism yields further gains by enhancing the information exchange between parallel branches. Ablation results without attention (e.g., LSE + PGC) show relatively higher variance across datasets, while configurations with attention exhibit more stable performance and higher mean accuracy. The symmetric design also contributes to feature alignment between scales, leading to better robustness and convergence.

Table 9 presents the OA, AA, and Kappa values of each model variant across the three datasets. To visually demonstrate the impact of different modules on performance, Figure 6 provides a detailed performance analysis across multiple metrics in grouped bar charts.

The LSE-Only model effectively extracts local spectral-spatial features within the neighborhood, mitigating the “salt-and-pepper” phenomenon in pixel-wise classification. However, due to the fixed receptive field, it struggles to model long-range dependencies and irregular boundaries, limiting its performance.

The LSE + SGC method reduces the number of graph nodes and introduces global context with the super-pixel graph generated by SLIC. This significantly improves classification consistency in large homogeneous regions and reduces noise misclassification, outperforming the baseline model.

The LSE + PGC method adaptively constructs sparse graphs at the pixel level, establishing connections through spectral similarity and spatial distance. This approach better characterizes complex boundaries and fine-grained long-range dependencies, showing significant advantages in scenarios with complex boundaries and spectral proximity between classes.

The complete MCGNet model combines the complementary advantages of all three branches, significantly outperforming any single-branch or two-branch combinations. It achieves the best performance across all datasets, validating the effectiveness and robustness of the multi-scale and multi-view collaborative modeling.

Introducing SGC or PGC alone can significantly improve the performance of the LSE baseline, while the model performance is optimized when all three work together and integrate the multi-granularity information effectively by the fusion module, as shown in Figure 6d. This indicates that multi-scale and multi-view feature modeling has significant advantages in HSI classification. The results of the ablation experiments fully demonstrate that the excellent performance of the method adopted in this paper is due to the complementary roles of the modules in the multi-branch architecture.

4.2. Effectiveness of Different Modules in MCGNet

To verify the effectiveness of multi-branch feature fusion, t-SNE is employed to map the high-dimensional features from different stages of the model to a 2D space for comparison.

When only the LSE branch is used, as shown in Figure 7a,e,i, similar samples form an initial clustering in the feature space. However, the boundaries between classes are still unclear, and some classes overlap, indicating that relying solely on local spectral-spatial convolution makes it difficult to separate the classes effectively.

After introducing the SGC branch to the LSE baseline, as shown in Figure 7b,f,j, the super-pixel-level global context modeling enhances the intra-class compactness of large-area features, such as similar farmland, while increasing the inter-class spacing, which reduces misclassification across regions.

When the PGC branch is added to the LSE branch, as shown in Figure 7c,g,k, the pixel-level sparse graph construction effectively captures boundaries and fine-grained non-local relationships, further improving inter-class differentiation. The separation of feature distributions becomes more evident, particularly in small-sample and morphologically complex classes.

Finally, the complete MCGNet model, as shown in Figure 7d,h,l, exhibits an ideal pattern of high intra-class compactness and strong inter-class separation. This evolutionary process demonstrates that the multi-branch synergistic mechanism gradually optimizes the feature space structure, providing the classifier with more discriminative input representations, thereby achieving optimal performance.

4.3. Parameter Sensitivity Analysis

4.3.1. Hyperparameter Sensitivity Analysis

To investigate the robustness of MCGNet with respect to key hyperparameters, we performed sensitivity experiments on two representative parameters: (1) the superpixel segmentation scale (Scale) used in the SLIC algorithm for constructing region-level graphs, and (2) the Gaussian parameter

σ

employed in the pixel-level graph construction.

A smaller Scale value generates finer superpixels, preserving spatial detail but increasing computational cost, whereas a larger Scale reduces node count and computation but may blur region boundaries. Similarly, the Gaussian parameter

σ

controls the connectivity weights in the pixel graph: too small a value causes sparse connections and unstable learning, while too large a value leads to over-smoothing and noise propagation.

As shown in Table 10, MCGNet maintains stable performance under moderate variations in both parameters. The optimal configuration is achieved when Scale = 200 for the Indian Pines dataset, and Scale = 100 for both Pavia University and Salinas datasets, while

σ = 1.0

yields the best overall trade-off between spatial smoothness and edge preservation. The accuracy variation across different parameter values is less than 1.5%, confirming the strong robustness of MCGNet to hyperparameter settings.

4.3.2. Computational Complexity Analysis

To formally analyze the computational cost of MCGNet, both time and memory complexities are expressed using Big-O notation, and compared with representative baselines as summarized in Table 11.

Here, N is the number of pixels, E is the number of graph edges,

N_{s}

is the number of superpixel nodes, and K denotes the number of nearest neighbors in sparse adjacency. The proposed MCGNet achieves a lower computational burden than pixel-level GCNs by introducing a hierarchical superpixel structure and sparse connectivity, while maintaining high discriminative power. This analysis demonstrates that the symmetric design of MCGNet effectively balances accuracy and efficiency, making it scalable to larger hyperspectral scenes.

5. Conclusions

This paper proposes a multi-scale feature fusion architecture (MCGNet) that integrates CNNs with GNNs, successfully addressing key challenges in HSI classification, such as multi-scale dependency modeling, complex feature extraction, and the high computational cost of graph-based methods. MCGNet introduces an SNS module and three complementary branches (LSE, SGC, and PGC) to achieve semantic feature extraction and fusion across multiple scales, from local to global and from Euclidean grids to non-Euclidean spaces. The SNS module enhances the spectral signal-to-noise ratio, forming a solid foundation for subsequent feature learning. The LSE branch employs depthwise separable convolutions to extract local spectral–spatial representations; the SGC branch performs efficient long-range relation modeling at the superpixel level; and the PGC branch utilizes sparse graph learning to accurately capture boundary details and fine-grained pixel dependencies. Moreover, the Transformer-based decoder fusion module further strengthens the integration of heterogeneous multi-branch features. Experimental results demonstrate that MCGNet consistently outperforms mainstream methods such as CNN, GCN, and CEGCN on three benchmark datasets (IP, PU, and SA), confirming the effectiveness of its multi-branch design in capturing the complex spectral-spatial features posed by the initial problem statement.

The key advantage of MCGNet lies in unifying convolutional and graph-based learning within a symmetric multi-scale framework, which ensures both computational efficiency and balanced model complexity. This symmetric architecture enhances robustness, reduces feature redundancy, and ensures balanced spectral–spatial information propagation. From both theoretical interpretability and practical generalizability perspectives, MCGNet achieves an effective balance between accuracy and efficiency, providing a scalable structural foundation for intelligent multisource remote sensing understanding.

While the results validate MCGNet’s design, several limitations guide future research. First, for real-world HSI applications, the model requires validation beyond benchmark datasets to assess its robustness under complex surface conditions and dynamic environments. Second, potential scalability challenges may arise when processing very large-scale HSIs, necessitating research into hierarchical or parallel graph construction strategies. Building on this, future work will explicitly target domain adaptation and transfer learning tasks, exploring MCGNet’s cross-sensor generalization for multi-source, multi-temporal, and heterogeneous remote sensing data. This includes integrating techniques such as federated learning and knowledge distillation for few-shot learning scenarios. Finally, the foundational principle of symmetry shows promise for extension to other remote sensing tasks, such as change detection, semantic segmentation, and multimodal fusion, demonstrating its broad applicability.

Author Contributions

Revising the manuscript critically for intellectual content; and final approval of the version to be published, Y.X.; Drafting the paper, J.W.; Analysis and interpretation of the data, Z.Y.; Conception and design, X.L. All authors agree to be accountable for all aspects of the work. All authors have read and agreed to the published version of the manuscript.

Funding

This paper received no external funding.

Data Availability Statement

We have utilized publicly available datasets: https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes (accessed on 11 July 2024).

Acknowledgments

We appreciate the valuable comments and constructive suggestions from the anonymous reviewers that helped improve the manuscript.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
GCN	Graph Convolutional Network
HSI	Hyperspectral Image
KNN	k-Nearest Neighbor
LSE	Local Spectral Feature Extraction
PGC	Pixel-level Graph Convolution
SLIC	Simple Linear Iterative Clustering
SGC	Superpixel Graph Convolution
SNR	Signal-to-Noise Ratio
SNS	Spectral Noise Suppression

References

Zhou, B.; Deng, L.; Ying, J.; Wang, Q.; Cheng, Y. Dimensionality reduction method based on spatial-spectral preservation and minimum noise fraction for hyperspectral images. J. Eur. Opt. Soc.-Rapid Publ. 2025, 21, 31. [Google Scholar] [CrossRef]
Mehmood, M.; Shahzad, A.; Zafar, B.; Shabbir, A.; Ali, N. Remote sensing image classification: A comprehensive review and applications. Math. Probl. Eng. 2022, 2022, 5880959. [Google Scholar] [CrossRef]
Pande, C.B.; Moharir, K.N. Application of hyperspectral remote sensing role in precision farming and sustainable agriculture under climate change: A review. In Climate Change Impacts on Natural Resources, Ecosystems and Agricultural Systems; Springer: Cham, Switzerland, 2023; pp. 503–520. [Google Scholar]
Lv, W.; Wang, X. Overview of hyperspectral image classification. J. Sensors 2020, 2020, 4817234. [Google Scholar] [CrossRef]
Wang, Y.; Xue, Z.; Jia, M.; Liu, Z.; Su, H. Hypergraph convolutional network with multiple hyperedges fusion for hyperspectral image classification under limited samples. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5526318. [Google Scholar] [CrossRef]
Datta, D.; Mallick, P.K.; Bhoi, A.K.; Ijaz, M.F.; Shafi, J.; Choi, J. Hyperspectral image classification: Potentials, challenges, and future directions. Comput. Intell. Neurosci. 2022, 2022, 3854635. [Google Scholar] [CrossRef]
Tejasree, G.; Agilandeeswari, L. An extensive review of hyperspectral image classification and prediction: Techniques and challenges. Multimed. Tools Appl. 2024, 83, 80941–81038. [Google Scholar] [CrossRef]
Ullah, F.; Ullah, I.; Khan, R.U.; Khan, S.; Khan, K.; Pau, G. Conventional to deep ensemble methods for hyperspectral image classification: A comprehensive survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 3878–3916. [Google Scholar] [CrossRef]
Ge, H.; Pan, H.; Wang, L.; Liu, M.; Li, C. Self-training algorithm for hyperspectral imagery classification based on mixed measurement k-nearest neighbor and support vector machine. J. Appl. Remote Sens. 2021, 15, 042604. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, Z.; Cai, Y.; Miao, Y.; Chen, Z. Semi-supervised classification via hypergraph convolutional extreme learning machine. Appl. Sci. 2021, 11, 3867. [Google Scholar] [CrossRef]
Qin, Y.; Ye, Y.; Zhao, Y.; Wu, J.; Zhang, H.; Cheng, K.; Li, K. Nearest neighboring self-supervised learning for hyperspectral image classification. Remote Sens. 2023, 15, 1713. [Google Scholar] [CrossRef]
Zhang, W.; Kasun, L.C.; Wang, Q.J.; Zheng, Y.; Lin, Z. A review of machine learning for near-infrared spectroscopy. Sensors 2022, 22, 9764. [Google Scholar] [CrossRef]
Boateng, D. Advances in deep learning-based applications for Raman spectroscopy analysis: A mini-review of the progress and challenges. Microchem. J. 2025, 209, 112692. [Google Scholar] [CrossRef]
Imani, M.; Ghassemian, H. An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges. Inf. Fusion 2020, 59, 59–83. [Google Scholar] [CrossRef]
Peng, J.; Sun, W.; Li, H.C.; Li, W.; Meng, X.; Ge, C.; Du, Q. Low-rank and sparse representation for hyperspectral image processing: A review. IEEE Geosci. Remote Sens. Mag. 2021, 10, 10–43. [Google Scholar] [CrossRef]
Zhao, Y.; Yan, F. Hyperspectral image classification based on sparse superpixel graph. Remote Sens. 2021, 13, 3592. [Google Scholar] [CrossRef]
Wang, N.; Zeng, X.; Duan, Y.; Deng, B.; Mo, Y.; Xie, Z.; Duan, P. Multi-scale superpixel-guided structural profiles for hyperspectral image classification. Sensors 2022, 22, 8502. [Google Scholar] [CrossRef]
Jia, S.; Jiang, S.; Zhang, S.; Xu, M.; Jia, X. Graph-in-graph convolutional network for hyperspectral image classification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 1157–1171. [Google Scholar] [CrossRef]
Bai, J.; Ding, B.; Xiao, Z.; Jiao, L.; Chen, H.; Regan, A.C. Hyperspectral image classification based on deep attention graph convolutional network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5504316. [Google Scholar] [CrossRef]
Li, L.; Chen, X.; Song, C. A robust clustering method with noise identification based on directed K-nearest neighbor graph. Neurocomputing 2022, 508, 19–35. [Google Scholar] [CrossRef]
Subudhi, S.; Patro, R.N.; Biswal, P.K.; Dell’Acqua, F. A survey on superpixel segmentation as a preprocessing step in hyperspectral image analysis. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5015–5035. [Google Scholar] [CrossRef]
Subudhi, S.; Patro, R.; Biswal, P.K. Texture Based Superpixel Segmentation Algorithm for Hyperspectral Image Classification. Res. Sq. 2022. [Google Scholar] [CrossRef]
Yang, C.; Kong, Y.; Wang, X.; Cheng, Y. Hyperspectral Image Classification Based on Adaptive Global–Local Feature Fusion. Remote Sens. 2024, 16, 1918. [Google Scholar] [CrossRef]
Liu, Q.; Xiao, L.; Yang, J.; Wei, Z. CNN-enhanced graph convolutional network with pixel-and superpixel-level feature fusion for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 8657–8671. [Google Scholar] [CrossRef]
Bera, S.; Shrivastava, V.K.; Satapathy, S.C. Advances in Hyperspectral Image Classification Based on Convolutional Neural Networks: A Review. CMES-Comput. Model. Eng. Sci. 2022, 133, 219–250. [Google Scholar] [CrossRef]
Ge, Z.; Cao, G.; Li, X.; Fu, P. Hyperspectral image classification method based on 2D–3D CNN and multibranch feature fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5776–5788. [Google Scholar] [CrossRef]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2017, 56, 847–858. [Google Scholar] [CrossRef]
Chen, S.Y.; Chu, P.Y.; Liu, K.L.; Wu, Y.C. A Multichannel Hybrid 2D-3D-CNN for Hyperspectral Image Classification with Small Training Sample Sizes. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5540915. [Google Scholar] [CrossRef]
Chiney, A.; Paduri, A.R.; Darapaneni, N.; Kulkarni, S.; Kadam, M.; Kohli, I.; Subramaniyan, M. Handwritten data digitization using an anchor based multi-channel CNN (MCCNN) trained on a hybrid dataset (h-EH). Procedia Comput. Sci. 2021, 189, 175–182. [Google Scholar] [CrossRef]
Liao, T.; Li, L.; Ouyang, R.; Lin, X.; Lai, X.; Cheng, G.; Ma, J. Classification of asymmetry in mammography via the DenseNet convolutional neural network. Eur. J. Radiol. Open 2023, 11, 100502. [Google Scholar] [CrossRef]
Roy, S.K.; Manna, S.; Song, T.; Bruzzone, L. Attention-based adaptive spectral–spatial kernel ResNet for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7831–7843. [Google Scholar] [CrossRef]
Li, S.; Zhu, X.; Liu, Y.; Bao, J. Adaptive spatial-spectral feature learning for hyperspectral image classification. IEEE Access 2019, 7, 61534–61547. [Google Scholar] [CrossRef]
Zhao, X.; Ma, J.; Wang, L.; Zhang, Z.; Ding, Y.; Xiao, X. A review of hyperspectral image classification based on graph neural networks. Artif. Intell. Rev. 2025, 58, 172. [Google Scholar] [CrossRef]
Ding, Y.; Chong, Y.; Pan, S.; Zheng, C. Diversity-connected graph convolutional network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5518118. [Google Scholar] [CrossRef]
Yang, A.; Li, M.; Ding, Y.; Hong, D.; Lv, Y.; He, Y. GTFN: GCN and transformer fusion network with spatial-spectral features for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 6600115. [Google Scholar] [CrossRef]
Khatun, Z.; Jónsson, H., Jr.; Tsirilaki, M.; Maffulli, N.; Oliva, F.; Daval, P.; Tortorella, F.; Gargiulo, P. Beyond pixel: Superpixel-based MRI segmentation through traditional machine learning and graph convolutional network. Comput. Methods Programs Biomed. 2024, 256, 108398. [Google Scholar] [CrossRef]
Zhao, H.; Zhou, F.; Bruzzone, L.; Guan, R.; Yang, C. Superpixel-level global and local similarity graph-based clustering for large hyperspectral images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5519316. [Google Scholar] [CrossRef]
Liu, Y.; Wang, X.; Jiang, B.; Chen, L.; Luo, B. SemanticFormer: Hyperspectral image classification via semantic transformer. Pattern Recognit. Lett. 2024, 179, 1–8. [Google Scholar] [CrossRef]
Zhang, H.; Zou, J.; Zhang, L. EMS-GCN: An end-to-end mixhop superpixel-based graph convolutional network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5526116. [Google Scholar] [CrossRef]
Yang, P.; Zhang, X. A dual-branch fusion of a graph convolutional network and a convolutional neural network for hyperspectral image classification. Sensors 2024, 24, 4760. [Google Scholar] [CrossRef]
Zhu, W.; Sun, X.; Zhang, Q. DCG-Net: Enhanced Hyperspectral Image Classification with Dual-Branch Convolutional Neural Network and Graph Convolutional Neural Network Integration. Electronics 2024, 13, 3271. [Google Scholar] [CrossRef]
Gao, L.; Xiao, S.; Hu, C.; Yan, Y. Hyperspectral image classification based on fusion of convolutional neural network and graph network. Appl. Sci. 2023, 13, 7143. [Google Scholar] [CrossRef]
Chen, H.; Long, H.; Chen, T.; Song, Y.; Chen, H.; Zhou, X.; Deng, W. M³FuNet: An unsupervised multivariate feature fusion network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5513015. [Google Scholar] [CrossRef]
Dong, Y.; Liu, Q.; Du, B.; Zhang, L. Weighted feature fusion of convolutional neural network and graph attention network for hyperspectral image classification. IEEE Trans. Image Process. 2022, 31, 1559–1572. [Google Scholar] [CrossRef] [PubMed]
Tu, B.; Ren, Q.; Li, Q.; He, W.; He, W. Hyperspectral image classification using a superpixel–pixel–subpixel multilevel network. IEEE Trans. Instrum. Meas. 2023, 72, 5013616. [Google Scholar] [CrossRef]
Wang, B.; Cao, C.; Kong, D. SGFNet: Redundancy-Reduced Spectral–Spatial Fusion Network for Hyperspectral Image Classification. Entropy 2025, 27, 995. [Google Scholar] [CrossRef]
Zang, C.; Song, G.; Li, L.; Zhao, G.; Lu, W.; Jiang, G.; Sun, Q. DB-MFENet: A Dual-Branch Multi-Frequency Feature Enhancement Network for Hyperspectral Image Classification. Remote Sens. 2025, 17, 1458. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Liu, Q.; Xiao, L.; Yang, J.; Wei, Z. Multilevel superpixel structured graph U-Nets for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5516115. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed MCGNet.

Figure 2. The architecture of the proposed MCGNet.

Figure 3. Classification results on the Indian Pines dataset. (a) CNN; (b) GCN; (c) CEGCN; (d) MSSGU; (e) SSSTNet; (f) EMS-GCN; (g) OURS. Different colors represent different land-cover classes.

Figure 4. Classification results on Pavia University. (a) CNN; (b) GCN; (c) CEGCN; (d) MSSGU; (e) SSSTNet; (f) EMS-GCN; (g) OURS. Different colors represent different land-cover classes.

Figure 5. Classification results on Salinas. (a) CNN; (b) GCN; (c) CEGCN; (d) MSSGU; (e) SSSTNet; (f) EMS-GCN; (g) OURS. Different colors represent different land-cover classes.

Figure 6. Detailed Performance Analysis Across Multiple Metrics.

Figure 7. Visualization results of feature fusion on three datasets: (a–d) Indian Pines, (e–h) Pavia University, and (i–l) Salinas. (a,e,i) LSE baseline; (b,f,j) LSE+SGC; (c,g,k) LSE+PGC; (d,h,l) full MCGNet.

Table 1. Classification information for the Indian Pines dataset.

No.	Class	TRAIN.	VAL.	TEST
1	Alfalfa	1	1	45
2	Corn-notill	14	14	1401
3	Corn-mintill	8	8	815
4	Corn	2	2	233
5	Grass-pasture	5	5	473
6	Grass-trees	7	7	716
7	Grass-pasture-mowed	1	1	26
8	Hay-windrowed	5	5	468
9	Oats	1	1	18
10	Soybean-notill	10	10	952
11	Soybean-mintill	25	25	2405
12	Soybean-clean	6	6	581
13	Wheat	2	2	201
14	Woods	13	13	1239
15	Buildings-Grass-Trees-Drives	4	4	378
16	Stone-Steel-Towers	1	1	91
	Total	105	105	10,042

Table 2. Classification information for the Pavia University dataset.

No.	Class	TRAIN.	VAL.	TEST
1	Asphalt	7	66	6565
2	Meadows	18	186	18,464
3	Gravel	2	21	2078
4	Trees	3	31	3033
5	Painted metal sheets	1	13	1332
6	Bare Soil	5	50	4979
7	Bitumen	1	13	1317
8	Self-Blocking Bricks	4	37	3645
9	Shadows	1	9	938
	Total	42	426	42,351

Table 3. Classification information for the Salinas dataset.

No.	Class	TRAIN.	VAL.	TEST
1	Brocoli_green_weeds_1	2	20	1987
2	Brocoli_green_weeds_2	4	37	3685
3	Fallow	2	20	1954
4	Fallow_rough_plow	1	14	1379
5	Fallow_smooth	3	27	2648
6	Stubble	4	40	3915
7	Celery	4	36	3539
8	Grapes_untrained	11	113	11,147
9	Soil_vinyard_develop	6	62	6135
10	Corn_senesced_green_weeds	3	33	3242
11	Lettuce_romaine_4wk	1	11	1056
12	Lettuce_romaine_5wk	2	19	1906
13	Lettuce_romaine_6wk	1	9	906
14	Lettuce_romaine_7wk	1	11	1058
15	Vinyard_untrained	1	7	718
16	Vinyard_vertical_trellis	2	18	1786
	Total	48	477	47,061

Table 4. Dataset characteristics and class imbalance statistics.

Dataset	$N_{\max}$	$N_{\min}$	Imbalance Ratio (IR)
Indiana Pines	2455	20	122.75
Pavia University	18,668	948	19.69
Salinas	11,271	726	15.53

Table 5. Classification accuracies obtained by different methods for the IP dataset. Bold values indicate the best performance for each class or metric.

Class	CNN	GCN	CEGCN [24]	MSSGU [49]	SSSTNet [38]	EMS-GCN [39]	OURS
1	15.27 ± 11.32	0 ± 0	11.22 ± 11.43	13.63 ± 8.25	4.93 ± 3.83	26.96 ± 21.77	18.33 ± 14.24
2	81.37 ± 6.07	72.57 ± 19.29	86.39 ± 4.24	63.96 ± 6.78	64.29 ± 18.32	87.22 ± 5.99	85.15 ± 5.06
3	39.64 ± 19.79	42.98 ± 16.67	48.27 ± 17.49	27.73 ± 3.80	24.83 ± 6.53	74.81 ± 4.64	72.85 ± 13.90
4	18.12 ± 11.66	47.76 ± 32.41	20.54 ± 15.53	15.23 ± 7.25	11.64 ± 9.82	53.51 ± 28.76	65.27 ± 11.91
5	70.85 ± 8.64	59.28 ± 30.59	66.49 ± 13.34	58.56 ± 7.31	40.80 ± 18.37	67.56 ± 19.04	78.75 ± 11.51
6	98.21 ± 1.25	64.12 ± 35.17	96.87 ± 2.39	83.75 ± 6.43	96.42 ± 1.94	96.56 ± 1.65	98.15 ± 0.72
7	35.64 ± 19.57	0 ± 0	50.48 ± 20.16	50.76 ± 26.35	12.70 ± 15.12	72.66 ± 27.47	85.07 ± 22.95
8	94.07 ± 6.06	97.55 ± 4.89	93.43 ± 7.09	94.91 ± 4.00	79.37 ± 15.45	99.54 ± 0.65	100.00 ± 0.00
9	5.26 ± 8.15	13.68 ± 27.36	7.42 ± 7.12	64.44 ± 15.94	7.42 ± 7.12	20.17 ± 19.48	21.11 ± 12.82
10	68.22 ± 7.33	75.34 ± 13.99	65.64 ± 11.51	62.33 ± 7.16	48.6 ± 16.49	79.17 ± 7.07	77.96 ± 9.93
11	88.04 ± 4.90	82.38 ± 11.81	87.07 ± 4.64	75.16 ± 9.44	94.38 ± 2.76	91.98 ± 8.75	91.85 ± 7.45
12	45.11 ± 18.54	92.61 ± 11.17	48.04 ± 17.47	21.06 ± 6.03	27.86 ± 11.49	70.89 ± 13.79	64.29 ± 17.33
13	95.51 ± 3.99	76.71 ± 38.72	94.32 ± 5.05	77.78 ± 18.06	90.61 ± 6.75	98.95 ± 1.46	99.70 ± 0.39
14	99.80 ± 0.13	92.06 ± 9.78	99.69 ± 0.32	94.85 ± 0.32	99.24 ± 0.89	99.38 ± 0.92	99.93 ± 0.07
15	44.83 ± 23.77	68.88 ± 26.63	47.68 ± 29.05	30.00 ± 12.73	28.44 ± 15.54	67.56 ± 12.76	76.10 ± 19.63
16	19.89 ± 7.18	0 ± 0	17.06 ± 10.62	16.26 ± 13.39	21.72 ± 31.32	34.33 ± 20.91	59.62 ± 19.37
OA (%)	76.16 ± 2.73	74.36 ± 4.39	77.04 ± 2.13	64.67 ± 3.05	67.99 ± 5.19	85.28 ± 1.52	85.87 ± 1.68
AA (%)	57.49 ± 3.92	55.37 ± 5.40	58.79 ± 4.19	53.15 ± 3.17	47.08 ± 6.08	71.33 ± 3.00	74.63 ± 2.37
Kappa	72.26 ± 3.43	70.90 ± 4.87	73.38 ± 2.67	59.24 ± 3.31	62.15 ± 6.38	83.14 ± 1.63	83.81 ± 1.87
F1 (%)	74.41 ± 3.02	72.38 ± 4.26	75.22 ± 2.64	63.55 ± 3.21	66.21 ± 4.95	84.73 ± 1.84	85.06 ± 2.77

Table 6. Classification accuracies obtained by different methods for PU dataset. Bold values indicate the best performance for each class or metric.

Class	CNN	GCN	CEGCN	MSSGU	SSSTNet	EMS-GCN	OURS
1	97.02 ± 2.65	62.29 ± 7.60	95.58 ± 2.57	94.16 ± 4.92	58.77 ± 8.85	98.65 ± 2.57	93.41 ± 5.60
2	94.74 ± 2.97	93.48 ± 2.78	97.35 ± 1.10	95.14 ± 3.07	99.27 ± 7.91	98.21 ± 0.92	96.07 ± 2.32
3	53.44 ± 2.19	93.86 ± 5.07	88.46 ± 6.39	65.86 ± 14.44	20.00 ± 17.66	84.41 ± 1.10	80.05 ± 8.81
4	78.93 ± 4.84	38.15 ± 13.67	78.31 ± 10.93	88.61 ± 4.76	21.87 ± 7.82	93.45 ± 6.39	84.91 ± 9.25
5	98.96 ± 1.04	98.25 ± 2.04	99.95 ± 0.09	99.90 ± 0.19	66.17 ± 11.91	97.80 ± 10.93	99.92 ± 0.15
6	55.04 ± 11.59	90.05 ± 16.40	93.16 ± 10.63	92.96 ± 9.13	25.82 ± 10.67	96.75 ± 0.09	89.15 ± 11.78
7	54.72 ± 30.80	86.63 ± 6.30	93.42 ± 4.80	87.50 ± 14.23	16.08 ± 11.76	88.83 ± 9.63	95.46 ± 4.01
8	68.02 ± 30.52	82.34 ± 10.75	72.06 ± 17.32	81.83 ± 20.50	37.91 ± 23.80	75.06 ± 4.80	83.74 ± 7.58
9	96.29 ± 4.43	9.89 ± 6.70	53.95 ± 27.33	81.70 ± 9.52	10.35 ± 10.29	63.00 ± 27.33	83.89 ± 8.87
OA (%)	83.90 ± 3.66	81.42 ± 2.40	91.61 ± 2.24	91.30 ± 3.15	64.06 ± 2.91	91.19 ± 2.45	92.03 ± 1.95
AA (%)	77.46 ± 6.96	72.77 ± 2.03	85.80 ± 4.00	87.52 ± 4.48	39.58 ± 5.21	88.36 ± 2.25	89.62 ± 3.25
Kappa	78.16 ± 5.20	75.39 ± 3.22	88.81 ± 3.14	88.48 ± 4.20	44.81 ± 5.55	87.29 ± 2.74	89.41 ± 2.68
F1 (%)	82.47 ± 3.91	80.15 ± 3.18	90.26 ± 2.71	89.87 ± 3.58	62.14 ± 4.96	90.84 ± 2.53	91.95 ± 2.38

Table 7. Classification accuracies obtained by different methods for the SA dataset. Bold values indicate the best performance for each class or metric.

Class	CNN	GCN	CEGCN	MSSGU	SSSTNet	EMS-GCN	OURS
1	98.18 ± 1.48	98.09 ± 3.73	98.83 ± 1.73	83.87 ± 18.95	99.26 ± 0.50	99.65 ± 0.68	94.19 ± 10.75
2	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	97.04 ± 3.29	99.95 ± 0.09	100.00 ± 0.00	100.00 ± 0.00
3	82.20 ± 17.40	93.68 ± 12.50	90.75 ± 6.11	78.92 ± 16.93	69.89 ± 13.68	84.41 ± 17.37	87.88 ± 12.78
4	99.33 ± 0.58	75.80 ± 20.49	99.07 ± 1.15	98.55 ± 1.02	96.59 ± 1.94	93.45 ± 8.43	99.55 ± 0.30
5	95.92 ± 6.16	86.18 ± 8.34	96.86 ± 3.54	91.10 ± 2.56	99.13 ± 0.77	97.80 ± 2.60	96.30 ± 5.89
6	99.87 ± 0.24	96.40 ± 4.19	99.90 ± 0.07	97.13 ± 2.68	100.00 ± 0.00	99.75 ± 0.32	99.93 ± 0.05
7	98.52 ± 2.92	99.70 ± 0.44	99.99 ± 0.01	96.26 ± 4.71	99.91 ± 0.06	99.83 ± 0.23	99.94 ± 0.09
8	88.60 ± 5.26	98.02 ± 0.73	92.76 ± 0.01	70.79 ± 11.45	96.32 ± 1.74	89.78 ± 5.47	92.98 ± 2.07
9	99.98 ± 0.01	99.69 ± 0.60	100.00 ± 0.00	96.90 ± 2.43	99.79 ± 0.26	100.00 ± 0.00	100.00 ± 0.00
10	82.01 ± 10.68	87.32 ± 13.08	90.36 ± 12.15	67.19 ± 12.12	84.51 ± 12.34	82.42 ± 14.88	90.05 ± 9.43
11	93.27 ± 6.42	91.92 ± 9.88	99.10 ± 1.15	68.26 ± 24.87	85.17 ± 24.42	98.10 ± 2.46	99.41 ± 1.17
12	93.42 ± 11.85	86.10 ± 8.21	99.10 ± 1.11	87.41 ± 20.98	94.50 ± 7.85	98.45 ± 2.25	99.46 ± 3.06
13	84.08 ± 24.89	59.48 ± 19.07	80.02 ± 32.91	77.49 ± 19.90	79.05 ± 27.03	88.50 ± 8.16	94.35 ± 6.08
14	98.65 ± 0.85	91.97 ± 9.17	98.41 ± 0.73	97.94 ± 0.94	97.83 ± 1.14	98.81 ± 0.53	99.54 ± 0.55
15	66.64 ± 11.49	93.79 ± 2.67	82.95 ± 10.36	67.37 ± 14.96	27.43 ± 18.83	78.43 ± 13.80	87.52 ± 4.63
16	71.72 ± 13.74	99.46 ± 1.07	79.48 ± 27.36	57.09 ± 32.60	74.47 ± 26.02	84.45 ± 3.46	85.78 ± 5.16
OA (%)	89.40 ± 2.02	94.54 ± 0.79	93.94 ± 1.60	81.76 ± 3.68	85.50 ± 3.90	92.19 ± 2.45	94.75 ± 0.48
AA (%)	90.77 ± 3.99	91.10 ± 2.30	94.22 ± 3.66	83.33 ± 2.95	87.74 ± 5.47	93.36 ± 2.25	95.37 ± 0.99
Kappa	88.15 ± 2.30	93.92 ± 0.88	93.24 ± 1.79	79.66 ± 4.09	83.68 ± 4.46	91.29 ± 2.74	94.15 ± 0.54
F1 (%)	89.02 ± 2.13	93.80 ± 0.91	93.55 ± 1.64	80.93 ± 3.88	84.62 ± 4.12	91.07 ± 2.56	94.81 ± 0.50

Table 8. Computational efficiency comparison of models.

Model	Training Time (s)	Testing Time (s)	Parameter Quantity (M)	FLOPs (G)
CNN	9.67	0.86	0.19	1.62
GCN	8.50	0.80	0.19	1.62
CEGCN	9.41	0.80	0.19	1.62
MSSGU	16.44	1.46	0.13	1.38
SeFormer	12.44	0.88	0.21	1.63
EMS-GCN	30.32	1.51	0.19	1.63
MFSGCN (Ours)	12.31	0.62	0.13	1.63

Table 9. Ablation performance comparison of MCGNet modules on three datasets. Bold values denote the best performance. Checkmark (✓) indicates the module is enabled, and cross (×) indicates it is disabled.

Method			IP			PU			SA
Symmetry	Self-Attn	Class	OA	AA	Kappa	OA	AA	Kappa	OA	AA	Kappa
✓	×	LSE-Only	74.14 ± 2.01	54.07 ± 3.01	69.84 ± 2.48	84.27 ± 4.13	78.43 ± 8.65	78.82 ± 5.72	88.77 ± 2.09	89.67 ± 4.07	87.45 ± 2.37
✓	×	PGC-Only	83.36 ± 1.12	64.52 ± 2.60	80.21 ± 2.30	88.19 ± 4.06	86.42 ± 4.23	84.38 ± 5.30	91.74 ± 2.18	90.09 ± 3.31	90.81 ± 2.41
✓	✓	SGC-Only	82.21 ± 1.60	70.52 ± 4.02	83.53 ± 2.38	84.35 ± 1.32	74.89 ± 1.18	79.25 ± 1.79	94.10 ± 1.40	91.32 ± 3.22	90.43 ± 1.56
✓	✓	LSE + SGC	84.59 ± 1.71	70.62 ± 1.64	82.38 ± 1.95	90.64 ± 2.25	86.21 ± 4.81	87.58 ± 3.04	93.69 ± 1.07	94.86 ± 0.92	93.20 ± 1.19
✓	×	LSE + PGC	83.60 ± 3.98	62.96 ± 3.72	75.57 ± 1.10	83.60 ± 3.98	78.11 ± 7.52	77.98 ± 5.62	90.72 ± 1.25	91.82 ± 1.88	89.66 ± 1.41
×	✓	PGC + SGC	84.95 ± 1.25	73.74 ± 2.59	82.58 ± 3.66	90.37 ± 2.12	85.61 ± 3.55	87.23 ± 2.93	94.29 ± 0.83	95.01 ± 0.93	93.87 ± 0.93
✓	✓	MCGNet	85.87 ± 1.68	74.63 ± 2.37	83.81 ± 1.87	92.03 ± 1.95	89.62 ± 3.25	89.41 ± 2.68	94.75 ± 0.48	95.37 ± 0.99	94.15 ± 0.54

Table 10. Impact of superpixel segmentation scale on classification performance. Bold values indicate the best overall accuracy (OA).

Scale	IP (OA%)	PU (OA%)	SA (OA%)
25	84.23 ± 1.85	90.45 ± 2.12	93.21 ± 0.65
50	85.12 ± 1.72	91.23 ± 1.98	94.08 ± 0.58
100	85.65 ± 1.68	92.03 ± 1.95	94.75 ± 0.48
150	85.78 ± 1.71	91.87 ± 2.01	94.52 ± 0.51
200	85.87 ± 1.68	91.56 ± 2.08	94.28 ± 0.55
250	85.34 ± 1.76	91.12 ± 2.15	93.89 ± 0.62
300	84.91 ± 1.82	90.67 ± 2.23	93.45 ± 0.68

Table 11. Time and memory complexity comparison among representative models.

Model	Time Complexity	Memory Complexity
CNN	$O (N \cdot k^{2} \cdot C^{2})$	$O (N C)$
GCN	$O (E \cdot C^{2})$	$O (N + E)$
CEGCN	$O (N \cdot k^{2} \cdot C^{2} + E \cdot C^{2})$	$O (N + E)$
MCGNet (ours)	$O (N \cdot k^{2} \cdot C + N_{s}^{2} \cdot C + K \cdot N \cdot C)$	$O (N + N_{s} + K)$
MSSGU	$O (N \cdot L \cdot C^{2})$	$O (N L)$
EMS-GCN	$O (E_{s p} \cdot C^{2})$	$O (N + E_{s p})$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Y.; Wang, J.; You, Z.; Li, X. A Symmetric Multiscale Feature Fusion Architecture Based on CNN and GNN for Hyperspectral Image Classification. Symmetry 2025, 17, 1930. https://doi.org/10.3390/sym17111930

AMA Style

Xu Y, Wang J, You Z, Li X. A Symmetric Multiscale Feature Fusion Architecture Based on CNN and GNN for Hyperspectral Image Classification. Symmetry. 2025; 17(11):1930. https://doi.org/10.3390/sym17111930

Chicago/Turabian Style

Xu, Yaoqun, Junyi Wang, Zelong You, and Xin Li. 2025. "A Symmetric Multiscale Feature Fusion Architecture Based on CNN and GNN for Hyperspectral Image Classification" Symmetry 17, no. 11: 1930. https://doi.org/10.3390/sym17111930

APA Style

Xu, Y., Wang, J., You, Z., & Li, X. (2025). A Symmetric Multiscale Feature Fusion Architecture Based on CNN and GNN for Hyperspectral Image Classification. Symmetry, 17(11), 1930. https://doi.org/10.3390/sym17111930

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Symmetric Multiscale Feature Fusion Architecture Based on CNN and GNN for Hyperspectral Image Classification

Abstract

1. Introduction

2. Proposed Method

2.1. Spectral Noise Suppression and Local Spectral-Spatial Encoding

2.1.1. SNS

2.1.2. LSE

2.2. Regional and Pixel-Level Relationship Inference

2.2.1. SGC

2.2.2. PGC

2.3. Fusion and Final Classification

2.4. Theoretical Explanation of Symmetric Euclidean–Non-Euclidean Feature Fusion

3. Experiments

3.1. Hyperspecteral Data Sets

3.2. Experiment Setting

3.2.1. Baseline

3.2.2. Model Setup

3.2.3. Evaluation Metrics

3.3. Comparison of Classification Performance

3.3.1. Experimental Results

3.3.2. Visualization of Dataset Classification

3.3.3. Comparison of Training and Testing Time

4. Ablation Studies

4.1. Comparison and Analysis of Results

4.2. Effectiveness of Different Modules in MCGNet

4.3. Parameter Sensitivity Analysis

4.3.1. Hyperparameter Sensitivity Analysis

4.3.2. Computational Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI