1. Introduction
In recent years, research has uncovered extensive applications for hyperspectral images (HSIs) in diverse fields, such as land management [
1,
2,
3], resource exploration [
4,
5,
6], urban rescues [
7,
8], military investigations [
9,
10] and agricultural production [
11,
12]. This is mainly attributed to the abundance of spatial and spectral information available in HSIs [
13,
14]. Due to its applicability, HSI classification has attracted considerable attention. HSI classification involves the assignment of class labels to individual image elements, representing the features within the HSI [
15].
Researchers have attempted numerous times to achieve more accurate land cover classification. In the preceding decades, the field of HSI classification has witnessed the incorporation of machine learning techniques. Classical machine learning methods, such as K-nearest neighbor [
16], logistic regression [
17], local binary pattern (LBP) [
18,
19], Gabor filter [
20], and random forest [
21], have been extensively applied to HSI classification and can achieve satisfactory results under ideal conditions; however, these conventional approaches heavily depend on manual feature design, which is constrained by the expert knowledge and parameter-setting stage [
22,
23].
In contrast, deep-learning (DL) methods have become widely used in HSI classification because they automatically learn deep adaptive features from training data [
24,
25]. A wide range of state-of-the-art DL techniques has been successfully employed in HSI classification. For instance, convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory (LSTM) [
26], and stacked autoencoders (SAEs) have been proposed as effective approaches to learning the intricate high-dimensional features of HSIs. Among such models, CNNs [
27,
28,
29] have emerged as the predominant method for extracting spectral–spatial features from HSIs [
30].
CNNs can capture spatial and spectral information by leveraging local connectivity and weight-sharing characteristics, and researchers have proposed CNN variants, such as 1D–3D CNNs and hybrid CNNs, to augment the learning capabilities of spectral–spatial features. Three-dimensional CNNs, for example, are effective in extracting deep spectral–spatial combination features [
31], while hybrid CNNs can reduce model complexity and perform well when dealing with noise and limited training samples. In addition, dual-branch CNNs [
32] have demonstrated an effective approach to extracting spectral–spatial features. Researchers have introduced techniques, including residual and dense connectivity, to increase the network’s depth and achieve higher performance in HSIs classification. Liang et al. [
33] proposed MDRN (multi-scale DenseNet, bidirectional recurrent Neural Network and attention mechanism network), a novel classification framework for spectral–spatial networks. MDRN efficiently extracts multi-scale and intricate spatial structural features while capturing internal spectral correlations within continuous spectral data.
In spite of this, the high computational complexity of these deformable CNNs demands increased computational power and longer training times. Researchers have also explored other advanced CNN architectures, such as spectral–spatial attention networks. For instance, Roy et al. [
34] proposed an attention-based adaptive spectral–spatial kernel-improved residual network (A
2S
2K-ResNet) capable of capturing discriminative spectral spatial features for HSI classification in an end-to-end training manner. Sun et al. [
35] introduced a spectral–spatial feature-tagging transformer (SSFTT) method, which effectively enhances the classification performance by capturing spectral–spatial features and high-level semantic features. Due to the limitations of CNNs’ perceptual fields, significant challenges remain in processing large-scale semantic information while reducing the loss of small-scale accuracy information and performing a deep fusion of spectral–spatial features.
While CNN models have shown promising results in HSI classification using iterative neural network models based on backpropagation supervised learning methods, they are still constrained by limitations. For example, CNN models are designed for Euclidean data and regular spatial structures, often overlooking the inherent correlations between adjacent land cover [
36]. Graph convolutional networks (GCNs) [
37] have garnered significant attention due to their ability to perform convolution operations on arbitrary graph structures [
38]. By encoding HSIs as a graph, the intrinsic correlations between adjacent land cover can be explicitly leveraged so that GCNs can better model the spatial context structure of HSIs. For example, Qin et al. [
39] introduced a semi-supervised GCN-based method that leverages spectral similarity and spatial distance to propagate information between adjacent pixels; however, due to the large number of pixels in HSIs, the computational costs associated with treating each pixel as a node in a graph become prohibitive, limiting the method’s practical applicability. Wan et al. [
40] proposed a method that replaces individual pixels with superpixels as nodes, significantly reducing the node count in the graph and rendering GCNs more feasible for practical implementation. Superpixels can effectively describe land cover (such as shape and size) and facilitate subsequent graph learning. Hong et al. [
41] proposed a method, miniGCN, which divides the entire graph into smaller blocks during training, enabling more efficient and effective training.
GCN and CNN effectively extract pixel-level and superpixel-level features, respectively, and are DL methods that excel at capturing deep features. Devising an effective fusion scheme to integrate these two methods is crucial; however, directly incorporating existing fusion schemes into a hybrid network can lead to issues with incompatible data structures. Additionally, the fusion network must strike a balance between the CNN and GCN subnetworks during the training process; insufficient training of either subnetwork can impede the classification performance. To address these challenges, Liu et al. [
42] introduced a unified network, a CNN-enhanced GCN (CEGCN), which seamlessly integrates CNN and GCN by incorporating graph structure encoding and decoding mechanisms. Similarly, Dong et al. [
43] empirically studied hybrid networks and proposed a weighted feature fusion network (WFCG). The WFCG effectively combines the advantages of graph attention networks and CNNs in capturing spectral–spatial information.
In CNNs and GCNs, the constrained receptive field of an individual convolutional layer limits its efficiency in capturing information. Researchers have proposed various approaches to overcome these limitations. For example, Sharifi et al. [
44] proposed multi-scale CNNs that use patches of varying sizes to capture intricate spatial features. Sun et al. [
45] proposed a novel multi-scale weighted kernel network (MSWKNet) based on adaptive receptive fields to fully and adaptively explore multi-scale information in the spectral and spatial domains of HSI. Xue et al. [
46] introduced a network incorporating a multi-hop hierarchical GCN, which employs small kernels to extract node representations from k-hop graphs; the multi-hop graph is designed to systematically aggregate contextual information at multiple scales while avoiding the inclusion of redundant information. Yang et al. [
47] proposed a dynamic multi-scale graph dialogue network (DMSGer) classifier that learns pixel representations using a superpixel segmentation algorithm and metric learning. Although researchers have attempted to extract multi-scale information, such attempts are often limited to a single type of network, emphasizing either CNNs or GCNs; thus, clarifying the correlation between distant features in HSIs classification tasks and enhancing the capacity to extract multi-scale information while preserving the benefits of hybrid networks in extracting features at the pixel-level and superpixel-level, and avoiding the loss of spectral-spatial features caused by extracting multilayered contextual information, remain crucial tasks for research.
Based on CNNs and GCNs, this paper proposes a multiscale pixel-level and superpixel-level method for HSI classification, abbreviated as MPAS. At the technical implementation level, we first designed two 1×1 convolutional layers to process the original HSI data. The processed data is then separately fed into branches one and two. In addition, branch one adopts a parallel multi-hop GCN (MGCN) and a normalization layer to extract multi-scale superpixel-level features. The second branch is the multiscale hybrid spectral–spatial attention convolution branch (HSSAC), in which the multiscale spectral–spatial CNN module (MSSC) is utilized to extract the multiscale spectral–spatial information and cross-path fusion to reduce the semantic information loss caused by fixed convolution kernels during feature extraction. Subsequently, this information is transmitted to the spectral–spatial attention module (SSAM) for adaptive feature weighting. Finally, the features from both branches are contacted for classification. This paper makes contributions in three primary aspects, summarized as follows:
This study proposes a novel feature extraction framework, MPAS, based on MGCN, MSSC, and SSAM. It combines multi-scale pixel-level CNN and superpixel-level GCN features to capture local and long-range contextual relationships effectively. MPAS ensures high training and inference speed while maintaining excellent classification performance in HSIs.
To overcome the narrow acceptance domain of traditional GCNs, which makes it difficult to capture the correlation between distant nodes in HSIs, we propose extracting superpixel features from neighboring nodes in large regions using multi-hop graphs. The network uses parallel multi-hop GCNs to improve the model’s ability to perceive global structures.
We propose MSSC to build parallel structures and establish cross-path fusion to realize the extraction, communication, and fusion of pixel-level information from different scale convolutional kernels, thus reducing unnecessary information loss during the convolution process. Finally, we utilize the SSAM module to improve the feature representation of the model while reducing the computational effort.