Adaptive Multi-Feature Fusion Graph Convolutional Network for Hyperspectral Image Classiﬁcation

: Graph convolutional networks (GCNs) are a promising approach for addressing the necessity for long-range information in hyperspectral image (HSI) classiﬁcation. Researchers have attempted to develop classiﬁcation methods that combine strong generalizations with effective classiﬁcation. However, the current HSI classiﬁcation methods based on GCN present two main challenges. First, they overlook the multi-view features inherent in HSIs, whereas multi-view information interacts with each other to facilitate classiﬁcation tasks. Second, many algorithms perform a rudimentary fusion of extracted features, which can result in information redundancy and conﬂicts. To address these challenges and exploit the strengths of multiple features, this paper introduces an adaptive multi-feature fusion GCN (AMF-GCN) for HSI classiﬁcation. Initially, the AMF-GCN algorithm extracts spectral and textural features from the HSIs and combines them to create fusion features. Subsequently, these three features are employed to construct separate images, which are then processed individually using multi-branch GCNs. The AMG-GCN aggregates node information and utilizes an attention-based feature fusion method to selectively incorporate valuable features. We evaluated the model on three widely used HSI datasets, i.e., Pavia University, Salinas, and Houston-2013, and achieved accuracies of 97.45%, 98.03%, and 93.02%, respectively. Extensive experimental results show that the classiﬁcation performance of the AMF-GCN on benchmark HSI datasets is comparable to those of state-of-the-art methods.


Introduction
Hyperspectral images (HSIs) are characterized by abundant spectral and spatial information comprising hundreds of contiguous spectral bands [1].The combination of image and spectral data in HSI, which is attributed to its distinctive characteristics [2], bestows a substantial capacity for information extraction.Consequently, HSIs have been widely employed across diverse domains, including military and civilian sectors, as well as in applications pertaining to agriculture, mining, aviation, and national defense [3][4][5][6].
Recently, the classification of HSIs has been investigated extensively for HSI processing.Nevertheless, the inherent complexity of HSI coupled with the challenge of acquiring pertinent data attributes engenders some issues, including substantial noise levels, high computational demands, high-dimensional intricacies, and complexities pertaining to classifier training [7].Furthermore, the scarcity of adequate sample sizes augments the intricacy of HSI classification tasks [8,9].
Various HSI classification approaches have been developed in recent decades.Early methods primarily harnessed conventional machine learning (ML) techniques to categorize pixels based on the spectral data embedded within HSIs.Examples include the K-nearest neighbor (KNN) classification [10], support vector machines (SVM) [11], and random forests [12].However, when addressing HSIs characterized by intricate feature distributions, relying solely on spectral information can pose challenges in accurately discerning diverse ground features.Consequently, some researchers have introduced methodologies based on morphology to effectively amalgamate the spatial and spectral information within HSIs [13,14].Similarly, techniques such as texture feature descriptors and Gabor filtering [15,16] have been employed to extract joint spatial-spectral information from HSIs.
However, many of these methods require the manual extraction of spatial-spectral features, thus rendering the quality of these features significantly reliant on expert judgment.In this regard, deep learning offers an elegant solution to HSI feature extraction [17][18][19].Specifically, deep learning techniques can automatically derive abstract high-level representations by progressively aggregating low-level features, thereby eliminating the necessity for intricate feature engineering [20,21].In the early stages of the adoption of deep learning, Chen et al. [22] pioneered the use of a stacked autoencoder to extract high-level features from HSIs.Subsequently, Mou et al. [23] employed a recurrent neural network model to address HSI classification challenges.In recent years, classification networks based on transformers have been investigated extensively.A transformer structure affords global feature extraction by establishing long-distance dependencies.Studies based on transformers for HSI classification have been performed.For example, He et al. [24] were the first to apply a visual transformer for HSI classification, which resulted in an unexpectedly high classification accuracy.Hong et al. [25] established a new transformer backbone network based on spectral sequencing.However, the structure of the transformer model was relatively complex and required significant amounts of computing resources and training data.As a simple and conventional model, convolutional neural networks (CNNs) have emerged as effective tools for HSI classification [26,27].CNN-based approaches outperform conventional SVM methods in terms of classification performance [28].For instance, Makantasis et al. [29] utilized a CNN model to encode spatial and spectral information simultaneously within an HSI by employing a multi-layer perceptron for pixel classification.Similarly, Zhang et al. [30] introduced a multi-dimensional CNN to automatically extract multi-level spatial-spectral features.Lee et al. [31] established an innovative contextual deep CNN model that harnessed the spatial-spectral relationships among neighboring pixels to capture optimal contextual information.Zhu et al. [32] used the residual connection method in neural networks to enable the underlying information to directly participate in high-order convolution operations, which alleviated the degradation in classification accuracy as the number of network layers increased.However, this method continuously extracts spectral-spatial information, and the image texture content may not be available.Despite the excellent performances demonstrated by CNNs, a few limitations remain.In conventional CNN models, the convolution kernels typically operate on regular objects within square areas.Consequently, these models cannot adaptively capture geometric variations among different feature blocks within an HSI [33].Moreover, conventional CNN models cannot directly simulate long-distance spatial relationships in spectral images [34], which constrains their representation capabilities [35,36].
To address the inherent limitations of CNNs in HSI classification tasks, researchers have adopted a novel deep learning model known as graph convolutional neural networks (GCNs) [37][38][39][40].GCNs dynamically enhance node representations primarily by assimilating insights from neighboring nodes, where the graph convolution operation is adaptively guided by the underlying graph's structural properties [41,42].Consequently, the GCNs exhibit compatibility with irregular data featuring non-Euclidean structures, thus enabling them to effectively capture irregular class boundaries within HSIs.Furthermore, owing to their appropriately designed graph structure, GCNs can directly capture pixel-to-pixel distances and model spatial relationships.
Exploiting the strengths of GCNs, Qin et al. [38] pioneered the incorporation of both spectral and spatial information into the graph convolution process, although at the expense of increased computational complexity.He et al. introduced a two-branch GCN approach [43] in which the first branch extracts sample-specific features and the second branch engages in label distribution learning.Hong et al. [44] used a GCN to process irregular HSI data and implemented network training in small batches to effectively mitigate the substantial computational demands associated with conventional GCNs.Liu et al. [45] devised a CNN-enhanced graph convolutional network (CEGCN) to address issues arising from the incongruity between the representation structures of CNNs and GCNs.To dynamically adapt to the unique graph structure of HSIs, Yang et al. [46] introduced a deep graph network equipped with an adaptive graph structure, which yielded favorable classification results.Wang et al. [47] pioneered the development of a graph attention network that seamlessly integrated the attention mechanism to adaptively capture spatial and spectral feature information.Yao et al. [48] developed a dual-branch deep hybrid multi-GCN customized for HSI classification by proficiently applying spectral and autoregressive filters to extract spectral features while suppressing graph-related noise.Finally, Bai et al. [49] formulated a multitiered graph-learning network for HSI classification designed to reveal contextual information in HSIs by seamlessly learning both local and global graph structures in an end-to-end manner.
However, the existing HSIs based on graph convolution present two problems.On the one hand, multi-feature fusion can indeed improve HSI classification accuracy [18,50,51], but few algorithms consider combining the multi-view information of HSIs.For instance, a multi-scale graph sample and aggregate network with context-aware learning (MSAGE-Cal) [52] integrates multi-scale and global information from the graph.Meanwhile, multilevel superpixel structured graph U-Nets (MSSUG) [53] creates multilevel graphs by combining adjacent regions in HSIs, capturing spatial topologies in a multi-scale hierarchical manner.However, these methods only consider information mining from a single modality, without regard for the complementary value of multi-view data, i.e., textural information.Textural information can complement spatial information to facilitate more accurate classifications [54].On the other hand, these algorithms extract multiple features and then fuse them non-precisely, thus causing the retention of redundant information, which may cause conflicts and affect classification.In this study, we introduce an adaptive multi-feature fusion GCN to fully exploit the potential of multi-view data.First, we employ a rotation-invariant uniform local binary pattern (RULBP) to extract textural features, which are then combined with spectral features.This fusion of two distinct feature sets results in a multi-feature representation.To further enhance the integration of information from different views while simultaneously eliminating redundancy, we precisely combine multiple features in an adaptive fusion mechanism based on the attention mechanism, following the feature aggregation of the GCN.The main contributions of this study are as follows.

•
This study introduces a novel framework for HSI classification, referred to as the AMF-GCN, which focuses on extracting and adaptively fusing multi-branch features.

•
Spectral and textural features are extracted and fused to achieve multiple GCNs, and an attention-based adaptive feature fusion method is utilized to eliminate redundant features.

•
Extensive experimental results on three benchmark datasets demonstrate the effectiveness and superiority of the AMF-GCN over its competitors.

GCNs
The essential purpose of GCNs is to extract the spatial features of topological maps.The convolution process of the graph can be regarded as the process of transmitting messages in an entire HSI, which can be separated into two aspects: feature aggregation and feature transformation.In feature aggregation, each node combines its own features with those of its neighboring nodes.The graph structure can be formally defined as G = (V, E), where V denotes the set of nodes and E represents the set of edges.This structure is typically represented using an adjacency matrix A and degree matrix D. Specifically, A encodes the relationships among the pixels within the HSI, where N denotes the total number of nodes.In the case of an undirected graph, A assumes the form of a symmetrical square matrix, with its elements being either 0 or 1.A value of 1 signifies the presence of edges connecting two nodes, whereas a value of 0 indicates their absence.The degree matrix D assumes a diagonal matrix configuration.Its diagonal elements correspond to the degrees of the individual vertices, thus signifying the number of edges associated with each node.
In spectrogram analysis, the fundamental operator is a graph Laplacian matrix, which is a symmetric positive semidefinite matrix.Based on the attributes of this symmetric matrix, its n eigenvectors are linearly independent, and form a complete set of orthonormal bases within an n-dimensional space.The graph Laplacian matrix can be represented as L = D − A, and the symmetrically normalized version of this graph Laplacian matrix is formally expressed as shown in Equation (1).
The graph Fourier transform employs the eigenvectors of the Laplacian matrix as its basis function.It expresses the eigenvectors of the nodes as linear combinations of these basis functions, thus effectively transforming the convolution operation into a product involving the coefficients of these basis functions.The convolution formula for the graph is expressed as shown in Equation (2).
where U represents an orthogonal matrix whose column vector is composed of the eigenvectors of the symmetric normalized Laplacian matrix L and g(Λ) is a diagonal matrix composed of parameters θ, which represent the parameters to be learned.The equation above is the general form of spectrogram convolution; however, implementing Equation ( 2) is computationally intensive because the complexity of the feature vector matrix U is O(N 2 ).Equation (3) can be obtained via the truncation fitting of the Chebyshev polynomial T k (x).
To obtain the computational cost of the HSI composition, the first-order Chebyshev polynomial T k (x) up to the K-th truncated expansion [55] is used to approximate Equation (2).The obtained calculation formula is: In this formula, L = 2 λ max L − I N , where λ max is the maximum eigenvalue of L and θ k is the Chebyshev coefficient vector.To reduce the amount of calculations, Kipf et al. [56] only performed calculations based on K = 1 and a λ max of approximately 2; consequently, Equation ( 4) is derived.
In addition, to learn the features of the nodes and alleviate the vanishing gradient problem in the multi-layer graph convolution process [57], self-normalization is introduced, which yields Equation (5).
where Ã = A + I N , Dii = ∑ j Ãij , and σ is the activation function.
where H l denotes the input of layer l, H l+1 represents the output of layer l, and W l represents the weight parameter to be learned.Equation ( 6) is a typically used graph convolution formula.The graph convolution layer realizes the transfer of neighborhood relationships by continuously aggregating adjacent nodes.

HSI Classification Based on Superpixels
The conventional approach of directly employing pixels as nodes to compose images results in substantial temporal and spatial complexities, which severely constrains the feasibility of applying GCN models to extensive HSI datasets [18,33].To address this issue, researchers proposed the use of superpixels as nodes to establish graph structures.Superpixels typically denote irregular pixel clusters comprising neighboring pixels that share similar attributes such as texture, color, and brightness.In HSIs, superpixels are typically created by applying image-segmentation algorithms.More importantly, the number of superpixels in an HSI is generally significantly smaller than the number of individual pixels.Consequently, adopting superpixels as nodes for graph construction judiciously limits graph size and significantly enhances the efficiency of graph convolution.Furthermore, leveraging superpixels to construct graph structures offers the added advantage of preserving local structural information in HSIs.
Owing to these advantages, the practice of building graphs using superpixels as nodes has been adopted extensively.In 2019, Wan et al. [33] were the first to introduce the multiscale dynamic GCN (MDGCN) algorithm, which signified the earliest use of superpixel mappings in GCN methods.The MDGCN algorithm initiates this process by employing an image-segmentation algorithm to partition the HSIs into compact superpixels.The mean spectral feature value derived from all pixels within a specified superpixel defines its feature representation.Subsequently, the graph structure is constructed based on the local spatial neighborhood to facilitate the subsequent graph convolution operations.Since the introduction of the MDGCN algorithm, constructing graph structures based on superpixels has become a standard practice [19,52], which subsequently resulted in a series of refinements.For instance, to extract multi-scale spatial-spectral information from an HSI, the MSSGU algorithm [53] employs regional fusion techniques to generate multi-scale spatial-spectral information through superpixel segmentation.This approach classifies superpixels into multiple levels and constructs the corresponding graph structures for graph convolution operations.Additionally, a major feature of the algorithm is that the CNN is used in both the pre-processing and post-processing steps of the algorithm to perform pixel-level feature fusion operations, thus avoiding the more complex pixel-level image convolution operations.The operating efficiency of the algorithm is improved to a certain extent.To preserve fine-grained spatial-spectral information at the pixel level, the graph-in-graph convolutional network (GiGCN) algorithm [58] creates both pixel-level and superpixellevel graph structures.These are then fused to yield multi-scale spatial-spectral features through different levels of graph convolution.The model's structure sufficiently represents the local and global information of objects and reflects their relationships.Moreover, in a bid to further enhance the efficiency of graph convolution operations, the automatic graph learning convolutional network (Auto-GCN) algorithm [59] subdivides the original image into larger grid-shaped superpixels, thereby reducing the number of graph nodes.
In summary, adopting GCNs with superpixels as nodes offers convenience and allows precomputations, thus resulting in efficient resource utilization.Hence, superpixels were used as nodes in the current study.However, the aforementioned algorithms only take into account the spectral information of HSIs, neglecting other pertinent views like texture.As a result, they are unable to precisely represent the interdependency between ground objects within HSIs.This constrained, to some degree, the capacity to further enhance classification accuracy for HSIs.

Methodology
In this section, we provide a comprehensive overview of the fundamental architecture of the AMF-GCN.Figure 1 shows the overall network structure.We begin by detailing the processing of HSIs, which encompasses the extraction and fusion of spatial and textural features.Subsequently, we present the creation of multi-view features to compose separate images.Finally, we investigate the utilization of multi-branch GCNs for feature extraction and the application of an attention mechanism for feature fusion.Furthermore, we describe the detailed process of the AMF-GCN in Algorithm 1.

Hyperspectral Image Process Multi-Branch Graph Constructioni
Adaptive Multi-Branch GCNi

Algorithm 1 Learning Procedure for AMF-GCN
Input: HSI original data X, number of epoch T, number of superpixels N, number of graph node neighbors k 1: Obtain N superpixels by using SLIC to perform superpixel segmentation on HSI.2: Obtain spectral features z s and texture features z t by Equation ( 8) and Equation (12), respectively.3: Obtain the fused feature z f by fusing z s and z t 4: while epoch ≤ T do 5: Use Equation ( 14) to construct adjacency matrices A s , A t and A f with k-nearest neighbors for features z s , z t , and z f .

6:
Obtain deep features Z S , Z T , and Z F by Equation ( 15) using (z s , A s ), (z t , A t ) and (z f , A f ) 7: Obtain final feature F through adaptive fusion by using Equations ( 16) and ( 17).

8:
Obtain the predicted label Y by Equation (18).9: end while 10: Label prediction is performed for each pixel.Output: each pixel label

HSI Preprocessing
In this study, hyperspectral data are denoted as X H ∈ R H×W×C , where H and W represent the height and width of an HSI in pixels, respectively, and C denotes the number of spectral bands in the HSI.Before proceeding with subsequent operations, we preprocessed the HSI.First, we performed principal component analysis [60] to reduce the dimensionality of the HSI.Dimensionality reduction eliminates redundant information from the original hyperspectral data.Excessive dimensions can impede the training and prediction speed of a model, thus potentially resulting in issues such as overfitting.Subsequently, we applied the simple linear iterative clustering (SLIC) [61] algorithm for superpixel segmentation.This step is essential to circumvent the substantial computational burden associated with regarding each pixel as a graph node during the composition process.The number of superpixels N obtained after the segmentation can be expressed as follows: where the scale parameter denoted by λ is used to control the number of superpixels, λ ≥ 1.
We represent the set of all superpixels in the image as S = {S i } n i=1 , where each corresponds to the ith superpixel.Here, x i j represents the jth original pixel in superpixel S i , and N i denotes the total number of original pixels encompassed by superpixel S i .We converted all the superpixels into nodes in the topological graph.Each node V i corresponds to a superpixel S i .To complete the construction of the topological map, we must acquire hyperspectral features z s , texture features z t , and fusion features z f , in addition to an adjacency matrix representing the connection relationships between the nodes.

Spectral Feature Extraction
The original HSI may contain redundant information and noise, which can adversely affect feature extraction.To mitigate this issue, we employed a 1 × 1 CNN to preprocess individual pixels, and the model was shown in Figure 2. Subsequently, based on the results of the superpixel segmentation, we incorporated the spectral information of the pixels into the spectral features of the graph nodes.Specifically, the output X (l) = { xi } obtained after the l-th convolutional layer within the node feature network structure is expressed as follows: where * represents the convolution operator; X (l−1) signifies the input of the layer; BN(•) denotes batch normalization; σ(•) is the activation function; W (l) and b (l) represent the learnable parameters and offsets, respectively.For a 1 × 1 convolution kernel size, the network output size remains identical to that of the input.Even after the aforementioned operation is performed, pixel-level features remain in the resulting To achieve feature-level transformation while preserving the spatial information of the original image, superpixel-based feature aggregation must be performed.Let z s k denote the spectral features of the k-th node.Hence, feature aggregation can be expressed as follows: where N k denotes the number of pixels in the superpixel corresponding to a node.This feature aggregation method, which utilizes the average values, mitigates the effects of outlier pixels when the segmentation accuracy is compromised.By amalgamating all the node feature vectors, we obtain the spectral feature matrix for the nodes, where n denotes the number of nodes.

Texture Feature Extraction
To further enhance the image classification accuracy, we incorporate texture features extracted from images using the local binary pattern (LBP) technique, which are then combined with spectral features for classification.The LBP model, which was originally introduced by Ojala et al. [62,63], operates on image pixels by comparing the grayscale values of a central pixel and its neighboring pixels to form a binary bit string.Formally, the fundamental LBP operator at a specified center pixel is defined as follows: where R denotes the radius of the sampling circle; P denotes the number of sampling points situated along the circumference of the circle; g c represents the gray value attributed to the central pixel; and g i represents the gray value assigned to the ith adjacent point pixel along the sampling circle, where i ranges from 0 to P − 1.Additionally, s(x) corresponds to the threshold function employed to binarize the grayscale disparity between g c and g i .In scenarios where the sampling point does not align precisely with an actual pixel, the gray level is typically estimated using standard bilinear interpolation.By encoding all actual pixels using the LBP model expressed in Equation (10), an m * n texture image can be encoded as an LBP-encoded image.Subsequently, a statistical frequency histogram of the encoded image is generated to construct a feature vector.To maintain rotation invariance and reduce the dimensionality of the LBP, Ojala et al. [63] proposed a rotation-invariant uniform LBP (RULBP).Through mapping, this method obtains rotation invariance and further reduces the feature dimensions.The gradient descriptor ud of the RULBP is expressed as follows: where U(LBP P,R ) is a uniformity measure that counts the number of transitions between 0 and 1 in binary.
In HSIs, every affected spectral band can be regarded as an individual grayscale image.The RULBP model was applied directly to each band, which yielded RULBP codes for every pixel.Subsequently, a statistical histogram of the central pixel image block was employed as the RULBP feature for a specific pixel.Figure 3 illustrates the process of generating RULBP features.The extracted features encompassed both primary spectral features and local texture features derived from the RULBP.This approach effectively leverages both the spatial and spectral information inherent in HSIs.

Multi-Branch Graph Construction
After obtaining the spectral feature z s and texture feature z t , we used a fully connected layer to unify their lengths.Subsequently, we used the element-by-element multiplication method to fuse the two features to obtain a fused feature z f .Once the feature vector z = {z s , z t , z f } is defined, connections between the edges must be established.This involves determining the adjacency matrix A based on the interactions between nodes.To maximize the preservation of the spatial information in the original image, we adopted an adjacency relationship derived from superpixels.In simpler terms, a weight of 1 was assigned to the border connecting two adjacent superpixels, whereas the weights of all other edges were set to 0 to indicate the absence of any edge connection.This can be expressed as follows: where A ij denotes the element positioned at (i, j) within the adjacency matrix A and z j ∈ N k (z i ) represents the set of neighbors which are selected by KNN in the view that contains z .Employing the aforementioned approach, we can obtain the adjacency matrices A = {A S , A T , A F } that represent the spectral, texture, and fusion adjacency matrix.Therefore, the spectral, texture, and fusion graph can be expressed as After acquiring the multi-branch features of HSIs and their corresponding adjacency matrices, we employed a GCN to aggregate adjacent node features, thereby enhancing the model's feature extraction capabilities.As discussed previously, we introduced the GCN, and each layer of the GCN can be expressed mathematically as follows: The number of graph convolution layers was set to three.This is because employing additional graph convolution operations may result in network degradation, whereas fewer layers may not capture the full spectrum of data features effectively.After graph convolution operations were completed in the three layers, we obtained deep features, labeled as Z S , Z T , and Z F , respectively.
Upon completing the multi-branch feature extraction, the features from each branch were fused.Furthermore, to generate the final category map, the features were remapped to the pixel level to extract the pixel-level features, thereby enabling the retrieval of category information for each individual pixel.After the dual views and their fused features were mapped back to the size of the original image based on the corresponding relationship during the superpixel segmentation, they were converted into P S , P T , and P F .

Adaptive Feature Fusion
The fusion of the three features can introduce redundancy and mutual interference.To address this issue, we employed an adaptive feature fusion method.First, we consider the feature p s of any node in G s as an illustrative example.Initially, the feature was subjected to a nonlinear transformation; subsequently, an attention vector q ∈ R h ×1 was used to compute the attention weight ω i S as follows: where ω S ∈ R h ×h and b S ∈ R h ×1 represent the weight matrix and bias vector, respectively.In a similar fashion, we can calculate the attention values ω i T and ω i F for node i in the spectral graph and fusion graph, respectively.Subsequently, we normalize these values using the softmax function to derive the ultimate weights: A higher value of α i S implies greater importance for the corresponding embedding.Finally, we employed a weight vector to adaptively amalgamate the features from each component, which resulted in the ultimate fusion feature.Subsequently, the resultant final features, which contain information across multiple scales, were input into a classifier comprising a fully connected network and a softmax function to predict the category for each pixel, as depicted in Equation (18).
where Y denotes the category vector output by the network.The network employs the typically used cross-entropy loss as its loss function in classification tasks.

HSI Datasets
To provide a fair assessment of the model's effectiveness, it is crucial to employ a diverse dataset.This paper evaluates the model's classification prowess using three widely recognized HSI datasets: Pavia University, Salinas, and Houston-2013.It is worth mentioning that in the experiments, the division of the training set and test set of each dataset is not fixed.Each experiment randomly selects a fixed number of sample points from the dataset as the training set and the remaining as the test set.
Pavia University dataset: The Pavia University dataset was captured by the ROSIS imaging spectrometer, operated by the German National Aeronautics and Space Administration, in 2003.The data collection took place in the city of Pavia, Italy.Subsequently, Pavia University processed the HSI obtained from the city.The dataset, after extraction, comprises dimensions of 610 × 340 × 115, incorporating 610 × 340 pixels and 115 spectral bands per pixel.The spatial resolution of the Pavia University dataset stands at 1.3 m per pixel, encompassing nine distinct labels.Figure 4 provides a pseudo-color representation of this dataset, with each color corresponding to a specific label category, and Table 1 shows its detailed information.
Salinas dataset: The Salinas dataset was gathered in California's Salinas Valley back in 1992, utilizing an AVIRIS sensor.It boasts dimensions of 512 × 217 × 224, with each pixel measuring 512 × 217 and equipped with 224 spectral bands, including 20 bands related to water absorption.Spatially, the Salinas dataset has a resolution of approximately 3.7 m per pixel and incorporates 16 categories of labels.Figure 5 presents a pseudo-color image of the dataset, along with corresponding category labels represented by distinct colors, and Table 2 shows its detailed information.
Houston dataset: It was collected by the ITRES CASI-1500 sensor at the Houston University campus in Texas, USA, and its adjacent rural areas in 2013.After deleting the noise bands, the remaining 144 valid spectral bands were used for experiments.The Houston dataset has 349 × 1905 pixels, a spatial resolution of 2.5mpp, and contains 15 land cover categories, such as tree, soil, water, healthy grass, running track, tennis court, etc.The pseudo-color images, standard classification maps, and color category labels corresponding to the Houston dataset are shown in Figure 6 and Table 2 shows its detailed information.

Setup 4.2.1. Evaluation Indices
To quantitatively analyze the advantages and disadvantages of the algorithm constructed in this article and the comparison algorithm, we use three commonly used evaluation indicators to evaluate the model proposed in this article.The following evaluation indicators can effectively evaluate the performance of the algorithm from different aspects: overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa).The values of these three indicators are positively related to the classification effect.
(1) Overall accuracy Overall accuracy refers to the proportion of the number of correctly classified samples to the total number of samples after the model predicts the dataset.The higher the OA, the better the classification effect, and its mathematical definition is expressed as Equation (19).
where N represents the total number of samples, h ii represents the number correctly classified into category i.
(2) Average accuracy Average accuracy refers to the average classification accuracy of each category, which can describe the classification difference in each category.Its mathematical definition is expressed as Equation (20).
(3) Kappa coefficient The Kappa coefficient can be used to evaluate the consistency between the classification map and the reference image, can comprehensively evaluate the classification, and is defined as follows where h i+ represents the total number of samples of the i-th category, and h +i represents the number of samples classified as i-th.N is the total number of class samples.

Compared Methods
To verify the capability of the algorithm, a variety of existing advanced methods were selected for comparison in the experiment.They are ML-based algorithms: joint collaborative representation, SVM with decision fusion (JSDF) [64], and multiband compact texture units (MBCTU) [65]; CNN-based algorithms: hybrid spectral CNN (HybridSN) [66] and the diverse region-based deep CNN (DR-CNN) [67]; and algorithms based on GNN: graph sample and aggregate attention (SAGE-A) [68] and MDGCN [33].All comparative experiments were run five times using the optimal parameters given in the article and then averaged.

Experimental Environment and Parameter Settings
The experiments in this article were conducted using Python version 3.9 and the PyTorch deep learning framework version 1.13.All experiments were run five times on a machine equipped with a 24 GB RTX 3090 GPU and 64 GB of RAM, with the results subsequently averaged to reduce errors.Furthermore, the Adam optimizer was employed with a learning rate set to 0.0005 for model optimization.Various parameters, such as the number of iteration steps (T), k-nearest neighbor composition (k), and the number of superpixels (N), were adjusted to different values depending on the specific dataset under consideration.Specific information can be found in Table 3. Tables 4-6 present the classification accuracy for each category and the results of three evaluation indices for various methods on the Pavia University, Salinas, and Houston datasets.AMF-GCN consistently achieved the highest classification results across all datasets.In the three datasets, the OA reached remarkable levels of 91% and 92%, respectively.As observed in Table 4, the proposed AMF-GCN method demonstrates clear superiority over GNN-based approaches and all other classifiers in the Pavia dataset.AMF-GCN notably outperforms SAGE-A and MDGCN by 1.26% and 1.77% in terms of OA, respectively.When compared to CNN-based methods, AMF-GCN exhibits a remarkable 4.83% and 7.46% OA advantage.Although the Salinas dataset contains numerous categories and a substantial amount of data, the AMF-GCN method presented in this article still achieved the highest OA and kappa accuracy.As evident from Table 5, the ML-based method delivered impressive results.The 3DCNN method leverages three-dimensional convolutional kernels to concurrently capture the spatial-spectral features of HSI, resulting in enhanced classification accuracy.However, these methods tend to overlook node features within the image and may not emphasize important information.Consequently, AMF-GCN achieves significant gains in the range of 5.89-8.48% in Kappa on the Salinas dataset.In the classification of broccoli green weed1, broccoli green weed2, and fallow species, AMF-GCN achieved 100% classification results.In contrast to the previous two datasets, the Houston dataset features more dispersed labeled areas with smaller scales.In such scenarios, relying solely on shallow spectral information from the HSI may not suffice for intricate and fine-grained classification.SAGE-A and MDGCN achieved OA of 89.58% and 91.40%, respectively, which, although respectable, did not reach exceptional accuracy.However, AMF-GCN achieved an impressive OA of 93.02%, underscoring the benefit of extracting and fusing multiple features to enhance network performance.

Visual Results
To facilitate a more intuitive comparison of the classification accuracy among different algorithms, this experiment includes visualizations of the classification results for three datasets: Pavia, Salinas, and Houston, as depicted in Figures 7-9.An immediate observation is that the ground object classification map generated by AMF-GCN exhibits the most impressive display, with fewer instances of misclassification and a smoother appearance compared to the two convolutional neural network methods.Furthermore, it is noticeable that on the Pavia and Salinas datasets, the ML-based method excels in classifying certain category blocks but introduces larger errors in certain neighborhoods, particularly involving self-blocking bricks and trees.This suggests that the ML-based method may struggle with complex scenarios involving smaller blocks.However, the approach presented in this chapter demonstrates competence not only in handling complex situations within larger classification blocks but also in dealing effectively with intricate scenarios within smaller classification blocks.When compared to the CNN-based method, it is evident that both methods perform admirably when utilizing neighborhood information.Nonetheless, it is worth noting that at the boundaries, the CNN-based method lags behind AMF-GCN, indicating that AMF-GCN offers greater flexibility than CNN-based methods.In conclusion, compared to other methods, this approach leverages the attention mechanism to make more effective use of spatial information.It displays very few instances of misclassification points internally, and simultaneously, it accurately captures details within the black boxes, outperforming other methods by a considerable margin.This further underscores the distinct advantages of the attention mechanism in harnessing spatial information.

Parameter Analysis
In this section, we delve into a detailed investigation of the impacts of k and N. The experiments systematically vary the sizes of k and N across the dataset scales, as illustrated in Figure 10, showcasing the experimental results under different parameter combinations.Specifically, k ranges from 100 to 600 at intervals of 100 for the Pavia University and Houston datasets, and from 100 to 350 at intervals of 50 for the Salinas dataset.The resulting surface reflects that using a smaller number of edges to construct the graph during information aggregation may overlook important neighbor nodes containing crucial information.In HSI, correlation information between pixels at both short and long distances can contribute to improved classification results.Therefore, preserving the integrity of the graph data is pivotal for model learning.However, it is evident that excessively large k values lead to reduced model accuracy on all three datasets, indicating that an excessive number of neighbor nodes can introduce noise.Consequently, selecting an appropriate number of neighbor nodes is of paramount importance.Furthermore, the number of superpixels inversely affects the segmentation map size obtained.Smaller numbers of superpixels retain larger objects while suppressing more noise, whereas larger numbers of superpixels yield smaller segmentation maps, preserving smaller objects but potentially introducing more noise.To analyze the impact of the number of superpixel blocks on classification results, the experiment sets N to range from 1000 to 11,000 and tests the classification accuracy of AMF-GCN on each dataset.As depicted in Figure 10, the classification accuracy on the Pavia University dataset demonstrates an upward trend as N increases.This is attributed to the larger category scale within the dataset, with increased segmentation contributing to overall accuracy improvement.The pixel segmentation process effectively suppresses classification map noise resulting from misclassification.However, it is important to note that this upward trend may not persist indefinitely.To avoid excessively smooth classification maps, the number of segmentations for different datasets was set to the most appropriate value during the experiments in this chapter.

Ablation Study
To comprehensively evaluate the AMF-GCN algorithm introduced in this study, this section conducts a series of rigorous ablation experiments.Firstly, AMF-GCN comprises three branches, with each branch receiving inputs of spectral features, texture features, and fusion features.In an effort to dissect the specific contributions of these three types of features, we assessed their individual classification accuracy on the Pavia University, Salinas, and Houston datasets.Secondly, we investigated the influence of the attentionbased feature fusion mechanism by testing the network's accuracy without its inclusion.The results of these experiments are presented in Table 7. Upon reviewing the table, it becomes evident that the absence of any feature results in a decline in overall classification accuracy.On the Salinas dataset, the network that fused spatial-spectral and texture features outperformed individual features by 1.93% and 1.97%, respectively.This underscores the synergy between texture and spectral information extracted from multiple perspectives, ultimately enhancing classification performance.Furthermore, the omission of the attention mechanism led to reduced classification results, with an OA drop of 0.86%, 0.86%, and 0.36% for the three datasets, respectively.Otherwise, multi-feature fusion variant 5 exhibited 0.20%, 0.61%, and 0.38% greater overall accuracy on the three datasets compared to variant 3, which utilized exclusively fused representations.This highlights the significance of incorporating an attention-based feature fusion mechanism, enabling the model to assign varying degrees of importance to different features and thereby improving classification outcomes.

Comparison with Various Graph-Based Models
To further demonstrate the advantages of our model, we compare it with advanced methods in recent years on the Pavia University and Salinas datasets, namely MSAGE-Cal [52], MSSUG [53], SSG [50], SSPGAT [51], and MARP [18], respectively.Several algorithms mentioned above incorporate multi-feature fusion techniques.For instance, SSPGAT employs the graph attention network to seamlessly merge pixel and superpixel features, enhancing hyperspectral classification.MSAGE-Cal integrates multi-scale and global information from the graph, whereas MSSUG creates multilevel graphs by combining adjacent regions in HSIs, capturing spatial topologies in a multi-scale hierarchical fashion.
Additionally, MARP discerns the importance weights of various hop neighborhoods and aggregates nodes selectively and automatically.From Table 8, it is evident that AMF-GCN consistently achieves optimal performance across various evaluation metrics.In the case of the Pavia dataset, the OA accuracy of AMF-GCN surpasses other models MARP, SSPGAT, SSG, MSSUG, and MSAGE-Cal by 0.36%, 1.1%, 2.63%, 5.39%, and 1.06%, respectively.When considering the Salinas dataset, the accuracy of AMF-GCN outperforms other models by 0.22%, 0.45%, 0.39%, 0.04%, and 0.42%, respectively.There are two key factors contributing to the superior accuracy of AMF-GCN.Firstly, AMF-GCN effectively aggregates multi-view features, leveraging their complementary information.Additionally, unlike other models that simply fuse features, AMF-GCN employs an attention-based feature fusion mechanism, adaptively selecting features and reducing redundancy to a significant extent.

Impact of the Number of Training Samples
The number of training samples plays a crucial role in determining the classification performance of a network.Generally, a higher number of samples correlates positively with improved model performance.Increasing the number of samples allows the model to more accurately learn data features and reduces the risk of network overfitting.To assess the impact of sample size on AMF-GCN and other methods, we varied the number of training samples for each class in the three datasets, ranging from 5 to 30 with intervals of 5 increments.Figure 11 presents the overall classification accuracy achieved by four methods (HybridSN, SAGE-A, MDGCN, and AMF-GCN) on the three datasets.It is evident that increasing the number of samples enhances the classification accuracy of each method.This underscores the significance of sample size in influencing classification results.Notably, AFM-GCN outperforms other methods in classification accuracy, demonstrating its robustness and advantages, especially in scenarios with limited samples.These advantages are attributed to its ability to extract multi-view features and the incorporation of modules such as superpixel segmentation and attention-based feature fusion.In contrast, HybridSN exhibits the lowest classification results, primarily due to its heavy reliance on training data.GCN performs better than HybridSN with a small number of samples because the relationship features obtained through the graph structure help mitigate the challenge of having an insufficient number of HSI training samples.GCN effectively captures global relationships among nodes in the graph data, and classification tasks are performed based on these relationship features.

Influence of Different Texture Feature Extraction Methods
We are already aware of the significant role that texture information in HSI plays in complementing and enhancing the accuracy of downstream tasks.The effectiveness of texture information extraction also directly influences the quality of the extracted features.Consequently, this section delves into the impact of various texture feature extraction methods on our experiments, with the results presented in Table 9.The chosen texture feature extraction algorithms encompass extended morphological profile (EMP) [14], LBP, and RULBP.The EMP represents a structural approach that characterizes texture primitives and their spatial arrangements while reducing noise through morphological operations.The results reveal that the EMP algorithm successfully eliminates misclassified patches within ground objects via morphological transformations, achieving commendable accuracy.However, it falls slightly behind the LBP-based method.Notably, RULBP emerges as the top-performing approach, outperforming other methods by 1.43%/0.56%,1.85%/0.79%,and 1.92%/1.17%on the Pavia University, Salinas, and Houston datasets, showcasing its effectiveness.

Influence of the Number of Encoder Layer
To some extent, the performance of a neural network is positively correlated with the number of layers it possesses.However, increasing the number of layers can lead to issues like gradient vanishing or explosion, which can hinder convergence, slow down training, and even degrade performance.Consequently, determining the appropriate number of network layers must be guided by the specific problem and dataset feature.To evaluate the impact of network depth on AMF-CGN, this section conducts experiments with different depths, setting the number of network layers to 1, 2, 3, 4, and 5, and assessing AMF-CGN's performance under these various depths.The results are presented in Figure 12.Upon examining the results for the Pavia University dataset, it is evident that the performance is quite similar when the network depth is 2 and 3.This is because deeper GCN layers tend to produce smoother classification maps.Given that the Pavia University dataset consists of relatively large-scale ground object categories, this smoothing effect does not significantly impact its performance.However, deeper network layers enable the model to extract richer features.For the Salinas and Houston datasets, optimal performance is achieved with a network depth of 3. Taking into account the network's performance across all three datasets, a network depth of 3 is selected for this chapter.

Complete Image Visualization
To further validate the consistency and robustness of our classification model, we applied it to fully test unseen images containing both labeled reference samples and unlabeled background pixels.As shown in Figure 13, we compare our model against three graphbased approaches.Our model demonstrates the best performance for classes like grape untrained and vineyard untrained, exhibiting minimal impact from noisy backgrounds.This can be attributed to the incorporation of texture information, which enhances the model's feature extraction abilities and robustness.

Conclusions
In this paper, we introduce a novel HSI classification approach known as AMF-GCN.This method involves the extraction and fusion of spectral and texture features to create multi-view features.It utilizes three separate branches for graph convolution operations to consolidate node information and employs an attention mechanism-based feature fusion technique for adaptive feature fusion.Extensive ablation experiments and discussions are conducted to thoroughly evaluate the proposed method.The results, obtained from experiments on three commonly used datasets, demonstrate the efficacy and advantages of the proposed AMF-GCN.They also demonstrate that AMF-GCN outperforms all comparative methods and achieves outstanding performance.
However, there are two limitations to our model.Firstly, constructing multiple views comes at the cost of computational efficiency.Secondly, our superpixel division method is static and only utilizes shallow spectral information.Going forward, we aim to explore dynamic superpixel generation techniques that can co-evolve with model training for even stronger performance.Furthermore, in future work, we plan to explore unsupervised learning techniques for clustering tasks on HSI and delve deeper into the spatial autocorrelation aspects of HSI.

Figure 1 .
Figure 1.AMF-GCN architecture diagram.The model process is divided into three main parts.First, the hyperspectral image undergoes a series of preprocessing, including dimensionality reduction, superpixel segmentation, and texture feature extraction.Second, the spectral features are extracted from the hyperspectral image after superpixel segmentation, the texture features are combined to obtain the fusion features, and then the k-nearest neighbor algorithm is used to compose features.Finally, the graph convolution network is used to extract features of G 1 , G 2 , and G 3 , and then the attention-based fusion algorithm is used to fuse the features.

Figure 3 .
Figure 3.The process of generating rotation-invariant uniform local binary pattern (RULBP) features in HSI.

Figure 4 .
Figure 4. Pavia University dataset pseudo-color images and corresponding category labels for each color.(a) False colour image.(b) Ground-truth map.

Figure 5 .
Figure 5. Salinas dataset pseudo-color images and corresponding category labels for each color.(a) False colour image.(b) Ground-truth map.

Figure 6 .
Figure 6.Houston dataset pseudo-color images and corresponding category labels for each color.(a) False colour image.(b) Ground-truth map.

Figure 7 .
Figure 7.The classification maps of different methods on the Pavia University dataset.

Figure 8 .
Figure 8.The classification maps of different methods on the Salinas dataset.

Figure 9 .
Figure 9.The classification maps of different methods on the Houston dataset.The red box is an enlarged version of the partial picture.(a) GT, (b) JSDF, (c) MBCUT (d) HybridSN (e) DR-CNN (f) SAGE-A (g) MDGCN (h) AMF-GCN

Figure 10 .
Figure 10.Sensitivity to the k and N parameters on three datasets.N and k, respectively, represent the number of superpixels and the value of k-nearest neighbor composition.

Figure 11 .
Figure 11.Effect of different number of samples on classification results.

Figure 12 .
Figure 12.Effect of a different number of encoder layer on classification results.

Figure 13 .
Figure 13.The complete classification maps of different methods on the Salinas dataset.

Table 1 .
Detailed information of Pavia University datasets.

Table 2 .
Detailed information of Salinas and Houston datasets.

Table 3 .
Hyperparameter settings applied for different datasets.T, lr, respectively, code the number of epochs and learning rate of model training.N and k, respectively, represent the number of superpixels and the value of k-nearest neighbor composition.

Table 4 .
Classification results of different methods in terms of per-class accuracy, OA, AA, and Kappa for Pavia University dataset (%).The optimal result is shown in bold.

Table 5 .
Classification results of different methods in terms of per-class accuracy, OA, AA, and Kappa for Salinas dataset (%).The optimal result is shown in bold.

Table 6 .
Classification results of different methods in terms of per-class accuracy, OA, AA, and Kappa for Houston dataset (%).The optimal result is shown in bold.

Table 7 .
Classification results of different methods in terms of per-class accuracy, OA, AA, and Kappa for the Houston dataset (%).The optimal result is shown in bold.

Table 8 .
Classification results of different methods in terms OA, AA, and Kappa for Pavia University and Salinas datasets (%).The optimal result is shown in bold.

Table 9 .
OA (%) indices of AMF-GCN with different texture feature extraction on four datasets.The methods used are extended morphological profiles (EMP), local binary pattern (LBP), and rotation-invariant uniform LBP (RULBP).