Hyperspectral Image Classification with Localized Graph Convolutional Filtering

: The nascent graph representation learning has shown superiority for resolving graph data. Compared to conventional convolutional neural networks, graph ‐ based deep learning has the advantages of illustrating class boundaries and modeling feature relationships. Faced with hyperspectral image (HSI) classification, the priority problem might be how to convert hyperspec ‐ tral data into irregular domains from regular grids. In this regard, we present a novel method that performs the localized graph convolutional filtering on HSIs based on spectral graph theory. First, we conducted principal component analysis (PCA) preprocessing to create localized hyperspectral data cubes with unsupervised feature reduction. These feature cubes combined with localized ad ‐ jacent matrices were fed into the popular graph convolution network in a standard supervised learning paradigm. Finally, we succeeded in analyzing diversified land covers by considering lo ‐ cal graph structure with graph convolutional filtering. Experiments on real hyperspectral datasets demonstrated that the presented method offers promising classification performance compared with other popular competitors.


Introduction
The hundred contiguously narrow bands of hyperspectral images (HSIs) feature the hyperspectral remote sensing research fields [1]. HSIs make high-resolution spectral or spectral-spatial information extraction possible on account of their ability to carry a high volume of information [2]. Hyperspectral information extraction often involves noise estimation, endmember extraction, spectral unmixing, classification, and target detection phases based on hyperspectral data processing and analysis [3][4][5][6]. Hyperspectral remote sensing image analysis has a great power to recognize the materials of the land surface at a fine level compared to RGB (red, green, and blue) or multispectral image analysis [7,8]. However, it is unable to ignore that adjacent bands in high-dimensional hyperspectral data might be highly correlated, resulting in the Hughes phenomenon (or called the curse of dimensionality) [9], so desiring a large number of labeled samples [10,11]. And then, varying spectral signature and limited training samples at hand would probably raise the unanticipated dilemma that we have to solve the small sample classification problem [12]. When it comes to HSI classification tasks, the early-staged machine learning methods might (1) heavily rely on the handcrafted spectral-spatial features [13], (2) fail to accurately learn class conditional densities [2], (3) not accommodate limited training samples faced with the high dimensionality of hyperspectral data [6]. In the above regard,  propose a subspace-based approach to reduce the dimen-sionality of the input space and facilitate the exploitation of the limited training samples [14]. Yu et al. (2016) introduced a novel supervised classifier for HSI classification combining spectral and spatial information [15]. Gao et al. (2016) combined locality-preserving projection and sparse representation to balance the high dimensionality of hyperspectral data and the limited training samples [16].  integrated the locality-sensitive discriminant analysis with the group sparse representation for HSI classification [17]. Gao et al. (2017) presented an optimized kernel minimum noise fraction transformation algorithm for the efficient feature extraction of HSIs [18].  presented a multiscale super pixel segmentation method to model the distribution of classes based on spatial information [19]. Henceforward, the deep learning technique has been increasingly favored by the scientific community attributing to its great power of abstracting representations to classify hyperspectral cubes into certain land cover categories [20][21][22][23]. As a consequence, Cui et al. (2019) proposed a multiscale spatial-spectral convolutional neural network (CNN) to integrate multiple receptive fields fused features and multiscale spatial features at different levels [24]. Gao et al. (2019) integrated t-distributed stochastic neighbor embedding with a CNN to capture the potential assembly features of HSIs [25]. Yu et al. (2020) proposed a novel method to exploit local spectral similarity and nonlocal spatial similarity by considering spatial consistency [26]. Liu et al. (2020) proposed a novel lightweight shuffled graph convolutional network (GCN) to accelerate the training procedure through a limited number of training data [27]. Making a mark on the latest, the recent novelties regarding graph representation learning have attracted more and more attention from the community.
Graph neural networks (GNNs) are a class of deep learning methods designed to perform inference on data described by graphs [28]. Deep learning as a data-driven machine learning technique has undoubtedly brought enormous prosperity in hyperspectral remote sensing intelligent information extraction depending on the high-level representational ability [29]. CNNs, as a kind of attractive representation of deep learning models, have also achieved promising results in analyzing HSIs [1]. Particularly, the CNNs' localized kernels could efficiently recognize identical features beyond their spatial locations. Although CNN has been successful on the domains with underlying grid-like structured data, the CNN-based methods suffer from several intrinsic drawbacks summarized by the previous studies [30,31], i.e., (1) only adapting to the regular squares regardless of the geometric changes in object regions, (2) difficulty in capturing the valuable information of class boundaries during convolving a punch of patches as the convolution kernels have fixed shape, size, and weights, (3) often take a longer training time to fit huge parameters, (4) incapable of modeling topological relations among samples whether local or nonlocal feature extraction. In this regard, the graph-based convolutional neural networks appear relatively promising to overcome the aforementioned defects and show excellent characteristics, i.e., (1) competently process the irregular image regions in the non-Euclidean (or non-grid) graph data structure, (2) multiple graph inputs can be dynamically updated and refined with multiscale neighborhood [30]. It is worth mentioning that graph representation learning represented by GCNs has received increasing attention in quantifying nonlinear features in irregular graphs converted from hyperspectral data.
Inspired by the previous works on spectral graph-based CNN [31], its key components have been employed to adapt the HSI classification task (see Figure 1). The main contributions of this study are summarized below.
(1) The usual supervised setting regarding fitting the graph-based learning models is designed through collecting the patch-based feature cubes and localized graph adjacent matrices.
(2) The graph convolution layer is used to learn the spatially local graph representation and to represent the localized topological patterns of the graph nodes.
(3) The experiments demonstrate that the presented study could achieve promising classification performance based on the localized graph convolutional filter. The rest of this paper is organized as follows. We first reviewed the latest works relevant to HSI classification with the graph-based methods in Section 2. Then, we provide the preliminaries and definitions in Section 3. The technical details of our graph-based representation learning method are presented in Section 4. Next, we analyze the experimental results and discuss the derived findings in Section 5. Finally, the concluding remarks are given in Section 6.

Related Work
Graphs are a kind of universal representation of non-Euclidean structured data, which could encode complex geometric structures [32]. The following studies regarding the graph-based HSI classification approaches have gained significant attention in the last few years. Therefore, we offer a glimpse of their scientific contributions. Hyperspectral data usually reside on a nonlinear sub-manifold, causing the inefficiency of linear algorithms [1]. Manifold learning-based algorithms have been applied for the exploration of the nonlinear structure of HSI [33]. Graph-based semi-supervised learning usually constructs a graph from the labeled and unlabeled samples for manifold representation [2]. Ma et al. (2014) presented a study of the local manifold learning to preserve the local geometry of each neighborhood by finding the relationships between the nonlinear data points [34].
Sparse representation-based graph learning algorithms are good at obtaining the adjacency relationships among the samples and weights [35]. Tan et al. (2015) constructed a block sparse graph by combining sparse representation and the regularized collaborative representation for HSI classification based on discriminant analysis [36].  employed manifold learning based on sparse representation to illustrate the manifold structure of HSI [37]. De Morsier et al. (2016) proposed a graph representation with the kernel low-rank and sparse subspace clustering for the classification of his, assuming that hyperspectral data lies on the union of manifolds [38]. Based on the previous works, Shao et al. (2017) proposed a probabilistic class structure to estimate the probability relationship between each sample point and each class of the whole data [39].  proposed a graph-based semi-supervised learning method for analyzing the discriminant behavior of the labeled samples to assess the class separability [40].
As spectral information alone is not useful for discriminating different classes, the superior classification performance could be achieved through exploiting the spatial neighborhood information along with spectral information [41]. Camps-Valls et al. (2007) presented a graph-based composite kernel model for learning spectral-spatial information in a semi-supervised way [42].  proposed a two-layer graph-based framework to overcome the challenges of limited data and the compound distribution of classes [43]. Martínez-Usó et al. (2014) proposed a transductive approach for graph-based semi-supervised learning based on the probabilistic relaxation theory [44]. Wang et al. (2014) classified newly introduced samples by constructing the spectral-spatial graph while the unlabeled samples could be randomly selected relying on the spatial information [45].  proposed a graph-based model considering both spatial and spectral information [46].
The sparse representation-based graph semi-supervised learning technique combined with spectral-spatial feature learning has been proven to be effective to boost the resultant classification performance. Kruse et al. (2003) constructed a hypergraph model to explore the high-order relationships among training samples and then performed a semi-supervised hypergraph learning based on a locality constraint low-rank representation method [47]. Chen et al. (2017) conducted the double sparse graph discriminant analysis based on mining the positive and negative relationships among the data points for the dimensionality reduction in HSI in a semi-supervised manner [48]. Xue et al.
(2017) adopted the sparse graph regularization for getting a more accurate classification map [49]. Aydemir and Bilgin (2017) used subtractive clustering to select training samples and extract the kernel sparse representation features to fit a support vector machine (SVM) classifier [50].
In the latest literature, GCNs have been successfully applied in irregular (or non-Euclidean) data representation learning [8]. The label information of each sample is propagated to its neighboring samples until a global stable state is reached on the complete dataset [2]. Earlier, the feature extraction and classification module has been assembled separately or executed step-by-step. As such, some scholars tried the spatial fusion technique to extract the spectral-spatial features, and then the fused features are fed into a CNN framework to learn the class distribution [12,51]. Cao et al. (2016) proposed a graph-based convolutional neural network, which used the Schroedinger Eigenmaps algorithm by incorporating a cluster potential matrix to encode spatial proximity and takes CNN as a spectral-spatial classifier to predict the accurate labels of pixels [12]. Shahraki and Prasad (2018) defined three spectral-spatial weighted affinities, (1) unsupervised adjacency matrix by using the raw reflectance spectra, (2) supervised adjacency matrix through extracting discriminative features using CNN, and (3) semi-supervised adjacency matrix via learning the limited amount of labeled samples and extensive unlabeled samples, to demonstrate the data resided on manifold structure (i.e., graph structure) [52]. Liu et al. (2020) extracted the extended morphological profiles and then conducted graph construction by the k -neighbors method, then fed into a GCN framework [20].
Recent advances in HSI classification also tended to improve the traditional GCN-based methods to inspire novelties in diverse learning paradigms [53]. As traditional GCNs might fail to utilize spectral signatures without considering spatial structures embedded in hyperspectral data, Qin et al. (2019) presented a semi-supervised spectral-spatial GCN framework and claimed that a general backpropagation rule of error could benefit the final classification performance [7]. Wan et al. (2019) made multiple graph inputs dynamically updated and refined with a novel dynamic graph convolution operation, then multiple graphs with different neighborhood scales could serve for extracting spectral-spatial features in different scales [11]. Hong et al. (2020) introduce the mini-batch strategy to improve GCN, which is capable of processing large-scale data and out-of-samples, and then jointly fused CNN (to extract the spectral-spatial features) and GCN (to analyze the relation representations) by testing three fusion schemes [10]. The deeply semi-supervised learning models have drawn more attention depending on their peculiar advantages of mining the unlabeled data to alleviate the annotating burden in the last couple of years [54].
Relevant to this study, most related works pay attention to the graph-based semi-supervised learning methods for HSI classification. The graph-based semi-supervised technique makes the input data built on the full graph, which combines the labeled and unlabeled nodes by employing a graph Laplacian regularizer when training and evaluating node classification models. The unlabeled nodes are completely observed during training or testing, whereas the standard formulation of semi-supervised learning requires the independent and identically distributed assumption between the labeled and unlabeled nodes [55]. In this case, the special concern of how to follow the usual supervised setting is raised as a research problem associated with our scientific motivation in this study.

Graph Structure
An undirected graph is represented by  vertices which signify both the labeled and unlabeled data samples. E is the edge set which denotes the similarities among the labeled samples as well as the unlabeled samples from the dataset. The R n n   A is a weighted adjacency matrix (i.e., graph weights) encoding the connection weight between two vertices. Note that, given a signal x defined on the nodes of the graph, which can be regarded as a vector R n  x , i x is the value of x at the th i node.

Adjacency Matrix
The graph adjacency matrix is usually calculated by measuring the similarity between two spatial neighborhoods. The adjacency matrix can be denoted as , which defines the relationships (or edges) between vertexes. Each element ij a  A can be generally computed by using the following: (1) the radial basis function [8] or the Gaussian similar- [2], where  is a parameter to control the width of the neighborhoods, the vectors i x and j x denote the spectral signatures associated to the vertexes i v and j v , respectively; (2) the distance function

Graph Laplacian
Once Α is given, the corresponding graph Laplacian matrix L can be defined as is a diagonal matrix representing the degrees of A , and i i j j d a   is the degree of the node i . To enhance the generali-zation ability of the graph, the symmetric normalized Laplacian matrix L can be given

Graph Fourier Transform
As L is a real symmetric positive semidefinite matrix, it has a complete set of orthonormal eigenvectors   R n i  u known as the graph Fourier modes [31]. The associated ordered real nonnegative eigenvalues   R n i   of the set of eigenvectors can be identified as the frequencies of the graph. The Laplacian is further diagonalized by the Fourier basis i.e., T  UU E , the L can be written as , its graph Fourier transform can be defined as ˆR T n   x U x , and its inverse is  x Ux [56]. Furthermore, the given basis functions of F can be equivalently represented by a set of eigenvectors of L [8] . Therefore, the F of f on a graph can be expressed as

Graph Construction
Whether the graph-based quasi-semi-supervised learning in the literature or the usual supervised learning in this study, both require the construction of graph data from the labeled and unlabeled samples using a graph Laplacian regularizer to smooth the classification function for the data manifold [2]. Accordingly, the high-dimensional hyperspectral data could be transferred into a low-dimensional subspace adapting to low-dimensional modeling and computation. Here, a graph was constructed with nodes and edges, where the nodes were specified by the unlabeled and labeled samples, whereas the edges specified the similarities among the labeled as well as the unlabeled samples. The effect of the Laplacian regularizer depends upon the construction of the graph adjacency matrix. As illustrated in Figure 2, the construction of the graph structure involved (1) the determination of localized graph adjacency relationships and (2) the calculation of multiple graph weights that have the number of samples. Finally, the correct label of land cover classes could be assigned to each patch-based hyperspectral cube after fitting the deep graph representation model in a standard supervised learning paradigm. The high-dimensional data distribution might form an overlap of multiple manifolds [2]. The existing methods assume that hyperspectral data are a single manifold (follows label smoothness assumption) or multiple well-separated manifolds (i.e., dissatisfy label smoothness assumption). A graph can be constructed with k -nearest neighbor ( k -NN) edges. The nearby nodes are strongly connected and have similar labels. Therefore, the original hyperspectral data are spectral vectors structured in regular grids requiring to be converted into graphs in the irregular (or non-Euclidean) domain before deeply learning graph representations, given hyperspectral data matrix , where n is the number of samples, and c (e.g., 10) is the feature dimension .
Because hyperspectral data contain redundant information of a huge volume, feature reduction is one of the widely used techniques in machine learning-based HSI processing [40,48]. In terms of deep learning methods represented by CNNs, the contribution of using principal component analysis (PCA) has been known to be limited. As for the presented GCN, we found the PCA transformation was workable to enhance the classification performance. In this regard, we tried the PCA preprocessing (note that the number of components was set as 10) to extract unsupervised features and reduce the effects of intrinsic data correlation and noise. Finally, the graph adjacency matrix could be constructed by the k -NN. The k -NN based graph construction method is most favored by the remote sensing community [2]. The adjacent matrices (i.e., graph weights) of a k -NN graph are computed by selecting k -connected neighbor nodes closest to the central node i x from the given data. That is, to compute the weighted graph of k -neighbors for pixels in R n c   X , a set of localized weight matrices (i.e., adjacent matrices)   i i n  A will be normally calculated among all labeled and unlabeled samples to participate in the classification procedure. So, the neighbor nodes i x and j x have an associated weight (e.g.,

=0
ij w means no connection). Notice that most graph construction methods use k -NN to generate adjacent matrices (adjacent graph). Nevertheless, k -NN might fail to obtain sufficient discriminant information.

Graph Convolution Filter
There are two strategies to define convolutional filters, either from a spatial approach or from a spectral approach [31]. The convolution theorem defines convolutions as linear operators diagonalized on the Fourier basis (represented by the eigenvectors of the Laplacian operator) [57]. Given two functions f and g , then their convolution can be written as  is the shifting distance, and  denotes the convolution operator (or using G  denotes convolution operator on the graph in the Fourier domain). Through the well-known theorems presented in [58], the convolution can be generalized to , where  is the element-wise Hadamard product. Hence, the convolution operation on a graph can be converted to define the Fourier transform F or find a set of basis functions. According to the graph Fourier transform addressed before, the convolution between f and g on a graph can be further expressed as

Localized Graph Convolution
When it comes to the construction of localized graph convolution operators, spatial approaches provide filter localization via setting the finite size of the kernel, and spectral approaches, such as spectral filtering, could provide a well-defined localization operator on graphs via convolutions with a Kronecker delta implemented in the spectral domain [56,59]. In this regard, spectral filtering would be an effective approach to construct a graph convolution filter. Assume that by imposing an additional spectral filter g  on the Fourier transform of a graph, we could have tion of the eigenvalues Λ of L with respect to the variable  . The parameter R n   is a vector of Fourier coefficients. As g  is a non-parametric filter, so . And then, we find that T g U could be equivalently written as   g  Λ or g  . That is, the convolution on a graph can be formulated as The non-parametric filters might have an intrinsic deficiency to be localized in node space and have a higher learning complexity [31]. In this case, the polynomial filters are introduced to parameterize the localized filters and defined as where the parameter is a vector of polynomial coefficients. The value at the vertex j of the filter g  centered at the vertex i is given by , where the kernel is localized via the convolution with a Kronecker delta function R n i   . The K relates to the minimum number of edges connecting two vertices on the graph (i.e., the shortest path distance).
As a result, spectral filters represented by th k order polynomials of the Laplacian are exactly K -localized enclosing K parameters.
When the graph convolution filter is localized with respect to K , the cost to filter a graph signal might be relatively high because of the multiplication with the Fourier basis U . A practical solution is to parameterize ( ) g  L as polynomial functions [31], e.g., the th k order truncated expansion of Chebyshev polynomials [60] and Lanczos algo-rithm [61], which can be computed recursively from L . Therefore, the localized graph convolution filter can be parameterized as the truncated expansion of the Chebyshev

Graph-Based CNN
Regarding the architecture of the graph-based CNN (see Figure 3), we have the following propagation rule for fitting a designed GCN [8], where A = A + I  , where E is the loss of energy over a mini-batch of S samples. Each of the above three computations boils down to K sparse matrix-vector multiplications and one dense matrix-vector multiplication. At the top of the graph neural networks, the objective function will be formulated to minimize the training loss and to ensure robustness in terms of convergence. Finally, the unlabeled samples could be classified into different known land cover categories.

Datasets and Settings
Four real hyperspectral datasets (see Figure 4 and Table 1)   The Indian Pines (IP) scene was gathered by the 224-band AVIRIS sensor in the wavelength range 400 to 2500 nm at a 20-m spatial resolution (i.e., 20 meters/pixel, or abbreviated as 20 m/p), in north-western Indiana. The IA dataset was a subset of the IP dataset, which consisted of 86 × 69 pixels and contained 200 spectral reflectance bands by removing bands covering the region of water absorption. The SV scene was also collected by the AVIRIS sensor (of which discarded 20 water absorption bands) over Salinas Valley, which comprised 512 × 217 pixels and had 16 classes, but a 3.7 m/p spatial resolution. Similarly, the SA scene was a small sub-scene of the SV scene, which comprised 86×83 pixels and had 6 classes. The PU scene was acquired by the ROSIS sensor over Pavia University, northern Italy. The PU dataset consisted of 103 spectral bands after the 13 noisiest bands were discarded, which had a size of 610 × 340 at a 1.3 m/p spatial resolution. Meanwhile, there were 9 classes included in the ground truth map.    It is of importance to select qualified training samples for fitting and evaluating the presented algorithm [63]. As listed in Table 1, the unlabeled training samples were coded as class C0 with the white background color. As for all the datasets, the size of the training set of all classes was set to 60, the same as the size of the validation set (i.e., 60 class n  ). Except for the samples included in the training and validation sets, all other samples were taken as the test set. The type of dataset might be a non-negligible factor in this study. The PU dataset was collected over an urban area. The IA, SA, and SV datasets were collected in a natural area. The IA and SA scenes belonged to simple datasets with low data complexity, while the SV and PU scenes appeared relatively complex, whether in spatial scale or landscape diversity. Moreover, the IA and SA datasets had a large proportion of ground truth samples relative to the entire scene and the lower intra-class variability. These intrinsic differences were crucial for investigating the representation ability of deep learning models on simple or complex data.
The experimental platform was a laptop equipped with an Intel Core i7-9750 12-core 2.60 GHz processor, a 256GB SSD, a 1T HDD, a 16 GB RAM, and an 8G GDDR6 NVIDIA RTX 2070 graphics card. The experimental procedures ran on the GPU aimed to achieve a higher computational speed. As only small training data was used, the time consumption of experiments could be controlled within a few minutes with the 5 and 10 independent runs and 200 epochs (or early stopping over 100 epochs) per run. It was relatively fast and showed promising efficiency in terms of complex networks. To ensure a complete comparison with CNN and to improve the traditional GCN, we tried to keep the parameter settings of the network structure as similar as possible. In this regard, we ran the experiments 5 and 10 times with each model for each dataset and kept the sizes of the training and validation sets independent. These sets of samples were randomly shuffled to reduce the possible influence of random effects, and the statistical accuracies were recorded.
The training details incorporated the accuracy and loss of the training and validation procedures. Many factors impacted these curves, which show whether a model is qualified enough or its parameter configuration is appropriate for the subsequent parameter learning. The experiments demonstrated that, for the simple datasets, i.e., the IA and SA datasets, the CNN and GCN models appeared to converge gradually and showed good convergence behavior. When it came to the complex datasets, i.e., the SV and PU datasets, the CNN model stabilized much faster at ~10 epochs, while its validation loss curve (Val_loss) appeared to have an abnormal behavior of convergence, and the corresponding GCN model was getting better slowly (see Figure 5). As a whole, the GCN model showed a better representation of the global convergence than the CNN model, though the GCN model had some difficulties in handling the local portions, which might be influenced by the learning rate. Relative to deep learning models, the machine learning algorithms (i.e., support vector machine (SVM)) often involve the training set (i.e., randomly selecting 60 samples of each category for fitting the classifier) and the test set, without the specialized validation set. In this study, we fine-tuned the hyperparameters (i.e., two parameters, the penalty parameter of the error term and the kernel coefficient for "rbf"). The implementation of the SVM classifier is based on "libsvm" with a one-vs.-one scheme. Moreover, all grid searches are calculated using the five-fold cross-validation. For the IA, SA, SV, and PU datasets, the best parameters obtained in the first independent run were: (1) the penalty parameter fixed at 10.0, 10.0, 100.0, and 100.0, respectively; and (2) the kernel coefficient for "rbf" determined to be 0.1, 0.1, 0.1, and 0.1, respectively. Finally, the best score was 0.8875, 0.9861, 0.9292, and 0.8481, respectively.

Classification Maps
The presented experiments regarding HSI classification were achieved based on the intensity values (i.e., not the reflectance values) distributed along with spectral bands. After the training and evaluating procedures, all the unlabeled samples were classified into the proper categories; then the classification maps would be particularly helpful to assess the final classification results qualitatively. As the experiments using each algorithm for different datasets had been randomly run 5 and 10 times, we first plotted the classification maps of SVM, CNN, and the presented GCN for the IA dataset in the five running times (see Figure 6). Subsequently, the classification maps of three algorithms in the 1 st run for all hyperspectral datasets were illustrated as well (see Figure 7). As shown in Figure 6, the classification maps obtained by the SVM algorithm appeared as scattered spots, as the SVM is a pixel-level classifier essentially. The misclassifications might often be caused by the high intra-class variability and the low inter-class variability among different land cover classes. The similar results of different algorithms were commonly seen in the different random runs. The subsequent accuracy statistics and the corresponding probability maps also supported such an analysis. Referring to Figure 7, the GCN model had promising outputs compared to the other two algorithms, i.e., SVM and CNN. Furthermore, most errors of commission and omission occurred in the non-homogeneous areas involving complex landscape structures or land surface materials. The misclassifications might be mainly caused by some inherent uncertainties between classes. It is obvious that the GCN model obtained fairly good results and surpassed the SVM and CNN algorithms.

Classification Accuracies
Three widely used accuracy metrics, i.e., the Kappa index (K), overall accuracy (OA), average accuracy (AA), were used to assess the classification results, which were derived from the site-specific confusion matrix. As Table 2 illustrated, the presented GCN undoubtedly obtained the best classification performance for all the used hyperspectral datasets. Concerning the CNN model for the SA and SV datasets, the resultant accuracies appeared unanticipated. The reason is that we expanded the number of epochs from 50 to 200, and the CNN model triggered the early stopping event when monitoring its validation loss. The larger epochs might lead to a worse convergence and a decrease in performance simultaneously. Meanwhile, such a result also disclosed that the differences between the simple and complex datasets might have a significant impact on measuring performance.  To analyze the misclassifications of the GCN model, we drew the confusion matrices for different datasets with the best performance based on the presented GCN model (see Figure 8). As for the IA dataset, we got accuracies {K: 0.9644; OA: 0.9750; AA: 0.9796}; there were 32 samples (0.04%) of Class 1 (Corn-notill) wrongly classified as Class 4 (Soybean-mintill) while there were 58 samples (0.03%) of Class 4 (Soybean-mintill) wrongly predicted as other classes, which might mean a potential inter-class negative influence. Because the SA dataset was relatively simple, the corresponding derived accuracies appeared almost saturated, i.e.,  The IA dataset was at the 5th random run, (b) the SA dataset was at the 3rd random run, (c) the SV dataset was at the 3rd random run, and (d) the PU dataset was at the 2nd random run.

Probability Maps
Probability density has been taken as an effective indicator to indicate the confidence of the classification output [63]. In this regard, the label assignment depends on the credible predictions with the maximum predicted probabilities and determines the final output maps. Probability maps are often utilized to observe the probability density and to find weak predictions. Therefore, we graphed the probability maps of each algorithm to show that the GCN model had clear advantages over the popular CNN model, and an apparent distinction could be reflected in the probability maps (see Figure 9).
Weak predictions in a holistic scene regarding the HSI classification task have been reported by scholars previously [63]. Figure 9 indicates the typical but not noticeable differences between the CNN and GCN models. That is, the maximum predicted probabilities of each hyperspectral cube whose central pixel was located at the edge of a category were relatively low. It seems that the GCN might be better at illustrating the boundaries among different land cover categories. Moreover, weak predictions might be more likely to occur in the cross areas and the areas covered by non-ground truth samples. Figure 9. The probability maps of the presented GCN and its competitors in the 1st run for four real hyperspectral datasets, i.e., (a) the IA dataset, (b) the SA dataset, (c) the SV dataset, (d) the PU dataset, corresponding to the experiment with five random runs. Note that the deeper the color, the weaker the prediction.

Time Consumption
The statistics of time cost are related to deep learning network structures. So, the efficiency of deep learning models can be approximately deduced as per the scales of network parameters. In practice, the training and test times are often recorded based on the clock setting of the computer operating system (see Table 3). The processing time with CPU plus GPU may depend on many possible factors, i.e., the randomness in neural networks, the efficiency of memory storage, and the difference of computational environment. Note that we tried to make the network structures of the presented CNN and GCN as comparable as possible, thus facilitating further improvement and contrastive analysis. Table 3. Total network parameters and time consumption (i.e., the average time of 5 and 10 random runs). Note that the numbers in parentheses regarding datasets and models indicate the number of samples and the number of network parameters, respectively. The differences between the presented CNN and GCN models relied on the differences between the convolution layer with a batch normalization layer and the graph convolution layer, under a comparable paradigm of network architecture design. From another perspective, both network structures had approximately the same network complexity, as we expected. Then by observing Table 3, except for the simple datasets, the presented GCN model cost ~2 times the training time than the CNN model. Such a finding could be further confirmed by the number of their network parameters (i.e., CNN: 1.22 × 10 5 ; GCN: 2.50 × 10 5 ).

Conclusions
Deep learning models have been extensively employed for HSI classification and have attracted increasing attention for their strong representation ability. Particularly, the nascent graph representation learning has shown good goodness for resolving the graph-structured data. In this study, we not only reviewed the recent publications related to the graph-based representation learning methods for HSI classification but also presented a novel graph-based spectral filtering approach that has promising benefits. It is worth mentioning that we found that the presented GCN model indeed has the advantage of illustrating the boundaries of different land cover classes by observing the weak predictions derived from the probability maps. In short, graph representation learning might represent future directions to enhance the research field of HSI classification. Future work would involve (i) testing the n-dimensional datasets where the number of components less than 10, and (ii) using the explained variance ratio for the PCA preprocessing.
Author Contributions: S.P. and Y.W. conceived and prepared the manuscript, and all relevant co-authors participated in writing and editing the paper. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: All data used during the study are available at http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes.