Multiple Superpixel Graphs Learning Based on Adaptive Multiscale Segmentation for Hyperspectral Image Classiﬁcation

: Hyperspectral image classiﬁcation (HSIC) methods usually require more training samples for better classiﬁcation performance. However, a large number of labeled samples are difﬁcult to obtain because it is cost- and time-consuming to label an HSI in a pixel-wise way. Therefore, how to overcome the problem of insufﬁcient accuracy and stability under the condition of small labeled training sample size (SLTSS) is still a challenge for HSIC. In this paper, we proposed a novel multiple superpixel graphs learning method based on adaptive multiscale segmentation (MSGLAMS) for HSI classiﬁcation to address this problem. First, the multiscale-superpixel-based framework can reduce the adverse effect of improper selection of a superpixel segmentation scale on the classiﬁcation accuracy while saving the cost to manually seek a suitable segmentation scale. To make full use of the superpixel-level spatial information of different segmentation scales, a novel two-steps multiscale selection strategy is designed to adaptively select a group of complementary scales (multiscale). To ﬁx the bias and instability of a single model, multiple superpixel-based graphical models obatined by constructing superpixel contracted graph of fusion scales are developed to jointly predict the ﬁnal results via a pixel-level fusion strategy. Experimental results show that the proposed MSGLAMS has better performance when compared with other state-of-the-art algorithms. Speciﬁcally, its overall accuracy achieves 94.312%, 99.217% ,98.373% and 92.693% on Indian Pines, Salinas and University of Pavia, and the more challenging dataset Houston2013, respectively.


Introduction
Hyperspectral images (HSIs) contain hundreds of bands [1][2][3], which provide rich spectral information and spatial information [4,5]. The HSI classification (HSIC) has always been an activate research topic. HSIC is widely used in the fields of agriculture [6,7], environment monitoring [8,9] and smart city design [10]. The representations of HSIs are spatially sparse while spectrally correlated; unlike the object-level images classification, HSIC needs to perform dense classification or regression over the whole image, hence how to extract pixel-level spectral features while utilizing local spatial information of HSIs is the core to process HSIs. Recently, computer vision has become a mainstream research direction and has been applied in many fields, such as agronomy [11,12], plant science [13], remote sensing and so on. After deep learning (DL) is applied to HSI classification, the detailed spectral-spatial features can be extracted [14][15][16][17]. The authors in [18] proposed a 1D+2D HSIC method, which uses a one-dimensional (1D) convolution kernel to extract spectral features and a two-dimensional (2D) convolution kernel to extract spatial features. Hamida et al. [19] directly used the three dimensional (3D) convolution neural networks (3D-CNN) to extract spatial-spectral features simultaneously. Considering the local details of HSIs, a framework based on kernel-guide deformable convolution and double-window joint bilateral filter is proposed to capture spatial-spectral features flexibly [20]. These DLbased methods achieved excellent performance but often require a large number of training samples. The limited number of training samples will obviously reduce the classification accuracy of these DL-based methods. However, manually annotating a large number of accurate training samples is extremely labor-intensive and time-consuming in a real scenario [21][22][23], and thus, the small labeled training sample size (STLSS) problem is always a challenge for HSIC [24,25].
Recently, superpixel (SP) segmentation has been widely applied to HSIC [26,27]. SP-based methods can generate an irregular homogeneous region with similar spatial information [28,29]. The local spatial similarity within an SP can easily enlarge the size of training samples, so the SP segmentation can be utilized to tackle the SLTSS problem [30,31]. The most commonly used SP segmentation methods are simple linear iterative clustering (SLIC) [32,33] and entropy rate SP (ERS) [34]. Initially, the support vector machine (SVM) and SP segmentation was first combined together for HSIC by [35]. Then, Zheng et al. [36] proposed an SP-guided training sample enlargement and the distance weighted linear regression-based classification (STSE_DWLR) to solve STLSS problem. Liu et al. [37] applied SP segmentation to deep learning and employed the graph convolutional networks (GCN) to extract global features. They proposed the CNN-enhanced GCN (CEGCN) to integrate the complementary advantages of CNN and GCN, and achieved the pixel-and superpixellevel feature fusion. Sellars et al. [38] introduced a superpixel contracted graph based transductive learning framework, named superpixel graph learning (SGL).The SGL applies the advantages of SP segmentation to HSIC and has good classification performance. These SP-based methods have usually applied the SP segmentation as a pretreating technique and have achieved excellent performance, however, how to select a suitable segmentation scale (the number of SPs), which will affect the results of subsequent methods, still need to be fixed. Some researches have demonstrated that different segmentation scales will seriously affect the classification accuracy of SP-based methods [39][40][41]. To tackle with this issue, Jia et al. [42] proposed an effective framework to calculate the optimal single segmentation scale and select multiscale, but this method did not consider the spatial complementarity between different segmentation scales. Although multiscale fusion can cope with the manual dependence dilemma [43][44][45][46], these non-adaptive multiscale selection methods will select the unwanted fusion scale and cause a negative contribution to the classification accuracy. Later, a novel adaptive multiscale segmentations (AMSs) framework which can select scales through an intra-inter discrimination index f (s i ) was provided by [47]. AMSs adopted a two-step framework that first selects the basic scale and then selects the fusion scales according to the basic scale. The intra-inter discrimination index f (s i ) has satisfactory performance in small scale intervals, however, for hyperspectral data sets with complex spatial distribution, such as that of the University of Pavia which requires large scale to segment, this index may lose its effect.
To reduce the bias and variance of classification results when the number of training samples is very small and select a group of multi-scale superpixel maps with complementary information, a multiple superpixel graphs learning method based on adaptive multiscale segmentation (MSGLAMS) is proposed for HSIC in this paper. In the proposed method, the multiscale superpixel segmentation maps through SLIC are generated first, and an optimal reference scale (ORS) selection algorithm (ORSSA) is used to select an ORS map as the basic scale. Then, a multiscale selection algorithm based on sparse representation is used to select fusion scale maps that have positive contribution to supplement the spatial information of the ORS. Finally, the classification results of a different fusion scale is obtained by SGL and the these results are fused by voting to obtain the final result. The main contributions of this paper are: (1) The proposed MSGLAMS adopts the multiscale-superpixel-based framework, which can reduce the adverse effect of improper selection of a superpixel segmentation scale on the classification accuracy while saving the cost to manually seek a suitable segmentation scale.
(2) To make full use of the spatial information of different segmentation scales, a novel two-steps multiscale selection strategy is designed to adaptively select a set of complementary multiscales. Specifically, an optimal reference scale selection algorithm (ORSSA) which can select the single basic scale suitable for different data sets is proposed to select an optimal reference scale (ORS). Then, the multiscale selection method base on sparse representation (MSSR) is proposed to select fusion scales that have positive contribution to supplement the spatial information of ORS. (3) Multiple superpixel-based graphical models (SGL-based model), which are created via constructed superpixel contracted graph of determined scales (fusion scales pool), are adopted to jointly predict the final classification results, that is, pixel-level labels are determined via the voting results of these different models. This Boosting-like fusion strategy can significantly reduce the bias and instability of the final results, and keep the similar inductive bias of models.

Proposed Method
In this section, the proposed HSIC framework is explained in detail. As shown in Figure 1, for an HSI, the principal component analysis (PCA) [30] is first applied to extract top three principal components (PCs), and the candidate scale pool is constructed according to the size of the HSI. Then, the SLIC algorithm was used to segment HSI with different scales to obtain multiscale segmentation maps. A novel optimal reference scale (ORS) selection algorithm (ORSSA) is proposed to select the ORS suitable for different datasets as the basic scale, and the classification results of the ORS map are used to construct a sparse dictionary. Next, a multiscale selection method base on sparse representation (MSSR) is used to select fusion scales that can supplement the spatial information of enlarged samples of ORS map. Finally, multiple SGL models are obtained via constructed superpixel contracted graph with different scales (fusion scales), and the final results are jointly predicted via the voting results of these models.

Construction of the Candidate Scale Pool
Different hyperspectral datasets usually reflect different ground distribution. It is necessary to construct candidate scale pools [S lower , . . . , S upper ] suitable for different datasets, because the meaningless SP segmentation scale is not helpful to improve the classification accuracy and will waste computing time. As shown in Figure 2, when the scales of SP exceed the appropriate range, it is a meaningless segmentation. Small-scale segmentation maps cannot preserve the local detailed information of the HSI and can't ensure that the pixels within the SP have similar land-cover categories (undersegmentation). The large-scale segmentation maps will lose large amount of spatial information (oversegmentation) [20,48]. Therefore, it is important to construct a candidate scale pool that can exclude the meaningless segmentation scale and include the indispensable segmentation scales. A candidate scale pool [S lower , . . . , S upper ] in [42] was defined as where C is the real number of land cover classes and R is the texture ratio of the HSI. C is usually much smaller than the lower boundary of meaningful segmentation scale, especially for datasets such as that of Pavia University with large image sizes and a large number of edge pixels. In addition, the texture ratio R is easily affected by algorithm of texture detectors. The ideal interval is determined only by the dataset, so the lower boundary and upper boundary are defined as where L and W are the length and width of the HSI. K 1 = arg min(L, W). The S lower and S upper defined by this method will not miss meaningful segmentation scales. If the scale interval is defined as [S lower : k : S upper ], the definition of S lower and S upper will only affect the runtime of the algorithm, but the setting of k will affect the selection of fusion scale. As shown in Figure 3, the number of SP cluster centers and the segmentation scale are not linear. If the setting of k is too small, the final fusion scale pool will contain many scales with same segmentation (i.e., non-contribution scales). If k is too large, it is easy to lose the important fusion scale (i.e., positive-contribution scales).
S L is defined as In addition, k s = k/2, k m = 2k and k m = 3k, which k = ∆/N S , N S is the number of multiscale segmentation maps (generally set to 30). Such a candidate scale pool can retain positive-contribution scales and exclude the meaningless segmentation maps.

Optimal Reference Scale Selection
To select multiple fusion scales, an ORS should be selected as the basic scale first. As shown in Figure 4, the influence of the SP scale on the overall accuracy (OA) is not linear. The small scale SP contains more similar spectral features, while the large scale SP contains richer spatial information. It is difficult to choose a single segmentation scale with good classification performance. Therefore, the optimal reference scale selection algorithm (ORSSA) is designed to choose the optimal single scale suitable for different datasets by calculating the segmentation evaluation indexes of different scale maps.  Small scale maps that contain pure spectral features can obtain high-quality enlarged samples. Large scale maps that contain better spatial information can obtain more enlarged samples. The optimal single scale generally contains similar spectral features and rich spatial spatial features, so the optimal single scale can be only sought with balanced spatial-spectral features in the middle scale pool.
Assuming that the middle scale pool is S= S 1 , S 2 , · · · , S K , S ⊆ S M : k m : S L . The kth scale has M SP regions, S k = {S k 1 , S k 2 , ..., S k M }. Then, p i ∈ S k m is the ith pixel within the mth SP of kth scale, p c is the corresponding cluster center of S k m , (x i , y i ) and (x c , y c ) are the spatial coordinates of p i and p c , respectively. In order to ensure the similar spectral features of the enlarged sample, the optimal single scale should contain small spectral differences between pixels within an SP, so the spectral evaluation index of the mth SP is defined as where I m is the number of pixel within the mth SP. Then, the overall spectral evaluation index of kth scale can be expressed as In addition, each SP should contain a larger number of pixels, so that SPs with poor spatial information will not easily be randomly selected to enlarge training samples. In order to ensure the size of enlarged samples, we define the spatial evaluation index of mth SP as where d is the length of the diagonal of mth SP S k m . The overall spatial evaluation index of kth scale can be expressed as The final segmentation evaluation index of kth scale is presented as where F spectral (S k ) and F spatial (S k ) are the normalized results of (6) and (8), and λ is a spatial-spectral balance factor. The optimal reference scale ORS = arg max(F(S k )). The pseudocode of the ORSSA is given in Algorithm 1.

Require:
The middle scale pool, S = S 1 , S 2 , · · · , S K , S ⊆ S M : k m : S L ; Number of SPs in each scale, M = {M 1 , M 2 , · · · , M K }; parameter λ; Ensure: The optimal reference scale ORS; 1: Obtain a series of middle scale maps through SLIC; 2: for k = 1 to K do 3: for m = 1 to M k do 4: f spectral (S k m ) can be calculated by (5); 5: f spatial (S k m ) can be calculated by (7) F spectral S k can be calculated by (6); 8: F spatial S k can be calculated by (8); 9: Compute the segmentation evaluation index of kth scale F spectral (S k ) by (9); 10: end for 11: ORS = arg max(F(S k )); 12: return ORS;

Superpixel Graph Learning
The superpixel graph learning (SGL) proposed by [38] is an SP contracted graph-based transductive learning framework for semi-supervised HSI classification. The SGL can be split into two major tasks: graph construction and label propagation.

Graph Construction
An HSI is denoted as I ∈ R L×W×B and the reduced image after PCA is denoted aŝ I ∈ R L×W×A . Assuming that the ith SP S i contains n i connected pixels, S i = p i,1 , . . . , p i,n i , three meaningful features need to be extracted from each SP ready for graph construction first. The mean features vector of ith The weighted feature vector − → S w i can be expressed as where the weight between adjacent SPs w i,z j is defined as The final centroidal location of each Next, a weighted undirected graph G = (V, E, W) can be created from these previously discussed features and SP node set. The weight between two connected SPs S i and S j is constructed based on two Gaussian kernels and is given as where where β is the balance factor between the mean and weighted features, and the width of the Gaussian kernels are determined by σ s , σ l . The edge set is constructed using k-nearest neighbors. Therefore, the edge weights are defined as

Label Propagation
A set of labeled spectral pixels are first randomly selected from the original HSI. The final labeling can be specified using a matrix F ∈ R K×c . The cost function associated with the matrix F is given by where µ > 0 is a regularization parameter, Y ∈ R K×C is a labeling matrix contained the labels of K superpixels. F denotes the set of n × c matrices with non-negative entries. The labeling matrix is given by F * = argmin F∈F Q(F). The above cost function has a closed form solution which reads: The final labeling of the nodes is then computed as: y i = argmax j≤c F ij . The final classification result map of the SP map can be obtained by calculating the label of each SP.

Multiscale Selection Based on Sparse Representation
As shown in Figure 4, the classification accuracy of a single scale is greatly affected by the selection of SP scale. When the number of samples is limited, the random selection of samples will seriously affect the spectral-spatial features of the enlarged samples after SP segmentation and thus affect the final classification results. The results shown in Figure 5 indicate that choosing a good single scale can improve the accuracy, but the robustness of the single scale is poor. The accuracy of the same single scale under different iterations is also very different because each iteration will randomly selects different samples to obtain corresponding enlarged samples. After using a multiscale fusion strategy, complementary spatial information can improve the classification accuracy under STLSS conditions, and complementary spectral features of enlarged samples can improve the robustness of classification accuracy. Hence, the multiscale fusion strategy has more advantages than the single-scale SP classification algorithm under the condition of STLSS. Hence, the accuracy and robustness of classification can be obviously improved by multiscale fusion. The sparse representation can select pixels similar to the spectral features of the sample used to construct the dictionary [49][50][51][52], so when the classification result of ORS map is used to construct the dictionary, the fusion scale maps similar to ORS map can be selected by the representation residual. Consequently, the multiscale selection algorithm based on sparse representation (MSSR) can be exploited to select fusion scale maps that have positive contribution to supplement the spatial information of ORS.
The traditional sparse representation assumes that a pixel in HSI is represented as y ∈ R B , where B is the number of spectral bands. The structured dictionary is denoted by X = X 1 , X 2 , . . . , X K ∈ R B×D , where K is the total number of classes and D is the number of training samples. In this algorithm, we suppose the HSI have K distinct classes, and structured dictionary X = X 1 , X 2 , . . . , X K ∈ R B×N , where N is the number of all pixels of the ORS map. Assuming the classification result of ORS maps has N j pixels of jth class, the jth class subdictionary is The superpixel S can be represented by a linear combination of these reference pixels as The mth SP in the ith scale map can be represented as where α i m is the sparse coefficient matrix. The sparse coefficient matrix α i m can be calculated bŷ where · F is the Frobenius norm and the OMP [53] is implemented to solve this problem. Onceα i m is obtained, and the jth class reconstruction residual error which is the difference between original SP and reconstructed SP can be calculated by The residual matrix can be expressed by calculating the minimum residual error of each SP. Mean value of residual matrix of ith scale map can be calculated by where M is the number of SPs in the S i th scale map. Small mean value of means that the segmentation information is more similar to the ORS map. The finally fusion scale pool is determined by where T s is a threshold and T s can be presented as where E max and E min are the maximum and minimum values of the E(S i ), respectively. κ = 1 means that all scales are selected for fusion, and T s = E min means that only the ORS map selected by ORSSA is used for fusion. The classification result maps of S F is obtained by the SGL. Subsequently, these result maps are fused to obtain the final classification result.

A Pixel-Level Fusion Strategy for Multiple Graphical Models
In general, superpixel-based methods assume that the pixels within the same superpixel share the same label. These superpixel-level predictions can be sub-optimal for dense prediction tasks due to the discrepancy between superpixel-level and pixel-level prediction. Moreover, the instability and bias of the single-scale superpixel based graph classifier comes from the insufficient generalization ability of the single graph model, that is, the single graph cannot well represent the distribution of a real HSI. To fix this gap, a pixel-level fusion strategy for multiple graphical models is proposed in this paper. Similar to Boosting [54], multiple graphical models, which are generated via constructed superpixel contracted graphs of determined scales S F = S 1 , S 2 , · · · , S f , are adopted to jointly predict the label of each pixel. Thereby, assume the f constructed graphs as For the pixel p i , the labels predicted by different models can be presented as where F ij can be obtain by Equation (18) in Section 2.3.2. The final label can be obtained by the majority voting rule, which presented as After predicting the label of each pixel, the final classification map is generated. The pseudocode of the MSGLAMS is given in Algorithm 2.  (3) and (4); 3: Reduced imageÎ ∈ R L×W×A is obtained by PCA. 4: Apply SLIC algorithm on candidate scale pool to create multiscale segmentation maps. 5: Select the ORS by Algorithm 1; 6: Obtain the classification result of ORS (l 1 , l 2 , . . . , l L×W ) by SGL; 7: Construct a sparse dictionary X = X 1 , X 2 , . . . , X K ∈ R B×D by the classification results of ORS; 8: for S i =S lower to S upper do 9: Calculate the E(S i ) through Equation (23); 10: end for 11: Determine the final fusion scale pool S F by Equation (24); 12: Create multiple graphical models of S F by Equation (26) . 13: Fuse multiple scale maps to obtain the final classification result by Equations (27) and (28). 14: return Final classification result;

Experimental Results and Analysis
In this section, the details of three experimental datasets and experimental settings are first introduced. Then, the results of the proposed method and other state-of-the-art methods on three HSI datasets are compared. Finally, the effect of the ORASS algorithm and the MSSR algorithm, and the parameter analysis of the proposed MSGLAMS is discussed.

Datasets
Three HSI datasets are selected for experiments, which are those of the Indian Pines, Salinas and University of Pavia. These datasets are collected by different sensors over different land covers, and have been selected for experiments by many related papers. The details of these datasets are given as follows:

Experimental Settings
In order to verify the effectiveness of the proposed MSGLAMS comprehensively, the experiment can be divided into three parts. First, the comparison between MSGLAMS and other state-of-the-art methods is conducted. Second, as two key components of the the proposed MSGLAMS, the performances of ORASS and MSSR are also compared with their counterparts, respectively. Finally, the influence of key parameters of the MSGLAMS is analyzed. For the evaluation metrics, overall accuracy (OA), average accuracy (AA) and the kappa coefficient are adopted to quantify the classification performance. All the experiments are repeated ten times to obtain the final average performance. The hardware platform is Intel Xeon Silver 4210 2.4 GHz twelve-core processor CPU, GeForce RTX 3090 GPU and 32G memory.

Experimental Settings for Comparing with Other State of the Arts
In order to validate the performance of our proposed framework MSGLAMS, seven state-of-the-art algorithms are selected for comparison. These methods include three HSI classification methods based on SP segmentation: CEGCN [37], STSE_DWLR [36], and SGL [38]. Two HSI classification methods to solve small sample problems: KDCDWBF [20] and 3D-CNN [19]. Two common HSI classification methods: 1D+2D [18] and SVM. For the Houston, we still can't provide the memory needed for GCN-M method even if we have adopted a decent hardware platform(CPU: Intel Xeon Silver 4210 2.4 GHz twelve-core processor; GPU: GeForce RTX 3090; Memory: 32G).

Experimental Settings of ORSSA
The purpose of proposed ORASS is to select an ORS map with best classification performance as basic scale. The three mainstream single scale selection algorithms are used to compare with ORSSA. First, the texture ratio was proposed in [55] to calculate the optimal single scale (TRS). The ORS calculated by TRS are 706, 1603 and 2765 for the Indian Pines, Salinas Scene, and University of Pavia, respectively. Second, the resolution of the dataset is used to calculate the optimal single scale (RES) in [42], and the RES are obtained as 1051, 3142 and 2216 for the Indian Pines, Salinas Scene, and University of Pavia. Finally, the ORS can be adaptively selected through inter-intra parameters (ASS) in [47], and the parameter setting is consistent with [47]. The spatial-spectral balance factor λ of ORSSA is set to 0.3 on the three datasets.

Experimental Settings of MSSR
The multi-scale comparison experiment is to compare the impact of different multiscale selection methods on the classification results of the same classifier SGL. The multiscale selection comparison methods include the multiscale selection formula (MSF) in [42], the multiscale SP number selection formula (MSNSF) in [56] and the adaptive multiscale selection (AMS) method proposed by [47]. The number of multiscales K in MSF and MSNSF is the same as the number of fusion scales of MSSN and the basic scale of MSF and MSNSF are both set to the ORS. The parameter κ in Equation (25) of MSSR is set to {4, 4, 7} for the Indian Pines, Salinas Scene, and University of Pavia, respectively. The influence of κ will be analyzed in detail in the Section 3.5.

Comparison of Results of Different Methods
In this section, seven state of the arts are compared with the proposed MSGLAMS framework. Tables 5-8 report the classification accuracy of MSGLAMS and seven methods on the Indian Pines, Salinas and University of Pavia dataset, and the cells with a green background represent the highest accuracy. The classification maps obtained by these methods are illustrated in Figures 6-9. It is easy to find that the accuracy of MSGLAMS far exceeds the common classification methods SVM and 1D+2D because the common classification methods cannot solve the problem of STLSS. Furthermore, the classification accuracy of MSGLAMS also far exceeds that of the small sample classification methods KDCDWBF and 3D-CNN. By comparing three small sample classification methods based on SP segmentation GNC-M, STSE_DWLR and SGL, the OA of MSGLAMS exceeds the best performing comparison method SGL 5.66%, 4.65% on the university of Pavia and the Indian Pines datasets, respectively. Kappa exceeds 7.23%, 4.57% than SGL on these two datasets. GNC-M, STSE_DWLR, SGL and our MSGLAMS perform well on the Salinas dataset, because the texture of Salinas dataset is not complicated. For the more challenging dataset Houston2013, MSGLAMS yields 92.906% overall accuracy, which outperforms the KDCDWBF and SGL by a large margin, i.e., 3.665% and 6.845%. It can be inferred that MSGLAMS also can achieve excellent performance even if the spatial distribution of HSI is complex. It should be noted that all accuracies except stressed grass and synthetic grass reached the top-1. Overall, these experimental results suggest that the proposed method can overcome the problem of unstable classification performance under small sample conditions and thus effectively improve the classification accuracy.

Analysis of Experimental Results of the ORASS and MSSR
To better display the effect and function of these two key components of the the proposed MSGLAMS, the performance of the ORASS and MSSR are compared with other single scale and multiscale selection methods, respectively. The experimental results of single-scale and multiscale selection methods are put in the same tables for comparison. Tables 9-11 show the classification accuracy OA of different scales on the three datasets. The bolded results represent the highest accuracy in the single-scale selection methods, and the cells with green background represent the highest accuracy in both single-scale selection methods and multiscale selection methods. It can be seen from the results that among all single-scale selection algorithms, the single scale selected by ORASS has better performance. ASS is also an adaptive single-scale selection method, but because the training sample size is small, the intra-inter indexes f (s i ) of the SP maps is not must positively related to the OAs. Therefore, the optimal single scale selected by the maximum value of the intra-inter index argmax( f (s i )) may not have the best performance. TRS and RES do not have the ability to adaptively select scales according to different SP segmentation methods, so the fixed scale chosen by TRS and RES may not be suitable for the SGL classifier. In addition, the classification accuracy of HSI has been significantly improved through a multiscale fusion strategy. Especially for the dataset of the University of Pavia with complex texture, the classification accuracy of multiscale fusion is much higher than single scale. However, choosing different fusion scales will supplement different spatial information of ORS and effect the final classification accuracy. The negative-contribution scales will destroy the quality and quantity of the enlarged samples and reduce the classification accuracy. For example, the OA of MSF and MSNSF is lower than MSSR, because the fixed fusion scales selected by MSF and MSNSF may contain scales with negative contribution. AMS has the ability to adaptively select multiple scales, but it still relies too much on intra-inter indexes f (s i ). The f (s i ) is not must positively related to the OAs, so AMS does not have the best performance. As shown in Tables 9-11, the OA of MSSR is higher than other multiscale selection methods and much higher than single-scale SP segmentation. Because under the condition of STLSS, the fusion scales selected by MSSR have complementary enlarged samples information, which makes a positive contribution to improving the classification accuracy. Finally, with the increase of the number of samples, the OAs of different scales are very close, especially on the dataset Salinas with simple texture.

Parameter Analysis
In this section, the influence of three key parameters of the MSGLAMS method are analyzed in detail. First, we jointly analyzed the influence of the spatial-spectral balance parameter λ on the ORS selection and the influence of κ in Equation (25) on the classification results. Then, the effect of the number of training samples ({3, 5, 7, 10, 15} randomly selected samples per class) on several classifiers is examined on the sets of the Indian Pines, Salinas, and University of Pavia images.

Parameter Analysis of λ and κ
The setting of the spatial-spectral balance parameter λ determines the selection of ORS which used to construct the sparse dictionary, and ultimately affects the fusion scale pool. The interval of λ is set to [0 : 0.1 : 1]. When λ = 0, ORSSA only selects the ORS based on the spectral evaluation index F spectral , i.e., ORS = S upper , because S upper contains the smallest spectral difference between pixels within the same SP. Conversely, λ = 1 means that ORASS only selects ORS based on the spatial evaluation index F spatial , i.e., ORS = S lower , because each SP in S lower scale map contains the largest number of pixels. The interval of κ in the fusion threshold T S is set to [1 : 2 : 11]. When κ = 1, T S = E max , representing all scales fusion. For κ = 11 or higher, the performance of MSGLAMS is close to ORSSA. Figure 10 shows the influence of λ and κ on the overall accuracy on the three datasets, and deeper color corresponds to higher performance. For the Indian Pines dataset, MSGLAMS suggests the best performance when λ = 0.3 and κ = 5. In addition, it can be seen from the result of κ = 11, the ORS selected by ORASS has the best performance when the λ = 0.3. When the κ value is small, especially κ = 5, OA of MSGLAMS is higher than other κ values. For the Salinas dataset, MSGLAMS shows the best performance when λ = 0.7 and κ = 5, and when λ = 0.4, the ORS selected by ORSSA achieved the best performance. Finally, for the University of Pavia dataset, MSGLAMS has the best performance when λ = 0.3 and κ = 7, and when λ = 0.3, ORASS yield the best performance. Two conclusions can be drawn from the above analysis. First, the small spatial-spectral balance factor λ (i.e., small spectral difference between pixels within the same SP) will get a better performing ORSSA. Consequently, in order to obtain a best-performance ORS to construct a sparse dictionary, λ can be set to 0.3. Second, fusing worse scales can reduce the accuracy (e.g., when κ = 1 or κ = 3), and too few fusion scales may not capture complete spatial information of multiscale (e.g., when κ = 9 or κ = 11). For normal HSIs, κ can be set to 5, but for datasets with complex textures (such as University of Pavia), setting κ = 7 will achieve better classification results.

Effect of Different Number of Training Samples
The effect of the number of training samples on several classifiers is examined on the Indian Pines, Salinas, and University of Pavia images in this section. The parameters for all the classifiers are kept the same as that in Section 3.3, and randomly select {3, 5, 7, 10, 15} samples per class as training samples. Figure 11 shows the OA of different classifiers when selecting randomly different numbers of training samples. It can be seen from the results in Figure 11 that the OA obtained by the eight classifiers will increase with the increasing of the number of training samples. In addition, MSGLAMS has the best performance under different training sample sizes. When the number of samples selected is smaller, the advantage of MSGLAMS is more obvious. Therefore, when the number of training samples is small (e.g., three or five training samples per class), the OA of MSGLAMS is significantly higher than other comparison methods. This result proves the obvious advantages of our proposed MSGLAMS under the condition of STLSS.

Running Time Comparison
To comprehensively evaluate our proposed method, the corresponding experiments are conducted on four datasets. Table 12 presents the runtime results of different methods. For the small-size datasets, such as Indian Pines, Salinas and University of Pavia, SVM and lightweight networks (1D+2D and 3D-CNN) yield shorter runtime. For the large-size dataset, Houston2013, the results suggest that all methods require more runtime. Compared with the first three methods, STSE_DWLR and SGL require more computation time. The key of GCN-M is the graph convolution network, which needs to feed the whole HSI into network, hence the GCN-M requires much more calculated cost. KDCDWBF also requires more runtime due to its extra post-processing procedure. Our proposed method also requires more computational cost because MSGLAMS needs spend a lot of time to build multiple superpixel contracted graphs. Especially when the size of an HSI is large, it requires more time to construct the large-scale superpixel maps.

Conclusions
This paper proposes a multiple superpixel graphs learning method based on adaptive multiscale segmentation for HSI classification (MSGLAMS). The MSGLAMS can be split into two major section: the selection of multiscale, and creating multiple superpixelbased graphical models to jointly predict the classification results via a pixel-level fusion. Aiming at the election of a multiscale, a novel two-steps multiscale selection strategy is designed to adaptively select a set of complementary multiscales. Initially, an optimal reference scale selection algorithm (ORSSA) is proposed to select the optimal reference scale (ORS) as a basic scale from the built candidate scale pool. Then, a multiscale selection algorithm based on sparse representation (MSSR) is developed to select the fusion scales, which has positive contribution to supplementing spatial information of ORS. Finally, multiple superpixel-based graphical models are obatined via a constructed superpixel contracted graph of determined scales (fusion scales), and the final results are jointly predicted via the voting results of these models. The proposed MSGLAMS adopts the multiscale-superpixel-based framework, which can reduce the adverse effect of improper selection of a superpixel segmentation scale on the classification accuracy while saving the cost to manually seek suitable segmentation scale. To make full use of the spatial information of different segmentation scales, a novel two-steps multiscale selection strategy is designed to adaptively select a set of complementary multiscales. To fix the bias and instability of a single model, multiple superpixel-based graphical models obatined by constructing superpixel contracted graph of fusion scales are developed to jointly predict the final results via a pixel-level fusion strategy. This Boosting-like prediction strategy can significantly improve the accuracy and robustness of the classification results, especially when the number of labeled samples is small. The experimental results on the three datasets indicate the superiority of the proposed MSGLAMS over several well-known classifiers in terms of qualitative and quantitative results, especially when the number of selected labeled samples is small.
The proposed method has achieved excellent performance on four datasets, but it needs a higher computational costs for HSIs with larger sizes. In the future, we will try to reduce computational complexity and adopt the CUDA to accelerate MSGLAMS.