Multiscale Weighted Adjacent Superpixel-Based Composite Kernel for Hyperspectral Image Classiﬁcation

: This paper presents a composite kernel method (MWASCK) based on multiscale weighted adjacent superpixels (ASs) to classify hyperspectral image (HSI). The MWASCK adequately exploits spatial-spectral features of weighted adjacent superpixels to guarantee that more accurate spectral features can be extracted. Firstly, we use a superpixel segmentation algorithm to divide HSI into multiple superpixels. Secondly, the similarities between each target superpixel and its ASs are calculated to construct the spatial features. Finally, a weighted AS-based composite kernel (WASCK) method for HSI classiﬁcation is proposed. In order to avoid seeking for the optimal superpixel scale and fuse the multiscale spatial features, the MWASCK method uses multiscale weighted superpixel neighbor information. Experiments from two real HSIs indicate that superior performance of the WASCK and MWASCK methods compared with some popular classiﬁcation methods.


Introduction
Hyperspectral images can be regarded as a collection of corresponding single image obtained in response to different spectral bands. The abundant spectral bands contain a large amount of spectral information, which makes hyperspectral images (HSIs) have a wide range of application prospects [1], such as classification [2], unmixing [3], target detection [4,5], etc. In recent decades, HSI classification has been widely concerned by scholars in the remote sensing field. Hyperspectral research in the domain of classification means that given a labeled training set, classification is to label each pixel with a corresponding category according to the spectral features of the target pixel. Hence, many approaches have emerged to classify HSI, such as a powerful pixel-wise classifier called support vector machine (SVM) [6], maximum likelihood [7], sparse representation classification (SRC) [8][9][10], Collaborative representation (CRC) [11], etc.
Most approaches only exploit spectral features to classify the HSI without any spatial information, which makes them sensitive to noise and cannot obtain satisfactory results. As demonstrated in [12], hyperspectral data should be viewed as a textured image, not simply as a few unrelated pixels. For this reason, many HSI classification methods combining spectral and spatial features have been proposed continuously, and the well-pleasing classification results have also been obtained. The classical spatial feature extraction methods include wavelets [13], Gabor filter [14], 3-D Gabor filter [15], and other spatial feature extraction operators that can exploit the image texture information. The extended morphological profiles [16] method utilizes a series of continuous morphological filter to capture spatial features of adjacent pixels. Moreover, the methods based on Markov random fields (MRF) [17][18][19] have achieved excellent classification performance to classify hyperspectral images. The joint sparse representation [20][21][22][23][24] methods achieve a smoother result by jointly representing the adjacent pixels while representing the target pixels. Furthermore, the low-rank representation [25][26][27][28][29][30] approaches have also been applied to classify the HSI. Moreover, several kernel-based spatial-spectral approaches are developed to integrate spatial-spectral features. For instance, the composite kernel (CK) [31] method (i.e., SVMCK) replaces each target pixel with the mean value of the square neighborhood centered on the target pixel, so as to extract spatial features and thus show good classification performance. On this basis, many multiple kernels learning methods, such as extreme learning machine with CK [32], CK discriminant analysis [33], and subspace multiple kernels learning [34], have also been used to classify HSI, effectively improving the classification accuracy. Unlike CK, the spatial-spectral kernel (SSK) only constructs a kernel function to exploit the spatial and spectral features in feature space. Many SSKbased [35][36][37] methods have also achieved satisfactory results. In CK and SSK, their classification results tend to be too smooth, with blurred edges and small targets lost, since the region used to extract spatial information is usually set as a square region centered on the target pixel, and the fixed size square neighborhood leads to insufficient use of the spatial information of the target pixel.
The ideal neighborhood should be one that can accommodate different HSI struc-tures in size and shape and has a similar spectrum. The adaptive non-local strategy and global-based non-local strategy are the most commonly used methods to obtain homogeneous regions. Both of them assume that the original cluster is composed of two non-overlapping subclusters, and only one of them is effective, achieving excellent classification performance in [38][39][40]. Superpixel segmentation has been extensively developed in computer vision [41][42][43] in recent years. According to texture structure of the image, superpixel segmentation algorithm can cluster the image into many non-overlapping homogeneous regions. Each superpixel can change its shape and size adaptively through its different texture structure. Hence, various superpixel-based approaches are proposed to classify HSI. For instance, in [44], the composite kernel based on superpixel (SCK) method captures spatial features by calculating the mean of each superpixel. In [45], the superpixel-based multiple kernels (SCMK) method utilizes the spectral and spatial features between and within superpixels by three kernels. In [46], the relaxed multiple kernels based on region (RMK) method achieves multiscale feature fusion to obtain the spatial features by kernel fusion. In [47], the multiscale spatial-spectral kernel based on adjacent superpixel (ASMGSSK) method utilizes the adjacent superpixels (ASs)-based strategy and multiscale feature fusion to obtain the spatial-spectral information. The aforementioned methods can achieve outstanding classification performance. However, in HSI, the adjacent pixel information belongs to different classes influence each other [47]. Based on this reason, most superpixel-based methods only consider the inner information of each superpixel, which makes it hard to preserve edge region.
In this paper, we present a composite kernel based on multiscale weighted adjacent superpixel (MWASCK) method to classify HSI. Firstly, we divide the HSI into multiscale superpixels by adopting the entropy rate superpixel segmentation (ERS) [43] algorithm. Secondly, on each scale, the similarities between each current AS and its neighbor ASs are calculated to construct the spatial features. At this time, we can obtain multiscale spatial features. Finally, a multiscale composite kernel approach combining original spectral features with multiscale spatial features is proposed.
The remainder of this paper is organized as follows. Section 2 introduces SVM with CK and superpixel multiscale segmentation techniques. In Section 3, we closely describe the details of the proposed methods. Experimental results and related parameter analysis are given in Section 4. Section 5 concludes this paper.

CK with SVM
At present, SVM is one of the best supervised learning algorithms. Its purpose is to find a hyperplane, which can divide the data correctly on both sides of the hyperplane. Given a set of labeled training data {(x 1 , y 1 ), · · · , (x n , y n )}, where x i ∈ R N , y i ∈ {−1, +1}, and a mapping function φ, which maps data from the current dimensional space to a higher dimensional space to change the nonlinear distribution of data into a linear distribution. SVM attempts to find a classification hyperplane with a maximum interval by solving Lagrangian dual problem: which is constrained to 0 ≤ α i ≤ C and ∑ i α i y i = 0, i = 1, · · · , n. α i is the Lagrange multiplier. The kernel function K can be expressed as the inner product between two instances after a nonlinear transformation, which is as follows: Using kernel function K instead of inner product, the decision function of nonlinear SVM obtained by solving dual problem is as follows: where b is a linear classifier parameter.
The following radial basis function (RBF) kernel can achieve the same performance as other nonlinear kernel functions with fewer parameters, and is one of the most widely used kernel functions in SVM: The CK function is formulated by the spectral and spatial information, which should fulfil Mercer's condition [48]. Let x spe i denote the spectral information of a pixel x i . The CK method utilizes the mean or variance of the square region of the target pixel x i to obtain spatial information x spa i . The spectral kernel K s and spatial kernel K w can be computed via (4): where, σ s and σ w are respectively the hyperparameters of K s and K w . Thus, the CK can be constructed as follows: where µ is a spectral kernel weight.

Superpixel Multiscale Segmentation
In recent years, more and more HSIs classification methods exploit spatial information through superpixels. According to texture features, the image can be segmented into homogeneous regions (i.e., superpixels) of similar size and non-overlapping. Pixels in HSI usually have hundreds of spectral bands. Most of the information in these spectral bands has little influence on the classification results. Therefore, we use principal component analysis (PCA) [49] to remove the information which has little effect on classification to improve the segmentation efficiency.
In this paper, due to the most important information of the HSI is contained in first principal component (PC), we adopt a powerful entropy rate superpixel (ERS) [43] segmentation algorithm for obtaining segmentation map through first PC image. Firstly, the superpixel number is given as Q, the first PC image is represented by a graph G = (V, E). where V is the vertex set corresponding to pixels in PC image, E is the edge set representing the similarity among adjacent pixels. Then, ERS aims to form a compact and homogeneous superpixel by finding a subset of edges A ⊆ E. The following is the objective function: where H(A) represents the entropy rate of random walk of the first PC image, B(·) is the balance term, which makes the size of superpixels similar, and λ represents the equilibrium parameter between H(A) and B(·). The first PC image is segmented into superpixels by using the greedy algorithm [50] to maximize the above objective function. For superpixel multiscale segmentation, a variety of multiscale segmentation methods are proposed. In this paper, we take the approach in [47] directly. Given the number of base superpixel Q and the scale number M, the number of superpixels per scale can be obtained by using the following formula: After we figure out Q s , the segmentation map per scale can be obtained by ERS directly.

Weighted Adjacent Superpixel-Based Composite Kernel (WASCK)
We proposed the WASCK method to classify the HSI by utilizing the spectral-spatial features. The spatial neighborhood used in CK is constructed in a square region centered on each target pixel, which makes the objects be affected by backgrounds. In order to reduce this effect, the SCK method uses the information of superpixels to find the homogeneous region. Furthermore, we use adjacent superpixels information and its location information of each superpixel to construct a weighted ASs. The weighted ASs strategy can reduce the impact of undesired superpixels by assigning different weights to adjacent superpixels. Figure 1 shows a simple example of different spatial region selection strategies. It can be seen that there are two different ground object targets, denoted as green and red respectively. Figure 1a-c show the strategies of the square neighborhood, superpixel neighborhood and weighted ASs neighborhood, respectively. Figure 1d-f show the corresponding features extracted under three strategies. Since the size of square neighborhood is artificially chosen and fixed, for a target pixel x i , if the size is too large, it will contain irrelevant pixels (i.e., red pixel). On the contrary, if the size is too small, more effective spatial information cannot be considered. It is difficult to find a suitable size for all target pixels, so the spatial features cannot be fully extracted to the classification (see Figure 1d). Figure 1b is the superpixel-based strategy. The size and shape of each superpixel can be adapted to vary based on different spatial structures. However, this strategy may cause the problem that each superpixel is too small to obtain effective spatial information on account of over-segmentation (see Figure 1e). Figure 1c is the weighted ASs-based strategy, where {S i1 , S i2 , S i3 , S i4 } are superpixels adjacent to the target superpixel S i and {D i1 , D i2 , D i3 , D i4 } are the corresponding centroids. By assigning smaller weights to dissimilar superpixels (i.e., S i4 ), this strategy can maintain ASs-based homogeneous regions to a large extent, and effectively reduce the adverse effects of dissimilar superpixels. The color of feature also demonstrates that the weighted ASs neighborhood will produce the most accurate result (see Figure 1f). Therefore, the weighted ASs strategy can effectively exploit spatial-spectral features of ASs. The WASCK method integrates the weighted ASs strategy and CK together. Let ∈ × be the HSI, where is the depth of HSI and is the number of pixels. Firstly, given a superpixel number , the HSI can be segmented into superpixels { 1 , … , , … } ∈ × by ERS on the first PC image. For each superpixel , its adjacent superpixels can be denoted as { 1 , … , , … , }, where is the number of adjacent superpixels. At the same time, we can get the corresponding centroids { 1 , … , , … , } to each adjacent superpixel. Since the feature of each superpixel can be represented by its mean pixel , the weight of each superpixel can also be calculated by ( ). In order to exploit spatial features and location information of adjacent superpixels, the spatial information of can be denoted as: where = (−‖ − ‖ 2 /2 2 ) and = (−‖ − ‖ 2 /2 2 ) represent the correlation of location information and spatial information between the current superpixel and its adjacent superpixels, respectively. and are the broadband parameter of the function, which controls the radial action range. Then, the WASCK we constructed can be expressed as follows: After the kernel function is obtained, the decision formula is obtained by substituting Equation (11) into Equation (3). The WASCK method integrates the weighted ASs strategy and CK together. Let X ∈ R B×I be the HSI, where B is the depth of HSI and I is the number of pixels. Firstly, given a superpixel number P, the HSI X can be segmented into P superpixels {S 1 , . . . , S i , . . . S P } ∈ R B×P by ERS on the first PC image. For each superpixel S i , its adjacent superpixels can be denoted as S i1 , . . . , S ik , . . . , S ip , where p is the number of adjacent superpixels. At the same time, we can get the corresponding centroids D i1 , . . . , D ik , . . . , D ip to each adjacent superpixel. Since the feature of each superpixel can be represented by its mean pixel S mean ik , the weight of each superpixel can also be calculated by S mean ik (D ik ). In order to exploit spatial features and location information of adjacent superpixels, the spatial information of x i can be denoted as:

Multiscale Weighted Adjacent Superpixel-Based Composite Kernel (MWASCK)
represent the correlation of location information and spatial information between the current superpixel and its adjacent superpixels, respectively. σ d and σ r are the broadband parameter of the function, which controls the radial action range. Then, the WASCK we constructed can be expressed as follows: After the kernel function is obtained, the decision formula is obtained by substituting Equation (11) into Equation (3).

Multiscale Weighted Adjacent Superpixel-Based Composite Kernel (MWASCK)
The WASCK method only uses the single-scale weighted ASs to extract the spatial features. Here, the fusion of superpixel spatial information under different segmentation scales is considered to improve classification performance. Considering that it is possible to avoid seeking for the optimal superpixel scale and fuse the spatial multiscale features, the MWASCK method is utilized in the framework of WASCK as follows: where M can be selected empirically and represents the number of total scales. K w s can be acquired by the spatial kernel K w of Equation (11) directly and represents spatial kernel on scale s. As in the above case, we can obtain the decision formula by substituting Equation (12) into Equation (3). Figure 2 shows the flowchart of using weighted ASs features to classify HSI via MWASCK. The WASCK method only uses the single-scale weighted ASs to extract the spatial features. Here, the fusion of superpixel spatial information under different segmentation scales is considered to improve classification performance. Considering that it is possible to avoid seeking for the optimal superpixel scale and fuse the spatial multiscale features, the MWASCK method is utilized in the framework of WASCK as follows: where can be selected empirically and represents the number of total scales. can be acquired by the spatial kernel of Equation (11) directly and represents spatial kernel on scale . As in the above case, we can obtain the decision formula by substituting Equation (12) into Equation (3). Figure 2 shows the flowchart of using weighted ASs features to classify HSI via MWASCK.

Datasets
Indian Pines: This is a 145 × 145 image taken from a test site for Indian Pine in Northwest Indiana by AVIRIS sensor. Each pixel in the image contains 220 wavelengths, covering wavelengths from 0.4 to 2.5 μm. By removing the 20 wavelengths that are absorbed by water vapor we end up with 200 wavelengths. Table 1 details the 16 available reference categories. The training samples required for the experiment are randomly selected from 3% of each category and the rest for testing.

Datasets
Indian Pines: This is a 145 × 145 image taken from a test site for Indian Pine in Northwest Indiana by AVIRIS sensor. Each pixel in the image contains 220 wavelengths, covering wavelengths from 0.4 to 2.5 µm. By removing the 20 wavelengths that are absorbed by water vapor we end up with 200 wavelengths. Table 1 details the 16 available reference categories. The training samples required for the experiment are randomly selected from 3% of each category and the rest for testing.

Experimental Results
In the experiments, several kernel-based classification methods are used to compare with our method: support vector machine with RBF kernel (SVM-RBF) [6] method, the classic support vector machine with composite kernel (SVMCK) [31] method, the superpixelbased multiple kernels (SCMK) [45] method, the relaxed multiple kernel based on region (RMK) [46] method and the multiscale spatial-spectral kernel based on adjacent superpixel (ASMGSSK) [47]. The overall accuracy (OA), average accuracy (AA) and Kappa coefficient are used as evaluation criterion of algorithm performance. Moreover, the average results of ten randomized tests are used as the final experimental data. The optimal parameters of the proposed WASCK and MASCK approaches are set as follows. For proposed methods, the value chosen for spectral kernel weight µ is 0.1, the value chosen for σ d is 2 −3 , the σ r and σ s in Equation (5) are set to 2 −2 , the σ w in Equation (6) is set to 2 −7 in Indian Pines dataset and is set to 2 −6 in University of Pavia dataset. To simplify the statement, we use snum to represent the number of superpixels. The optimal snum for the WASCK is set to 1400 in Indian Pines dataset and is set to 1100 in University of Pavia dataset. The multiscale snum for the MWASCK is set to {100, 200, 400, 800, 1600, 3200} in Indian Pines dataset and is set to {200, 400, 800, 1600, 3200, 6400} in University of Pavia dataset. The comparison methods choose their optimal parameter settings. In addition, after fivefold cross-validation, SVM training parameters are selected. The multiple methods based on SVM in this paper are calculated by using LIBSVM [51] toolbox. Figures 3 and 4 show classification maps of multiple approaches for two datasets. Obviously, the map of SVM-RBF method shows many noisy estimations, which only considers the spectral information. SVMCK method uses the spatial features of HSI to acquire a smoother classification map. However, the detail and edge regions still have a lot of misclassified pixels. The SCMK and RMK methods achieve better classification maps by considering the multiple kernels and the spatial information of ASs. The ASMGSSK method further achieves satisfactory classification maps by integrating the ASs strategy and multiscale feature fusion together. Moreover, the WASCK method also provides a smoother classification map and maintains the details. The MWASCK approach achieves the best classification map than the other compared classifiers by considering the weighted ASs strategy and multiscale structures of HSI. a smoother classification map. However, the detail and edge regions still have a lot of misclassified pixels. The SCMK and RMK methods achieve better classification maps by considering the multiple kernels and the spatial information of ASs. The ASMGSSK method further achieves satisfactory classification maps by integrating the ASs strategy and multiscale feature fusion together. Moreover, the WASCK method also provides a smoother classification map and maintains the details. The MWASCK approach achieves the best classification map than the other compared classifiers by considering the weighted ASs strategy and multiscale structures of HSI.  a smoother classification map. However, the detail and edge regions still have a lot of misclassified pixels. The SCMK and RMK methods achieve better classification maps by considering the multiple kernels and the spatial information of ASs. The ASMGSSK method further achieves satisfactory classification maps by integrating the ASs strategy and multiscale feature fusion together. Moreover, the WASCK method also provides a smoother classification map and maintains the details. The MWASCK approach achieves the best classification map than the other compared classifiers by considering the weighted ASs strategy and multiscale structures of HSI. (e) (f) (g) (h) Tables 2 and 3 present the experimental data of multiple approaches for two datasets. Obviously, SVM-RBF method achieves very poor accuracy, which only considers the spectral features. The SVMCK method improves the OA value by adding spatial features within a square region. SCMK and RMK methods achieve higher accuracy by considering the multiple kernels and spatial information of ASs, respectively. By taking into consideration the spatial features of the ASs and multiscale structures of HSI, The ASMGSSK method also achieves satisfactory classification accuracy. In addition, the WASCK approach has obtained excellent classification accuracy by considering the weighted ASs strategy. Moreover, the MWASCK method achieves the best accuracy than the other compared classifiers by considering the weighted ASs strategy and multiscale structures of HSI.  Tables 2 and 3 present the experimental data of multiple approaches for two datasets. Obviously, SVM-RBF method achieves very poor accuracy, which only considers the spectral features. The SVMCK method improves the OA value by adding spatial features within a square region. SCMK and RMK methods achieve higher accuracy by considering the multiple kernels and spatial information of ASs, respectively. By taking into consideration the spatial features of the ASs and multiscale structures of HSI, The ASMGSSK method also achieves satisfactory classification accuracy. In addition, the WASCK approach has obtained excellent classification accuracy by considering the weighted ASs strategy. Moreover, the MWASCK method achieves the best accuracy than the other compared classifiers by considering the weighted ASs strategy and multiscale structures of HSI.
Deep learning approaches have been used to classify HSIs in recent years. This paper also compares several excellent CNN-based deep learning methods: the deep CNN [52] method, the CNN based on contextual deep (CD-CNN) [53] method and the CNN based on diverse region (DR-CNN) [54] method. Table 4 shows the classification accuracies under the selection of different number of training samples in each category. For Indian Pines dataset, we only chose to use the first nine larger categories. Obviously, in the case of a small number of training samples, the MWASCK shows a better classification result compared with the deep learning methods. Since CNN needs more training samples to reach its maximum capacity. Therefore, the advantage of our method is that it still has excellent classification performance when the training samples are limited. In addition, we further compared the MWASCK with another a fully dense multiscale fusion network (FDMFN) [55] method under the same training samples on the Indian Pines and Kennedy Space Center (KSC) datasets. The details of the KSC datasets can be found in [55]. The classification results are shown in Table 5. Obviously, our method also shows an outstanding classification performance.   Figure 5 illustrates the line graph of the OA values of WASCK method under different number of superpixels (snum). The weighted ASs-based strategy requires smaller size of superpixels to form larger homogeneous regions. Hence, when snum is large, the WASCK has better classification performance. However, it will result in over-segmentation and cannot extract spatial information effectively when snum is too large, and then the classification performance will decline. As it can be observed, the optimal snum for two datasets is 1400 and 1100, respectively. It can also be found that when the optimal snum is exceeded, classification performance of the WASCK will degrade.  Figure 5 illustrates the line graph of the OA values of WASCK method under different number of superpixels ( ). The weighted ASs-based strategy requires smaller size of superpixels to form larger homogeneous regions. Hence, when is large, the WASCK has better classification performance. However, it will result in over-segmentation and cannot extract spatial information effectively when is too large, and then the classification performance will decline. As it can be observed, the optimal for two datasets is 1400 and 1100, respectively. It can also be found that when the optimal is exceeded, classification performance of the WASCK will degrade.

Discussion
(a) (b)  Obviously, with a limited training sample, single-scale MWASCK with fewer superpixels can obtain higher OA, while single-scale MWASCK with more superpixels can obtain higher OA with sufficient training samples. Therefore, it has been confirmed that spatial information cannot be effectively utilized by small scale superpixels with limited labeled samples. In addition, when a large number of superpixels are selected to achieve better experimental data, the spatial information of weighted ASs cannot be exploited effectively, resulting in poor classification performance (see SN = 3200 and SN = 6400 in Figure  6b). Meanwhile, when enough training samples are selected, the limitation of the segmentation algorithm will lead to the failure to improve the OA value. The proposed MWASCK method not only improves the OA values, but also avoids selecting the optimal and sufficient number of samples.  Obviously, with a limited training sample, single-scale MWASCK with fewer superpixels can obtain higher OA, while single-scale MWASCK with more superpixels can obtain higher OA with sufficient training samples. Therefore, it has been confirmed that spatial information cannot be effectively utilized by small scale superpixels with limited labeled samples. In addition, when a large number of superpixels are selected to achieve better experimental data, the spatial information of weighted ASs cannot be exploited effectively, resulting in poor classification performance (see SN = 3200 and SN = 6400 in Figure 6b). Meanwhile, when enough training samples are selected, the limitation of the segmentation algorithm will lead to the failure to improve the OA value. The proposed MWASCK method not only improves the OA values, but also avoids selecting the optimal snum and sufficient number of samples.  Figure 7 illustrates the effect of the spectral kernel weight associated with WASCK and MWASCK. The interval between 0 and 1 as the value of the spectral kernel weight. Obviously, the WASCK and MWASCK show poor classification performance on two datasets when we assign a value of 0 or 1 to the spectral kernel weight (i.e., only using spatial features of weighted ASs or spectral features). On the contrary, the proposed methods show good classification performance when the spatial information is utilized (i.e., the spectral kernel varies from 0.1 to 0.9). This suggests that we should combine spectral features with spatial features of the weighted ASs to classify HSI. It is worth noting that when the interval between 0.1 and 0.9 as the value of the spectral kernel weight, the performance of the proposed methods on two datasets generally degrades. This tells us that we should assign a relatively large weight to the spatial kernel.  Figure 8 illustrates the effect of the kernel width on two datasets. We can observe that the MWASCK has a more robust kernel width value than WASCK. Meanwhile, the proposed WASCK and MWASCK can show better classification performance when goes from 2 −5 to 2 −10 on Indian Pines dataset and goes from 2 −3 to 2 −10 on University of Pavia dataset.  Figure 7 illustrates the effect of the spectral kernel weight associated with WASCK and MWASCK. The interval between 0 and 1 as the value of the spectral kernel weight. Obviously, the WASCK and MWASCK show poor classification performance on two datasets when we assign a value of 0 or 1 to the spectral kernel weight (i.e., only using spatial features of weighted ASs or spectral features). On the contrary, the proposed methods show good classification performance when the spatial information is utilized (i.e., the spectral kernel varies from 0.1 to 0.9). This suggests that we should combine spectral features with spatial features of the weighted ASs to classify HSI. It is worth noting that when the interval between 0.1 and 0.9 as the value of the spectral kernel weight, the performance of the proposed methods on two datasets generally degrades. This tells us that we should assign a relatively large weight to the spatial kernel.  Figure 7 illustrates the effect of the spectral kernel weight associated with WASCK and MWASCK. The interval between 0 and 1 as the value of the spectral kernel weight. Obviously, the WASCK and MWASCK show poor classification performance on two datasets when we assign a value of 0 or 1 to the spectral kernel weight (i.e., only using spatial features of weighted ASs or spectral features). On the contrary, the proposed methods show good classification performance when the spatial information is utilized (i.e., the spectral kernel varies from 0.1 to 0.9). This suggests that we should combine spectral features with spatial features of the weighted ASs to classify HSI. It is worth noting that when the interval between 0.1 and 0.9 as the value of the spectral kernel weight, the performance of the proposed methods on two datasets generally degrades. This tells us that we should assign a relatively large weight to the spatial kernel.  Figure 8 illustrates the effect of the kernel width on two datasets. We can observe that the MWASCK has a more robust kernel width value than WASCK. Meanwhile, the proposed WASCK and MWASCK can show better classification performance when goes from 2 −5 to 2 −10 on Indian Pines dataset and goes from 2 −3 to 2 −10 on University of Pavia dataset.  Figure 8 illustrates the effect of the kernel width σ w on two datasets. We can observe that the MWASCK has a more robust kernel width value than WASCK. Meanwhile, the proposed WASCK and MWASCK can show better classification performance when σ w goes from 2 −5 to 2 −10 on Indian Pines dataset and goes from 2 −3 to 2 −10 on University of Pavia dataset.  Figure 9 shows the line chart of OA values of several methods for two datasets under different training sample numbers. As can be observed, with the increase of training sample numbers, classification accuracy is getting higher and higher. At the same time, the WASCK and MWASCK methods provide superior performance compared with other classification approaches. It is worth noting that classification accuracies of the MWASCK are higher than other compared methods on different training samples.  Table 6 presents the time costs of the WASCK and MWASCK on the Indian Pines dataset and University of Pavia dataset. All of our experiments are run on a laptop computer with an Intel Core i5-8265U CPU 1.60 GHz and 8GB of RAM. Obviously, the superpixel segmentation and kernels computation processes take up most of the computation time. In addition, since the segmentation algorithm in [43] is efficient, the superpixel segmentation stage does not consume much time. For the MWASCK on the University of Pavia dataset, since the maximum number of superpixel segmentation reaches 6400, the calculation cost of segmentation and kernels computation is greatly increased. It is worth noting that once the segmentation of superpixels has been completed, the pixels within each superpixel share the same spatial information, and additional costs of this process will not increase with the increase of the number of samples.   Figure 9 shows the line chart of OA values of several methods for two datasets under different training sample numbers. As can be observed, with the increase of training sample numbers, classification accuracy is getting higher and higher. At the same time, the WASCK and MWASCK methods provide superior performance compared with other classification approaches. It is worth noting that classification accuracies of the MWASCK are higher than other compared methods on different training samples.  Figure 9 shows the line chart of OA values of several methods for two datasets under different training sample numbers. As can be observed, with the increase of training sample numbers, classification accuracy is getting higher and higher. At the same time, the WASCK and MWASCK methods provide superior performance compared with other classification approaches. It is worth noting that classification accuracies of the MWASCK are higher than other compared methods on different training samples.  Table 6 presents the time costs of the WASCK and MWASCK on the Indian Pines dataset and University of Pavia dataset. All of our experiments are run on a laptop computer with an Intel Core i5-8265U CPU 1.60 GHz and 8GB of RAM. Obviously, the superpixel segmentation and kernels computation processes take up most of the computation time. In addition, since the segmentation algorithm in [43] is efficient, the superpixel segmentation stage does not consume much time. For the MWASCK on the University of Pavia dataset, since the maximum number of superpixel segmentation reaches 6400, the calculation cost of segmentation and kernels computation is greatly increased. It is worth noting that once the segmentation of superpixels has been completed, the pixels within each superpixel share the same spatial information, and additional costs of this process will not increase with the increase of the number of samples. Table 6. Time costs (in seconds) of the WASCK and MWASCK on two datasets.  Table 6 presents the time costs of the WASCK and MWASCK on the Indian Pines dataset and University of Pavia dataset. All of our experiments are run on a laptop computer with an Intel Core i5-8265U CPU 1.60 GHz and 8GB of RAM. Obviously, the superpixel segmentation and kernels computation processes take up most of the computation time. In addition, since the segmentation algorithm in [43] is efficient, the superpixel segmentation stage does not consume much time. For the MWASCK on the University of Pavia dataset, since the maximum number of superpixel segmentation reaches 6400, the calculation cost of segmentation and kernels computation is greatly increased. It is worth noting that once the segmentation of superpixels has been completed, the pixels within each superpixel share the same spatial information, and additional costs of this process will not increase with the increase of the number of samples.

Conclusions
In this paper, WASCK and MWASCK methods are proposed to classify the HSI. The WASCK fully utilizes spatial and location information of the weighted ASs. In addition, the multiscale method is adopted in the framework of WASCK (i.e., MWASCK) to effectively exploit multiscale superpixel spatial and location information of the HSI. Experiment results have indicated that the WASCK and MWASCK achieve the desired excellent classification performance.
In the experiments, the optimal spectral kernel weight was not selected for each test pixel. The spectral kernel weight was only selected as a fixed value empirically and used jointly by all the test pixels. Therefore, the optimal spectral kernel weight can be obtained adaptively through the local distribution of target pixels. In addition, an excellent superpixel segmentation technique will also show better classification performance.
In the future, deep features [56][57][58][59] will be considered to be exploited and integrated into the composite kernel framework for obtaining more accurate results. Moreover, how to adaptively decide the weight of each base kernel is also an open problem, optimal algorithm such as particle swarm optimization [60,61] will be considered and studied.