Multi-Scale Superpixel-Guided Structural Profiles for Hyperspectral Image Classification

Hyperspectral image classification has received a lot of attention in the remote sensing field. However, most classification methods require a large number of training samples to obtain satisfactory performance. In real applications, it is difficult for users to label sufficient samples. To overcome this problem, in this work, a novel multi-scale superpixel-guided structural profile method is proposed for the classification of hyperspectral images. First, the spectral number (of the original image) is reduced with an averaging fusion method. Then, multi-scale structural profiles are extracted with the help of the superpixel segmentation method. Finally, the extracted multi-scale structural profiles are fused with an unsupervised feature selection method followed by a spectral classifier to obtain classification results. Experiments on several hyperspectral datasets verify that the proposed method can produce outstanding classification effects in the case of limited samples compared to other advanced classification methods. The classification accuracies obtained by the proposed method on the Salinas dataset are increased by 43.25%, 31.34%, and 46.82% in terms of overall accuracy (OA), average accuracy (AA), and Kappa coefficient compared to recently proposed deep learning methods.


Introduction
A hyperspectral imager can simultaneously record abundant spectral information and spatial information regarding the observed materials to form a hyperspectral image (HSI). A HSI, which has hundreds of continuous spectral channels for each pixel, can offer very detailed spectral information. Due to this characteristic, a HSI has the unique ability to identify different land covers and has been widely applied in various fields, such as object detection [1][2][3][4], image interpretation [5][6][7], and water quality retrieval [8][9][10]. In recent years, hyperspectral image classification, which aims to assign a unique label to each pixel, has received more attention due to its importance in precision agriculture, urban investigation, and so on.
In the past few decades, a larger number of hyperspectral image classification methods have been developed [11][12][13][14]. The earliest classification methods were based on spectral classifiers, including support vector machine (SVM) [11], random forest (RF) [12], and nearest neighbor (NN) [13]. These methods mainly used spectral information for classification without considering the spatial information of each pixel, which easily produces salt and pepper noise in the classification map [15].
To fully utilize the spatial information in HSI, many spectral-spatial classification methods have been investigated. For example, Benediktsson et al. proposed extended morphological profiles based on the repeated employment of opening and closing morphological operations with increasing sizes [16]. Kang et al. proposed an edge-preserving filtering method for extracting spectral-spatial features of HSIs, in which the spectral number of the input is first reduced, and then, the domain transform recursive filtering is used to remove the image details [17]. Ghamisi et al. developed extinction filters to extract spatial and contextual information about HSIs [18]. Duan et al. developed a structural profile for the feature extraction of HSIs to greatly improve the discrimination of different ground objects, in which the HSI is modeled as a structural component that reflects the significant spatial structures and a texture component that is the image noise and useless details [19]. In addition, in order to increase the classification performances of these feature extractors, several classification frameworks are used, such as multi-scale classification methods [6], multi-view classification methods [7], semi-supervised classification methods [20], and active learning methods [21].
In recent years, deep learning models have been widely applied in the classification of HSIs. For instance, Chen et al. proposed a convolutional neural network (CNN) for the classification of HSIs, where several convolutional and pooling layers are used to extract deep spectral-spatial features [22]. Hang et al. developed cascaded recurrent neural networks (RNN) to learn the discriminative features of HSIs, in which the first RNN is used to remove redundant information of adjacent spectral channels, and the second RNN models the complementary information of nonadjacent spectral channels [23]. Hang et al. designed a multitask generative adversarial network for the classification of HSIs with limited samples, in which a generator network aims to classify the class of the input data, and a discriminator network is used to discriminate the real and fake samples [24]. These methods can achieve satisfactory classification performances when the number of training samples is sufficient. However, it is hard to label a large number of samples by users in real applications. Therefore, some improved classification technologies have been proposed, such as a semi-supervised classification framework [25], ensemble learning [26], and active learning [27].
To improve the classification performance in the case of limited samples, various classification frameworks, including semi-supervised classification, active learning, and multiscale or multi-view classifications, have been developed [20,[28][29][30]. For example, Kang et al. designed a decision function to expand training samples, in which two feature extractors were used to yield initial probability maps [20]. Lu et al. proposed a multi-view kernel collaborative subspace clustering method, in which the subspace learning and collaborative representation were employed to construct the Laplacian matrix [30]. Although these methods can improve the classification performance to some extent, the rich spatial feature and homogeneity of the same object, which are both very important components for obtaining superior classification results with limited samples, have not been further analyzed in hyperspectral images.
Recently, structural profiles have received more attention in the hyperspectral image classification community [6,19,31,32]. For example, Shah et al. used structural profiles in the preprocessing stage to remove texture details, greatly improving the classification performance of the classifier [31]. Liang et al. developed a multi-view structural profile to characterize complex spectral-spatial features of hyperspectral images [32]. These approaches have proven that the structural profile can well represent the intrinsic properties of the input while removing unrelated information. However, they fail to consider the internal homogeneity among adjacent pixels, which easily yields the over-smoothed phenomenon, especially in the case of limited samples. Inspired by the fact that superpixel segmentation can divide an image into several local regions without intersecting, this work proposes a novel multi-scale superpixel-guided structural profile method for hyperspectral image classification. First, the spectral number of hyperspectral images is reduced with the averaging fusion method. Second, the multi-scale structural profiles are constructed with the help of the multi-scale superpixel segmentation technique. Finally, the multi-scale structural profiles are merged to obtain the discriminative features followed by a spectral classifier to yield the final classification map. Experiments on several hyperspectral datasets demonstrate that the proposed method has a superior classification performance over other advanced classification approaches in the case of limited samples. The novelty and contribution of the proposed method are as follows: (1) This paper presents a novel multi-scale superpixel-guided structural profile classification framework, which can well preserve the boundaries of different land covers. (2) A novel superpixel-level structural feature is proposed to make full use of the strong correlation between neighborhood pixels, which can provide spectral homogenous information belonging to the same object.
The proposed method does not require much parameter adjusting. The parameters of the proposed method are set to fixed values, which are able to obtain superior classification results over other classification methods in the case of limited samples. Figure 1 shows the schematic of the proposed classification method, which mainly consists of three steps: dimension reduction, multi-scale feature extraction, and feature fusion. First, the dimension reduction technique is used to decrease the spectral dimension. Second, the multi-scale feature extraction aims to provide a better characterization of different objects with different sizes. Finally, the extracted multi-scale structural profiles are integrated with the KPCA method, and the fused features are fed into a spectral classifier to yield the classification map.

Dimension Reduction
In order to decrease the computing cost in the following process, the spectral number of the original image is reduced with an averaging fusion method. In more detail, the Ndimensional hyperspectral image I is first segmented into R groups of subsets along the spectral direction. Then, the averaging strategy is utilized to merge the spectral bands in each subset to obtain the dimension-reduced hyperspectral data Y.

Extraction of Multi-Scale Structural Profiles
To better characterize the spectral-spatial information of HSIs, we propose multiscale structural profiles by using the superpixel segmentation technique to preserve the boundaries of different land covers. First, an entropy rate superpixel (ERS) segmentation technique [33] is conducted on the principal components of the input data to produce the homogeneous region. Specifically, the PCA algorithm is performed on the input data I to obtain the first three principal components as the base image I PCs for superpixel segmenta-tion. Then, a graph G = (V, E) is built based on the base image, where V represents the vertex set of pixels in the base image, and E denotes the edge set that measures the pairwise similarities between neighboring pixels. Next, graph G is segmented into L-connected subgraphs by choosing a subset of edges A ⊆ E. To form the homogeneous and compact clusters, the objective function of the ERS segmentation method is constructed by an entropy rate term H(·) and a balance term B(·) as follows: where λ denotes the weight for adjusting the contribution of H(·) and B(·). A greedy optimization scheme [34] is adopted to solve the objective function in (1). Finally, the superpixel map is generated by performing the ERS method on the base image I PCs to yield the 2D superpixel segmentation map. In this work, we refer to the ERS segmentation method with ERS as follows: where E is the segmentation result and T denotes the amount of superpixels, which plays an important role in real applications. How to set an optimal value for the amount of superpixels is a very challenging issue since there is no single segmentation region that can accurately characterize the spatial information [35]. When the number of superpixels is relatively small, each superpixel contains more pixels, which causes ambiguity-labeled boundary superpixels that need to be further segmented. When the number of superpixels is relatively large, each superpixel cannot provide distinctive spatial information to infer accurate labels. To alleviate the above-mentioned problem, a multi-scale segmentation strategy is proposed to select the value of T. In more detail, the base image is segmented into 2C + 1 scales. The amount of superpixels for the cth scale is set as T c : where T f denotes an empirical value as a base superpixel number. Considering that the value of T c may not be the integer number, T c is set as min(max(1, round(T c )), P), where P is the total number of pixels for Y PCs . Accordingly, the amount of multi-scale superpixels T is set to be {T 1 , T 2 , . . . , T C }. Then, a relative total variation (RTV) method [36] is conducted on the dimensionreduced hyperspectral data Y to obtain structural profile S.
where S is the structural profile. α represents a weight parameter. ε denotes a very small positive number to prevent the division by zero. D x and D y are the variations in x and y directions.
where ∂ x S and ∂ y S are the partial derivatives, which aim to estimate the spatial similarity in a local region R(p), and g p,q is defined as: where σ is the standard deviation in the Gaussian filter.
Finally, we use the position indices of pixels for each superpixel E i on the structural profile S to obtain multi-scale 3D superpixels and a mean filtering is performed on each 3D superpixel to obtain the multi-scale structural profiles S c (c = 1, 2, . . . , 2C + 1) so as to boost the similarity of pixels belonging to the same object. Specifically, the mean value of pixels in each 3D superpixel is computed. Then, all pixels in each superpixel are assigned with the mean value. Using this operation, the interferences in each superpixel are greatly decreased, which effectively reduces spectral variability belonging to the same object.

Fusion of Multi-Scale Features
The multi-scale structural profilesŜ have very high dimensions and embody plenty of redundant information. This easily leads to the Hugh phenomenon, which is not conducive to the classification of HSIs. Furthermore, the spectral difference of the same object is decreased in S c . However, the spectral difference belonging to different ground objects is still small and needs to be further improved. To overcome this problem, the KPCA method [37] is adopted to merge different scales of structural profiles to reduce the redundant information and enlarge the spectral differences of different objects. It should be mentioned that the spatial sizes of all structural profiles are consistent. More specifically, first, different scales of structural profiles are stacked together to obtain multi-scale structural profilesŜ = {S 1 , S 2 , . . . , S 2C+1 }. Then, the multi-scale structural profilesŜ are mapped into a high-dimensional feature space H with a radial basis kernel function Φ. Finally, the principal components are calculated as: where G indicates the Gram matrix Φ T (Ŝ)Φ(Ŝ). In this work, we refer to the KPCA with KPCA(Ŝ, K), where K is the number of the preserved features. Accordingly, the obtained features F are fed into the spectral classifier SVM to obtain the classification result O (Algorithm 1).
Algorithm 1 Multi-scale superpixels-guided structural profiles for hyperspectral image classification Input: Input hyperspectral image I; Output: Classification result O; 1: Begin 2: The spectral number of the input hyperspectral image I is reduced with an averagingbased fusion method to obtain dimension-reduced data Y. 3: According to (2), compute the multi-scale superpixel segmentation maps E. 4: According to (4), compute the structural profile S. 5: Using the position indices of pixels for each superpixel in cth superpixel map E c on the structural profile S to obtain multi-scale 3D superpixels. 6: Using mean filtering on each 3D superpixel to obtain the multi-scale structural profileŝ S. 7: According to (7), fuse the multi-scale structural profilesŜ to obtain fused features F. 8: The fused features F are fed into a spectral classifier SVM to obtain classification result O. 9: Return O 10: End

Experimental Setup
(1) Compared methods: To verify the effectiveness of the proposed classification method, several recently proposed or highly cited classification approaches were adopted as compared approaches, including support vector machine (SVM) [11], extended morphological attribute profiles (EMAP) [38], superpixel-based classification via multiple kernels (SCMK) [39], PCA-based edge-preserving features (PCAEPFs) [40], multi-scale total variation (MSTV) [6], orthogonal total variation component analysis (OTVCA) [41], local correntropy matrix (LCEM) [42], spectral-spatial transformer network (SSTN) [43], and spectral-spatial feature tokenization transformer (SSFTT) [44]. For all compared approaches, the default parameters in the corresponding publications were used. The SVM classifier was implemented using the LIBSVM library [45] by using the radial basis function kernel, in which the Gaussian kernel with five-fold cross-validation was adopted for the classifier. (2) Datasets: In this work, three hyperspectral datasets were employed to test the classification abilities of all considered approaches. These datasets are available from the public classification database.
The Salinas Pines dataset was obtained by the AVIRIS sensor over Salinas Valley, USA. The dataset contained 224 spectral bands ranging from 0.4 to 2.5 µm. The spatial resolution was 3.7 m and its spatial size was 512 × 217. Before the classification experiment, 20 water absorption spectral channels (no. 108-112, 154-167, and 224) were removed. Figure 2 presents the false-color image and the ground truth. The Longkou dataset was obtained by an 8-mm focal length Headwall Nano-Hyperspec imaging sensor, over Longkou, Hubei province, China, which was captured by a DJI Matrice 600 Pro UAV platform. The flying height of the UAV was 500 m. The spatial size of the dataset contained 550 × 400 pixels, and it consisted of 270 spectral bands ranging from 0.4 to 1 µm. The spatial resolution of this dataset was about 0.463 m. Figure 3 presents the false-color image and the reference image. The Honghu dataset was obtained by a 17-mm focal length Headwall Nano-Hyperspec imaging sensor embedding on a DJI Matrice 600 Pro UAV platform over Honghu City, Hubei province, China. This study region is a complex agricultural scene, which contains different types of crops and different cultivars of the same crop are shown in the area, such as Brassica chinensis and small Brassica chinensis. The flying height of the UAV was 100 m. This image consisted of 270 spectral bands ranging from 0.4 to 1 µm. The spatial size was 940 × 475 pixels with a spatial resolution of 0.043 m. Figure 4 shows the false-color image and the reference image.
Objective indices: To objectively evaluate the classification performances of all used methods, three extensively used classification indices [46][47][48] were adopted, i.e., overall accuracy (OA), average accuracy (AA), and Kappa coefficient. For these objective indices, a higher value indicates a better classification performance.

Salinas Dataset
The first experiment was tested on the Salinas dataset. To examine the classification performances of different classification methods in the case of limited samples, the number of training samples per class was 5, which was randomly selected from the reference image. Figure 5 presents the classification results of all studied approaches. As depicted in this figure, the SVM method produces a very noisy classification map and some classes have serious misclassification phenomena. The main reason is that the SVM method is a spectral classifier without utilizing the spatial correlation between neighboring pixels. The EMAP method also yields an unsatisfactory classification effect. There are many salt and pepper noises in the classification map since the EMAP method is a pixel-wise feature extraction method. Compared to previous methods mentioned above, the SCMK method improves the classification performance because this method takes the objectlevel features into consideration by using the superpixel segmentation technique in the feature extraction process. The PCAEPF method greatly removes the noisy spots in the classification map. However, the edges between different ground objects have mislabels. Similar to the PCAEPF method, the MSTV method also produces noisy labels on the boundaries of different objects. The OTVCA method fails to well classify some similar land covers, such as Vinyard_untrained and Grapes_untrained classes. For the LCEM method, the Grapes_untrained class is misclassified into the Vinyard_untrained class. The SSTN and SSFTT methods cannot work well in the case of limited samples and the classification maps have serious misclassification labels. The reason is that the two methods are based on deep learning models, which usually require a larger number of training samples. By contrast, the proposed method obtains the best classification results among all classification approaches in preserving the edges of different land covers.
Furthermore, Table 1 presents the objective results of all classification methods. By observing this table, several conclusions can be obtained. First, the SVM method obtains the lowest classification accuracies in terms of OA, AA, and Kappa coefficient. The reason is that the SVM method only uses the spectral information of each pixel. Second, the pixel-level feature extraction method (e.g., EMAP) is inferior to the spectral-spatial classification method (e.g., SCMK, PCAEPF, MSTV). Third, the deep learning methods based on SSTN and SSFTT produce very low classification accuracies. In addition, some classes could not be identified (the primary reason is that the number of training samples is very limited in this work). Finally, the proposed method obtains the highest OA, AA, and Kappa coefficients among all methods. Moreover, the proposed method yields the highest classification accuracy in six classes. This objective result is consistent with the visual classification map.

Longkou Dataset
The second experiment was conducted on the Longkou dataset, where the number of training samples per class was set to 5. Figure 6 presents the visual classification results of all considered classification techniques. As shown in this figure, the SVM method obtained the " pepper and noisy" classification map because only the spectral reflectance of each pixel was utilized. The classification performance of the EMAP method was improved. However, there were still noisy labels in the homogeneous region. The SCMK method discarded some misclassification labels in the homogeneous regions. Nevertheless, some areas had obvious inaccurate classifications. The PCAEPF method improved the classification effect. However, there were obvious misclassifications in the boundaries of different ground objects. Compared to the PCAEPF method, the MSTV method removed some noisy labels, but it could not well identify the edges of different land covers. The OTVCA method had a poor classification performance. The classification result had very serious misclassifications, such as noisy spots. For the LCEM method, some classes had obvious misclassifications, such as water. The SSFTT method also yielded a very noisy classification map. The SSTN method improved the classification performance. However, there were misclassifications on the edges of different classes. Different from these methods, the proposed method had a satisfactory classification performance. The homogeneous regions could be well preserved in the classification map.    Moreover, Table 2 shows the classification accuracy of all used methods. The recorded values of all methods are the mean of ten experiments. For the SVM method, the classification accuracies were relatively low since the spectral classifier only took the spatial information into consideration. The EMAP and OTVCA methods improved classification accuracies, and they obtained similar classification accuracies. However, the AA value was still low. The SSTN method boosted the classification accuracies over the other compared methods. Generally, among all classification methods, the proposed method obtained the highest classification accuracies in terms of OA, AA, and Kappa. The OA increased from 76.47% in the SVM method to 92.63% obtained by our method. In general, our method greatly improved the classification accuracies compared to other classification techniques in the case of limited samples.

Honghu Dataset
The third experiment was performed on the Honghu dataset, where the number of training samples was set to 10. Figure 7 shows the classification results of all methods. The SVM method still had a very noisy classification performance in the resulting image. Likewise, the EMAP method also had a similar classification result. Most land covers cannot be well identified. The SCMK method greatly boosted the classification result. However, the edges of different land covers were not aligned with those of real objects. For the PCAEPF method, the classification map had obvious noisy spots. The MSTV method yielded similar phenomena to the PCAEPF method. The OTVCA method cannot classify different types of crops when the number of training samples is limited. This is due to the fact that this technique cannot well distinguish the similar spectra belonging to different land covers. The LCEM method also yielded obvious misclassification in the visual map. The SSTN method obtained a similar classification map to the SCMK method. The SSFTT method still cannot perform well on this dataset. The classification map had serious speckle noise. In contrast, the proposed method can yield a better classification map with respect to other approaches. The noisy labels in the homogeneous regions were well removed. Moreover, the edges of different objects were well aligned with those of the reference image. Table 3 lists the objective values of different classification methods. Since the Honghu dataset had different cultivars of the same crop, the spectral curves of these land covers were very similar. Thus, many methods cannot produce satisfactory classification accuracies. For example, the OTVCA method on the Honghu dataset yielded very low classification accuracies. In contrast, the proposed method obtained the best classification performance in terms of OA, AA, and Kappa coefficient. When the number of training samples was 10, the OA value obtained by the proposed method reached 89.68%.

The Influences of Different Parameters
In this subsection, four parameters, i.e., dimension reduced parameter R, the number of feature fusion in the KPCA K, weight parameter α, and the filter kernel σ, are first discussed. An experiment was conducted on the Salinas dataset, in which the number of training samples was set to 5. Figure 8 presents the influence of four free parameters on the classification performance of the proposed method. When we discuss the influence of the two parameters, i.e., dimension reduced parameter R, the number of feature fusion in the KPCA K, the weight parameter α, and the filter kernel σ are fixed. The first row in Figure 8 depicts the influence of parameters R and K. It can be observed that the proposed method can obtain the highest classification accuracies when R is 30 and K is 15. Similarly, when the parameters α and σ are analyzed, the dimension reduced parameter R and the number of feature fusions in the KPCA K are fixed. The second row in Figure 8 presents the influence of parameters α and σ. It is obvious that when α and σ are set to 0.007 and 3, respectively, the proposed method has the best classification performance. Therefore, the four free parameters, i.e., R, K, α, and σ, in the proposed method are set to 30, 15, 0.007, and 3, respectively.
Furthermore, the influence of the two parameters, i.e., the number of scales C, and the number of base superpixels T f , is analyzed. An experiment is performed on the Salinas dataset, where the number of training samples is set to 5. Figure 9a shows the OA obtained by the proposed method as a function of the number of scales C whose value is selected from 0 to 8 with step 1. It can be seen that when the number of scales C is relatively small, the proposed method produces unsatisfactory classification performance. When C is 5, the proposed method obtains the highest OA. Figure 9b presents the influence of the number of base superpixel T f ∈ {1, 3, 5, 10, 20, 30, 40, 50, 75, 100, 150, 200, 300} to the OA of the proposed method. It is obvious that when the number of base superpixel T f is 100, the proposed method yields satisfactory classification performance. Based on the above experiments, the number of scales C and the base superpixel number T f are set to 5 and 100, respectively.

The Influences of Different Steps
In this subsection, the effects of different steps, i.e., dimension reduction, superpixel segmentation, structural profile, and feature fusion, over the classification performance of the proposed method are discussed. An experiment was performed on the Salinas dataset, where the number of training samples was 5 for each class. Table 4 shows the classification accuracies of the proposed method with or without these steps. Here, WDR denotes the proposed method without the dimension reduction step. WMS indicates the proposed method without the multi-scale superpixel segmentation step. WSP is the proposed method without the structural profile extraction step. WFF represents the proposed method without the feature fusion step. As shown in this Table, it can be seen that the WSP method obtains the lowest classification accuracies since this method fails to well characterize the spectralspatial information of HSI. This also illustrates that the structural profile plays an important role in the proposed method. In addition, the WDR method has the least reduction in classification accuracies, indicating that the dimension reduction step has less influence on the proposed method. In general, the proposed method with these steps produces the best classification performance, which demonstrates that all steps make important contributions to the proposed classification method.

The Influences of Different Numbers of Samples
In this subsection, the influences of different numbers of training samples per class on the classification effects of all considered methods are discussed. An experiment was conducted on the Salinas dataset, in which the number of training samples varied from 5 to 50 with step 5. Figure 10 shows the classification accuracies of different classification approaches with different amounts of training sets. From this figure, it can be seen that the proposed method was always higher than other classification approaches. In particular, when the number of training sets was very limited, the proposed method was far more than the other approaches. This also illustrates that the main advantage of the proposed method is that it is able to yield superior classification performance in the case of limited samples. Furthermore, as the number of training samples increases, the classification performances of all classification methods tend to increase. As shown in Figure 10, when the number of training samples was 50, the proposed method produced satisfactory classification performance.

Computing Time
In this subsection, the computing costs of all methods were evaluated. Table 5 shows the computing times of all considered approaches on three hyperspectral datasets. Here, the computing time includes the whole running time of all steps, i.e., training and test times. All experiments were performed on a laptop with 8GB RAM. It can be seen from Table 5 that the computing times of different approaches tend to increase as the spatial size of the input data increases. Moreover, the computing cost of the proposed classification method is acceptable among all methods. For example, the running time of the proposed method on Salinas is only about 18 s. How to improve computational efficiency is an interesting topic.

Conclusions
In this study, we propose multi-scale superpixel-guided structural profiles for hyperspectral image classification. This method is mainly composed of three key steps. First, the spectral numbers of the input data are decreased with the average fusion method. Second, the multi-scale structural profiles are extracted with the help of the superpixel segmentation technique. Finally, the multi-scale structural profiles are merged with the KPCA method. Experiments on three public datasets reveal that the proposed method can achieve satisfactory classification performances with respect to other classification techniques. The main advantage of this method is that the proposed method is superior to other approaches especially when the number of training samples is very limited.