Multi-View Structural Feature Extraction for Hyperspectral Image Classiﬁcation

: The hyperspectral feature extraction technique is one of the most popular topics in the remote sensing community. However, most hyperspectral feature extraction methods are based on region-based local information descriptors while neglecting the correlation and dependencies of different homogeneous regions. To alleviate this issue, this paper proposes a multi-view structural feature extraction method to furnish a complete characterization for spectral–spatial structures of different objects, which mainly is made up of the following key steps. First, the spectral number of the original image is reduced with the minimum noise fraction (MNF) method, and a relative total variation is exploited to extract the local structural feature from the dimension reduced data. Then, with the help of a superpixel segmentation technique, the nonlocal structural features from intra-view and inter-view are constructed by considering the intra- and inter-similarities of superpixels. Finally, the local and nonlocal structural features are merged together to form the ﬁnal image features for classiﬁcation. Experiments on several real hyperspectral datasets indicate that the proposed method outperforms other state-of-the-art classiﬁcation methods in terms of visual performance and objective results, especially when the number of training set is limited.


Introduction
A hyperspectral image (HSI) is able to record hundreds of contiguous spectral channels, and thus provides an unique ability to identify different types of materials. Owing to this merit, hyperspectral imaging has been extensively applied in various aspects, such as land cover mapping [1,2], object detection [3,4], and environment monitoring [5,6]. Over the past few years, hyperspectral image classification has been made great progress because of its significance in mineral mapping, urban investigation, and precision agriculture. Nevertheless, object spectrum is usually affected by the imaging equipment and imaging environment, resulting in high spectrum mixture among different land covers.
To alleviate this problem, hyperspectral feature extraction methods have been widely studied to improve the class discrimination among different land covers. Some representative manifold learning tools [7][8][9][10] have been successfully used as feature extractor of HSIs, such as principal component analysis (PCA) [8], minimum noise fraction (MNF) [10], and independent component analysis (ICA) [9]. However, most of these techniques only utilize the spectral information of different objects, and thus fail to achieve satisfactory classification performance.
To fully exploit the spectral and spatial characteristics in HSIs, a mass of spectralspatial feature extraction techniques have been developed [11][12][13][14][15]. For instance, Marpu et al. developed attribute profiles (APs) to extract discriminative features of HSIs by using morphological operations [12]. Mura et al. studied extended morphological attribute profiles (EMAP) to characterize HSIs by using a series of morphological attribute filters [13]. Kang et al. developed an edge-preserving filtering method for hyperspectral feature extraction to remove low-contrast details [14]. Duan et al. modeled hyperspectral image as a linear combination of structural profile and texture information, in which the structural profile was used as the spatial features [15]. After that, many improved approaches have been also studied for classification of HSIs, such as ensemble learning [16,17], semisupervised learning [18,19], and active learning [20,21]. For instance, in [17], a random feature ensemble method was proposed by using ICA and edge-preserving filtering to boost the classification accuracy of HSIs. In [18], a rolling guidance filter-based semi-supervised classification method was developed in which an extended label propagation technique was utilized to expand the training set.
In addition, with the development of deep learning models, many deep learning models have been applied to extract the high-order semantic features of HSIs. For example, Chen et al. presented a 3D convolutional neural network (CNN)-based feature extraction technique in which the dropout and L 2 regularization were adopted to prevent overfitting for class imbalance [22]. Liu et al. proposed a Siamese CNN method with a margin ranking loss function to improve the discrimination of different objects [23]. Liu et al. designed a mixed deep feature extraction technique by combining the pixel frequency spectrum features obtained by fast Fourier transform and spatial features. Lately, all kinds of improved versions have been also investigated to increase the classification accuracy [24][25][26][27][28]. For example, Hang et al. designed an attention-guided CNN method to extract spectral-spatial information, in which a spectral attention module and a spatial attention module were considered to capture the spectral and spatial information, respectively [26]. Hong et al. improved the transformer network to characterize the sequence attributes of spectral information, where a cross-layer skip connection was used to fuse spatial features from different layers [27].
To better characterize spectral-spatial information, the multiview technique, which aims to reveal the data characteristics from diverse aspects and provide multiple features for model learning, has been applied in hyperspectral feature extraction [20,29,30]. In more detail, the raw data are first transformed into different views (e.g., attributes, feature subsets), and then, the complementary information of all views is integrated together to achieve a more accurate classification result. For example, in [20], Li et al. proposed a multi-view active learning method for hyperspectral image classification, where the subpixel-level, pixel-level, and superpixel-level information were jointly used to achieve a better identification ability. Xu et al. proposed multi-view attribute components for classification of HSIs, in which an intensity-based query scheme was used to expand the number of training set [29]. In general, these methods can increase the classification performance because of multi-view strategy. However, most of them only utilize the local neighboring information without considering the correlation of pixels in the nonlocal region. Based on the above analysis, it is necessary to develop a novel multi-view method to further boost the classification performance by jointly using the dependencies of pixels in the local and nonlocal regions.
In this work, we propose a novel multi-view structural feature extraction method for the first time, which consists of several key steps. First, the spectral dimension of the original data is reduced to increase the computational efficiency. Then, three multi-view structural features are constructed to characterize varying land covers from different aspects. Finally, different types of features are merged together to increase the discrimination of different land covers, and the fused feature is fed into a spectral classifier to obtain the classification results. Experiments are performed on three benchmark datasets to quantitatively and qualitatively validate the effectiveness of the proposed method. The experimental results verify that the proposed feature extraction method can significantly outperform other state-of-the-art feature extractors. More importantly, our method can obtain promising classification performance over other approaches in the case of limited training samples.
The remainder of this article is divided into four sections. Section 2 provides the proposed method. The experimental results are shown and analyzed in Section 3. Section 4 discusses the influence of different components. Finally, Section 5 concludes this work. Figure 1 shows the flowchart of the multi-view structural feature extraction method, which consists of three key steps. First, the spectral number of the raw data is reduced with the MNF method. Then, the multi-view structural features, i.e., local structural feature, intra-view, and inter-view structural features, are constructed based on the correlation of pixels in the local and nonlocal regions. Finally, the multi-view features are fused together with the kernel PCA (KPCA) method, and the fused feature is fed into the support vector machine (SVM) classifier to obtain the classification map.

Dimension Reduction
To decrease the computing time and the influence of image noise, the maximum noise fraction (MNF) [31] is first exploited to decrease the spectral dimension of the original data. Specifically, assume I is the raw data, the MNF is to seek a transform matrix W to maximize the signal to noise ratio of transformed data.
where R is the dimension-reduced data, and W is the transform matrix, which can be estimated as arg max where I is regarded as a linear combination of the uncorrelated signal S and noise matrix N, and cov(I) = ∑ I = ∑ S + ∑ N , in which ∑ S and ∑ N denote the covariance of S and N. In this work, the first L-dimensional components are preserved for the following feature extraction.

Multi-View Feature Generation
Since the imaging scene contains different types of land covers with different spatial size, a single structural feature cannot comprehensively characterize the spatial information of diverse objects. To alleviate this problem, a multi-view structural feature extraction method is proposed. Specifically, three different types of structural features are generated, including local structural feature, intra-view, and inter-view structural features.
(1) Local structural feature: The local structural feature aims to remove useless details (e.g., image noise and texture) and preserve the intrinsic spectral-spatial information.
Specifically, a relative total variation technique [32] is performed on the dimension-reduced data R to construct local structural feature F 1 , which is expressed as arg min where T represents the amount of pixels in total. F 1 stands for the desired structural feature. α denotes a smoothing weight. ε is adopted to prevent dividing by zero. D x and D y indicate the variations in two directions. The solution of Equation (3) refers to [32].
where ∂ x S and ∂ y S stand for the partial derivatives in two directions, which mainly calculates the spatial similarity within a local window R(i), and g i,j is a weight.
where σ denotes the window size.
(2) Intra-view structural feature: The intra-view structural feature is to reduce the spectral difference of pixels belonging to the same land cover and increase the spectral purity in the homogeneous regions. In order to extract the intra-view structural feature, first, an entropy rate superpixel (ERS) segmentation method [33] is adopted to obtain the homogeneous region of the same object. In more detail, the PCA scheme is first conducted on the local structural feature F 1 to obtain the first three componentsF 1 , and then, the ERS segmentation scheme is utilized to obtain a 2D superpixel resulting map.
where S indicates the segmentation result, and T indicates the number of superpixels, which is determined by where L is empirically selected in this work,N represents the amount of nonzero pixels in the detected map obtained by performing Canny filter on the base imageF 1 , and N represents the total amount of pixels in the base image. Based on the position indexes of pixels in each superpixel S i , we can obtain the corresponding 3D superpixels Then, a mean filtering is conducted on each 3D superpixel Y i to calculate the average value. Finally, we assign the pixels in each superpixel to the average value to obtain the intra-view structural feature F 2 .
(3) Inter-view structural feature: The intra-view structural feature is able to reduce the difference of pixels in each superpixel. However, the correlations of pixels for different superpixels are not considered. Thus, we construct an inter-view structural feature to improve the discrimination of different objects. Specifically, a weighted mean operation is performed on the neighboring superpixels Y i,j , {j = 1, 2, . . . , J} of the current superpixel Y i to obtain the inter-view structural feature F 3 .
where ω i,j denotes the neighboring superpixels of the ith superpixels. The obtained valueỹ i is assigned to all pixels in the ith superpixel to produce the inter-view structural feature F 3 .

Feature Fusion
To make full use of multi-view structural features, the KPCA technique [34] is used to merge three types of features. Specifically, first, three structural features F i , {i = 1, 2, 3} are stacked together F = {F 1 , F 2 , F 3 }. Then, the stacked data F is projected into a highdimensional space H by using a Gaussian kernel function Φ. Finally, the fused featureF can be calculated: where K indicates Gram matrix Φ T (S)Φ(S). In this paper, K-dimensional features are preserved in the fused feature. Once the fused feature is obtained, the spectral classifier, i.e., SVM, is considered to examine the classification performance. To clearly show the whole procedure of the proposed method, Algorithm 1 presents a pseudocode to summarize the key steps of our method.

Input:
Input hyperspectral image I; Output: Hyperspectral image featureF 1: According to (1), reduce the spectral number of the raw data I to obtain the dimension reduced image R 2: According to (3), obtain the local structural feature F 1 of the input image. 3: According to (6), obtain the 2-D superpixel segmentation map S. 4: Obtain the corresponding 3-D superpixels Y based on the position indexes of pixels in each superpixels S i . 5: Construct the intra-view structural feature F 2 by performing the mean filtering on each 3-D superpixel. 6: According to (8), construct the inter-view structural feature F 3 . 7: According to (9), fuse the multi-view structural features F i , {i = 1, 2, 3} to obtain the final featureF. 8: ReturnF

Experimental Setup
(1) Datasets: In the experimental section, three hyperspectral datasets, i.e., Indian Pines, Salinas, and Honghu, are used to examine the classification performance of the proposed feature extraction method. All these datasets are collected from a public hyperspectral database.
The Indian Pines dataset was obtained by the Airborne Visible Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test scene in northwestern Indiana, which is available online (http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_ Sensing_Scenes (accessed on 10 January 2022 )). This image is composed of 220 spectral bands spanning from 0.4 to 2.5 µm. The spatial size is 145 × 145 with a spatial resolution of 20 m. Twenty water absorption channels (No. 104-108, 150-163, and 220) are discarded before experiments. Figure 2a presents the false color composite image. Figure 2b shows the ground truth, which contains 16 different land covers. Figure 2c gives the class name.
The Salinas dataset was collected by the AVIRIS sensor over Salinas Valley, California, USA, which is available online (http://www.ehu.eus/ccwintco/index.php/Hyperspectral_ Remote_Sensing_Scenes (accessed on 10 January 2022)). This image consists of 224 spectral channels with spatial size of 512 × 217 pixels. The spatial resolution is of 3.7 m. This scene is an agricultural region, which includes different types of crops, such as vegetables, bare soils, and vineyard fields. Twenty spectral bands (No. 108-112, 154-167, and 224) are discarded before the following experiments. Figure 3a gives the false color composition. Figure 3b displays the ground truth, which consists of 16 different land covers. Figure 3c shows the class name.
The Honghu dataset was captured by the a 17 mm focal length Headwall Nano-Hyperspec imaging sensor over Honghu City, Hubei province, China, which is available online (http://rsidea.whu.edu.cn/e-resource_WHUHi_sharing.htm (accessed on 10 January 2022)). This image consists of 270 spectral channels ranging from 0.4 to 1.0 µm. The spatial size is 940 × 475 pixels with a spatial resolution of 0.043 m. This scene is a complex agricultural area, including various crops and different cultivars of the same crop. Figure 4 gives the false color composition, ground truth, and class name. Table 1 presents the training and test samples of all used datasets for the following experiments. (2) Evaluation Indexes: To quantitatively calculate the classification accuracies of all considered techniques, four extensively used objective indexes [35][36][37], i.e., class accuracy (CA), overall accuracy (OA), average accuracy (AA), and Kappa coefficient, are used. The definitions of all objective indexes are shown as follows: (1) CA: CA calculates the percentage of correctly classified pixels of each class in the total number of pixels.
where M is the confusion matrix obtained by comparing the ground truth with the predicted result, and C is the total number of categories.
(2) OA: OA assesses the proportion of correctly identified samples to all samples.
where N is the total number of labeled samples, M is the confusion matrix, and C is the total number of categories.
(3) AA: AA represents mean of the percentage of the correctly identified samples.
where C is the total number of categories, and M is the confusion matrix.
where N is the total number of labeled samples, M is the confusion matrix, and C is the total number of categories.

Classification Results
To examine the effectiveness of the proposed feature extraction method, several stateof-the-art hyperspectral classification methods are selected as competitors, including (1) the spectral classifier, i.e., SVM on the original image (SVM) [38]; (2) the feature extraction methods, i.e., the image fusion and recursive filtering (IFRF) [14], the extended morphological attribute profiles (EMAP) [13], multi-scale total variation (MSTV) [15], the PCA-based edge-preserving features (PCAEPFs) [8]; (3) the spectral-spatial classification methods, i.e., the superpixel-based classification via multiple kernels (SCMK) [39] and the generalized tensor regression approach (GTR) [40]. These methods are adopted because they are either highly cited publications in the remote sensing field or are recently proposed classification methods with state-of-the-art classification performance on several hyperspectral datasets. For all considered approaches, the default parameters follow the corresponding publications for a fair comparison.

Indian Pines Dataset
The first experiment is performed on the Indian Pines dataset, in which 1% labeled samples are randomly selected from the reference image for training (see Table 1). Figure 5 presents the classification maps of all considered methods on the Indian Pines dataset. As shown in this figure, the SVM method yields very noisy visual effects in the classification result, exposing the disadvantages of the spectral classifier without considering the spatial information. By removing image details and preserving the strong edge structures, the IFRF method greatly improves the classification result over the spectral classifier. However, there are still obvious misclassified pixels around the boundaries. For the EMAP method, some homogeneous regions still contain "noise" such as mislabels in the resulting map. For the SCMK method, some obvious misclassified results appear in the edges and corners. The main reason is that the homogeneous region belonging to the same object cannot be accurately segmented. The MSTV method effectively removes the noisy labels. However, it tends to yield an oversmoothed classification map. The PCAEPFs method significantly boosts the classification performance with respect to the IFRF method, since multi-scale feature extraction strategy is adopted. However, some objects with small size fail to be well preserved in the classification map. The GTR method yields spot-like misclassification results since the tensor regression technique cannot fit the spectral curves of different objects well. By contrast, the proposed method obtains a better visual map by integrating multi-view spectral-spatial structural features, in which the edges of the classification map are more consistent with the real scene. For objective comparison, Table 2 lists the objective results of different approaches including CA, OA, AA, and Kappa. It is easily to observe that the proposed method is superior than all compared approaches in terms of OA, AA, and Kappa. For instance, OA value is increased from 53% to 90% obtained by the proposed method with respect to the SVM method on the original data. Moreover, the proposed feature extraction approach yields the highest classification accuracies for ten classes. This experiment illustrates that the proposed feature extraction method is more effective compared to other approaches.
Furthermore, the influence of the number of training set on all classification methods is discussed. Different numbers of samples varying from 1% to 10% are randomly chosen from the reference data to construct the training set. Figure 6 shows the change tendency of all studied methods with different numbers of training set. It is easily found that the classification performance of all methods tends to be improved when the amount of training set increases. In addition, the proposed method performs promising performance with respect to other methods especially when the number of training samples is limited.

Salinas Dataset
The second experiment is conducted on the Salinas dataset, in which five samples per class are randomly selected from the reference image to constitute the training samples (see Table 1). Figure 7 shows the visual maps of all studied approaches. We can easily observe that the SVM method produces noisy classification performance. The reason is that the spatial priors are not considered in the spectral classifier. The IFRF method removes the noisy labels in some homogeneous regions. However, there are still serious misclassifications, such as Grapes_untrained and Vinyard_treils classes. The EMAP method also yields "pepper and noisy" appearance since this method is a pixel-level feature extraction method. For the SCMK method, some regions are misclassified into other classes due to the inaccurate segmentation. The MSTV method produces an oversmoothed visual resulting map. The main reason is that the feature extraction process removes the spatial information of land covers with low reflectivity. The PCAEPFs method produces noisy labels in the edges and boundaries. The GTR method yields a serious misclassification map since the tensor regression model fails to distinguish the similar spectral curves. Different from other methods, the proposed method provides the best visual classification effect in removing noisy labels and preserving the boundaries of different classes. Furthermore, Table 3 also verifies the effectiveness of the proposed method. Likewise, the proposed method obtains the highest classification accuracies with regard to OA, AA, and Kappa compared to other studied techniques. In addition, the influence of different numbers of training samples is presented in Figure 8. The number of training set for each class is varying from 5 to 50. It is shown that the increase of the training size is beneficial to the classification performance of all methods. Moreover, our method is always higher than other classification approaches.

Honghu Dataset
The third experiment is performed on a complex crop scene with 22 crop classes, i.e., Honghu dataset, in which the benchmark training and test samples (http://rsidea.whu. edu.cn/e-resource_WHUHi_sharing.htm (accessed on on Jaunary 2022)) are adopted. The classification results of all methods are shown in Figure 9. Similarly, the SVM method produces a very noisy visual map since only the spectral information is used. The IFRF method greatly improves this problem by using the image filtering on the raw data, obtaining a better classification result over the SVM method. The EMAP method also obtains a noisy classification map. The reason is that this feature extraction method is in a pixel-wise pattern. For the SCMK method, the classification map has a small amount of misclassification labels for some classes. The MSTV method obtains a better classification map due to multi-scale technique. However, the edges of different classes still have misclassification appearance. The PCAEPFs method yields a similar classification effect to the MSTV method. For the GTR method, the classification result suffers from serious misclassification for this complex scene when the amount of training set is limited. By contrast, the proposed method yields the best visual map with respect to other studied approaches, since multi-view structural features are effectively merged to yield a complete characterization of different objects. Furthermore, the objective results obtained by all considered approaches are listed in Table 4. It is obvious from Table 4 that our method still provides the highest classification accuracies concerning OA, AA, and Kappa coefficient. In addition, Figure 10 gives the OA, AA, and Kappa coefficient of all studied approaches as functions of the amount of training samples from 25 to 300. It should be mentioned that the training samples follow the benchmark dataset (http://rsidea.whu.edu.cn/e-resource_WHUHi_sharing.htm (accessed on 10 January 2022)). It is found that the classification accuracies of all methods tend to improve when the training size increases, and our method still produces the highest OA, AA, and Kappa coefficient.  In this part, the influence of different free parameters, i.e., the number of dimension reduction L, the number of fused feature K, the smoothing weight α, and the window size σ, on the classification accuracy of our method is analyzed. An experiment is performed on the three datasets with training set listed in Table 1. When L and K are discussed, α and σ are fixed as 0.005 and 3, respectively. Similarly, when α and σ are analyzed, L and K are set as 30 and 20, respectively. Figure 11 presents the classification accuracy OA of the proposed method with different parameter settings. It can be observed that the proposed method can yield satisfactory classification accuracy for all used datasets when L and K are set to be 30 and 20, respectively. Moreover, when L and K are relatively small, the classification performance tends to be decreased, since the limited number of features cannot well represent the spectral-spatial information in HSIs. Figure 11d-f shows the influence of different α and σ. It is shown that when α and σ increase, the classification performance of the proposed method decreases. The reason is that the structural feature extraction technique smooths out the spatial structures of HSIs. When α and σ are set to be 0.005 and 3, respectively, the proposed method obtains the highest classification accuracy. Based on this observation, L, K, α, and σ are set as 30, 20, 0.005, and 3, respectively.

The Influence of Three Different Views
In this subsection, the influence of three different views on the classification accuracy is investigated. An experiment is conducted on the Indian Pines dataset. The classification performance obtained by the proposed framework with different views is shown in Table 5. It can be seen that the inter-view feature performs the best classification performance among three different views. Furthermore, the combination of two different views outperforms individual view in terms of classification accuracies. Overall, when three types of features are combined, the proposed method provides the highest classification results. The reason is that three different views have complementary information, which can be jointly utilized to improve the classification performance.

Effect of Different Hyperspectral Feature Methods
To demonstrate the advantage of the proposed feature extraction method, several widely used feature extraction methods for HSIs are selected as competitors, including extended morphological profiles (EMP) [41], extended morphological attribute profiles (EMAP) [13], Gabor filtering (Gabor) [42], image fusion and recursive filtering (IFRF) [14], intrinsic image decomposition (IID) [43], PCA-based edge-preserving filters (PCAEPFs) [8], invariant attribute profiles (IAPs) [44], low rank representation (LRR) [45], multi-scale total variation (MSTV) [15], and random patches network (RPNet) [46]. An experiment is performed on the Indian Pines dataset with 1% of training samples listed in Table 1. The classification accuracy of all considered approaches is shown in Figure 12. The EMP and EMAP methods only yield around 60% classification performance when the number of training samples is scarce. The classification performance obtained by the edge-preserving filtering-based feature extraction methods such as IFRF and PCAEPFs also tends to decrease when the number of training set is scarce. The RPNet-based deep feature extraction method also fails to achieve satisfactory performance. By contrast, it is found that the proposed feature method obtains the highest classification performance among all feature extraction techniques for three indexes, which further illustrates that the proposed method can better characterize the spectral-spatial information compared to other methods by fusing local and nonlocal multi-view structural features.

Computing Time
The computing efficiency of all considered techniques for all datasets is provided in Table 6. All experiments are tested a laptop with 8 GB RAM and 2.6 GHz with Matlab 2018. We can observe from Table 6 that when the spatial and spectral dimensions of HSIs increase, the computing time of all methods tends to increases. Furthermore, the computing time of our method is quite competitive among all considered approaches (taking the Indian Pines dataset as an example, the running time of our method is around 5.36 s). The GTR method is the fastest as it is a regression model.

Conclusions
In this work, a multi-view structural feature extraction method is developed for hyperspectral image classification, which consists of three key steps. First, the spectral number of the raw data is decreased. Then, the local structural feature, intra-view structural feature, and inter-view structural feature are constructed to characterize spectral-spatial information of diverse ground objects. Finally, the KPCA technique is exploited to merge multi-view structural features, and the fused feature is incorporated with the spectral classifier to obtain the classification map. Our experimental results on three datasets reveal that the proposed feature extraction method can consistently outperform other state-of-theart classification methods even when the number of training set is limited. Furthermore, with regard to 10 other representative feature extraction methods, our method still produces the highest classification performance.

EMP Extended morphological profiles Gabor
Gabor filtering IID Intrinsic image decomposition IAPs Invariant attribute profiles LRR Low rank representation RPNet Random patches network