An Efﬁcient Spectral Feature Extraction Framework for Hyperspectral Images

: Extracting diverse spectral features from hyperspectral images has become a hot topic in recent years. However, these models are time consuming for training and test and suffer from a poor discriminative ability, resulting in low classiﬁcation accuracy. In this paper, we design an effective feature extracting framework for the spectra of hyperspectral data. We construct a structured dictionary to encode spectral information and apply learning machine to map coding coefﬁcients. To reduce training and testing time, the sparsity constraint is replaced by a block-diagonal constraint to accelerate the iteration, and an efﬁcient extreme learning machine is employed to ﬁt the spectral characteristics. To optimize the discriminative ability of our model, we ﬁrst add spectral convolution to extract abundant spectral information. Then, we design shared constraints for subdictionaries so that the common features of subdictionaries can be expressed more effectively, and the discriminative and reconstructive ability of dictionary will be improved. The experimental results on diverse databases show that the proposed feature extraction framework can not only greatly reduce the training and testing time, but also lead to very competitive accuracy performance compared with deep learning models.


Introduction
Feature extraction of hyperspectral images (HSIs) is a significant topic at present and is widely applied in different HSI applications [1,2], including hyperspectral classification [3], target detection [4], and image fusion [5]. However, the variability and redundancy of spectra make it challenging to extract valid features from HSIs. A large number of feature learning techniques have been developed to describe spectral characteristics, which can be roughly categorized into two types: linear and nonlinear algorithms. Linear models exploit the original spectral information or linearly derive various features from such information. These kinds of features have been widely used to represent the linear separability of certain classes [6]. The common linear models are independent component analysis [7], principal component analysis [8], and linear discriminant analysis [9]. Although these models are simple and compact, they suffer from poor representation ability and cannot cope with intricate HSI data.
The nonlinear models are more effective for class discrimination due to the existence of nonlinear class boundaries. These approaches adopt nonlinear transformations to better represent spectral features of HSIs. The kernel-based method [10] is a common nonlinear model that maps samples into higher dimensional space. Support vector machine (SVM) [11][12][13] is a representative kernel-based method and has been proven to be effective for HSI classification. In [14], Bruzzone proposed a transductive SVM that can simultaneously utilize labeled and unlabeled data. Nonetheless, kernel-based algorithms usually lack a theoretical basis for the selection of the corresponding parameters and are not scalable to large datasets. Another widely used nonlinear model is the deep learning method with strong potential for feature learning. Chen et al. [15] verified the eligibility of the stacked autoencoder (SAE) by classical spectral information-based classification. A similar model was proposed by Chen et al. [16], who applied deep belief networks (DBNs) to extract features in practice. In [17][18][19][20], multiple dimension convolutional neural networks (CNNs) were adopted for HSI classification. Rasti et al. [21] provided a technical overview of the state-of-the-art techniques for HSI classification, especially the deep learning models. However, deep learning models require numerous labeled data points, strictly limiting their application domain. Moreover, the trained models are inflexible, and their parameters are difficult to adjust.
Recently, dictionary-based methods have been introduced into HSI recognition. Compared with deep learning models, dictionary-based methods can represent spectral characteristics more effectively with less HSI data. Regarding sparse representation-based classification (SRC), References [22,23] constructed an unsupervised dictionary that often engendered unstable sparse coding. References [24,25] combined the kernel model with sparse coding to make samples more separable. Li et al. [26] designed a robust sparse representation algorithm against outliers in practice. To obtain a compact and discriminative dictionary, Zhang and Li [27] absorbed label information and constructed a k-singular-value decomposition (K-SVD) dictionary for feature learning. Moreover, Reference [28] optimized the discriminative dictionary and applied it to process HSIs. In [29,30], learning vector quantization was adopted for dictionary-based models for hyperspectral classification. In general, dictionary-based methods show great potential for HSI feature representation. However, these dictionaries are time consuming, and their discriminative ability is poor.
To address the aforementioned drawbacks, we propose an efficient framework that trains a discriminative structure dictionary to describe HSIs. The main novelties of the proposed model are threefold: (1) We design an efficient feature learning framework that calculates the structured dictionary to encode spectral information and adopts machine learning to map the coding coefficients. The block-diagonal constraint is applied to increase the efficiency of coding, and an effective extreme learning machine (ELM) is employed to complete the mapping. (2) We apply spectral convolution to extract the mean value and local variation of the spectra of HSIs. Then, the dictionary learning is carried out to capture more local spectral characteristics of HSI data.
We devise a new shared constraint for all of the subdictionaries. In this way, the common and specific features of HSI samples will be learned separately to achieve a more discriminative representation.

Materials and Methods
In this section, we first introduce the experimental datasets and then elaborate the proposed feature extracting framework for HSIs.

The Study Datasets
The experimental datasets include three well-known HSI datasets, and we randomly select 10% of each dataset for training and the rest for testing. The detailed information is presented as follows.
Center of Pavia [31]: The HSI data were collected by the airborne sensor of the reflective optics system imaging spectrometer (ROSIS) located in the urban area of Pavia, Northern Italy. The image consisted of 1096 × 492 pixels at a ground sampling distance (GSD) of 1.3 m with 102 spectral bands in the range of 430 nm to 860 nm. In this dataset, nine main categories are investigated for the land cover classification task. The number of training and testing samples is specifically listed in Table 1. Botswana [32]: This dataset was collected by the Hyperion sensors on the NASA Earth Observing 1 (EO-1) satellite over the Okavango Delta, Botswana. It has 1476 × 256 pixels at a GSD of 30 m with 145 spectral channels ranging from 400 nm to 2500 nm. There are 14 challenging classes for the land cover classification task. Table 2 lists the scene categories and the number of training and testing samples used in the classification task. Houston University 2013 [21]: The dataset was collected by the compact airborne spectrographic imager (CASI) sensor over the campus of the University of Houston and its surrounding areas, in Houston, TX, USA. It contains 349 × 1905 pixels at a GSD of 1 m with 144 spectral channels ranging from 364 nm to 1046 nm. The specific training and test information for the data is detailed in Table 3.

Related Works
Recently, dictionary learning has led to promising results in HSI classification recognition. Dictionary learning aims to learn a set of atoms, also called visual words in the computer vision community, in which a few atoms can be linearly combined to well approximate a given signal [33]. Here, we briefly introduce several mainstream dictionary-based approaches.

Review of Sparse Representation-Based Classification
Wright et al. [22] proposed the sparse representation-based classification (SRC) model, which is widely applied in HSI classification [30]. Suppose there are C classes of HSIs. Let X = [X 1 , . . . , X i , . . . , X C ] be the set of original training samples, where X i is the subset of training samples from class i. Then, sparse coding vector a corresponding to dictionary D is obtained by the l p -norm minimization constraint as follows: where λ is a positive scalar and p is usually zero or one. The test samples can be classified via the following: arg min where a i is the coefficient vector associated with class i. SRC has impressive performance in face recognition and is robust to different noises [33]. It acts as a leading methodtoward classification with the help of dictionary coding. Nevertheless, it is obvious that the SRC model naively employs all the training samples as one dictionary. The dictionary of SRC suffers from redundant atoms and a disordered structure, making is unsuitable for complex HSI classification.

Review of Class-Specific Dictionary Learning
As discussed in [34], the pre-defined dictionary of the SRC model incorporates much redundancy, as well as noise and trivial information. To solve this problem, Yang et al. [34] constructed a class-specific dictionary, in which sub-dictionary D i of learned dictionary D = [D 1 , . . . , D i , . . . , D C ] corresponds to class i. The sub-dictionary could be learned class-by-class as follows: where A i is the coding result of samples X i on sub-dictionary D i . Equation (3) can be seen as the basic model of the class-specific dictionary learning model since each D i is trained separately from the samples of a specific class. We can apply reconstruction error X − D i A i 2 to classify HSI data. However, Equation (3) does not consider the discriminative ability between different coefficients, resulting in low classification accuracy.

Review of Fisher Discriminant Dictionary Learning
Yang et al. [35] proposed a complex model named Fisher discriminant dictionary learning (FDDL), which adopts the Fisher criterion to learn a structured dictionary. Suppose that X = [X 1 , . . . , X i , . . . , X C ] ∈ R (L×N) refers to all N training HSI samples from C classes with L band number.
is the corresponding coefficient over dictionary D containing N A atoms. The ith training sample can be computed as X i = D i A i , and the objective function is shown as follows: where λ 1 and λ 2 are the regularization parameters. L R , L S , and L D denote reconstructive loss, sparse constraint loss, and discriminative loss, respectively: where · F is the Frobenius norm. In Equation (5), the first term X i − DA i 2 F guarantees reconstruction fidelity, while the rest of the terms are designed for the discriminative ability of dictionary D. As for Equation (6), A 1 is a sparsity constraint and can be calculated by lasso [35]. Equation (7) based on the Fisher criterion [35] can be completed by minimizing the within-class scatter of A, denoted by S W (A), and maximizing the between-class scatter of S B (A). The last elastic term of Equation (7) is applied to solve the non-convex problem.
The atoms of the structured dictionary in FDDL are strongly correlated with specific classes, which will improve the representation ability of D. However, the FDDL model is time consuming and unsuitable for practical application. More importantly, the structure of the FDDL model needs improvement to enhance the reconstructive ability. Figure 1 shows the workflow of the proposed framework in which we construct a structured dictionary to extract spectral features for classification application. Spectral convolution is first introduced into our model to extract the abundant information. Following the convolution, the corresponding coding representations are built for the test spectral data. We design the shared constraint for all of subdictionaries to enhance the discriminative ability of the structured dictionary. Finally, the ELM model is adopted to map the coding coefficients to the corresponding labels.

Spectral Convolution
The HSI data contain a massive amount of spectral characteristics, such as reflection peaks and valleys, which play important roles in spectral classification. To extract this spectral information, we design different convolution masks for the original samples. The masks are as follows: To achieve stable classification performance, we apply M 1 to preserve the original data. Inspired by the wave transform, we design mask M 2 to extract the main structure (mean values) of spectral samples and mask M 3 to capture the detailed information (local variation) of the spectra. As shown in Figure 2, the results of M 2 capture the main signal of spectra (M 1 ) and the values of M 3 change with the local variation in the spectra (M 1 ). Mask M 2 can be adopted to describe the main structure of spectral samples, while mask M 3 can be applied to describe the local reflection valleys and peaks of spectral data. However, the running time is closely related to the number of masks. In this work, we only employ three convolutional masks to extract the spectral information, and there are other possible masks that can be applied to extract the spectral characteristics.

Structured Dictionary
To encode the spectral information, most of the dictionary-based methods [34,35] are based on the sparsity constraint under the following framework: where λ ≥ 0 is a scalar constant. The first term X − DA 2 F is the fidelity constraint to ensure the representation ability of trained dictionary. The second term A p is the sparsity constraint, and the remaining term ϕ (D i , A i ) is the additional constraint for some discrimination promotion function. These models will train a structured dictionary to represent signals, which will promote discrimination between classes. However, the sparsity constraint is time consuming on the coding coefficients, making the model inefficient. More importantly, the role of sparse coding in classification is still an open problem [36][37][38], and some experts have argued that sparse coding may not be crucial for dictionary classification.
As described in [38], the block-diagonal constraint is an efficient way to calculate coding coefficients. Here, we built the structured dictionary model as follows: where the coefficient matrix A will be nearly block diagonal. The objective function in Equation (10) is generally non-convex. We introduce a variable matrix P to calculate the coefficient matrix A. Matrix P ∈ R N A ×L is an encoder, and code A can be calculated as A = PX. With the encoder P = [P 1 ; . . . ; P j ; . . . ; P C ], we want the encoder P j to be able project the samples X i (j = i) to a nearly null space, i.e., P j X i ≈ 0, ∀j = i. Therefore, Equation (10) can be relaxed to the following problem: {A, D, P} = arg min where τ and λ are scalar constants, P i X i = A i , and X i denotes the complementary data matrix of subset X i in the whole training set X. Equation (11) can be implemented via a two-stage iterative algorithm: updating A with fixed D and P and updating D and P with fixed A.
(1) Suppose that D and P are fixed, and A are updated as follows: Equation (12) is a standard least squares problem, and we achieve the closed-form solution: where I is the unit matrix.
(2) Fixing A, D and P are updated as follows: where d i is the atom of the structured dictionary and d i 2 2 ≤ 1 is to make the dictionary more stable. The closed-form solution of P can be obtained as: where γ is a small number. D can be calculated by introducing a variable S: The optimal solution of Equation (16) can be achieved by the alternating direction method of multipliers (ADMM) algorithm [39]: where ρ is an ever-changing value with a fixed ratio and T is a temp matrix. All these closed-form solutions converge rapidly, and a balance between the discrimination and representation power of the model can be achieved.

Shared Constraint
To improve the representation and reconstructive ability of the subdictionaries, we design the shared constraint for subdictionaries. As shown in Figure 3, the test samples contain the shared features, and our shared constraint (the com subdictionary) is added to describe duplicated information (shared features). Then, the discriminative features will be "amplified" relative to the original ones, and constructing a new structured dictionary is easier than ever. Here, we design a subdictionary D com to calculate the class-shared characteristics as follows: where D com denotes the shared subdictionary. The corresponding objective function is modified as follows: where The introduction of D com will not affect the solution procedure. With the calculation of term tend to be closer to zero, and the corresponding reconstructive ability of the structured dictionary will be improved.

Feature Extraction Framework
We construct the structured dictionary and encode the spectral information of HSIs. The coding coefficients A will be fed into the learning classifier to achieve better performance than directly using the minimum reconstruction error for classification. Different learning classifiers, such as SVM [12] and neural networks (NNs), can be employed to map the coding coefficients. However, these tools are often time consuming. Therefore, we employ an efficient machine technique, i.e., the extreme learning machine, to classify the HSIs.
In [40], Huang et al. proposed an ELM for generalized single-hidden-layer feed-forward neural networks (SLFNs), which has been widely applied in various application [41,42]. The ELM tries to learn an approximation function based on the training data. Suppose that SLFNs with K hidden nodes can be represented as follows: where a ij is the input weight connecting the input x i to the j-th hidden node, b ij is the bias connecting the input x i with the j-th hidden node, g(·) is the activation function, and β j is the output weight of the j-th hidden node. The activation function g(·) can be any nonlinear piecewise continuous function as follows: where Equations (21) and (22) are the sigmoid and radial basis function (RBF), θ = (a, b) are the parameters of the mapping function, and · 2 denotes the Euclidean norm. Huang et al. [43] proved that SLFNs can approximate any continuous target function over any compact subset X with the above sigmoid and RBF functions. Training ELMs is equivalent to settling a regularized least-squares problem, which is considerately more efficient than training an SVM or learning with back-propagation. Therefore, in our model, an ELM is adopted for mapping the coding coefficients into different classes of HSIs.

Experimental Results and Discussion
In this section, we compare the performance of our proposed method with other feature extracting models, including SVM [12], FDDL [35], DPL [38], ResNet [44], RNN [21], and CNN [21] for HSI classification. We report the overall accuracy (OA), average accuracy (AA), and kappa coefficient of the different datasets and present the corresponding classification maps. The proposed method is evaluated, and relevant results are summarized and discussed in detail as follows.

Compared Methods and Evaluation Indexes
The SVM model (the codes for SVM were otained from https://www.csie.ntu.edu.tw/~cjlin/ libsvm/) is a representative kernel-based method and has shown effective performance in HSI classification [12,13,45]. Yang et al. [35] proposed a complicated model named FDDL (the codes of FDDL were from http://www4.comp.polyu.edu.hk/~cslzhang/papers.htm), which was applied in HSI classification in [46]. The DPL [38] method (http://www4.comp.polyu.edu.hk/~cslzhang/papers. htm) is constructed to reduce the running time of learning the dictionary model. Convolutional neural networks (CNNs) [21] (all the CNNs models were downloaded from https://github. com/BehnoodRasti/HyFTech-Hyperspectral-Shallow-Deep-Feature-Extraction-Toolbox) are the most popularly adopted deep model for hyperspectral classification. Compared to traditional deep fully connected networks, CNNs possess weight-sharing and local-connection characteristics, making their training processes more efficient and effective. ResNet [44] adopts a residual networks to address the degradation problem and enhances the convergence rate of the CNN model, which is employed in HSI classification [47]. Recurrent neural networks (RNNs) [48,49] process all the spectral bands as a sequence and adopt a flexible network structure to classify HSIs. All experiments were repeated 10 times with the average classification results reported for comparison.
We used the following criteria to evaluate the performance of the different methods for HSI classification used in this paper, which include: Overall accuracy (OA): the number of correctly classified HSI pixels divided by the total number of tests [50]; Average accuracy (AA): the average value of the classification accuracies of all classes [50]; Kappa coefficient: A statistical measurement of agreement between the final classification and the ground-truth map [50].

Discussions of Different Datasets
(1) Center of Pavia: Table 4 lists the classification results of the compared algorithms, and Figure 4 shows the confusion matrix of our model (only to one decimal place). In Table 4, one can observe that all the CNN-based models have a good performance. The best performance is achieved by the proposed framework whose OA, AA, and kappa coefficients are 98.39%, 95.83%, and 97.23%, respectively. Compared with the dictionary learning-and deep learning-based models, our model gains significant classification accuracy for this dataset, especially for Class No. 2; see Figure 4. The confusion matrix for our model is shown in Figure 4, indicating that our algorithm distinguishes surface regions quite effectively.
For illustrative purposes, Figure 5 shows the obtained classification maps of the compared methods on the Center of Pavia dataset. Figure 5a,b is the RGB image and ground truth map, and Figure 5c-h is the corresponding classification results of SVM, FDDL, DPL, ResNet, RNN, CNN, and the proposed model. We employ yellow and red rectangles to highlight the interesting regions. We can observe from Figure 5 that the classification maps obtained by the proposed feature extractor are smoother in the regions sharing the same materials and sharper on the edges between different materials. The classification map produced from our model is the closest one compared with the results from other approaches. Our method is capable of extracting the intrinsic invariant feature representation from the HSI, achieving a more effective feature extraction.
(2) Botswana: The class-specific classification accuracies for the Botswana dataset and corresponding confusion matrix of our model are provided in Table 5 and Figure 6, respectively. From the results, one can see that the proposed algorithm outperforms the other algorithms in terms of OA, AA, and kappa, especially for Class Nos. 10 and 13. The proposed method significantly improves the results with a very high accuracy when tested with the Botswana dataset. From the illustrative results in the confusion matrix map, our model shows more discriminative ability between different classes. The confusion matrix can also confirm the class-specific classification accuracies presented in Table 5. Figure 7 shows the classification maps for the Botswana dataset where Figure 7a,b is the RGB image and ground truth map and Figure 7c-h is the corresponding classification results of SVM, FDDL, DPL, ResNet, RNN, CNN, and the proposed model. We employ yellow and red rectangles to highlight the interesting regions. From the illustrative presentation in the classification maps, the compared algorithms show more noisy scattered point in the maps. The proposed method can remove them and lead to smoother classification results without blurring the boundaries. The result of our model is the closest one compared with the state-of-the-art methods. It demonstrates the effectiveness of the proposed structured dictionary learning model.
(3) Houston University 2013: Table 6 lists the classification result of the compared methods on the Houston University 2013 dataset, and Figure 8 shows the corresponding confusion matrix of our model. In Table 6, it is obvious that our model achieves slightly better performance than CNN-based models. The OA, AA, and kappa coefficients of our framework are 86.82%, 86.44%, and 85.74%, respectively. Compared with the dictionary learning-and deep learning-based models, our model gains significant classification accuracy over this dataset, especially for Class Nos. 8, 9, and 12. The confusion matrix for our model is shown in Figure 8, indicating that our algorithm distinguishes surface regions quite effectively.     For illustrative purposes, Figure 9 shows the obtained classification maps of the compared methods on the Houston University 2013 dataset. Figure 9a,b is the RGB image and ground truth map, and Figure 9c-h is the corresponding classification results of SVM, FDDL, DPL, ResNet, RNN, CNN, and the proposed model. We employ yellow and red rectangles to highlight the interesting regions. As shown in Figure 9, our model removes the effects of salt-and-pepper noise from the classification maps effectively and simultaneously preserves the meaningful structure or objects. Owing to the robustness in local changes of the spectra, our model obtains more accurate classification maps in the area in and around the parking lot. Generally speaking, our model clearly shows superior performance in effective classification of HSIs.

Small Training Samples
The impact of the sample size for HSI classification has been reported in many research studies [23,24,28]. To confirm the effectiveness of our framework on small training samples, we randomly selected 5% of the Botswana dataset for training and the rest for testing. As shown in Table 7, the classification performance is extremely susceptible to the number of training data. The reduction to 5% of the training samples leads to a decrease of about 2%∼4% in classification accuracy. The OA, AA, and kappa of our model are 88.42%, 88.95%, and 87.46%, beating all other compared methods. This result suggests that our model has the potential to achieve higher level accuracy with a limited sample size.

Time Cost
All the experiments in this paper were implemented with MATLAB 2018b and Python on a Windows 10 operation system and conducted on an Intel Core i7-8700 CPU 3.20 GHz desktop with 16GB memory. The training and testing time of different models are listed in Table 8. Overall, the training and testing time of our model are far less than the SVM-and CNN-based models, which clearly shows the superior efficiency of our approach in classification application.

Conclusions
In this work, we propose an efficient spectral feature extraction framework for HSI data. This algorithm is more suitable for low spatial resolution HSIs with a lack of spatial features.
To improve the efficiency of our framework, we replace the sparsity constraint with the block-diagonal constraint to reduce the coding computation and employ an ELM model to map the coding coefficients. More importantly, we design spectral convolution and perform the dictionary learning on these features to capture more local spectral characteristics of the data. We also design a new shared constraint to construct a discriminative dictionary in the learning. Extensive experiments are conducted on three HSI datasets, and both qualitative and quantitative results demonstrate the effectiveness of the proposed feature learning model. Furthermore, the proposed approach consistently achieves higher classification accuracy even under a small number of training samples. In comparison to the SVMand CNN-based models, our framework requires much less computation time, which demonstrates its potential and superiority in the HSI classification task. In the future, we will continue to incorporate the spatial information into the model to further strengthen the feature representation ability.

Conflicts of Interest:
The authors declare no conflict of interest.