Hyperspectral Feature Extraction Using Sparse and Smooth Low-Rank Analysis

: In this paper, we develop a hyperspectral feature extraction method called sparse and smooth low-rank analysis (SSLRA). First, we propose a new low-rank model for hyperspectral images (HSIs) where we decompose the HSI into smooth and sparse components. Then, these components are simultaneously estimated using a nonconvex constrained penalized cost function (CPCF). The proposed CPCF exploits total variation penalty, (cid:96) 1 penalty, and an orthogonality constraint. The total variation penalty is used to promote piecewise smoothness, and, therefore, it extracts spatial (local neighborhood) information. The (cid:96) 1 penalty encourages sparse and spatial structures. Additionally, we show that this new type of decomposition improves the classiﬁcation of the HSIs. In the experiments, SSLRA was applied on the Houston (urban) and the Trento (rural) datasets. The extracted features were used as an input into a classiﬁer (either support vector machines (SVM) or random forest (RF)) to produce the ﬁnal classiﬁcation map. The results conﬁrm improvement in classiﬁcation accuracy compared to the state-of-the-art feature extraction approaches.


Introduction
Hyperspectral cameras can acquire remotely sensed images for a large number of contiguous spectral bands. Thus, a hyperspectral image (HSI) contains detailed spectral information of a scene. Since many kinds of materials have unique spectral signatures, this type of image is useful for recognizing the types of materials in a captured scene [1]. On the other hand, due to the Hughes effect [2], which is also known as the curse of dimensionality, the high spectral dimensionality makes the analysis of HSIs a challenging task from both computational and statistical perspective. The limited availability of training samples is a common issue in this kind of analysis since their collection can be both time demanding and expensive [3]. An increase in the number of spectral features, after a certain point, usually causes a decrease in classification accuracy when the number of training samples is limited. As a result, reducing the spectral dimensionality (or feature reduction) is of great interest in HSI analysis [4]. In general, dimensionality reduction (DR) techniques can be divided into feature selection (FS) and feature extraction (FE). In this paper, we focus on FE.
FE is the process of finding a set of vectors that represent an observation while reducing the dimensionality. For data classification, it is desirable to extract informative features that are useful for differentiating between classes of interest. Although DR is important for HSI analysis, the error due to the reduction in dimension has to occur without sacrificing the discriminative power of the classifier [5]. FE techniques can be broadly divided, based on the availability of training data, into two main groups: supervised FE (SFE) and unsupervised FE (USFE). The SFE methods require training samples while the USFE techniques are used to extract distinctive features in the absence of labeled training data.
SFE has been widely studied in the hyperspectral community [1]. Discriminant analysis feature extraction (DAFE) [6] is a classical SFE approach. It is a parametric method that extracts features that maximize the proportion of the between-class variance to within-class variance. The main shortcoming of DAFE is that this approach is not full rank and its rank at maximum is equal to the number of classes minus one. In addition, the class mean values can highly affect the performance of DAFE. Therefore, decision boundary feature extraction (DBFE) [7] and nonparametric weighted feature extraction (NWFE) [8] are suggested for HSI classification. In DBFE, the decision boundary is defined by applying the Bayes decision rules on the training samples and from that a decision boundary matrix transformation is calculated to extract the feature vectors. Hence, DBFE could fail in the case of having too few training samples since it directly works with the training samples to determine the location of the effective decision boundaries. NWFE is designed to improve the limitations of parametric feature extraction by putting different weights on samples to compute the local means and define a new nonparametric between-class and within-class scatter matrices to produce more features than DAFE. In addition, discriminant analysis based techniques such as the linear constraint distance-based discriminant analysis (LCDA) [9], the modified Fisher's linear discriminant analysis (MFLDA) [10], and a tensor representation-based discriminant analysis [11] were all proposed to improve the performance of the DAFE.
Recent SFE approaches take the advantage of the local neighborhood properties (spatial information) of data. Li et al. [12] considered local Fisher's discriminant analysis [13] to perform DR while preserving the corresponding multi-modal structure. In [14], local neighborhood information is taken into account in both spectral and spatial domains to obtain a discriminative projection for dimensionality reduction of hyperspectral data. Xue et al. [15] introduced a nonlinear FE approach based on spatial and spectral regularized local discriminant embedding to address spatial variability and spectral multi-modality.
USFE techniques are usually based on optimizing an objective function to project the original features into a lower dimensional feature space. Principal component analysis (PCA) searches for a projection to maximize the signal variance [16]. Maximum noise fraction (MNF) [17] and noise adjusted principal components (NAPC) [18] seek a projection that maximizes the signal-to-noise ratio (SNR). Such FE approaches are mostly used for data representation, usually as a preprocessing step, and address the large size of hyperspectral datasets. Independent component analysis (ICA) [19,20], non-negative matrix factorization (NMF) [21,22], and hyperspectral unmixing [23,24] are other examples of USFE techniques.
Some FE techniques are proposed based on preserving local (spatial) information [25,26]. Neighborhood preserving embedding (NPE) [27], locality preserving projection (LPP) [28] and linear local tangent space alignment (LLTSA) [29] are proposed for hyperspectral FE [30,31]. The work in [32] develops a tensor version of the LPP algorithm for hyperspectral DR and classification. The work in [33] proposes a common minimization framework called graph-embedding (GE), which is based on estimating an undirected weighted graph to describe the desired intrinsic (statistical or geometrical) properties of the data. The method uses either scale normalization or penalty graph constraints that describe undesirable properties. In [34], a sparse graph-based discriminant analysis (SGDA) technique that induces sparsity on the graph construction is proposed for hyperspectral DR and classification. SGDA may not obtain acceptable results when the input data have a nonlinear and complex nature. To address this issue, a kernel extension of SGDA is proposed in [35]. Image fusion and recursive filtering [36] are designed in [37], which incorporate spatial information to extract informative features. In [38], a DR approach is developed to estimate a sparse and low-rank projection matrix by fulfilling the restricted isometric property condition on all secants of hyperspectral data to preserve the nearest neighbor points of all pixels to improve the subsequent classification step further. Total variation (TV) regularization is suggested in [39] for HSI feature extraction. Wavelet-based sparse reduced-rank regression [40] and sparse and low-rank modeling [41] are suggested for hyperspectral feature extraction. Recently, in [42], orthogonal total variation component analysis (OTVCA) is proposed, where a non-convex cost function is optimized to find the best representation for HSI in a low dimensional feature space while controlling the spatial smoothness of the features by using a TV regularization. The TV penalty promotes piecewise smoothness (homogeneous spatial regions) on the extracted features, and thus substantially helps to extract spatial (local neighborhood) information that is very useful for classification.
In this paper, we propose a USFE for the classification of HSI called sparse and smooth low-rank analysis (SSLRA). SSLRA decomposes the HSI into sparse and piecewise smooth components. To capture the spectral redundancy of HSI and perform DR, we assume that these components can be represented in a lower dimensional space. Therefore, we propose a low-rank model for HSI in which the HSI is modeled based on a combination of sparse and smooth components. The components are estimated simultaneously by optimizing a constrained penalized cost function (CPCF). The TV and 1 penalties are exploited by the CPCF to promote the smoothness and the sparsity of the corresponding components, respectively. We assume that the unknown bases are orthogonal, and therefore we solve the CPCF by enforcing an orthogonality constraint. In the experiments, we used two HSIs: (1) an urban HSI of the University of the Houston campus; and (2) a rural HSI of the Italian city of Trento. The experiments confirmed that SSLRA outperforms both OTVCA and state-of-the-art FE techniques concerning classification accuracy.
The organization of the paper is as follows. The proposed hyperspectral feature extraction technique (SSLRA) and the corresponding algorithm are derived and explained in Section 2. Section 3 is devoted to the experimental results. Finally, Section 4 concludes the paper.

Notation
The notations used in this paper are as follows. The number of spectral bands and pixels in each band are denoted by p and n, respectively. r indicates the rank of the HSI. Matrices are represented by bold and capital letters, vectors by bold letters, the (i, j)th element of X by x ij , and the ith column by x (i) . I p denotes identity matrix of size p × p.X stands for the estimate of X. The Frobenius norm and TV-normare denoted by . F and . TV , respectively. The definitions of the symbols used in the paper are given in Table 1.

Sym. Definition
x i the ith entry of the vector x x ij the (i, j)th entry of the matrix X x (i) the ith column of the matrix X x T j the jth row of the matrix X x 1 l 1 -norm of the vector x, obtained by ∑ i |x i |

Hyperspectral Modeling and Sparse and Smooth Low-Rank Analysis
HSIs are often represented by using low-rank models. Such models have, for example, been shown to be more appropriate for HSI in terms of mean squared error than full-rank models [41]. However, the rank is unknown and has to be estimated [43,44]. We model the observed HSI as where Y = y (i) is an n × p matrix containing the vectorized observed image at band i in its ith column, X = x (i) is an n × p matrix representing the HSI, and N = n (i) is an n × p matrix that represents the noise and model error. Here, we assume that X is a low-rank matrix, i.e., it has rank r << min(n, p). The low-rank property can be enforced by representing X as a product of two rank r matrices (F + S) and V T , which leads to the following low-rank model: where F = f (i) and S = s (i) are matrices of size n × r containing the unknown smooth and sparse components, respectively, and V is an unknown p × r subspace matrix. The model in Equation (2) separates the sparse features from the smooth features. The smooth features can be used to promote smooth regions of interests in the classification maps.
Assuming the model in Equation (2), we propose a CPCF to simultaneously estimate F, S, and V by solving min where In Equation (4), the TV-norm (see Appendix A) promotes piecewise smoothness on F, the 1 norm promotes sparsity on S, and the constraint V T V = I r enforces the orthogonality condition on the subspace. Note that minimization of Equation (3) is non-convex and therefore the solution might lead to a local minima.
To solve Equation (3), we use a cyclic descent (CD) algorithm [45][46][47] where the problem is solved with respect to one matrix at a time while the others are assumed to be fixed. As a result, the proposed CD approach consists of the four steps discussed below: initialization, the F step, the S step, and the V step.

F-Step
Given fixed V m and S m , get F m+1 by solving Equation (3) which can be reduced to where G = g (i) = YV m . The problem in Equation (4) can be thought of as r-separable TV regularization problems [48] that can be solved using the split Bregman iterations method given in [49] (also known as the alternative direction method of multipliers (ADMM) [50]) denoted by

S-Step
Given fixed V m and F m+1 , get S m+1 by solving Equation (3), i.e., It can be shown that Equation (5) is separable and the solution is given bŷ which is called soft-thresholding and often is written aŝ Note that soft function in Equation (7) is applied element-wise on matrix G − F m+1 .

V-Step
Given fixed F m+1 and S m+1 , get V m+1 by solving Equation (3), which can be rewritten as The solution is given by a low-rank Procrustes rotation [51] where the matrices P and Q are given by the following truncated SVD of rank r where Σ is a diagonal matrix which contains the first r singular values of Y T (F m+1 + S m+1 ). The resulting algorithm is summarized in Algorithm 1. The monotonicity property of SSLRA can be observed easily since by construction the algorithm guarantees that the cost function is non-increasing with respect to the iteration index, i.e., Therefore, the cost function is guaranteed to decrease or stay the same at each iteration of the algorithm. Since the cost function is both upper bounded (by the initial value) and lower bounded (since it is greater than or equal zero), the cost function iterations will converge to a finite value.
Note that the smooth features (F) extracted using SSLRA are used for classification purposes in this paper. This is discussed further in Section 3.

Experimental Results
Two HSIs, the Houston and Trento datasets, described below, were used in the experiments. The Houston dataset investigation is presented in Sections 3.2-3.4. The Trento dataset experiment is presented in Section 3.4. Two classifiers were used in the experiments: Random Forest (RF) and Support Vector Machine (SVM). For the RF, we set the number of trees to 200. For the SVM, a radial basis function (RBF) kernel was used. The penalty parameter C and spread of the RBF kernel γ were selected by searching in the range of 10 −2 , 10 −1 , . . . , 10 4 and 2 −3 , 2 −2 , . . . , 2 4 , respectively, using five-fold cross-validation. The classification metrics used in the experiments are Average Accuracy (AA), Overall Accuracy (OA), and Kappa Coefficient (κ) (see A.5 in [41]).

Trento
The first dataset is from a rural area in the south of the city of Trento, Italy. The size of the dataset is 600 by 166 pixels. The AISA Eagle sensor acquired the HSI with a spatial resolution of 1 m. The hyperspectral data consist of 63 bands ranging from 0.40 to 0.99 µm, where the spectral resolution is 9.2 nm. The available ground truth consists of six classes, i.e., Building, Wood, Apple Tree, Road, Vineyard, and Ground. Figure 1 illustrates a false color composite representation of the hyperspectral data along with the corresponding training and test samples. Table 2 provides information on the number of training and test samples for each class of interest.  The Compact Airborne Spectrographic Imager (CASI) captured the second HSI over the University of Houston campus and the neighboring urban area. The data size is 349 × 1905 pixels, and the spatial resolution is 2.5 m. The hyperspectral dataset consists of 144 spectral bands ranging from 0.38 to 1.05 µm. The 15 classes of interest are: Grass Healthy, Grass Stressed, Grass Synthetic, Tree, Soil, Water, Residential, Commercial, Road, Highway, Railway, Parking Lot 1, Parking Lot 2, Tennis Court and Running Track. "Parking Lot 1" includes parking garages at the ground level and also in elevated areas, while "Parking Lot 2" corresponds to parked vehicles. Figure 2 illustrates a false color composite representation of the hyperspectral data and the corresponding training and test samples. Table 3 provides information on the number of training and test samples.
It is important to note that we used the standard sets of training and test samples for the datasets mentioned above to make the results entirely comparable with most of the approaches available in the literature.

Performance of SSLRA with Respect to Tuning Parameters
The assessment of the effect of the tuning parameters (λ 1 and λ 2 ) on the performance of the proposed algorithm was of interest. Since we were interested in the classification accuracy, Remote Sens. 2019, 11, 121 9 of 21 we investigated the effect of the smoothing parameter (λ 1 ) and the sparsity parameter (λ 2 ) on OA. We selected the tuning parameter value with respect to a percentage of the range of the intensity value as follows: where 0 ≤ T j ≤ 1. Figure 3 shows the contour plot of the OA with respect to T 1 and T 2 for both the random forest (RF) and the support vector machine (SVM) classifiers. It can be seen that along the T 1 = T 2 diagonal line the OA has little variability and takes on high values. The results confirm, for this example, that, if the tuning parameters are selected to be equal, SSLRA is less sensitive in terms of OA with respect to T 1 and T 2 . Tuning parameter selection is non-trivial and often a computationally-expensive task.
To save computations, we selected T 1 = T 2 = T. Here, we selected r = 15, which is the number of the classes, and we assumed that it is the dimension of the subspace. Note that one can claim that the number of classes of interests is the minimum of the dimension of the subspace.

Comparisons with Respect to the Tuning Parameter
The tuning parameter T controls the amount of smoothness of F, and therefore it can affect the classification accuracies. Hence, it is of interest to compare the performances of OTVCA and SSLRA with respect to T. Figure 4 shows that SSLRA and OTVCA give similar OA with respect to T for RF, but SSLRA give higher overall accuracies for SVM except when T = 0.8. Note that increasing T causes oversmoothing of the extracted feature, which might lead to the loss of information in the final classification map. We selected T = 0.2 and T = 0.4 to avoid oversmoothing of the features.

Comparisons with Respect to the Number of Features
A major advantage of FE techniques is their DR capability. In HSI classification, DR is of great interest since the spectral redundancy makes HSI classification computationally expensive and also DR can improve the classification task since it can address the Hughes effect or the curse of dimensionality to a large extent [2]. As a result, we investigated the performance of SSLRA in terms of OA with respect to r. Figure 5 depicts the DR performance of SSLRA in terms of OA with respect to feature number r for both RF and SVM classifiers. For both SVM and RF and for both T = 0.2 and T = 0.4 when r = 15, SSLRA provides high OAs (close to 90%), which confirms the good performance of SSLRA concerning DR. Additionally, Figure 5 compares SSLRA with OTVCA in terms of OA with respect to r for both RF and SVM classifiers. The figure shows that SSLRA outperforms OTVCA in terms of OA and demonstrates better DR for the SVM classifier. For the RF, SSLRA and OTVCA perform similarly.

Comparisons with Respect to the Number of Training Samples
A major problem in classification applications is to collect reliable ground truth. The number of labeled samples per class is usually limited compared to the number of pixels. Hence, it is of interest to evaluate the performance of the proposed technique with respect to the number of samples selected per class. Figure 6  Both OTVCA and SSLRA show a similar trend in terms of OA with respect to the number of training samples. We see that SSLRA provides high accuracy by using only few training samples (5 and 10) for both classifiers, which is of great interest in the remote sensing community. It can also be observed that, by increasing the number of training samples up to only 50 samples per class, the OA reaches over 97% in all cases shown for SSLRA. We note that, only in the experiments presented in this subsection, we did not use the standard sets of test and training samples. We instead selected the samples randomly to be able to show the performance of the techniques with respect to the number of training samples.  The features extracted using SSLRA contain more homogeneous regions compared to the ones extracted by OTVCA. Figure 8 demonstrates this better. It shows a portion of feature 2 extracted by SSLRA compared with the corresponding OTVCA component. Figure 8 depicts the sparse structures extracted by SSLRA. The sparse structures in the sparse components decrease the classification accuracies since they are not frequently included in the region of interests, and, therefore, the class labels are not available for these sparse structures. SSLRA separates the sparse structures from the smooth ones which increases the classification accuracy and provides homogeneous class regions in the final classification map.

Performance of SSLRA with Respect to the State-of-the-Art
Here, we compared the performance of SSLRA with PCA, MNF [17], DAFE [6], NWFE [8], and SELD [54] as competitive FE approaches applied to Houston and Trento. The number of features was set to the number of classes of interests (i.e., 15 for Houston and 6 for Trento) except for DAFE that extracts maximum one feature fewer than the number of classes. In the tables, HSI shows the classification results applied on the spectral bands. Tables 4 and 5 show the classification accuracies obtained by applying SVM and RF, respectively, on Houston's components extracted using different FE techniques. Similarly, Tables 6 and 7 show the classification accuracies for Trento. In general, SSLRA outperforms the other FE approaches. For Houston, SSLRA improves OA over 13% using RF and 6% using SVM compared to HSI. For Trento, SSLRA improves OA over 9% using RF and 5% using SVM compared to HSI. OTVCA achieves the second best performance in terms of classification accuracy followed by MNF. As can be seen, DAFE gives the lowest OAs. Figures 9 and 10 show the classifications maps for Houston and Trento datasets, respectively. These figures show that the maps obtained by SSLRA contain homogeneous class regions which is of interest in the classification applications. Table 8 compares the CPU processing time (in seconds) spent by different feature extraction techniques applied on the Trento and Houston datasets. All methods were implemented in Matlab on a computer having Intel(R) Core(TM) i7-6700 processor (3.40 GHz), 32 GB of memory and 64-bit Operating System. It can be seen that SSLRA and OTVCA are computationally expensive compared to the other techniques used in the experiments. That is mainly due to the iterative nature of those algorithms and the inner TV-regularization loop and the SVD step. It is worth noting that the CPU time for the supervised techniques (NWFE and LDA) is affected considerably by the number of labeled (training) samples used and the semi-supervised technique (SELD) is affected by both labeled and unlabeled samples used. In the case of unsupervised techniques, the CPU time is affected by the total size of the data. Figure 11 depicts the values of the cost function J and the values of the stopping criterion ((J m+1 − J m )/J 1 ) when SSLRA was applied on the Houston and Trento datasets. It can be seen that the cost functions are strictly descending, as stated in Section 2, for both datasets. The stopping criterion values are less than 0.001 after 62 and 48 iterations for Houston and Trento, respectively. Therefore, in the experiments, we set the number of iterations to 100. Figure 12 compares the values of the cost function (J) with respect to the number of iterations for two different initialization of SSLRA. It can be seen that random orthogonal matrix initialization gives higher cost function values for all iterations shown compared to the spectral eigenvectors initialization. Therefore, in this paper, spectral eigenvectors were used to initialize SSLRA, i.e., W = V 0 . Note that the proposed cost function is nonconvex and therefore different initializations might give different optimum values.

Discussion
The conventional FE methods used in the experiments, i.e., PCA, MNF, DAFE, and NWFE, do not take into account spatial correlation of the HSI that can considerably improve the classification results [1,42]. SELD incorporates the spatial correlation by using unlabeled samples. However, the number of unlabeled samples highly affects the complexity of the algorithm and having few samples does not provide satisfactory spatial information [54]. That drawback has been considerably improved in OTVCA. OTVCA captures both spectral and spatial redundancies while extracting informative components. The spectral redundancy of HSI is captured by the low-rank representation of HSI in the OTVCA model. TV regularization not only captures the spatial redundancy of HSI but also induces the piece-wise smoothness on HSI features that helps to extract spatial information and reduce the salt and pepper noise. However, there are sparse structures in the components extracted by OTVCA that are mostly assumed to be outliers in the classification task. SSLRA improves the classification accuracies by separating the sparse structures and providing smoother components. As a result, the classification map obtained contains less salt and pepper noise effect and more homogeneous class regions. On the other hand, SSLRA is computationally more expensive than the other methods.

Conclusions
Sparse and smooth low-rank analysis was proposed for hyperspectral image feature extraction. First, a low-rank model was proposed where the HSI was modeled as a combination of smooth and sparse components. A constrained penalized cost function minimization was proposed to estimate the smooth and sparse components that use the TV penalty and the 1 penalty to promote smoothness and sparsity, respectively, while the orthogonality constraint was applied on the subspace basis. Then, an iterative algorithm was derived from solving the proposed non-convex minimization problem. In the experiments, it was shown that SSLRA outperforms other FE methods in terms of classification accuracy for urban and rural HSIs. It was also demonstrated that components extracted by SSLRA provide relatively high classification accuracies when only a limited number of training samples is available. Additionally, the experiments confirmed that SSLRA reduces the salt and pepper noise effect and produces homogeneous class regions in the classification maps by separating the sparse features from the smooth ones. On the other hand, SSLRA is more complicated and computationally expensive compared to the techniques used in the experiments.