Sparse Feature Learning of Hyperspectral Imagery via Multiobjective-Based Extreme Learning Machine

Hyperspectral image (HSI) consists of hundreds of narrow spectral band components with rich spectral and spatial information. Extreme Learning Machine (ELM) has been widely used for HSI analysis. However, the classical ELM is difficult to use for sparse feature leaning due to its randomly generated hidden layer. In this paper, we propose a novel unsupervised sparse feature learning approach, called Evolutionary Multiobjective-based ELM (EMO-ELM), and apply it to HSI feature extraction. Specifically, we represent the task of constructing the ELM Autoencoder (ELM-AE) as a multiobjective optimization problem that takes the sparsity of hidden layer outputs and the reconstruction error as two conflicting objectives. Then, we adopt an Evolutionary Multiobjective Optimization (EMO) method to solve the two objectives, simultaneously. To find the best solution from the Pareto solution set and construct the best trade-off feature extractor, a curvature-based method is proposed to focus on the knee area of the Pareto solutions. Benefited from the EMO, the proposed EMO-ELM is less prone to fall into a local minimum and has fewer trainable parameters than gradient-based AEs. Experiments on two real HSIs demonstrate that the features learned by EMO-ELM not only preserve better sparsity but also achieve superior separability than many existing feature learning methods.


Introduction
Hyperspectral Imagery (HSI), which is obtained by remote sensing systems, contains high-resolution spectral information over a wide range of the electromagnetic spectrum with hundreds of observed spectral bands [1]. The detailed spectral and spatial information provides the power of accurately differentiating or recognizing materials of interest [2]. In recent years, HSI has been applied in a wide variety of applications, including agriculture, surveillance, astronomy, and biomedical imaging, among others [3] . However, a great number of redundancies between spectral bands bring heavy computation burdens in HSI data analysis [4]. Furthermore, high dimensionality even rises "Hughes" problem, namely the so-called curse of dimensionality [5,6].
As an effective solution, feature learning overcomes these issues and guarantees good classification accuracy [7][8][9][10][11][12][13]. The conventional feature learning methods, such as Principal Component Analysis (PCA) [9] and its variants [14][15][16] are widely applied in HSI. However, PCA is a linear combination of all the original data, thus it is often difficult to interpret the results. To improve PCA's performance, Sparse multiobjective model composed of two conflicting objectives, including the sparsity of learned features and reconstruction error. In order to solve this two-objective model, EMO is effectively employed to handle this type of optimization problem. Comparing with AE and SAE, the number of parameters have been reduced to half; comparing with ELM-AE and SELM-AE, this approach learns features by a nonlinear manner, moreover, the using of optimization method avoids the influence of NRP. Through constructing two conflicting objectives composed of the sparsity and reconstruction error, and extracting features from the selected solutions based on optimal Pareto Front (PF) obtained by EMO, EMO-ELM improves the classification accuracy while ensuring the sparsity. Experimental results verify that the purposed method outperforms NRP, SPCA, AE, SAE, ELM-AE and SELM-AE in terms of the sparsity and reconstruction error.
The main contributions of this paper can be summarized as follows: (1) A novel unsupervised feature learning approach is proposed; (2) We combine the advantages of basic ELM and conventional AE; (3) A hybrid feature learning framework is presented which uses EMO for training.
The remainder of this paper is organized as follows. Section 2 reviews the related work, including ELM-AE and AE. Section 3 describes the proposed EMO-ELM method in detail. The experimental results and discussions are shown in Section 4. Finally, the conclusions are represented in Section 5.

ELM-AE
ELM-AE can be divided into NRP and LR, and its structure is shown in Figure 1a. Let us consider N distinct original samples where in x i ∈ R n ( n -dimensional feature space). For example, ELM-AE is used to extract features of SalinasA data set for comparison in this paper, then N = 86 × 83 and n = 204. Supposing the dimension of the extracted feature ∼ X is 100, ∼ X is the preprocessed data of N = 86 × 83 samples that have the same number of samples as the original data X. The input and output layers have the same number of neurons and the input and output vectors have the same dimension, which is n. The input layer is equivalent to the encoding part of the automatic encoder, while the output layer to the decoding part. Further supposing that the number of hidden neurons L is less than the number of hidden neurons n, the weight matrix is orthogonal, we see that the first part of ELM-AE aims at mapping X into an L-dimensional space by the following equation: denotes the output vector of a hidden layer for the i-th input sample, the vector of the i-th input sample, NRP's orthogonal weights matrix connecting the neurons of the input layer with the neurons of the hidden layer, and the orthogonal bias vector in the hidden layer respectively. The output weights β of ELM-AE are responsible for the transformation from the feature space to input data and satisfy: Minimize According to ELM-AE's theories, the output weights β of ELM-AE are: where V is the eigenvectors of covariance matrix X T X.
The new feature can be achieved in ELM-AE by projecting the data X along the decoder stage weight β as: ∼ X is the new feature and can be applied to hyperspectral image classification. When the number of input neurons in ELM-AE is smaller than the number of hidden neurons (n < L), the weight matrix is sparse. According to the ELM theories, W and b can be randomly generated. g (·) is any nonlinear activation function, such as Sigmoid function. To further increase the model's sparsity, W and b, which are orthogonal, can be generated using the following equation [37]: where w ij indicates the weight connecting the i-th neuron in the input layer with the j-th neuron in the hidden layer, b j is the bias of the j-th neuron in the hidden layer, and p represents the ratio of elements in the sparse random matrix W. w ij and b j are generated randomly according to the p-value, similar to the wheel algorithm. It is assumed that there is a random probability r before every matrix element is generated, and the value range of r is between 0 and 1. The elements in the matrix change according to the random generation probability r. When the probability r is less than 1/ 3, the values of w ij and b j are 3 L ; when the random probability r is more than 1/ 3 and less than 5/ 6, the values of w ij and b j are 0; when the random probability r is greater than 5/6 and less than 1, the values of w ij and b j are − 3 L . The output weights β of SELM-AE are calculated as: According to ELM-AE's theories, the output weight β of SELM-AE are: where V is the eigenvectors of covariance matrix X T X. Dimension reduction is achieved in SELM-AE by projecting the data X along the decoder stage weights β as: The second part of ELM-AE is a well-known ridge regression or regularized least squares, which aims to solve the output weights matrix β of which β ji indicates the weight connecting the j-th neuron in the hidden layer with the i-th neuron in the output layer by minimizing the following learning problem: The first term in the objective function is the output weights matrix, and the second term represents the reconstruction errors. λ is a regularization term that controls the complexity of the learned model. Here, according to the ELM theorem [37,40], σ 1 > 0, σ 2 > 0, t, and q can be any matrix norm (e.g., t, q = 0, 1 2 , 1, 2, · · · , +∞). To obtain the analytical solution, σ 1 , σ 2 , t and q are often set to 2, which is also known as a ridge regression problem. However, t, q = 0(i.e., L0-norm) are possible. Under the circumstances, β 0 denotes the number of non-zero weights in the hidden layer, and Hβ − X 0 indicates that the amount of non-zero reconstruction errors for the original data X.
When the penalty coefficient λ is infinitely large in Equation (9), β can be calculated as where H † indicates the Moore-Penrose generalized inverse of H. According to the ELM learning theory, ELM with a minimum norm of output weights has better generalization performance and a more robust solution [41]. Different output weights β can be obtained based on the concerns on the efficiency in different size of training data sets. When the number of training samples is not huge, the output weight β can be expressed as The output function of ELM classifier is When the number of training samples is very large, the output weight β can be represented as And the output function of ELM classifier is Subsequently, the new featureX can be represented as:

AE
AE has a symmetrical encoder-to-decoder structure as seen in Figure 1b. For the sake of simplicity, we represent AE's cost function as where L is the loss term such as Mean Squared Error (MSE); R is the regularization term with the regularization coefficient λ, in SAE, R can be written as Θ e p or F e (X; Θ e ) p ; F e and F d denote encoder and decoder's transformation with hyperparameters Θ e and Θ d , respectively. The training of AE can be finished using the gradient backpropagation algorithms.
Here µ denotes the learning rate, and ∂J ∂Θ indicates the gradient of Θ. Unlike ELM-AE, AE represents the learned features using the outputs of an encoder and it can be written as Because F e is generally nonlinear, AE can process complicated data. For example, AE has been successfully used for HSI feature learning and classification in [24].

EMO-ELM
The structure of the proposed approach is given in Figure 2. We still keep ELM-AE's structure and parameter settings but use parametric learning steps by encoder-to-decoder like AE. In our approach, the gradient-based tuning method is difficult to work, therefore a multiobjective-based method is taken into consideration. In this section, we first introduce the multiobjective construction; next, we give the Non-dominated Sorting Genetic Algorithm II(NSGA-II)-based learning steps; then we describe the solution selection procedure; finally,we use the proposed approach to represent the sparse feature.

Constructing a Multiobjective Model
Our multiobjective model contains two conflicting objective functions with respect to the decision variable vector α, where α = [w 11 , · · · , w 1L , · · · , w n1 , · · · , w nL , b 1 , · · · , b L ] T , and we indicate them as f 1 and f 2 , respectively. The detailed definitions of f 1 and f 2 are given below.
Firstly, we define f 1 as the sparseness of the encoded results. Suppose the activation function of the hidden neurons to be the Sigmoid function, then the activation values in the hidden layer will be limited in (0, 1). Informally, we think of a hidden neuron as being "active" (or as "firing") if its output value is close to 1, or as being "inactive" if its output value is close to 0. The lower the probability of the activation function among the hidden layer, the sparser the encoded results, and the smaller f 1 is. Our goal is to constrain the hidden neurons to be inactive most of the time. We first define the average activation of the j-th hidden neuron aŝ Further, let ρ be the desired activation, our goal is forcingρ j to close to ρ. To measure the difference between ρ andρ j , the Kullback-Leibler(KL)-divergence is takeninto consideration, which can be expressed as follows Considering all the hidden neurons, f 1 is represented as the sum of the KL-divergence of all hidden neurons.
Secondly, we define f 2 as the Root Mean Squared Error (RMSE) of the raw input and the decoded results. However, the decoder determines parameters mathematically, we adopt K-fold crossover validation to calculate RMSE to further avoid overfitting. Thus, f 2 can be denoted as where K indicates a K-fold crossover validation is used, n k is the number of validation samples in k-th fold subjected to K ∑ k=1 n k = N, and y (x i ) is the decoded results for x i . Ultimately, we aim to simultaneously minimize the two functions. For clarity, the overall multiobjective model is expressed as

Solving a Multiobjective Model
To effectively solve the optimization problem given in Equation (23), we employ EMO algorithms. Specifically, NSGA-II [42] is used in this paper due to its fast solvingspeed, excellent convergence, and popular applications. All kinds of evolutionary multiobjective algorithms such as Multi-Objective Evolutionary Algorithm based on Decomposition (MOEA/D) [43]and Multi-Objective Particle Swarm Optimization(MOPSO) [44], are practicable in our framework. The mainsteps using NSGA-II to solve our multiobjective model are given in Algorithm 1.

Selecting Solution
By evolving, we get a set of solutions, which is called Pareto optimal solutions. Generally, the corresponding objective values can be plotted as a curve named PF curve. Unlike single-objective optimization method, we have to choose from the Pareto optimal set, because these solutions are deemed to be equally important. In this paper, the following three solutions are considered as alternatives.
1. the solution getting a minimum value of f 1 ; 2. the solution getting a minimum value of f 2 ; 3. the solution locating at the knee area.
The first two solutions are easy to determine. However, focusing on the knee area is commonly difficult, because the PF could be not smooth and the objective functions involved in the model usually have greatly different magnitudes. To observe the PF, we find a fact that the knee usually occurs concurrently with a maximum curvature of PF curve. Thus, this problem is transformed to find out the maximum curvature of the PF curve. To do this, the following steps are carried out. Firstly, normalize the Pareto solution to overcome the different magnitudes. Next, smooth the PF by interpolating the PF using B-splines. And then evenly resample from the smooth spline. Finally, estimate the curvature according to the derivatives of the B-spline curve and select the solution which is closest to the maximum curvature.

Sparse Feature learning Using EMO-ELM
We denote the selected solution as α F . Subsequently, we use α F . to regenerate the parameters of the encoder, which is represented as W F and b F . Depending on this, the learned features can be obtained from the following processing.
where X is the learned features, and B F = [b T F , · · · , b T F ] T N×L . As seen in Equation (24), this procedure is a nonlinear transformation. The complete pseudocode of using EMO-ELM to extract sparse features is given in Algorithm 2. This image was collected by the 224-band Airborne Visible/Infrared Imaging Spectrometer(AVIRIS) sensor over SalinasValley, California, and was characterized by high spatial resolution (3.7-meter pixels). The area covered comprises 86 × 83 pixels and includes six classes, of which the class information is given in Table 1. The pseudo-color image and its ground truth are shown in Figure 3. This scene has reduced the number of bands to 204 by removing 20 bands covering the region of water absorption: 108-112, 154-167, 224.   March 23, 1996. This data is with 512 × 614 pixels and has a spatial resolution of 18 m, which was shown in Figure 4. Regardless of discarded water absorption and low signal-to-noise ratio (SNR) bands, 176 spectral bands are used for classification. 13 different land cover classes available in the original dataset are displayed in Table 2. Spartina marsh 520 10 Cattail marsh 404 11 Salt marsh 419 12 Mud flats 503 13 Water 927 (a) (b)

Experiment Settings
Experiments will be organized into three parts for the convergence, sparsity, and separability. The first one aims at analyzing the convergence of EMO-ELM, and further illustrating the solution selection. In the second experiment, we quantitatively compare the sparsity of EMO-ELMs (EMO-ELM( f 1 ), EMO-ELM( f 2 ) and EMO-ELM(best) represent EMO-ELM with different strategies of solution selection respectively) with NRP (namely unoptimized ELM), SPCA [11], ELM-AE [37], SELM-AE [37], AE, and SAE. Moreover, Finally, we use the basic ELM with 500 hidden neurons as the classifier to estimate the classification capabilities of the learned features in terms of Overall Accuracy (OA), Average Accuracy (AA), and Kappa coefficient. In this part, all experiments are executed through three-fold cross-validation and repeated ten times and take the mean as the performance criteria. 10 percent of samples are used for training and the rest is for testing. For AE and SAE, we use the implementation available from the Keras website [45]. For EMO-ELM, the well-known NSGA-II (https://github.com/Project-Platypus/Platypus) is applied, in which we uniformly set NP = 50 and gen = 5000. Additionally, all experiments are carried out using Python 2.7 on an Intel i5 Core(TM) 3.2-GHz machine with 8 GB of RAM.

Convergence and Solution Selection
The optimizing of Multi-Objective Problems(MOP) requires the objectives that need to be solved are conflicted to each other. Thus, we can determine whether the algorithm is convergent by observing the PF. We consider algorithm is convergent if all these solutions are non-dominated. At this point, the PF looks like a 'smooth' curve. In Figure 5, we illustrate the normalized PFs obtained by optimizing 10-hidden-neurons ELM with 5000 generations on SalinasA and KSC data sets. As we can see, after 5000-generation evolution, the final results are completely convergent. Furthermore, it was revealed that two objectives we constructed are conflicted. Observing the curvature curves, the values will be enlarged dramatically in the knee area, therefore we can accurately find out it.

Visual Investigation of Features Learned by Different Algorithms
Iris data set, which contains 150 samples with three classes, is suitable for visualization. Figure 6 a-i show the distribution of the features of Iris data projected into two-dimensional space. Observed from Figure 6a, the original nonlinear random projection in ELM disorders the data distribution, which indicates that there are included noise in the original ELM. As seen in Figure 6b-i, the features of Iris are effectively distributed in the feature space after feature learning. By performing EMO-ELM, these noises have been dramatically reduced. From the cluster point of view, EMO-ELM decreased the within-class distance and increased the between-class distance. This phenomenon is especially evident in Figure 6i which uses the proposed knee-based solution selection strategy.

Measuring Sparsity of the Learned Features
In order to quantify the sparsity, we use the L 2 /L 1 sparsity measure, which is applied in [37,46]. L 2 /L 1 measure indicates sparsity at an abstract level. A higher L 2 /L 1 measure demonstrates that there are few large feature values and more small feature values. On the contrary, there are more large feature values and a few small feature values. In other words, the higher the L 2 /L 1 gets, the sparser the learned feature is. In Figure 7, we give the L 2 /L 1 sparsity measure of different algorithms under the various number of features. It is certain that the features learned by EMO-ELMs are sparser than all the competitors. Comparing with NRP, EMO-ELM has significantly increased the sparsity.

Comparison of Classification Ability
For feature learning, an important criterion for evaluating the effectiveness of learned features is classification ability. In this experiment, we compare EMO-ELM's performance with the above-mentioned six feature learning approaches under 10-dimensional features. The results of SalinasA and KSC are given in Tables 3 and 4. The best results are bold. As seen from Tables 3  and 4, the EMO-ELMs, especially EMO-ELM( f 2 ), outperforms other methods, which means that the performance could be more excellent in optimization way. To visually present the differences between competitors, we also plot the statistical evaluations of OA, AA, and Kappa in Figure 8a-c for SalinasA and (d-f) for KSC. As shown in Figure 8, EMO-ELM( f 2 ) obtains the best results in comparison with other strategies in terms of all aspects, which is because that EMO-ELM( f 2 ) acquires the least reconstruction error. The detailed conclusions are given as follows: 1. EMO-ELM( f 1 ) v.s. EMO-ELM( f 2 ) v.s EMO-ELM(best): Generally, EMO-ELM( f 2 ) yields higher accuracy than other competitors in terms of mean performance for SalinasA and KSC. The reason is that because EMO-ELM( f 2 ) guarantees the model to achieve the smallest reconstruction error, whereas EMO-ELM( f 1 ), although, obtains the smallest sparsity, the feature reconstruction is limited. Reviewing Figure 7a,b, EMO-ELM( f 2 ) also maintains good sparsity. Hence, the solution selection strategy based on f 2 can be considered best in our experiments. EMO-ELM(best) also plays a trade-off role between sparsity and reconstruction error, thus we view it as the second choice. 2. NRP v.s. EMO-ELM: As shown in Figure 8a-f, the original nonlinear random projection (NRP) is effective in feature mapping, but EMO-ELM has shown that the NRP's performance can be further improved after optimizing. 3. SPCA v.s EMO-ELM: The features learned by SPCA maintain the remarkable sparsity, however, the classification ability is damaged. Furthermore, as a dimension-reduction method, SPCA cannot work when the learned dimension is larger than the original dimension. On the contrary, EMO-ELM outperforms SPCA in respects of many tested performances. 4. ELM-AE and SELM-AE v.s EMO-ELM: As we known, ELM-AE and SELM-AE learn features linearly. In this experiment, SELM-AE performs better than ELM-AE due to the sparse matrix is used. Whereas, EMO-ELM learns features nonlinearly and has significantly enhanced the classification capacity of the learned features. 5. AE and SAE v.s EMO-ELM: The learning procedure of EMO-ELM is similar to AE and SAE, but EMO-ELM becomes more competitive in both respects of classification ability and sparsity after the same times of updating. Especially, EMO-ELM optimizes only the hidden layer, whereas AE and SAE have to simultaneously optimize the hidden layer and the output layer.

Discussion
Based on the above experimental results, we provide a meaningful discussion on this article: 1. The proposed approach can be regarded as a general framework that is composed of a nonlinear encoder and a linear decoder. For the optimization of this framework, we only need to focus on the encoder (or the hidden layer) since the decoder can be represented as a closed-form solution, which is very different from neural networks. Thus, compared with SAE and AE, the number of EMO-ELM's parameters can be reduced to half. 2. In addition to the objectives used in this paper, various objectives, such as classification error and matrix norm constraints are considered in this framework. More importantly, the optimizer is replaceable and flexible. Therefore, EMO-ELM can be used as an alternative to unsupervised feature learning. 3. It is well known that the evolutionary operating is time-consuming, this is the main challenge faced in EMO-ELM. Thus, EMO-ELM is difficult to directly handle the big data. Fortunately, evolutionary algorithms are easy to implement in parallel. There is an issue that how to use EMO-ELM to do deep representation learning is worth studying in the future works.

Conclusions
This paper proposed an EMO-ELM approach for sparse feature learning of hyperspectral image. The main idea of EMO-ELM is that, firstly, using an EMO to optimize the hidden layer of ELM, then executing feature extraction according to the optimal hidden layer. To let EMO-ELM have the ability to learn sparse features, the hidden layer activations are constrained by an objective function. The other objective involved in EMO-ELM is the cross-validation RMSE, which ensures the learned features are accurate. Unlike ELM-AE which linearly transforms original features, EMO-ELM uses a nonlinear way to do this. This procedure is similar to neural network autoencoder, such AE, but EMO-ELM optimizes the hidden layer only. So that EMO-ELM can be updated using EMO. Experiments are carried out on two real hyperspectral image data sets, and the performance of EMO-ELM is extensively explored including the convergence, the sparsity, and the classification ability. According to the experimental results, we can draw the following conclusions: 1. The experimental results demonstrate that EMO-ELMs is more suitable to extract sparse features from the hyperspectral image, and EMO plays a more significant role in dealing with the nonlinear data of hyperspectral image. 2. The proposed EMO-ELM significantly improves the performance of the original ELM. These experimental results demonstrate that the optimized hidden layer of ELM is effective for HSI feature learning. 3. EMO-ELM generally outperforms ELM-AE, SELM-AE, AE , and SAE in terms of sparsity and classification ability because of two optimized objectives. 4. The knee-based solution selection strategy can accurately focus on the knee area of the PF curve.
But, the RMSE-based solution selection strategy is more applicable in our experiments.