Linear vs. Nonlinear Extreme Learning Machine for Spectral-Spatial Classification of Hyperspectral Images

As a new machine learning approach, the extreme learning machine (ELM) has received much attention due to its good performance. However, when directly applied to hyperspectral image (HSI) classification, the recognition rate is low. This is because ELM does not use spatial information, which is very important for HSI classification. In view of this, this paper proposes a new framework for the spectral-spatial classification of HSI by combining ELM with loopy belief propagation (LBP). The original ELM is linear, and the nonlinear ELMs (or Kernel ELMs) are an improvement of linear ELM (LELM). However, based on lots of experiments and much analysis, it is found that the LELM is a better choice than nonlinear ELM for the spectral-spatial classification of HSI. Furthermore, we exploit the marginal probability distribution that uses the whole information in the HSI and learns such a distribution using the LBP. The proposed method not only maintains the fast speed of ELM, but also greatly improves the accuracy of classification. The experimental results in the well-known HSI data sets, Indian Pines, and Pavia University, demonstrate the good performance of the proposed method.


Introduction
The main goal of HSI classification is to assign each pixel of the hypercube into a different class according to the spectral and spatial characteristics [1]. Since each pixel of HSI has many spectral features, it is difficult to classify HSI with limited samples, due to the curse of dimensionality. There are some typical algorithms for HSI image classification, such as the support vector machine (SVM) [2] and sparse multinomial logistic regression (SMLR) [3]. Many techniques have been proposed for feature extraction and dimensionality reduction [4,5], such as singular spectrum analysis (SSA) [6][7][8][9], principal component analysis (PCA) [10,11], and spectral-spatial classification methods [12]. However, there are still many challenges facing HSI classification, for example, the data structure of each pixel in the HSI data is actually a vector, corresponding to the responses from different spectral bands. The dimension of the vector equals the number of spectral bands, usually in the scale of hundreds or even thousands. As a result, it is still a challenging problem for the efficient and effective classification of HSI, especially with limited training samples.
As a new machine learning approach that has a single-hidden layer feedforward neural network, ELM has received much attention due to its good performance. It has been proven to be a promising The remainder of this paper is divided into the following sections: Section 2 describes the experimental data and the details of the proposed method. Section 3 shows the extensive experimental results and discussions. The conclusions are summarized in Section 4.

Materials and Methods
In this section, we first introduce the experimental data sets, and then we elaborate the proposed method based on LELM and LBP.

HSI Data Set
The experimental data sets include two well-known HSI datasets, which are detailed below.
(1) Indian Pines: The Indian Pines HSI data set [3] is based on the urban image collected in June 1992 by the AVIRIS sensors over the Indian Pines region in North-western Indiana. The Indian Pines scene contains two-thirds agriculture and one-third forest or other perennial vegetation.
There are two major dual lane highways and a rail line, as well as some low density housing, other built structures, and smaller roads. Since the scene is taken in June, some of the crops present, including corn and soybeans, are in the early stages of growth with less than 5% coverage. The data set has 145 × 145 pixels, each of which has 200 spectral bands after removing 20 water absorption bands ranging from 0.2 to 2.4 µm. There are 16 classes in total (e.g., corns, soybeans and, wheat), with 10,366 pixels that need to be classified. This data set can be downloaded at http://www.lx.it.pt/~jun/.
(2) Pavia University: The Pavia university HSI data set [1] was acquired in 2001 by the Reflective Optics System Imaging Spectrometer (ROSIS), flown over the city of Pavia, Italy. The sensor collected an HSI data set in 115 spectral bands ranging from 0.43 to 0.86 µm with a spatial resolution of 1.3 m/pixel. A total of 103 bands were selected for experiments after removing 12 noisiest bands. The image scene contains 610 × 340 pixels and there are nine classes in total, with 42,776 pixels that need to be classified. This data set can be downloaded at http://www.lx.it.pt/~jun/.

Normalization
Normalization is a preprocessing step for HSI classification. As an important preprocessing step for HSI classification, a number of normalization approaches have been proposed. For simplicity and consistency, we chose the Max method for normalization as it is a widely used method [30]. Let X = (X 1 , X 2 , . . . , X N ) ∈ R N×d be the HSI data, which has N samples and each sample has d features. The Max method divides the maximum value of the whole data set which can be expressed as: where X ij is any pixel value of the HSI data, and max() is the largest value of all the data in the HSI.

Linear ELM
Let x = (x 1 , x 2 , . . . , x N ) ∈ R N×d be the HSI data after normalization, where y = y 1 , y 2 , . . . , y N ∈ R N×M denotes the class labels. As a new learning algorithm, ELM [17] is a single layer feedforward neural network, which can be modeled as: where w i = (w i1 , w i2 , . . . , w iL ) T is the weight vector connecting the input layer with the hidden layer of the i-th sample; b j is the bias connecting the input layer with the hidden layer of the i-th sample and β j is the output weight vector of the i-th sample; T is the transpose operation; and g() is the activation function of the hidden layer. The main steps of classification with ELM are as follows: Step1: Assign random input w i and bias b i , i = 1, 2, . . . , N for the input layer.
Step2: Calculate the output matrix of hidden layer G as: Step3: Calculate the output matrix β: where β = [β 1 , . . . , β L ] T L×M and † is the Moore-Penrose generalized inverse of the hidden layer matrix. Step4: The result of the final classification of ELM can be expressed by the following equation: The execution time of ELM can be greatly reduced because the input weight and bias of ELM are randomly generated, and the output weight can be directly computed as β = G † * y. Any piecewise continual function can be used as the hidden layer activation function. Obviously, ELM is a lineal operation.

Nonlinear ELM
The classification problem for NLELM [22] can be formulated as: where ε i = [ε i,1 , . . . , ε i,M ] is the error vector of the M output nodes relative to the sample x i . h(x i ) is the output of the i-th sample between the hidden layer and the input layer. Based on the KKT theorem, Equation (6) is equivalent to solve the following dual optimization problem: where β j is the vector of weight between the hidden layer and output layer. α i,j is the Lagrange multiplier. Based on the KKT theorem, we can derive that: where i = 1, . . . , N, α i = [α i,1 , α i,2 , . . . , α i,M ] T and α = [α 1 , α 2 , . . . , α N ] T . Now, the output weight β can be formulated as: The hidden neurons are unknown. Any kernel satisfying the Mercer's conditions can be used: In general, the Gaussian kernel is chosen: Then, the NLELM can be constructed using the kernel function.
Although NLELM can achieve a higher classification accuracy than LELM if we just consider the spectral information, it may degrade the performance in the spectral-spatial classification of HSI. As a result, we will choose the LELM with LBP for the spectral-spatial classification of HSI, yet the performance of LELM and NLELM will be compared in the experiments.

Using LBP Based Spatial Information to Improve the Classification Accuracy
To further extract the spatial information, the output of LELM is used as the input of LBP. The posterior density p(y/x) is obtained according to the feature x, which is the output of LELM. We adopt the discriminative random field (DRF) [26] as: where Z(x) is the partition function. The term log p(y i /x i ) is the association potential that models the likelihood of label y i given the feature x i , and log p y i , y j is the interaction potential.
We adopt an isotropic MLL prior to the model image of class label y in order to use the spatial information of HSI. This prior belongs to the MRF class and encourages piecewise smooth segmentations. It tends to produce solutions where the adjacent pixels are likely to belong to the same class [3]. The MLL prior has been widely used in image segmentation problems [31][32][33][34] and is a generalization of the Ising model [35][36][37]. It can be formulated as: where µ is a tunable parameter controlling the degree of smoothness, Z is a normalization constant for the density, and δ(y) is the unit impulse function. The pairwise interaction term δ y i , y j assigns a high probability to the neighborhoods. The setting of the smoothness parameter, µ, will be discussed in Section 3.2.
A maximum a posteriori (MAP) estimate will minimize the Bayesian risk associated with the zero-one loss function [3]. The MAP estimate of y can be given by: This is a combinatorial optimization problem having pairwise interaction terms. An alternative MAP solution is the MAP marginal (MAM) solution, which minimizes the Bayesian risk associated with the zero-one loss function. The MAM estimation of label y i can be formulated as: where q(y i /x) is the marginal density of p(y/x) with respect to y i . The computation of the marginal density of p(y/x) in (14) is difficult [3]. Since the LBP is an efficient approach to estimate Bayesian beliefs [24] in graphical model, here we will use LBP to estimate the MAM solution and let the output of LELM y * LELM be the input of LBP. Figure 1 is a graphical example of MRF, where each node represents a random variable or a hidden node, and the class label y i is associated with each input feature x i . In the graphical example of MRF, ψ ij y i , y j = p y i , y j denotes the interaction potential that penalizes the dissimilar pair of neighboring labels. ϕ i (y i , x i ) = p(y i /x i ) stands for the association potential of label y i with respect to evidence. Suppose we observe some information about x i . Each node has the state value y i , and the observation value x i . ϕ i (y i , x i ) reflects the existence of statistical dependence. ψ ij y i , y j is the potential energy between adjacent neighbor nodes, and reflects the compatibility between the node variables y i and y j . Figure 2 provides a graphical example of an undirected network.
Since LBP is an iterative algorithm, at the t-th iteration, the message sent from node i to its neighbor node j ∈ N(i) can be given by the following equation: where Z is a normalization constant.  Assume that b (y ) is the belief of node i at the t-th iteration, it can be represented by the following equation: Finally, we can estimate the final solution by maximizing the posterior marginal for node i: As we know, not all the pixels, but only a part of the HSI needs to be classified. For instance, the size of the HSI data set of Indian Pines is 145 × 145 × 200, so the size of ground-truth is 145 × 145. But  Assume that b (y ) is the belief of node i at the t-th iteration, it can be represented by the following equation: Finally, we can estimate the final solution by maximizing the posterior marginal for node i: As we know, not all the pixels, but only a part of the HSI needs to be classified. For instance, the size of the HSI data set of Indian Pines is 145 × 145 × 200, so the size of ground-truth is 145 × 145. But Assume that b t i (y i ) is the belief of node i at the t-th iteration, it can be represented by the following equation: Finally, we can estimate the final solution by maximizing the posterior marginal for node i: As we know, not all the pixels, but only a part of the HSI needs to be classified. For instance, the size of the HSI data set of Indian Pines is 145 × 145 × 200, so the size of ground-truth is 145 × 145. But only 10,366 out of 21,025 pixels need to be classified. It may cause ill-posed problems if we use LBP directly with all the pixels. In view of this, we make some improvement of LBP (ILBP) in order to solve this problem, where we discard the pixel that belongs to the background, i.e., we just consider the pixels that need to be classified. The proposed method is summarized in Algorithm 1. (1) Normalization: (2) LELM training: Step 1: Randomly generate the input weights, w i , and bias, b i .
Step 2: Calculate the hidden layer of the output matrix: Step3: Calculate the output weight: Output of LELM: Calculate the hidden layer matrix of the test samples: (3) Spatial Classification by ILBP: Step1: Find the index of adjacent pixels of training samples and test samples and eliminate the pixels of the background.
Step2: Calculate the marginal of MPA as follows: Then the belief of node i at the t-th iteration can be represented as: The final solution for node i can be obtained by maximizing the posterior marginal:

Results and Discussions
In this section, the proposed method will be evaluated and relevant results are discussed in details. The experimental datasets include two well-known HSI datasets, i.e., Indian Pines and Pavia University.

Parameter Settings
All the experimental results are assessed by the overall accuracy (OA), average accuracy (AA), and kappa statistics (k) [35]. In order to avoid the effects induced by the selection of training samples, ten independent Monte Karlo runs are performed and OA, AA, and k are all averaged by ten runs.
In order to compare the performance of the proposed method with other classifiers, we show the parameter settings used in the experiments. The parameters of SMLR and KSMLR are the same as suggested in [38] (noting that the SMLR and KSMLR are implemented via variable splitting and augmented Lagrangian (LORSAL) [39], which can decrease the computation time of SMLR and KSMLR). The cost function C = 2 b of NLELM is in the range of b = [0, 1, 2, . . . , 10], the kernel function in (12) (3) is a very important parameter and we will evaluate the impact in the next subsection. The parameter µ in (15) is a tunable parameter controlling the degree of smoothness, which is set to µ = 20 for Indian Pines and Pavia University. We will further evaluate the impact on the proposed approach in the next subsection. Note that the output of LELM and NLELM represent the probability output. All the experiments are conducted in MATLAB R2016b on a computer with 3.50 GHz CPU and 32.0 G RAM.

Impact of Parameters L and µ
In this subsection, we will evaluate the impact of the hidden neurons of LELM, L, and the smoothness parameter, µ, using the Indian Pines and Pavia University datasets. Table 1 displays the number of training samples and test samples.  Figure 3 shows the OA, AA, and kappa statistic results as a function of variable L with the training samples of 1043 and 3921 in the Indian Pines and Pavia University, respectively (about 9% and 10% of the total samples, respectively). The training samples are randomly selected from each class in each Monte Carlo Run. From Figure 3a,b, we can see that the classification accuracies of LELM indeed depend on the hidden neurons, so we should choose the best hidden neurons for LELM in order to improve the classification performance in the sequential spatial information classification. We can see that the best hidden neurons value of LELM for Indian Pines is about 450 and the best hidden neurons value of LELM for Pavia University is about 1050. Therefore, we will set the hidden neurons values as 450 for Indian Pines and 1050 for Pavia University. class in each Monte Carlo Run. From Figure 3a,b, we can see that the classification accuracies of LELM indeed depend on the hidden neurons, so we should choose the best hidden neurons for LELM in order to improve the classification performance in the sequential spatial information classification. We can see that the best hidden neurons value of LELM for Indian Pines is about 450 and the best hidden neurons value of LELM for Pavia University is about 1050. Therefore, we will set the hidden neurons values as 450 for Indian Pines and 1050 for Pavia University.  It can be seen that the performance of the proposed framework depends on the smoothness parameter, . However, the classification performance maintains a high accuracy as is increasing and it tends to be almost unchanged when ≥ 20. So in the experiments, it is set as = 20 for Indian Pines and Pavia University. This also demonstrates that the proposed framework is very robust.  It can be seen that the performance of the proposed framework depends on the smoothness parameter, µ. However, the classification performance maintains a high accuracy as µ is increasing and it tends to be almost unchanged when µ ≥ 20. So in the experiments, it is set as µ = 20 for Indian Pines and Pavia University. This also demonstrates that the proposed framework is very robust.

The Experiment Resutls and Analysis
In this subsection, we will evaluate the HSI classification accuracy of the proposed method in the two HSI datasets by comparing it with other state-of-the-art methods, including the sparse multinomial logistic regression (SMLR), kernel sparse multinomial logistic regression (KSMLR) [3], nonlinear ELM (NLELM), linear ELM (LELM) [13], SMLR-LBP, KSMLR-LBP, and NLELM-LBP. It is worth noting that the SMLR, KSMLR, LELM, and NLELM are spectral classification methods, i.e., pixel-based, while the SMLR-LBP, KSMLR-LBP, LELM-LBP, and NLELM-LBP are spectral-spatial classification methods. For the normalization, we use the Max method as in Equation (1) for all the algorithms. Table 1 shows the numbers of training samples and testing samples of Indian Pines and Pavia University.
For an illustration, Figure 5 shows the training samples of the Indian Pines data. Figure 6a-h show the classification results obtained by different methods for the Indian Pines data. Moreover, Table 2 shows all the comparable results of different classifiers. From Table 2, it is obvious that the classifiers with spatial information (the proposed method, NLELM-LBP, SMLR-LBP, KSMLR-LBP) show a clear advantage over their pixel-only counterpart. NLELM obtains the best pixel-only classification results, but the results of NLELM-LBP are not good. This validates that the nonlinear transform will disturb the original salient feature of the original pixels. The reason of the bad results of SMLR is due to the fact that SMLR needs to iterate and the outputs of SMLR will also disturb the

The Experiment Resutls and Analysis
In this subsection, we will evaluate the HSI classification accuracy of the proposed method in the two HSI datasets by comparing it with other state-of-the-art methods, including the sparse multinomial logistic regression (SMLR), kernel sparse multinomial logistic regression (KSMLR) [3], nonlinear ELM (NLELM), linear ELM (LELM) [13], SMLR-LBP, KSMLR-LBP, and NLELM-LBP. It is worth noting that the SMLR, KSMLR, LELM, and NLELM are spectral classification methods, i.e., pixel-based, while the SMLR-LBP, KSMLR-LBP, LELM-LBP, and NLELM-LBP are spectral-spatial classification methods. For the normalization, we use the Max method as in Equation (1) for all the algorithms. Table 1 shows the numbers of training samples and testing samples of Indian Pines and Pavia University.
For an illustration, Figure 5 shows the training samples of the Indian Pines data. Figure 6a-h show the classification results obtained by different methods for the Indian Pines data. Moreover, Table 2 shows all the comparable results of different classifiers. From Table 2, it is obvious that the classifiers with spatial information (the proposed method, NLELM-LBP, SMLR-LBP, KSMLR-LBP) show a clear advantage over their pixel-only counterpart. NLELM obtains the best pixel-only classification results, but the results of NLELM-LBP are not good. This validates that the nonlinear transform will disturb the original salient feature of the original pixels. The reason of the bad results of SMLR is due to the fact that SMLR needs to iterate and the outputs of SMLR will also disturb the original salient feature of the pixels. KSMLR-LBP achieves a slightly higher result than SMLR-LBP.
The kernel operation is better than the non-kernel operation with the pixel-only classifier. Nevertheless, the result of KSMLR-LBP is still lower than the proposed method. Our proposed spectral-spatial method based on LELM and ILBP achieves the best recognition results, when compared with LELM, NLELM, SMLR, KSMLR, NLELM-LBP, SMLR-LBP, and KSMLR-LBP. This is due to the usage of the linear transform to keep the original salient features of pixel, and the ILBP to extract the spatial features. Figure 7 shows the training samples of Pavia University, and Figure 8 shows the classification results of Pavia University and the classification details are reported in Table 3. It can be seen that the proposed framework also achieves the highest accuracy among all the methods.                In the last line of Tables 2 and 3, we report the average computation time of all the methods for the Indian Pines with 1043 training samples and Pavia University with 3921 training samples. We test for ten Monte Carlo runs, respectively. It is obvious and reasonable that the classifiers with spectral-spatial information cost more time than the pixel-only counterpart. From the last line of Table 2, we can also see that the proposed method has a very similar computation time as SMLR-LBP for Indian Pines. However, the proposed method achieves a higher classification accuracy than SMLR-LBP. The proposed method achieves a higher classification accuracy than NLELM-LBP and KSMLR-LBP with much less computation time. From the last line of Table 3, we can get the same conclusion for the Pavia University database. To sum up, the proposed method has achieved a higher accuracy than KSMLR-LBP, NLELM-LBP with much less computation time. It is obvious that the proposed LELM-LBP maintains the salient features of HSI very well, so it can obtain a higher accuracy than other spectral-spatial methods with a high computational efficiency.

The Experiment Resutls and Analysis
In this subsection, we compare the proposed approach with other spectral-spatial ELM-based methods. The classification results are shown in Table 4. The classification accuracies of EMP-ELM, S-ELM, and G-ELM are directly taken from [21][22][23], respectively. From Table 4, we can see that the proposed method achieves the best classification accuracies among all these four methods.

Conclusions
In this work, we proposed a new framework for HSI classification using spectral-spatial information with LELM and LBP. The LELM method is used to learn a spectral classifier for the original HSI data and keep the salient features of HSI. The spatial information is modeled based on LBP in order to improve the classification accuracy of HSI. The proposed method maintains the salient feature of HSI for the spatial-based classification. Experimental results show the superiority of the proposed method.
In future work, we will focus on learning the dictionary of each class in the spectral domain for LELM in order to further improve the classification of LELM. In order to improve the classification results further, we will resort to Spatial Filtering [40]. Moreover, we will also decrease the time-consuming issue by resorting to the extended multi-attribute profiles (EMAPs) [41] method.