Deep Kernel Extreme-Learning Machine for the Spectral–Spatial Classiﬁcation of Hyperspectral Imagery

: Extreme-learning machines (ELM) have attracted signiﬁcant attention in hyperspectral image classiﬁcation due to their extremely fast and simple training structure. However, their shallow architecture may not be capable of further improving classiﬁcation accuracy. Recently, deep-learning-based algorithms have focused on deep feature extraction. In this paper, a deep neural network-based kernel extreme-learning machine (KELM) is proposed. Furthermore, an excellent spatial guided ﬁlter with ﬁrst-principal component (GFFPC) is also proposed for spatial feature enhancement. Consequently, a new classiﬁcation framework derived from the deep KELM network and GFFPC is presented to generate deep spectral and spatial features. Experimental results demonstrate that the proposed framework outperforms some state-of-the-art algorithms with very low cost, which can be used for real-time processes.


Introduction
Hyperspectral imagery in remote sensing with hundreds of narrow spectral bands is used in many applications, such as global environment monitoring, land cover change detection, natural disaster assessment and medical diagnosis [1][2][3][4] etc. Classification is a significant information acquisition technique for hyperspectral imagery, which focuses on distinguishing physical objects and classifying each pixel into a unique label. With the development of machine learning, most machine learning algorithms based on statistical learning theory are employed in the information processing field. There are many traditional classification algorithms, such as k-nearest neighbors (KNN) [5], Bayesian models [6], random forests [7], etc. One of the most important and famous classifiers for hyperspectral image classification is the kernel-based support vector machine (KSVM), which can also be considered to be a neural network [8,9]. It provides superior classification performance by learning an optimal decision hyperplane to best separate different classes in a high-dimensional feature space through non-linear mapping. Some popular kernels include polynomial function and Gaussian radial basis function (RBF).

1.
ML-ELM is investigated and applied firstly in hyperspectral classification.

2.
A classification framework is proposed for hyperspectral classification which combines DKELM and a novel guided filtering first-principal component (GFFPC) spatial filter. 3.
The DKELM model remains simple, because randomly generated parameters are not necessary but only kernel parameters need to be tuned in each layer. Furthermore, compared to the ML-ELM, the numbers of nodes for each hidden layer are not required to be set due to the kernel tricks.

4.
The proposed framework can achieve superior performance with very fast training speed, which is beneficial for real-time application.
This paper is organized as follows. Section 2 briefly introduces the ELM, KELM and ML-ELM (for convenience, we use MELM to replace ML-ELM in the following sections). Section 3 proposes our new framework to address the hyperspectral classification problem. Section 4 depicts the datasets and parameters tuning. Experimental results are presented and discussed in Section 5. The conclusions are drawn in Section 6.

ELM and KELM
The ELM is a single hidden-layer feed-forward neural network (SLFN) as depicted in Figure 1. Please note that the hidden layer is non-linear because of the use of a non-linear activation function. However, the output layer is linear without an activation function. It contains three layers: input layer, hidden layer, and output layer. Let x represent a training sample and f (x) be the output of the neural network. The SLFN with k hidden nodes can be represented by the following equation: where G(w, b, x) denotes the hidden-layer activation function, w is the input weight matrix connecting the input layer to the hidden layer, b means the bias weight of the hidden layer, and B = [β 1 β 2 ...β m ] is the weight between the hidden layer and output layer. For an ELM with n training samples, d input neurons (i.e., the number of bands), k hidden neurons, and m output neurons (i.e., m classes), Equation (1) becomes where t i is the m-dimensional desired output vector for the i-th training sample x i , the d-dimensional w j represents the j-th weight vector from the input layer to the j-th hidden neuron, and b j is the bias of the j-th hidden neuron. Here, w j , x i denotes the inner product of w j and x i . The sigmoid function g is used as the activation function, so the output of the j-th hidden neuron is where exp(·) denotes the exponent arithmetic, and 2 is the steepness parameter.
In matrix form, model (2) can be rearranged as where T ∈ R n×m is the target output,  is referred to as hidden-layer output matrix of ELM with the size of (n, k), which can also be expressed as follows: Then, B can be estimated by a smallest norm least-squares solution: where C is a regularization parameter. The ELM model can be represented as ELM can be extended to kernel-based ELM (KELM) via using kernel trick. Let in which where x i and x j represent the i-th and j-th training sample, respectively. Then, replacing HH T by Ω, the representation of KELM can be written as where f KELM (x) represents the output of the KELM model, and h(x) Obviously, different from ELM, the most important characteristic of KELM is that the number of hidden nodes is not desired to be set and there are no random feature mappings. Furthermore, the computing time is reduced compared with ELM due to the kernel trick used. Figure 2 depicts the detailed structure of ELM autoencoder(ELM-AE). ELM-AE represents features based on singular values. MELM is a multilayer neural network through stacking multiple ELM-AEs.

Multilayer Extreme-Learning Machines (MELM)
k is the transformation vector used for representation learning with respect to x (i) k . According to Equation (4), B is replaced by Λ (i) and T is replaced by X (i) respectively in MELM.
where H (i) is the output matrix of the i-th hidden layer with respect to X (i) , and Λ (i) can be solved by Then where X * is the last representation of X (1) . X * is used as hidden-layer output to calculate the output weight β * and β * can be computed as

The Proposed Framework for Hyperspectral Classification
In this section, we propose a new framework for hyperspectral classification which combines the hyperspectral spatial features with the Deep-layer-based KELM. The details of the proposed framework are discussed in this section. The main procedure of our proposed framework is briefly depicted in Figure 3. From Figure 3, we can see that the major three parts are as follows: data normalization, spatial feature enhancement and DKELM classification. The following sections introduce these three procedures in detail.  (1) is the first layer's transformation matrix of DKELM got from the KELM-AE of the input data.Λ (i+1) refers to the transformation matrix of (i + 1)-th hidden layer h i+1 of DKELM, which obtained by KELM-AE of the i-th hidden layer h i .

Data Normalization
Let X ∈ R N * L be a hyperspectral data, where N denotes the number of samples and L is the number of bands. Data normalization is the pre-procedure to make each sample standardization. For each sample: wherex i is the normalized sample, x i ∈ X, i ∈ {1, · · · , N}, and µ i and δ i denote the mean and variance of the samples, respectively. After this process, the data has zero mean and unit variance.

Spatial Features Enhancement
In our proposed framework, we use the Gaussian filters to extract spatial information; furthermore, a spatial feature enhancement method combined with guided filter (GF) and principal component analysis (PCA) is presented to enhance spatial information. Here, we introduce the GFFPC in detail.
The GF was proposed by He in 2012 [38], which can be used as an edge-preserving filter such as bilateral filter and it performs better near edges with fast time. As a local linear model, GF assumes that output image o is a linear transformation of guidance image g in a window W k of size ω * ω centered at the pixel k, where ω = (2r+1): where o i is the i-th pixel of the output image o and g i is the i-th pixel of guidance image g, respectively.
It is obvious that this model ensures ∇o = a∇g, which means the output o will have an edge only if g has an edge. To calculate the coefficients a k and b k , a minimum energy function is applied as follows: where f i is the i-th pixel of the input image, and α is a regularization parameter penalizing large a k , which can affect the blurring for the GF. According to the energy function, it is expected that the output image o ought to be as close as possible to the input image f , while preserving the texture information of guidance image g through the linear model. The solution to (16) can be addressed by linear regression as follows (Draper, 1981): where µ k and v 2 k are the mean and variance of guidance image g within the local window W k , respectively. |W| is the number of pixels in W k . In addition,f k = ∑ k∈W k f k is the mean value of f in the window.
The structure of GFFPC is shown in Figure 4. We can see that the original hyperspectral image is processed by PCA method firstly, then we get the reconstructed dataset consisting of a new set of independent bands named PCs. After that, the GF is performed on each band of the original dataset, where the first PC of the reconstructed dataset is used as the gray-scale guidance image with the most information of the hyperspectral image including spatial features. The filtering output is the novel hyperspectral data cube with more distinctive edges and texture features, that can help further hyperspectral classification.

DKELM Classification
DKELM consists of several KELM-auto-encoders (AEs) in deep layer. Thus, we firstly present a brief description of the KELM-AE. Figure 5 demonstrates the structure of KELM-AE which is very similar to ELM-AE except the kernel representation. The kernel operation in Figure 5 can be represented as

KELM-AE
x is referred to as the testing samples, x j denotes the j-th training sample, and φ is the mapping function to the reproducing kernel Hilbert space(RKHS). From Figure 5, the input matrix . In our proposed DKELM, we use the RBF kernel function with parameter σ k . ThenΛ (i) is employed to represent the i-th transformation matrix in KELM-AE, which is similar to ELM-AE in (10) and thenΛ (i) is calculated viaΛ The data can be represented through the final data transformation procedure using: where g is still an activation function. The hidden-layer activation functions can be either linear or non-linear. In our proposed DKELM, we use non-linear activation functions. Because deep distinct and abundant features can be learned and captured through the data representation via non-linear activation functions used between each KELM-AE. The combination of dense and sparse representations has more powerful interpretation than only linear learning. Compared to ELM-AE, we can find that the number of hidden nodes is not necessary to be set in advance because of the kernel trick used in each hidden layer. The pseudocode of KELM-AE is depicted in Algorithm 1.

DKELM
DKELM can obtain the universal approximation due to two separate learning procedures contained as same as in H-ELM [31]. Each pair ofΛ (i) and X (i) (in the i-th KELM-AE) can be computed via Equations (22) and (23), respectively. At last, the final data representation X * final is calculated, and then X * final is used as the training input to train a KELM classification model as: where Ω * final is obtained from X * final , then the output weight β can be calculated via The procedure of DKELM is depicted in Algorithm 2, including the training and testing phases.

Algorithm 1
The pseudocode of KELM-AE.
, where x k and x j are referred to as the k-th and j-th training sample, respectively.
Step2: Calculate the output weightΛ (i) ← I Step3: Calculate the new data representation

Training Phase
Input: Input matrix X (i) , output matrix T, regularization C i for each hidden layer, kernel parameter σ i for each hidden layer, activation function g i for each hidden layer, and the number of layers N. Output: Transformation matrixΛ (i) for each hidden layer, the final representation of the training samples X N and final output weight β of the output layer. Step1: are the final representation of the training samples x k and x j respectively.
Step4: Calculate the final output weight β= I •

Testing Phase
Input: Input matrix of testing samples TX (1) , output of the training phaseΛ (1) -Λ (N−1) , X N , β. Output: Output matrix of the testing samples TY. Step1: TX j are the final representation of the training samples x k and the final representation of the testing sample x j , respectively.
MELM employs the pseudoinverse to calculate the transformation matrix in each layer. Compared to MELM, exact inverse is used to calculateΛ (i) via invertible kernel matrix in the KELM-AEs of DKELM. Therefore, a theoretically perfect reconstruction of X (i) is resulted, which will reduce the error accumulation of the AEs in a certain degree. Consequently, DKELM can learn a better data representation and make for better generalization.

Experiments
In this section, we design a series of experiments to evaluate our proposed hyperspectral framework combining spatial filters with DKELM. As the comparison algorithms, ELM, KELM, KSVM and CNN are used in our experiments. Besides, all these algorithms are combined with GFFPC spatial feature enhancement method for further comparison purpose. The evaluation criteria employed are overall accuracy (OA) and kappa coefficient. Three classic hyperspectral benchmark datasets and one self-photographed hyperspectral dataset, noted as Indian Pines, University of Pavia, Salinas and Glycine ussuriensis, are used. In this paper, except CNN, all experiments are performed using MATLAB R2017b on a computer with 3.2 GHz CPU and 8.0 GB RAM. CNN algorithm is performed on NVIDIA Tesla K80 GPU.

Hyperspectral Datasets
The Indian Pines dataset was gathered by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in North-western Indiana. It contains 220 spectral bands from 0.4-2.45 µm region with spatial resolution of 20 m. This image has 145 × 145 pixels with 200 bands after removing 20 noisy and water absorption bands. The Indian Pines scene includes two-third agriculture, and one-third forest. The groundtruth of this image is introduced in Figure 6. Sixteen classes are contained and their numbers of labeled samples are tabulated in Table 1. In Indian Pines, 10% to 50% of labeled data of each class is selected randomly as training samples, and the remaining is testing samples.  The second dataset we used is about the University of Pavia, an urban scene acquired by the Reflective Optics Spectrometer (ROSIS) sensor. The ROSIS sensor can generate 115 spectral bands over 0.43 to 0.86 µm. This image scene has 610 × 340 pixels, and each pixel has 103 bands after noisy band removal. The geometric resolution is 1.3 m per pixel. The Nine groundtruth classes with the number of labeled samples are tabulated in Table 2. Figure 7 demonstrates the groundtruth of the Pavia University using as the referenced image. In Pavia University, 5% to 25% of labeled data of each class used as training samples to evaluate the proposed framework.  The third dataset is named Salinas which was also collected by the 224-band AVIRIS sensor over Salinas Valley, California. This image scene comprises 512 × 217 pixels. It has 204 bands after removing noisy and water absorption bands. The groundtruth depicted in Figure 8 contains 16 classes and the detailed samples are showed in Table 3. In Salinas dataset experiments, 5% to 25% of labeled samples of each class are chosen randomly for training our proposed classification framework.
The last dataset is named Glycine ussuriensis dataset which was collected over the Yellow River Delta National Nature Reserve, Qingdao, China. The image is acquired by the Nano-hyperspec imaging system equipped on unmanned aerial vehicle. This image scene comprises 355 × 266 pixels with 270 bands after removing noisy and water absorption bands. The Glycine ussuriensis dataset contains 4 classes shown in Figure 9 and Table 4. All four categories are plants, specifically, Glycine ussuriensis is a small-seeded species, which is grown in hills, roadsides, or shrubs at 100-800 m above sea level. Unlike the datasets mentioned before, the classes in Glycine ussuriensis have no strict geographical separation. For instance, the samples in tarragon are surrounded by other samples. In the experiments, 5% to 25% of labeled samples of each class are chosen randomly to testify our proposed classification framework.

Parameters Tuning and Setting
In our proposed framework, we need to tune several parameters. There are two user specified parameters are required for GFFPC, the size of local sliding window denoted ω and the regularization parameter denoted α, respectively, in which the latter determining the degree of the blurring for the GF. In the simulations, ω is chosen from [3,5,7,9,11], and α is selected in the range of [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6]. Figure 10a shows the classification accuracies of DKELM in the ω subspace, where the parameter α is prefixed. It can be seen that better performance is achieved when ω equals 7 for all the datasets. Then Figure 10b indicates the impact on the classification results of different α. The OA first increased, and then decreased as α decreasing. Thus, we set α to 1e-4 finally.  Figure 11 shows the accuracies obtained with different numbers of kernel layers in DKELM. We can see that in the case of using three kernel layers, the classification performance of DKELM can achieve superior results, because, based on the characteristics of hyperspectral datasets, three kernel layers of the network can already extract sufficient refined and distinguished features for the classification task, and the over-fitting occurs when the network is too deep with limited training samples. Thus, in our proposed framework, we set the number of kernel layers to 3. Kernel function we used in this paper is RBF. The activation functions employed to connect different KELM-AEs are sigmoid and ReLU function. Sigmoid function can constrain the output of each layer to the range from 0 to 1, which has been represented in Equation (3). The ReLU function [39] can be expressed as follows: where x is the input. The ReLU leads a sparse learning which can decrease the relationship among the parameters and eliminate the over-fitting problem. Therefore, the combination of sigmoid and ReLU functions can learn deep feature with different scales, which is beneficial to DKELM. Our proposed DKELM has three kernel layers, therefore three main parameters need to be tuned. σ 1 , σ 2 and σ 3 are the parameters used in RBF kernel functions. To make full advantages of our framework, a gird search algorithm is employed in tuning σ 1 , σ 2 and σ 3 . Figure 12 depicts the detailed tuning procedures of the three parameters in Indian Pines, University of Pavia, Salinas, and Glycine ussuriensis datasets, respectively. The vertical coordinate axis represents σ 3 , and the two horizontal coordinate axes express σ 1 and σ 2 . The color bars in Figure 12

Experimental Results and Discussions
In this section, the proposed GFFPC and the novel classification model will be assessed, and the related results will be summarized and discussed at length. All the algorithms are repeated 10 runs with different locations of the training samples for each dataset, and the mean results with the corresponding standard deviation are reported.

Discussion on the Proposed GFFPC
In the experiments, we first investigate the impact of different spatial filters to MELM and DKELM. Figure 13 illustrates the OA of different spatial filters based on MELM and DKELM algorithms. Origin denotes the performance obtained via the original MELM and DKELM. G_3 × 3 and G_5 × 5 mean using Gaussian filter with 3 × 3 and 5 × 5 size of windows respectively before employing MELM and DKELM. The GFFPC represents using GFFPC based on MELM and DKELM. From Figure 13, We can see that the performance with spatial filters become better than without spatial filters. Furthermore, the GFFPC obtains superior performance to Gaussian filters. Consequently, in our subsequent experiments, the GFFPC filter is our recommended spatial filter for strongly enhancing spatial features. Table 5 tabulates the performance achieved by different classification algorithms and the combinations with GFFPC spatial filter through using 10% of labeled samples as training samples in Indian Pines. Although the OA of DKELM is slightly worse than the performance of CNN, our proposed framework DKELM-GFFPC outperforms other classification algorithms. In addition, quite apart from that, our proposed GFFPC can enhance the classification performance by a wide range. For instance, the OA of ELM-GFFPC is increased by 16.23% and the OA of CNN-GFFPC has been improved from 83.19% to 95.57%. In particular, the performance of MELM and DKLEM are enhanced by 17.02% and 16.38%, respectively. For class 1, 9 and 16 of Indian Pines, the accuracies increase from 83.33%, 66.67% and 98.33% to 100%, 94.44% and 100% when DKELM-GFFPC is used instead of DKELM. This phenomenon indicates that our proposed framework can beneficial to the performance of several small-size classes. Table 6 also lists the classification performance of the comparison algorithms and the proposed algorithm in University of Pavia dataset through using 5% labeled samples as training samples. From the accuracies exhibited in Table 6, the OA of DKELM is improved by 13.95% via using GFFPC spatial filter. The performance of other algorithms is also enhanced in different degrees from 2.6% to 11.29%. It also can be seen that after combining with GFFPC, the classification performance is enhanced greatly. In particular, the DKELM-GFFPC achieves the best performance. Furthermore, the accuracies of small-size classes such as classes 5, 7 and 9 are improved through using GFFPC, implying that GFFPC can keep more distinctive features of small-size classes.  Table 6. Classification Accuracy(%) of the comparison and proposed algorithms in University of Pavia.  Table 7 demonstrates the performance obtained via different classification algorithms in Salinas dataset through adopting 5% of labeled samples as training samples. Compared with other algorithms, the proposed DKELM-GFFPC is still a predominant one. In addition, the classification performance of ELM, KELM, KSVM, CNN, MELM and DKELM is increased by 5.86%, 7.81%, 6.21%, 6.43%, 6.61% and 6.22%, respectively. Furthermore, classes 11, 13 and 14 are small classes with a few training samples, the performances of which are enhanced greatly through using the GFFPC spatial filter.

Discussion on the Classification Results
The classification accuracies achieved through the Glycine ussuriensis dataset are tabulated in Table 8. From Table 8, the DKELM-GFFPC obtains best performance than other classification frameworks. GFFPC still has the positive influence on enhancing the classification performance. For instance, the OAs of MELM and CNN are improved by 13.41% and 12.48% through adding GFFPC spatial filter. Moreover, the classification accuracy of the class Glycine ussuriensis with the least labeled samples is enhanced greatly through our proposed spatial filter and classification framework.
From Tables 5-8, we can see that CNN, MELM and DKELM can work better than other traditional classifiers. Therefore, to further testify our proposed framework, we compare DKELM-GFFPC with CNN-GFFPC and MELM-GFFPC when using different percentage of training samples in Figure 14.
With the increasing training samples, the classification performance is increasing gradually. Obviously, DKELM-GFFPC outperforms other two classification frameworks regardless the number of training samples. One of the most significant advantages of the proposed classification framework is the very fast training procedure. Thus, the average training times of different algorithms are compared and demonstrated in Table 9. ELM-based methods are faster than KSVM and CNN. The training time of DKELM and KELM depend on the number of training data. Meanwhile, ELM and MELM rely on the number of hidden neurons. The more the hidden neurons, the more the training time since the higher model generalization is provided subject to the data complexity. CNN can achieve superior classification performance. Nevertheless, it spends more training time which is nearly 135 to 411 times than that of DKELM in different experimental datasets. Therefore, DKELM is the most appealing one with the best performance and the least training time.  To illustrate the merits of our proposed classification framework and the spatial filter from the perspective of visualization, Figures 15-18 demonstrate the classification maps. Clearly, compared with the groundtruth shown in Figures 6-9, the classification maps obtained by our proposed framework are the smoothest and clearest. Besides, the classification maps achieved with GFFPC spatial filter are more distinct than without evidently. In particular, the border pixels and the boundaries of different classes in DKELM-GFFPC are more distinct. Compared with other classification methods, our proposed framework is better because of less salt-an-pepper noise contained in the classification maps.

Conclusions
In this work, the MELM algorithm is investigated and firstly applied to hyperspectral classification. Then, a DKELM-GFFPC framework is proposed consisting of the GFFPC for enhancing spatial features and the DKELM, a kernel version of MELM. Experimental results demonstrate that it can outperform other traditional algorithms, especially, DKELM-GFFPC can improve the accuracy of those classes with small-size samples in a different degree for each dataset. Moreover, compared with Gaussian filter, the proposed GFFPC can play an important role in enhancing hyperspectral classification performance. Finally, our proposed classification framework takes much lowest computation cost to achieve the highest classification accuracy. Based on the above-mentioned advantages, we believe that the proposed hyperspectral classification framework based on the novel DKELM and GFFPC is more suitable to process hyperspectral data in practical applications with low cost of computing, furthermore, in real-time application.
Author Contributions: J.L. and Q.D. conceived and designed the study; B.X. performed the experiments; G.R. and R. S. shared part of the experiment data; J.L. and Y.L. analyzed the data; J.L. and B.X. wrote the paper. Y.L., Q.D. and R. S. reviewed and edited the manuscript. All authors read and approved the manuscript.