Hyperspectral Imagery Classiﬁcation Based on Semi-Supervised Broad Learning System

: Recently, deep learning-based methods have drawn increasing attention in hyperspectral imagery (HSI) classiﬁcation, due to their strong nonlinear mapping capability. However, these methods suffer from a time-consuming training process because of many network parameters. In this paper, the concept of broad learning is introduced into HSI classiﬁcation. Firstly, to make full use of abundant spectral and spatial information of hyperspectral imagery, hierarchical guidance ﬁltering is performing on the original HSI to get its spectral-spatial representation. Then, the class-probability structure is incorporated into the broad learning model to obtain a semi-supervised broad learning version, so that limited labeled samples and many unlabeled samples can be utilized simultaneously. Finally, the connecting weights of broad structure can be easily computed through the ridge regression approximation. Experimental results on three popular hyperspectral imagery datasets demonstrate that the proposed method can achieve better performance than deep learning-based methods and conventional classiﬁers.


Introduction
Hyperspectral imagery (HSI) captured by hyperspectral sensors has high spectral and spatial resolution, thus has a strong capability to distinguish surface objects [1]. HSI has been widely applied in many fields including agricultural monitoring [2], environment analysis and prediction [3], and climate monitoring [4]. HSI classification is a common task in these applications, i.e., to assign a class label of surface object to every HSI pixel by using a small number of training samples.
In recent years, many methods have been proposed to address HSI classification. The K-nearest neighborhood (KNN) [5] determines the class of testing sample by calculating the Euclidean distance between testing and training samples. Support vector machine (SVM) [6,7] projects samples to a high-dimensional space by kernel functions and distinguishes sample classes by learning a classification hyperplane, which achieves satisfactory performance in the small-sample classification tasks. Extreme learning machine (ELM) [8,9] is a single hidden-layer neural network which has the following characteristics: (1) the connecting weights between input-layer and hidden-layer neurons are randomly assigned and do not need to be adjusted during the learning process; and (2) the connecting weights between hidden-layer and output-layer neurons can be calculated via the least square method. Therefore, the computational efficiency of ELM is high.
Recently, deep learning (DL) is found to be able to automatically learn representative features from data via stacking multi-layer nonlinear units [10,11], making successful application on HSI structure (by the low-rankness) and the locally linear structure (by the sparseness) of data, hence it is both generative and discriminative. Morsier et al. [27] presented a kernel low-rank and sparse graph, which was based on sample proximities in reproducing kernel Hilbert spaces and expressed sample relationships under sparse and low-rank constraints. However, data class structure is not considered in the above methods. Considering this, Shao, et al. [28] presented a class-probability (CP) structure, which can express the relation between each sample and each class via a class-probability matrix.
In summary, a HSI classification method is proposed based on a semi-supervised BLS (SBLS). The main contributions of this paper include: (1) To our knowledge, this is the first trial where BLS is applied in HSI classification tasks. The proposed SBLS can get higher HSI classification accuracy and faster training speed. (2) The class-probability structure is introduced into BLS for an extended semi-supervised BLS to make use of limited numbers of labeled samples as well as vast unlabeled samples.

HSI Classification Based On SBLS
The flowchart of HSI classification based on SBLS is shown in Figure 1 and includes three steps: (1) the original HSI data are processed by hierarchical guidance filtering (HGF) to get the spectral-spatial expression of HSI; (2) the pseudo labels of unlabeled samples are obtained via CP structure; and (3)  Step 2 Step 3 Step 1 Figure 1. Flowchart of HSI classification based on SBLS.

Hierarchical Guidance Filtering
The first step of SBLS is to get the HGF representation of HSI, shown as Step 1 in Figure 1. The original hyperspectral images are expressed in the form of 3D tensor. If vectorization is expressed by a tensor, not only is the data dimension greatly increased, but the inherent data structure is also destroyed. Pan et al. [29] proposed a spectral-spatial expression of HSI data by using HGF. As one of the edge-preserving filtering methods, HGF can remove noise and small details while preserving the overall structure of the image, thus can map the original HSI data into a feature subspace having more abundant feature expression. In terms of the superiority of HGF, the original HSI is processed by HGF to get the spectral-spatial expression of HSI.

Hierarchical Guidance Filtering
The first step of SBLS is to get the HGF representation of HSI, shown as Step 1 in Figure 1. The original hyperspectral images are expressed in the form of 3D tensor. If vectorization is expressed by a tensor, not only is the data dimension greatly increased, but the inherent data structure is also destroyed. Pan et al. [29] proposed a spectral-spatial expression of HSI data by using HGF. As one of the edge-preserving filtering methods, HGF can remove noise and small details while preserving the overall structure of the image, thus can map the original HSI data into a feature subspace having more abundant feature expression. In terms of the superiority of HGF, the original HSI is processed by HGF to get the spectral-spatial expression of HSI.
As an extension of guided filtering and rolling guidance filtering, HGF can generate a series of joint spectral-spatial features. HGF minimizes the following energy function: where a p k and b p k are linear coefficients based on the input HSI data I and the guidance image G, ω k is the window around pixel k with size (2r + 1) × (2r + 1), r is the window radius, i is one of a pixel in ω k , p denotes the p-th band, and ε is the controlling parameter. Larger ε will lead smoother output. Equation (1) is a ridge regression, and can be solved by: where µ k and σ k are the mean and standard variance of G, respectively; I p k is the mean of I in ω k ; and |ω| is the number of pixels in ω k . More details can be found in [29]. HGF is a kind of preprocessing trick, and a similar strategy is also used in [19,29].

Class-Probability Structure
The second step of SBLS is to obtain the pseudo labels of unlabeled samples via CP structure, shown as Step 2 in Figure 1. The labeled samples via HGF expression X S = {x 1 , · · · , x n } ∈ R n s ×m and corresponding labels Y S = {y 1 , · · · , y n s } ∈ R n s ×c are given, where n s is the number of labeled samples, m is the number of dimensionality, c is the number of classes, y ij is a binary number and if the i-th sample belongs to j-th class, y ij = 1, or else y ij = 0. Given the unlabeled samples X U = {x 1 , · · · , x n } ∈ R n U ×m via HGF expression, where n U means the number of unlabeled samples, the overall number of samples is n = n S + n U . Hence, the similarity between the labeled X S and unlabeled samples X U can be expressed by the following: where a is the sparsity coefficient. Equation (3) can be solved with alternating direction methods of multipliers with adaptive penalty (ADMAP). More details can be referred to in [28]. The class-probability vector of x i is written as: where p i = (p i1 , p i2 , · · · , p ic ) ∈ R 1×c , and p ic means the probability that the i-th sample belongs to the c-th class. Regarding the unlabeled samples, it is feasible to get the class-probability matrix p U ∈ R U×c via label propagation for a given sample. Regarding the labeled samples, the class-probability matrix p S ∈ R S×c is defined. Therefore, the probability that the i-th and the j-th samples belong to an identical class is written as: As a further step, P can be expressed as P = P SS P US P SU P UU , where P SS means the probability that the labeled samples have the same class while P UU means the probability that the unlabeled Remote Sens. 2018, 10, 685 5 of 13 samples have the same class. P US and P SU represent the probabilities that the unlabeled and labeled samples have the same class, respectively. Finding the index of maximum probability per row in P US , can obtain the labeled sample that is most similar to each unlabeled one as well as the pseudo label Y U of the unlabeled samples. The calculation principle is as follows:

SBLS
The third step of SBLS is to train the SBLS model and get the predictive labels of unlabeled samples, shown as Step 3 in Figure 1. BLS is proposed based on RVFLNN, including three parts: mapped features (which are the mapping from inputs), enhancement nodes (which are the mapping from mapped features), and output labels (which are the joint mapping from mapped features and enhancement nodes). The learning parameter is W m , which can be fast and approximately obtained by ridge regression. However, the BLS model belongs to the supervised method and cannot utilize vast numbers of commonly unlabeled samples in HSI. Hence, for better adaption of BLS into HSI classification, it is necessary to investigate semi-supervised BLS. Here, CP is introduced into BLS and SBLS is proposed to realize the semi-supervision classification of HSI.
HSI samples X = [X S ; X U ] ∈ R n×m generally expressed by HGF are given, as well as labels Y S and Y U that are obtained by the class-probability structure. In terms of SBLS, the input is first mapped to "mapped features" via the random weight where G M is the number of groups of MF. φ i (·) is a nonlinear function, and different functions can be chosen for different groups of MF. Here, linear mapping is used in all MF for simplicity, which means To have better features, W M is usually fine-tuned by linear sparse auto encoder. After obtaining the MF, Z = [Z 1 , Z 2 , · · · , Z G M ], the expansion of SBLS can be realized by mapping the features of MF to EN with random weights W E and bias β E where G E is the number of ENs. Further, the SBLS model is expressed as: where W m are the connecting weights from MF and EN to output nodes. It can be solved the following problem: where λ denotes the further constraints on the sum of the W m . The solution of Equation (8) can be solved by ridge regression: If λ = 0, Equation (8) degenerates into the least square problem. On the other hand, if λ → ∞ , the solution is heavily constrained and tends to 0. Thus, we set λ → 0 here, such as 2 −30 . By giving an approximation to the Moore-Penrose generalized inverse of [Z|H] , Equation (8) can be written as: Specifically, we have: Finally, the predictive labels can be obtained by In summary, the algorithm steps of HSI classification based on SBLS is shown in Table 1.  (7), (8), and (14), according to W M , β E , W E , β m , and W m . Output: predictive labels Y.

HSI Datasets
In this section, three real HSI datasets, i.e. Indian Pines, Salinas and Botswana, are used to evaluate the accuracy and efficiency of the proposed SBLS method. Figure 2 shows the ground truth maps of the three HSI datasets. For the three HSI datasets, 20 samples are randomly selected from different surface objects as labeled (training) samples, with the remaining as unlabeled (testing) samples.
(1) For supervised classification methods, only the labeled samples are used to train the classifier and the trained classifier is used to predict the labels of unlabeled samples.
(2) For semi-supervised classification methods, all labeled and unlabeled samples are used to train the classifier. In addition, since the total size of Salinas dataset is large, only part of labeled samples participates in the classifier training.
(3) Since the total size of surface object "Oats" in Indian Pines dataset is small, the size of labeled samples (denoted by s.l.s.) equals that of unlabeled samples (denoted by s.u.s.). The detailed sample settings for different HSI datasets are shown in Table 2.  (13) Finally, the predictive labels can be obtained by (14) In summary, the algorithm steps of HSI classification based on SBLS is shown in Table 1.

HSI Datasets
In this section, three real HSI datasets, i.e. Indian Pines, Salinas and Botswana, are used to evaluate the accuracy and efficiency of the proposed SBLS method. Figure 2 shows the ground truth maps of the three HSI datasets. For the three HSI datasets, 20 samples are randomly selected from different surface objects as labeled (training) samples, with the remaining as unlabeled (testing) samples.
(1) For supervised classification methods, only the labeled samples are used to train the classifier and the trained classifier is used to predict the labels of unlabeled samples.
(2) For semi-supervised classification methods, all labeled and unlabeled samples are used to train the classifier. In addition, since the total size of Salinas dataset is large, only part of labeled samples participates in the classifier training.
(3) Since the total size of surface object "Oats" in Indian Pines dataset is small, the size of labeled samples (denoted by s.l.s.) equals that of unlabeled samples (denoted by s.u.s.). The detailed sample settings for different HSI datasets are shown in Table 2.

Comparative Experiments
To evaluate the performance of the proposed SBLS on HSI classification, we investigate the following nine methods for comparison.
(1) Traditional classifiers include SVM [6], ELM [8], and SPELM [9]. Since only the linear feature mapping is used in BLS and SBLS, the linear kernel function is used in SVM and ELM in our experiments. The hyper parameters of SVM and ELM are selected through five-fold cross validation, and the penalty factor of SVM and the regularization coefficient of ELM and SPELM are selected from {1, 10, 100, 1000}. In addition, HSI data after HGF preprocessing were taken as the input of SVM, ELM, and SPELM for fair comparison. The number of trails of SPELM is set as 50.
(5) BLS [20], where HSI data after HGF preprocessing were taken as the input of BLS. The proposed SBLS and nine comparative methods are used on the three HSI datasets for classification. Related experiments about CNN-PPF and BASS-Net are tested on Theano and Torch platforms with GPU GTX 980. Other experiments are performed in MATLAB R2014a using a computer with a 3.60 GHz Intel Core i7-4790 CPU and 16 GB of RAM. Each experiment is conducted five times to get the average value for stochastic. Tables 3-5, respectively, show the comparison of classification performance on different datasets, where five performance indexes are considered: the accuracy on each surface object (%), average accuracy (AA, %), overall accuracy (OA, %), Kappa coefficient, and consumed time (t, s) for classifier training and testing sample classification.
The following can be observed in Tables 3-5: (1) The AA, OA, and Kappa coefficient of SBLS on the three datasets are the highest. This is because the CP structure is introduced into SBLS, which can make use of vast unlabeled samples compared with BLS.
(2) ELM has the shortest period of consumed time, followed by SVM. Besides SVM and ELM, BLS has the shortest consumed time. This is because the BLS network parameters can be directly solved by the generalized inverse and BLS has simple network.
(3) CNN-PPF, BASS-Net, and R-VCANET have longer consumed time. This is because these methods belong to deep learning. For BASS-Net, a high number of iteration steps are needed when the network parameters are updated based on the gradient descent. For CNN-PPF, to ensure the training of the CNN with many layers, the training samples are expanded greatly in number and, Remote Sens. 2018, 10, 685 8 of 13 therefore, it has longer training time. For R-VCANet, its testing process is time-consuming due to the high dimensions of extracted features per layer.
(4) Compared with BLS, SBLS has longer period of consumed time. This is because the correlation computation between samples in CP structure consumes much time.
The classification maps on Indian Pines and Salinas datasets are visibly shown in Figures 3 and 4, respectively. A conclusion consistent with the aforementioned can be obtained from Figures 3 and 4: the classification effect of HSI with SBLS is the best.   (2) ELM has the shortest period of consumed time, followed by SVM. Besides SVM and ELM, BLS has the shortest consumed time. This is because the BLS network parameters can be directly solved by the generalized inverse and BLS has simple network.
(3) CNN-PPF, BASS-Net, and R-VCANET have longer consumed time. This is because these methods belong to deep learning. For BASS-Net, a high number of iteration steps are needed when the network parameters are updated based on the gradient descent. For CNN-PPF, to ensure the training of the CNN with many layers, the training samples are expanded greatly in number and, therefore, it has longer training time. For R-VCANet, its testing process is time-consuming due to the high dimensions of extracted features per layer.
(4) Compared with BLS, SBLS has longer period of consumed time. This is because the correlation computation between samples in CP structure consumes much time.
The classification maps on Indian Pines and Salinas datasets are visibly shown in Figures 3 and  4   (2) ELM has the shortest period of consumed time, followed by SVM. Besides SVM and ELM, BLS has the shortest consumed time. This is because the BLS network parameters can be directly solved by the generalized inverse and BLS has simple network.
(3) CNN-PPF, BASS-Net, and R-VCANET have longer consumed time. This is because these methods belong to deep learning. For BASS-Net, a high number of iteration steps are needed when the network parameters are updated based on the gradient descent. For CNN-PPF, to ensure the training of the CNN with many layers, the training samples are expanded greatly in number and, therefore, it has longer training time. For R-VCANet, its testing process is time-consuming due to the high dimensions of extracted features per layer.
(4) Compared with BLS, SBLS has longer period of consumed time. This is because the correlation computation between samples in CP structure consumes much time.
The classification maps on Indian Pines and Salinas datasets are visibly shown in Figures 3