Hyperspectral Classiﬁcation Based on Texture Feature Enhancement and Deep Belief Networks

: With success of Deep Belief Networks (DBNs) in computer vision, DBN has attracted great attention in hyperspectral classiﬁcation. Many deep learning based algorithms have been focused on deep feature extraction for classiﬁcation improvement. Multi-features, such as texture feature, are widely utilized in classiﬁcation process to enhance classiﬁcation accuracy greatly. In this paper, a novel hyperspectral classiﬁcation framework based on an optimal DBN and a novel texture feature enhancement (TFE) is proposed. Through band grouping, sample band selection and guided ﬁltering, the texture features of hyperspectral data are improved. After TFE, the optimal DBN is employed on the hyperspectral reconstructed data for feature extraction and classiﬁcation. Experimental results demonstrate that the proposed classiﬁcation framework outperforms some state-of-the-art classiﬁcation algorithms, and it can achieve outstanding hyperspectral classiﬁcation performance. Furthermore, our proposed TFE method can play a signiﬁcant role in improving classiﬁcation accuracy.


Introduction
Hyperspectral imagery with hundreds of narrow spectral channels provides wealthy spectral information.With very high spectral resolution, hyperspectral data has been of great interest in many practical applications, such as in agriculture, environment, surveillance, medicine [1][2][3][4] etc. Hyperspectral classification is a key technique employed in aforementioned applications.A majority of classification methods have been promoted in the last several decades to distinguish physical objects and classify each pixel into a unique land-cover label, such as maximum likelihood [5], minimum distance [6], K-nearest neighbors [7,8], random forests [9], Bayesian models [10,11], neural networks, etc., and their improvements [12][13][14][15].Among these supervised classifiers, one of the most important classifiers is kernel-based support vector machine (SVM), which can also be considered as a kind of neural network.It can achieve superior hyperspectral classification accuracy via building an optimal hyperplane to best separate training samples.
In addition, sparse representation based on an over-complete signal dictionary has gained great attention in the literature.Sparse representation-based classification (SRC) [16][17][18] and collaborative representation classification (CRC) [19,20] are proposed from a different aspect: they do not adopt the traditional training-testing fashion.Such classification methods do not need any prior knowledge about probability density distribution of the data.To further enhance the performance of SRC and CRC, Du and Li [21] utilized a diagonal weight matrix to adaptively adjust the regularization parameter.
To address the issues of Hughes phenomenon in hyperspectral classification, majority of feature extraction and selection algorithms are utilized to delete redundant features from the original data.To further improve performance of hyperspectral classification, multi-features are extracted and employed for classification.For instance, Kang et al. combined spectral and spatial features through a guided filter to process pixel-wise classification map in each class [22].Several studies [23][24][25] focused on integrating spatial and spectral information in hyperspectral imagery.In addition, texture features are considered to assist hyperspectral classification [26], and modeling of hyperspectral image textures is significant for classification and material identification.
Recent research has highlighted deep learning with deep neural networks, which can learn high-level features hierarchically.They have demonstrated their potential in image classification, which also motivated successful applications of deep models on hyperspectral image classification.The classic deep learning method is convolutional neural networks (CNN), which plays a dominant role in visual-based issues.The local receptive fields of CNN can extract spatial-related features at high levels.Fukushima [27] introduced the motivations of CNNs.Ciresan and Lee et al. [28,29] depicted the invariants of CNNs.Chen et al. proposed 2-D CNN and 3-D CNN [30] to capture deep abstract and robust features, yielding superior hyperspectral classification performance.Although CNNs are typical supervised models, a massive training dataset is needed to trigger their powers.Unfortunately, a limited number of labeled samples are usually given in hyperspectral imagery.Deep belief networks (DBNs) [31] and stacked autoencoders (SAEs) [32] are also very promising deep learning methods for hyperspectral classification with limited training samples.
In this paper, we mainly investigate the DBN for its suitability and practicality to hyperspectral classification.A novel hyperspectral classification framework is proposed based on an optimum DBN.To acquire desirable performance, we also promote an advanced algorithm to enhance the texture features of hyperspectral imagery.The main contributions of this paper are summarized below.

1.
We first promote a band group method to separate the bands of hyperspectral data into different band groups.Multi-texture features are used to select a sample band in each band group.

2.
We propose a novel algorithm to enhance the texture features of hyperspectral data.We advocate the use of guided filter to complete the procedure of texture feature enhancement (TFE).

3.
An optimal DBN structure is proposed with consideration of learning and deep features extraction.
The learned features are exploited in Softmax to address the classification problem.Furthermore, with enhanced texture features, accurate classification maps can be generated by considering spatial information.
The rest of the paper is organized into four sections.Section 2 is a brief description of related work.In Section 3, we detail our proposed DBN model.Datasets and parameters setting are demonstrated in Section 4. Experimental results and discussions are depicted in Section 5. Section 6 draws the conclusion of this paper.

The Related Work
A deep belief network (DBN) is a model that is first pre-trained in an unsupervised way, and then the available labeled training samples are used to fine-tune the pre-trained model through optimizing a cost function defined over the labels of training samples and their predictions.
The original DBN, published in Science [33], uses a generative model in the pre-training procedure, and uses back-propagation in the fine-tuning stage.This is very useful when the number of training samples is limited, such as in the case of hyperspectral remote sensing.DBN can be efficiently trained in an unsupervised, layer-by-layer manner where the layers are typically made of restricted Boltzmann machines (RBM).Thus, to explain the structure and theory of the DBN, we first describe its main component, the RBM.

Restricted Boltzmann Machines (RBM)
An RBM generally uses unsupervised learning, which can be interpreted as stochastic neural networks.It was originally developed to form a distributed representation.It is a two layer-wise network, which is composed of visible and hidden units.Learning RBM only allows the full connection between visible and hidden units, and does not allow connection between two visible units or connections between two hidden units.With the given visible units, hidden units can be obtained via mapping of visible units.The activations of each neuron in hidden layers are independent.Meanwhile, with the given hidden units, visible units have the same effects.A typical RBM structure is depicted in Figure 1.

Restricted Boltzmann Machines (RBM)
An RBM generally uses unsupervised learning, which can be interpreted as stochastic neural networks.It was originally developed to form a distributed representation.It is a two layer-wise network, which is composed of visible and hidden units.Learning RBM only allows the full connection between visible and hidden units, and does not allow connection between two visible units or connections between two hidden units.With the given visible units, hidden units can be obtained via mapping of visible units.The activations of each neuron in hidden layers are independent.Meanwhile, with the given hidden units, visible units have the same effects.A typical RBM structure is depicted in Figure 1.The visible units can be represented as h , and the hidden units can be expressed as v.The RBM model is a kind of energy-based models in which the joint distribution of the layers can be expressed as Boltzmann distribution.Energy-based probabilistic models define a probability distribution through an energy function as: Due to the specific structure of RBMs, visible and hidden units are conditionally independent, as given by: where ( ) σ • is the logistic function defined as The visible units can be represented as h, and the hidden units can be expressed as v.The RBM model is a kind of energy-based models in which the joint distribution of the layers can be expressed as Boltzmann distribution.Energy-based probabilistic models define a probability distribution through an energy function as: where the normalization constant Z(θ) is called the partition function by analogy with physical systems: A joint configuration of the units has an energy given by: where θ = {a i , b j , w ij }; w ij represents the weight connecting the visible unit i and the hidden unit j; a i and b j denote the bias terms of visible and hidden layers, respectively; n and m are the total visible and hidden unit numbers; and v i and h j represent the states of visible unit i and hidden unit j.
Due to the specific structure of RBMs, visible and hidden units are conditionally independent, as given by: where σ(•) is the logistic function defined as Overall, an RBM has five parameters: h, v, w, a and b, where w, a and b are achieved via learning, v is input, and h is output.w, a and b can be learned and updated via the contrastive divergence (CD) method as where λ denotes the learning rate, P(h r j |v r i ) represents the reconstructed probability distribution, and v r i and h r j are the reconstruction of visible and hidden unit, respectively.Once the states of hidden units are chosen, the visible units can be reconstructed via the hidden units sampled via Gibbs method.Then, the states of hidden units are updated through the visible units, so that the hidden units demonstrate the features of reconstruction.The distribution of visible units approximates the distribution of the real data.The learning ability of an RBM depends on whether the hidden units contain enough information of the input data.

Deep Belief Learning
The learning ability of a single hidden layer is limited.To capture the comprehensive information of data, the hidden units of the RBM can be feed as the input (visible units) of another RBM.This kind of layer-by-layer learning structure trained in a greedy manner forms so-called Deep Belief Networks.In this way, DBN can extract deep features of image data.The structure of three-layer DBN is depicted in Figure 2.
via learning, v is input, and h is output.w , a and b can be learned and updated via the contrastive divergence (CD) method as ( ) ( ) where λ denotes the learning rate, ( | ) r r j i P h v represents the reconstructed probability distribution, and r i v and r j h are the reconstruction of visible and hidden unit, respectively.Once the states of hidden units are chosen, the visible units can be reconstructed via the hidden units sampled via Gibbs method.Then, the states of hidden units are updated through the visible units, so that the hidden units demonstrate the features of reconstruction.The distribution of visible units approximates the distribution of the real data.The learning ability of an RBM depends on whether the hidden units contain enough information of the input data.

Deep Belief Learning
The learning ability of a single hidden layer is limited.To capture the comprehensive information of data, the hidden units of the RBM can be feed as the input (visible units) of another RBM.This kind of layer-by-layer learning structure trained in a greedy manner forms so-called Deep Belief Networks.In this way, DBN can extract deep features of image data.The structure of three-layer DBN is depicted in Figure 2.

The Proposed Framework
To extract more powerful and invariant features, we propose a novel DBN hyperspectral classification algorithm based on TFE.DBN is composed of several layers of latent factors, which can be deemed as neurons of neural networks.However, the limited training samples in the real hyperspectral image classification task usually lead to many "dead" (never responding) or "potential over-tolerant" (always responding) latent factors (neurons) in the trained DBN.Our proposed

The Proposed Framework
To extract more powerful and invariant features, we propose a novel DBN hyperspectral classification algorithm based on TFE.DBN is composed of several layers of latent factors, which can be deemed as neurons of neural networks.However, the limited training samples in the real hyperspectral image classification task usually lead to many "dead" (never responding) or "potential over-tolerant" (always responding) latent factors (neurons) in the trained DBN.Our proposed framework mainly consists of three steps: band grouping and sample band selection, TFE, and DBN-based classification.

Band Grouping and Sample Band Selection
Compared to multispectral imagery, hyperspectral imagery with hundreds of spectral bands has relatively narrow bandwidths.The correlation between spectral bands needs to be considered.
In our framework, we calculated all the pair wise correlation coefficient of bands, and then utilized the correlations between adjacent bands.The spectral correlation coefficients in different datasets are depicted in Figure 3.
framework mainly consists of three steps: band grouping and sample band selection, TFE, and DBNbased classification.

Band Grouping and Sample Band Selection
Compared to multispectral imagery, hyperspectral imagery with hundreds of spectral bands has relatively narrow bandwidths.The correlation between spectral bands needs to be considered.In our framework, we calculated all the pair wise correlation coefficient of bands, and then utilized the correlations between adjacent bands.The spectral correlation coefficients in different datasets are depicted in Figure 3.We can obtain the correlation coefficient between adjacent bands as: , (B , B ) cov(B ,B ) var(B ) var(B ) where cov is covariance and var means variance.B i and B j represent the i -th and j -th band channels, respectively.1, 2 ..., Here, L denotes the number of bands of the hyperspectral dataset.Based on Equation (9), the correlation coefficients between adjacent bands in different datasets are calculated, as shown in Figure 4. We can see that the highest correlation coefficient in Indian Pines is 0.9997, and the lowest correlation coefficient is 0.0686.The spectral bands of university of Pavia have strong correlations overall, where the highest correlation coefficient is 0.9998, and the lowest correlation coefficient is 0.9294.The highest correlation coefficient in Salinas is 0.9999, and the lowest correlation coefficient is 0.5856.Here, we design an algorithm for grouping bands rationally.Firstly, calculate the average correlation coefficients of the adjacent bands, denoted as C , which is utilized as the threshold in the following steps.It can be calculated through: We can obtain the correlation coefficient between adjacent bands as: where cov is covariance and var means variance.B i and B j represent the i-th and j-th band channels, respectively.i = 1, 2, . . ., L − 1.Here, L denotes the number of bands of the hyperspectral dataset.
Based on Equation ( 9), the correlation coefficients between adjacent bands in different datasets are calculated, as shown in Figure 4. We can see that the highest correlation coefficient in Indian Pines is 0.9997, and the lowest correlation coefficient is 0.0686.The spectral bands of university of Pavia have strong correlations overall, where the highest correlation coefficient is 0.9998, and the lowest correlation coefficient is 0.9294.The highest correlation coefficient in Salinas is 0.9999, and the lowest correlation coefficient is 0.5856.
framework mainly consists of three steps: band grouping and sample band selection, TFE, and DBNbased classification.

Band Grouping and Sample Band Selection
Compared to multispectral imagery, hyperspectral imagery with hundreds of spectral bands has relatively narrow bandwidths.The correlation between spectral bands needs to be considered.In our framework, we calculated all the pair wise correlation coefficient of bands, and then utilized the correlations between adjacent bands.The spectral correlation coefficients in different datasets are depicted in Figure 3.We can obtain the correlation coefficient between adjacent bands as: where cov is covariance and var means variance.B i and B j represent the i -th and j -th band channels, respectively.1, 2 ..., Here, L denotes the number of bands of the hyperspectral dataset.Based on Equation ( 9), the correlation coefficients between adjacent bands in different datasets are calculated, as shown in Figure 4. We can see that the highest correlation coefficient in Indian Pines is 0.9997, and the lowest correlation coefficient is 0.0686.The spectral bands of university of Pavia have strong correlations overall, where the highest correlation coefficient is 0.9998, and the lowest correlation coefficient is 0.9294.The highest correlation coefficient in Salinas is 0.9999, and the lowest correlation coefficient is 0.5856.Here, we design an algorithm for grouping bands rationally.Firstly, calculate the average correlation coefficients of the adjacent bands, denoted as C , which is utilized as the threshold in the following steps.It can be calculated through: Here, we design an algorithm for grouping bands rationally.Firstly, calculate the average correlation coefficients of the adjacent bands, denoted as C, which is utilized as the threshold in the following steps.It can be calculated through: where j = i + 1.If the correlation coefficients of adjacent bands are greater than C, these two bands are considered to have strong correlation.
Second, search local minimum values from the correlation coefficients between the adjacent bands, denoted as ρ min , where ρ min = {ρ i,j |ρ i,j ≤ ρ i+1,j+1 || ρ i,j ≤ ρ i−1,j−1 }.All the elements in ρ min are compared with C. If the inequality {ρ i,j ∈ ρ min } < C is satisfied, it indicates that the correlation between the i-th band and the j-th band is lower than the average correlation value, and the correlation between these two bands is considered to be weak.Then, the corresponding index group {i, j} is recorded and added to the set ρ Loc .
Third, band grouping depends on the stored index pairs in ρ Loc .For instance, with regard to index pair {i, j}, the i-th band is set as the end band of the former band group and the j-th band is set as the first band of the next band group.Thus, based on the aforementioned rules, all the bands are divided into different band groups After dividing all the bands of hyperspectral dataset into different band groups, a sample band with the strongest and clearest texture features is searched and selected from each group.
To extract texture features, the gray level co-occurrence matrix (GLCM) has been employed successfully.GLCM [34] is defined as a matrix of frequencies which can extract second order statistics from a hyperspectral image.The distribution in the matrix depends on the angular and distance relationship between pixels.After the GLCM is created, it can be used to compute various features.We choose the five most commonly used features in Table 1 to select a sample band from each band group.The texture feature score of each band can be calculated by Equation ( 11): Table 1.Feature calculated from the normalized co-occurrence matrix P(i, j).

No.
Feature Formula The sample band in each band group can be selected through: where G k represents the k-th band group of the dataset, is the number of bands in the k-th band group, and B l k represent the l k -th band in the k-th band group.Finally, the sample band set are comprised of {g 1 , g 2 , • • • , g k }.

Texture Feature Enhancement
As an effective edge-preserving filter, guided filter (GF) was proposed by He in 2012.It can enhance the detail of an image.Texture feature is a kind of important spatial characteristics and also has long history in image processing.In this paper, we utilize the GF in each band group to enhance the texture features of the image.
The general guided image filtering was designed for gray-scale images or color images.It is very easy to extend to multi-channel image.Firstly, the guidance image in our proposed framework is multi-channel image, denoted as I M , which is comprised of the copies of the band with the strongest texture features in each band group.We assume q M is a linear transform of I M in a window ω k centered at the pixel k, and the multi-channel guided filter model can be expressed as where I M i is a C × 1 vector, and C is the channel number of the input image, a M k is a C × 1 coefficient vector, and q M i and b M k are scalars.The guided filter for multi-channel guidance image becomes where filtering input image which is given beforehand according to the application, µ k is the mean of I M in ω k , p M k is the mean of p M in ω k , and |ω| represents the number of pixels in ω k .
Then, the extending guided image filtering for multi-channel images will be applied to each band group.For instance, each channel of the guidance image I M in Equation ( 14) for the k-th band group G k is the copy of the sample band g k selected previously.
After guided filtering for all groups is completed, the output bands are restored to a hyperspectral image cube according to the band number.Finally, the reconstructed image data with enhanced texture features are obtained through the aforementioned steps.Figure 5 demonstrates the procedure of band grouping and TFE.We can see that, after sample bands with strongest textures are obtained, the reconstructed image data with enhanced texture feature can be achieved through the GF process.
multi-channel image, denoted as M I , which is comprised of the copies of the band with the strongest texture features in each band group.We assume M q is a linear transform of M I in a window k ω centered at the pixel k, and the multi-channel guided filter model can be expressed as where where denotes a filtering input image which is given beforehand according to the application, k μ is the Then, the extending guided image filtering for multi-channel images will be applied to each band group.For instance, each channel of the guidance image M I in Equation ( 14) for the k -th band group k G is the copy of the sample band k g selected previously.
After guided filtering for all groups is completed, the output bands are restored to a hyperspectral image cube according to the band number.Finally, the reconstructed image data with enhanced texture features are obtained through the aforementioned steps.Figure 5 demonstrates the procedure of band grouping and TFE.We can see that, after sample bands with strongest textures are obtained, the reconstructed image data with enhanced texture feature can be achieved through the GF process.

DBN Classification Model
In this section, a DBN-based framework for hyperspectral classification with feature enhanced data is developed.
Spectral information is the most significant and direct feature, and can be directly utilized for classification.Architectures of existing methods, such as SVM and KNN, can extract spectral features but not deep enough.Therefore, only a deep architecture can make full use of the texture enhanced hyperspectral image characteristics.However, as the training samples are limited, the overfitting problem often occurs if the network is too deep, so we advocate a novel DBN framework, which has only two hidden layers (Figure 6).
Spectral information is the most significant and direct feature, and can be directly utilized for classification.Architectures of existing methods, such as SVM and KNN, can extract spectral features but not deep enough.Therefore, only a deep architecture can make full use of the texture enhanced hyperspectral image characteristics.However, as the training samples are limited, the overfitting problem often occurs if the network is too deep, so we advocate a novel DBN framework, which has only two hidden layers (Figure 6).The input data consist of training samples that are one-dimensional (1-D) vectors, and each pixel of a training sample is collected from the texture enhanced HSI data.For ease of description, the first hidden layer is denoted as 1 h and the second 2 h .The first layer is learned for extracting features from the input data, and the learned features are preserved in 1 h .Then, to pursue refined and abstract features, using the features contained in 1 h as the visible data of the second layer, 2 h keeps the refined features.This procedure is generally called recursive greedy learning for pre-training a DBN.
In practice, learning each layer is often performed through the n -step CD, and the weights are updated using Equations ( 6)- (8).
To fine-tune the DBN and accomplish classification, a Softmax layer is added to the end of the network.Now, let x be a set of training samples and be the corresponding labels, where , , , is the spectral signature of the k -th sample with L bands.Utilizing the maximum likelihood method, the objective function can be written as where K is the number of training samples, ( ( ), ) where L H is the number of the hidden layers, which is set to 2 in our proposed framework, and M is the number of the classes.m θ and n θ are the parameter vectors for the m -th and n -th unit of the The input data consist of training samples that are one-dimensional (1-D) vectors, and each pixel of a training sample is collected from the texture enhanced HSI data.For ease of description, the first hidden layer is denoted as h 1 and the second h 2 .The first layer is learned for extracting features from the input data, and the learned features are preserved in h 1 .Then, to pursue refined and abstract features, using the features contained in h 1 as the visible data of the second layer, h 2 keeps the refined features.This procedure is generally called recursive greedy learning for pre-training a DBN.
In practice, learning each layer is often performed through the n-step CD, and the weights are updated using Equations ( 6)-( 8).
To fine-tune the DBN and accomplish classification, a Softmax layer is added to the end of the network.Now, let X = {x 1 , x 2 , . . ., x K } be a set of training samples and Y = {y 1 , y 2 , . . ., y K } be the corresponding labels, where x k = [x k1 , x k2 , . . . ,x kL ] T is the spectral signature of the k-th sample with L bands.Utilizing the maximum likelihood method, the objective function can be written as where K is the number of training samples, (P(y k |x k ), θ) means the distribution of y k when given x k with the parameters θ of the Softmax layer, and S y k (x k , θ) denotes the output of the Softmax layer of the k-th training sample, that is where H L is the number of the hidden layers, which is set to 2 in our proposed framework, and M is the number of the classes.θ m and θ n are the parameter vectors for the m-th and n-th unit of the softmax layer, respectively.h H L is the output of the H L -th hidden layer, which is calculated via the input data, the weights and bias from the first layer to the H L -th hidden layer.To optimize the objective function, the stochastic gradient descent (SGD) algorithm is used.Finally, the label of each testing pixel is determined via the weights and biases from aforementioned steps.

Datasets
In this section, three typical hyperspectral datasets, namely Indian Pines, University of Pavia and Salinas, are employed to compare the proposed DBN classification method with other state-of-the-art Remote Sens. 2018, 10, 396 9 of 20 methods.In these experiments, we randomly select 300 labeled pixels per class for training, of which 20 samples are utilized for validation.The remaining pixels of labeled data are used for testing.Furthermore, each pixel is uniformly scaled to the range of −1 to 1.
The first experiment is Indian Pines dataset, which was gathered by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in northwestern Indiana.There are 220 spectral channels in 0.4 to 2.45 µm region with spatial resolution of 20 m.It consists of 145 × 145 pixels with 200 bands after removing 20 noisy and water absorption bands.Here, we employ 8 large classes in this experiment.The numbers of training and testing samples are listed in Table 2.The second dataset with 610 × 340 pixels is the University of Pavia, which was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS) during a flight campaign over Pavia, northern Italy.The ROSIS sensor cover 115 spectral bands from 0.43 to 0.86 µm and the geometric resolution is 1.3 m.Each pixel has 103 bands after discarding bad bands.There are 9 ground-truth classes with the number of labeled samples shown in Table 3.The third experiment is on Salinas dataset, which was also collected by the AVIRIS sensor, capturing an area over Salinas Valley, California, with a spatial resolution of 3.7 m.The area comprises 512 × 217 pixels with 204 bands after removing noisy and water absorption bands.It mainly contains vegetables, bare soils, and vineyard fields.There are 16 different ground-truth classes, and the numbers of training and testing samples are listed in Table 4.
Our experiments are implemented using Matlab 2015b which is manufactured by Mathworks in Massachusetts, US.The CPU we employed is Intel Core i5-3470.The basic frequency is 3.200 GHz.The operation system is Win7 with 64 bits.

Parameters Tuning and Analysis
In our proposed framework, we have several parameters that need to be adjusted: the number of hidden units, the learning rate, the max epoch and the number of hidden layers.In this section, some tuning experimental results are listed for selecting proper values.Both the number of hidden layers and the number of hidden units in hidden layers play an important role in classification performance.A suitable number of hidden layers and neurons can make full use of texture enhanced hyperspectral data without over-training, and can support a fitting mapping from original hyperspectral data to hyperspectral features.In the training process of DBN, the learning rate controls the pace of learning.It implies that a too large learning rate will lead an unstable output of training, and a too small learning rate will lead a longer training process.Therefore, an appropriate learning rate can expedite our training procedure with satisfactory performance.
In Figure 7, we can see that our proposed framework achieves best classification accuracy with 200 hidden neurons in each hidden layer.It demonstrates that 200 is a suitable number of hidden neurons.Figure 8 depicts the relationship between accuracies and the learning rates.It can be seen that the values of learning rate from 0.15 to 0.2 can obtain better performance.Therefore, we select 0.15 for the first RBM, and 0.2 for the second RBM.To determine the max epoch, we set the range of max epoch from 50 to 500. Figure 9 demonstrates that, when max epoch reaches 300, our proposed framework can achieve best classification performance.Consequently, the max epoch is set to 300.Table 5 lists the accuracies achieved with different numbers of hidden layers in DBN.When employing two hidden layers, the classification performance of DBN can achieve superior results.Thus, in our proposed framework, we set the number of hidden layers to 2.
In our paper, we utilize Graycomatrix function in Matlab to calculate the GLCM.The parameters used in experiments are "NumLevels" and "Offset", and they are set to 8 and [0,

Evaluation Criteria
The evaluation criteria used in our paper are overall accuracy (OA), average accuracy (AA), precision, and Kappa.Especially, OA, Precision and Kappa are highlighted for assessment of the proposed framework.

Evaluation Criteria
The evaluation criteria used in our paper are overall accuracy (OA), average accuracy (AA), precision, and Kappa.Especially, OA, Precision and Kappa are highlighted for assessment of the proposed framework.

Evaluation Criteria
The evaluation criteria used in our paper are overall accuracy (OA), average accuracy (AA), precision, and Kappa.Especially, OA, Precision and Kappa are highlighted for assessment of the proposed framework.

Evaluation Criteria
The evaluation criteria used in our paper are overall accuracy (OA), average accuracy (AA), precision, and Kappa.Especially, OA, Precision and Kappa are highlighted for assessment of the proposed framework.
where p is the number of classes.N is the total number of the hyperspctral image data samples and N = ∑ p i=1 n i .n ii is the number of hyperspectral image samples in the i-th class to be classified into the i-th class, and n ji is the number of hyperspectral image samples in the i-th class to be classified into the j-th class.
where p is the number of classes.N is the total number of the hyperspctral image data samples and nii is the number of hyperspectral image samples in the i-th class to be classified into the i-th class, and nji is the number of hyperspectral image samples in the i-th class to be classified into the j-th class.We also take the nonparametric McNemar's test based on the standardized normal test statistic to evaluate the statistical significance in the improvement of OA with different hyperspectral classification algorithms.The McNemar's test statistic for two different algorithms noted as Algorithm 1 and Algorithm 2 can be calculated as [36]: where 12 f denotes the number of samples misclassified using Algorithm 2 but not Algorithm 1, and

Experimental Results and Discussion
In this section, the proposed TFE and the novel classification framework will be evaluated and the relevant results will be summarized and discussed in detail.

Compared Methods and Band Groups
To analyze and evaluate our proposed algorithm, which combines the TFE and the optimal DBN efficiently, existing algorithm, such as SVM with Radial Basis Function kernel (SVM-RBF), the Radical Basis Function neural network (RBFNN) and CNN, are employed for comparison purpose.Besides, we also compare with a state-of-the-art spectral-spatial algorithm called EPF-G-c [22].All these algorithms are widely used with excellent performance in hyperspectral image classification tasks, especially EPF-G-c.In addition, for evaluating our proposed texture feature enhancement (TFE) algorithm, we also applied TFE algorithm on the traditional SVM-RBF and RBFNN.All experiments are repeated 10 times with the average classification results demonstrated for comparison.
According to our proposed band grouping solution, the bands of Indian Pines can be divided into 41 groups:  We also take the nonparametric McNemar's test based on the standardized normal test statistic to evaluate the statistical significance in the improvement of OA with different hyperspectral classification algorithms.The McNemar's test statistic for two different algorithms noted as Algorithm 1 and Algorithm 2 can be calculated as [36]: where f 12 denotes the number of samples misclassified using Algorithm 2 but not Algorithm 1, and f 21 means the number of samples misclassified using Algorithm 1 but not Algorithm 2. |z| is the absolute value of z.For 5% level of significance, the |z| value is 1.96.If a |z| value is greater than this quantity, the two classification algorithms have significant discrepancy.

Experimental Results and Discussion
In this section, the proposed TFE and the novel classification framework will be evaluated and the relevant results will be summarized and discussed in detail.

Compared Methods and Band Groups
To analyze and evaluate our proposed algorithm, which combines the TFE and the optimal DBN efficiently, existing algorithm, such as SVM with Radial Basis Function kernel (SVM-RBF), the Radical Basis Function neural network (RBFNN) and CNN, are employed for comparison purpose.Besides, we also compare with a state-of-the-art spectral-spatial algorithm called EPF-G-c [22].All these algorithms are widely used with excellent performance in hyperspectral image classification tasks, especially EPF-G-c.In addition, for evaluating our proposed texture feature enhancement (TFE) algorithm, we also applied TFE algorithm on the traditional SVM-RBF and RBFNN.All experiments are repeated 10 times with the average classification results demonstrated for comparison.
According to our proposed band grouping solution, the bands of Indian Pines can be divided into 41 groups: 1, 2, 3, 4-17, 18, 19-33, 34, 35, 36,      In hyperspectral classification, some spectra of the hyperspectral image are distorted through imaging noise or low spatial resolution, especially border-pixels, therefore the difficulty of hyperspectral classification primarily focuses on the correct classification of the border pixels.In Figure 11, it can be seen that, by utilizing TFE, the reconstructed border pixels become different from the original border pixels, and the reconstructed inner pixels are nearly the same as the original inner pixels, which implies that TFE plays an important role for border pixels.TFE can make border pixels distinct with its characteristics and more similar to their original spectra.Hence, the texture feature of the hyperspectral image become more obvious and clear.Consequently, the pixels that are difficult to distinguish can be recognized more easily than before with clearer texture feature.In other words, TFE has a positive effect for enhancing hyperspectral classification performance.In hyperspectral classification, some spectra of the hyperspectral image are distorted through imaging noise or low spatial resolution, especially border-pixels, therefore the difficulty of hyperspectral classification primarily focuses on the correct classification of the border pixels.In Figure 11, it can be seen that, by utilizing TFE, the reconstructed border pixels become different from the original border pixels, and the reconstructed inner pixels are nearly the same as the original inner pixels, which implies that TFE plays an important role for border pixels.TFE can make border pixels distinct with its characteristics and more similar to their original spectra.Hence, the texture feature of the hyperspectral image become more obvious and clear.Consequently, the pixels that are difficult to distinguish can be recognized more easily than before with clearer texture feature.In other words, TFE has a positive effect for enhancing hyperspectral classification performance.

Discussion on Classification Results and Statistical Test
Table 6 provides the classification performance on Indian Pines achieved by different classification algorithms: SVM, RBFNN, optimal DBN (O_DBN), SVM combined with TFE (SVM_TFE), RBFNN combined with TFE (RBFNN_TFE), CNN, EFP-G-c and our proposed framework.O_DBN denotes the optimal DBN we proposed but without TFE.The SVM_TFE and RBFNN_TFE are two algorithms combined with the TFE method.The classification accuracy of each class is also listed in this table.In Table 6, we can see that our proposed framework can obtain the superior performance compared with other algorithms.Meanwhile, the optimal DBN has the best classification accuracy compared to the other algorithms without TFE, such as SVM and RBFNN.Although EFP-G-c is an outstanding spectral-spatial hyperspectral classification algorithm, our proposed framework utilizing TFE still has slightly better classification accuracy.Besides, SVM_TFE and RBFNN_TFE outperform SVM and RBFNN, respectively.The OA of SVM_TFE is 5.06% greater than SVM, and the OA of RBFNN_TFE is 8.97% higher than RBFNN.Compared with O_DBN, the OA obtained via our proposed framework improved by 8.08% and the Kappa increased by 9.98%.All these facts indicate the successful effects of TFE and demonstrates that our proposed framework and TFE have good influence on Indian Pines in hyperspectral classification.8 and 10 that our proposed framework has better performance than other classification methods.Especially, we can see that all algorithms that integrate TFE outperform those without TFE.By employing the TFE, the performance of SVM increased by 5.78% in University of Pavia and 1.75% in Salinas, while the performance of RBFNN improved by 6.8% in University of Pavia and 1.55% in Salinas.The OA achieved by the proposed framework is 6.55% higher than the OA achieved via optimal DBN in University of Pavia and 3.94% larger than the OA achieved via optimal DBN in Salinas.Furthermore, the proposed classification framework has better performance than CNN and EPF-G-c.As for kappa coefficients, we can see that our proposed framework has better consistency.The possible reason is the ability of our proposed framework, as a deep network, to extract high-level features of data is stronger than the RBFN and the SVM, as shallow networks, thus the description ability of our proposed framework is more stable.In Tables 9 and 11, the precisions obtained through our proposed model on different datasets are better than precisions achieved via other algorithms.Furthermore, our proposed TFE has a positive effect on classification accuracy.Note: 5% significance level is selected.

Conclusions
In this paper, we investigate a novel hyperspectral classification framework based on an optimal DBN algorithm.In our proposed framework, we develop a new TFE algorithm that employs multi-texture features and band grouping method.The resulting classification framework can offer better classification accuracy than other classic algorithms.To further test our proposed TFE algorithm, a series of experiments based on the combination of the state-of-the-art algorithms and the TFE algorithm are applied on the three classic hyperspectral datasets.Experimental results demonstrate that the algorithms with TFE outperform those without TFE, which implies that our proposed TFE can play an important role in improving hyperspectral classification performance.We believe that the proposed hyperspectral classification framework based on the optimal DBN and TFE is more suitable to process hyperspectral data in practical applications when training samples are limited.

Figure 1 .
Figure 1.Architecture of Restricted Boltzmann Machines weight connecting the visible unit i and the hidden unit j; i a and j b denote the bias terms of visible and hidden layers, respectively; n and m are the total visible and hidden unit numbers; and i v and j h represent the states of visible unit i and hidden unit j.

Figure 2 .
Figure 2.An illustration of three-layer DBN with logistic regression.

Figure 2 .
Figure 2.An illustration of three-layer DBN with logistic regression.

Figure 3 .
Figure 3.The maps of correlation coefficients of spectral bands in different datasets: (a) Indian Pines; (b) University of Pavia; and (c) Salinas.

Figure 4 .
Figure 4.The correlation coefficients of adjacent spectral bands in different datasets: (a) Indian Pines; (b) University of Pavia; and (c) Salinas.

Figure 3 .
Figure 3.The maps of correlation coefficients of spectral bands in different datasets: (a) Indian Pines; (b) University of Pavia; and (c) Salinas.

Figure 3 .
Figure 3.The maps of correlation coefficients of spectral bands in different datasets: (a) Indian Pines; (b) University of Pavia; and (c) Salinas.

Figure 4 .
Figure 4.The correlation coefficients of adjacent spectral bands in different datasets: (a) Indian Pines; (b) University of Pavia; and (c) Salinas.

Figure 4 .
Figure 4.The correlation coefficients of adjacent spectral bands in different datasets: (a) Indian Pines; (b) University of Pavia; and (c) Salinas.

Figure 5 .
Figure 5.The procedure of band grouping and texture features enhancement.Figure 5.The procedure of band grouping and texture features enhancement.

Figure 5 .
Figure 5.The procedure of band grouping and texture features enhancement.Figure 5.The procedure of band grouping and texture features enhancement.

Figure 6 .
Figure 6.Our proposed DBN network for classification.
θ means the distribution of k y when given k x with the parameters θ of the Softmax layer, and ( , ) k y k S x θ denotes the output of the Softmax layer of the k -th training sample, that is

Figure 6 .
Figure 6.Our proposed DBN network for classification.

Figure 7 .Figure 8 .Figure 9 .
Figure 7.The relationship between accuracies and the number of hidden units in different datasets: (a) Indian Pines; (b) University of Pavia; and (c) Salinas.

Figure 7 .Figure 7 .Figure 8 .Figure 9 .
Figure 7.The relationship between accuracies and the number of hidden units in different datasets: (a) Indian Pines; (b) University of Pavia; and (c) Salinas.

Figure 9 .
Figure 9.The relationship between accuracies and the numbers of Max epoch in different datasets: (a) Indian Pines; (b) University of Pavia; and (c) Salinas.

Figure 10
Figure10demonstrates a p-class confusion matrix.Based on Figure10, AA and precision can be derived as[35]

21 f
means the number of samples misclassified using Algorithm 1 but not Algorithm 2. | | z is the absolute value of z .For 5% level of significance, the | | z value is 1.96.If a | | z value is greater than this quantity, the two classification algorithms have significant discrepancy.

5. 2 .
Figure 11 demonstrates the reconstructions of border and inner pixels of four classes after TFE in Indian Pines dataset.The first image of each row depicts the locations of border and inner pixels.The reconstruction and reconstructed error of the border pixel are demonstrated in the second image of each row.Meanwhile, the reconstruction and reconstructed error of the inner pixel are demonstrated in the third image.

5. 2 .
Figure 11 demonstrates the reconstructions of border and inner pixels of four classes after TFE in Indian Pines dataset.The first image of each row depicts the locations of border and inner pixels.The reconstruction and reconstructed error of the border pixel are demonstrated in the second image of each row.Meanwhile, the reconstruction and reconstructed error of the inner pixel are demonstrated in the third image.

Figure 11 .
Figure 11.The reconstructions of the border-pixels and inner-pixels of different classes in Indian Pines.First row is the reconstruction information of Class 2, second row is the reconstruction information of Class 4, third row is the reconstruction information of Class 6 and last row is the reconstruction information of Class 8.

Figure 11 .
Figure 11.The reconstructions of the border-pixels and inner-pixels of different classes in Indian Pines.First row is the reconstruction information of Class 2, second row is the reconstruction information of Class 4, third row is the reconstruction information of Class 6 and last row is the reconstruction information of Class 8.

Figures 12 -
Figures 12-14 demonstrate the classification maps obtained in Indian Pines, University of Pavia and Salinas, respectively.Clearly, the classification maps shown in Figures 12-14 achieved by our proposed framework are the smoothest and clearest.The classification accuracy of border pixels in these datasets is improved greatly and the boundaries of different classes are more distinct.Compared to other classification algorithms, the results of our proposed framework are better because they contain less salt-and-pepper noise.

Figures 12 -Figure 12 .
Figures 12-14 demonstrate the classification maps obtained in Indian Pines, University of Pavia and Salinas, respectively.Clearly, the classification maps shown in Figures 12-14 achieved by our proposed framework are the smoothest and clearest.The classification accuracy of border pixels in these datasets is improved greatly and the boundaries of different classes are more distinct.Compared to other classification algorithms, the results of our proposed framework are better because they contain less salt-and-pepper noise.
and C is the channel number of the input image, M

Table 2 .
Number of training and testing samples used in the Indian Pines dataset.

Table 3 .
Number of training and testing samples used in the Pavia University dataset.

Table 4 .
Number of training and testing samples used in the Salinas dataset.

Table 5 .
The accuracies obtained via different numbers of hidden layers in DBN.

Table 5 .
The accuracies obtained via different numbers of hidden layers in DBN.

Table 5 .
The accuracies obtained via different numbers of hidden layers in DBN.

Table 5 .
The accuracies obtained via different numbers of hidden layers in DBN.

Table 6 .
Classification accuracy of different algorithms on Indian Pines.

Table 7
lists the classification precision achieved via these different classification algorithms.In Table7, we can see that the precision of our proposed algorithm outperforms SVM, RBFNN, O_DBN, SVM_TFE, RBFNN_TEF, CNN and EPF-G-c.In addition, the methods associated with TFE have better classification precision than without TFE.

Table 7 .
Classification precision of different algorithms on Indian Pines.Tables 8 and 10 present the classification accuracy acquired via different algorithms for University of Pavia and Salinas datasets.Meanwhile, Tables 9 and 11 also list the precisions obtained through our proposed model and other classification algorithms on different datasets.It is obvious in Tables

Table 8 .
Classification accuracy of different algorithms on University of Pavia.

Table 9 .
Classification precision of different algorithms on University of Pavia.

Table 10 .
Classification accuracy of different algorithms on Salinas Dataset.

Table 11 .
Classification precision of different algorithms on Salinas Dataset.