Convolutional Neural Networks Based Hyperspectral Image Classiﬁcation Method with Adaptive Kernels

: Hyperspectral image (HSI) classiﬁcation aims at assigning each pixel a pre-deﬁned class label, which underpins lots of vision related applications, such as remote sensing, mineral exploration and ground object identiﬁcation, etc. Lots of classiﬁcation methods thus have been proposed for better hyperspectral imagery interpretation. Witnessing the success of convolutional neural networks (CNNs) in the traditional images based classiﬁcation tasks, plenty of efforts have been made to leverage CNNs to improve HSI classiﬁcation. An advanced CNNs architecture uses the kernels generated from the clustering method, such as a K-means network uses K-means to generate the kernels. However, the above methods are often obtained heuristically (e.g., the number of kernels should be assigned manually), and how to data-adaptively determine the number of convolutional kernels (i.e., ﬁlters), and thus generate the kernels that better represent the data, are seldom studied in existing CNNs based HSI classiﬁcation methods. In this study, we propose a new CNNs based HSI classiﬁcation method where the convolutional kernels can be automatically learned from the data through clustering without knowing the cluster number. With those data-adaptive kernels, the proposed CNNs method achieves better classiﬁcation results. Experimental results from the datasets demonstrate the effectiveness of the proposed method.


Introduction
Different from traditional images (e.g., RGB image), hyperspectral image (HSI) contains a continuous spectrum at each pixel, which is beneficial for identifying different imaged land covers.With such abundant spectral information, hyperspectral image (HSI) classification that aims at assigning each pixel a pre-defined class label has facilitated various applications, such as mineral exploration, ground object identification, survey of agriculture and monitoring of geology, etc.Therefore, plenty of efforts have been made in HSI classification.According to the feature utilized, HSI classification methods can be roughly divided into hand-crafted feature based methods and the deep learning feature based methods.A detailed review can be seen from Section 2. For hand-crafted feature based methods, HSI is often represented by the features designed manually [1][2][3][4][5][6][7].However, due to their shallow structure, the representation ability of such features is limited, especially for HSIs which often exhibit high nonlinearity aroused by the high-dimensionality and mixture of pixels.On the contrary, deep learning feature based methods can automatically extract features from training data with deep architectures.It has been proved that those deep features perform well in representing the complicated nonlinearity of data, which has promoted the development of deep learning feature based HSI classification methods in recent years [8][9][10][11][12].
Since the convolutional kernels should be updated through the network training, traditional deep learning based methods exhaust much training time.To address this problem, an advanced CNNs architecture has been proposed recently, which adopts the kernels pre-learned from clustering the training data without updating them in the training process any more.One typical method is the K-means Net proposed in [13], where each CNNs kernel is first learned from a specific cluster obtained by conducting the K-means algorithm on training data.Nevertheless, the cluster number K (i.e., the number of kernels in CNNs) of K-means Net should be assigned empirically, which limits the representational power of CNNs.Specifically, a different number K of kernels designed manually in the convolutional layer will change the structure of CNNs and thus influence the output of CNNs.In addition, the number K is expected to be adaptive to different images and tasks.Therefore, how to data-adaptively choose a proper number of kernels is crucial for representing data characteristics with CNNs.However, most of the existing CNNs based HSI classification methods fail to pay sufficient consideration to this problem.
In this study, we propose a MCFSFDP based CNNs framework for HSI classification.First, inspired by clustering by fast search and find of peaks (CFSFDP) [14], a novel clustering method, named modified clustering by fast search and find of peaks (MCFSFDP), is proposed to data-adaptively learn a specific number of kernels from training data.The convolution kernels can be automatically determined by the center of each cluster and the inter-cluster margin, which guarantees the pre-learned kernels to be suitable for the data structure.Then, the CNNs framework with those pre-learned convolutional kernels is employed to classify each pixel in the HSI.Extensive experimental results demonstrate that the proposed method outperforms several state-of-the-art CNNs based methods in classification accuracy.
In summary, the proposed CNNs framework has two key advantages: (1) a specific number of convolutional kernels can be data-adaptively learned from training data, which can well represent the data characteristics; and (2) the MCFSFDP based CNNs framework is effective for HSI classification.

Related Work
Based on the feature adopted in classification of HSI, the HSI classification method can be roughly divided into two categories, including the hand-crafted feature based methods and the deep learning feature based methods.

Hand-Crafted Feature Based Methods
Linear features extracted by principal component analysis (PCA) [15] and partial least squares (PLS) [16] are applied to classify the HSI data.The kernel methods are further developed to exploit the nonlinear feature of HSI [17].To depict the spatial texture of image, the wavelet transform (WT) methods [18,19] have been widely used, which often show different scales and perform effectively for classification in the high spatial resolution remotely sensed (HSRRS) data.Considering the complicated spatial correlation, some Gaussian Markov Random Field (GMRF) [20,21] methods are proposed to model such correlation within a graph structure.In [22], a spatial feature index that measured the gray similarity distance in every direction is used to describe the shape feature in local area that is surrounding a pixel in HSI.An adaptive mean-shift (MS) analysis framework [2] is proposed for object extraction and classification of HSI over urban areas, which is able to obtain an object-oriented representations of HSI data.Li et al. [3] integrate the spectral and spatial information in a Bayesian framework, which utilizes a Multinomial Logistic Regression (MLR) algorithm to learn the posterior probability distributions from the spectral information.In addition, this method uses subspace projection to better characterize noise, highly mixed pixels and contextual information.In [4], a mathematical morphology (MM) based method is utilized to process the HSI data.In this approach, opening and closing morphological transforms are used to isolate bright (opening) and dark (closing) structures in images, where bright/dark means brighter/darker than the surrounding features in the images.To model different kinds of structural information, morphological attribute profiles (APs) are adopted to provide a multi-level characterization for an image created by the sequential application of morphological attribute filters [23].Based on Gray Level Co-occurrence matrix (GLCM), Zortea et al. attempt to extract the contextual information of images by concatenating the spectral features used for classification [1].To improve the classification result of HSI, the Edge-Aware Filtering (EAF) and Edge-Preserving Filtering (EPF) methods are proposed in [24,25].Based on the EPF method, a spectral-spatial classification framework was proposed in [25], which can significantly enhance the classification accuracy.Kang et al. propose combining a recursion with image fusion to enhance the image classification accuracy [26].Recently, the Bag-of-Words (BOW) model has shown a promising way to handle the remote sensing imagery classification problem.In the BOW model, images can be represented by the frequency of visual words that are constructed by quantizing local features with a clustering method, such as K-means and so on [27,28].Due to the capacity of extracting the handcrafted local features, such as local structural points, color histogram and texture features [29,30], BOW based methods present good performance.Manifold regularized kernel logistic regression (KLR) are proposed to solve multi-view image classification [31].To integrate different levels of features for saliency detection, Wang et al. [32] propose a multiple-instance learning based framework that fuses the low-level, mid-level, and high-level features into a unified model.While effective, the trepresentation capacity of the manual feature extraction based methods is limited.

Deep Learning Feature Based Methods
Recently, with the development of deep learning technology, lots of methods based on deep learning have been developed for image classification, such as deep brief network (DBN) and stacked auto-encoder (SAE).The DBN and SAE are unsupervised learning methods that are also used for spectral-spatial classification of hyperspectral data without using the label information [9,33].The concept of deep learning is introduced into the hyperspectral data classification for the first time [9].The Canonical Correlation Analysis Network is useful for multi-view image classification [34].With the development of convolutional neural networks (CNNs) [35], which has been widely applied to the image processing and achieved spectacular effects, more and more deep CNNs frameworks have emerged, such as AlexNet [36], VGGNet [37], GoogLeNet [38] and ResNet [39], which can provide results comparable with human beings in image classification and recognition tasks.Those methods can automatically learn features from the training data, which can replace the manually-engineered features, and have shown significant effects on HSI classification [8][9][10].For example, Li et al. [40] applied 3D-CNNs for spectral-spatial feature extraction and classification, where 3D kernels were used to extract the feature from HSI cube without any preprocessing or post-processing.In [41], the transfer learning method for HRRS scene classification is used for transferring features from successfully pre-learned CNNs.Different from the CNNs methods, the convolutional kernels are updated in the training process, and the kernels in PCA-Net [42] and K-means Net are pre-learned before the network training and don't need to be updated in the network training.In addition, the kernels come from data directly.PCA-Net [42] adopts the principle components of training data as multistage filter banks, while K-means Net learns the kernels by clustering the training data.In this study, we mainly focus on the K-means Net.Although K-means Net can be directly applied to the classification and reduces the training time by employing the pre-learned kernels, it is difficult to determine the number of kernels that is crucial for the performance.To address this issue, we attempt to adaptively generate a specific number of kernels from the training data of CNNs framework.

MCFSFDP Based CNNs
The traditional CNNs framework contains the convolutional layer, fully connected layer and a classification layer.The convolution layer is updated through the error feedback process, which is different from the pre-learned convolutional kernels based CNNs framework.
The proposed MCFSFDP based CNNs method includes three major modules: (1) data pre-processing module, which extracts patches from block samples; (2) MCFSFDP based kernel learning module, which learns the convolutional kernels from those extracted patches; and (3) classification modules which utilize the learned convolution kernels.
The flowchart of our MCFSFDP based CNNs method is shown in Figure 1.
Remote Sens. 2017, 9, 618 4 of 15 learning module, which learns the convolutional kernels from those extracted patches; and (3) classification modules which utilize the learned convolution kernels.The flowchart of our MCFSFDP based CNNs method is shown in Figure 1.

Data Pre-Processing
In this study, we follow the standard data pre-processing principle in K-means Net [13].Specifically, a HSI used in this classification task is denoted by R .Though HSI is 3D data, it also can be seen as a collection of 2D images (i.e., images from different bands).Here, we denote the HSI as 2D form.First, we randomly select M pixels from R , and then extract M corresponding blocks

MCFSFDP Based CNNs Kernels Learning
To obtain the kernels with those cropped patches, a suitable clustering method is necessary.Lots of clustering methods have been proposed, among which clustering by fast search and find of peaks (CFSFDP) [14], is a typical state-of-the-art method.The reason for partial success of CFSFDP on clustering is based on the idea that "cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities" and the cluster centers can be determined through two thresholds of distance and density [14].
Though CFSFDP has shown its power for clustering, we find that when we apply it directly to generate the kernels for CNNs, the generated kernels are not always optimal for hyperspectral image classification tasks.This phenomena is observed from the experimental results (a similar conclusion also can be seen from the results in Section 4.3.1).In our opinion, we consider kernels (filters) as the standards for comparing the samples, which also show the evaluation standards for determining which cluster they belong to.Since the inter-cluster points are difficult to classify, we should also select several inter-cluster points with representations as the clusters (kernels).To address this problem, we propose a new clustering method based on CFSFDP, which only uses

Data Pre-Processing
In this study, we follow the standard data pre-processing principle in K-means Net [13].Specifically, a HSI used in this classification task is denoted by R. Though HSI is 3D data, it also can be seen as a collection of 2D images (i.e., images from different bands).Here, we denote the HSI as 2D form.First, we randomly select M pixels from R, and then extract M corresponding blocks {B i } M i=1 with a size of m × m as samples, where each block is centered at each selected pixel.These extracted M samples are roughly divided into three parts, namely, training samples, validation samples and testing samples.The property of center block pixel is described by all the pixels in the block.Then, {B i } M i=1 are put into the network and the center pixel labels of block B i are used as the ground truth for training.
In addition, we randomly extract N patches

Data Pre-Processing
In this study, we follow the standard data pre-processing principle in K-means Net [13].Specifically, a HSI used in this classification task is denoted by R .Though HSI is 3D data, it also can be seen as a collection of 2D images (i.e., images from different bands).Here, we denote the HSI as 2D form.First, we randomly select M pixels from R , and then extract M corresponding blocks

MCFSFDP Based CNNs Kernels Learning
To obtain the kernels with those cropped patches, a suitable clustering method is necessary.Lots of clustering methods have been proposed, among which clustering by fast search and find of peaks (CFSFDP) [14], is a typical state-of-the-art method.The reason for partial success of CFSFDP on clustering is based on the idea that "cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities" and the cluster centers can be determined through two thresholds of distance and density [14].
Though CFSFDP has shown its power for clustering, we find that when we apply it directly to generate the kernels for CNNs, the generated kernels are not always optimal for hyperspectral image classification tasks.This phenomena is observed from the experimental results (a similar conclusion also can be seen from the results in Section 4.3.1).In our opinion, we consider kernels (filters) as the standards for comparing the samples, which also show the evaluation standards for determining which cluster they belong to.Since the inter-cluster points are difficult to classify, we should also select several inter-cluster points with representations as the clusters (kernels).To address this problem, we propose a new clustering method based on CFSFDP, which only uses

MCFSFDP Based CNNs Kernels Learning
To obtain the kernels with those cropped patches, a suitable clustering method is necessary.Lots of clustering methods have been proposed, among which clustering by fast search and find of peaks (CFSFDP) [14], is a typical state-of-the-art method.The reason for partial success of CFSFDP on clustering is based on the idea that "cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities" and the cluster centers can be determined through two thresholds of distance and density [14].
Though CFSFDP has shown its power for clustering, we find that when we apply it directly to generate the kernels for CNNs, the generated kernels are not always optimal for hyperspectral image classification tasks.This phenomena is observed from the experimental results (a similar conclusion also can be seen from the results in Section 4.3.1).In our opinion, we consider kernels (filters) as the standards for comparing the samples, which also show the evaluation standards for determining which cluster they belong to.Since the inter-cluster points are difficult to classify, we should also select several inter-cluster points with representations as the clusters (kernels).To address this problem, we propose a new clustering method based on CFSFDP, which only uses distance threshold to generate the kernel centers.The proposed method differs from the traditional CFSFDP in two aspects: (1) CFSFDP simultaneously uses the points with a large distance and high density to determine the cluster center, which easily excludes the outlier points into the generation of cluster centers; while the proposed MCFSFDP method only uses distance threshold to generate the cluster center, the cluster centers can be generated from either outlier points (with only large distance) or points of density; (2) the number of clusters via CFSFDP is determined 'semi-automatically', i.e., an extra frame needs to be introduced to help determine the number of clusters, while the number of clusters can be automatically determined through the proposed method.We give the details of the proposed method as follows.
The same as the CFSFDP algorithm in [14], we assume that the cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities.
Following this idea, we firstly reshape each patch P j into a column vector as a data point j with a size of 1 × n 2 .For each point j, we compute two values: its local density ρ j and its distance δ j from the point with higher density, where, if the point j has the highest density, δ j denotes the largest distance between j and other points.
Both of these values depend only on the Euclidean distances d jk between any pair of data points j and k.The local density ρ j of data j is defined as where χ(x) = 1 if x < 0 and χ(x) = 0 otherwise, and d c is a cut-off distance.Basically, ρ j is equal to the number of points that are closer than d c to point j.δ j is evaluated through computing the minimum distance between the point j and any other point with higher density in Equation (2): For the point with the highest density, we usually take δ j = max k (d jk ).Note that δ j is much larger than the typical nearest neighbor distance only for points that are local or global maxima in the density.Thus, the cluster centers are recognized as points for which the value of δ j is anomalously large and the value of ρ j is higher than a value density at the same time.To show the distance and density of each point intuitively, we give the decision graph of 10,000 patches with a size of 10 × 10 from the real Indian pines dataset in Figure 3.
distance threshold to generate the kernel centers.The proposed method differs from the traditional CFSFDP in two aspects: (1) CFSFDP simultaneously uses the points with a large distance and high density to determine the cluster center, which easily excludes the outlier points into the generation of cluster centers; while the proposed MCFSFDP method only uses distance threshold to generate the cluster center, the cluster centers can be generated from either outlier points (with only large distance) or points of density; (2) the number of clusters via CFSFDP is determined 'semi-automatically', i.e., an extra frame needs to be introduced to help determine the number of clusters, while the number of clusters can be automatically determined through the proposed method.We give the details of the proposed method as follows.
The same as the CFSFDP algorithm in [14], we assume that the cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities.
Following this idea, we firstly reshape each patch j P into a column vector as a data point j with a size of 2 1 n × .For each point j , we compute two values: its local density j ρ and its distance j δ from the point with higher density, where, if the point j has the highest density, j δ denotes the largest distance between j and other points.
Both of these values depend only on the Euclidean distances jk d between any pair of data points j and k .The local density j ρ of data j is defined as where and c d is a cut-off distance.Basically, j ρ is equal to the number of points that are closer than c d to point j .j δ is evaluated through computing the minimum distance between the point j and any other point with higher density in Equation (2): : min ( ) .
For the point with the highest density, we usually take max ( ) . Note that j δ is much larger than the typical nearest neighbor distance only for points that are local or global maxima in the density.Thus, the cluster centers are recognized as points for which the value of j δ is anomalously large and the value of j ρ is higher than a value density at the same time.To show the distance and density of each point intuitively, we give the decision graph of 10,000 patches with a size of 10 10 × from the real Indian pines dataset in Figure 3. Different from choosing cluster centers in CFSFDP [14], we use the MCFSFDP algorithm to learn the kernels adaptively.Firstly, we choose the distance δ as the only threshold for choosing kernels from the decision graph in MCFSFDP.Different from choosing cluster centers in CFSFDP [14], we use the MCFSFDP algorithm to learn the kernels adaptively.Firstly, we choose the distance δ as the only threshold for choosing kernels from the decision graph in MCFSFDP.
To adapt the kernels and choose the number of kernels, we select the optimal distance threshold value δ A as the following steps: where, in Equation ( 3), δ v denotes the value of distance that contains points and f (δ v ) gives the mapping relationship of the number of points whose distances are equal or larger than δ v , as shown as Figure 4a.In Equation ( 4), where δ v+1 ≥ δ v , con v denotes the differential of f (δ v ), which is an intermediate result between Equations ( 3) and (5).Equation ( 5) denotes the variation quantity of the number of points with δ v , shown as Figure 4b.
δ A denotes the adaptive distance threshold, and the points whose distances are larger than δ A are chosen as CNN kernels.δ A is a critical point that must satisfy the number num v and num v+1 of points are stable (in other words, they have a similar quantity), at the same time, the value |con v /con v+1 | is larger than the value |con v+1 /con v+2 |.In this time, δ v is selected as the adaptive distance threshold δ A .
In other words, to determine the adaptive distance threshold δ A intuitively, from Figure 4a, we can find the value region δ v (0.25-0.30) from curve 1 when num v begins to approach to 0; as can be seen from Figure 4b, con v with the distance value δ v in region (0.25-0.30) has a local maxima at δ v = 0.28.The distance δ v (0.28) that belongs to the region (0.25-0.30) is confirmed as the adaptive threshold distance as δ A .In conclusion, by observing Figure 4, the adaptive distance threshold δ A is determined as 0.28 on the Indian Pines dataset.
Remote Sens. 2017, 9, 618 6 of 15 To adapt the kernels and choose the number of kernels, we select the optimal distance threshold value A δ as the following steps:   Finally, the points j with the distance value δ j > δ A are adaptively chosen as the kernels and thus the number of kernels is also adaptively determined through the threshold δ A .Those chosen points are then reshaped to patches with a size of n × n as the convolutional kernels in the CNNs framework.The CNNs with the pre-learned adaptive kernels are called MCFSFDP Net.The pre-learned kernels are denoted as w k in the following sections.

Convolutional Neural Networks
With the pre-learned kernels w k , a convolutional neural network such as [13] is designed for per-pixel level HSI classification.This CNNs structure consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer and a soft-max layer, as shown in Figure 5.

3.3.Convolutional Neural Networks
With the pre-learned kernels k w , a convolutional neural network such as [13] is designed for per-pixel level HSI classification.This CNNs structure consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer and a soft-max layer, as shown in Figure 5. .The feature map corresponding with the k-th filter where σ is the rectified linear unit (ReLU).The kernels were pre-trained using the MCFSFDP algorithm.
The maximum pooling over a local non-overlapping spatial region is adopted to down-sample the convolutional layer.The pooling layer for the k -th filter, , is calculated as: The k feature maps are reshaped to the column vectors and all the column vectors are connected with a fully connected auto-encode unit.The autoencode unit is used to process the connected column vector and represented the feature of the column vector.The output results of the hide layer in the auto-encode unit were used to connect the classification layer.
The last CNNs step is a soft-max layer used for final classification.

Experiments and Analysis
Three datasets were utilized to validate the feasibility and effectiveness of the proposed CNNs based MCFSFDP method (named as MCFSFDP Net) in HSI classification.In the following sections, dataset and experimental settings are described firstly, and then the effectiveness and the superiority of the proposed method are tested.

Datasets
To find images with less categories and obvious discriminations between categories, we firstly select an image dataset with a size of 256 256 × .The image of this dataset has been manually labeled as three categories, including mountains, sky and roads.One hundred samples with a size of 25 25 × from each category that were extracted from this image.We randomly choose 210 context area There are k kernels in the convolutional layer.Each feature map is calculated by taking the dot product between the k-th kernel w k of size n × n, w ∈ R n×n×k , and local context area x of size m × m with c number of channels, x ∈ R m×m×c .The feature map corresponding with the k-th filter f ∈ R (m−n+1)×(m−n+1) is calculated as: where σ is the rectified linear unit (ReLU).The kernels were pre-trained using the MCFSFDP algorithm.
The k feature maps are reshaped to the column vectors and all the column vectors are connected with a fully connected auto-encode unit.The autoencode unit is used to process the connected column vector and represented the feature of the column vector.The output results of the hide layer in the auto-encode unit were used to connect the classification layer.
The last CNNs step is a soft-max layer used for final classification.

Experiments and Analysis
Three datasets were utilized to validate the feasibility and effectiveness of the proposed CNNs based MCFSFDP method (named as MCFSFDP Net) in HSI classification.In the following sections, dataset and experimental settings are described firstly, and then the effectiveness and the superiority of the proposed method are tested.

Datasets
To find images with less categories and obvious discriminations between categories, we firstly select an image dataset with a size of 256 × 256.The image of this dataset has been manually labeled as three categories, including mountains, sky and roads.One hundred samples with a size of 25 × 25 from each category that were extracted from this image.We randomly choose 210 context area samples for training, 30 samples for validation and 60 other samples for testing.The details of selected image samples were given in Table 1.In order to evaluate the proposed method on complex data, Dataset 2 includes the benchmark Indian Pines image, which is HSI data captured by the airborne visible imaging spectrometer (AVIRIS) sensor with a moderate spatial resolution of 20 m over the Indian Pines test site in northwestern Indiana in 1992.As shown in Figure 6, this image contains145 × 145 pixels and 224 spectral bands, whose wavelength ranges from 0.4 to 2.5 um.The number of bands of corrected data was reduced to 200 (extracted the 1-200 bands).In addition, 6476 image context area samples with a size of 19 × 19 were extracted.Among them, 3238, 647 and 2591 samples were used for training, validation and testing, respectively.The details of each category of image samples were given in Table 2.
Remote Sens. 2017, 9, 618 8 of 15 samples for training, 30 samples for validation and 60 other samples for testing.The details of selected image samples were given in Table 1.
Table 1.Ground truth classes and their respective sample numbers in Dataset 1.In order to evaluate the proposed method on complex data, Dataset 2 includes the benchmark Indian Pines image, which is HSI data captured by the airborne visible imaging spectrometer (AVIRIS) sensor with a moderate spatial resolution of 20 m over the Indian Pines test site in northwestern Indiana in 1992.As shown in Figure 6, this image contains145 × 145 pixels and 224 spectral bands, whose wavelength ranges from 0.4 to 2.5 um.The number of bands of corrected data was reduced to 200 (extracted the 1-200 bands).In addition, 6476 image context area samples with a size of 19 19 × were extracted.Among them, 3238, 647 and 2591 samples were used for training, validation and testing, respectively.The details of each category of image samples were given in Table 2.The third Dataset 3 includes the benchmark Pavia University image, which is HSI data captured by a ROSIS sensor with a moderate spatial resolution of 1.3 m over the flight campaign over Pavia, northern Italy.As shown in Figure 7, this image contains 610 × 610 pixels and 103 spectral bands.The number of bands was reduced to 100 (extracted the 1-100 bands).Furthermore, 34,400 image context area samples with a size of 11 × 11 were extracted.Among them, 17,200, 3440 and 13,760 samples were used for training, validation and testing, respectively.The details of each category of samples were given in Table 3.

Class Samples Training Validation Testing
Remote Sens. 2017, 9, 618 9 of 15 The third Dataset 3 includes the benchmark Pavia University image, which is HSI data captured by a ROSIS sensor with a moderate spatial resolution of 1.3 m over the flight campaign over Pavia, northern Italy.As shown in Figure 7, this image contains 610 610 × pixels and 103 spectral bands.The number of bands was reduced to 100 (extracted the 1-100 bands).Furthermore, 34,400 image context area samples with a size of 11 11 × were extracted.Among them, 17,200, 3440 and 13,760 samples were used for training, validation and testing, respectively.The details of each category of samples were given in Table 3.

Experimental Parameter Settings
Ten thousand patches were randomly extracted from the training samples for learning kernels.For each dataset, the sample (blocks) size and the number of patches should be maintained consistently in different pre-learned CNNs frameworks.
The CNNs framework that is shown in Figure 5 uses one convolutional layer, one pooling layer, one auto-encode layer and a classifier.In our algorithm, the pooling layer adopted the non-overlap rule, the number of neurons in the hide layer of auto encode was set to 100 and the maximum iterations for training the classifier was 400.The learning rate is 0.0001 and momentum is 1.The batch sizes on the three datasets are chosen as 10, 50 and 200, respectively.
The average test accuracy is calculated on 10 independent Monte Carlo runs.

Experimental Parameter Settings
Ten thousand patches were randomly extracted from the training samples for learning kernels.For each dataset, the sample (blocks) size and the number of patches should be maintained consistently in different pre-learned CNNs frameworks.
The CNNs framework that is shown in Figure 5 uses one convolutional layer, one pooling layer, one auto-encode layer and a classifier.In our algorithm, the pooling layer adopted the non-overlap rule, the number of neurons in the hide layer of auto encode was set to 100 and the maximum iterations for training the classifier was 400.The learning rate is 0.0001 and momentum is 1.The batch sizes on the three datasets are chosen as 10, 50 and 200, respectively.
The average test accuracy is calculated on 10 independent Monte Carlo runs.

Effectiveness of the Kernels Learned by MCFSFDP
The aim of this experiment is to validate the effectiveness of the kernels learned by MCFSFDP.To this end, we compared those kernels with those learned as the cluster center obtained by CFSFDP algorithm.Those two kinds of kernels were then integrated into the same CNNs framework for HSI classification on Dataset 1.To obtain fair comparison results, both of the numbers of kernels in those two methods were fixed at 49.The kernel size was set to 14 × 14 and the pooling size was designed as 4 × 4. The average testing classification accuracy of those two methods was shown in Table 4.It reveals that the kernels learned by the MCFSFDP are more effective than the kernels learned by the CFSFDP.

Effectiveness of the Kernels Number Determined by MCFSFDP
To demonstrate the effectiveness of the kernels number determined by MCFSFDP, we compared MCFSFDP with its variants for classification in each dataset.Those variants shared the same CNNs architecture and the kernel learning scheme excepted choosing the kernels number manually.Dataset 1, Dataset 2 and Dataset 3 were used in the experiment.For each dataset, the kernel size and the pooling size can be found in Table 5.We report the testing classification accuracy of all these methods on each dataset in Tables 6-8, respectively.Each variant is denoted as MCFSFDP-M Net followed with a specific number which indicates the kernel number chosen manually.Similarly, the number that followed MCFSFDP Net represents the kernel number automatically determined by the proposed method.In Table 6, the proposed method determines the kernel number as 35.The manually chosen kernel number in other variants are 20, 25, 41 and 55, respectively.The accuracy, distance threshold and the number of kernels for each method are shown in different rows.It can be seen that the proposed method shows the best classification accuracy.Similar phenomenon arises in Tables 7 and 8. Therefore, we can conclude that the proposed method is able to seek a good kernel number for different datasets.

Performance Evaluation of MCFSFDP Net
In this part, the proposed method was compared with three state-of-the-art pre-learned kernels based CNNs methods, including K-means Net [13], PCA-Net [42] and Random Net.For fair comparison, the same CNNs architecture was adopted by all comparison methods.The number of kernels for K-means Net, PCA-Net and Random Net was set to 50, while the proposed method determines the number of kernels automatically.For each dataset, the kernel size and the pooling size can be found in Table 9.It reveals that the proposed algorithm can produce more accuracy for pixel classification than those three types of pre-learned kernels based CNNs methods on this dataset as shown in Table 9.Moreover, the proposed MCFSFDP Net with 35 kernels that has less computational complexity than comparison methods with 50 kernels in the training process.
The average testing classification accuracy of our proposed algorithm, K-means Net, PCA-Net and Random Net on Dataset 2 was given in Table 10.The results obviously show that the proposed MCFSFDP Net obtains better accuracy than those three types of pre-learned kernels based CNNs methods, which is consistent with the results obtained from Dataset 1.The average classification accuracy of our proposed method compared with another three kernels pre-learned based CNNs on the Pavia University image was presented in Table 11.The results show that our proposed CNNs method is more accurate than those three types of pre-learned kernels based CNNs methods.Even if the proposed method needs more kernels number to perform the better classification result.

Effect ofthe Number of Kernels
In the MCFSFDP-M Net, the number of kernels influences the pixel-level classification.Figure 8 shows the classification accuracy achieved with different numbers A k that were manually selected via MCFSFDP on Dataset 1, Dataset 2 and Dataset 3.
Figure 8a shows the classification results with the variation of kernel numbers A k on each kernel size n × n on Dataset 1.The accuracy of MCFSFDP-M Net computation cannot be enhanced when the kernel number A k was increased.Figure 8b shows the highest accuracy on Dataset 2. While the kernel number is manually chosen via MCFSFDP, the accuracy can get a high point in the number range of the kernels, as the adaptive kernels learned through the MCFSFDP method.It demonstrates again that the accuracy cannot be enhanced with the increased kernel number on Dataset 3, as shown in Figure 8c.

Effect ofthe Number of Kernels
In the MCFSFDP-M Net, the number of kernels influences the pixel-level classification.Figure 8 shows the classification accuracy achieved with different numbers k A that were manually selected via MCFSFDP on Dataset 1, Dataset 2 and Dataset 3.
Figure 8a shows the classification results with the variation of kernel numbers k A on each kernel size n n × on Dataset 1.The accuracy of MCFSFDP-M Net computation cannot be enhanced when the kernel number k A was increased.Figure 8b shows the highest accuracy on Dataset2.
While the kernel number is manually chosen via MCFSFDP, the accuracy can get a high point in the number range of the kernels, as the adaptive kernels learned through the MCFSFDP method.It demonstrates again that the accuracy cannot be enhanced with the increased kernel number on Dataset 3, as shown in Figure 8c.

Effect of the Kernel Size
In our proposed MCFSFSP based CNN method, the kernel size has a major impact on the pixel classification performance.Table 12 gives the average classification accuracy obtained by using different kernel size.It shows that the highest classification accuracy was achieved when kernel size was set to 10 × 10 and 6 × 6 on Dataset 1 and 6 × 6 on Dataset 2.

Conclusions
In this paper, we propose a novel CNNs classification framework for HSIs, which can data-adaptively learn a specific number of kernels from the training data.In particular, this model adopts the MCFSFDP algorithm to cluster the training data, and then the convolutional kernels can be determined automatically by the cluster center and inter-cluster margin.With those pre-learned kernels, a CNNs framework is developed for classifications.We have compared the proposed CNNs framework against three state-of-the-art deep learning methods with pre-trained kernels on three datasets.The experimental results demonstrate the superiority of the proposed CNNs framework in classification accuracy.Moreover, we validate that the proposed method is able to seek a good kernel number for a specific dataset.These adaptively learned kernels can help us understand the complexity of data and adjust the CNNs architecture for good feature extraction.
In terms of future research, we will exploit a multi-layer architecture via MCSFDP based CNNs to enhance the classification accuracy with less samples.

Figure 1 .
Figure 1.The flowchart of the MCFSFDP based CNNs method.

BBP
= with a size of m m × as samples, where each block is centered at each selected pixel.These extracted M samples are roughly divided into three parts, namely, training samples, validation samples and testing samples.The property of center block pixel is described by all the pixels in the block.Then, = are put into the network and the center pixel labels of block i B are used as the ground truth for training.In addition, we randomly extract N patches = are used for learning the convolutional kernels with a size of n n × via MCFSFDP.The producing process of the block (sample) and patch is shown in Figure2.

Figure 2 .
Figure 2. The block (sample) is extracted from image R and the patch is extracted from block, respectively.

Figure 1 .
Figure 1.The flowchart of the MCFSFDP based CNNs method.
P j N j=1 with a size of n × n from M T training samples, M T denotes the number of training samples, where M T < M and n < m.The extracted N patches P j N j=1 are used for learning the convolutional kernels with a size of n × n via MCFSFDP.The producing process of the block (sample) and patch is shown in Figure 2. Remote Sens. 2017, 9, 618 4 of 15 learning module, which learns the convolutional kernels from those extracted patches; and (3) classification modules which utilize the learned convolution kernels.The flowchart of our MCFSFDP based CNNs method is shown in Figure 1.

Figure 1 .
Figure 1.The flowchart of the MCFSFDP based CNNs method.

BP
= with a size of m m × as samples, where each block is centered at each selected pixel.These extracted M samples are roughly divided into three parts, namely, training samples, validation samples and testing samples.The property of center block pixel is described by all the pixels in the block.Then, 1 { } M i i B = are put into the network and the center pixel labels of block i B are used as the ground truth for training.In addition, we randomly extract N patches = are used for learning the convolutional kernels with a size of n n × via MCFSFDP.The producing process of the block (sample) and patch is shown in Figure2.

Figure 2 .
Figure 2. The block (sample) is extracted from image R and the patch is extracted from block, respectively.

Figure 2 .
Figure 2. The block (sample) is extracted from image R and the patch is extracted from block, respectively.

Figure 3 .
Figure 3.Decision graph of 10,000 patches with a size of 10 10 × on the Indian pines dataset.

δ 10 Figure 3 .
Figure 3. Decision graph of 10,000 patches with a size of 10 × 10 on the Indian pines dataset.

..
)where, in Equation Error!Reference source not found., v δ denotes the value of distance that contains points and ( ) v f δ gives the mapping relationship of the number of points whose distances are equal or larger than v δ , as shown as Figure4a.In Equation Error!Reference source not found., differential of ( ) v f δ , which is an intermediate result betweenEquations Error!Reference source not found.and (5).Equation (5) denotes the variation quantity of the number of points with v δ , shown as Figure4b.A δ denotes the adaptive distance threshold, and the points whose distances are larger than A δ are chosen as CNN kernels.A δ is a critical point that must satisfy the number v num and 1 v num + of points are stable (in other words, they have a similar quantity), at the same time, the value In this time, v δ is selected as the adaptive distance threshold A δ .In other words, to determine the adaptive distance threshold A δ intuitively, from Figure4a, we can find the value region v δ (0.25-0.30) from curve 1 when v num begins to approach to 0; as can be seen from Figure4b, v con with the distance value v δ in region (0.25-0.30)The distance v δ (0.28) that belongs to the region (0.25-0.30) is confirmed as the adaptive threshold distance as A δ .In conclusion, by observing Figure 4, the adaptive distance threshold A δ is determined as 0.28 on the Indian Pines dataset.

Figure 4 .
Figure 4.The curve for determining the adaptive distance with patches with a size of 10 10 × the on Indian pines dataset.(a) shows the curve of point-number over distance v δ ; (b) gives the curve of quotients of differential over distance v δ .Finally, the points j with the distance value j

Figure 4 .
Figure 4.The curve for determining the adaptive distance with patches with a size of 10 × 10 the on Indian pines dataset.(a) shows the curve of point-number over distance δ v ; (b) gives the curve of quotients of differential over distance δ v .

Figure 5 .
Figure 5.The structure of MCFSFDP based CNNs.There are k kernels in the convolutional layer.Each feature map is calculated by taking the dot product between the k-th kernel k w of size n n × , n n k w R × × ∈ , and local context area x of size m m × with c number of channels, m m cx R × × ∈

Figure 6 .
Figure 6.The Indian Pines on Dataset 2. (a) shows the composite image; (b) shows the groundtruth of Indian Pines dataset, where the white area denotes the unlabeled pixels.

Figure 6 .
Figure 6.The Indian Pines on Dataset 2. (a) shows the composite image; (b) shows the groundtruth of Indian Pines dataset, where the white area denotes the unlabeled pixels.

Figure 7 .
Figure 7.The Pavia University in Dataset 3. (a) shows the composite image; (b) shows the groundtruth of the Pavia University dataset, white area denotes the unlabeled pixels.

Figure 7 .
Figure 7.The Pavia University in Dataset 3. (a) shows the composite image; (b) shows the groundtruth of the Pavia University dataset, white area denotes the unlabeled pixels.

Figure 8 .
Figure 8.The classification accuracy influence with the number of kernels.(a) the classification accuracy with the increased number of kernels with different kernel size on Dataset 1; (b) the classification accuracy with the increased number of kernels with different kernel size on Dataset 2; (c) the classification accuracy with the increased number of kernels on Dataset 3.

2 Figure 8 .
Figure 8.The classification accuracy influence with the number of kernels.(a) the classification accuracy with the increased number of kernels with different kernel size on Dataset 1; (b) the classification accuracy with the increased number of kernels with different kernel size on Dataset 2; (c) the classification accuracy with the increased number of kernels on Dataset 3.

Table 1 .
Ground truth classes and their respective sample numbers in Dataset 1.

Table 2 .
Groundtruth of classes and their respective sample numbers on Indian Pines scene.

Table 2 .
Groundtruth of classes and their respective sample numbers on Indian Pines scene.

Table 3 .
Groundtruth of classes and their respective sample numbers in the Pavia University scene.

Table 3 .
Groundtruth of classes and their respective sample numbers in the Pavia University scene.

Table 4 .
The testing accuracy compared with learned 49 kernels via CFSFDP and MCFSFDP-M on Dataset 1.

Table 5 .
The chosen block size, kernel size and pooling size of each dataset.

Table 6 .
The testing accuracy of MCFSFDP-M Net compared with MCFSFDP Net on Dataset 1.

Table 7 .
The testing accuracy of MCFSFDP-M Net compared with MCFSFDP Net on Dataset 2.

Table 8 .
The test accuracy of MCFSFDP-M Net compared with MCFSFDP Net on Dataset 3.

Table 9 .
The testing accuracy of different CNNs methods compared with MCFSFDP Net on Dataset 1.

Table 10 .
The testing accuracy of different CNNs methods compared with MCFSFDP Net on Dataset 2.

Table 11 .
The testing accuracy of different CNNs methods compared with MCFSFDP Net on Dataset 3.

Table 11 .
The testing accuracy of different CNNs methods compared with MCFSFDP Net on Dataset 3.

Table 12 .
The average classification accuracy obtained by using different kernel size.