Automatic Kernel Size Determination for Deep Neural Networks Based Hyperspectral Image Classification

Considering kernels in Convolutional Neural Networks (CNNs) as detectors for local patterns, K-means neural network proposes to cluster local patches extracted from training images and then fixate those kernels as the representative patches in each cluster without further training. Thus the amount of labeled samples necessitated for training can be greatly reduced. One key property of those kernels is their spatial size which determines their capacity in detecting local patterns and is expected to be task-specific. However, most of literatures determine the spatial size of those kernels in a heuristic way. To address this problem, we propose to automatically determine the kernel size in order to better adapt the K-means neural network for hyperspectral imagery classification. Specifically, a novel kernel-size determination scheme is developed by measuring the clustering performance of local patches with different sizes. With the kernel of determined size, more discriminative local patterns can be detected in the hyperspectral imagery, with which the classification performance of K-means neural network can be obviously improved. Experimental results on two datasets demonstrate the effectiveness of the proposed method.


Introduction
With the growth of remote sensing technology, hyperspectral imagery (HSI) which can provides both spatial information and abundant spectral information [1,2], has been widely employed in various applications such as mineral exploration, ground object identification, survey of agriculture, monitoring of geology, etc.In these applications, pixel level classification is a commonly used technology, which is crucial for both the low-level HSI processing and the high-level understanding of HSI.
Plenty of methods have been proposed for HSI classification.According to the feature utilized for representing pixels in a HSI, they often can be roughly divided into two categories, namely handcrafted feature based methods and deep learning feature based methods.In previous years, handcrafted feature based methods for HSI classification had gained much promising progress.Nevertheless, for the handcrafted feature based methods [3][4][5][6][7][8][9], various domain knowledge is required in order to extract the appropriate features for the following classification step.More importantly, the handcrafted features are often exhibit shallow structure, and thus are insufficient to represent the many complicated structures in the challenging HSI classification problems.Recently, deep learning feature based methods are extensively investigated, which can learn features from low level to high level with a deep hierarchical structure.Compared with those handcrafted features, the learned deep features often show better nonlinear representation ability for the original images.Therefore, numerous deep learning feature based methods have been developed for HSI classification [10][11][12].
In [26], K-means Net utilizes K-means algorithm to learn kernels, which is an unsupervised clustering method and based on distance of the sample points.For this reason, the evaluation indicator for measuring the clustering results should be relevant to either inter-class or inner-class distance.Due to different kernel sizes, the same clustering sample with different size owns different location when it projects into 2-D plane, and the number of samples in each class is always different, which belongs to the problem of samples with non-uniform distribution.In this evaluation task, the traditional evaluation indicator based on either inter-class or inner-class distance is not suitable.To better deal with this evaluation problem, a more practical evaluation indicator should be designed to replace the recent unsuitable determination methods.What' more, the new practical evaluation indicator needs to consider the important factor of number of samples after clustering process in each class.
In this paper, to enhance the HSI classification results of K-means Net, we propose a new size-adaptive kernels based K-means Net.Specifically, a new clustering evaluation indicator for the groups of pre-learned kernels with different sizes is proposed to evaluate the clustering results and determines the adaptive kernel size.Using the proposed method, the adaptive kernel size can be easily determined to well represent the data characteristics.Experimental results on two datasets demonstrate that with the automatically determined kernel size, the proposed method outperforms several state-of-the-art CNNs methods.
In summary, the proposed CNNs framework has two key contributions: (1) a specific size of convolutional kernels can be determined by a new clustering evaluation indicator; (2) the K-means based CNNs framework with adaptive kernel size is effective for HSI classification.

The Proposed Method
The K-means based CNNs method with adaptive kernel size includes four major steps: (1) data pre-processing which extracts groups of patches with different patch sizes from block samples (the block samples are extracted from the original HSI for training); (2) K-means for clustering the convolutional kernels with groups of different sizes; (3) the evaluation of clustering results for determining the adaptive kernel size; and (4) HSI classification using the pre-learned kernels with adaptive kernel size in K-means based CNNs.The flowchart of the proposed method is shown as Figure 1.
evaluating the samples with given labels is just for the supervised clustering.However, most of the clustering problems are unsupervised.In the other category for non-labeled data, the indicators, such as Cluster Accuracy (CA), Rand Index (RI) and Normalized Mutual Information (NMI), are used for evaluating clustering results.
In [26], K-means Net utilizes K-means algorithm to learn kernels, which is an unsupervised clustering method and based on distance of the sample points.For this reason, the evaluation indicator for measuring the clustering results should be relevant to either inter-class or inner-class distance.Due to different kernel sizes, the same clustering sample with different size owns different location when it projects into 2-D plane, and the number of samples in each class is always different, which belongs to the problem of samples with non-uniform distribution.In this evaluation task, the traditional evaluation indicator based on either inter-class or inner-class distance is not suitable.To better deal with this evaluation problem, a more practical evaluation indicator should be designed to replace the recent unsuitable determination methods.What' more, the new practical evaluation indicator needs to consider the important factor of number of samples after clustering process in each class.
In this paper, to enhance the HSI classification results of K-means Net, we propose a new size-adaptive kernels based K-means Net.Specifically, a new clustering evaluation indicator for the groups of pre-learned kernels with different sizes is proposed to evaluate the clustering results and determines the adaptive kernel size.Using the proposed method, the adaptive kernel size can be easily determined to well represent the data characteristics.Experimental results on two datasets demonstrate that with the automatically determined kernel size, the proposed method outperforms several state-of-the-art CNNs methods.
In summary, the proposed CNNs framework has two key contributions: (1) a specific size of convolutional kernels can be determined by a new clustering evaluation indicator; (2) the K-means based CNNs framework with adaptive kernel size is effective for HSI classification.

The Proposed Method
The K-means based CNNs method with adaptive kernel size includes four major steps: (1) data pre-processing which extracts groups of patches with different patch sizes from block samples (the block samples are extracted from the original HSI for training); (2) K-means for clustering the convolutional kernels with groups of different sizes; (3) the evaluation of clustering results for determining the adaptive kernel size; and (4) HSI classification using the pre-learned kernels with adaptive kernel size in K-means based CNNs.The flowchart of the proposed method is shown as Figure 1.

Data Pre-Processing
For simplicity, we denote the HSI employed for classification as R in this paper.
First, we randomly select M pixels from R and then extract the corresponding blocks {B i } M i=1 with size of m × m centered at each selected pixel as samples.Each pixel contains the information from all spectral bands, here, we omit the bands in the process of describing the size.These extracted M samples are roughly divided into three sets, namely training samples set (M T ), validation samples set (M V ) and testing samples set (M P ).The label of the central block pixel is represented through the property of the whole block.In other words, the property of the central pixel is described via the statistical property of pixels value which includes the central pixel and its surrounding pixels in each whole block.Then, {B i } M T i=1 are fed into the network and the central pixels labels of the input blocks {B i } M T i=1 are used as the ground truth for training a network.In this paper, through comparing different block sizes, we select the block with size of 27 × 27 (i.e., m = 27) to obtain the best classification results [23].
Moreover, we randomly extract N patches P j N j=1 with a size of n × n from M T training samples, where M T denotes the number of training samples, with M T < M and n < m.The extracted N patches P j N j=1 are used for learning the convolutional kernels with size of n × n via K-means clustering.In this paper, we choose some groups of patches with different sizes of 22 × 22, 20 × 20, . . ., 6 × 6, respectively, for the results after convolutional process can be divided with no reminder in pooling process, the pooling size is designed as 2 × 2, in addition, N is designed as 10,000.
The process of data extraction is shown as Figure 2.

Data Pre-Processing
For simplicity, we denote the HSI employed for classification as R in this paper.First, we randomly select M pixels from R and then extract the corresponding blocks with size of m m × centered at each selected pixel as samples.Each pixel contains the information from all spectral bands, here, we omit the bands in the process of describing the size.These extracted M samples are roughly divided into three sets, namely training samples set ( T M ), validation samples set ( V M ) and testing samples set ( P M ).The label of the central block pixel is represented through the property of the whole block.In other words, the property of the central pixel is described via the statistical property of pixels value which includes the central pixel and its surrounding pixels in each whole block.Then, to obtain the best classification results [23].
Moreover, we randomly extract N patches × , respectively, for the results after convolutional process can be divided with no reminder in pooling process, the pooling size is designed as 2 2 × , in addition, N is designed as 10,000.The process of data extraction is shown as Figure 2.

Clustering the Convolutional Kernels via K-Means
The method of pre-learned convolutional kernels is based on K-means algorithm.To verify the adaptive size, we set the class number K as 50 by experience.
We firstly reshape each patch j P into a column vector as a vector j P with a size of 2 1 n × .All the vectors denote as P ∈ { 1 P ,…, j P ,…, 10000 P }.The steps of K-means are shown as follows: Step 1: we randomly choose 50 vectors from P as the initial cluster centers, i.e.,

Clustering the Convolutional Kernels via K-Means
The method of pre-learned convolutional kernels is based on K-means algorithm.To verify the adaptive size, we set the class number K as 50 by experience.
Step 2: for each vector P j , if the vector has same label with cluster center µ f , the vector should exhibit the nearest distance to the cluster center µ f than that to the other cluster centers.
where the distance is Euclidean distance and label P j denotes the label of the vector P j .
Step 3: for all c f vectors P j which have the same label of µ f in class f .We calculate the new mean value µ f as the new cluster center; c f denotes the number of vectors in the class f .
Since the patches with different sizes extracted from the same sample, they often show different degrees of representation ability.Moreover, the patches with different sizes have different clustering results and different distributions in the 2-D plane.The detailed discussion will be introduced in Section 3.

Determination Method of Adaptive Kernel Size
Given the clustering results using different groups of patch (kernel) sizes, we determine the adaptive kernel size as the following steps: Step 1: We compute the inner-class distance D inner .We compute the inner-class distance matrix D inner(K f ) of each class f .f denotes the class number.We adopt Euclidean distance for distance measure.
A variable value N f is introduced to represent the number of patches in each class.
The matrix D inner(K f ) is given as Equation ( 3): where f is 1, 2, . . ., 50, P K f denotes the K f -th vector in class f , K f = 1, 2, . . ., N f , N f denotes the number of patches in class f , u z f denotes the cluster center of the class f .
In each class, For the variable value N f , we rank the number of samples in each class from small to large.The weight w f as the quotient between the number of patches in the class f and the number N of all patches, which is shown as Equation ( 4): e f denotes the weight of w f , which represents the rank of the number of patches in each class.If a class has the max number of patches, the corresponding e f is set to 50/50.Oppositely, if a class has the minimum number of points, whose e f is set as 1/50 (50 denotes the number of classes).Through ranking the number of patches in the other classes from large to small, the other e f corresponding to the patches number ranks of classes are set to 49/50, 48/50, . . ., 3/50, 2/50, respectively.The inner distance D inner( f ) in class f is shown as the Equation ( 5): For all classes, the final demonstration of inner-class distance value D inner is described as Equation ( 6): Step 2: We compute the inter-class distance D inter .We compute the distance matrix which is composed of the distances among the cluster centers.In this paper, the distance matrix with a size of 50 × 50 is described as , where r = 1, . . ., 50, t = 1, . . ., 50, µ z r and µ z t are considered as the class centers of class r and class t, respectively.
We normalize the matrix D M as Equation ( 7): where max(D M ) denotes the max element in matrix D M .Finally, the inter-class distance D inter is shown as: where r and t denote the line and column of the distance matrix D M , respectively.D inter is a scalar.
Step 3: The evaluation indicator of clustering results with different kernel sizes is shown as follows: With the different kernel sizes n = [22, 20, . . . , 6],the evaluation indicator EI(n) is: where n denotes the kernel size and EI(n) denotes the evaluation indicator value of the clustering result with a size of n × n.
Then, through ranking EI(n), the optimal kernel size is n × n, which leads to the largest value EI(n).
The flow chart of determining the adaptive size of the convolutioanal kernels is shown in Figure 3.
The inner distance in class f is shown as the Equation ( 5): For all classes, the final demonstration of inner-class distance value inner D is described as Step 2: We compute the inter-class distance inter D .
We compute the distance matrix which is composed of the distances among the cluster centers.In this paper, the distance matrix with a size of 50 50 We normalize the matrix M D as Equation ( 7): where max( ) Finally, the inter-class distance inter D is shown as: where r and t denote the line and column of the distance matrix M D , respectively.inter D is a scalar. Step where n denotes the kernel size and EI n ( ) denotes the evaluation indicator value of the clustering result with a size of n n × .Then, through ranking EI n ( ) , the optimal kernel size is n n × , which leads to the largest value EI n The flow chart of determining the adaptive size of the convolutioanal kernels is shown in Figure 3.
With the pre-learned kernels C f , a convolutional neural networks as described in [27] is developed for per-pixel level HSI classification.This CNNs structure consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer and a soft-max layer shown as Figure 4.
There are 50 kernels in the convolutional layer.Each feature map is calculated by taking the dot product between the f -th kernel C f with a size of n × n × h, C ∈ R n×n×h× f and local context area x of size m × m × h with h number of channels, x ∈ R m×m×h .The feature map corresponding to the f -th filter O ∈ R (m−n+1)×(m−n+1) is calculated as: where σ is the rectified linear unit (ReLU).The kernels were pre-trained using K-means algorithm.
The maximum pooling over a local overlapping spatial region is adopted to down-sample the convolutional layer.The pooling layers for the f -th filter, g ∈ R (m−n+1)/p×(m−n+1)/p , is calculated as: The f feature maps are reshaped into column vectors.Then, all the column vectors are connected into a fully connected layer, auto-encode unit is used to process the connected column vector and it represents the feature of column vector.The output results of hidden layer in the auto-encode unit were used to connect the classification layer.
The last soft-max layer is used for output final classification result.The structure of the K-means based CNNs with adaptive kernel size is shown as Figure 4.
Remote With the pre-learned kernels f C , a convolutional neural networks as described in [27]is developed for per-pixel level HSI classification.This CNNs structure consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer and a soft-max layer shown as Figure 4.
There are 50 kernels in the convolutional layer.Each feature map is calculated by taking the dot product between the f -th kernel f C with a size of n n h × × , . The feature map corresponding to the f -th filter where σ is the rectified linear unit (ReLU).The kernels were pre-trained using K-means algorithm.
The maximum pooling over a local overlapping spatial region is adopted to down-sample the convolutional layer.The pooling layers for the f -th filter, , is calculated as: ,1 ( 1) 1 ( 1), , max( ,..., ,..., , ..., ) The f feature maps are reshaped into column vectors.Then, all the column vectors are connected into a fully connected layer, auto-encode unit is used to process the connected column vector and it represents the feature of column vector.The output results of hidden layer in the auto-encode unit were used to connect the classification layer.
The last soft-max layer is used for output final classification result.The structure of the K-means based CNNs with adaptive kernel size is shown as Figure 4.

Experiments and Analysis
To demonstrate the effectiveness of the proposed method, two HSI datasets are adopted in the following experiments.These two datasets are used to validate the feasibility and effectiveness of the proposed CNNs based K-means with adaptive kernel size in classification.In the following sections, we firstly introduce the datasets.Then, the detailed experimental setting are provided.Finally, two experiments are conducted to show the HSI classification results of the proposed method.

Experiments and Analysis
To demonstrate the effectiveness of the proposed method, two HSI datasets are adopted in the following experiments.These two datasets are used to validate the feasibility and effectiveness of the proposed CNNs based K-means with adaptive kernel size in classification.In the following sections, we firstly introduce the datasets.Then, the detailed experimental setting are provided.Finally, two experiments are conducted to show the HSI classification results of the proposed method.

Datasets
Two public image datasets are utilized in our experiments.Dataset 1: In order to evaluate the proposed method on the complex dataset, the first dataset is the benchmark Indian Pines image, shown as Figure 5a.It is gathered by AVIRIS sensor over the Indian Pines test site in North-western Indiana.The ground reference is shown in Figure 5b.This image contains 145 × 145 pixels and 224 spectral bands, the wavelength ranges from 0.4 to 2.5 um.The number of bands is reduced to 200.Eleven interesting classes of this image were classified.5108 image context area samples with a size of 27 × 27 were extracted.We choose 11 categories of the total 16 categories for the experiment.Each category of image samples is given in Table 1.This dataset is used to analyze the distributions of the extracted patches in different groups with different sizes for certifying the extracted patches for learning kernels with different distributions.It is also used to test the feasibility and effectiveness of the proposed approach for classification.

Datasets
Two public image datasets are utilized in our experiments.Dataset 1: In order to evaluate the proposed method on the complex dataset, the first dataset is the benchmark Indian Pines image, shown as Figure 5a.It is gathered by AVIRIS sensor over the Indian Pines test site in North-western Indiana.The ground reference is shown in Figure 5b.This image contains 145 × 145 pixels and 224 spectral bands, the wavelength ranges from 0.4 to 2.5 um.The number of bands is reduced to 200.Eleven interesting classes of this image were classified.5108 image context area samples with a size of 27 27 × were extracted.We choose 11 categories of the total 16 categories for the experiment.Each category of image samples is given in Table 1.This dataset is used to analyze the distributions of the extracted patches in different groups with different sizes for certifying the extracted patches for learning kernels with different distributions.It is also used to test the feasibility and effectiveness of the approach for classification.6, this image contains 610 × 610 pixels and 103 spectral bands.The number of bands is reduced to 100 by selecting the top 100 bands from 103 bands, and the whole image was used.The total numbers of samples are split into training, validation and testing samples with ratios 0.5, 0.1 and 0.4.Furthermore, 31,571 image context area samples with a size of 27 × 27 were extracted.Among them, 15,785, 3157 and 12,629 samples are used for training, validation and testing, respectively.The details of each category of samples were given in Table 2.This dataset is used to test the feasibility and effectiveness of the proposed approach for classification.The details of each category of samples were given in Table 2.This dataset is used to test the feasibility and effectiveness of the proposed approach for classification.

Experimental Parameter Settings
In the experiment, the samples (blocks) are randomly extracted from the HSI dataset, and then some groups of patches are extracted from the training samples for learning the pre-learned kernels.Each group of kernels should contain a constant kernel size.The pre-learned kernel number (patches number) should be fixed as 50.And the pre-learned kernels would be used in the pre-learned CNNs framework.
In the experiment, as shown in Figure 4, the CNNs framework used one convolutional layer, one pooling layer, one auto-encode layer and a classifier.In our algorithm, the pooling layer adopts the overlapping rule with size of 2 2 × , the number of neurons in the hidden layer of auto-encoding is set to 1000 and the max iteration for training the classifier was 400.The learning rate is 0.0001 and the momentum is 1.The batch size is set as 200.The testing accuracy is the average value of 10 trials.The code is running on the computer with Intel Xeon E5-2678 V3 2.50 GHz × 2 (Intel, Santa Clara, CA,USA), NVIDIA Tesla (NVIDIA, Santa Clara, CA,USA) K40c GPU × 2, 128 GB RAM,120 GB SSD and Matlab 2016a (MathWorks, Natick, MA, USA).The gradient is computed via batch gradient descent, which is not computed by GPU.

Experimental Parameter Settings
In the experiment, the samples (blocks) are randomly extracted from the HSI dataset, and then some groups of patches are extracted from the training samples for learning the pre-learned kernels.Each group of kernels should contain a constant kernel size.The pre-learned kernel number (patches number) should be fixed as 50.And the pre-learned kernels would be used in the pre-learned CNNs framework.
In the experiment, as shown in Figure 4, the CNNs framework used one convolutional layer, one pooling layer, one auto-encode layer and a classifier.In our algorithm, the pooling layer adopts the overlapping rule with size of 2 × 2, the number of neurons in the hidden layer of auto-encoding is set to 1000 and the max iteration for training the classifier was 400.The learning rate is 0.0001 and the momentum is 1.The batch size is set as 200.The testing accuracy is the average value of 10 trials.

Different Performances in 2-D Plane of the Patches with Different Sizes
The aim of this experiment is to show the performance of the patches of non-uniform distribution with different sizes in 2-D plane.In the experiment, the patches with different sizes are extracted from the HSI in dataset 1, which are reshaped into vectors, and then the vectors are projected onto the 2-D plane through the tSNE_VISURE_2dDATA tools.The chosen patch sizes are 22 × 22, 20 × 20, . . ., 6 × 6.The 2-D plane represents both distribution of each patch and the result of patches distribution after K-means clustering with 50 classes.
Figure 7 shows performances of the patches of projected into the 2-D plane with different sizes are different.This is because that the patches with different sizes show different qualities.The patches with different sizes projected onto 2-D plane after K-means clustering are also different, which make it difficult to evaluate the clustering results with the different patch size.For this reason, the evaluation indicator of clustering results should be defined anew.

Different Performances in 2-D Plane of the Patches with Different Sizes
The aim of this experiment is to show the performance of the patches of non-uniform distribution with different sizes in 2-D plane.In the experiment, the patches with different sizes are extracted from the HSI in dataset 1, which are reshaped into vectors, and then the vectors are projected onto the 2-D plane through the tSNE_VISURE_2dDATA tools.The chosen patch sizes are 22 22 × , 20 20 × , …, 6 6 × .The 2-D plane represents both distribution of each patch and the result of patches distribution after K-means clustering with 50 classes.
Figure 7 shows performances of the patches of projected into the 2-D plane with different sizes are different.This is because that the patches with different sizes show different qualities.The patches with different sizes projected onto 2-D plane after K-means clustering are also different, which make it difficult to evaluate the clustering results with the different patch size.For this reason, the evaluation indicator of clustering results should be defined anew.

Effectiveness of the Adaptive Kernels Size Determined by CNNs Based K-Means Clustering
To demonstrate the effectiveness of the adaptive kernel size determined by an evaluation indicator of K-means, we compare the value of evaluation indicator and classification accuracy by the CNNs based on K-means algorithm on two HSI datasets.Dataset 1 and Dataset 2 are used in the experiment.
We report the evaluation indicator value and the testing accuracy with different patches sizes on each dataset in Tables 3 and 4.  We report the evaluation indicator value and the testing accuracy with different patches sizes on each dataset in Tables 3 and 4. In Table 3, the proposed method determines the adaptive kernel size as 6 × 6 on Dataset 1.The chosen kernel sizes in this experiment are 6 × 6, 8 × 8, . . ., 22 × 22.The kernel size with the largest value of evaluation indicator is 6 × 6, and the corresponding evaluation indicator value is 16.9080.The evaluation indicator value shows the samples with size of 6 × 6 owns the best clustering result.It can be seen that adaptive kernels size 6 × 6 via the proposed method shows the best testing classification accuracy, 99.7945%.Similar observations can be made from Table 4. Therefore, the proposed method is demonstrated to have the potential to determine the adaptive kernel size in the other datasets.

Performance Evaluation of K-Means Net
In this part, the proposed method is compared with three state-of-the art pre-learned kernels based CNNs methods, including PCA-Net [26], Random Net and MCSFDP Net [27].For fair comparison, the same CNNs architecture is designed in all the compared methods.The number of kernels is set to 50, while the adaptive kernel with size of 6 × 6 is determined by the proposed method.These parameters in these four networks are determined through tuning parameters such as learning rate and moment value, the iteration is set to 400.
The average testing classification accuracy of our proposed algorithm, PCA-Net and Random Net and MCFSFDP Net on the Dataset 1 and Dataset 2, is given in Tables 5 and 6.The results obviously show that the Random Net, MCFSFDP Net and K-means Net with the adaptive kernel size obtain the acceptable accuracy.It reveals that our proposed method can produce the second best classification result in the four comparison methods.Moreover, the MCFSFDP Net acquired the best classification accuracy, which relies on the more advanced clustering method for pre-learning convolutional kernels.Specifically, K-means Net has the fast speed and data-determined character for clustering the kernels.These are two advantages of the K-means Net.In the other methods, PCA-Net is data-determined for learning kernels.However, with the increase of sample dimension, the effect of PCA reduction will be dropped.In other words, the parameter of reduction dimension is hard to be designed, which influences the kernel performance of PCA-Net.What's more, the process of learning kernels via PCA-Net is slower than the process of K-means Net.The kernels are learned via randomly initialization in Random Net.In this reason, the kernels are not data-determined.Therefore, Random Net is not applicable in a large extent.In MCFSFDP Net, the kernels are learned via the MCFSFDP method.These kernels are data-determined by MCFSFDP Net and this clustering method is based on density and distance, which has the advanced clustering performance than K-means.However, the MCFSFDP algorithm needs a step of calculating a distance matrix, this step needs more time and memory than the K-means Net, PCA-Net and Random Net.

Discussion
In the kernel-size determination scheme, the relationship of the clustering results with evaluation indicator and the testing accuracy on Dataset 1 and Dataset 2 are shown in Figures 8 and 9. Figures 8a  and 9a show the evaluation indicator value of clustering results achieved with different patch sizes that are calculated by the value of inter distance and inner distance on two Datasets.Figures 8b and 9b show the classification results with variation of kernel sizes on Dataset 1 and Dataset 2.
In Figures 8a and 9a, the inner distance is increased when the kernel size is increased.Nevertheless, the inter distance cannot be increased with the regular of the inner distance, furthermore, it presents an unobvious variation trend in inter distance.However, the evaluation indicator value calculated by inner and inter distance can be reduced evidently on Dataset 1 and Dataset 2 when the kernel size is increased.
In Figures 8b and 9b, the classification results and the evaluation indicator value are reduced at the same time with the increased kernel size.The evaluation indicator value and classification results have the same reduced trend.
In this case, the proposed evaluation indicator value is more suitable than both inner distance and inter distance for determining the adaptive size of convolutional kernels.
Remote Sens. 2018, 10, x FOR PEER REVIEW 14 of 17 randomly initialization in Random Net.In this reason, the kernels are not data-determined.Therefore, Random Net is not applicable in a large extent.In MCFSFDP Net, the kernels are learned via the MCFSFDP method.These kernels are data-determined by MCFSFDP Net and this clustering method is based on density and distance, which has the advanced clustering performance than K-means.However, the MCFSFDP algorithm needs a step of calculating a distance matrix, this step needs more time and memory than the K-means Net, PCA-Net and Random Net.

Discussion
In the kernel-size determination scheme, the relationship of the clustering results with evaluation indicator and the testing accuracy on Dataset 1 and Dataset 2 are shown in Figures 8 and  9. Figures 8a and 9a show the evaluation indicator value of clustering results achieved with different patch sizes that are calculated by the value of inter distance and inner distance on two Datasets.Figures 8b and 9b show the classification results with variation of kernel sizes on Dataset 1 and Dataset 2.
In Figures 8a and 9a, the inner distance is increased when the kernel size is increased.Nevertheless, the inter distance cannot be increased with the regular of the inner distance, furthermore, it presents an unobvious variation trend in inter distance.However, the evaluation indicator value calculated by inner and inter distance can be reduced evidently on Dataset 1 and Dataset 2 when the kernel size is increased.
In Figures 8b and 9b, the classification results and the evaluation indicator value are reduced at the same time with the increased kernel size.The evaluation indicator value and classification results have the same reduced trend.
In this case, the proposed evaluation indicator value is more suitable than both inner distance and inter distance for determining the adaptive size of convolutional kernels.

Conclusions
In this paper, we propose a novel CNNs classification framework based on K-means for HSI classification, which can determine the size of the kernels from training data through an adaptive manner.Specifically, this framework utilizes the K-means algorithm to cluster groups of patches with different patch sizes extracted from the training data, and then the convolutional kernel size can be determined adaptively by the proposed evaluation indicator of the clustering results.The clustering centers with the adaptive kernel size can be considered as the pre-learned kernels in the CNNs framework based on K-means.The experimental results demonstrate that the proposed method is able to seek a good kernel size for the datasets, which can help to define a more suitable CNNs architecture for good feature extraction and classification.

Conclusions
In this paper, we propose a novel CNNs classification framework based on K-means for HSI classification, which can determine the size of the kernels from training data through an adaptive manner.Specifically, this framework utilizes the K-means algorithm to cluster groups of patches with different patch sizes extracted from the training data, and then the convolutional kernel size can be determined adaptively by the proposed evaluation indicator of the clustering results.The clustering centers with the adaptive kernel size can be considered as the pre-learned kernels in the CNNs framework based on K-means.The experimental results demonstrate that the proposed method is able to seek a good kernel size for the datasets, which can help to define a more suitable CNNs architecture for good feature extraction and classification.

Figure 1 .
Figure 1.The flow chart of our proposed method.Figure 1.The flow chart of our proposed method.

Figure 1 .
Figure 1.The flow chart of our proposed method.Figure 1.The flow chart of our proposed method.

B
= are used as the ground truth for training a network.In this paper, through comparing different block sizes, we select the block with size of 27 27 × (i.e., 27 m = )

P
= are used for learning the convolutional kernels with size of n n × via K-means clustering.In this paper, we choose some groups of patches with different sizes of22 22   × , 20 20 × , …, 6 6

Figure 2 .
Figure 2. The block (sample) is extracted from image R and the different groups of samples with different sizes are extracted from block i B , respectively.

Figure 2 .
Figure 2. The block (sample) is extracted from image R and the different groups of samples with different sizes are extracted from block B i , respectively.

Figure 3 .
Figure 3.The flow chart of determining the adaptive size of the convolutional kernels.Figure 3. The flow chart of determining the adaptive size of the convolutional kernels.

Figure 3 .
Figure 3.The flow chart of determining the adaptive size of the convolutional kernels.Figure 3. The flow chart of determining the adaptive size of the convolutional kernels.

Figure 4 .
Figure 4.The structure of the K-means based CNNs.

Figure 4 .
Figure 4.The structure of the K-means based CNNs.

Figure 5 .
Figure 5.The Indian Pines on Dataset 1.(a) shows the composite image; (b) shows the groundtruth of Indian Pines dataset, where white area denotes the unlabeled pixels.

Figure 5 .
Figure 5.The Indian Pines on Dataset 1.(a) shows the composite image; (b) shows the groundtruth of Indian Pines dataset, where white area denotes the unlabeled pixels.

Figure 6 .
Figure 6.The Pavia University on Dataset 2. (a) shows the composite image; (b) shows the groundtruth of Pavia University dataset, white area denotes the unlabeled pixels.

Figure 6 .
Figure 6.The Pavia University on Dataset 2. (a) shows the composite image; (b) shows the groundtruth of Pavia University dataset, white area denotes the unlabeled pixels.
with size of 6*6 in 2-D plane

Figure 8 .
Figure 8.The classification accuracy influence with the kernel size.(a) the evaluation indicator value, the inter distance and inner distance with different kernel sizes on Dataset 1; (b) the classification accuracy with the evaluation indicator value on Dataset 1.

Figure 8 .
Figure 8.The classification accuracy influence with the kernel size.(a) the evaluation indicator value, the inter distance and inner distance with different kernel sizes on Dataset 1; (b) the classification accuracy with the evaluation indicator value on Dataset 1.

Figure 9 .
Figure 9.The classification accuracy influence with the kernel size.(a) the evaluation indicator value, the inter distance and inner distance with different kernel sizes on Dataset 2; (b) the classification accuracy with the evaluation indicator value on Dataset 2.

Figure 9 .
Figure 9.The classification accuracy influence with the kernel size.(a) the evaluation indicator value, the inter distance and inner distance with different kernel sizes on Dataset 2; (b) the classification accuracy with the evaluation indicator value on Dataset 2.
3: The evaluation indicator of clustering results with different kernel sizes is shown as follows: = 22, 20, ..., 6 n , the evaluation indicator EI n ( ) is:

Table 1 .
Ground truth of classes and their respective sample size on Indian Pines scene.

Table 1 .
Ground truth of classes and their respective sample size on Indian Pines scene.The second dataset is the benchmark Pavia University image.It is acquired by ROSIS sensor during a flight campaign over Pavia, northern Italy.As shown in Figure

Table 2 .
Ground truth of classes and their respective sample size in the Pavia University scene.

Table 2 .
Ground truth of classes and their respective sample size in the Pavia University scene.

Table 3 .
The evaluation indicator value and the testing accuracy of K-means Net with different kernel sizes on Dataset 1.

Table 3 .
The evaluation indicator value and the testing accuracy of K-means Net with different kernel sizes on Dataset 1.

Table 4 .
The evaluation indicator value and the testing accuracy of K-means Net with different kernel sizes on Dataset 2.

Table 5 .
The testing accuracy of different CNNs methods which compared with K-means Net on Dataset 1.

Table 6 .
The testing accuracy of different CNNs methods which compared with K-means Net on Dataset 2.