Semi-Supervised Hyperspectral Image Classiﬁcation via Spatial-Regulated Self-Training

: Because there are many unlabeled samples in hyperspectral images and the cost of manual labeling is high, this paper adopts semi-supervised learning method to make full use of many unlabeled samples. In addition, those hyperspectral images contain much spectral information and the convolutional neural networks have great ability in representation learning. This paper proposes a novel semi-supervised hyperspectral image classiﬁcation framework which utilizes self-training to gradually assign highly conﬁdent pseudo labels to unlabeled samples by clustering and employs spatial constraints to regulate self-training process. Spatial constraints are introduced to exploit the spatial consistency within the image to correct and re-assign the mistakenly classiﬁed pseudo labels. Through the process of self-training, the sample points of high conﬁdence are gradually increase, and they are added to the corresponding semantic classes, which makes semantic constraints gradually enhanced. At the same time, the increase in high conﬁdence pseudo labels also contributes to regional consistency within hyperspectral images, which highlights the role of spatial constraints and improves the HSIc efﬁciency. Extensive experiments in HSIc demonstrate the effectiveness, robustness, and high accuracy of our approach.


Introduction
Due to the advance of optical sensing technology, hyperspectral images, which contain richer spectral information compared with Synthetic-Aperture Radar (SAR) and Red_Green_Blue (RGB) images, have attracted increasing attentions in the remote sensing field recently. To fully analyze raw hyperspectral images, hyperspectral image classification (HSIc) [1][2][3] plays a crucial role and is a prerequisite step towards many remote sensing applications [4,5], i.e., forest inventory, urban-area monitoring, resource exploration.
HSIc usually refers to the use of the spectral-spatial information of hyperspectral images to accurately map the spatial distribution and material content, and identify different types of features in the corresponding scene. But compared to the application of SAR or RGB images [6][7][8], there are two main challenges for HSIc: (1) redundancy of spectral information and (2) large datasets. In traditional methods, since the redundancy of spectral information about hyperspectral images, dimensionality reduction [9,10] is required to efficiently extract features. Chen et al. [11] applied Principal Component Analysis (PCA) [12,13] for dimension reduction to reduce redundant spectral features by linearly transposing raw high-dimensional data into a new low-dimensional data. Similarly, there are other ways to implement dimensionality reduction, such as Independent Component Analysis (ICA) [14], multivariate learning, or directly selecting the most representative channels from hyperspectral data. After that, low-dimensional data is obtained by preprocessing through the methods mentioned above, Stacked Autoencoder (SAE) [15] or Deep Belief Network (DBN) [16] have been introduced to classify objects (pixels) for low-dimensional data. However, in the process of dimensionality reduction, the original spectral structure may be destroyed, resulting in the loss of some useful spectral information, thus possibly reducing the performance of the HSIc.
In recent years, deep learning methods have caught the eyes from researchers as a tool of representation learning, and have made significant progress in computer vision [17], natural language processing [18], and speech recognition [19]. In the task of HSIc, deep learning methods also have became popular due to their impressive performance [20][21][22]. The features extracted by deep learning are better representation than hand-crafted features. Recently, spectral-spatial features extraction [23,24] methods based on convolutional neural networks (CNN) [25][26][27] have been proposed to make full use of spectral-spatial information by extracting the most discriminative regions and channels. Cao et al. [28] used the CNN classification results as the likelihood probability, and introduced the superpixel segmentation method as a prior one to regulate the final results. Yang et al. [29] used the migration learning to pre-train the CNN on other RGB-based remote sensing image datasets, and then fine-tune it on HSIc by end-to-end training. In all, the above methods can achieve promising performance in the HSIc on the condition that there are enough training samples.
Despite their great success, the deep learning methods heavily rely on enormous amounts of annotated data to achieve good results in supervised learning problems. However, labeling the data manually takes much time, which is a heavy burden for most researchers. Hyperspectral images are especially expensive and time-consuming to annotate as the knowledge and skill of experts are required which limits the power and application of deep learning models on hyperspectral images. For the above mentioned problems, the semi-supervised HSIc methods is proposed, which requires only a small number of labeled data.
In this paper, we introduce a novel semi-supervised HSIc framework relying on CNN for feature extraction and representation learning. The details of the proposed model could be composed of three steps: feature extraction, constrained clustering and spatial constraints. The following is the specific implementation process: first, the initial feature sets is extracted directly from the sample set, and each sample feature is a flattening of a certain size image patch centered on the sample. We implement a number of tiny-scale initial clusters with high self-consistency by over-clustering the initial feature sets. Next, a small number of initial labels are introduced as semantic constraints to assign pseudo labels to initial clusters, which makes these clusters merged into corresponding dominant semantic classes. The initial labels here are provided by the expert at the beginning and the pseudo labels are labels predicted by each iteration of the algorithm. To overcome erroneous pseudo labels, the spatial constraints are introduced to improve the HSIc accuracy under the smoothness assumption where neighboring pixels likely belong to similar classes. After clustering with semantic and spatial constraints, the pseudo labeles will increase. At the same time, the increase of highly confident pseudo labels also enrich the spatial neighborhood information, which enforces the role of spatial constraints. Finally, the pseudo labels are used for CNN training, and the trained network is used for features extraction. The features extracted by CNN will be more friendly to the next round of constraint clustering. This process loops until reaching the end condition.
In all, the contributions of this paper can be summarized as follows: • We have introduced a novel semi-supervised classification algorithm for HSIc based on the cooperation between deep learning models and clustering. • Adjacent pixels in a hyperspectral image may belong to the same class. We introduce a spatial constraint in the above algorithm to give a smoothness hypothesis to improve HSIc accuracy.
• Compared with previous methods, our proposed approach has achieved competitive performance on HSIc while leveraging tiny labeled data.
The remainder of this paper is organized as follows. Section 2 introduces relevant works. In Section 3, a HSIc framework is introduced. In Section 4, the corresponding experimental design and results are displayed. In Section 5, analysis and discussion of experiments. At last, the conclusions are summarized in Section 6.

Hyperspectral Image Classification
We know that hyperspectral images contain two-dimensional spatial information and rich spectral information of the target scene. In hyperspectral images, each pixel corresponds to continuous spectral information of an object sample point, and every spectral channel includes location information of the target object. HSIc usually refers to the use of the spectral-spatial information of hyperspectral images to accurately map the spatial distribution and material content, and identify different types of features in the corresponding scene. In this paper, we use a semi-supervised learning approach to classify each pixel of the entire hyperspectral image by labeling a small number of labels for each scene, combined with a large number of unlabeled samples. The features of each pixel are the spectral-spatial features extracted by the CNN, making full use of the spectral-spatial information. HSIc is generally the first step to a variety of remote sensing applications including geological exploration [30], precision agriculture [31], and environmental monitoring [32], and it plays an important role in these areas.
In general, there are two methods for HSIc: pixel-wise methods and spectral-spatial methods. In the pixel-wise methods, the raw pixels with their labels are fed into the training models directly. In particular, SVM-based HSIc methods [33,34] have shown good performance, but they require a large number of labeled samples. In hyperspectral images, collecting annotated samples manually is very time-consuming and expensive, which makes this method limited.
In the spectral-spatial methods, both the marked and unmarked pixels are available for classification by using the neighborhood information of the labeled pixels. In hyperspectral images, spatially adjacent pixels typically have similar spectral features and tend to have similar classes [35]. As a result, spectralspatial-based methods are more accurate [36]. Spectral-spatial methods are based on a variety of techniques, such as CNN [37,38], semi-supervised learning [39], Markov Random Field [40,41] and the ensembles of classifiers [42].

Semi-Supervised Learning
With the development of technology, the acquisition of hyperspectral image data becomes easy. However, obtaining labeled hyperspectral image data requires expert knowledge and it's costly and time-consuming. Therefore, if only a small number of samples are labeled in the hyperspectral image data, cost will be saved. In HSIc applications, the commonly used supervised learning methods are limited by the number of labeled samples, and the semi-supervised learning methods can solve the problem of insufficient labeled samples. Semi-supervised learning has achieved good results in the classification of hyperspectral imagery. Ma et al. [43] proposed a feature learning algorithm, context deep learning (CDL), which is applied to HSIc. CDL uses a small number of training samples and a simple classifier to obtain high quality spectral and spatial features through a deep learning framework. Dópido et al. [44] proposed a semi-supervised classification method based on spatial neighbors with labeled samples (SNI-L) for HSIc, which adopts standard active learning methods into a self-learning scenarios, and uses machine-machine interaction instead of manual supervision to obtain new (unlabeled) samples. In order to utilize the rich information of hyperspectral images, Ma et al. [20] proposed a semi-supervised classification method based on multi-decision labeling and deep feature learning to achieve HSIc tasks. Tan et al. [45] proposed a semi-supervised HSIc method based on spatial neighborhood information of the selected unlabeled samples (SNI-unL), which combines the spatial neighborhood information of unlabeled samples with the classifier to improve the classification ability of selected unlabeled samples.
Many classical semi-supervised learning algorithms have been proposed. Commonly used generative methods are relatively easy to implement, and such methods assume that all samples (whether or not labeled) are "generated" by the same potential model. We can link unlabeled samples to learn objectives through the parameters of the potential model, while the unlabeled samples can be regarded as missing parameters of the model, and the maximum likelihood estimation can usually be solved based on the expectation-maximization (EM) [46] algorithm. But the algorithm has one key: the model hypothesis must be correct, otherwise using unlabeled data will reduce the generalization performance, which has great limitations in the real task. Transductive SVM (TSVM) [47] which maximize the spacing between different classes for all samples, but it's difficult to deal with the problem having a large amount of unlabeled data. The graph-based [48] semi-supervised learning method is very clear in concept. It is easy to explore the nature of the algorithm by analyzing the matrix operations involved, but the algorithm also has great drawbacks: large storage overhead and sensitive to the structure of the graph. In this paper, we use an extension of the self-training algorithm [49] that is easy to implement. The method adds the high-confidence samples and the corresponding prediction labels to the pseudo labels in each iteration until all unlabeled samples have been added to the labels. The disadvantage of this method is that when the initial labels do not cover or represent the data space well, the initial training classifier will have a poor effect, which means a large number of misclassified samples will occur and remain. And the newly added misclassified labels in the subsequent iterations are the same. In our method, firstly, we use the method of repeating multiple experiments of randomly labeled subset to make the initial labels better cover the data space, and the initial classifier obtains better results. Secondly, we use semantic constraints and spatial constraints to make as few new misclassified labels as possible. Finally, the spectral-spatial features are extracted by CNN and applied to the classifier in the next episode.

Proposed Method
As illustrated in Figure 1, our method is a recurrent framework which consists of four modules: image preprocessing, CNN-based representation learning, constrained clustering and spatial constraints. A simple description of Figure 1: (1) Preprocessing of the image, the hyperspectral image is divided into slice spectral-slices according to the number of spectral channels; (2) CNN (network structure based on LeNet [50] modification) performs spectral-spatial feature extraction for each spectral-slice separately; (3) The feature set extracted from each spectral-slice is separately constrained clustering to obtain the clustering results of all spectral-slices; (4) Similar to finding intersections between sets, the clustering results of each spectral-slice are compared. The results in each spectral-slice have the same semantic class as the HSIc result of the constrained clustering. (5) Adding a space constraint to the HSIc result to obtain a pseudo labels. (6) The generated pseudo labels and input data are input to the CNN for the next round of training. All the details are as follows.

Feature Extraction Based on CNN Representation Learning
In order to take advantage of the spectral-spatial information of the hyperspectral image, we use CNN to extract spatial-spectral features. CNN can extract more representative features [51] compared to hand-crafted features, have strong feature learning capabilities and have been successful in image recognition.
Given a hyperspectral data set P = (P 1 , P 2 , ..., P N ), N represents the total number of images, P n represents the n-th hyperspectral image. Each of these hyperspectral images are divided into slice spectral-slices according to the number of spectral channels, and each of which contains 1/slice spectral channel of the original image. Then, the same operation is performed on all the spectral-slices: taking m small patches centered on each pixel, and each image has m × slice small patches. The size of the patches is e × e × t, where e is the width of the patches and t is the number of spectral channels of the patches. All the small patches of a spectral-slices of the n-th image can be represented as following: where Z n ∈ Z and Z is defined as the full set of all the patches of the data set. As shown in Figure 1, X n denotes the features extracted from the CNN of the n-th image, where X n ∈ X and X is defined as the collection of extracted features of the full set. X n could be formulated as: where θ denotes the parameters of CNN and f represents the forward pass function.

Network Structure
In this paper, our CNN structure is similar to LeNet, except that we abandon the full connected layer, change it to a full convolutional layer [52] and adjust the size and number of convolutional layer kernels. The network structure details are shown in Figure 2:

Constraints Based on Semantic Information
In this paper, we introduce a constrainted clustering scheme [53]. We perform the same operation on all spectral-slices of the hyperspectral image: taking m small patches centered on each pixel, and each spectral-slice has m small patches. Z n = (z n1 ; z n2 ; ...; z nm ) is all small patche sets for the n-th spectral-slice. X n = (x n1 ; x n2 ; ...; x nm ) denotes a feature vector set extracted from the CNN of the n-th spectral-slice. X n could be formulated as X n = f (Z n ; θ). We separately over-clustered each spectral-sliced feature vector set. The feature vector set X n = (x n1 ; x n2 ; ...; The over-clustering method yields many small clusters, and the samples in these clusters likely belong to one class. However, the image feature annotation needs to get the semantic information of the feature. So, we need to use a small amount of labeled information as the semantic constraint to identify the semantic information of each cluster. Clusters belonging to the same semantic constraint in clusters generated by over-clustering will be merged into the same semantic cluster.
No corresponding semantic labels and semantic information in the over-clustering clusters are classified into "unknown classes". Since the labels in the semantic cluster are based on semantic distribution, the training pairs can make the neural network learn better feature representation. So in the next iteration, enlarging the over-clustering clusters can make the constrained clustering process more efficient. In the iterative optimization process, the number of over-clustering clusters decreases exponentially until the value is equal to the K * . K * is generally set to double or triple the number of semantic objects contained in the image.

Sample Confidence Calculation
The calculation of sample confidence in the over-clustering process requires to know the label information. The initial over-clustering requires initial labels, and the subsequent over-clustering requires pseudo labels. The calculation process is as follows: First, calculate the number of labels per cluster: Then, calculate the average number of labels per cluster: where, K is the number of classes, N c is the number of clusters, N p,i is the number of samples included in the i cluster, pt(i, j, q) indicates that the label of the j-th sample in the i-th cluster is q. When q = 0, it means that the label is the background, and pt(i, j, q) = 0, else pt(i, j, q) = 1. Secondly, the purity of the labels in the cluster is calculated, and the purity of the labels of the type f of the i cluster is: where PURE i,max is the maximum value in the PURE i, f . When PURE i,max is greater than the threshold TH and S i > S ave . Currently, the cluster contains more labels and the number of samples belonged to category f is larger. Currently, we think that the cluster has a higher confidence which belongs to the f class. The value of TH is usually set to 50% to 80% according to the specific situation. Finally, we compare the clustering results of the slice spectral-slices. Only the samples of the same semantic class in the clustering results of the slice spectral-slices are used as the classification results of the over-clustering process.

Constraints Based on Neighborhood Spatial Information
In the previous introduction, we obtain the pseudo labels by clustering constraints. In this section, considering the fact that neighboring pixels in the hyperspectral image space domain likely belong to the same class, we introduce a local decision [20] strategy to smooth the pseudo labels. First, in the square neighborhood centered on the selected unlabeled sample, we count the labeled samples for a certain class with weight. Since the labels that near the test sample should have more decision power than the further appear label, the weight should relate to the two-dimensional Euclidean distance. Then sort the possible class scores. The label with the highest score is the final conclusion of the local decision. A simple example of a local decision process is as follows: As shown in Figure 3, we assume that there are three types of labeled samples L1, L2, and L3 in the square field centered on the sample point P, and spaces indicate unlabeled samples. We count the three types of labeled samples with different weights. The formula for calculating the i-th score is as follows: where w1 represents the weight of the 8 labeled samples around the first circle around the sample P, and w2 represents the weight of the 16 labeled samples adjacent to the second circle around the sample P.
In this experiment we take w1 = 1 and w2 = 0.5. We sort the calculated label scores, and the highest score label is the final conclusion of the local decision. The class with the highest score is L1, and we set a threshold ThreshHold. The setting of ThreshHold is related to the values of weights w1 and w2, and in this paper we take ThreshHold = 8. When the score of L1 is greater than Threshold, we think that the higher confidence of sample point P belongs to category L1, otherwise it is not.

Iteration Process Based on Self-Training
As shown in Algorithm 1, the specific process is as follows: first, we divide the hyperspectral image into slice spectral-slices according to the number of spectral channels. The number of spectral in the spectral-slice is 1/slice of the original image. In this paper we divide each hyperspectral image into four spectral-slices. Then, we extract the raw feature set X 1 ={X 0 1 ,X 0 2 ,...,X 0 4 } from each of the four spectral-slices, and X 1 4 represents the feature set extracted from the 4th spectral-slice at the 1-th time, each feature from the flattening of the sample patch centered on the sample and has a size of 5 × 5. Next, the initial labels L is introduced into the feature set of each spectral-slice, where L is a small number of labels obtained by repeating multiple experiments of randomly labeled subset, so that the initial labels can better cover the data space. And a separate over-clustering process is performed to obtain the raw cluster K 0 ={K 0 1 ,K 0 2 ,...,K 0 4 }, andK 1 n represents the clustering result of the 1-th constrained cluster of the 4th spectral-slice. Apply semantic constraints to K 0 by introducing L and comparing the clustering results of each spectral-slice, we will only use it as the clustering result C 0 for the samples that are assigned to the same semantic class. Finally, the initial classification result Y 0 is obtained by spatial constraint in the clustering result C 0 .
In the subsequent training, the feature set X t = {X t 1 ,X t 2 ,...,X t 4 } is directly extracted from the fully convolutional layers of the CNN (t represents t iteration), and X t 4 represents the feature set extracted from the 4th spectral-slice at the t-th time. Each feature comes from a sample-centric image patch of size 9 × 9 which CNN encoded. Subsequent constrained clustering and spatial constraining processes are similar to the initial process, except updated each classification result Y t (t represents the t-th classification result).
A new round of training is performed on the CNN with updated Y and input data. So iteratively, stop after satisfying the end condition. In this paper, we have found through many experiments that the HSIc accuracy tends to converge when iteration is about 12 rounds.
Algorithm 1 HSIc algorithm based on self-training 1: Input: 2: I = input hyperspectral image set. 3: L = initial label sets of hyperspectral images. 4: T = number of training epochs. 5: K * = number of over-clustering clusters. 6: Output: 7: Y * = final results of hyperspectral images classification. 8: θ * = final parameters of CNN. 9: divide image set I into n spectral-slices set Z={Z 1 , Z 2 , ..., Z n }. 10: extract the initial feature set X 1 ={X 0 1 ,X 0 2 ,...,X 0 n } from each spectral-slice. 11: obtain the initial clustering results K 0 ={K 0 1 ,K 0 2 ,...,K 0 n } by K-means clustering. 12: apply semantic constraints to K 0 by introducing L and compare the initial clustering results of each spectral-slice to get C 0 . 13: apply spatial constraints to C 0 to get the initial Y 0 14: initialize parameters of CNN θ 1 . 15: t ← 1 16: while K t > K * or t < T do 17: update θ t to θ t+1 by training CNN with labels Y t . 18: update X t to X t+1 by feeding Z into CNN. 19: obtain the clustering results K t by K-means clustering. 20: apply semantic constraints to K t by introducing Y t−1 and compare the clustering results of each spectral-slice to get C t . 21: apply spatial constraints to C t to get the Y t

Data Sets
To assess the efficacy of our method, we use three publicly available hyperspectral data sets (http:// www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes): The first hyperspectral image used in experiments was collected by the AV IRIS sensor over the Indian Pines region in Northwestern Indiana in 1992. And the second hyperspectral data set was collected by the ROSIS optical sensor over the urban area of the University of Pavia, Italy. The third scene was collected by the 224-band AV IRIS sensor over Salinas Valley, California. The detailed descriptions of these datasets are listed as follows.

Indian Pines Data Set
The Indian Pines data set is shown in Figure 4. It consists of 145 × 145 pixels and 224 spectral reflectance bands in the wavelength range 0.4-2.5 µm. This scene contains two-thirds of agriculture and one-third of forests or other natural perennial vegetation. There are two main two-lane highways, one rail line, and some low-density housing, other building structures and smaller roads. Due to the emergence of some crops in June, corn and soybeans are in an early stage of growth with coverage below 5 percent. The available samples are divided into 16 categories for a total of 10,366 samples. The number of bands is reduced to 200 by removing the band covering the water absorption area. The actual sample categories and numbers and colors of this data set are as Table 1.  The Pavia University data set is shown in Figure 5. It consists of 610 × 610 pixels and 103 spectral bands with a spatial resolution of 1.3 m, but some samples in the image don't contain any information and must be discarded before analysis. Its real category is divided into 9 with a total of 42,761 samples. The actual sample categories and numbers and colors are as  Figure 5. Pavia University dataset: The left picture shows the pseudo-color of Pavia University dataset, and the right picture shows the geographical location of Pavia University dataset.

Salinas Scene Data Set
The Salinas Scene data set is shown in Figure 6. This scene was collected by the 224-band AV IRIS sensor over Salinas Valley, California, and is characterized by high spatial resolution (3.7-meter pixels). The area covered comprises 512 lines by 217 samples.As with Indian Pines scene, we discarded the 20 water absorption bands. This image was available only as at-sensor radiance data. It includes vegetables, bare soils, and vineyard fields. Salinas groundtruth contains 16 classes.The actual sample categories and numbers and colors are as Table 3.   3  Fallow  1976  11  Lettuce_romaine_4wk  1068  4  Fallow_rough_plow  1394  12  Lettuce_romaine_5wk  1927  5  Fallow_smooth  2678  13  Lettuce_romaine_6wk  916  6  Stubble  3959  14  Lettuce_romaine_7wk  1070  7  Celery  3579  15  Vinyard_untrained  7268  8  Grapes_untrained  11271  16 Vinyard_vertical_trellis 1807

Experimental Design
We compare the proposed method with several semi-supervised HSIc methods based on self-training frameworks, where these methods all use deep learning to extract spectral-spatial features. First, we compare our results with CDL, which is used as a benchmark for comparison. Then, we further compare the results with the classical semi-supervised classification method of SNI-L. SNI-L first uses the criteria based on spatial neighborhood information to infer the candidate set, and then assumes that the pixels which are spatially adjacent to the training sample can be labeled with the same label, and then automatically selects new samples from the candidate set. Finally, in order to maintain timeliness, we also compared with the newer method (SNI-unL) semi-supervised classification method. SNI-unL combines spatial neighborhood information of unlabeled samples with a classifier to enhance the classification ability of the selected unlabeled samples.
Different HSIc methods are compared based on three common metrics: OA (overall accuracy), which measures the ratio of correctly classified samples in the entire test set; AA (average accuracy), which is the classification average accuracy of each type; Kappa indicator [55], which is calculated from the entries in the confusion matrix, and is a consistent robust measure of random classification.
In the process of extracting spectral-spatial features through CNN, normalization processing is added to the first, fourth, and seventh network layers. The feature maps are uniformly mapped to the interval [0, 1]. Because some feature maps are very large in amplitude and which can mask other feature maps, normalization can maintain a balance between feature maps.

Experimental Result
First, we test the performance of our method and compare it with the CDL, SNI-L and SNI-unL algorithms. We randomly select 10 labeled samples for each class as the training set, and the rest of the data is used as the test set. The experimental results are shown in Figures 7-9.
Among them, Figure 7a is the Indian Pines data, Figure 7b is the selected small number of labeled sample sets, and Figure 7c-g are the algorithm iterations to the first, third, fifth, seventh, ninth rounds of output. And the classification accuracy (OA) of these iterations of the algorithm is shown in Table 4. Figure 7h is the ground truth of the Indian Pines data set. The algorithm stops when iterating to the 11th round, with a final classification accuracy (OA) of 87.35%. Among them, Figure 8a is the Pavia University data, Figure 8b is the selected small number of labeled sample sets, and Figure 8c-g are the algorithm iterations to the first, third, fifth, seventh, ninth rounds of output. And the classification accuracy (OA) of these iterations of the algorithm is shown in Table 4. Figure 8h is the ground truth of the Pavia data set. The algorithm stops when iterating to the 12th round, with a final classification accuracy (OA) of 85.63%. Among them, Figure 9a is the Salinas data, Figure 9b is the selected small number of labeled sample sets, and Figure 9c-g are the algorithm iterations to the first, third, fifth, seventh, ninth rounds of output. And the classification accuracy (OA) of these iterations of the algorithm is shown in Table 4. Figure 9h is the ground truth of the Salinas data set. The algorithm stops when iterating to the 11th round, with a final classification accuracy (OA) of 97.37%.  The average OA, AA, and Kappa obtained in the experiment five times are shown in Table 5. For the Indian Pines data set, we use 31 × 31 window for spatially constrained, and the CDL parameters are set according to their original paper, and the first layer window size is set to 9 × 9. For the Pavia University data set, the spatially constrained window size is set to 47 × 47 and the first layer window size is set to 21 × 21.
We can get from the above experimental results: Our method suppresses the generation of pseudo labels by adding semantic attribute constraints and spatial constraints. In each iteration, better representation features are extracted from the CNN, which helps the classifier to better classify. As can be seen from the results of OA, AA and Kappa, when the initial labels (10 per class) are very small, our method performs well. We obtained the variance for the results of the five experiments. It can be seen from the variance of Table 4 that our method is relatively stable.

Discussion
For this method, the number of initial labels selected has a large impact on performance. In order to investigate the effect of the number of initial labels on the labeling results, we conduct experimental validation. Each time the number of initial labels for each class is selected from 6 to 15, the values of the three indicators OA, AA and Kappa are obtained. We conduct five experiments to obtain the mean of the three indicators, then the number of initial labels with the x-axis as the initial selection, and the y-axis for the corresponding OA, AA, Kappa indicators to draw their images. From the image index curve, it can be obtained that the number of initial labels does have an effect on the labeling result.  The Figure 10 shows the changes in the OA indicators of the classification results of four classification methods, such as CDL, SNI-L and SNI-unL, when initially selecting different number of labels. It can be seen that, overall, the OA increases with the number of initial labels increases, indicating that the more the number of initial labels, the higher the ratio of correctly classified samples in the entire test set. Moreover, the OA indicator of our method is always higher than other methods, which indicates that our classification method is superior. The number of labeled sample The number of labeled sample The number of labeled sample The Figure 11 shows the change in the AA indicator for the number of different initial labels. The higher the rate, the better the classification effect for each class. It can be seen that like OA, it also increases with the number of initial labels increases, reflecting that the algorithm's classification effect for each class is increasing. The AA index of our method is much higher than other methods, and it can be seen that our method is more friendly to all classes. 6 8 10 12 14 The number of labeled sample Similarly, the Figure 12 shows how the Kappa indicator changes with the number of labels initially selected varies. We know that Kappa can be divided into five groups to indicate different levels of consistency: 0.0-0.20 is extremely low, 0.21-0.40 is normal, 0.41-0.60 is medium, 0.61-0.80 is highly consistent, and 0.81-1 is almost identical. It can be seen that almost all algorithm results increase with the number of initial labels increasing. Our method Kappa indicator is almost between 0.81 and 1, which is always higher than other algorithms, that is, our method has higher classification consistency.

Conclusions
In this paper, we introduce a novel semi-supervised classification algorithm for HSIc based on the cooperation between deep learning models and clustering. The algorithm is based on a self-training algorithm for solving the problem that hyperspectral images contain a large number of unlabeled samples, and the cost of labeled samples is too high. First, we use CNN to extract spatial-spectral features. Afterward, the extracted spectral-spatial features are used for semantic constraint clustering. The novelty of this paper is that the hyperspectral image is divided into several spectral slices, and each spectral slice is separately subjected to semantic constraint clustering, and the clustering results are compared. Just the samples in the clustering results that are assigned to the same semantic class are used as the clustering results for the entire hyperspectral image. In hyperspectral images, adjacent pixels have the same class, so we introduce local decisions to smooth the pseudo labels after obtaining the clustering results for the entire image. Our algorithm is compared to CDL, SNI-L and SNI-unL respectively. The average of the three indicators of our algorithm OA, AA and Kappa is higher than the other three algorithms, and by calculating the variance of our method, we can see that our algorithm is superior.
The framework has some drawbacks: (1) This paper is based on a self-training algorithm. The shortcoming of this algorithm is that the mislabeled samples in the previous iteration will affect the later iterative process and will increase its impact. At present, there is no way to completely solve this problem. If the following academia solves this problem, we will continue to optimize the algorithm in this article. Our method can only reduce the error rate by a part. Later we might consider adding new constraints to improve classification accuracy, or consider adopting a new advanced framework. (2) The problem of boundaries between different scenes, adjacent samples have different classes. Adjacent samples have a large overlap of image patches, so the features extracted by CNN may be similar. It can be seen from the experimental results of the three data sets that the Indian Pines and Salinas Scene scenes are relatively simple and the classification accuracy (OA) is relatively high. The Pavia University scene is complex and the classification accuracy (OA) is relatively low.