Semi-Supervised Deep Learning Classiﬁcation for Hyperspectral Image Based on Dual-Strategy Sample Selection

: This paper studies the classiﬁcation problem of hyperspectral image (HSI). Inspired by the great success of deep neural networks in Artiﬁcial Intelligence (AI), researchers have proposed different deep learning based algorithms to improve the performance of hyperspectral classiﬁcation. However, deep learning based algorithms always require a large-scale annotated dataset to provide sufﬁcient training. To address this problem, we propose a semi-supervised deep learning framework based on the residual networks (ResNets), which use very limited labeled data supplemented by abundant unlabeled data. The core of our framework is a novel dual-strategy sample selection co-training algorithm, which can successfully guide ResNets to learn from the unlabeled data by making full use of the complementary cues of the spectral and spatial features in HSI classiﬁcation. Experiments on the benchmark HSI dataset and real HSI dataset demonstrate that, with a small number of training data, our approach achieves competitive performance with maximum improvement of 41% (compare with traditional convolutional neural network (CNN) with 5 initial training samples per class on Indian Pines dataset) for HSI classiﬁcation as compared with the results from those state-of-the-art supervised and semi-supervised methods.


Introduction
Hyperspectral image (HSI) collected by imaging spectrometers captured rich spectral and spatial information simultaneously [1]. For this reason, hyperspectral data are used on a wide range of applications such as environmental sciences [2], agriculture [3], and mineral exploitation [4]. HSI classification is one of the most important topics in remote sensing. Specifically, combining the rich spectral information and spatial information as complementary cues represents an opportunity to dramatically improve the performance of HSI classification.
Most recently, deep learning has emerged as the state-of-the-art machine learning technique with a great potential for HSI classification. Instead of depending on manually-engineered shallow features, deep learning techniques automatically learn hierarchical features (from low-level to high-level) from raw input data [5,6]. Inspired by the great success of deep learning for image classification, remarkable efforts have been invested for spectral-spatial HSI classification by deep learning techniques in the last few years [7][8][9][10][11][12]. These deep learning algorithms falls into two broad categories. The first category includes features learning and classification steps. For example, Chen et al. [7] applied deep feature learning by including stacked autoencoder (SAE) and deep belief network (DBN) [8] for spectral-spatial It should be noted that deep neural networks always require a large number of datasets for supervised training, e.g., ImageNet with millions of annotated images [12]. However, labeling a large archive of hyperspectral data for classification task is very expensive and time consuming. To address this challenge, Yang et al. [13] proposed a deep CNN with two-branch architecture to extract the joint spectral-spatial features from HSI which is reportedly beneficial when the number of training samples is limited. Ma et al. [14] developed a spatial updated deep autoencoder, in order to deal with the small training set using deep features, a collaborative representation-based classification is applied.
Although previous works use the supervised method with a small training samples, they do not benefit from the massive unlabeled data to promote the classification performance. Thus, it is necessary to develop a new effective training framework for deep learning to benefit from the massive unlabeled data which is already available. To make full use of massive unlabeled samples, some semi-supervised methods exist in the literature [15,16], we are particularly interested in the co-training algorithm, which is an important paradigm of semi-supervised methods [17][18][19]. Blum and Mitchell [20] have given theoretical proofs to guarantee the success of co-training in utilizing the unlabeled samples. At each iteration of the co-training process, two learners are trained independently from two views and are required to label some unlabeled examples for each other to augment the training set [19]. The co-training strategy has already been considered to solve HSI classification on the conditions that: (1) Each example contains two views, either of which is able to depict the example well; and (2) the two views should not be highly correlated. Hyperspectral data matches the two conditions well by providing the spectral features and spatial features that are conditionally independent [21], and co-training can exploit the limited labeled data with the massive unlabeled data to improve the performance. Romaszewski et al. [18] used co-training approach with the P-N learning scheme, which P-expert assumes the same class labels for spatially close pixels and the N-expert detects pixels with similar spectra. P-expert and N-expert take advantage of the spatial structure and the spectral structure respectively. Tan et al. [17] used to tri-training to exploit spectral and spatial information for hyperspectral data classification is presented based on an active learning and a multi-scale homogeneity. In order to make accurate predictions of the unknown labels of a sparsely labeled image, Appice et al. [21] applied a transductive learning approach with a co-training schema.
To the best of our knowledge, in previous co-training algorithms, unlabeled samples selected for augmenting the training set are the ones with the highest confidence from a single view (spectral or spatial) sample selection criteria, such as the spatial neighbors sample selection with active learning [17], the spectral neighbors sample selection with Euclidean spectral distance [18], spatial information extracted with segmentation algorithm [22], the spatial example selection with the diversity class criterion [19]. However, when only few training data are available, as is the case with spectral-spatial HSI classification, this strategy is not appropriate because the training samples are too few to describe adequately the distribution of the data from either the spectral view or spatial view. To address this issue, we propose a new sample selection scheme for co-training process based on spectral features and spatial features views.
Another obstacle of deep neural networks training is that when deeper networks are able to start converging, a degradation problem has been exposed [23]. Fortunately, degradation problem with the increase of convolutional layers can be solved by adding shortcut connections between every other layer and propagating the value of features by the latest residual networks learning framework (ResNet) as proposed by He et al. [23]. Zhong et al. [24] have used ResNet for supervised HSI classification, but the method has not exploited the unlabeled data. In this paper, therefore, the goal is to develop a semi-supervised deep learning classification framework based on co-training. The framework is illustrated in Figure 1b.
The pipeline of the framework can be summarized as follows. First, the spectral-ResNet and spatial-ResNet models are trained on the given labeled data of the respective views. Then at each iteration of co-training, two models are applied to predict the unlabeled sets, and the most confident labeled samples are used to augment the training set of the other model (view). The iterative process is repeated until some stopping criterion has been reached. Finally, the classification result of the spectral features is fused with that of the spatial features to obtain the label of the test data.
The main contribution of this paper can be summarized as three aspects. Firstly, ResNets are used to extract the spectral features and spatial features for HSI classification. The identity mapping of the ResNet can alleviate the degradation of the classification performance of deep learning models caused by increased depth. Secondly, in order to select a set of informative and high confident samples from the unlabeled datasets to update the next round training of the deep learning models effectively, a new sample selection scheme for co-training process based on spectral features and spatial features views is proposed. Finally, we verify the advantages of our method by testing it on several benchmark HSI datasets and a selected Hyperion dataset.
The remainder of this paper is organized as follows. In Section 2, the general framework is presented, and a sample selection scheme is presented in detail. We present the experimental results and discuss about the experimental results in Sections 3 and 4, respectively. Finally, in Section 5, the paper is summarized, and the future works are suggested.

Overview
The proposed framework aims at learning a powerful semi-supervised deep learning framework for HSI classification based on limited labeled data and the wealth of unlabeled data. To be specific, we have a small labeled pool L, and we have a large-scale unlabeled hyperspectral dataset U. The proposed framework is shown in Figure 2, where a spectral-spatial co-training algorithm based on deep learning is introduced to learn from the unlabeled data. Now, three important phases of the framework will be introduced. Overview of semi-supervised deep learning framework for hyperspectral image (HSI) classification. The training of the framework mainly two iterative steps: (1) Training the spectral-and spatial-models over the respective data based on the labeled pool (indicated as solid lines); (2) applying each model to predict the unlabeled HSI data and use respective sample selection strategy to select the most confident samples for the other (indicated as dashed lines. See details in the text). After all iterations of co-training are completed, the classification results of the test dataset which obtained through two training networks were fused, and then the label of the test dataset was obtained (indicated as solid black lines).
Network Architectures. Building a suitable network architecture is the first prerequisite of the whole system. This paper adopts the architecture of ResNet (Section 2.2) to extract both the spectral and spatial features. Residual Networks can be regarded as an extension of CNN's with skipped connections that facilitate the propagation of gradients and perform robustly with very deep architecture [24]. However, the extremely limited number of training samples for such deep learning models is difficult. To address this problem, we utilized the regularization method Batch Normalization (BN) [25] to prevent the learning process from overfitting.
Training Process. Training of the semi-supervised deep learning framework mainly involves two iterative steps: Training each ResNet model and updating the labeled pool as illustrated in Figure 2. More specifically, we denote the state of the system at the t-th iteration of co-training as L t and U t are denoted as labeled training samples and unlabeled samples respectively. To effectively select informative and confident samples from U t to update the next round training set of the deep learning models, a dual-strategy sample selection co-training algorithm based on spectral and spatial features is introduced in Section 2.3.
The goal of the proposed dual-strategy sample selection method is that labeling and selecting the unlabeled samples for each model are based on both spectral and spatial features. To this end, for spectral view of the co-training, we propose a new similarity metric based on deep spectral feature learning, it is a measurement to define the relationship between two samples. In particular, we extract the hierarchical features from a deep network on all available samples (labeled samples and unlabeled samples), and then the distance between labeled samples and unlabeled samples is given by the Euclidean distance. Using this method, we can select the most confident spectral samples with high similarity for each labeled sample to be included in the new training set on the condition that the spectral-ResNet agree on the labeling of these unlabeled samples. For spatial view of the co-training, we use a spatial neighborhood information extraction strategy to select the most confident spatial neighbors as the new training set based on the condition that spatial-ResNet agree on the labeling of these unlabeled samples. Such dual-strategy is believed to select the most useful and informative samples to update the training set for the next round of training of the deep learning models.
Testing Process. The iterative process is repeated until some stopping criterion has been reached. After the fully connected layers, the output of the fully connected layers represent the spectral features and spatial features, which are followed by a softmax regression classifier (defined in this work as spectral classifier and spatial classifier) to predict the probability distribution of each class. Finally, the prediction probability vector of the test dataset from the two channels are summed to get the final classification result, and then the label of the test dataset was obtained, as shown in Figure 2 indicated as solid black lines.

Networks Architectures Based on Spectral and Spatial Features
ResNet is constructed via stacking residual blocks, and it skips blocks of convolutional layers by using shortcut connections to form residual blocks. By using shortcut connections, residual networks perform residual mapping fitted by stacked nonlinear layers, which is easier to be optimized than the original mapping [23]. These stacked residual blocks significantly improve training efficiency and largely resolve the degradation problem by employing batch normalization BN [25]. Inspired by the latest residual networks learning framework proposed by He et al. [23], the architecture of our networks for each model, as shown in Figure 3, contains one convolutional layer and two "bottleneck" building blocks, and each building block has one shortcut connection. Each residual block can be expressed in a general form as follows: where x l and x l+1 are input and output of the l-th block, respectively. W l denotes the parameters of the residual structure. F(x) is a residual mapping function and h(x l ) = x l is an identity mapping function, and f is a Rectified Linear Units (ReLU) [26] function. For each residual mapping function F, the "bottleneck" building block has a stack of 3 layers. Take spatial-ResNet in the right of Figure 3 for example, the three layers are [1 × 1, 20], [3 × 3, 20], and [1 × 1, 80] convolutions, where 1 × 1 convolutions layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3 × 3 convolutions layer a bottleneck with smaller input/output dimensions [23]. To regularize and speed up the training process, we adopt batch normalization BN [25] right after each convolution and before activation. BN standardizes the mean and variance of hidden layers for each mini-batch, is defined as follows.
wherex (i) is the i-th dimension of feature batch x, E(·) represents the expected value and VAR(·) is the variance of the features. In order to prevent overfitting, 'dropout' (random omission of part of the feature during each training case) is employed in our method after the average pooling in each branch, the dropout rate is set to 0.5. Figure 3. A residual network for spatial-ResNet model, which contains two "bottleneck" building blocks, and each building block has one shortcut connection. The number on each building block is the number of output feature map. F(x) is the residual mapping and x is the identity mapping, for each residual function, we use a stack of 3 layers. The original mapping is represented as F(x) + x.
In the spectral-ResNet model, for one pixel of HSI which is to be dealt with, a 3 × 3 × K-sized cube is extracted from its eight neighborhoods as its original input data (the size of spatial neighborhood is empirically determined). To meet the input requirement of spectral-ResNet, the original data is re-arranged into nine pixel vectors, and the length of each pixel vector is K, K is the number of bands, as shown in Figure 2. It should be noted that the 1D kernels are exploited to effectively capture intrinsic spectrum content along the 1D spectral dimensions. In the 1D convolution operation, the input data is convolved with 1D kernels, and then the convolved input data go through the activation function to form the feature vectors. The data is re-arranged in the spectral-ResNet to extract the high-level abstract spectral features. Suppose the set of labeled samples is L spectral t , and the set of the unlabeled In the spatial-ResNet model, for a certain pixel in the original HSI, it is natural to consider its neighboring pixels for the extraction of spatial features. However, due to the hundreds of bands along the spectral dimension of HSI, the region-based feature vector will result in too large as an input dimension. This problem could be solved by principal component analysis (PCA), and as PCA is conducted on the pixel-spectrum, the spatial information remains intact. We reduce the spectral dimension of the original HSI to three which is empirically chosen as a trade-off between accuracy and computational complexity with minimum information loss. Then, for each pixel, we choose a relatively large image patch (the patch is 27 × 27 in our experiment) from its neighborhood window as the input of the spatial-ResNet model, as shown in Figure 2. In each 2D convolutional layer, the image patch is convolved with 2D kernels, then goes through the activation function to form the feature maps. Then the high-level spatial features can be extracted by the spatial-ResNet model. Suppose the set of labeled samples is denoted as L spatial t , and the set of the unlabeled samples is denoted as U spatial t .

Dual-Strategy Sample Selection Co-Training
The goal of the dual-strategy sample selection co-training algorithm is to select highly confident examples with predicted labels from the unlabeled pool based on spectral and spatial features. These newly labeled examples by each model can boost the performance in the next round of training. Now we introduce the three main components of the iteration algorithm. For clarity, we omit the iteration number t of co-training in the equations below.

New Sample Selection Mechanism Based on Spectral Feature
For spectral-ResNet model, all bands of the labeled data are used to train the model, so we take full advantage of the spectral characteristics and the inherent deep features to select the most confident samples. Thus we proposed a new sample selection mechanism based on spectral feature and deep learning.
Since the spectral information of the same class is similar and the labeled samples is limited, we intend to take the samples with the highest similarity to be the most confident samples for this class. First, we define a distance metric from the test sample and a class dataset. For the t-th iteration of co-training, we get a candidate set for each class after the unlabeled samples U where inf represents the infimum, l M is each sample in the training set with label y M . Then, the main problem is how to define the distance between two samples. In order to take advantage of the deep features inherent to describe the distribution of the hyperspectral data, we give a definition of a new metric between two hyperspectral samples based on deep learning. Some research [27,28] have been shown that combining the features from lower layers can capture finer features. Moreover, using the hierarchical feature to describe the distribution of the data can alleviate the problem of intra-and inter-class variation in data [29]. Inspired by this observation, for the spectral-ResNet described in Section 2.2, we can extract a multi-level representation for each of the unlabeled samples. The multi-level representation consists of the output of the first convolutional layer and the two building blocks, which are denoted as r 1 , r 2 , r 3 respectively. Then, the hierarchical representation of each sample is represented as r = [r 1 ; r 2 ; r 3 ]. The distance between two samples such as x M and l M is given by the distance between two hierarchical representations: Last, we define a similarity metric between the x M and the training set L M based on this distance metric as: For this similarity metric, the most confident samples belonging to L M are those with distance close to zero, and the corresponding similarity is close to one.

Sample Selection Mechanism Based on Spatial Feature
For spatial-ResNet model, PCA is executed to map the hyperspectral data in the first step, this step can keep spatial information intact but cast away part of the redundant spectral information. Since spatial consistency has been found among neighboring pixels. Neighbors of the labeled samples are identified using a second-order spatial connectivity by the spatial consistency assumption. Then the most confident samples are selected by the classification of the spatial-ResNet and neighbors of labeled samples.
For illustrative purpose, Figure 4 shows a toy example illustrating the process of sample selection mechanism based on spatial feature. In the left of Figure 4, we display the available labeled samples for two different classes, 1 and 2. These labeled samples are used to train the Spatial-Resnet, and the second-order neighborhood of the labeled samples are labeled by the trained Spatial-Resnet, as illustrated in the upper middle part of Figure 4. In the middle of the lower part of Figure 4, we label the neighbors of the labeled samples by the spatial consistency assumption. Finally, the most confident samples are selected as shown in the right of Figure 4.

Co-Training
Finally, the two well-trained ResNet models are utilized to predict the unlabeled pool over the respective modalities. A highly confident set H spectral and H spatial can be selected with the new sample selection mechanism. Now we attach the spectral data H spectral and spatial data H spatial to update the labeled pool as L spectral t+1 This step basically identifies the labeled samples with the highest confidence scores combined from both views. During the next round training, H spectral will improve the spatial-ResNet model as they are new and informative labeled samples, which is the same with the H spatial for the spectral-ResNet.

Experimental Results and Analyses
In this section, the effectiveness of the proposed method is tested in the classification of three open source hyperspectral datasets, namely, the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Indian Pines datasets, the Reflective Optics System Imaging Spectrometer (ROSIS-03) University of Pavia datasets, the AVIRIS Salinas Valley datasets and one selected Hyperion dataset. In our experiments, firstly, the performance of the proposed method is compared with three state-of-the-art HSI classification methods: (1) CNN [10], a supervised classification using deep CNN to extract the joint spectral-spatial features from HSI; (2) CDL-MD-L [15], a self-training semi-supervised classification approach based on contextual deep learning classification (CDL) and multi-decision labeling (MD-L); (3) Co-DC-CNN, DC-CNN [30] is a dual-channel CNN with non-residual networks from our previous work, we extend it to a co-training approach denoted as Co-DC-CNN. Then, we compared our results against three semi-supervised classification methods based on spectral-spatial feature and co-training: (1) PNGrow [18], a semi-supervised classification algorithm using co-training approach with the P-N learning scheme, P-expert and N-expert take advantage of the spatial structure and the spectral structure respectively; (2) TT_AL_MSH_MKE [17], tri-training technique for spectral-spatial HSI classification based on an active learning (AL) and a multi-scale homogeneity(MSH), MKE is a combination of MLR, KNN and ELM, and (3) S 2 CoTraC [21], the semi-supervised classification algorithm using co-training approach with both spectral information and spatial information which are iteratively extracted at the pixel level via collective inference. All the parameters in the compared methods are set according to the authors' suggestion or tuned to achieve the best performance. It should be noted that the result of TT_AL_MSH_MKE are taken from the results reported by Tan et al. [17]. Additionally, the quantitative comparisons of the classification results are based on class-specific accuracy, overall accuracy (OA), average accuracy (AA), kappa coefficient (κ) and F1-meaure [31].

Dataset Description and Experimental Settings
The Indian Pine image was recorded by the AVIRIS sensor over the Indian Pines site in Northwestern Indiana. It consists of 145 × 145 pixels and 220 spectral reflectance bands in the wavelength range 0.4-2.5 µm. Twenty spectral bands were removed due to noise and water absorption, and the remaining 200 bands was used for the experiments. The ground-truth data contains sixteen classes, and the false-color composite of the Indian Pines image and the corresponding reference image are shown in Figure 5a,b, respectively. The Pavia University image was gathered by the ROSIS-03 sensor during a flight campaign over Pavia, northern Italy, having 610 × 340 pixels. A total of 115 spectral bands were collected, at the range 0.43-0.86 µm. Twelve spectral bands were removed due to noise and the remaining 103 bands were used for classification. Nine land-cover classes were selected, and the true color composite of the University of Pavia image and the corresponding reference image are shown in Figure 6a,b, respectively.   For all the experiments with the four HSI datasets, limited training samples were randomly selected from each class, and the rest of the samples were set as unlabeled data for co-training. To obtain a more convincing estimate of the capabilities of the proposed method, we run the experiment 10 times for each dataset. The training sample sizes of all the experiments were quite limited, which is a challenge for the classification task.
For the four datasets, the structure of the networks was set to the same depth and same width with a fair comparison. For the spectral-ResNet, it contains one convolutional layer and two building blocks. The first layer contains [3 × 1, 20] convolutional kernels, is followed by one pooling layer with pooling size [2,1] and stride [2,1]. For each building block, the three stacked layers contains [1 × 1, 20], [3 × 1, 20] and [1 × 1, 80] convolutional kernels. Finally, the network ends with a global average pooling, a fully connected layer, and softmax. In training a network, one epoch means one pass of the full training set. This network is trained over 240 epochs (160 epochs with learning rate 0.01, 80 epochs with learning rate 0.001). Each iteration of training network randomly takes 20 samples, where weight decay, momentum and dropout rate are set to 0.0005, 0.9 and 0.5, respectively. For the spatial-ResNet, three Principal Components are extracted from the original HSI and then 27 × 27 × 3-sized image patches are extracted as the input data. The network structure is same as spectral-ResNet. The first layer contains [3 × 3, 20] convolutional kernels, is followed by one pooling layer with pooling size [2,2] and stride [2,2]. For each building block, the three stacked layers contains [1 × 1, 20], [3 × 3, 20] and [1 × 1, 80] convolutional kernels. It is trained over 200 epochs (140 epochs with learning rate 0.01, 60 epochs with learning rate 0.001). At each iteration of training network, 20 samples are randomly selected from the training set. Weight decay, momentum and dropout rate are set to 0.0005, 0.9 and 0.5, respectively. In order to prevent overfitting, 'dropout' is employed in our method after the average pooling in each branch, the dropout rate is set to 0.5. In the dual-strategy sample selection method, the spectral similarity between the tested samples and training set is empirically set as s ≥ 0.9 for selection, the eight neighbors of labeled training samples are used as the candidate set.

Experimental Results on the AVIRIS Indian Pines Dataset
The first group of experiments was conducted on AVIRIS Indian Pines dataset. Firstly, we tested the proposed method in different scenarios, where an increased amount of initial training samples was used, respectively ({5, 10, 15, 20} samples per class). In particular, for Grass-pasture-mowed and Oats, the number of initial training samples is at most 10. For the co-training progress, three iterations of co-training were performed, the iteration number of co-training is denoted as t = 3. The detailed results are listed in Table 1, and classifications are shown in Figure 9. We make two observations on Table 1: First, as the initial training samples increase, the OA is in an upward trend until it becomes stable, and the OA of 20 training samples improves a little bit as compared with 15. Furthermore, in the case of 15 initial training samples, the network structure and sample selection strategy used in the proposed method can be used to train the network well. Second, we analyze the classification results of each class. For the classes with very few samples, Alfalfa, Grass-pasture-mowed and Oats, the classification accuracy had already reached 100% when the initial training samples is 10. Furthermore, when the initial training samples is set as 5 or 10, the classification accuracy for each class is not stable, especially for Corn-notill, Soybean-mintill and Woods, as for those classes the number of samples is large. However, the results are more stable when the initial training samples size is set as 15 and 20.  Then we compare the proposed method with three HSI classifiers algorithms in Table 2. In order to evaluate the performance of co-training in the proposed algorithm, firstly, we compare the proposed algorithm with a state-of-the-art supervised spectral-spatial deep learning algorithm CNN and a self-training based semi-supervised classification approach CDL-MD-L. It is obvious especially when there is a very small initial training samples that the performance between our method and other two algorithms shows the significant advantage (maximum improvement of 41% in 5 initial training samples per class). To validate the effectiveness of the residual network in the proposed framework, we also compare the proposed algorithm with a co-training CNN with non-residual networks Co-DC-CNN, it can be found in Table 2 that the proposed algorithm achieves a better performance using the residual network. Moreover, we compare the proposed method with four different semi-supervised approaches. As the results show in Table 3, the proposed method provides the best performance even with small number of initial training samples. In particular, when the initial training sample is 5, the proposed method obtained the best result OA of 88.42%, which is 6.31% higher than the second best (82.11%) achieved by PNGrow.

Experimental Results on the ROSIS-03 University of Pavia Dataset
The second experiment was conducted on the ROSIS-03 University of Pavia Dataset. The results are listed in Table 4, and the visual classification results are shown in Figure 10. In this dataset, as the initial training samples increases, the OA improves and the per-class accuracy is more stable. The comparison between our proposed method and state-of-the-art HSI classifiers algorithms are presented in Table 5. The comparison between our proposed method and other semi-supervised methods are presented in Table 6. The proposed method obtained the best result on the 10, 15, 20 initial training samples, but on the 5 initial training samples, the OA and κ is lower than PNGrow. However, the iteration of proposed method is only three where the iteration of PNGrow is ten. In Section 4, we will discuss the relationship between classification results and the number of iterations in co-training.

Experimental Results on the AVIRIS Salinas Valley Dataset
The third experiment was conducted on the AVIRIS Salinas Valley. We tested the proposed method on different initial training samples sizes same as the AVIRIS Indian Pines Dataset. The detailed results are listed in Table 7, and the classifications are shown in Figure 11. In this dataset, as the initial training samples increases, the OA have been improved and the per-class accuracies are more stable. The comparison between our proposed method and other methods are presented in Tables 8 and 9. For each class, the number of sample is large. With large number of samples, the classification results after three iterations of co-training are not ideal, especially on the two classes with high number of samples, Grapes_untrained and Vinyard_untrained.

Experimental Results on Hyperion Dataset
The fourth experiment was conducted on a Hyperion dataset. We tested the proposed method on different initial training samples, respectively ({5, 10, 15} samples per class). Because the number of samples per class is small, we didn't test the experiment on 20 initial training samples. The number of iterations in co-training is two. The results are listed in Table 10, and the visual classification results are shown in Figure 12. In this dataset, as the initial training samples increases, the OA have been improved and more stable. The comparison between our proposed method and the state-of-the-art HSI classifiers method are presented in Table 11. According to Table 11, the proposed method performs best.

Influence of Network Hyper-Parameters
For the proposed classification framework, the choice of network hyper-parameters has an effect on the training process and classification performance. In this section, we investigate the impact on the proposed framework from two aspects: The kernel number of convolutional layers and the spatial size of the input image patch in the spatial-Resnet.
For the "bottleneck" building block, the number of kernel and the quadruple operation is designed by referring to the literature [23]. The two blocks do not use pooling and therefore it does not increase the number of dimensions and feature channels to compensate for the loss of information after pooling, so the adjacent two blocks are with the same width. Then we experimentally verified the kernel number of convolutional layers (the width of the network). We assessed different kernel numbers from 10 to 25 in an interval of 5 in each convolutional layer to find a general framework. The classification results of all datasets using different number of kernels with 5 initial training samples per class in Figure 13a. The framework with 20 kernels in each convolutional layer achieved the highest classification accuracy in the Indian Pines, Pavia University datasets and Hyperion data, and the framework with 15 kernels obtained the best performance in the Salinas Valley dataset, but it is a little bit higher than the 20 kernels. To maintain data consistency, we use 20 kernels for all datasets. In order to get an appropriate size of spatial neighborhood in the spatial-Resnet, we tested 15 Figure 13b shows the classification results of all datasets using different sizes of spatial neighborhood with 5 initial training samples per class. It can be seen that the accuracy increases quickly with the increase of spatial size at first, then plateaued when the spatial size reached 27 × 27. Therefore, as a trade-off between accuracy and amount of data involved, we empirically choose 27 × 27 as the spatial neighborhood size.

Effect of the Number of Iterations in Co-Training
As mentioned in Section 3, when the dataset is large as well as the initial training samples is extremely small, the classification results with 3 iterations of co-training are not ideal. Therefore, it is interesting to understand the relationship between classification results and the number of iterations in co-training, using OA and computation time. Experiments are performed on the ROSIS-03 University of Pavia Dataset and the AVIRIS Salinas Valley Dataset by co-training strategy, the initial training labeled set is with 5 per class. The accuracy and computation time are plotted in Figure 14a-c. From the results we see that with large dataset, increase of iteration of co-training will improve the classification results but drastically increase the time costs. Specifically, the first iteration of the co-training actually is a supervised classification. From Figure 14a we can see, on the first iteration of the method, the accuracies are very low. With the samples added based on our sample selection mechanism and sample addition mechanism of co-training, the accuracies are greatly improved. Moreover, as we can see from Figure 14b,c, the computation time increases drastically with the increase of number of iterations of co-training, the main factor comes from the network training time. Therefore, we suggest a moderate number (e.g., three) of iterations in co-training, which keeps satisfactory performance and relatively low time cost simultaneously. We can always set the number of iterations of co-training as 6 to achieve better classification results, if the computational cost is still affordable.

Sample Selection Mechanism Analysis in Co-Training
The effectiveness of the proposed co-training method has been validated, as it not only uses the labeled data, but also exploits and labels the most confident unlabeled data to help learning. In this part, we analyze the performance of the dual-strategy sample selection method in co-training. Experiments are carried out on two HSI data which have relatively large number of samples-the ROSIS-03 University of Pavia Dataset and the AVIRIS Salinas Valley Dataset. Tables 12 and 13 display for each iteration of co-training method on the two HSI data with 5 per class initial training data set, the number of initial training samples, the number of selected samples from spectral-spatial feature and classification results. It should be noted that the accuracies of the selected samples are also listed next to the number of selected samples in Tables 12 and 13.  Firstly, as is shown in first column of Table 12, the number of initial training samples and selected samples are small and the corresponding classification results (54.2%) are relatively poor, which is normal. However, after two iterations of co-training with new samples added, the classification results have been greatly improved. It can be seen that the selected samples by the proposed sample selection mechanism can effectively promote the training of the network and improve the classification results. Secondly, again from the first column of the Table 12, in the first iteration of co-training, the number of samples selected from the two models have a large difference (e.g., 170 and 63). However, the selected samples are used to augment the training set of the other model in the next iteration of co-training, which pushed the network performance to excel. Furthermore, the selected samples from the spatial view with local distributions and the selected samples from the spectral view with global distributions, they added into each other's strength and facilitate the selection of samples next round. Third, considering that the proposed method is based on the deep learning framework, the corresponding training time is relatively longer. The number of selected samples of each iteration of co-training is also relatively larger than other co-training based algorithms [17,18]. Although there are mislabeled samples, its number is extremely small compared to the total training samples, and the impact on network training is negligible. Overall, it shows the robustness of our proposed network and the effectiveness of the co-training structure. The same conclusion can be obtained by analyzing the AVIRIS Salinas Valley in Table 13.

Conclusions
This paper proposed a semi-supervised deep learning framework for HSI classification, which is capable of reducing the dependence of deep learning method on large-scale manually labeled HSI data. The key to the framework are two parts: (1) The spectral-and spatial-ResNet for extracting the spectral features and spatial features and (2) the dual-strategy sample selection co-training algorithm for effective semi-supervised learning. Experimental results on the benchmark HSI data and a selected Hyperion data demonstrate the effectiveness of our approach. In terms of future research, we plan to design an adaptive network structure to better classify different HSI data. Moreover, it would be interesting to investigate a completely unsupervised setting where an advanced clustering algorithm can be used to initialize the network parameters.