Hierarchical Multi-View Semi-Supervised Learning for Very High-Resolution Remote Sensing Image Classiﬁcation

: Traditional classiﬁcation methods used for very high-resolution (VHR) remote sensing images require a large number of labeled samples to obtain higher classiﬁcation accuracy. Labeled samples are difﬁcult to obtain and costly. Therefore, semi-supervised learning becomes an effective paradigm that combines the labeled and unlabeled samples for classiﬁcation. In semi-supervised learning, the key issue is to enlarge the training set by selecting highly-reliable unlabeled samples. Observing the samples from multiple views is helpful to improving the accuracy of label prediction for unlabeled samples. Hence, the reasonable view partition is very important for improving the classiﬁcation performance. In this paper, a hierarchical multi-view semi-supervised learning framework with CNNs (HMVSSL) is proposed for VHR remote sensing image classiﬁcation. Firstly, a superpixel-based sample enlargement method is proposed to increase the number of training samples in each view. Secondly, a view partition method is designed to partition the training set into two independent views, and the partitioned subsets are characterized by being inter-distinctive and intra-compact. Finally, a collaborative classiﬁcation strategy is proposed for the ﬁnal classiﬁcation. Experiments are conducted on three VHR remote sensing images, and the results show that the proposed method performs better than several state-of-the-art methods.


Introduction
The classification of very high-resolution (VHR) remote sensing images faces great challenges with the rapid development of remote sensing technologies. The purpose of classification is to assign each spectral pixel over the observed scene with a certain thematic class. Early classification approaches focused on spectral-based classification; for instance, the support vector machine (SVM) [1,2], linear discriminant analysis (LDA) [3,4], maximum likelihood (ML) [5], and random forest (RF) [6]. However, these methods easily lead to noisy classification maps. To overcome this problem, spectral-spatial classification methods have become the mainstream in the last decades. The typical spectral-spatial classification methods include Markov random fields (MRFs) [7], dictionary learning [8], multi-kernel learning [9], extended multi-attribute profiles (EMAPs) [10], and edge-preserving filtering [11]. Compared with the spectral classification methods, the performance of spectral-spatial classification methods has been improved significantly.
hardly met in most real-case scenarios [35,36]. The view partition methods presented recently mainly consist of two categories: extracting different features from the original sample set, and single view partitioning (partitioning of the original sample set into several subsets). The former makes it difficult to obtain two distinct views, while the latter leads to the further reduction of labeled samples in each partition, and is usually applied under the conditions of sufficient labeled samples. In this paper, we present a methodology for splitting the feature set into two independent sets utilizing K-means and propose a hierarchical multi-view semi-supervised learning framework with CNNs (HMVSSL) for VHR remote sensing image classification to achieve effective and independent view partitioning with the limited labeled samples. The merits of our work are mainly twofold: (1) effective sample enlargement before view partition and (2) construction of a view partition set. The former usually requires sufficient labeled samples to guarantee the reliability of the prediction. For limited labeled samples, a superpixel-based sample enlargement process is designed to enlarge the training set first. In this way, the number of labeled samples in each partition set will not decrease sharply. The latter one ensures that the differences between two views should be large enough to confirm the effectiveness of the decision. The ideal partition subset should be inter-distinctive and intra-compact. According to this principle, we designed a novel view partition method. Through the calculation of intra-class and inter-class distances, the diversity of the view partition can be effectively improved.
The main contributions of the proposed HMVSSL model are summarized as follows: 1) A hierarchical semi-supervised learning framework is proposed. The proposed model consists of three levels: superpixel-based sample enlargement, construction of view partition set, and collaborative classification.
2) Initial sample expansion via initial classification and superpixel segmentation is proposed to enlarge the partitioned sample set.
3) A novel view partition strategy is proposed to promote the inter-distinctiveness and intra-compactness for the view partition sets.
The rest of the paper is organized as follows: Section 2 shows the related works. The details of the proposed method are described in Section 3. Section 4 presents the experimental results and analysis, followed by a conclusion of our work.

Deep Convolutional Neural Networks
One of the most important deep learning models is the convolution neural network (CNN), which is widely used in VHR remote sensing image classification [37,38]. In general, traditional CNNs consist of five fundamental structures: the convolutional layer, non-linear mapping (NL) layer, pooling layer, full connection (FC) layer, and classification layer. The deep structure of CNNs is achieved by alternating a series of convolutional, NL, and pooling layers. The general CNN structure is shown in Figure 1.

Input Image Convolutional layer
Pooling layer

Convolutional layer
Pooling layer

Full connection layer
Classification layer In the convolutional layer, the output convolution features are obtained by convolving the trainable convolution kernel and the input sample or feature. Assume the input feature is x l−1 , and l is the lth layer of the CNNs. The output convolution feature is expressed as [39]: where * means the convolution operator, and F l and B l refer to the lth convolution kernel and biases, respectively. After the convolutional layer, an NL layer follows to enhance the non-linear capability of the network. In this paper, the ReLU function is used as the non-linear activation function [40].
The purpose of pooling layer is to enhance the invariance of the learned feature by reducing the size of the features. Then, the output pooling features are rearranged as a feature vector and input to the FC layer. Finally, a softmax classifier is connected at the end of the CNNs for classification.

Superpixel Segmentation
Superpixel segmentation can adaptively segment the image into several homogeneous regions according to the intrinsic spatial structure [41]. Figure 2 shows the superpixel segmentation maps with different scales. In VHR remote sensing image classification, it is generally assumed that the pixels within each superpixel belong to the same category. Based on this assumption, Fang et al. [42] exploited the multi-scale superpixel features via multi-kernel learning. Jiao et al. [43] proposed a collaborative representation-based multiscale superpixel fusion method for HSI classification. Feng et al. [44] proposed a superpixel tensor sparse coding model for HSI classification. In our previous work, we proposed superpixel-based 3D CNNs for HSI classification [45]. A spatial feature map is extracted from HSI data to suppress the noisy pixels in classification results. Zheng et al. [46] proposed superpixel-guided training sample enlargement. Similar to our work, superpixel, which contains training samples belonging to only one class, was researched, and all the pixels within this superpixel were assigned to the class of the training samples it contained. All these pixels were used together with the initial training samples to train the classifiers. However, in superpixel segmentation, the mixed pixels may cause inaccurate positioning of land-cover boundaries, resulting in mislabeled samples that are added into the training set. Hence, in our method, we only assign the label to the center pixel of the superpixel to reduce false labeling.

Proposed Method
In this paper, a hierarchical multi-view semi-supervised classification method for VHR remote sensing image classification is proposed. The proposed method mainly includes three stages: superpixel-based sample enlargement, the construction of a view partition set, and collaborative classification. In the first two stages, we will generate three classification maps from different views, which provide effective support for the final collaborative classification. The framework of the proposed method is shown in Figure 3.

Superpixel-based Sample Enlargement
Initial training set Ω = {D, L} is extracted by using a rectangular window on randomly selected pixels; D is the training sample and L is the label. The purpose of this section is to provide more reliable samples for subsequent view partition. Superpixel can segment the VHR remote sensing image into several homogenous regions. It is commonly assumed that the pixels within each superpixel share the same label. Hence, superpixel is always used in pseudo sample labeling. In fact, the rich details in VHR remote sensing images can interfere with extraction of object boundaries, because of such things as shadow occlusion. At the same time, due to the existence of the mixed pixels, the boundaries of the superpixel do not exactly match the object. To avoid the problem of mislabeled samples caused by the error extraction of boundaries, only the center pixel within each superpixel is selected and added into the training set. In the following, we will describe how to assign the pseudo label to the center pixels, in detail.
To assign a pseudo label to center pixels, the initial training set is used to train the CNNs, and the initial classification map can be obtained by the trained CNNs.
where g( * ) means the CNNs trained by the initial training set, D f represents all unlabeled samples, F f is the predicted feature, and L f represents the predicted labels. According to the predicted label, the initial classification map G 1 is obtained. We perform superpixel segmentation on the VHR remote sensing image, and project the segmentation map onto the initial classification map. This process is shown in Figure 4. If the pixels within the superpixel have the same predicted label, the center pixel of this superpixel and its predicted label are added into the training set. In the proposed method, entropy rate segmentation [47] is adopted for superpixel segmentation. The other available superpixel segmentation methods can also be used here. The new training set Ω = {D, L; D 1 , L 1 }, and {D 1 , L 1 } consists of the selected unlabeled samples and their predicted labels.

Construction of a View Partition Set
In this section, two partition sets are constructed from different views of the feature domain. The purpose of view partitioning is to make each partition set have the characteristics of inter-distinctiveness and intra-compactness; meanwhile, the correlation between different views is as low as possible. For this purpose, we designed a two-step partition method for view partitioning.
Through the trained CNNs in Section 3.1, the features of each training sample can be obtained. Assume the feature set of the training set Ω is F Ω ; in the following, we will partition the feature set F Ω into two views. Figure 5 shows the view partitioning process. Notice that the partition process is only applied on the feature set of the training samples. The proposed view partitioning process consists of two parts: intra-class partitioning and inter-class partitioning. The purpose of intra-class partitioning is to enhance the intra-compactness of each partition and the difference between the two partitions. According to the labels of the training samples, the feature set F Ω can be divided into N subsets; N is the class number, and [48] is an unsupervised cluster algorithm, which can achieve better clustering results with a lesser time cost. Hence, K-means is applied on each class for clustering, respectively. Since two view partition set is constructed in this section, each class is divided into two subsets, In the second part, the feature set F Ω is merged into two partition sets. The principle of the merging is to enlarge the inter-distinctiveness within each partition set. The merge process is carried out class-by-class. Assume the two partition sets are Ω 1 = {F 1 1 } and Ω 2 = ∅. For Class 1 and Class 2, the corresponding intra-class partition feature sets are {F 1 1 , F 1 2 } and {F 2 1 , F 2 2 }. For each feature subset, the feature center is calculated via averaging the features of each feature subset, which are denoted as {C 1 1 , C 1 2 } and {C 2 1 , C 2 2 }. For the feature center C 1 1 , calculate its euclidean distances from C 2 1 and C 2 2 , and select a feature set with larger distance to join the partition Ω 1 . Assuming that C 1 1 and C 2 1 have a smaller difference, then Ω 1 = {F 1 1 , F 2 1 }, and the other two feature subsets are merged into partition Similarly to the merging process of Classes 1 and 2, the feature subsets {Ω 1 , Ω 2 } and {F 3 1 , F 3 2 } will be merged as described above. Until the feature subsets {F N 1 , F N 2 } are merged into the two partition sets, the view partitioning process is finished. Equation (4) shows the whole merging process.

Initial
: : : StepN − 1 : The inter-class partition and inter-class merge process can enable us to obtain two partition sets with different views. Although the partitioning process is not completely orthogonal, the proposed method can make the two partition sets far apart.
The CNNs are trained via partition sets Ω 1 and Ω 2 , respectively. Two classification maps G 2 and G 3 with large differences can be obtained for the final decision.

Collaborative Classification
In Sections 3.1 and 3.2, the classification maps of G 1 , G 2 , and G 3 were obtained, respectively. Since these three classification results were obtained from three different training sets, in this section, the three classification maps are combined for final classification.
The collaborative classification process is shown in Figure 6. The final training set is constructed based on all the samples needed to be classified, which is irrelevant to the previously assigned pseudo-label. If the classification results of the sample in the three classification results are the same, then add this sample and its pseudo label to the training set. Because the training sets are large, computational complexity is increased; this also exacerbates the problems of sample imbalance. Therefore, we take the category with the fewest samples as the benchmark, and randomly select the corresponding number of samples from the other categories to form the new training set. Train the CNNs with the new training set and get the final classification result. Level 2: View partition 4: According to the trained CNNs (step 1), obtain the feature set F Ω of training set Ω . 5: Intra-class partition for feature set F Ω by K-means. 6: Intra-class partition for feature set F Ω by K-means. 7: Train the CNNs with the two partition sets, respectively. 8: Two classification maps G 2 and G 3 are obtained according to the trained CNNs (Step 7). Level 3: Collaborative classification 9: Select unlabeled samples with the same label prediction on G 1 , G 2 , G 3 to enlarge the training set. 10: Train the CNNs with the new training set. 11: Predict the labels of the unlabeled samples using the trained CNNs (Step 10).

Datasets
The following three VHR remote sensing images are used in our experiments. a) Aerial data [49] was acquired by an ADS80 remote sensor on a plane. The spatial resolution of this scene is 0.32m with three bands. The Aerial data are 560 lines by 360 samples; there are six classes available, including grass, water, road, trees, building, and shadow. The image data and its ground-truth map are shown in Figure 7a,b, respectively. b) JX_1 data were collected from UVA platform and Canon EOS 5D Mark II camera, and the flight elevation was 100 m. JX_1 data are a small subscene of JX image [49]. The size of the JX_1 data in pixels is 500 × 700 with a spatial resolution of 0.1m. JX_1 data have three bands with six classes available. Figure 8a,b show the image data and the ground-truth map, respectively. c) Pavia University data (http://www.ehu.eus//ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes. were collected by the reflective optical system imaging spectrometer (ROSIS-3) optimal sensor on July, 8,2002. The original HSI contains 115 spectral bands; after removing the noisy bands, only 103 bands are remained. This scene, with a size of 610 × 340, has a spatial resolution of 1.3m. In this data set, nine classes are available for classification. The false color image data and its ground-truth are shown in Figure 9a,b, respectively.

Experiment Setup
In this paper, we compare the performance of our method with several start-of-the-art VHR remote sensing image classification methods, including a supervised classification method and semi-supervised classification methods.
The supervised classification method is a CNN [12]. CNNs are an effective feature extraction tool and have been widely used in VHR remote sensing image classification.
TSVM is an iterative algorithm, which incorporates the unlabeled samples into a training phase, and tries to search a much reliable separating hyperplane (in the kernel space).
SemiMLR combines the labeled samples and the unlabeled samples to improve the performance the multiple logistic regression classifier.
SimiSAE uses the large number of unlabeled samples to pre-train the unsupervised autoencoder, and fine-tunes the networks with the small labeled samples. The size of each hidden layer in the pre-trained autoencoder is: 3 − 30 − 30 − 30 − 6 (for Aerial data and JX_1 data), and 103 − 300 − 300 − 300 − 9 (for Pavia University data).
The ladder network is a recently proposed semi-supervised classification network. It consists of two encoders and one decoder, and during the training process, a supervised and an unsupervised cost are combined together for training. The supervised cost is to exploit the deep features of the labeled samples, and the unsupervised cost is to constraint reconstruction error of unlabeled samples. The architecture of the ladder network is 3 − 30 − 30 − 30 − 6 (for Aerial data and JX_1 data), and 103 − 300 − 300 − 300 − 9 (for Pavia University data).
In addition, to present the advantages of the proposed view partition method, we replaced the proposed two views with spectral feature and 2D Gabor feature [31] in the proposed framework, which is called spectral-spatial view in the following experiments.
The other parameters of the compared methods were set as the defaults from their papers. For the proposed method, the parameters were: the window size of the sample is 21 × 21, the superpixel number is 1000; and for CNNs, the first convolutional layer had 20 filters of size 6 × 6, the second convolutional layer had 40 filters of size 5 × 5. The full connection layer had 100 units, the iteration number was 1000, and the learning rate was set as 0.01. The numbers of the initial training samples are shown Table 1, and the average values of overall accuracy (OA) and average accuracy (AA), and the kappa coefficient, were used to evaluate the classification results. Table 1. Numbers of the initial training and testing samples.

Experimental Results on Aerial Data
In this experiment, the performance of the proposed HMVSSL method was evaluated by using Aerial data. Figure 10 describes the classification results, and Table 2 tabulates the classification accuracies of the compared methods. The initial training and final training pixels are shown in Figure  10a with red and green markers, respectively. It is clear that the number of training samples increased significantly. In order to assess the label prediction accuracy for the selected unlabeled samples, we calculated the accuracy of these samples within ground-truth regions, and the accuracy is 100%. Compared with CNNs, the advantages of unlabeled samples are clearly revealed in the proposed method. The classification results of road and water have been improved significantly. Meanwhile, as can be shown from Table 2, the classification accuracies of road and water are increased 1.05% and 0.98%, respectively, and the OA value is increased by 1.6% compared to CNNs. For other semi-supervised learning methods, TSVM has lower classification performance for road, trees, and shadow, and Table 2 also reports that the classification accuracy of road is only 4.84%. SemiSAE has a lower classification performance in grass, and the OA value of grass is 28.28%. The classification maps of semiSAE and the ladder network appear to be obviously noisy. The SemiMLR method shows obvious advantages in smooth areas, such as water, road, and building, and the classification accuracies of these three categories are higher than that of the proposed method. However, for the non-smooth regions and small details, e.g., trees and shadow, the classification accuracy is significantly reduced. The classification accuracies of trees and shadow are 19.94% and 18.5% lower than that of the proposed method. In the proposed HMVSSL method, the classification accuracy of each category is above 94%, and there is no bias to one category. Because Gabor wavelet has obvious advantages for texture extraction, the spectral-spatial view gets higher classification accuracies on trees and roads. However, the redundancy between spectral and spatial views is large, and results in misclassified samples added into the training set. Hence the classification accuracy of spectral-spatial views is slightly lower than that of the proposed views, especially at the boundary of the building regions.

Experimental Results on JX_1 Data
In this section, the classification performance is evaluated on JX_1 data. The classification maps and accuracies are illustrated in Figure 11 and Table 3, respectively. JX_1 data contains not only homogeneous regions with smaller intra-class differences, such as farmland, buildings, and roads, but also land-cover areas with large intra-class differences, such as trees and grass.  As can be seen from Figure 11, the compared methods show different classification performances for trees and grass. The classification maps of TSVM, SemiSAE, and ladder networks methods show obvious misclassifications within these two categories. In Table 3, the classification accuracies of grass (class 5) in these three methods are 79.67%, 79.04%, and 76.76%; meanwhile, for CNNs, SemiMLR, and the proposed method-97.12%, 80.24%, and 93.19%. CNNs, SemiMLR, and the proposed methods show better regional consistency. Although SemiMLR obtains a competitive OA value, its classification performance for small details are poorer, such as shadow (94.28% for SemiMLR and 96.94% for the proposed method). This case is similar to the former analysis on Aerial data. The proposed method achieves better performance in both visuals and classification accuracy. For the spectral-spatial view, the proposed method presented higher initial classification accuracies on the JX_1 data, and hence the probability that the same sample is mislabeled from the two views is small. Therefore, the spectral-spatial view also achieves better classification results. However, two independent views can obtain more high-reliability unlabeled samples, and therefore, the classification accuracy is slightly higher than that of the spectral-spatial view-based classification method.

Experimental Results on Pavia University Data
Pavia University data is a well-known HSI data which contains 103 bands. For CNNs and the proposed methods, PCA is performed first, and the first three PCs are maintained for the following classification. For the other methods, the input samples are extracted from the original HSI data. Pavia University data contains building and roads with rich details, and grass and soil areas with noisy information, which increases the difficulty of accurate land-cover interpretation. Figure 12 and Table 4 present the classification maps and classification accuracies, respectively. TSVM and SemiSAE present classification maps with obvious salt-and-pepper noise, especially for SemiSAE-the classification accuracy is only 60.22%. After the incorporation of the spatial information, CNNS, SemiMLR, and the proposed method show better anti-noise performance. Due to the use of unlabeled samples, the OA value obtained by the proposed method is 2.64% higher than that of CNNs. For spectral-spatial view-based classification, reduced initial accuracy can lead to the probability of two views mislabeling the same sample. Hence, compared with CNNs, the increase of the classification accuracy is not significant. Therefore, when the accuracy of the initial classification is low, the independence of the two views has an important influence label decision process of the unlabeled samples. Hence, the proposed method performs better than the compared approaches in classification metrics, detail preservation, and region smoothness.

Discussion
In the proposed method, the number of training samples is first enlarged by superpixel segmentation. The center pixel of the superpixel with "pure" classification is selected to enlarge the training set. Therefore, the superpixel number determines the correctness of the unlabeled sample selection. Figure 13 shows the classification results with different numbers of superpixels. The number of superpixel ranges from 50 to 2000. For Aerial data and JX_1 data, the effect of superpixel number on the classification accuracies is not obvious. But the situation is different for the Pavia University data. For Pavia University data, when the superpixel number is less than 500, the OA values are slight reduced, because a small number of unlabeled samples is selected. However, when the superpixel number is larger than 2000, the OA value also decreases due to the mislabeling of the unlabeled samples. Therefore, the classification evaluation metrics on the Pavia University data are more sensitive to the superpixel number. The other analysis is about the influence of the number of labeled samples on the classification accuracy. Figure 14 shows the classification performance of the compared and proposed methods. To further analyze the performance of these methods with the limited training sample, the number of labeled samples per class ranges from 10 to 100. Since the sample selection process is related to the initial classification map, the proposed method does not achieve good performance when the number of training samples is less than 30. However, as the number of training samples increases, the OA values are higher than that of the compared approaches. Therefore, the proposed method has superiority over the other methods when handling the problem of limited training samples.

Conclusions
This paper proposed a novel hierarchical multi-view semi-supervised learning framework for VHR remote sensing image classification. The proposed method consists of three levels: The first level is the enlargement of the training set, which may prevent the sharp reduction in the number of training samples after view partition. The second level is view partitioning, which can obtain two different views with the characteristics of inter-distinctiveness and intra-compactness. The designed view partitioning method can effectively improve the reliability of unlabeled sample selection. The third level is to combine the classification results of the previous levels for collaborative classification. Experiments were conducted on three VHR remote sensing datasets containing various land-cover classes, such as water, building, road, and grass. The experimental results verify the effectiveness of the proposed method compared to several state-of-the-art approaches.
There are two improvements that can be considered in future work. On the one hand, in the classification problem, most of the samples belong to "simple samples" that can be classified correctly. The remaining few samples are difficult to classify. These samples can be called "difficult samples." In our work, the unlabeled samples are selected based on the correctly classified samples, and therefore, the improvement of classification performance is still limited for the "difficult samples." Hence our further work is to effectively distinguish the difficult samples. On the other hand, the main contribution of the proposed method is view partitioning. Although self-training, co-training, and tri-training are all well-known semi-supervised classification frameworks, the reasons that we introduce view partitioning into co-training are: (1) Multi-view is always combined with co-training in most published multi-view-based remote sensing image classification methods. The main contribution of the proposed method is view-partitioning. In order to better understand to the background of view partitioning, we incorporate the proposed method by utilizing co-training. (2) For self-training, it mainly focuses on single-view learning. For tri-training, it usually uses the bootstrap sampling method to generate three different views. And for co-training, view construction is an important process. Hence, we evaluated the proposed method by utilizing co-training. In fact, the proposed view partitioning can also be utilized in self-training and tri-training methods. Perhaps that method can achieve better results.