Superpixel-Guided Layer-Wise Embedding CNN for Remote Sensing Image Classification

Irregular spatial dependency is one of the major characteristics of remote sensing images, which brings about challenges for classification tasks. Deep supervised models such as convolutional neural networks (CNNs) have shown great capacity for remote sensing image classification. However, they generally require a huge labeled training set for the fine tuning of a deep neural network. To handle the irregular spatial dependency of remote sensing images and mitigate the conflict between limited labeled samples and training demand, we design a superpixel-guided layer-wise embedding CNN (SLE-CNN) for remote sensing image classification, which can efficiently exploit the information from both labeled and unlabeled samples. With the superpixel-guided sampling strategy for unlabeled samples, we can achieve an automatic determination of the neighborhood covering for a spatial dependency system and thus adapting to real scenes of remote sensing images. In the designed network, two types of loss costs are combined for the training of CNN, i.e., supervised cross entropy and unsupervised reconstruction cost on both labeled and unlabeled samples, respectively. Our experimental results are conducted with three types of remote sensing data, including hyperspectral, multispectral, and synthetic aperture radar (SAR) images. The designed SLE-CNN achieves excellent classification performance in all cases with a limited labeled training set, suggesting its good potential for remote sensing image classification.


Introduction
Remote sensing images generally refer to the pictorial ground information acquired by satellite or aircraft sensor technologies.With sufficient spectral and spatial information, remote sensing images have played an important role in many applications, such as urban planning, agriculture management, climate monitoring, military affairs, etc. [1][2][3], while for these applications, classification with a fine accuracy is essential [4].Different from common optical images, classifying remote sensing images is more difficult concerning their characteristics, such as having more spectral bands, rich spatial information, low spatial resolution, and so on [5,6].Furthermore, remote sensing images usually have complex ground scenes and irregular objects, thus they are characteristic of irregular spatial dependency [7], which further cause challenges for classification tasks.
There have been many methods designed for the classification of remote sensing images, which can be roughly grouped into three categories, i.e., supervised, semi-supervised and unsupervised methods [8], according to the manner of the information exploration from labeled or unlabeled samples.On the one hand, unsupervised methods, generally using clustering strategies, such as fuzzy clustering [9] and fuzzy C-Means algorithms [10], which attempt to explore patterns from unlabeled samples [11], have been proved to be efficient but incapable of bridging the gap between clusters and classes [12].On the other hand, supervised classifiers, such as support vector machine (SVM) [13,14], multinomial logistic regression [15,16] and artificial neural networks (ANN) [17,18], which learn from labeled samples to obtain prior knowledge, have demonstrated impressive performance.However, supervised classifiers heavily rely on the quantity and quality of labeled samples [19][20][21].In real scenarios, sample labeling is usually difficult, time-consuming, and expensive [22].Therefore, the labeled samples available are often insufficient, which leads to the occurrence of Hughes phenomenon [23] and increases the possibility of overfitting [24,25].
Semi-supervised learning is usually used to relieve the conflict between training demand and limited labeled sample set.It aims to make use of both limited labeled samples and abundant unlabeled samples, binds together unsupervised and supervised learning [26].There exist many semi-supervised approaches in the literatures [27].For instance, generative semi-supervised learning methods use the conditional density to determine labels of unlabeled samples [28][29][30].However, those methods generally under the assumption that unlabeled samples follow a certain distribution which may limit the performance [11].Wrapper methods include self-training [22,31,32] and co-training [22,[33][34][35].The former trains the classifier iteratively with new training samples labeled by the classifier itself, while the latter employ several classifiers to train with independent subsets of samples and the unlabeled samples with high reliability are then used to train another classifier.Self-training schemes may reinforce its poor predictions, while co-training algorithms demand that the samples can be divided into independent subsets [11].Low-density separation algorithms, such as the transductive SVMs [36][37][38] which perform the classification by maximizing the margin for labeled and unlabeled samples, also suffer from the poor generalization ability.Graph-based approaches construct graphs to connect similar observations and spread labeled information in its neighbors by finding minimum energy function [12,[39][40][41][42], which also incur some problems such as being sensitive to the graph structure [43,44].
Recently, deep learning structures have attained great success owing to its outstanding generalization capacity compared with traditional shallow structures [45].Some of the recent developments are focused on semi-supervised learning, which exploits both labeled and unlabeled information to tackle the issue of overfitting, i.e., a limited number of labeled information and huge number of parameters involved [46].This new trend has been successfully applied for remote sensing image classifications.For instance, Ma et al. [11] use a deep hierarchical structure to learn highly discriminative representation and pre-labels unlabeled samples, where multi-decision schemes are formed to update the labeled training data set and thus realize semi-supervised learning.However, this kind of purely discriminative and self-learning style semi-supervised way often rely on iterative training, thus are time-consuming and resource-consuming.He et al. [8] apply popular GANs (Generative adversarial networks) to study the latent representation of the input data, whose model resorts to the regularization techniques to explore the information in unlabeled samples and hence assist the discriminative classification tasks.In that generative model, unsupervised embeddings or hidden representations are often used to help supervised objectives [47].Nevertheless, such latent variable models are still not suitable enough to match hidden representation with supervised tasks at hand.Rasmus et al. [48] propose a Ladder Network by combining supervised learning with unsupervised learning in deep neural networks, which needs only a small number of labeled samples.However, the method lacks the mining to spatial information of unlabeled samples, which weakens its capacity and applications, especially for remote sensing images.
Remote sensing images have complex ground scenes and irregular objects; thus, they are naturally characteristic of irregular spatial dependency [7], causing difficulties for classification.Deep supervised models such as CNNs are robust classifiers but require many training samples for fine tuning the network parameters, which conflicts with the reality that only a small number of labeled samples are available.To address these challenges, in this paper, we design a superpixel-guided layer-wise embedding CNN framework (SLE-CNN) for remote sensing image classification.It can automatically determine the neighborhood covering for a spatial dependency system and thus provide more a priori information of high quality for labels, which can improve the training performance of the deep network in a semi-supervised manner.We use a superpixel-based random sampling strategy to select unlabeled samples since superpixels are adaptive to real scenes of remote sensing images [7].The involved layer-wise embedding CNN can fuse deep autoencoder (AE) and CNN in a layer-wise embedding fashion where unsupervised reconstruction cost and supervised cross entropy loss are optimized simultaneously, thus achieving an end-to-end structure.This structure can use information from both labeled and unlabeled samples and efficiently reduce the overfitting risks, therefore, well adapting to semi-supervised tasks.Moreover, instead of applying the unsupervised auxiliary tasks as only a part of pre-training procedure followed by normal supervised learning, the layer-wise embedding CNN shares the hidden representations between unsupervised generative representation and its discriminative counterpart, thus helps more informative unsupervised features to be learned for a discriminative purpose.All the above aspects contribute to the better classification performances for remote sensing images.
The main research objectives of this paper can be identified as follows: • Considering the fact that remote sensing images are characteristic of irregular spatial dependency, we introduce the superpixel sampling strategy to guide the use of unlabeled samples, which can achieve an automatic determination of the neighborhood covering for a spatial dependency system and thus adapting to real scenes of remote sensing images.With the aid of these highly representative and informative unlabeled samples, the training process will be boosted, leading to better classification results.

•
To demonstrate the performance of our framework for classification tasks of different types of remote sensing data, we conducted experiments to provide the latest results on benchmark problems.In addition, we compared our framework with several typical semi-supervised and supervised methods, which also verifies the effectiveness of our proposed framework.
The remainder of the paper is organized as follows.Section 2 presents our newly developed framework in detail.Experimental results with hyperspectral, multispectral and SAR image data are shown in Section 3. Some discussions with extra experiments are placed in Section 4. Finally, Section 5 draws some conclusions.

Methodology
The block diagram of the proposed classification framework for remote sensing images is shown in Figure 1.The core of the framework is SLE-CNN (shown in the purple block in Figure 1), which is mainly composed of two sequential steps: heuristic sampling based on superpixel segmentation and the layer-wise embedding CNN.To exploit the full potential of remote sensing data, both limited labeled samples and sufficient unlabeled samples are used to construct a training dataset.Considering the irregular spatial dependency in remote sensing images, the random sampling strategy based on superpixels is employed to guide the selection of unlabeled samples.Superpixels are adaptive to real scenes of remote sensing images, which can improve the performance of the framework considering the variations of spatial characteristics of remote sensing images.Samples close to superpixel boundaries, viewed as samples likely to be class boundaries with difficulty to distinguish, are of high representativeness and entropy and can strengthen the generalization capacity of classifiers.Unlabeled samples along with a limited size of labeled samples are subsequently organized in the form of patches and both input to the layer-wise embedding CNN to fine tune the deep network and search for the best generalization of the input data.At last, classification maps can be obtained through the input of patches from remote sensing images to the fine-tuned layer-wise embedding CNN.
In this section, we will thoroughly present structure of SLE-CNN.In the first two parts, we will first introduce the basic background and knowledge of the proposed method, including the detailed description of the proposed superpixel-based random sampling strategy for unlabeled samples in Section 2.1 and the structure of an autoencoder, one of the basic architectures used in our classifier, in Section 2.2.At last, in Section 2.3, we will illustrate the whole structure of the designed SLE-CNN in detail.

Superpixel guided patches
Training samples

Superpixel-Based Random Sampling
To ensure both the representative ability and efficiency, we need to bring in a random sampling strategy to select just a portion of unlabeled dataset instead of using them all during the training process.However, absolute random strategy is not enough to exploit the potential of the unlabeled samples.To deal with this, we design a superpixel-based random sampling strategy.
Pixels close to the class boundaries usually have a higher error probability to be misclassified, which makes it informative for sample collection.After a beforehand segmentation, we can actually obtain a strong prior on both spectral and spatial domain.In addition, samples close to the boundaries are more likely to be those near the class boundaries.We can strengthen the generalization capacity of classifiers by taking into account these high entropy samples [49].
Considering the irregular spatial dependency of remote sensing images, we use superpixels segmentation to produce the segmentation results in view of its adaptive ability to different scenes of remote sensing images.
Simple Linear Iterative Clustering (SLIC) [50,51] algorithm is used as the method to obtain superpixels mainly considering its efficient computational performance compared to other algorithms.SLIC, which is a simple and efficient segmentation method based on k-means clustering, generates superpixels by clustering pixels in both spectral and spatial domains with each pixel linked to a feature vector ψ(p, q): where I(p, q) is the spectral vector at position (p, q).α is a coefficient to balance the spectral and spatial components of the vector, α = c S .S is the nominal size of superpixels, and c is a variable to control the compactness of superpixels.
The algorithm starts by dividing image into A × B tiles (A = iw rs , B = ih rs , where iw and ih are the number of rows and columns in an image, respectively.rs is the expected spatial size of superpixels. • represents ceiling function which maps a number to the least integer greater than or equal to the number) with initial cluster center (p i , q j ) (p i = i * iw A , q j = j * ih B ).To avoid placing centers at edges and selecting noisy pixels, the cluster centers are moved in a 3 × 3 window with the lowest gradient.The gradient is defined as: where • is the L 2 norm.
Then the superpixels are obtained by k-means clustering, where each pixel is assigned to the nearest initial cluster center, and a new center is recomputed as the average of the feature vectors of pixels belonging to the cluster.The process is iteratively repeated until convergence.After the k-means clustering, the SLIC algorithm assigns disjoint segments to the largest neighboring cluster to enforce connectivity.
Based on the SLIC algorithm, we can sample the most representative unlabeled data.With simple random sampling strategy, samples are selected randomly and may be biased, which cannot meet the need for semi-supervised learning [52].To select highly informative unlabeled samples automatically and reduce the demand of enlarging training samples, we introduce a superpixel segmentation-based random sampling strategy, which can also be regarded as a process to mine samples that are hard to distinguish:

•
Images are segmented to superpixels, and all pixels in images are recorded as set A; • Randomly choose a part of set A as set B; • Pixels located on the boundaries of superpixels are detected and recorded as set C; • For each pixel in set B, if its spatial distance to any pixel of the same superpixel in set C is less than or equal to k (pixel unit), put it in the candidate list.
This whole sampling procedure can be visualized as Figure 2. The procedure of superpixel-based random sampling.Superpixels are introduced to guide the selection of unlabeled samples to handle the irregular spatial dependency of remote sensing images.Under the strategy, more representative and informative samples come from pixels located on superpixel boundaries, since they are more likely to be close to the class boundaries and easier to be misclassified.

Autoencoder
An autoencoder (AE) is an artificial neural network used for unsupervised learning of efficient codings [53,54].An AE aims to learn a representation (encoding) for a set of data.
Architecturally, the simplest form of an AE is a feedforward, non-recurrent neural network very similar to the multilayer perceptron (MLP)-having an input layer, an output layer and one or more hidden layers connecting them-but with the output layer having the same number of nodes as the input layer, and with the purpose of reconstructing its own inputs.
An AE always consists of two parts, the encoder and the decoder.In the simplest case, where there is one hidden layer, the encoder stage of an AE takes the input x and maps it to r where the image r is usually referred to as code, latent variables, or latent representation.θ is an element-wise activation function such as a sigmoid function or a rectified linear unit (ReLU).W is a weight matrix and b is a bias vector.After the encoder, the decoder stage of the AE maps r to the reconstruction x of the same shape as where θ ,W and b for the decoder may differ in general from the corresponding θ,W and b for the encoder, depending on the design of the AE.AEs are also trained to minimize reconstruction costs (such as square error): where x is usually averaged over some input training set.Denoising AEs take a partially noisy input while training to recover the original clean input.This technique has been introduced with a specific approach to good representation [55].A good representation is one that can be obtained robustly from a noisy input and that will be useful for recovering the corresponding clean input.

Superpixel-Guided Layer-Wise Embedding CNN
Remote sensing images have complex ground scenes and irregular objects, thus irregular spatial dependency is one of the major characteristics of remote sensing images, which brings about challenges for classification tasks.Though prevailing deep supervised models usually have good feature generalization capacities with sufficient training samples, the situation will deteriorate rapidly when labeled data is limited [56].This dilemma originates from the conflict between huge parameter volume and insufficient training samples.To handle the irregular spatial dependency and relieve the training demand for labeled samples, we design a superpixel-guided layer-wise embedding CNN framework to assist the optimization process for remote sensing image classification introducing the use of unlabeled data.
Since the goal of using unlabeled data for unsupervised learning is actually a type of regularization for supervised learning, we expect our supervised tasks to perform better, which demand the hidden representations shared by both supervised and unsupervised parts to be more robust.To achieve this, we need on one hand to feed more informative training samples to learn the best representation, and on the other hand to design a more powerful network structure to capture the internal characteristics of the input data.For the latter, we establish a layer-wise embedding CNN structure to efficiently learn the best discriminate feature for final classification.For the former, we use the superpixel-based random sampling strategy (as introduced in Section 2.1) to heuristically search useful and informative unlabeled samples without enlarging the labeled part of the training dataset.In particular, superpixels are usually spatially irregular subregions, but pixels inside them are homogenous, which means an automatic determination of the neighborhood covering for a spatial dependency system in a data-dependent manner [7].Thus, supepixels are adaptive to real scenes of remote sensing images considering the irregular spatial dependency in remote sensing images.
The proposed whole SLE-CNN structure is shown in Figure 3, where all the inputs are organized as patches from pixels.Superpixel-guided patches, referring to unlabeled samples which are selected under the superpixel-based random sampling strategy shown in the left purple block of the figure, along with labeled patches are input to the layer-wise embedding CNN in the right part of the figure to fine tune the network.To get a classification map for a remote sensing image, each patch from each pixel need to be input to the fine-tuned layer-wise embedding CNN so that a class label at each pixel can be obtained, shown in the upper part of the figure with an orange-red color arrow.
From the denoising AE's point of view, the layer-wise embedding CNN can be constructed by two parts in sequence, where two versions of encoder architectures, including one clean encoder (the green block in the top of Figure 3) and one noisy encoder (the blue block at the bottom of Figure 3), are followed by a mutual decoder architecture (the yellow block in the middle of Figure 3).Here, noise is injected into hidden layers of the noisy encoder to obtain a better feature generalization, which is similar to the common regularization technique dropout [57].Specifically, we use a deep spectral-spatial CNN structure, which can make full use of spectral and spatial information of remote sensing images, as the architectures of the encoders to enhance the representative capacity and increase the supervised discriminative power.The parts of the structure are associated through skip connection and layer-wise embedding structures.The former technique strengthens the representative ability of the learned feature in the reconstruction stage by the superposition of the noisy encoder part upon the decoder part layer-wisely.The latter one serves as an extra supervision for the joint optimization process to achieve a strong regularization for raw remote sensing images and promote the discriminative ability.
Consider a dataset with labeled samples {x(m), y(m)|1 ≤ m ≤ N} and unlabeled samples {x(m)|N + 1 ≤ m ≤ N + M} where M N. The goal for the classifier is to learn a function that models P(y|x) by using both the labeled samples and the unlabeled samples.Here, the objective function for the training of the layer-wise embedding CNN is casted as a sum of the supervised cross entropy (COST1 in Figure 3) related to labeled patches from the noisy encoder and the unsupervised reconstruction cost (COST2 (l) in Figure 3) from superpixel-guided unlabeled patches at each layer of the decoder.Since all layers of the noisy encoder are corrupted by noise, another clean encoder path with shared parameters is responsible for providing the clean reconstruction targets.The whole structure is optimized by traditional backpropagation gradient descent. (1)   (2)   (3)  ( 4) (1)   (2)   (3)   (4)    (4)  (3)  (2)   (1)    (1)   (2)   (3)  ( 4) The structure of the superpixel-guided layer-wise embedding CNN.The layer-wise embedding CNN consists of two encoders (the clean one, the green block on the top of the figure and the noisy one, the blue block at the bottom) and one decoder (the yellow block in the middle).The objective cost function COST for fine tuning comes from supervised cross entropy (COST1) and unsupervised reconstruction cost (COST2 (l) ).The size of the input patch x is set to be 13 × 13, which represents the neighborhood centered on the objective pixel to be classified.Superpixel-guided unlabeled patches, obtained from the superpixel-based random sampling strategy (purple block in the left), are the main input training samples for the layer-wise embedding CNN with black arrows and responsible for COST2 (l) .While, labeled patches are input to the noisy encoder in the direction of the orange arrow, which are for the calculation of COST1.To obtain the classification maps of remote sensing images, each patch is input to the clean encoder of the fine-tuned layer-wise embedding CNN at the test stage to output a clean class label.
At the end of the encoder path, we can obtain the one-hot encoded classification vector through a full connection layer combined with SoftMax operation.Please note that the ultimate class label for each input patch of remote sensing images at the test stage comes from the clean output from the clean encoder, while the noisy output from the noisy encoder is only for calculating the supervised cross entropy.
Each part of the layer-wise embedding CNN is explained in detail in the following.

General Steps for Constructing Layer-Wise Embedding CNN
Based on the structure of a denoising AE, we combine a noisy encoder and corresponding decoder layer via vertical skip connections, where two signals are fused by a denoising function to reconstruct the layer in the decoder.This technique helps the higher layer to focus on extracting more abstract and task-specific features, which can facilitate feature extraction from complex remote sensing images.Meanwhile, a clean encoder is trained in a feedforward fashion to evaluate the reconstruction effect [58].
The layer-wise embedding CNN can be defined as (suppose we have a total of L layers in both encoder and decoder parts): where Encoder noisy (•), Encoder clean (•) and Decoder(•) represent the noisy encoder, the clean encoder, and the decoder, respectively.x, x and x are the clean, noisy, and reconstructed input patches, respectively.r (l) , r(l) , and r(l) are the clean hidden representation, its noisy version, and its reconstructed version at layer l. y and ỹ, outputs after SoftMax operation, are the clean class label and the noisy class label, respectively.The noisy ỹ is used to calculate supervised cross entropy during the training process as described in following Equation (17), while the classification map is obtained from the clean y at test stage.

CNN Based Encoder for Supervised Learning
To use both spectral and spatial information in remote sensing images and enhance the representative capacity, a spectral-spatial CNN structure is constructed into the encoder architecture in the forward path.Overall, the encoder consists of the convolution, max pooling, batch normalization, noise injecting (for the noisy encoder) and activation operations for each layer.At the end of the encoders, the output y and ỹ are obtained through SoftMax operation (see Figure 3).
Firstly, 3-D convolution conv (l) (•) and max pooling maxPooling(•) transformations from layer (l − 1) to layer l are put on s(l−1) , the post-activation at layer (l − 1), to obtain the pre-normalization r(l) pre : r(l) pre = maxPooling(conv (l) (s (l−1) )) Batch normalization is then applied to r(l) pre with the mini-batch mean mean(r pre ) and standard deviation stdv(r (l) pre ).In addition, isotropic Gaussian noise n is added to compute pre-activation r(l) : Then, through a nonlinear activation function such as ReLU, defined as φ(x) = max(0, x), we can obtain s(l) , the post-activation at layer l, as the input for the next layer: where β (l) and γ (l) are trainable parameters responsible for shifting and scaling.Please note that the above equations describe the noisy encoder, with noisy s and r.If we remove noise, we will obtain the clean version of the encoder with clean s and r.

Vertical Connection and Vanilla Combinator-Based Denoising Function for Unsupervised Learning
In the backward path, deconvolution, unpooling and batch normalization are performed at layers in the decoder.Besides, vanilla combinator-based denoising function is used for combining the signal from the noisy encoder and the signal in the decoder, which achieves the vertical connection.This technique strengthens the representative ability of the learned feature in the reconstruction stage.
For each layer of the decoder, deconvolution deconv (l) (•) and unpooling upSampling(•) operations from layer l + 1 to layer l are employed to layer r(l+1) : Batch normalization is then implemented on u (l+1) pre to get u (l+1) : After normalization correction, the signal from the layer r(l+1) and the noisy r(l) via vertical connection are combined into the reconstruction r(l) through a denoising process: r(l) = g(r (l) , u (l+1) ) (14) where g(•, •) is the vanilla combinator-based denoising function.It can combine the lateral u (l+1) and the vertical r(l) connections in an element-wise fashion.
Here, function g(, •, ) is to achieve the lowest reconstruction cost.Considering the conditional distribution P(r (l) |r (l+1) ) that we intend to model, the optimal functional form of g will be linear with respect to r(l) when P(r (l) |r (l+1) ) is Gaussian.The parametrization of the denoising function is therefore: where we modeled both ω(u) and v(u) with a multilayer architecture form nonlinear function: ω(u) = t 1 sigmoid(t 2 u + t 3 ) + t 4 u + t 5 and v(u) = t 6 sigmoid(t 7 u + t 8 ) + t 9 u + t 10 .t 1 to t 10 are linear coefficients.For a given u, r is linear related to the parametrization, and both v and ω depend nonlinearly on u.

Overall Objective Function Formulation
Finally, the objective function for the layer-wise embedding is a balance of the supervised cross entropy from the noisy encoder and the unsupervised reconstruction cost at each layer of the decoder.Since all layers of the noisy encoder are corrupted by noise for the purpose of obtaining a better feature generalization, the clean encoder is providing the clean reconstruction targets as the reference of the decoder.
The objective function, COST, is defined as the following shows: where COST1 is supervised cross entropy from the noisy encoder and COST2 is unsupervised reconstruction costs from the decoder.COST1 (with N labeled patches) is calculated as the sum negative log probability of the noisy output ỹ(m) matching the target output y * (m) given the input x(m): And COST2 (with M superpixel-guided unlabeled patches) represent the sum of reconstruction costs from all L layers: where COST2(•) represents the layer-wise embedding unsupervised reconstruction cost which consists of cost from each decoder layer.λ l is a layer-wise coefficient.The denoising intensity of each layer can be tuned by changing each λ l .COST2 (l) (•) in the above Equation ( 18) is formalized as: where r(l) is performed batch normalization with mean and standard deviation of r(l) pre in the noisy encoder part.
The feedforward pass of the layer-wise embedding CNN is listed in Algorithm 1, where batchnorm(•) means batch normalization, and activation(•) is the nonlinear activation function, such as ReLU.

Experimental Setup
In this section, we use three different types of remote sensing images to evaluate our designed SLE-CNN framework.All of them are common benchmarks concerning their types.A series of experiments have been conducted to make a comprehensive comparison among various methods.All experiments are carried out with the same image pre-processing operations to guarantee fairness.

Dataset Description
Hyperspectral image (HSI), acquired by hyperspectral imaging sensors, consists of hundreds or even thousands of continuous spectral bands, carrying abundant information.In the experiments, publicly available University of Pavia data (http://www.ehu.eus/ccwintco/index.php/ Hyperspectral_Remote_Sensing_Scenes) is employed as a benchmark HSI dataset.It was captured by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor over an urban area at the University of Pavia, Italy, in 2002.The image is composed of 610 × 340 pixels with a spatial resolution of 1.3 m.It contains 115 bands.After removing 12 noise bands, 103 bands are remaining.The ground truth map contains 9 classes.Figure 4 shows the false color image as well as the ground truth data.There are 9 classes of interest and the detailed information of each class is listed in Table 1.Multispectral image (MSI), acquired by multispectral sensors, contains several useful discontinuous spectral bands.One image (http://www.recogna.tech),covering the area of Itatinga, SP-Brazil, obtained by Landsat 5 TM, one of the most popular multispectral sensors, is used in the experiments.The image consists of 492 × 526 pixels and 3 bands.The ground truth map contains 6 classes, Figure 5 shows the ground truth data as well as the image.The 6 classes of interest and the detailed information of each class are listed in Table 2. SAR imagery, acquired by synthetic aperture radar, carries a lot of speckle noise.Here, one image collected by Electromagnetics Institute Synthetic Aperture Radar (EMISAR) (http://www.space.dtu.dk/english/Research/Research_divisions/Microwaves_and_Remote_Sensing/Sensors/emisar) is used.It was captured over a vegetated region, in Foulum, Denmark.The image is composed of 421 × 300 pixels and 41 bands.Its ground truth map contains 5 classes, Figure 6 shows the ground truth data as well as the image.The 5 classes of interest and the detailed information of each class are listed in Table 3.

Experiments
To evaluate the designed SLE-CNN framework for remote sensing image classification, we compare it with the supervised version of our model and some other classification algorithms, i.e., SVM, Laplacian SVM (LapSVM), Self-learning (based on Breaking Ties or BT strategy) SVM (SL SVM), convolutional neural network -autoencoder (CNN-AE).Therein, supervised version of our framework with only cross entropy to learn, has the same structure with the encoder part of the semi-supervised version to ensure the comparability.Specifically, SVM is also in supervised fashion, while LapSVM and SL SVM are semi-supervised classifiers.The LapSVM [59], which is a graph-based semi-supervised learning method, introduces an additional manifold regularization term on the geometry of both the labeled and the unlabeled data using the graph Laplacian and has been demonstrated as an effective approach [60,61].Self-learning is one of the traditional wrapper methods of semi-supervised learning.Here, SVM is chosen as the probabilistic classifier.In addition, the BT active learning algorithm [62], which focuses on searching the samples with the smallest difference between the two most probable classes, is combined with self-learning strategy to serve as an adaptive machine-machine approach.To compare with a contextual method, SL SVM is performed on both the original spectral data and the Gabor textures (with SL-Gabor SVM) [63].CNN-AE is an approach based on convolutional features and sparse AE for remote sensing images, proposed by [64], whose architecture is a sequential version of our proposed method.This approach starts by generating an initial feature representation from a pre-trained CNN model.Then these convolutional features are fed into an AE for learning a new suitable representation in an unsupervised manner.After this, several class-specific AEs are trained, and the images are then classified based on the reconstruction error.To ensure the fairness, the CNN architecture implemented in CNN-AE is consistent with our proposed method (same number of layers, same size of filters and so on) as described behind.In addition, we have also used the superpixel-based sampling strategy in CNN-AE.
The experiments are implemented on the aforementioned three kinds of datasets.Following the procedure shown in Figure 1, 40% of the ground truth data is randomly selected for testing.Among the remaining part, a small number of labeled samples, 5 samples per class for HSI, SAR and MSI datasets, are selected from the training samples as labeled samples with a stratified random sampling strategy.Then the SLIC is implemented upon the whole remote sensing image with an average superpixel size of 400 pixel unit.Those pixels within 3 pixel unit distance to the boundary of its superpixel are selected as the candidate unlabeled samples, where those belong to test samples and labeled samples are removed.Then from the candidate unlabeled samples, 5000, 7000 and 3500 samples are randomly selected as the unlabeled samples for HSI, SAR and MSI datasets, respectively.Finally, both the unlabeled and labeled samples constitute the training set and used for network fine-tuning.Please note that we did not consider the spatial autocorrelation [65] of the input images during the process of separating training and testing samples.In addition, we organize samples in the form of 13 × 13 patches.We train the LWE-CNN in random batches with a batch size of 16.All the evaluation experiments are repeated for 10 Monte Carlo runs, and the reported accuracy are the average results.To evaluate the experimental results, we compare three indexes, including overall accuracy (OA, the number of correct classifications by the total number of test samples) [66], average accuracy (AA, an average of the producer's accuracy of individual classes) [67] and kappa coefficient (Kappa, a measure of the actual agreement minus chance agreement) [68].Moreover, the F-Measure (2 * P * R/(P + R), where P represents the user's accuracy and R is the producer's accuracy) [69] of various methods is also compared.
For parameter settings of the layer-wise embedding CNN, we adopt a simple structure of 4 convolution layers and 1 full connection layer as the encoder part while a structure of the same amount of convolution as the decoder counterpart.The hyper-parameters in each layer of layer-wise embedding CNN are set empirically and can be found in Table 4.The learning rate is set to be 0.001 with mini-batch size of 45 for HSI, 25 for SAR and 100 for MSI.Though hyper-parameters (such as the number of filters and layers, etc.) of both encoder and decoder has not been fully optimized, with the adopted empirical settings, the results obtained are already very competitive.Moreover, for different types of remote sensing images, there is still huge improvement space if we fine tune the hyper-parameters separately for HSI, MSI and SAR data.

Experimental Results and Discussions
For illustrative purposes, Figure 7 show the OA as a function of the number of iterations in the fine-tuning process of SLE-CNN for the three different types of datasets, respectively.In the figures, the curves in the yellow color show the original results with fluctuation, while curves in the blue color are results through a certain percentage of smoothing.It can be observed that the designed SLE-CNN framework converges very fast, requiring merely 11, 8 and 11 times of iterations with a proper mini-batch size for HSI, MSI and SAR data respectively, which indicates its strong ability.Based on the above-mentioned experimental results, a few discussions can be highlighted.Firstly, among the four SVM-based classifiers (SVM, LapSVM, SL SVM and SL-Gabor SVM), SVM using only the limited labeled data performs worst.For instance, it is observed from Table 7 that the OA of SVM is 10.17%, 8.18% and 11.01%lower than those of the LapSVM, SL SVM and SL-Gabor SVM, respectively.Similar properties can also be found in Tables 6 and 7.This phenomenon demonstrates the importance of taking advantage of unlabeled data.Secondly, compared with the designed SLE-CNN framework, the network structure of the supervised version of the proposed framework is simple, with only one encoder.In addition, it uses only supervised cross entropy for training while the SLE-CNN framework uses a combined loss function which consists of the supervised cross entropy from the noisy encoder and the unsupervised reconstruction cost from the interaction of the clean encoder and the decoder.From the experiments, we can see that the classification accuracy under supervised scenario is relatively worse than that obtained under SLE-CNN situation.As shown in Table 7, the OA, AA, Kappa and F-Measure of the supervised CNN are lower than those of the SLE-CNN, which again proves the efficiency and effectiveness of the unsupervised co-training fashion.It also implies that optimization with unlabeled data can reduce the error rate.To obtain a similar level classification accuracy, the need for labeled samples in our designed SLE-CNN framework is relatively lower than that of the supervised version.
Thirdly, even without unlabeled samples for training, deep structures such as CNN still have good-enough results reflected in high classification accuracy and smooth output classification map with little-shattered fragments, which outperforms traditional shallow models (e.g., SVM) with same training dataset.As shown in Table 5, the OA of purely supervised CNN is 5.94%, 1.23%, 3.27% and 0.01% higher than those of the SVM, LapSVM, SL SVM and SL-Gabor SVM, respectively.It is also clearly visible that little-shattered fragments are generated in purely supervised and SLE-CNN classification map in Figure 8 than in SVM, Laplacian SVM, SL SVM and SL-Gabor SVM classification map in Figure 8.The SAR data also yield similar properties.As for MSI data (see Table 7), the OA of the purely supervised CNN is lower than those of the LapSVM, SL SVM and SL-Gabor SVM but still higher than that of the SVM.To some extent, the classification maps of contextual methods (CNN and SL-Gabor SVM), though much smoother, lack a certain number of details, such as sharp corners and fine elements.However, some of the lost details may be noises caused by sensors.
Fourthly, when compared with SL-Gabor SVM, the designed SLE-CNN framework shows a higher accuracy, though both using unlabeled samples and contextual features, as shown in Tables 5-7, which further illustrates the capability of deep CNN.
Besides, the designed SLE-CNN framework performs better than CNN-AE with a different CNN architecture, which proves the superiority of the structure of our proposed model.As shown in Tables 5-7, the OA of the designed SLE-CNN framework is 3.13%, 1.88% and 2.33% higher than those of CNN-AE, respectively.Besides, the computation time of CNN-AE is longer than the designed SLE-CNN framework.The reasons may include: (1) In our proposed framework, end-to-end learning is used in the training process, where supervised and unsupervised parts are trained together through the cost function (Equation ( 16)) by backpropagation.(2) All classes are trained at the same time in our proposed framework.However, in CNN-AE, each class needs to be trained in a separate AE, respectively.(3) During the inference part, CNN-AE needs to run the pre-trained model to obtain CNN features and then runs AE t times (if we have t classes).While, in our proposed framework, the result can be acquired by running the encoder part of AE only one time.Finally, the designed SLE-CNN provides better or comparable classification results as compared with any other method included in comparison and obtain a state-of-art classification accuracy.Regarding all three different scenarios involving HSI, MSI and SAR data, which also demonstrates its strong adaptation capacity to adapt to different types of image classification tasks in the field of remote sensing.
Briefly, the aforementioned analysis validates the effectiveness of the SLE-CNN framework for remote sensing image classification.Also, we evaluate the performance of our designed SLE-CNN framework with the increase in the number of labeled samples and estimate its stability.We randomly choose 5, 10, 15, 20, 25 and 30 samples from each class as the labeled samples and the OA of above various methods is plotted in Figure 12, which shows that the classification accuracy increases as the number of labeled samples goes up and our designed SLE-CNN framework is superior to other methods when the same number of labeled training samples is chosen.Although the performance of different methods changes as the number of training samples changes, the designed SLE-CNN framework provides higher classification accuracies than other methods.It should be noted that under the condition of extremely limited training samples, our framework has the risk of becoming unstable and influenced by noises, which need to be improved in the algorithm.However, the superpixel-based random sampling strategy, used for the guide of the selection of unlabeled samples, has reduced such risks to some degree.It should also be mentioned that classification results produced by the designed framework may be over-smooth and some details may be lost.Therefore, it should be treated carefully.

Discussion
From the above Tables 5-7, we can observe the phenomenon that the accuracies of some classes of other methods are higher than those of our proposed SLE-CNN.The accuracies shown in the tables are under the condition of only 5 labeled samples per class, where the phenomena can in some degree reflect that our proposed SLE-CNN has the risk of becoming unstable and influenced by noises with extremely limited training samples as stated above.However, with the increase in the number of samples, the situation turns good.In Table 5, "bitumen" class of HSI data shows a relatively lower accuracy.However, the number of this class is small, so that the test results may be more likely to be interfered by randomness.As for "shadows" of HSI data in Table 5 and "winter wheat", "water" classes of SAR data in Table 7, the accuracies of these classes, though not the highest among different methods, have already reached a high level.The relatively low accuracies of classes in MSI data in Table 6 reflect that our proposed SLE-CNN performed not so well on MSI data compared to HSI and SAR data under the condition of 5 labeled samples per class.Though the OA and other global indexes are still better than other methods, the improvement on accuracy is relatively small.Considering this, we will make some adaptive adjustments to the MSI data in later research.
In the future, we plan to combine variational inference with the unsupervised autoencoder model to obtain a better regularization for supervised learning and improve the decision boundaries.It is also promising to directly fuse the superpixel-based hard mining criterion into the final optimization objective.

Conclusions
Remote sensing images, with complex ground scenes and irregular objects, are naturally characteristic of irregular spatial dependency, which cause challenges for classification tasks.Moreover, effective labeled samples are usually scared in remote sensing datasets, which conflict with the general requirement of a huge labeled training set for the fine tuning of a deep neural network.To deal with these challenges, we design a superpixel-guided layer-wise embedding CNN framework, where unlabeled samples can be automatically exploited with the guide of a superpixel-based random sampling strategy.Therefore, a more robust training dataset can be obtained with many informative and representative unlabeled samples.Furthermore, since superpixels can handle the challenge of irregular spatial dependency, our classification framework is much more adaptive to real scenes of remote sensing images.Different from prevailing deep supervised learning models such as DNNs and CNNs that have already shown an impressive capacity for feature generalization with enough training samples, the designed SLE-CNN aims at relieving relevant problem under limited ground truth situations.By using AE-based generative model for learning unsupervised embedding of unlabeled samples, we can strongly regularize the supervised training, thus reducing the searching space to obtain a better convergency performance.In addition, since the supervised and unsupervised parts are combined in a joint optimization fashion with a new objective function, we can simultaneously learn the best feature representation with both supervised and unsupervised information.Experiments on benchmark remote sensing images of different types have shown satisfactory performances compared with both our framework in purely supervised manner and other state-of-art supervised and semi-supervised classification models, which implies a promising classification capacity for remote sensing images in different application fields.

Figure 1 .
Figure 1.The block diagram of the proposed classification framework.The core part is the superpixel-guided layer-wise embedding CNN.To reduce the demand for training samples, unlabeled samples are also used for fine tuning the designed layer-wise embedding CNN.Considering the irregular spatial dependency of remote sensing images, superpixels are introduced to guide the selection of more valuable unlabeled samples (superpixel-guided patches in the figure) since superpixels are adaptive to real scenes of remote sensing images.

Figure 2 .
Figure 2.The procedure of superpixel-based random sampling.Superpixels are introduced to guide the selection of unlabeled samples to handle the irregular spatial dependency of remote sensing images.Under the strategy, more representative and informative samples come from pixels located on superpixel boundaries, since they are more likely to be close to the class boundaries and easier to be misclassified.

Figure 7 .
Figure 7. OA as a function of the number of iterations in the fine-tuning process of SLE-CNN for (a) HSI data, (b) MSI data, (c) SAR data.

Figure 11 .
Figure 11.Classification accuracies of the designed layer-wise embedding CNN with totally random sampling strategy and superpixel-based random sampling strategy in (a) HSI data, (b) MSI data, (c) SAR data.

Figure 12 .
Figure 12.The impact of the number of labeled training samples per class on the OA in (a) HSI data, (b) MSI data, (c) SAR data.

Table 1 .
Number of samples (NoS) and colors of each class in the ground truth of the ROSIS Pavia University hyperspectral image.

Table 2 .
Number of samples (NoS) and colors of each class in the ground truth of the Landsat 5 TM multispectral image.

Table 3 .
Number of samples (NoS) and colors of each class in the ground truth of the EMISAR image.

Table 4 .
Parameter settings in each layer of the layer-wise embedding CNN.