Weakly Supervised Segmentation of SAR Imagery Using Superpixel and Hierarchically Adversarial CRF

: Synthetic aperture radar (SAR) image segmentation aims at generating homogeneous regions from a pixel-based image and is the basis of image interpretation. However, most of the existing segmentation methods usually neglect the appearance and spatial consistency during feature extraction and also require a large number of training data. In addition, pixel-based processing cannot meet the real time requirement. We hereby present a weakly supervised algorithm to perform the task of segmentation for high-resolution SAR images. For effective segmentation, the input image is ﬁrst over-segmented into a set of primitive superpixels. This algorithm combines hierarchical conditional generative adversarial nets (CGAN) and conditional random ﬁelds (CRF). The CGAN-based networks can leverage abundant unlabeled data learning parameters, reducing their reliance on the labeled samples. In order to preserve neighborhood consistency in the feature extraction stage, the hierarchical CGAN is composed of two sub-networks, which are employed to extract the information of the central superpixels and the corresponding background superpixels, respectively. Afterwards, CRF is utilized to perform label optimization using the concatenated features. Quantiﬁed experiments on an airborne SAR image dataset prove that the proposed method can effectively learn feature representations and achieve competitive accuracy to the state-of-the-art segmentation approaches. More speciﬁcally, our algorithm has a higher Cohen’s kappa coefﬁcient and overall accuracy. Its computation time is less than the current mainstream pixel-level semantic segmentation networks.


Introduction
The latest technology of synthetic aperture radar (SAR) imaging sensors can achieve all-day and high-resolution imaging for various geographical terrains [1,2].SAR image segmentation aims at assigning optimal labels to the pixels and is considered as a foundation for many high-level interpretation tasks.Accurate segmentation can greatly reduce the difficulty of subsequent advanced tasks (target detection, recognition [3], tracking, change detection [4], etc.).Unlike the common classification framework, where classifiers (e.g., support vector machine (SVM) [5], random forest (RF) [6], sparse representation [7]) are generally used to assign a discrete or continuous label to each unit, segmentation models also need to preserve neighborhood consistency.For instance, in the segmentation framework, if the neighbors of a pixel are oceans, the confidence that it belongs to ocean areas should increase.The current mainstream segmentation algorithms include: superpixel segmentation methods [8,9], watershed segmentation methods [10,11], and level set segmentation methods [12,13].
As a representative probabilistic graphical model, conditional random fields (CRF) [14] can capture the appearance and spatial consistency and has become a fundamental tool in image segmentation.CRFs [14] deduces the conditional distribution of the labels given observations.In typical CRF-based segmentation methods [15], appropriate features are first extracted and selected from the input images, then structured support vector machine (SSVM) [16] or other classifiers are utilized to learn the coefficients of CRF for segmentation [17,18].Hence, the quality of the feature descriptors has a significant impact on the performance of CRF-based segmentation models.In order to further accelerate the segmentation speed for large-scale SAR images, the input generally is over-segmented into superpixels [19,20].However, effective feature extraction algorithms for superpixels remain a challenge.Most of the previous methods focus on hand-crafted features for superpixels, e.g., histogram of oriented gradients (HOG) [21], gray-level co-occurrence matrix (GLCM) [22], and co-occurrence matrix (COOC) [23].These hand-crafted features include many coefficients determined by previous knowledge and experiences in practice [19], which is difficult and time-consuming.Besides, the speckle noise existing in SAR images also increases the difficulty of feature extraction.
In recent years, deep convolutional neural network (DCNN) [24] has achieved tremendous breakthroughs in the field of image classification and segmentation [25,26].Combining CRF and DCNN for SAR image segmentation has become a research hotspot [27,28].For example, Liu et al. [15] utilized the CNN trained on the ImageNet dataset to extract features of superpixels.Liu et al. further modeled the feature extractor and continuously valued CRF as a DCNN in [27], achieving better results compared with hand-crafted features on the depth estimation and semantic segmentation tasks.
However, the CNN based feature descriptor still has s several issues that need to be resolved.First, either DCNN or hand-crafted extractor neglects the neighborhood label consistency.According to a human's a priori knowledge in labeling a superpixel, the context information from its adjacent superpixels have a strong influence towards the superpixels labeling.For example, a superpixel surrounded by farmland areas is highly likely to be farmland.
In addition, training DCNN needs a large number of labeled samples, which mainly come from manual annotation and are quite labor-intensive, especially when using SAR images.Unlike the common optical images, there is not any universal and large scale labeled SAR image dataset like ImageNet, which can be applied to interpreting SAR images.Each airborne SAR imaging system has distinctive imaging parameters.Even for the same airborne SAR, due to the change of altitude, speed, and overlook angles in each flight, the features of the same target regions are always different.Hence, for various SAR data datasets, we need to manually label the training sets.Hence, researchers have proposed unsupervised learning models, such as sparse auto-encoder (SAE) [29], restricted Boltzmann machine (RBM) [30], deep belief network (DBN) [31], and generative adversarial nets (GAN) [32].Among them, GAN is a new unsupervised learning model proposed in 2014 and has been successfully applied in image generation, image inpainting, road segmentation [33], etc.This model is composed of a generator (G) and a discriminator (D).The G network attempts to generate fake images to deceive the discriminator, while the discriminator is a binary classifier and endeavors to distinguish the fake data from the real one.The adversarial training process between the two sub-networks enforces the discriminator to achieve a better feature extraction ability than the other unsupervised deep learning models.
In this paper, we propose a superpixel-wise hierarchically adversarial CRF (HACRF) which combines GAN with CRF to address the above problems.First, in order to reduce the computational cost while maintaining the edge information of the input image, the input is over-segmented to superpixels.Second, based on the semi-supervised conditional generative adversarial nets (CGAN) [34], we design a hierarchical CGAN to learn the target superpixels and their background through the adversarial training process.Then the concatenated feature vectors are fed into the CRF model to obtain the optimal segmentation results.The novel aspects of our proposed algorithm consist of the following aspects.
(1) To improve the SAR image segmentation performance with insufficient labeled samples, we introduce the CGAN in CRF-based segmentation method.In the CGAN, a multi-classifier is added in the original GAN, which shares the feature extraction network with the discriminator.The binary discriminator aims at distinguishing between the real samples and the generated samples, and the multi-classifier strives to label the samples correctly.Hence, the feature extraction layers can simultaneously learn the parameters using insufficient labeled images and abundant unlabeled images.
(2) In order to preserve neighborhood label consistency during the feature extraction, the hierarchical CGAN model is composed of two CGANs, named target CGAN (TCGAN) and background CGAN (BCGAN).For an input superpixel (named target superpixel), we also extract a background superpixel of a larger size, which is composed of the neighboring superpixels of the target superpixel.TCGAN aims at learning the feature vectors from the target superpixels, while BCGAN is introduced to explore the corresponding background superpixel's features.The concatenated feature vectors from two kinds of features are then fed into the CRF, approximating maximum a posteriori (MAP) of the labels.
The rest of this paper is organized as follows.Previous CRF-based segmentation approaches are reviewed in Section 2. Section 3 presents the proposed method, including the generation of the target and background superpixels, training hierarchical CGANs, and superpixel-wise segmentation using conditional random field.Our experiments and the results are presented in Section 4. Section 5 presents the discussion in terms of pros and cons.Section 6 gives the conclusion.

Related Work
In this section, we review the previous CRF-based approaches for image segmentation, which are closely related to our method.Their merits and defects are distinctly summarized in Table 1.The earlier way of applying CRF for segmentation is directly processing each pixel as a unit [16].These methods do not need the feature extraction step, while their weakness is the heavy computing burden.Taking a fully connected CRF [35] as an example, for an image with n pixels, n 2 edges are required to calculated when build the graph model.Considering remote sensing images generally have much larger sizes than optical images, the pixel-wise segmentation methods' calculation will sharply increase.
With appearance of superpixel generation models (e.g., simple linear iterative clustering, SLIC [36]), the superpixel-wise segmentation algorithms have received considerable attention, which introduce CRFs as a post-processing to incorporates the structured constraints.Most of the superpixel generation models employ the unsupervised clustering algorithms to cluster the adjacent and similar pixels together to form a superpixel.Superpixels can greatly improve the efficiency of the segmentation algorithm while preserving the images' edge information.According to the different kinds of superpixel feature descriptors, these superpixel-wise segmentation methods can be divided into two parts, one of which uses the traditional hand-crafted methods to extract superpixel features.For instance, Sultani W et al. [19] introduces a feature descriptor for superpixel, including HOG, GLCM, intensity histogram (IH), and mean intensity (MI).The weight of each type of the features are learned also by SSVM.Kenduiywo B K et al. [37] selects gray-level GLCMs based on a 3 × 3 convolution matrix with 64 gray-scale quantization levels as the features to train the CRF.Moreover, some filter sets, such as Gabor and wavelet transform, are also utilized for extracting complex features [38].Most of these methods are unsupervised and do not require labeled samples.However, their parameters are often based on empirical evidence.
Later on, more attempts are made towards extracting deep feature descriptors for superpixels.These methods first train a DCNN to extract the features, then apply a CRF model to refine the segmentation results [15,27,39].In these models, feature extraction and CRF inference are still separated.Recently, a more appealing direction is to facilitate deep end-to-end segmentation.For example, Zheng S et al. [40] treats mean field inference in CRFs as a recurrent neural network (RNN).Lin et al. [41] present to learn the unary and pairwise potentials of CRFs using CNNs.
The above-mentioned deep feature descriptors all need a large number of labeled images for training the DCNN.In contrast, here we explore GAN to make the best use of the abundant unlabeled SAR images in CRF inference.

Hierarchically Adversarial CRF
This paper presents a hierarchical adversarial CRF (HACRF) method to segment the large-scale SAR images.As shown in Figure 1, the segmentation processing can be divided into three steps: (1) over-segment the input SAR images to target superpixels and background superpixels; Lin et al. [41] present to learn the unary and pairwise potentials of CRFs using CNNs.The abovementioned deep feature descriptors all need a large number of labeled images for training the DCNN.
In contrast, here we explore GAN to make the best use of the abundant unlabeled SAR images in CRF inference.

Hierarchically Adversarial CRF
This paper presents a hierarchical adversarial CRF (HACRF) method to segment the large-scale SAR images.As shown in Figure 1, the segmentation processing can be divided into three steps: (1) over-segment the input SAR images to target superpixels and background superpixels; (2) two kinds of superpixels are fed into the hierarchical CGAN to extract their feature vectors, which are composed of target CGAN (TCGAN) and background CGAN (BCGAN); (3) the concatenated features are then utilized to train the CRF and infer the optimum label of each superpixel.The training of hierarchical CGAN is performed using the labeled data and unlabeled data from the testing set.The unary and pairwise potential coefficients of CRF are learned by SSVM.

Target Superpixels and Background Superpixels
Whether it is in training or testing, we need to first over-segment the large-scale airborne SAR images into superpixels using the SLIC proposed in [36].Compared with the regular rectangular patches used in deep learning, superpixels can preserve the edge information of the scene, while effectively speeding up the segmentation algorithm to meet the real time requirements in remote sensing.We represent the superpixels obtained from all the training images in the dataset as ( ) , where M is the number of the superpixels.
In order to process the irregular superpixels by convolution operation, we place each superpixel into a rectangular patch.The central of the patch corresponds to the centroid of the superpixel.This kind of patch is called 'target superpixel', denoted as T j r .
When manually labeling a superpixel, we find that it is difficult to determine its category if only focusing on the target superpixel T j r without noticing the surrounding superpixels.For example, the shadow areas of mountains look very similar to river areas in airborne or spaceborne SAR images.

Target Superpixels and Background Superpixels
Whether it is in training or testing, we need to first over-segment the large-scale airborne SAR images into superpixels using the SLIC proposed in [36].Compared with the regular rectangular patches used in deep learning, superpixels can preserve the edge information of the scene, while effectively speeding up the segmentation algorithm to meet the real time requirements in remote sensing.We represent the superpixels obtained from all the training images in the dataset as r j ∈ L (j = 1, • • • , M), where M is the number of the superpixels.
In order to process the irregular superpixels by convolution operation, we place each superpixel into a rectangular patch.The central of the patch corresponds to the centroid of the superpixel.This kind of patch is called 'target superpixel', denoted as r T j .When manually labeling a superpixel, we find that it is difficult to determine its category if only focusing on the target superpixel r T j without noticing the surrounding superpixels.For example, the shadow areas of mountains look very similar to river areas in airborne or spaceborne SAR images.Therefore, in order to distinguish a superpixel's class, taking its background information into consideration is necessary.If there is a clear difference between the central superpixel and the surrounding superpixels, the target superpixel is likely to be located in the shadow of the mountains, otherwise, it is in the river areas.Based on the above considerations, we believe that appearance and spatial consistency should also be taken into account in the feature extraction stage, in other words, both the background and target information should be considered.Therefore, for a superpixel r j , a larger rectangular region centered on the centroid of the superpixel is selected as the background slices, called the 'background superpixel (r B j )'.It should be noted that we set the label of r B j to be the same as that of r T j .It forces the hierarchical CGAN to extract the background information with the label of the central superpixel.

Hierachical CGANs
In this section, a hierarchical CGANs is proposed to extract the features of the target and the background superpixels, aiming at mining useful information from the abundant unlabeled data and improving the performance of feature extraction.We will introduce the architecture of hierarchical CGANs in this section.
The original GAN is mainly composed of a generator and a discriminator.The generator converts the random noises z following the distribution P z (z) into the fake images G(z) ∈ P g (x) by deconvolution.The fake images G(z) and the real images x ∈ P data (x) are fed into the discriminator, whose goal is to correctly discriminate whether the input image is real or fake.The generator attempts to keep the generated fake images as close to the real as possible until the discriminator cannot distinguish them.The adversarial training with the generator makes the discriminator achieves better feature extraction ability.However, the original GANs structure is unable to learn the label information.The conditional GANs (CGAN) [42] is then proposed to handle this drawback by adding a multi-classifier in the GANs network.This classifier shares the feature extraction network with the discriminator.Thus, the loss function of the CGAN is defined as where x stands for all the real images, including the labeled images x L and the unlabeled images; D(x) denotes the output of the discriminator.E x∈P data (x) [log D(x)] is the sum of log D(x) for all the input x, while E x L ∈P L data (x) [M(x L )] denotes the sum of M(x L ),which represents the loss values of the multi-classifier for all the labeled data x L ∈ P L data (x).On the right-hand side of Equation ( 1), the first term ensures that the discriminator can correctly distinguish all the real and fake images; the second term encourages the fake images to deceive the discriminator; and the third term is to support the multi-classifier to correctly classify the labeled data.
More specifically, in each training epoch, we first fix the parameters of the generator and train the discriminator and the multi-classifier.The m generated fake images G z (i) , . . ., G z (m) and the real images x are fed into the discriminator.The discriminator is optimized by maximizing the following loss functions V(D) with a stochastic gradient descent (SGD) algorithm where N x refers to the number of all the real images x.Meanwhile, the multi-classifier takes the labeled real images as the input to predict their labels and calculate the following loss function V(M), whilst its parameters are then updated by minimizing the loss function The proposed hierarchical CGANs includes dual CGANs: Target CGAN (TCGAN) and background CGAN (BCGAN).Figure 2 shows the structure and the training process of the hierarchical CGANs.TCGAN and BCGAN are trained independently.Taking the TCGAN as an example, the target superpixels r T j from the training images include the labeled and the unlabeled, represented as r L,T j and r U,T j respectively.g T i is the i th fake superpixel generated by the G-network of the TCGAN.In each training epoch, r L,T j , r U,T j , and g T i are fed as the input of the feature extraction part of the TCGAN (named T-feature extractor).Then the discriminator aims at distinguishing whether the three kinds of feature vectors are real or fake and obtains the loss value V(D) in Equation ( 5).Meanwhile, we use the multi-classifier to predict the categories of the labeled target superpixels r L,T j and calculate the corresponding loss value V(M) in Equation ( 5).max where N L,T , N U,T , and m denote the numbers of the labeled, unlabeled and generated target superpixels.The parameters of the T-feature extractor, discriminator, and multi-classifier are updated by maximizing the V T (D, M).
The fake target superpixels are then employed to train the G-network of the TCGAN by minimizing the following loss function V T (G) The adversarial learning between the discriminator, multi-classifier, and generator can fully exploit the available information in the unlabeled target superpixels, and thus improve the performance of the T-feature extractor.The training process of the BCGAN is similar to the TCGAN.The labeled background superpixels r L,B j , the unlabeled background superpixels r U,B j and the generated superpixels g B i are together fed into the B-feature extractor of the BCGAN, and then the loss values V B (D, M) can be calculated based on the outputs of the discriminator and multi-classifier.The parameters of the B-feature extractor, discriminator, and multi-classifier are optimized by maximizing V B (D, M).In the testing stage, the trained dual feature extractors are employed to extract the features of the target superpixels r T j and the background superpixels r B j respectively, which are denoted as f T j and f B j respectively.

Superpixel-Wise Segmentation Using Conditional Random Field
The final feature vector j f of a superpixel is obtained by concatenating the features T j f and B j f .In this section, conditional random fields (CRF) is introduced to optimize the superpixel-wise segmentation results according to the input features.CRF is first proposed by Lafferty et al. [14] as a data labeling method based on a graph model.For an input SAR image with n superpixels { } where C stands for the set of cliques in the graph model; θ are the coefficients of CRF model, and denotes the potential function of the th c cliques given the observation x and the corresponding labels y .For regular image classification and segmentation, unary and pairwise potentials will be included in the objective function.In this case, Equation ( 7) can be rewritten as exp , ; + , , ; ;

Superpixel-Wise Segmentation Using Conditional Random Field
The final feature vector f j of a superpixel is obtained by concatenating the features f T j and f B j .In this section, conditional random fields (CRF) is introduced to optimize the superpixel-wise segmentation results according to the input features.CRF is first proposed by Lafferty et al. [14] as a data labeling method based on a graph model.For an input SAR image with n superpixels x = {x 1 , x 2 , . . . ,x n }, the labels assigned to these superpixels are defined as y = {y 1 , y 2 , . . . ,y n }, where y i ∈ L = {l 1 , l 2 , . . . ,l k } and k refers to the number of the classes.CRF models the whole image as an undirected graph model G(V, E), where V is the set of vertices and each superpixel corresponds to a vertex.E represents the set of edges.It has been demonstrated that in CRF, the conditional probability distribution of label y obeys the following Gibbs distribution, where C stands for the set of cliques in the graph model; θ are the coefficients of CRF model, and Z(x; θ) is a normalized constant term.φ c (y c , x; θ) denotes the potential function of the c th cliques given the observation x and the corresponding labels y.For regular image classification and segmentation, unary and pairwise potentials will be included in the objective function.In this case, Equation ( 7) can be rewritten as where y i denotes the label value of x i and δ i stands for all the surroundings superpixels of the central superpixel x i .On the right-hand side of Equation ( 8), φ i (y i , x; θ u ) is the unary potential, indicating the dependence between x i and its label y i , and θ u is the coefficient set of unary potential.Similarly, φ ij y i , y j , x; θ p denotes the pairwise potential, where θ p is its coefficients and x j ∈ δ i represents all the adjacent superpixels of superpixel x i .Pairwise potential aims at assigning the similar labels to the superpixels with consistent features.At this point, the CRF optimization can be expressed as It indicates that the label sequence achieving maximum a posteriori (MAP) will be the optimal label y * .Since − log(Z(x; θ)) is a constant relative to y, the formula can be simplified as In this formula, the observation x refer to the superpixel features of the input SAR image, i.e., f = f j |j = 1, . . ., M .Hence, we can have φ i (y i , x; θ u ) = φ i (y i , f; θ u ).We design the unary potential as a linear combination of the features and coefficients, i.e., where p(y i | f i ) refers to the probability that the superpixel r i is assigned to the label y i given feature vector f i ; coefficient W U is a matrix with the size of n c × n f , n c represents the number of the classes, n f indicating the length of the feature vector f i .Next, we define the pairwise potential as where [.] indicates zero-one indicator function.w y i ∼y j ∈ [0, 1] is an element of the transition matrix W P with the size of n c × n c , representing the possibility that a superpixel adjacent to a superpixel belonging to y j is labeled as y i .As the graph is undirected, W P is a symmetric matrix.D ij represents the distance between the feature vectors of two superpixels r i and r j , which is defined as where f ik is the k th element of feature vector f i .D ij is proportional to the difference between the superpixels' features.Equation ( 13) encourages CRF to assign different labels to neighboring and distinctive superpixels.In summary, CRF coefficients include {W P , W U }, and we introduce the SSVM to train the model with the coefficients.Finally, an inference algorithm is used to calculate the MAP p( y| x; θ), and find the optimal label sequence y * .Here we use the AD3 proposed in [43] to estimate the MAP inference.AD3 is a dual decomposition algorithm based on the alternating directions method of multipliers.As each local subproblem has a quadratic regularizer, AD3 converges faster than other subgradient-based dual decomposition and message-passing methods.Experimental results also prove that AD3 can result in good segmentation.
The pseudo code of the proposed method is presented in Algorithm 1, which explains the training of the hierarchical CGANs and CRF in details.

Data Description
The SAR database used in our experiment contains the imaging results of FangChengGang in Guangxi Province, China [44].The imaging range of this database is about 30 × 30 km with a resolution of 2 m, and image size is 1122 × 1419 pixels.There are totally 36 images in the dataset and seven of them are selected as the training set.We manually annotate these images into five classes:

Data Description
The SAR database used in our experiment contains the imaging results of FangChengGang in Guangxi Province, China [44].The imaging range of this database is about 30 × 30 km with a resolution of 2 m, and image size is 1122 × 1419 pixels.There are totally 36 images in the dataset and seven of them are selected as the training set.We manually annotate these images into five classes: farmland, river, urban, background, and non-image.Non-image refers to those unscanned areas during the imaging, whose pixel values are zero.Three exemplar images and the corresponding ground truth maps are illustrated in Figure 3.In the ground truth, different colors represent various categories.As can be seen, there are various topographic and geomorphic conditions-e.g., complex buildings and roads, and dense river networks characterized by different sizes and shapes-leading to great difficulty in segmentation.
buildings and roads, and dense river networks characterized by different sizes and shapes-leading to great difficulty in segmentation.In the training or testing stage, all the images and the corresponding ground truths are first oversegmented to the superpixels using the SLIC method.The size of the superpixel determines the resolution of the segmentation results.The larger superpixels are, the more edge information in each superpixel will be missed.For example, some rivers with a width smaller than that of the superpixels will be lost.A small size of superpixels also will increase the computation of the segmentation algorithm.According to the clustering results of the SLIC for the FangChengGang dataset, a superpixel with 200-300 pixels can ensure that each superpixel retains the shape of rivers and urban areas.In the experiment, we set the compactness of the SLIC to be 0.5 and the number of the superpixels for each SAR image (1122 × 1419 pixels) is 8000, which means a superpixel contains an average of 200 pixels.The size of the patch named 'target superpixel' is set to be 16 × 16, so that each patch can accommodate an irregular superpixel.The background superpixels' sizes are set to be 16 times the target superpixel, i.e., 64 × 64.
Table 2 gives the numbers of all the classes of the superpixels in the training and testing sets, and the percentages of the training pixels in the overall pixels with respect to each class.It can be seen that the numbers of urban, farmland, and river superpixels are basically the same both in the training and testing sets, while the number of the background superpixels is much larger than that of the other three categories.This is consistent with the practical application.In practice, there are many kinds of land species in the large scene, but we are only interested in a small part of them.The huge imbalance in the numbers of the samples presents a challenge for the segmentation models.The experiments are implemented using Pytorch in an Ubuntu platform.The main configuration of our computer is 32 G Memory, Intel(R) Xeon(R) CPU L5639 @ 2.13GHz×12 (Intel, Santa Clara, CA, USA) and Tesla K20c graphics (NVIDIA, Santa Clara, CA, USA).In the training or testing stage, all the images and the corresponding ground truths are first over-segmented to the superpixels using the SLIC method.The size of the superpixel the resolution of the segmentation results.The larger superpixels are, the more edge information in each superpixel will be missed.For example, some rivers with a width smaller than that of the superpixels will be lost.A small size of superpixels also will increase the computation of the segmentation algorithm.According to the clustering results of the SLIC for the FangChengGang dataset, a superpixel with 200-300 pixels can ensure that each superpixel retains the shape of rivers and urban areas.In the experiment, we set the compactness of the SLIC to be 0.5 and the number of the superpixels for each SAR image (1122 × 1419 pixels) is 8000, which means a superpixel contains an average of 200 pixels.The size of the patch named 'target superpixel' is set to be 16 × 16, so that each patch can accommodate an irregular superpixel.The background superpixels' sizes are set to be 16 times the target superpixel, i.e., 64 × 64.
Table 2 gives the numbers of all the classes of the superpixels in the training and testing sets, and the percentages of the training pixels in the overall pixels with respect to each class.It can be seen that the numbers of urban, farmland, and river superpixels are basically the same both in the training and testing sets, while the number of the background superpixels is much larger than that of the other three categories.This is consistent with the practical application.In practice, there are many kinds of land species in the large scene, but we are only interested in a small part of them.The huge imbalance in the numbers of the samples presents a challenge for the segmentation models.The experiments are implemented using Pytorch in an Ubuntu platform.The main configuration of our computer is 32 G Memory, Intel(R) Xeon(R) CPU L5639 @ 2.13GHz×12 (Intel, Santa Clara, CA, USA) and Tesla K20c graphics (NVIDIA, Santa Clara, CA, USA).

Evaluation Measures
The pixel-wise overall accuracy (OA) [45], overall precision (OP) [46], and F1 score [47] are utilized to quantitatively measure the segmentation results of each method.F1 score is the weighted harmonic of precision and recall as The precision refers to the proportion of the pixels correctly assigned in all the segmentation results, while the recall value corresponds to the ratio of the pixels segmented correctly in all the pixels in the ground truth.The larger the F1 score, the better the overall performance of the segmentation algorithm.OA and OP are defined as the weighted average recall values and precision values, respectively.Cohen's kappa coefficient (κ) [48] is also introduced to measure the correctness of the segmentation.The definition of the Kappa coefficient is where p o is named as the relative agreement between the segmentation results for the testing data and the real labels; p e is the hypothetical probability of the chance agreement.A coefficient closer to 1 denotes the better agreement between the segmentation results and the ground truth.

Feature Extraction Analysis
In this section, before measuring the segmentation results, we first visualize the features extracted by the hierarchical CGANs to explore whether our feature extraction method can capture the inherent differences between various kinds of superpixels.We train the hierarchical CGANs using the labeled superpixels from the training set and the abundant unlabeled superpixels from the testing set to obtain the T-feature extractor and B-feature extractor.
Figure 4 shows the architecture and hyperparameters of the TCGAN.The abbreviation 'bn' and 'FC' correspond to the batch norm and fully-connected layers.The generator (blue rectangles) is composed of two fully-connected blocks (FC block 1 and 2) and two deconvolution blocks (Deconv block 1 and 2).The T-feature extractor (red rectangles) consists of the convolution block 1 and the FC block 3, which is responsible for transforming a 16 × 16 target superpixel into a feature vector of 1 × 1024.In order to easily visualize the distribution of features, we use principal component analysis (PCA) [50] to reduce the dimension of the feature vectors to two dimensions.The results are illustrated in Figure 5a, where individual colors and markers are utilized to distinguish the different kinds of features.It can be seen that different classes of features extracted by the HACRF do not show much confusion.For comparison, we also visualize the features extracted by other models, as shown in Figure 5b-e.Figure 5b shows the features only extracted by the T-feature extractor, where a large number of the feature points of the background and urban areas are overlapped, proving the introduction of background features can improve the separability of the features.Figure 5c   The default activation functions for all the layers are Leaky ReLU [49].The cross entropy and binary cross entropy are selected as the loss functions of the multi-classifier and the discriminator.The experimental results show that when the training epoch of the TCGAN is greater than 30, the segmentation performance will not continue to improve, so we set the epoch number as 30.Moreover, the variation of the batch size and learning rates of the discriminator, the multi-classifier, and the generator have little effect on the performance of the segmentation in our experiment, and therefore we set the batch size to 1 and set all learning rates to 0.0002.

T-feature extractor
The architecture of the BCGAN remains the same, except the neuron number of its FC block 2 is set to be 128 × 16 × 16.Similarly, the B-feature extractor converts each background superpixel with size 64 × 64 into a 1 × 1024 feature vector.After that, the final feature vector of the superpixel (1 × 2048) is obtained by concatenating the above dual features vectors.
In order to easily visualize the distribution of features, we use principal component analysis (PCA) [50] to reduce the dimension of the feature vectors to two dimensions.The results are illustrated in Figure 5a, where individual colors and markers are utilized to distinguish the different kinds of features.It can be seen that different classes of features extracted by the HACRF do not show much confusion.For comparison, we also visualize the features extracted by other models, as shown in Figure 5b-e.Figure 5b shows the features only extracted by the T-feature extractor, where a large number of the feature points of the background and urban areas are overlapped, proving the introduction of background features can improve the separability of the features.Figure 5c illustrates the features extracted by a supervised CNN.This CNN is composed of the T-feature extractor and the multi-classifier from the TCGAN, so its hyperparameters are also consistent with the TCGAN network.This supervised CNN is trained using 16 × 16 target superpixels from the training set (epoch = 100).Severe confusion between different kinds of features happens in the CNN's results, which will inevitably reduce its accuracy of the segmentation results.In order to easily visualize the distribution of features, we use principal component analysis (PCA) [50] to reduce the dimension of the feature vectors to two dimensions.The results are illustrated in Figure 5a, where individual colors and markers are utilized to distinguish the different kinds of features.It can be seen that different classes of features extracted by the HACRF do not show much confusion.For comparison, we also visualize the features extracted by other models, as shown in Figure 5b-e.Figure 5b shows the features only extracted by the T-feature extractor, where a large number of the feature points of the background and urban areas are overlapped, proving the introduction of background features can improve the separability of the features.Figure 5c illustrates the features extracted by a supervised CNN.This CNN is composed of the T-feature extractor and the multi-classifier from the TCGAN, so its hyperparameters are also consistent with the TCGAN network.This supervised CNN is trained using 16 × 16 target superpixels from the training set (epoch = 100).Severe confusion between different kinds of features happens in the CNN's results, which will inevitably reduce its accuracy of the segmentation results.Figure 5d shows the distribution of dimension-reduced features from the AlexNet, which is composed of the convolution and pooling layers of AlexNet [51] network pre-trained on ImageNet [52] as the feature extractor.The feature extractor outputs a 256-dimensional feature vector.Figure 5e also gives the distribution of features extracted by the DCAE model, which is an autoencoder-based classification method proposed in [38].In particular, the extracted GLCM and Gabor features first go through dimension reduction by average pooling and PCA.Then the dual-layer sparse autoencoder is employed to optimize these features.Finally, the optimized features are classified using a fully-connected layer with the Softmax activation functions.The number of the units in the first hidden layer is set to 1.5 times larger than the input dimension, and the number of the second hidden layer's units is set to 0.8 times smaller than the input dimension.The greedy training method is utilized to train the autoencoder.The hyperparameters of this algorithm are set as follows: the loss function is the mean square error function; the learning rate is 0.0001; the batch size is also 330; the training epoch number is 40.The 16 × 16 patches are cropped from the input SAR images with a step size of 16.In summary, for either AlexNet or DCAE, the separability of each class features is weaker than that of the first three methods.

Segmentation Comparison
After converting each superpixel into a feature vector, we can model a SAR image as an undirected graph, where superpixels are regarded as the vertices of the graph, and adjacent superpixels are connected by edges.Finally, SSVM is employed update the coefficients of the graph-based CRF.The optimal segmentation results can be obtained by the AD3 algorithm.
In this part, we compare the segmentation performance of our HACRF method with several methods, such as TCGAN-CRF, CNN-CRF, AlexNet, DCAE, and SegNet [53].Among them, the TCGAN-CRF only adopts the T-feature extractor in the TCGAN to extract the superpixels' features.The other parts and parameters are consistent with those of the HACRF.The comparison with the TCGAN-CRF allows us to explore whether or not introducing background feature extraction can improve the segmentation results.In the CNN-CRF, a fully supervised CNN network mentioned in the feature extraction analysis section, instead of the hierarchical CGAN, is applied to extracting the features.The CNN is composed of the T-feature extractor and the multi-classifier in the TCGAN.In order to avoid the influence of the change of hyperparameters on the segmentation results, its hyperparameters also are consistent with those of the TCGAN.
AlexNet refers to the classification method based on the AlexNet proposed in [54].This algorithm combines the AlexNet-based feature extractor (described in Section 3.3) and a fully connected network with Softmax functions for classification.During the training of the network, the parameters of the convolution and pooling layers in AlexNet are fixed, and the patches from the training set are used to train the full connected layers.According to the suggestion of [54], the input patches with measuring 21 × 21 are cropped from the training images with a step size of 10.The loss function is the mean square error function; the learning rate is set to 0.0001; the batch size is set to 330; the training epoch is 40; and the regularization parameter is set to 0.0005.
SegNet is an encoder-decoder network for performing pixel-level semantic segmentation proposed in [53].The encoder part uses the first 13-layer convolutional network of VGG16 [55] pre-trained on the ImageNet.Each encoder layer corresponds to a decoder layer, and the decoder employs pooling indices calculated in the max-pooling operation of the corresponding encoder to execute upsampling.Then the upsampled maps are convolved with the trainable filters to produce dense feature maps.The feature maps are sent to the Softmax classifier to generate the class probability for each pixel.It has been found that in the SAR image dataset, a larger patch size more easily causes unstable convergence of the training, leading to poor segmentation results.Hence, we finally set the patch size to 16 × 16.Among the other hyperparameters, the learning rate is 0.01, the regular regularization parameter = 0.0005, batch size = 300, and training epoch = 50.
Table 3 compares the segmentation results of several methods for FangChengGang testing set.Obviously, our method is superior to other methods in OA, OP, F1 scores, and Kappa coefficient.Notably, the better metrics of HACRF than TCGAN-CRF show that the introduction of background information in the feature extraction step not only enhances the separability between different classes of superpixel features, but also improves the segmentation results.In addition, TCGAN-CRF performs much better than CNN-CRF, proving that weakly supervised CGAN network does have better feature extraction ability than supervised CNN network.AlexNet has the lowest OA, OP, F1 scores and Kappa coefficient.On the one hand, as a fully supervised method, AlexNet requires a large number of labeled samples.On the other hand, AlexNet and DCAE are both classification algorithms, and they do not preserve neighborhood label consistency.All four evaluation indicators also show that among the six algorithms, SegNet's segmentation performance is only slightly worse than HACRF and better than the other four methods.It benefits from its encoder-decoder structure which can implement the appearance and spatial consistency using convolution and deconvolution layers.The excellent feature extraction capability of the deep network improves its segmentation performance.(50.42), indicating that the introduction of TCAGN effectively improves the segmentation results.The addition of BCGAN further increases the F1 score of HACRF to 70.63.For river areas, the F1 scores of all methods are approximately equal and all greater than 72, demonstrating that river features are simpler and easier to segment than farmland and urban areas.To be more specific, SegNet achieves the best score on the class of river (75)-a little higher than HACRF (73.67) and TCGAN-CRF (72.76)-for the background areas, although SegNet gets the highest score, which is only about 2.33 more than the proposed HACRF.From the above analysis, SegNet is lower than HACRF in terms of OA, OP, F1 scores and Kappa coefficient, mainly because its unacceptable segmentation effect on the complex farmland regions.In addition, Figure 6 compares the pixel-level precision and recall values of the six methods, respectively.In Figure 6a, HACRF has greater precision of 65% for river, farmland and urban pixels.In the segmentation results of TCGAN-CRF, AlexNet, and DCAE, only their precision for river areas is higher than 60%.SegNet just has segmentation precision of more than 60% for farmland and urban areas.Figure 6b shows that recall values of our method for five classes of pixels are all higher than 60%.Conversely, TCGAN-CRF, AlexNet, DCAE, and SegNet's recall values for urban pixels are all lower than 60%.
In order to visually compare the segmentation results of each method, we draw the segmentation results of several methods for two testing images in Figures 7 and 8.The input image in Figure 7 is mainly composed of river and farmland areas, which are both segmented accurately by the HACRF.Although the TCGAN-CRF and CNN-CRF can also capture the outlines of two types of land covers, the number of the misclassified superpixels are obviously more than that of the HACRF.Having not considered the appearance and spatial consistency, more patches are assigned wrong labels in the AlexNet and DCAE's results.In addition, Figure 6 compares the pixel-level precision and recall values of the six methods, respectively.In Figure 6a, HACRF has greater precision of 65% for river, farmland and urban pixels.In the segmentation results of TCGAN-CRF, AlexNet, and DCAE, only their precision for river areas is higher than 60%.SegNet just has segmentation precision of more than 60% for farmland and urban areas.Figure 6b shows that recall values of our method for five classes of pixels are all higher than 60%.Conversely, TCGAN-CRF, AlexNet, DCAE, and SegNet's recall values for urban pixels are all lower than 60%.In order to visually compare the segmentation results of each method, we draw the segmentation results of several methods for two testing images in Figures 7 and 8.The input image in Figure 7 is mainly composed of river and farmland areas, which are both segmented accurately by the HACRF.Although the TCGAN-CRF and CNN-CRF can also capture the outlines of two types of land covers, the number of the misclassified superpixels are obviously more than that of the HACRF.Having not considered the appearance and spatial consistency, more patches are assigned wrong labels in the AlexNet and DCAE's results.In order to visually compare the segmentation results of each method, we draw the segmentation results of several methods for two testing images in Figures 7 and 8.The input image in Figure 7 is mainly composed of river and farmland areas, which are both segmented accurately by the HACRF.Although the TCGAN-CRF and CNN-CRF can also capture the outlines of two types of land covers, the number of the misclassified superpixels are obviously more than that of the HACRF.Having not considered the appearance and spatial consistency, more patches are assigned wrong labels in the AlexNet and DCAE's results.The SAR image in Figure 8a mainly includes rivers and a small amount of farmland.The outline of the rivers is complex and the width of the rivers varies greatly, which increases the difficulty of the segmentation.In the result of the HACRF (Figure 8c), the farmland areas in the upper right corner and the river areas with a large width are completely segmented.What should be noted is that, due to the use of superpixels as the segmentation units, some rivers whose width is less than the superpixels' width are misclassified as the background.For the TCGAN-CRF (Figure 8d) and CNN-CRF (Figure 8e), it is obvious that there are more superpixels in the central region of the input being misclassified.
the segmentation.In the result of the HACRF (Figure 8c), the farmland areas in the upper right corner and the river areas with a large width are completely segmented.What should be noted is that, due to the use of superpixels as the segmentation units, some rivers whose width is less than the superpixels' width are misclassified as the background.For the TCGAN-CRF (Figure 8d) and CNN-CRF (Figure 8e), it is obvious that there are more superpixels in the central region of the input being misclassified.

Discussion
The difference between segmentation and classification tasks is that the segmentation methods need to consider the appearance and spatial consistency.The previous superpixel-wise CRF-based segmentation methods only consider the appearance and spatial consistency in the label optimization stage and they heavily depend on the labeled data.We hope to improve the segmentation performance by taking neighborhood consistency into account in the feature extraction stage and well use the abundant unlabeled samples in training the deep feature extractor.
To verify the effectiveness of our improvements, we finish a series of comparison experiments with some state-of-the-art algorithms.Among them, TCGAN-CRF only utilizes an original CGAN to extract features without considering the relation between the central superpixels and their surrounding superpixels.Its segmentation results are better than the superpixel-wise CNN-CRF method, which uses the common supervised CNN to extract the features of the superpixels.It proves that the introduction of the unlabeled samples in the adversarial training of the CGAN indeed improves the quality of feature extraction and achieves better segmentation performance.Compared with the TCGAN-CRF, our hierarchical CGAN considers the centered superpixels and corresponding background areas during feature extraction.This improvement is consistent with the cognitive laws

Discussion
The difference between segmentation and classification tasks is that the segmentation methods need to consider the appearance and spatial consistency.The previous superpixel-wise CRF-based segmentation methods only consider the appearance and spatial consistency in the label optimization stage and they heavily depend on the labeled data.We hope to improve the segmentation performance by taking neighborhood consistency into account in the feature extraction stage and well use the abundant unlabeled samples in training the deep feature extractor.
To verify the effectiveness of our improvements, we finish a series of comparison experiments with some state-of-the-art algorithms.Among them, TCGAN-CRF only utilizes an original CGAN to extract features without considering the relation between the central superpixels and their surrounding superpixels.Its segmentation results are better than the superpixel-wise CNN-CRF method, which uses the common supervised CNN to extract the features of the superpixels.It proves that the introduction of the unlabeled samples in the adversarial training of the CGAN indeed improves the quality of feature extraction and achieves better segmentation performance.Compared with the TCGAN-CRF, our hierarchical CGAN considers the centered superpixels and corresponding background areas during feature extraction.This improvement is consistent with the cognitive laws of human beings.
In the absence of background information, it is difficult to judge the classes of the centered superpixel.The background and the target features from the hierarchical CGAN allow us to optimize the training of CRFs.A series of experiments have confirmed that this improvement led to a better OA value, f1 score, and Kappa coefficients in the segmentation results.
We compare the computational time of the proposed method with that of the state-of-the-art methods in Table 5.The running time of our method, TCGAN-CRF and CNN-CRF includes the time for over-segmentation, feature extraction and CRF-based label optimization.Due to the calculation of the distance between adjacent superpixels in the feature space when modeling the CRF, three models have similar computational efficiency.DCAE requires a lot of operations in the Gabor and GLCM feature extraction stages, so it takes the longest time.Since over-segmentation is not needed, AlexNet has the minimal running time.SegNet is a commonly used end-to-end supervised deep segmentation method.It has a common encoder-decoder structure used in pixel-level semantic segmentation networks.It employs the convolution layers to learn the texture features, and uses subsequent deconvolution operations to learn the spatial distribution of all the pixels.Our method achieves better segmentation performance over SegNet.The use of deconvolution shows that the SegNet can only achieve pixel-level segmentation, which brings a larger computation burden.Hence its running time is comparable to that of the DCAE.
In short, our approach introduces conditional adversarial networks to learn information from the unlabeled samples and lessen the reliance on the labeled samples.Both the target and background features are extracted by the hierarchical CGANs for CRF-based optimization.The above two improvements make our segmentation accuracy better than the other methods.The weakness of our approach is that it is not an end-to-end method, and the training of the feature extraction part and CRF is carried out differently.
In future research, we will unify CRF parameter learning and feature extractor learning into one network to facilitate the end-to-end segmentation.Although pixel-level segmentation with CNNs and CRF has been implemented in [40,41], the positional relation between the neighboring superpixels in the superpixel-wise segmentation algorithms is too complex.

Conclusions
In this study, a new segmentation algorithm combining hierarchical CGANs with CRF has been presented for high-resolution airborne SAR imagery.The original images are first over-segmented into the initial superpixels.Then, we design a hierarchical CGANs network to extract the feature vectors of the superpixels.On the one hand, the introduction of CGAN enables this network to learn its parameters using the unlabeled samples.The experiments for feature extraction analysis show the CGAN can improve the separability of the features of individual classes, and thus facilitate better segmentation results.On the other hand, inspired by the cognitive laws of human beings, this network simultaneously extracts the feature vectors of the centered superpixels and their corresponding background areas.The addition of the background information allows us to obtain optimal unary and pairwise potential parameters in CRF models and effectively preserve the neighbor consistency in the segmentation results.Evaluation measures on the datasets also show that this method further improves the segmentation accuracy.
(2) two kinds of superpixels are fed into the hierarchical CGAN to extract their feature vectors, which are composed of target CGAN (TCGAN) and background CGAN (BCGAN); (3) the concatenated features are then utilized to train the CRF and infer the optimum label of each superpixel.The training of hierarchical CGAN is performed using the labeled data and unlabeled data from the testing set.The unary and pairwise potential coefficients of CRF are learned by SSVM.Remote Sens. 2019, 11, x FOR PEER REVIEW 4 of 20

Figure 1 .
Figure 1.Architecture of our weakly supervised segmentation method based on HACRF.

Figure 1 .
Figure 1.Architecture of our weakly supervised segmentation method based on HACRF.
L refers to the number of labeled images x L .During training of generator, we first fix the parameters of the discriminator and the multi-classifier.The m faked images G z (i) are fed into the discriminator, and then the generator's parameters are optimized by minimizing the following loss function V(G) classifier.The parameters of the B-feature extractor, discriminator, and multi-classifier are optimized by maximizing ( ) , B V D M .In the testing stage, the trained dual feature extractors are employed to extract the features of the target superpixels T j r and the background superpixels B j r respectively, which are denoted as T j f and B j f respectively.

Figure 2 .
Figure 2. The architecture of the hierarchical CGAN, consisting of a target CGAN(TCGAN) and a background CGAN(BCGAN).In the training stage, the target superpixels T j r and background

1 2 =
, ,..., n x x x x , the labels assigned to these superpixels are defined as k refers to the number of the classes.CRF models the whole image as an undirected graph model ( ) , G V E , where V is the set of vertices and each superpixel corresponds to a vertex.E represents the set of edges.It has been demonstrated that in CRF, the conditional probability distribution of label y obeys the following Gibbs distribution,

Figure 2 .
Figure 2. The architecture of the hierarchical CGAN, consisting of a target CGAN(TCGAN) and a background CGAN(BCGAN).In the training stage, the target superpixels r T j and background superpixels r B j are respectively utilized to train TCGAN and BCGAN.


update T-feature Extractor, discriminator and multi-classifier of TCAGN by maximizing update the generator of TCAGN by minimizing the loss function obtain the corresponding labeled and unlabeled background superpixels from x generator of BCAGN outputs fake target superpixels B i g  update B-feature Extractor, discriminator and multi-classifier of BCAGN by maximizing update the generator of BCAGN by minimizing the loss function model the CRF based the feature vectors. train the coefficients of CRF using SSVM and AD3.procedure Inference in CRF model  obtain the target and background superpixels from t estimate MAP inference in CRF model using AD34.Results based the feature vectors.•train the coefficients of CRF using SSVM and AD3.procedure Inference in CRF model• obtain the target and background superpixels from x t → r T j and r B j .•extract the features of the r T j and r B j using T-feature Extractor and T-feature Extractor → f T j and f B j • estimate MAP inference in CRF model using AD3

Figure 3 .
Figure 3. Part of images and the corresponding ground truth.

Figure 3 .
Figure 3. Part of images and the corresponding ground truth.

Figure 4 .
Figure 4.The parameters of TCGAN.The generator (blue rectangles) is composed of FC block 1, FC block 2, deconv block 1, and deconv block 2. The T-feature extractor (red rectangles) consists of the convolution block 1 and the fully-connected block 3. The Discriminator block and Multi-Classifier block are both implemented by a fully-connected layer.

Figure 4 .
Figure 4.The parameters of TCGAN.The generator (blue rectangles) is composed of FC block 1, FC block 2, deconv block 1, and deconv block 2. The T-feature extractor (red rectangles) consists of the convolution block 1 and the fully-connected block 3. The Discriminator block and Multi-Classifier block are both implemented by a fully-connected layer.

Figure 4 .
Figure 4.The parameters of TCGAN.The generator (blue rectangles) is composed of FC block 1, FC block 2, deconv block 1, and deconv block 2. The T-feature extractor (red rectangles) consists of the convolution block 1 and the fully-connected block 3. The Discriminator block and Multi-Classifier block are both implemented by a fully-connected layer.

Figure 6 .
Figure 6.Comparison of the pixel-wise precision and recall for individual classes of superpixels with five state-of-the-art algorithms.(a) pixel-wise precision; (b) pixel-wise recall.

Figure 6 .
Figure 6.Comparison of the pixel-wise precision and recall for individual classes of superpixels with five state-of-the-art algorithms.(a) pixel-wise precision; (b) pixel-wise recall.

Figure 6 .
Figure 6.Comparison of the pixel-wise precision and recall for individual classes of superpixels with five state-of-the-art algorithms.(a) pixel-wise precision; (b) pixel-wise recall.

Table 1 .
Compared recently CRF-based segmentation in terms of pros and cons.

Table 1 .
Compared recently CRF-based segmentation in terms of pros and cons

Algorithm 1: Hierarchical Adversarial CRF (HACRF) based Image Segmentation Input:
Training images x, test images x t Output: Superpixel-wise segmentation of image x t procedure Training Hierarchical CGANs.multipliers.As each local subproblem has a quadratic regularizer, AD3 converges faster than other subgradient-based dual decomposition and message-passing methods.Experimental results also prove that AD3 can result in good segmentation.The pseudo code of the proposed method is presented in Algorithm 1, which explains the training of the hierarchical CGANs and CRF in details.

Hierarchical Adversarial CRF (HACRF) based Image Segmentation Input:
Training images x , test images t x

Table 2 .
The numbers of superpixels of diverse classes in the training and testing set.

Table 2 .
The numbers of superpixels of diverse classes in the training and testing set.

Table 3 .
Comparison of the overall segmentation performance with five state-of-the-art methods.

Table 4
gives the F1 scores of multiple methods for each class of superpixels.In detail, HACRF's F1 scores for urban areas and farmland are highest in all methods, respectively 63.17 and 70.63.SegNet's scores for these two classes are only 61.46 and 52.56.Farmland is the hardest class to segment.AlexNet and DCAE's F1 scores are both less than 30, and SegNet's is only 52.56.Conversely, TCGAN-CRF's F1 score for farmland regions is 63.08, which is more than the CNN-CRF's

Table 4 .
Comparison of the pixel-wise F1 scores for individual classes with five state-of-the-art methods.

Table 5 .
Comparison of run time (seconds) with five state-of-the-art algorithms.