Semi-Supervised Cloud Detection in Satellite Images by Considering the Domain Shift Problem

: In terms of semi-supervised cloud detection work, efforts are being made to learn a promising cloud detection model via a limited number of pixel-wise labeled images and a large number of unlabeled ones. However, remote sensing images obtained from the same satellite sensor often show a data distribution drift problem due to the different cloud shapes and land-cover types on the Earth’s surface. Therefore, there are domain distribution gaps between labeled and unlabeled satellite images. To solve this problem, we take the domain shift problem into account for the semi-supervised learning (SSL) network. Feature-level and output-level domain adaptations are applied to reduce the domain distribution gaps between labeled and unlabeled images, thus improving predicted results accuracy of the SSL network. Experimental results on Landsat-8 OLI and GF-1 WFV multispectral images demonstrate that the proposed semi-supervised cloud detection network (SSCDnet) is able to achieve promising cloud detection performance when using a limited number of labeled samples and outperforms several state-of-the-art SSL methods.


Introduction
With the development of the Earth observation technology, an increasing number of optical satellites are launched for Earth observation missions.Remote sensing images acquired from the optical satellites can serve environment protection [1], global climate change [2], hydrology [3], agriculture [4], urban development [5], and military reconnaissance [6].However, since 60% earth's surface is covered by clouds, the acquired optical remote sensing (RS) images are often contaminated by clouds [7].In the field of meteorological, cloud information of RS images is useful in weather forecast [8], while, for earth surface observation missions, cloud coverage degrades the quality of satellite imagery.Therefore, it is important to improve RS images quality through cloud detection.
Over the past few decades, cloud detection from RS imagery has attracted much attention.Many advanced cloud detection technologies have been proposed.In this paper, we broadly categorize these methods into rule-based methods and machine learningbased methods.The rule-based methods are mostly developed from spectral/spatial domain [9][10][11][12].These methods distinguish clouds from clear sky pixels by exploiting reflectance variations in visible, shortwave-infrared, and thermal bands.Rule-based methods have obvious flaws, i.e., they strongly depend on particular sensor models and have poor generalization performance.For example, Fmask algorithms [9][10][11] are developed for Sentinel-2 and Landsat 4/5/7/8 satellite images, while the multifeature combined (MFC) method [12] is developed for GF-1 wide field view (WFV) satellite images only.In addition, machine learning-based cloud detection methods have also attracted much attention due to their powerful data adaptability.The most representative machine learning-based cloud detection methods are maximum likelihood [13,14], support vector machine (SVM) [15,16], and neural network [17,18].However, these methods heavily rely on hand-crafted features, such as color, texture, and morphological features, to distinguish clouds from clear sky pixels.
Recent years, with the development of deep learning, deep convolutional neural network (DCNN) methods have been rapidly developed and widely used for cloud detection from RS images.For example, U-Net and SegNet variants cloud detection frameworks [19][20][21][22][23], and multi-scale/level feature fusion cloud detection frameworks [24][25][26][27][28].In addition, advanced convolutional neural network (CNN) models, such as CDnetV2 [7] and ACDnet [29], are developed for cloud detection from RS imagery with cloud-snow coexistence.To achieve real-time and onboard processing, lightweight neural networks, such as [30][31][32], are proposed for pixelwise cloud detection from RS images.However, most of the previous CNN-based cloud detection methods are based on supervised learning frameworks.Although these CNN-based cloud detection methods have achieved impressive performance, they heavily rely on a large number of training data with strong pixel-wise annotations.Some recent cloud detection works, such as unsupervised domain adaptation (UDA) [33,34] and domain translation strategy [35], have begun to explore how to avoid using pixel-wise annotations for cloud detection network training.However, these pixel-wise annotations free methods are not real label-free ones because they rely on other labeled datasets.
Obtaining data label is an expensive and time-consuming task, especially pixel-wise annotation.As illustrated in cityscapes dataset annotation work [36], it usually takes 1.5 h to label a pixel-wise annotation from a high-resolution urban scene image with pixel size of 1024 × 2048.For a remote sensing image with pixel size of 8 k × 8 k , it may take more hours to label a whole scene RS image according to such experience.Although it is easier to label the cloud pixel-wise samples individually, it may still take three to four hours for a tough case that contains a large number of tiny and thin clouds, which increases the heavy cost of manual labeling undoubtedly.In contrast, unlabeled RS images can be far more easily acquired than labeled ones [37].Therefore, it desperately needs to exploit how to utilize a large number of unlabeled data to enhance the performance of cloud detection model.
In this paper, we proposed to use a semi-supervised learning (SSL) method [38,39] to train a cloud detection network.Because the SSL method is able to reduce the heavy cost of manual dataset labeling.In a semi-supervised segmentation framework, such as DAN [37] and s4GAN [40], the segmentation network (cloud detection network) is able to simultaneously take advantage of a large amount of unlabeled samples and a limited number of labeled examples for network's parameter learning.The core of SSL method is a self-training [41] strategy, which is able to leverage pseudo-label generated from a large amount of unlabeled samples to supervise the segmentation network training [42].This also means that accurate pseudo-label labeling is the key of self-training.Therefore, most advanced SSL methods, such as [40,43], focus on improving pseudo-labels of unlabeled samples to improve the performance of the SSL network.
SSL networks developed for tradition natural image segmentation, such as [38][39][40]43], may not achieve a promising performance for satellite images cloud detection due to RS images are different from traditional camera natural images.In addition, the data drift problem may appear between a limited number of labeled examples and a large number of unlabeled samples due to different cloud shapes and land-cover types on different satellite images.The SSL network trained with labeled samples is difficult to generalize to unlabeled samples due to the data drift problem.During training, the SSL network may produce prediction results with lower certainty when the network input with unlabeled data [40].The prediction results of unlabeled samples has shown lower certainty and cannot generate accurate pseudo-labels, which makes self-training unfavorable for providing supervision signals of network training.
In summary, the main contributions of this work are summarized as follows: (i) We propose a semi-supervised cloud detection framework, named SSCDnet, which learns knowledge from a limited number of pixel-wise labeled examples and a large number of unlabeled samples for cloud detection.(ii) We take the domain shift problem into account between labeled and unlabeled images and propose the feature-level and output-level domain adaptation method to reduce domain distribution gaps.(iii) We propose a double threshold pseudo-labeling method to obtain trustworthy pseudo label, which helps to avoid the effects of noise labels for self-training as much as possible and to further enhance the performance of SSCDnet.
This paper is organized as follows: in Section 2, we present the proposed SSCDnet in detail.Experimental datasets and networks training details are presented in Section 4. The experimental results and discussions are presented in Sections 4 and 5, respectively, followed by conclusions in Section 6.

The Proposed Method
In this section, we provide a detailed introduction of the proposed SSCDnet, including the traditional semi-supervised segmentation framework, the proposed overall workflow of SSCDnet, feature/out-level domain adaptation, trustworthy pseudo label labelling, cloud detection network, and discriminator network structures.

Traditional Semi-Supervised Segmentation Framework
In a traditional semi-supervised segmentation framework [37][38][39][40], as shown in Figure 1, the segmentation network G simultaneously takes advantage of a large number of unlabeled samples and a limited number of labeled examples for a network's parameter training.In this framework, there are two datasets, i.e., labeled dataset M l = {x l , y l } and unlabeled dataset M u = {x u }, where x l and x u are the input data of the segmentation network G and y l is a pixel-wise label of x l .p l and p u represent predicted results of x l and x u , respectively.pu represents a pseudo label of p u .During training, given an input labeled image x l , the segmentation network G is supervised by a standard cross-entropy loss L ce .When using the unlabeled data x u , the segmentation network is further supervised by a self-training loss L st .That is, we use pseudo label pu that generated from predicted result p u as the "ground truth" for self-training to enhance the semantic segmentation network G. Therefore, total training objective L G of the segmentation network G is defined as follows:

Segmentation
where λ st is weight used for minimizing the objective L G .As shown in Figure 2, remote sensing images obtained from different places show large domain distribution gaps between each other due to different cloud shapes and land-cover types on Earth's surface.Therefore, there is data drift between labeled samples and unlabeled ones in training dataset of SSL.In SSL framework, segmentation network G trained with labeled samples is hard to generalize to unlabeled ones due to data drift problems.It is difficult to produce a highly certain prediction result p u for unlabeled sample x u .Predicted results with low certainty leads to low quality pseudo-labels, thus affecting the supervision signal provided from the self-training loss L st , and further affecting the performance of segmentation network G. Therefore, improving certainty of predicted results of unlabeled samples is the key to a semi-supervised learning framework.

Proposed Semi-Supervised Cloud Detection Framework
In this paper, we take the domain shift problem into account for the semi-supervised learning (SSL) framework to improve generalization of the segmentation network to generate trustworthy pseudo-label for self training.We improve a standard SSL network with the unsupervised domain adaptation (UDA) strategy and propose an improved semisupervised cloud detection network as shown in Figure 3. UDA methods are able to help semantic segmentation networks to learn domain-invariant features at different representation layers, such as input-level (pixel-level) [49], feature-level [45], or output-level [46,47].
Different from the traditional SSL framework, we apply feature-level and output-level domain adaptations at the intermediate layers and the end layer of network, respectively, to reduce domain distribution gaps and improve the generalization performance of SSCDnet.The highly generalized network is able to generate highly certain predicted results for unlabeled samples.To further improve the quality of pseudo-label, instead of directly using the predicted results of unlabeled samples for network training, we proposed a trustworthy pseudo label labelling method to obtain a high-quality pseudo label.To be specific, similar to [40,43], we take advantage of the feedback information from outputlevel domain adaptation to obtain high-quality candidate labels.Then, we use a threshold strategy to obtain trustworthy regions from the high-quality candidate labels.Finally, trustworthy regions are considered as the ground-truth labels for network training through the self-training loss.During training, given an input labeled image x l and its corresponding predicted result is p l , the cloud detection network is supervised by a standard cross-entropy loss L ce [27].When inputting unlabeled data x u , segmentation network G predicts its corresponding result p u .Then, we generate the pseudo label pu from predicted result p u by using proposed trustworthy pseudo label labelling method.Pseudo label pu as the "ground truth" for self-training to enhance the semantic segmentation network G. Therefore, the semantic segmentation network is further supervised by self-training loss L st and feature-level domain adaptation loss L f da as well as output-level domain adaptation loss L oda when inputting unlabeled data.The total loss of the semi-supervised cloud detection network, named SSCDnet, is defined as follows: where λ st , λ oda , and λ f da are three regulation parameters used for minimizing the objective L G .
During training, we minimize L G for updating parameters of segmentation network G.The detailed information of the proposed feature/output-level domain adaptation strategies and trustworthy pseudo label labelling method will be introduced in following subsections.

Reducing Domain Distribution Gaps 2.3.1. Feature-Level Domain Adaptation
To reduce domain distribution gaps at the feature-level, we propose a class-relevant feature alignment (CRFA) strategy.As shown in Figure 4, we use predicted score maps of each class (i.e., cloud and background classes) as the attention maps to obtain class-relevant features.Then, we design a standard binary classification network as the discriminator to help segmentation network G to generate domain-invariant feature representations, thus helping to reduce domain distribution gaps between labeled samples and unlabeled ones at feature level.Therefore, the proposed CRFA domain adaptation consists of two stages: (i) class-relevant feature selection and (ii) class-relevant feature alignment.
To be specific, let H k l ∈ R W×H×C and H k u ∈ R W×H×C denote labeled and unlabeled samples' features extracted from the k-th intermediate hidden layer of network G, respectively.Let pi l (:, :, 1) and pi l (:, :, 2) denote the spatial attention maps of cloud and background areas of labeled samples, respectively.Then, cloud-relevant feature C k l ∈ R W×H×C and background-relevant feature B k l ∈ R W×H×C of the labeled samples' feature H k l ∈ R W×H×C are defined as follows: and where F (•) represents sampling operator, which includes down-sampling or up-sampling operators.Similarly, we are able to obtain cloud-relevant feature and For segmentation network training, the adversarial objectives provided by discriminators (D crfa ) are defined as follows: and L adv Therefore, the class-relevant feature alignment loss L k cr f a of the k-th intermediate hidden feature is defined as follows: Since the proposed feature-level domain adaptation is performed at multiple intermediate layers, the feature-level domain adaptation loss L f da is provided from all intermediate layers' CRFA losses {L k cr f a } K k=1 , i.e., where L k f da is the feature domain adaptation loss of the k-th intermediate hidden feature.

Output-Level Domain Adaptation
Instead of directly aligning the predicted results, we align cloud and background objects images extracted from labeled and unlabeled samples, respectively, via a proposed class-relevant outputs alignment (CROA) method.Similar to the CRFA method, the CROA method is built on the standard adversarial learning framework.CROA consists of two discriminators, which are used to align cloud and background images, respectively.As shown in Figure 5, we use two standard binary classification networks as discriminators to reduce the domain distributions gaps between labeled and unlabeled datasets at the output level and improve the certainty of predicted results z u for unlabeled sample x u .As shown in Figure 5, we use the cloud and background objects extracted from original images as the input data of different discriminators.To be specific, let z u c = p u (:, :, 1) ⊗ x u and z u b = p u (:, :, 2) ⊗ x u denote extracted cloud and background objects of unlabeled sample.Similarly, let z l c = p l (:, :, 1) ⊗ x l and z l b = p l (:, :, 2) ⊗ x l denote extracted cloud and background objects of the labeled sample, where ⊗ represents element-wise multiplication.For discriminator training, both discriminators D 1 croa and D 2 croa use cross-entropy domain classification loss [44] as the objective function.They are defined as follows:

Discriminator
and For segmentation network training, adversarial objectives provided by these discriminators are defined as follows: and L adv Therefore, the output-level domain adaptation objective L oda is defined as follows: As shown in Figure 6, we present the visualized experiment results on a GF-1 WFV image with or without applied domain adaptation.Results show that applying domain adaptation is able to improve the segmentation network to produce high certainly predicted results for unlabeled samples as shown in Figure 6c.High certainly predicted results ensure that we can obtain trustworthy pseudo label for self-training, thus improving the network's cloud detection performance.More detailed information can be found in Section 4.1 (Ablation Study).

Trustworthy Pseudo Label Labelling
Pseudo label labelling is the main work for self-training in an SSL framework [38][39][40]43,50].During self-training, pseudo labels predicted from the segmentation model serve as the "ground truth" to provide additional supervisory signals for network training, which makes the segmentation network able to leverage the unlabeled data.In this paper, we propose a double threshold pseudo-labelling method to efficiently obtain trustworthy regions from pseudo label.We use trustworthy regions as the ground truth for network training.

Candidate Labels Selection
As shown in Figure 7, we sample an unlabeled image x u into the segmentation network G, obtaining its corresponding predicted probability maps p u = G(x u ) as well as extracted cloud and background objects z u c and z u b , respectively.We first use feedback information from the output-level domain adaptation to obtain candidate labels.Specifically, we select high-quality pseudo labels online based on two discriminator scores of output-level domain adaptation, i.e., where τ 1 is the threshold.Equation ( 16) is feedback information from the output-level domain adaptation.We treat these predicted labels satisfied Equation ( 16) as candidate labels.If not satisfied, we directly set self-training loss L st = 0.

Trustworthy Regions Selection from Candidate Labels
After we obtained the candidate labels, we set a confidence threshold τ 2 to discover trustworthy regions from the selected candidate labels, i.e., Equation (17) indicates that, if predicted probability of these regions is greater than τ 2 , we use these trustworthy regions as the ground-truth for self-training.

Self-Training Loss L st
According to the above mentioned method, the self-training loss L st is defined as: where pu is the pseudo label generated from the prediction map p u by using a one-hot encoding scheme.p u (w,h,c) represents the predicted probability at the location (h,w) of the C-channel.
2.5.Network Architecture 2.5.1.Cloud Detection Network Similar to most SSL approaches [40,43], we use DeepLabv2 [51] as our main cloud detection framework and resort to ResNet-101 [52] as the backbone to extract semantic segmentation information.DeepLabv2 uses atrous spatial pyramid pooling (ASPP) module [51], which incorporates multiple parallel dilated convolutional layers [53] with different sampling rates, to capture multi-scale features for robustness clouds detection.In this paper, we set the resolution of predicted probability map as 1/8 × 1/8 size of input image for fair comparison with previous semi-supervised works based on DeepLabv2, such as [37,40].Then, we directly up-sample the predicted probability maps to the same size as the input images to obtain final predicted results.

Discriminator Network
In this paper, we design a standard binary classification network as the discriminator for both CRFA and CROA modules.Figure 8 illustrates the detailed structure of the designed discriminator.This discriminator contains four convolutional layers, a global average pooling layer, and a fully-connected layer.To be specific, there are four convolutional layers with 4 × 4 kernels.Four convolutional layers have {256, 192, 128, 64} and {64, 128, 256, 512} channels for CRFA and CROA modules, respectively.Each convolutional layer shares the same stride (stride = 2) and simultaneously followed by a Leaky-ReLU activation (slope = 0.2) and a dropout layer (dropout-rate = 0.5).After these convolution operations, a global average pooling layer and a fully-connected layer are designed to obtain confidence score for each input image.

Experimental Dataset
In this paper, we use two available cloud cover validation datasets, i.e., Landsat-8 OLI cloud cover validation dataset [48] and GF-1 WFV cloud and cloud shadow cover validation dataset [12], to comprehensively evaluate the proposed SSCDnet.Table 1 shows the detailed information of Landsat-8 OLI and GF1-WFV multispectral images.Similar to [27], we follow the idea that the number of subimages with and without cloud in the training data should be balanced.Otherwise, the detection results would bias towards the majority.Therefore, we exclude some cloud-free and full cloud covered scenes.In addition, current cloud detection accuracy measurement is not robust when the cloud percentage is quite low [10].A low cloud cover percentage in a scene may cause an apparent reduction in the cloud's producer accuracy and user accuracy.Therefore, original large size images with cloud percentage less than 5% are usually removed in training and testing.Finally, we select 40 and 20 scenes images from both Landsat-8 and Gaofen-1 WFV dataset for network training and evaluation, respectively.As illustrated in [54], CNNs are strongly biased towards recognizing textures for object classification.Therefore, we can select a limited number of channels data of multispectral data to evaluate the proposed SSCDnet.In this paper, we select channels 3 (green), 4 (red), and 5 (near-infrared) of Landsat-8 OLI data and channels 2 (green), 3 (red), and 4 (near-infrared) of Gaofen-1 WFV data for segmentation network training and testing.
During training, all training data are cropped into subimages of pixel size 321 × 321.There are about 30 k and 75 k annotated sub-images in Landsat-8 OLI and GF-1 WFV training dataset, respectively.We select a portion of the data set for network supervised training and the remaining de-annotated portion for network self-training (unsupervised training).During testing, we divide the whole RS image into a series of sub-images with image size of 1200 × 1200 for network evaluation.The final detection result is obtained by merging results of sub-images.
In addition, in order to improve the performance, we use the model pre-trained on the ImageNet dataset [58] to fine-tune the parameters of the backbone network (ResNet-101).We conduct feature-level domain adaptation tasks at the end of Conv4_x and Conv5_x residual blocks, i.e., K = 2.We empirically set weight-parameters λ st = 1.0, λ f da = 0.001, and λ oda = 0.1 for GF-1 dataset, set λ st = 1.0, λ f da = 0.001, and λ oda = 0.01 for Landsat-8 OLI dataset.In addition, we empirically set threshold-parameters τ 1 = 0.6 and τ 2 = 0.55 to accurately generate trustworthy pseudo labels for both GF-1 and Landsat-8 OLI datasets.

Ablation Study
To investigate the effectiveness of SSCDnet, we conduct a series of ablation studies on it.Five widely used quantitative metrics of RS images, i.e., mean intersection over union (MIoU), kappa coefficient (Kappa), overall accuracy (OA), producer accuracy (PA), and user accuracy (UA), are used to comprehensively measure the cloud detection results.All experiment results are obtained from the GF-1 WFV dataset.

Ablation Study on Loss Function
In this paper, the proposed SSCDnet is supervised by four loss functions, i.e., a standard cross-entropy loss L ce , self-training loss L st , output-level domain adaptation loss L oda , and feature-level domain adaptation loss L f da .In Table 2, we list ablation results under different proportion of labeled samples ( 1 200 , 1 100 , 1 40 , 1  20 , and full labeled samples) to demonstrate the effects of each component loss.   2 show that the best performance is achieved by a combination of all optimal loss terms (we use {L ce , L st , L oda , L f da } to represent this combination.),while the baseline framework supervised only by the standard cross-entropy loss L ce shows the worst performance.To be specific, a combination of L ce and L st (i.e., {L ce , L st }) and combination of L ce , L st , and L oda (i.e., {L ce , L st , L oda }) obtain significant improvement of +9.44% and +13.86% in Kappa, respectively, compared with only L ce loss when labeled sample proportion is 1  200 .In addition, {L ce , L st } and {L ce , L st , L oda } also outperform L ce under other proportions of labeled sample settings, which shows that output-level domain adaptation and self-training strategies are able to improve the performance of segmentation network.

Results in Table
Based on {L ce , L st , L oda }, we further introduce feature domain adaptation loss L f da to reduce feature-level domain distribution gaps between labeled and unlabeled datasets.Results in Table 2 show that {L ce , L st , L oda , L f da } performs better than {L ce , L st , L oda }.In addition, it can be seen that the performance of {L ce , L st , L oda , L f da } consistently outperforms those of other combinations under different proportions of labeled sample ({ 1 200 , 1 100 , 1 40 , 1 20 }), which shows that the proposed {L ce , L st , L oda , L f da } is a promising strategy for semi-supervised segmentation network.
In summary, results in Table 2 demonstrate that standard cross-entropy loss L ce , self-training loss L st , output-level domain adaptation loss L oda , and feature-level domain adaptation loss L f da are beneficial for semi-supervised cloud detection work.Because applying domain adaptation strategy is able to reduce distribution gaps between labeled and unlabeled datasets and improve SSCDnet to generate trustworthy pseudo-labels for self-training, thus providing positive supervised signals for segmentation network learning.Combination of all these loss terms is able to achieve a state-of-the-art cloud detection performance when using a limited number of labeled samples.

Ablation Study on Feature-Level Domain Adaptation
In this paper, we use ResNet-101 [52] as the backbone network to extract feature maps and conduct domain adaptation study on the end features of Conv4_x, Conv5_x, Conv3_x, and Conv2_x residual blocks.In Table 3, we list the experiment results of ablation study on different intermediate layers.Experiment results show that applying feature domain adaptation on the end of Conv4_x and Conv5_x residual blocks, i.e., FDA_45, achieves the best performance.In contrast, applying feature domain adaptation on other layers' features always shows worse performance than on Conv4_x and Conv5_x layers.Hence, we conduct the domain adaptation tasks at the end of Conv4_x and Conv5_x residual blocks (two intermediate layers K = 2 in feature-level domain adaptation).Furthermore, we find that applying domain adaptation tasks at above mentioned intermediate layers is able to achieve promising performance on the Landsat-8 OLI dataset.In Table 4, we list the experiment results under different pseudo-labeling strategies, i.e., pseudo-labeling directly transforms from probability maps L st(N/A,N/A) , pseudolabeling based on a discriminator score of output-level domain adaptation L st(τ 1 ,N/A) , and pseudo-labeling based on both discriminator score of output-level domain adaptation and trustworthy regions selection L st(τ 1 ,τ 2 ) , where τ 1 and τ 2 are used to obtain candidate labels and trustworthy regions, respectively.Results in Table 4 show that performance of the loss term L st(τ 1 ,τ 2 ) is better than those of L st(τ 1 ,N/A) and L st(N/A,N/A) , which demonstrates that trustworthy regions selection based on high confidence candidate label is a promising strategy for pseudo label selecting.

Hyper-Parameter Analysis
The proposed training objective L G has three balance weights, i.e., λ st , λ oda , and λ f da .In this paper, since we use all the trustworthy regions' pseudo labels for self-training, we directly set λ st = 1.0.Then, λ oda and λ f da are two important hyper-parameters that affect cloud detection results.In Table 5, we list a series of validation experimental results to investigate the impact of these hyper-parameters.As illustrated in Table 5, we first obtain the promising λ oda by setting λ f da = N/A (N/A means Not Applicable).Then, we obtain the promising λ f da based on obtained promising λ oda .Experimental results in Table 5 show that setting λ oda = 0.10 and λ f da = 0.001 is able to achieve promising performance.Similarly, we can obtain the promising hyper-parameter setting based on above mentioned methods on Landsat-8 OLI data cloud detection task.For comprehensive evaluation, we compare deep adversarial network DAN [37], which focuses on training with both un-labeled and labeled images simultaneously to improve segmentation performance.Moreover, we compare two semi-supervised CNNbased semantic segmentation methods, i.e., Hung et al. [43] and s4GAN [40].These two SSL methods are based on an adversarial network and achieve promising performance on PASCAL VOC 2012 [59] and Cityscapes datasets [36].For fair comparison with the proposed SSCDnet, we use DeepLabv2 [51] and ResNet-101 [52] as a segmentation network and backbone of above-mentioned competing methods, respectively.In addition, baseline network DeepLabv2 [51] is also used as the competing method.During training, we retrain these CNN-based methods under their optimal parameter settings on Landast-8 OLI and GF-1 WFV datasets.

Results on GF-1 WFV Data
Table 6 shows the quantitative results in terms of average OA, MIoU, Kappa, PA, and UA on the GF-1 WFV testing dataset.We show these comparison results on four different proportions of labeled samples ( 1 200 , 1 100 , 1  40 , and 1  20 ).Meanwhile, we also give comparison results on fully labeled samples.Experiment results show that our proposed SSCDnet consistently outperforms these competing methods at different proportions of labeled samples.Notably, SSCDnet achieves 83.08% Kappa and 86.96% MIoU using only 0.5% (1/200) training data with pixel-wise annotation.Results of SSCDnet significantly outperform those of competing methods.SSCDnet also achieves the best performance on fully labeled data and shows a larger gain in terms of OA, MIoU, Kappa, PA, and UA than competing methods.where 1 200 , 1 100 , 1  40 , and 1  20 are the fractions of the total training images in the dataset that are used as labeled data, and the rest of the data are used without labels.
It can be seen that the baseline network, DeepLabV2 [51], trained with only labeled data, shows the worst results.DAN [37] shows better results than DeepLabV2 [51] due to it being able to effectively utilize unlabeled data for training.These semi-supervised methods, Hung et al. [43] and s4GAN [40] show better results than DeepLabV2 [51] and DAN [37] due to being able to learn knowledge from a limited number of labeled examples and a large number of additional unlabeled samples.Although these two methods show competitive cloud detection performance, there is still a gap in performance between them and the proposed SSCDnet.In addition, it also can be seen that all of the methods show promising cloud detection performance on GF-1 WFV data when using fully labeled data for network training.Even the worst baseline method DeepLabV2 [51] achieves 90.15% mIoU and 88.64% Kappa.
In Figure 9, we show qualitative results when labeled samples proportion is 1 200 .These images contain typical land-cover types, i.e., Figure 9a includes mountains, wetlands, farmland, and grass/crops, Figure 9b includes village, forest and ice/snow, Figure 9c includes water.Experiment results show that detection results of SSCDnet show more consistency with the ground-truth than those of other competing methods.Results of competing methods show a lot of misclassification pixels in thin cloud areas, while SSCDnet shows less misclassification pixels, which indicates that SSCDnet is able to achieve more promising performance on these imageries when using a limited number of labeled images.Image ID of (a), (b), and (c) are GF1_WFV2_W102.1_N37.6_20140517_L2A0000244678,GF1_WFV3_E114.1_N2.1_20151011_L2A0001094727, and GF1_WFV3_E87.8_N2.1_20140316_L2A0000184430,respectively.

Results on Landsat-8 OLI Data
In addition to the experiment results on the GF-1 data, we also conduct experiments on Landsat8 OLI data.In Table 7, we list quantitative results on the Landsat8 OLI testing dataset.Compared to other methods, the proposed SSCDnet also performs best on Landsat8 OLI data.For the low labeled sample's proportion, such as 1  200 , SSCDnet is still able to achieve satisfactory results (90.77% MIoU and 88.72% Kappa).These results show consistency with the ground-truth.In contrast, performance of these competing methods is less than that of SSCDnet.When increasing the proportion of labeled samples, competing methods can improve their performance, but they are still inferior to SSCDnet.From Table 7, we find that the performance of SSCDnet at a labeled sample proportion of 1  20 approaches that of its full supervision and outperforms fully supervised DeepLabV2 [51] and DAN [37].For fully labeled data, with the help of adversarial training and self-training strategies, SSCDnet still shows the best performance compared with other competing methods.
In Figure 10, we show qualitative results on three whole scene landsat-8 OLI images.These images contain typical land-cover types, i.e., Figure 10a includes mountains, forest, ice/snow, water, and wetlands areas.Figure 10b includes water, floating ice, urban, mountains, and forest areas.Figure 10c includes barren and desert areas.Experiment results are obtained when labeled sample proportion is 1  200 .Experiment results show that SSCDnet trained with a limited number of labeled samples can yield very competitive performance on Landsat-8 OLI data.It can be seen that results of these competing methods show a large number of misclassified pixels (red areas), especially in Figure 10c.There are a large number of misclassified pixels displayed in the thin cloud area.In contrast, SSCDnet works well on these images.Results of SSCDnet shows better consistency with the ground-truth and fewer misclassified pixels than competing methods.Overall, experiment results in Table 7 and Figure 10 show that the proposed SSCDnet is able to achieve promising cloud detection performance on Landsat-8 OLI data.

Robustness Analysis
To evaluate the robustness of the proposed method, we conducted a series of experiments as follows: (1) experiment results on the same area under different seasons and (2) experiment results on different land cover types.

Results on the Same Area under Different Seasons
In Figure 11, we present the cloud detection results on the same area under different seasons, i.e., Spring, Summer, Autumn, and Winter.Satellite images obtained from this area include different land cover types, such as mountain, village, urban, water, ocean, plant, and In Figure 11, overall accuracy (OA) of Spring, Summer, Autumn, and Winter images are 98.21%, 97.70%, 95.79%, and 96.03%, respectively.Experiment results show that the proposed semi-supervised cloud detection method SSCDnet is able to achieve a promising performance under different radiance (i.e., same area and different seasons).However, there is still a large number of misclassified pixels in the experiment results.These misclassified pixels are located mostly near the cloud object boundaries and thin cloud areas, which is also a difficult problem for most CNN-based cloud detection works, such as [7,19,27,[33][34][35].

Results on Different Land Cover Types
In Figure 12, we present the experiment results on twelve sub images with different land cover types.Results show that these corresponding cloud detection results obtained by SSCDnet show consistency with the ground truth when our CNN model is trained with a limited number of labeled dataset (labeled proportion is 1  10 ), except for some misclassified pixels located near cloud object boundaries and thin cloud areas.To be specific, for tough cases, such as barren/desert areas (Figure 12a,b) and urban areas (Figure 12g,h), SSCDnet is able to achieve a promising performance on these land cover types.In addition, SSCDnet also shows promising performance on water areas, such as lake (Figure 12i,j), river (Figure 12h), and ocean (Figure 12k,l) areas.Except for snow, buildings, and some white objects, few ground objects affect cloud detection, and SSCDnet can easily obtain promising results in some general cases, such as mountain/plant (Figure 12e,f) and farmland/village (Figure 12g,h) areas.In general, results in Figure 12 demonstrate that SSCDnet has a robust cloud detection performance on different land cover types.

Computational Complexity Analysis
To analyze the computational complexity of SSCDnet, we evaluate computational complexity of these networks with six evaluation criterions, which are floating point operations (FLOPs), number of trainable parameters, training time, training GPU memory usage, testing time, and testing GPU memory usage.In During training, both the segmentation network and discriminator network are trained simultaneously.Discriminators of different SSL methods have different raw input data channels, which results in these models having different model parameters, different computational complexity and training times, and different GPU memory requirements.In Table 8, we can see that FLOPs, number of parameters, and training GUP memory usage of SSCDnet are higher than those of competing methods.This is because SSCDnet performs intermediate feature map domain adaptation alignments, while competing methods have no such operation.Feature map alignment requires more computations and GPU memories.In addition, the discriminator network for feature alignment further increases the number of training parameters.Luckily, the longest training time is not the SSCDnet but Hung et al. [43].
During testing, we only need to use the segmentation network to detect clouds instead of the discriminator network.Since all the methods use the same baseline segmentation network, i.e., DeepLabV2 [51], all these methods share the same testing times and GPU memory usage.In Table 8, we can see that it takes about 400 s to detect 20 scenes Landsat 8-OLI satellite images with image size of 8 k × 8 k.In other words, it takes 20 s to detect an image.2849 MB GPU memory is required to process an image with size of 1200 × 1200.The code for computational complexity is available from https://github.com/sovrasov/flops-counter.pytorch(accessed on 24 April 2022). 1 GFLOPs = 1 × 10 9 FLOPs. 1 M = 1 × 10 6 . 1 MB = 1 × 10 6 bytes.

Limitations
Although SSCDnet achieves a promising cloud detection performance on both GF-1 WFV and Landsat-8 OLI data, there is still a large number of misclassified pixels in tough cases when using a limited number of labeled samples.Since urban and floating ice areas show the similar color or texture with the clouds, and thin cloud objects show few differences with underground objects, it is difficult for SSCDnet to handle these areas when using a limited number of labeled samples as shown in Figure 13, where Figure 13a,b are the results of GF-1 WFV data (GF1_WFV2_W70.8_N19.2_20140801_L2A0000292230),while Figure 13c,d are the results of Landsat-8 OLI dat (LC81180382014244LGN00).
Experiment results in Figure 13a,b show that there are many misclassified pixels at urban and floating ice areas when labeled sample proportion is 1  200 and 1 100 .When labeled sample proportion is more than 1  40 , the qualitative results remain basically stable and show consistency with the ground-truth.Results in Figure 13c,d indicate that capturing sharp and detailed object boundaries in thin cloud areas is still very difficult even trained with fully labeled samples.Adding a sufficient of thin cloud samples for network training may obtain good detection performance.In future work, we will focus on this point to further improve this work.

Conclusions
Semi-supervised learning is an effective training strategy, which is able to train a segmentation network by using a limited number of pixel-wise labeled samples and a large number of unlabeled ones.In this paper, we present a semi-supervised cloud detection network, named SSCDnet.Since there are domain distribution gaps between the labeled and unlabeled datasets, we take the domain shift problem into account for the semi-supervised learning framework and propose feature-/output-level domain adaptation strategy to reduce domain distribution gaps, thus improving SSCDnet to generate trustworthy pseudo label for unlabeled data.A high certain pseudo label provides positive supervised signals for segmentation network learning through self-training.Experimental results on GF-1 WFV and Landsat-8 OLI datasets demonstrate that SSCDnet is able to achieve promising performance by using a limited number of labeled samples.It shows great promise for practical application on new satellite RS imagery in the presence of less labeled data available.
Although SSCDnet shows good performance, there is still much room for improvement, such as hyper-parameters setting of loss function and threshold setting of pseudolabeling.Different cloud detection datasets have different domain distributions.We need to update these parameters to achieve a promising performance on different datasets.In addition, different ground objects have different characteristics, and the performance of SSCDnet on other objects detection also needs to be further evaluated.In our future work, we will further evaluate this method on other cloud detection datasets and other object detection tasks.In addition, SSCDnet performs poorly on cloud boundaries and thin cloud regions, which requires our future efforts to improve it.In our future work, we will also explore how to utilize some auxiliary information, such as land use and land cover (LULC) map, water index map, and vegetation index map, to improve the cloud detection performance.
In general, a semi-supervised learning training strategy provides us with an effective way for cloud detection from RS images in the presence of less labeled data available.In addition, this strategy may provide us a promising way for other object detection tasks such as water, vegetation, and building detection.
In order to promote understanding of the paper's technology, we released the code of SSCDnet.It is available at: https://github.com/nkszjx/SSCDnet(accessed on 24 April 2022).

Figure 3 .
Figure 3.The detailed structure of the proposed SSCDnet.

Figure 4 .
Figure 4. Structure of the proposed class-relevant feature alignment module.After obtaining these class-relevant features, we input these features into discriminators as shown in Figure 4.For discriminator training, both discriminators D (k,1) crfa and D (k,2) crfa use cross-entropy domain classification loss [44] as the objective function.They are defined as follows:

Figure 6 .
Figure 6.Experiment results on an unlabeled remote sensing image.(a) input GF-1 WFV image; (b) un-applied domain adaptation result, and (c) applied domain adaptation result.

Figure 7 .
Figure 7.The double threshold pseudo-labelling method for self-training.

Figure 8 .
Figure 8. Structure of the proposed discriminator network.

Figure 10 .
Figure 10.Comparison of cloud extraction results of different methods on Landsat-8 OLI dataset with a labeled sample proportion of 1/200.The image ID of (a), (b), and (c) are LC80650182013237LGN00, LC80430122014214LGN00, and LC81990402014267LGN00, respectively.

Table 6 .
Cloud extraction accuracy (%) of different comparison networks on GF-1 WFV data.All results are the averaging results on all testing images.

Table 7 .
Cloud extraction accuracy (%) of different comparison networks on Landsat-8 OLI data.All results are the averaged results on all the testing images.

Table 8 ,
we list results of different competing methods.FLOPs are calculated from input data with image size of 321 × 321.Training times of all methods are obtained from 5000 iterations.Training GPU memory size is obtained by setting batch size of 4 and image size of 321 × 321.Testing time is obtained by testing 20 scene Landsat-8 OLI satellite images with image size of 8 k × 8 k.Training GPU memory size is obtained by setting batch size of 1 and image size of 1200 × 1200.