Reﬁned UNet: UNet-Based Reﬁnement Network for Cloud and Shadow Precise Segmentation

: Formulated as a pixel-level labeling task, data-driven neural segmentation models for cloud and corresponding shadow detection have achieved a promising accomplishment in remote sensing imagery processing. The limited capability of these methods to delineate the boundaries of clouds and shadows, however, is still referred to as a central issue of precise cloud and shadow detection. In this paper, we focus on the issue of rough cloud and shadow location and ﬁne-grained boundary reﬁnement of clouds on the dataset of Landsat8 OLI and therefore propose the Reﬁned UNet to achieve this goal. To this end, a data-driven UNet-based coarse prediction and a fully-connected conditional random ﬁeld (Dense CRF) are concatenated to achieve precise detection. Speciﬁcally, the UNet network with adaptive weights of balancing categories is trained from scratch, which can locate the clouds and cloud shadows roughly, while correspondingly the Dense CRF is employed to reﬁne the cloud boundaries. Eventually, Reﬁned UNet can give cloud and shadow proposals sharper and more precisely. The experiments and results illustrate that our model can propose sharper and more precise cloud and shadow segmentation proposals than the ground truths do. Additionally, evaluations on the Landsat 8 OLI imagery dataset of Blue, Green, Red, and NIR bands illustrate that our model can be applied to feasibly segment clouds and shadows on the four-band imagery data.


Introduction
Clouds and corresponding shadows contaminate remote sensing imageries, occlude the recognition of land cover, and eventually lead to an invalid resolve activity. Cloud and cloud shadow detection, therefore, is essential for intelligent remote sensing imagery processing and translation. Currently, it is very challenging to precisely recognize clouds and corresponding shadows in a remote sensing image even if the rough location of utilizing spectral and spatial features has been sufficiently developed; it is mainly because the manually-developed solutions are highly dependent on the inherent features, which leads to segmenting clouds and shadows with reasonable spectral thresholds instead of risking grouping pixels with low confidence. Accordingly, under-or over-segmentation (shrinkage or inflation) remains challenging in the cloud and shadow segmentation.
Non-data-driven development of cloud and cloud shadow detection mainly focuses on three aspects of image features, namely spatial and spectral test, temporal differentiation methods, and statistical methods [1], in which the spatial and spectral features are mainly taken into consideration. Recently, data-driven methods [2][3][4] thrive because of the abundant labeled training samples and the adaptive feature extraction, which enables automatically typical feature discovery of clouds and cloud shadows and detects them in automatic ways. Particularly, CNN-based models [5][6][7][8] utilize learnable feature extractors to adaptively learn features within images, and later map them False-color image UNet ×α + refining False-color image UNet ×α + refining False-color image UNet ×α + refining False-color image UNet ×α + refining Figure 1. Examples of Refined UNet for cloud and cloud shadow segmentation. It is observed that Refined UNet can delineate boundaries of clouds and shadows sharper and more precisely, which overcomes the inflation of given ground truths.
The main contributions in this paper are listed as follows: • Refined UNet: We propose an innovative architecture of assembling UNet and Dense CRF to detect clouds and shadows and refine their corresponding boundaries. The proper utilization of the Dense CRF refinement can sharpen the detection of cloud and shadow boundaries.

•
Adaptive weights for imbalanced categories: An adaptive weight strategy for imbalanced categories is employed in training, which can dynamically calculate the weights and enhance the label attention of the model for minorities.

•
Extension to four-band segmentation: The segmentation efficacy of our Refined UNet was also tested on the Landsat 8 OLI imagery dataset of Blue, Green, Red, and NIR bands; the experimental results illustrate that our method can obtain feasible segmentation results as well.
The rest of the paper is organized as follows. Section 2 investigates and presents some related work regarding cloud and cloud shadow detection and neural semantic segmentation. Proposed Refined UNet for cloud and shadow detection is described in Section 3. Section 4 presents the test Landsat8 OLI dataset, implementation details, and experiments for evaluation; it also illustrates experimental results qualitatively. Section 5 concludes this paper.

Related Work
We summarize the related work from two aspects: manual cloud and shadow segmentation in Section 2.1 and state-of-the-art neural semantic segmentation in Section 2.2.

Cloud and Shadow Segmentation
In terms of different perspectives of intermediate spectral features from remote sensing imageries, manually-developed cloud and corresponding shadow segmentation can be grouped into three categories: spectral tests, temporal differentiation, and statistical methods [1]. Observing the distribution of spectral data, thresholds were used to detect clouds and shadows limited in a finite range [9][10][11][12]. CFMask [13,14] explored comprehensively the spectral features and provided a benchmark of cloud and shadow detection. Temporal differentiation methods [15][16][17] observed the movement of dynamic clouds and shadows, detecting according to differences between imageries. Exploiting the statistics of spatial and spectral features, statistical methods [18,19] formulated detecting the cloud and shadow areas as a pixel-wise classification issue, which are highly relevant to data-driven methods. In this case, however, accurate or precise labels should be given so the statistical model can fit the distribution of cloud and shadows. Recently, it is noted that the cloud and shadow detection can be formulated as a semantic segmentation issue and solved by CNN-based pixel-wise classification model [1] when the data-driven methods thrived in semantic segmentation tasks on natural images; this is the main inspiration of formulating our task as well.

State-of-the-Art Neural Semantic Segmentation
Dense classification tasks, i.e., semantic segmentation tasks, aim to group pixels of an image into categories semantically, in which pixels of a potential object should be classified into a category. High-level vision tasks (image classification, object detection, etc) comprehend the high-level semantic information, whereas the low-level vision task provides a base for fine-grained image understanding. Accordingly, some representative and state-of-the-art methods are summarized as follows.
Classifiers of natural image segmentation tasks recognize natural objects and classify pixels accordingly: they take natural images as input and ultimately aim to output labeled predictions. These classifiers are seldom trained from scratch; they, alternatively, finetune feature extractors or other components of widely-used pretrained neural classifiers as the backbone networks. Typical backbones include VGG-16/VGG-19 [2], MobileNets V1/V2/V3 [20][21][22], ResNet18/ResNet50/ResNet101 [3,23], DenseNet [4], etc. The aforementioned backbone networks have demonstrated their striking performance in image classification tasks because of delicate feature extractor designing, which can effectively be transferred into the segmentation tasks as well.
Based on these backbone networks, neural semantic segmentation networks have significantly pushed the performance of pixel-level annotation tasks. Fully convolutional networks (FCN) [5] substituted fully-connected layers with convolutional layers, which can adaptively segment images with arbitrary sizes. U-Net [6] introduced intermediate feature fusion by concatenating multi-level feature maps with the same dimensions via shortcut connections, which popularized the reuse of features in image segmentation tasks. SegNet [7,8] inherited the encoder-decoder architecture and was applied to efficient scene understanding applications. Jegou et al. [24] extended DenseNet [4] into semantic segmentation problems due to its excellent performance on image classification tasks. FastFCN [25] used Joint Pyramid Upsampling to reduce computation complexity.
Explorations have been deep developed on feature mechanisms and data distributions: methods based on dilated convolution balanced trade-off between the larger receptive fields and kernel sizes, which implements multi-scale sparse subsampling with a small kernel and different dilated ratios. Yu et al. [26] proposed a method of multi-scale contextual aggregation using dilated convolutions. RefineNet [27] exploited fine-grained features to reinforce high-resolution classification in a way of building long-range residual connections (identity maps). PSPNet [28] aggregated global feature representations using Pyramid Pooling Module to segment images. Peng et al. [29] suggested that large kernel matters in the classification and localization tasks simultaneously, and accordingly proposed a global convolutional network to address mentioned issues. UPerNet [30] was proposed to discover rich visual knowledge and parse multiple visual concepts at once. HRNet [31] aggregated features from all the parallel convolutions instead of only from the high-resolution convolutions, leading to learning stronger feature representations. Gated-SCNN [32] built a two-stream segmentation classifier using a side branch of dedicated shape processing. Papandreou et al. [33] applied EM [34] to weaklyand semi-supervised learning for neural semantic segmentation.
DeepLabs initiated a series of segmentation methods along with the development of the mentioned methods. Using the atrous convolution and CRF, DeepLab V1 [35] initiated a pipeline aggregating rough classification and boundary refinement, and further DeepLab V2 [36] improved the performance. DeepLab V3 [37] deceased the use of CRF to improve segmentation performance. DeepLab V3+ [38] applied the depthwise separable convolution from Xception [39] to the atrous spatial pyramid pooling modules and decoder, promoting both efficiency and robustness.
Additionally, modeling segmentation as a probabilistic graphical model is gradually becoming a novel trend under the condition of CNN extracting high-level visual features. CRFasRNN [40] formulated CRF implementation as an RNN-based layer, which achieved an end-to-end training and inference of neural network predicting and CRF refining in natural image segmentation tasks. Deep parsing network [41] addressed the semantic segmentation task by modeling unary terms and pairwise terms from CNN and approximation of mean-field of additional layers, respectively, yielding a striking performance on PASCAL VOC 2012. Moreover, a combination of Gaussian Conditional Random Field (G-CRF) and deep learning architecture [42] is proposed to address the structured prediction, which inherited several merits including a unique global optimum, end-to-end training, and self-discovered pairwise terms.
Segmentation methods have carried out comprehensive exploration of semantic object localization, and have achieved promising performance on the dense classification tasks. The lower-level issues, however, should be concentrated carefully: splitting objects along with a precise boundary remains challenging, especially in remote sensing data. Consequently, we rethink the drawbacks of cloud and shadow detection and focus on the boundary prediction, which drives us to establish a dedicated model from scratch.

Methodology
In this section, we present the proposed Refined UNet in three subsections: The UNet architecture is introduced in Section 3.1 and the postprocessing of fully-connected conditional random field is presented in Section 3.2. The concatenation of UNet prediction and Dense CRF refinement is introduced in Section 3.3, which is also an overall framework. The entire pipeline of our method is illustrated in Figure 2.

UNet Prediction
UNet has been referred to as an effective structure in image segmentation tasks. Given an image of which each pixel is grouped into a specific category, UNet architecture can hierarchically extract low-level features and recombine them into higher-level features in the encoder, while it can perform the element-wise classification from multiple features in the decoder. Driven by the weighted cross-entropy loss function, UNet gradually secures the learnable parameters in feature extractors and infer the expected output which is closer to ground truth. The encoder-decode architecture of UNet is illustrated in Figure 3, in which down-sampling blocks of "Conv-ReLU-MaxPooling" are employed to extract features and upsampling blocks of "UpSample-Conv-ReLU" are employed to infer the segmentation in the same resolution. To clarify the use of UNet architecture, a mathematical formulation of learning and inference is given as follows. In the learning phase, given the N-band input x that denotes a multi-band remote sensing image, UNet f UNet outputs the logitsŷ with respect to x, in whichŷ denotes the corresponding pixel-wise likelihood.
Convolutional operator * filters the multi-band input or intermediate feature maps to generate multi-level features within f UNet of N layers, in which each element φ l p,q,k of the feature map of layer l are calculated in Equation (2).
Following convolutional layers, MaxPooling layers are used to enlarge the receptive field so that high-level features can be captured comprehensively.
f UNet (·) fuses the intermediate feature maps by concatenating them with the same size, in Equation (3).
In our study, a weighted multi-class cross-entropy loss function with an adaptive categorical weight vector α is proposed to push the network to pay more attention to the minorities of categories. Specifically, α i ∈ α is proportion to the inverse of total counts M i of category i and the total counts of pixels M, namely, minorities in the categories can have higher weights. Thus, the loss function is calculated in Equation (4).
where y denotes a one-hot vector of the label,ŷ is the prediction of f UNet with respect to input x, and α is the adaptive weight vector of each category. Each element of α is calculated dynamically in Equation (6).
where α i denotes the adaptive weight of category i and M i is the total counts of category i. In the optimization, the gradient descent method is used to optimize the learnable parameters in UNet, more specifically, kernels of convolutional layers. Particularly, the derivatives of the loss function with respect to the outputŷ is calculated in Equation (7).
In the inference phase, UNet outputs the segmentation proposal with the size of p × q × k indicating that p × q pixels have the possibilities of k categories. The maximums of these k possibilities are the elementwise classification results.

Fully-Connected Conditional Random Field (Dense CRF) Postprocessing
Generally speaking, UNet can reliably sense the existence of clouds and cloud shadows and roughly localize them. The boundaries of clouds, however, cannot be precisely pinpointed by UNet.
The reason for vague boundary segmentation is speculated as follows: multiple max-pooling layers enlarge the receptive field of the neural network, which improves effectively extracting the high-level features (i.e., semantic information) and helps high-level vision tasks, for instance, image classification. However, the use of multiple max-pooling layers brings more invariance in the low-level vision tasks, which is detrimental to exact boundary detection in cloud segmentation [35]. UNet is still affected in fine-grained segmentation even if the concatenations attempt to alleviate the lack of high-resolution features. Considering the disadvantages of UNet prediction, the postprocessing of the fully-connected conditional random field (Dense CRF) is employed to refine exact cloud boundaries.
The cloud and shadow refinement of Dense CRF is formulated as follows. Element-wise classification (X, I) can be formulated as a conditional random field (CRF) characterized by a Gibbs distribution, defined in Equation (9).
in which E(x) denotes the Gibbs energy, G = (V, E ) the graph, V = {X 1 , X 2 , . . . , X N } the element-wise classes, I the global observation (image), and Z(I) the normalization term to guarantee the correct probability.
In the Dense CRF, the corresponding Gibbs energy function is defined in Equation (10).
in which x denotes the label assignment for all pixels, ψ u the unary potential, and ψ p the pairwise potential. The unary potential ψ u (x i ) can be given by UNet outputs, while the pairwise potential ψ p (x i ) is defined in Equation (11).
in which µ(x i , x j ) denotes the label compatibility in Dense CRF and f i and f j the feature vectors. In our case, Potts model µ(x i , x j ) = [x i = x j ] is used as the label compatibility.
Contrast-sensitive two-kernel potentials [43] are used to capture the connectivity of two nearby pixels with similar spectral features and eliminate the isolated regions, defined in Equation (12).
in which p i , p j denote the positions, I i , I j the spectral features of pixel i and j. The spectral features I i and I j consist of false-color band 5, 4, and 3. Note that θ α , θ β , and θ γ are three key hyperparameters controlling the degree of connectivity and similarity, and significantly affect the performance of the refinement.
In the inference phase, the Dense CRF infers an observationx to find the most likely assignment ). An efficient solution to Dense CRF has been provided in [43], in which the approximate inference of the iterative message-passing algorithm is used to estimate the CRF distribution. The solution facilitates the inference of Dense CRF in a linear time complexity, which can result in an efficient utility of Dense CRF in segmentation tasks.

Concatenation of UNet Prediction and Dense CRF Refinement
The overall framework of our Refined UNet is described as follows. The large size of an entire high-resolution remote sensing image discourages UNet prediction; predicting patch by patch, therefore, is a practical solution to the remote sensing image. Cropped into and reconstructed from tiles, the multi-band remote sensing image is transformed into a segmentation proposal by UNet. Afterward, Dense CRF can sufficiently process the entire image, which can improve the prediction coherency on the edges of tiles and eliminate isolated regions. Specifically, the concatenation of UNet prediction and Dense CRF refinement is described as follows: • The entire images are rescaled, padded, and cropped into patches with the size of w crop × h crop . The trained UNet infers the pixel-level categories for the patches. The rough segmentation proposal is constructed from the results.

•
Taking as input the entire UNet proposal and a three-channel edge-sensitive image, Dense CRF refines the segmentation proposal to make the boundaries of clouds and shadows more precise.
We observed in the experiments the efficacy of patch-wise UNet prediction and Dense CRF refinement.

Experiments and Discussion
Experiments were conducted to evaluate the results of our Refined UNet compared to references. Ablation studies are conducted to verify the efficacy of each component as well.
Experimental data acquisition, implementation details, and evaluation metrics are briefly introduced in Sections 4.1, 4.2, and 4.3, respectively. In Section 4.4, Refined UNet and novel methods are compared and evaluated qualitatively and quantitatively. In Section 4.5, the outputs of Refined UNet and references are visually compared, in which the superiority of boundary refinement can be illustrated. In Section 4.6, the refinement of Dense CRF is evaluated in against with vanilla UNet predictions. In Section 4.7, some key hyperparameters are examined to show the effect on the segmentation performance. In Section 4.8, the effect of adaptive weights for imbalanced categories is evaluated against fixed weights. In Section 4.9, cross-validation on the four-year dataset is used to explore the performance consistency. At last, evaluations on four-band imageries and comparisons are conducted in Section 4.10.

Experimental Data Acquisition and Preprocessing
In the experiments, Landsat 8 OLI imagery data [11] were employed to train, validate, and test the performance of our Refined UNet. We chose images in the years of 2013, 2014, and 2015 and split them into the training set and validation set. Images in 2016 were chosen as the test data for visualization and numerical evaluation. Cloud and shadow labels were generated from the Pixel Quality Assessment band, in which the clouds and shadows with confidence were derived from the CFMask algorithm. Practically, clouds and shadows with high confidence were marked while those with low confidence were excluded. Class IDs of background, fill values, shadows, and clouds are 0, 1, 2, and 3, respectively; alternatively, we merged classes of land, snow, and water into that of background because segmentation tasks of clouds and cloud shadows are the key issue we are discussing. Instead of ground truths, the labels are referred to as references because they are dilated and not accurate enough at the pixel level. All seven bands were merged as default inputs, as illustrated in Figure 4. For visual evaluation, Band 5 NIR, 4 Red, and 3 Green were combined as RGB channels to construct a false-color image. Linear 2% algorithm was performed on the false-color images to enhance the contrast and visualization. The false-color images were still used as the inputs of Dense CRF because of its sufficient contrast and evident edges. Additionally, Bands 2 Blue, 3 Green, 4 Red, and 5 NIR in Landsat 8 OLI data were chosen to combine the inputs of four-band images. We assessed the segmentation performance compared to seven-band segmentation. Band   In the preprocessing, images were padded firstly for slicing. Zeros were assigned to fill values and surrounding padded values. The padded size was calculated using Equations (13)- (16), respectively, where w l , w r , h u , and h d denote the left, right, up, and down padding widths and heights. After padding, we cropped raw image data into 512 × 512 patches for training, validation, or test.
in which is 10 −10 to avoid that data are divided by zero.

Implementation Details
The UNet model is composed of four "Conv-BN-ReLU" components for down-sampling and four "UpSample-Conv-BN-ReLU" components for up-sampling. The model was trained from scratch on the training set, taking as input seven-or four-band imageries and outputting 0-to label each pixel. It was optimized by ADAM [44] optimizer in which β 1 , β 2 , and learning rate were 0.9, 0.999, and 0.001, respectively.
As the postprocessing, Dense CRF took as input both the entire false-color images and categorical proposals reconstructed from UNet results and transforms into refined predictions. Empirically, the default θ α , θ β , and θ γ were 80, 13, and 3. We further conducted subsequent experiments to thoroughly test the effect of Dense CRF with regards to the aforementioned hyperparameters.

Evaluation Metrics
In our four-class pixel-level classification task, precision P, recall R, and F1 score F 1 were utilized to evaluate the efficacy and sensitivity of the cloud and shadow detection. Considering the confusion matrix P cm = p ij 4×4 , i, j ∈ {0, 1, 2, 3}, in which p ij denotes the number of observations that should actually belong to group i and are predicted to group j, precision reports how many correct pixels in the prediction the method can retrieve, defined in Equation (18); recall reports how comprehensively the method can retrieve specified pixels, defined in Equation (19); and F1 score is a numerical assessment taking into consideration both precision and recall, defined in Equation (20).
In addition, Wilcoxon signed-rank test [45] was used to test if the differences between the two methods are significant.

Comparisons of Refined UNet and Novel Methods
We first compared our Refined UNet to its backbone net UNet [6], which is usually exploited in natural image segmentation. Besides, the novel PSPNet [28] with ResNet-50 as the backbone net was retrained from scratch on the training set and its results are also taken into consideration. The same strategy of adaptive weights for imbalanced categories was used in the training of these methods. Qualitative and quantitative results are presented in Figure 5 and Table 1. Figure 5 shows the visualization results of PSPNet, UNet, and Refined UNet. It can be seen that our Refined UNet outperforms PSPNet in terms of visual detection of clouds and shadows: some clouds and shadows are missing in the detection of PSPNet, whereas UNet over-detects clouds and shadows. Refined UNet overcomes the drawbacks of over-detection and delineates the boundaries of clouds and shadows more precisely, compared to UNet. The cutting edges of tiles, on the other hand, are also neutralized in the results of Refined UNet, while those gaps of PSPNet are not properly sealed. In summary, our Refined UNet can effectively label rough clouds and shadows and refine their boundaries more precisely.  Table 1 shows the quantitative assessments with respect to PSPNet, UNet, and Refined UNet. Precision P i assesses the efficacy of how many pixels the method can correctly detect in its prediction of class i, while recall R i indicates the efficacy of how many pixels the method can sensitively capture in all pixels of a specified class i. F1 score F 1i takes into consideration both the specificity and the sensitivity by computing the average of P i and R i . In the detection of clouds and shadows, Refined UNet balances the performance of precisions and recalls, while PSPNet only achieves superior precisions due to its negligence of clouds and shadows with low confidence. It is concluded that Refined UNet can achieve superiority of balancing precision and recall in the precise detection of clouds and shadows.

Comparisons of References and Refined UNet
Next, we report the segmentation results and compare our results to the references from qualitative and quantitative perspectives. Figure 6 entirely illustrates the false-color visualizations, the segmentation references the results of Refined UNet, and the differences between them. We can generally conclude that our method can detect clouds comprehensively and precisely: in the visual assessment, almost all pixels of clouds can be detected correctly. The clouds and shadows can be considerably retrieved by the Refined UNet, especially for the interior pixels indicating clouds and shadows. Sharper boundaries of clouds and shadows are delineated and the pixels indicating differences are highlighted on the boundaries of clouds and shadows, which can illustrate the effect of Dense CRF refinement. In terms of the superior results, one of the merits of the Refined UNet, thus, is concluded: Refined UNet can almost detect all clouds and shadows with high confidence, and refine the boundaries of clouds, highlighted in difference visualization. We attribute this superiority to the nature that the UNet model can roughly locate clouds and shadows and Dense CRF can detect the explicit boundaries, which generates accurate and refined results.
Nevertheless, the drawback of refinement cannot be totally ignored: Refined UNet might over-refine the boundaries of shadows, which leads to missing the detection of some shadows. The difference images illustrated that in some cases, Dense CRF is strong in refinement so it inevitably erases some weak shadows, which shows its aggression. In fact, it appears to be a trade-off between specificity and sensitivity of the model, and, in our cases, the precision should be the first priority.
We further evaluated locally, by zooming into some areas and observing the rough location and refinement of the detection. Figure 7 visually confirms the superiority of refining the boundaries of clouds and shadows. Combining entirely and locally visual assessment, we conclude that our Refined UNet can accurately locate clouds and shadows and precisely capture boundaries. Green are combined together as RGB channels to construct a false-color image for visualization. We mark the differences between clouds by red pixels and shadows by green.
We also evaluated our method from the quantitative perspective, in which precision, recall, and F1 score were employed to assess the performance of detection. Precision P i assesses the efficacy of how many pixels the method can correctly detect in its prediction of class i, while recall R i indicates the efficacy of how many pixels the method can sensitively capture in all pixels of a specified class i. F1 score F 1i takes into consideration both the specificity and the sensitivity by computing the average of P i and R i . Before evaluating, we hypothesize that precisions should be higher while recalls should be lower because of the fact of these indicators. Table 2 confirmed our hypothesis.
In Table 2, the average precisions of backgrounds, fill values, and shadows are higher while clouds are slightly lower. We attribute the higher precisions to the Dense CRF refinement: it dramatically purifies the detection of shadows. The lower precision of clouds with high standard deviations may be caused by the misclassification of snow pixels, which strongly affects the performance of cloud detection. We will further investigate the differentiation of cloud and snow pixels to promote precisions.  Green are combined together as RGB channels to construct a false-color image for visualization. Visually, Refined UNet can obtain more precise contours of clouds and shadows compared to references, which leads to finer detection results. Some patches of shadows, however, might be eliminated due to the over-refinement, which should be further taken into consideration and solved in the future.

Effect of the Dense CRF Refinement
An ablation study on the Dense CRF was conducted to test its effect. Dense CRF focuses on splitting along boundaries so that it can further obtain a finer segmentation result in the task. In addition to refining the contours precisely, Dense CRF can be used to eliminate isolated predictions (misclassification noises) and smooth gaps of slices practically. Figures 8 and 9 qualitatively show the results with and without Dense CRF refinement. As shown in the figures, the boundaries of clouds and shadows are refined, and the isolated misclassification regions and slicing gaps are removed as well, which demonstrates the superiority of our Refined UNet. We also realize that the strong Dense CRF might also erase some small shadow patches with vague boundaries or some plausible shadow patches, which should be solved in the future.   Figure 9. Comparisons of segmentations with or without Dense CRF refinement in local areas (L8, Path 113, Row 26). From left to right are false-color images, results of UNet ×α , and UNet ×α + Refinement. Bands 5 NIR, 4 Red, and 3 Green are combined together as RGB channels to construct a false-color image for visualization. In local areas, it is confirmed that the refinement of Dense CRF can precisely delineate the contours of clouds and shadows; in addition, it can remove the isolated classification errors and smooth the gaps caused by slice-wise processing.

Hyperparameter Sensitivity with Respect to Dense CRF
We examined the performance of Dense CRF postprocessing by varying the spatial and spectral ranges in the appearance and smooth kernels θ α , θ β , and θ γ , which is shown in Figures 10-12. According to Krahenbuhl and Koltun [43], a proper θ γ yields a slight visual improvement, which is visually demonstrated by Figure 12. Higher θ α and θ β , on the other hand, provide more visual improvement and remove more isolated regions. However, they can over-refine the cloud and shadow regions as well. In summary, these parameters should be learned using more accurately labeled data or controlled manually. 2016-04-12 2016-04-12 Figure 12. Visualizations with regards to θ γ of Dense CRF postprocessing. The candidate values of θ γ vary from 1 to 9 while θ α and θ β are secured to 91 and 11. It can hardly be seen that there is a significant visual improvement if θ γ varies.

Effect of the Adaptive Weights Regarding Imbalanced Categories
The adaptive weights with regard to imbalanced categories were employed to promote the performance of cloud and shadow detection. By observing imagery data of the whole year, the clouds and shadows may be minorities in summer and autumn, which needs to dynamically balance the training samples. Hence, the adaptive weights are required to balance. Figures 13 and 14 show the comparisons between segmentation results with the fixed weights and the adaptive weight. Fixed weights drive UNet to predict more shadows even though it seems that the model would over-detect: it captures more pixels that should not be grouped into the category of shadows. Note that more cloud shadows of isolated pixels are also detected in our case. Adaptive weights adjust the prediction dynamically, fit the distribution of cloud and shadow pixels, and push the model to classify properly. We conclude that our method can achieve a good performance in finely detecting clouds and shadows, in terms of visual assessments.
Quantitative assessments were used to demonstrate the superiority of our method. F1 score was used to numerically demonstrate it since it considers both precision and recall. In Tables 2 and 3, we find that the UNet with adaptive weights significantly outperforms the models with fixed weights in terms of F1 scores, which also supports the conclusion of qualitative assessments.

Cross-Validation over the Entire Dataset
We further evaluated the performance consistency of our Refined UNet by the cross-validation upon the image set of each year. For all images used above, five images for each year were selected and are listed as follows. For the four cross-validations, images of two years were used as the training set, one year as the validation set, and the last one as the test set. The quantitative results are reported in Table 4. The accuracy, precision, recall, and f1 score can demonstrate the performance consistency of our Refined UNet: all of them can perform well on labeling pixels of background, fill values, and clouds, in terms of precisions. Labeling the pixels of shadows, however, should be improved as plenty of detection algorithms do.

Evaluation on Four-Band Imageries
We assessed the segmentation performance of our Refined UNet on the four-band imagery dataset. Bands 2 Blue, 3 Green, 4 Red, and 5 NIR were employed to construct the four-band dataset. Qualitative and quantitative results are illustrated and indicated in Figures 15 and 16 and Table 5, respectively. In the experimental results, the performance on the four-and seven-band data are different in terms of the visual assessment: visual differences are easily sensed, especially for shadow detection. We further verified the performance quantitatively. The quantitative assessment is shown in Table 5. In addition to visual differences, the numerical differences of shadow detection are significant in terms of F1 score, which also supports the observation of visual assessments. We speculate that some missing bands play a key role in detecting cloud shadows. Conversely, the differences in cloud detection are weak in terms of F1 score, so we can conclude that the model is able to be applied to four-band cloud segmentation tasks. The causes of significant differences will be explored in the future. Table 5. Average scores of accuracy, precision, recall, and F1 in the comparison between four and seven-band models; * highlights the significant differences (p-value < 0.05) in Wilcoxon signed-rank test. The top results are highlighted in bold.

Conclusions
Cloud and cloud shadow segmentation remains a challenging task in intelligent remote sensing imagery processing, and its urgent requirement leads to the prosperous development of learning methods given the circumstances that tremendous pairs of training samples and corresponding labels are given. In this paper, we investigate the efficacy of UNet prediction and Dense CRF refinement in cloud and shadow segmentation tasks, and further propose an innovative architecture, refined UNet, to localize clouds and sharpen boundaries. Specifically, UNet learns the features of clouds and shadows and intends to give proposals. The Dense CRF refines the boundaries of clouds and shadows to predict more precisely. Landsat 8 OLI datasets were used in experiments to demonstrate that our method can localize and refine the segmentation of clouds and shadows, which is illustrated in terms of experimental results for 2016. We shall improve our work by categorizing pixels into more classes and achieve a more sufficient segmentation, explore the approximate inference methods or learning methods for Dense CRF, and ultimately concatenate altogether neural network-based classifiers and Dense CRF layers to gain a more efficient end-to-end framework.