Reﬁned UNet V2: End-to-End Patch-Wise Network for Noise-Free Cloud and Shadow Segmentation

: Cloud and shadow detection is an essential prerequisite for further remote sensing processing, whereas edge-precise segmentation remains a challenging issue. In Reﬁned UNet, we considered the aforementioned task and proposed a two-stage pipeline to achieve the edge-precise segmentation. The isolated segmentation regions in Reﬁned UNet, however, bring inferior visualization and should be sufﬁciently eliminated. Moreover, an end-to-end model is also expected to jointly predict and reﬁne the segmentation results. In this paper, we propose the end-to-end Reﬁned UNet v2 to achieve joint prediction and reﬁnement of cloud and shadow segmentation, which is capable of visually neutralizing redundant segmentation pixels or regions. To this end, we inherit the pipeline of Reﬁne UNet, revisit the bilateral message passing in the inference of conditional random ﬁeld (CRF), and then develop a novel bilateral strategy derived from the Guided Gaussian ﬁlter. Derived from a local linear model of denoising, our v2 can considerably remove isolated segmentation pixels or regions, which is able to yield “cleaner” results. Compared to the high-dimensional Gaussian ﬁlter, the Guided Gaussian ﬁlter-based message-passing strategy is quite straightforward and easy to implement so that a brute-force implementation can be easily given in GPU frameworks, which is potentially efﬁcient and facilitates embedding. Moreover, we prove that Guided Gaussian ﬁlter-based message passing is highly relevant to the Gaussian bilateral term in Dense CRF. Experiments and results demonstrate that our v2 is quantitatively comparable to Reﬁned UNet, but can visually outperform that from the noise-free segmentation perspective. The comparison of time consumption also supports the potential efﬁciency of our v2.


Introduction
More and more remote sensing applications are supported by cloud-and shadow-free images [1][2][3][4], while remote sensing images are usually degraded by clouds and cloud shadows, which leads to a negative effect on the further processing or resolve activity. In particular, cloud and shadow removal needs a necessary prerequisite of cloud and corresponding shadow segmentation, which is still a challenging issue in the remote sensing preprocessing. Fundamental solutions to the cloud and cloud shadow segmentation focus on manually developed segmentation methods, which can be generally grouped into three categories: spectral tests, temporal differentiation, and statistical methods [4]: spectral thresholds can be secured in terms of spectral data [3,[5][6][7][8][9], temporal differentiation methods [10][11][12] pinpoint the movement of clouds and shadows, and statistical methods [13,14] exploit the statistics of spatial and spectral features. Manually developed methods can also be promoted by machine-learning methods due to large-scale labeled datasets [3], and data-driven segmentation methods have shown promising performance in cloud and shadow segmentation tasks [4,15].
Neural image segmentation (image segmentation by neural networks) approaches, on the other hand, introduce the learnable end-to-end solutions in the spatial and spectral feature spaces of remote sensing images. The convolutional neural network-based (CNN-based) feature encoders enable spatial and spectral feature extractions and output representative feature vectors, which can be used as the backbone for dense classification tasks. Learnable parameters [16][17][18][19][20][21] or network structures [22,23] push models to fit the feature space and reach the accurate pixel-wise classification results. Typical neural classifiers [4,24,25] have already been transferred to remote sensing segmentation. Some novel designing principles [23,26,27] can be gradually applied to segmentation models fitting particular scenarios as well.
Some challenging issues, however, remain in the cloud and cloud shadow segmentation tasks, such as the edge-precise segmentation [15]. Due to the discrete cost functions and the inflation of receptive fields in CNN-based feature extractors, neural image segmentation is restricted to coarse pixel-level classification, but the precise delineation of clouds and shadows is still limited. To achieve edge-precise segmentation, the fully connected conditional random field (Dense CRF) has been employed to model in pixel or patch level, and it is able to refine the segmentation performance on the edges. A feasible solution has been given in Refined UNet [15], in which we have preliminarily investigated the edge-precise cloud and shadow segmentation and proposed a feasible two-stage pipeline: the trainable UNet coarsely locates clouds and shadows patch by patch, and then the post-processing of Dense CRF refines the segmentation edges on the full images.
In the prospect of the two-stage Refined UNet [15], we expect an end-to-end implementation of UNet-CRF segmentation, which should incorporate UNet coarse prediction and CRF refinement for an individual patch in one forward step. However, the complicated high-dimensional filter-based bilateral message passing makes it difficult to extend Refined UNet to an end-to-end model, which inspires us to explore the bilateral message passing with other sophisticated strategies. Therefore, we intend to go deep into the Dense CRF and simplify its bilateral message passing, which is of help for building a joint implementation composed of UNet coarse prediction, unary transformation, and CRF refinement. In this paper, we inherit the two-stage pipeline of cloud and shadow segmentation in Refined UNet [15] and further explore an end-to-end solution, in which clouds and shadows can be jointly identified and refined by the concatenation of the pretrained UNet and the following CRF. Guided Gaussian filter-based message passing is employed in our computationally efficient CRF inference, rather than the complicated high-dimensional filter. Practically, a vanilla brute-force GPU implementation can be easily given by the Gaussian filter in GPU frameworks. Derived from the local linear model of denoising, our proposed CRF can effectively eliminate redundant isolated segmentation pixels or regions, which can yield "cleaner" results. A visual example of our Refined UNet v2 has been shown in Figure 1. Accordingly, our main contributions are listed as follows.

•
Refined UNet V2: an experimental prototype of the end-to-end model for cloud and shadow segmentation is proposed, which can jointly predict and refine clouds and shadows by the concatenation of the UNet and following CRF for an individual image patch in one forward step.

•
Straightforward and potentially efficient GPU implementation: we give an innovative Guided Gaussian filter-based message-passing strategy, which is straightforward and easy to implement in GPU frameworks. Thus, the vanilla implementation is also potentially efficient in computation.

•
Noise-free segmentation: Our proposed Refined UNet v2 can effectively eliminate redundant isolated segmentation pixels or regions and yield "cleaner" results. Moreover, we demonstrate that the CRF can show a particular segmentation preference (edge-precise or clean results) if the bilateral term is customized to fit the preference.
The rest of the paper is organized as follows. Section 2 reviews related work regarding cloud and shadow segmentation. Proposed Refined UNet v2 is described in Section 3. Section 4 presents the experiments on Landsat 8 OLI dataset, including quantitative and visual comparisons against Refined UNet [15], ablation study with respect to the proposed CRF, hyperparameter sensitivity with respect to r and , and computational efficiency. Section 5 concludes this paper.

Related Work
In this section, we review semantic segmentation and related techniques, including neural image segmentation methods, CRF methods, and edge-preserving filters.
Typical neural segmentation methods are reviewed as they can show the principles of designing architecture and adapting to particular scenarios. Fully convolutional network (FCN) [16] initiated neural semantic segmentation, which replaced fully connected layers with convolutional layers to adapt to the segmentation of arbitrary size. Unet [17] introduced the intermediate layer concatenation to reuse and fuse the extracted feature maps. Moreover, the long-range feature map aggregation was fully employed in segmentation tasks, such as RefineNet [18] enhancing high-resolution segmentation and PSPNet [19] with pyramid pooling module. In scene understanding applications, SegNet [20,21] inherited the encoder-decoder architecture for segmentation. For efficient computation in segmentation tasks, Joint pyramid upsampling was applied to FastFCN [37]. DeepLab series [38][39][40][41] exploited the atrous convolution, the CRF post-processing, the depth-wise separatable convolution, the atrous spatial pyramid pooling module, and novel backbones, attempting to improve both the efficiency and robustness.
The concurrent trend of semantic segmentation concentrates on (i) bringing prior knowledge or features to particular scenarios, such as "cars cannot fly up in the sky" in urban scene segmentation [42], Fourier domain adaption [26], and model transfer (synthetic images to real images) [43], (ii) effective and efficient prediction, such as single-stage effective segmentation [44], boundary preserving segmentation [45], and very high-resolution segmentation [46], and (iii) novel network architecture, such as learnable dynamic architecture for semantic segmentation [23] and graph reasoning [27].

CRF-Based Image Segmentation
The CRFs can model semantic segmentation in pixel or patch level, which implicitly performs a maximum a posteriori (MAP) inference by minimizing the corresponding Gibbs energy function. Adjacency CRFs have been significantly improved by higher-order potentials or hierarchical connectivity, such as Robust P n CRF [47,48], but they are still restricted to coarse segmentation due to the short-range connectivity. Fully connected CRFs benefit from the long-range connectivity, while the computation complexity remains a challenging problem; graph-cut [49] and the high-dimensional filter [50] are two common solutions to the efficient inference. On the other hand, the unary potentials in CRF can be predicted by a coarse segmentation structure. For example, CRFasRNN [51] presented an end-to-end segmentation structure concatenating CNN and CRF, in which the CNN-based segmentation backbone yielded the unary prediction and the following CRF refinement was built as a recurrent-neural-network (RNN) layer. And a similar combination of deep learning architecture and Gaussian Conditional Random Field (G-CRF) was given in [52]. CRFs can refine the segmentation performance in pixel level, which provides a novel perspective in our precise cloud and shadow segmentation task.

Edge-Preserving Filters
Edge-preserving filters aim to simultaneously smooth images and preserve edges, which have been widely used in applications of noise removal [53], high dynamic range (HDR) compression [54], haze removal [55], and joint upsampling [56]. Sophisticated filters include bilateral, weighted least squares (WLS), and guided filters. The bilateral filter [57] was a straightforward edge-preserving method to smooth images, which computed the output pixels weighted by a Gaussian kernel of both the spatial and color intensity discrepancy. The WLS [58] filter filtered images in an edge-preserving way of optimizing the quadratic function and taking as the guidance the input image, which achieved a global optimization. The guided filter [59] can perform the edge-preserving smoothness on images by taking themselves as guidance. In our study, the edge-preserving filter provides a message-passing strategy of smoothing the segmentation and preserving the edges in the CRF inference.

Methodology
We introduce our Refined UNet v2 in four subsections, including an overview of Refined UNet v2 in Section 3.1, the Dense CRF as the segmentation refinement in Section 3.2, the Guided Gaussian filter as the efficient message-passing strategy in Section 3.3, and our end-to-end CRF inference in Section 3.4.

Overview of Refined UNet v2
We present the overview of our Refined UNet v2 performing noise-free segmentation on high-resolution remote sensing images. An end-to-end UNet-CRF architecture is used to roughly locate clouds and shadows and refine noise from a local perspective, which takes as input a seven-band 512 × 512 patch and yields a corresponding refined segmentation result. Hence, a high-resolution remote sensing image is first padded and cropped into patches of 512 × 512, and then the aforementioned end-to-end network infers and refines patch by patch. The full segmentation result of clouds and shadows is eventually reconstructed from the patches. In this case, the pretrained UNet is inherited from [15] and the proposed CRF inference is introduced in the following subsections. The full pipeline of our Refined UNet v2 is illustrated in Figure 2.

Revisiting Fully Connected Conditional Random Fields
We revisit the Dense CRF and the corresponding mean-field approximation inference, which has been thoroughly defined in [50]. Given the random field X and its global observation (image) I, the CRF (I, X) is characterized by a Gibbs distribution, defined in Equation (1), and the corresponding Gibbs energy is given by Equation (2).
in which x denotes the label assignments for all pixels, ψ u the unary potential, and ψ p the pairwise potential. In Equation (2), the unary potential can be practically given by a pixel-level classifier, and the pairwise potential is given by Equation (3).
in which k (m) and w (m) denote a Gaussian kernel and its corresponding weight, f i and f j the feature vectors of pixel i and j, µ the label compatibility function.
In [50], the contrast-sensitive two-kernel potentials are given by Equation (4), in which I i , I j , p i , and p j denote color vectors and spatial positions of pixel i and j.
The inference of CRF aims to find ax as the most possible pixel-level classification by minimizing the energy function E(x), and the mean-field approximation facilitates the inference instead of computing the exact distribution P(X). Equation (5) of mean-field approximation leads to an iterative update algorithm, which is presented in Algorithm 1.
Algorithm 1 Mean-Field Approximation in Fully Connected CRFs.

Guided Gaussian Filter
We introduce the Guided Gaussian filter as our proposed efficient message-passing method. As is presented in [50], the message passing is the bottleneck of efficient Dense CRF inference, and the high-dimensional filter based on the permutohedral lattice [60] is chosen to accelerate. The implementation of high-dimensional Gaussian filter significantly reduces the time complexity but is quite complicated. Therefore, we intend to find an alternative to efficient message passing which is highly relevant to the bilateral term, including both color intensity and Gaussian spatial feature. The Guided Gaussian filter can satisfy the requirement of the aforementioned bilateral term, which is introduced in this subsection.
Assuming a local linear model between the guidance I and the output y, we have Equation (6) in a window ω k centered at the pixel k.
The difference between the desired output y i and the input x i is assumed to the unwanted noise n i , defined in Equation (7).
Given a guidance I, the solution should satisfy minimizing the discrepancy between the input x and the desired output y, which is formulated as a cost function of linear ridge regression. Moreover, we introduce a Gaussian weight g ik in window ω k , and the cost function is defined in Equation (9).
in which is a regularization parameter to penalize large a k . Equation (9) defines a linear ridge regression model and the solution is given by Equations (10)-(13).
x k = ∑ i∈ω k g ik x i ∑ i∈ω k g ik (12) µ k = ∑ i∈ω k g ik I i ∑ i∈ω k g ik (13) in whichx k and µ k denote the Gaussian-weighted means of x i and I i in window ω k . The derivation is given in Appendix A.
In fact, a k and b k can be computed by the Gaussian filter with radius r, defined in Equations (14) and (15).
in which the radius r controls the size of window ω k : its width is 2r + 1.
Using the Gaussian filter with radius r to compute the overlapping a k and b k , defined in Equations (16) and (17), y i can be obtained by Equation (18) and the Guided Gaussian filter can be defined in Algorithm 2.ā where . * and ./ denote element-wise multiplication and division.
A bilateral term composed of both the color intensity and the Gaussian spatial features is desired as that presented in [50]. Actually, the Guided Gaussian filter takes into consideration both the position and color intensity features of pixel i and j, which can be proved by the implicit kernel weights, defined in Equation (19).
in which Z g , µ k , and σ k are the normalization term of the Gaussian filter ∑ i∈ω k g ik , the Gaussian-weighted mean f GF (I i ) and variance f GF (I 2 i ) − f 2 GF (I i ) in window ω k , respectively. The proof is given in Appendix B.

End-to-End CRF Inference
As is mentioned above, an efficient inference has been presented in [50], in which a high-dimensional filtering algorithm was applied to the efficient bilateral message passing, defined in Equation (20). The high-dimensional filter of the permutohedral lattice [60] can reduce the time complexity of the bilateral message passing to linear time consumption (O(N)) [50].
Nevertheless, the high-dimensional filter is quite complicated to implement, especially for GPU and end-to-end inference. Conversely, the brute-force Guided Gaussian filter can be applied to the bilateral message passing, which is highly relevant to the bilateral features. The bilateral message passing term can be replaced by the Guided Gaussian filter while the smoothness term is still Gaussian kernel, defined in Equation (21).
in which the smoothness term is defined in Equation (22).
Consequently, the mean-field approximation algorithm for our CRF inference is presented in Algorithm 3.

Algorithm 3
End-to-End Mean-Field Approximation in CRF Inference.
normalize Q i (x i ) 8: end while

Experiments and Discussion
We experiment with our method in several subsections, including quantitative and visual comparisons against Refined UNet [15], ablation study with respect to our CRF inference, hyperparameter sensitivity with respect to the r and in the CRF, and computational efficiency of the v2.

Revisiting Experimental Datasets, Preprocessing, Implementation Details, and Evaluation Metrics
To evaluate the performance of the proposed Refined UNet v2, we reuse the experimental dataset in [15], which is selected from Landsat 8 OLI imagery data [3]. Practically, test dataset, labels of clouds and shadows, class IDs, and visual settings are inherited: images of Landsat 8 OLI, Path 113, Row 26, Year 2016 are preserved for the test, listed as follows. For a fair comparison, the pretrained UNet backbone is inherited, which has been trained and validated on the training and validation sets given in [15]. QA is derived from the Level-2 Pixel Quality Assessment band as the reference, which specifically labels pixels of clouds and shadows with high confidence generated by the CFMask algorithm [8]. Please note that QA is referred to as reference rather than ground truth because the labels of clouds and shadows are dilated and not precise enough in the pixel level. Class IDs of background, fill values (invalid values, −9999), shadows, and clouds are assigned to 0, 1, 2, and 3, respectively; land, snow, and water are merged into background as we focus more on cloud and shadow segmentation. Band 5 NIR, 4 Red, and 3 Green are stacked as false-color images, and pixels of background (0), cloud shadows (2), and clouds (3) are colorized by gray (#9C9C9C), green (#267300), and cyan (#73DFFF), respectively. Please refer to [15] to obtain more details regarding the dataset. We also inherit the data preprocessing in [15]: fill and padded values are assigned to zero, full images are sliced into patches with a size of 512 × 512, and all pixels in one patch are normalized to the interval (0, 1] for UNet prediction. In our case, test images are padded to 8192 × 8192 so they can be sliced into 256 patches. UNet backbone takes all seven bands as input, but CRF only uses Bands 5, 4, and 3, and combines them to grayscale. We follow [50] to transform predictions into unary potentials.
In the implementation of CRF inference, the Gaussian and Guided Gaussian filters are built upon TensorFlow [61] framework, and the center weight of the Gaussian kernel is assigned to zero in order to not passing the message to itself. Instead of reusing the reported assessment, we reproduce Refined UNet [15] but the number of iteration of Dense CRF inference is set to 10. In our Refined UNet v2, is empirically assigned to 10 −8 , while r varies from 10 to 80; we choose v2 of r = 20 for visual assessment as well. We also conduct a subsequent experiment to examine the effect with respect to these hyperparameters in this section. The demo code is available at https://github.com/92xianshen/ refined-unet-v2.
For evaluation, we inherit the quantitative metrics from [15] to assess our methods, including accuracy, precision P, recall R, and F 1 scores. Considering the indicator p ij denoting the cumulative number of pixels that should be grouped into class i but are actually grouped into class j, precision P reports the ability of the method to predict correctly, recall R reports the ability of the method to retrieve comprehensively, and F1 score is a synthesized indicator considering both P and R. Equations (23)-(25) define above indicators. (25) in which C is the number of all the classes. Additionally, time consumption is also considered to be an indicator to evaluate these methods, which can demonstrate if our method is practically efficient.

Quantitative Comparison between Refined UNets and V2
We first clarify some expressions in the comparison experiments. For brevity, we refer to the segmentation results from UNet as predictions and those from CRF as refinements. The differences between local and global Refined UNet are refining on partial or full predictions. Hence, the local Refined UNet reproduces refinements by taking as input the image patches, and the full segmentation results are restored from patches. Conversely, the global Refined UNet predicts locally but refines globally, yielding full results by CRF. Naturally, the global Refined UNet represents the method in [15].
For quantitative comparison, we inherit the hyperparameter configuration of global and local Refined UNets from [15]: θ α , θ β , and θ γ are empirically assigned to 80, 13, and 3, respectively. The quantitative assessment is listed in Table 1, in which precision, recall, and F1 scores are compared. As can be seen in Table 1, the quantitative evaluations of UNet backbone, global Refined UNet [15], local Refined UNet, and Refined UNet v2 are approximately close in terms of accuracy, P, R, and F 1 . Please note that the QA band is in fact not precise enough as the ground truth, so we just take into account these quantitative indicators as the secondary evaluation, in order to observe if there is a distinct difference in numerical indicators. Accordingly, the quantitative assessment can demonstrate that our Refined UNet v2 is quantitatively comparable to Refined UNet [15]. On the other hand, the accuracy score drops slightly with the increase of the radius r, and the scores of precision P and recall R of the cloud shadow decrease significantly. We attribute this to the intrinsic refinement of our v2: it prunes necessary shadows so that P and R are quite affected. Additionally, the side effect of these experiments is demonstrating the reproductivity of Refined UNet [15].

Visual Comparison between Refined UNet and V2
Furthermore, we compare the visual performance between Refined UNet [15] and v2. Theoretically, Refined UNet [15] focuses on global edge-preserving refinements while Refined UNet v2 concentrates more on eliminating noise in local predictions, due to their mathematical derivations. We present the qualitative evaluations as the principal assessment, which is illustrated in Figures 3 and 4. Visually, our Refined UNet v2 retains the prediction from its backbone but enlarges some small pieces of clouds and shadows from a global perspective (Figure 3). From local observation, it features "denoising" the prediction (eliminating isolated minor pieces or pixels), performing a better segmentation on the snow region (the first row in Figure 4). We attribute this to its intrinsic property of denoising, which has been shown in Section 3.3. Therefore, our Refined UNet v2 is superior in the noise-free segmentation of clouds and shadows.  [15], (e) Refined UNet v2. Bands 5 NIR, 4 Red, and 3 Green are stacked as RGB channels to construct false-color images for visualization. Gray (#9C9C9C), green (#267300), and cyan (#73DFFF) mark pixels of background (0), cloud shadows (2), and clouds (3), respectively. Table 1. Average accuracy, precision, recall, and F1 scores on the test set (Average ± Standard deviation, + represents that higher scores indicate better performance).

Models
Time (  However, it is noted that our Refined UNet v2 sometimes over-refines the "noise" of not only the misclassified minor regions (snow regions or fine cloud regions) but also the small pieces of background, which leads to merging two separated cloud regions; we attribute these drawbacks to the property of denoising as well. The strength of denoising can be essentially controlled by the radius r, which will be thoroughly explored in the following subsection. In addition, there remain some misclassified regions, a piece of the river, for example, is detected as the shadow, and some associated shadows are missing. This is, in our opinion, because of the misclassification of the UNet backbone and the strength of our CRF inference. In the future, we will further explore the property of other filter-based message-passing mechanisms to improve the CRF inference, and plan to introduce sophisticated weak-supervised strategies to improve the accuracy.  (2), and clouds (3), respectively. Visually, Refined UNet [15] focuses more on the precise boundaries of clouds and shadows while generates more "noise" in the prediction. Refined UNet v2, alternatively, can remove the "noise". Over-refinement, however, can also be found in terms of the visual assessment, which should be further considered in the future.

Ablation Study Regarding Our CRF Inference
We evaluate the performance of the UNet backbone and v2 to demonstrate the effect of our CRF inference, which is illustrated in Figures 3 and 4. Visually, UNet with adaptive weights concentrates more on the minority of categories (pixels of cloud shadows), which results in a prediction with too much noise. The visualizations in Figures 3 and 4 have been demonstrated above efficacy. Refined UNet [15] can refine the boundaries of cloud and shadow entities and, to some extent, eliminate isolated regions, but still preserve some stubborn fine regions. In contrast, v2 can faithfully denoise these predictions and generate cleaner results. Consequently, our Refined UNet v2 outperforms Refined UNet [15] in the noise-free segmentation. Similarly, the over-refinement of isolated regions remains, which should be sometimes a drawback to overcome in future exploration.

Hyperparameter Sensitivity Regarding r and in Our CRF Inference
We evaluate the performance of Refined UNet v2 with respect to the hyperparameters r and . As is mentioned above, r and denote the radius of the Gaussian filter and the regularization parameter of the linear ridge regression model. According to [59], r controls the filter window and affects the inference efficiency of the brute-force implementation: a higher r should yield a more refined result because of longer-range connectivity but it takes more time in inference, while controls the ability to smooth predictions: a higher yields a "blur" prediction. The qualitative and quantitative assessments are illustrated and shown in Figure 5 and Table 1. Visualizations with regards to r and in our CRF implementation. Gray (#9C9C9C), green (#267300), and cyan (#73DFFF) mark pixels of background (0), cloud shadows (2), and clouds (3), respectively. The candidate values of r vary from 10 to 80 when is secured to 10 −8 . The candidate values of vary from 10 −8 to 100 when r is secured to 50. The prediction is closer to boundaries if a higher r and a lower are used, while more isolated regions are also eliminated, especially for small pieces of shadows. Figure 5 visually presents the effect of r in the performance of segmentation. As is speculated above, a higher r propagates messages with longer-range connectivity, leading to finer segmentation results; a lower r truncates long-range messages passing, leading to coarser segmentation results. The quantitative assessment in Table 1 also confirms the denoising sensitivity: v2 with a larger r prefers to prune more shadows, but it tends to retain more shadow entity if a smaller r is assigned to it. It can be visually explained by the similarity between shadows and segmentation noise. Our Refined UNet v2, on the other hand, is not sensitive with until it grows up to 1; the edges of segmentation results are not precise enough when is too high. In summary, we can increase r to yield cleaner results and decrease that to preserve more entities.

Computational Efficiency of Refined UNet v2
We compare the computational efficiency between Refined UNets and v2, which is shown in Table 1. Please note that the listed duration is the time consumption of inferring one full image in the test phase. As can be seen in Table 1, we can confirm that our v2 is potentially efficient compared to the global and local Refined UNets: it is noted that global and local Refined UNet spend similar time refining results and is not proportional to θ α , θ β , and θ γ , but our v2 consumes less time than the counterparts if r < 50. The computational efficiency of our v2 benefits from the GPU support of TensorFlow framework. Moreover, according to Table 1, the computational efficiency of our brute-force implementation is highly related to the selection of radius r: the time consumption increases significantly when greater r is used, and we attribute this to our brute-force implementation. Thus, we will optimize the time performance by an efficient GPU filter in future work.

Conclusions
In this paper, we present an experimental prototype of an end-to-end pixel-level classifier for cloud and shadow noise-free patch-wise segmentation, which concatenates the UNet for cloud and shadow coarse segmentation and the following CRF inference for segmentation noise removal. In contrast to the separated pipeline of the local prediction of UNet and the global refinement of Dense CRF post-processing in [15], our end-to-end Refined UNet v2 locally gives the coarse prediction and the noise-free segmentation result simultaneously. Practically, it is straightforward for GPU frameworks and can be easily implemented by any machine-learning frameworks, which is able to be potentially efficient. Theoretically, we prove that the proposed bilateral term is highly relevant to both color intensity and Gaussian spatial features, which is similar to that of [50]. Experiments and results have demonstrated that our v2 is quantitatively comparable to Refined UNet [15] in terms of accuracy, precision, recall, and F1 scores, but can visually outperform that from the noise-free segmentation perspective. It is noted that a larger radius r results in the accuracy score dropping slightly, but the precision, recall, and F1 scores decreasing significantly, particularly for cloud shadows. We attribute this to the intrinsic refinement of our v2. The comparison of time consumption also supports that our implementation is potentially efficient; actually, our brute-force implementation is more efficient than Refined UNet if r < 50. We will further explore the backpropagation and the learnability of our CRF implementation, and improve its computational efficiency.

Appendix A. Derivation of Guided Gaussian Filter
We revisit the derivation of Guided Gaussian filter in this section and first recall the cost function of ridge regression with respect to linear coefficients a k and b k .
The partial derivatives of E with respect to a k and b k are given by Equations (A2) and (A3).
The two partial derivatives should be zero if we intend to minimize the cost function. Then we have We first consider the partial derivative of E with respect to b k and then we have Thus, b k is given by Equation (A8). For brevity of exposition, we refer to µ k andx k as the Gaussian-weighted mean of I i and x i in the window ω k .
Now we consider the partial derivative of E with respect to a k , given by Equations (A9) and (A10).
We substitute b k (Equation (A8) into the term with respect to b k in Equation (A10) and it yields ∑ i∈ω k g ik I i b k = ∑ i∈ω k g ik I i (x k − a k µ k ) =x k ∑ i∈ω k g ik I i − a k µ k ∑ i∈ω k g ik I i (A11) a k ∑ i∈ω k g ik I 2 i +x k ∑ i∈ω k g ik I i − a k µ k ∑ i∈ω k g ik I i − ∑ i∈ω k g ik I i x i + a k ∑ i∈ω k g ik = 0 (A12) a k ∑ i∈ω k g ik I 2 i − µ k ∑ i∈ω k g ik I i + ∑ i∈ω k g ik = ∑ i∈ω k g ik I i x i −x k ∑ i∈ω k g ik I i (A13) Now we have the solution to a k , given by Equation (A14).
a k = ∑ i∈ω k g ik I i x i −x k ∑ i∈ω k g ik I i ∑ i∈ω k g ik I 2 i − µ k ∑ i∈ω k g ik I i + ∑ i∈ω k g ik (A14)

Appendix B. Proof of the Kernel of Guided Gaussian Filter
The Guided Gaussian filter takes into consideration both the position and color intensity features of pixel i and j, which can be proved by the implicit kernel weights, defined in Equation (A15).
in which Z g , µ k , and σ k are the normalization term of the Gaussian filter ∑ i∈ω k g ik , the Gaussian-weighted mean f GF (I i ) and variance f GF (I 2 i ) − f 2 GF (I i ) in window ω k , respectively.
Proof. The kernel is given by Equation (A16).
We replace b k in Equation (18) with (15) and obtain Thus, the derivative of y i with respect to x j is The derivative ofx k with respect to x j is ∂x k ∂x j = 1 Z g g jk δ j∈ω k = 1 Z g g kj δ k∈ω j (A19) And the derivative of a k with respect to x j is Finally, we have