Cloud and Snow Identiﬁcation Based on DeepLab V3+ and CRF Combined Model for GF-1 WFV Images

: Cloud and snow identiﬁcation in remote sensing images is critical for snow mapping and snow hydrology research. Aimed at the problem that the semantic segmentation model is prone to producing blurred boundaries, slicing traces and isolated small patches for cloud and snow identiﬁcation in high-resolution remote sensing images, the feasibility of combining DeepLab v3+ and conditional random ﬁeld (CRF) models for cloud and snow identiﬁcation based on GF-1 WFV images is studied. For GF-1 WFV images, the model training and testing experiments under the conditions of different sample numbers, sample sizes and loss functions are compared. The results show that, ﬁrstly, when the number of samples is 10,000, the sample size is 256 × 256, and the loss function is the Focal function, the model accuracy is the optimal and the Mean Intersection over Union (MIoU) and the Mean Pixel Accuracy (MPA) reach 0.816 and 0.918, respectively. Secondly, after post-processing with the CRF model, the MIoU and the MPA are improved to 0.836 and 0.941, respectively, compared with those without post-processing. Moreover, the misclassiﬁcations such as blurred boundaries, slicing traces and isolated small patches are signiﬁcantly reduced, which indicates that the combination of the DeepLab v3+ and CRF models has high accuracy and strong feasibility for cloud and snow identiﬁcation in high-resolution remote sensing images. The conclusions can provide a reference for high-resolution snow mapping and hydrology applications using deep learning models.


Introduction
As an important part of the cryosphere, snow is one of the most active natural elements on the earth's surface [1]. Snow cover is the product of atmospheric circulation and plays an extremely important role in the Earth's climate system because its changes can, in turn, affect the climate by changing the surface energy balance, water cycle and atmospheric circulation [2]. Snow cover change also has a wide and profound impact on the future ecological security, environmental security and social economy [3]. With the rapid improvement in the spatial resolution of remote sensing images, high-resolution snow cover identification and mapping have attracted attention in the field of hydrology and water resources. Due to the lack of short-wave infrared bands, the commonly used spectrum-based cloud and snow identification algorithm is difficult to apply in high-resolution remote sensing images. This means that the study of cloud and snow identification methods for high-resolution remote sensing images has become one of the important directions of snow remote sensing research.
The current algorithms for cloud and snow identification in remote sensing images mainly include the spectral feature method, spatial texture method and pattern recognition method, and so on [4]. Among these, the spectral feature method is mature and widely used for cloud and snow identification in medium and low spatial resolution remote sensing images with a short-wave infrared band, but it cannot be used for cloud and snow

GF-1 WFV Data
The high-resolution remote sensing image used in this paper is the Wide Field View (WFV) sensor data of China Gaofen-1 (GF-1) satellite. The orbit height of GF-1 satellite is 645 km. The image of WFV sensor contains four bands of red, green, blue and near-infrared, with a spatial resolution of 16 m and a width of 800 km. The specific parameters of GF-1 WFV are shown in Table 1. The specific data used in this paper are listed in Table 2. A total of ten GF-1 WFV data images from November 2017 to October 2020 were used. Among them, seven images were used for model training and validation, and the other three images were used for model testing. Radiometric calibration and atmospheric correction of these images was performed before sample labeling and model training.

Sample Labeling
Since cloud and snow cover vary frequently with time, it is necessary to label samples by manual vectorization. The labeling categories are divided into three categories, which are snow, cloud and background. Considering the difficulty and limited accuracy of manual labeling for snow in shadows, both mountain shadows and cloud shadows are annotated as background samples in order to not affect the accuracy of model training and testing. Firstly, the regions with relatively concentrated cloud and snow in the seven training and validation images are manually vectorized and labeled, and the labeled regions are cropped to a total of 2000 pieces of sample with 256 × 256 pixel size and four bands. Secondly, in order to avoid overfitting of the model due to the small amount of training samples and to improve the robustness of the model, in this paper, the data augmentation methods such as rotation, Blur transform and adding Gaussian noise are used to increase the amount of samples. Here, the above 2000 pieces of sample are expanded to 10,000. These 10,000 pieces of sample with a size of 256 × 256 × 4 pixels are all used as the training and validation set of the cloud and snow identification neural network model, in which the training subset accounts for 75% and the validation subset accounts for 25%. At the same time, for the other three test images, the regions with relatively concentrated cloud and snow are only manually vectorized and annotated, without cropping and data augmentation, and the annotation results are directly used as the test data for the accuracy evaluation. Some labeled data are shown in Figure 1.
These 10,000 pieces of sample with a size of 256 × 256 × 4 pixels are all used as the training and validation set of the cloud and snow identification neural network model, in which the training subset accounts for 75% and the validation subset accounts for 25%. At the same time, for the other three test images, the regions with relatively concentrated cloud and snow are only manually vectorized and annotated, without cropping and data augmentation, and the annotation results are directly used as the test data for the accuracy evaluation. Some labeled data are shown in Figure 1.

DeepLab V3+ Model
DeepLab v3+ is the latest model in Google's DeepLab series, and the model structure is shown in Figure 2. Compared with the DeepLab v3 model, its biggest feature is to replace most of the convolutions in the network with dilated convolutions, which enhances the ability of the model to extract dense features of images without increasing the amount of calculated parameters while obtaining a larger sensory domain.

DeepLab V3+ Model
DeepLab v3+ is the latest model in Google's DeepLab series, and the model structure is shown in Figure 2. Compared with the DeepLab v3 model, its biggest feature is to replace most of the convolutions in the network with dilated convolutions, which enhances the ability of the model to extract dense features of images without increasing the amount of calculated parameters while obtaining a larger sensory domain.
The skeleton network of the DeepLab v3+ encoder part is an Xception network with atrous convolution. The network is developed based on Inception v3+, and the model structure is similar to the residual connection in ResNet. It is considered that spatial correlations and inter-channel correlations should be dealt with separately. Therefore, Depthwise separable convolution is used to divided the ordinary convolution into Depthwise convolution and Pointwise convolution. Deepwise convolution performs spatial convolution only for each channel eigenvalue independently, and Pointwise convolution only performs for different channel eigenvalues of each pixel, which can reduce parameters and computation, and reduce computational complexity and maintain similar performance [21]. Xception replaces all the maximum pooling layer operations with depth separation convolution with step size without modifying the entry flow, middle flow and exit flow structure of the traditional entry flow network. Finally, the same as DeepLab v3, the Atrous Spatial Pyramid Pooling (ASPP), is used to extract the context information of remote sensing images at four different scales in four different sensory domains, so as to achieve robust segmentation and thus improve the segmentation effect.
Remote Sens. 2022, 14, x FOR PEER REVIEW 5 of 19 Figure 2. DeepLab v3+ structure [21]. It adopts encoder-decoder structure. The arrows in the figure represent the data flow. The red, blue and gray in the prediction map represent different ground objects.
The skeleton network of the DeepLab v3+ encoder part is an Xception network with atrous convolution. The network is developed based on Inception v3+, and the model structure is similar to the residual connection in ResNet. It is considered that spatial correlations and inter-channel correlations should be dealt with separately. Therefore, Depthwise separable convolution is used to divided the ordinary convolution into Depthwise convolution and Pointwise convolution. Deepwise convolution performs spatial convolution only for each channel eigenvalue independently, and Pointwise convolution only performs for different channel eigenvalues of each pixel, which can reduce parameters and computation, and reduce computational complexity and maintain similar performance [21]. Xception replaces all the maximum pooling layer operations with depth separation convolution with step size without modifying the entry flow, middle flow and exit flow structure of the traditional entry flow network. Finally, the same as DeepLab v3, the Atrous Spatial Pyramid Pooling (ASPP), is used to extract the context information of remote sensing images at four different scales in four different sensory domains, so as to achieve robust segmentation and thus improve the segmentation effect.
The decoding part of DeepLab v3+ model refers to the step-skipping connection mode of the Full Coiler Network (FCN), and fuses the low-level detail features in the encoder part with the high-level features' output from the encoder part by convolutional dimension reduction. Then the feature fusion image is restored to the original image size using 1 × 1 convolution and bilinear interpolation upsampling method. Finally, the Softmax activation function is also used to classify each pixel.

Loss Function
In the training process of neural network, the loss function is used to calculate the difference between the model predicted value and the true label value, to optimally adjust the parameters and training process in the model, and to evaluate the training results of the model. It is inversely proportional to the accuracy of the model. Cross Entropy Loss (CE) is generally used as the loss function in image segmentation, examines each pixel one by one, but is prone to fitting difficulties caused by too small loss when the sample amount of different types is extremely unbalanced. In medical image processing, because the anatomical structure of interest usually occupies only a small area in the scanned image, the  [21]. It adopts encoder-decoder structure. The arrows in the figure represent the data flow. The red, blue and gray in the prediction map represent different ground objects.
The decoding part of DeepLab v3+ model refers to the step-skipping connection mode of the Full Coiler Network (FCN), and fuses the low-level detail features in the encoder part with the high-level features' output from the encoder part by convolutional dimension reduction. Then the feature fusion image is restored to the original image size using 1 × 1 convolution and bilinear interpolation upsampling method. Finally, the Softmax activation function is also used to classify each pixel.

Loss Function
In the training process of neural network, the loss function is used to calculate the difference between the model predicted value and the true label value, to optimally adjust the parameters and training process in the model, and to evaluate the training results of the model. It is inversely proportional to the accuracy of the model. Cross Entropy Loss (CE) is generally used as the loss function in image segmentation, examines each pixel one by one, but is prone to fitting difficulties caused by too small loss when the sample amount of different types is extremely unbalanced. In medical image processing, because the anatomical structure of interest usually occupies only a small area in the scanned image, the Dice loss function is proposed in V-Net [22] to increase the weight of the foreground area, which prevents the model from falling into the local minimum of the loss function during the training process. In the field of target detection, the Focal loss function [23] is usually used to solve the problem of severe imbalance in the proportion of positive and negative samples. In the case of unbalanced categories, it can make the loss smaller for samples with high prediction probability and the loss larger for samples with low prediction probability, thus strengthening the attention of the model on the positive samples. In this study, because there are many more background samples and more cloud samples than snow samples in the labeled dataset, the Focal function is chosen as the loss function, which can effectively solve the problem that the proportion of foreground samples is too small. The formulas are as follows: where n is the number of categories; P i is the true probability distribution; Q i is the model prediction probability distribution; the value of ε in the Dice loss function formula is generally one to avoid the gradient explosion caused by denominator being zero or too small. γ in the Focal loss function is the parameter that controls the orientation of the sample tendency, generally takes the value of zero to five. In this paper, a triple classification (cloud, snow, background) problem is discussed, so n is three, P i and Q i are 256 × 256 matrices, where P i is the sample label image, Qi is the model classification image.

Conditional Random Field
The conditional random field model is a probabilistic graph model proposed by Lafferty et al. (2001) [24]. It combines the unary potential energy of a single pixel and the pairwise potential energy between neighboring pixels, so that the spatial pixels are assigned to the same label. It is usually applied to smooth the segmentation maps with edge noise. However, its structure cannot model the pixels far apart and is prone to over-smoothing of target object boundaries. To solve this problem, Krähenbühl et al. (2011) proposed the concept of fully connected CRF based on CRF [25]. In fully connected CRF, the energy of predicted label value X is defined as where, i, j represent pixels; x i and x j are the labels assigned to pixels i and j, respectively; ψ u (x i ) represents unary potential energy; ψ p x i , x j represents pairwise potential energy. The unary potential energy represents the class probability distribution obtained from independent prediction of each pixel i in the classification image to be improved in accuracy, which contains much noise and is discontinuous. The pairwise potential energy represents a fully connected graph that connects all pixels of the image and classifies pixels with the same properties into the same category as much as possible. When the energy E(X) of fully connected CRF is smaller, the predicted pixel category label X is more accurate. The average field approximation is generally used to iterate and find the minimum energy function so as to obtain the result of improved boundary accuracy. In this paper, the pixel category distribution probability map output from DeepLab v3+ neural network model is taken as unary potential energy, and the original high-resolution remote sensing image is taken as pairwise potential energy.

Evaluation Indicators
In order to explore the advantages and disadvantages of different neural network models in cloud and snow identification, the accuracy criteria in this paper are Mean Intersection over Union (MIoU) and the Mean Pixel Accuracy (MPA) [26,27]. The MIoU is the result of averaging the ratio of the intersection set to the union set of the true values derived from each class of prediction results, which can represent the accuracy of each class. MPA is the result of averaging the proportion of correctly classified pixels for each class. Both evaluation indicators take values in the range of zero to one, with closer to one representing better segmentation. Both of them are commonly used criteria to verify the accuracy of neural network model. Therefore, these two indicators are used as quantitative research criteria in this paper. The expressions are as follows where there are a total of K + 1 label categories (K classes of objects and one other category) in the classified image; n ii is the number of correct predictions of class i; n ij is the number of class i pixels predicted as class j; and n ji is the number of class j pixels predicted as class i.

Experimental Environment
The experimental platform in this paper is an Inter (R) Core (TM) i7-9700F @ 3.0 GHz CPU, NVIDIA GeForce RTX 2060 SUPPER 8 GB graphics card and 16.0 GB running memory. In terms of software environment, Python is used as the main programming language under Windows 10 system, and the high-performance computing library CUDA11.0 for the display card is installed. The deep learning framework adopts TensorFlow 2.5.0 and Keras 2.3.1. In the training process, Adam is selected as the optimizer to update the network gradient, and Softmax activation function is used to classify each pixel. The learning rate is set to 0.001, the batch size is 5 and the iteration number (epoch) is 200.

Experiments and Results
The number of samples, sample size and loss function have a certain impact on the accuracy of the semantic segmentation neural network model, and the post-processing work will also affect the identification results. Generally, the smaller the sample number, the easier it is to cause overfitting. If the sample size is too small, it is impossible to learn to obtain more spatial semantic information, and it is easy to misclassify snow and cloud with similar spectral characteristics; if the sample size is too large, the model training time increases and the generalization ability decreases. At the same time, the model training accuracy will be different while using the different loss functions. Therefore, this study analyzed the effects of different sample numbers, sample sizes and loss functions on the DeepLab v3+ model for cloud and snow identification, as well as the impact of CRF postprocessing on the accuracy of cloud and snow identification, so as to provide a reference for the optimal parameter selection of the semantic segmentation neural network model for cloud and snow identification.

Sample Number Analysis
In order to investigate the optimal number of samples required for model training, 2000, 5000 and 10,000 samples were randomly taken from the 10,000 training and validation sets prepared above, respectively, and input to the DeepLab v3+ model, in turn, for training. Among them, 2000 samples were directly taken from those training and validation samples without data augmentation. The batch size was 5, epoch was 200 and neural network models for cloud and snow identification trained by different sample numbers were obtained. The curves of loss value and the accuracy of model training with each batch are shown in Figure 3.
It can be seen from Figure 3 that the larger the number of samples, the smaller the fluctuation in the training loss value and training accuracy, and the higher the stability of the model. When the number of samples is 2000, the training loss value and training accuracy fluctuate greatly, and the model stability is very low. When the number of samples is 5000 and the iteration times is more than 100, the training loss value and training accuracy are comparable to those when the number of samples is 10,000, but the stability of the model is still insufficient. When the number of samples is 10,000 and the time of iterations reaches 170, the model training accuracy is high and the stability is strong. Therefore, the number of 10,000 training samples is more suitable for training the cloud and snow identification model with high accuracy and stability. Figure 4 shows the prediction maps for the test data by using the models with different sample numbers. The prediction accuracies are shown in Table 3. As seen in Figure 4, when sample numbers are 2000 and 5000, there are more misclassified cloud and snow pixels. Snow IoU, Cloud IoU, Snow PA and Cloud PA, as well as MIoU and MPA, are relatively low. In addition, compared with the number of 2000 samples, the prediction accuracy of the model trained by 5000 samples has not improved, and even Cloud IoU, MIoU and Snow PA have some reduction. When the number of samples is 10,000, the misclassified pixels of cloud and snow are significantly reduced. The MIoU and MPA are 0.816 and 0.918, respectively, which are 0.066 and 0.061 higher than the accuracy of 5000 samples. This is a significant improvement. In summary, the model training accuracy, stability and prediction accuracy are optimal when the number of samples is 10,000.

Sample Number Analysis
In order to investigate the optimal number of samples required for model training, 2000, 5000 and 10,000 samples were randomly taken from the 10,000 training and validation sets prepared above, respectively, and input to the DeepLab v3+ model, in turn, for training. Among them, 2000 samples were directly taken from those training and validation samples without data augmentation. The batch size was 5, epoch was 200 and neural network models for cloud and snow identification trained by different sample numbers were obtained. The curves of loss value and the accuracy of model training with each batch are shown in Figure 3. It can be seen from Figure 3 that the larger the number of samples, the smaller the fluctuation in the training loss value and training accuracy, and the higher the stability of the model. When the number of samples is 2000, the training loss value and training accuracy fluctuate greatly, and the model stability is very low. When the number of samples is 5000 and the iteration times is more than 100, the training loss value and training accuracy are comparable to those when the number of samples is 10,000, but the stability of the model is still insufficient. When the number of samples is 10,000 and the time of iterations reaches 170, the model training accuracy is high and the stability is strong. Therefore, the number of 10,000 training samples is more suitable for training the cloud and snow identification model with high accuracy and stability. Figure 4 shows the prediction maps for the test data by using the models with different sample numbers. The prediction accuracies are shown in Table 3. As seen in Figure 4, when sample numbers are 2000 and 5000, there are more misclassified cloud and snow pixels. Snow IoU, Cloud IoU, Snow PA and Cloud PA, as well as MIoU and MPA, are relatively low. In addition, compared with the number of 2000 samples, the prediction accuracy of the model trained by 5000 samples has not improved, and even Cloud IoU, MIoU and Snow PA have some reduction. When the number of samples is 10,000, the misclassified pixels of cloud and snow are significantly reduced. The MIoU and MPA are 0.816 and 0.918, respectively, which are 0.066 and 0.061 higher than the accuracy of 5000 samples. This is a significant improvement. In summary, the model training accuracy, stability and prediction accuracy are optimal when the number of samples is 10,000.

Sample Size Analysis
In order to analyze the appropriate sample size for cloud and snow identification in the GF-1 WFV image using the DeepLab v3+ model, the previous 10,000 samples of 256 × 256 size were cut into 10,000 samples of 64 × 64 size and 10,000 samples of 128 × 128 size, respectively. These samples with different sizes were input to the DeepLab v3+ model for training in turn. The loss function was set to the Focal function, the batch size was set to 5 and the epoch was set to 200. The variation curves of the loss value and accuracy of model training with the iteration times are shown in Figure 5.

Sample Size Analysis
In order to analyze the appropriate sample size for cloud and snow identification in the GF-1 WFV image using the DeepLab v3+ model, the previous 10,000 samples of 256 × 256 size were cut into 10,000 samples of 64 × 64 size and 10,000 samples of 128 × 128 size, respectively. These samples with different sizes were input to the DeepLab v3+ model for training in turn. The loss function was set to the Focal function, the batch size was set to 5 and the epoch was set to 200. The variation curves of the loss value and accuracy of model training with the iteration times are shown in Figure 5.
As seen in Figure 5, in the early stage of training, the larger the sample size, the faster the fitting speed is. As the number of iterations increases, the differences in model training loss between different sample sizes gradually decrease, as does the difference in model training accuracy. However, within 200 iterations, the training loss value of the model trained by the sample sizes of 256 × 256 is always better than those trained by the sample sizes of 64 × 64 and 128 × 128; the training accuracy of the model is always higher than that of the sample sizes of 64 × 64 and 128 ×128 accuracy; and the model stability is better when the sample sizes is 256 × 256, and the model tends to be stable when the number of iterations reaches 170.  Figure 6 shows the prediction maps for the test data by using models with different sample sizes, and the prediction accuracies are shown in Table 4. As seen in Figure 6 and Table 4, when the training sample sizes are 64 × 64 and 128 × 128, the cloud and snow are seriously misclassified in the prediction maps, and Snow IoU, Cloud IoU, Snow PA and Cloud PA, as well as MIoU and MPA are relatively low. The MIoU and MPA are only 0.754 and 0.862, the prediction accuracy of the 128 × 128 size is not improved compared with that of the 64 × 64 size, and the Cloud IoU, MIoU, Cloud PA and MPA are even reduced to a certain extent; in addition, the classification map of 128 × 128 size shows serious slicing traces. When the training sample size is 256 × 256, the prediction accuracy of the model is greatly improved, and the misclassified pixels of cloud and snow are significantly reduced. The MIoU and MPA reach 0.816 and 0.918, respectively, and the Cloud PA even reaches 0.934. It can be seen that the appropriate increase in sample size can reduce some misclassification pixels and improve the accuracy of the model, but at the same time, it also makes the model training slower and less efficient. In summary, when the sample size is 256 × 256, the training accuracy, stability and prediction accuracy of the model are relatively better. As seen in Figure 5, in the early stage of training, the larger the sample size, the faster the fitting speed is. As the number of iterations increases, the differences in model training loss between different sample sizes gradually decrease, as does the difference in model training accuracy. However, within 200 iterations, the training loss value of the model trained by the sample sizes of 256 × 256 is always better than those trained by the sample sizes of 64 × 64 and 128 × 128; the training accuracy of the model is always higher than that of the sample sizes of 64 × 64 and 128 ×128 accuracy; and the model stability is better when the sample sizes is 256 × 256, and the model tends to be stable when the number of iterations reaches 170. Figure 6 shows the prediction maps for the test data by using models with different sample sizes, and the prediction accuracies are shown in Table 4. As seen in Figure 6 and Table 4, when the training sample sizes are 64 × 64 and 128 × 128, the cloud and snow are seriously misclassified in the prediction maps, and Snow IoU, Cloud IoU, Snow PA and Cloud PA, as well as MIoU and MPA are relatively low. The MIoU and MPA are only 0.754 and 0.862, the prediction accuracy of the 128 × 128 size is not improved compared with that of the 64 × 64 size, and the Cloud IoU, MIoU, Cloud PA and MPA are even reduced to a certain extent; in addition, the classification map of 128 × 128 size shows serious slicing traces. When the training sample size is 256 × 256, the prediction accuracy of the model is greatly improved, and the misclassified pixels of cloud and snow are significantly reduced. The MIoU and MPA reach 0.816 and 0.918, respectively, and the Cloud PA even reaches 0.934. It can be seen that the appropriate increase in sample size can reduce some misclassification pixels and improve the accuracy of the model, but at the same time, it also makes the model training slower and less efficient. In summary, when the sample size is 256 × 256, the training accuracy, stability and prediction accuracy of the model are relatively better.

Selection of Loss Function
To investigate the accuracy differences of different loss functions on the DeepLab v3+ model for cloud and snow identification, the CE loss function, Dice loss function and Focal loss function were selected, respectively, in the experiment, and 10,000 pieces of 256 × 256 size samples were input to train the DeepLab v3+ models for cloud and snow identification. The batch size was set to five, and the epoch was 200. The changes in training loss value and training accuracy were recorded, as shown in Figure 7.

Selection of Loss Function
To investigate the accuracy differences of different loss functions on the DeepLab v3+ model for cloud and snow identification, the CE loss function, Dice loss function and Focal loss function were selected, respectively, in the experiment, and 10,000 pieces of 256 × 256 size samples were input to train the DeepLab v3+ models for cloud and snow identification. The batch size was set to five, and the epoch was 200. The changes in training loss value and training accuracy were recorded, as shown in Figure 7. It can be seen from Figure 7 that the training loss curve of Dice converges faster and the loss value is smaller in the whole process of training, but the training accuracies of these three loss functions are relatively close. In terms of the stability of training accuracy, the CE function has the most stable performance, but the difference with the Dice and Focal functions is not obvious. Figure 8 shows the prediction maps for the test data by using models under different loss functions, and the prediction accuracies are shown in Table 5. From Figure 8 and Table 5, it can be seen that the model using the CE loss function has more snow pixels misclassified as cloud, and the slicing traces are obvious. Compared with the Dice function and Focal function, the prediction accuracy of the model using the CE function is also the lowest, with MIoU and MPA only 0.741 and 0.827, respectively. The model accuracy using the Dice or Focal loss functions improves somewhat. In particular, because the Focal function increases the focus of the model on snow and cloud samples, the problem of an unbalanced number of samples of each category in the training samples set improves. In the model prediction maps, the misclassified pixels of cloud and snow are significantly reduced, and the model prediction accuracy is significantly improved. The Cloud PA reaches 0.934 and the Snow PA reaches 0.891. The MIoU and MPA are higher than those of CE and the Dice loss function. In summary, the training accuracies of the models using the CE, Dice and Focal functions are comparable, but the model using the Focal loss function has higher prediction accuracy and stronger generalization ability.  It can be seen from Figure 7 that the training loss curve of Dice converges faster and the loss value is smaller in the whole process of training, but the training accuracies of these three loss functions are relatively close. In terms of the stability of training accuracy, the CE function has the most stable performance, but the difference with the Dice and Focal functions is not obvious. Figure 8 shows the prediction maps for the test data by using models under different loss functions, and the prediction accuracies are shown in Table 5. From Figure 8 and Table 5, it can be seen that the model using the CE loss function has more snow pixels misclassified as cloud, and the slicing traces are obvious. Compared with the Dice function and Focal function, the prediction accuracy of the model using the CE function is also the lowest, with MIoU and MPA only 0.741 and 0.827, respectively. The model accuracy using the Dice or Focal loss functions improves somewhat. In particular, because the Focal function increases the focus of the model on snow and cloud samples, the problem of an unbalanced number of samples of each category in the training samples set improves. In the model prediction maps, the misclassified pixels of cloud and snow are significantly reduced, and the model prediction accuracy is significantly improved. The Cloud PA reaches 0.934 and the Snow PA reaches 0.891. The MIoU and MPA are higher than those of CE and the Dice loss function. In summary, the training accuracies of the models using the CE, Dice and Focal functions are comparable, but the model using the Focal loss function has higher prediction accuracy and stronger generalization ability.

Conditional Random Field Post-Processing
In order to investigate the effectiveness of CRF post-processing on the accuracy improvement of the DeepLab v3+ model for cloud and snow classification, the CRF model is used to post-process the cloud and snow classification results of DeepLab v3+ model. The cloud and snow classification map predicted by the DeepLab v3+ model on the test data is taken as the univariate potential energy of the conditional random field, and the GF-1 WFV image is used as the unary potential energy. The mean field approximation method is used to iteratively find the minimum energy function E(X). The smaller E(X) is, the more accurate the predicted pixel class label X is, resulting in a classification map with improved boundary accuracy, as shown in Figure 9.

Conditional Random Field Post-Processing
In order to investigate the effectiveness of CRF post-processing on the accuracy improvement of the DeepLab v3+ model for cloud and snow classification, the CRF model is used to post-process the cloud and snow classification results of DeepLab v3+ model. The cloud and snow classification map predicted by the DeepLab v3+ model on the test data is taken as the univariate potential energy of the conditional random field, and the GF-1 WFV image is used as the unary potential energy. The mean field approximation method is used to iteratively find the minimum energy function E(X). The smaller E(X) is, the more accurate the predicted pixel class label X is, resulting in a classification map with improved boundary accuracy, as shown in Figure 9.   Figure 9 shows the comparison of prediction maps before and after CRF post-processing, and Figure 10 shows the comparison of their local details before and after post-processing. From Figures 9 and 10, it is obvious that the DeepLab v3+ model misidentifies some isolated small patches of snow as clouds; and the boundaries of the snow are smoother and different from the true snow cover. In addition, the semantic segmentation neural network classifies the image after slicing, and then splices the classified slices. Different slices will take global consideration, respectively, so that different prediction results are generated at the boundaries of adjacent slices, thus leading to some slicing traces in the final spliced classification map. After the CRF post-processing, the misclassified clouds are correctly identified as snow again, and the boundaries of the snow cover are also finer and more closely match the true ground objects; at the same time, the slicing traces and isolated small patches are also eliminated.
DeepLab v3+ prediction map and the prediction map of DeepLab v3+ & CRF at three different dates, respectively. Figure 9 shows the comparison of prediction maps before and after CRF post-processing, and Figure 10 shows the comparison of their local details before and after postprocessing. From Figures 9 and 10, it is obvious that the DeepLab v3+ model misidentifies some isolated small patches of snow as clouds; and the boundaries of the snow are smoother and different from the true snow cover. In addition, the semantic segmentation neural network classifies the image after slicing, and then splices the classified slices. Different slices will take global consideration, respectively, so that different prediction results are generated at the boundaries of adjacent slices, thus leading to some slicing traces in the final spliced classification map. After the CRF post-processing, the misclassified clouds are correctly identified as snow again, and the boundaries of the snow cover are also finer and more closely match the true ground objects; at the same time, the slicing traces and isolated small patches are also eliminated. In order to quantitatively analyze the effectiveness of CRF post-processing on the accuracy improvement of cloud and snow identification, The MIoU and MPA of the classification maps before and after CRF post-processing were calculated respectively and are shown in Table 6. It can be seen that Snow IoU, Cloud IoU and Cloud PA, as well as MIoU and MPA, are effectively improved, where MIoU and MPA are improved from 0.816 and 0.918 to 0.836 and 0.941, respectively, and the improvement compared with no post-processing is 0.020 and 0.023, respectively. In summary, the combined model of DeepLab v3+ and CRF can effectively correct the misclassification problems such as blurred boundaries, slicing traces and isolated small patches, thus further improving the cloud and snow identification accuracy.

Discussion
When conducting image semantic segmentation experiments, it can be better to use authoritative public datasets. Tian et al. (2019) summarized some common public datasets for image semantic segmentation [28]. PASCAL VOC 2012 is one of the public standard datasets commonly used in the field of computer vision [29], and many scholars have studied the effectiveness and generalization of models using public datasets [30][31][32]. However, due to the frequent temporal changes in snow and cloud, there are few publicly available high spatial resolution cloud and snow labeling datasets. Therefore, the training datasets used in this paper are all completed by manual visual annotation. Since the annotation of deep learning datasets is time-consuming and labor-intensive, and the number of samples is relatively insufficient, many scholars have used data augmentation methods to increase the amount of sample data, including operations horizontal flips, vertical flips, diagonal mirroring and random scaling [33,34]. In this paper, various data augmentation operations are also used to increase the sample number of the labeled dataset, eliminate the overfitting caused by the small number of samples and improve the robustness of the model. The experimental results of different sample numbers in Section 3.1 also demonstrate that increasing the sample number by data augmentation can improve the identification accuracy of the model. Wieland et al. (2019) achieved an accuracy of 0.89 for cloud and snow identification in multi-spectral satellite images based on the improved U-Net convolutional neural network [35]. The Fmask 4.0 algorithm proposed by Qiu et al. (2019) has an overall accuracy of 0.924 for cloud identification in Landsat 4-7 images [36]. In the tests of this paper, the accuracy for cloud and snow identification using only the DeepLab v3+ neural network is 0.918. However, as seen from the prediction maps above, there are still some misclassification problems such as blurred boundaries, slicing traces and isolated small patches. The CRF model can capture fine-grained information and infer the output class of target pixels by combining the target pixels with the nearby pixels, which is not achieved by the convolutional neural network focusing on local information. Some scholars previously used the CRF to extract the target features in remote sensing images, and the results show that the CRF model can improve the accuracy of the segmentation results [37,38]. In this paper, CRF post-processing for the predicted maps of the DeepLab v3+ model is carried out to further improve the pixel accuracy. The accuracy reaches 0.941, which is 0.023 higher than the accuracy before CRF post-processing, and 0.051 and 0.017 higher than the accuracy of the U-Net and Fmask 4.0 models, respectively, and the misclassification problems of blurred boundaries, slicing traces and isolated small patches are corrected. This further demonstrates that the CRF post-processing method can effectively optimize the boundary of cloud and snow and improve the accuracy of the segmentation. Therefore, it is feasible to combine the DeepLab v3+ and CRF models for cloud and snow identification in high-resolution remote sensing images.

Conclusions
Aimed at the problem that it is difficult to use the snow index algorithm to identify cloud and snow in high-resolution remote sensing images lacking the short-wave infrared band, and the problem that the semantic segmentation neural network model is prone to producing blurred boundaries, slicing traces and isolated small patches, in this paper, the feasibility and the optimal parameter selection of the DeepLab v3+ and CRF combined model for cloud and snow identification in high-resolution remote sensing images are explored through the comparative experimental analysis of different sample numbers, sample sizes, loss functions and CRF post-processing using GF-1 WFV images. The main conclusions are as follows: (1) The DeepLab v3+ model is used to identify cloud and snow in a GF-1 WFV image.
When the number of samples is 10,000, the sample size is 256 × 256, and the loss function is the Focal function, the model has the optimal accuracy and strong stability, where the MIoU and the MPA reach 0.816 and 0.918, respectively. (2) For the cloud and snow identification, CRF post-processing can significantly improve the misclassification problems such as blurred boundaries, slicing traces and isolated small patches caused by the semantic segmentation of neural network model. Compared with the prediction maps without post-processing, the prediction accuracy after CRF post-processing is effectively improved. The MIoU and MPA are improved to 0.836 and 0.941, respectively, which proves the effectiveness of the post-processing method. (3) The DeepLab v3+ and CRF combined model for cloud and snow identification in a high-resolution remote sensing image has high accuracy and strong feasibility. The conclusions can provide a technical reference for the application of deep learning algorithms in high-resolution snow mapping and hydrological application.
The sample accuracy is a key factor affecting the prediction results of the semantic segmentation model. The manual labeling accuracy of cloud and snow samples is greatly affected by human factors; in particular, the manual labeling of snow in shadows is more difficult and has limited accuracy. Therefore, this paper treats both mountain shadows and cloud shadows as background categories. This treatment has certain limitations, which reduces the accuracy of cloud and snow identification. Therefore, how to reduce the influence of human factors on the accuracy of samples and improve the accuracy of cloud and snow identification, especially to improve the accuracy of snow identification in shadow areas, is the direction of further research. The authors will next try to use a weakly supervised learning method to identify cloud and snow in high-resolution remote sensing images to reduce the impact of human factors.