A Self-Trained Model for Cloud, Shadow and Snow Detection in Sentinel-2 Images of Snow- and Ice-Covered Regions

: Screening clouds, shadows, and snow is a critical pre-processing step in many remote-sensing data processing pipelines that operate on satellite image data from polar and high mountain regions. We observe that the results of the state-of-the-art Fmask algorithm are not very accurate in polar and high mountain regions. Given the unavailability of large, labeled Sentinel-2 training datasets, we present a multi-stage self-training approach that trains a model to perform semantic segmentation on Sentinel-2 L1C images using the noisy Fmask labels for training and a small human-labeled dataset for validation. At each stage of the proposed iterative framework, we use a larger network architecture in comparison to the previous stage and train a new model. The trained model at each stage is then used to generate new training labels for a bigger dataset, which are used for training the model in the next stage. We select the best model during training in each stage by evaluating the multi-class segmentation metric, mean Intersection over Union (mIoU), on the small human-labeled validation dataset. This effectively helps to correct the noisy labels. Our model achieved an overall accuracy of 93% compared to the Fmask 4 and Sen2Cor 2.8, which achieved 75% and 76%, respectively. We believe our approach can also be adapted for other remote-sensing applications for training deep-learning models with imprecise labels.


Introduction
Satellite imagery provides a vital source of information for a wide range of remotesensing applications. However, such imagery is often contaminated with cloud and cloud shadows. Screening cloud and shadows is a critical preprocessing step that needs to be undertaken before any meaningful analysis can be performed on these images.
Traditional algorithms use threshold functions on spectral values of the image to generate cloud and shadow masks. In addition to this, metadata, such as solar zenith and azimuth angles, and elevation, are also used in computing these masks. The Fmask algorithm is widely accepted in the remote sensing community as a reliable method for generating cloud and cloud shadow masks [1]. Though the initial version of the Fmask was developed for Landsat-5 and Landsat-7 imagery, subsequent publications adapted the algorithm for use in Landsat-8 and Sentinel-2 imagery and proposed further improvements [2,3]. CFmask, a derivative of the Fmask algorithm, is used by the U.S. Geological Survey (USGS) in their production environment. The Sen2Cor process used by the European Space Agency (ESA) to process its Sentinel-2 products generates a scene classification map that detects clouds, shadows, water and snow [4]. However, the results of these algorithms are known to be erroneous when the underlying surface is brightly colored, for example, in the case of ice and snow, white building tops, salt lakes, etc. The detection of shadows is not reliable on darker surfaces. Water bodies are often misclassified as shadows and vice versa. In addition to Fmask and Sen2Cor, several texture-based and threshold-based approaches have been proposed [5][6][7]. Early attempts at using data-driven methods, such as support vector machines [8,9], random forest [10,11] and Markov random fields [12,13], for this task, have shown improvements over the traditional rule-based methods. In this paper, we propose a deep-learning-based model that specializes in cloud detection in polar and high mountain regions that are snow-and ice-covered for most parts of the year.
Over the past decade, deep convolutional neural networks (DCNNs) have made giant leaps in the field of computer vision and have found application in domains that require very high precision, such as medical imaging and autonomous driving [14,15]. In contrast to single pixel classification methods, DCNNs learn spatial relationships between the neighboring pixels in the image. Hence, the network can make use of texture and structural information, in addition to the spectral properties, to distinguish objects in the scene-this contributes towards improved detection performance. DCNNs are typically trained using large amounts of correctly labeled data. However, labeled data is very expensive because it is time-consuming to produce.
The open access policy of the Copernicus programme has led to an abundant use of optical Sentinel-2 data. No cloud mask is provided with this optical data that reliably detects clouds over bright targets or shadows over dark targets. While there are a few datasets for cloud detection, they were unsuitable for direct use in our multi-class segmentation problem. In this study, we propose a self-training framework in order to train our model. The trained model in our work segments a given Sentinel-2 L1C scene into six classes: No-Data, Clear-Sky Land, Cloud, Shadow, Snow and Water. We use a large dataset with labels generated using the Fmask algorithm for the training, and a small human-labeled dataset for validation. The validation dataset contains numerous examples where the Fmask classification has given incorrect labels.
The trained model, when compared with widely used automatic methods, such as Fmask 4 and Sen2Cor 2.8, on our test dataset, showed significant improvement in various segmentation metrics. Interestingly, the model achieved better performance than its teacher, Fmask, that was used to automatically generate the labels for training.

Self-Training
Given the scarcity of large, labeled datasets required to train the increasingly complex neural network models, self-training has recently attracted attention in the research community [16][17][18]. Self-training refers to a learning paradigm where a base model, which was trained using a smaller labeled dataset, is used to generate the pseudo-labels for the unlabeled data that will be used for training another model [19]. Xie et al. proposed a self-training method that achieved a 2% improvement over the state-of-the-art model on an Imagenet classification task using weakly labeled images [18]. Babakhin et al. used a similar iterative training method for segmenting salt deposits in seismic images [20] and Chen et al. applied a similar iterative training method for scene segmentation from video sequences [21]. In each of these papers, small clean labeled datasets were used in combination with a large unlabeled dataset to train the model. In our research, we used a labeled dataset generated using the Fmask algorithm and a large unlabeled dataset for training. Unlike the previous papers, the Fmask algorithm did not always provide clean labels. Hence, we used different regularization techniques in our training to ensure that the model was robust to label noise in the training dataset. Yilmaz and Heckel showed that deep neural networks, trained using an early stopping strategy, fit clean labels faster than noisy labels [22]. Experiments performed on 10%-to 50%-correctly labeled training datasets have produced higher test performance than standard training on a clean dataset for the Imagenet and CIFAR-10 classification tasks.

Cloud Detection Algorithms
Early cloud-detection algorithms relied heavily on feature engineering to produce accurate cloud detection models [8,23,24]. The Landsat 8 Cloud Cover Assessment (CCA) dataset, designed to test the effectiveness of cloud detection algorithms, has helped in the development of new deep-learning-based methods for Landsat imagery [25][26][27]. Scene classification using deep neural networks is treated as a semantic segmentation task, and several variants of the well-established encoder-decoder architectures, U-Net and Fully Convolutional Network (FCN), have been proposed. Some of the pioneering studies in this domain for Landsat-8 imagery include RS-Net [28], Cloud-Net [29], and MF-CNN [30]. Similarly, for images from the GaoFen satellite series operated by the China National Space Administration (CNSA), Zhan et al. used an FCN with a multiscale prediction module to distinguish clouds and snow [31], and Yan et al. implemented a multilevel feature-fused segmentation network called MFFSNet, to perform cloud and cloud shadow detection [32]. Recently, several improvements, such as attention mechanisms [33,34], and novel convolution techniques [35], have contributed to advancing the state-of-the-art in this domain. In contrast to the methods discussed above, which rely on manually labeled, pixellevel annotations, Li et al. proposed a weakly supervised method, trained using block-level labels that indicate the presence or absence of clouds in a given block [36]. The network was used to generate cloud activation maps, which were subsequently segmented, using a statistical threshold based on clear-sky values, to obtain pixel-level cloud detection output.
Liu et al. proposed a neural network with a residual architecture called CloudNet to predict clouds and haze from Sentinel-2 imagery [37]. This method uses Sen2Cor corrected data from Taiwan for training and primarily functions as a cloud-detection algorithm. Li et al. used a lightweight network, trained using a manually labeled dataset over diverse land cover and climatic conditions in mainland China, for performing cloud detection [38]. In contrast to these methods, we use data from polar and high mountain regions and address a more challenging multi-class segmentation problem.
Hughes and Kennedy proposed a fully convolutional network that segmented Landsat-8 imagery, and Hollstein et al. used a decision-tree model that segmented Sentinel-2 imagery into the same classes as our investigation [10,39].

Dataset
The Sentinel-2 L1C data used for this study was downloaded from the Copernicus Open Access Data Hub. The data was captured by the Sentinel-2A and Sentinel-2B satellites using their onboard multi-spectral instrument (MSI) that captures images in 13 frequency bands. The description of the bands used in this study is provided in Table 1. The L1C data contains the top-of-atmosphere (TOA) reflectance values and each scene covers an area of 100 km × 100 km.
The sensing period for the images used in our study was between October 2019 and December 2020. We used 96 scenes in the training dataset, 22 scenes in the validation dataset and 23 scenes in the test dataset. The sites of these scenes were randomly selected in the polar and tundra regions. We also included three sites in mountainous terrain that were not in the polar and tundra region. The geographic distribution of the data is shown in Figure 1. The tile-wise quarterly distribution of the Sentinel-2 scenes for the training, validation and test datasets is provided in Tables A1, A2 and A3, respectively. Table 1. Description of the spectral bands from Sentinel-2A [40] that were used as input in our study. The spectral bands of its twin satellite, Sentinel-2B, also had the same resolution for the respective bands and only varied in the central wavelength by a small fraction. Data from both Sentinel-2 satellites were used in our study without any distinction regarding their source.  The six classes that we used in our model: No-Data, Clear-Sky Land, Cloud, Shadow, Snow and Water, were the same class labels as were used in the segmentation output of the F-Mask algorithm. The No-Data class was used to indicate pixels in the image that were saturated or defective and regions along the edge of the image where sensor data was not available. Such pixels can be identified using the quality indicator information provided along with the Sentinel-2 imagery. Hence, we did not include this class in the labeled dataset. The Clear-Sky Land class was used for snow-and cloud-free land, which was usually rocky surfaces in the sites used for this study. The Shadow class was used for all types of shadow, irrespective of the object that cast the shadow, or the underlying surface on which the shadow fell. The samples in the validation set were selected so that many examples where the Fmask algorithm produced an incorrect label (as per our visual assessment) were present. The labeling for the validation and test dataset was performed manually using the QGIS software [41] package. In addition to the true color image, the short-wave infrared band (B11) and the near infrared band (B08) were used in our labeling process. Labeling of very small targets was performed occasionally using a thresholding operation in the local neighborhood of the target.

Band
The Fmask labels were generated using Fmask 4.3. According to its authors, this version of Fmask offers substantially better cloud, cloud shadow, and snow detection results compared to the previous versions for Sentinel-2 imagery [2,3,42]. The cloud buffer distance, and the cloud shadow buffer distance were set to 0 in the Fmask configuration settings. All other configuration settings were set to their default values, and the Fmask output was computed at 20 m (5490 × 5490 pixels) resolution.
We also compared our model performance against the results from Sen2Cor 2.8. The Sen2cor algorithm is part of the European Space Agency's (ESA) processing pipeline, used for generating the L2A images [4]. This algorithm produces a scene classification (SCL) image, in addition to various quality indicators and bottom-of-atmosphere (BOA) images. The resolution of the SCL was set to 20 m and no additional digital elevation model (DEM) was provided. While the Fmask segmented the image into the same six classes that we used for training our model, Sen2Cor provided a more comprehensive class organization. For example, the Sen2Cor scene classification mask has four classes for clouds: cloud high probability, cloud medium probability, cloud low probability and thin cirrus clouds. To perform the comparison, we combined multiple classes from the Sen2Cor mask into the six classes used in our model, as shown in Table 2. As shown in Table 1, the spectral bands in the Sentinel-2 L1C product are provided in different resolutions. We resampled all the bands to a resolution of 20 m using bicubic interpolation. Due to hardware limitations, it was not possible to train the network with the resampled image of size 5490 × 5490. Hence, we split the images into sub-scenes of size 254 × 254 pixels for training. Since the optical sensor data from the satellite may not always be aligned to fill the complete image, the edges of the image tend to have zero-valued pixels, i.e., no data. After splitting the image into sub-scenes, we discarded those sub-scenes in which all pixels were zero-valued. We applied zero-padding of one pixel around each sub-scene, which increased the network input size to 256 × 256. Since the network did not have any fully connected layers (which requires a fixed input size), we were able flexibly to use a larger sub-scene size for prediction. In the prediction step, we used overlapping sub-scenes of size 510 × 510, and we also applied zero-padding, as in the training step. Before stitching together the network output from each sub-scene, we clipped out a three-pixel border around the sub-scene to eliminate the uncertain network predictions from the zero-padding [39].
The input to the network consisted of seven channels: the Sentinel-2 RGB bands (B02, B03, B04), near infrared band (B08), the short-wave infrared bands (B11, B12) and the normalised difference snow index (NDSI). The NDSI is a well-known index that exploits the difference in the spectral reflectance of the green band (B03) and shortwave infrared band (B11) to detect snow [43]. The NDSI is computed pixel-wise as follows:

Neural Network Architecture
The U-Net [14] and its variants are widely used neural network architectures for semantic segmentation. The network displayed in Figure 2 consists of an encoder-decoder architecture. Each encoder block operates at a spatial resolution twice that of its succeeding encoder but has half the number of convolution filters compared to its succeeding encoder.
At each encoder and decoder block, there are two sequences, comprising a convolution layer, a ReLU layer, and a batch-normalization layer connected in series, followed by a spatial dropout. The convolution layers used here had a kernel size of 3 × 3 and a stride of 1.
The output of the encoder takes two paths. In one path, the output of the spatial dropout is downsampled by a factor of two in the spatial dimensions using a max-pool layer, and is provided as input to the next encoder. The other path, referred as the skip connection, is connected to the decoder block that operates, at the same resolution, on the other side of the network.
At each decoder, two inputs are received and these two inputs have to be concatenated as a single input before applying the layers of the decoder. The input received from the preceding layer is upsampled using a 2 × 2 up-convolution operation with a stride of 2, which results in an increase in spatial dimensions by a factor of two, and a decrease in the number of feature maps by a factor of two. The upsampled output is then concatenated with the input received from the encoder of the corresponding step via the skip connection (denoted as a blue box in the decoder block in Figure 2). The decoder layers are then applied to the concatenated output and the result is provided as the input to the next decoder.
This sequence is repeated until the final decoder that operates at the same spatial resolution as the input. At the final decoder, an additional 1 × 1 convolution, and a softmax layer, is applied to obtain the class probabilities for each pixel of the image. The network output has the same spatial dimensions as the input and has six channels, where each channel contains the probability map for the respective class. The segmentation map can be generated by selecting the class with the highest probability at each pixel position.

Self-Training Framework
We used a four-stage iterative approach to train our model. The network architecture and training data organization of the iterative approach is summarized in Table 3. The training dataset, consisting of sub-scenes of size 254 × 254 pixels, was divided randomly into four equal batches. In the first stage, a deep neural network with a modified U-Net architecture was trained using batch-1 of the training dataset. The imperfect labels used in this stage were generated automatically using the Fmask algorithm. We evaluated the model performance on the validation dataset with human-labeled data after each training epoch, and, importantly, the human-labeled data was never used as training data. Based on the performance metrics, we selected the model from the best epoch, and called the selected model the teacher model. Using the teacher model, we inferred labels for the data in batch-2 of the training dataset. We used only the labels for which the network confidence was greater than a threshold of 0.33 (twice the probability of a random guess). We assigned the label as 0 (No-Data label) to pixels where the confidence was lower than this threshold. In the second stage, we trained another neural network model, called the student model, using a combined dataset of batch-1 and batch-2. For the batch-1 data, we continued to use the Fmask labels exactly as in the previous stage. Once we had selected the best student model from stage 2 using the validation dataset, we made this student model our new teacher model. Using the new teacher model, we inferred new labels for batch-2 and batch-3 data. In each subsequent stage, we trained a new student model with a combined dataset of Fmask labels for batch-1, and model-inferred labels from the teacher model, i.e., the best model of the previous stage, for the remaining batches, as shown in Table 3.
A new batch of the data containing previously unseen images was available for training the model at each stage. Hence, the size of the training data increased linearly at each stage. A smaller model was used for the training in the first stage, and the model size incrementally increased at each stage with increase in the training data. This was done to make sure that the model did not overfit on the smaller training data in the early stages. Hence, we defined two hyperparameters, number of start filters and depth, which we used to control the model size. The number of start filters was the number of filters used in the convolution layer present in the first encoder of the network. The number of filters in the subsequent encoders was defined as multiples of this parameter. The depth of the network set the number of encoder blocks used in the network. A network architecture with 32 start filters and a depth of 5 is illustrated in Figure 2. Table 3. Network architecture and training data used at each stage of the iterative approach. The number of parameters in the network at each stage was controlled using the number of start filters and depth hyperparameters. The size of the training dataset given below was the number of 254 × 254 image patches.

Network Architecture
Training Data

Regularization Techniques
We employed the regularization techniques described below at each stage of the self-training framework to aid in generalization of the trained model.
We used an online augmentation strategy where we performed augmentation "on the fly" before the data was provided as input to the network. The training dataset already consisted of a large number of data samples, and using the online strategy helped us achieve diversity across each training epoch without an explosive increase in the dataset size. The augmentations used included cutout [44], horizontal flip, vertical flip, and rotation in steps of 90 • . These transformations were applied on randomly selected input samples and more than one transformation could be applied to the same image.
Unlike for fully connected layers, it has been shown that element-wise dropout is not very effective in the case of convolutional layers. The spatial dropout used in this study set entire feature maps of the output to zero instead of just individual pixels [45]. The application of spatial dropout leads to learning from an ensemble of sub-networks consisting of randomly selected feature maps and hence helps in regularizing the trained model.
Each stage of the self-training framework was trained for 100 epochs. An early stopping strategy that stopped the training if the average class accuracy on the validation dataset did not improve after 10 epochs, was employed. The Adam optimizer was used for the training with a weight decay of 1 × 10 −5 . The weight decay resulted in 2 norm regularization for the network weights.

Loss Function
As shown in Table 4, the training data had a class imbalance problem. This could have resulted in the network predicting the abundant class more often than the classes that were scarce. Hence, we used a weighted cross-entropy function as the loss function for the training. The weights for each class were calculated using the median frequency balancing (MFB) method [46]. It has been shown that this technique is effective in semantic segmentation of small targets in remote-sensing images [47]. The weights were calculated using the frequency of the class in the training dataset: Let the set C = {1, 2, . . . , M} denote the set of the M classes in the given segmentation problem. The frequency of class i, denoted as f i , is the ratio of the number of pixels that are labeled as class i to the total number of pixels. The median of the class frequencies in the set C is given by median({ f i |i ∈ C}). The weight w i for class i is given by: The target label used for training has a one-hot encoding, i.e., it is a vector whose elements are all zero except for one element corresponding to the true class set to one. The weighted cross entropy loss is given by: where N is the number of training samples (pixels), w i is the weight of class i calculated using the MFB method, y i is the target label at the dimension corresponding to class i for the training sample j, andŷ (j) i is the softmax output of the model at the dimension corresponding to class i.

Validation Metrics
The performance metrics used for the model evaluation were computed from a pixellevel confusion matrix: Consider the confusion matrix for an M-class segmentation problem shown in Figure 3. Each element in the matrix, denoted by n i,j , is the total number of pixels that belong to class i and predicted as class j by the model. For class i, are the number of true positives, false positives and false negatives, respectively. Precision or user accuracy for class i is defined as follows: i.e., the ratio of the number of pixels that were correctly predicted as class i to the total number of pixels that were predicted as class i. Recall or producer accuracy for class i is defined as follows: i.e., the ratio of the number of pixels that were correctly predicted as class i to the total number of pixels that actually belong to class i, i.e., the ground truth. Cases of over-segmentation and under-segmentation are often misinterpreted as good results when either recall or precision is analyzed individually. Hence, in semantic segmentation tasks, metrics such as the F1 Score and the Intersection over Union (IoU) are the preferred evaluation metrics. The F1 score is the harmonic mean of the precision and recall: Given the set of pixels that are predicted to be a particular class by the model, and the set of pixels that actually belong to that class (ground truth), IoU is defined as the ratio of the size of the intersection to the size of the union of these two sets. The IoU is expressed in terms of the number of true positives, true negatives and false positives as follows: The overall performance of the model can be evaluated using the mean Intersection over Union (mIoU) and the total accuracy. The mIoU for a segmentation task with M classes is defined as follows: i.e., the mean of the IoU metric computed over each class. The total accuracy is then defined as follows: i.e., the ratio of the number of correctly labeled pixels to the total number of pixels.

Results
We compared the performance of our model with two widely used methods for cloud masking, Fmask and Sen2Cor.
The confusion matrices comparing the prediction of Fmask, Sen2Cor and our model with the true labels in the test dataset are presented in Tables 5, 6 and 7, respectively. Table 8 shows that our model was able to achieve high precision and recall simultaneously. In the case of Fmask 4 and Sen2Cor 2.8, the high values of precision or recall for certain classes, such as the class Clear-Sky Land, were accompanied by lower values of recall or precision, respectively. Therefore the overall segmentation performance suffered. In contrast, our model outperformed both these methods in this regard; this is clearly reflected in the F1 score and IoU metrics shown in Table 9 and in Figure 4. We also observed that the performance of the class Water was slightly better for Fmask compared to our model.

Discussion
In our study, we showed that, even with a small manually labeled validation dataset, the self-training framework enabled us to train a segmentation model using noisy labels from the Fmask algorithm. Our model outperformed two widely used cloud-screening methods, Sen2Cor and Fmask, and can be considered a better alternative to its teacher, the Fmask algorithm. The results showed that the model performed particularly well for the Cloud and Snow classes, which were the two prominent classes observed in the geographical sites that we used in our study.
Shadows in satellite imagery can be classified based on their source as cloud shadows or topographic shadows. Topographic shadows are the shadows that are cast by topographic features, such as mountains, and are static features that depend on acquisition geometry and solar position. In contrast, cloud shadows are dynamic features whose location and representation also depend on the prevailing meteorological conditions during image acquisition. The spectral characteristics of cloud shadows and topographic shadows are similar [48]. Topographic shadows can be detected accurately using digital elevation models (DEM) and solar angles. However this information is not provided to the network explicitly. The Fmask algorithm also makes use of the image metadata, such as solar zenith and azimuth angles, along with the cloud detections, in order to predict cloud shadow pixels. However, visual inspection of the Fmask results showed that they were not always very accurate. In Figure 5, both Sen2Cor and Fmask tended to incorrectly label topographic shadows as Water. Our model offered an improvement in this regard, correctly labelng them as shadow in many instances. A suitable post-processing technique using the metadata and DEM can be applied to separate the cloud shadows and topographic shadows.
In Table 10, we compare the performance metrics for different stages of the selftraining framework. We observe that the performance of the model improved over the first three stages of the self-training framework and subsequently saturated in the last stage. We attribute this pattern to improvement in the quality of data labels as a result of our framework. The performance on the class Water did not improve across the different stages. We know that shadows and water tend to have dark-colored pixels and it is often difficult to distinguish between them by visual inspection. We believe the network infers that the class Water and the class Shadow are similar, and, in cases of ambiguity, prefers Shadow over Water due to higher loss function weights used during training. As shown in Figure 6, the network focused on improving the performance in the shadow class at each stage and this improvement was also reflected in the performance metrics.
We also trained two models using the conventional supervised training approach. The first model was trained using the same training dataset, but using the noisy Fmask labels, while the second model was trained with the small human-labeled dataset that was previously used for model selection in our proposed framework. The network architecture used was the same as the architecture from the model in Stage-4; we also used all regularization techniques discussed previously for training this model. Hence, the key difference between the supervised models and the self-trained model (Stage-4) was the labels used for the training. We present the performance metrics for these two models and our self-trained model in Table 11. The results for the clean supervised model show that the limited number of labeled examples in our validation dataset was insufficient to generalize the task effectively. We believe that the large noisy dataset offered greater data diversity over the smaller clean dataset, and that his helped the noisy model to outperform the clean model. One might be inclined to argue that the improvement in performance across the stages of the self-training framework was due to the increase in the number of parameters used in the network architecture. However, the performance of the noisy supervised model demonstrated that a larger network alone does not guarantee better performance. We observed that the improvement in data labels achieved through a self-trained framework enabled the stage-2 and stage-3 models to outperform the supervised model despite their smaller network size. Hence, we believe our approach has great potential for training models with noisy labels.
Jeppesen et al. demonstrated the capability of neural networks to learn from noisy Fmask labels in their RS-Net model trained for cloud detection in Landsat Imagery using the supervised learning approach [28]. Our study can be viewed as a next step in this research direction, and we also further extend this approach for the multi-class segmentation task. Similar to our research, Li et al. offered a solution for addressing the difficulty of obtaining large pixel-level annotated datasets for training neural networks [36]. The block-level annotated dataset used for training their weakly supervised model can potentially simplify the annotation process in comparison to the more laborious pixel-level annotation. However, it may prove to be challenging to adapt this approach for multi-class segmentation, particularly for shadow detection. We used the simple, yet effective, U-Net architecture in our self-training framework. However we note that the network used in our framework can be easily adapted to take advantage of recent advances in network architectures and training techniques, which have shown improved cloud-detection capabilities in supervised models.  Figure 6. Comparison of test image results from different stages of the self-training framework and the results from the models trained using supervised training, from Tile 04WDV in Alaska, U.S.A, captured on 27 October 2020.

Conclusions
Our study demonstrates the effectiveness of self-training neural networks in the Earth observation domain. The key challenge in this study was to train the model using the noisy labels from the Fmask cloud detection algorithm. Many other remote-sensing applications face the same difficulty with existing, but imprecise, training data, that often limits the deployment of deep-learning technologies.
Even though the proposed method offers better performance compared to Fmask and Sen2Cor, we believe that the performance can be improved further. As well as the detection of shadows, the detection of thin clouds, particularly those above water bodies, can be improved. The clear-sky land class is the most heterogeneous of the six classes in this investigation. When training a similar model that can be applied to all geographical environments, this class is expected to pose some challenges. The use of other indices in addition to NDSI would allow advantage to be taken of the domain knowledge that has been acquired as a result of years of research.
Our classification strategy has the potential for more nuanced class separation and for integration into information retrieval algorithms. The use of other existing masks or data sets (e.g., ocean, lakes or vegetation) might lead to even higher precision on specific problems or over certain surfaces.