Machine Learning for Cloud Detection of Globally Distributed Sentinel-2 Images

: In recent years, a number of different procedures have been proposed for segmentation of remote sensing images, basing on spectral information. Model-based and machine learning strategies have been investigated in several studies. This work presents a comprehensive overview and an unbiased comparison of the most adopted segmentation strategies: Support Vector Machines (SVM), Random Forests, Neural networks, Sen2Cor, FMask and MAJA. We used a training set for learning and two different independent sets for testing. The comparison accounted for 135 images acquired from 54 different worldwide sites. We observed that machine learning segmentations are extremely reliable when the training and test are homogeneous. SVM performed slightly better than other methods. In particular, when using heterogeneous test data, SVM remained the most accurate segmentation method while state-of-the-art model-based methods such as MAJA and FMask obtained better sensitivity and precision, respectively. Therefore, even if each method has its speciﬁc advantages and drawbacks, SVM resulted in a competitive option for remote sensing applications.


Introduction
Accurate land cover classification of multispectral remote sensing images is of paramount importance to fully exploit their informative content. The correct detection of clouds, shadows, snow or ice, is critical for many activities such as image compositing [1], correction for atmosphere effects [2,3], calculation of vegetation indices [4], classification of land cover [5] and change detection [6]. In recent years, different strategies have been proposed especially for cloud segmentation. In particular, two different strategies can be distinguished [7]: (i) model-based approaches, in this case pixels are assigned to a specific category by evaluating the available reflectances and using an a priori physical model to determine proper thresholds; (ii) learning approaches, these strategies exploit supervised algorithms to learn statistical models able to maximize the separation of pixels belonging to different classes and minimize classification errors.
Model-based approaches use specific physical constraints to separate different classes according to spectral bands; these constraints are usually formalized by proper combining the bands Table 1. Sentinel-2 MSI spatial resolution, central wavelength and bandwidth. Their central wavelength ranges from 442.7 nm for B1 to 2202.4 nm for B12. B1 is coastal aerosol, useful for imaging of shallow waters; B2, B3 and B4 are the blue, green and red bands, respectively; B5, B6 and B7 are useful for vegetation; B8 and B8a are near infrared bands, suitable to measure plant health; B9 is water vapour band; B10, B11 and B12 are shortwave infrared (SWIR) bands, particularly useful for cirrus detection. Sentinel-2 data are available to users under a free and open data policy, which underpins the development of long-term, sustainable Earth Observation (EO) applications.

Sentinel-2 Bands Spatial Resolution (m) Central Wavelength (nm) Bandwidth (nm)
In this work we used three publicly available data sets of manually labelled Sentinel-2 scenes, the distribution of which is shown in Figure 1. We used the first set for training and tuning of machine learning models (Support Vector Machines, Random Forests and Neural Networks), the second and the third one for test and comparison on homogeneous and inhomogeneous data with the state-of-the-art model-based segmentations (Sen2Cor, FMask and MAJA). Further details about the segmentation methods will be provided in the next section.
The first dataset D 1 consists of about 3.8 million labelled pixels from 67 scenes and 26 different sites; according to the original works [23,24], pixels belong to six different classes: This dataset was selected to ensure a global geographical distribution and, therefore, a wide scene variability. This variability is highlighted in Figure 1. The second set D 2 consists of about 2.7 million labelled pixels from 39 scenes and 18 sites. This set has been collected by the same authors of [23,24] and it is labelled according to the same labels of D 1 . The homogeneity of this data provides an excellent way to test the reliability of machine learning methods.
Finally, the third set D 3 consists of about 100 million labelled pixels from 29 scenes [25,26] captured in 10 different sites. This set was used to test the generalization power of the machine learning methodologies across a non homogeneous set of images collected with different modalities and in different sites. D 3 includes the following classes: As we are interested in detecting clouds, we regrouped them into clear sky and clouds. A synthetic overview about D 1 , D 2 and D 3 is provided in Table 2. Although Sentinel-2 cirrus band B10, centred at 1.38 µm, was designed to detect high clouds, the only light reflected in this band comes from altitudes above 1000-2000 m [27]. Thus, in order to avoid misclassification of high altitude bright rocks, GTOPO30 digital elevation models were included in our analyses [28].

Model-Based Segmentation
Model-based methodologies consist of a series of constraints based on physical models and phenomenological observations. These constraints provide ranges and thresholds for reflectances and their ratios, which can be used to assign the image pixels to different classes [29]. Among several possibilities, we present here the evaluation and the comparison of the most adopted and accurate, according to recent literature, model-based segmentation algorithms: Sen2Cor [30], FMask [9] and MAJA [12].

Sen2Cor
Sen2Cor is mainly based on four different steps:

1.
Bright pixels are detected with the red band (B4), as this band tends to be highly sensitive to bright pixels, likely containing snow or clouds; 2.
Snow and cloudy pixels can be distinguished by considering the Normalized Difference Snow Index (NDSI): furthermore, ancillary information (e.g., latitude, near infrared reflectances) can be used; 3.
Vegetation pixels can be detected with the Normalized Difference Vegetation Index (NDV I): furthermore, a reflectance ratio of B8 and B3 is used for senescent vegetation;

4.
Soils and waters are mainly recognized using the blue band (B2) and the soil band (B11).
Using a combination of thresholds and ratios, Sen2Cor can provide a detailed scene classification, accounting for several categories.

FMask
FMask was initially developed for Landsat 5 and Landsat 7 data, but later it was also used with Landsat 8 and Sentinel-2 images [9,31]. FMask first assigns some pixels to clouds and their shadows based on thresholds analogously to what happens in Sen2Cor. Then, pixel-based statistics are used to compute a cloud probability map p cloud for the undecided cases; in these cases, a pixel k is then classified as cloud if: Thus, FMask dynamically determines a data-driven threshold and, in this sense, it can be considered a statistical extension of Sen2Cor. Furthermore, FMask includes other information related to cloud displacement, spectral context and auxiliary features like digital elevation model and global water map [32,33].

MAJA
MAJA is a recursive algorithm specifically developed for cloud detection in FORMOSAT and Landsat images, then extended to Sentinel-2 images [12]. It is a multitemporal and recursive algorithm because it requires a chronologically ordered time series of images for training. Firstly, a monotemporal analysis is performed to distinguish high and low clouds. In particular, a pixel is assigned to the high cloud class if: where h denotes the pixel altitude (Km) above the sea level and B10 * is the B10 pixel reflectance corrected for Rayleigh scattering. Pixels are assigned to the low cloud class if all the following constraints are satisfied: Accordingly, a first cloud mask is obtained. This segmentation is then refined including information from a time series. In the end, an analysis of spatial correlations yields the final segmentation [34].

Learning-Based Segmentation
In this work, we considered three different learning-based segmentation approaches: Random Forest (RF), Support Vector Machines (SVM) and Multilayer Perceptrons (MLP).

Random Forest
Random Forest (RF) is an ensemble learning classifier [35] conceptually based on classification trees. The basic idea behind RF is growing an ensemble of classification trees with each tree being trained using a bootstrap sample from the available data; to avoid biased estimates, one third of available examples is left out of and used for the so called out-of-bag error estimation.
When growing a tree, at each node the optimal split is determined using √ M randomly sampled features (if M features are available). It is demonstrated that the classification error depends on two factors: the correlation among the forest trees and the individual strength of each tree. These factors are essentially controlled by the number of trees used to grow the forest and the number of features sampled at each split; the optimal tuning of these two parameters usually yields accurate and robust predictions. RF has shown its effectiveness in several remote sensing applications [36,37].
In this study, RF was implemented using the randomForest (v.4.6-14) R 3.6.1 package [38]. Using the training set D 1 we investigated the optimal values for the number of trees and the number of variables used at each node split. The number of tree parameter was set equal to 100; in fact, over this value, no further performance increase was observed. For what concerns the number of features, the default value equal to the square root of the number of the available features was used.

Support Vector Machine
Support Vector Machine (SVM) is a learning algorithm [39] in which the essential idea is that the separation of two classes in the feature space is analogous to the definition of a suitable hyperplane, at least for linearly separable variables. Geometrically determining a separation hyperplane is equivalent to determining a number of observations, called support vectors, best representing the classes of the problem. A margin parameter determines how well observations are separated as well as the number of support vectors needed by the model. SVM can be proficiently used even with non-linearly separable observations, provided the existence of a higher dimensional space where linear separation can be achieved with a suitable transformation or kernel function. As infinite choices can be adopted, one of the most important SVM parameters to tune is the optimal kernel; of course many kernel functions can be explored, but to keep the computational burden affordable, a particular constraint is often adopted, i.e., that distances in the new feature space depend only on the original features. A detailed review of SVM and its efficiency in remote sensing can be found in [40].
The main SVM parameters explored in this work were the margin C and the kernel. In particular, a radial kernel was finally adopted, thus requiring the search for the optimal γ parameter. For the radial kernel, γ determines the influence range of training observations, the greater the γ value, the lesser that range. Accordingly, models with large γ values tend to overfit, and models with small values tend to underfit. In this study, the e1071 (v.1.7-3) R package was used [41]. SVM hyperparameters were optimized using a uniform grid algorithm; we preferred this solution given the exiguous number of parameters to tune. We used our training data for cross-validated estimations of accuracy and used this metric to choose the best configuration. In particular, the explored grid consisted of C ranging from 0.5 and 10 (step 0.5) and gamma ranging from 0.03 to 0.6 (step 0.03). More extreme values were not taken into account as we observed a consistent accuracy worsening. We observed that the performance remained stable over a wide range of values; however, we found that a slight increase in segmentation accuracy can be obtained by choosing the configuration C = 9 and γ = 0.18.

Artificial Neural Networks
Our last baseline method attempts to perform pixel-level classification using a Multilayered Perceptron (MLP) with a fully connected architecture. MLPs are composed by three basic structures: an input layer fed by the features, hidden inner layers combining the output vectors of previous layers with linear combinations and a final output layer, which yields the classification result. MLPs can generally be distinguished in shallow and deep neural networks according to the depth of the hidden layers, accordingly including this model allowed us the exploration of a deep learning architecture.
With the adopted architecture, every node of a given layer receives a weighted average of the outputs of the previous one; given its specific weight, that node will propagate the obtained information to the next layer where a new weighted average is calculated; this procedure ends with the final layer where all information is collected and summarized in a vector score (with a number of components equal to the number of desired classes). During the training phase, a backpropagation algorithm [42][43][44] measures the classification error according to the given nodal weights and rearranges the weights in order to minimize the error. In this study, the h2o package R implementation (v.3.28.0.4) of MLP was used [45]. As MLPs are characterized by the use of several hyper-parameters, the best configuration was obtained by means of a random grid [46] and resulted in a 14-20-20-2 architecture, without any regularization and dropout and Rectified Linear Unit (ReLu) activation functions. This model was chosen with a random grid search and using cross-validation accuracy to minimize overfitting risk and determine the optimal configuration. It is worth noting that several deep architectures were also tested, but a shallow one yielded the best results.

Post Processing and Segmentation Evaluation
All the segmentations compared in this study underwent an image post-processing based on morphological filtering. Different filters can be investigated, depending on the specific properties of the analyzed data. The most important are erosion and dilation filters. The former being preferred to reduce segmentations (excess of false positives), the latter to enlarge them (excess of false negatives).
Several studies [11,25] have demonstrated the appropriate filtering in this kind of application is dilation. At least two distinct arguments support this choice: • Cloud edges are so thin that their pixels remain undetected more frequently than inner ones, thus resulting in large numbers of false negatives; • Clouds scatter light to their neighbourhood pixels thus resulting in blurred edges in which the pixels are hardly recognized as clouds.
Because of these considerations, we decided to dilate the obtained cloud masks with a 180 × 180 m square kernel window in order to avoid the under detection of cloud contaminated pixels [25]. These images were then used for classification and segmentation purposes.
Computational requirements for training machine learning algorithms can be extremely demanding, especially when dealing with high-cardinality data as satellite images. In this case, millions of examples are available, but the adoption of sampling strategies can suitably yield a faithful representation of the whole feature space. Accordingly, we investigated to which extent it was possible to use random sampling in order to reduce the computational burden without a significant loss in performance.
Performance evaluation was assessed with several metrics and a five-fold cross-validation procedure. A rigorous application of cross-validation is important for unbiased estimations of machine learning accuracy [47]. This can be particularly true for remote sensing images in which pixels and polygons have strong spatial correlation: for example, let us suppose that two adjacent pixels are used for training and test, respectively. In principle, learning a model using the first pixel and evaluating the model performance on the second would be correct; however, this would lead to overestimating the performance as the two examples are often indistinguishable. Some studies have performed cross-validation over pixels [23], but we preferred here to perform cross-validation over the available images, a choice which is also more adherent to practical purposes. Cross-validation was repeated 100 times. The adopted metrics were accuracy (Acc), sensitivity (Sens), precision (Prec), specificity (Spec) and F1 according to the following definitions: where TP, TN, FP and FN are the number of true positives, true negatives, false positives and false negatives, respectively. We considered as positive pixels those ones belonging to the cloud class. Accuracy is the ratio between the correctly classified pixels and all the pixels of a single scene. Sensitivity and specificity are the portion of correctly classified examples of the positive and negative classes, respectively. Precision is the ratio of positive examples correctly classified within the positive predictions. Finally, F1 is the harmonic mean of sensitivity and precision. Some studies present their results in terms of accuracy, sensitivity and specificity; others prefer F1 and precision. Here, we chose to present all of them to ease the comparison with already published studies.

Cross-Validation Assessment of Learning Methods on D 1
To compare machine learning algorithms with the state-of-the-art cloud detection techniques, we trained RF, SVM and MLP classifiers on the dataset D 1 and assessed their robustness. An overview is presented in Figure 2. SVM achieved the highest cross-validation performance in terms of all proposed metrics except sensitivity, in the latter case, MLP obtained the best result. The numeric values are reported to ease comparison in Table 3. We assessed the statistical difference of cross-validation distributions by means of a Mann-Whitney U test [48], and found that MLP performs significantly better (5%) in terms of sensitivity than RF and SVM. Both SVM and RF perform significantly better (5%) than MLP in terms of precision and specificity. SVM obtains the best better performance in terms of accuracy and F1. Moreover, we assessed the statistical difference of the variances for the performance distributions by means of a Levene Test [49]. The only significant differences we found were for the accuracy distribution of RF and the sensitivity distribution of MLP. In conclusion, RF would seem a slightly more robust segmentation method while SVM is more accurate; accordingly, for the following investigations we considered only SVM.

How Sample Size and Heterogeneity Affect Performance
Firstly, we evaluated how training set size affected classification performance. The aim of this analysis was twofold: on the one hand, understanding whether or not it is possible to use a reduced training set size to accurately distinguish clouds form clear sky pixels; on the other hand, reducing the computational burden of learning processes. Then, we investigated how the availability of heterogeneous data affects the learning process. Using a fixed number of training examples, we sampled them from a varying number of images. The SVM results are presented in Figure 3, a similar behaviour was also observed for RF and MLP. Left panel (a) shows how SVM performance reaches a plateau after 10 4 training pixels are provided for learning. As 10 4 pixels is a sufficient number of training examples to reach accurate classification, the following analyses are performed using this value. Right panel (b) shows how the number of different images affect the classification performance. In this way we evaluated the heterogeneity of D 1 images. We varied the number of images used for training from 10% to 90% of the whole training set. Once reached 80% of training images, the mean and the variance of SVM accuracy are statistically indistinguishable. For the mean, we used the Mann-Whitney U test, and for variance, a Levene Test [49].

Segmentation Reliability on the Independent Test Set D 2
We evaluated the reliability of SVM segmentation on an independent test set D 2 , which is homogeneous with D 1 . Furthermore, we compared our results with state-of-the-art threshold-based methods: Sen2Cor and FMask. It was not possible to evaluate MAJA on D 2 as it requires a temporal series of images that for the present case were not available. Specifically, we trained SVM on D 1 pixels and we evaluated its segmentation performance on D 2 . Results are shown in Figure 4. SVM achieved a similar performance to the one obtained on D 1 , thus confirming the robustness of its segmentations. A complete overview of the comparison is presented in Table 4. SVM segmentations significantly (5%) resulted in the best ones in terms of precision and specificity according to a paired Wilcoxon test [50]. For what concerns accuracy, F1 and sensitivity FMask performed slightly better than other methods. However, these differences were not statistically significant. Finally, FMask and Sen2Cor performance distributions resulted as less sharp than the SVM one (5% significance), according to the Levene Test.

Generalization Power on D 3
Another independent test set D 3 was used to evaluate the generalization power of SVM on a set inhomogeneous with the training one. In order to accomplish such task, we trained SVM on pixels from D 1 and D 2 and evaluated its performance on D 3 . Figure 5 shows a comparison of the distribution of the classification metrics computed over the 29 scenes of the D 3 dataset.
A comprehensive overview of all performance metrics is reported in Table 5.  SVM significantly (5%) resulted in the best performing method in terms of accuracy and F1 according to a paired Wilcoxon test. Conversely, FMask is significantly more accurate than Sen2Cor while no significant difference in terms of accuracy can be assessed between MAJA and FMask. Furthermore, SVM is the most balanced segmentation strategy as it can be observed in terms of the 95% confidence intervals; this difference is statistically significant according to the Levene test. MAJA is significantly (5%) the best segmentation method in terms of sensitivity, followed by SVM. FMask is the second method in terms of F1, it has high specificity but a significant lower sensitivity compared to MAJA and SVM. Finally, Sen2Cor is the least accurate in terms of accuracy and F1, but with remarkable precision and specificity, where it scored as the second. A visual assessment of these results is shown in Figure 6. SVM provides a balanced segmentation between false negative and false positive errors. FMask and MAJA segmentations tend to prefer false negatives and false positives, respectively; Sen2Cor was not considered here as its segmentations tend to be similar to those obtained by FMask.

Discussion
This study proposed a comparison of machine learning and threshold-based techniques for cloud detection and segmentation in Sentinel-2 L1C images. In order to enable an unbiased performance comparison, we first built and validated supervised learning models, namely SVM, RF and MLP, with a cross-validation procedure on a training set D 1 . Machine learning strategies achieve extremely accurate performances, despite the exiguous number of features used; in this work we used only the 13 spectral band intensities and the GTOPO30 digital elevation feature. Although all machine learning strategies were accurate, SVM performed slightly better than the other classifiers in almost all metrics; MLP was the most sensitive method and RF the one with less variance.
For what concerns MLP, it is worth noting that we explored several deep neural network configurations, but a shallow one resulted in the best performing. This is particularly striking compared with the RF performance; it is reasonable to conclude that when dealing with pixel-based approaches, deep learning cannot exploit their full potential as they certainly do in other situations and with other approaches, for example when considering Convolutional Neural Networks (CNN) [51,52].
To this purpose, it should also be noted that CNN algorithms cannot be suitably applied to whatever data format; in particular, in this study we used images with labelled polygons of varying size and shape, a typical situation in remote sensing imagery, thus it is not possible to adopt ready-to-use CNN solutions, and specific solutions and customizations would be required. In this sense, standard learning algorithms, which simply require labelled pixels as the base of knowledge, represent an efficient, user-friendly and therefore still valuable solution for several engineering purposes [53,54].
Furthermore, we evaluated how, despite the increasing size of the training set for remote sensing applications, machine learning strategies remain an efficient tool for segmentation as they require relatively small-size training sets. According to our experiments, 10 4 pixels collected from ∼20 scenes provide a sufficiently accurate classification. This analysis confirmed the robustness of machine learning strategies, in fact for all the classifiers showed a common behaviour.
Typically, a Sentinel-2 acquisition covers a swath of 298 km roughly corresponding to three distinct but adjacent images. Of course, these images are strongly correlated, thus to avoid results biased by spatial correlations we always kept adjacent tiles within the same cross-validation fold. Previous works in remote sensing and other fields have outlined the danger of double dipping in machine learning [47,55], especially for generalization purposes. As demonstrated elsewhere [21,56], generalization remains the most difficult challenge to tackle for learning algorithms. Accordingly, we evaluated the performance of our best machine learning strategy on an independent test set and compared it with model-based segmentations.
A first evaluation was performed using the independent test set D 2 , which was labelled by the same experts of D 1 ; we observed that SVM substantially reproduced the performances observed on D 1 . These findings remark that when training and validation data are homogeneous (e.g., same geographical regions and same segmentation protocols) learning strategies like SVM can provide reliable segmentations. SVM achieved comparable performance with the state-of-the-art methodologies Sen2Cor and FMask. Notably, SVM segmentations were significantly more accurate in terms of precision and specificity. Furthermore, we observed that SVM segmentations were statistically less prone to drastic failures, thus yielding performance distributions affected by less variability, an effect demonstrated by a Levene test. This result can represent a particularly interesting feature, as it suggests that in some cases learning-based approaches are more robust.
A second evaluation was performed using the test set D 3 , which for its characteristics could be considered not homogeneous with D 1 . SVM remained the best performing method in terms of accuracy (94.7%) and F1 (84.5%). Although homogeneity of training and test data is a primary issue for machine learning accuracy [57,58], the only statistically significant drop in performance we observed were ∼4% in accuracy and ∼10% in F1. MAJA was able to provide segmentations more accurate than SVM and other methods in a few cases. An example is shown in Figure 7. As can be seen, MAJA gets the most accurate cloud classification on a barren area. In these regions, characterized by high reflectance, the multi-temporal base of knowledge allows a correct classification of bright surfaces, which otherwise could be classified as clouds [10,11,59].
In general, MAJA resulted in the most sensitive (95.3%) method and FMask resulted in the most precise (98.0%) and specific (99.5%); however, it is worth noting that according to specificity the difference with Sen2Cor is negligible (99.4%). These findings should be taken into consideration as the main purpose of cloud detection is avoiding false positives, especially for change detection or land cover applications [23,25,60,61]. Nevertheless, learning strategies and specifically SVM seem to provide more balanced classifications, which can achieve better results, especially in terms of metrics evaluating the overall performance as accuracy and F1.

Conclusions
In this work, we developed a pixel-based classification procedure based on machine learning techniques for cloud detection in Sentinel-2 data. We evaluated different supervised models: RF, SVM and MLP. Our analyses demonstrated that, among learning models, SVM is the best option. In addition, we compared SVM with state-of-the-art model-based methodologies such as MAJA, FMask and Sen2Cor. We evaluated how data homogeneity affects the segmentations using two independent test sets, the first one collected and segmented with the same procedures of our training set and the second characterized by a deep heterogeneity. In both cases, SVM was the best performing method in terms of accuracy and F1. Nonetheless, all different strategies have strengths and weaknesses: MAJA resulted in the most sensitive method, FMask and Sen2Cor were the most precise. The access to homogeneous data remains a key issue, in fact when using not homogeneous data, we observed a slight but significant drop in performance. The comparison with model-based segmentation suggests that learning methods can improve their performance when trained on temporal series, an aspect that deserves future investigations. Our findings demonstrate the accuracy of standard machine learning methods, especially SVM as a valid alternative to state-of-the-art segmentation strategies. As far as we know, this work presents the most extensive comparison between machine learning and model-based segmentations. Furthermore, other studies comparing different segmentation strategies focused on small data usually covering homogeneous environments. As far as we know, this is the first work considering learning procedures and comparing them with model-based approaches for world-scale analyses, whilst regional scale analyses are usually preferred. Future studies should consider the design and customization of CNN architectures for this data; another interesting aspect to consider is the investigation of eventual differences in segmentation performance between Sentinel and Landsat data.