Cloud Detection: An Assessment Study from the ESA Round Robin Exercise for PROBA-V.

A Round Robin exercise was implemented by ESA to compare different classification methods in detecting clouds from images taken by the PROBA-V sensor. A high-quality dataset of 1350 reflectances and Clear/Cloudy corresponding labels had been prepared by ESA in the framework of the exercise. Motivated by both the experience acquired by one of the authors in this exercise and the availability of such a reliable annotated dataset, we present a full assessment of the methodology proposed therein. Our objective is also to investigate specific issues related to cloud detection when remotely sensed images comprise only a few spectral bands in the visible and near-infrared. For this purpose, we consider a bunch of well-known classification methods. First, we demonstrate the feasibility of using a training dataset semi-automatically obtained from other accurate algorithms. In addition, we investigate the effect of ancillary information, e.g., surface type or climate, on accuracy. Then we compare the different classification methods using the same training dataset under different configurations. We also perform a consensus analysis aimed at estimating the degree of mutual agreement among classification methods in detecting Clear or Cloudy sky conditions.


Introduction
Despite the large existing literature, cloud detection from images taken by sensors onboard satellites is still an area of very active research. This is essentially due to three main reasons: (i) Cloud detection is an important preliminary step of remotely sensed image processing because clouds affect sensor measurements of radiance emitted by surface up to make data unreliable for a wide range of remote-sensing applications that use optical satellite images; (ii) Cloud detection by itself is a difficult problem (even for experts attempting to visually detect clouds from signatures and/or images) in some conditions as transparent or semi-transparent clouds and in general when the contrast between the cloud and the underlying surface is poor; and (iii) Despite consolidated guidelines for cloud detection algorithms (e.g., use of infrared bands, preliminary removal of noninformative bands), development of new sensors with different hardware capabilities in terms of spatial, spectral and temporal resolution claims for specific algorithms or adaption of existing ones (e.g., re-estimate of new thresholds). In particular, a dramatic progress in the technology was availability of hyper-spectral sensors able to take images up to 8K bands [1,2]. While these instruments promise to gather unprecedented information from surface and atmosphere, however, they challenge low-dimensional conventional algorithms for cloud detection not so much for scalability and computational resources required but as for physical and theoretical implications of hyper-spectrality. In particular, really informative bands must be selected in advance to face the curse of dimensionality inherent in statistical estimation (dimensionality reduction). Actually, a manual selection of spectral bands and estimate of thresholds for bands themselves or some couples of theirs is not conceivable anymore, therefore innovative methods for automatic feature extraction from the hyper-spectral images are sought.
In the case of low-dimensionality, as in PROBA-V considered in the present paper, the dimension reduction problem is generally not a concern and all spectral bands are considered (Experiments not reported in this paper confirm that best accuracy is achieved when all PROBA-V spectral bands are considered). However, due to the limited amount of information with a so reduced number of bands, it is important to extract relevant features for cloud detection as effectively as possible.
In all cases, also considering the recent explosive emerging methods, a problem of validation of methods arises that could help not only in comparing their accuracy but also to understand strengths and weaknesses of the general cloud detection problem.
Classification exercises are sometimes organized where different algorithms are challenged to estimate a cloud mask from radiance detected by a specific sensor. Radiance is endowed with labels on the Clear or Cloudy condition accurately assigned by experts that are blind to the algorithms, so to be used as a validation of the algorithms themselves. While the main purpose of such exercises is to develop accurate operational algorithms for specific sensors onboard satellites, an important side effect is comparison of state-of-art methods on a same, very accurate dataset. In this respect we mention the Landsat comparison exercise [3] and the ESA Round Robin exercise for PROBA-V sensor [4]. Such comparisons are an exceptional way not only to compare algorithms, but especially to discover their weakness in particular climatic/surface conditions and, finally, to progress knowledge of cloud mask detection.
One of the authors participated in the ESA Cloud Detection Round Robin exercise (https:// earth.esa.int/web/sppa/activities/instrument-characterization-studies/pv-cdrr). Such exercise was intended for the PROBA-V sensor onboard the PROBA ESA platforms and suited for land use and classification, including vegetation, crop monitoring, food security and scarcity prediction, disaster and biosphere monitoring. PROBA-V has a small number of spectral bands (Blue, Red, NIR and SWIR) and in particular it lacks a Thermal Infrared band that could have been useful to detect cirrus clouds. This makes cloud detection from its images challenging. ESA and the Belgian Science Policy Office organized a dedicated Round Robin exercise to inter-compare the performances of different cloud detection algorithms for PROBA-V. The Round Robin exercise provided the participants with a large dataset of PROBA-V images (331 for almost 8 Billion scenes) covering all seasons, most surface types, different world zones and most cloud types (In this paper, we shall refer to pixel as the single element of the image matrix provided by the sensor with its field of view, including the corresponding geographical coordinates, and to scene as the set of corresponding information for that pixel, namely spectral radiance, sky condition, surface, climatic zone). The key data of the exercise is a set of 1350 scenes, blind to participants that have been manually labelled by experts, claimed to sample the most important types of clouds (gold standard). A more detailed description of the exercise and main conclusions are reported in [4].
The framework proposed by the authors for the Round Robin exercise includes a statistical classification method (Cumulative Discriminant Analysis, CDA [5]), a training set semi-automatically obtained from cloud masks estimated for concurrent sensors, and grouping data in almost homogeneous surface types. In particular, our framework was the only one within the exercise that did not use a manual dataset obtained by expert annotation to train the classification. Instead it was relying on a semi-automatic training obtained as the result of consolidated and acknowledged as reliable cloud masks with comparable spatial resolution as the target cloud mask (MODIS and SEVIRI). The only intervention required is spatial and temporal co-registration of the training cloud mask with the target cloud mask. In view of the fact that this training cloud mask is not obtained by expert judgement but by another algorithm, it will be defined as a silver standard. On the one hand quality of the silver standard cannot be compared with the accuracy of a gold standard. However, in this respect we also recall that even very accurate cloud masks annotated by experts are affected by judgement error that in best cases is estimated around 4-7% [3]. On the other hand, the much larger size of the training dataset and its wide coverage can represent a much larger number of cloud and surface conditions. This result cannot be obtained by manual training, naturally limited by human resources. This strategy appears in our opinion as a natural path when methods requiring large training datasets are involved. It is mainly the case of deep learning algorithms. In this respect we mention [6] who use the results of an algorithm (CFMask [3]) to train their deep RS-Net model for Landsat 8 images, and [7] based on AVIRIS cloud mask. Massive use of such a silver standard dataset for cloud detection was pioneered in [8] in our knowledge.
Another qualifying part of our framework was grouping of scenes into homogeneous zones selected basing on the surface type. It is frequent that algorithms for cloud detection are trained separately for different types of the underlying surface (e.g., land or water); other approaches are possible, for example introducing climatic information as in [5].
Aim of the present paper is first to show a full and detailed analysis of our framework and of the results within the Round Robin exercise, assessing its performance under several cloud and surface conditions.
In addition the availability of a very accurate gold standard allows one to quantitatively analyze weaknesses and strengths of cloud detection under a framework in which also the same training dataset is shared among classification methods. In particular, we address the following questions: (a) to compare prototypes of selected cloud detection algorithms well known in the literature; (b) to assess the feasibility of a silver standard to train cloud detection; (c) to assess the role of surface and/or climatic information on the accuracy of cloud detection. Finally, a consensus analysis is performed aimed at estimating the degree of mutual agreement among classification methods in detecting Clear or Cloudy sky conditions.

PROBA-V Data
PROBA-V (PRoject for On-Board Autonomy-Vegetation) is a global vegetation monitoring mission [9], launched in 2013 to assure the succession of the Vegetation instruments onboard the French SPOT-4 and SPOT-5 Earth observation missions. The satellite follows a Sun-synchronous orbit at a height of 820 km, achieving a daily global coverage, except the equatorial region (within 35 • of the Equator) where coverage is guaranteed every two days. The optical instrument onboard provides from 1/3 km to 1 km-resolution data products. It captures a Blue band (centered at 463 nm), a Red band (centered at 655 nm), a Near-Infrared band (centered at 845nm), and a Short-wave Infrared band (centered at 1600 nm). The data of the traditional Vegetation products, as provided by PROBA-V, are freely accessible for all users. The new, higher resolution products of PROBA-V elder than 1 month share the same full, free and open data policy. Details on the technical characteristics of PROBA-V of interest for the present work are reported in Table 1.
We consider as input data 331 images released by ESA and provided by the organizers of the Round Robin exercise. These images are PROBA-V Level 2A products with Top-of atmosphere reflectance (the four PROBA-V bands radiometrically and geometrically corrected and resampled at 333 m). They conform a complete globe acquisition from four different dates covering the four seasons in 2014. PROBA-V scenes are endowed with a sea/land mask and an algorithm for snow/ice detection. The total number of valid scenes available in the 331 files is 7,731,538,861, the remaining ones being off sensor view, sun glint or missing reflectance.

Validation Dataset (Gold Standard)
All participants of the Round Robin exercise for PROBA-V cloud detection algorithm were provided with a small dataset consisting of 1350 scenes, manually collected, classified and labelled by an expert user. Since obtained by an expert, the dataset can be fully defined as a gold standard. The scenes are a subset of the full PROBA-V dataset collected from 4 different images and labelled with the following categories: Clouds (totally Cloudy, opaque clouds; semi-transparent clouds; other turbid atmosphere, e.g., dust, smoke); Clear sky (over water; snow/ice; other cases); spatially mixed clouds (over land; water; ice).
This small dataset comprises 30% totally Cloudy, 32% semi-transparent and 38% Clear cases. The relationship between land and water scenes is about 70:30 (land:water). The detailed distribution of categories is reported in Table 2. Figure 1 shows the world distribution of the dataset. We mention that a second dataset of 53,000 scenes was prepared during the ESA project. It served as a basis for comparing algorithms participating in the Round Robin exercise [4]. However, this dataset has never been revealed to participants, even after the end of the exercise [10].

Training and Validation Dataset
To produce the training and validation sets in the PROBA-V exercise needed by the classification methods we rely on the cloud masks provided by consolidated algorithms endowed with SEVIRI and MODIS data. We assign labels to all the PROBA-V scenes for which both algorithms can provide labels, as reported in the following in detail.

SEVIRI Cloud Mask
Processing data from SEVIRI (Spinning Enhanced Visible and Infrared Imager) sensor onboard MSG satellites provides a cloud mask dataset at 15 min temporal resolution [11] with a spatial resolution of 3Km sub-satellite that degrades far from the equator and from the Greenwich meridian. The data are provided with regional coverage within a radius of about 60 degrees around the point at zero latitude and longitude. No data are provided for the hemisphere including Americas, Oceania and most Asia. The grid of the SEVIRI cloud mask is 3712 × 3712 pixels; the number of valid cloud mask pixels is about 12M.
SEVIRI cloud mask provides four different labels: Clear over sea, Clear over land, Cloudy, uncertain. We considered for the training set only Clear and Cloudy SEVIRI scenes, so to represent the distribution of Clear and Cloudy conditions more accurately without influence of the other conditions. Moreover, since SEVIRI and PROBA-V scenes need to be co-located, we resampled the SEVIRI grid of pixels to a uniform grid in latitude and longitude. This choice preserves the original space resolution close to the center of the SEVIRI images. Of course, far from the center the finer resolution of the new grid is fake and the SEVIRI cloud mask is simply repeated within the coarser grid. Technically the procedure is equivalent to a Nearest Neighbor interpolation. Then we include in the training dataset only the PROBA-V scenes for which the closest SEVIRI pixel with respect to time and space provides a certain (i.e., Clear or Cloudy) label. Figure 2 shows an example of SEVIRI cloud mask.

MODIS Cloud Mask
MODIS (MODerate-resolution Imaging Spectroradiometer) sensor flies onboard Earth Observation System Terra and Aqua satellites on polar orbits. There are two MODIS Cloud Mask L2 data product files: MOD35 [12], containing data collected from the Terra platform, and MYD35 [13], with data collected from the Aqua platform. The MODIS cloud mask provides fifteen classes (such as confident Clear over different surface types; possibly Clear; uncertain Clear; Cloudy) that provide basic surface and cloud type information.
We selected all the MODIS images for the four days of the PROBA-V exercise at 1Km spatial resolution. Their latitude and longitude grid is not uniform and it is given at a coarser resolution than the PROBA-V grid. For this reason, the grid of pixels is first interpolated to the full resolution of the sought PROBA-V cloud mask. Moreover, since the grid of the MODIS images depends on the granule and therefore is not fixed in time or space, we co-located the coarser MODIS grid into the PROBA-V uniform grid. Since MODIS orbits are polar, there is not always a good match in time with a PROBA-V pixel; we considered to be simultaneous only pixels whose acquisition differs at most by 30 min. We tested indeed shorter overlaps of time intervals (15 min and 7 min), and obtained comparable accuracy in the results. Clearly, the interpolation procedure can generate mixed/uncertain labels on some scenes in the finer grid. However, as already done for the SEVIRI cloud mask, we included in the training and validation set only PROBA-V scenes co-located to MODIS ones labelled as Cloudy and confident Clear. Figure 3 shows an example of MODIS cloud mask. Example of MODIS cloud mask co-located to PROBA-V image PROBAV_L2A_20140321_000512_3_333M_V001 over South Australia.

Combined SEVIRI and MODIS Cloud Mask
A single cloud mask is obtained from the SEVIRI and MODIS cloud masks of Sections 2.3.1 and 2.3.2 by selecting all scenes for which there was an agreement of both. This choice enhances purity of the Clear and Cloudy classes for the purpose of classification. Since MSG points at 0 • longitude, it can take images only of Europe, Africa and part of Asia, therefore the merged database completely misses Americas, almost full Asia and Oceania for which several scenes of the gold standard are present (see Figure 1). The full joint SEVIRI and MODIS dataset is composed of 65M scenes, of which 40% labelled as Clear and 60% as Cloudy. This proportion is consistent with the gold standard in Table 2 (62% Cloudy scenes and 38% Clear ones). An extensive analysis through experiments analogous to the ones described in this paper (not reported for the sake of brevity) confirms that accuracy of the combined cloud mask outperforms the single ones. In [4] some comparisons are shown for the gold standard dataset.

Surface Type
Among the ancillary information that can improve cloud mask detection, a significant contribution can be given by the characterization of the underlying surface. To sort scenes into homogeneous regions with respect to their spectral behavior, one can use the sea/land mask endowed with PROBA-V scenes and an algorithm for snow/ice detection. Moreover, other and more detailed surface classifications can be considered, such as the GlobCover surface map. GlobCover [14] Figure 1. To retain the most significant surface characteristics while avoiding unnecessary details, we group the GlobCover surface classes into the following five types: Water, Vegetation, Bare Land, Urban, Snow/Ice. They are shown in Table 3 together with their numerosity and percentage of Clear and Cloudy scenes both in the full training dataset and in the gold standard one. Table 3. Surface types considered in the paper, their numerosity and percentage of Clear and Cloudy scenes in both the full joint MODIS and SEVIRI and the gold standard dataset.

Full Dataset
Gold

Climate
We also consider grouping of the scenes into homogeneous climatic zones as in [5]. Specifically, we discriminate several regions based on latitude (Tropics, Mid and High Latitude); season (Winter, Summer); Hemisphere. We also rely on information from Elevation and Surface Temperature maps, as obtained from NASA (ASTER Global Digital Elevation Model, [15]) and the European Center for Medium-Range Weather Forecast (ERA-5 land hourly data from the Copernicus Climate Data Store, [16]) public databases, respectively. The specification of the climatic regions we consider is reported in Table 4. Due to the limited world coverage of the SEVIRI data, the number of climatic zones is less than [5] because some of them are empty. Table 4. Climatic zones considered in the paper and their numerosity in both the joint MODIS and SEVIRI dataset and the gold standard dataset.

Full Dataset
Gold

Classification Methods
Cloud detection can be formally considered to be a binary supervised classification problem. As such, methods for its solution need a representative set of data with labels considered to be "certain" (training dataset). They evaluate patterns in different features and assign data into one of the two classes (Clear or Cloudy). The classification procedure also involves collection and evaluation of a validation dataset. Once trained, each classifier applies a decision rule to determine if validation data are more likely to have originated from one class or another. This rule partitions the n-dimensional feature space into 2 regions corresponding to the Clear and Cloudy conditions.
Operational cloud-masking algorithms on low/moderate-resolution sensors such as AVHRR and MODIS were mainly based on empirically tuned thresholds from several spectral channels. Also, for higher spectral resolution sensors (Landsat), thermal channel-based spectral thresholding along with prior knowledge of land surface properties has been the most common approach for automatic cloud detection. However, the payload of several recent sensors does not include thermal channels in which cloud-masking strategies have previously relied, so that different approaches, somewhat relying on complementary information have been pursued. Among the many methods and available implementations from the recent literature, we mention here a few significant ones. Taravat et al. [17] propose a Multilayer Perceptron for automatic classification of SEVIRI MSG images trained on the cloud mask by the European Organization for the Exploitation of Meteorological Satellites; Chen et al. [18] implement a neural network classifier driven by extensive radiative transfer simulations and validate it through collocated CALIOP and MODIS data. In [19], a Support Vector Machine classifier is trained on the Gabor energy characteristics of cloud superpixels from GF-1 images, while in [20] algorithms for Sentinel-2 MSI focused on Decision Trees and classical Bayesian classification are considered. Sedano et al. [21] propose a method based on the estimation of Clear/Cloudy radiance density distributions in a data fusion framework, followed by a region growing process and validate their results against both cloud masks generated by statistical methods and Landsat operational cloud mask. Finally, deep learning methods are increasingly considered, with several different approaches: in [22] a multi-modal, pixel-level Convolutional Neural Network-based classifier is introduced for detecting clouds in medium-and high-resolution remote-sensing images, which relies on a large number of per-pixel cloud masks digitized by experts; Francis et al. [23] use multi-scale features, based on a Fully Convolutional Network architecture, and report results on manually annotated images from two high-resolution sensors. An image-based approach is described in [24], relying on multi-modal, high-resolution satellite imagery (PlanetScope, Sentinel-2) at the scene level. Many of these methods rely on expert intervention for labelling training data. As mentioned in the Introduction, we focus instead on automatic means of assigning labels to the training and validation datasets (silver standard), allowing for adjustments of decision boundaries independent of subjective and costly human intervention. In addition, they can cover more general cases than manually possible ones and with a much larger extent, of course to the detriment of the decreased accuracy of labels.
Among the different approaches reported in the literature, we consider and compare for the present study seven different supervised classifiers. They fall into the categories usually labelled as Statistical and Machine Learning and are based on different principles, as Discriminant Analysis, Neural Networks, Nearest Neighbor. We mention that Neural Networks are the basis of Artificial Intelligence methods of present strong interest when the number of features is very high. In the following we briefly describe them.
1. Linear Discriminant Analysis (LDA). It applies the Bayes rule to each scene to select the Clear/Cloudy class so to maximize the posterior probability of the class for a scene given the actual reflectance in that scene. LDA assumes that reflectance follows Gaussian distributions for the Clear and Cloudy classes sharing the same covariance matrix; 2. Quadratic Discriminant Analysis (QDA), which generalizes LDA assuming that also covariance matrix depends on the class (Clear or Cloudy); 3. Principal Component Discriminant Analysis (PCDA) [25]: the hypothesis of Gaussian distribution of reflectance is released in favor of a generic distribution estimated by nonparametric regression; in addition the original reflectances are transformed into uncorrelated Principal Components before classification; 4. Independent Component Discriminant Analysis (ICDA) [25]: similar to PCDA, but with the original reflectances transformed into Independent Components before nonparametric estimation of the densities; this makes such components independent also for non-Gaussian distributions; 5. Cumulative Discriminant Analysis (CDA) [5]: the decision rule for classification depends on a single threshold for each feature (spectral band), based on the empirical distribution function, which discriminates scenes belonging to the Clear and Cloudy classes; the threshold is estimated so to minimize at the same time the false positive and false negative rates on the training or on a validation dataset. 6. Artificial Neural Networks (ANN) [26][27][28]. We use a two-layer feed-forward network, with sigmoid hidden and SoftMax output neurons for pattern recognition. The network is trained with scaled conjugate gradient backpropagation. 7. K-Nearest Neighbor (KNN) [28,29] that labels each considered scene based on a voting strategy among the labels assigned to the K closest neighboring scenes belonging to the training dataset. We used K = 50 throughout this study.
Methods LDA, QDA, PCDA, ICDA and CDA require estimate of the statistical distribution of radiance. We mention that other methods are available in the literature; results of some of them have not been reported because of poor accuracy on other sensors (Logistic Regression) or unfeasible computational time (Support Vector Machine).
All the above methods are pixel-wise, i.e., they treat pixels separately without taking account of spatial correlations among them or local features that are instead typical of images. Among the classification methods that use spatial features of images we mention [30] (Markov chains), [31] (Discriminant Analysis), [32] (relying on PCANet and SVM). We also mention the special case of Artificial Intelligence Deep Learning algorithms (e.g., [6,23]).
Finally, we mention that the method proposed for the Round Robin exercise was CDA [4].

Results
This section fully analyzes accuracy of the classification methods introduced in Section 3. Different configurations of classification are considered depending on the use of ancillary information.

Basic Classification
We consider as input data the reflectances of the scenes at the four spectral bands of PROBA-V. We extract a random training set of 3M scenes from the training dataset described in Section 2.3, whose cloud mask is assigned by the agreement of two consolidated algorithms (joint SEVIRI and MODIS silver standard). After the training phase, we classify the validation set composed of all the PROBA-V scenes for which a joint SEVIRI and MODIS cloud mask is available (65M scenes).
We remark that formally the training and validation datasets should be kept distinct, whereas the former is a subset of the latter. However, due to the large number of available scenes the features of the system are accurately learnt from the various methods in all conditions and are the same both in distinct and overlapping datasets. From the practical point of view all error indicators obtained keeping training and validation datasets distinct or overlapping have the same values up to 3 decimal digits. In addition, for the same reason a size of 3M scenes as a training set is sufficient for classification methods to accurately learn all the features of the data, so that no larger datasets are needed. In practice accuracy of the methods does not change when increasing the size of the training set even up to cover all the 65M available scenes. Using a smaller dataset for the training phase has the advantage of reducing computational time, which would otherwise be practically unfeasible for some methods such as ANN and KNN. Finally, we also mention that accuracy does not depend on the random choice of the 3M scenes subset within 3 decimal digits.
To estimate accuracy of the methods, we consider the following indicators: if N Clear and N Cloudy represent the number of Clear and Cloudy scenes in the validation dataset according to the MODIS and SEVIRI silver standard, respectively, andN Clear andN Cloudy the corresponding values estimated by any classification method, then we consider the global success accuracy A, A Clear and A Cloudy as Assuming Positive as the Cloudy condition, A Clear is also known as Specificity (True Negative rate) and A Cloudy as Sensitivity (True Positive rate). Accuracy indicators A, A Clear and A Cloudy are shown in Table 5 for the entire dataset and the classification methods of Section 3; they are also sorted by the surface types introduced in Section 2.5. We remark that in this experiment the training and validation phases are both performed without using any surface information; surface sorting is performed after classification only to assess the accuracy level on different surface types. As a complementary exercise, we compute accuracy indicators when the validation dataset is the gold standard of Section 2.2. The performance of all the algorithms is reported in Table 6. We advise that accuracy for some surfaces, namely Urban and Snow/Ice, is biased from the very low number of representative scenes and from lack of Clear or Cloudy scenes (compare with Table 3).
Analysis of results on the entire dataset ( Table 5) shows first of all that ANN and KNN generally outperform methods based on the direct estimate of distributions, being the only ones to reach accuracies beyond 90% globally and in most cases. We also observe that CDA, which was used for the Round Robin exercise, is by far outperformed by ANN and KNN. This is mainly due to its very nature of forcing the same accuracy in both Clear and Cloudy conditions so to reach equal I-and II-Type errors globally (row Global in the table). This is paid when accuracy is assessed on a finer scale of surface type after classification has been made. As an example Bare Land has the most unbalanced proportion of Clear and Cloudy scenes, with latter ones being 8% of the total (see Table 3), therefore Clear scenes that globally have a lower frequency (40%) are strongly penalized. We also observe that accuracy is strongly dependent on the surface type for all methods; in particular it is higher on Land than on Water. This result is consistent among all methods participating the Round Robin exercise [4] and was there justified with a more accurate training. In this experiment this is equivalent to the fact that Land scenes are 2/3 of Total ones in the global training set, so that classification naturally tends to better represent them. On the other side main applications of PROBA-V sensor are for Land and vegetation, particularly. Results are roughly consistent when we limit validation to the gold standard dataset for which the true Clear/Cloudy condition is known (Table 6). Accuracy is lower than the training set case by 7-8% for all methods. A noteworthy exception is CDA, whose global accuracy remains the same (actually, marginally better), even though no longer equibalanced between Clear and Cloudy conditions, the latter being increased to 87.1% at detriment of the former (74.4%). We again observe that ANN and KNN algorithms outperform methods based on the direct estimate of distributions, with accuracy rarely going beyond 90%, and that particularly for KNN is globally 2% higher than ANN, mostly due to a better detection of Cloudy condition (81.6% vs. 75.2%).

Use of Ancillary Information
Experiment of Section 4.1 considers the entire joint SEVIRI and MODIS dataset for the training (apart from the random subset selection). However, it is well known that capability of detecting clouds heavily depends on the underlying surface because of the different contrast between clouds and surface. Therefore, a possible useful strategy is to disaggregate the scenes into groups homogeneous as possible within, at a finer level than the land/water often used in operational cloud detection methods. For this purpose, we consider two possible disaggregations based on the type of surface and on the climate.

Surface as Ancillary Information
We train the algorithms separately on different surface types. From the entire dataset we randomly extract a training set for each of the five surface types considered in Section 2.5. The size of these training sets is fixed as the minimum between 3M scenes and the subset size. Validation is performed on all the scenes of the same surface type in the full dataset. Table 7 reports the accuracy indicators. Table 8 refers to accuracy estimated using the gold standard as a validation dataset.
Comparison of Table 7 with Table 5 does not provide unique indications: most methods show a certain degree of decrease of accuracy around 1% globally (only LDA reaching −17%). Notable exceptions are ANN (+1%) and, especially, CDA (+5.8%). This is to be expected for CDA since the constraint of equal I-and II-Type errors now applies separately to each surface type and much better adapts to the frequency of Clear/Cloudy conditions that globally depends on the underlying type of surface. Improvement of CDA occurs for all surface types; in particular it is remarkable for that sky condition (Clear or Cloudy) with the worse accuracy when a unique classification was made without using ancillary information; however this improvement occurs in a less extent at detriment of the other sky condition (Cloudy or Clear, respectively). If we compare Table 8 with Table 6 again we observe a drop in accuracy around 6-9% (now including CDA), somewhat smaller for KNN (4.4%).
Summarizing there is not a clear evidence from these quantitative indicators that separate classification problems tailored to different surface types improve accuracy. We will investigate further this matter later in Section 4.3.1 through visual inspection of the cloud masks provided by the different algorithms on some specific images.

Climate as Ancillary Information
We train the classification methods separately on different climatic zones, as described in Section 2.6 and introduced in [5]. From the entire dataset a sample is extracted for each climatic zone as a training, with a size as the minimum between 3M and the subset size. As in Section 4.2.1 validation is performed on the entire dataset or on the gold standard one. Accuracy is shown in Tables 9 and 10, respectively. Table 9. Accuracy indicators A, A Clear and A Cloudy of classification methods of Section 3 when validation is made on the full dataset and classifications are separate for each climatic zone of Section 2.6. Percentage values are reported for the entire dataset (row Global) and for the different climatic zones. Also, this experiment does not provide a conclusive answer on the improvement of accuracy considering separate classifications for different climatic zones. Global increase of accuracy, when occurring (ANN among the best performing methods) is not spread over Clear and Cloudy conditions, rather a slight decrease in Clear conditions is observed. In the same way as separate classifications by surface type we observe a mix of better (ICDA and KNN) and worse (CDA and ANN) accuracies also spread over Clear and Cloudy conditions. The same conclusions can be drawn also on the gold standard dataset, with the remarkable better performance of CDA that is indeed the best performing method.

Ancillary Information with ANN
Neural Networks are claimed to successfully mix information of different nature (e.g., nonnumeric). Our framework includes reflectance as a numeric information and surface type or climatic zone as a categorical variable. In Sections 2.5 and 2.6 the categorical variable was dealt independently, in the sense that separate cloud masks were obtained with separate different training sets for each surface type or climatic zone. By its nature, ANN instead easily allows one to introduce the type of surface and/or climatic zone directly within the training dataset as a further fifth variable besides reflectance of the four spectral bands. The present Section aims at estimating if and how accuracy improves or degrades when compared with ancillary information used as separate classifications.
Results are shown in Table 11, where for the sake of brevity only Global results are shown, not disaggregated by type of surface or climate. Table 11. Accuracy of ANN when ancillary information is introduced as a fifth variable. Global accuracy is shown for the entire dataset (A) and separately for Clear (A Clear ) and Cloudy (A Cloudy ) scenes when no ancillary information are used, and surface type, climate and both are used. Comparison of Table 11 with Tables 5, 7 and 9 shows that keeping separate classification for the different types of ancillary information improves accuracy of ANN.

Opaque and Semi-Transparent Clouds
This Section is intended to assess behavior of the considered classification methods with opaque and semi-transparent clouds and border identification. For this purpose, we consider two images already used in [4] and a quantitative analysis based on the gold standard dataset.

Visual Analysis of Opaque and Semi-Transparent Clouds
The first image was acquired on 21 June 2014 at 14:42 UTC over Bolivia overlying a land surface. It includes both opaque and semi-transparent clouds, as can be seen from the composite RGB image in the first panel of Figure 4. The other panels of the Figure show the cloud mask predicted by LDA, PCDA, CDA, ANN and KNN. All cases refer to the configuration of a unique classification independent of surface type or climate. An indirect comparison with other methods of the PROBA-V Round Robin exercise as reported in [4] can be made because the covered region is the same.    Figure 4 shows that a unique classification independent of surface or climate is prone to mistakes in interpreting Clear condition over water. Actually, some of the lakes in the Bolivia region (green color in the RGB image of the Figure) are misrecognized as Cloudy by 3 classification methods out of 5, including one of the best performing (ANN). On the contrary when classification is separately made by surface type only PCDA still misrecognizes some of the lakes as Cloudy, all other methods correctly detecting Clear sky conditions.
We also remark that KNN, though more conservative with respect to transparent clouds, is less prone to spurious Cloudy isolated scenes (see in particular the top-right part of Figure 5, bottom-right panel), while still preserving sharpness of the clouds.
The second image, also considered in [4], was acquired on 21 December 2014 at 02:29 UTC over Northern Australia and includes both land and water surface. Figure 6 shows the RGB image and the cloud mask retrieved by LDA, PCDA, CDA, ANN and KNN (orderly from left to right and from top to bottom) when classification is made separately for each surface type. The

Quantitative Analysis of Semi-Transparent Clouds
To quantitatively analyze behavior of the classification methods with respect to opaque and semi-transparent clouds, Table 12 reports the percentage of semi-transparent scenes identified as Clear and Cloudy for all the considered methods. Table 12 quantitatively confirms the more conservative character of ANN and KNN throughout all the different experiments settings.

Consensus Analysis
This Section makes a consensus analysis aimed at assessing the mutual agreement of classification methods in estimating cloud mask. It can provide a further accuracy indicator, independent of any gold standard eventually available, and advice on the reliability of the cloud masks obtained by classification methods. Given a set of methods, we define consensus on Clear and Cloudy sky condition as the percentage of scenes for which all methods predict Clear and Cloudy conditions, respectively. The global consensus is defined as the direct sum of both Clear and Cloudy ones, and represents the percentage of scenes for which all methods agree in predicting the same sky condition, irrespective of the type, Clear or Cloudy. The percentage is computed on the entire set of 65M scenes of the dataset. We consider a subset of the classification methods considered in Section 3 according to the following guidelines: a) the methods have to be as independent as possible (however some degree of dependence cannot be avoided because all methods rely on the same training data set); b) the number of considered methods has to be low; in fact, the higher the number of methods, the higher probability that any of them predicts the status of the sky differently from the other ones, and therefore the lower the consensus. As a consequence we select only one representative method from the Discriminant Analysis group, namely PCDA that gives better accuracy in the analysis of Sections 4.1, 4.2.1 and 4.2.2, keeping all the other methods (CDA, ANN, KNN). Should the number of methods be much larger, then a more elaborated definition of consensus should be devised.
First, we compute the consensus according to the results of Section 4.1 where a unique training dataset is considered independently of surface or climatic zone. The four methods agree on identifying 33% of the scenes as Clear and 49% as Cloudy, resulting in a full concordance on 82% of the scenes.
Then we consider the case when different separate classifications are performed for each type of surface (Section 2.5), whose results are reported in Table 13. When additional information on surface type is given, global concordance of the methods is slightly higher (83%), with a slightly better concordance on Clear scenes at detriment of Cloudy ones. Values disaggregated by surface type and for each sky condition are reported in the same Table: the highest consensus is obtained for Vegetation (87%) and the lowest one on Water (78%). When separate classifications are made for each climatic zone, global consensus drops to 78%, mainly due to a lower consensus for Clear sky conditions (30.5%). Values for different climatic zones range from 73% (Tropical zone) to 89% (Mid Latitude Winter NH). Table 14 reports full results disaggregated by climatic zone and by sky condition. Summarizing we can state that the highest consensus is reached when different classifications are made by surface type.
Finally, we investigate which of the selected classification methods deviates most from the consensus expressed from the other methods. For this purpose, we consider scenes for which only three of the 4 methods agree on the Clear or Cloudy condition and calculate the frequency of the methods that deviate from the other three ones.
In the basic classification (without additional information) the method that more often disagrees with the other three ones in predicting a Cloudy sky condition is PCDA. Specifically, when three methods out of four agree on Cloudy scenes, PCDA disagrees with them 41% of the times. When three methods agree in predicting a Clear sky condition, then CDA mostly disagrees (61% of scenes). The situation is similar when separate classifications are performed for each surface type (see row Global in Table 15): PCDA mostly disagrees with the other three methods (50.3% of scenes); this result is due to a very poor agreement in Cloudy sky conditions (disagreement in 87.9% of scenes), whereas in Clear sky conditions it is CDA that reaches the highest disagreement with 52.7% of scenes. This also occurs on the different surface types, with PCDA showing the highest disagreement for all types of surfaces but Bare Land, due to the highest disagreement in Cloudy conditions, whereas in Clear sky conditions CDA shows the highest disagreement on Vegetation and Bare Land, ANN on Water and Urban surface and KNN on Snow/Ice. When classification is trained separately on different climate zones, as reported in Table 16, the situation is quite similar for PCDA and CDA, with ANN showing a greater disagreement in more cases. Summarizing, we can say that PCDA and CDA show the highest disagreement with respect to the other methods and on the contrary KNN the lowest disagreement.

Conclusions
The paper shows a detailed analysis of the method presented by one the authors in a Round Robin exercise organized by ESA for detecting clouds from images taken by the PROBA-V sensor. Availability of a common high-quality dataset of scenes labelled as Clear or Cloudy by experts (gold standard) is a unique benchmark for comparing different methods for detecting clouds and investigating on questions still open.
We considered some prototypes of methods and compared them under different frameworks but using a common training dataset for all of them. We demonstrated that CDA, chosen for participating in the Round Robin, was adequate yielding good accuracy. However, ANN and, particularly, KNN can both improve accuracy and better detect scenes with semi-transparent clouds. In addition, a silver standard training dataset, semi-automatically obtained by algorithms developed for other sensors, was proved to be effective in detecting clouds, yielding high accuracy. We stress that the silver standard dataset considered in this paper covers only a portion of the globe and that the majority of gold standard pixels are outside the region covered by the silver standard. Indeed a silver standard dataset is the only feasible solution to get very large training datasets needed to train Artificial Intelligence methods (see, e.g., [7]).
Then, even though it was not possible to give a conclusive quantitative answer whether separate classifications based on ancillary information as surface types and/or climatic zones improve accuracy, a qualitative analysis shows that introducing such information reduces probability to misinterpret Clear/Cloudy condition on Water.
Finally, we performed a consensus analysis aimed at estimating the degree of mutual agreement among classification methods in detecting Clear or Cloudy sky. The result was that a selection of 4 classification methods agree on the status of Clear or Cloudy sky for about 83% of scenes. PCDA and CDA show the highest disagreement with the consensus of the other 3 methods and KNN the lowest disagreement. This result is consistent with the findings on accuracy.
Results shown in the paper strictly refer to sensors with a very low number of spectral bands. Other sensors, especially hyper-spectral, or at higher/lower spatial resolutions need a specific similar analysis that will probably give very different results.
Results of the paper suggest possible future investigations. a) The use of different methods for different surfaces or climatic zones: this study demonstrated that some methods could be more suited for particular surfaces or climatic zones when we consider global accuracy or specific accuracy for Clear or Cloudy conditions. b) Balancing of cloud/Clear conditions: this largely depends on the training dataset and on the fraction of Clear/Cloudy scenes included. Then the classification methods will generally and naturally favor the most populated class when Clear and Cloudy features overlap, so to improve global accuracy. However commonly remotely sensed images can refer to conditions that are prevalently Clear or prevalently Cloudy that could benefit from a training dataset or a classification method that weights scenes according to the different proportion of Clear/Cloudy conditions in the region. CDA is an attempt into this direction, even though in a global way, and Discriminant Analysis is naturally prone to include such weights, simulating a balancing of Clear and Sky conditions different from the training dataset. c) To use images instead of independent pixels in the classification, so to exploit spatial correlation (that clouds indeed possess) and/or equivalently spatial features. In this respect Artificial Intelligence methods, already available in the literature but not considered in this paper, become interesting also for a small number of spectral bands.

Conflicts of Interest:
The authors declare no conflict of interest. The funders designed the Round Robin exercise and collected main data; they had no role in the analysis or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: